Crawling with NodeJS
JSMeetup2@Paris 24.11.2010
@sylvinus
Crawling?
Web crawling
Grab
Process
Store
?
NodeJS
Server-side Javascript
Async / Event-driven / Reactor pattern
Small stdlib, Exploding module ecosystem
Why?
Boldly going where no one has g...
Threads vs. Async
ZOMG Server-side CSS3 selectors!
Apricot
https://github.com/silentrob/Apricot
HTML/DOM Parser, inspired by Hpricot
Sizzle + JSDOM + XUI
Problems w/ Apricot
if (file.match(/^https?:///)) {
    var urlInfo = url.parse(file, parseQueryString=false),
    host = ...
Problems w/ Apricot
No advanced HTTP client in Node’s lib
npm install request
https + redirects + buffering
Problems w/ Apricot
Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) {
doc.find("selector"); // Populates ...
Problems w/ Apricot
XUI api?!
jQuery please :)
require("jsdom").jQueryify !!
var jsdom = require("jsdom"),
window = jsdom....
Concurrency?
https://github.com/coopernurse/node-pool
npm install generic-pool
generic-pool
// Create a MySQL connection pool with
// a max of 10 connections and a 30 second max idle time
var poolModul...
So what?
Apricot - XUI + jQuery
+ request + generic-pool
+ qunit + ?
=
??
Simple API?
var Crawler = require("node-crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"timeout":60,
"defau...
Name contest! :)
node-crawler ?
Crawly ?
?????
Thanks!
First code on github tonight
Help & Forks welcomed
(We’re hiring HTML5/JS hackers ;-)
Also, http://html5weekend.or...
Upcoming SlideShare
Loading in...5
×

Web Crawling with NodeJS

127,640

Published on

Lightning talk I did at the second #JSMeetup in Paris. #Parisjs
It kickstarted the project http://github.com/sylvinus/node-crawler

Published in: Technology, Design
2 Comments
30 Likes
Statistics
Notes
  • i think this is what I've looking for, i going to use it for some bigger project. thanks for this great module though !!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The repository is here! https://github.com/sylvinus/node-crawler
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
127,640
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
239
Comments
2
Likes
30
Embeds 0
No embeds

No notes for slide

Web Crawling with NodeJS

  1. 1. Crawling with NodeJS JSMeetup2@Paris 24.11.2010 @sylvinus
  2. 2. Crawling?
  3. 3. Web crawling Grab Process Store
  4. 4. ?
  5. 5. NodeJS Server-side Javascript Async / Event-driven / Reactor pattern Small stdlib, Exploding module ecosystem
  6. 6. Why? Boldly going where no one has g... Threads vs. Async ZOMG Server-side CSS3 selectors!
  7. 7. Apricot https://github.com/silentrob/Apricot HTML/DOM Parser, inspired by Hpricot Sizzle + JSDOM + XUI
  8. 8. Problems w/ Apricot if (file.match(/^https?:///)) {     var urlInfo = url.parse(file, parseQueryString=false),     host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443), urlInfo.hostname),     req_url = urlInfo.pathname;     if (urlInfo.search) {       req_url += urlInfo.search;     }     var request = host.request('GET', req_url, { host: urlInfo.hostname });     request.addListener('response', function (response) {       var data = '';       response.addListener('data', function (chunk) {         data += chunk;       });       response.addListener("end", function() {         fnLoaderHandle(null, data);       });     });     if (request.end) {       request.end();     } else {       request.close();     }          } else {     fs.readFile(file, encoding='utf8', fnLoaderHandle);   }
  9. 9. Problems w/ Apricot No advanced HTTP client in Node’s lib npm install request https + redirects + buffering
  10. 10. Problems w/ Apricot Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) { doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules) doc.each(callback); // Itterates over the collection, applying a callback to each match (element) doc.remove(); // Removes all elements in the internal collection (See XUI Syntax) doc.inner("fragment"); // See XUI Syntax doc.outer("fragment"); // See XUI Syntax doc.top("fragment"); // See XUI Syntax doc.bottom("fragment"); // See XUI Syntax doc.before("fragment"); // See XUI Syntax doc.after("fragment"); // See XUI Syntax doc.hasClass("class"); // See XUI Syntax doc.addClass("class"); // See XUI Syntax doc.removeClass("class"); // See XUI Syntax doc.toHTML; // Returns the HTML doc.innerHTML; // Returns the innerHTML of the body. doc.toDOM; // Returns the DOM representation // Most methods are chainable, so this works doc.find("selector").addClass('foo').after(", just because"); });
  11. 11. Problems w/ Apricot XUI api?! jQuery please :) require("jsdom").jQueryify !! var jsdom = require("jsdom"), window = jsdom.jsdom().createWindow(); jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() { window.$('body').append('<div class="testing">Hello World, It works</div>'); console.log(window.$('.testing').text()); });
  12. 12. Concurrency? https://github.com/coopernurse/node-pool npm install generic-pool
  13. 13. generic-pool // Create a MySQL connection pool with // a max of 10 connections and a 30 second max idle time var poolModule = require('generic-pool'); var pool = poolModule.Pool({ name : 'mysql', create : function(callback) { var Client = require('mysql').Client; var c = new Client(); c.user = 'scott'; c.password = 'tiger'; c.database = 'mydb'; c.connect(); callback(c); }, destroy : function(client) { client.end(); }, max : 10, idleTimeoutMillis : 30000, log : false }); // borrow connection - callback function is called // once a resource becomes available pool.borrow(function(client) { client.query("select * from foo", [], function() { // return object back to pool pool.returnToPool(client); }); });
  14. 14. So what? Apricot - XUI + jQuery + request + generic-pool + qunit + ? = ??
  15. 15. Simple API? var Crawler = require("node-crawler").Crawler; var c = new Crawler({ "maxConnections":10, "timeout":60, "defaultHandler":function(error,result,$) { $("#content a:link").each(function(a) { c.queue(a.href); }) } }); c.queue(["http://jamendo.com/","http://tedxparis.com", ...]); c.queue([{ "uri":"http://parisjs.org/register", "method":"POST" "handler":function(error,result,$) { $("div:contains(Thank you)").after(" very much"); } }]);
  16. 16. Name contest! :) node-crawler ? Crawly ? ?????
  17. 17. Thanks! First code on github tonight Help & Forks welcomed (We’re hiring HTML5/JS hackers ;-) Also, http://html5weekend.org/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×