• Save
Roll your own web crawler. RubyDay 2013
Upcoming SlideShare
Loading in...5
×
 

Roll your own web crawler. RubyDay 2013

on

  • 4,200 views

It is all about data. ...

It is all about data.
Having the right data at the right time might make the difference between you and your competitor. Google can show you just what it can catch. If you know where to find the data of your interest, let's go deeper and roll your own web crawler framework.
Taking the advantage of the latest cool technologies I will show you how to build your distributed web crawler based on Redis and Mongo

Statistics

Views

Total Views
4,200
Views on SlideShare
4,174
Embed Views
26

Actions

Likes
9
Downloads
0
Comments
0

2 Embeds 26

http://lanyrd.com 18
https://twitter.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Roll your own web crawler. RubyDay 2013 Roll your own web crawler. RubyDay 2013 Presentation Transcript

    • Get the Data you want, because you want the Data now!Francesco LauritaRubyDay 2013, Milan - ItalyRoll you own WebCrawlerFriday, June 14, 13
    • What a web crawler is?“A Web crawler is an Internet bot that systematically browses the WorldWide Web, typically for the purpose of Web indexing.”http://en.wikipedia.org/wiki/Web_crawlerFriday, June 14, 13
    • How does it work?1.Starts with a list of urls to visit (seeds)2.Get all of the hyperlinks in the page andadds them to the list of urls to visit (push)1. The page content is stored somewhere2.The visited url is marked as visited3.Urls are recursively visitedDirected graphQueue (FIFO)Friday, June 14, 13
    • How does it work?Web Crawler is able to “walk” a“WebGraph”A WebGraph is a directed graph whosevertices are pages and a direct edgeconnects page A to page B if there is a linkbetween A and BDirected graphQueue (FIFO)Friday, June 14, 13
    • Generic Web Crawler InfrastructureWhile it’s fairly easy to build and write a standalone single-instance Crawler,building a distribute and scalable system that can download millions ofpages over weeks is notFriday, June 14, 13
    • Why should you roll your own Web Crawler?Universal Crawlers:* General purpose* Most interested contents (page rank)Focused Crawlers:* Better accuracy* Only certain topic* Highly selective* Not only for search enginesReady to be used for Machine Learning Engine as a serviceData warehouse and so onFriday, June 14, 13
    • Sentiment AnalysisFriday, June 14, 13
    • FinanceFriday, June 14, 13
    • A.I, Machine Learning, RecommendationEngine as A ServiceFriday, June 14, 13
    • Last but not least....Friday, June 14, 13
    • Polipus (because octopus was taken)Friday, June 14, 13
    • Polipus (because octopus was taken)A distributed easy-to-use DSL-ish web crawler framework writtenin ruby* Distributed and scalable* Easy to usehttps://github.com/taganaka/polipusHeavily inspired to Anemone* Well designed* Easy to use* Not distributed* Not Scalablehttps://github.com/chriskite/anemoneFriday, June 14, 13
    • Polipus in actionFriday, June 14, 13
    • Polipus: Under the hoodRedis(What is it?)* Is a NoSQL DB* Is an advanced Key/Value Store* Is a caching server* Is a lot of things...Friday, June 14, 13
    • Polipus: Under the hoodRedis(What is it?)* It is a way to share Memory over TCP/IPCan share memory (data structure) between different processes* List (LinkedList) --> queue.pop, queue.push* Hash --> {}* Set --> Set* SortedSet --> SortedSet.new* ....Friday, June 14, 13
    • Polipus: Under the hoodRedis* Reliable and Distributed Queue1) A producer pushes an URL to visit into the QueueRPUSH2) A consumer fetches the URL and at the same time pushesit into a processing LISTRPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking)An additional client may monitor the processing list foritems that remain there for too much time, and will pushthose timed out items into the queue again if needed.Friday, June 14, 13
    • Polipus: Under the hoodRedis* Reliable and Distributed Queuehttps://github.com/taganaka/redis-queueFriday, June 14, 13
    • Polipus: Under the hoodRedis* URL TrackerA crawler should know if an URL has been already visited or itabout to be visited* SET(a = Set.new, a << url ; a.include?(url))* Bloom Filter (SETBIT / GETBIT)Friday, June 14, 13
    • Polipus: Under the hoodRedisBloom Filter:“A Bloom filter, is a space-efficient probabilistic data structure that is usedto test whether an element is a member of a set.”http://en.wikipedia.org/wiki/Bloom_filterFriday, June 14, 13
    • Polipus: Under the hoodRedisBloom Filter:* Very space efficient! 1.000.000 of elements ~2Mb on Redis* With a cost: False positive retrieval are possible, while negative are notWith a probability of 0.1% of false positive, every 1M of pages, 1k of themmight be marked erroneously as already visitedUsing SET : No errors at all but 1.000.000 of elements are ~150MBoccupied on Redishttps://github.com/taganaka/redis-bloomfilterFriday, June 14, 13
    • Polipus: Under the hoodMongoDB1) MongoDB is used mainly for storing pages2) Pages are stored using upsert command so that a document can be easilyupdated during a fresh crawling on the same contents3) By default the body of the page is compressed in order to save disk space4) No query() is needed because of bloom filterFriday, June 14, 13
    • Polipus: The infrastructureFriday, June 14, 13
    • Is it so easy?!Not really...1) Redis is an in-memory database2) A queue of URLs can grow very fast3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 charsfor each entry)4) MongoDB will eat your disk space: 50M of saved pages are around 400GBSuggested Redis conf:maxmemory 2.5GB (or whatever your instance can handle)maxmemory-policy noevictionAfter 6M Igot Redis torefuse writesFriday, June 14, 13
    • An experiment using the current available codeSetup:6x t1.micro (web crawlers, 5 workers each)1x m1.medium (Redis and MongoDB)MongoDB with default settingsRedismaxmemory 2.5GBmaxmemory-policy noeviction~4.700.000 of Pages downloaded in 24h...then I ran out of disk because of MongoDBFriday, June 14, 13
    • TODO•Redis memory Guard• Should be able to move items from the Redis queue to MongoDB if thequeue size hits a threshold and move items back on Redis at somepoint•Honor the robot.txt file• So that we can be respect Disallow directives if any•Add support for Ruby Mechanize• Maintain browsing sessions• Filling and submitting formsFriday, June 14, 13
    • Questions?francesco@gild.comfacebook.com/francesco.lauritawww.gild.comFriday, June 14, 13