Successfully reported this slideshow.
Your SlideShare is downloading. ×

Roll your own web crawler. RubyDay 2013

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 26 Ad

Roll your own web crawler. RubyDay 2013

It is all about data.
Having the right data at the right time might make the difference between you and your competitor. Google can show you just what it can catch. If you know where to find the data of your interest, let's go deeper and roll your own web crawler framework.
Taking the advantage of the latest cool technologies I will show you how to build your distributed web crawler based on Redis and Mongo

It is all about data.
Having the right data at the right time might make the difference between you and your competitor. Google can show you just what it can catch. If you know where to find the data of your interest, let's go deeper and roll your own web crawler framework.
Taking the advantage of the latest cool technologies I will show you how to build your distributed web crawler based on Redis and Mongo

Advertisement
Advertisement

More Related Content

Similar to Roll your own web crawler. RubyDay 2013 (20)

Recently uploaded (20)

Advertisement

Roll your own web crawler. RubyDay 2013

  1. 1. Get the Data you want, because you want the Data now! Francesco Laurita RubyDay 2013, Milan - Italy Roll you own Web Crawler Friday, June 14, 13
  2. 2. What a web crawler is? “A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.” http://en.wikipedia.org/wiki/Web_crawler Friday, June 14, 13
  3. 3. How does it work? 1.Starts with a list of urls to visit (seeds) 2.Get all of the hyperlinks in the page and adds them to the list of urls to visit (push) 1. The page content is stored somewhere 2.The visited url is marked as visited 3.Urls are recursively visited Directed graph Queue (FIFO) Friday, June 14, 13
  4. 4. How does it work? Web Crawler is able to “walk” a “WebGraph” A WebGraph is a directed graph whose vertices are pages and a direct edge connects page A to page B if there is a link between A and B Directed graph Queue (FIFO) Friday, June 14, 13
  5. 5. Generic Web Crawler Infrastructure While it’s fairly easy to build and write a standalone single-instance Crawler, building a distribute and scalable system that can download millions of pages over weeks is not Friday, June 14, 13
  6. 6. Why should you roll your own Web Crawler? Universal Crawlers: * General purpose * Most interested contents (page rank) Focused Crawlers: * Better accuracy * Only certain topic * Highly selective * Not only for search engines Ready to be used for Machine Learning Engine as a service Data warehouse and so on Friday, June 14, 13
  7. 7. Sentiment Analysis Friday, June 14, 13
  8. 8. Finance Friday, June 14, 13
  9. 9. A.I, Machine Learning, Recommendation Engine as A Service Friday, June 14, 13
  10. 10. Last but not least.... Friday, June 14, 13
  11. 11. Polipus (because octopus was taken) Friday, June 14, 13
  12. 12. Polipus (because octopus was taken) A distributed easy-to-use DSL-ish web crawler framework written in ruby * Distributed and scalable * Easy to use https://github.com/taganaka/polipus Heavily inspired to Anemone * Well designed * Easy to use * Not distributed * Not Scalable https://github.com/chriskite/anemone Friday, June 14, 13
  13. 13. Polipus in action Friday, June 14, 13
  14. 14. Polipus: Under the hood Redis (What is it?) * Is a NoSQL DB * Is an advanced Key/Value Store * Is a caching server * Is a lot of things... Friday, June 14, 13
  15. 15. Polipus: Under the hood Redis (What is it?) * It is a way to share Memory over TCP/IP Can share memory (data structure) between different processes * List (LinkedList) --> queue.pop, queue.push * Hash --> {} * Set --> Set * SortedSet --> SortedSet.new * .... Friday, June 14, 13
  16. 16. Polipus: Under the hood Redis * Reliable and Distributed Queue 1) A producer pushes an URL to visit into the Queue RPUSH 2) A consumer fetches the URL and at the same time pushes it into a processing LIST RPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking) An additional client may monitor the processing list for items that remain there for too much time, and will push those timed out items into the queue again if needed. Friday, June 14, 13
  17. 17. Polipus: Under the hood Redis * Reliable and Distributed Queue https://github.com/taganaka/redis-queue Friday, June 14, 13
  18. 18. Polipus: Under the hood Redis * URL Tracker A crawler should know if an URL has been already visited or it about to be visited * SET (a = Set.new, a << url ; a.include?(url)) * Bloom Filter (SETBIT / GETBIT) Friday, June 14, 13
  19. 19. Polipus: Under the hood Redis Bloom Filter: “A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.” http://en.wikipedia.org/wiki/Bloom_filter Friday, June 14, 13
  20. 20. Polipus: Under the hood Redis Bloom Filter: * Very space efficient! 1.000.000 of elements ~2Mb on Redis * With a cost: False positive retrieval are possible, while negative are not With a probability of 0.1% of false positive, every 1M of pages, 1k of them might be marked erroneously as already visited Using SET : No errors at all but 1.000.000 of elements are ~150MB occupied on Redis https://github.com/taganaka/redis-bloomfilter Friday, June 14, 13
  21. 21. Polipus: Under the hood MongoDB 1) MongoDB is used mainly for storing pages 2) Pages are stored using upsert command so that a document can be easily updated during a fresh crawling on the same contents 3) By default the body of the page is compressed in order to save disk space 4) No query() is needed because of bloom filter Friday, June 14, 13
  22. 22. Polipus: The infrastructure Friday, June 14, 13
  23. 23. Is it so easy?! Not really... 1) Redis is an in-memory database 2) A queue of URLs can grow very fast 3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 chars for each entry) 4) MongoDB will eat your disk space: 50M of saved pages are around 400GB Suggested Redis conf: maxmemory 2.5GB (or whatever your instance can handle) maxmemory-policy noeviction After 6M I got Redis to refuse writes Friday, June 14, 13
  24. 24. An experiment using the current available code Setup: 6x t1.micro (web crawlers, 5 workers each) 1x m1.medium (Redis and MongoDB) MongoDB with default settings Redis maxmemory 2.5GB maxmemory-policy noeviction ~4.700.000 of Pages downloaded in 24h ...then I ran out of disk because of MongoDB Friday, June 14, 13
  25. 25. TODO •Redis memory Guard • Should be able to move items from the Redis queue to MongoDB if the queue size hits a threshold and move items back on Redis at some point •Honor the robot.txt file • So that we can be respect Disallow directives if any •Add support for Ruby Mechanize • Maintain browsing sessions • Filling and submitting forms Friday, June 14, 13
  26. 26. Questions? francesco@gild.com facebook.com/francesco.laurita www.gild.com Friday, June 14, 13

×