SlideShare a Scribd company logo
1 of 26
Get the Data you want, because you want the Data now!
Francesco Laurita
RubyDay 2013, Milan - Italy
Roll you own Web
Crawler
Friday, June 14, 13
What a web crawler is?
“A Web crawler is an Internet bot that systematically browses the World
Wide Web, typically for the purpose of Web indexing.”
http://en.wikipedia.org/wiki/Web_crawler
Friday, June 14, 13
How does it work?
1.Starts with a list of urls to visit (seeds)
2.Get all of the hyperlinks in the page and
adds them to the list of urls to visit (push)
1. The page content is stored somewhere
2.The visited url is marked as visited
3.Urls are recursively visited
Directed graph
Queue (FIFO)
Friday, June 14, 13
How does it work?
Web Crawler is able to “walk” a
“WebGraph”
A WebGraph is a directed graph whose
vertices are pages and a direct edge
connects page A to page B if there is a link
between A and B
Directed graph
Queue (FIFO)
Friday, June 14, 13
Generic Web Crawler Infrastructure
While it’s fairly easy to build and write a standalone single-instance Crawler,
building a distribute and scalable system that can download millions of
pages over weeks is not
Friday, June 14, 13
Why should you roll your own Web Crawler?
Universal Crawlers:
* General purpose
* Most interested contents (page rank)
Focused Crawlers:
* Better accuracy
* Only certain topic
* Highly selective
* Not only for search engines
Ready to be used for Machine Learning Engine as a service
Data warehouse and so on
Friday, June 14, 13
Sentiment Analysis
Friday, June 14, 13
Finance
Friday, June 14, 13
A.I, Machine Learning, Recommendation
Engine as A Service
Friday, June 14, 13
Last but not least....
Friday, June 14, 13
Polipus (because octopus was taken)
Friday, June 14, 13
Polipus (because octopus was taken)
A distributed easy-to-use DSL-ish web crawler framework written
in ruby
* Distributed and scalable
* Easy to use
https://github.com/taganaka/polipus
Heavily inspired to Anemone
* Well designed
* Easy to use
* Not distributed
* Not Scalable
https://github.com/chriskite/anemone
Friday, June 14, 13
Polipus in action
Friday, June 14, 13
Polipus: Under the hood
Redis
(What is it?)
* Is a NoSQL DB
* Is an advanced Key/Value Store
* Is a caching server
* Is a lot of things...
Friday, June 14, 13
Polipus: Under the hood
Redis
(What is it?)
* It is a way to share Memory over TCP/IP
Can share memory (data structure) between different processes
* List (LinkedList) --> queue.pop, queue.push
* Hash --> {}
* Set --> Set
* SortedSet --> SortedSet.new
* ....
Friday, June 14, 13
Polipus: Under the hood
Redis
* Reliable and Distributed Queue
1) A producer pushes an URL to visit into the Queue
RPUSH
2) A consumer fetches the URL and at the same time pushes
it into a processing LIST
RPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking)
An additional client may monitor the processing list for
items that remain there for too much time, and will push
those timed out items into the queue again if needed.
Friday, June 14, 13
Polipus: Under the hood
Redis
* Reliable and Distributed Queue
https://github.com/taganaka/redis-queue
Friday, June 14, 13
Polipus: Under the hood
Redis
* URL Tracker
A crawler should know if an URL has been already visited or it
about to be visited
* SET
(a = Set.new, a << url ; a.include?(url))
* Bloom Filter (SETBIT / GETBIT)
Friday, June 14, 13
Polipus: Under the hood
Redis
Bloom Filter:
“A Bloom filter, is a space-efficient probabilistic data structure that is used
to test whether an element is a member of a set.”
http://en.wikipedia.org/wiki/Bloom_filter
Friday, June 14, 13
Polipus: Under the hood
Redis
Bloom Filter:
* Very space efficient! 1.000.000 of elements ~2Mb on Redis
* With a cost: False positive retrieval are possible, while negative are not
With a probability of 0.1% of false positive, every 1M of pages, 1k of them
might be marked erroneously as already visited
Using SET : No errors at all but 1.000.000 of elements are ~150MB
occupied on Redis
https://github.com/taganaka/redis-bloomfilter
Friday, June 14, 13
Polipus: Under the hood
MongoDB
1) MongoDB is used mainly for storing pages
2) Pages are stored using upsert command so that a document can be easily
updated during a fresh crawling on the same contents
3) By default the body of the page is compressed in order to save disk space
4) No query() is needed because of bloom filter
Friday, June 14, 13
Polipus: The infrastructure
Friday, June 14, 13
Is it so easy?!
Not really...
1) Redis is an in-memory database
2) A queue of URLs can grow very fast
3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 chars
for each entry)
4) MongoDB will eat your disk space: 50M of saved pages are around 400GB
Suggested Redis conf:
maxmemory 2.5GB (or whatever your instance can handle)
maxmemory-policy noeviction
After 6M I
got Redis to
refuse writes
Friday, June 14, 13
An experiment using the current available code
Setup:
6x t1.micro (web crawlers, 5 workers each)
1x m1.medium (Redis and MongoDB)
MongoDB with default settings
Redis
maxmemory 2.5GB
maxmemory-policy noeviction
~4.700.000 of Pages downloaded in 24h
...then I ran out of disk because of MongoDB
Friday, June 14, 13
TODO
•Redis memory Guard
• Should be able to move items from the Redis queue to MongoDB if the
queue size hits a threshold and move items back on Redis at some
point
•Honor the robot.txt file
• So that we can be respect Disallow directives if any
•Add support for Ruby Mechanize
• Maintain browsing sessions
• Filling and submitting forms
Friday, June 14, 13
Questions?
francesco@gild.com
facebook.com/francesco.laurita
www.gild.com
Friday, June 14, 13

More Related Content

Similar to Roll your own web crawler. RubyDay 2013

Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)Francesco Fullone
 
Geekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web PrimerGeekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web Primerianibbo
 
Introduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseIntroduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseTugdual Grall
 
The Virtual Repository
The Virtual RepositoryThe Virtual Repository
The Virtual RepositoryFabio Simeoni
 
Linked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolLinked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolGabriel Dragomir
 
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...PatrickCrompton
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionTreasure Data, Inc.
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmahp3rnilla
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
 
Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013Olaf Alders
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaThomas Kurz
 
One Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web AppOne Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web Apptechnicolorenvy
 
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...Dr. Haxel Consult
 
Drupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappersDrupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappersMarcus Deglos
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...xu liwei
 

Similar to Roll your own web crawler. RubyDay 2013 (20)

Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)
 
Geekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web PrimerGeekup Sheffield Semantic Web Primer
Geekup Sheffield Semantic Web Primer
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
 
Introduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseIntroduction to NoSQL with Couchbase
Introduction to NoSQL with Couchbase
 
The Virtual Repository
The Virtual RepositoryThe Virtual Repository
The Virtual Repository
 
Linked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache StanbolLinked data based semantic annotation using Drupal and Apache Stanbol
Linked data based semantic annotation using Drupal and Apache Stanbol
 
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
eSynergy Andy Hawkins - Enabling DevOps through next generation configuration...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Presentationnosqlmah
PresentationnosqlmahPresentationnosqlmah
Presentationnosqlmah
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013
 
Linked Media Management with Apache Marmotta
Linked Media Management with Apache MarmottaLinked Media Management with Apache Marmotta
Linked Media Management with Apache Marmotta
 
One Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web AppOne Page, One App -or- How to Write a Crawlable Single Page Web App
One Page, One App -or- How to Write a Crawlable Single Page Web App
 
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
 
Drupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappersDrupal feature proposal: two new stream-wrappers
Drupal feature proposal: two new stream-wrappers
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
 

Recently uploaded

Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Roll your own web crawler. RubyDay 2013

  • 1. Get the Data you want, because you want the Data now! Francesco Laurita RubyDay 2013, Milan - Italy Roll you own Web Crawler Friday, June 14, 13
  • 2. What a web crawler is? “A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.” http://en.wikipedia.org/wiki/Web_crawler Friday, June 14, 13
  • 3. How does it work? 1.Starts with a list of urls to visit (seeds) 2.Get all of the hyperlinks in the page and adds them to the list of urls to visit (push) 1. The page content is stored somewhere 2.The visited url is marked as visited 3.Urls are recursively visited Directed graph Queue (FIFO) Friday, June 14, 13
  • 4. How does it work? Web Crawler is able to “walk” a “WebGraph” A WebGraph is a directed graph whose vertices are pages and a direct edge connects page A to page B if there is a link between A and B Directed graph Queue (FIFO) Friday, June 14, 13
  • 5. Generic Web Crawler Infrastructure While it’s fairly easy to build and write a standalone single-instance Crawler, building a distribute and scalable system that can download millions of pages over weeks is not Friday, June 14, 13
  • 6. Why should you roll your own Web Crawler? Universal Crawlers: * General purpose * Most interested contents (page rank) Focused Crawlers: * Better accuracy * Only certain topic * Highly selective * Not only for search engines Ready to be used for Machine Learning Engine as a service Data warehouse and so on Friday, June 14, 13
  • 9. A.I, Machine Learning, Recommendation Engine as A Service Friday, June 14, 13
  • 10. Last but not least.... Friday, June 14, 13
  • 11. Polipus (because octopus was taken) Friday, June 14, 13
  • 12. Polipus (because octopus was taken) A distributed easy-to-use DSL-ish web crawler framework written in ruby * Distributed and scalable * Easy to use https://github.com/taganaka/polipus Heavily inspired to Anemone * Well designed * Easy to use * Not distributed * Not Scalable https://github.com/chriskite/anemone Friday, June 14, 13
  • 14. Polipus: Under the hood Redis (What is it?) * Is a NoSQL DB * Is an advanced Key/Value Store * Is a caching server * Is a lot of things... Friday, June 14, 13
  • 15. Polipus: Under the hood Redis (What is it?) * It is a way to share Memory over TCP/IP Can share memory (data structure) between different processes * List (LinkedList) --> queue.pop, queue.push * Hash --> {} * Set --> Set * SortedSet --> SortedSet.new * .... Friday, June 14, 13
  • 16. Polipus: Under the hood Redis * Reliable and Distributed Queue 1) A producer pushes an URL to visit into the Queue RPUSH 2) A consumer fetches the URL and at the same time pushes it into a processing LIST RPOPLPUSH (Non blocking)/BRPOPLPUSH (blocking) An additional client may monitor the processing list for items that remain there for too much time, and will push those timed out items into the queue again if needed. Friday, June 14, 13
  • 17. Polipus: Under the hood Redis * Reliable and Distributed Queue https://github.com/taganaka/redis-queue Friday, June 14, 13
  • 18. Polipus: Under the hood Redis * URL Tracker A crawler should know if an URL has been already visited or it about to be visited * SET (a = Set.new, a << url ; a.include?(url)) * Bloom Filter (SETBIT / GETBIT) Friday, June 14, 13
  • 19. Polipus: Under the hood Redis Bloom Filter: “A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.” http://en.wikipedia.org/wiki/Bloom_filter Friday, June 14, 13
  • 20. Polipus: Under the hood Redis Bloom Filter: * Very space efficient! 1.000.000 of elements ~2Mb on Redis * With a cost: False positive retrieval are possible, while negative are not With a probability of 0.1% of false positive, every 1M of pages, 1k of them might be marked erroneously as already visited Using SET : No errors at all but 1.000.000 of elements are ~150MB occupied on Redis https://github.com/taganaka/redis-bloomfilter Friday, June 14, 13
  • 21. Polipus: Under the hood MongoDB 1) MongoDB is used mainly for storing pages 2) Pages are stored using upsert command so that a document can be easily updated during a fresh crawling on the same contents 3) By default the body of the page is compressed in order to save disk space 4) No query() is needed because of bloom filter Friday, June 14, 13
  • 23. Is it so easy?! Not really... 1) Redis is an in-memory database 2) A queue of URLs can grow very fast 3) A queue of 1M of URLs is about 370MB occupied on Redis (about 400 chars for each entry) 4) MongoDB will eat your disk space: 50M of saved pages are around 400GB Suggested Redis conf: maxmemory 2.5GB (or whatever your instance can handle) maxmemory-policy noeviction After 6M I got Redis to refuse writes Friday, June 14, 13
  • 24. An experiment using the current available code Setup: 6x t1.micro (web crawlers, 5 workers each) 1x m1.medium (Redis and MongoDB) MongoDB with default settings Redis maxmemory 2.5GB maxmemory-policy noeviction ~4.700.000 of Pages downloaded in 24h ...then I ran out of disk because of MongoDB Friday, June 14, 13
  • 25. TODO •Redis memory Guard • Should be able to move items from the Redis queue to MongoDB if the queue size hits a threshold and move items back on Redis at some point •Honor the robot.txt file • So that we can be respect Disallow directives if any •Add support for Ruby Mechanize • Maintain browsing sessions • Filling and submitting forms Friday, June 14, 13