0
Crawling the world
@mmoreram
Apparently...
Nobody
uses parsing in their applications
Not even
Chuck Norris
Many bussinesses

need crawling
Crawling brings you
knowledge
Knowledge is

power
And power is

Money
What is crawling?
Or

parsing
Crawling
We download just an url with a request (HTML, XML…)
We manipulate response by searching the desired data,
like li...
“Machines will do what humans
do before they realize”
–Marc Morera, yesterday
Let’s see an example
Step by step-
chicplace.com

Our goal is parse all available products, saving name,
description, price, shop and categories
We will use ...
Parsing Strategies

Linear. Just one script. If any page fails (crawling error,
server timeout, …) some kind of exception ...
Parsing Strategies

Distributed. One script for each case. If any page fails
can be recovered by simply execute himself ag...
Crawling steps

Analyzing. Think like Google does. find the fastest way
through the labyrinth
Scripting. Build scripts usin...
Analyzing

Every parsing process should be evaluated as a simple
crawler. For example Google
How to access to all needed p...
Analyzing
We will use category map to just access to
all available products
Analyzing
Each category will list all available products
Analyzing
Do we need also to parse product page?
In fact, we do. We already have name, price and
category, but we also nee...
Scripting

We will use distributed strategy, using queues and
supervisord
Supervisord is responsible for managing X instan...
Worker?

Yep, worker. Using a queue system, a worker is like a
box ( script ) with a parameters ( input values ), that jus...
Running
We just enable all workers and forces first to run.
First worker will find all categories urls and will
enqueue them...
Running

Each url is enqueued to another queued named
products-queue
Third and last worker ( 50 instances ) just consume t...
OK. Call me God
but…
“Don't shoot the messenger”

–Some bored man
warning!

50 workers requesting chicplace in parallel. This is a big
problem
@Gonzalo (CTO) will be angry and he will dete...
Warning
do not try this at home
Be invisible
To be invisible we just can parse all site slowly ( days )
To be faster we just can mask our IP using Proxies...
They are attacking
me !
“
And whatever you ask in prayer,
you will receive, if you have faith”
–Matthew 21:22
My pray!

A good crawling implementation is infallible
Server will receive dozens of requests per second and
will not reco...
Welcome to amazing world of

Crawling
Where no one is

SAVE
Crawling the world
Upcoming SlideShare
Loading in...5
×

Crawling the world

463

Published on

Some tips for crawling processes.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
463
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Crawling the world"

  1. 1. Crawling the world @mmoreram
  2. 2. Apparently...
  3. 3. Nobody uses parsing in their applications
  4. 4. Not even Chuck Norris
  5. 5. Many bussinesses need crawling
  6. 6. Crawling brings you knowledge Knowledge is power
  7. 7. And power is Money
  8. 8. What is crawling? Or parsing
  9. 9. Crawling We download just an url with a request (HTML, XML…) We manipulate response by searching the desired data, like links, headers or any kind of text or label Once we have needed content, we can just update our database and take following decisions, for example parsing some found links. and that’s it!
  10. 10. “Machines will do what humans do before they realize” –Marc Morera, yesterday
  11. 11. Let’s see an example Step by step-
  12. 12. chicplace.com Our goal is parse all available products, saving name, description, price, shop and categories We will use linear strategy. There are some kind of strategies when a site must be parsed Let’s see all available strategies
  13. 13. Parsing Strategies Linear. Just one script. If any page fails (crawling error, server timeout, …) some kind of exception could be thrown and catched. Advantages: Just an script is needed. Easier? Not even close… Problems: Cannot be distributed. Just one script for 1M requests. Memory problem?
  14. 14. Parsing Strategies Distributed. One script for each case. If any page fails can be recovered by simply execute himself again. Advantages: All cases are encapsulated in an individual script, low memory. Can be easily distributed by using queues. Problems: Any
  15. 15. Crawling steps Analyzing. Think like Google does. find the fastest way through the labyrinth Scripting. Build scripts using queues for distributed strategy. Each queue means one page Running. keep in mind the impact of your actions. DDOS attack, copyright
  16. 16. Analyzing Every parsing process should be evaluated as a simple crawler. For example Google How to access to all needed pages with the lowest server impact Usually, all serious websites are designed to easily access to all pages within 3 clicks
  17. 17. Analyzing We will use category map to just access to all available products
  18. 18. Analyzing Each category will list all available products
  19. 19. Analyzing Do we need also to parse product page? In fact, we do. We already have name, price and category, but we also need description and shop So we have main page to parse all category links, we have category page with all product ( can be paginated ) and we need also product page to get all information Product page is responsible for saving all data in DDBB
  20. 20. Scripting We will use distributed strategy, using queues and supervisord Supervisord is responsible for managing X instances of a process running at the same time. Using distributed queue system, we will have 3 workers.
  21. 21. Worker? Yep, worker. Using a queue system, a worker is like a box ( script ) with a parameters ( input values ), that just do something. We have 3 kind of workers. One of them, the CategoryWorker will just receive a category url, will parse related content ( HTML ) and will detect all products. Each product will generate a new instance for ProductWorker
  22. 22. Running We just enable all workers and forces first to run. First worker will find all categories urls and will enqueue them into a queue named categories-queue Second worker ( for example 10 instances ) will just consume categories-queue looking for urls and parsing their content. Their content means just products urls
  23. 23. Running Each url is enqueued to another queued named products-queue Third and last worker ( 50 instances ) just consume this queue, parses their content and get needed data ( name, description, shop, category and price.
  24. 24. OK. Call me God
  25. 25. but…
  26. 26. “Don't shoot the messenger” –Some bored man
  27. 27. warning! 50 workers requesting chicplace in parallel. This is a big problem @Gonzalo (CTO) will be angry and he will detect something is happening So, we must be careful to not alert him or just prevent us discover
  28. 28. Warning do not try this at home
  29. 29. Be invisible To be invisible we just can parse all site slowly ( days ) To be faster we just can mask our IP using Proxies ( How about different proxy for every request? ) To be faster we just can user some reversed Proxy, like TOR. To be stupid we can just parse chicplace with our IP ( most companies will not even notice )
  30. 30. They are attacking me !
  31. 31. “ And whatever you ask in prayer, you will receive, if you have faith” –Matthew 21:22
  32. 32. My pray! A good crawling implementation is infallible Server will receive dozens of requests per second and will not recognize any pattern to discriminate crawler requests from simple user requests So…?
  33. 33. Welcome to amazing world of Crawling
  34. 34. Where no one is SAVE
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×