Crawler

2,109 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,109
On SlideShare
0
From Embeds
0
Number of Embeds
1,186
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Crawler

  1. 1. Crawler @hack-stuff.com Anything can be a crawlerNovember 11, 2012 1 / 19
  2. 2. What’s the Crawler Crawlers walk on the network, search anything it found and doing anything what they wants... Search engine Data finder / collector Anything else... 2 / 19
  3. 3. Conception Crawler can easy to be separate into three steps... Download Data operation Find the next seed 3 / 19
  4. 4. Pseudo Code Fetch the web page, parser it, get useful information and repeat it again. f o r u r l i n nextSeed ( ) : info = fetch ( url ) data , seeds = o p e r a t e ( i n f o ) pushSeed ( seeds ) 4 / 19
  5. 5. Greedy But easy things are always too hard to be solved... Web server always block the crawler! Data always never structured! How to find the next seed! Crawler always bounded on network speed... 5 / 19
  6. 6. Operation When we link to the target... Download the web page, parser the HTML code Download the database, parser the DB format Finial, record everything into our DB 6 / 19
  7. 7. Pseudo Code Parser the HTML code, for example, search what’s you need... from B e a u t i f u l S o u p import ∗ soup = B e a u t i f u l S o u p ( webpage ) ## P r i n t t h e main body p r i n t soup . h t m l . body ## P r i n t t h e f i r s t t a g <a> i n body p r i n t soup . h t m l . body . a ## Find t h e p a r t i c u l a r t a g t a g s = soup . f i n d A l l ( ’ form ’ ) 7 / 19
  8. 8. Operation (cont’d) And more, you also can do something else, like payload, when operate the web page... Post / Get the method based on HTML Find the next seed on the web page Something good / bad 8 / 19
  9. 9. Link to Site Before we operated the web page, we need to... Link to web site Get the web page But server master hates the net crawler, ’cause No functionality Slow down / burn out the resource As the thief 9 / 19
  10. 10. Fetch If you are not Google You must be the human 10 / 19
  11. 11. Be a Human Be a human as a human being... No one can press anything under 0.11 second No one can look page with few secode No one can work for all day 11 / 19
  12. 12. Rules Using the framework / tool to enumlate the browser Change the default setting Simulate the existed browser Cookie support Time issue and random variable 12 / 19
  13. 13. Pseudo Code Simple fetch code import u r l l i b 2 from c o o k i e l i b import CookieJar import time , random f o r n i n range (MAX LOOP ) : ## Cookie ck = CookieJar ( ) ck = u r l l i b 2 . HTTPCookieProcessor ( ck ) req = u r l l i b 2 . b u i l d o p e n e r ( ck ) ## User−Agent req . addheaders = [ ( ’ User−Agent ’ , ’ c r a w l e r c m j ’ ) ] data = req . open ( u r l ) . read ( ) ## Wait t i m e . s l e e p ( random . r a n d i n t ( 0 , 5 ) ) 13 / 19
  14. 14. Seed The last one, but the hardest one... We always unknown the next sheep 14 / 19
  15. 15. Find Sheep Using the well known search engine Also, search engine blocks other crawler The crawler needs to parser the garbage code The result maybe the js code... Using the random / enumerate method Too hard to find the useful target Cost lots of time Cannot shut sheeps immediately 15 / 19
  16. 16. Based Search Engine Design an other crawler Given the initial keyword as the seed Fetch the search engine Parser the result, and get the next seed if possible Repeat until stop or blocked. 16 / 19
  17. 17. Tricky Using the distribution model Separate each parts More volunteers can speed-up 17 / 19
  18. 18. Pyro4 Pyro4 can help you to remote control python object... Expose the object can access as on local side Using the remote resource to process Provide the M-n model 18 / 19
  19. 19. Thanks for participation Q&A 19 / 19

×