Your SlideShare is downloading. ×
0
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Growing spiders to crawl the web - Dutch PHP conference
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Growing spiders to crawl the web - Dutch PHP conference

8,998

Published on

Published in: Technology, News & Politics
1 Comment
1 Like
Statistics
Notes
  • interesting slide for me :-) Is not a typo at page 6 - 'SCRAPPING'?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
8,998
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. GROWING SPIDERSTOCRAWLTHE WEBJuozas Kaziukėnas // juokaz.com // @juokaz
  • 2. WEB SPIDERS
  • 3. Juozas Kaziukėnas, LithuanianYou can call me JoeMore info http://juokaz.com
  • 4. WHY CRAWL?
  • 5. WE NEED DATA1. Get data2. ???3. Profit
  • 6. IF PEOPLE ARE SCRAPPINGYOUR SITE,YOU HAVE DATAPEOPLE WANT. CONSIDERMAKING AN APIRussell Ahlstrom
  • 7. DATA SCIENCE
  • 8. 1. FIGURE OUT WHATTO REQUEST2. MAKE A REQUEST3. PARSETHE REQUEST4. STORE RESULTS
  • 9. WHATTO EXTRACT
  • 10. AS LITTLE AS POSSIBLE
  • 11. MAKE A REQUEST
  • 12. FILE_GET_CONTENTS($URL);
  • 13. HANDLING HTTP ERRORS
  • 14. OPTIMIZE HTTP REQUESTS
  • 15. function get($url) {// Create a handle.$handle = curl_init($url); // Set options... // Do the request.$ret = curl_exec($handle); // Do stuff with the results... // Destroy the handle.curl_close($handle);}
  • 16. function get($url) {// Create a handle.$handle = curl_init($url); // Set options... // Do the request.$ret = curlExecWithMulti($handle); // Do stuff with the results... // Destroy the handle.curl_close($handle); }
  • 17. function curlExecWithMulti($handle) {// In real life this is a class variable.static $multi = NULL; // Create a multi if necessary.if (empty($multi)) { $multi = curl_multi_init(); } // Add the handle to be processed.curl_multi_add_handle($multi, $handle); // Do all the processing.$active = NULL;do {$ret = curl_multi_exec($multi, $active);} while ($ret == CURLM_CALL_MULTI_PERFORM); while ($active && $ret == CURLM_OK) {if (curl_multi_select($multi) != -1) {do {$mrc = curl_multi_exec($multi, $active);} while ($mrc == CURLM_CALL_MULTI_PERFORM);}} // Remove the handle from the multi processor.curl_multi_remove_handle($multi, $handle); return TRUE;}
  • 18. QUEUES FOR EVERYTHING
  • 19. ASYNCHRONOUSPROCESSING
  • 20. DO NOT BLOCK FOR I/O
  • 21. RETRIES
  • 22. REGULAR EXPRESSIONS
  • 23. REGULAR EXPRESSIONS NOT
  • 24. XPATH
  • 25. PHANTOM.JS/SELENIUM
  • 26. WHAT HAPPENS WHENTHEPAGE CHANGES
  • 27. ACTING LIKE A HUMAN
  • 28. HTTP HEADERS
  • 29. $HEADER = ARRAY();$HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML+XML,";$HEADER[0] .= "TEXT/HTML;Q=0.9,TEXT/PLAIN;Q=0.8,IMAGE/PNG,*/*;Q=0.5";$HEADER[] = "CACHE-CONTROL: MAX-AGE=0";$HEADER[] = "CONNECTION: KEEP-ALIVE";$HEADER[] = "KEEP-ALIVE: 300";$HEADER[] = "ACCEPT-CHARSET: ISO-8859-1,UTF-8;Q=0.7,*;Q=0.7";$HEADER[] = "ACCEPT-LANGUAGE: EN-US,EN;Q=0.5";$HEADER[] = "PRAGMA: "; // BROWSERS KEEP THIS BLANK.CURL_SETOPT($CURL, CURLOPT_USERAGENT, MOZILLA/5.0 (WINDOWS; U;WINDOWS NT 5.2; EN-US; RV:1.8.1.7) GECKO/20070914 FIREFOX/2.0.0.7);CURL_SETOPT($CURL, CURLOPT_HTTPHEADER, $HEADER);
  • 30. COOKIES AND SESSIONScurl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar);curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);
  • 31. AVOIDING GETTINGBLOCKED
  • 32. DO NOT DDOS
  • 33. PROXY NETWORK
  • 34. ACT LIKE A HUMANBROWSINGTHE PAGEcurl_setopt($curl,CURLOPT_AUTOREFERER, true);
  • 35. ROBOTS.TXT
  • 36. LEGAL ISSUES
  • 37. YOU ARE GOINGTO GETSUED
  • 38. MEASURE EVERYTHING
  • 39. 1. Response time2. Response size3. HTTP error type4. Retries count5. Failing proxy IP6. Failing parsing7. etc.
  • 40. OPTIMIZE AND REPEAT
  • 41. WEB CRAWLING FOR FUNAND PROFIT
  • 42. THANKS!Juozas Kaziukėnas@juokaz

×