GROWING SPIDERSTOCRAWLTHE WEBJuozas Kaziukėnas // juokaz.com // @juokaz
WEB SPIDERS
Juozas Kaziukėnas, LithuanianYou can call me JoeMore info http://juokaz.com
WHY CRAWL?
WE NEED DATA1. Get data2. ???3. Profit
IF PEOPLE ARE SCRAPPINGYOUR SITE,YOU HAVE DATAPEOPLE WANT. CONSIDERMAKING AN APIRussell Ahlstrom
DATA SCIENCE
1. FIGURE OUT WHATTO REQUEST2. MAKE A REQUEST3. PARSETHE REQUEST4. STORE RESULTS
WHATTO EXTRACT
AS LITTLE AS POSSIBLE
MAKE A REQUEST
FILE_GET_CONTENTS($URL);
HANDLING HTTP ERRORS
OPTIMIZE HTTP REQUESTS
function get($url) {// Create a handle.$handle = curl_init($url); // Set options... // Do the request.$ret = curl_exec($ha...
function get($url) {// Create a handle.$handle = curl_init($url); // Set options... // Do the request.$ret = curlExecWithM...
function curlExecWithMulti($handle) {// In real life this is a class variable.static $multi = NULL; // Create a multi if n...
QUEUES FOR EVERYTHING
ASYNCHRONOUSPROCESSING
DO NOT BLOCK FOR I/O
RETRIES
REGULAR EXPRESSIONS
REGULAR EXPRESSIONS NOT
XPATH
PHANTOM.JS/SELENIUM
WHAT HAPPENS WHENTHEPAGE CHANGES
ACTING LIKE A HUMAN
HTTP HEADERS
$HEADER = ARRAY();$HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML+XML,";$HEADER[0] .= "TEXT/HTML;Q=0.9,TE...
COOKIES AND SESSIONScurl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar);curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);
AVOIDING GETTINGBLOCKED
DO NOT DDOS
PROXY NETWORK
ACT LIKE A HUMANBROWSINGTHE PAGEcurl_setopt($curl,CURLOPT_AUTOREFERER, true);
ROBOTS.TXT
LEGAL ISSUES
YOU ARE GOINGTO GETSUED
MEASURE EVERYTHING
1. Response time2. Response size3. HTTP error type4. Retries count5. Failing proxy IP6. Failing parsing7. etc.
OPTIMIZE AND REPEAT
WEB CRAWLING FOR FUNAND PROFIT
THANKS!Juozas Kaziukėnas@juokaz
Upcoming SlideShare
Loading in …5
×

Growing spiders to crawl the web - Dutch PHP conference

9,756 views

Published on

Published in: Technology, News & Politics
1 Comment
1 Like
Statistics
Notes
  • interesting slide for me :-) Is not a typo at page 6 - 'SCRAPPING'?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
9,756
On SlideShare
0
From Embeds
0
Number of Embeds
51
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Growing spiders to crawl the web - Dutch PHP conference

  1. 1. GROWING SPIDERSTOCRAWLTHE WEBJuozas Kaziukėnas // juokaz.com // @juokaz
  2. 2. WEB SPIDERS
  3. 3. Juozas Kaziukėnas, LithuanianYou can call me JoeMore info http://juokaz.com
  4. 4. WHY CRAWL?
  5. 5. WE NEED DATA1. Get data2. ???3. Profit
  6. 6. IF PEOPLE ARE SCRAPPINGYOUR SITE,YOU HAVE DATAPEOPLE WANT. CONSIDERMAKING AN APIRussell Ahlstrom
  7. 7. DATA SCIENCE
  8. 8. 1. FIGURE OUT WHATTO REQUEST2. MAKE A REQUEST3. PARSETHE REQUEST4. STORE RESULTS
  9. 9. WHATTO EXTRACT
  10. 10. AS LITTLE AS POSSIBLE
  11. 11. MAKE A REQUEST
  12. 12. FILE_GET_CONTENTS($URL);
  13. 13. HANDLING HTTP ERRORS
  14. 14. OPTIMIZE HTTP REQUESTS
  15. 15. function get($url) {// Create a handle.$handle = curl_init($url); // Set options... // Do the request.$ret = curl_exec($handle); // Do stuff with the results... // Destroy the handle.curl_close($handle);}
  16. 16. function get($url) {// Create a handle.$handle = curl_init($url); // Set options... // Do the request.$ret = curlExecWithMulti($handle); // Do stuff with the results... // Destroy the handle.curl_close($handle); }
  17. 17. function curlExecWithMulti($handle) {// In real life this is a class variable.static $multi = NULL; // Create a multi if necessary.if (empty($multi)) { $multi = curl_multi_init(); } // Add the handle to be processed.curl_multi_add_handle($multi, $handle); // Do all the processing.$active = NULL;do {$ret = curl_multi_exec($multi, $active);} while ($ret == CURLM_CALL_MULTI_PERFORM); while ($active && $ret == CURLM_OK) {if (curl_multi_select($multi) != -1) {do {$mrc = curl_multi_exec($multi, $active);} while ($mrc == CURLM_CALL_MULTI_PERFORM);}} // Remove the handle from the multi processor.curl_multi_remove_handle($multi, $handle); return TRUE;}
  18. 18. QUEUES FOR EVERYTHING
  19. 19. ASYNCHRONOUSPROCESSING
  20. 20. DO NOT BLOCK FOR I/O
  21. 21. RETRIES
  22. 22. REGULAR EXPRESSIONS
  23. 23. REGULAR EXPRESSIONS NOT
  24. 24. XPATH
  25. 25. PHANTOM.JS/SELENIUM
  26. 26. WHAT HAPPENS WHENTHEPAGE CHANGES
  27. 27. ACTING LIKE A HUMAN
  28. 28. HTTP HEADERS
  29. 29. $HEADER = ARRAY();$HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML+XML,";$HEADER[0] .= "TEXT/HTML;Q=0.9,TEXT/PLAIN;Q=0.8,IMAGE/PNG,*/*;Q=0.5";$HEADER[] = "CACHE-CONTROL: MAX-AGE=0";$HEADER[] = "CONNECTION: KEEP-ALIVE";$HEADER[] = "KEEP-ALIVE: 300";$HEADER[] = "ACCEPT-CHARSET: ISO-8859-1,UTF-8;Q=0.7,*;Q=0.7";$HEADER[] = "ACCEPT-LANGUAGE: EN-US,EN;Q=0.5";$HEADER[] = "PRAGMA: "; // BROWSERS KEEP THIS BLANK.CURL_SETOPT($CURL, CURLOPT_USERAGENT, MOZILLA/5.0 (WINDOWS; U;WINDOWS NT 5.2; EN-US; RV:1.8.1.7) GECKO/20070914 FIREFOX/2.0.0.7);CURL_SETOPT($CURL, CURLOPT_HTTPHEADER, $HEADER);
  30. 30. COOKIES AND SESSIONScurl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar);curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);
  31. 31. AVOIDING GETTINGBLOCKED
  32. 32. DO NOT DDOS
  33. 33. PROXY NETWORK
  34. 34. ACT LIKE A HUMANBROWSINGTHE PAGEcurl_setopt($curl,CURLOPT_AUTOREFERER, true);
  35. 35. ROBOTS.TXT
  36. 36. LEGAL ISSUES
  37. 37. YOU ARE GOINGTO GETSUED
  38. 38. MEASURE EVERYTHING
  39. 39. 1. Response time2. Response size3. HTTP error type4. Retries count5. Failing proxy IP6. Failing parsing7. etc.
  40. 40. OPTIMIZE AND REPEAT
  41. 41. WEB CRAWLING FOR FUNAND PROFIT
  42. 42. THANKS!Juozas Kaziukėnas@juokaz

×