Growing web spiders - VilniusPHP

2,883 views

Published on

Published in: Technology, News & Politics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,883
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Growing web spiders - VilniusPHP

  1. 1. GROWING WEB SPIDERS Juozas Kaziukėnas // juokaz.com // @juokaz
  2. 2. 300’000’000 products / 24 hours = 12’500’00 products / 3600 seconds = 3’472 products / 3000 nodes = 1.1 sec. per product 24’000 cores on Amazon = $300/h
  3. 3. Juozas Kaziukėnas, Lithuanian You can call me Joe More info http://juokaz.com
  4. 4. WHY CRAWL?
  5. 5. WE NEED DATA 1. Get data 2. ??? 3. Profit
  6. 6. IF PEOPLE ARE SCRAPPING YOUR SITE,YOU HAVE DATA PEOPLE WANT. CONSIDER MAKING AN API Russell Ahlstrom
  7. 7. DATA SCIENCE
  8. 8. 1. FIGURE OUT WHATTO REQUEST 2. MAKE A REQUEST 3. PARSETHE REQUEST 4. STORE RESULTS
  9. 9. WHATTO EXTRACT
  10. 10. AS LITTLE AS POSSIBLE
  11. 11. MAKE A REQUEST
  12. 12. FILE_GET_CONTENTS($URL);
  13. 13. HANDLING HTTP ERRORS
  14. 14. OPTIMIZE HTTP REQUESTS
  15. 15. function get($url) { // Create a handle. $handle = curl_init($url);   // Set options...   // Do the request. $ret = curl_exec($handle);   // Do stuff with the results...   // Destroy the handle. curl_close($handle); }
  16. 16. function get($url) { // Create a handle. $handle = curl_init($url);   // Set options...   // Do the request. $ret = curlExecWithMulti($handle);   // Do stuff with the results...   // Destroy the handle. curl_close($handle);  }
  17. 17. function curlExecWithMulti($handle) { // In real life this is a class variable. static $multi = NULL;   // Create a multi if necessary. if (empty($multi)) { $multi = curl_multi_init(); }   // Add the handle to be processed. curl_multi_add_handle($multi, $handle);   // Do all the processing. $active = NULL; do { $ret = curl_multi_exec($multi, $active); } while ($ret == CURLM_CALL_MULTI_PERFORM);   while ($active && $ret == CURLM_OK) { if (curl_multi_select($multi) != -1) { do { $mrc = curl_multi_exec($multi, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } }   // Remove the handle from the multi processor. curl_multi_remove_handle($multi, $handle);   return TRUE; }
  18. 18. QUEUES FOR EVERYTHING
  19. 19. ASYNCHRONOUS PROCESSING
  20. 20. DO NOT BLOCK FOR I/O
  21. 21. RETRIES
  22. 22. REGULAR EXPRESSIONS
  23. 23. REGULAR EXPRESSIONS NOT
  24. 24. XPATH
  25. 25. PHANTOM.JS/SELENIUM
  26. 26. WHAT HAPPENS WHENTHE PAGE CHANGES
  27. 27. ACTING LIKE A HUMAN
  28. 28. HTTP HEADERS
  29. 29. $HEADER = ARRAY(); $HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML +XML,"; $HEADER[0] .= "TEXT/HTML;Q=0.9,TEXT/PLAIN;Q=0.8,IMAGE/PNG,*/ *;Q=0.5"; $HEADER[] = "CACHE-CONTROL: MAX-AGE=0"; $HEADER[] = "CONNECTION: KEEP-ALIVE"; $HEADER[] = "KEEP-ALIVE: 300"; $HEADER[] = "ACCEPT-CHARSET: ISO-8859-1,UTF-8;Q=0.7,*;Q=0.7"; $HEADER[] = "ACCEPT-LANGUAGE: EN-US,EN;Q=0.5"; $HEADER[] = "PRAGMA: "; // BROWSERS KEEP THIS BLANK. CURL_SETOPT($CURL, CURLOPT_USERAGENT, 'MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.2; EN-US; RV:1.8.1.7) GECKO/20070914 FIREFOX/ 2.0.0.7'); CURL_SETOPT($CURL, CURLOPT_HTTPHEADER, $HEADER);
  30. 30. COOKIES AND SESSIONS curl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar); curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);
  31. 31. AVOIDING GETTING BLOCKED
  32. 32. DO NOT DDOS
  33. 33. PROXY NETWORK HAProxy
  34. 34. ACT LIKE A HUMAN BROWSINGTHE PAGE curl_setopt($curl,CURLOPT_AUTOREFERER, true);
  35. 35. ROBOTS.TXT
  36. 36. LEGAL ISSUES
  37. 37. YOU ARE GOINGTO GET SUED
  38. 38. MEASURE EVERYTHING
  39. 39. 1. Response time 2. Response size 3. HTTP error type 4. Retries count 5. Failing proxy IP 6. Failing parsing 7. etc.
  40. 40. OPTIMIZE AND REPEAT
  41. 41. WEB CRAWLING FOR FUN AND PROFIT
  42. 42. THANKS! Juozas Kaziukėnas @juokaz

×