Your SlideShare is downloading. ×
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Growing web spiders - VilniusPHP
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Growing web spiders - VilniusPHP

2,354

Published on

Published in: Technology, News & Politics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,354
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. GROWING WEB SPIDERS Juozas Kaziukėnas // juokaz.com // @juokaz
  • 2. 300’000’000 products / 24 hours = 12’500’00 products / 3600 seconds = 3’472 products / 3000 nodes = 1.1 sec. per product 24’000 cores on Amazon = $300/h
  • 3. Juozas Kaziukėnas, Lithuanian You can call me Joe More info http://juokaz.com
  • 4. WHY CRAWL?
  • 5. WE NEED DATA 1. Get data 2. ??? 3. Profit
  • 6. IF PEOPLE ARE SCRAPPING YOUR SITE,YOU HAVE DATA PEOPLE WANT. CONSIDER MAKING AN API Russell Ahlstrom
  • 7. DATA SCIENCE
  • 8. 1. FIGURE OUT WHATTO REQUEST 2. MAKE A REQUEST 3. PARSETHE REQUEST 4. STORE RESULTS
  • 9. WHATTO EXTRACT
  • 10. AS LITTLE AS POSSIBLE
  • 11. MAKE A REQUEST
  • 12. FILE_GET_CONTENTS($URL);
  • 13. HANDLING HTTP ERRORS
  • 14. OPTIMIZE HTTP REQUESTS
  • 15. function get($url) { // Create a handle. $handle = curl_init($url);   // Set options...   // Do the request. $ret = curl_exec($handle);   // Do stuff with the results...   // Destroy the handle. curl_close($handle); }
  • 16. function get($url) { // Create a handle. $handle = curl_init($url);   // Set options...   // Do the request. $ret = curlExecWithMulti($handle);   // Do stuff with the results...   // Destroy the handle. curl_close($handle);  }
  • 17. function curlExecWithMulti($handle) { // In real life this is a class variable. static $multi = NULL;   // Create a multi if necessary. if (empty($multi)) { $multi = curl_multi_init(); }   // Add the handle to be processed. curl_multi_add_handle($multi, $handle);   // Do all the processing. $active = NULL; do { $ret = curl_multi_exec($multi, $active); } while ($ret == CURLM_CALL_MULTI_PERFORM);   while ($active && $ret == CURLM_OK) { if (curl_multi_select($multi) != -1) { do { $mrc = curl_multi_exec($multi, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } }   // Remove the handle from the multi processor. curl_multi_remove_handle($multi, $handle);   return TRUE; }
  • 18. QUEUES FOR EVERYTHING
  • 19. ASYNCHRONOUS PROCESSING
  • 20. DO NOT BLOCK FOR I/O
  • 21. RETRIES
  • 22. REGULAR EXPRESSIONS
  • 23. REGULAR EXPRESSIONS NOT
  • 24. XPATH
  • 25. PHANTOM.JS/SELENIUM
  • 26. WHAT HAPPENS WHENTHE PAGE CHANGES
  • 27. ACTING LIKE A HUMAN
  • 28. HTTP HEADERS
  • 29. $HEADER = ARRAY(); $HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML +XML,"; $HEADER[0] .= "TEXT/HTML;Q=0.9,TEXT/PLAIN;Q=0.8,IMAGE/PNG,*/ *;Q=0.5"; $HEADER[] = "CACHE-CONTROL: MAX-AGE=0"; $HEADER[] = "CONNECTION: KEEP-ALIVE"; $HEADER[] = "KEEP-ALIVE: 300"; $HEADER[] = "ACCEPT-CHARSET: ISO-8859-1,UTF-8;Q=0.7,*;Q=0.7"; $HEADER[] = "ACCEPT-LANGUAGE: EN-US,EN;Q=0.5"; $HEADER[] = "PRAGMA: "; // BROWSERS KEEP THIS BLANK. CURL_SETOPT($CURL, CURLOPT_USERAGENT, 'MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.2; EN-US; RV:1.8.1.7) GECKO/20070914 FIREFOX/ 2.0.0.7'); CURL_SETOPT($CURL, CURLOPT_HTTPHEADER, $HEADER);
  • 30. COOKIES AND SESSIONS curl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar); curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);
  • 31. AVOIDING GETTING BLOCKED
  • 32. DO NOT DDOS
  • 33. PROXY NETWORK HAProxy
  • 34. ACT LIKE A HUMAN BROWSINGTHE PAGE curl_setopt($curl,CURLOPT_AUTOREFERER, true);
  • 35. ROBOTS.TXT
  • 36. LEGAL ISSUES
  • 37. YOU ARE GOINGTO GET SUED
  • 38. MEASURE EVERYTHING
  • 39. 1. Response time 2. Response size 3. HTTP error type 4. Retries count 5. Failing proxy IP 6. Failing parsing 7. etc.
  • 40. OPTIMIZE AND REPEAT
  • 41. WEB CRAWLING FOR FUN AND PROFIT
  • 42. THANKS! Juozas Kaziukėnas @juokaz

×