Successfully reported this slideshow.

More Related Content

Web Scraping with PHP

  1. 1. Web Scraping with Matthew Turland php|tek 2009 Unconference May 21, 2009
  2. 2. What Is It?
  3. 3. Normal Web Browsing
  4. 4. Difference #1: Immediate Audience
  5. 5. Difference #2: Consumption Method
  6. 6. Why Is It Useful?
  7. 7. Data Without Web Services
  8. 8. Integration Testing
  9. 9. Crawlers
  10. 10. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
  11. 11. Disadvantages
  12. 12. Potential Lack of Stability
  13. 13. Reverse Engineering Required
  14. 14. More Requests
  15. 15. No Nice Neat Data Package
  16. 16. Step #1: Retrieval
  17. 17. Speaking the Language
  18. 18. The Web We Weave GET / HTTP/1.1 HTTP/1.1 200 OK User-Agent: ... Content-Type: ...
  19. 19. Browsing → Requests <a href=quot;/index.php?foo=barquot;>Index</a> GET /index.php?foo=bar HTTP/1.1 <form method=quot;postquot; action=quot;/index.phpquot;> <input name=quot;fooquot; value=quot;barquot; /> </form> POST /index.php HTTP/1.1 foo=bar
  20. 20. Responses → Rendered Elements <img src=quot;/intl/en_ALL/images/logo.gifquot; /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com HTTP/1.1 200 OK Content-Type: image/gif Content-Length: 8558
  21. 21. Not As Easy As It Looks
  22. 22. Redirections
  23. 23. Referer [sic]
  24. 24. Cookies
  25. 25. User Agent Sniffing
  26. 26. robots.txt
  27. 27. Caching
  28. 28. HTTP Authentication
  29. 29. PHP: Glue for the Web
  30. 30. HTTP Client Libraries Streams, cURL PEAR::HTTP_Client pecl_http Zend_Http_Client
  31. 31. Simple Streams Example $uri = 'http://www.example.com/some/resource'; $get = file_get_contents($uri); $context = stream_context_create( array( 'http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ) ) ); $post = file_get_contents($uri, false, $context);
  32. 32. pecl_http Example $http = new HttpRequest($uri); $http->enableCookies(); $http->setMethod(HTTP_METH_POST); $http->addPostFields(array('var1' => 'value1')); $http->setOptions( 'useragent' => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer' )); $response = $http->send(); $headers = $response->getHeaders(); $body = $response->getBody();
  33. 33. pecl_http Request Pooling $pool = new HttpRequestPool; foreach ($urls as $url) { $request = new HttpRequest($url, HTTP_METH_GET); $pool->attach($request); } $pool->send(); foreach ($pool as $request) { echo $request->getUrl(), PHP_EOL; echo $request->getResponseBody(), PHP_EOL; }
  34. 34. HTTP Resources ➔ RFC 2616 HyperText Transfer Protocol ➔ RFC 3986 Uniform Resource Identifiers ➔ quot;HTTP: The Definitive Guidequot; (ISBN 1565925092) ➔ quot;HTTP Pocket Reference: HyperText Transfer Protocolquot; (ISBN 1565928628) ➔ quot;HTTP Developer's Handbookquot; (ISBN 0672324547) by Chris Shiflett ➔ Ben Ramsey's blog series on HTTP
  35. 35. Step #2:Analysis
  36. 36. Tidy Extension $config = array('output-xhtml' => true); $tidy = tidy_parse_string($markupString, $config); $tidy = tidy_parse_file($markupFilePath, $config); $output = tidy_get_output($tidy);
  37. 37. DOM Extension $doc = new DOMDocument; $doc->loadHTML($htmlString); $doc->loadHTMLFile($htmlFilePath); $listItems = $doc->getElementsByTagName('li'); $xpath = new DOMXPath($doc); $listItems = $xpath->query('//ul/li'); foreach ($listItems as $listItem) { echo $listItem->nodeValue, PHP_EOL; }
  38. 38. SimpleXML Extension $sxe = new SimpleXMLElement($markupString); $sxe = new SimpleXMLElement($filePath, null, true); echo $sxe->body->ul->li[0], PHP_EOL; $children = $sxe->body->ul->li; $children = $sxe->body->ul->children(); foreach ($children as $li) { echo $li, PHP_EOL; } echo $sxe->body->ul['id']; $attributes = $sxe->body->ul->attributes(); foreach ($attributes as $name => $value) { echo $name, '=', $value, PHP_EOL; }
  39. 39. XMLReader Extension $doc = XMLReader::xml($xmlString); $doc = XMLReader::open($filePath); while ($doc->read()) { if ($doc->nodeType == XMLReader::ELEMENT) { var_dump($doc->localName); var_dump($doc->hasValue); var_dump($doc->value); var_dump($doc->hasAttributes); var_dump($doc->getAttribute('id')); } }
  40. 40. CSS Selector Libraries ➔ phpQuery ➔ Simple HTML DOM Parser ➔ Zend_Dom_Query $doc1 = phpQuery::newDocumentFile($markupFilePath); $doc2 = phpQuery::newDocument($markupString); $listItems = pq('ul > li'); // uses $doc2 $listItems = pq('ul > li', $doc1);
  41. 41. PCRE Extension
  42. 42. Best Practices
  43. 43. Approximate Human Behavior
  44. 44. Minimize Requests
  45. 45. Batch Jobs, Non-Peak Hours
  46. 46. Account for Unavailability
  47. 47. Aim for Parallelism
  48. 48. Validate Data
  49. 49. Test, Test, Test!
  50. 50. Questions
  51. 51. Please leave a comment! http://joind.in/event/view/41
  52. 52. And ping me online! Matthew Turland Senior Consultant, Blue Parabola LLC matthew@blueparabola.com http://blueparabola.com matt@ishouldbecoding.com http://ishouldbecoding.com @elazar

×