Web Scraping with PHP

4,893 views
4,756 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,893
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
61
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Web Scraping with PHP

  1. 1. Web Scraping with Matthew Turland Acadiana Open Source Group April 30, 2009
  2. 2. What Is It?
  3. 3. Normal Web Browsing
  4. 4. Difference #1: Immediate Audience
  5. 5. Difference #2: Consumption Method
  6. 6. Why Is It Useful?
  7. 7. Data Without Web Services
  8. 8. Integration Testing
  9. 9. Crawlers
  10. 10. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
  11. 11. Disadvantages
  12. 12. Potential Lack of Stability
  13. 13. Reverse Engineering Required
  14. 14. More Requests
  15. 15. No Nice Neat Data Package
  16. 16. Step #1: Retrieval
  17. 17. Speaking the Language
  18. 18. The Web We Weave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
  19. 19. GET /index.php?foo=bar HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot; action= &quot;/index.php&quot; > <input name= &quot;foo&quot; value= &quot;bar&quot; /> </form> POST /index.php HTTP/1.1 foo = bar Browsing -> Requests
  20. 20. HTTP/1.1 200 OK Content-Type : image/gif Content-Length: 8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot; /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com
  21. 21. Not As Easy As It Looks
  22. 22. Redirections
  23. 23. Referer [sic]
  24. 24. Cookies
  25. 25. User Agent Sniffing
  26. 26. robots.txt
  27. 27. Caching
  28. 28. HTTP Authentication
  29. 29. PHP: Glue for the Web
  30. 30. HTTP Client Libraries PEAR::HTTP_Client pecl_http Zend_Http_Client Streams , cURL
  31. 31. Simple Streams Example $uri = 'http://www.example.com/some/resource' ; $get = file_get_contents( $uri ); $context = stream_context_create( array ( 'http' => array ( 'method' => 'POST' , 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded' , 'content' => http_build_query( array ( 'var1' => 'value1' , 'var2' => 'value2' )) ) ) ); $post = file_get_contents( $uri , false, $context );
  32. 32. pecl_http Example $http = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1' => 'value1' )); $http ->setOptions( 'useragent' => 'PHP ' . phpversion (), 'referer' => 'http://example.com/some/referer' )); $response = $http -> send (); $headers = $response ->getHeaders(); $body = $response ->getBody();
  33. 33. pecl_http Request Pooling $pool = new HttpRequestPool; foreach ( $urls as $url ) { $request = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach ( $pool as $request ) { echo $request ->getUrl(), PHP_EOL; echo $request ->getResponseBody(), PHP_EOL; }
  34. 34. HTTP Resources <ul><li>RFC 2616 HyperText Transfer Protocol </li></ul><ul><li>RFC 3986 Uniform Resource Identifiers </li></ul><ul><li>&quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092) </li></ul><ul><li>&quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628) </li></ul><ul><li>&quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by Chris Shiflett </li></ul><ul><li>Ben Ramsey's blog series on HTTP </li></ul>
  35. 35. Step #2:Analysis
  36. 36. Tidy Extension $config = array ( 'output-xhtml' => true); $tidy = tidy_parse_string( $markupString , $config ); $tidy = tidy_parse_file( $markupFilePath , $config ); $output = tidy_get_output( $tidy );
  37. 37. DOM Extension $doc = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems = $doc ->getElementsByTagName( 'li' ); $xpath = new DOMXPath( $doc ); $listItems = $xpath ->query( '//ul/li' ); foreach ( $listItems as $listItem ) { echo $listItem ->nodeValue, PHP_EOL; }
  38. 38. SimpleXML Extension $sxe = new SimpleXMLElement( $markupString ); $sxe = new SimpleXMLElement( $filePath , null, true); echo $sxe ->body->ul->li[0], PHP_EOL; $children = $sxe ->body->ul->li; $children = $sxe ->body->ul->children(); foreach ( $children as $li ) { echo $li , PHP_EOL; } echo $sxe ->body->ul[ 'id' ]; $attributes = $sxe ->body->ul->attributes(); foreach ( $attributes as $name => $value ) { echo $name , '=' , $value , PHP_EOL; }
  39. 39. XMLReader Extension $doc = XMLReader::xml( $xmlString ); $doc = XMLReader::open( $filePath ); while ( $doc -> read ()) { if ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
  40. 40. CSS Selector Libraries <ul><li>phpQuery </li></ul><ul><li>Simple HTML DOM Parser </li></ul><ul><li>Zend_Dom_Query </li></ul>$doc1 = phpQuery::newDocumentFile( $markupFilePath ); $doc2 = phpQuery::newDocument( $markupString ); $listItems = pq( 'ul > li' ); // uses $doc2 $listItems = pq( 'ul > li' , $doc1 );
  41. 41. PCRE Extension
  42. 42. Best Practices
  43. 43. Approximate Human Behavior
  44. 44. Minimize Requests
  45. 45. Batch Jobs, Non-Peak Hours
  46. 46. Account for Unavailability
  47. 47. Aim for Parallelism
  48. 48. Validate Data
  49. 49. Test, Test, Test!
  50. 50. Questions

×