Web Scraping with


                      Matthew Turland
            php|tek 2009 Unconference
                         M...
What Is It?
Normal Web Browsing
Difference #1: Immediate Audience
Difference #2: Consumption Method
Why Is It
Useful?
Data Without
Web Services
Integration Testing
Crawlers
With plain text, we give ourselves the
ability to manipulate knowledge, both
manually and programmatically, using
virtuall...
Disadvantages
Potential Lack of Stability
Reverse Engineering Required
More
Requests
No Nice Neat
Data Package
Step #1: Retrieval
Speaking the Language
The Web We Weave

GET / HTTP/1.1    HTTP/1.1 200 OK
User-Agent: ...   Content-Type: ...
Browsing → Requests

<a href=quot;/index.php?foo=barquot;>Index</a>

   GET /index.php?foo=bar HTTP/1.1

<form method=quot...
Responses → Rendered Elements
<img src=quot;/intl/en_ALL/images/logo.gifquot; />

GET /intl/en_ALL/images/logo.gif HTTP/1....
Not As Easy As It Looks
Redirections
Referer [sic]
Cookies
User Agent Sniffing
robots.txt
Caching
HTTP Authentication
PHP: Glue for the Web
HTTP Client Libraries


                Streams, cURL


                PEAR::HTTP_Client

                pecl_http


   ...
Simple Streams Example
$uri = 'http://www.example.com/some/resource';
$get = file_get_contents($uri);
$context = stream_co...
pecl_http Example

$http = new HttpRequest($uri);
$http->enableCookies();
$http->setMethod(HTTP_METH_POST);
$http->addPost...
pecl_http Request Pooling

$pool = new HttpRequestPool;
foreach ($urls as $url) {
  $request = new HttpRequest($url, HTTP_...
HTTP Resources

➔ RFC 2616 HyperText Transfer Protocol
➔ RFC 3986 Uniform Resource Identifiers
➔ quot;HTTP: The Definitive...
Step #2:Analysis
Tidy Extension
$config   = array('output-xhtml' => true);
$tidy =   tidy_parse_string($markupString, $config);
$tidy =   t...
DOM Extension
$doc = new DOMDocument;
$doc->loadHTML($htmlString);
$doc->loadHTMLFile($htmlFilePath);
$listItems = $doc->g...
SimpleXML Extension
$sxe = new SimpleXMLElement($markupString);
$sxe = new SimpleXMLElement($filePath, null, true);
echo $...
XMLReader Extension

$doc = XMLReader::xml($xmlString);
$doc = XMLReader::open($filePath);
while ($doc->read()) {
  if ($d...
CSS Selector Libraries
 ➔ phpQuery
 ➔ Simple HTML DOM Parser
 ➔ Zend_Dom_Query


$doc1 = phpQuery::newDocumentFile($markup...
PCRE Extension
Best Practices
Approximate Human Behavior
Minimize Requests
Batch Jobs,
Non-Peak Hours
Account for Unavailability
Aim for Parallelism
Validate Data
Test, Test, Test!
Questions
Please leave a comment!



 http://joind.in/event/view/41
And ping me online!

          Matthew Turland
Senior Consultant, Blue Parabola LLC
    matthew@blueparabola.com
      htt...
Upcoming SlideShare
Loading in …5
×

Web Scraping with PHP

9,617 views

Published on

Published in: Technology, Design
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
9,617
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
205
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide

Web Scraping with PHP

  1. 1. Web Scraping with Matthew Turland php|tek 2009 Unconference May 21, 2009
  2. 2. What Is It?
  3. 3. Normal Web Browsing
  4. 4. Difference #1: Immediate Audience
  5. 5. Difference #2: Consumption Method
  6. 6. Why Is It Useful?
  7. 7. Data Without Web Services
  8. 8. Integration Testing
  9. 9. Crawlers
  10. 10. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
  11. 11. Disadvantages
  12. 12. Potential Lack of Stability
  13. 13. Reverse Engineering Required
  14. 14. More Requests
  15. 15. No Nice Neat Data Package
  16. 16. Step #1: Retrieval
  17. 17. Speaking the Language
  18. 18. The Web We Weave GET / HTTP/1.1 HTTP/1.1 200 OK User-Agent: ... Content-Type: ...
  19. 19. Browsing → Requests <a href=quot;/index.php?foo=barquot;>Index</a> GET /index.php?foo=bar HTTP/1.1 <form method=quot;postquot; action=quot;/index.phpquot;> <input name=quot;fooquot; value=quot;barquot; /> </form> POST /index.php HTTP/1.1 foo=bar
  20. 20. Responses → Rendered Elements <img src=quot;/intl/en_ALL/images/logo.gifquot; /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com HTTP/1.1 200 OK Content-Type: image/gif Content-Length: 8558
  21. 21. Not As Easy As It Looks
  22. 22. Redirections
  23. 23. Referer [sic]
  24. 24. Cookies
  25. 25. User Agent Sniffing
  26. 26. robots.txt
  27. 27. Caching
  28. 28. HTTP Authentication
  29. 29. PHP: Glue for the Web
  30. 30. HTTP Client Libraries Streams, cURL PEAR::HTTP_Client pecl_http Zend_Http_Client
  31. 31. Simple Streams Example $uri = 'http://www.example.com/some/resource'; $get = file_get_contents($uri); $context = stream_context_create( array( 'http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ) ) ); $post = file_get_contents($uri, false, $context);
  32. 32. pecl_http Example $http = new HttpRequest($uri); $http->enableCookies(); $http->setMethod(HTTP_METH_POST); $http->addPostFields(array('var1' => 'value1')); $http->setOptions( 'useragent' => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer' )); $response = $http->send(); $headers = $response->getHeaders(); $body = $response->getBody();
  33. 33. pecl_http Request Pooling $pool = new HttpRequestPool; foreach ($urls as $url) { $request = new HttpRequest($url, HTTP_METH_GET); $pool->attach($request); } $pool->send(); foreach ($pool as $request) { echo $request->getUrl(), PHP_EOL; echo $request->getResponseBody(), PHP_EOL; }
  34. 34. HTTP Resources ➔ RFC 2616 HyperText Transfer Protocol ➔ RFC 3986 Uniform Resource Identifiers ➔ quot;HTTP: The Definitive Guidequot; (ISBN 1565925092) ➔ quot;HTTP Pocket Reference: HyperText Transfer Protocolquot; (ISBN 1565928628) ➔ quot;HTTP Developer's Handbookquot; (ISBN 0672324547) by Chris Shiflett ➔ Ben Ramsey's blog series on HTTP
  35. 35. Step #2:Analysis
  36. 36. Tidy Extension $config = array('output-xhtml' => true); $tidy = tidy_parse_string($markupString, $config); $tidy = tidy_parse_file($markupFilePath, $config); $output = tidy_get_output($tidy);
  37. 37. DOM Extension $doc = new DOMDocument; $doc->loadHTML($htmlString); $doc->loadHTMLFile($htmlFilePath); $listItems = $doc->getElementsByTagName('li'); $xpath = new DOMXPath($doc); $listItems = $xpath->query('//ul/li'); foreach ($listItems as $listItem) { echo $listItem->nodeValue, PHP_EOL; }
  38. 38. SimpleXML Extension $sxe = new SimpleXMLElement($markupString); $sxe = new SimpleXMLElement($filePath, null, true); echo $sxe->body->ul->li[0], PHP_EOL; $children = $sxe->body->ul->li; $children = $sxe->body->ul->children(); foreach ($children as $li) { echo $li, PHP_EOL; } echo $sxe->body->ul['id']; $attributes = $sxe->body->ul->attributes(); foreach ($attributes as $name => $value) { echo $name, '=', $value, PHP_EOL; }
  39. 39. XMLReader Extension $doc = XMLReader::xml($xmlString); $doc = XMLReader::open($filePath); while ($doc->read()) { if ($doc->nodeType == XMLReader::ELEMENT) { var_dump($doc->localName); var_dump($doc->hasValue); var_dump($doc->value); var_dump($doc->hasAttributes); var_dump($doc->getAttribute('id')); } }
  40. 40. CSS Selector Libraries ➔ phpQuery ➔ Simple HTML DOM Parser ➔ Zend_Dom_Query $doc1 = phpQuery::newDocumentFile($markupFilePath); $doc2 = phpQuery::newDocument($markupString); $listItems = pq('ul > li'); // uses $doc2 $listItems = pq('ul > li', $doc1);
  41. 41. PCRE Extension
  42. 42. Best Practices
  43. 43. Approximate Human Behavior
  44. 44. Minimize Requests
  45. 45. Batch Jobs, Non-Peak Hours
  46. 46. Account for Unavailability
  47. 47. Aim for Parallelism
  48. 48. Validate Data
  49. 49. Test, Test, Test!
  50. 50. Questions
  51. 51. Please leave a comment! http://joind.in/event/view/41
  52. 52. And ping me online! Matthew Turland Senior Consultant, Blue Parabola LLC matthew@blueparabola.com http://blueparabola.com matt@ishouldbecoding.com http://ishouldbecoding.com @elazar

×