Web Scraping with PHP

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    3 Favorites

    Web Scraping with PHP - Presentation Transcript

    1. Web Scraping with Matthew Turland php|tek 2009 Unconference May 21, 2009
    2. What Is It?
    3. Normal Web Browsing
    4. Difference #1: Immediate Audience
    5. Difference #2: Consumption Method
    6. Why Is It Useful?
    7. Data Without Web Services
    8. Integration Testing
    9. Crawlers
    10. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
    11. Disadvantages
    12. Potential Lack of Stability
    13. Reverse Engineering Required
    14. More Requests
    15. No Nice Neat Data Package
    16. Step #1: Retrieval
    17. Speaking the Language
    18. The Web We Weave GET / HTTP/1.1 HTTP/1.1 200 OK User-Agent: ... Content-Type: ...
    19. Browsing → Requests <a href=\"/index.php?foo=bar\">Index</a> GET /index.php?foo=bar HTTP/1.1 <form method=\"post\" action=\"/index.php\"> <input name=\"foo\" value=\"bar\" /> </form> POST /index.php HTTP/1.1 foo=bar
    20. Responses → Rendered Elements <img src=\"/intl/en_ALL/images/logo.gif\" /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com HTTP/1.1 200 OK Content-Type: image/gif Content-Length: 8558
    21. Not As Easy As It Looks
    22. Redirections
    23. Referer [sic]
    24. Cookies
    25. User Agent Sniffing
    26. robots.txt
    27. Caching
    28. HTTP Authentication
    29. PHP: Glue for the Web
    30. HTTP Client Libraries Streams, cURL PEAR::HTTP_Client pecl_http Zend_Http_Client
    31. Simple Streams Example $uri = 'http://www.example.com/some/resource'; $get = file_get_contents($uri); $context = stream_context_create( array( 'http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ) ) ); $post = file_get_contents($uri, false, $context);
    32. pecl_http Example $http = new HttpRequest($uri); $http->enableCookies(); $http->setMethod(HTTP_METH_POST); $http->addPostFields(array('var1' => 'value1')); $http->setOptions( 'useragent' => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer' )); $response = $http->send(); $headers = $response->getHeaders(); $body = $response->getBody();
    33. pecl_http Request Pooling $pool = new HttpRequestPool; foreach ($urls as $url) { $request = new HttpRequest($url, HTTP_METH_GET); $pool->attach($request); } $pool->send(); foreach ($pool as $request) { echo $request->getUrl(), PHP_EOL; echo $request->getResponseBody(), PHP_EOL; }
    34. HTTP Resources ➔ RFC 2616 HyperText Transfer Protocol ➔ RFC 3986 Uniform Resource Identifiers ➔ \"HTTP: The Definitive Guide\" (ISBN 1565925092) ➔ \"HTTP Pocket Reference: HyperText Transfer Protocol\" (ISBN 1565928628) ➔ \"HTTP Developer's Handbook\" (ISBN 0672324547) by Chris Shiflett ➔ Ben Ramsey's blog series on HTTP
    35. Step #2:Analysis
    36. Tidy Extension $config = array('output-xhtml' => true); $tidy = tidy_parse_string($markupString, $config); $tidy = tidy_parse_file($markupFilePath, $config); $output = tidy_get_output($tidy);
    37. DOM Extension $doc = new DOMDocument; $doc->loadHTML($htmlString); $doc->loadHTMLFile($htmlFilePath); $listItems = $doc->getElementsByTagName('li'); $xpath = new DOMXPath($doc); $listItems = $xpath->query('//ul/li'); foreach ($listItems as $listItem) { echo $listItem->nodeValue, PHP_EOL; }
    38. SimpleXML Extension $sxe = new SimpleXMLElement($markupString); $sxe = new SimpleXMLElement($filePath, null, true); echo $sxe->body->ul->li[0], PHP_EOL; $children = $sxe->body->ul->li; $children = $sxe->body->ul->children(); foreach ($children as $li) { echo $li, PHP_EOL; } echo $sxe->body->ul['id']; $attributes = $sxe->body->ul->attributes(); foreach ($attributes as $name => $value) { echo $name, '=', $value, PHP_EOL; }
    39. XMLReader Extension $doc = XMLReader::xml($xmlString); $doc = XMLReader::open($filePath); while ($doc->read()) { if ($doc->nodeType == XMLReader::ELEMENT) { var_dump($doc->localName); var_dump($doc->hasValue); var_dump($doc->value); var_dump($doc->hasAttributes); var_dump($doc->getAttribute('id')); } }
    40. CSS Selector Libraries ➔ phpQuery ➔ Simple HTML DOM Parser ➔ Zend_Dom_Query $doc1 = phpQuery::newDocumentFile($markupFilePath); $doc2 = phpQuery::newDocument($markupString); $listItems = pq('ul > li'); // uses $doc2 $listItems = pq('ul > li', $doc1);
    41. PCRE Extension
    42. Best Practices
    43. Approximate Human Behavior
    44. Minimize Requests
    45. Batch Jobs, Non-Peak Hours
    46. Account for Unavailability
    47. Aim for Parallelism
    48. Validate Data
    49. Test, Test, Test!
    50. Questions
    51. Please leave a comment! http://joind.in/event/view/41
    52. And ping me online! Matthew Turland Senior Consultant, Blue Parabola LLC matthew@blueparabola.com http://blueparabola.com matt@ishouldbecoding.com http://ishouldbecoding.com @elazar

    + tobias382tobias382, 5 months ago

    custom

    1439 views, 3 favs, 1 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1439
      • 1436 on SlideShare
      • 3 from embeds
    • Comments 0
    • Favorites 3
    • Downloads 33
    Most viewed embeds
    • 3 views on http://www.lr00.net

    more

    All embeds
    • 3 views on http://www.lr00.net

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories