Web Scraping with PHP

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Web Scraping with PHP - Presentation Transcript

    1. Web Scraping with Matthew Turland Acadiana Open Source Group April 30, 2009
    2. What Is It?
    3. Normal Web Browsing
    4. Difference #1: Immediate Audience
    5. Difference #2: Consumption Method
    6. Why Is It Useful?
    7. Data Without Web Services
    8. Integration Testing
    9. Crawlers
    10. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
    11. Disadvantages
    12. Potential Lack of Stability
    13. Reverse Engineering Required
    14. More Requests
    15. No Nice Neat Data Package
    16. Step #1: Retrieval
    17. Speaking the Language
    18. The Web We Weave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
    19. GET /index.php?foo=bar HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot; action= &quot;/index.php&quot; > <input name= &quot;foo&quot; value= &quot;bar&quot; /> </form> POST /index.php HTTP/1.1 foo = bar Browsing -> Requests
    20. HTTP/1.1 200 OK Content-Type : image/gif Content-Length: 8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot; /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com
    21. Not As Easy As It Looks
    22. Redirections
    23. Referer [sic]
    24. Cookies
    25. User Agent Sniffing
    26. robots.txt
    27. Caching
    28. HTTP Authentication
    29. PHP: Glue for the Web
    30. HTTP Client Libraries PEAR::HTTP_Client pecl_http Zend_Http_Client Streams , cURL
    31. Simple Streams Example $uri = 'http://www.example.com/some/resource' ; $get = file_get_contents( $uri ); $context = stream_context_create( array ( 'http' => array ( 'method' => 'POST' , 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded' , 'content' => http_build_query( array ( 'var1' => 'value1' , 'var2' => 'value2' )) ) ) ); $post = file_get_contents( $uri , false, $context );
    32. pecl_http Example $http = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1' => 'value1' )); $http ->setOptions( 'useragent' => 'PHP ' . phpversion (), 'referer' => 'http://example.com/some/referer' )); $response = $http -> send (); $headers = $response ->getHeaders(); $body = $response ->getBody();
    33. pecl_http Request Pooling $pool = new HttpRequestPool; foreach ( $urls as $url ) { $request = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach ( $pool as $request ) { echo $request ->getUrl(), PHP_EOL; echo $request ->getResponseBody(), PHP_EOL; }
    34. HTTP Resources
      • RFC 2616 HyperText Transfer Protocol
      • RFC 3986 Uniform Resource Identifiers
      • &quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092)
      • &quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628)
      • &quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by Chris Shiflett
      • Ben Ramsey's blog series on HTTP
    35. Step #2:Analysis
    36. Tidy Extension $config = array ( 'output-xhtml' => true); $tidy = tidy_parse_string( $markupString , $config ); $tidy = tidy_parse_file( $markupFilePath , $config ); $output = tidy_get_output( $tidy );
    37. DOM Extension $doc = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems = $doc ->getElementsByTagName( 'li' ); $xpath = new DOMXPath( $doc ); $listItems = $xpath ->query( '//ul/li' ); foreach ( $listItems as $listItem ) { echo $listItem ->nodeValue, PHP_EOL; }
    38. SimpleXML Extension $sxe = new SimpleXMLElement( $markupString ); $sxe = new SimpleXMLElement( $filePath , null, true); echo $sxe ->body->ul->li[0], PHP_EOL; $children = $sxe ->body->ul->li; $children = $sxe ->body->ul->children(); foreach ( $children as $li ) { echo $li , PHP_EOL; } echo $sxe ->body->ul[ 'id' ]; $attributes = $sxe ->body->ul->attributes(); foreach ( $attributes as $name => $value ) { echo $name , '=' , $value , PHP_EOL; }
    39. XMLReader Extension $doc = XMLReader::xml( $xmlString ); $doc = XMLReader::open( $filePath ); while ( $doc -> read ()) { if ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
    40. CSS Selector Libraries
      • phpQuery
      • Simple HTML DOM Parser
      • Zend_Dom_Query
      $doc1 = phpQuery::newDocumentFile( $markupFilePath ); $doc2 = phpQuery::newDocument( $markupString ); $listItems = pq( 'ul > li' ); // uses $doc2 $listItems = pq( 'ul > li' , $doc1 );
    41. PCRE Extension
    42. Best Practices
    43. Approximate Human Behavior
    44. Minimize Requests
    45. Batch Jobs, Non-Peak Hours
    46. Account for Unavailability
    47. Aim for Parallelism
    48. Validate Data
    49. Test, Test, Test!
    50. Questions

    + tobias382tobias382, 6 months ago

    custom

    957 views, 1 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 957
      • 957 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 8
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories