Web Scraping with PHP Matthew Turland September 16, 2008
Everyone acquainted? <ul><li>Lead Programmer for  surgiSYS, LLC </li></ul><ul><li>PHP Community  member </li></ul><ul><li>...
What is Web Scraping? 2 Stage Process Stage 1 : Retrieval GET  /some/resource ... HTTP/1.1 200  OK ... Resource with data ...
How is it different from... Data Mining Focus in data mining Focus in web scraping Consuming Web Services Web service data...
Potential Applications What Data source When Web service is unavailable or data  access is one-time only. Crawlers and ind...
Disadvantages Δ == vs
Legal Concerns TOS TOU EUA Original source Illegal syndicate IANAL!
Retrieval GET  /some/resource ... sizeof( ) == sizeof( ) if( ) require ;
The Low-Down on HTTP
Know enough HTTP to... Use one like this: To do this:
Know enough HTTP to... PEAR::HTTP_Client pecl_http Zend_Http_Client Learn to use and troubleshoot one like this: Or roll y...
<ul><li>GET /wiki/Main_Page HTTP/1.1 </li></ul><ul><li>Host: en.wikipedia.org </li></ul>Let's GET Started method  or  oper...
Warning about GET In principle: &quot;Let's do this by the book.&quot; GET In reality: &quot;' Safe operation '? Whatever....
URI vs URL 1. Uniquely identifies a resource 2. Indicates how to locate a resource 3. Does both and is thus human-usable. ...
Query Strings http://en.wikipedia.org/w/index.php? title=Query_string&action=edit URL Query String Question mark  to separ...
URL Encoding Parameter Value first second this is a field was it clear enough (already)? Query String first=this+is+a+fiel...
POST Requests Most Common HTTP Operations 1. GET 2. POST ... /w/index.php POST /new/resource -or- /updated/resource GET /s...
POST Request Example <ul><li>POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 </li></ul><ul><li>Content-Type: applicatio...
HEAD Request <ul><li>HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org </li></ul>Same as GET with two exceptions: 1 <ul...
Responses HTTP/1.0 200 OK Server: Apache X-Powered-By: PHP/5.2.5 ... [body] Lowest  protocol version required to process t...
Response Status Codes <ul><li>1xx Informational Request received, continuing  process. </li></ul><ul><li>2xx Success Reque...
Headers Set-Cookie Cookie Location Watch out for  infinite loops! Last-Modified If-Modified-Since 304 Not Modified ETag If...
More Headers WWW-Authenticate Authorization User-Agent 200 OK / 403 Forbidden See  RFC 2617 for more info. User-Agent: Som...
Best Practices <ul><li>If responses are unlikely to change often, cache them locally and check for updates with HEAD reque...
Simple Streams Examples $uri = 'http://www.example.com/some/resource'; $get =  file_get_contents ($uri); $context =  strea...
Streams Resources <ul><li>Language Reference > Context options and parameters </li></ul><ul><ul><li>HTTP context options <...
pecl_http Examples $http = new  HttpRequest ($uri); $http-> enableCookies (); $http-> setMethod (HTTP_METH_POST); // or HT...
PEAR::HTTP_Client Examples $cookiejar = new  HTTP_Client_CookieManager (); $request = new  HTTP_Request ($uri); $request->...
Zend_Http_Client Examples $client = new  Zend_Http_Client ($uri); $client-> setMethod (Zend_Http_Client::POST); $client-> ...
cURL Examples Fatal error: Allowed memory size of n00b bytes  exhausted (tried to allocate 1337 bytes) in  /this/slide.php...
HTTP Resources <ul><li>RFC 2616 HyperText Transfer Protocol </li></ul><ul><li>RFC 3986 Uniform Resource Identifiers </li><...
Analysis Raw resource Usable data DOM XMLReader SimpleXML XSL tidy PCRE String functions JSON ctype XML Parser
Cleanup <ul><li>tidy is good for correcting markup malformations. * </li></ul><ul><li>String functions and PCRE can be use...
Parsing <ul><li>DOM and SimpleXML are tree-based parsers that store the entire document in memory to provide full access. ...
Validation <ul><li>Make as few assumptions (and as many assertions) about the target as possible. </li></ul><ul><li>Valida...
Transformation <ul><li>XSL can be used to extract data from an XML-compatible document and retrofit it to a format defined...
Abstraction <ul><li>Remain in keeping with the DRY principle. </li></ul><ul><li>Develop components that can be reused acro...
Assertions <ul><li>Apply to long-term real-time web scraping applications. </li></ul><ul><li>Affirm conditions of behavior...
Testing <ul><li>Write tests on target application output stored in local files that can be run sans internet during develo...
Questions? <ul><li>No heckling... OK, maybe just a little. </li></ul><ul><li>I will hang around afterward if you have ques...
Upcoming SlideShare
Loading in...5
×

Web Scraping with PHP

23,194

Published on

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
23,194
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
327
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Web Scraping with PHP

    1. 1. Web Scraping with PHP Matthew Turland September 16, 2008
    2. 2. Everyone acquainted? <ul><li>Lead Programmer for surgiSYS, LLC </li></ul><ul><li>PHP Community member </li></ul><ul><li>Blog: http://ishouldbecoding.com </li></ul>
    3. 3. What is Web Scraping? 2 Stage Process Stage 1 : Retrieval GET /some/resource ... HTTP/1.1 200 OK ... Resource with data you want Stage 2 : Analysis Raw resource Usable data
    4. 4. How is it different from... Data Mining Focus in data mining Focus in web scraping Consuming Web Services Web service data formats Web scraping data formats
    5. 5. Potential Applications What Data source When Web service is unavailable or data access is one-time only. Crawlers and indexers Remote data search offers no capabilities for search or data source integration. Integration testing Applications must be tested by simulating client behavior and ensuring responses are consistent with requirements.
    6. 6. Disadvantages Δ == vs
    7. 7. Legal Concerns TOS TOU EUA Original source Illegal syndicate IANAL!
    8. 8. Retrieval GET /some/resource ... sizeof( ) == sizeof( ) if( ) require ;
    9. 9. The Low-Down on HTTP
    10. 10. Know enough HTTP to... Use one like this: To do this:
    11. 11. Know enough HTTP to... PEAR::HTTP_Client pecl_http Zend_Http_Client Learn to use and troubleshoot one like this: Or roll your own! cURL Filesystem + Streams
    12. 12. <ul><li>GET /wiki/Main_Page HTTP/1.1 </li></ul><ul><li>Host: en.wikipedia.org </li></ul>Let's GET Started method or operation URI address for the desired resource protocol version in use by the client header name header value request line header more headers follow...
    13. 13. Warning about GET In principle: &quot;Let's do this by the book.&quot; GET In reality: &quot;' Safe operation '? Whatever.&quot; GET
    14. 14. URI vs URL 1. Uniquely identifies a resource 2. Indicates how to locate a resource 3. Does both and is thus human-usable. URI URL More info in RFC 3986 Sections 1.1.3 and 1.2.2
    15. 15. Query Strings http://en.wikipedia.org/w/index.php? title=Query_string&action=edit URL Query String Question mark to separate the resource address and query string Equal signs to separate parameter names and respective values Ampersands to separate parameter name-value pairs. Parameter Value
    16. 16. URL Encoding Parameter Value first second this is a field was it clear enough (already)? Query String first=this+is+a+field&second=was+it+clear+%28already%29%3F Also called percent encoding . urlencode and urlencode : Handy PHP URL functions $_SERVER ['QUERY_STRING'] / http_build_query ( $_GET ) More info on URL encoding in RFC 3986 Section 2.1
    17. 17. POST Requests Most Common HTTP Operations 1. GET 2. POST ... /w/index.php POST /new/resource -or- /updated/resource GET /some/resource HTTP/1.1 Header: Value ... POST /some/resource HTTP/1.1 Header: Value request body none
    18. 18. POST Request Example <ul><li>POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 </li></ul><ul><li>Content-Type: application/x-www-form-urlencoded </li></ul><ul><li>wpStarttime=20080719022313&wpEdittime=20080719022100... </li></ul>Blank line separates request headers and body Content type for data submitted via HTML form (multipart/form-data for file uploads ) Request body ... look familiar? Note : Most browsers have a query string length limit. Lowest known common denominator: IE7 – strlen(entire URL) <= 2,048 bytes. This limit is not standardized and only applies to query strings, not request bodies.
    19. 19. HEAD Request <ul><li>HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org </li></ul>Same as GET with two exceptions: 1 <ul><li>HTTP/1.1 200 OK Header: Value </li></ul>2 No response body HEAD vs GET Sometimes headers are all you want ?
    20. 20. Responses HTTP/1.0 200 OK Server: Apache X-Powered-By: PHP/5.2.5 ... [body] Lowest protocol version required to process the response Response status code Response status description Status line Same header format as requests, but different headers are used (see RFC 2616 Section 14 )
    21. 21. Response Status Codes <ul><li>1xx Informational Request received, continuing process. </li></ul><ul><li>2xx Success Request received, understood, and accepted. </li></ul><ul><li>3xx Redirection Client must take additional action to complete the request. </li></ul><ul><li>4xx Client Error Request is malformed or could not be fulfilled. </li></ul><ul><li>5xx Server Error Request was valid, but the server failed to process it. </li></ul>See RFC 2616 Section 10 for more info.
    22. 22. Headers Set-Cookie Cookie Location Watch out for infinite loops! Last-Modified If-Modified-Since 304 Not Modified ETag If-None-Match OR See RFC 2109 or RFC 2965 for more info.
    23. 23. More Headers WWW-Authenticate Authorization User-Agent 200 OK / 403 Forbidden See RFC 2617 for more info. User-Agent: Some servers perform user agent sniffing Some clients perform user agent spoofing
    24. 24. Best Practices <ul><li>If responses are unlikely to change often, cache them locally and check for updates with HEAD requests. </li></ul><ul><li>If possible, measure target server response time and vary when requests are sent based on peak usage times. </li></ul><ul><li>Replicate normal user behavior (particularly headers) as closely as possible. If the application is flexible, vary behavior where it makes sense for performance. </li></ul>
    25. 25. Simple Streams Examples $uri = 'http://www.example.com/some/resource'; $get = file_get_contents ($uri); $context = stream_context_create (array('http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ))); // Last 2 parameters here also apply to fopen() $post = file_get_contents ($uri, false, $context);
    26. 26. Streams Resources <ul><li>Language Reference > Context options and parameters </li></ul><ul><ul><li>HTTP context options </li></ul></ul><ul><ul><li>Context parameters </li></ul></ul><ul><li>Appendices > List of Supported Protocols/Wrappers </li></ul><ul><ul><li>HTTP and HTTPS </li></ul></ul><ul><li>php|architect's Definitive Guide to PHP Streams (ETA late 2008 / early 2009) </li></ul>
    27. 27. pecl_http Examples $http = new HttpRequest ($uri); $http-> enableCookies (); $http-> setMethod (HTTP_METH_POST); // or HTTP_METH_GET $http-> addPostFields ($postData); $http-> setOptions (array( 'httpauth' => $username . ':' . $password, 'httpauthtype' => HTTP_AUTH_BASIC, ‘ useragent’ => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer', 'range' => array(array(1, 5), array(10, 15)) )); $response = $http-> send (); $headers = $response-> getHeaders (); $body = $response-> getBody (); See PHP Manual for more info.
    28. 28. PEAR::HTTP_Client Examples $cookiejar = new HTTP_Client_CookieManager (); $request = new HTTP_Request ($uri); $request-> setMethod (HTTP_REQUEST_METHOD_POST); $request-> setBasicAuth ($username, $password); $request-> addHeader ('User-Agent', $userAgent); $request-> addHeader ('Referer', $referrer); $request-> addHeader ('Range', 'bytes=2-3,5-6'); foreach ($postData as $key => $value) $request-> addPostData ($key, $value); $request-> sendRequest (); $cookiejar-> updateCookies ($request); $request = new HTTP_Request ($otheruri); $cookiejar-> passCookies ($request); $response = $request-> sendRequest (); $headers = $request->getResponseHeader(); $body = $request->getResponseBody(); See PEAR Manual and API Docs for more info.
    29. 29. Zend_Http_Client Examples $client = new Zend_Http_Client ($uri); $client-> setMethod (Zend_Http_Client::POST); $client-> setAuth ($username, $password); $client-> setHeaders ('User-Agent', $userAgent); $client-> setHeaders (array( 'Referer' => $referrer, 'Range' => 'bytes=2-3,5-6' ); $client-> setParameterPost ($postData); $client-> setCookieJar (); $client-> request (); $client-> setUri ($otheruri); $client-> setMethod (Zend_Http_Client::GET); $response = $client-> request (); $headers = $response-> getHeaders (); $body = $response-> getBody (); See ZF Manual for more info.
    30. 30. cURL Examples Fatal error: Allowed memory size of n00b bytes exhausted (tried to allocate 1337 bytes) in /this/slide.php on line 1 See PHP Manual , Context Options , or my php|architect article for more info. Just kidding. Really, the equivalent cURL code for the previous examples is so verbose that it won't fit on one slide and I don't think it's deserving of multiple slides.
    31. 31. HTTP Resources <ul><li>RFC 2616 HyperText Transfer Protocol </li></ul><ul><li>RFC 3986 Uniform Resource Identifiers </li></ul><ul><li>&quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092) </li></ul><ul><li>&quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628) </li></ul><ul><li>&quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by Chris Shiflett </li></ul><ul><li>Ben Ramsey's blog series on HTTP </li></ul>
    32. 32. Analysis Raw resource Usable data DOM XMLReader SimpleXML XSL tidy PCRE String functions JSON ctype XML Parser
    33. 33. Cleanup <ul><li>tidy is good for correcting markup malformations. * </li></ul><ul><li>String functions and PCRE can be used for manual cleanup prior to using a parsing extension. </li></ul><ul><li>DOM is generally forgiving when parsing malformed markup. It generates warnings that can be suppressed. </li></ul><ul><li>Save a static copy of your target, use a validator on the input (ex: W3C Markup Validator ), fix validation errors manually, and write code to automatically apply fixes. </li></ul>
    34. 34. Parsing <ul><li>DOM and SimpleXML are tree-based parsers that store the entire document in memory to provide full access. </li></ul><ul><li>XMLReader is a pull-based parser that iterates over nodes in the document and is less memory-intensive. </li></ul><ul><li>SAX is also pull-based, but uses event-based callbacks. </li></ul><ul><li>JSON can be used to parse isolated JavaScript data. </li></ul><ul><li>Nothing &quot;official&quot; for CSS. Find something like CSSTidy . </li></ul><ul><li>PCRE can be used for parsing. Last resort, though. </li></ul>
    35. 35. Validation <ul><li>Make as few assumptions (and as many assertions) about the target as possible. </li></ul><ul><li>Validation provides additional sanity checks for your application. </li></ul><ul><li>PCRE can be used to form pattern-based assertions about extracted data. </li></ul><ul><li>ctype can be used to form primitive type-based assertions. </li></ul>
    36. 36. Transformation <ul><li>XSL can be used to extract data from an XML-compatible document and retrofit it to a format defined by an XSL template. </li></ul><ul><li>To my knowledge, this capability is unfortunately unique to XML-compatible data. </li></ul><ul><li>Use components like template engines to separate formatting of data from retrieval/analysis logic. </li></ul>
    37. 37. Abstraction <ul><li>Remain in keeping with the DRY principle. </li></ul><ul><li>Develop components that can be reused across projects. Ex: DomQuery , Zend_Dom . </li></ul><ul><li>Make an effort to minimize application-specific logic. This applies to both retrieval and analysis. </li></ul>
    38. 38. Assertions <ul><li>Apply to long-term real-time web scraping applications. </li></ul><ul><li>Affirm conditions of behavior and output of the target application. </li></ul><ul><li>Use in the application during runtime to avoid Bad Things (tm) happening when the target application changes. </li></ul><ul><li>Include in unit tests of the application. You are using unit tests, right? </li></ul>
    39. 39. Testing <ul><li>Write tests on target application output stored in local files that can be run sans internet during development. </li></ul><ul><li>If possible/feasible/appropriate, write &quot;live tests&quot; that actively test using assertions on the target application. </li></ul><ul><li>Run live tests when the target appears to have changed (because your web scraping application breaks). </li></ul>
    40. 40. Questions? <ul><li>No heckling... OK, maybe just a little. </li></ul><ul><li>I will hang around afterward if you have questions, points for discussion, or just want to say hi. It's cool, I don't bite or have cooties or anything. I have business cards too. </li></ul><ul><li>I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com. </shameless_plug> </li></ul><ul><li>Thanks for coming! </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×