When RSS Fails: Web Scraping with HTTP

6,579 views

Published on

A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.

Published in: Technology
2 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total views
6,579
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
62
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide

When RSS Fails: Web Scraping with HTTP

  1. When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLC February 27, 2009
  2. What is Web Scraping? A 2 Step Process
  3. Its Goal: Data
  4. Obtain It
  5. Transform It
  6. Automate It
  7. Step 1: Retrieval
  8. The Client
  9. The Server
  10. The Request
  11. The Response
  12. Or In Your Case
  13. Step #2: Analysis
  14. Locate Desired Data
  15. Extract It
  16. Use It
  17. So To Recap 2 Step Process Step 1: GET /some/resource Retrieval ... HTTP/1.1 200 OK Resource ... with data you want Usable Raw Step 2: data resource Analysis
  18. How Is It Different? Consuming Web Services Web service data formats Web scraping data formats Data Mining Focus in data mining Focus in web scraping
  19. What Is It Used For? System integration Crawlers and indexers Integration testing
  20. Disadvantages
  21. One small change to markup...
  22. ... may break your application.
  23. Or in modern terms...
  24. Reverse Engineering Required
  25. Multiple Requests
  26. No Nice Neat Data Package
  27. Quite the Opposite, In Fact
  28. Know enough HTTP to... Use one like this: To do this:
  29. Know enough HTTP to... Learn to use and troubleshoot one like this: PEAR::HTTP_Client pecl_http Zend_Http_Client Or roll your own! Filesystem + Streams cURL
  30. Let's GET Started request line protocol version in method URI address for the use by the client or operation desired resource GET /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org header name header value header more headers follow...
  31. URI vs URL URI 1. Uniquely identifies a resource URL 2. Indicates how to locate a resource 3. Does both and is thus human-usable. More info in RFC 3986 Sections 1.1.3 and 1.2.2
  32. Warning about GET GET In principle: quot;Let's do this by the book.quot; GET In reality: quot;'Safe operation'? Whatever.quot;
  33. Query Strings Value Ampersands to separate Parameter parameter name-value pairs. URL Query String http://en.wikipedia.org/w/index.php? title=Query_string&action=edit Question mark to separate Equal signs to separate parameter the resource address and names and respective values query string
  34. URL Encoding Also called percent encoding. Parameter Value first this is a field second is it clear enough (already)? Query String first=this+is+a+field&second=is+it+clear+%28already%29%3F parse_str, urlencode, urldecode: Handy PHP URL functions $_SERVER['QUERY_STRING'] / http_build_query($_GET) More info on URL encoding in RFC 3986 Section 2.1
  35. POST Requests Most Common POST HTTP Operations /w/index.php 1. GET 2. POST ... /new/resource GET /some/resource HTTP/1.1 -or- Header: Value ... /updated/resource POST /some/resource HTTP/1.1 none Header: Value request body
  36. POST Request Example Content type for data submitted via HTML form Blank line separates (multipart/form-data for file uploads) request headers and body POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 Content­Type: application/x­www­form­urlencoded wpStarttime=20080719022313&wpEdittime=20080719022100 ... Note: Most browsers have a query string length limit. Lowest known common denominator: IE7 strlen(entire URL) <= 2,048 bytes. Request body This limit is not standardized. It applies ... look familiar? to query strings, but not request bodies.
  37. HEAD Request Same as GET with two exceptions: HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org ? 1 HEAD vs GET HTTP/1.1 200 OK Header: Value Sometimes headers are all you want 2 No response body Headers Body
  38. Responses Response Lowest protocol version Response status code required to process the status description response Status line HTTP/1.0 200 OK Server: Apache X­Powered­By: PHP/5.2.5 ... Same header format as requests, but different [body] headers are used (see RFC 2616 Section 14)
  39. Response Status Codes 1xx Informational Request received, continuing process. 2xx Success Request received, understood, and accepted. 3xx Redirection Client must take additional action to complete the request. 4xx Client Error Request is malformed or could not be fulfilled. 5xx Server Error Request was valid, but the server failed to process it. See RFC 2616 Section 10 for more info.
  40. Headers Set-Cookie See RFC 2109 or RFC 2965 for more info. Cookie Location Watch out for infinite loops! Last-Modified ETag OR If-Modified-Since If-None-Match 304 Not Modified
  41. More Headers WWW-Authenticate See RFC 2617 Authorization for more info. 200 OK / 403 Forbidden Some servers perform User-Agent user agent sniffing Some clients perform User-Agent: user agent spoofing
  42. Best Practices
  43. Simulate User Behavior
  44. Minimize Requests
  45. Batch Jobs, Non-Peak Hours
  46. Questions?  No heckling... OK, maybe just a little.  I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com. </shameless_plug> Thanks for coming!

×