When RSS Fails: Web Scraping with HTTP
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

When RSS Fails: Web Scraping with HTTP

on

  • 6,770 views

A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.

A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.

Statistics

Views

Total Views
6,770
Views on SlideShare
6,762
Embed Views
8

Actions

Likes
1
Downloads
57
Comments
2

2 Embeds 8

http://www.slideshare.net 5
http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

When RSS Fails: Web Scraping with HTTP Presentation Transcript

  • 1. When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLC February 27, 2009
  • 2. What is Web Scraping? A 2 Step Process
  • 3. Its Goal: Data
  • 4. Obtain It
  • 5. Transform It
  • 6. Automate It
  • 7. Step 1: Retrieval
  • 8. The Client
  • 9. The Server
  • 10. The Request
  • 11. The Response
  • 12. Or In Your Case
  • 13. Step #2: Analysis
  • 14. Locate Desired Data
  • 15. Extract It
  • 16. Use It
  • 17. So To Recap 2 Step Process Step 1: GET /some/resource Retrieval ... HTTP/1.1 200 OK Resource ... with data you want Usable Raw Step 2: data resource Analysis
  • 18. How Is It Different? Consuming Web Services Web service data formats Web scraping data formats Data Mining Focus in data mining Focus in web scraping
  • 19. What Is It Used For? System integration Crawlers and indexers Integration testing
  • 20. Disadvantages
  • 21. One small change to markup...
  • 22. ... may break your application.
  • 23. Or in modern terms...
  • 24. Reverse Engineering Required
  • 25. Multiple Requests
  • 26. No Nice Neat Data Package
  • 27. Quite the Opposite, In Fact
  • 28. Know enough HTTP to... Use one like this: To do this:
  • 29. Know enough HTTP to... Learn to use and troubleshoot one like this: PEAR::HTTP_Client pecl_http Zend_Http_Client Or roll your own! Filesystem + Streams cURL
  • 30. Let's GET Started request line protocol version in method URI address for the use by the client or operation desired resource GET /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org header name header value header more headers follow...
  • 31. URI vs URL URI 1. Uniquely identifies a resource URL 2. Indicates how to locate a resource 3. Does both and is thus human-usable. More info in RFC 3986 Sections 1.1.3 and 1.2.2
  • 32. Warning about GET GET In principle: quot;Let's do this by the book.quot; GET In reality: quot;'Safe operation'? Whatever.quot;
  • 33. Query Strings Value Ampersands to separate Parameter parameter name-value pairs. URL Query String http://en.wikipedia.org/w/index.php? title=Query_string&action=edit Question mark to separate Equal signs to separate parameter the resource address and names and respective values query string
  • 34. URL Encoding Also called percent encoding. Parameter Value first this is a field second is it clear enough (already)? Query String first=this+is+a+field&second=is+it+clear+%28already%29%3F parse_str, urlencode, urldecode: Handy PHP URL functions $_SERVER['QUERY_STRING'] / http_build_query($_GET) More info on URL encoding in RFC 3986 Section 2.1
  • 35. POST Requests Most Common POST HTTP Operations /w/index.php 1. GET 2. POST ... /new/resource GET /some/resource HTTP/1.1 -or- Header: Value ... /updated/resource POST /some/resource HTTP/1.1 none Header: Value request body
  • 36. POST Request Example Content type for data submitted via HTML form Blank line separates (multipart/form-data for file uploads) request headers and body POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 Content­Type: application/x­www­form­urlencoded wpStarttime=20080719022313&wpEdittime=20080719022100 ... Note: Most browsers have a query string length limit. Lowest known common denominator: IE7 strlen(entire URL) <= 2,048 bytes. Request body This limit is not standardized. It applies ... look familiar? to query strings, but not request bodies.
  • 37. HEAD Request Same as GET with two exceptions: HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org ? 1 HEAD vs GET HTTP/1.1 200 OK Header: Value Sometimes headers are all you want 2 No response body Headers Body
  • 38. Responses Response Lowest protocol version Response status code required to process the status description response Status line HTTP/1.0 200 OK Server: Apache X­Powered­By: PHP/5.2.5 ... Same header format as requests, but different [body] headers are used (see RFC 2616 Section 14)
  • 39. Response Status Codes 1xx Informational Request received, continuing process. 2xx Success Request received, understood, and accepted. 3xx Redirection Client must take additional action to complete the request. 4xx Client Error Request is malformed or could not be fulfilled. 5xx Server Error Request was valid, but the server failed to process it. See RFC 2616 Section 10 for more info.
  • 40. Headers Set-Cookie See RFC 2109 or RFC 2965 for more info. Cookie Location Watch out for infinite loops! Last-Modified ETag OR If-Modified-Since If-None-Match 304 Not Modified
  • 41. More Headers WWW-Authenticate See RFC 2617 Authorization for more info. 200 OK / 403 Forbidden Some servers perform User-Agent user agent sniffing Some clients perform User-Agent: user agent spoofing
  • 42. Best Practices
  • 43. Simulate User Behavior
  • 44. Minimize Requests
  • 45. Batch Jobs, Non-Peak Hours
  • 46. Questions?  No heckling... OK, maybe just a little.  I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com. </shameless_plug> Thanks for coming!