17. So To Recap
2 Step Process
Step 1:
GET /some/resource
Retrieval ...
HTTP/1.1 200 OK
Resource
...
with data
you want
Usable
Raw
Step 2:
data
resource
Analysis
18. How Is It Different?
Consuming Web Services
Web service data formats Web scraping data formats
Data Mining
Focus in data mining Focus in web scraping
19. What Is It Used For?
System
integration
Crawlers
and indexers
Integration
testing
29. Know enough HTTP to...
Learn to use and troubleshoot one like this:
PEAR::HTTP_Client pecl_http Zend_Http_Client
Or roll your own!
Filesystem + Streams cURL
30. Let's GET Started
request line
protocol version in
method URI address for the
use by the client
or operation desired resource
GET /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org
header name header value
header
more headers follow...
31. URI vs URL
URI
1. Uniquely identifies a resource
URL
2. Indicates how to locate a resource
3. Does both and is thus human-usable.
More info in RFC 3986 Sections 1.1.3 and 1.2.2
32. Warning about GET
GET
In principle:
quot;Let's do this by the book.quot;
GET
In reality:
quot;'Safe operation'? Whatever.quot;
33. Query Strings
Value
Ampersands to separate
Parameter
parameter name-value pairs.
URL
Query String
http://en.wikipedia.org/w/index.php? title=Query_string&action=edit
Question mark to separate
Equal signs to separate parameter
the resource address and
names and respective values
query string
34. URL Encoding
Also called percent encoding.
Parameter Value
first this is a field
second is it clear enough (already)?
Query String
first=this+is+a+field&second=is+it+clear+%28already%29%3F
parse_str, urlencode, urldecode: Handy PHP URL functions
$_SERVER['QUERY_STRING'] / http_build_query($_GET)
More info on URL encoding in RFC 3986 Section 2.1
35. POST Requests
Most Common
POST
HTTP Operations /w/index.php
1. GET
2. POST
...
/new/resource
GET /some/resource HTTP/1.1
-or-
Header: Value
...
/updated/resource
POST /some/resource HTTP/1.1
none Header: Value
request body
36. POST Request Example
Content type for data
submitted via HTML form
Blank line separates (multipart/form-data for file uploads)
request headers and body
POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1
ContentType: application/xwwwformurlencoded
wpStarttime=20080719022313&wpEdittime=20080719022100
...
Note: Most browsers have a query string length limit.
Lowest known common denominator: IE7
strlen(entire URL) <= 2,048 bytes.
Request body This limit is not standardized. It applies
... look familiar? to query strings, but not request bodies.
37. HEAD Request
Same as GET with two exceptions:
HEAD /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org
?
1 HEAD vs GET
HTTP/1.1 200 OK
Header: Value
Sometimes headers
are all you want
2 No response body
Headers
Body
38. Responses
Response
Lowest protocol version Response
status code
required to process the status description
response
Status line
HTTP/1.0 200 OK
Server: Apache
XPoweredBy: PHP/5.2.5
...
Same header format as
requests, but different
[body] headers are used
(see RFC 2616 Section 14)
39. Response Status Codes
1xx Informational
Request received, continuing
process.
2xx Success
Request received, understood,
and accepted.
3xx Redirection
Client must take additional action to complete the request.
4xx Client Error
Request is malformed or could not be fulfilled.
5xx Server Error
Request was valid, but the server failed to process it.
See RFC 2616 Section 10 for more info.
40. Headers
Set-Cookie See RFC 2109 or RFC 2965
for more info.
Cookie
Location Watch out for
infinite loops!
Last-Modified ETag
OR
If-Modified-Since If-None-Match
304 Not Modified
41. More Headers
WWW-Authenticate
See RFC 2617
Authorization
for more info.
200 OK / 403 Forbidden
Some servers perform
User-Agent
user agent sniffing
Some clients perform
User-Agent:
user agent spoofing
46. Questions?
No heckling... OK, maybe just a little.
I generally blog about my experiences with web scraping
and PHP at http://ishouldbecoding.com.
</shameless_plug>
Thanks for coming!