3. What is Web Scraping? 2 Stage Process Stage 1 : Retrieval GET /some/resource ... HTTP/1.1 200 OK ... Resource with data you want Stage 2 : Analysis Raw resource Usable data
4. How is it different from... Data Mining Focus in data mining Focus in web scraping Consuming Web Services Web service data formats Web scraping data formats
5. Potential Applications What Data source When Web service is unavailable or data access is one-time only. Crawlers and indexers Remote data search offers no capabilities for search or data source integration. Integration testing Applications must be tested by simulating client behavior and ensuring responses are consistent with requirements.
11. Know enough HTTP to... PEAR::HTTP_Client pecl_http Zend_Http_Client Learn to use and troubleshoot one like this: Or roll your own! cURL Filesystem + Streams
13. Warning about GET In principle: "Let's do this by the book." GET In reality: "' Safe operation '? Whatever." GET
14. URI vs URL 1. Uniquely identifies a resource 2. Indicates how to locate a resource 3. Does both and is thus human-usable. URI URL More info in RFC 3986 Sections 1.1.3 and 1.2.2
15. Query Strings http://en.wikipedia.org/w/index.php? title=Query_string&action=edit URL Query String Question mark to separate the resource address and query string Equal signs to separate parameter names and respective values Ampersands to separate parameter name-value pairs. Parameter Value
16. URL Encoding Parameter Value first second this is a field was it clear enough (already)? Query String first=this+is+a+field&second=was+it+clear+%28already%29%3F Also called percent encoding . urlencode and urlencode : Handy PHP URL functions $_SERVER ['QUERY_STRING'] / http_build_query ( $_GET ) More info on URL encoding in RFC 3986 Section 2.1
17. POST Requests Most Common HTTP Operations 1. GET 2. POST ... /w/index.php POST /new/resource -or- /updated/resource GET /some/resource HTTP/1.1 Header: Value ... POST /some/resource HTTP/1.1 Header: Value request body none
20. Responses HTTP/1.0 200 OK Server: Apache X-Powered-By: PHP/5.2.5 ... [body] Lowest protocol version required to process the response Response status code Response status description Status line Same header format as requests, but different headers are used (see RFC 2616 Section 14 )
22. Headers Set-Cookie Cookie Location Watch out for infinite loops! Last-Modified If-Modified-Since 304 Not Modified ETag If-None-Match OR See RFC 2109 or RFC 2965 for more info.
23. More Headers WWW-Authenticate Authorization User-Agent 200 OK / 403 Forbidden See RFC 2617 for more info. User-Agent: Some servers perform user agent sniffing Some clients perform user agent spoofing
30. cURL Examples Fatal error: Allowed memory size of n00b bytes exhausted (tried to allocate 1337 bytes) in /this/slide.php on line 1 See PHP Manual , Context Options , or my php|architect article for more info. Just kidding. Really, the equivalent cURL code for the previous examples is so verbose that it won't fit on one slide and I don't think it's deserving of multiple slides.
32. Analysis Raw resource Usable data DOM XMLReader SimpleXML XSL tidy PCRE String functions JSON ctype XML Parser