More Related Content Similar to Web Scraping with PHP Similar to Web Scraping with PHP(20) More from Matthew Turland More from Matthew Turland(12) Web Scraping with PHP10. With plain text, we give ourselves the
ability to manipulate knowledge, both
manually and programmatically, using
virtually every tool at our disposal.
3.14 The Power of Plain Text,
The Pragmatic Programmer
18. The Web We Weave
GET / HTTP/1.1 HTTP/1.1 200 OK
User-Agent: ... Content-Type: ...
19. Browsing → Requests
<a href=quot;/index.php?foo=barquot;>Index</a>
GET /index.php?foo=bar HTTP/1.1
<form method=quot;postquot; action=quot;/index.phpquot;>
<input name=quot;fooquot; value=quot;barquot; />
</form>
POST /index.php HTTP/1.1
foo=bar
20. Responses → Rendered Elements
<img src=quot;/intl/en_ALL/images/logo.gifquot; />
GET /intl/en_ALL/images/logo.gif HTTP/1.1
Host: google.com
HTTP/1.1 200 OK
Content-Type: image/gif
Content-Length: 8558
31. Simple Streams Example
$uri = 'http://www.example.com/some/resource';
$get = file_get_contents($uri);
$context = stream_context_create(
array(
'http' => array(
'method' => 'POST',
'header' => 'Content-Type: ' .
'application/x-www-form-urlencoded',
'content' => http_build_query(array(
'var1' => 'value1',
'var2' => 'value2'
))
)
)
);
$post = file_get_contents($uri, false, $context);
32. pecl_http Example
$http = new HttpRequest($uri);
$http->enableCookies();
$http->setMethod(HTTP_METH_POST);
$http->addPostFields(array('var1' => 'value1'));
$http->setOptions(
'useragent' => 'PHP ' . phpversion(),
'referer' => 'http://example.com/some/referer'
));
$response = $http->send();
$headers = $response->getHeaders();
$body = $response->getBody();
33. pecl_http Request Pooling
$pool = new HttpRequestPool;
foreach ($urls as $url) {
$request = new HttpRequest($url, HTTP_METH_GET);
$pool->attach($request);
}
$pool->send();
foreach ($pool as $request) {
echo $request->getUrl(), PHP_EOL;
echo $request->getResponseBody(), PHP_EOL;
}
34. HTTP Resources
➔ RFC 2616 HyperText Transfer Protocol
➔ RFC 3986 Uniform Resource Identifiers
➔ quot;HTTP: The Definitive Guidequot; (ISBN 1565925092)
➔ quot;HTTP Pocket Reference: HyperText Transfer Protocolquot;
(ISBN 1565928628)
➔ quot;HTTP Developer's Handbookquot; (ISBN 0672324547) by
Chris Shiflett
➔ Ben Ramsey's blog series on HTTP
36. Tidy Extension
$config = array('output-xhtml' => true);
$tidy = tidy_parse_string($markupString, $config);
$tidy = tidy_parse_file($markupFilePath, $config);
$output = tidy_get_output($tidy);
37. DOM Extension
$doc = new DOMDocument;
$doc->loadHTML($htmlString);
$doc->loadHTMLFile($htmlFilePath);
$listItems = $doc->getElementsByTagName('li');
$xpath = new DOMXPath($doc);
$listItems = $xpath->query('//ul/li');
foreach ($listItems as $listItem) {
echo $listItem->nodeValue, PHP_EOL;
}
38. SimpleXML Extension
$sxe = new SimpleXMLElement($markupString);
$sxe = new SimpleXMLElement($filePath, null, true);
echo $sxe->body->ul->li[0], PHP_EOL;
$children = $sxe->body->ul->li;
$children = $sxe->body->ul->children();
foreach ($children as $li) {
echo $li, PHP_EOL;
}
echo $sxe->body->ul['id'];
$attributes = $sxe->body->ul->attributes();
foreach ($attributes as $name => $value) {
echo $name, '=', $value, PHP_EOL;
}
39. XMLReader Extension
$doc = XMLReader::xml($xmlString);
$doc = XMLReader::open($filePath);
while ($doc->read()) {
if ($doc->nodeType == XMLReader::ELEMENT) {
var_dump($doc->localName);
var_dump($doc->hasValue);
var_dump($doc->value);
var_dump($doc->hasAttributes);
var_dump($doc->getAttribute('id'));
}
}
40. CSS Selector Libraries
➔ phpQuery
➔ Simple HTML DOM Parser
➔ Zend_Dom_Query
$doc1 = phpQuery::newDocumentFile($markupFilePath);
$doc2 = phpQuery::newDocument($markupString);
$listItems = pq('ul > li'); // uses $doc2
$listItems = pq('ul > li', $doc1);
52. And ping me online!
Matthew Turland
Senior Consultant, Blue Parabola LLC
matthew@blueparabola.com
http://blueparabola.com
matt@ishouldbecoding.com
http://ishouldbecoding.com
@elazar