Web Scraping with PHP

Matthew Turland
Matthew TurlandSenior Engineer at Synacor, Inc.
Web Scraping with


                      Matthew Turland
            php|tek 2009 Unconference
                         May 21, 2009
What Is It?
Normal Web Browsing
Difference #1: Immediate Audience
Difference #2: Consumption Method
Why Is It
Useful?
Data Without
Web Services
Integration Testing
Crawlers
With plain text, we give ourselves the
ability to manipulate knowledge, both
manually and programmatically, using
virtually every tool at our disposal.




              3.14 The Power of Plain Text,
              The Pragmatic Programmer
Disadvantages
Potential Lack of Stability
Reverse Engineering Required
More
Requests
No Nice Neat
Data Package
Step #1: Retrieval
Speaking the Language
The Web We Weave

GET / HTTP/1.1    HTTP/1.1 200 OK
User-Agent: ...   Content-Type: ...
Browsing → Requests

<a href=quot;/index.php?foo=barquot;>Index</a>

   GET /index.php?foo=bar HTTP/1.1

<form method=quot;postquot; action=quot;/index.phpquot;>
  <input name=quot;fooquot; value=quot;barquot; />
</form>

       POST /index.php HTTP/1.1

       foo=bar
Responses → Rendered Elements
<img src=quot;/intl/en_ALL/images/logo.gifquot; />

GET /intl/en_ALL/images/logo.gif HTTP/1.1
Host: google.com

        HTTP/1.1 200 OK
        Content-Type: image/gif
        Content-Length: 8558
Not As Easy As It Looks
Redirections
Referer [sic]
Cookies
User Agent Sniffing
robots.txt
Caching
HTTP Authentication
PHP: Glue for the Web
HTTP Client Libraries


                Streams, cURL


                PEAR::HTTP_Client

                pecl_http


                Zend_Http_Client
Simple Streams Example
$uri = 'http://www.example.com/some/resource';
$get = file_get_contents($uri);
$context = stream_context_create(
  array(
    'http' => array(
      'method' => 'POST',
      'header' => 'Content-Type: ' .
         'application/x-www-form-urlencoded',
      'content' => http_build_query(array(
         'var1' => 'value1',
         'var2' => 'value2'
      ))
    )
  )
);
$post = file_get_contents($uri, false, $context);
pecl_http Example

$http = new HttpRequest($uri);
$http->enableCookies();
$http->setMethod(HTTP_METH_POST);
$http->addPostFields(array('var1' => 'value1'));
$http->setOptions(
  'useragent' => 'PHP ' . phpversion(),
  'referer' => 'http://example.com/some/referer'
));
$response = $http->send();
$headers = $response->getHeaders();
$body = $response->getBody();
pecl_http Request Pooling

$pool = new HttpRequestPool;
foreach ($urls as $url) {
  $request = new HttpRequest($url, HTTP_METH_GET);
  $pool->attach($request);
}
$pool->send();
foreach ($pool as $request) {
  echo $request->getUrl(), PHP_EOL;
  echo $request->getResponseBody(), PHP_EOL;
}
HTTP Resources

➔ RFC 2616 HyperText Transfer Protocol
➔ RFC 3986 Uniform Resource Identifiers
➔ quot;HTTP: The Definitive Guidequot; (ISBN 1565925092)
➔ quot;HTTP Pocket Reference: HyperText Transfer Protocolquot;
   (ISBN 1565928628)
➔ quot;HTTP Developer's Handbookquot; (ISBN 0672324547) by
   Chris Shiflett
➔ Ben Ramsey's blog series on HTTP
Step #2:Analysis
Tidy Extension
$config   = array('output-xhtml' => true);
$tidy =   tidy_parse_string($markupString, $config);
$tidy =   tidy_parse_file($markupFilePath, $config);
$output   = tidy_get_output($tidy);
DOM Extension
$doc = new DOMDocument;
$doc->loadHTML($htmlString);
$doc->loadHTMLFile($htmlFilePath);
$listItems = $doc->getElementsByTagName('li');
$xpath = new DOMXPath($doc);
$listItems = $xpath->query('//ul/li');
foreach ($listItems as $listItem) {
    echo $listItem->nodeValue, PHP_EOL;
}
SimpleXML Extension
$sxe = new SimpleXMLElement($markupString);
$sxe = new SimpleXMLElement($filePath, null, true);
echo $sxe->body->ul->li[0], PHP_EOL;
$children = $sxe->body->ul->li;
$children = $sxe->body->ul->children();
foreach ($children as $li) {
  echo $li, PHP_EOL;
}
echo $sxe->body->ul['id'];
$attributes = $sxe->body->ul->attributes();
foreach ($attributes as $name => $value) {
  echo $name, '=', $value, PHP_EOL;
}
XMLReader Extension

$doc = XMLReader::xml($xmlString);
$doc = XMLReader::open($filePath);
while ($doc->read()) {
  if ($doc->nodeType == XMLReader::ELEMENT) {
    var_dump($doc->localName);
    var_dump($doc->hasValue);
    var_dump($doc->value);
    var_dump($doc->hasAttributes);
    var_dump($doc->getAttribute('id'));
  }
}
CSS Selector Libraries
 ➔ phpQuery
 ➔ Simple HTML DOM Parser
 ➔ Zend_Dom_Query


$doc1 = phpQuery::newDocumentFile($markupFilePath);
$doc2 = phpQuery::newDocument($markupString);
$listItems = pq('ul > li'); // uses $doc2
$listItems = pq('ul > li', $doc1);
PCRE Extension
Best Practices
Approximate Human Behavior
Minimize Requests
Batch Jobs,
Non-Peak Hours
Account for Unavailability
Aim for Parallelism
Validate Data
Test, Test, Test!
Questions
Please leave a comment!



 http://joind.in/event/view/41
And ping me online!

          Matthew Turland
Senior Consultant, Blue Parabola LLC
    matthew@blueparabola.com
      http://blueparabola.com
    matt@ishouldbecoding.com
     http://ishouldbecoding.com
              @elazar
1 of 52

Recommended

Web Scraping with PHP by
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHPMatthew Turland
10.3K views40 slides
When RSS Fails: Web Scraping with HTTP by
When RSS Fails: Web Scraping with HTTPWhen RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPMatthew Turland
8.2K views46 slides
Introduction to Google API - Focusky by
Introduction to Google API - FocuskyIntroduction to Google API - Focusky
Introduction to Google API - FocuskyFocusky Presentation
1.6K views34 slides
Hack in the Box Keynote 2006 by
Hack in the Box Keynote 2006Hack in the Box Keynote 2006
Hack in the Box Keynote 2006Mark Curphey
667 views153 slides
Getting started with MongoDB and PHP by
Getting started with MongoDB and PHPGetting started with MongoDB and PHP
Getting started with MongoDB and PHPgates10gen
6.4K views35 slides
Ant by
Ant Ant
Ant sundar22in
1.1K views21 slides

More Related Content

What's hot

Php by
PhpPhp
Phpmohamed ashraf
445 views16 slides
Advanced Json by
Advanced JsonAdvanced Json
Advanced Jsonguestfd7d7c
6.9K views39 slides
Session Server - Maintaing State between several Servers by
Session Server - Maintaing State between several ServersSession Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersStephan Schmidt
1.8K views34 slides
Cakefest 2010: API Development by
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API DevelopmentAndrew Curioso
2K views43 slides
Intro to php by
Intro to phpIntro to php
Intro to phpSp Singh
242 views50 slides
Intro to PHP by
Intro to PHPIntro to PHP
Intro to PHPSandy Smith
2.4K views37 slides

What's hot(20)

Advanced Json by guestfd7d7c
Advanced JsonAdvanced Json
Advanced Json
guestfd7d7c6.9K views
Session Server - Maintaing State between several Servers by Stephan Schmidt
Session Server - Maintaing State between several ServersSession Server - Maintaing State between several Servers
Session Server - Maintaing State between several Servers
Stephan Schmidt1.8K views
Intro to php by Sp Singh
Intro to phpIntro to php
Intro to php
Sp Singh242 views
RESTful SOA - 中科院暑期讲座 by Li Yi
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座
Li Yi301 views
Open Source Package PHP & MySQL by kalaisai
Open Source Package PHP & MySQLOpen Source Package PHP & MySQL
Open Source Package PHP & MySQL
kalaisai5.7K views
Modern Web Development with Perl by Dave Cross
Modern Web Development with PerlModern Web Development with Perl
Modern Web Development with Perl
Dave Cross32.3K views
Class 6 - PHP Web Programming by Ahmed Swilam
Class 6 - PHP Web ProgrammingClass 6 - PHP Web Programming
Class 6 - PHP Web Programming
Ahmed Swilam1.8K views
XML and Web Services with PHP5 and PEAR by Stephan Schmidt
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt9.3K views
URL Mapping, with and without mod_rewrite by Rich Bowen
URL Mapping, with and without mod_rewriteURL Mapping, with and without mod_rewrite
URL Mapping, with and without mod_rewrite
Rich Bowen7.6K views
Go OO! - Real-life Design Patterns in PHP 5 by Stephan Schmidt
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5
Stephan Schmidt5.1K views
Apache mod_rewrite by Dave Ross
Apache mod_rewriteApache mod_rewrite
Apache mod_rewrite
Dave Ross989 views
Making Java REST with JAX-RS 2.0 by Dmytro Chyzhykov
Making Java REST with JAX-RS 2.0Making Java REST with JAX-RS 2.0
Making Java REST with JAX-RS 2.0
Dmytro Chyzhykov22.1K views

Similar to Web Scraping with PHP

Web Scraping with PHP by
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHPMatthew Turland
2.8K views50 slides
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009) by
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Michael Wales
1.3K views29 slides
Testing persistence in PHP with DbUnit by
Testing persistence in PHP with DbUnitTesting persistence in PHP with DbUnit
Testing persistence in PHP with DbUnitPeter Wilcsinszky
5.5K views29 slides
PHP by
PHP PHP
PHP webhostingguy
1.8K views41 slides
Framework by
FrameworkFramework
FrameworkNguyen Linh
660 views28 slides
London XQuery Meetup: Querying the World (Web Scraping) by
London XQuery Meetup: Querying the World (Web Scraping)London XQuery Meetup: Querying the World (Web Scraping)
London XQuery Meetup: Querying the World (Web Scraping)Dennis Knochenwefel
740 views18 slides

Similar to Web Scraping with PHP(20)

Introduction to CodeIgniter (RefreshAugusta, 20 May 2009) by Michael Wales
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Michael Wales1.3K views
Testing persistence in PHP with DbUnit by Peter Wilcsinszky
Testing persistence in PHP with DbUnitTesting persistence in PHP with DbUnit
Testing persistence in PHP with DbUnit
Peter Wilcsinszky5.5K views
London XQuery Meetup: Querying the World (Web Scraping) by Dennis Knochenwefel
London XQuery Meetup: Querying the World (Web Scraping)London XQuery Meetup: Querying the World (Web Scraping)
London XQuery Meetup: Querying the World (Web Scraping)
Using Geeklog as a Web Application Framework by Dirk Haun
Using Geeklog as a Web Application FrameworkUsing Geeklog as a Web Application Framework
Using Geeklog as a Web Application Framework
Dirk Haun894 views
ActiveWeb: Chicago Java User Group Presentation by ipolevoy
ActiveWeb: Chicago Java User Group PresentationActiveWeb: Chicago Java User Group Presentation
ActiveWeb: Chicago Java User Group Presentation
ipolevoy2K views
Os Nixon by oscon2007
Os NixonOs Nixon
Os Nixon
oscon2007406 views
P H P Part I I, By Kian by phelios
P H P  Part  I I,  By  KianP H P  Part  I I,  By  Kian
P H P Part I I, By Kian
phelios2.7K views
Система рендеринга в Magento by Magecom Ukraine
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine652 views
Advanced PHPUnit Testing by Mike Lively
Advanced PHPUnit TestingAdvanced PHPUnit Testing
Advanced PHPUnit Testing
Mike Lively23.6K views
Parameter Passing & Session Tracking in PHP by amichoksi
Parameter Passing & Session Tracking in PHPParameter Passing & Session Tracking in PHP
Parameter Passing & Session Tracking in PHP
amichoksi6.7K views

More from Matthew Turland

New SPL Features in PHP 5.3 by
New SPL Features in PHP 5.3New SPL Features in PHP 5.3
New SPL Features in PHP 5.3Matthew Turland
8.4K views62 slides
New SPL Features in PHP 5.3 (TEK-X) by
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)Matthew Turland
4.8K views44 slides
Sinatra by
SinatraSinatra
SinatraMatthew Turland
783 views21 slides
Open Source Networking with Vyatta by
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with VyattaMatthew Turland
1.2K views12 slides
Open Source Content Management Systems by
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management SystemsMatthew Turland
913 views26 slides
PHP Basics for Designers by
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for DesignersMatthew Turland
2.1K views24 slides

More from Matthew Turland(12)

New SPL Features in PHP 5.3 (TEK-X) by Matthew Turland
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)
Matthew Turland4.8K views
Open Source Networking with Vyatta by Matthew Turland
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with Vyatta
Matthew Turland1.2K views
Open Source Content Management Systems by Matthew Turland
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
Matthew Turland913 views
Creating Web Services with Zend Framework - Matthew Turland by Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew Turland
Matthew Turland6.7K views
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville by Matthew Turland
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
Matthew Turland929 views
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier by Matthew Turland
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Matthew Turland958 views
The Ruby Programming Language - Ryan Farnell by Matthew Turland
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan Farnell
Matthew Turland794 views
PDQ Programming Languages plus an overview of Alice - Frank Ducrest by Matthew Turland
PDQ Programming Languages plus an overview of Alice - Frank DucrestPDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
Matthew Turland962 views
Getting Involved in Open Source - Matthew Turland by Matthew Turland
Getting Involved in Open Source - Matthew TurlandGetting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew Turland
Matthew Turland939 views

Recently uploaded

AI + Memoori = AIM by
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIMMemoori
15 views9 slides
GDSC GLAU Info Session.pptx by
GDSC GLAU Info Session.pptxGDSC GLAU Info Session.pptx
GDSC GLAU Info Session.pptxgauriverrma4
15 views28 slides
The Power of Generative AI in Accelerating No Code Adoption.pdf by
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdfSaeed Al Dhaheri
44 views18 slides
Telenity Solutions Brief by
Telenity Solutions BriefTelenity Solutions Brief
Telenity Solutions BriefMustafa Kuğu
14 views10 slides
The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
85 views20 slides
Qualifying SaaS, IaaS.pptx by
Qualifying SaaS, IaaS.pptxQualifying SaaS, IaaS.pptx
Qualifying SaaS, IaaS.pptxSachin Bhandari
1.1K views8 slides

Recently uploaded(20)

AI + Memoori = AIM by Memoori
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIM
Memoori15 views
GDSC GLAU Info Session.pptx by gauriverrma4
GDSC GLAU Info Session.pptxGDSC GLAU Info Session.pptx
GDSC GLAU Info Session.pptx
gauriverrma415 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri44 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE85 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays38 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li104 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Adopting Karpenter for Cost and Simplicity at Grafana Labs.pdf by MichaelOLeary82
Adopting Karpenter for Cost and Simplicity at Grafana Labs.pdfAdopting Karpenter for Cost and Simplicity at Grafana Labs.pdf
Adopting Karpenter for Cost and Simplicity at Grafana Labs.pdf
MichaelOLeary8213 views
Mobile Core Solutions & Successful Cases.pdf by IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks16 views
AIM102-S_Cognizant_CognizantCognitive by PhilipBasford
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
PhilipBasford23 views
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 views
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software198 views

Web Scraping with PHP