SlideShare a Scribd company logo
1 of 20
Web Scraping 101
with F.O.S. Goutte
By Joshua Copeland
About Me
● CTO of Engaged Nation
● PHP Developer for 6+ years
● Java, .NET, and C/++ exp.
● Serial Entrepreneur
● Prior Real Estate Agent
● ♥’s Family, Tech, & Skating
● Self Proclaimed Computer
Josh Copeland
@OGProgrammer
What is Web Scraping?
Web scraping is the process of automatically collecting information from the web.
Requires breakthroughs in text processing, semantic understanding, artificial
intelligence and human-computer interactions.
Current web scraping solutions range from the ad-hoc, requiring human effort, to
fully automated systems that are able to convert entire sites into structured
information, with limitations.
Traditional Methods
In 2009 there was no “all-in-one” library with
both an HTTP Client & a HTML Parser. These
where your choices in PHP back then:
➔ Tidy Extension
Wasn’t designed for extraction, only
HTML error fixing.
➔ DOM
➔ SimpleXML
➔ XMLReader
➔ CSS Selectors
Works fine for HTML parsing but isn’t a
crawler.
Introducing Goutte!
A simple PHP Web Scraper.
Did you know?
Goutte was built by
Fabien Potencier who
also built the Symfony
Framework.
FriendsOfSymfony is the
group that maintains this
package and others in
the Symfony world.
Examples from this presentation available at
https://github.com/php-vegas/web-scraper-
examples
What does Goutte use?
● Symfony Components
a. BrowserKit
b. CssSelector
c. DomCrawler
● Guzzle HTTP Component.
Did you know?
Fabien Potencier also
built these Symfony
components.
You should check out
his github profile where
his username is “fabpot”.
He’s kind of a big deal.
What does Goutte do
● Uses Guzzle (cURL, streams, sockets, or event loops)
○ GET/POST Requests
● Fine tune cURL settings
● Follow links - Crawl the site
● Extract data
○ XPath, CssSelector
● Submit forms
○ Login!
What Goutte doesn’t do
● Does not interpret the response in any way.
○ Will not execute JavaScript
■ Which means no AJAX
● Could simulate the AJAX request
■ Try Google cached versions of the site
■ Use PhantomJS, Spiderling, CasperJS, Selenium
● Can’t render or screenshot the page
○ Could save the HTML & assets
Let’s get started!
What you’ll need
➔ Recommend using Composer
Easiest way to install PHP libraries
➔ Alternatively could use PHAR
Available releases on their GitHub
➔ Version 3
◆ PHP 5.5+
◆ Guzzle 6+
➔ Version 2
◆ PHP 5.4
◆ Guzzle 4-5
➔ Version 1
◆ PHP 5.3
◆ Guzzle 3
Require Goutte in your project
composer require fabpot/goutte
Basic Example
use GoutteClient;
$client = new Client();
// Go to the symfony.com website
$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."n";
});
Guzzle Settings Example
use GoutteClient;
use GuzzleHttpClient as GuzzleClient;
// Create the guzzle client with your default options
$guzzle = new GuzzleClient(
array(
// base_uri isn't supported due to BrowserKit, anyone want to make a PR on github for this?
// 'base_uri' => 'https://www.symfony.com',
'timeout' => 0,
'allow_redirects' => false,
'cookies' => true,
// Proxy from proxylist.hidemyass.com
'proxy' => 'tcp://63.150.152.151:3128'
)
);
$client = new Client();
$client->setClient($guzzle);
Check out all the Guzzle options at http://docs.guzzlephp.org/en/latest/request-options.html
Basic HTTP Auth Example
$client = new Client();
// Params are username, password, and auth type (basic & digest)
$client->setAuth('test', 'test', 'basic');
$crawler = $client->request('GET', 'http://browserspy.dk/password-ok.php');
print $client->getResponse()->getStatus();
// 401 = no good, 200 = happy
Form Login Example
$crawler = $client->request('GET', 'http://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."n";
});
// outputs “You can't perform that action at this time.”
Getting info from the Response
// Get the URI
print 'Request URI : ' . $crawler->getUri() . PHP_EOL;
// Get the SymfonyComponentBrowserKitResponse object
$response = $client->getResponse();
// Get important stuff out of the Response object
$status = $response->getStatus();
$content = $response->getContent();
$headers = $response->getHeaders();
Watch out for...
● DDOSing a site, put a sleep(x) between calls
○ Good way to get your IP banned, use a proxy
● Pulling bad/malformed data
○ Write tests to make sure this doesn’t happen
● Fetching elements by unique IDs, hashes, etc
○ Get creative, find an RSS feed, API, or structured data
● Protections against scrapping like javascript or AJAX
○ Some buttons have JS events attached
More examples in GitHub
This presentation + dell product info scraper
https://github.com/php-vegas/web-scraper-
examples
CSRF Scanner
https://github.com/marlon-be/marlon-csrfscanner
There are extensions for WP, Laravel, mink, & others.
Just search pacakgist.org for “goutte”.
In theory you could use PHPv8 (v8js engine)
to execute javascript and create a handler aka
middleware in Guzzle.
This would be awesome but there are existing
projects out there that already do this. Just
that PHP doesn’t have a way right now.
What were those projects?
● PhantomJS
● Spiderling
● CasperJS
● Selenium
Questions?
@OGProgrammer
Rate this talk
http://spkr8.com/t/68291

More Related Content

What's hot

Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3
Angela Zoss
 

What's hot (20)

Web Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.KeyWeb Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.Key
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social Plugins
 
Website Performance Basics
Website Performance BasicsWebsite Performance Basics
Website Performance Basics
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2
 
The 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix themThe 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix them
 
REST, the internet as a database?
REST, the internet as a database?REST, the internet as a database?
REST, the internet as a database?
 
JavaScript Performance Patterns
JavaScript Performance PatternsJavaScript Performance Patterns
JavaScript Performance Patterns
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patterns
 
moma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layermoma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layer
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3
 
Scaling my sql_in_3d
Scaling my sql_in_3dScaling my sql_in_3d
Scaling my sql_in_3d
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3
 
all data everywhere
all data everywhereall data everywhere
all data everywhere
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 
HTML5 Overview
HTML5 OverviewHTML5 Overview
HTML5 Overview
 
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
 

Similar to Web scraping 101 with goutte

PHP Web Development
PHP Web DevelopmentPHP Web Development
PHP Web Development
gaplabs
 
Web 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web SitesWeb 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web Sites
Steve Souders
 

Similar to Web scraping 101 with goutte (20)

PHP SA 2014 - Releasing Your Open Source Project
PHP SA 2014 - Releasing Your Open Source ProjectPHP SA 2014 - Releasing Your Open Source Project
PHP SA 2014 - Releasing Your Open Source Project
 
Php on the Web and Desktop
Php on the Web and DesktopPhp on the Web and Desktop
Php on the Web and Desktop
 
PHP BASIC PRESENTATION
PHP BASIC PRESENTATIONPHP BASIC PRESENTATION
PHP BASIC PRESENTATION
 
Google I/O 2012 - Protecting your user experience while integrating 3rd party...
Google I/O 2012 - Protecting your user experience while integrating 3rd party...Google I/O 2012 - Protecting your user experience while integrating 3rd party...
Google I/O 2012 - Protecting your user experience while integrating 3rd party...
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
 
Twas the night before Malware...
Twas the night before Malware...Twas the night before Malware...
Twas the night before Malware...
 
Mojolicious - A new hope
Mojolicious - A new hopeMojolicious - A new hope
Mojolicious - A new hope
 
Introduction to python scrapping
Introduction to python scrappingIntroduction to python scrapping
Introduction to python scrapping
 
Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!
 
PHP Web Development
PHP Web DevelopmentPHP Web Development
PHP Web Development
 
A Holistic View of Website Performance
A Holistic View of Website PerformanceA Holistic View of Website Performance
A Holistic View of Website Performance
 
Web 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web SitesWeb 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web Sites
 
Use Xdebug to profile PHP
Use Xdebug to profile PHPUse Xdebug to profile PHP
Use Xdebug to profile PHP
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translation
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php Security
 
Using Geeklog as a Web Application Framework
Using Geeklog as a Web Application FrameworkUsing Geeklog as a Web Application Framework
Using Geeklog as a Web Application Framework
 
Apache and PHP: Why httpd.conf is your new BFF!
Apache and PHP: Why httpd.conf is your new BFF!Apache and PHP: Why httpd.conf is your new BFF!
Apache and PHP: Why httpd.conf is your new BFF!
 
jQuery Features to Avoid
jQuery Features to AvoidjQuery Features to Avoid
jQuery Features to Avoid
 
Api Design
Api DesignApi Design
Api Design
 

More from Joshua Copeland (7)

WooCommerce
WooCommerceWooCommerce
WooCommerce
 
Universal Windows Platform Overview
Universal Windows Platform OverviewUniversal Windows Platform Overview
Universal Windows Platform Overview
 
LVPHP.org
LVPHP.orgLVPHP.org
LVPHP.org
 
PHP Rocketeer
PHP RocketeerPHP Rocketeer
PHP Rocketeer
 
PHP 7
PHP 7PHP 7
PHP 7
 
Blackfire
BlackfireBlackfire
Blackfire
 
Lumen
LumenLumen
Lumen
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 

Web scraping 101 with goutte

  • 1. Web Scraping 101 with F.O.S. Goutte By Joshua Copeland
  • 2. About Me ● CTO of Engaged Nation ● PHP Developer for 6+ years ● Java, .NET, and C/++ exp. ● Serial Entrepreneur ● Prior Real Estate Agent ● ♥’s Family, Tech, & Skating ● Self Proclaimed Computer Josh Copeland @OGProgrammer
  • 3. What is Web Scraping? Web scraping is the process of automatically collecting information from the web. Requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire sites into structured information, with limitations.
  • 4. Traditional Methods In 2009 there was no “all-in-one” library with both an HTTP Client & a HTML Parser. These where your choices in PHP back then: ➔ Tidy Extension Wasn’t designed for extraction, only HTML error fixing. ➔ DOM ➔ SimpleXML ➔ XMLReader ➔ CSS Selectors Works fine for HTML parsing but isn’t a crawler.
  • 5. Introducing Goutte! A simple PHP Web Scraper. Did you know? Goutte was built by Fabien Potencier who also built the Symfony Framework. FriendsOfSymfony is the group that maintains this package and others in the Symfony world. Examples from this presentation available at https://github.com/php-vegas/web-scraper- examples
  • 6. What does Goutte use? ● Symfony Components a. BrowserKit b. CssSelector c. DomCrawler ● Guzzle HTTP Component. Did you know? Fabien Potencier also built these Symfony components. You should check out his github profile where his username is “fabpot”. He’s kind of a big deal.
  • 7. What does Goutte do ● Uses Guzzle (cURL, streams, sockets, or event loops) ○ GET/POST Requests ● Fine tune cURL settings ● Follow links - Crawl the site ● Extract data ○ XPath, CssSelector ● Submit forms ○ Login!
  • 8. What Goutte doesn’t do ● Does not interpret the response in any way. ○ Will not execute JavaScript ■ Which means no AJAX ● Could simulate the AJAX request ■ Try Google cached versions of the site ■ Use PhantomJS, Spiderling, CasperJS, Selenium ● Can’t render or screenshot the page ○ Could save the HTML & assets
  • 10. What you’ll need ➔ Recommend using Composer Easiest way to install PHP libraries ➔ Alternatively could use PHAR Available releases on their GitHub ➔ Version 3 ◆ PHP 5.5+ ◆ Guzzle 6+ ➔ Version 2 ◆ PHP 5.4 ◆ Guzzle 4-5 ➔ Version 1 ◆ PHP 5.3 ◆ Guzzle 3
  • 11. Require Goutte in your project composer require fabpot/goutte
  • 12. Basic Example use GoutteClient; $client = new Client(); // Go to the symfony.com website $crawler = $client->request('GET', 'http://www.symfony.com/blog/'); // Click on the "Security Advisories" link $link = $crawler->selectLink('Security Advisories')->link(); $crawler = $client->click($link); // Get the latest post in this category and display the titles $crawler->filter('h2 > a')->each(function ($node) { print $node->text()."n"; });
  • 13. Guzzle Settings Example use GoutteClient; use GuzzleHttpClient as GuzzleClient; // Create the guzzle client with your default options $guzzle = new GuzzleClient( array( // base_uri isn't supported due to BrowserKit, anyone want to make a PR on github for this? // 'base_uri' => 'https://www.symfony.com', 'timeout' => 0, 'allow_redirects' => false, 'cookies' => true, // Proxy from proxylist.hidemyass.com 'proxy' => 'tcp://63.150.152.151:3128' ) ); $client = new Client(); $client->setClient($guzzle); Check out all the Guzzle options at http://docs.guzzlephp.org/en/latest/request-options.html
  • 14. Basic HTTP Auth Example $client = new Client(); // Params are username, password, and auth type (basic & digest) $client->setAuth('test', 'test', 'basic'); $crawler = $client->request('GET', 'http://browserspy.dk/password-ok.php'); print $client->getResponse()->getStatus(); // 401 = no good, 200 = happy
  • 15. Form Login Example $crawler = $client->request('GET', 'http://github.com/'); $crawler = $client->click($crawler->selectLink('Sign in')->link()); $form = $crawler->selectButton('Sign in')->form(); $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx')); $crawler->filter('.flash-error')->each(function ($node) { print $node->text()."n"; }); // outputs “You can't perform that action at this time.”
  • 16. Getting info from the Response // Get the URI print 'Request URI : ' . $crawler->getUri() . PHP_EOL; // Get the SymfonyComponentBrowserKitResponse object $response = $client->getResponse(); // Get important stuff out of the Response object $status = $response->getStatus(); $content = $response->getContent(); $headers = $response->getHeaders();
  • 17. Watch out for... ● DDOSing a site, put a sleep(x) between calls ○ Good way to get your IP banned, use a proxy ● Pulling bad/malformed data ○ Write tests to make sure this doesn’t happen ● Fetching elements by unique IDs, hashes, etc ○ Get creative, find an RSS feed, API, or structured data ● Protections against scrapping like javascript or AJAX ○ Some buttons have JS events attached
  • 18. More examples in GitHub This presentation + dell product info scraper https://github.com/php-vegas/web-scraper- examples CSRF Scanner https://github.com/marlon-be/marlon-csrfscanner There are extensions for WP, Laravel, mink, & others. Just search pacakgist.org for “goutte”.
  • 19. In theory you could use PHPv8 (v8js engine) to execute javascript and create a handler aka middleware in Guzzle. This would be awesome but there are existing projects out there that already do this. Just that PHP doesn’t have a way right now. What were those projects? ● PhantomJS ● Spiderling ● CasperJS ● Selenium

Editor's Notes

  1. Http clients were pretty clunky too, thank you Amazon for sponsoring Guzzle!
  2. Allows you to make requests to sites and crawl the site (just as Google does) to “scrape” HTML for data.
  3. There is always manual scraping! Pay someone overseas?