SlideShare a Scribd company logo
1 of 18
XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
Evolution Web Scraping
PHP (2007) $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){     if ((strpos($row,'<th')===false)){         preg_match_all("|<td(.*)</td>|U",$row,$cells);         $number = strip_tags($cells[0][0]);         $name = strip_tags($cells[0][1]);         $position = strip_tags($cells[0][2]);         echo "{$position} - {$name} - Number {$number} <br>";     } } $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){     if ((strpos($row,'<th')===false)){         preg_match_all("|<td(.*)</td>|U",$row,$cells);         $number = strip_tags($cells[0][0]);         $name = strip_tags($cells[0][1]);         $position = strip_tags($cells[0][2]);         echo "{$position} - {$name} - Number {$number} <br>";     } } source: http://www.bradino.com/php/screen-scraping/
PHP (June 2011) $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing ! ! source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
XQuery
Real World Example
awesome site awesome data no API
Deal with sessions
Need to emulate setting options
Different Notions Publisher <=> Consumer
JSON ? XML ? CSV ! HTML ! XLS ! Zip ! App Website
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
CSV ! HTML ! XLS ! Zip ! HTML ! Session! Session! App Website XQuery ! HTML Forms HTML Forms
Summary
Session handling Forms ! ! XQuery Web Data Processing A browser can do it?                  XQuery can do it!
Result: http://www.unemployment.by/country

More Related Content

What's hot

C A S Sample Php
C A S Sample PhpC A S Sample Php
C A S Sample Php
JH Lee
 

What's hot (20)

Perl6 operators and metaoperators
Perl6   operators and metaoperatorsPerl6   operators and metaoperators
Perl6 operators and metaoperators
 
C A S Sample Php
C A S Sample PhpC A S Sample Php
C A S Sample Php
 
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
 
R57.Php
R57.PhpR57.Php
R57.Php
 
PhoneGap: Local Storage
PhoneGap: Local StoragePhoneGap: Local Storage
PhoneGap: Local Storage
 
Session8
Session8Session8
Session8
 
Nop2
Nop2Nop2
Nop2
 
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くために
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
 
MySQL Create Table
MySQL Create TableMySQL Create Table
MySQL Create Table
 
Php
PhpPhp
Php
 
The Magic Of Tie
The Magic Of TieThe Magic Of Tie
The Magic Of Tie
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 
IsTrue(true)?
IsTrue(true)?IsTrue(true)?
IsTrue(true)?
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
PHP Tips & Tricks
PHP Tips & TricksPHP Tips & Tricks
PHP Tips & Tricks
 
Coding website
Coding websiteCoding website
Coding website
 

Similar to London XQuery Meetup: Querying the World (Web Scraping)

R57shell
R57shellR57shell
R57shell
ady36
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
elliando dias
 
Javascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSJavascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJS
Ladislav Prskavec
 

Similar to London XQuery Meetup: Querying the World (Web Scraping) (20)

Drupal Development (Part 2)
Drupal Development (Part 2)Drupal Development (Part 2)
Drupal Development (Part 2)
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Ae internals
Ae internalsAe internals
Ae internals
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
 
R57shell
R57shellR57shell
R57shell
 
Blog Hacks 2011
Blog Hacks 2011Blog Hacks 2011
Blog Hacks 2011
 
Database api
Database apiDatabase api
Database api
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Daily notes
Daily notesDaily notes
Daily notes
 
8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会
 
Dropping ACID with MongoDB
Dropping ACID with MongoDBDropping ACID with MongoDB
Dropping ACID with MongoDB
 
Php My Sql
Php My SqlPhp My Sql
Php My Sql
 
Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)
 
Javascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSJavascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJS
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistence
 
[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?
 
2013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 22013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 2
 
Views notwithstanding
Views notwithstandingViews notwithstanding
Views notwithstanding
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

London XQuery Meetup: Querying the World (Web Scraping)

  • 1. XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
  • 3. PHP (2007) $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>"; } } $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>"; } } source: http://www.bradino.com/php/screen-scraping/
  • 4. PHP (June 2011) $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing ! ! source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • 7. awesome site awesome data no API
  • 9. Need to emulate setting options
  • 11. JSON ? XML ? CSV ! HTML ! XLS ! Zip ! App Website
  • 12. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website
  • 13. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
  • 14. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
  • 15. CSV ! HTML ! XLS ! Zip ! HTML ! Session! Session! App Website XQuery ! HTML Forms HTML Forms
  • 17. Session handling Forms ! ! XQuery Web Data Processing A browser can do it? XQuery can do it!

Editor's Notes

  1. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  2. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  3. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page