London XQuery Meetup: Querying the World (Web Scraping)

1,611
-1

Published on

Presentation held at London XQuery Meetup in September 2011. In general, it shows how Web Scraping has naturally evolved towards XQuery. Additionally, it discusses different obstacles in scraping websites. A live example is shown as proof of solving these problems using XQuery.

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,611
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • London XQuery Meetup: Querying the World (Web Scraping)

    1. 1. XQuery: Querying the World<br />(formerly known as Web Scraping)<br />Dennis Knochenwefel <dennis.knochenwefel@28msec.com><br />
    2. 2. Evolution<br />Web Scraping<br />
    3. 3. PHP (2007)<br />$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";<br />$raw = file_get_contents($url);<br />$newlines = array("t","n","r","x20x20","0","x0B");<br />$content = str_replace($newlines, "", html_entity_decode($raw));<br />$start = strpos($content,'<table cellpadding="2" class="standard_table"');<br />$end = strpos($content,'</table>',$start) + 8;<br />$table = substr($content,$start,$end-$start);<br />preg_match_all("|<tr(.*)</tr>|U",$table,$rows);<br />foreach ($rows[0] as $row){<br /> if ((strpos($row,'<th')===false)){<br /> preg_match_all("|<td(.*)</td>|U",$row,$cells);<br /> $number = strip_tags($cells[0][0]);<br /> $name = strip_tags($cells[0][1]);<br /> $position = strip_tags($cells[0][2]);<br /> echo "{$position} - {$name} - Number {$number} <br>n";<br /> }<br />}<br />$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";<br />$raw = file_get_contents($url);<br />$newlines = array("t","n","r","x20x20","0","x0B");<br />$content = str_replace($newlines, "", html_entity_decode($raw));<br />$start = strpos($content,'<table cellpadding="2" class="standard_table"');<br />$end = strpos($content,'</table>',$start) + 8;<br />$table = substr($content,$start,$end-$start);<br />preg_match_all("|<tr(.*)</tr>|U",$table,$rows);<br />foreach ($rows[0] as $row){<br /> if ((strpos($row,'<th')===false)){<br /> preg_match_all("|<td(.*)</td>|U",$row,$cells);<br /> $number = strip_tags($cells[0][0]);<br /> $name = strip_tags($cells[0][1]);<br /> $position = strip_tags($cells[0][2]);<br /> echo "{$position} - {$name} - Number {$number} <br>n";<br /> }<br />}<br />source: http://www.bradino.com/php/screen-scraping/<br />
    4. 4. PHP (June 2011)<br />$url="http://www.rtu.ac.in/results/reformat.php";<br />$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";<br />$ch=curl_init();<br />curl_setopt($ch,CURLOPT_URL,$url);<br />curl_setopt($ch,CURLOPT_POST,1);<br />curl_setopt($ch,CURLOPT_POSTFIELDS,$post);<br />curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);<br />curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);<br />$content=curl_exec($ch);<br />curl_close($ch);<br />$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";<br />$page=new DOMDocument();<br />$xpath=new DOMXPath($page);<br />$page->loadHTML($content);<br />$page->saveHTML();  // this shows the page contents<br />$total=$xpath->query($totalPath);<br />echo $total->length;    //shows 0<br />echo $total->item(0)->nodeValue;   //shows nothing<br />$url="http://www.rtu.ac.in/results/reformat.php";<br />$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";<br />$ch=curl_init();<br />curl_setopt($ch,CURLOPT_URL,$url);<br />curl_setopt($ch,CURLOPT_POST,1);<br />curl_setopt($ch,CURLOPT_POSTFIELDS,$post);<br />curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);<br />curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);<br />$content=curl_exec($ch);<br />curl_close($ch);<br />$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";<br />$page=new DOMDocument();<br />$xpath=new DOMXPath($page);<br />$page->loadHTML($content);<br />$page->saveHTML();  // this shows the page contents<br />$total=$xpath->query($totalPath);<br />echo $total->length;    //shows 0<br />echo $total->item(0)->nodeValue;   //shows nothing<br />!<br />!<br />source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page<br />
    5. 5. XQuery<br />
    6. 6. Real World<br />Example<br />
    7. 7. awesome site<br />awesome data<br />no API<br />
    8. 8. Deal with sessions<br />
    9. 9. Need to emulate setting options<br />
    10. 10. Different Notions<br />Publisher <=> Consumer<br />
    11. 11. JSON ?<br />XML ?<br />CSV !<br />HTML !<br />XLS !<br />Zip !<br />App<br />Website<br />
    12. 12. Stateless REST API ?<br />JSON ?<br />XML ?<br />CSV !<br />HTML !<br />XLS !<br />Zip !<br />Session!<br />App<br />Website<br />
    13. 13. Stateless REST API ?<br />JSON ?<br />XML ?<br />CSV !<br />HTML !<br />XLS !<br />Zip !<br />Session!<br />App<br />Website<br />Customize with URL Params<br />HTML Forms<br />
    14. 14. Stateless REST API ?<br />JSON ?<br />XML ?<br />CSV !<br />HTML !<br />XLS !<br />Zip !<br />Session!<br />App<br />Website<br />Customize with URL Params<br />HTML Forms<br />
    15. 15. CSV !<br />HTML !<br />XLS !<br />Zip !<br />HTML !<br />Session!<br />Session!<br />App<br />Website<br />XQuery !<br />HTML Forms<br />HTML Forms<br />
    16. 16. Summary<br />
    17. 17. Session handling<br />Forms<br />!<br />!<br />XQuery Web Data Processing<br />A browser can do it? <br /> XQuery can do it!<br />
    18. 18. Result:<br />http://www.unemployment.by/country<br />

    ×