0
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm...
<ul><li>How many of you </li></ul><ul><li>have done  </li></ul><ul><li>screen-scraping w/ Perl? </li></ul>
<ul><li>How many of you </li></ul><ul><li>have used </li></ul><ul><li>LWP::Simple and regexp? </li></ul>
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46<...
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46<...
<ul><li>It works! </li></ul>
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
<ul><li>There are </li></ul><ul><li>3 problems </li></ul><ul><li>(at least) </li></ul>
<ul><li>(1) </li></ul><ul><li>Fragile </li></ul><ul><li>Easy to break even with slight HTML changes </li></ul><ul><li>(lik...
<ul><li>(2) </li></ul><ul><li>Hard to maintain </li></ul><ul><li>Regular expression based scrapers are good  </li></ul><ul...
<ul><li>(3) </li></ul><ul><li>Improper  </li></ul><ul><li>HTML & encoding </li></ul><ul><li>handling </li></ul>
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@...
<ul><li>Web::Scraper </li></ul><ul><li>to the rescue </li></ul>
<ul><li>Web scraping toolkit </li></ul><ul><li>inspired by scrapi.rb </li></ul><ul><li>DSL-ish </li></ul>
Example <ul><li>#!/usr/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>use Web::S...
Basics <ul><li>use Web::Scraper; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li># DSL goes here </li></ul><ul><li>}...
process <ul><li>process $selector, </li></ul><ul><li>$key => $what, </li></ul><ul><li>… ; </li></ul>
<ul><li>$selector: </li></ul><ul><li>CSS Selector </li></ul><ul><li>or </li></ul><ul><li>XPath (start with /) </li></ul>
<ul><li>CSS Selector: strong#ctu </li></ul><ul><li>XPath: //strong[@id=&quot;ctu&quot;] </li></ul><td>Current <strong>UTC<...
<ul><li>$key: </li></ul><ul><li>key for the result hash </li></ul><ul><li>append &quot;[]&quot; for looping </li></ul>
<ul><li>$what: </li></ul><ul><li>'@attr' </li></ul><ul><li>'TEXT' </li></ul><ul><li>'RAW' </li></ul><ul><li>Web::Scraper <...
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;h...
<ul><li>process &quot;ul.sites > li > a&quot;,  </li></ul><ul><li>'urls[]'  => ' @href '; </li></ul><ul><li># { urls => [ ...
<ul><li>process '//ul[@class=&quot;sites&quot;]/li/a', </li></ul><ul><li>'names[]'  => ' TEXT '; </li></ul><ul><li># { nam...
<ul><li>process &quot;ul.sites > li > a&quot;,  </li></ul><ul><li>'sites[]' => { </li></ul><ul><li>link => '@href', name =...
<ul><li>Tools </li></ul>
<ul><li>> cpan Web::Scraper </li></ul><ul><li>comes with 'scraper' CLI </li></ul>
<ul><li>>  scraper http://example.com/ </li></ul><ul><li>scraper>  process &quot;a&quot;, &quot;links[]&quot; => '@href'; ...
<ul><li>>  scraper /path/to/foo.html </li></ul><ul><li>>  GET http://example.com/ | scraper </li></ul>
<ul><li>Demo </li></ul>
<ul><li>Thank you </li></ul><ul><li>http://search.cpan.org/dist/Web-Scraper </li></ul><ul><li>http://www.slideshare.net/mi...
Upcoming SlideShare
Loading in...5
×

Web::Scraper for SF.pm LT

12,379

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,379
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
95
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Web::Scraper for SF.pm LT"

  1. 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
  2. 2. <ul><li>How many of you </li></ul><ul><li>have done </li></ul><ul><li>screen-scraping w/ Perl? </li></ul>
  3. 3. <ul><li>How many of you </li></ul><ul><li>have used </li></ul><ul><li>LWP::Simple and regexp? </li></ul>
  4. 5. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  5. 6. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  6. 7. <ul><li>It works! </li></ul>
  7. 8. WWW::MySpace 0.70
  8. 9. WWW::Search::Ebay 2.231
  9. 10. <ul><li>There are </li></ul><ul><li>3 problems </li></ul><ul><li>(at least) </li></ul>
  10. 11. <ul><li>(1) </li></ul><ul><li>Fragile </li></ul><ul><li>Easy to break even with slight HTML changes </li></ul><ul><li>(like newlines, order of attributes etc.) </li></ul>
  11. 12. <ul><li>(2) </li></ul><ul><li>Hard to maintain </li></ul><ul><li>Regular expression based scrapers are good </li></ul><ul><li>Only when they're used in write-only scripts </li></ul>
  12. 13. <ul><li>(3) </li></ul><ul><li>Improper </li></ul><ul><li>HTML & encoding </li></ul><ul><li>handling </li></ul>
  13. 14. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  14. 15. <ul><li>Web::Scraper </li></ul><ul><li>to the rescue </li></ul>
  15. 16. <ul><li>Web scraping toolkit </li></ul><ul><li>inspired by scrapi.rb </li></ul><ul><li>DSL-ish </li></ul>
  16. 17. Example <ul><li>#!/usr/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>use Web::Scraper; </li></ul><ul><li>use URI; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li>process &quot;strong#ctu&quot;, time => 'TEXT'; </li></ul><ul><li>result 'time'; </li></ul><ul><li>}; </li></ul><ul><li>my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); </li></ul><ul><li>print $s->scrape($uri); </li></ul>
  17. 18. Basics <ul><li>use Web::Scraper; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li># DSL goes here </li></ul><ul><li>}; </li></ul><ul><li>my $res = $s->scrape($uri); </li></ul>
  18. 19. process <ul><li>process $selector, </li></ul><ul><li>$key => $what, </li></ul><ul><li>… ; </li></ul>
  19. 20. <ul><li>$selector: </li></ul><ul><li>CSS Selector </li></ul><ul><li>or </li></ul><ul><li>XPath (start with /) </li></ul>
  20. 21. <ul><li>CSS Selector: strong#ctu </li></ul><ul><li>XPath: //strong[@id=&quot;ctu&quot;] </li></ul><td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  21. 22. <ul><li>$key: </li></ul><ul><li>key for the result hash </li></ul><ul><li>append &quot;[]&quot; for looping </li></ul>
  22. 23. <ul><li>$what: </li></ul><ul><li>'@attr' </li></ul><ul><li>'TEXT' </li></ul><ul><li>'RAW' </li></ul><ul><li>Web::Scraper </li></ul><ul><li>sub { … } </li></ul><ul><li>Hash reference </li></ul>
  23. 24. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  24. 25. <ul><li>process &quot;ul.sites > li > a&quot;, </li></ul><ul><li>'urls[]' => ' @href '; </li></ul><ul><li># { urls => [ … ] } </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
  25. 26. <ul><li>process '//ul[@class=&quot;sites&quot;]/li/a', </li></ul><ul><li>'names[]' => ' TEXT '; </li></ul><ul><li># { names => [ 'OpenGuides', … ] } </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
  26. 27. <ul><li>process &quot;ul.sites > li > a&quot;, </li></ul><ul><li>'sites[]' => { </li></ul><ul><li>link => '@href', name => 'TEXT'; </li></ul><ul><li>}; </li></ul><ul><li># { sites => [ { link => …, name => … }, </li></ul><ul><li># { link => …, name => … } ] }; </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  27. 28. <ul><li>Tools </li></ul>
  28. 29. <ul><li>> cpan Web::Scraper </li></ul><ul><li>comes with 'scraper' CLI </li></ul>
  29. 30. <ul><li>> scraper http://example.com/ </li></ul><ul><li>scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href'; </li></ul><ul><li>scraper> d </li></ul><ul><li>$VAR1 = { </li></ul><ul><li>links => [ </li></ul><ul><li>'http://example.org/', </li></ul><ul><li>'http://example.net/', </li></ul><ul><li>], </li></ul><ul><li>}; </li></ul><ul><li>scraper> y </li></ul><ul><li>--- </li></ul><ul><li>links: </li></ul><ul><li>- http://example.org/ </li></ul><ul><li>- http://example.net/ </li></ul>
  30. 31. <ul><li>> scraper /path/to/foo.html </li></ul><ul><li>> GET http://example.com/ | scraper </li></ul>
  31. 32. <ul><li>Demo </li></ul>
  32. 33. <ul><li>Thank you </li></ul><ul><li>http://search.cpan.org/dist/Web-Scraper </li></ul><ul><li>http://www.slideshare.net/miyagawa/webscraper </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×