Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
Tatsuhiko Miyagawa
CPAN: MIYAGAWA
abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
 
http://code.sixapart.com/
 
Practical  Web Scraping with Web::Scraper
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
"Screen-scraping is so 1999!"
 
 
RSS is a metadata not a complete  HTML replacement
Practical  Web Scraping with Web::Scraper
What's wrong with LWP & Regexp?
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
It works!
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
It works …
There are 3 problems (at least)
(1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.)
(2) Hard to maintain Regular expression based scrapers are good  Only when they're used in write-only scripts
(3) Improper  HTML & encoding handling
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Vienna
<span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
The &quot;right&quot; way of screen-scraping
(1), (2) Maintainable Less fragile
Use XPath and CSS Selectors
XPath HTML::TreeBuilder::XPath XML::LibXML
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
CSS Selectors &quot;XPath for HTML coders&quot; &quot;XPath for people who hates XML&quot;
CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }
XPath:  //strong[@id=&quot;ctu&quot;] CSS Selector:  strong#ctu
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Robust, Maintainable, and Sane character handling
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
but … long and boring
Practical Web Scraping with  Web::Scraper
Web scraping toolkit inspired by scrapi.rb DSL-ish
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process &quot;strong#ctu&quot;, time => 'TEXT'; result 'time'; }; my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); print $s->scrape($uri);
Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);
process process $selector, $key => $what, … ;
$selector: CSS Selector or XPath (start with /)
$key: key for the result hash append &quot;[]&quot; for looping
$what: '@attr' 'TEXT' Web::Scraper sub { … } Hash reference
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'urls[]'  => ' @href '; # { urls => [ … ] } <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
process '//ul[@class=&quot;sites&quot;]/li/a', 'names[]'  => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
process &quot;ul.sites > li&quot;,  'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
process &quot;ul.sites > li > a&quot;,  'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, #  { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result result;  # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key;  # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };
More Examples
 
Thumbnail URLs on Flickr set #!/usr/bin/perl use strict; use Data::Dumper; use Web::Scraper; use URI; my $url = &quot;http://flickr.com/photos/bulknews/sets/72157601700510359/&quot;; my $s = scraper { process &quot;a.image_link img&quot;, &quot;thumbs[]&quot; => '@src'; }; warn Dumper $s->scrape( URI->new($url) );
 
<span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot;  id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
Twitter Friends #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard a&quot;, &quot;people[]&quot; => '@title'; }; warn Dumper $s->scrape( URI->new($url) ) ;
Twitter Friends (complex) #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard&quot;, &quot;people[]&quot; => scraper { process &quot;a&quot;, link => '@href', name => '@title'; process &quot;img&quot;, thumb => '@src'; }; }; warn Dumper $s->scrape( URI->new($url) ) ;
Tools
> cpan Web::Scraper comes with 'scraper' CLI
>  scraper http://example.com/ scraper>  process &quot;a&quot;, &quot;links[]&quot; => '@href'; scraper>  d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper>  y --- links: - http://example.org/ - http://example.net/
>  scraper /path/to/foo.html >  GET http://example.com/ | scraper
TODO
Web::Scraper Needs documentation
More examples to put in eg/ directory
integrate with WWW::Mechanize and Test::WWW::Declare
XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)
Questions?
Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper

Web::Scraper

  • 1.
    Practical Web Scrapingwith Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
  • 2.
  • 3.
  • 4.
    abbreviation Acme::Module::Authors Acme::SneezeAcme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
  • 5.
  • 6.
  • 7.
  • 8.
    Practical WebScraping with Web::Scraper
  • 9.
    Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 10.
    Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 11.
  • 12.
  • 13.
  • 14.
    RSS is ametadata not a complete HTML replacement
  • 15.
    Practical WebScraping with Web::Scraper
  • 16.
    What's wrong withLWP & Regexp?
  • 17.
  • 18.
    <td>Current <strong>UTC</strong> (orGMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 19.
    <td>Current <strong>UTC</strong> (orGMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    There are 3problems (at least)
  • 26.
    (1) Fragile Easyto break even with slight HTML changes (like newlines, order of attributes etc.)
  • 27.
    (2) Hard tomaintain Regular expression based scrapers are good Only when they're used in write-only scripts
  • 28.
    (3) Improper HTML & encoding handling
  • 29.
    <span class=&quot;message&quot;>I &hearts;Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
  • 30.
    <span class=&quot;message&quot;>I &hearts;Vienna</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Vienna
  • 31.
    <span class=&quot;message&quot;> ウィーンが大好き!</span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
  • 32.
    The &quot;right&quot; wayof screen-scraping
  • 33.
  • 34.
    Use XPath andCSS Selectors
  • 35.
  • 36.
    XPath <td>Current <strong>UTC</strong>(or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 37.
    CSS Selectors &quot;XPathfor HTML coders&quot; &quot;XPath for people who hates XML&quot;
  • 38.
    CSS Selectors body{ font-size: 12px; } div.article { padding: 1em } span#count { color: #fff }
  • 39.
    XPath: //strong[@id=&quot;ctu&quot;]CSS Selector: strong#ctu
  • 40.
    CSS Selectors <td>Current<strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 41.
    Complete Script #!/usr/bin/perluse strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 42.
    Robust, Maintainable, andSane character handling
  • 43.
    Exmaple (before) <td>Current<strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 44.
    Example (after) #!/usr/bin/perluse strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 45.
    but … longand boring
  • 46.
    Practical Web Scrapingwith Web::Scraper
  • 47.
    Web scraping toolkitinspired by scrapi.rb DSL-ish
  • 48.
    Example (before) #!/usr/bin/perluse strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 49.
    Example (after) #!/usr/bin/perluse strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process &quot;strong#ctu&quot;, time => 'TEXT'; result 'time'; }; my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); print $s->scrape($uri);
  • 50.
    Basics use Web::Scraper;my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri);
  • 51.
    process process $selector,$key => $what, … ;
  • 52.
    $selector: CSS Selectoror XPath (start with /)
  • 53.
    $key: key forthe result hash append &quot;[]&quot; for looping
  • 54.
    $what: '@attr' 'TEXT'Web::Scraper sub { … } Hash reference
  • 55.
    <ul class=&quot;sites&quot;> <li><ahref=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 56.
    process &quot;ul.sites >li > a&quot;, 'urls[]' => ' @href '; # { urls => [ … ] } <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
  • 57.
    process '//ul[@class=&quot;sites&quot;]/li/a', 'names[]' => ' TEXT '; # { names => [ 'OpenGuides', … ] } <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
  • 58.
    process &quot;ul.sites >li&quot;, 'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 59.
    process &quot;ul.sites >li > a&quot;, 'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 60.
    process &quot;ul.sites >li > a&quot;, 'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 61.
    result result; # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key; # get value of stash $key; my $s = scraper { process …; process …; result 'foo', 'bar'; };
  • 62.
  • 63.
  • 64.
    Thumbnail URLs onFlickr set #!/usr/bin/perl use strict; use Data::Dumper; use Web::Scraper; use URI; my $url = &quot;http://flickr.com/photos/bulknews/sets/72157601700510359/&quot;; my $s = scraper { process &quot;a.image_link img&quot;, &quot;thumbs[]&quot; => '@src'; }; warn Dumper $s->scrape( URI->new($url) );
  • 65.
  • 66.
    <span class=&quot;vcard&quot;> <ahref=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot; id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
  • 67.
    Twitter Friends #!/usr/bin/perluse strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard a&quot;, &quot;people[]&quot; => '@title'; }; warn Dumper $s->scrape( URI->new($url) ) ;
  • 68.
    Twitter Friends (complex)#!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = &quot;http://twitter.com/miyagawa&quot;; my $s = scraper { process &quot;span.vcard&quot;, &quot;people[]&quot; => scraper { process &quot;a&quot;, link => '@href', name => '@title'; process &quot;img&quot;, thumb => '@src'; }; }; warn Dumper $s->scrape( URI->new($url) ) ;
  • 69.
  • 70.
    > cpan Web::Scrapercomes with 'scraper' CLI
  • 71.
    > scraperhttp://example.com/ scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href'; scraper> d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper> y --- links: - http://example.org/ - http://example.net/
  • 72.
    > scraper/path/to/foo.html > GET http://example.com/ | scraper
  • 73.
  • 74.
  • 75.
    More examples toput in eg/ directory
  • 76.
    integrate with WWW::Mechanizeand Test::WWW::Declare
  • 77.
    XPath Auto-suggestion offof DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?)
  • 78.
  • 79.
    Thank you http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/webscraper