Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 16 (more)

Web::Scraper

From miyagawa, 8 months ago

10083 views  |  4 comments  |  14 favorites  |  382 downloads  |  12 embeds (Stats)
 

Tags

yapc yapceu07 yapceu2007 perl webscraper web scrape scraper scrapping ruby

more

 
 

Groups/Events

 
 

Privacy InfoNew!

This slideshow is Public

 

Slideshow transcript

Slide 1: Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa miyagawa@gmail.com Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna

Slide 2: Tatsuhiko Miyagawa Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 3: CPAN: MIYAGAWA Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 4: abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 5: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 6: http://code.sixapart.com/ Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 7: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 8: Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 9: Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 10: Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 11: \"Screen-scraping is so 1999!\" Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 12: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 13: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 14: RSS is a metadata not a complete HTML replacement Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 15: Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 16: What's wrong with LWP & Regexp? Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 17: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 18: <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id=\"ctu\">Monday, August 27, 2007 at 12:49:46</strong> <br /> Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 19: <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id=\"ctu\">Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(\"http://timeanddate.com/worldclock/\"); $c =~ m@<strong id=\"ctu\">(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 20: It works! Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 21: WWW::MySpace 0.70 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 22: WWW::Search::Ebay 2.231 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 23: WWW::Mixi 0.50 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 24: It works … Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 25: There are 3 problems (at least) Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 26: (1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.) Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 27: (2) Hard to maintain Regular expression based scrapers are good Only when they're used in write-only scripts Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 28: (3) Improper HTML & encoding handling Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 29: <span class=\"message\">I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=\"message\">(.*?)</span>@ and print $1' I &hearts; Vienna Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 30: <span class=\"message\">I &hearts; Vienna</span> > perl –MHTML::Entities –e '$c =~ m@<span class=\"message\">(.*?)</span and print decode_entities($1)' I ♥ Vienna Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 31: <s pa n c l a s s =\"m s a ge\">ウィーンが大好き! </s pa n> es > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=\"message\">(.*?)</span>@ and print decode_entities(decode_utf8($1))' Wide character in print at –e line 1. ウィーンが大好き! Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 32: The \"right\" way of screen-scraping Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 33: (1), (2) Maintainable Less fragile Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 34: Use XPath and CSS Selectors Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 35: XPath HTML::TreeBuilder::XPath XML::LibXML Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 36: XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id=\"ctu\">Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content print $tree->findnodes('//strong[@id=\"ctu\"]')->shift->as_text; # Monday, August 27, 2007 at 12:49:46 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 37: CSS Selectors \"XPath for HTML coders\" \"XPath for people who hates XML\" Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 38: CSS Selectors body { font-size: 12px; } div.article { padding: 1em } span#count { color: #fff } Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 39: XPath: //strong[@id=\"ctu\"] CSS Selector: strong#ctu Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 40: CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id=\"ctu\">Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content my $xpath = selector_to_xpath \"strong#ctu\"; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 41: Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(\"http://www.timeanddate.com/worldclock/\"); if ($res->is_error) { die \"HTTP GET error: \", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath- >new_from_content($content); my $xpath = selector_to_xpath(\"strong#ctu\"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 42: Robust, Maintainable, and Sane character handling Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 43: Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id=\"ctu\">Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(\"http://timeanddate.com/worldclock/\"); $c =~ m@<strong id=\"ctu\">(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46 Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 44: Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(\"http://www.timeanddate.com/worldclock/\"); if ($res->is_error) { die \"HTTP GET error: \", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath- >new_from_content($content); my $xpath = selector_to_xpath(\"strong#ctu\"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 45: but … long and boring Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 46: Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 47: Web scraping toolkit inspired by scrapi.rb DSL-ish Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 48: Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(\"http://www.timeanddate.com/worldclock/\"); if ($res->is_error) { die \"HTTP GET error: \", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath- >new_from_content($content); my $xpath = selector_to_xpath(\"strong#ctu\"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 49: Example (after) #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process \"strong#ctu\", time => 'TEXT'; result 'time'; }; my $uri = URI- >new(\"http://timeanddate.com/worldclock/\"); print $s->scrape($uri); Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 50: Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri); Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 51: process process $selector, $key => $what, …; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 52: $selector: CSS Selector or XPath (start with /) Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 53: $key: key for the result hash append \"[]\" for looping Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 54: $what: '@attr' 'TEXT' Web::Scraper sub { … } Hash reference Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 55: <ul class=\"sites\"> <li><a href=\"http://vienna.openguides.org/\">OpenGuides</a></li> <li><a href=\"http://vienna.yapceurope.org/\">YAPC::Europe</a></l </ul> Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 56: <ul class=\"sites\"> <li><a href=\"http://vienna.openguides.org/\">OpenGuides</a></li> <li><a href=\"http://vienna.yapceurope.org/\">YAPC::Europe</a></l </ul> process \"ul.sites > li > a\", 'urls[]' => '@href'; # { urls => [ … ] } Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 57: <ul class=\"sites\"> <li><a href=\"http://vienna.openguides.org/\">OpenGuides</a></li> <li><a href=\"http://vienna.yapceurope.org/\">YAPC::Europe</a></l </ul> process '//ul[@class=\"sites\"]/li/a', 'names[]' => 'TEXT'; # { names => [ 'OpenGuides', … ] } Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 58: <ul class=\"sites\"> <li><a href=\"http://vienna.openguides.org/\">OpenGuides</a></li> <li><a href=\"http://vienna.yapceurope.org/\">YAPC::Europe</a></l </ul> process \"ul.sites > li\", 'sites[]' => scraper { process 'a', link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 59: <ul class=\"sites\"> <li><a href=\"http://vienna.openguides.org/\">OpenGuides</a></li> <li><a href=\"http://vienna.yapceurope.org/\">YAPC::Europe</a></l </ul> process \"ul.sites > li > a\", 'sites[]' => sub { # $_ is HTML::Element +{ link => $_->attr('href'), name => $_->as_text }; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 60: <ul class=\"sites\"> <li><a href=\"http://vienna.openguides.org/\">OpenGuides</a></li> <li><a href=\"http://vienna.yapceurope.org/\">YAPC::Europe</a></l </ul> process \"ul.sites > li > a\", 'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 61: result my $s = scraper { process …; process …; result 'foo', 'bar'; }; result; # get stash as hashref (default) result @keys; # get stash as hashref containing @keys result $key; # get value of stash $key; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 62: More Examples Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 63: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 64: Thumbnail URLs on Flickr set #!/usr/bin/perl use strict; use Data::Dumper; use Web::Scraper; use URI; my $url = \"http://flickr.com/photos/bulknews/sets/721576017005 10359/\"; my $s = scraper { process \"a.image_link img\", \"thumbs[]\" => '@src'; }; warn Dumper $s->scrape( URI->new($url) ); Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 65: Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 66: <span class=\"vcard\"> <a href=\"http://twitter.com/iamcal\" class=\"url\" rel=\"contact\" title=\"Cal Henderson\"> <img alt=\"Cal Henderson\" class=\"photo fn\" height=\"24\" id=\"profile-image\" src=\"http://assets0.twitter.com/…/mini/buddyicon.gif\" width=\"24\" /></a> </span> <span class=\"vcard\"> … </span> Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 67: Twitter Friends #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = \"http://twitter.com/miyagawa\"; my $s = scraper { process \"span.vcard a\", \"people[]\" => '@title'; }; warn Dumper $s->scrape( URI->new($url) ) ; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 68: Twitter Friends (complex) #!/usr/bin/perl use strict; use Web::Scraper; use URI; use Data::Dumper; my $url = \"http://twitter.com/miyagawa\"; my $s = scraper { process \"span.vcard\", \"people[]\" => scraper { process \"a\", link => '@href', name => '@title'; process \"img\", thumb => '@src'; }; }; warn Dumper $s->scrape( URI->new($url) ) ; Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 69: Tools Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 70: > cpan Web::Scraper comes with 'scraper' CLI Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 71: > scraper http://example.com/ scraper> process \"a\", \"links[]\" => '@href'; scraper> d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper> y --- links: - http://example.org/ - http://example.net/ Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 72: > scraper /path/to/foo.html > GET http://example.com/ | scraper Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 73: TODO Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 74: Web::Scraper Needs documentation Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 75: More examples to put in eg/ directory Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 76: integrate with WWW::Mechanize and Test::WWW::Declare Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 77: XPath Auto-suggestion off of DOM + element DOM + XPath => Element DOM + Element => XPath? (Template::Extract?) Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 78: Questions? Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007

Slide 79: Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper Tatsuhiko Miyagawa 2007/08/28 YAPC::Europe 2007