Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibu...
<ul><li>Practical  Web Scraping </li></ul><ul><li>with Web::Scraper </li></ul>
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful da...
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful da...
<ul><li>&quot;Screen-scraping </li></ul><ul><li>is so 1999!&quot; </li></ul>
 
 
<ul><li>RSS is a metadata </li></ul><ul><li>not a complete  </li></ul><ul><li>HTML replacement </li></ul>
<ul><li>Practical  Web Scraping </li></ul><ul><li>with Web::Scraper </li></ul>
<ul><li>What's wrong with </li></ul><ul><li>LWP & Regexp? </li></ul>
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46<...
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46<...
<ul><li>It works! </li></ul>
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
<ul><li>It works … </li></ul>
<ul><li>There are </li></ul><ul><li>3 problems </li></ul><ul><li>(at least) </li></ul>
<ul><li>(1) </li></ul><ul><li>Fragile </li></ul><ul><li>Easy to break even with slight HTML changes </li></ul><ul><li>(lik...
<ul><li>(2) </li></ul><ul><li>Hard to maintain </li></ul><ul><li>Regular expression based scrapers are good  </li></ul><ul...
<ul><li>(3) </li></ul><ul><li>Improper  </li></ul><ul><li>HTML & encoding </li></ul><ul><li>handling </li></ul>
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@...
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&...
<span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;messa...
<ul><li>The &quot;right&quot; way </li></ul><ul><li>of screen-scraping </li></ul>
<ul><li>(1), (2) </li></ul><ul><li>Maintainable </li></ul><ul><li>Less fragile </li></ul>
<ul><li>Use XPath </li></ul><ul><li>and CSS Selectors </li></ul>
<ul><li>XPath </li></ul><ul><li>HTML::TreeBuilder::XPath </li></ul><ul><li>XML::LibXML </li></ul>
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:...
<ul><li>CSS Selectors </li></ul><ul><li>&quot;XPath for HTML coders&quot; </li></ul><ul><li>&quot;XPath for people who hat...
CSS Selectors <ul><li>body { font-size: 12px; } </li></ul><ul><li>div.article { padding: 1em } </li></ul><ul><li>span#coun...
<ul><li>XPath:  </li></ul><ul><li>//strong[@id=&quot;ctu&quot;] </li></ul><ul><li>CSS Selector:  </li></ul><ul><li>strong#...
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 200...
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; us...
<ul><li>Robust, </li></ul><ul><li>Maintainable, </li></ul><ul><li>and </li></ul><ul><li>Sane character handling </li></ul>
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, ...
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; us...
<ul><li>but … </li></ul><ul><li>long and boring </li></ul>
<ul><li>Practical Web Scraping </li></ul><ul><li>with  Web::Scraper </li></ul>
<ul><li>Web scraping toolkit </li></ul><ul><li>inspired by scrapi.rb </li></ul><ul><li>DSL-ish </li></ul>
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; u...
Example (after) <ul><li>#!/usr/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>us...
Basics <ul><li>use Web::Scraper; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li># DSL goes here </li></ul><ul><li>}...
process <ul><li>process $selector, </li></ul><ul><li>$key => $what, </li></ul><ul><li>… ; </li></ul>
<ul><li>$selector: </li></ul><ul><li>CSS Selector </li></ul><ul><li>or </li></ul><ul><li>XPath (start with /) </li></ul>
<ul><li>$key: </li></ul><ul><li>key for the result hash </li></ul><ul><li>append &quot;[]&quot; for looping </li></ul>
<ul><li>$what: </li></ul><ul><li>'@attr' </li></ul><ul><li>'TEXT' </li></ul><ul><li>'RAW' </li></ul><ul><li>Web::Scraper <...
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;h...
<ul><li>process &quot;ul.sites > li > a&quot;,  </li></ul><ul><li>'urls[]'  => ' @href '; </li></ul><ul><li># { urls => [ ...
<ul><li>process '//ul[@class=&quot;sites&quot;]/li/a', </li></ul><ul><li>'names[]'  => ' TEXT '; </li></ul><ul><li># { nam...
<ul><li>process &quot;ul.sites > li&quot;,  </li></ul><ul><li>'sites[]' => scraper { </li></ul><ul><li>process 'a', </li><...
<ul><li>process &quot;ul.sites > li > a&quot;,  </li></ul><ul><li>'sites[]' => sub { </li></ul><ul><li># $_ is HTML::Eleme...
<ul><li>process &quot;ul.sites > li > a&quot;,  </li></ul><ul><li>'sites[]' => { </li></ul><ul><li>link => '@href', name =...
result <ul><li>result;  </li></ul><ul><li># get stash as hashref (default) </li></ul><ul><li>result @keys; </li></ul><ul><...
<ul><li>Live Demo </li></ul>
<ul><li>Tools </li></ul>
<ul><li>> cpan Web::Scraper </li></ul><ul><li>comes with 'scraper' CLI </li></ul>
<ul><li>>  scraper http://example.com/ </li></ul><ul><li>scraper>  process &quot;a&quot;, &quot;links[]&quot; => '@href'; ...
<ul><li>>  scraper /path/to/foo.html </li></ul><ul><li>>  GET http://example.com/ | scraper </li></ul>
<ul><li>Recent Updates </li></ul>
<ul><li>0.13 </li></ul><ul><li>'c' and 'c all' </li></ul><ul><li>WARN in scraper </li></ul>
<ul><li>0.14 </li></ul><ul><li>automatic absolute URI for link elements (a@href, img@src) </li></ul>
<ul><li>0.14 (cont.) </li></ul><ul><li>'RAW' and 'HTML' </li></ul>
<ul><li>0.15 </li></ul><ul><li>$Web::Scraper::UserAgent </li></ul><ul><li>$scraper->user_agent </li></ul>
<ul><li>0.19 </li></ul><ul><li>support encoding detection w/ META tags </li></ul>
<ul><li>TODO </li></ul>
<ul><li>Web::Scraper </li></ul><ul><li>Needs documentation </li></ul>
<ul><li>More examples </li></ul><ul><li>to put in eg/ directory </li></ul>
<ul><li>Alternative API </li></ul><ul><li>inspired by scRUBYt! </li></ul>
<ul><li>OO Backend API </li></ul><ul><li>if you don't like the DSL </li></ul>
<ul><li>integrate with </li></ul><ul><li>WWW::Mechanize </li></ul><ul><li>and Test::WWW::Declare </li></ul>
<ul><li>XPath Auto-suggestion </li></ul><ul><li>off of DOM + element </li></ul><ul><li>DOM + XPath => Element </li></ul><u...
<ul><li>generic XML support </li></ul><ul><li>(e.g. RSS/Atom feeds) </li></ul>
<ul><li>extensible text filter </li></ul><ul><li>date, geo, hCards (microformats) </li></ul><span class=&quot;entry-date&q...
<ul><li>Summary </li></ul>
<ul><li>Web::Scraper </li></ul><ul><li>inspired by scrapi </li></ul>
<ul><li>easy, fun, maintainable </li></ul><ul><li>& less fragile </li></ul>
<ul><li>CSS selector </li></ul><ul><li>XPath </li></ul>
<ul><li>Questions? </li></ul>
<ul><li>Thank you </li></ul><ul><li>http://search.cpan.org/dist/Web-Scraper </li></ul><ul><li>http://www.slideshare.net/mi...
Upcoming SlideShare
Loading in...5
×

Web Scraper Shibuya.pm tech talk #8

19,956

Published on

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
19,956
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
269
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Web Scraper Shibuya.pm tech talk #8

  1. 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
  2. 2. <ul><li>Practical Web Scraping </li></ul><ul><li>with Web::Scraper </li></ul>
  3. 3. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  4. 4. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  5. 5. <ul><li>&quot;Screen-scraping </li></ul><ul><li>is so 1999!&quot; </li></ul>
  6. 8. <ul><li>RSS is a metadata </li></ul><ul><li>not a complete </li></ul><ul><li>HTML replacement </li></ul>
  7. 9. <ul><li>Practical Web Scraping </li></ul><ul><li>with Web::Scraper </li></ul>
  8. 10. <ul><li>What's wrong with </li></ul><ul><li>LWP & Regexp? </li></ul>
  9. 12. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  10. 13. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  11. 14. <ul><li>It works! </li></ul>
  12. 15. WWW::MySpace 0.70
  13. 16. WWW::Search::Ebay 2.231
  14. 17. WWW::Mixi 0.50
  15. 18. <ul><li>It works … </li></ul>
  16. 19. <ul><li>There are </li></ul><ul><li>3 problems </li></ul><ul><li>(at least) </li></ul>
  17. 20. <ul><li>(1) </li></ul><ul><li>Fragile </li></ul><ul><li>Easy to break even with slight HTML changes </li></ul><ul><li>(like newlines, order of attributes etc.) </li></ul>
  18. 21. <ul><li>(2) </li></ul><ul><li>Hard to maintain </li></ul><ul><li>Regular expression based scrapers are good </li></ul><ul><li>Only when they're used in write-only scripts </li></ul>
  19. 22. <ul><li>(3) </li></ul><ul><li>Improper </li></ul><ul><li>HTML & encoding </li></ul><ul><li>handling </li></ul>
  20. 23. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  21. 24. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Shibuya
  22. 25. <span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
  23. 26. <ul><li>The &quot;right&quot; way </li></ul><ul><li>of screen-scraping </li></ul>
  24. 27. <ul><li>(1), (2) </li></ul><ul><li>Maintainable </li></ul><ul><li>Less fragile </li></ul>
  25. 28. <ul><li>Use XPath </li></ul><ul><li>and CSS Selectors </li></ul>
  26. 29. <ul><li>XPath </li></ul><ul><li>HTML::TreeBuilder::XPath </li></ul><ul><li>XML::LibXML </li></ul>
  27. 30. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  28. 31. <ul><li>CSS Selectors </li></ul><ul><li>&quot;XPath for HTML coders&quot; </li></ul><ul><li>&quot;XPath for people who hates XML&quot; </li></ul>
  29. 32. CSS Selectors <ul><li>body { font-size: 12px; } </li></ul><ul><li>div.article { padding: 1em } </li></ul><ul><li>span#count { color: #fff } </li></ul>
  30. 33. <ul><li>XPath: </li></ul><ul><li>//strong[@id=&quot;ctu&quot;] </li></ul><ul><li>CSS Selector: </li></ul><ul><li>strong#ctu </li></ul>
  31. 34. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  32. 35. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  33. 36. <ul><li>Robust, </li></ul><ul><li>Maintainable, </li></ul><ul><li>and </li></ul><ul><li>Sane character handling </li></ul>
  34. 37. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  35. 38. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  36. 39. <ul><li>but … </li></ul><ul><li>long and boring </li></ul>
  37. 40. <ul><li>Practical Web Scraping </li></ul><ul><li>with Web::Scraper </li></ul>
  38. 41. <ul><li>Web scraping toolkit </li></ul><ul><li>inspired by scrapi.rb </li></ul><ul><li>DSL-ish </li></ul>
  39. 42. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  40. 43. Example (after) <ul><li>#!/usr/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>use Web::Scraper; </li></ul><ul><li>use URI; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li>process &quot;strong#ctu&quot;, time => 'TEXT'; </li></ul><ul><li>result 'time'; </li></ul><ul><li>}; </li></ul><ul><li>my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;); </li></ul><ul><li>print $s->scrape($uri); </li></ul>
  41. 44. Basics <ul><li>use Web::Scraper; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li># DSL goes here </li></ul><ul><li>}; </li></ul><ul><li>my $res = $s->scrape($uri); </li></ul>
  42. 45. process <ul><li>process $selector, </li></ul><ul><li>$key => $what, </li></ul><ul><li>… ; </li></ul>
  43. 46. <ul><li>$selector: </li></ul><ul><li>CSS Selector </li></ul><ul><li>or </li></ul><ul><li>XPath (start with /) </li></ul>
  44. 47. <ul><li>$key: </li></ul><ul><li>key for the result hash </li></ul><ul><li>append &quot;[]&quot; for looping </li></ul>
  45. 48. <ul><li>$what: </li></ul><ul><li>'@attr' </li></ul><ul><li>'TEXT' </li></ul><ul><li>'RAW' </li></ul><ul><li>Web::Scraper </li></ul><ul><li>sub { … } </li></ul><ul><li>Hash reference </li></ul>
  46. 49. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  47. 50. <ul><li>process &quot;ul.sites > li > a&quot;, </li></ul><ul><li>'urls[]' => ' @href '; </li></ul><ul><li># { urls => [ … ] } </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
  48. 51. <ul><li>process '//ul[@class=&quot;sites&quot;]/li/a', </li></ul><ul><li>'names[]' => ' TEXT '; </li></ul><ul><li># { names => [ 'OpenGuides', … ] } </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
  49. 52. <ul><li>process &quot;ul.sites > li&quot;, </li></ul><ul><li>'sites[]' => scraper { </li></ul><ul><li>process 'a', </li></ul><ul><li>link => '@href', name => 'TEXT'; </li></ul><ul><li>}; </li></ul><ul><li># { sites => [ { link => …, name => … }, </li></ul><ul><li># { link => …, name => … } ] }; </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  50. 53. <ul><li>process &quot;ul.sites > li > a&quot;, </li></ul><ul><li>'sites[]' => sub { </li></ul><ul><li># $_ is HTML::Element </li></ul><ul><li>+{ link => $_->attr('href'), name => $_->as_text }; </li></ul><ul><li>}; </li></ul><ul><li># { sites => [ { link => …, name => … }, </li></ul><ul><li># { link => …, name => … } ] }; </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  51. 54. <ul><li>process &quot;ul.sites > li > a&quot;, </li></ul><ul><li>'sites[]' => { </li></ul><ul><li>link => '@href', name => 'TEXT'; </li></ul><ul><li>}; </li></ul><ul><li># { sites => [ { link => …, name => … }, </li></ul><ul><li># { link => …, name => … } ] }; </li></ul><ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  52. 55. result <ul><li>result; </li></ul><ul><li># get stash as hashref (default) </li></ul><ul><li>result @keys; </li></ul><ul><li># get stash as hashref containing @keys </li></ul><ul><li>result $key; </li></ul><ul><li># get value of stash $key; </li></ul>my $s = scraper { process …; process …; result 'foo', 'bar'; };
  53. 56. <ul><li>Live Demo </li></ul>
  54. 57. <ul><li>Tools </li></ul>
  55. 58. <ul><li>> cpan Web::Scraper </li></ul><ul><li>comes with 'scraper' CLI </li></ul>
  56. 59. <ul><li>> scraper http://example.com/ </li></ul><ul><li>scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href'; </li></ul><ul><li>scraper> d </li></ul><ul><li>$VAR1 = { </li></ul><ul><li>links => [ </li></ul><ul><li>'http://example.org/', </li></ul><ul><li>'http://example.net/', </li></ul><ul><li>], </li></ul><ul><li>}; </li></ul><ul><li>scraper> y </li></ul><ul><li>--- </li></ul><ul><li>links: </li></ul><ul><li>- http://example.org/ </li></ul><ul><li>- http://example.net/ </li></ul>
  57. 60. <ul><li>> scraper /path/to/foo.html </li></ul><ul><li>> GET http://example.com/ | scraper </li></ul>
  58. 61. <ul><li>Recent Updates </li></ul>
  59. 62. <ul><li>0.13 </li></ul><ul><li>'c' and 'c all' </li></ul><ul><li>WARN in scraper </li></ul>
  60. 63. <ul><li>0.14 </li></ul><ul><li>automatic absolute URI for link elements (a@href, img@src) </li></ul>
  61. 64. <ul><li>0.14 (cont.) </li></ul><ul><li>'RAW' and 'HTML' </li></ul>
  62. 65. <ul><li>0.15 </li></ul><ul><li>$Web::Scraper::UserAgent </li></ul><ul><li>$scraper->user_agent </li></ul>
  63. 66. <ul><li>0.19 </li></ul><ul><li>support encoding detection w/ META tags </li></ul>
  64. 67. <ul><li>TODO </li></ul>
  65. 68. <ul><li>Web::Scraper </li></ul><ul><li>Needs documentation </li></ul>
  66. 69. <ul><li>More examples </li></ul><ul><li>to put in eg/ directory </li></ul>
  67. 70. <ul><li>Alternative API </li></ul><ul><li>inspired by scRUBYt! </li></ul>
  68. 71. <ul><li>OO Backend API </li></ul><ul><li>if you don't like the DSL </li></ul>
  69. 72. <ul><li>integrate with </li></ul><ul><li>WWW::Mechanize </li></ul><ul><li>and Test::WWW::Declare </li></ul>
  70. 73. <ul><li>XPath Auto-suggestion </li></ul><ul><li>off of DOM + element </li></ul><ul><li>DOM + XPath => Element </li></ul><ul><li>DOM + Element => XPath? </li></ul><ul><li>(Template::Extract?) </li></ul>
  71. 74. <ul><li>generic XML support </li></ul><ul><li>(e.g. RSS/Atom feeds) </li></ul>
  72. 75. <ul><li>extensible text filter </li></ul><ul><li>date, geo, hCards (microformats) </li></ul><span class=&quot;entry-date&quot;>October 1st, 2007 17:13:31 +0900</span> process &quot;.entry-date&quot;, date => 'TEXT :rfc822 ';
  73. 76. <ul><li>Summary </li></ul>
  74. 77. <ul><li>Web::Scraper </li></ul><ul><li>inspired by scrapi </li></ul>
  75. 78. <ul><li>easy, fun, maintainable </li></ul><ul><li>& less fragile </li></ul>
  76. 79. <ul><li>CSS selector </li></ul><ul><li>XPath </li></ul>
  77. 80. <ul><li>Questions? </li></ul>
  78. 81. <ul><li>Thank you </li></ul><ul><li>http://search.cpan.org/dist/Web-Scraper </li></ul><ul><li>http://www.slideshare.net/miyagawa/webscraper </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×