Web::Scraper for SF.pm LT

  • 12,141 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,141
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
95
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
  • 2.
    • How many of you
    • have done
    • screen-scraping w/ Perl?
  • 3.
    • How many of you
    • have used
    • LWP::Simple and regexp?
  • 4.  
  • 5. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 6. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 7.
    • It works!
  • 8. WWW::MySpace 0.70
  • 9. WWW::Search::Ebay 2.231
  • 10.
    • There are
    • 3 problems
    • (at least)
  • 11.
    • (1)
    • Fragile
    • Easy to break even with slight HTML changes
    • (like newlines, order of attributes etc.)
  • 12.
    • (2)
    • Hard to maintain
    • Regular expression based scrapers are good
    • Only when they're used in write-only scripts
  • 13.
    • (3)
    • Improper
    • HTML & encoding
    • handling
  • 14. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  • 15.
    • Web::Scraper
    • to the rescue
  • 16.
    • Web scraping toolkit
    • inspired by scrapi.rb
    • DSL-ish
  • 17. Example
    • #!/usr/bin/perl
    • use strict;
    • use warnings;
    • use Web::Scraper;
    • use URI;
    • my $s = scraper {
    • process &quot;strong#ctu&quot;, time => 'TEXT';
    • result 'time';
    • };
    • my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;);
    • print $s->scrape($uri);
  • 18. Basics
    • use Web::Scraper;
    • my $s = scraper {
    • # DSL goes here
    • };
    • my $res = $s->scrape($uri);
  • 19. process
    • process $selector,
    • $key => $what,
    • … ;
  • 20.
    • $selector:
    • CSS Selector
    • or
    • XPath (start with /)
  • 21.
    • CSS Selector: strong#ctu
    • XPath: //strong[@id=&quot;ctu&quot;]
    <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 22.
    • $key:
    • key for the result hash
    • append &quot;[]&quot; for looping
  • 23.
    • $what:
    • '@attr'
    • 'TEXT'
    • 'RAW'
    • Web::Scraper
    • sub { … }
    • Hash reference
  • 24. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 25.
    • process &quot;ul.sites > li > a&quot;,
    • 'urls[]' => ' @href ';
    • # { urls => [ … ] }
    <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
  • 26.
    • process '//ul[@class=&quot;sites&quot;]/li/a',
    • 'names[]' => ' TEXT ';
    • # { names => [ 'OpenGuides', … ] }
    <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
  • 27.
    • process &quot;ul.sites > li > a&quot;,
    • 'sites[]' => {
    • link => '@href', name => 'TEXT';
    • };
    • # { sites => [ { link => …, name => … },
    • # { link => …, name => … } ] };
    <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 28.
    • Tools
  • 29.
    • > cpan Web::Scraper
    • comes with 'scraper' CLI
  • 30.
    • > scraper http://example.com/
    • scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href';
    • scraper> d
    • $VAR1 = {
    • links => [
    • 'http://example.org/',
    • 'http://example.net/',
    • ],
    • };
    • scraper> y
    • ---
    • links:
    • - http://example.org/
    • - http://example.net/
  • 31.
    • > scraper /path/to/foo.html
    • > GET http://example.com/ | scraper
  • 32.
    • Demo
  • 33.
    • Thank you
    • http://search.cpan.org/dist/Web-Scraper
    • http://www.slideshare.net/miyagawa/webscraper