Web::Scraper for SF.pm LT
Upcoming SlideShare
Loading in...5
×
 

Web::Scraper for SF.pm LT

on

  • 15,734 views

 

Statistics

Views

Total Views
15,734
Views on SlideShare
15,387
Embed Views
347

Actions

Likes
5
Downloads
94
Comments
0

12 Embeds 347

http://coderwall.com 256
http://www.codeordie.org 34
http://www.slideshare.net 26
http://www.hanrss.com 20
file:// 3
http://nanotechplastics.com 2
http://localhost:3000 1
http://nashvilletnrealestate.org 1
http://codeordie.org 1
http://nanotechnologyenergy.org 1
http://geneticexams.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Web::Scraper for SF.pm LT Web::Scraper for SF.pm LT Presentation Transcript

  • Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
    • How many of you
    • have done
    • screen-scraping w/ Perl?
    • How many of you
    • have used
    • LWP::Simple and regexp?
  •  
  • <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
    • It works!
  • WWW::MySpace 0.70
  • WWW::Search::Ebay 2.231
    • There are
    • 3 problems
    • (at least)
    • (1)
    • Fragile
    • Easy to break even with slight HTML changes
    • (like newlines, order of attributes etc.)
    • (2)
    • Hard to maintain
    • Regular expression based scrapers are good
    • Only when they're used in write-only scripts
    • (3)
    • Improper
    • HTML & encoding
    • handling
  • <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
    • Web::Scraper
    • to the rescue
    • Web scraping toolkit
    • inspired by scrapi.rb
    • DSL-ish
  • Example
    • #!/usr/bin/perl
    • use strict;
    • use warnings;
    • use Web::Scraper;
    • use URI;
    • my $s = scraper {
    • process &quot;strong#ctu&quot;, time => 'TEXT';
    • result 'time';
    • };
    • my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;);
    • print $s->scrape($uri);
  • Basics
    • use Web::Scraper;
    • my $s = scraper {
    • # DSL goes here
    • };
    • my $res = $s->scrape($uri);
  • process
    • process $selector,
    • $key => $what,
    • … ;
    • $selector:
    • CSS Selector
    • or
    • XPath (start with /)
    • CSS Selector: strong#ctu
    • XPath: //strong[@id=&quot;ctu&quot;]
    <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
    • $key:
    • key for the result hash
    • append &quot;[]&quot; for looping
    • $what:
    • '@attr'
    • 'TEXT'
    • 'RAW'
    • Web::Scraper
    • sub { … }
    • Hash reference
  • <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
    • process &quot;ul.sites > li > a&quot;,
    • 'urls[]' => ' @href ';
    • # { urls => [ … ] }
    <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
    • process '//ul[@class=&quot;sites&quot;]/li/a',
    • 'names[]' => ' TEXT ';
    • # { names => [ 'OpenGuides', … ] }
    <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
    • process &quot;ul.sites > li > a&quot;,
    • 'sites[]' => {
    • link => '@href', name => 'TEXT';
    • };
    • # { sites => [ { link => …, name => … },
    • # { link => …, name => … } ] };
    <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
    • Tools
    • > cpan Web::Scraper
    • comes with 'scraper' CLI
    • > scraper http://example.com/
    • scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href';
    • scraper> d
    • $VAR1 = {
    • links => [
    • 'http://example.org/',
    • 'http://example.net/',
    • ],
    • };
    • scraper> y
    • ---
    • links:
    • - http://example.org/
    • - http://example.net/
    • > scraper /path/to/foo.html
    • > GET http://example.com/ | scraper
    • Demo
    • Thank you
    • http://search.cpan.org/dist/Web-Scraper
    • http://www.slideshare.net/miyagawa/webscraper