• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web::Scraper for SF.pm LT
 

Web::Scraper for SF.pm LT

on

  • 15,557 views

 

Statistics

Views

Total Views
15,557
Views on SlideShare
15,210
Embed Views
347

Actions

Likes
5
Downloads
94
Comments
0

12 Embeds 347

http://coderwall.com 256
http://www.codeordie.org 34
http://www.slideshare.net 26
http://www.hanrss.com 20
file:// 3
http://nanotechplastics.com 2
http://localhost:3000 1
http://nashvilletnrealestate.org 1
http://codeordie.org 1
http://nanotechnologyenergy.org 1
http://geneticexams.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web::Scraper for SF.pm LT Web::Scraper for SF.pm LT Presentation Transcript

    • Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
      • How many of you
      • have done
      • screen-scraping w/ Perl?
      • How many of you
      • have used
      • LWP::Simple and regexp?
    •  
    • <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
    • <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
      • It works!
    • WWW::MySpace 0.70
    • WWW::Search::Ebay 2.231
      • There are
      • 3 problems
      • (at least)
      • (1)
      • Fragile
      • Easy to break even with slight HTML changes
      • (like newlines, order of attributes etc.)
      • (2)
      • Hard to maintain
      • Regular expression based scrapers are good
      • Only when they're used in write-only scripts
      • (3)
      • Improper
      • HTML & encoding
      • handling
    • <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
      • Web::Scraper
      • to the rescue
      • Web scraping toolkit
      • inspired by scrapi.rb
      • DSL-ish
    • Example
      • #!/usr/bin/perl
      • use strict;
      • use warnings;
      • use Web::Scraper;
      • use URI;
      • my $s = scraper {
      • process &quot;strong#ctu&quot;, time => 'TEXT';
      • result 'time';
      • };
      • my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;);
      • print $s->scrape($uri);
    • Basics
      • use Web::Scraper;
      • my $s = scraper {
      • # DSL goes here
      • };
      • my $res = $s->scrape($uri);
    • process
      • process $selector,
      • $key => $what,
      • … ;
      • $selector:
      • CSS Selector
      • or
      • XPath (start with /)
      • CSS Selector: strong#ctu
      • XPath: //strong[@id=&quot;ctu&quot;]
      <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
      • $key:
      • key for the result hash
      • append &quot;[]&quot; for looping
      • $what:
      • '@attr'
      • 'TEXT'
      • 'RAW'
      • Web::Scraper
      • sub { … }
      • Hash reference
    • <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
      • process &quot;ul.sites > li > a&quot;,
      • 'urls[]' => ' @href ';
      • # { urls => [ … ] }
      <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
      • process '//ul[@class=&quot;sites&quot;]/li/a',
      • 'names[]' => ' TEXT ';
      • # { names => [ 'OpenGuides', … ] }
      <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
      • process &quot;ul.sites > li > a&quot;,
      • 'sites[]' => {
      • link => '@href', name => 'TEXT';
      • };
      • # { sites => [ { link => …, name => … },
      • # { link => …, name => … } ] };
      <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
      • Tools
      • > cpan Web::Scraper
      • comes with 'scraper' CLI
      • > scraper http://example.com/
      • scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href';
      • scraper> d
      • $VAR1 = {
      • links => [
      • 'http://example.org/',
      • 'http://example.net/',
      • ],
      • };
      • scraper> y
      • ---
      • links:
      • - http://example.org/
      • - http://example.net/
      • > scraper /path/to/foo.html
      • > GET http://example.com/ | scraper
      • Demo
      • Thank you
      • http://search.cpan.org/dist/Web-Scraper
      • http://www.slideshare.net/miyagawa/webscraper