Web::Scraper for SF.pm LT

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

1 comments

Comments 1 - 1 of 1 previous next Post a comment

  • + guest47ce76 guest47ce76 2 years ago
    This is a good example for screen scraping. But i think if a site embed in Flash completely then this module is useless. Like http://cricinfo.tv etc. Also if a site has Flash Videos and if they are dynamically loading then this module or any module in PERL are helpless. If you know any way to solve this riddle please get back to me @ :- shekarkcb@gmail.com

    Thanks for the presentation.
Post a comment
Embed Video
Edit your comment Cancel

2 Favorites & 1 Group

Web::Scraper for SF.pm LT - Presentation Transcript

  1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
    • How many of you
    • have done
    • screen-scraping w/ Perl?
    • How many of you
    • have used
    • LWP::Simple and regexp?
  2.  
  3. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  4. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
    • It works!
  5. WWW::MySpace 0.70
  6. WWW::Search::Ebay 2.231
    • There are
    • 3 problems
    • (at least)
    • (1)
    • Fragile
    • Easy to break even with slight HTML changes
    • (like newlines, order of attributes etc.)
    • (2)
    • Hard to maintain
    • Regular expression based scrapers are good
    • Only when they're used in write-only scripts
    • (3)
    • Improper
    • HTML & encoding
    • handling
  7. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
    • Web::Scraper
    • to the rescue
    • Web scraping toolkit
    • inspired by scrapi.rb
    • DSL-ish
  8. Example
    • #!/usr/bin/perl
    • use strict;
    • use warnings;
    • use Web::Scraper;
    • use URI;
    • my $s = scraper {
    • process &quot;strong#ctu&quot;, time => 'TEXT';
    • result 'time';
    • };
    • my $uri = URI->new(&quot;http://timeanddate.com/worldclock/&quot;);
    • print $s->scrape($uri);
  9. Basics
    • use Web::Scraper;
    • my $s = scraper {
    • # DSL goes here
    • };
    • my $res = $s->scrape($uri);
  10. process
    • process $selector,
    • $key => $what,
    • … ;
    • $selector:
    • CSS Selector
    • or
    • XPath (start with /)
    • CSS Selector: strong#ctu
    • XPath: //strong[@id=&quot;ctu&quot;]
    <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
    • $key:
    • key for the result hash
    • append &quot;[]&quot; for looping
    • $what:
    • '@attr'
    • 'TEXT'
    • 'RAW'
    • Web::Scraper
    • sub { … }
    • Hash reference
  11. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
    • process &quot;ul.sites > li > a&quot;,
    • 'urls[]' => ' @href ';
    • # { urls => [ … ] }
    <ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
    • process '//ul[@class=&quot;sites&quot;]/li/a',
    • 'names[]' => ' TEXT ';
    • # { names => [ 'OpenGuides', … ] }
    <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
    • process &quot;ul.sites > li > a&quot;,
    • 'sites[]' => {
    • link => '@href', name => 'TEXT';
    • };
    • # { sites => [ { link => …, name => … },
    • # { link => …, name => … } ] };
    <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
    • Tools
    • > cpan Web::Scraper
    • comes with 'scraper' CLI
    • > scraper http://example.com/
    • scraper> process &quot;a&quot;, &quot;links[]&quot; => '@href';
    • scraper> d
    • $VAR1 = {
    • links => [
    • 'http://example.org/',
    • 'http://example.net/',
    • ],
    • };
    • scraper> y
    • ---
    • links:
    • - http://example.org/
    • - http://example.net/
    • > scraper /path/to/foo.html
    • > GET http://example.com/ | scraper
    • Demo
    • Thank you
    • http://search.cpan.org/dist/Web-Scraper
    • http://www.slideshare.net/miyagawa/webscraper

+ Tatsuhiko MiyagawaTatsuhiko Miyagawa, 2 years ago

custom

7527 views, 2 favs, 8 embeds more stats

More info about this document

© All Rights Reserved

Go to text version

  • Total Views 7527
    • 7464 on SlideShare
    • 63 from embeds
  • Comments 1
  • Favorites 2
  • Downloads 62
Most viewed embeds
  • 34 views on http://www.codeordie.org
  • 20 views on http://www.hanrss.com
  • 3 views on file://
  • 2 views on http://nanotechplastics.com
  • 1 views on http://geneticexams.com

more

All embeds
  • 34 views on http://www.codeordie.org
  • 20 views on http://www.hanrss.com
  • 3 views on file://
  • 2 views on http://nanotechplastics.com
  • 1 views on http://geneticexams.com
  • 1 views on http://nanotechnologyenergy.org
  • 1 views on http://codeordie.org
  • 1 views on http://nashvilletnrealestate.org

less

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

Cancel
File a copyright complaint
Having problems? Go to our helpdesk?

Categories

Groups / Events