Slideshare.net (beta)

 

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 3 (more)

Web::Scraper for SF.pm LT

From miyagawa, 8 months ago

3955 views  |  0 comments  |  2 favorites  |  45 downloads  |  3 embeds (Stats)
Embed
options

More Info

This slideshow is Public
Total Views: 3955
on Slideshare: 3925
from embeds: 30

Slideshow transcript

Slide 1: Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa miyagawa@gmail.com Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk

Slide 2: How many of you have done screen-scraping w/ Perl? Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 3: How many of you have used LWP::Simple and regexp? Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 4: Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 5: <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br /> Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 6: <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46 Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 7: It works! Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 8: WWW::MySpace 0.70 Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 9: WWW::Search::Ebay 2.231 Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 10: There are 3 problems (at least) Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 11: (1) Fragile Easy to break even with slight HTML changes (like newlines, order of attributes etc.) Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 12: (2) Hard to maintain Regular expression based scrapers are good Only when they're used in write-only scripts Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 13: (3) Improper HTML & encoding handling Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 14: <span class="message">I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1' I &hearts; Shibuya Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 15: Web::Scraper to the rescue Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 16: Web scraping toolkit inspired by scrapi.rb DSL-ish Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 17: Example #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $s = scraper { process "strong#ctu", time => 'TEXT'; result 'time'; }; my $uri = URI- >new("http://timeanddate.com/worldclock/"); print $s->scrape($uri); Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 18: Basics use Web::Scraper; my $s = scraper { # DSL goes here }; my $res = $s->scrape($uri); Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 19: process process $selector, $key => $what, …; Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 20: $selector: CSS Selector or XPath (start with /) Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 21: <td>Current <strong>UTC</strong> (or GMT/Zulu)- time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br /> CSS Selector: strong#ctu XPath: //strong[@id="ctu"] Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 22: $key: key for the result hash append "[]" for looping Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 23: $what: '@attr' 'TEXT' 'RAW' Web::Scraper sub { … } Hash reference Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 24: <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></l </ul> Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 25: <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></l </ul> process "ul.sites > li > a", 'urls[]' => '@href'; # { urls => [ … ] } Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 26: <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></l </ul> process '//ul[@class="sites"]/li/a', 'names[]' => 'TEXT'; # { names => [ 'OpenGuides', … ] } Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 27: <ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></l </ul> process "ul.sites > li > a", 'sites[]' => { link => '@href', name => 'TEXT'; }; # { sites => [ { link => …, name => … }, # { link => …, name => … } ] }; Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 28: Tools Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 29: > cpan Web::Scraper comes with 'scraper' CLI Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 30: > scraper http://example.com/ scraper> process "a", "links[]" => '@href'; scraper> d $VAR1 = { links => [ 'http://example.org/', 'http://example.net/', ], }; scraper> y --- links: - http://example.org/ - http://example.net/ Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 31: > scraper /path/to/foo.html > GET http://example.com/ | scraper Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 32: Demo Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks

Slide 33: Thank you http://search.cpan.org/dist/Web-Scraper http://www.slideshare.net/miyagawa/webscraper Tatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks