Your SlideShare is downloading. ×
0
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web::Scraper for SF.pm LT

12,354

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,354
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
95
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
  • 2. <ul><li>How many of you </li></ul><ul><li>have done </li></ul><ul><li>screen-scraping w/ Perl? </li></ul>
  • 3. <ul><li>How many of you </li></ul><ul><li>have used </li></ul><ul><li>LWP::Simple and regexp? </li></ul>
  • 4. &nbsp;
  • 5. &lt;td&gt;Current &lt;strong&gt;UTC&lt;/strong&gt; (or GMT/Zulu)-time used: &lt;strong id=&amp;quot;ctu&amp;quot;&gt;Monday, August 27, 2007 at 12:49:46&lt;/strong&gt; &lt;br /&gt;
  • 6. &lt;td&gt;Current &lt;strong&gt;UTC&lt;/strong&gt; (or GMT/Zulu)-time used: &lt;strong id=&amp;quot;ctu&amp;quot;&gt;Monday, August 27, 2007 at 12:49:46&lt;/strong&gt; &lt;br /&gt; &gt; perl -MLWP::Simple -le &apos;$c = get(&amp;quot;http://timeanddate.com/worldclock/&amp;quot;); $c =~ m@&lt;strong id=&amp;quot;ctu&amp;quot;&gt;(.*?)&lt;/strong&gt;@ and print $1&apos; Monday, August 27, 2007 at 12:49:46
  • 7. <ul><li>It works! </li></ul>
  • 8. WWW::MySpace 0.70
  • 9. WWW::Search::Ebay 2.231
  • 10. <ul><li>There are </li></ul><ul><li>3 problems </li></ul><ul><li>(at least) </li></ul>
  • 11. <ul><li>(1) </li></ul><ul><li>Fragile </li></ul><ul><li>Easy to break even with slight HTML changes </li></ul><ul><li>(like newlines, order of attributes etc.) </li></ul>
  • 12. <ul><li>(2) </li></ul><ul><li>Hard to maintain </li></ul><ul><li>Regular expression based scrapers are good </li></ul><ul><li>Only when they&apos;re used in write-only scripts </li></ul>
  • 13. <ul><li>(3) </li></ul><ul><li>Improper </li></ul><ul><li>HTML &amp; encoding </li></ul><ul><li>handling </li></ul>
  • 14. &lt;span class=&amp;quot;message&amp;quot;&gt;I &amp;hearts; Shibuya&lt;/span&gt; &gt; perl –e &apos;$c =~ m@&lt;span class=&amp;quot;message&amp;quot;&gt;(.*?)&lt;/span&gt;@ and print $1&apos; I &amp;hearts; Shibuya
  • 15. <ul><li>Web::Scraper </li></ul><ul><li>to the rescue </li></ul>
  • 16. <ul><li>Web scraping toolkit </li></ul><ul><li>inspired by scrapi.rb </li></ul><ul><li>DSL-ish </li></ul>
  • 17. Example <ul><li>#!/usr/bin/perl </li></ul><ul><li>use strict; </li></ul><ul><li>use warnings; </li></ul><ul><li>use Web::Scraper; </li></ul><ul><li>use URI; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li>process &amp;quot;strong#ctu&amp;quot;, time =&gt; &apos;TEXT&apos;; </li></ul><ul><li>result &apos;time&apos;; </li></ul><ul><li>}; </li></ul><ul><li>my $uri = URI-&gt;new(&amp;quot;http://timeanddate.com/worldclock/&amp;quot;); </li></ul><ul><li>print $s-&gt;scrape($uri); </li></ul>
  • 18. Basics <ul><li>use Web::Scraper; </li></ul><ul><li>my $s = scraper { </li></ul><ul><li># DSL goes here </li></ul><ul><li>}; </li></ul><ul><li>my $res = $s-&gt;scrape($uri); </li></ul>
  • 19. process <ul><li>process $selector, </li></ul><ul><li>$key =&gt; $what, </li></ul><ul><li>… ; </li></ul>
  • 20. <ul><li>$selector: </li></ul><ul><li>CSS Selector </li></ul><ul><li>or </li></ul><ul><li>XPath (start with /) </li></ul>
  • 21. <ul><li>CSS Selector: strong#ctu </li></ul><ul><li>XPath: //strong[@id=&amp;quot;ctu&amp;quot;] </li></ul>&lt;td&gt;Current &lt;strong&gt;UTC&lt;/strong&gt; (or GMT/Zulu)-time used: &lt;strong id=&amp;quot;ctu&amp;quot;&gt;Monday, August 27, 2007 at 12:49:46&lt;/strong&gt; &lt;br /&gt;
  • 22. <ul><li>$key: </li></ul><ul><li>key for the result hash </li></ul><ul><li>append &amp;quot;[]&amp;quot; for looping </li></ul>
  • 23. <ul><li>$what: </li></ul><ul><li>&apos;@attr&apos; </li></ul><ul><li>&apos;TEXT&apos; </li></ul><ul><li>&apos;RAW&apos; </li></ul><ul><li>Web::Scraper </li></ul><ul><li>sub { … } </li></ul><ul><li>Hash reference </li></ul>
  • 24. &lt;ul class=&amp;quot;sites&amp;quot;&gt; &lt;li&gt;&lt;a href=&amp;quot;http://vienna.openguides.org/&amp;quot;&gt;OpenGuides&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href=&amp;quot;http://vienna.yapceurope.org/&amp;quot;&gt;YAPC::Europe&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;
  • 25. <ul><li>process &amp;quot;ul.sites &gt; li &gt; a&amp;quot;, </li></ul><ul><li>&apos;urls[]&apos; =&gt; &apos; @href &apos;; </li></ul><ul><li># { urls =&gt; [ … ] } </li></ul>&lt;ul class=&amp;quot;sites&amp;quot;&gt; &lt;li&gt;&lt;a href=&amp;quot; http://vienna.openguides.org/ &amp;quot;&gt;OpenGuides&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href=&amp;quot; http://vienna.yapceurope.org/ &amp;quot;&gt;YAPC::Europe&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;
  • 26. <ul><li>process &apos;//ul[@class=&amp;quot;sites&amp;quot;]/li/a&apos;, </li></ul><ul><li>&apos;names[]&apos; =&gt; &apos; TEXT &apos;; </li></ul><ul><li># { names =&gt; [ &apos;OpenGuides&apos;, … ] } </li></ul>&lt;ul class=&amp;quot;sites&amp;quot;&gt; &lt;li&gt;&lt;a href=&amp;quot;http://vienna.openguides.org/&amp;quot;&gt; OpenGuides &lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href=&amp;quot;http://vienna.yapceurope.org/&amp;quot;&gt; YAPC::Europe &lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;
  • 27. <ul><li>process &amp;quot;ul.sites &gt; li &gt; a&amp;quot;, </li></ul><ul><li>&apos;sites[]&apos; =&gt; { </li></ul><ul><li>link =&gt; &apos;@href&apos;, name =&gt; &apos;TEXT&apos;; </li></ul><ul><li>}; </li></ul><ul><li># { sites =&gt; [ { link =&gt; …, name =&gt; … }, </li></ul><ul><li># { link =&gt; …, name =&gt; … } ] }; </li></ul>&lt;ul class=&amp;quot;sites&amp;quot;&gt; &lt;li&gt;&lt;a href=&amp;quot;http://vienna.openguides.org/&amp;quot;&gt;OpenGuides&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href=&amp;quot;http://vienna.yapceurope.org/&amp;quot;&gt;YAPC::Europe&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;
  • 28. <ul><li>Tools </li></ul>
  • 29. <ul><li>&gt; cpan Web::Scraper </li></ul><ul><li>comes with &apos;scraper&apos; CLI </li></ul>
  • 30. <ul><li>&gt; scraper http://example.com/ </li></ul><ul><li>scraper&gt; process &amp;quot;a&amp;quot;, &amp;quot;links[]&amp;quot; =&gt; &apos;@href&apos;; </li></ul><ul><li>scraper&gt; d </li></ul><ul><li>$VAR1 = { </li></ul><ul><li>links =&gt; [ </li></ul><ul><li>&apos;http://example.org/&apos;, </li></ul><ul><li>&apos;http://example.net/&apos;, </li></ul><ul><li>], </li></ul><ul><li>}; </li></ul><ul><li>scraper&gt; y </li></ul><ul><li>--- </li></ul><ul><li>links: </li></ul><ul><li>- http://example.org/ </li></ul><ul><li>- http://example.net/ </li></ul>
  • 31. <ul><li>&gt; scraper /path/to/foo.html </li></ul><ul><li>&gt; GET http://example.com/ | scraper </li></ul>
  • 32. <ul><li>Demo </li></ul>
  • 33. <ul><li>Thank you </li></ul><ul><li>http://search.cpan.org/dist/Web-Scraper </li></ul><ul><li>http://www.slideshare.net/miyagawa/webscraper </li></ul>

×