SlideShare a Scribd company logo
1 of 79
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
[object Object]
[object Object]
abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
 
http://code.sixapart.com/
 
[object Object],[object Object]
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
[object Object],[object Object]
 
 
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
[object Object]
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Vienna
<span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
[object Object],[object Object],[object Object]
CSS Selectors ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object],[object Object],[object Object]
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Basics ,[object Object],[object Object],[object Object],[object Object],[object Object]
process ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],my $s = scraper { process …; process …; result 'foo', 'bar'; };
[object Object]
 
Thumbnail URLs on Flickr set ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
<span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot;  id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
Twitter Friends ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Twitter Friends (complex) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]

More Related Content

What's hot

LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Godreamwidth
 
Ruby HTTP clients comparison
Ruby HTTP clients comparisonRuby HTTP clients comparison
Ruby HTTP clients comparisonHiroshi Nakamura
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
 
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Thijs Feryn
 
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Codemotion
 
Introduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsIntroduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsPerrin Harkins
 
A reviravolta do desenvolvimento web
A reviravolta do desenvolvimento webA reviravolta do desenvolvimento web
A reviravolta do desenvolvimento webWallace Reis
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with phpElizabeth Smith
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Thijs Feryn
 
Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...Droidcon Eastern Europe
 
Android webservices
Android webservicesAndroid webservices
Android webservicesKrazy Koder
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Thijs Feryn
 
Perl: Hate it for the Right Reasons
Perl: Hate it for the Right ReasonsPerl: Hate it for the Right Reasons
Perl: Hate it for the Right ReasonsMatt Follett
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Workhorse Computing
 

What's hot (20)

LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Go
 
Ruby HTTP clients comparison
Ruby HTTP clients comparisonRuby HTTP clients comparison
Ruby HTTP clients comparison
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
 
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
 
Introduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsIntroduction to performance tuning perl web applications
Introduction to performance tuning perl web applications
 
Triple Blitz Strike
Triple Blitz StrikeTriple Blitz Strike
Triple Blitz Strike
 
AJAX Transport Layer
AJAX Transport LayerAJAX Transport Layer
AJAX Transport Layer
 
A reviravolta do desenvolvimento web
A reviravolta do desenvolvimento webA reviravolta do desenvolvimento web
A reviravolta do desenvolvimento web
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
 
Lies, Damn Lies, and Benchmarks
Lies, Damn Lies, and BenchmarksLies, Damn Lies, and Benchmarks
Lies, Damn Lies, and Benchmarks
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018
 
B03-GenomeContent-Intermine
B03-GenomeContent-IntermineB03-GenomeContent-Intermine
B03-GenomeContent-Intermine
 
Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...
 
Android webservices
Android webservicesAndroid webservices
Android webservices
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018
 
Perl: Hate it for the Right Reasons
Perl: Hate it for the Right ReasonsPerl: Hate it for the Right Reasons
Perl: Hate it for the Right Reasons
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.
 
On Centralizing Logs
On Centralizing LogsOn Centralizing Logs
On Centralizing Logs
 
Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 

Viewers also liked

Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Web Scraping and Its Business Benefits
Web Scraping and Its Business BenefitsWeb Scraping and Its Business Benefits
Web Scraping and Its Business BenefitsEminenture
 
When RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPWhen RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPMatthew Turland
 
Java Web Scraping
Java Web ScrapingJava Web Scraping
Java Web ScrapingSumant Raja
 
Marina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Pivotingskyscrapers
PivotingskyscrapersPivotingskyscrapers
Pivotingskyscrapersbengermo1950
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Bucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozerBucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozerArun Kumar
 
Birth of skyscrapers
Birth of skyscrapersBirth of skyscrapers
Birth of skyscrapersAbhiniti Garg
 
Scraper ripper-grader-dozer
Scraper ripper-grader-dozerScraper ripper-grader-dozer
Scraper ripper-grader-dozerSATYANARAYANA I
 

Viewers also liked (20)

Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Scraper
ScraperScraper
Scraper
 
Web Scraping and Its Business Benefits
Web Scraping and Its Business BenefitsWeb Scraping and Its Business Benefits
Web Scraping and Its Business Benefits
 
Relevance Assessment Tool
Relevance Assessment ToolRelevance Assessment Tool
Relevance Assessment Tool
 
When RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPWhen RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTP
 
Whereismy Dozer
Whereismy DozerWhereismy Dozer
Whereismy Dozer
 
Java Web Scraping
Java Web ScrapingJava Web Scraping
Java Web Scraping
 
Marina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian - Portfolio
Marina Grigorian - Portfolio
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Pivotingskyscrapers
PivotingskyscrapersPivotingskyscrapers
Pivotingskyscrapers
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Scraper
ScraperScraper
Scraper
 
Bucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozerBucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozer
 
Skyscraper
SkyscraperSkyscraper
Skyscraper
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Skyscrapers
SkyscrapersSkyscrapers
Skyscrapers
 
Birth of skyscrapers
Birth of skyscrapersBirth of skyscrapers
Birth of skyscrapers
 
Scraper ripper-grader-dozer
Scraper ripper-grader-dozerScraper ripper-grader-dozer
Scraper ripper-grader-dozer
 
Using Rss
Using RssUsing Rss
Using Rss
 

Similar to Web::Scraper

非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumaki非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumakikeroyonn
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
Teflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceTeflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceSaumil Shah
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To LampAmzad Hossain
 
Implementing Comet using PHP
Implementing Comet using PHPImplementing Comet using PHP
Implementing Comet using PHPKing Foo
 
루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날Sukjoon Kim
 
Jade & Javascript templating
Jade & Javascript templatingJade & Javascript templating
Jade & Javascript templatingwearefractal
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersTodd Anglin
 
Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatraSérgio Santos
 
Ajax to the Moon
Ajax to the MoonAjax to the Moon
Ajax to the Moondavejohnson
 
PHP Presentation
PHP PresentationPHP Presentation
PHP PresentationNikhil Jain
 
Searching the Now
Searching the NowSearching the Now
Searching the Nowlucasjosh
 
Node js presentation
Node js presentationNode js presentation
Node js presentationmartincabrera
 
Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09Steve Souders
 

Similar to Web::Scraper (20)

Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LTWeb::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
 
非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumaki非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumaki
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Teflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceTeflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surface
 
WordPress APIs
WordPress APIsWordPress APIs
WordPress APIs
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To Lamp
 
Implementing Comet using PHP
Implementing Comet using PHPImplementing Comet using PHP
Implementing Comet using PHP
 
루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날
 
Jade & Javascript templating
Jade & Javascript templatingJade & Javascript templating
Jade & Javascript templating
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
 
Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatra
 
Ajax to the Moon
Ajax to the MoonAjax to the Moon
Ajax to the Moon
 
&lt;img src="xss.com">
&lt;img src="xss.com">&lt;img src="xss.com">
&lt;img src="xss.com">
 
Fav
FavFav
Fav
 
Ajax ons2
Ajax ons2Ajax ons2
Ajax ons2
 
PHP Presentation
PHP PresentationPHP Presentation
PHP Presentation
 
Searching the Now
Searching the NowSearching the Now
Searching the Now
 
Node js presentation
Node js presentationNode js presentation
Node js presentation
 
Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09
 
Writing Pluggable Software
Writing Pluggable SoftwareWriting Pluggable Software
Writing Pluggable Software
 

More from Tatsuhiko Miyagawa

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency managerTatsuhiko Miyagawa
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Tatsuhiko Miyagawa
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversTatsuhiko Miyagawa
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEventTatsuhiko Miyagawa
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Tatsuhiko Miyagawa
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked aboutTatsuhiko Miyagawa
 

More from Tatsuhiko Miyagawa (20)

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency manager
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011
 
Plack at OSCON 2010
Plack at OSCON 2010Plack at OSCON 2010
Plack at OSCON 2010
 
cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010
 
Plack at YAPC::NA 2010
Plack at YAPC::NA 2010Plack at YAPC::NA 2010
Plack at YAPC::NA 2010
 
PSGI/Plack OSDC.TW
PSGI/Plack OSDC.TWPSGI/Plack OSDC.TW
PSGI/Plack OSDC.TW
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
 
Plack - LPW 2009
Plack - LPW 2009Plack - LPW 2009
Plack - LPW 2009
 
Tatsumaki
TatsumakiTatsumaki
Tatsumaki
 
Intro to PSGI and Plack
Intro to PSGI and PlackIntro to PSGI and Plack
Intro to PSGI and Plack
 
CPAN Realtime feed
CPAN Realtime feedCPAN Realtime feed
CPAN Realtime feed
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEvent
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Remedie OSDC.TW
Remedie OSDC.TWRemedie OSDC.TW
Remedie OSDC.TW
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked about
 
XML::Liberal
XML::LiberalXML::Liberal
XML::Liberal
 
Test::Base
Test::BaseTest::Base
Test::Base
 
Hacking Vox and Plagger
Hacking Vox and PlaggerHacking Vox and Plagger
Hacking Vox and Plagger
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Web::Scraper

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
  • 2.
  • 3.
  • 4. abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
  • 5.  
  • 7.  
  • 8.
  • 9. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 10. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 11.
  • 12.  
  • 13.  
  • 14.
  • 15.
  • 16.
  • 17.  
  • 18. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 19. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 20.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. <span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
  • 30. <span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Vienna
  • 31. <span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 37.
  • 38.
  • 39.
  • 40. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 41. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 42.
  • 43. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 44. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 45.
  • 46.
  • 47.
  • 48. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.  
  • 64.
  • 65.  
  • 66. <span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot; id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.