Web Scraper Shibuya.pm tech talk #8

Tatsuhiko Miyagawa
Tatsuhiko MiyagawaSoftware Engineer
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
[object Object],[object Object]
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
[object Object],[object Object]
 
 
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
[object Object]
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Shibuya
<span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
[object Object],[object Object],[object Object]
CSS Selectors ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object],[object Object],[object Object]
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Basics ,[object Object],[object Object],[object Object],[object Object],[object Object]
process ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],my $s = scraper { process …; process …; result 'foo', 'bar'; };
[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],<span class=&quot;entry-date&quot;>October 1st, 2007 17:13:31 +0900</span> process &quot;.entry-date&quot;, date => 'TEXT :rfc822 ';
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]
1 of 81

Recommended

Real-Time Python Web: Gevent and Socket.io by
Real-Time Python Web: Gevent and Socket.ioReal-Time Python Web: Gevent and Socket.io
Real-Time Python Web: Gevent and Socket.ioRick Copeland
13.8K views27 slides
HTTP For the Good or the Bad - FSEC Edition by
HTTP For the Good or the Bad - FSEC EditionHTTP For the Good or the Bad - FSEC Edition
HTTP For the Good or the Bad - FSEC EditionXavier Mertens
4K views74 slides
Sphinx 1.1 i18n 機能紹介 by
Sphinx 1.1 i18n 機能紹介Sphinx 1.1 i18n 機能紹介
Sphinx 1.1 i18n 機能紹介Ian Lewis
3.7K views43 slides
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014 by
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Puppet
4.2K views48 slides
루비가 얼랭에 빠진 날 by
루비가 얼랭에 빠진 날루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날Sukjoon Kim
658 views25 slides
Apache CouchDB talk at Ontario GNU Linux Fest by
Apache CouchDB talk at Ontario GNU Linux FestApache CouchDB talk at Ontario GNU Linux Fest
Apache CouchDB talk at Ontario GNU Linux FestMyles Braithwaite
821 views84 slides

More Related Content

What's hot

Ultra fast web development with sinatra by
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatraSérgio Santos
996 views36 slides
CouchDB Day NYC 2017: Mango by
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoIBM Cloud Data Services
304 views34 slides
Javascript - The Stack and Beyond by
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and BeyondAll Things Open
540 views34 slides
Stop Worrying & Love the SQL - A Case Study by
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyAll Things Open
1.5K views42 slides
Browser Extensions for Web Hackers by
Browser Extensions for Web HackersBrowser Extensions for Web Hackers
Browser Extensions for Web HackersMark Wubben
2.6K views89 slides
今時なウェブ開発をSmalltalkでやってみる? by
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?Sho Yoshida
1.1K views44 slides

What's hot(20)

Ultra fast web development with sinatra by Sérgio Santos
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatra
Sérgio Santos996 views
Javascript - The Stack and Beyond by All Things Open
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and Beyond
All Things Open540 views
Stop Worrying & Love the SQL - A Case Study by All Things Open
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case Study
All Things Open1.5K views
Browser Extensions for Web Hackers by Mark Wubben
Browser Extensions for Web HackersBrowser Extensions for Web Hackers
Browser Extensions for Web Hackers
Mark Wubben2.6K views
今時なウェブ開発をSmalltalkでやってみる? by Sho Yoshida
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
Sho Yoshida1.1K views
Rotzy - Building an iPhone Photo Sharing App on Google App Engine by geehwan
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
geehwan694 views
Eve - REST API for Humans™ by Nicola Iarocci
Eve - REST API for Humans™Eve - REST API for Humans™
Eve - REST API for Humans™
Nicola Iarocci8.2K views
HPPG - high performance photo gallery by Remigijus Kiminas
HPPG - high performance photo galleryHPPG - high performance photo gallery
HPPG - high performance photo gallery
Remigijus Kiminas2.1K views
Android webservices by Krazy Koder
Android webservicesAndroid webservices
Android webservices
Krazy Koder3.5K views
Debugging and Testing ES Systems by Chris Birchall
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
Chris Birchall6.8K views
Non-Relational Databases by kchodorow
Non-Relational DatabasesNon-Relational Databases
Non-Relational Databases
kchodorow545 views
yusukebe in Yokohama.pm 090909 by Yusuke Wada
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909
Yusuke Wada2.2K views
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama... by BigDataCloud
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud789 views
Tips on how to improve the performance of your custom modules for high volume... by Odoo
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...
Odoo37.3K views

Similar to Web Scraper Shibuya.pm tech talk #8

Web::Scraper by
Web::ScraperWeb::Scraper
Web::ScraperTatsuhiko Miyagawa
18.4K views79 slides
Introduction To Lamp by
Introduction To LampIntroduction To Lamp
Introduction To LampAmzad Hossain
603 views34 slides
A brief history of the web by
A brief history of the webA brief history of the web
A brief history of the webJorge Zapico
551 views30 slides
PHP Presentation by
PHP PresentationPHP Presentation
PHP PresentationAnkush Jain
8.4K views103 slides
How Xslate Works by
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
4.8K views57 slides
Introducing Modern Perl by
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern PerlDave Cross
30.4K views161 slides

Similar to Web Scraper Shibuya.pm tech talk #8(20)

A brief history of the web by Jorge Zapico
A brief history of the webA brief history of the web
A brief history of the web
Jorge Zapico551 views
PHP Presentation by Ankush Jain
PHP PresentationPHP Presentation
PHP Presentation
Ankush Jain8.4K views
How Xslate Works by Goro Fuji
How Xslate WorksHow Xslate Works
How Xslate Works
Goro Fuji4.8K views
Introducing Modern Perl by Dave Cross
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
Dave Cross30.4K views
Forum Presentation by Angus Pratt
Forum PresentationForum Presentation
Forum Presentation
Angus Pratt4.9K views
XML processing with perl by Joe Jiang
XML processing with perlXML processing with perl
XML processing with perl
Joe Jiang927 views
introduction to web technology by vikram singh
introduction to web technologyintroduction to web technology
introduction to web technology
vikram singh35.5K views
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development] by Chris Toohey
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
Chris Toohey994 views
PHP Presentation by Nikhil Jain
PHP PresentationPHP Presentation
PHP Presentation
Nikhil Jain15.5K views

More from Tatsuhiko Miyagawa

Carton CPAN dependency manager by
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency managerTatsuhiko Miyagawa
4.1K views39 slides
Deploying Plack Web Applications: OSCON 2011 by
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Tatsuhiko Miyagawa
8.2K views143 slides
Plack at OSCON 2010 by
Plack at OSCON 2010Plack at OSCON 2010
Plack at OSCON 2010Tatsuhiko Miyagawa
42K views138 slides
cpanminus at YAPC::NA 2010 by
cpanminus at YAPC::NA 2010cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010Tatsuhiko Miyagawa
1.6K views31 slides
Plack at YAPC::NA 2010 by
Plack at YAPC::NA 2010Plack at YAPC::NA 2010
Plack at YAPC::NA 2010Tatsuhiko Miyagawa
3.7K views117 slides
PSGI/Plack OSDC.TW by
PSGI/Plack OSDC.TWPSGI/Plack OSDC.TW
PSGI/Plack OSDC.TWTatsuhiko Miyagawa
3.1K views118 slides

More from Tatsuhiko Miyagawa(20)

Deploying Plack Web Applications: OSCON 2011 by Tatsuhiko Miyagawa
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011
Tatsuhiko Miyagawa8.2K views
Plack perl superglue for web frameworks and servers by Tatsuhiko Miyagawa
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
Tatsuhiko Miyagawa6.7K views
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery by Tatsuhiko Miyagawa
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Tatsuhiko Miyagawa39.1K views
Building a desktop app with HTTP::Engine, SQLite and jQuery by Tatsuhiko Miyagawa
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQuery
Tatsuhiko Miyagawa4.2K views

Recently uploaded

The Role of Patterns in the Era of Large Language Models by
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
80 views65 slides
Business Analyst Series 2023 - Week 4 Session 8 by
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8DianaGray10
86 views13 slides
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
181 views19 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
62 views27 slides
DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
140 views21 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
112 views34 slides

Recently uploaded(20)

The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li80 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray1086 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty62 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue140 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue85 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue88 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue166 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue63 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue79 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu365 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue98 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue179 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue144 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue163 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views

Web Scraper Shibuya.pm tech talk #8

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
  • 2.
  • 3. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 4. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 5.
  • 6.  
  • 7.  
  • 8.
  • 9.
  • 10.
  • 11.  
  • 12. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 13. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 14.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  • 24. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Shibuya
  • 25. <span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 31.
  • 32.
  • 33.
  • 34. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 35. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 36.
  • 37. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 38. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 39.
  • 40.
  • 41.
  • 42. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.