SlideShare a Scribd company logo
1 of 15
Download to read offline
Linux Creative Group

 Hpricot – Dig The Impossible With Ruby

                 By: Subhransu Behera
            arya.subhransu@gmail.com
Ruby !!!


What’s Special?
So … Let’s See !
•  Dynamic

•  Easy
to
Learn

•  Easy
to
maintain
and
grow

•  Convenient
Short‐Cuts

Ex:
Str
=
“Linux
Crea=ve
Group”


 
 Str_join
=
Str.split(“
“).join(“+”)

•  Transparent,
code
faster

•  Few
Syntax
Errors,
Fewer
Bugs

•  It’s
Fun

Ruby Gems
•  Package
Management
System
for
Ruby
Applica=ons

   and
Libraries


•  Resolve
Dependencies.


•  Provides
Central
Repository
of
SoUware.

•  One
Command
Rules:

 


 ‐
gem
install
<gem_name>

•  Can
Have
your
Own
Local
Gem
Server




 ‐
gem
install
<gem_name>
‐‐source
<gem_server_ip_and_port>

Hpricot makes it easy
      to Parse
Hpricot

•    Pull
informa=on
from
virtually
any
website.

•    Search
by
Element
ID,
Tags,
CSS
Selectors.

•    Parse
HTML
including
broken
HTML

•    Update
HTML

•    Use
this
data
anywhere
and
anyway
you
want!

•    Parse
by
XPath
for
directly
parsing
an
element.

•    Let’s
see
….
How
it
works.


Let’s Parse A Badly
           Designed Site !!
•  h^p://www.worldweather.org

•  It’s
a
site
that
provides
weather
informa=on
for

   different
loca=ons
across
the
globe.

•  In
the
main
page
they
have
a
badly
nested
table

   structure
!!

•  An
ideal
Web‐Developer
could
have
put
them
nicely
in

   divs
with
meaningful
IDs.

•  But
let’s
face
the
truth
and
parse
the
Country
Names

   and
their
URLs.

Easy Steps – 1. Open The
          Site
Easy Steps – 2. Inspect
     With Firebug
Easy Steps – 3. Copy X-Path
      of the Element
Easy Steps – 4. Parse By X-
    Path Using Hpricot
Use some Logic & You’ll Get
Just Try it Out

 Questions?
References



•  Ruby
Programming
Language:
h^p://
   www.ruby‐lang.org/en/

•  Hpricot:
h^p://code.whytheluckys=ff.net/
   hpricot/

•  X‐Path:
h^p://en.wikipedia.org/wiki/XPath

•  Firebug:
h^p://gecirebug.com/

Thanks 

More Related Content

Similar to HTML Parsing With Hpricot

LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08Barry Sampson
 
HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08Jesse Young
 
Roll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSRoll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSChris Evjy
 
Text Mining and SEASR
Text Mining and SEASRText Mining and SEASR
Text Mining and SEASRLoretta Auvil
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open StackMegan Eskey
 
Fedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacFedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacLoretta Auvil
 
The Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoThe Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoVenture Hacks
 
Yakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep DiveYakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep Dive360|Conferences
 
Social Computing Tools and Social Technography
Social Computing Tools and Social TechnographySocial Computing Tools and Social Technography
Social Computing Tools and Social TechnographyKiran Budhrani
 
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
Social Media   Very Simple Overview What Is It How Did It Start What Does It DoSocial Media   Very Simple Overview What Is It How Did It Start What Does It Do
Social Media Very Simple Overview What Is It How Did It Start What Does It DoKristin McCullough
 
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)jjhuff
 
Tesi Laurea Specialistica
Tesi Laurea SpecialisticaTesi Laurea Specialistica
Tesi Laurea Specialisticalando84
 
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingUW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingChris Sterling
 
GIPA
GIPAGIPA
GIPAESUG
 
Scalability without going nuts
Scalability without going nutsScalability without going nuts
Scalability without going nutsJames Cox
 
The New Face of Learning? (full version)
The New Face of Learning? (full version)The New Face of Learning? (full version)
The New Face of Learning? (full version)Judith Christian-Carter
 
A Guide To Blogging For The Uninitiated
A Guide To Blogging For The UninitiatedA Guide To Blogging For The Uninitiated
A Guide To Blogging For The UninitiatedMatt Machell
 
企业级搜索引擎Solr交流
企业级搜索引擎Solr交流企业级搜索引擎Solr交流
企业级搜索引擎Solr交流chuan liang
 

Similar to HTML Parsing With Hpricot (20)

LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08
 
HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08
 
Roll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSRoll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMS
 
Text Mining and SEASR
Text Mining and SEASRText Mining and SEASR
Text Mining and SEASR
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open Stack
 
Fedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacFedora App Slide 2009 Hastac
Fedora App Slide 2009 Hastac
 
The Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoThe Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 Expo
 
Yakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep DiveYakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep Dive
 
Social Computing Tools and Social Technography
Social Computing Tools and Social TechnographySocial Computing Tools and Social Technography
Social Computing Tools and Social Technography
 
Blogging Slides
Blogging SlidesBlogging Slides
Blogging Slides
 
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
Social Media   Very Simple Overview What Is It How Did It Start What Does It DoSocial Media   Very Simple Overview What Is It How Did It Start What Does It Do
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
 
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
 
Tesi Laurea Specialistica
Tesi Laurea SpecialisticaTesi Laurea Specialistica
Tesi Laurea Specialistica
 
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingUW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
 
GIPA
GIPAGIPA
GIPA
 
Scalability without going nuts
Scalability without going nutsScalability without going nuts
Scalability without going nuts
 
The New Face of Learning? (full version)
The New Face of Learning? (full version)The New Face of Learning? (full version)
The New Face of Learning? (full version)
 
A Guide To Blogging For The Uninitiated
A Guide To Blogging For The UninitiatedA Guide To Blogging For The Uninitiated
A Guide To Blogging For The Uninitiated
 
Rich Web Clients 20081118
Rich Web Clients 20081118Rich Web Clients 20081118
Rich Web Clients 20081118
 
企业级搜索引擎Solr交流
企业级搜索引擎Solr交流企业级搜索引擎Solr交流
企业级搜索引擎Solr交流
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

HTML Parsing With Hpricot

  • 1. Linux Creative Group Hpricot – Dig The Impossible With Ruby By: Subhransu Behera arya.subhransu@gmail.com
  • 3. So … Let’s See ! •  Dynamic
 •  Easy
to
Learn
 •  Easy
to
maintain
and
grow
 •  Convenient
Short‐Cuts
 Ex:
Str
=
“Linux
Crea=ve
Group”
 
 
 Str_join
=
Str.split(“
“).join(“+”)
 •  Transparent,
code
faster
 •  Few
Syntax
Errors,
Fewer
Bugs
 •  It’s
Fun

  • 4. Ruby Gems •  Package
Management
System
for
Ruby
Applica=ons
 and
Libraries

 •  Resolve
Dependencies.

 •  Provides
Central
Repository
of
SoUware.
 •  One
Command
Rules:

 
 
 ‐
gem
install
<gem_name>
 •  Can
Have
your
Own
Local
Gem
Server


 
 ‐
gem
install
<gem_name>
‐‐source
<gem_server_ip_and_port>

  • 5. Hpricot makes it easy to Parse
  • 6. Hpricot •  Pull
informa=on
from
virtually
any
website.
 •  Search
by
Element
ID,
Tags,
CSS
Selectors.
 •  Parse
HTML
including
broken
HTML
 •  Update
HTML
 •  Use
this
data
anywhere
and
anyway
you
want!
 •  Parse
by
XPath
for
directly
parsing
an
element.
 •  Let’s
see
….
How
it
works.


  • 7. Let’s Parse A Badly Designed Site !! •  h^p://www.worldweather.org
 •  It’s
a
site
that
provides
weather
informa=on
for
 different
loca=ons
across
the
globe.
 •  In
the
main
page
they
have
a
badly
nested
table
 structure
!!
 •  An
ideal
Web‐Developer
could
have
put
them
nicely
in
 divs
with
meaningful
IDs.
 •  But
let’s
face
the
truth
and
parse
the
Country
Names
 and
their
URLs.

  • 8. Easy Steps – 1. Open The Site
  • 9. Easy Steps – 2. Inspect With Firebug
  • 10. Easy Steps – 3. Copy X-Path of the Element
  • 11. Easy Steps – 4. Parse By X- Path Using Hpricot
  • 12. Use some Logic & You’ll Get
  • 13. Just Try it Out Questions?
  • 14. References

 •  Ruby
Programming
Language:
h^p:// www.ruby‐lang.org/en/
 •  Hpricot:
h^p://code.whytheluckys=ff.net/ hpricot/
 •  X‐Path:
h^p://en.wikipedia.org/wiki/XPath
 •  Firebug:
h^p://gecirebug.com/