SlideShare a Scribd company logo
1 of 15
Download to read offline
Linux Creative Group

 Hpricot – Dig The Impossible With Ruby

                 By: Subhransu Behera
            arya.subhransu@gmail.com
Ruby !!!


What’s Special?
So … Let’s See !
•  Dynamic

•  Easy
to
Learn

•  Easy
to
maintain
and
grow

•  Convenient
Short‐Cuts

Ex:
Str
=
“Linux
Crea=ve
Group”


 
 Str_join
=
Str.split(“
“).join(“+”)

•  Transparent,
code
faster

•  Few
Syntax
Errors,
Fewer
Bugs

•  It’s
Fun

Ruby Gems
•  Package
Management
System
for
Ruby
Applica=ons

   and
Libraries


•  Resolve
Dependencies.


•  Provides
Central
Repository
of
SoUware.

•  One
Command
Rules:

 


 ‐
gem
install
<gem_name>

•  Can
Have
your
Own
Local
Gem
Server




 ‐
gem
install
<gem_name>
‐‐source
<gem_server_ip_and_port>

Hpricot makes it easy
      to Parse
Hpricot

•    Pull
informa=on
from
virtually
any
website.

•    Search
by
Element
ID,
Tags,
CSS
Selectors.

•    Parse
HTML
including
broken
HTML

•    Update
HTML

•    Use
this
data
anywhere
and
anyway
you
want!

•    Parse
by
XPath
for
directly
parsing
an
element.

•    Let’s
see
….
How
it
works.


Let’s Parse A Badly
           Designed Site !!
•  h^p://www.worldweather.org

•  It’s
a
site
that
provides
weather
informa=on
for

   different
loca=ons
across
the
globe.

•  In
the
main
page
they
have
a
badly
nested
table

   structure
!!

•  An
ideal
Web‐Developer
could
have
put
them
nicely
in

   divs
with
meaningful
IDs.

•  But
let’s
face
the
truth
and
parse
the
Country
Names

   and
their
URLs.

Easy Steps – 1. Open The
          Site
Easy Steps – 2. Inspect
     With Firebug
Easy Steps – 3. Copy X-Path
      of the Element
Easy Steps – 4. Parse By X-
    Path Using Hpricot
Use some Logic & You’ll Get
Just Try it Out

 Questions?
References



•  Ruby
Programming
Language:
h^p://
   www.ruby‐lang.org/en/

•  Hpricot:
h^p://code.whytheluckys=ff.net/
   hpricot/

•  X‐Path:
h^p://en.wikipedia.org/wiki/XPath

•  Firebug:
h^p://gecirebug.com/

Thanks 

More Related Content

Similar to HTML Parsing With Hpricot

LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08Barry Sampson
 
HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08Jesse Young
 
Roll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSRoll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSChris Evjy
 
Text Mining and SEASR
Text Mining and SEASRText Mining and SEASR
Text Mining and SEASRLoretta Auvil
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open StackMegan Eskey
 
Fedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacFedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacLoretta Auvil
 
The Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoThe Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoVenture Hacks
 
Yakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep DiveYakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep Dive360|Conferences
 
Social Computing Tools and Social Technography
Social Computing Tools and Social TechnographySocial Computing Tools and Social Technography
Social Computing Tools and Social TechnographyKiran Budhrani
 
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
Social Media   Very Simple Overview What Is It How Did It Start What Does It DoSocial Media   Very Simple Overview What Is It How Did It Start What Does It Do
Social Media Very Simple Overview What Is It How Did It Start What Does It DoKristin McCullough
 
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)jjhuff
 
Tesi Laurea Specialistica
Tesi Laurea SpecialisticaTesi Laurea Specialistica
Tesi Laurea Specialisticalando84
 
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingUW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingChris Sterling
 
GIPA
GIPAGIPA
GIPAESUG
 
Scalability without going nuts
Scalability without going nutsScalability without going nuts
Scalability without going nutsJames Cox
 
The New Face of Learning? (full version)
The New Face of Learning? (full version)The New Face of Learning? (full version)
The New Face of Learning? (full version)Judith Christian-Carter
 
企业级搜索引擎Solr交流
企业级搜索引擎Solr交流企业级搜索引擎Solr交流
企业级搜索引擎Solr交流chuan liang
 
Diving Into The Yahoo Open Stack
Diving Into The Yahoo Open StackDiving Into The Yahoo Open Stack
Diving Into The Yahoo Open StackDustin Whittle
 

Similar to HTML Parsing With Hpricot (20)

LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08
 
HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08HA+DRBD+Postgres - PostgresWest '08
HA+DRBD+Postgres - PostgresWest '08
 
Roll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSRoll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMS
 
Text Mining and SEASR
Text Mining and SEASRText Mining and SEASR
Text Mining and SEASR
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open Stack
 
Fedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacFedora App Slide 2009 Hastac
Fedora App Slide 2009 Hastac
 
The Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoThe Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 Expo
 
Yakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep DiveYakov Fain - Design Patterns a Deep Dive
Yakov Fain - Design Patterns a Deep Dive
 
Social Computing Tools and Social Technography
Social Computing Tools and Social TechnographySocial Computing Tools and Social Technography
Social Computing Tools and Social Technography
 
Blogging Slides
Blogging SlidesBlogging Slides
Blogging Slides
 
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
Social Media   Very Simple Overview What Is It How Did It Start What Does It DoSocial Media   Very Simple Overview What Is It How Did It Start What Does It Do
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
 
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
Inside Picnik: How We Built Picnik (and What We Learned Along the Way)
 
Tesi Laurea Specialistica
Tesi Laurea SpecialisticaTesi Laurea Specialistica
Tesi Laurea Specialistica
 
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingUW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
 
GIPA
GIPAGIPA
GIPA
 
Scalability without going nuts
Scalability without going nutsScalability without going nuts
Scalability without going nuts
 
The New Face of Learning? (full version)
The New Face of Learning? (full version)The New Face of Learning? (full version)
The New Face of Learning? (full version)
 
Rich Web Clients 20081118
Rich Web Clients 20081118Rich Web Clients 20081118
Rich Web Clients 20081118
 
企业级搜索引擎Solr交流
企业级搜索引擎Solr交流企业级搜索引擎Solr交流
企业级搜索引擎Solr交流
 
Diving Into The Yahoo Open Stack
Diving Into The Yahoo Open StackDiving Into The Yahoo Open Stack
Diving Into The Yahoo Open Stack
 

Recently uploaded

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 

Recently uploaded (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

HTML Parsing With Hpricot

  • 1. Linux Creative Group Hpricot – Dig The Impossible With Ruby By: Subhransu Behera arya.subhransu@gmail.com
  • 3. So … Let’s See ! •  Dynamic
 •  Easy
to
Learn
 •  Easy
to
maintain
and
grow
 •  Convenient
Short‐Cuts
 Ex:
Str
=
“Linux
Crea=ve
Group”
 
 
 Str_join
=
Str.split(“
“).join(“+”)
 •  Transparent,
code
faster
 •  Few
Syntax
Errors,
Fewer
Bugs
 •  It’s
Fun

  • 4. Ruby Gems •  Package
Management
System
for
Ruby
Applica=ons
 and
Libraries

 •  Resolve
Dependencies.

 •  Provides
Central
Repository
of
SoUware.
 •  One
Command
Rules:

 
 
 ‐
gem
install
<gem_name>
 •  Can
Have
your
Own
Local
Gem
Server


 
 ‐
gem
install
<gem_name>
‐‐source
<gem_server_ip_and_port>

  • 5. Hpricot makes it easy to Parse
  • 6. Hpricot •  Pull
informa=on
from
virtually
any
website.
 •  Search
by
Element
ID,
Tags,
CSS
Selectors.
 •  Parse
HTML
including
broken
HTML
 •  Update
HTML
 •  Use
this
data
anywhere
and
anyway
you
want!
 •  Parse
by
XPath
for
directly
parsing
an
element.
 •  Let’s
see
….
How
it
works.


  • 7. Let’s Parse A Badly Designed Site !! •  h^p://www.worldweather.org
 •  It’s
a
site
that
provides
weather
informa=on
for
 different
loca=ons
across
the
globe.
 •  In
the
main
page
they
have
a
badly
nested
table
 structure
!!
 •  An
ideal
Web‐Developer
could
have
put
them
nicely
in
 divs
with
meaningful
IDs.
 •  But
let’s
face
the
truth
and
parse
the
Country
Names
 and
their
URLs.

  • 8. Easy Steps – 1. Open The Site
  • 9. Easy Steps – 2. Inspect With Firebug
  • 10. Easy Steps – 3. Copy X-Path of the Element
  • 11. Easy Steps – 4. Parse By X- Path Using Hpricot
  • 12. Use some Logic & You’ll Get
  • 13. Just Try it Out Questions?
  • 14. References

 •  Ruby
Programming
Language:
h^p:// www.ruby‐lang.org/en/
 •  Hpricot:
h^p://code.whytheluckys=ff.net/ hpricot/
 •  X‐Path:
h^p://en.wikipedia.org/wiki/XPath
 •  Firebug:
h^p://gecirebug.com/