• Like
HTML Parsing With Hpricot
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

HTML Parsing With Hpricot

  • 4,027 views
Published

We can use Hpricot to virtually parse any website. Some cool techniques were shown in this slide to parse a site by Tags, Element IDs, XPath.

We can use Hpricot to virtually parse any website. Some cool techniques were shown in this slide to parse a site by Tags, Element IDs, XPath.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,027
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
29
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Linux Creative Group Hpricot – Dig The Impossible With Ruby By: Subhransu Behera arya.subhransu@gmail.com
  • 2. Ruby !!! What’s Special?
  • 3. So … Let’s See ! •  Dynamic
 •  Easy
to
Learn
 •  Easy
to
maintain
and
grow
 •  Convenient
Short‐Cuts
 Ex:
Str
=
“Linux
Crea=ve
Group”
 
 
 Str_join
=
Str.split(“
“).join(“+”)
 •  Transparent,
code
faster
 •  Few
Syntax
Errors,
Fewer
Bugs
 •  It’s
Fun

  • 4. Ruby Gems •  Package
Management
System
for
Ruby
Applica=ons
 and
Libraries

 •  Resolve
Dependencies.

 •  Provides
Central
Repository
of
SoUware.
 •  One
Command
Rules:

 
 
 ‐
gem
install
<gem_name>
 •  Can
Have
your
Own
Local
Gem
Server


 
 ‐
gem
install
<gem_name>
‐‐source
<gem_server_ip_and_port>

  • 5. Hpricot makes it easy to Parse
  • 6. Hpricot •  Pull
informa=on
from
virtually
any
website.
 •  Search
by
Element
ID,
Tags,
CSS
Selectors.
 •  Parse
HTML
including
broken
HTML
 •  Update
HTML
 •  Use
this
data
anywhere
and
anyway
you
want!
 •  Parse
by
XPath
for
directly
parsing
an
element.
 •  Let’s
see
….
How
it
works.


  • 7. Let’s Parse A Badly Designed Site !! •  h^p://www.worldweather.org
 •  It’s
a
site
that
provides
weather
informa=on
for
 different
loca=ons
across
the
globe.
 •  In
the
main
page
they
have
a
badly
nested
table
 structure
!!
 •  An
ideal
Web‐Developer
could
have
put
them
nicely
in
 divs
with
meaningful
IDs.
 •  But
let’s
face
the
truth
and
parse
the
Country
Names
 and
their
URLs.

  • 8. Easy Steps – 1. Open The Site
  • 9. Easy Steps – 2. Inspect With Firebug
  • 10. Easy Steps – 3. Copy X-Path of the Element
  • 11. Easy Steps – 4. Parse By X- Path Using Hpricot
  • 12. Use some Logic & You’ll Get
  • 13. Just Try it Out Questions?
  • 14. References

 •  Ruby
Programming
Language:
h^p:// www.ruby‐lang.org/en/
 •  Hpricot:
h^p://code.whytheluckys=ff.net/ hpricot/
 •  X‐Path:
h^p://en.wikipedia.org/wiki/XPath
 •  Firebug:
h^p://gecirebug.com/

  • 15. Thanks 