Screen Scraping with Ruby Jeremy Raines [email_address]
Pre-Reqs <ul><li>XPATH </li></ul><ul><ul><li>Used for addressing elements in an XML doc </li></ul></ul><ul><ul><li>http://...
Rubular <ul><li>Input a string and try out regexs on it </li></ul>
XPATH basics <ul><li>//book  -- selects all book nodes no matter where they are in the document </li></ul><ul><li>//bookst...
Not-so-secret weapon: Firebug <ul><li>Lets you find the Xpath of any element in a web page. </li></ul><ul><li>“Inspect Ele...
Example
Example (con't) <ul><ul><li>//sbScoreboxScores//sbScoreboxTeamAway </li></ul></ul><ul><ul><li>//sbScoreboxScores//sbScoreb...
How do we get at the data in these elements using Ruby? <ul><li>Open-uri – built-in library for opening pages </li></ul><u...
My Process <ul><li>Find a good source </li></ul><ul><li>Use firebug to determine Xpaths of the elements you want to scrape...
Example:  Scraping Quotes from your Tumblr
Warning <ul><li>Screen scraping is very iterative & involves a lot of trial and error </li></ul><ul><li>Make sure you comm...
References <ul><li>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/ </li></ul><ul><ul><li>Inspiration f...
Upcoming SlideShare
Loading in...5
×

Screen Scraping with Ruby

8,872

Published on

A brief overview of screen scraping with Ruby

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,872
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
91
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Screen Scraping with Ruby

  1. 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  2. 2. Pre-Reqs <ul><li>XPATH </li></ul><ul><ul><li>Used for addressing elements in an XML doc </li></ul></ul><ul><ul><li>http://www.w3schools.com/XPath/default.asp </li></ul></ul><ul><li>Ruby Regular Expressions </li></ul><ul><ul><li>www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby </li></ul></ul>
  3. 3. Rubular <ul><li>Input a string and try out regexs on it </li></ul>
  4. 4. XPATH basics <ul><li>//book -- selects all book nodes no matter where they are in the document </li></ul><ul><li>//bookstore/book – selects all books that are a direct child element of bookstore </li></ul><ul><li>//bookstore//book – selects all books that are children of bookstore, no matter how deep in tree </li></ul><ul><li>//book[1] –- selects first book element </li></ul><ul><li>//book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books </li></ul>
  5. 5. Not-so-secret weapon: Firebug <ul><li>Lets you find the Xpath of any element in a web page. </li></ul><ul><li>“Inspect Element” of a representative item </li></ul><ul><li>Focus on classes, think about loops </li></ul><ul><ul><li>This will produce easier to read xpaths </li></ul></ul>
  6. 6. Example
  7. 7. Example (con't) <ul><ul><li>//sbScoreboxScores//sbScoreboxTeamAway </li></ul></ul><ul><ul><li>//sbScoreboxScores//sbScoreboxTotal </li></ul></ul>
  8. 8. How do we get at the data in these elements using Ruby? <ul><li>Open-uri – built-in library for opening pages </li></ul><ul><li>REXML – built-in Xpath parser </li></ul><ul><ul><li>Fast, good for straightforward tasks </li></ul></ul><ul><li>Hpricot – popular gem </li></ul><ul><ul><li>More powerful, I usually use in conjuction with: </li></ul></ul><ul><li>WWW::Mechanize </li></ul><ul><ul><li>For getting at data behind forms. Requires Hpricot </li></ul></ul><ul><li>ScRUBYt! -- powerful, high-level abstraction, magic </li></ul>
  9. 9. My Process <ul><li>Find a good source </li></ul><ul><li>Use firebug to determine Xpaths of the elements you want to scrape </li></ul><ul><li>Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools </li></ul><ul><li>Refine Xpaths </li></ul><ul><li>Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc) </li></ul><ul><li>Ruby's string methods help with this last part </li></ul>
  10. 10. Example: Scraping Quotes from your Tumblr
  11. 11. Warning <ul><li>Screen scraping is very iterative & involves a lot of trial and error </li></ul><ul><li>Make sure you comment a lot as you go along </li></ul><ul><li>Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up </li></ul>
  12. 12. References <ul><li>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/ </li></ul><ul><ul><li>Inspiration for this presentation </li></ul></ul><ul><ul><li>Describes the Firebug + hpricot method </li></ul></ul><ul><li>http://code.whytheluckystiff.net/hpricot/wiki </li></ul><ul><li>http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/ </li></ul><ul><li>http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html </li></ul><ul><li>http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html </li></ul><ul><ul><li>WWW::Mechanize examples </li></ul></ul><ul><li>For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts </li></ul><ul><ul><li>jeremyraines.com/linkedinscraper </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×