Screen Scraping with Ruby


Published on

A brief overview of screen scraping with Ruby

Published in: Technology

Screen Scraping with Ruby

  1. 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  2. 2. Pre-Reqs <ul><li>XPATH </li></ul><ul><ul><li>Used for addressing elements in an XML doc </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Ruby Regular Expressions </li></ul><ul><ul><li> -- fastest way to build rope-swinging regex skills in Ruby </li></ul></ul>
  3. 3. Rubular <ul><li>Input a string and try out regexs on it </li></ul>
  4. 4. XPATH basics <ul><li>//book -- selects all book nodes no matter where they are in the document </li></ul><ul><li>//bookstore/book – selects all books that are a direct child element of bookstore </li></ul><ul><li>//bookstore//book – selects all books that are children of bookstore, no matter how deep in tree </li></ul><ul><li>//book[1] –- selects first book element </li></ul><ul><li>//book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books </li></ul>
  5. 5. Not-so-secret weapon: Firebug <ul><li>Lets you find the Xpath of any element in a web page. </li></ul><ul><li>“Inspect Element” of a representative item </li></ul><ul><li>Focus on classes, think about loops </li></ul><ul><ul><li>This will produce easier to read xpaths </li></ul></ul>
  6. 6. Example
  7. 7. Example (con't) <ul><ul><li>//sbScoreboxScores//sbScoreboxTeamAway </li></ul></ul><ul><ul><li>//sbScoreboxScores//sbScoreboxTotal </li></ul></ul>
  8. 8. How do we get at the data in these elements using Ruby? <ul><li>Open-uri – built-in library for opening pages </li></ul><ul><li>REXML – built-in Xpath parser </li></ul><ul><ul><li>Fast, good for straightforward tasks </li></ul></ul><ul><li>Hpricot – popular gem </li></ul><ul><ul><li>More powerful, I usually use in conjuction with: </li></ul></ul><ul><li>WWW::Mechanize </li></ul><ul><ul><li>For getting at data behind forms. Requires Hpricot </li></ul></ul><ul><li>ScRUBYt! -- powerful, high-level abstraction, magic </li></ul>
  9. 9. My Process <ul><li>Find a good source </li></ul><ul><li>Use firebug to determine Xpaths of the elements you want to scrape </li></ul><ul><li>Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools </li></ul><ul><li>Refine Xpaths </li></ul><ul><li>Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc) </li></ul><ul><li>Ruby's string methods help with this last part </li></ul>
  10. 10. Example: Scraping Quotes from your Tumblr
  11. 11. Warning <ul><li>Screen scraping is very iterative & involves a lot of trial and error </li></ul><ul><li>Make sure you comment a lot as you go along </li></ul><ul><li>Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up </li></ul>
  12. 12. References <ul><li> </li></ul><ul><ul><li>Inspiration for this presentation </li></ul></ul><ul><ul><li>Describes the Firebug + hpricot method </li></ul></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><ul><li>WWW::Mechanize examples </li></ul></ul><ul><li>For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts </li></ul><ul><ul><li> </li></ul></ul>