Screen Scraping with Ruby
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Screen Scraping with Ruby



A brief overview of screen scraping with Ruby

A brief overview of screen scraping with Ruby



Total Views
Views on SlideShare
Embed Views



2 Embeds 156 125 31



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Screen Scraping with Ruby Presentation Transcript

  • 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  • 2. Pre-Reqs
    • XPATH
      • Used for addressing elements in an XML doc
    • Ruby Regular Expressions
      • -- fastest way to build rope-swinging regex skills in Ruby
  • 3. Rubular
    • Input a string and try out regexs on it
  • 4. XPATH basics
    • //book -- selects all book nodes no matter where they are in the document
    • //bookstore/book – selects all books that are a direct child element of bookstore
    • //bookstore//book – selects all books that are children of bookstore, no matter how deep in tree
    • //book[1] –- selects first book element
    • //book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books
  • 5. Not-so-secret weapon: Firebug
    • Lets you find the Xpath of any element in a web page.
    • “Inspect Element” of a representative item
    • Focus on classes, think about loops
      • This will produce easier to read xpaths
  • 6. Example
  • 7. Example (con't)
      • //sbScoreboxScores//sbScoreboxTeamAway
      • //sbScoreboxScores//sbScoreboxTotal
  • 8. How do we get at the data in these elements using Ruby?
    • Open-uri – built-in library for opening pages
    • REXML – built-in Xpath parser
      • Fast, good for straightforward tasks
    • Hpricot – popular gem
      • More powerful, I usually use in conjuction with:
    • WWW::Mechanize
      • For getting at data behind forms. Requires Hpricot
    • ScRUBYt! -- powerful, high-level abstraction, magic
  • 9. My Process
    • Find a good source
    • Use firebug to determine Xpaths of the elements you want to scrape
    • Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools
    • Refine Xpaths
    • Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc)
    • Ruby's string methods help with this last part
  • 10. Example: Scraping Quotes from your Tumblr
  • 11. Warning
    • Screen scraping is very iterative & involves a lot of trial and error
    • Make sure you comment a lot as you go along
    • Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
  • 12. References
      • Inspiration for this presentation
      • Describes the Firebug + hpricot method
      • WWW::Mechanize examples
    • For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts