Screen Scraping with Ruby
Upcoming SlideShare
Loading in...5
×
 

Screen Scraping with Ruby

on

  • 11,990 views

A brief overview of screen scraping with Ruby

A brief overview of screen scraping with Ruby

Statistics

Views

Total Views
11,990
Views on SlideShare
11,834
Embed Views
156

Actions

Likes
10
Downloads
91
Comments
0

2 Embeds 156

http://www.jeremyraines.com 125
http://www.slideshare.net 31

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Screen Scraping with Ruby Screen Scraping with Ruby Presentation Transcript

  • Screen Scraping with Ruby Jeremy Raines [email_address]
  • Pre-Reqs
    • XPATH
      • Used for addressing elements in an XML doc
      • http://www.w3schools.com/XPath/default.asp
    • Ruby Regular Expressions
      • www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby
  • Rubular
    • Input a string and try out regexs on it
  • XPATH basics
    • //book -- selects all book nodes no matter where they are in the document
    • //bookstore/book – selects all books that are a direct child element of bookstore
    • //bookstore//book – selects all books that are children of bookstore, no matter how deep in tree
    • //book[1] –- selects first book element
    • //book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books
  • Not-so-secret weapon: Firebug
    • Lets you find the Xpath of any element in a web page.
    • “Inspect Element” of a representative item
    • Focus on classes, think about loops
      • This will produce easier to read xpaths
  • Example
  • Example (con't)
      • //sbScoreboxScores//sbScoreboxTeamAway
      • //sbScoreboxScores//sbScoreboxTotal
  • How do we get at the data in these elements using Ruby?
    • Open-uri – built-in library for opening pages
    • REXML – built-in Xpath parser
      • Fast, good for straightforward tasks
    • Hpricot – popular gem
      • More powerful, I usually use in conjuction with:
    • WWW::Mechanize
      • For getting at data behind forms. Requires Hpricot
    • ScRUBYt! -- powerful, high-level abstraction, magic
  • My Process
    • Find a good source
    • Use firebug to determine Xpaths of the elements you want to scrape
    • Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools
    • Refine Xpaths
    • Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc)
    • Ruby's string methods help with this last part
  • Example: Scraping Quotes from your Tumblr
  • Warning
    • Screen scraping is very iterative & involves a lot of trial and error
    • Make sure you comment a lot as you go along
    • Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
  • References
    • http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
      • Inspiration for this presentation
      • Describes the Firebug + hpricot method
    • http://code.whytheluckystiff.net/hpricot/wiki
    • http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/
    • http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html
    • http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html
      • WWW::Mechanize examples
    • For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts
      • jeremyraines.com/linkedinscraper