Screen Scraping with Ruby
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Screen Scraping with Ruby

on

  • 12,124 views

A brief overview of screen scraping with Ruby

A brief overview of screen scraping with Ruby

Statistics

Views

Total Views
12,124
Views on SlideShare
11,968
Embed Views
156

Actions

Likes
10
Downloads
91
Comments
0

2 Embeds 156

http://www.jeremyraines.com 125
http://www.slideshare.net 31

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Screen Scraping with Ruby Presentation Transcript

  • 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  • 2. Pre-Reqs
    • XPATH
      • Used for addressing elements in an XML doc
      • http://www.w3schools.com/XPath/default.asp
    • Ruby Regular Expressions
      • www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby
  • 3. Rubular
    • Input a string and try out regexs on it
  • 4. XPATH basics
    • //book -- selects all book nodes no matter where they are in the document
    • //bookstore/book – selects all books that are a direct child element of bookstore
    • //bookstore//book – selects all books that are children of bookstore, no matter how deep in tree
    • //book[1] –- selects first book element
    • //book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books
  • 5. Not-so-secret weapon: Firebug
    • Lets you find the Xpath of any element in a web page.
    • “Inspect Element” of a representative item
    • Focus on classes, think about loops
      • This will produce easier to read xpaths
  • 6. Example
  • 7. Example (con't)
      • //sbScoreboxScores//sbScoreboxTeamAway
      • //sbScoreboxScores//sbScoreboxTotal
  • 8. How do we get at the data in these elements using Ruby?
    • Open-uri – built-in library for opening pages
    • REXML – built-in Xpath parser
      • Fast, good for straightforward tasks
    • Hpricot – popular gem
      • More powerful, I usually use in conjuction with:
    • WWW::Mechanize
      • For getting at data behind forms. Requires Hpricot
    • ScRUBYt! -- powerful, high-level abstraction, magic
  • 9. My Process
    • Find a good source
    • Use firebug to determine Xpaths of the elements you want to scrape
    • Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools
    • Refine Xpaths
    • Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc)
    • Ruby's string methods help with this last part
  • 10. Example: Scraping Quotes from your Tumblr
  • 11. Warning
    • Screen scraping is very iterative & involves a lot of trial and error
    • Make sure you comment a lot as you go along
    • Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
  • 12. References
    • http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
      • Inspiration for this presentation
      • Describes the Firebug + hpricot method
    • http://code.whytheluckystiff.net/hpricot/wiki
    • http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/
    • http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html
    • http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html
      • WWW::Mechanize examples
    • For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts
      • jeremyraines.com/linkedinscraper