Your SlideShare is downloading. ×
  • Like
Screen Scraping with Ruby
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Screen Scraping with Ruby

  • 8,746 views
Published

A brief overview of screen scraping with Ruby

A brief overview of screen scraping with Ruby

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,746
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
91
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  • 2. Pre-Reqs
    • XPATH
      • Used for addressing elements in an XML doc
      • http://www.w3schools.com/XPath/default.asp
    • Ruby Regular Expressions
      • www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby
  • 3. Rubular
    • Input a string and try out regexs on it
  • 4. XPATH basics
    • //book -- selects all book nodes no matter where they are in the document
    • //bookstore/book – selects all books that are a direct child element of bookstore
    • //bookstore//book – selects all books that are children of bookstore, no matter how deep in tree
    • //book[1] –- selects first book element
    • //book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books
  • 5. Not-so-secret weapon: Firebug
    • Lets you find the Xpath of any element in a web page.
    • “Inspect Element” of a representative item
    • Focus on classes, think about loops
      • This will produce easier to read xpaths
  • 6. Example
  • 7. Example (con't)
      • //sbScoreboxScores//sbScoreboxTeamAway
      • //sbScoreboxScores//sbScoreboxTotal
  • 8. How do we get at the data in these elements using Ruby?
    • Open-uri – built-in library for opening pages
    • REXML – built-in Xpath parser
      • Fast, good for straightforward tasks
    • Hpricot – popular gem
      • More powerful, I usually use in conjuction with:
    • WWW::Mechanize
      • For getting at data behind forms. Requires Hpricot
    • ScRUBYt! -- powerful, high-level abstraction, magic
  • 9. My Process
    • Find a good source
    • Use firebug to determine Xpaths of the elements you want to scrape
    • Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools
    • Refine Xpaths
    • Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc)
    • Ruby's string methods help with this last part
  • 10. Example: Scraping Quotes from your Tumblr
  • 11. Warning
    • Screen scraping is very iterative & involves a lot of trial and error
    • Make sure you comment a lot as you go along
    • Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
  • 12. References
    • http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
      • Inspiration for this presentation
      • Describes the Firebug + hpricot method
    • http://code.whytheluckystiff.net/hpricot/wiki
    • http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/
    • http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html
    • http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html
      • WWW::Mechanize examples
    • For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts
      • jeremyraines.com/linkedinscraper