Your SlideShare is downloading. ×
0
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Screen Scraping with Ruby
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Screen Scraping with Ruby

8,819

Published on

A brief overview of screen scraping with Ruby

A brief overview of screen scraping with Ruby

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,819
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
91
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  • 2. Pre-Reqs
    • XPATH
      • Used for addressing elements in an XML doc
      • http://www.w3schools.com/XPath/default.asp
    • Ruby Regular Expressions
      • www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby
  • 3. Rubular
    • Input a string and try out regexs on it
  • 4. XPATH basics
    • //book -- selects all book nodes no matter where they are in the document
    • //bookstore/book – selects all books that are a direct child element of bookstore
    • //bookstore//book – selects all books that are children of bookstore, no matter how deep in tree
    • //book[1] –- selects first book element
    • //book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books
  • 5. Not-so-secret weapon: Firebug
    • Lets you find the Xpath of any element in a web page.
    • “Inspect Element” of a representative item
    • Focus on classes, think about loops
      • This will produce easier to read xpaths
  • 6. Example
  • 7. Example (con't)
      • //sbScoreboxScores//sbScoreboxTeamAway
      • //sbScoreboxScores//sbScoreboxTotal
  • 8. How do we get at the data in these elements using Ruby?
    • Open-uri – built-in library for opening pages
    • REXML – built-in Xpath parser
      • Fast, good for straightforward tasks
    • Hpricot – popular gem
      • More powerful, I usually use in conjuction with:
    • WWW::Mechanize
      • For getting at data behind forms. Requires Hpricot
    • ScRUBYt! -- powerful, high-level abstraction, magic
  • 9. My Process
    • Find a good source
    • Use firebug to determine Xpaths of the elements you want to scrape
    • Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools
    • Refine Xpaths
    • Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc)
    • Ruby's string methods help with this last part
  • 10. Example: Scraping Quotes from your Tumblr
  • 11. Warning
    • Screen scraping is very iterative & involves a lot of trial and error
    • Make sure you comment a lot as you go along
    • Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
  • 12. References
    • http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
      • Inspiration for this presentation
      • Describes the Firebug + hpricot method
    • http://code.whytheluckystiff.net/hpricot/wiki
    • http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/
    • http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html
    • http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html
      • WWW::Mechanize examples
    • For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts
      • jeremyraines.com/linkedinscraper

×