Screen Scraping with Ruby

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    4 Favorites

    Screen Scraping with Ruby - Presentation Transcript

    1. Screen Scraping with Ruby Jeremy Raines [email_address]
    2. Pre-Reqs
      • XPATH
        • Used for addressing elements in an XML doc
        • http://www.w3schools.com/XPath/default.asp
      • Ruby Regular Expressions
        • www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby
    3. Rubular
      • Input a string and try out regexs on it
    4. XPATH basics
      • //book -- selects all book nodes no matter where they are in the document
      • //bookstore/book – selects all books that are a direct child element of bookstore
      • //bookstore//book – selects all books that are children of bookstore, no matter how deep in tree
      • //book[1] –- selects first book element
      • //book@category =”fiction” -- @ selects an attribute, in this case specifically fiction books
    5. Not-so-secret weapon: Firebug
      • Lets you find the Xpath of any element in a web page.
      • “Inspect Element” of a representative item
      • Focus on classes, think about loops
        • This will produce easier to read xpaths
    6. Example
    7. Example (con't)
        • //sbScoreboxScores//sbScoreboxTeamAway
        • //sbScoreboxScores//sbScoreboxTotal
    8. How do we get at the data in these elements using Ruby?
      • Open-uri – built-in library for opening pages
      • REXML – built-in Xpath parser
        • Fast, good for straightforward tasks
      • Hpricot – popular gem
        • More powerful, I usually use in conjuction with:
      • WWW::Mechanize
        • For getting at data behind forms. Requires Hpricot
      • ScRUBYt! -- powerful, high-level abstraction, magic
    9. My Process
      • Find a good source
      • Use firebug to determine Xpaths of the elements you want to scrape
      • Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools
      • Refine Xpaths
      • Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc)
      • Ruby's string methods help with this last part
    10. Example: Scraping Quotes from your Tumblr
    11. Warning
      • Screen scraping is very iterative & involves a lot of trial and error
      • Make sure you comment a lot as you go along
      • Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
    12. References
      • http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
        • Inspiration for this presentation
        • Describes the Firebug + hpricot method
      • http://code.whytheluckystiff.net/hpricot/wiki
      • http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/
      • http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html
      • http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html
        • WWW::Mechanize examples
      • For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts
        • jeremyraines.com/linkedinscraper

    + jeremyrainesjeremyraines, 2 years ago

    custom

    2825 views, 4 favs, 1 embeds more stats

    A brief overview of screen scraping with Ruby

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2825
      • 2700 on SlideShare
      • 125 from embeds
    • Comments 0
    • Favorites 4
    • Downloads 29
    Most viewed embeds
    • 125 views on http://www.jeremyraines.com

    more

    All embeds
    • 125 views on http://www.jeremyraines.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories