Web Scraping In Ruby Utosc 2009.Key

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Civil Law, Contract Law & Tort Law
    Copyright is civil law with a lot of case law & precedent
    TOS, AUP, etc. are contract law
    Tresspass to chattel tort law - have to ‘damage’
    Dex knows: look at TOU for automated scraping, robots.txt and the sitemap

    Favorites, Groups & Events

    Web Scraping In Ruby Utosc 2009.Key - Presentation Transcript

    1. Web Scraping in Ruby* * for fun and profit
    2. 2 of the 6 W’s Who? Why?
    3. Why Ruby? Fun 5.times { print “I like scraping in Ruby” } OOP Pretty Closures Flexible Interactive mode
    4. Why Ruby? Community Culture of testing Rails is a web app test culture + web app = web testing Lots of libraries!
    5. Why Scrape the Web? Information Research Testing acceptance & integration performance / load Standard API: HTTP + text Why not?
    6. IANAL
    7. Legal Concerns Copyright Online != Public Domain Fair Use License AUP, EULA, TOS, TOU ... Example: www.dexknows.com Trespass to chattel
    8. Objective Get data. Have fun. Be nice.
    9. Request GET / Host: www.google.com Response 200 OK <html> <head> <title>My page</title> </head> <body> <p>da body</p> </body> </html>
    10. HTTP Status Codes 200 level - Success on client and server 300 level - Redirection - client is supposed to do something else 400 level - Client error 500 level - Server error
    11. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>All about Joe</title> </head> <body> <div id=”header”> | <span class=”nav-link”>home</span> | </div> <div id=”content”> <div id=”sidebar”>I’m a paragraph</div> <div id=”body”> <h1>I’m a header</h1> <p class=”bio”> <img src=”/img/mugshot.png”> <span class=”name”>Joe Blow</span> </p> <p>Info about Joe</p> </div> </div> <div id=”footer”>&copy; 2009 Creative Commons</div> </body> </html>
    12. CSS & XPath Selectors Way to target data inside a document CSS “head title” "div#body * span.name" XPath /head/title //div[@id='body']/*/span[@class='name']
    13. Selecting XPath Nodes title selects all nodes in the doc /html selects from the root node //title selects all nodes below current node [@src] selects nodes with an attribute and [@class=’name’] optionally a value
    14. Browser Tools Firefox QuarkRuby’s version of Firebug click to get CSS & XPath Selectors Firefinder extension to Firebug query doc with CSS Selectors
    15. Interaction Parsing net/http regex Watir family hpricot & nokogiri Mechanize webrat scrubyt
    16. Scrubyt http://github.com/scrubber/scrubyt_examples/blob/master/google.rb require 'rubygems' require 'scrubyt'   google_data = Scrubyt::Extractor.define do   fetch 'http://www.google.com/search?hl=en&q=ruby'      link_title "//a[@class='l']", :write_text => true do     link_url   end end   p google_data.to_hash
    17. Watir (Safariwatir) require 'safariwatir' browser = Watir::Safari.new browser.goto("http://google.com") browser.text_field(:name, "q").set("safariwatir") browser.button(:name, "btnI").click puts "FAILURE" unless browser.contains_text("software")
    18. Webrat Scraper require 'webrat_scraper' class MyScraper < WebratScraper def initialize @url = "http://www.google.com" super end def first_result_for(search_term) visit @url fill_in "q", :with => search_term click_button first_link = (doc/"li.g a.l").first {:text => first_link.inner_text, :url => first_link.attributes[“href”].to_s } end end m = MyScraper.new result = m.first_result_for("webrat-mechanize") puts result.inspect
    19. Resources Ruby http://ruby-lang.org HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html CSS Selectors http://www.w3schools.com/Css/css_syntax.asp XPath http://www.w3schools.com/XPath/xpath_syntax.asp Mechanize http://mechanize.rubyforge.org/mechanize/ Nokogiri http://nokogiri.rubyforge.org/nokogiri/ Hpricot http://github.com/whymirror/hpricot Webrat http://wiki.github.com/brynary/webrat Scrubyt http://scrubyt.org/ Webrat Scraper http://github.com/jtzemp/webrat-scraper TourBus http://github.com/dbrady/tourbus
    20. Advanced Topics Distributed Scraping Anonymization Security Captcha XSRF & CSRF protections Load and Performance Testing

    + jtzempjtzemp, 1 month ago

    custom

    296 views, 0 favs, 0 embeds more stats

    A brief overview of web scraping in general, and so more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 296
      • 296 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 4
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories