Web Scraping In Ruby Utosc 2009.Key


Published on

A brief overview of web scraping in general, and some tools in Ruby to get it done.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Civil Law, Contract Law & Tort Law
    Copyright is civil law with a lot of case law & precedent
    TOS, AUP, etc. are contract law
    Tresspass to chattel tort law - have to ‘damage’
    Dex knows: look at TOU for automated scraping, robots.txt and the sitemap
  • Web Scraping In Ruby Utosc 2009.Key

    1. 1. Web Scraping in Ruby* * for fun and profit
    2. 2. 2 of the 6 W’s Who? Why?
    3. 3. Why Ruby? Fun 5.times { print “I like scraping in Ruby” } OOP Pretty Closures Flexible Interactive mode
    4. 4. Why Ruby? Community Culture of testing Rails is a web app test culture + web app = web testing Lots of libraries!
    5. 5. Why Scrape the Web? Information Research Testing acceptance & integration performance / load Standard API: HTTP + text Why not?
    6. 6. IANAL
    7. 7. Legal Concerns Copyright Online != Public Domain Fair Use License AUP, EULA, TOS, TOU ... Example: www.dexknows.com Trespass to chattel
    8. 8. Objective Get data. Have fun. Be nice.
    9. 9. Request GET / Host: www.google.com Response 200 OK <html> <head> <title>My page</title> </head> <body> <p>da body</p> </body> </html>
    10. 10. HTTP Status Codes 200 level - Success on client and server 300 level - Redirection - client is supposed to do something else 400 level - Client error 500 level - Server error
    11. 11. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>All about Joe</title> </head> <body> <div id=”header”> | <span class=”nav-link”>home</span> | </div> <div id=”content”> <div id=”sidebar”>I’m a paragraph</div> <div id=”body”> <h1>I’m a header</h1> <p class=”bio”> <img src=”/img/mugshot.png”> <span class=”name”>Joe Blow</span> </p> <p>Info about Joe</p> </div> </div> <div id=”footer”>&copy; 2009 Creative Commons</div> </body> </html>
    12. 12. CSS & XPath Selectors Way to target data inside a document CSS “head title” "div#body * span.name" XPath /head/title //div[@id='body']/*/span[@class='name']
    13. 13. Selecting XPath Nodes title selects all nodes in the doc /html selects from the root node //title selects all nodes below current node [@src] selects nodes with an attribute and [@class=’name’] optionally a value
    14. 14. Browser Tools Firefox QuarkRuby’s version of Firebug click to get CSS & XPath Selectors Firefinder extension to Firebug query doc with CSS Selectors
    15. 15. Interaction Parsing net/http regex Watir family hpricot & nokogiri Mechanize webrat scrubyt
    16. 16. Scrubyt http://github.com/scrubber/scrubyt_examples/blob/master/google.rb require 'rubygems' require 'scrubyt'   google_data = Scrubyt::Extractor.define do   fetch 'http://www.google.com/search?hl=en&q=ruby'      link_title "//a[@class='l']", :write_text => true do     link_url   end end   p google_data.to_hash
    17. 17. Watir (Safariwatir) require 'safariwatir' browser = Watir::Safari.new browser.goto("http://google.com") browser.text_field(:name, "q").set("safariwatir") browser.button(:name, "btnI").click puts "FAILURE" unless browser.contains_text("software")
    18. 18. Webrat Scraper require 'webrat_scraper' class MyScraper < WebratScraper def initialize @url = "http://www.google.com" super end def first_result_for(search_term) visit @url fill_in "q", :with => search_term click_button first_link = (doc/"li.g a.l").first {:text => first_link.inner_text, :url => first_link.attributes[“href”].to_s } end end m = MyScraper.new result = m.first_result_for("webrat-mechanize") puts result.inspect
    19. 19. Resources Ruby http://ruby-lang.org HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html CSS Selectors http://www.w3schools.com/Css/css_syntax.asp XPath http://www.w3schools.com/XPath/xpath_syntax.asp Mechanize http://mechanize.rubyforge.org/mechanize/ Nokogiri http://nokogiri.rubyforge.org/nokogiri/ Hpricot http://github.com/whymirror/hpricot Webrat http://wiki.github.com/brynary/webrat Scrubyt http://scrubyt.org/ Webrat Scraper http://github.com/jtzemp/webrat-scraper TourBus http://github.com/dbrady/tourbus
    20. 20. Advanced Topics Distributed Scraping Anonymization Security Captcha XSRF & CSRF protections Load and Performance Testing