Your SlideShare is downloading. ×
  • Like
Web Scraping In Ruby Utosc 2009.Key
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Web Scraping In Ruby Utosc 2009.Key

  • 2,979 views
Published

A brief overview of web scraping in general, and some tools in Ruby to get it done.

A brief overview of web scraping in general, and some tools in Ruby to get it done.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,979
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
61
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Civil Law, Contract Law & Tort Law
    Copyright is civil law with a lot of case law & precedent
    TOS, AUP, etc. are contract law
    Tresspass to chattel tort law - have to ‘damage’
    Dex knows: look at TOU for automated scraping, robots.txt and the sitemap

Transcript

  • 1. Web Scraping in Ruby* * for fun and profit
  • 2. 2 of the 6 W’s Who? Why?
  • 3. Why Ruby? Fun 5.times { print “I like scraping in Ruby” } OOP Pretty Closures Flexible Interactive mode
  • 4. Why Ruby? Community Culture of testing Rails is a web app test culture + web app = web testing Lots of libraries!
  • 5. Why Scrape the Web? Information Research Testing acceptance & integration performance / load Standard API: HTTP + text Why not?
  • 6. IANAL
  • 7. Legal Concerns Copyright Online != Public Domain Fair Use License AUP, EULA, TOS, TOU ... Example: www.dexknows.com Trespass to chattel
  • 8. Objective Get data. Have fun. Be nice.
  • 9. Request GET / Host: www.google.com Response 200 OK <html> <head> <title>My page</title> </head> <body> <p>da body</p> </body> </html>
  • 10. HTTP Status Codes 200 level - Success on client and server 300 level - Redirection - client is supposed to do something else 400 level - Client error 500 level - Server error
  • 11. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>All about Joe</title> </head> <body> <div id=”header”> | <span class=”nav-link”>home</span> | </div> <div id=”content”> <div id=”sidebar”>I’m a paragraph</div> <div id=”body”> <h1>I’m a header</h1> <p class=”bio”> <img src=”/img/mugshot.png”> <span class=”name”>Joe Blow</span> </p> <p>Info about Joe</p> </div> </div> <div id=”footer”>&copy; 2009 Creative Commons</div> </body> </html>
  • 12. CSS & XPath Selectors Way to target data inside a document CSS “head title” "div#body * span.name" XPath /head/title //div[@id='body']/*/span[@class='name']
  • 13. Selecting XPath Nodes title selects all nodes in the doc /html selects from the root node //title selects all nodes below current node [@src] selects nodes with an attribute and [@class=’name’] optionally a value
  • 14. Browser Tools Firefox QuarkRuby’s version of Firebug click to get CSS & XPath Selectors Firefinder extension to Firebug query doc with CSS Selectors
  • 15. Interaction Parsing net/http regex Watir family hpricot & nokogiri Mechanize webrat scrubyt
  • 16. Scrubyt http://github.com/scrubber/scrubyt_examples/blob/master/google.rb require 'rubygems' require 'scrubyt'   google_data = Scrubyt::Extractor.define do   fetch 'http://www.google.com/search?hl=en&q=ruby'      link_title "//a[@class='l']", :write_text => true do     link_url   end end   p google_data.to_hash
  • 17. Watir (Safariwatir) require 'safariwatir' browser = Watir::Safari.new browser.goto("http://google.com") browser.text_field(:name, "q").set("safariwatir") browser.button(:name, "btnI").click puts "FAILURE" unless browser.contains_text("software")
  • 18. Webrat Scraper require 'webrat_scraper' class MyScraper < WebratScraper def initialize @url = "http://www.google.com" super end def first_result_for(search_term) visit @url fill_in "q", :with => search_term click_button first_link = (doc/"li.g a.l").first {:text => first_link.inner_text, :url => first_link.attributes[“href”].to_s } end end m = MyScraper.new result = m.first_result_for("webrat-mechanize") puts result.inspect
  • 19. Resources Ruby http://ruby-lang.org HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html CSS Selectors http://www.w3schools.com/Css/css_syntax.asp XPath http://www.w3schools.com/XPath/xpath_syntax.asp Mechanize http://mechanize.rubyforge.org/mechanize/ Nokogiri http://nokogiri.rubyforge.org/nokogiri/ Hpricot http://github.com/whymirror/hpricot Webrat http://wiki.github.com/brynary/webrat Scrubyt http://scrubyt.org/ Webrat Scraper http://github.com/jtzemp/webrat-scraper TourBus http://github.com/dbrady/tourbus
  • 20. Advanced Topics Distributed Scraping Anonymization Security Captcha XSRF & CSRF protections Load and Performance Testing