Web Scraping in Ruby*
 * for fun and profit
2 of the 6 W’s


Who?
Why?
Why Ruby?
Fun                       5.times { print
                   “I like scraping in Ruby” }
OOP
Pretty
Closures
Fle...
Why Ruby?

Community
Culture of testing
Rails is a web app
  test culture + web app = web testing
Lots of libraries!
Why Scrape the Web?
Information
Research
Testing
  acceptance & integration
  performance / load
Standard API: HTTP + text...
IANAL
Legal Concerns
Copyright
  Online != Public Domain
  Fair Use
License
  AUP, EULA, TOS, TOU ...
  Example: www.dexknows.co...
Objective


Get data. Have fun. Be nice.
Request
GET /
Host: www.google.com




Response
200 OK
<html>
  <head>
     <title>My page</title>
  </head>
  <body>
    ...
HTTP Status Codes

200 level - Success on client and server
300 level - Redirection - client is supposed to do
something e...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
...
CSS & XPath Selectors
Way to target data inside a document
CSS
  “head title”
  "div#body * span.name"
XPath
  /head/title...
Selecting XPath Nodes

      title           selects all nodes in the doc


     /html             selects from the root n...
Browser Tools

Firefox
  QuarkRuby’s version of Firebug
    click to get CSS & XPath Selectors
  Firefinder extension to Fi...
Interaction                                          Parsing

 net/http                                     regex

       ...
Scrubyt
http://github.com/scrubber/scrubyt_examples/blob/master/google.rb


require 'rubygems'
require 'scrubyt'
 
google_...
Watir (Safariwatir)
require 'safariwatir'

browser = Watir::Safari.new
browser.goto("http://google.com")
browser.text_fiel...
Webrat Scraper
   require 'webrat_scraper'

   class MyScraper < WebratScraper
     def initialize
       @url = "http://w...
Resources
Ruby http://ruby-lang.org

HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

CSS Selecto...
Advanced Topics
Distributed Scraping
Anonymization
Security
  Captcha
  XSRF & CSRF protections
Load and Performance Testi...
Upcoming SlideShare
Loading in...5
×

Web Scraping In Ruby Utosc 2009.Key

3,074

Published on

A brief overview of web scraping in general, and some tools in Ruby to get it done.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,074
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
63
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Civil Law, Contract Law &amp; Tort Law
    Copyright is civil law with a lot of case law &amp; precedent
    TOS, AUP, etc. are contract law
    Tresspass to chattel tort law - have to &amp;#x2018;damage&amp;#x2019;
    Dex knows: look at TOU for automated scraping, robots.txt and the sitemap
  • Web Scraping In Ruby Utosc 2009.Key

    1. 1. Web Scraping in Ruby* * for fun and profit
    2. 2. 2 of the 6 W’s Who? Why?
    3. 3. Why Ruby? Fun 5.times { print “I like scraping in Ruby” } OOP Pretty Closures Flexible Interactive mode
    4. 4. Why Ruby? Community Culture of testing Rails is a web app test culture + web app = web testing Lots of libraries!
    5. 5. Why Scrape the Web? Information Research Testing acceptance & integration performance / load Standard API: HTTP + text Why not?
    6. 6. IANAL
    7. 7. Legal Concerns Copyright Online != Public Domain Fair Use License AUP, EULA, TOS, TOU ... Example: www.dexknows.com Trespass to chattel
    8. 8. Objective Get data. Have fun. Be nice.
    9. 9. Request GET / Host: www.google.com Response 200 OK <html> <head> <title>My page</title> </head> <body> <p>da body</p> </body> </html>
    10. 10. HTTP Status Codes 200 level - Success on client and server 300 level - Redirection - client is supposed to do something else 400 level - Client error 500 level - Server error
    11. 11. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>All about Joe</title> </head> <body> <div id=”header”> | <span class=”nav-link”>home</span> | </div> <div id=”content”> <div id=”sidebar”>I’m a paragraph</div> <div id=”body”> <h1>I’m a header</h1> <p class=”bio”> <img src=”/img/mugshot.png”> <span class=”name”>Joe Blow</span> </p> <p>Info about Joe</p> </div> </div> <div id=”footer”>&copy; 2009 Creative Commons</div> </body> </html>
    12. 12. CSS & XPath Selectors Way to target data inside a document CSS “head title” "div#body * span.name" XPath /head/title //div[@id='body']/*/span[@class='name']
    13. 13. Selecting XPath Nodes title selects all nodes in the doc /html selects from the root node //title selects all nodes below current node [@src] selects nodes with an attribute and [@class=’name’] optionally a value
    14. 14. Browser Tools Firefox QuarkRuby’s version of Firebug click to get CSS & XPath Selectors Firefinder extension to Firebug query doc with CSS Selectors
    15. 15. Interaction Parsing net/http regex Watir family hpricot & nokogiri Mechanize webrat scrubyt
    16. 16. Scrubyt http://github.com/scrubber/scrubyt_examples/blob/master/google.rb require 'rubygems' require 'scrubyt'   google_data = Scrubyt::Extractor.define do   fetch 'http://www.google.com/search?hl=en&q=ruby'      link_title "//a[@class='l']", :write_text => true do     link_url   end end   p google_data.to_hash
    17. 17. Watir (Safariwatir) require 'safariwatir' browser = Watir::Safari.new browser.goto("http://google.com") browser.text_field(:name, "q").set("safariwatir") browser.button(:name, "btnI").click puts "FAILURE" unless browser.contains_text("software")
    18. 18. Webrat Scraper require 'webrat_scraper' class MyScraper < WebratScraper def initialize @url = "http://www.google.com" super end def first_result_for(search_term) visit @url fill_in "q", :with => search_term click_button first_link = (doc/"li.g a.l").first {:text => first_link.inner_text, :url => first_link.attributes[“href”].to_s } end end m = MyScraper.new result = m.first_result_for("webrat-mechanize") puts result.inspect
    19. 19. Resources Ruby http://ruby-lang.org HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html CSS Selectors http://www.w3schools.com/Css/css_syntax.asp XPath http://www.w3schools.com/XPath/xpath_syntax.asp Mechanize http://mechanize.rubyforge.org/mechanize/ Nokogiri http://nokogiri.rubyforge.org/nokogiri/ Hpricot http://github.com/whymirror/hpricot Webrat http://wiki.github.com/brynary/webrat Scrubyt http://scrubyt.org/ Webrat Scraper http://github.com/jtzemp/webrat-scraper TourBus http://github.com/dbrady/tourbus
    20. 20. Advanced Topics Distributed Scraping Anonymization Security Captcha XSRF & CSRF protections Load and Performance Testing
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×