Mechanize at the Ruby Drink-up of Sophia, November 2011
Upcoming SlideShare
Loading in...5
×
 

Mechanize at the Ruby Drink-up of Sophia, November 2011

on

  • 2,353 views

Simple web-scraping with Mechanize and Nokogiri. Presented at the Ruby Drink-up of Sophia Antipolis on the 8th of November 2011 by Muriel Salvan (@MurielSalvan).

Simple web-scraping with Mechanize and Nokogiri. Presented at the Ruby Drink-up of Sophia Antipolis on the 8th of November 2011 by Muriel Salvan (@MurielSalvan).

Statistics

Views

Total Views
2,353
Views on SlideShare
2,317
Embed Views
36

Actions

Likes
2
Downloads
3
Comments
0

3 Embeds 36

http://news.humancoders.com 22
http://rubylive.fr 13
http://www.hanrss.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mechanize at the Ruby Drink-up of Sophia, November 2011 Mechanize at the Ruby Drink-up of Sophia, November 2011 Presentation Transcript

  • Simple web-scraping with Mechanize and Nokogiri Nov 8 th 2011 Muriel Salvan Open Source Lead developer and architect X-Aeon Solutions http://x-aeon.com
  • The need
    • Browse the web from a Ruby environment
    • Navigate like a browser (forms, cookies, history, redirects) => Mechanize
    • Parse HTML pages (DOM) => Nokogiri
  • Installation
    • gem install mechanize
    • Also installs Nokogiri if not already done
    • ! Version 1.0.0 can be more stable than 2.x.x for some complex queries (" gem install mechanize -v 1.0.0 " to enforce it).
  • Basic example require 'mechanize' agent = Mechanize. new page = agent. get ( 'http://rivierarb.fr' ) element = page. root . css ( 'h1.logo' ) . first element . content => "Riviera.rb"
  • Main usage
    • Create a Mechanize agent . Same as creating a browser session (handles cookies, history, authentication...).
    • Use the agent to perform HTTP(S) requests (get, post). Each request gives a Nokogiri page.
    • Parse the page using CSS selectors, XPath, DOM iterators.
    • Fill and post forms using intuitive helpers.
  • Common requests page = agent. get ( 'http://rivierarb.fr' ) page2 = page. links_with ( :text => 'Green King' ) . first . click page3 = agent. back agent. user_agent = 'My user agent'
  • Common parsing Selectors page. root . css ( 'body div.myclass' ) . each { | element | … } page. root . xpath ( '//h3/a[@class="l"]' ) . eac h { | element | … }
  • Common parsing Elements < div > < a href = &quot; http://www.google.com &quot; > Click here < img src = &quot; http://www.google.com/favicon.ico &quot; / > < / a > < / div > element [ 'href' ] => &quot;http: // www.google.com&quot; element. content => &quot; n Click here n n &quot; element. children . second . name => &quot;img&quot; element. parent . name => &quot;div&quot; element
  • Filling and submitting forms Basic example Google search form = agent. get ( 'http://www.google.com' ) . forms . first form. q = 'Rivierarb' results_page = form. submit
  • Filling and submitting forms Fields When your HTML form has < input … name = &quot;myfield&quot; >...< / input > you can write form. myfield = 'The field value' form. field_with ( :name => 'myfield' ) . value = 'The field value' form. checkboxfield = '1' form. selectfield = '5'
  • Filling and submitting forms Buttons ! Mechanize does not add the value of the button being clicked ! If the web server cares for buttons values in POST data, add them manually. < input type = &quot;submit&quot; name = &quot;btn1&quot; value = &quot;Clicked&quot;>...< / input > form. add_field ! ( 'btn1' , 'Clicked' ) b utton = form. button_with ( :name => 'btn1' ) page = form. click_button ( button )
  • Other Mechanize features
    • Download raw files
    • Upload files
    • Give read/write access to cookies
    • Use proxies
    • Use HTML parsers other than Nokogiri
    • Does not have JavaScript engine (therefore no Ajax)
  • Links
    • Mechanize
    • Mechanize examples
    • Nokogiri
    • Nokogiri element API
    This presentation is available under CC-BY license by Muriel Salvan
  • Q/A