Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Simple web-scraping with Mechanize and Nokogiri Nov 8 th  2011 Muriel Salvan Open Source Lead developer and architect X-Ae...
The need <ul><li>Browse the web from a Ruby environment </li></ul><ul><li>Navigate  like a browser (forms, cookies, histor...
Parse  HTML pages (DOM) =>  Nokogiri </li></ul>
Installation <ul><li>gem install mechanize </li></ul><ul><li>Also installs Nokogiri if not already done
! Version 1.0.0 can be more stable than 2.x.x for some complex queries (&quot; gem install mechanize -v 1.0.0 &quot; to en...
Basic example require   'mechanize' agent = Mechanize. new page = agent. get ( 'http://rivierarb.fr' ) element = page. roo...
Main usage <ul><li>Create a Mechanize agent . Same as creating a browser session (handles cookies, history, authentication...
Use the agent to  perform HTTP(S) requests  (get, post). Each request gives a Nokogiri page.
Parse the page  using CSS selectors, XPath, DOM iterators.
Fill and post forms  using intuitive helpers. </li></ul>
Common requests page = agent. get ( 'http://rivierarb.fr' ) page2 = page. links_with ( :text   =>   'Green King' ) . first...
Common parsing Selectors page. root . css ( 'body div.myclass' ) . each   {   | element |  …  } page. root . xpath ( '//h3...
Common parsing Elements < div > < a   href = &quot; http://www.google.com &quot; > Click here < img   src = &quot; http://...
Filling and submitting forms Basic example Google search form =  agent. get ( 'http://www.google.com' ) . forms . first fo...
Upcoming SlideShare
Loading in …5
×

Mechanize at the Ruby Drink-up of Sophia, November 2011

3,283 views

Published on

Simple web-scraping with Mechanize and Nokogiri. Presented at the Ruby Drink-up of Sophia Antipolis on the 8th of November 2011 by Muriel Salvan (@MurielSalvan).

Published in: Technology
  • Be the first to comment

Mechanize at the Ruby Drink-up of Sophia, November 2011

  1. 1. Simple web-scraping with Mechanize and Nokogiri Nov 8 th 2011 Muriel Salvan Open Source Lead developer and architect X-Aeon Solutions http://x-aeon.com
  2. 2. The need <ul><li>Browse the web from a Ruby environment </li></ul><ul><li>Navigate like a browser (forms, cookies, history, redirects) => Mechanize
  3. 3. Parse HTML pages (DOM) => Nokogiri </li></ul>
  4. 4. Installation <ul><li>gem install mechanize </li></ul><ul><li>Also installs Nokogiri if not already done
  5. 5. ! Version 1.0.0 can be more stable than 2.x.x for some complex queries (&quot; gem install mechanize -v 1.0.0 &quot; to enforce it). </li></ul>
  6. 6. Basic example require 'mechanize' agent = Mechanize. new page = agent. get ( 'http://rivierarb.fr' ) element = page. root . css ( 'h1.logo' ) . first element . content => &quot;Riviera.rb&quot;
  7. 7. Main usage <ul><li>Create a Mechanize agent . Same as creating a browser session (handles cookies, history, authentication...).
  8. 8. Use the agent to perform HTTP(S) requests (get, post). Each request gives a Nokogiri page.
  9. 9. Parse the page using CSS selectors, XPath, DOM iterators.
  10. 10. Fill and post forms using intuitive helpers. </li></ul>
  11. 11. Common requests page = agent. get ( 'http://rivierarb.fr' ) page2 = page. links_with ( :text => 'Green King' ) . first . click page3 = agent. back agent. user_agent = 'My user agent'
  12. 12. Common parsing Selectors page. root . css ( 'body div.myclass' ) . each { | element | … } page. root . xpath ( '//h3/a[@class=&quot;l&quot;]' ) . eac h { | element | … }
  13. 13. Common parsing Elements < div > < a href = &quot; http://www.google.com &quot; > Click here < img src = &quot; http://www.google.com/favicon.ico &quot; / > < / a > < / div > element [ 'href' ] => &quot;http: // www.google.com&quot; element. content => &quot; n Click here n n &quot; element. children . second . name => &quot;img&quot; element. parent . name => &quot;div&quot; element
  14. 14. Filling and submitting forms Basic example Google search form = agent. get ( 'http://www.google.com' ) . forms . first form. q = 'Rivierarb' results_page = form. submit
  15. 15. Filling and submitting forms Fields When your HTML form has < input … name = &quot;myfield&quot; >...< / input > you can write form. myfield = 'The field value' form. field_with ( :name => 'myfield' ) . value = 'The field value' form. checkboxfield = '1' form. selectfield = '5'
  16. 16. Filling and submitting forms Buttons ! Mechanize does not add the value of the button being clicked ! If the web server cares for buttons values in POST data, add them manually. < input type = &quot;submit&quot; name = &quot;btn1&quot; value = &quot;Clicked&quot;>...< / input > form. add_field ! ( 'btn1' , 'Clicked' ) b utton = form. button_with ( :name => 'btn1' ) page = form. click_button ( button )
  17. 17. Other Mechanize features <ul><li>Download raw files
  18. 18. Upload files
  19. 19. Give read/write access to cookies
  20. 20. Use proxies
  21. 21. Use HTML parsers other than Nokogiri
  22. 22. Does not have JavaScript engine (therefore no Ajax) </li></ul>
  23. 23. Links <ul><li>Mechanize
  24. 24. Mechanize examples
  25. 25. Nokogiri
  26. 26. Nokogiri element API </li></ul>This presentation is available under CC-BY license by Muriel Salvan
  27. 27. Q/A

×