Your SlideShare is downloading. ×
0
Simple web-scraping with Mechanize and Nokogiri Nov 8 th  2011 Muriel Salvan Open Source Lead developer and architect X-Ae...
The need <ul><li>Browse the web from a Ruby environment </li></ul><ul><li>Navigate  like a browser (forms, cookies, histor...
Parse  HTML pages (DOM) =>  Nokogiri </li></ul>
Installation <ul><li>gem install mechanize </li></ul><ul><li>Also installs Nokogiri if not already done
! Version 1.0.0 can be more stable than 2.x.x for some complex queries (&quot; gem install mechanize -v 1.0.0 &quot; to en...
Basic example require   'mechanize' agent = Mechanize. new page = agent. get ( 'http://rivierarb.fr' ) element = page. roo...
Main usage <ul><li>Create a Mechanize agent . Same as creating a browser session (handles cookies, history, authentication...
Use the agent to  perform HTTP(S) requests  (get, post). Each request gives a Nokogiri page.
Parse the page  using CSS selectors, XPath, DOM iterators.
Fill and post forms  using intuitive helpers. </li></ul>
Common requests page = agent. get ( 'http://rivierarb.fr' ) page2 = page. links_with ( :text   =>   'Green King' ) . first...
Common parsing Selectors page. root . css ( 'body div.myclass' ) . each   {   | element |  …  } page. root . xpath ( '//h3...
Common parsing Elements < div > < a   href = &quot; http://www.google.com &quot; > Click here < img   src = &quot; http://...
Filling and submitting forms Basic example Google search form =  agent. get ( 'http://www.google.com' ) . forms . first fo...
Upcoming SlideShare
Loading in...5
×

Mechanize at the Ruby Drink-up of Sophia, November 2011

2,416

Published on

Simple web-scraping with Mechanize and Nokogiri. Presented at the Ruby Drink-up of Sophia Antipolis on the 8th of November 2011 by Muriel Salvan (@MurielSalvan).

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,416
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Mechanize at the Ruby Drink-up of Sophia, November 2011"

  1. 1. Simple web-scraping with Mechanize and Nokogiri Nov 8 th 2011 Muriel Salvan Open Source Lead developer and architect X-Aeon Solutions http://x-aeon.com
  2. 2. The need <ul><li>Browse the web from a Ruby environment </li></ul><ul><li>Navigate like a browser (forms, cookies, history, redirects) => Mechanize
  3. 3. Parse HTML pages (DOM) => Nokogiri </li></ul>
  4. 4. Installation <ul><li>gem install mechanize </li></ul><ul><li>Also installs Nokogiri if not already done
  5. 5. ! Version 1.0.0 can be more stable than 2.x.x for some complex queries (&quot; gem install mechanize -v 1.0.0 &quot; to enforce it). </li></ul>
  6. 6. Basic example require 'mechanize' agent = Mechanize. new page = agent. get ( 'http://rivierarb.fr' ) element = page. root . css ( 'h1.logo' ) . first element . content => &quot;Riviera.rb&quot;
  7. 7. Main usage <ul><li>Create a Mechanize agent . Same as creating a browser session (handles cookies, history, authentication...).
  8. 8. Use the agent to perform HTTP(S) requests (get, post). Each request gives a Nokogiri page.
  9. 9. Parse the page using CSS selectors, XPath, DOM iterators.
  10. 10. Fill and post forms using intuitive helpers. </li></ul>
  11. 11. Common requests page = agent. get ( 'http://rivierarb.fr' ) page2 = page. links_with ( :text => 'Green King' ) . first . click page3 = agent. back agent. user_agent = 'My user agent'
  12. 12. Common parsing Selectors page. root . css ( 'body div.myclass' ) . each { | element | … } page. root . xpath ( '//h3/a[@class=&quot;l&quot;]' ) . eac h { | element | … }
  13. 13. Common parsing Elements < div > < a href = &quot; http://www.google.com &quot; > Click here < img src = &quot; http://www.google.com/favicon.ico &quot; / > < / a > < / div > element [ 'href' ] => &quot;http: // www.google.com&quot; element. content => &quot; n Click here n n &quot; element. children . second . name => &quot;img&quot; element. parent . name => &quot;div&quot; element
  14. 14. Filling and submitting forms Basic example Google search form = agent. get ( 'http://www.google.com' ) . forms . first form. q = 'Rivierarb' results_page = form. submit
  15. 15. Filling and submitting forms Fields When your HTML form has < input … name = &quot;myfield&quot; >...< / input > you can write form. myfield = 'The field value' form. field_with ( :name => 'myfield' ) . value = 'The field value' form. checkboxfield = '1' form. selectfield = '5'
  16. 16. Filling and submitting forms Buttons ! Mechanize does not add the value of the button being clicked ! If the web server cares for buttons values in POST data, add them manually. < input type = &quot;submit&quot; name = &quot;btn1&quot; value = &quot;Clicked&quot;>...< / input > form. add_field ! ( 'btn1' , 'Clicked' ) b utton = form. button_with ( :name => 'btn1' ) page = form. click_button ( button )
  17. 17. Other Mechanize features <ul><li>Download raw files
  18. 18. Upload files
  19. 19. Give read/write access to cookies
  20. 20. Use proxies
  21. 21. Use HTML parsers other than Nokogiri
  22. 22. Does not have JavaScript engine (therefore no Ajax) </li></ul>
  23. 23. Links <ul><li>Mechanize
  24. 24. Mechanize examples
  25. 25. Nokogiri
  26. 26. Nokogiri element API </li></ul>This presentation is available under CC-BY license by Muriel Salvan
  27. 27. Q/A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×