Mechanize at the Ruby Drink-up of Sophia, November 2011

  • 2,216 views
Uploaded on

Simple web-scraping with Mechanize and Nokogiri. Presented at the Ruby Drink-up of Sophia Antipolis on the 8th of November 2011 by Muriel Salvan (@MurielSalvan).

Simple web-scraping with Mechanize and Nokogiri. Presented at the Ruby Drink-up of Sophia Antipolis on the 8th of November 2011 by Muriel Salvan (@MurielSalvan).

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,216
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
4
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Simple web-scraping with Mechanize and Nokogiri Nov 8 th 2011 Muriel Salvan Open Source Lead developer and architect X-Aeon Solutions http://x-aeon.com
  • 2. The need
    • Browse the web from a Ruby environment
    • Navigate like a browser (forms, cookies, history, redirects) => Mechanize
    • 3. Parse HTML pages (DOM) => Nokogiri
  • 4. Installation
    • gem install mechanize
    • Also installs Nokogiri if not already done
    • 5. ! Version 1.0.0 can be more stable than 2.x.x for some complex queries (" gem install mechanize -v 1.0.0 " to enforce it).
  • 6. Basic example require 'mechanize' agent = Mechanize. new page = agent. get ( 'http://rivierarb.fr' ) element = page. root . css ( 'h1.logo' ) . first element . content => "Riviera.rb"
  • 7. Main usage
    • Create a Mechanize agent . Same as creating a browser session (handles cookies, history, authentication...).
    • 8. Use the agent to perform HTTP(S) requests (get, post). Each request gives a Nokogiri page.
    • 9. Parse the page using CSS selectors, XPath, DOM iterators.
    • 10. Fill and post forms using intuitive helpers.
  • 11. Common requests page = agent. get ( 'http://rivierarb.fr' ) page2 = page. links_with ( :text => 'Green King' ) . first . click page3 = agent. back agent. user_agent = 'My user agent'
  • 12. Common parsing Selectors page. root . css ( 'body div.myclass' ) . each { | element | … } page. root . xpath ( '//h3/a[@class="l"]' ) . eac h { | element | … }
  • 13. Common parsing Elements < div > < a href = &quot; http://www.google.com &quot; > Click here < img src = &quot; http://www.google.com/favicon.ico &quot; / > < / a > < / div > element [ 'href' ] => &quot;http: // www.google.com&quot; element. content => &quot; n Click here n n &quot; element. children . second . name => &quot;img&quot; element. parent . name => &quot;div&quot; element
  • 14. Filling and submitting forms Basic example Google search form = agent. get ( 'http://www.google.com' ) . forms . first form. q = 'Rivierarb' results_page = form. submit
  • 15. Filling and submitting forms Fields When your HTML form has < input … name = &quot;myfield&quot; >...< / input > you can write form. myfield = 'The field value' form. field_with ( :name => 'myfield' ) . value = 'The field value' form. checkboxfield = '1' form. selectfield = '5'
  • 16. Filling and submitting forms Buttons ! Mechanize does not add the value of the button being clicked ! If the web server cares for buttons values in POST data, add them manually. < input type = &quot;submit&quot; name = &quot;btn1&quot; value = &quot;Clicked&quot;>...< / input > form. add_field ! ( 'btn1' , 'Clicked' ) b utton = form. button_with ( :name => 'btn1' ) page = form. click_button ( button )
  • 17. Other Mechanize features
    • Download raw files
    • 18. Upload files
    • 19. Give read/write access to cookies
    • 20. Use proxies
    • 21. Use HTML parsers other than Nokogiri
    • 22. Does not have JavaScript engine (therefore no Ajax)
  • 23. Links This presentation is available under CC-BY license by Muriel Salvan
  • 27. Q/A