RubyRobotsDaniel Cukier@danicuki                http://www.flickr.com/photos/flysi/183272970
Relatives• spiders• crawlers• bots
Why robot?
http://www.flickr.com/photos/nhankamer/5016628611
require anemoneAnemone.crawl(url) do |anemone|  anemone.on_every_page do |page|      puts page.url  endend                ...
XPath<html>...<div class="bla">  <a>legal</a></div>...</html>html_doc = Nokogiri::HTML(html)info = html_doc.xpath(  "//div...
XPath<table id="super">   >> html_doc = Nokogiri::HTML(html)  <tr>               >> info = html_doc.xpath(    <td>L1C1</td...
rest-client
ETGhttp://www.flickr.com/photos/amortize/766738216
http://www.flickr.com/photos/abbeychristine/223898960
Good bot                                                /robots.txt                                                User-ag...
RubyRobotsDaniel Cukier@danicuki                http://www.flickr.com/photos/flysi/183272970
http://www.flickr.com/photos/nephelim/5632618462
maxRowsList=16
>>   body = RestClient.get(url) >>   json = JSON.parse(body) >>   content = json["Content"] >>   content.size =>   16     ...
>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo","jorgemendes", "bos...
email?phone?
>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class...
cookiescookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split(...
Proxies
http://www.ip-adress.com/proxy_list
>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class=proxylist...
<script type="text/javascript">  z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>
JAVASCRIPT     =   RUBY     http://www.flickr.com/photos/drics/4266471776/
<script type="text/javascript">       z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;     </script>>>   script = html_doc.xpath("...
>> lines[1].text=> "208.52.144.55 document.write(":"+i+r+i+r) anonymousproxy server-2 minutes ago United States">> server ...
mechanizeagent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform[vis...
protection techniques                     javascript                  text as image                        captcha        ...
captchaprove you are not a robot      YES you can!
3 steps1. Download Image2. filter image3. run OCR software
scalinghttp://www.flickr.com/photos/liquene/3330714590
clouds$ knife ec2 server create
threads   +queues
Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site info...
Thank youDaniel Cukier@danicuki
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Ruby Robots
Upcoming SlideShare
Loading in …5
×

Ruby Robots

4,220 views

Published on

Talk about creating web robots in Ruby programming language, using restclient, nokogiri, mechanize in rsonrails event

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,220
On SlideShare
0
From Embeds
0
Number of Embeds
1,027
Actions
Shares
0
Downloads
78
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Ruby Robots

  1. 1. RubyRobotsDaniel Cukier@danicuki http://www.flickr.com/photos/flysi/183272970
  2. 2. Relatives• spiders• crawlers• bots
  3. 3. Why robot?
  4. 4. http://www.flickr.com/photos/nhankamer/5016628611
  5. 5. require anemoneAnemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/ http://www.cantora.mus.br/fotos http://www.cantora.mus.br/?locale=en http://www.cantora.mus.br/?locale=pt-BR http://www.cantora.mus.br/musicas http://www.cantora.mus.br/videos http://www.cantora.mus.br/agenda http://www.cantora.mus.br/novidades http://www.cantora.mus.br/musicas/baixar http://www.cantora.mus.br/visitors/baixar http://www.cantora.mus.br/social http://www.cantora.mus.br/fotos?locale=pt-BR http://www.cantora.mus.br/musicas?locale=en http://www.cantora.mus.br/fotos?locale=en
  6. 6. XPath<html>...<div class="bla"> <a>legal</a></div>...</html>html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class=bla]/a")info.text=> legal
  7. 7. XPath<table id="super"> >> html_doc = Nokogiri::HTML(html) <tr> >> info = html_doc.xpath( <td>L1C1</td> "//table[@id=super]/tr") <td>L1C2</td> >> info.size => 3 </tr> <tr> >> info <td>L2C1</td> => legal <td>L2C2</td> </tr> >> info[0].xpath("td").size <tr> => 2 <td>L3C1</td> <td>L3C2</td> >> info[2].xpath("td")[1].text </tr> => "L3C2"</table>
  8. 8. rest-client
  9. 9. ETGhttp://www.flickr.com/photos/amortize/766738216
  10. 10. http://www.flickr.com/photos/abbeychristine/223898960
  11. 11. Good bot /robots.txt User-agent: * Disallow:http://www.flickr.com/photos/temily/5645585162
  12. 12. RubyRobotsDaniel Cukier@danicuki http://www.flickr.com/photos/flysi/183272970
  13. 13. http://www.flickr.com/photos/nephelim/5632618462
  14. 14. maxRowsList=16
  15. 15. >> body = RestClient.get(url) >> json = JSON.parse(body) >> content = json["Content"] >> content.size => 16 AHA!!! http://.../artistas?maxRowsList=1600&filter=Recentes >> body = RestClient.get(url) >> json = JSON.parse(a) >> content = json["Content"] >> content.size => 1600http://.../artistas?maxRowsList=1600000&filter=Recentes >> content.size => 9154 Bingo!!!
  16. 16. >> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo","jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia","tiagorosa", "outprofile", "lucianokoscky","bandateatraldecarona", "tlounge", "almanaque", "razzyoficial","cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette","alinedelima", "thelio", "grupodomdesamba", "ladoz","alexandrepontes", "poeiradgua", "betimalu", "leonardobessa","kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys","locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao","jstonemghiphop", "uniaoglobal", "bandaefex", "severarock","manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul","wknd", "bandastarven", "bleiamusic", "3porcentoaocubo","lucianoterra", "hipnoia", "influencianegra", "bandaursamaior","mariafreitas", "jessejames", "vagnerrockxe", "stageo3","lemoneight", "innocence", "dinda", "marcelocapela","paulocamoeseoslusiadas", "magnussrock", "bandatheburk","mercantes", "bandaturnerock", "flaviasaolli", "tonysagga","thiagoponde", "centeio", "grupodeubranco", "bocadeleao","eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod","dreemonphc", "chicobrant", "osz", "bandalightspeed","cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]
  17. 17. email?phone?
  18. 18. >> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class=name]")>> info.text=> "robo-macaco@hotmail.comRIO DE JANEIRO - RJ - Brasil21 9675-0199
  19. 19. cookiescookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}RestClient.get(url, :cookies => cookies)
  20. 20. Proxies
  21. 21. http://www.ip-adress.com/proxy_list
  22. 22. >> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class=proxylist]")>> lines = table.children>> lines.shift # tira o cabeçalho Text IP>> lines[1].text=> "208.52.144.55 document.write(":"+i+r+i+r)anonymous proxy server-2 minutes ago United States"
  23. 23. <script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>
  24. 24. JAVASCRIPT = RUBY http://www.flickr.com/photos/drics/4266471776/
  25. 25. <script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0; </script>>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8
  26. 26. >> lines[1].text=> "208.52.144.55 document.write(":"+i+r+i+r) anonymousproxy server-2 minutes ago United States">> server = lines[1].text.split[0]=> "208.52.144.55">> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(":"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080" VoilàRestClient.proxy = "http://#{server}:#{port}"
  27. 27. mechanizeagent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform[visitor[name]] = danielform[visitor[email]] = "danicuki@gmail.com"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend
  28. 28. protection techniques javascript text as image captcha don’t be ingenuous
  29. 29. captchaprove you are not a robot YES you can!
  30. 30. 3 steps1. Download Image2. filter image3. run OCR software
  31. 31. scalinghttp://www.flickr.com/photos/liquene/3330714590
  32. 32. clouds$ knife ec2 server create
  33. 33. threads +queues
  34. 34. Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia coolTe dou um rubyPra você roubarCom o seu robôQuer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô http://www.flickr.com/photos/jobafunky/5572503988
  35. 35. Thank youDaniel Cukier@danicuki

×