Ruby
Robots

Daniel Cukier
@danicuki
                http://www.flickr.com/photos/flysi/183272970
Relatives


• spiders
• crawlers
• bots
Why robot?
http://www.flickr.com/photos/nhankamer/5016628611
require 'anemone'

Anemone.crawl(url) do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end                 http://www.cantora.mus.br/
                           http://www.cantora.mus.br/fotos
                           http://www.cantora.mus.br/?locale=en
                           http://www.cantora.mus.br/?locale=pt-BR
                           http://www.cantora.mus.br/musicas
                           http://www.cantora.mus.br/videos
                           http://www.cantora.mus.br/agenda
                           http://www.cantora.mus.br/novidades
                           http://www.cantora.mus.br/musicas/baixar
                           http://www.cantora.mus.br/visitors/baixar
                           http://www.cantora.mus.br/social
                           http://www.cantora.mus.br/fotos?locale=pt-BR
                           http://www.cantora.mus.br/musicas?locale=en
                           http://www.cantora.mus.br/fotos?locale=en
XPath
<html>
...
<div class="bla">
  <a>legal</a>
</div>
...
</html>




html_doc = Nokogiri::HTML(html)
info = html_doc.xpath(
  "//div[@class='bla']/a")
info.text
=> legal
XPath
<table id="super">   >> html_doc = Nokogiri::HTML(html)
  <tr>               >> info = html_doc.xpath(
    <td>L1C1</td>      "//table[@id='super']/tr")
    <td>L1C2</td>    >> info.size
                     => 3
  </tr>
  <tr>
                     >> info
    <td>L2C1</td>    => legal
    <td>L2C2</td>
  </tr>              >> info[0].xpath("td").size
  <tr>               => 2
    <td>L3C1</td>
    <td>L3C2</td>    >> info[2].xpath("td")[1].text
  </tr>              => "L3C2"
</table>
rest-client
ET
G

http://www.flickr.com/photos/amortize/766738216
http://www.flickr.com/photos/abbeychristine/223898960
Good bot

                                                /robots.txt


                                                User-agent: *
                                                Disallow:




http://www.flickr.com/photos/temily/5645585162
Ruby
Robots

Daniel Cukier
@danicuki
                http://www.flickr.com/photos/flysi/183272970
http://www.flickr.com/photos/nephelim/5632618462
maxRowsList=16
>>   body = RestClient.get(url)
 >>   json = JSON.parse(body)
 >>   content = json["Content"]
 >>   content.size
 =>   16
      AHA!!!
 http://.../artistas?maxRowsList=1600&filter=Recentes
 >>   body = RestClient.get(url)
 >>   json = JSON.parse(a)
 >>   content = json["Content"]
 >>   content.size
 =>   1600

http://.../artistas?maxRowsList=1600000&filter=Recentes
 >> content.size
 => 9154

       Bingo!!!
>> b["Content"].map {|c| c["ProfileUrl"]}
["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo",
"jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia",
"tiagorosa", "outprofile", "lucianokoscky",
"bandateatraldecarona", "tlounge", "almanaque", "razzyoficial",
"cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette",
"alinedelima", "thelio", "grupodomdesamba", "ladoz",
"alexandrepontes", "poeiradgua", "betimalu", "leonardobessa",
"kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys",
"locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao",
"jstonemghiphop", "uniaoglobal", "bandaefex", "severarock",
"manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul",
"wknd", "bandastarven", "bleiamusic", "3porcentoaocubo",
"lucianoterra", "hipnoia", "influencianegra", "bandaursamaior",
"mariafreitas", "jessejames", "vagnerrockxe", "stageo3",
"lemoneight", "innocence", "dinda", "marcelocapela",
"paulocamoeseoslusiadas", "magnussrock", "bandatheburk",
"mercantes", "bandaturnerock", "flaviasaolli", "tonysagga",
"thiagoponde", "centeio", "grupodeubranco", "bocadeleao",
"eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod",
"dreemonphc", "chicobrant", "osz", "bandalightspeed",
"cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]
email?
phone?
>> html = RestClient.get("http://.../robomacaco")
>> html_doc = Nokogiri::HTML(html)
>> info = html_doc.xpath("//span[@class='name']")
>> info.text
=> "robo-macaco@hotmail.com
RIO DE JANEIRO - RJ - Brasil
21 9675-0199
cookies



cookies = {}
c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"
cook = c.split(";").map {|i| i.strip.split("=")}
cook.each {|u| cookies[u[0]] = u[1]}

RestClient.get(url, :cookies => cookies)
Proxies
http://www.ip-adress.com/proxy_list
>> response = RestClient.get(url)
>> html_doc = Nokogiri::HTML(response)
>> table = html_doc.xpath("//table
[@class='proxylist']")
>> lines = table.children
>> lines.shift   # tira o cabeçalho
                     Text

        IP
>> lines[1].text
=> "208.52.144.55 document.write(":"+i+r+i+r)
anonymous proxy server-2 minutes ago United States"
<script type="text/javascript">
  z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;
</script>
JAVASCRIPT
     =
   RUBY



     http://www.flickr.com/photos/drics/4266471776/
<script type="text/javascript">
       z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;
     </script>


>>   script = html_doc.xpath("//script")[1]
>>   eval script.text
>>   z
=>   5
>>   i
=>   8
>> lines[1].text
=> "208.52.144.55 document.write(":"+i+r+i+r) anonymous
proxy server-2 minutes ago United States"


>> server = lines[1].text.split[0]
=> "208.52.144.55"


>>   digits = lines[1].text.split(")")[0].split("+")
=>   ["208.52.144.55document.write(":"", "i", "r", "i", "r"]
>>   digits.shift
>>   digits
=>   ["i", "r", "i", "r"]
>>   port = digits.map {|c| eval(c)}.join("")
=>   "8080"
                Voilà

RestClient.proxy = "http://#{server}:#{port}"
mechanize
agent = Mechanize.new
site = "http://www.cantora.mus.br"
page = agent.get("#{site}/baixar")
form = page.form
form['visitor[name]'] = 'daniel'
form['visitor[email]'] = "danicuki@gmail.com"
page = agent.submit(form)
tracks = page.links.select { |l| l.href =~ /track/ }
tracks.each do |t|
  file = agent.get("#{site}#{t})
  file.save
end
protection techniques




                     javascript
                  text as image
                        captcha
              don’t be ingenuous
captcha
prove you are not a robot




      YES you can!
3 steps

1. Download Image
2. filter image
3. run OCR software
scaling




http://www.flickr.com/photos/liquene/3330714590
clouds


$ knife ec2 server create
threads
   +
queues
Nessa vida de programador maluco
Me aparece cada situação
De repente um cliente, uma proposta bruta
Pra pegar de um site informação
Você tá louco, esse tipo de crime eu não faço
Se quiser tenho uns amigos lá do sul
Faz pra mim que eu te pago com essa jóia cool

Te dou um ruby
Pra você roubar
Com o seu robô

Quer fazer robô?
É só usar ruby
É só usar ruby
Pra fazer robô

                                http://www.flickr.com/photos/jobafunky/5572503988
Thank you




Daniel Cukier
@danicuki

Ruby Robots

  • 1.
    Ruby Robots Daniel Cukier @danicuki http://www.flickr.com/photos/flysi/183272970
  • 3.
  • 4.
  • 5.
  • 6.
    require 'anemone' Anemone.crawl(url) do|anemone| anemone.on_every_page do |page| puts page.url end end http://www.cantora.mus.br/ http://www.cantora.mus.br/fotos http://www.cantora.mus.br/?locale=en http://www.cantora.mus.br/?locale=pt-BR http://www.cantora.mus.br/musicas http://www.cantora.mus.br/videos http://www.cantora.mus.br/agenda http://www.cantora.mus.br/novidades http://www.cantora.mus.br/musicas/baixar http://www.cantora.mus.br/visitors/baixar http://www.cantora.mus.br/social http://www.cantora.mus.br/fotos?locale=pt-BR http://www.cantora.mus.br/musicas?locale=en http://www.cantora.mus.br/fotos?locale=en
  • 8.
    XPath <html> ... <div class="bla"> <a>legal</a> </div> ... </html> html_doc = Nokogiri::HTML(html) info = html_doc.xpath( "//div[@class='bla']/a") info.text => legal
  • 9.
    XPath <table id="super"> >> html_doc = Nokogiri::HTML(html) <tr> >> info = html_doc.xpath( <td>L1C1</td> "//table[@id='super']/tr") <td>L1C2</td> >> info.size => 3 </tr> <tr> >> info <td>L2C1</td> => legal <td>L2C2</td> </tr> >> info[0].xpath("td").size <tr> => 2 <td>L3C1</td> <td>L3C2</td> >> info[2].xpath("td")[1].text </tr> => "L3C2" </table>
  • 10.
  • 11.
  • 12.
  • 13.
    Good bot /robots.txt User-agent: * Disallow: http://www.flickr.com/photos/temily/5645585162
  • 14.
    Ruby Robots Daniel Cukier @danicuki http://www.flickr.com/photos/flysi/183272970
  • 15.
  • 17.
  • 20.
    >> body = RestClient.get(url) >> json = JSON.parse(body) >> content = json["Content"] >> content.size => 16 AHA!!! http://.../artistas?maxRowsList=1600&filter=Recentes >> body = RestClient.get(url) >> json = JSON.parse(a) >> content = json["Content"] >> content.size => 1600 http://.../artistas?maxRowsList=1600000&filter=Recentes >> content.size => 9154 Bingo!!!
  • 21.
    >> b["Content"].map {|c|c["ProfileUrl"]} ["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]
  • 22.
  • 24.
    >> html =RestClient.get("http://.../robomacaco") >> html_doc = Nokogiri::HTML(html) >> info = html_doc.xpath("//span[@class='name']") >> info.text => "robo-macaco@hotmail.com RIO DE JANEIRO - RJ - Brasil 21 9675-0199
  • 26.
    cookies cookies = {} c= "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458" cook = c.split(";").map {|i| i.strip.split("=")} cook.each {|u| cookies[u[0]] = u[1]} RestClient.get(url, :cookies => cookies)
  • 27.
  • 28.
  • 32.
    >> response =RestClient.get(url) >> html_doc = Nokogiri::HTML(response) >> table = html_doc.xpath("//table [@class='proxylist']") >> lines = table.children >> lines.shift # tira o cabeçalho Text IP >> lines[1].text => "208.52.144.55 document.write(":"+i+r+i+r) anonymous proxy server-2 minutes ago United States"
  • 33.
  • 34.
    JAVASCRIPT = RUBY http://www.flickr.com/photos/drics/4266471776/
  • 35.
    <script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0; </script> >> script = html_doc.xpath("//script")[1] >> eval script.text >> z => 5 >> i => 8
  • 36.
    >> lines[1].text => "208.52.144.55document.write(":"+i+r+i+r) anonymous proxy server-2 minutes ago United States" >> server = lines[1].text.split[0] => "208.52.144.55" >> digits = lines[1].text.split(")")[0].split("+") => ["208.52.144.55document.write(":"", "i", "r", "i", "r"] >> digits.shift >> digits => ["i", "r", "i", "r"] >> port = digits.map {|c| eval(c)}.join("") => "8080" Voilà RestClient.proxy = "http://#{server}:#{port}"
  • 37.
    mechanize agent = Mechanize.new site= "http://www.cantora.mus.br" page = agent.get("#{site}/baixar") form = page.form form['visitor[name]'] = 'daniel' form['visitor[email]'] = "danicuki@gmail.com" page = agent.submit(form) tracks = page.links.select { |l| l.href =~ /track/ } tracks.each do |t| file = agent.get("#{site}#{t}) file.save end
  • 38.
    protection techniques javascript text as image captcha don’t be ingenuous
  • 39.
    captcha prove you arenot a robot YES you can!
  • 40.
    3 steps 1. DownloadImage 2. filter image 3. run OCR software
  • 42.
  • 43.
    clouds $ knife ec2server create
  • 44.
    threads + queues
  • 46.
    Nessa vida deprogramador maluco Me aparece cada situação De repente um cliente, uma proposta bruta Pra pegar de um site informação Você tá louco, esse tipo de crime eu não faço Se quiser tenho uns amigos lá do sul Faz pra mim que eu te pago com essa jóia cool Te dou um ruby Pra você roubar Com o seu robô Quer fazer robô? É só usar ruby É só usar ruby Pra fazer robô http://www.flickr.com/photos/jobafunky/5572503988
  • 47.