Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Crawler from zero
shiken.don@gmail.com
RubyConf Taiwan 2016
About Me
• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder si...
About Me
• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder si...
Agenda
• What I did
• Why Ruby? Ruby
• Comparison
• Know-how
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
CrowdTrail
:D
CrowdTrail
• 14
• 2
•
DEMO
CrowdTrail
:D
>>>> https://goo.gl/sWfDBc <<<<
• 5000
• Web
• 0
• 5000
• Web
• 0
Kickstarter
2
T T
•
• JRuby
• Ruby
Ruby
Ruby
Ruby
Ruby
Ruby
Why, or Why not?
(Crawler) Python
Python Ruby
Ruby
Ruby
(Scrape web content)
Network flow
Google Chrome DeveloperTools screenshot
1.75 1.34
76%
Ruby Python
user system total real
Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499)
Ruby Parallel 16.900000 1.100...
user system total real
Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333)
Thread.new (parse only) 5.0600...
• ActiveRecord::ConnectionTimeoutError: could not
obtain a database connection within 5.000 seconds
(waited 5.009 seconds)
database.yml
default: &default
adapter: postgresql
encoding: unicode
pool: 25 # Increase this
Thread
begin
# …
ProjectLog.create!(title: title)
ensure
ActiveRecord::Base.clear_active_connections!
end
Sidekiq
• 200 8
• RubyThread
• Auto scaling
Heroku Auto Scaling
heroku.rake
task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args|
args.with_defaults(WORKER...
Sidekiq
• PostgreSQL > Redis > Sidekiq connections
Sidekiq Redis Thread Redis
Thread
Heroku PostgreSQL Pricing
PostgreSQL Standard 2
• Rails App DB
• RakeTask DB
• Restarting
• Sidekiq MAX connections = 200
• File descriptor
✦ MacOS: 256 (default)
✦ Linux: 1024 (default)
✦ Windows: who cares
Linux File descriptor
1024
CPU RAM
Linux
• #
• cat /proc/sys/fs/file-max
• #
• sysctl -w fs.file-max=100000
Know-how
•
• Heroku dyno 500 Thread
• dynos#process-thread-limits
Queue (Weight)
Sidekiq Queue Job
Queue
Project.find_each do |project|
# Sidekiq queue
name = project.platform.name
SnapshotWorker.sidekiq_options_hash["que...
Retry
sidekiq_options_hash["retry"] += 1
self.class.sidekiq_retry_in do |count|
Random.rand(retry_after + count)
end
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Goog...
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Goog...
User-Agent
DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible;
CrowdTrail/1.0; +https://crowdwatch.tw/)"
def random_user_agent_...
• header IP
• X-Forward-For
• X-Real-IP
• CF-Connecting-IP
X-Forward-For
def random_x_forward_for
format(
"140.118.%d.%d",
Random.rand(255),
Random.rand(255)
)
end
HTTParty.get("htt...
Proxy
• Proxy Server
• Proxy
• http://txt.proxyspy.net/proxy.txt
CAPTCHA
• Ruby OCR
e.g. ruby-tesseract-ocr
• antigate.com
reCAPTCHA antigate
Parser Know-how
index
# Bad
doc.css('.tab')[2].text
# Good
doc.css('.tab').text[/ (d+) /, 1]
DOM Parser Regular Expression
Integer Float to_i
# Bad
doc.css('.tab .pledged').text.to_i
# Good
Integer(doc.css('.tab .pledged').text)
to_i nil 0
THANKYOU
Follow me on
https://github.com/shikendon
https://medium.com/@shikendon
https://www.facebook.com/zxuandon
Upcoming SlideShare
Loading in …5
×

從零開始的爬蟲之旅 Crawler from zero

640 views

Published on

以過來人的經驗分享用 Ruby 每分鐘抓取 2000+ 頁面、維護上億筆紀錄資料庫的心路歷程,從單線程到多線程再到 Auto Scaling 讓大家一步一步體會撰寫爬蟲各階段可能遇到的瓶頸,期望能在未來為從事相關工作的朋友們帶來一點幫助。

Shi-Ken Don @ RubyConf Taiwan 2016

Published in: Internet
  • Be the first to comment

從零開始的爬蟲之旅 Crawler from zero

  1. 1. Crawler from zero shiken.don@gmail.com RubyConf Taiwan 2016
  2. 2. About Me • a.k.a. Shi-Ken Don • 2009 - 2014 • Ruby Developer since 2013 • 2014 - 2015 • Web Developer at Backer-Founder since 2015
  3. 3. About Me • a.k.a. Shi-Ken Don • 2009 - 2014 • Ruby Developer since 2013 • 2014 - 2015 • Web Developer at Backer-Founder since 2015 200 20
  4. 4. Agenda • What I did • Why Ruby? Ruby • Comparison • Know-how
  5. 5. Crawler Architecture Wikipedia:File:WebCrawlerArchitecture.svg
  6. 6. Crawler Architecture Wikipedia:File:WebCrawlerArchitecture.svg
  7. 7. CrowdTrail :D
  8. 8. CrowdTrail • 14 • 2 •
  9. 9. DEMO CrowdTrail :D >>>> https://goo.gl/sWfDBc <<<<
  10. 10. • 5000 • Web • 0
  11. 11. • 5000 • Web • 0 Kickstarter 2 T T
  12. 12. • • JRuby • Ruby
  13. 13. Ruby
  14. 14. Ruby Ruby
  15. 15. Ruby Ruby Why, or Why not?
  16. 16. (Crawler) Python Python Ruby
  17. 17. Ruby
  18. 18. Ruby (Scrape web content)
  19. 19. Network flow Google Chrome DeveloperTools screenshot 1.75 1.34 76%
  20. 20. Ruby Python user system total real Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499) Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813) Python threading 0.000000 0.000000 16.360000 ( 16.532583) 1000 thread www.facebook.com ruby_thread_tests.rb threading_test.py Python Ruby
  21. 21. user system total real Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333) Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837) Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519) ruby_thread_sidekiq_test.rb 100 thread www.facebook.com parse only
  22. 22. • ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.009 seconds)
  23. 23. database.yml default: &default adapter: postgresql encoding: unicode pool: 25 # Increase this
  24. 24. Thread begin # … ProjectLog.create!(title: title) ensure ActiveRecord::Base.clear_active_connections! end
  25. 25. Sidekiq • 200 8 • RubyThread • Auto scaling
  26. 26. Heroku Auto Scaling heroku.rake task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args| args.with_defaults(WORKER_NAME: "worker") APP_NAME = ENV["HEROKU_APP_NAME"] WORKER_NAME = args[:WORKER_NAME] heroku = Heroku::API.new queues = Sidekiq::Queue.all queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+) # 2X dyno 600 jobs # 50 parse project_log # jobs 45 now_minutes = Time.now.strftime("%M").to_i # / / left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0 workers_size = queues_size / 500 / [left_minutes, 1].max workers_size = 1 if workers_size < 1 workers_size = 10 if workers_size > 10 # 10 worker puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}" heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size) end
  27. 27. Sidekiq • PostgreSQL > Redis > Sidekiq connections Sidekiq Redis Thread Redis Thread
  28. 28. Heroku PostgreSQL Pricing
  29. 29. PostgreSQL Standard 2 • Rails App DB • RakeTask DB • Restarting • Sidekiq MAX connections = 200
  30. 30. • File descriptor ✦ MacOS: 256 (default) ✦ Linux: 1024 (default) ✦ Windows: who cares Linux File descriptor 1024 CPU RAM
  31. 31. Linux • # • cat /proc/sys/fs/file-max • # • sysctl -w fs.file-max=100000
  32. 32. Know-how
  33. 33. • • Heroku dyno 500 Thread • dynos#process-thread-limits
  34. 34. Queue (Weight) Sidekiq Queue Job
  35. 35. Queue Project.find_each do |project| # Sidekiq queue name = project.platform.name SnapshotWorker.sidekiq_options_hash["queue"] = name SnapshotWorker.perform_async(…) end
  36. 36. Retry sidekiq_options_hash["retry"] += 1 self.class.sidekiq_retry_in do |count| Random.rand(retry_after + count) end
  37. 37. • User-Agent • Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0) • Google Bot • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  38. 38. • User-Agent • Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0) • Google Bot • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SEO Googlebot ψ( ∇´)ψ
  39. 39. User-Agent DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible; CrowdTrail/1.0; +https://crowdwatch.tw/)" def random_user_agent_string format( "%s Random/0.%d.%d", DEFAULT_USER_AGENT, Random.rand(100), Random.rand(100) ) end HTTParty.get("https://www.facebook.com", headers: { "User-Agent" => random_user_agent_string })
  40. 40. • header IP • X-Forward-For • X-Real-IP • CF-Connecting-IP
  41. 41. X-Forward-For def random_x_forward_for format( "140.118.%d.%d", Random.rand(255), Random.rand(255) ) end HTTParty.get("https://www.facebook.com", headers: { "X-Forward-For" => random_x_forward_for })
  42. 42. Proxy • Proxy Server • Proxy • http://txt.proxyspy.net/proxy.txt
  43. 43. CAPTCHA • Ruby OCR e.g. ruby-tesseract-ocr • antigate.com reCAPTCHA antigate
  44. 44. Parser Know-how
  45. 45. index # Bad doc.css('.tab')[2].text # Good doc.css('.tab').text[/ (d+) /, 1] DOM Parser Regular Expression
  46. 46. Integer Float to_i # Bad doc.css('.tab .pledged').text.to_i # Good Integer(doc.css('.tab .pledged').text) to_i nil 0
  47. 47. THANKYOU Follow me on https://github.com/shikendon https://medium.com/@shikendon https://www.facebook.com/zxuandon

×