從零開始的爬蟲之旅 Crawler from zero

S
Shi-Ken DonWeb Developer
Crawler from zero
shiken.don@gmail.com
RubyConf Taiwan 2016
About Me
• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
About Me
• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
200
20
Agenda
• What I did
• Why Ruby? Ruby
• Comparison
• Know-how
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
CrowdTrail
:D
CrowdTrail
• 14
• 2
•
DEMO
CrowdTrail
:D
>>>> https://goo.gl/sWfDBc <<<<
• 5000
• Web
• 0
• 5000
• Web
• 0
Kickstarter
2
T T
•
• JRuby
• Ruby
Ruby
Ruby
Ruby
Ruby
Ruby
Why, or Why not?
(Crawler) Python
Python Ruby
Ruby
Ruby
(Scrape web content)
Network flow
Google Chrome DeveloperTools screenshot
1.75 1.34
76%
Ruby Python
user system total real
Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499)
Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813)
Python threading 0.000000 0.000000 16.360000 ( 16.532583)
1000 thread www.facebook.com
ruby_thread_tests.rb
threading_test.py
Python Ruby
user system total real
Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333)
Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837)
Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519)
ruby_thread_sidekiq_test.rb
100 thread www.facebook.com
parse only
• ActiveRecord::ConnectionTimeoutError: could not
obtain a database connection within 5.000 seconds
(waited 5.009 seconds)
database.yml
default: &default
adapter: postgresql
encoding: unicode
pool: 25 # Increase this
Thread
begin
# …
ProjectLog.create!(title: title)
ensure
ActiveRecord::Base.clear_active_connections!
end
Sidekiq
• 200 8
• RubyThread
• Auto scaling
Heroku Auto Scaling
heroku.rake
task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args|
args.with_defaults(WORKER_NAME: "worker")
APP_NAME = ENV["HEROKU_APP_NAME"]
WORKER_NAME = args[:WORKER_NAME]
heroku = Heroku::API.new
queues = Sidekiq::Queue.all
queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+)
# 2X dyno 600 jobs
# 50 parse project_log
# jobs 45
now_minutes = Time.now.strftime("%M").to_i
# / /
left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0
workers_size = queues_size / 500 / [left_minutes, 1].max
workers_size = 1 if workers_size < 1
workers_size = 10 if workers_size > 10 # 10 worker
puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}"
heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size)
end
Sidekiq
• PostgreSQL > Redis > Sidekiq connections
Sidekiq Redis Thread Redis
Thread
Heroku PostgreSQL Pricing
PostgreSQL Standard 2
• Rails App DB
• RakeTask DB
• Restarting
• Sidekiq MAX connections = 200
• File descriptor
✦ MacOS: 256 (default)
✦ Linux: 1024 (default)
✦ Windows: who cares
Linux File descriptor
1024
CPU RAM
Linux
• #
• cat /proc/sys/fs/file-max
• #
• sysctl -w fs.file-max=100000
Know-how
•
• Heroku dyno 500 Thread
• dynos#process-thread-limits
Queue (Weight)
Sidekiq Queue Job
Queue
Project.find_each do |project|
# Sidekiq queue
name = project.platform.name
SnapshotWorker.sidekiq_options_hash["queue"] = name
SnapshotWorker.perform_async(…)
end
Retry
sidekiq_options_hash["retry"] += 1
self.class.sidekiq_retry_in do |count|
Random.rand(retry_after + count)
end
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
SEO
Googlebot
ψ( ∇´)ψ
User-Agent
DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible;
CrowdTrail/1.0; +https://crowdwatch.tw/)"
def random_user_agent_string
format(
"%s Random/0.%d.%d",
DEFAULT_USER_AGENT,
Random.rand(100),
Random.rand(100)
)
end
HTTParty.get("https://www.facebook.com", headers:
{ "User-Agent" => random_user_agent_string })
• header IP
• X-Forward-For
• X-Real-IP
• CF-Connecting-IP
X-Forward-For
def random_x_forward_for
format(
"140.118.%d.%d",
Random.rand(255),
Random.rand(255)
)
end
HTTParty.get("https://www.facebook.com", headers:
{ "X-Forward-For" => random_x_forward_for })
Proxy
• Proxy Server
• Proxy
• http://txt.proxyspy.net/proxy.txt
CAPTCHA
• Ruby OCR
e.g. ruby-tesseract-ocr
• antigate.com
reCAPTCHA antigate
Parser Know-how
index
# Bad
doc.css('.tab')[2].text
# Good
doc.css('.tab').text[/ (d+) /, 1]
DOM Parser Regular Expression
Integer Float to_i
# Bad
doc.css('.tab .pledged').text.to_i
# Good
Integer(doc.css('.tab .pledged').text)
to_i nil 0
THANKYOU
Follow me on
https://github.com/shikendon
https://medium.com/@shikendon
https://www.facebook.com/zxuandon
1 of 47

Recommended

淺談 Startup 公司的軟體開發流程 v2 by
淺談 Startup 公司的軟體開發流程 v2淺談 Startup 公司的軟體開發流程 v2
淺談 Startup 公司的軟體開發流程 v2Wen-Tien Chang
38.8K views160 slides
COSCUP 開源工作坊:Git workflows by
COSCUP 開源工作坊:Git workflowsCOSCUP 開源工作坊:Git workflows
COSCUP 開源工作坊:Git workflowsCarl Su
2.3K views56 slides
Rapid development with Rails by
Rapid development with RailsRapid development with Rails
Rapid development with RailsYi-Ting Cheng
2.2K views92 slides
JHipster by
JHipsterJHipster
JHipsterYuen-Kuei Hsueh
6.4K views25 slides
遇見 Ruby on Rails by
遇見 Ruby on Rails遇見 Ruby on Rails
遇見 Ruby on RailsWen-Tien Chang
5.7K views72 slides
貢獻開源專案 (Contribute to open source project) by
貢獻開源專案 (Contribute to open source project)貢獻開源專案 (Contribute to open source project)
貢獻開源專案 (Contribute to open source project)Hung Wu Lo
153 views19 slides

More Related Content

What's hot

Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016 by
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016Matt Raible
5.5K views39 slides
Xitrum @ Scala Matsuri Tokyo 2014 by
Xitrum @ Scala Matsuri Tokyo 2014Xitrum @ Scala Matsuri Tokyo 2014
Xitrum @ Scala Matsuri Tokyo 2014Ngoc Dao
44.5K views42 slides
新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜 by
新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜
新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜Hiroshi Maekawa
2.1K views78 slides
Xitrum @ Scala Conference in Japan 2013 by
Xitrum @ Scala Conference in Japan 2013Xitrum @ Scala Conference in Japan 2013
Xitrum @ Scala Conference in Japan 2013Ngoc Dao
2.2K views7 slides
Saving Time By Testing With Jest by
Saving Time By Testing With JestSaving Time By Testing With Jest
Saving Time By Testing With JestBen McCormick
2.9K views37 slides
Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd... by
Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd...Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd...
Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd...Geoff Varosky
1.2K views40 slides

What's hot(20)

Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016 by Matt Raible
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx France 2016
Matt Raible5.5K views
Xitrum @ Scala Matsuri Tokyo 2014 by Ngoc Dao
Xitrum @ Scala Matsuri Tokyo 2014Xitrum @ Scala Matsuri Tokyo 2014
Xitrum @ Scala Matsuri Tokyo 2014
Ngoc Dao44.5K views
新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜 by Hiroshi Maekawa
新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜
新しい技術を取り入れるための実験のやり方 〜サーバーレス・機械学習・PWAを実戦に投入するまで〜
Hiroshi Maekawa2.1K views
Xitrum @ Scala Conference in Japan 2013 by Ngoc Dao
Xitrum @ Scala Conference in Japan 2013Xitrum @ Scala Conference in Japan 2013
Xitrum @ Scala Conference in Japan 2013
Ngoc Dao2.2K views
Saving Time By Testing With Jest by Ben McCormick
Saving Time By Testing With JestSaving Time By Testing With Jest
Saving Time By Testing With Jest
Ben McCormick2.9K views
Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd... by Geoff Varosky
Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd...Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd...
Best Practices in SharePoint Development - Just Freakin Work! Overcoming Hurd...
Geoff Varosky1.2K views
Xitrum HOWTOs by Ngoc Dao
Xitrum HOWTOsXitrum HOWTOs
Xitrum HOWTOs
Ngoc Dao45.2K views
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - GeekOut 2016 by Matt Raible
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - GeekOut 2016Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - GeekOut 2016
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - GeekOut 2016
Matt Raible3.8K views
Get Hip with JHipster - Colorado Springs OSS Meetup April 2016 by Matt Raible
Get Hip with JHipster - Colorado Springs OSS Meetup April 2016Get Hip with JHipster - Colorado Springs OSS Meetup April 2016
Get Hip with JHipster - Colorado Springs OSS Meetup April 2016
Matt Raible2K views
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx UK 2016 by Matt Raible
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx UK 2016Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx UK 2016
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx UK 2016
Matt Raible3.7K views
improving the performance of Rails web Applications by John McCaffrey
improving the performance of Rails web Applicationsimproving the performance of Rails web Applications
improving the performance of Rails web Applications
John McCaffrey1.6K views
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx 2015 by Matt Raible
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx 2015Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx 2015
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Devoxx 2015
Matt Raible6.7K views
ASP.NET Core 1.0 by Ido Flatow
ASP.NET Core 1.0ASP.NET Core 1.0
ASP.NET Core 1.0
Ido Flatow701 views
Hybrid Mobile Development with Apache Cordova and by Ryan Cuprak
Hybrid Mobile Development with Apache Cordova and Hybrid Mobile Development with Apache Cordova and
Hybrid Mobile Development with Apache Cordova and
Ryan Cuprak1.9K views
O365Con18 - Implementing Automated UI Testing for SharePoint Solutions - Elio... by NCCOMMS
O365Con18 - Implementing Automated UI Testing for SharePoint Solutions - Elio...O365Con18 - Implementing Automated UI Testing for SharePoint Solutions - Elio...
O365Con18 - Implementing Automated UI Testing for SharePoint Solutions - Elio...
NCCOMMS172 views
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Rich Web Experie... by Matt Raible
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Rich Web Experie...Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Rich Web Experie...
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Rich Web Experie...
Matt Raible3.4K views
Freelancing and side-projects on Rails by John McCaffrey
Freelancing and side-projects on RailsFreelancing and side-projects on Rails
Freelancing and side-projects on Rails
John McCaffrey1.4K views
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Angular Summit 2015 by Matt Raible
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Angular Summit 2015Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Angular Summit 2015
Get Hip with JHipster: Spring Boot + AngularJS + Bootstrap - Angular Summit 2015
Matt Raible38.7K views
Lambdaless and AWS CDK by MooYeol Lee
Lambdaless and AWS CDKLambdaless and AWS CDK
Lambdaless and AWS CDK
MooYeol Lee2.1K views

Similar to 從零開始的爬蟲之旅 Crawler from zero

Deploying Web Apps with PaaS and Docker Tools by
Deploying Web Apps with PaaS and Docker ToolsDeploying Web Apps with PaaS and Docker Tools
Deploying Web Apps with PaaS and Docker ToolsEddie Lau
1.5K views56 slides
Data(?)Ops with CircleCI by
Data(?)Ops with CircleCIData(?)Ops with CircleCI
Data(?)Ops with CircleCIJinwoong Kim
1.3K views27 slides
Introduction to Neo4j and .Net by
Introduction to Neo4j and .NetIntroduction to Neo4j and .Net
Introduction to Neo4j and .NetNeo4j
3.6K views53 slides
The Architecture of PicCollage Server by
The Architecture of PicCollage ServerThe Architecture of PicCollage Server
The Architecture of PicCollage ServerLin Jen-Shin
602 views126 slides
Qcon beijing 2010 by
Qcon beijing 2010Qcon beijing 2010
Qcon beijing 2010Vonbo
566 views115 slides
Cvcc performance tuning by
Cvcc performance tuningCvcc performance tuning
Cvcc performance tuningJohn McCaffrey
951 views65 slides

Similar to 從零開始的爬蟲之旅 Crawler from zero(20)

Deploying Web Apps with PaaS and Docker Tools by Eddie Lau
Deploying Web Apps with PaaS and Docker ToolsDeploying Web Apps with PaaS and Docker Tools
Deploying Web Apps with PaaS and Docker Tools
Eddie Lau1.5K views
Data(?)Ops with CircleCI by Jinwoong Kim
Data(?)Ops with CircleCIData(?)Ops with CircleCI
Data(?)Ops with CircleCI
Jinwoong Kim1.3K views
Introduction to Neo4j and .Net by Neo4j
Introduction to Neo4j and .NetIntroduction to Neo4j and .Net
Introduction to Neo4j and .Net
Neo4j3.6K views
The Architecture of PicCollage Server by Lin Jen-Shin
The Architecture of PicCollage ServerThe Architecture of PicCollage Server
The Architecture of PicCollage Server
Lin Jen-Shin602 views
Qcon beijing 2010 by Vonbo
Qcon beijing 2010Qcon beijing 2010
Qcon beijing 2010
Vonbo566 views
Future of Development and Deployment using Docker by Tamer Abdul-Radi
Future of Development and Deployment using DockerFuture of Development and Deployment using Docker
Future of Development and Deployment using Docker
Tamer Abdul-Radi1.1K views
Web Development using Ruby on Rails by Avi Kedar
Web Development using Ruby on RailsWeb Development using Ruby on Rails
Web Development using Ruby on Rails
Avi Kedar628 views
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure by Matt Ray
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
Matt Ray3K views
Dev ops lessons learned - Michael Collins by Devopsdays
Dev ops lessons learned  - Michael CollinsDev ops lessons learned  - Michael Collins
Dev ops lessons learned - Michael Collins
Devopsdays2.9K views
Page experience road - WordCamp Athens 2022 by Fellyph Cintra
Page experience road  - WordCamp Athens 2022Page experience road  - WordCamp Athens 2022
Page experience road - WordCamp Athens 2022
Fellyph Cintra90 views
Building Efficient and Reliable Crawler System With Sidekiq Enterprise by Gary Chu
Building Efficient and Reliable Crawler System With Sidekiq EnterpriseBuilding Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Gary Chu985 views
Design for scale by Doug Lampe
Design for scaleDesign for scale
Design for scale
Doug Lampe301 views
Web Performance Workshop - Velocity London 2013 by Andy Davies
Web Performance Workshop - Velocity London 2013Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013
Andy Davies15.8K views
JavaScript front end performance optimizations by Chris Love
JavaScript front end performance optimizationsJavaScript front end performance optimizations
JavaScript front end performance optimizations
Chris Love862 views
Ship It ! with Ruby/ Rails Ecosystem by Yi-Ting Cheng
Ship It ! with Ruby/ Rails EcosystemShip It ! with Ruby/ Rails Ecosystem
Ship It ! with Ruby/ Rails Ecosystem
Yi-Ting Cheng1.8K views
Lessons learned while building Omroep.nl by tieleman
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
tieleman840 views

Recently uploaded

Affiliate Marketing by
Affiliate MarketingAffiliate Marketing
Affiliate MarketingNavin Dhanuka
20 views30 slides
ATPMOUSE_융합2조.pptx by
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptxkts120898
35 views70 slides
hamro digital logics.pptx by
hamro digital logics.pptxhamro digital logics.pptx
hamro digital logics.pptxtupeshghimire
11 views36 slides
ARNAB12.pdf by
ARNAB12.pdfARNAB12.pdf
ARNAB12.pdfArnabChakraborty499766
5 views83 slides
cis5-Project-11a-Harry Lai by
cis5-Project-11a-Harry Laicis5-Project-11a-Harry Lai
cis5-Project-11a-Harry Laiharrylai126
9 views11 slides
The Dark Web : Hidden Services by
The Dark Web : Hidden ServicesThe Dark Web : Hidden Services
The Dark Web : Hidden ServicesAnshu Singh
19 views24 slides

Recently uploaded(10)

從零開始的爬蟲之旅 Crawler from zero