Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Photo: http://cliparts.co/clipart/3666251
Has anyone ever written crawlers?
Has anyone ever used cron?
Has anyone ever used Sidekiq?
Gary (Chien-Wei Chu)

@icarus4 / @icarus4.chu
Was a C programmer

Fall in love with Ruby since 2013
CTO of Statementdog
I Play
Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg
Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg
Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg
• Introduction to Statementdog
• Introduction to Statementdog
• Data behind Statementdog
• Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Problems of the past practice
• Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Problems of the past practice
...
Focus on:
• More reliable job scheduling
• Dealing with throttling issue
(Revenue)
(Revenue)
(EPS)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash ...
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash ...
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash ...
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash ...
(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash ...


Taiwan Market Observation Post System ( )
Taiwan Stock Exchange ( )
Taiwan Depository & Clearing Corporation ( )
Yahoo Sto...
Yearly - dividend, remuneration of directors and supervisors
Quarterly - quarterly financial statements
Monthly - Revenue
...
Something like this,
but written in PHP
A super long running process (1 hour+)
loops from the first stock to the last one
...
A super long running process
for quarterly report
A super long running process
for quarterly report
A super long running process
for monthly revenue
A super long running process
for quarterly report
A super long running process
for monthly revenue
A super long running pr...
A super long running process
for quarterly report
A super long running process
for monthly revenue
A super long running pr...
• Really slow
• Really slow
• Inefficient - unable to only retry the failed one
• Really slow
• Inefficient - unable to only retry the failed one
• Unpredictable server loading
Job 1 Job 2 Job 3
Time
When the server loading is low
Job 4 Job 5
Server

loading
When the server loading is HIGH
Time
Server

loading
Other task
Job 1
Job 2
Job 3
When the server loading is HIGH
Job 4
Job 5
Time
Server

loading
Other task
Job 1
Job 2
Job 3
When the server loading is HIGH
Job 4
Job 5
Time
Server

loading
Other task
Too many crawler processes e...
• Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
• Inherent problems of Unix Cron:
• Inherent problems of Unix Cron:
• Unreliable scheduling
• Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the ...
• Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the ...


Created by Mike Perham
Web server
Request
Request
Request
.
.
.
Process
Request
Request
Request
.
.
.
Job queue
push to queue

(very fast)
Web server
Process
Request
Request
Request
.
.
.
Job queue
push to queue

(very fast)
Worker process
Worker process
.
.
.
Worker server
Worke...
Request
Request
Request
.
.
.
Job queue
push to queue

(very fast)
Worker process
Worker process
.
.
.
Worker server
Worke...
Request
Request
Request
.
.
.
Job queue
push to queue

(very fast)
Producer
Worker process
Worker process
.
.
.
Worker ser...
Request
Request
Request
.
.
.
Job queue
push to queue

(very fast)
Producer
Consumer
Worker process
Worker process
.
.
.
W...
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Worker process v.s.
Multi-threadSingle process
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Worker process
1 : 25
Multi-threadSingle process
Multi-thread
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Single process
Worker process
1 : 25
With the same ...
Sidekiq (OSS)
Sidekiq Pro
Sidekiq Enterprise
Sidekiq Pro Sidekiq Enterprise
Batches
Enhanced Reliability
Search in Web UI
Worker Metrics
Expiring Jobs
Rate Limiting
Pe...
Parallelism Make Things Faster
• Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
• Efficient - only retry the failed one
• Predictable server loading
• Easy to scale out
• Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
• Inherent problem of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the p...
–Mike Perham, CEO, Contributed Systems,
Creator of Sidekiq
Keep states of cron executions in 

our robustest part of system - database
All scheduled jobs are invoked by a particular...
Keep states of cron executions in 

our robustest part of system - database
All scheduled jobs are invoked by a particular...
create_table :cron_jobs do |t|


t.string :klass, null: false


t.string :cron_expression, null: false


t.timestamp :next...
create_table :cron_jobs do |t|


t.string :klass, null: false


t.string :cron_expression, null: false


t.timestamp :next...
create_table :cron_jobs do |t|


t.string :klass, null: false


t.string :cron_expression, null: false


t.timestamp :next...
create_table :cron_jobs do |t|


t.string :klass, null: false


t.string :cron_expression, null: false


t.timestamp :next...
klass cron_expression next_run_at
Push2000NewsJobs “0 */2 * * *” …
Push2000DailyPriceJobs “0 2 * * 1-5” …
Push2000MonthlyR...
# Add to your Cron setting


every :minute do 

runner 'CronJobWorker.perform_async' 

end
Cron only schedules one job min...
class CronJobWorker 

include Sidekiq::Worker 



def perform 

CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
...
class CronJobWorker 

include Sidekiq::Worker 



def perform 

CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
...
class CronJobWorker 

include Sidekiq::Worker 



def perform 

CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
...
class CronJobWorker 

include Sidekiq::Worker 



def perform 

CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
...
The missed job executions will be
executed at next minute
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is...
Drawbacks solved
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• Hig...


table: cron_jobs
klass cron_expression args next_run_at
Push2000NewsJobs “0 */2 * * *” [] …


table: cron_jobs
klass cron_expression args next_run_at
Push2000NewsJobs “0 */2 * * *” [] …
NewsWorker “*/30 * * * *” [p...
Drawbacks solved
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• Hig...
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is...
Sidekiq.configure_server do |config| 

config.periodic do |mgr| 

mgr.register("* * * * * *", CronJobWorker) 

end 

end 
...
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is...
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is...
You always want your crawler 

as fast as possible
However, your target server doesn’t
always allow you to crawl with
unlimited rate
Insert 2000 jobs to the queue at the same time
Stock.pluck(:id).each do |stock_id| 

SomeWorker.perform_async(stock_id) 

...
Assume a target server accepts request at
maximum rate equals to
1 request / second
Time
(second)
1 2 3
job1
job2
job3
.
.
.
job2000
Insert 2000 jobs to the queue at the same time
All of your jobs may be bl...
Improvement 1
Schedule jobs with incremental delays
Stock.pluck(:id).each_with_index do |stock_id, index| 

SomeWorker.per...
Time
(second)
1 2 3
job1 job2 job3
…
job2000
2000
Workable, but…
1
job1 job2 job3
…
job2000
If the target server is unreachable
Time
(second)
Workable, but…
1 2 3
job1 job2 job3
…
job2000
2000
If the target server is unreachable
job3~2000 will still execute at the...
• Limit your worker thread to perform specific job
with bounded rate
• Sidekiq Enterprise provides two types of rate
limit...
CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) 



def perform(...) 

CONCURRENT_LIMITER.within_limit do 

...
CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) 



def perform(...) 

CONCURRENT_LIMITER.within_limit do 

...
BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second) 



def perform(...) 

BUCKET_LIMITER.within_limit do 

# c...
You must fine tune parameters of your limiter
for each data source for better performance
By far, you already got better performance.
However, the throttling control of your target server 

may not always be stat...


If throttling detected, pause your workers for a while
Redis (job queue)
Redis (job queue)
default
critical
low
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
yahoo
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
yahoo

(pause...
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Schedule a jo...
Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
yahoo
(resume...
class SomeWorker 

include Sidekiq::Worker 



def perform 

# try to crawl something 

# ... 



if throttled 

queue_nam...
class SomeWorker 

include Sidekiq::Worker 



def perform 

# try to crawl something 

# ... 



if throttled 

queue_nam...
class SomeWorker 

include Sidekiq::Worker 



def perform 

# try to crawl something 

# ... 



if throttled 

queue_nam...
The queue for ResumeJobQueueWorker

MUST NOT equal to the paused queue
We have a dedicated queue for
ResumeJobQueueWorker
Decrease Sidekiq server poll interval for more
precise timing control
Queue pausing alleviates throttling issues
Is it possible for us to do things even better?
Most throttling control aim to block requests
from the same IP address
We can change our IP address via
proxy service
Sidekiq
server
Target
server
a.b.c.d
Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Same IP for each request
Sidekiq
server
Target
server
a.b.c.d
Proxy
service
end
point
Sidekiq
server
Target
server
a.b.c.d
Proxy
service
end
point
proxy server
e.f.g.h
Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
Proxy
service
end
point
proxy server
proxy server
e.f.g.h
i.j.k.l
Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Proxy
service
end
point
proxy server
proxy server
proxy serve...
Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Proxy
service
end
point
proxy server
proxy server
proxy serve...
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is...
• With Sidekiq (Enterprise) and a proper design, the following problems
are solved
• Slow crawler
• Inefficient - unable to...
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Upcoming SlideShare
Loading in …5
×

Building Efficient and Reliable Crawler System With Sidekiq Enterprise

368 views

Published on

Rubyconf 2016
Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Published in: Software
  • Be the first to comment

Building Efficient and Reliable Crawler System With Sidekiq Enterprise

  1. 1. Photo: http://cliparts.co/clipart/3666251
  2. 2. Has anyone ever written crawlers?
  3. 3. Has anyone ever used cron?
  4. 4. Has anyone ever used Sidekiq?
  5. 5. Gary (Chien-Wei Chu)
 @icarus4 / @icarus4.chu Was a C programmer
 Fall in love with Ruby since 2013 CTO of Statementdog
  6. 6. I Play Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg
  7. 7. Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg
  8. 8. Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg
  9. 9. • Introduction to Statementdog
  10. 10. • Introduction to Statementdog • Data behind Statementdog
  11. 11. • Introduction to Statementdog • Data behind Statementdog • Past practice of Statementdog
  12. 12. • Introduction to Statementdog • Data behind Statementdog • Past practice of Statementdog • Problems of the past practice
  13. 13. • Introduction to Statementdog • Data behind Statementdog • Past practice of Statementdog • Problems of the past practice • How we design our system to solve the problems.
  14. 14. Focus on: • More reliable job scheduling • Dealing with throttling issue
  15. 15. (Revenue)
  16. 16. (Revenue) (EPS)
  17. 17. (Revenue) (EPS) (Gross Margin) (Net Income)
  18. 18. (Revenue) (EPS) (Gross Margin) (Net Income) (Assets) (Liabilities)
  19. 19. (Revenue) (EPS) (Gross Margin) (Net Income) (Assets) (Liabilities) (Operating Cash Flow) (Free Cash Flow) (Investing Cash Flow)
  20. 20. (Revenue) (EPS) (Gross Margin) (Net Income) (Assets) (Liabilities) (Operating Cash Flow) (Free Cash Flow) (Investing Cash Flow) (ROE) (ROA) (Accounts Receivable) (Accounts Payable)
  21. 21. (Revenue) (EPS) (Gross Margin) (Net Income) (Assets) (Liabilities) (Operating Cash Flow) (Free Cash Flow) (Investing Cash Flow) (ROE) (ROA) (Accounts Receivable) (Accounts Payable)
  22. 22. (Revenue) (EPS) (Gross Margin) (Net Income) (Assets) (Liabilities) (Operating Cash Flow) (Free Cash Flow) (Investing Cash Flow) (ROE) (ROA) (Accounts Receivable) (Accounts Payable) (PMI)
  23. 23. (Revenue) (EPS) (Gross Margin) (Net Income) (Assets) (Liabilities) (Operating Cash Flow) (Free Cash Flow) (Investing Cash Flow) (ROE) (ROA) (Accounts Receivable) (Accounts Payable) (PMI) GDP
  24. 24.
  25. 25. Taiwan Market Observation Post System ( ) Taiwan Stock Exchange ( ) Taiwan Depository & Clearing Corporation ( ) Yahoo Stock Feed … …
  26. 26. Yearly - dividend, remuneration of directors and supervisors Quarterly - quarterly financial statements Monthly - Revenue Weekly - Daily - closing price Hourly - stock news from Yahoo stock feed Minutely - important news from Taiwan Market Observation Post System
  27. 27. Something like this, but written in PHP A super long running process (1 hour+) loops from the first stock to the last one Stock.find_each do |stock| 
 # download xml financial report data 
 … 
 # extract xml data 
 … 
 # calculate advanced data 
 …
 
 end 

  28. 28. A super long running process for quarterly report
  29. 29. A super long running process for quarterly report A super long running process for monthly revenue
  30. 30. A super long running process for quarterly report A super long running process for monthly revenue A super long running process for daily price
  31. 31. A super long running process for quarterly report A super long running process for monthly revenue A super long running process for daily price A super long running process for news . . .
  32. 32. • Really slow
  33. 33. • Really slow • Inefficient - unable to only retry the failed one
  34. 34. • Really slow • Inefficient - unable to only retry the failed one • Unpredictable server loading
  35. 35. Job 1 Job 2 Job 3 Time When the server loading is low Job 4 Job 5 Server
 loading
  36. 36. When the server loading is HIGH Time Server
 loading Other task
  37. 37. Job 1 Job 2 Job 3 When the server loading is HIGH Job 4 Job 5 Time Server
 loading Other task
  38. 38. Job 1 Job 2 Job 3 When the server loading is HIGH Job 4 Job 5 Time Server
 loading Other task Too many crawler processes executed at the same time
  39. 39. • Really slow • Inefficient - unable to only retry the failed one. • Unpredictable server loading • Scale out is not easy
  40. 40. • Inherent problems of Unix Cron:
  41. 41. • Inherent problems of Unix Cron: • Unreliable scheduling
  42. 42. • Inherent problems of Unix Cron: • Unreliable scheduling • High availability is not easy
  43. 43. • Inherent problems of Unix Cron: • Unreliable scheduling • High availability is not easy • Hard to prioritize job by the popularity
  44. 44. • Inherent problems of Unix Cron: • Unreliable scheduling • High availability is not easy • Hard to prioritize job by the popularity • Not easy to deal with bandwidth throttling issue
  45. 45. 
 Created by Mike Perham
  46. 46. Web server Request Request Request . . . Process
  47. 47. Request Request Request . . . Job queue push to queue
 (very fast) Web server Process
  48. 48. Request Request Request . . . Job queue push to queue
 (very fast) Worker process Worker process . . . Worker server Worker process Web server Process
  49. 49. Request Request Request . . . Job queue push to queue
 (very fast) Worker process Worker process . . . Worker server Worker process Web server Process Add extra servers when needed
  50. 50. Request Request Request . . . Job queue push to queue
 (very fast) Producer Worker process Worker process . . . Worker server Worker process Web server Process
  51. 51. Request Request Request . . . Job queue push to queue
 (very fast) Producer Consumer Worker process Worker process . . . Worker server Worker process Web server Process
  52. 52. Worker process thread 1 thread 2 thread 3 thread 25 . . . Worker process v.s. Multi-threadSingle process
  53. 53. Worker process thread 1 thread 2 thread 3 thread 25 . . . Worker process 1 : 25 Multi-threadSingle process
  54. 54. Multi-thread Worker process thread 1 thread 2 thread 3 thread 25 . . . Single process Worker process 1 : 25 With the same degree of memory consumption
  55. 55. Sidekiq (OSS) Sidekiq Pro Sidekiq Enterprise
  56. 56. Sidekiq Pro Sidekiq Enterprise Batches Enhanced Reliability Search in Web UI Worker Metrics Expiring Jobs Rate Limiting Periodic Jobs Unique Jobs Historical Metrics Multi-process Encryption
  57. 57. Parallelism Make Things Faster
  58. 58. • Really slow • Inefficient - unable to only retry the failed one. • Unpredictable server loading • Scale out is not easy
  59. 59. • Efficient - only retry the failed one • Predictable server loading • Easy to scale out
  60. 60. • Really slow • Inefficient - unable to only retry the failed one. • Unpredictable server loading • Scale out is not easy
  61. 61. • Inherent problem of Unix Cron: • Unreliable scheduling • High availability is not easy • Hard to prioritize job by the popularity • Not easy to deal with bandwidth throttling issue
  62. 62. –Mike Perham, CEO, Contributed Systems, Creator of Sidekiq
  63. 63. Keep states of cron executions in 
 our robustest part of system - database All scheduled jobs are invoked by a particular job executed minutely
  64. 64. Keep states of cron executions in 
 our robustest part of system - database All scheduled jobs are invoked by a particular job executed minutely
  65. 65. create_table :cron_jobs do |t| 
 t.string :klass, null: false 
 t.string :cron_expression, null: false 
 t.timestamp :next_run_at, null: false, index: true 
 end 
 
 Create table for storing cron settings table name: cron_jobs
  66. 66. create_table :cron_jobs do |t| 
 t.string :klass, null: false 
 t.string :cron_expression, null: false 
 t.timestamp :next_run_at, null: false, index: true 
 end 
 
 Create table for storing cron settings worker class name
  67. 67. create_table :cron_jobs do |t| 
 t.string :klass, null: false 
 t.string :cron_expression, null: false 
 t.timestamp :next_run_at, null: false, index: true 
 end 
 
 Create table for storing cron settings Something like 0 */2 * * *
  68. 68. create_table :cron_jobs do |t| 
 t.string :klass, null: false 
 t.string :cron_expression, null: false 
 t.timestamp :next_run_at, null: false, index: true 
 end 
 
 Create table for storing cron settings when will a job should be executed
  69. 69. klass cron_expression next_run_at Push2000NewsJobs “0 */2 * * *” … Push2000DailyPriceJobs “0 2 * * 1-5” … Push2000MonthlyRevenueJobs “0 0 10 * *” … …
  70. 70. # Add to your Cron setting 
 every :minute do 
 runner 'CronJobWorker.perform_async' 
 end Cron only schedules one job minutely
  71. 71. class CronJobWorker 
 include Sidekiq::Worker 
 
 def perform 
 CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
 
 
 end 
 end 
 end 
 
 CronJobWorker to invoke all of your crawlers Find jobs should be executed
  72. 72. class CronJobWorker 
 include Sidekiq::Worker 
 
 def perform 
 CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
 
 Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )
 
 end 
 end 
 end 
 
 CronJobWorker to invoke all of your crawlers Push jobs to job queue
  73. 73. class CronJobWorker 
 include Sidekiq::Worker 
 
 def perform 
 CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
 
 Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )
 
 x = Sidekiq::CronParser.new(job.cron_expression) 
 job.update!(next_run_at: x.next.to_time) 
 end 
 end 
 end 
 
 CronJobWorker to invoke all of your crawlers Setup the next execution time
  74. 74. class CronJobWorker 
 include Sidekiq::Worker 
 
 def perform 
 CronJob.find_each("next_run_at <= ?", Time.now) do |job| 
 
 Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )
 
 x = Sidekiq::CronParser.new(job.cron_expression) 
 job.update!(next_run_at: x.next.to_time) 
 end 
 end 
 end 
 
 CronJobWorker to invoke all of your crawlers
  75. 75. The missed job executions will be executed at next minute
  76. 76. • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  77. 77. Drawbacks solved • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  78. 78. 
 table: cron_jobs klass cron_expression args next_run_at Push2000NewsJobs “0 */2 * * *” [] …
  79. 79. 
 table: cron_jobs klass cron_expression args next_run_at Push2000NewsJobs “0 */2 * * *” [] … NewsWorker “*/30 * * * *” [popular_stock_id_1] … NewsWorker “*/30 * * * *” [popular_stock_id_2] … …
  80. 80. Drawbacks solved • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  81. 81. • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  82. 82. Sidekiq.configure_server do |config| 
 config.periodic do |mgr| 
 mgr.register("* * * * * *", CronJobWorker) 
 end 
 end 
 

  83. 83. • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  84. 84. • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  85. 85. You always want your crawler 
 as fast as possible
  86. 86. However, your target server doesn’t always allow you to crawl with unlimited rate
  87. 87. Insert 2000 jobs to the queue at the same time Stock.pluck(:id).each do |stock_id| 
 SomeWorker.perform_async(stock_id) 
 end 
 If you want to craw data for your 2000 stocks
  88. 88. Assume a target server accepts request at maximum rate equals to 1 request / second
  89. 89. Time (second) 1 2 3 job1 job2 job3 . . . job2000 Insert 2000 jobs to the queue at the same time All of your jobs may be blocked (except the first one)
  90. 90. Improvement 1 Schedule jobs with incremental delays Stock.pluck(:id).each_with_index do |stock_id, index| 
 SomeWorker.perform_in(index, stock_id) 
 end 

  91. 91. Time (second) 1 2 3 job1 job2 job3 … job2000 2000
  92. 92. Workable, but… 1 job1 job2 job3 … job2000 If the target server is unreachable Time (second)
  93. 93. Workable, but… 1 2 3 job1 job2 job3 … job2000 2000 If the target server is unreachable job3~2000 will still execute at the same time Time (second)
  94. 94. • Limit your worker thread to perform specific job with bounded rate • Sidekiq Enterprise provides two types of rate limiting API
  95. 95. CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) 
 
 def perform(...) 
 CONCURRENT_LIMITER.within_limit do 
 # crawl stock data 
 end 
 end 

  96. 96. CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) 
 
 def perform(...) 
 CONCURRENT_LIMITER.within_limit do 
 # crawl stock data 
 end 
 end 
 Only 10 concurrent operations inside the block can happen at any given moment
  97. 97. BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second) 
 
 def perform(...) 
 BUCKET_LIMITER.within_limit do 
 # crawl stock data 
 end 
 end 
 For every second, you can perform up to 10 operations
  98. 98. You must fine tune parameters of your limiter for each data source for better performance
  99. 99. By far, you already got better performance. However, the throttling control of your target server 
 may not always be static. Many websites are dynamically throttling controlled.
  100. 100. 
 If throttling detected, pause your workers for a while
  101. 101. Redis (job queue)
  102. 102. Redis (job queue) default critical low
  103. 103. Redis (job queue) default critical low Worker thread Worker thread Worker thread Worker thread Worker thread
  104. 104. Redis (job queue) default critical low Worker thread Worker thread Worker thread Worker thread Worker thread
  105. 105. Redis (job queue) default critical low Worker thread Worker thread Worker thread Worker thread Worker thread yahoo
  106. 106. Redis (job queue) default critical low Worker thread Worker thread Worker thread Worker thread Worker thread yahoo
 (paused) Pause this queue when throttled
  107. 107. Redis (job queue) default critical low Worker thread Worker thread Worker thread Worker thread Worker thread Schedule a job executed after few seconds 
 to “unpause" job in another queue yahoo
 (paused)
  108. 108. Redis (job queue) default critical low Worker thread Worker thread Worker thread Worker thread Worker thread yahoo (resumed) Resumed after the unpause queue job executed
  109. 109. class SomeWorker 
 include Sidekiq::Worker 
 
 def perform 
 # try to crawl something 
 # ... 
 
 if throttled 
 queue_name = self.class.get_sidekiq_options['queue'] 
 queue = Sidekiq::Queue.new(queue_name) 
 queue.pause! 
 ResumeJobQueueWorker.perform_in(30.seconds, queue_name) 
 end 
 end 
 end 
 
 
 

  110. 110. class SomeWorker 
 include Sidekiq::Worker 
 
 def perform 
 # try to crawl something 
 # ... 
 
 if throttled 
 queue_name = self.class.get_sidekiq_options['queue'] 
 queue = Sidekiq::Queue.new(queue_name) 
 queue.pause! 
 ResumeJobQueueWorker.perform_in(30.seconds, queue_name) 
 end 
 end 
 end 
 
 
 

  111. 111. class SomeWorker 
 include Sidekiq::Worker 
 
 def perform 
 # try to crawl something 
 # ... 
 
 if throttled 
 queue_name = self.class.get_sidekiq_options['queue'] 
 queue = Sidekiq::Queue.new(queue_name) 
 queue.pause! 
 ResumeJobQueueWorker.perform_in(30.seconds, queue_name) 
 end 
 end 
 end 
 
 class ResumeJobQueueWorker 
 include Sidekiq::Worker 
 sidekiq_options queue: :queue_control, unique: :until_executed 
 
 def perform(queue_name) 
 queue = Sidekiq::Queue.new(queue_name) 
 queue.unpause! if queue.paused? 
 end 
 end 
 

  112. 112. The queue for ResumeJobQueueWorker
 MUST NOT equal to the paused queue We have a dedicated queue for ResumeJobQueueWorker
  113. 113. Decrease Sidekiq server poll interval for more precise timing control
  114. 114. Queue pausing alleviates throttling issues Is it possible for us to do things even better?
  115. 115. Most throttling control aim to block requests from the same IP address
  116. 116. We can change our IP address via proxy service
  117. 117. Sidekiq server Target server a.b.c.d
  118. 118. Sidekiq server Target server a.b.c.d a.b.c.d
  119. 119. Sidekiq server Target server a.b.c.d a.b.c.d a.b.c.d a.b.c.d Same IP for each request
  120. 120. Sidekiq server Target server a.b.c.d Proxy service end point
  121. 121. Sidekiq server Target server a.b.c.d Proxy service end point proxy server e.f.g.h
  122. 122. Sidekiq server Target server a.b.c.d a.b.c.d Proxy service end point proxy server proxy server e.f.g.h i.j.k.l
  123. 123. Sidekiq server Target server a.b.c.d a.b.c.d a.b.c.d a.b.c.d Proxy service end point proxy server proxy server proxy server proxy server e.f.g.h i.j.k.l m.n.o.p q.r.s.t
  124. 124. Sidekiq server Target server a.b.c.d a.b.c.d a.b.c.d a.b.c.d Proxy service end point proxy server proxy server proxy server proxy server e.f.g.h i.j.k.l m.n.o.p q.r.s.t Different IP for each request
  125. 125. • Inherent problem of Unix Cron: • Unreliable scheduling • Hard to prioritize job by the popularity • High availability is not easy • Not easy to deal with bandwidth throttling issue
  126. 126. • With Sidekiq (Enterprise) and a proper design, the following problems are solved • Slow crawler • Inefficient - unable to only retry the failed one • Unpredictable server loading • Scale out is not easy • Inherent problem of Unix Cron • Not easy to deal with bandwidth throttling issue

×