13. • Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
14. • Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Problems of the past practice
15. • Introduction to Statementdog
• Data behind Statementdog
• Past practice of Statementdog
• Problems of the past practice
• How we design our system to solve the problems.
16. Focus on:
• More reliable job scheduling
• Dealing with throttling issue
37. Yearly - dividend, remuneration of directors and supervisors
Quarterly - quarterly financial statements
Monthly - Revenue
Weekly -
Daily - closing price
Hourly - stock news from Yahoo stock feed
Minutely - important news from Taiwan Market Observation Post System
39. Something like this,
but written in PHP
A super long running process (1 hour+)
loops from the first stock to the last one
Stock.find_each do |stock|
# download xml financial report data
…
# extract xml data
…
# calculate advanced data
…
end
40. A super long running process
for quarterly report
41. A super long running process
for quarterly report
A super long running process
for monthly revenue
42. A super long running process
for quarterly report
A super long running process
for monthly revenue
A super long running process
for daily price
43. A super long running process
for quarterly report
A super long running process
for monthly revenue
A super long running process
for daily price
A super long running process
for news
.
.
.
54. • Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
55. • Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the popularity
56. • Inherent problems of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the popularity
• Not easy to deal with bandwidth throttling issue
70. Sidekiq Pro Sidekiq Enterprise
Batches
Enhanced Reliability
Search in Web UI
Worker Metrics
Expiring Jobs
Rate Limiting
Periodic Jobs
Unique Jobs
Historical Metrics
Multi-process
Encryption
73. • Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
74. • Efficient - only retry the failed one
• Predictable server loading
• Easy to scale out
75. • Really slow
• Inefficient - unable to only retry the failed one.
• Unpredictable server loading
• Scale out is not easy
76. • Inherent problem of Unix Cron:
• Unreliable scheduling
• High availability is not easy
• Hard to prioritize job by the popularity
• Not easy to deal with bandwidth throttling issue
80. Keep states of cron executions in
our robustest part of system - database
All scheduled jobs are invoked by a particular job
executed minutely
81. Keep states of cron executions in
our robustest part of system - database
All scheduled jobs are invoked by a particular job
executed minutely
82. create_table :cron_jobs do |t|
t.string :klass, null: false
t.string :cron_expression, null: false
t.timestamp :next_run_at, null: false, index: true
end
Create table for storing cron settings
table name: cron_jobs
83. create_table :cron_jobs do |t|
t.string :klass, null: false
t.string :cron_expression, null: false
t.timestamp :next_run_at, null: false, index: true
end
Create table for storing cron settings
worker class name
84. create_table :cron_jobs do |t|
t.string :klass, null: false
t.string :cron_expression, null: false
t.timestamp :next_run_at, null: false, index: true
end
Create table for storing cron settings
Something like
0 */2 * * *
85. create_table :cron_jobs do |t|
t.string :klass, null: false
t.string :cron_expression, null: false
t.timestamp :next_run_at, null: false, index: true
end
Create table for storing cron settings
when will a job should be executed
87. # Add to your Cron setting
every :minute do
runner 'CronJobWorker.perform_async'
end
Cron only schedules one job minutely
88. class CronJobWorker
include Sidekiq::Worker
def perform
CronJob.find_each("next_run_at <= ?", Time.now) do |job|
end
end
end
CronJobWorker to invoke all of your crawlers
Find jobs should be executed
89. class CronJobWorker
include Sidekiq::Worker
def perform
CronJob.find_each("next_run_at <= ?", Time.now) do |job|
Sidekiq::Client.push(
class: job.klass.constantize,
args: ['foo', ‘bar']
)
end
end
end
CronJobWorker to invoke all of your crawlers
Push jobs to job queue
90. class CronJobWorker
include Sidekiq::Worker
def perform
CronJob.find_each("next_run_at <= ?", Time.now) do |job|
Sidekiq::Client.push(
class: job.klass.constantize,
args: ['foo', ‘bar']
)
x = Sidekiq::CronParser.new(job.cron_expression)
job.update!(next_run_at: x.next.to_time)
end
end
end
CronJobWorker to invoke all of your crawlers
Setup the next execution time
91. class CronJobWorker
include Sidekiq::Worker
def perform
CronJob.find_each("next_run_at <= ?", Time.now) do |job|
Sidekiq::Client.push(
class: job.klass.constantize,
args: ['foo', ‘bar']
)
x = Sidekiq::CronParser.new(job.cron_expression)
job.update!(next_run_at: x.next.to_time)
end
end
end
CronJobWorker to invoke all of your crawlers
92. The missed job executions will be
executed at next minute
93. • Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
94. Drawbacks solved
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
97. Drawbacks solved
• Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
98. • Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
100. • Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
101. • Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
104. However, your target server doesn’t
always allow you to crawl with
unlimited rate
105. Insert 2000 jobs to the queue at the same time
Stock.pluck(:id).each do |stock_id|
SomeWorker.perform_async(stock_id)
end
If you want to craw data for your 2000 stocks
106. Assume a target server accepts request at
maximum rate equals to
1 request / second
108. Improvement 1
Schedule jobs with incremental delays
Stock.pluck(:id).each_with_index do |stock_id, index|
SomeWorker.perform_in(index, stock_id)
end
111. Workable, but…
1 2 3
job1 job2 job3
…
job2000
2000
If the target server is unreachable
job3~2000 will still execute at the same time
Time
(second)
112. • Limit your worker thread to perform specific job
with bounded rate
• Sidekiq Enterprise provides two types of rate
limiting API
116. You must fine tune parameters of your limiter
for each data source for better performance
117. By far, you already got better performance.
However, the throttling control of your target server
may not always be static.
Many websites are dynamically throttling controlled.
127. class SomeWorker
include Sidekiq::Worker
def perform
# try to crawl something
# ...
if throttled
queue_name = self.class.get_sidekiq_options['queue']
queue = Sidekiq::Queue.new(queue_name)
queue.pause!
ResumeJobQueueWorker.perform_in(30.seconds, queue_name)
end
end
end
128. class SomeWorker
include Sidekiq::Worker
def perform
# try to crawl something
# ...
if throttled
queue_name = self.class.get_sidekiq_options['queue']
queue = Sidekiq::Queue.new(queue_name)
queue.pause!
ResumeJobQueueWorker.perform_in(30.seconds, queue_name)
end
end
end
129. class SomeWorker
include Sidekiq::Worker
def perform
# try to crawl something
# ...
if throttled
queue_name = self.class.get_sidekiq_options['queue']
queue = Sidekiq::Queue.new(queue_name)
queue.pause!
ResumeJobQueueWorker.perform_in(30.seconds, queue_name)
end
end
end
class ResumeJobQueueWorker
include Sidekiq::Worker
sidekiq_options queue: :queue_control, unique: :until_executed
def perform(queue_name)
queue = Sidekiq::Queue.new(queue_name)
queue.unpause! if queue.paused?
end
end
130. The queue for ResumeJobQueueWorker
MUST NOT equal to the paused queue
We have a dedicated queue for
ResumeJobQueueWorker
143. • Inherent problem of Unix Cron:
• Unreliable scheduling
• Hard to prioritize job by the popularity
• High availability is not easy
• Not easy to deal with bandwidth throttling issue
145. • With Sidekiq (Enterprise) and a proper design, the following problems
are solved
• Slow crawler
• Inefficient - unable to only retry the failed one
• Unpredictable server loading
• Scale out is not easy
• Inherent problem of Unix Cron
• Not easy to deal with bandwidth throttling issue