Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Photo: http://cliparts.co/clipart/3666251

Has anyone ever written crawlers?

Gary (Chien-Wei Chu) 
@icarus4 / @icarus4.chu
Was a C programmer 
Fall in love with Ruby since 2013
CTO of Statementdog

I Play
Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg

Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg

Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Problems of the past practice

• Problems of the past practice
• How we design our system to solve the problems.

Focus on:
• More reliable job scheduling
• Dealing with throttling issue

(Revenue)
(EPS)
(Gross Margin)
(Net Income)

(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)

(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Operating Cash Flow)
(Free Cash Flow)
(Investing Cash Flow)

(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Free Cash Flow)
(ROE)
(ROA)
(Accounts Receivable)
(Accounts Payable)

(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Free Cash Flow)
(ROE)
(ROA)
(Accounts Payable)
(PMI)

(Revenue)
(EPS)
(Gross Margin)
(Net Income)
(Assets)
(Liabilities)
(Free Cash Flow)
(ROE)
(ROA)
(Accounts Payable)
(PMI)
GDP

Taiwan Market Observation Post System ( )
Taiwan Stock Exchange ( )
Taiwan Depository & Clearing Corporation ( )
Yahoo Stock Feed
…
…

Yearly - dividend, remuneration of directors and supervisors
Quarterly - quarterly financial statements
Monthly - Revenue
Weekly -
Daily - closing price
Hourly - stock news from Yahoo stock feed
Minutely - important news from Taiwan Market Observation Post System

Something like this,
but written in PHP
A super long running process (1 hour+)
loops from the first stock to the last one
Stock.find_each do |stock|  
# download xml financial report data  
…  
# extract xml data  
…  
# calculate advanced data  
… 
 
end

A super long running process
for quarterly report

for monthly revenue

for monthly revenue
for daily price

for monthly revenue
for daily price
for news
.
.
.

• Really slow
• Inefﬁcient - unable to only retry the failed one

• Really slow
• Unpredictable server loading

Job 1 Job 2 Job 3
Time
When the server loading is low
Job 4 Job 5
Server 
loading

When the server loading is HIGH
Time
Server 
loading
Other task

Job 1
Job 2
Job 3
Job 4
Job 5
Time
Server 
loading
Other task

Job 1
Job 2
Job 3
Job 4
Job 5
Time
Server 
loading
Other task
Too many crawler processes executed at the same time

• Really slow
• Inefﬁcient - unable to only retry the failed one.
• Scale out is not easy

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Not easy to deal with bandwidth throttling issue

Web server
Request
Request
Request
.
.
.
Process

Request
Request
Request
.
.
.
Job queue
push to queue 
(very fast)
Web server
Process

Request
Request
Request
.
.
.
Job queue
push to queue 
(very fast)
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process

Request
Request
Request
.
.
.
Job queue
push to queue 
(very fast)
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process Add extra servers
when needed

Request
Request
Request
.
.
.
Job queue
push to queue 
(very fast)
Producer
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process

Request
Request
Request
.
.
.
Job queue
push to queue 
(very fast)
Producer
Consumer
Worker process
Worker process
.
.
.
Worker server
Worker process
Web server
Process

Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Worker process v.s.
Multi-threadSingle process

Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Worker process
1 : 25
Multi-threadSingle process

Multi-thread
Worker process
thread 1
thread 2
thread 3
thread 25
.
.
.
Single process
Worker process
1 : 25
With the same degree of memory consumption

Sidekiq (OSS)
Sidekiq Pro
Sidekiq Enterprise

Sidekiq Pro Sidekiq Enterprise
Batches
Enhanced Reliability
Search in Web UI
Worker Metrics
Expiring Jobs
Rate Limiting
Periodic Jobs
Unique Jobs
Historical Metrics
Multi-process
Encryption

Parallelism Make Things Faster

• Really slow
• Inefficient - unable to only retry the failed one.

• Efficient - only retry the failed one
• Predictable server loading
• Easy to scale out

• Inherent problem of Unix Cron:

–Mike Perham, CEO, Contributed Systems,
Creator of Sidekiq

Keep states of cron executions in  
our robustest part of system - database
All scheduled jobs are invoked by a particular job
executed minutely

create_table :cron_jobs do |t|
 
t.string :klass, null: false
 
t.string :cron_expression, null: false
 
t.timestamp :next_run_at, null: false, index: true
 
end  
 
Create table for storing cron settings
table name: cron_jobs

 
 
 
 
end  
 
worker class name

 
 
 
 
end  
 
Something like
0 */2 * * *

 
 
 
 
end  
 
when will a job should be executed

klass cron_expression next_run_at
Push2000NewsJobs “0 */2 * * *” …
Push2000DailyPriceJobs “0 2 * * 1-5” …
Push2000MonthlyRevenueJobs “0 0 10 * *” …
…

# Add to your Cron setting
 
every :minute do  
runner 'CronJobWorker.perform_async'  
end
Cron only schedules one job minutely

class CronJobWorker  
include Sidekiq::Worker  
 
def perform  
CronJob.find_each("next_run_at <= ?", Time.now) do |job|  
 
 
end  
end  
end  
 
CronJobWorker to invoke all of your crawlers
Find jobs should be executed

 
def perform  
 
Sidekiq::Client.push(
class: job.klass.constantize,
args: ['foo', ‘bar']
) 
 
end  
end  
end  
 
Push jobs to job queue

 
def perform  
 
) 
 
x = Sidekiq::CronParser.new(job.cron_expression)  
job.update!(next_run_at: x.next.to_time)  
end  
end  
end  
 
Setup the next execution time

 
def perform  
 
) 
 
x = Sidekiq::CronParser.new(job.cron_expression)  
job.update!(next_run_at: x.next.to_time)  
end  
end  
end  
 

The missed job executions will be
executed at next minute

Drawbacks solved

table: cron_jobs
klass cron_expression args next_run_at
Push2000NewsJobs “0 */2 * * *” [] …

table: cron_jobs
klass cron_expression args next_run_at
Push2000NewsJobs “0 */2 * * *” [] …
NewsWorker “*/30 * * * *” [popular_stock_id_1] …
NewsWorker “*/30 * * * *” [popular_stock_id_2] …
…

Sidekiq.configure_server do |config|  
config.periodic do |mgr|  
mgr.register("* * * * * *", CronJobWorker)  
end  
end

You always want your crawler  
as fast as possible

However, your target server doesn’t
always allow you to crawl with
unlimited rate

Insert 2000 jobs to the queue at the same time
Stock.pluck(:id).each do |stock_id|  
SomeWorker.perform_async(stock_id)  
end  
If you want to craw data for your 2000 stocks

Assume a target server accepts request at
maximum rate equals to
1 request / second

Time
(second)
1 2 3
job1
job2
job3
.
.
.
job2000
Insert 2000 jobs to the queue at the same time
All of your jobs may be blocked (except the ﬁrst one)

Improvement 1
Schedule jobs with incremental delays
Stock.pluck(:id).each_with_index do |stock_id, index|  
SomeWorker.perform_in(index, stock_id)  
end

Time
(second)
1 2 3
job1 job2 job3
…
job2000
2000

Workable, but…
1
job1 job2 job3
…
job2000
If the target server is unreachable
Time
(second)

Workable, but…
1 2 3
job1 job2 job3
…
job2000
2000
If the target server is unreachable
job3~2000 will still execute at the same time
Time
(second)

• Limit your worker thread to perform specific job
with bounded rate
• Sidekiq Enterprise provides two types of rate
limiting API

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10)  
 
def perform(...)  
CONCURRENT_LIMITER.within_limit do  
# crawl stock data
 
end  
end

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10)  
 
CONCURRENT_LIMITER.within_limit do  
# crawl stock data
 
end  
end  
Only 10 concurrent operations inside the block
can happen at any given moment

BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second)  
 
BUCKET_LIMITER.within_limit do  
# crawl stock data
 
end  
end  
For every second, you can perform up to 10 operations

You must ﬁne tune parameters of your limiter
for each data source for better performance

By far, you already got better performance.
However, the throttling control of your target server  
may not always be static.
Many websites are dynamically throttling controlled.

If throttling detected, pause your workers for a while

Redis (job queue)
default
critical
low

Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread

Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
yahoo

Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
yahoo 
(paused)
Pause this queue when throttled

Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
Schedule a job executed after few seconds  
to “unpause" job in another queue
yahoo 
(paused)

Redis (job queue)
default
critical
low
Worker thread
Worker thread
Worker thread
Worker thread
Worker thread
yahoo
(resumed)
Resumed after the unpause queue job executed

class SomeWorker  
 
def perform  
# try to crawl something  
# ...  
 
if throttled  
queue_name = self.class.get_sidekiq_options['queue']  
queue = Sidekiq::Queue.new(queue_name)  
queue.pause!  
ResumeJobQueueWorker.perform_in(30.seconds, queue_name)  
end  
end  
end

class SomeWorker  
 
def perform  
# try to crawl something  
# ...  
 
if throttled  
queue_name = self.class.get_sidekiq_options['queue']  
queue.pause!  
ResumeJobQueueWorker.perform_in(30.seconds, queue_name)  
end  
end  
end  
 
class ResumeJobQueueWorker  
sidekiq_options queue: :queue_control, unique: :until_executed  
 
def perform(queue_name)  
queue.unpause! if queue.paused?  
end  
end

The queue for ResumeJobQueueWorker 
MUST NOT equal to the paused queue
We have a dedicated queue for
ResumeJobQueueWorker

Decrease Sidekiq server poll interval for more
precise timing control

Queue pausing alleviates throttling issues
Is it possible for us to do things even better?

Most throttling control aim to block requests
from the same IP address

We can change our IP address via
proxy service

Sidekiq
server
Target
server
a.b.c.d

Sidekiq
server
Target
server
a.b.c.d
a.b.c.d

Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Same IP for each request

Sidekiq
server
Target
server
a.b.c.d
Proxy
service
end
point

Sidekiq
server
Target
server
a.b.c.d
Proxy
service
end
point
proxy server
e.f.g.h

Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
Proxy
service
end
point
proxy server
proxy server
e.f.g.h
i.j.k.l

Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Proxy
service
end
point
proxy server
proxy server
proxy server
proxy server
e.f.g.h
i.j.k.l
m.n.o.p
q.r.s.t

Sidekiq
server
Target
server
a.b.c.d
a.b.c.d
a.b.c.d
a.b.c.d
Proxy
service
end
point
proxy server
proxy server
proxy server
proxy server
e.f.g.h
i.j.k.l
m.n.o.p
q.r.s.t
Different IP for each request

• With Sidekiq (Enterprise) and a proper design, the following problems
are solved
• Slow crawler
• Inherent problem of Unix Cron

Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Similar to Building Efficient and Reliable Crawler System With Sidekiq Enterprise (20)

Recently uploaded

Recently uploaded (20)

Building Efficient and Reliable Crawler System With Sidekiq Enterprise