WEB CRAWL
with Elixir
Who am I?
• Jechol Lee (mr.jechol@gmail.com)
• Software engineer at Skelterlabs
• Loves elixir, elm, ruby
• We are HIRING!
First try
First try
First try
Database
Crawler JSONHTTP
Crawler HTMLHTTP
Products
Products
Save
Queue
Products
SQL
Site A
ES
SaveQueue
Task
Page
Task
Item
Task
Item
product
req
Sup
Sup
(site A)
Sup
(simple
1 for 1)
Task.SupTask.Sup
Sup
(site B)
progress
start_link
Crawler speed > Site rate limit
Crawler WebsiteHTTP 10/s
Blocked
Crawler WebsiteHTTP 10/s
BLOCKED
DB speed < Crawler speed
Save
QueueDatabase Crawler100/s 200/s
Memory exhaustion
Save
QueueDatabase Crawler100/s 200/s
SaveQueue
Out of Memory
Solution
OOM Problem : Producer driven
200/s Crawler
Save
QueueDatabase 100/s
GenStage : Demand-driven
Crawler
Save
Queue
DEMAND 100
GenStage : Demand-driven
Crawler
Save
Queue
DEMAND 100
100
Elixir GenStage (2016 / 7)
GenStage : Demand-driven
Crawler
Save
Queue
DEMAND 100
100Database 100/s
Memory usage
Producer driven
Demand driven
Site rate limit :TokenBucket
Site rate limit :TokenBucket
Crawler WebsiteHTTP Burst Token
Bucket
60/min
Network Requests Overflow
Page 1
Network Requests Overflow
Page 2
Page 1 item
Page 1 item
Page 1 item
Network Requests Overflow
Page 1 item
Page 2 item
Page 3
Page 1 item
Page 2 item
Page 2 item
Network Requests Overflow
Can't depend on random processing order.
Page 1 item
Page 2 item
Page 4
Page 3 item
Page 1 item
Page 2 item
Page 3 item
Page 3 item
Priority Queue
Page 1 item
Page 2 item
Page 3
Page 1 item
Page 1 item
Page 2 item
Page 2 item
Revised Architecture
Existing Architecture
SQL
ES
SaveQueue
HTML
Crawler
C21
JSON
Crawler
Lego
Demand-driven
PRODUCT
DEMAND
DEMAND
PRODUCT
SQL
ES
SaveQueue
HTML
Crawler
C21
JSON
Crawler
Lego
Rate limit byTokenBucket
twotap.com
c21stores.com
WebProxy
PriorityQueue
+
TokenBucket
HTTP
SPAWN
HTML Parser
C21
HTML Parser
C21
HTML
Crawler
C21
JSON
Crawler
Lego
Error monitoring
ErrorMonitor
sentry.io
MONITOR
{:DOWN, :page_not_found}
HTML
Crawler
C21
Fault-tolerance by SupervisionTree
Supervisor
Supervisor
C21
Task
Supervisor
ErrorMonitor
WebProxy
PriorityQueue
+
TokenBucket
HTML Parser
C21
HTML Parser
C21
SaveQueue
HTML
Crawler
C21
JSON
Crawler
Lego
Tree for Multiple Crawlers
JSON
Crawler
GNC
Supervisor
JCPenney
Supervisor
Supervisor
C21
Task
Supervisor
ErrorMonitor
WebProxy
PriorityQueue
+
TokenBucket
HTML Parser
C21
HTML Parser
C21
SaveQueue
HTML
Crawler
C21
JSON
Crawler
Lego
Final
JSON
Crawler
GNC
Supervisor
JCPenney
Supervisor
Supervisor
C21
Task
Supervisor
ErrorMonitor
sentry.io
MONITOR
{:DOWN, :page_not_found}
twotap.com
c21stores.com
WebProxy
PriorityQueue
+
TokenBucket
HTTP
SPAWN
HTML Parser
C21
HTML Parser
C21
PRODUCT
DEMAND
DEMAND
PRODUCT
SQL
ES
SaveQueue
HTML
Crawler
C21
JSON
Crawler
Lego
Building Blocks
GenStage
Task
Task.Supervisor
GenServer
Supervisor
Agent
GenServer vs Task
• Tasks don't provide services.

→ No handle_call, etc.
• Just run a function and exit.
Task.Supervisor.async
• Not trap exit.
• Caller process dies together.
• Not restart task.
Task.async
vs
Task.Supervisor.async
Only later builds supervision relationship
so that visible using observer.
End

Web crawl with Elixir