Digdagによる大規模データ処理の自動化とエラー処理

Digdagによる大規模データ処理の 
自動化とエラー処理
Sadayuki Furuhashi
Workﬂow Engines Night

Sadayuki Furuhashi
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS projects I founded:
An open-source hacker.
Github: @frsyuki

What’s workload automation?
• あらゆる手作業の自動化
> バッチデータ解析の自動化：
• データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知
> メール送信の自動化
• アドレス一覧の取得 - 対象の絞り込み - テンプレートから 
本文を生成 - メール送信 - 完了通知
> システム間のデータ連携の自動化
> サーバ・DB・ネットワーク機器の管理やプロビジョニング
の自動化
> テスト・デプロイの自動化（CI）

求められる機能
• 基本機能
> タスクを依存関係順に実行
> 定期的な実行
> ファイルが作成されたら実行
> 過去分の一括実行（backﬁll）
> 時刻などの変数を含めて実行
• エラー処理
> 失敗したら通知
> 失敗した場所から再開
• 状態監視
> 実行時間が長ければ通知
> タスクの実行時間を可視化
> 実行ログの収集と保存
• 高速化
> タスクを並列して実行
> 同時実行数の制限
• 開発支援
> ワークフローのバージョン管理
> GUIによるワークフロー開発
> 定型処理を簡単に実行できるライ
ブラリ
> 手元とサーバ上で同じように動く
再現性（手元で動けばサーバでも
動く）
> Dockerイメージを使ってタスクを
実行

Products
OSS
• Makefile
• Jenkins
• Luigi
• Airflow
• Rundeck
• Azkaban
• Grid Engine
• OpenLava
• Obsidian Scheduler
• Hinemos
• StackStorm
• Platform LSM
Proprietary
• Tivoli Workload Scheduler (IBM)
• CA Workload Automation 
(CA Technologies)
• JP1/AJS3 (Hitachi)
• Systemwalker Job Workload
Server (Fujitsu)
• Workload Automation
(Automatic)
• BatchMan (Honico)
• Control-M (BMC)
• Schedulix
• ServiceNow Workflow

Challenge: Multiple Cloud & Regions
On-Premises
Diﬀerent API,
Diﬀerent tools,
Many scripts.

Challenge: Multiple DB technologies
Amazon S3
Amazon  
Redshift
Amazon EMR

Challenge: Multiple DB technologies
Amazon S3
Amazon  
Redshift
Amazon EMR
> Hi!
> I'm a new technology!

Challenge: Modern complex data analytics
Ingest
Application logs
User attribute data
Ad impressions
3rd-party cookie data
Enrich
Removing bot access
Geo location from IP
address
Parsing User-Agent
JOIN user attributes
to event logs
Model
A/B Testing
Funnel analysis
Segmentation
analysis
Machine learning
Load
Creating indexes
Data partitioning
Data compression
Statistics
collection
Utilize
Recommendation
API
Realtime ad bidding
Visualize using BI
applications
Ingest UtilizeEnrich Model Load

Traditional "false" solution
#!/bin/bash
./run_mysql_query.sh
./load_facebook_data.sh
./rsync_apache_logs.sh
./start_emr_cluster.sh
for query in emr/*.sql; do
./run_emr_hive $query
done
./shutdown_emr_cluster.sh
./run_redshift_queries.sh
./call_finish_notification.sh
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization

Solution: Multi-Cloud Workﬂow Engine
Solves
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization

Example in our case
1. Dump data to
BigQuery
2. load all tables to
Treasure Data
3. Run queries
5. Notify on slack
4. Create reports
on Tableau Server 
(on-premises)

Key constructs
Operators
> Packaged knowledge to run tasks.
> e.g. pg>, s3>, gcs>, emr>, td>, py>, rb>
Parameters
> Programmable variables for operators.
> e.g. ${session_time}, ${workflow_name}, 
${JSON.parse(http.last_content)}
Task groups
> Sequence of tasks to organize & modularize
workflows.

Operator library
_export:
td:
database: workflow_temp
+task1:
td>: queries/open.sql
create_table: daily_open
+task2:
td>: queries/close.sql
create_table: daily_close
Standard libraries
redshift>: runs Amazon Redshift queries
emr>: create/shutdowns a cluster & runs
steps
s3_wait>: waits until a file is put on S3
pg>: runs PostgreSQL queries
td>: runs Treasure Data queries
td_for_each>: repeats task for result rows
mail>: sends an email
Open-source libraries
You can release & use open-source
operator libraries.

Task grouping & parallel execution
+load_data:
_parallel: true
 
+load_users:
redshift>: copy/users.sql
 
+load_items:
redshift>: copy/items.sql
Parallel execution
Tasks under a same group run in
parallel if _parallel option is set to
true.

Grouping workﬂows...
+task
+task
+task
+task +task
+task +task
+task
+task
+task +task +task

Grouping workﬂows
+ingest +enrich
+task +task
+model
+basket_analysis
+task +task
+learn
+load
+task +task+tasks
+task

Parameters & Loops
+send_email_to_active_users:
td_for_each>: list_active.sql
_do:
+send:
email>: tempalte.txt
to: ${td.for_each.addr}
Parameter
A task can propagate parameters to
following tasks
Loop
Generate subtasks dynamically so
that Digdag applies the same set of
operators to different data sets.

Unite Engineering & Analytic Teams
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
Powerful for Engineers
> Comfortable for advanced users
Friendly for Analysts
> Still straight forward for analysts to
understand & leverage workflows

Pushing workﬂows to a server with Docker image
schedule:
daily>: 01:30:00
timezone: Asia/Tokyo
_export:
docker:
image: my_image:latest
+task:
sh>: ./run_in_docker
Digdag server
> Develop on laptop, push it to a server.
> Workflows run periodically on a server.
> Backfill
> Web editor & monitor
Docker
> Install scripts & dependences in a
Docker image, not on a server.
> Workflows can run anywhere including
developer's laptop.

Amazon ECR Dockerﬁle & Operator plugin template
• https://github.com/myui/dockernized-digdag-server
• https://github.com/myui/digdag-plugin-example
$ docker pull myui/digdag-server:latest
$ docker run -p 65432:65432 myui/digdag-server
open http://localhost:65432/

Digdag at Treasure Data
3,600 workflows run every day
28,000 tasks run every day
850 active workflows
400,000 workflow executions in total

Example: Customer analysis & alerting
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']

timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
Example: Customer analysis & alerting
Usage: ${td.each.percentage}%
Account Name: ${td.each.account_name}
Type: Purchase
${td.each.salesforce_link}
Region: ${td.each.region}
Owner: ${td.each.owner_name} (${td.each.owner_email})
Account: ${td.each.account_name}
Status: ${td.each.activity_status}
Actual: ${td.each.total_purchase}
Limit: ${td.each.monthly_purchase_limit}
mail.txt

Example: Backend of a BI app
timezone: <%= ev @timezone %>
<% if @schedule then %>
schedule: <%= ev @schedule %>
<% end %>
_export:
td:
database: <%= ev @database %>
all_mode: ${ 
(moment(session_time).dayOfYear() - 1)
% 3 == 0
}
+all_load:
if>: ${all_mode == "true"}
_do:
+create_all_records:
td>: segment_web_access.sql
create_table: "cdp_tmp_web_access"
_retry: 5
+rename_tmp_table:
td_ddl>:
rename_tables:
- from: "cdp_tmp_web_access"
to: "cdp_web_access"
_retry: 5
+get_all_count:
td>: incremental_count.sql
table_name: "cdp_web_access"
store_last_results: true
_retry: 5
+syndicate_loop:
loop>: ${Math.ceil(
td.last_results.total_count / 20000
)}
_do:
td>: incremental_select.sql
table_name: "cdp_web_access"
result_connection: cdp_web_access
result_settings:
id: 1
_retry: 5

Example: Moving Spark app to production
_export:
td:
database: digdag_demo_${session_date_compact}
+setup:
td_ddl>:
create_databases: ["${td.database}"]
+ingestion:
_parallel: true
+items_from_access_logs:
+wait_for_arrival:
s3_wait>: digdag-demo-bucket/www_login_$
{session_date_compact}.csv
+load_logs:
td_load>: s3_import_1479918530
+facebook_ads:
td_load>:
facebook_ads_reporting_import_1479843958
+items_from_aurora:
td_load>: mysql_import_1479918544
+enrichment:
_parallel: 5
+ip_location_to_user:
# ip_location, user
td>: queries/ip_location_to_user.sql
create_table: ip_location_to_user
+item_to_click_count:
# item, click_count
td>: queries/item_to_click_count.sql
create_table: item_to_click_count
+item_to_item_count:
# item_1, item_2, count
td>: queries/item_to_item_count.sql
create_table: item_to_item_count
+modeling:
emr>:
cluster: j-OD82XANWFYQ8
staging: s3://digdag-demo-data/emr/staging/
steps:
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class", "ItemRecommends"]
jars: [td-spark-assembly-0.1.jar]
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class",
"LocationRecommends"]
jars: [td-spark-assembly-0.1.jar]
+loading:
_parallel: true
+load_location_recommends:
redshift>: copy/copy_location_recommends.sql
+load_item_recommends:
redshift>: copy/copy_item_recommends.sql

HA deployment of Digdag
Digdag
server
PostgreSQL
It's just like a web application.
Digdag
client
All task state
API &
scheduler &
executor
Visual UI

PostgreSQL
Stateless servers + Replicated DB
Digdag
client
API &
scheduler &
executor
PostgreSQL
All task state
Digdag
server
Digdag
server
HTTP Load
Balancer
Visual UI
HA

Digdag
server
PostgreSQL
Isolating API and execution for reliability
Digdag
client
API
PostgreSQL
HA
Digdag
server
Digdag
server
Digdag
server
scheduler & 
executor
HTTP Load
Balancer
All task state
$ digdag server --disable-local-agent  
--disable-executor-loop
$ digdag server --max-task-threads 100

Single-server task logs
Digdag
server
PostgreSQL
Digdag
client
HTTP Load
Balancer
Local disks
A server writes logs 
to a local disk
The same server 
serves the logs.
$ digdag --task-log <dir>
$ digdag log <attempt-id> -f

Centralized task log storage
Digdag
server
PostgreSQL
Digdag
client
Digdag
server
HTTP Load
Balancer
AWS S3
A server uploads logs
A server pre-signs 
the download URL
log-server.type = s3
log-server.s3.bucket = my-digdag-log-bucket
log-server.s3.path = logs/
$ digdag log <attempt-id> -f
Client downloads logs 
directly from S3

Sadayuki Furuhashi
https://digdag.io
Visit my website!

Digdagによる大規模データ処理の自動化とエラー処理

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Digdagによる大規模データ処理の自動化とエラー処理

Similar to Digdagによる大規模データ処理の自動化とエラー処理 (20)

More from Sadayuki Furuhashi

More from Sadayuki Furuhashi (20)

Recently uploaded

Recently uploaded (20)

Digdagによる大規模データ処理の自動化とエラー処理