SlideShare a Scribd company logo
1 of 36
Download to read offline
Digdagによる大規模データ処理の

自動化とエラー処理
Sadayuki Furuhashi
Workflow Engines Night
Sadayuki Furuhashi
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS projects I founded:
An open-source hacker.
Github: @frsyuki
What’s workload automation?
• あらゆる手作業の自動化
> バッチデータ解析の自動化:
• データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知
> メール送信の自動化
• アドレス一覧の取得 - 対象の絞り込み - テンプレートから

本文を生成 - メール送信 - 完了通知
> システム間のデータ連携の自動化
> サーバ・DB・ネットワーク機器の管理やプロビジョニング
の自動化
> テスト・デプロイの自動化(CI)
求められる機能
• 基本機能
> タスクを依存関係順に実行
> 定期的な実行
> ファイルが作成されたら実行
> 過去分の一括実行(backfill)
> 時刻などの変数を含めて実行
• エラー処理
> 失敗したら通知
> 失敗した場所から再開
• 状態監視
> 実行時間が長ければ通知
> タスクの実行時間を可視化
> 実行ログの収集と保存
• 高速化
> タスクを並列して実行
> 同時実行数の制限
• 開発支援
> ワークフローのバージョン管理
> GUIによるワークフロー開発
> 定型処理を簡単に実行できるライ
ブラリ
> 手元とサーバ上で同じように動く
再現性(手元で動けばサーバでも
動く)
> Dockerイメージを使ってタスクを
実行
Products
OSS
• Makefile
• Jenkins
• Luigi
• Airflow
• Rundeck
• Azkaban
• Grid Engine
• OpenLava
• Obsidian Scheduler
• Hinemos
• StackStorm
• Platform LSM
Proprietary
• Tivoli Workload Scheduler (IBM)
• CA Workload Automation

(CA Technologies)
• JP1/AJS3 (Hitachi)
• Systemwalker Job Workload
Server (Fujitsu)
• Workload Automation
(Automatic)
• BatchMan (Honico)
• Control-M (BMC)
• Schedulix
• ServiceNow Workflow
Challenge: Multiple Cloud & Regions
On-Premises
Different API,
Different tools,
Many scripts.
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
> Hi!
> I'm a new technology!
Challenge: Modern complex data analytics
Ingest
Application logs
User attribute data
Ad impressions
3rd-party cookie data
Enrich
Removing bot access
Geo location from IP
address
Parsing User-Agent
JOIN user attributes
to event logs
Model
A/B Testing
Funnel analysis
Segmentation
analysis
Machine learning
Load
Creating indexes
Data partitioning
Data compression
Statistics
collection
Utilize
Recommendation
API
Realtime ad bidding
Visualize using BI
applications
Ingest UtilizeEnrich Model Load
Traditional "false" solution
#!/bin/bash
./run_mysql_query.sh
./load_facebook_data.sh
./rsync_apache_logs.sh
./start_emr_cluster.sh
for query in emr/*.sql; do
./run_emr_hive $query
done
./shutdown_emr_cluster.sh
./run_redshift_queries.sh
./call_finish_notification.sh
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Solution: Multi-Cloud Workflow Engine
Solves
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Example in our case
1. Dump data to
BigQuery
2. load all tables to
Treasure Data
3. Run queries
5. Notify on slack
4. Create reports
on Tableau Server

(on-premises)
Workflow constructs
Key constructs
Operators
> Packaged knowledge to run tasks.
> e.g. pg>, s3>, gcs>, emr>, td>, py>, rb>
Parameters
> Programmable variables for operators.
> e.g. ${session_time}, ${workflow_name},

${JSON.parse(http.last_content)}
Task groups
> Sequence of tasks to organize & modularize
workflows.
Operator library
_export:
td:
database: workflow_temp
+task1:
td>: queries/open.sql
create_table: daily_open
+task2:
td>: queries/close.sql
create_table: daily_close
Standard libraries
redshift>: runs Amazon Redshift queries
emr>: create/shutdowns a cluster & runs
steps
s3_wait>: waits until a file is put on S3
pg>: runs PostgreSQL queries
td>: runs Treasure Data queries
td_for_each>: repeats task for result rows
mail>: sends an email
Open-source libraries
You can release & use open-source
operator libraries.
Task grouping & parallel execution
+load_data:
_parallel: true


+load_users:
redshift>: copy/users.sql


+load_items:
redshift>: copy/items.sql
Parallel execution
Tasks under a same group run in
parallel if _parallel option is set to
true.
Grouping workflows...
Ingest UtilizeEnrich Model Load
+task
+task
+task
+task +task
+task +task
+task
+task
+task +task +task
Grouping workflows
Ingest UtilizeEnrich Model Load
+ingest +enrich
+task +task
+model
+basket_analysis
+task +task
+learn
+load
+task +task+tasks
+task
Parameters & Loops
+send_email_to_active_users:
td_for_each>: list_active.sql
_do:
+send:
email>: tempalte.txt
to: ${td.for_each.addr}
Parameter
A task can propagate parameters to
following tasks
Loop
Generate subtasks dynamically so
that Digdag applies the same set of
operators to different data sets.
Unite Engineering & Analytic Teams
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
Powerful for Engineers
> Comfortable for advanced users
Friendly for Analysts
> Still straight forward for analysts to
understand & leverage workflows
Pushing workflows to a server with Docker image
schedule:
daily>: 01:30:00
timezone: Asia/Tokyo
_export:
docker:
image: my_image:latest
+task:
sh>: ./run_in_docker
Digdag server
> Develop on laptop, push it to a server.
> Workflows run periodically on a server.
> Backfill
> Web editor & monitor
Docker
> Install scripts & dependences in a
Docker image, not on a server.
> Workflows can run anywhere including
developer's laptop.
Amazon ECR Dockerfile & Operator plugin template
• https://github.com/myui/dockernized-digdag-server
• https://github.com/myui/digdag-plugin-example
$ docker pull myui/digdag-server:latest
$ docker run -p 65432:65432 myui/digdag-server
open http://localhost:65432/
Demo
Real-world workflows
Digdag at Treasure Data
3,600 workflows run every day
28,000 tasks run every day
850 active workflows
400,000 workflow executions in total
Example: Customer analysis & alerting
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
Example: Customer analysis & alerting
Usage: ${td.each.percentage}%
Account Name: ${td.each.account_name}
Type: Purchase
${td.each.salesforce_link}
Region: ${td.each.region}
Owner: ${td.each.owner_name} (${td.each.owner_email})
Account: ${td.each.account_name}
Status: ${td.each.activity_status}
Actual: ${td.each.total_purchase}
Limit: ${td.each.monthly_purchase_limit}
mail.txt
Example: Backend of a BI app
timezone: <%= ev @timezone %>
<% if @schedule then %>
schedule: <%= ev @schedule %>
<% end %>
_export:
td:
database: <%= ev @database %>
all_mode: ${

(moment(session_time).dayOfYear() - 1)
% 3 == 0
}
+all_load:
if>: ${all_mode == "true"}
_do:
+create_all_records:
td>: segment_web_access.sql
create_table: "cdp_tmp_web_access"
_retry: 5
+rename_tmp_table:
td_ddl>:
rename_tables:
- from: "cdp_tmp_web_access"
to: "cdp_web_access"
_retry: 5
+get_all_count:
td>: incremental_count.sql
table_name: "cdp_web_access"
store_last_results: true
_retry: 5
+syndicate_loop:
loop>: ${Math.ceil(
td.last_results.total_count / 20000
)}
_do:
td>: incremental_select.sql
table_name: "cdp_web_access"
result_connection: cdp_web_access
result_settings:
id: 1
_retry: 5
Example: Moving Spark app to production
_export:
td:
database: digdag_demo_${session_date_compact}
+setup:
td_ddl>:
create_databases: ["${td.database}"]
+ingestion:
_parallel: true
+items_from_access_logs:
+wait_for_arrival:
s3_wait>: digdag-demo-bucket/www_login_$
{session_date_compact}.csv
+load_logs:
td_load>: s3_import_1479918530
+facebook_ads:
td_load>:
facebook_ads_reporting_import_1479843958
+items_from_aurora:
td_load>: mysql_import_1479918544
+enrichment:
_parallel: 5
+ip_location_to_user:
# ip_location, user
td>: queries/ip_location_to_user.sql
create_table: ip_location_to_user
+item_to_click_count:
# item, click_count
td>: queries/item_to_click_count.sql
create_table: item_to_click_count
+item_to_item_count:
# item_1, item_2, count
td>: queries/item_to_item_count.sql
create_table: item_to_item_count
+modeling:
emr>:
cluster: j-OD82XANWFYQ8
staging: s3://digdag-demo-data/emr/staging/
steps:
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class", "ItemRecommends"]
jars: [td-spark-assembly-0.1.jar]
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class",
"LocationRecommends"]
jars: [td-spark-assembly-0.1.jar]
+loading:
_parallel: true
+load_location_recommends:
redshift>: copy/copy_location_recommends.sql
+load_item_recommends:
redshift>: copy/copy_item_recommends.sql
Deployment & Fault tolerance
HA deployment of Digdag
Digdag
server
PostgreSQL
It's just like a web application.
Digdag
client
All task state
API &
scheduler &
executor
Visual UI
HA deployment of Digdag
PostgreSQL
Stateless servers + Replicated DB
Digdag
client
API &
scheduler &
executor
PostgreSQL
All task state
Digdag
server
Digdag
server
HTTP Load
Balancer
Visual UI
HA
HA deployment of Digdag
Digdag
server
PostgreSQL
Isolating API and execution for reliability
Digdag
client
API
PostgreSQL
HA
Digdag
server
Digdag
server
Digdag
server
scheduler &

executor
HTTP Load
Balancer
All task state
$ digdag server --disable-local-agent 

--disable-executor-loop
$ digdag server --max-task-threads 100
Single-server task logs
Digdag
server
PostgreSQL
Digdag
client
HTTP Load
Balancer
Local disks
A server writes logs

to a local disk
The same server

serves the logs.
$ digdag --task-log <dir>
$ digdag log <attempt-id> -f
Centralized task log storage
Digdag
server
PostgreSQL
Digdag
client
Digdag
server
HTTP Load
Balancer
AWS S3
A server uploads logs
A server pre-signs

the download URL
log-server.type = s3
log-server.s3.bucket = my-digdag-log-bucket
log-server.s3.path = logs/
$ digdag log <attempt-id> -f
Client downloads logs

directly from S3
Sadayuki Furuhashi
https://digdag.io
Visit my website!

More Related Content

What's hot

initとプロセス再起動
initとプロセス再起動initとプロセス再起動
initとプロセス再起動
Takashi Takizawa
 
はじめてのCouch db
はじめてのCouch dbはじめてのCouch db
はじめてのCouch db
Eiji Kuroda
 

What's hot (20)

Apache EventMesh を使ってみた
Apache EventMesh を使ってみたApache EventMesh を使ってみた
Apache EventMesh を使ってみた
 
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!
 
WiredTigerを詳しく説明
WiredTigerを詳しく説明WiredTigerを詳しく説明
WiredTigerを詳しく説明
 
initとプロセス再起動
initとプロセス再起動initとプロセス再起動
initとプロセス再起動
 
DBスキーマもバージョン管理したい!
DBスキーマもバージョン管理したい!DBスキーマもバージョン管理したい!
DBスキーマもバージョン管理したい!
 
Cognitive Complexity でコードの複雑さを定量的に計測しよう
Cognitive Complexity でコードの複雑さを定量的に計測しようCognitive Complexity でコードの複雑さを定量的に計測しよう
Cognitive Complexity でコードの複雑さを定量的に計測しよう
 
なかったらINSERTしたいし、あるならロック取りたいやん?
なかったらINSERTしたいし、あるならロック取りたいやん?なかったらINSERTしたいし、あるならロック取りたいやん?
なかったらINSERTしたいし、あるならロック取りたいやん?
 
Java11へのマイグレーションガイド ~Apache Hadoopの事例~
Java11へのマイグレーションガイド ~Apache Hadoopの事例~Java11へのマイグレーションガイド ~Apache Hadoopの事例~
Java11へのマイグレーションガイド ~Apache Hadoopの事例~
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
SPAセキュリティ入門~PHP Conference Japan 2021
SPAセキュリティ入門~PHP Conference Japan 2021SPAセキュリティ入門~PHP Conference Japan 2021
SPAセキュリティ入門~PHP Conference Japan 2021
 
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
 
ネットストーカー御用達OSINTツールBlackBirdを触ってみた.pptx
ネットストーカー御用達OSINTツールBlackBirdを触ってみた.pptxネットストーカー御用達OSINTツールBlackBirdを触ってみた.pptx
ネットストーカー御用達OSINTツールBlackBirdを触ってみた.pptx
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
 
Hadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントHadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイント
 
MongoDBが遅いときの切り分け方法
MongoDBが遅いときの切り分け方法MongoDBが遅いときの切り分け方法
MongoDBが遅いときの切り分け方法
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
MQTTとAMQPと.NET
MQTTとAMQPと.NETMQTTとAMQPと.NET
MQTTとAMQPと.NET
 
はじめてのCouch db
はじめてのCouch dbはじめてのCouch db
はじめてのCouch db
 
Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編
 

Viewers also liked

NFV : Virtual Network Function Architecture
NFV : Virtual Network Function ArchitectureNFV : Virtual Network Function Architecture
NFV : Virtual Network Function Architecture
sidneel
 

Viewers also liked (18)

Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
5 g network &amp; technology
5 g network &amp; technology5 g network &amp; technology
5 g network &amp; technology
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
NFV Tutorial
NFV TutorialNFV Tutorial
NFV Tutorial
 
NFV and OpenStack
NFV and OpenStackNFV and OpenStack
NFV and OpenStack
 
Using Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking AccelerationUsing Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking Acceleration
 
Nfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricentNfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricent
 
大規模環境のOpenStack アップグレードの考え方と実施のコツ
大規模環境のOpenStackアップグレードの考え方と実施のコツ大規模環境のOpenStackアップグレードの考え方と実施のコツ
大規模環境のOpenStack アップグレードの考え方と実施のコツ
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
 
NFV evolution towards 5G
NFV evolution towards 5GNFV evolution towards 5G
NFV evolution towards 5G
 
Design Principles for 5G
Design Principles for 5GDesign Principles for 5G
Design Principles for 5G
 
NFV : Virtual Network Function Architecture
NFV : Virtual Network Function ArchitectureNFV : Virtual Network Function Architecture
NFV : Virtual Network Function Architecture
 
【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信
 
Cloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper ContrailCloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper Contrail
 
Contrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at ScaleContrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at Scale
 
170827 jtf garafana
170827 jtf garafana170827 jtf garafana
170827 jtf garafana
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分け
 

Similar to Digdagによる大規模データ処理の自動化とエラー処理

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
Doris Chen
 

Similar to Digdagによる大規模データ処理の自動化とエラー処理 (20)

Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
EG Reports - Delicious Data
EG Reports - Delicious DataEG Reports - Delicious Data
EG Reports - Delicious Data
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 

More from Sadayuki Furuhashi

Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Sadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 

Recently uploaded

Recently uploaded (20)

WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 

Digdagによる大規模データ処理の自動化とエラー処理

  • 2. Sadayuki Furuhashi A founder of Treasure Data, Inc. located in Silicon Valley. OSS projects I founded: An open-source hacker. Github: @frsyuki
  • 3. What’s workload automation? • あらゆる手作業の自動化 > バッチデータ解析の自動化: • データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知 > メール送信の自動化 • アドレス一覧の取得 - 対象の絞り込み - テンプレートから
 本文を生成 - メール送信 - 完了通知 > システム間のデータ連携の自動化 > サーバ・DB・ネットワーク機器の管理やプロビジョニング の自動化 > テスト・デプロイの自動化(CI)
  • 4. 求められる機能 • 基本機能 > タスクを依存関係順に実行 > 定期的な実行 > ファイルが作成されたら実行 > 過去分の一括実行(backfill) > 時刻などの変数を含めて実行 • エラー処理 > 失敗したら通知 > 失敗した場所から再開 • 状態監視 > 実行時間が長ければ通知 > タスクの実行時間を可視化 > 実行ログの収集と保存 • 高速化 > タスクを並列して実行 > 同時実行数の制限 • 開発支援 > ワークフローのバージョン管理 > GUIによるワークフロー開発 > 定型処理を簡単に実行できるライ ブラリ > 手元とサーバ上で同じように動く 再現性(手元で動けばサーバでも 動く) > Dockerイメージを使ってタスクを 実行
  • 5. Products OSS • Makefile • Jenkins • Luigi • Airflow • Rundeck • Azkaban • Grid Engine • OpenLava • Obsidian Scheduler • Hinemos • StackStorm • Platform LSM Proprietary • Tivoli Workload Scheduler (IBM) • CA Workload Automation
 (CA Technologies) • JP1/AJS3 (Hitachi) • Systemwalker Job Workload Server (Fujitsu) • Workload Automation (Automatic) • BatchMan (Honico) • Control-M (BMC) • Schedulix • ServiceNow Workflow
  • 6. Challenge: Multiple Cloud & Regions On-Premises Different API, Different tools, Many scripts.
  • 7. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR
  • 8. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR > Hi! > I'm a new technology!
  • 9. Challenge: Modern complex data analytics Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications Ingest UtilizeEnrich Model Load
  • 10. Traditional "false" solution #!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 11. Solution: Multi-Cloud Workflow Engine Solves > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 12. Example in our case 1. Dump data to BigQuery 2. load all tables to Treasure Data 3. Run queries 5. Notify on slack 4. Create reports on Tableau Server
 (on-premises)
  • 14. Key constructs Operators > Packaged knowledge to run tasks. > e.g. pg>, s3>, gcs>, emr>, td>, py>, rb> Parameters > Programmable variables for operators. > e.g. ${session_time}, ${workflow_name},
 ${JSON.parse(http.last_content)} Task groups > Sequence of tasks to organize & modularize workflows.
  • 15. Operator library _export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email Open-source libraries You can release & use open-source operator libraries.
  • 16. Task grouping & parallel execution +load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
  • 17. Grouping workflows... Ingest UtilizeEnrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task
  • 18. Grouping workflows Ingest UtilizeEnrich Model Load +ingest +enrich +task +task +model +basket_analysis +task +task +learn +load +task +task+tasks +task
  • 19. Parameters & Loops +send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr} Parameter A task can propagate parameters to following tasks Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
  • 20. Unite Engineering & Analytic Teams +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows
  • 21. Pushing workflows to a server with Docker image schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor Docker > Install scripts & dependences in a Docker image, not on a server. > Workflows can run anywhere including developer's laptop.
  • 22. Amazon ECR Dockerfile & Operator plugin template • https://github.com/myui/dockernized-digdag-server • https://github.com/myui/digdag-plugin-example $ docker pull myui/digdag-server:latest $ docker run -p 65432:65432 myui/digdag-server open http://localhost:65432/
  • 23. Demo
  • 25. Digdag at Treasure Data 3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total
  • 26. Example: Customer analysis & alerting timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}']
  • 27. timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}'] Example: Customer analysis & alerting Usage: ${td.each.percentage}% Account Name: ${td.each.account_name} Type: Purchase ${td.each.salesforce_link} Region: ${td.each.region} Owner: ${td.each.owner_name} (${td.each.owner_email}) Account: ${td.each.account_name} Status: ${td.each.activity_status} Actual: ${td.each.total_purchase} Limit: ${td.each.monthly_purchase_limit} mail.txt
  • 28. Example: Backend of a BI app timezone: <%= ev @timezone %> <% if @schedule then %> schedule: <%= ev @schedule %> <% end %> _export: td: database: <%= ev @database %> all_mode: ${
 (moment(session_time).dayOfYear() - 1) % 3 == 0 } +all_load: if>: ${all_mode == "true"} _do: +create_all_records: td>: segment_web_access.sql create_table: "cdp_tmp_web_access" _retry: 5 +rename_tmp_table: td_ddl>: rename_tables: - from: "cdp_tmp_web_access" to: "cdp_web_access" _retry: 5 +get_all_count: td>: incremental_count.sql table_name: "cdp_web_access" store_last_results: true _retry: 5 +syndicate_loop: loop>: ${Math.ceil( td.last_results.total_count / 20000 )} _do: td>: incremental_select.sql table_name: "cdp_web_access" result_connection: cdp_web_access result_settings: id: 1 _retry: 5
  • 29. Example: Moving Spark app to production _export: td: database: digdag_demo_${session_date_compact} +setup: td_ddl>: create_databases: ["${td.database}"] +ingestion: _parallel: true +items_from_access_logs: +wait_for_arrival: s3_wait>: digdag-demo-bucket/www_login_$ {session_date_compact}.csv +load_logs: td_load>: s3_import_1479918530 +facebook_ads: td_load>: facebook_ads_reporting_import_1479843958 +items_from_aurora: td_load>: mysql_import_1479918544 +enrichment: _parallel: 5 +ip_location_to_user: # ip_location, user td>: queries/ip_location_to_user.sql create_table: ip_location_to_user +item_to_click_count: # item, click_count td>: queries/item_to_click_count.sql create_table: item_to_click_count +item_to_item_count: # item_1, item_2, count td>: queries/item_to_item_count.sql create_table: item_to_item_count +modeling: emr>: cluster: j-OD82XANWFYQ8 staging: s3://digdag-demo-data/emr/staging/ steps: - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "ItemRecommends"] jars: [td-spark-assembly-0.1.jar] - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "LocationRecommends"] jars: [td-spark-assembly-0.1.jar] +loading: _parallel: true +load_location_recommends: redshift>: copy/copy_location_recommends.sql +load_item_recommends: redshift>: copy/copy_item_recommends.sql
  • 30. Deployment & Fault tolerance
  • 31. HA deployment of Digdag Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI
  • 32. HA deployment of Digdag PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA
  • 33. HA deployment of Digdag Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state $ digdag server --disable-local-agent 
 --disable-executor-loop $ digdag server --max-task-threads 100
  • 34. Single-server task logs Digdag server PostgreSQL Digdag client HTTP Load Balancer Local disks A server writes logs
 to a local disk The same server
 serves the logs. $ digdag --task-log <dir> $ digdag log <attempt-id> -f
  • 35. Centralized task log storage Digdag server PostgreSQL Digdag client Digdag server HTTP Load Balancer AWS S3 A server uploads logs A server pre-signs
 the download URL log-server.type = s3 log-server.s3.bucket = my-digdag-log-bucket log-server.s3.path = logs/ $ digdag log <attempt-id> -f Client downloads logs
 directly from S3