SlideShare a Scribd company logo
Digdagによる大規模データ処理の

自動化とエラー処理
Sadayuki Furuhashi
Workflow Engines Night
Sadayuki Furuhashi
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS projects I founded:
An open-source hacker.
Github: @frsyuki
What’s workload automation?
• あらゆる手作業の自動化
> バッチデータ解析の自動化:
• データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知
> メール送信の自動化
• アドレス一覧の取得 - 対象の絞り込み - テンプレートから

本文を生成 - メール送信 - 完了通知
> システム間のデータ連携の自動化
> サーバ・DB・ネットワーク機器の管理やプロビジョニング
の自動化
> テスト・デプロイの自動化(CI)
求められる機能
• 基本機能
> タスクを依存関係順に実行
> 定期的な実行
> ファイルが作成されたら実行
> 過去分の一括実行(backfill)
> 時刻などの変数を含めて実行
• エラー処理
> 失敗したら通知
> 失敗した場所から再開
• 状態監視
> 実行時間が長ければ通知
> タスクの実行時間を可視化
> 実行ログの収集と保存
• 高速化
> タスクを並列して実行
> 同時実行数の制限
• 開発支援
> ワークフローのバージョン管理
> GUIによるワークフロー開発
> 定型処理を簡単に実行できるライ
ブラリ
> 手元とサーバ上で同じように動く
再現性(手元で動けばサーバでも
動く)
> Dockerイメージを使ってタスクを
実行
Products
OSS
• Makefile
• Jenkins
• Luigi
• Airflow
• Rundeck
• Azkaban
• Grid Engine
• OpenLava
• Obsidian Scheduler
• Hinemos
• StackStorm
• Platform LSM
Proprietary
• Tivoli Workload Scheduler (IBM)
• CA Workload Automation

(CA Technologies)
• JP1/AJS3 (Hitachi)
• Systemwalker Job Workload
Server (Fujitsu)
• Workload Automation
(Automatic)
• BatchMan (Honico)
• Control-M (BMC)
• Schedulix
• ServiceNow Workflow
Challenge: Multiple Cloud & Regions
On-Premises
Different API,
Different tools,
Many scripts.
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
> Hi!
> I'm a new technology!
Challenge: Modern complex data analytics
Ingest
Application logs
User attribute data
Ad impressions
3rd-party cookie data
Enrich
Removing bot access
Geo location from IP
address
Parsing User-Agent
JOIN user attributes
to event logs
Model
A/B Testing
Funnel analysis
Segmentation
analysis
Machine learning
Load
Creating indexes
Data partitioning
Data compression
Statistics
collection
Utilize
Recommendation
API
Realtime ad bidding
Visualize using BI
applications
Ingest UtilizeEnrich Model Load
Traditional "false" solution
#!/bin/bash
./run_mysql_query.sh
./load_facebook_data.sh
./rsync_apache_logs.sh
./start_emr_cluster.sh
for query in emr/*.sql; do
./run_emr_hive $query
done
./shutdown_emr_cluster.sh
./run_redshift_queries.sh
./call_finish_notification.sh
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Solution: Multi-Cloud Workflow Engine
Solves
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Example in our case
1. Dump data to
BigQuery
2. load all tables to
Treasure Data
3. Run queries
5. Notify on slack
4. Create reports
on Tableau Server

(on-premises)
Workflow constructs
Key constructs
Operators
> Packaged knowledge to run tasks.
> e.g. pg>, s3>, gcs>, emr>, td>, py>, rb>
Parameters
> Programmable variables for operators.
> e.g. ${session_time}, ${workflow_name},

${JSON.parse(http.last_content)}
Task groups
> Sequence of tasks to organize & modularize
workflows.
Operator library
_export:
td:
database: workflow_temp
+task1:
td>: queries/open.sql
create_table: daily_open
+task2:
td>: queries/close.sql
create_table: daily_close
Standard libraries
redshift>: runs Amazon Redshift queries
emr>: create/shutdowns a cluster & runs
steps
s3_wait>: waits until a file is put on S3
pg>: runs PostgreSQL queries
td>: runs Treasure Data queries
td_for_each>: repeats task for result rows
mail>: sends an email
Open-source libraries
You can release & use open-source
operator libraries.
Task grouping & parallel execution
+load_data:
_parallel: true


+load_users:
redshift>: copy/users.sql


+load_items:
redshift>: copy/items.sql
Parallel execution
Tasks under a same group run in
parallel if _parallel option is set to
true.
Grouping workflows...
Ingest UtilizeEnrich Model Load
+task
+task
+task
+task +task
+task +task
+task
+task
+task +task +task
Grouping workflows
Ingest UtilizeEnrich Model Load
+ingest +enrich
+task +task
+model
+basket_analysis
+task +task
+learn
+load
+task +task+tasks
+task
Parameters & Loops
+send_email_to_active_users:
td_for_each>: list_active.sql
_do:
+send:
email>: tempalte.txt
to: ${td.for_each.addr}
Parameter
A task can propagate parameters to
following tasks
Loop
Generate subtasks dynamically so
that Digdag applies the same set of
operators to different data sets.
Unite Engineering & Analytic Teams
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
Powerful for Engineers
> Comfortable for advanced users
Friendly for Analysts
> Still straight forward for analysts to
understand & leverage workflows
Pushing workflows to a server with Docker image
schedule:
daily>: 01:30:00
timezone: Asia/Tokyo
_export:
docker:
image: my_image:latest
+task:
sh>: ./run_in_docker
Digdag server
> Develop on laptop, push it to a server.
> Workflows run periodically on a server.
> Backfill
> Web editor & monitor
Docker
> Install scripts & dependences in a
Docker image, not on a server.
> Workflows can run anywhere including
developer's laptop.
Amazon ECR Dockerfile & Operator plugin template
• https://github.com/myui/dockernized-digdag-server
• https://github.com/myui/digdag-plugin-example
$ docker pull myui/digdag-server:latest
$ docker run -p 65432:65432 myui/digdag-server
open http://localhost:65432/
Demo
Real-world workflows
Digdag at Treasure Data
3,600 workflows run every day
28,000 tasks run every day
850 active workflows
400,000 workflow executions in total
Example: Customer analysis & alerting
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
Example: Customer analysis & alerting
Usage: ${td.each.percentage}%
Account Name: ${td.each.account_name}
Type: Purchase
${td.each.salesforce_link}
Region: ${td.each.region}
Owner: ${td.each.owner_name} (${td.each.owner_email})
Account: ${td.each.account_name}
Status: ${td.each.activity_status}
Actual: ${td.each.total_purchase}
Limit: ${td.each.monthly_purchase_limit}
mail.txt
Example: Backend of a BI app
timezone: <%= ev @timezone %>
<% if @schedule then %>
schedule: <%= ev @schedule %>
<% end %>
_export:
td:
database: <%= ev @database %>
all_mode: ${

(moment(session_time).dayOfYear() - 1)
% 3 == 0
}
+all_load:
if>: ${all_mode == "true"}
_do:
+create_all_records:
td>: segment_web_access.sql
create_table: "cdp_tmp_web_access"
_retry: 5
+rename_tmp_table:
td_ddl>:
rename_tables:
- from: "cdp_tmp_web_access"
to: "cdp_web_access"
_retry: 5
+get_all_count:
td>: incremental_count.sql
table_name: "cdp_web_access"
store_last_results: true
_retry: 5
+syndicate_loop:
loop>: ${Math.ceil(
td.last_results.total_count / 20000
)}
_do:
td>: incremental_select.sql
table_name: "cdp_web_access"
result_connection: cdp_web_access
result_settings:
id: 1
_retry: 5
Example: Moving Spark app to production
_export:
td:
database: digdag_demo_${session_date_compact}
+setup:
td_ddl>:
create_databases: ["${td.database}"]
+ingestion:
_parallel: true
+items_from_access_logs:
+wait_for_arrival:
s3_wait>: digdag-demo-bucket/www_login_$
{session_date_compact}.csv
+load_logs:
td_load>: s3_import_1479918530
+facebook_ads:
td_load>:
facebook_ads_reporting_import_1479843958
+items_from_aurora:
td_load>: mysql_import_1479918544
+enrichment:
_parallel: 5
+ip_location_to_user:
# ip_location, user
td>: queries/ip_location_to_user.sql
create_table: ip_location_to_user
+item_to_click_count:
# item, click_count
td>: queries/item_to_click_count.sql
create_table: item_to_click_count
+item_to_item_count:
# item_1, item_2, count
td>: queries/item_to_item_count.sql
create_table: item_to_item_count
+modeling:
emr>:
cluster: j-OD82XANWFYQ8
staging: s3://digdag-demo-data/emr/staging/
steps:
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class", "ItemRecommends"]
jars: [td-spark-assembly-0.1.jar]
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class",
"LocationRecommends"]
jars: [td-spark-assembly-0.1.jar]
+loading:
_parallel: true
+load_location_recommends:
redshift>: copy/copy_location_recommends.sql
+load_item_recommends:
redshift>: copy/copy_item_recommends.sql
Deployment & Fault tolerance
HA deployment of Digdag
Digdag
server
PostgreSQL
It's just like a web application.
Digdag
client
All task state
API &
scheduler &
executor
Visual UI
HA deployment of Digdag
PostgreSQL
Stateless servers + Replicated DB
Digdag
client
API &
scheduler &
executor
PostgreSQL
All task state
Digdag
server
Digdag
server
HTTP Load
Balancer
Visual UI
HA
HA deployment of Digdag
Digdag
server
PostgreSQL
Isolating API and execution for reliability
Digdag
client
API
PostgreSQL
HA
Digdag
server
Digdag
server
Digdag
server
scheduler &

executor
HTTP Load
Balancer
All task state
$ digdag server --disable-local-agent 

--disable-executor-loop
$ digdag server --max-task-threads 100
Single-server task logs
Digdag
server
PostgreSQL
Digdag
client
HTTP Load
Balancer
Local disks
A server writes logs

to a local disk
The same server

serves the logs.
$ digdag --task-log <dir>
$ digdag log <attempt-id> -f
Centralized task log storage
Digdag
server
PostgreSQL
Digdag
client
Digdag
server
HTTP Load
Balancer
AWS S3
A server uploads logs
A server pre-signs

the download URL
log-server.type = s3
log-server.s3.bucket = my-digdag-log-bucket
log-server.s3.path = logs/
$ digdag log <attempt-id> -f
Client downloads logs

directly from S3
Sadayuki Furuhashi
https://digdag.io
Visit my website!

More Related Content

What's hot

エンジニアの個人ブランディングと技術組織
エンジニアの個人ブランディングと技術組織エンジニアの個人ブランディングと技術組織
エンジニアの個人ブランディングと技術組織
Takafumi ONAKA
 
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Yahoo!デベロッパーネットワーク
 
CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善
CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善
CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善
Ito Takayuki
 
Docker Compose 徹底解説
Docker Compose 徹底解説Docker Compose 徹底解説
Docker Compose 徹底解説
Masahito Zembutsu
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
disc99_
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
Kohei Tokunaga
 
DDD x CQRS 更新系と参照系で異なるORMを併用して上手くいった話
DDD x CQRS   更新系と参照系で異なるORMを併用して上手くいった話DDD x CQRS   更新系と参照系で異なるORMを併用して上手くいった話
DDD x CQRS 更新系と参照系で異なるORMを併用して上手くいった話
Koichiro Matsuoka
 
モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)
モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)
モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)
NTT DATA Technology & Innovation
 
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
NTT DATA Technology & Innovation
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
Yoshiyasu SAEKI
 
WayOfNoTrouble.pptx
WayOfNoTrouble.pptxWayOfNoTrouble.pptx
WayOfNoTrouble.pptx
Daisuke Yamazaki
 
世界一わかりやすいClean Architecture
世界一わかりやすいClean Architecture世界一わかりやすいClean Architecture
世界一わかりやすいClean Architecture
Atsushi Nakamura
 
SPAセキュリティ入門~PHP Conference Japan 2021
SPAセキュリティ入門~PHP Conference Japan 2021SPAセキュリティ入門~PHP Conference Japan 2021
SPAセキュリティ入門~PHP Conference Japan 2021
Hiroshi Tokumaru
 
TLS, HTTP/2演習
TLS, HTTP/2演習TLS, HTTP/2演習
TLS, HTTP/2演習
shigeki_ohtsu
 
趣味と仕事の違い、現場で求められるアプリケーションの可観測性
趣味と仕事の違い、現場で求められるアプリケーションの可観測性趣味と仕事の違い、現場で求められるアプリケーションの可観測性
趣味と仕事の違い、現場で求められるアプリケーションの可観測性
LIFULL Co., Ltd.
 
はじめての datadog
はじめての datadogはじめての datadog
はじめての datadog
Naoya Nakazawa
 
今こそ知りたいSpring Batch(Spring Fest 2020講演資料)
今こそ知りたいSpring Batch(Spring Fest 2020講演資料)今こそ知りたいSpring Batch(Spring Fest 2020講演資料)
今こそ知りたいSpring Batch(Spring Fest 2020講演資料)
NTT DATA Technology & Innovation
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのか
Atsushi Nakada
 
KeycloakでAPI認可に入門する
KeycloakでAPI認可に入門するKeycloakでAPI認可に入門する
KeycloakでAPI認可に入門する
Hitachi, Ltd. OSS Solution Center.
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
Akihiro Suda
 

What's hot (20)

エンジニアの個人ブランディングと技術組織
エンジニアの個人ブランディングと技術組織エンジニアの個人ブランディングと技術組織
エンジニアの個人ブランディングと技術組織
 
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
 
CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善
CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善
CircleCIのinfrastructureを支えるTerraformのCI/CDパイプラインの改善
 
Docker Compose 徹底解説
Docker Compose 徹底解説Docker Compose 徹底解説
Docker Compose 徹底解説
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
DDD x CQRS 更新系と参照系で異なるORMを併用して上手くいった話
DDD x CQRS   更新系と参照系で異なるORMを併用して上手くいった話DDD x CQRS   更新系と参照系で異なるORMを併用して上手くいった話
DDD x CQRS 更新系と参照系で異なるORMを併用して上手くいった話
 
モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)
モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)
モノリスからマイクロサービスへの移行 ~ストラングラーパターンの検証~(Spring Fest 2020講演資料)
 
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
WayOfNoTrouble.pptx
WayOfNoTrouble.pptxWayOfNoTrouble.pptx
WayOfNoTrouble.pptx
 
世界一わかりやすいClean Architecture
世界一わかりやすいClean Architecture世界一わかりやすいClean Architecture
世界一わかりやすいClean Architecture
 
SPAセキュリティ入門~PHP Conference Japan 2021
SPAセキュリティ入門~PHP Conference Japan 2021SPAセキュリティ入門~PHP Conference Japan 2021
SPAセキュリティ入門~PHP Conference Japan 2021
 
TLS, HTTP/2演習
TLS, HTTP/2演習TLS, HTTP/2演習
TLS, HTTP/2演習
 
趣味と仕事の違い、現場で求められるアプリケーションの可観測性
趣味と仕事の違い、現場で求められるアプリケーションの可観測性趣味と仕事の違い、現場で求められるアプリケーションの可観測性
趣味と仕事の違い、現場で求められるアプリケーションの可観測性
 
はじめての datadog
はじめての datadogはじめての datadog
はじめての datadog
 
今こそ知りたいSpring Batch(Spring Fest 2020講演資料)
今こそ知りたいSpring Batch(Spring Fest 2020講演資料)今こそ知りたいSpring Batch(Spring Fest 2020講演資料)
今こそ知りたいSpring Batch(Spring Fest 2020講演資料)
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのか
 
KeycloakでAPI認可に入門する
KeycloakでAPI認可に入門するKeycloakでAPI認可に入門する
KeycloakでAPI認可に入門する
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 

Viewers also liked

Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
pphaal
 
5 g network &amp; technology
5 g network &amp; technology5 g network &amp; technology
5 g network &amp; technology
Frikha Nour
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
Amazon Web Services
 
NFV Tutorial
NFV TutorialNFV Tutorial
NFV Tutorial
Rashid Mijumbi
 
NFV and OpenStack
NFV and OpenStackNFV and OpenStack
NFV and OpenStack
Marie-Paule Odini
 
Using Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking AccelerationUsing Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking Acceleration
Netronome
 
Nfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricentNfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricent
Aricent
 
大規模環境のOpenStack アップグレードの考え方と実施のコツ
大規模環境のOpenStackアップグレードの考え方と実施のコツ大規模環境のOpenStackアップグレードの考え方と実施のコツ
大規模環境のOpenStack アップグレードの考え方と実施のコツ
Tomoya Hashimoto
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
Icinga
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
inside-BigData.com
 
NFV evolution towards 5G
NFV evolution towards 5GNFV evolution towards 5G
NFV evolution towards 5G
Marie-Paule Odini
 
Design Principles for 5G
Design Principles for 5GDesign Principles for 5G
Design Principles for 5G
Open Networking Summit
 
NFV : Virtual Network Function Architecture
NFV : Virtual Network Function ArchitectureNFV : Virtual Network Function Architecture
NFV : Virtual Network Function Architecturesidneel
 
【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信
Amazon Web Services Japan
 
Cloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper ContrailCloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper Contrail
buildacloud
 
Contrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at ScaleContrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at Scale
MarketingArrowECS_CZ
 
170827 jtf garafana
170827 jtf garafana170827 jtf garafana
170827 jtf garafana
OSSラボ株式会社
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分け
Recruit Technologies
 

Viewers also liked (18)

Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
5 g network &amp; technology
5 g network &amp; technology5 g network &amp; technology
5 g network &amp; technology
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
NFV Tutorial
NFV TutorialNFV Tutorial
NFV Tutorial
 
NFV and OpenStack
NFV and OpenStackNFV and OpenStack
NFV and OpenStack
 
Using Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking AccelerationUsing Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking Acceleration
 
Nfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricentNfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricent
 
大規模環境のOpenStack アップグレードの考え方と実施のコツ
大規模環境のOpenStackアップグレードの考え方と実施のコツ大規模環境のOpenStackアップグレードの考え方と実施のコツ
大規模環境のOpenStack アップグレードの考え方と実施のコツ
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
 
NFV evolution towards 5G
NFV evolution towards 5GNFV evolution towards 5G
NFV evolution towards 5G
 
Design Principles for 5G
Design Principles for 5GDesign Principles for 5G
Design Principles for 5G
 
NFV : Virtual Network Function Architecture
NFV : Virtual Network Function ArchitectureNFV : Virtual Network Function Architecture
NFV : Virtual Network Function Architecture
 
【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信
 
Cloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper ContrailCloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper Contrail
 
Contrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at ScaleContrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at Scale
 
170827 jtf garafana
170827 jtf garafana170827 jtf garafana
170827 jtf garafana
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分け
 

Similar to Digdagによる大規模データ処理の自動化とエラー処理

Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
EG Reports - Delicious Data
EG Reports - Delicious DataEG Reports - Delicious Data
EG Reports - Delicious Data
Benjamin Shum
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
abramsm
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
Engine Yard
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
N Masahiro
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
SmartNews, Inc.
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
Odoo
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Danny Abukalam
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
Doris Chen
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
Cisco Canada
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
Jakub Hajek
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PROIDEA
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
zeeg
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
Dimitar Damyanov
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
Brian Christner
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
Sadayuki Furuhashi
 

Similar to Digdagによる大規模データ処理の自動化とエラー処理 (20)

Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
EG Reports - Delicious Data
EG Reports - Delicious DataEG Reports - Delicious Data
EG Reports - Delicious Data
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 

More from Sadayuki Furuhashi

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
Sadayuki Furuhashi
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Sadayuki Furuhashi
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
Sadayuki Furuhashi
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
Sadayuki Furuhashi
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
Sadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
Sadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
Sadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 

Recently uploaded

GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 

Recently uploaded (20)

GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 

Digdagによる大規模データ処理の自動化とエラー処理

  • 2. Sadayuki Furuhashi A founder of Treasure Data, Inc. located in Silicon Valley. OSS projects I founded: An open-source hacker. Github: @frsyuki
  • 3. What’s workload automation? • あらゆる手作業の自動化 > バッチデータ解析の自動化: • データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知 > メール送信の自動化 • アドレス一覧の取得 - 対象の絞り込み - テンプレートから
 本文を生成 - メール送信 - 完了通知 > システム間のデータ連携の自動化 > サーバ・DB・ネットワーク機器の管理やプロビジョニング の自動化 > テスト・デプロイの自動化(CI)
  • 4. 求められる機能 • 基本機能 > タスクを依存関係順に実行 > 定期的な実行 > ファイルが作成されたら実行 > 過去分の一括実行(backfill) > 時刻などの変数を含めて実行 • エラー処理 > 失敗したら通知 > 失敗した場所から再開 • 状態監視 > 実行時間が長ければ通知 > タスクの実行時間を可視化 > 実行ログの収集と保存 • 高速化 > タスクを並列して実行 > 同時実行数の制限 • 開発支援 > ワークフローのバージョン管理 > GUIによるワークフロー開発 > 定型処理を簡単に実行できるライ ブラリ > 手元とサーバ上で同じように動く 再現性(手元で動けばサーバでも 動く) > Dockerイメージを使ってタスクを 実行
  • 5. Products OSS • Makefile • Jenkins • Luigi • Airflow • Rundeck • Azkaban • Grid Engine • OpenLava • Obsidian Scheduler • Hinemos • StackStorm • Platform LSM Proprietary • Tivoli Workload Scheduler (IBM) • CA Workload Automation
 (CA Technologies) • JP1/AJS3 (Hitachi) • Systemwalker Job Workload Server (Fujitsu) • Workload Automation (Automatic) • BatchMan (Honico) • Control-M (BMC) • Schedulix • ServiceNow Workflow
  • 6. Challenge: Multiple Cloud & Regions On-Premises Different API, Different tools, Many scripts.
  • 7. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR
  • 8. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR > Hi! > I'm a new technology!
  • 9. Challenge: Modern complex data analytics Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications Ingest UtilizeEnrich Model Load
  • 10. Traditional "false" solution #!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 11. Solution: Multi-Cloud Workflow Engine Solves > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 12. Example in our case 1. Dump data to BigQuery 2. load all tables to Treasure Data 3. Run queries 5. Notify on slack 4. Create reports on Tableau Server
 (on-premises)
  • 14. Key constructs Operators > Packaged knowledge to run tasks. > e.g. pg>, s3>, gcs>, emr>, td>, py>, rb> Parameters > Programmable variables for operators. > e.g. ${session_time}, ${workflow_name},
 ${JSON.parse(http.last_content)} Task groups > Sequence of tasks to organize & modularize workflows.
  • 15. Operator library _export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email Open-source libraries You can release & use open-source operator libraries.
  • 16. Task grouping & parallel execution +load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
  • 17. Grouping workflows... Ingest UtilizeEnrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task
  • 18. Grouping workflows Ingest UtilizeEnrich Model Load +ingest +enrich +task +task +model +basket_analysis +task +task +learn +load +task +task+tasks +task
  • 19. Parameters & Loops +send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr} Parameter A task can propagate parameters to following tasks Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
  • 20. Unite Engineering & Analytic Teams +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows
  • 21. Pushing workflows to a server with Docker image schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor Docker > Install scripts & dependences in a Docker image, not on a server. > Workflows can run anywhere including developer's laptop.
  • 22. Amazon ECR Dockerfile & Operator plugin template • https://github.com/myui/dockernized-digdag-server • https://github.com/myui/digdag-plugin-example $ docker pull myui/digdag-server:latest $ docker run -p 65432:65432 myui/digdag-server open http://localhost:65432/
  • 23. Demo
  • 25. Digdag at Treasure Data 3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total
  • 26. Example: Customer analysis & alerting timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}']
  • 27. timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}'] Example: Customer analysis & alerting Usage: ${td.each.percentage}% Account Name: ${td.each.account_name} Type: Purchase ${td.each.salesforce_link} Region: ${td.each.region} Owner: ${td.each.owner_name} (${td.each.owner_email}) Account: ${td.each.account_name} Status: ${td.each.activity_status} Actual: ${td.each.total_purchase} Limit: ${td.each.monthly_purchase_limit} mail.txt
  • 28. Example: Backend of a BI app timezone: <%= ev @timezone %> <% if @schedule then %> schedule: <%= ev @schedule %> <% end %> _export: td: database: <%= ev @database %> all_mode: ${
 (moment(session_time).dayOfYear() - 1) % 3 == 0 } +all_load: if>: ${all_mode == "true"} _do: +create_all_records: td>: segment_web_access.sql create_table: "cdp_tmp_web_access" _retry: 5 +rename_tmp_table: td_ddl>: rename_tables: - from: "cdp_tmp_web_access" to: "cdp_web_access" _retry: 5 +get_all_count: td>: incremental_count.sql table_name: "cdp_web_access" store_last_results: true _retry: 5 +syndicate_loop: loop>: ${Math.ceil( td.last_results.total_count / 20000 )} _do: td>: incremental_select.sql table_name: "cdp_web_access" result_connection: cdp_web_access result_settings: id: 1 _retry: 5
  • 29. Example: Moving Spark app to production _export: td: database: digdag_demo_${session_date_compact} +setup: td_ddl>: create_databases: ["${td.database}"] +ingestion: _parallel: true +items_from_access_logs: +wait_for_arrival: s3_wait>: digdag-demo-bucket/www_login_$ {session_date_compact}.csv +load_logs: td_load>: s3_import_1479918530 +facebook_ads: td_load>: facebook_ads_reporting_import_1479843958 +items_from_aurora: td_load>: mysql_import_1479918544 +enrichment: _parallel: 5 +ip_location_to_user: # ip_location, user td>: queries/ip_location_to_user.sql create_table: ip_location_to_user +item_to_click_count: # item, click_count td>: queries/item_to_click_count.sql create_table: item_to_click_count +item_to_item_count: # item_1, item_2, count td>: queries/item_to_item_count.sql create_table: item_to_item_count +modeling: emr>: cluster: j-OD82XANWFYQ8 staging: s3://digdag-demo-data/emr/staging/ steps: - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "ItemRecommends"] jars: [td-spark-assembly-0.1.jar] - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "LocationRecommends"] jars: [td-spark-assembly-0.1.jar] +loading: _parallel: true +load_location_recommends: redshift>: copy/copy_location_recommends.sql +load_item_recommends: redshift>: copy/copy_item_recommends.sql
  • 30. Deployment & Fault tolerance
  • 31. HA deployment of Digdag Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI
  • 32. HA deployment of Digdag PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA
  • 33. HA deployment of Digdag Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state $ digdag server --disable-local-agent 
 --disable-executor-loop $ digdag server --max-task-threads 100
  • 34. Single-server task logs Digdag server PostgreSQL Digdag client HTTP Load Balancer Local disks A server writes logs
 to a local disk The same server
 serves the logs. $ digdag --task-log <dir> $ digdag log <attempt-id> -f
  • 35. Centralized task log storage Digdag server PostgreSQL Digdag client Digdag server HTTP Load Balancer AWS S3 A server uploads logs A server pre-signs
 the download URL log-server.type = s3 log-server.s3.bucket = my-digdag-log-bucket log-server.s3.path = logs/ $ digdag log <attempt-id> -f Client downloads logs
 directly from S3