Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Digdagによる大規模データ処理の自動化とエラー処理

18,113 views

Published on

Talk at Workflow Engines Night, Tokyo, Japan

Published in: Software
  • Sex in your area is here: ❶❶❶ http://bit.ly/2F4cEJi ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❤❤❤ http://bit.ly/2F4cEJi ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Download this 3-step guide to generating insane amounts of media coverage for your from LinkedIn: http://bit.ly/linkedin3stepguide
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Digdagによる大規模データ処理の自動化とエラー処理

  1. 1. Digdagによる大規模データ処理の
 自動化とエラー処理 Sadayuki Furuhashi Workflow Engines Night
  2. 2. Sadayuki Furuhashi A founder of Treasure Data, Inc. located in Silicon Valley. OSS projects I founded: An open-source hacker. Github: @frsyuki
  3. 3. What’s workload automation? • あらゆる手作業の自動化 > バッチデータ解析の自動化: • データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知 > メール送信の自動化 • アドレス一覧の取得 - 対象の絞り込み - テンプレートから
 本文を生成 - メール送信 - 完了通知 > システム間のデータ連携の自動化 > サーバ・DB・ネットワーク機器の管理やプロビジョニング の自動化 > テスト・デプロイの自動化(CI)
  4. 4. 求められる機能 • 基本機能 > タスクを依存関係順に実行 > 定期的な実行 > ファイルが作成されたら実行 > 過去分の一括実行(backfill) > 時刻などの変数を含めて実行 • エラー処理 > 失敗したら通知 > 失敗した場所から再開 • 状態監視 > 実行時間が長ければ通知 > タスクの実行時間を可視化 > 実行ログの収集と保存 • 高速化 > タスクを並列して実行 > 同時実行数の制限 • 開発支援 > ワークフローのバージョン管理 > GUIによるワークフロー開発 > 定型処理を簡単に実行できるライ ブラリ > 手元とサーバ上で同じように動く 再現性(手元で動けばサーバでも 動く) > Dockerイメージを使ってタスクを 実行
  5. 5. Products OSS • Makefile • Jenkins • Luigi • Airflow • Rundeck • Azkaban • Grid Engine • OpenLava • Obsidian Scheduler • Hinemos • StackStorm • Platform LSM Proprietary • Tivoli Workload Scheduler (IBM) • CA Workload Automation
 (CA Technologies) • JP1/AJS3 (Hitachi) • Systemwalker Job Workload Server (Fujitsu) • Workload Automation (Automatic) • BatchMan (Honico) • Control-M (BMC) • Schedulix • ServiceNow Workflow
  6. 6. Challenge: Multiple Cloud & Regions On-Premises Different API, Different tools, Many scripts.
  7. 7. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR
  8. 8. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR > Hi! > I'm a new technology!
  9. 9. Challenge: Modern complex data analytics Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications Ingest UtilizeEnrich Model Load
  10. 10. Traditional "false" solution #!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  11. 11. Solution: Multi-Cloud Workflow Engine Solves > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  12. 12. Example in our case 1. Dump data to BigQuery 2. load all tables to Treasure Data 3. Run queries 5. Notify on slack 4. Create reports on Tableau Server
 (on-premises)
  13. 13. Workflow constructs
  14. 14. Key constructs Operators > Packaged knowledge to run tasks. > e.g. pg>, s3>, gcs>, emr>, td>, py>, rb> Parameters > Programmable variables for operators. > e.g. ${session_time}, ${workflow_name},
 ${JSON.parse(http.last_content)} Task groups > Sequence of tasks to organize & modularize workflows.
  15. 15. Operator library _export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email Open-source libraries You can release & use open-source operator libraries.
  16. 16. Task grouping & parallel execution +load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
  17. 17. Grouping workflows... Ingest UtilizeEnrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task
  18. 18. Grouping workflows Ingest UtilizeEnrich Model Load +ingest +enrich +task +task +model +basket_analysis +task +task +learn +load +task +task+tasks +task
  19. 19. Parameters & Loops +send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr} Parameter A task can propagate parameters to following tasks Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
  20. 20. Unite Engineering & Analytic Teams +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows
  21. 21. Pushing workflows to a server with Docker image schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor Docker > Install scripts & dependences in a Docker image, not on a server. > Workflows can run anywhere including developer's laptop.
  22. 22. Amazon ECR Dockerfile & Operator plugin template • https://github.com/myui/dockernized-digdag-server • https://github.com/myui/digdag-plugin-example $ docker pull myui/digdag-server:latest $ docker run -p 65432:65432 myui/digdag-server open http://localhost:65432/
  23. 23. Demo
  24. 24. Real-world workflows
  25. 25. Digdag at Treasure Data 3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total
  26. 26. Example: Customer analysis & alerting timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}']
  27. 27. timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}'] Example: Customer analysis & alerting Usage: ${td.each.percentage}% Account Name: ${td.each.account_name} Type: Purchase ${td.each.salesforce_link} Region: ${td.each.region} Owner: ${td.each.owner_name} (${td.each.owner_email}) Account: ${td.each.account_name} Status: ${td.each.activity_status} Actual: ${td.each.total_purchase} Limit: ${td.each.monthly_purchase_limit} mail.txt
  28. 28. Example: Backend of a BI app timezone: <%= ev @timezone %> <% if @schedule then %> schedule: <%= ev @schedule %> <% end %> _export: td: database: <%= ev @database %> all_mode: ${
 (moment(session_time).dayOfYear() - 1) % 3 == 0 } +all_load: if>: ${all_mode == "true"} _do: +create_all_records: td>: segment_web_access.sql create_table: "cdp_tmp_web_access" _retry: 5 +rename_tmp_table: td_ddl>: rename_tables: - from: "cdp_tmp_web_access" to: "cdp_web_access" _retry: 5 +get_all_count: td>: incremental_count.sql table_name: "cdp_web_access" store_last_results: true _retry: 5 +syndicate_loop: loop>: ${Math.ceil( td.last_results.total_count / 20000 )} _do: td>: incremental_select.sql table_name: "cdp_web_access" result_connection: cdp_web_access result_settings: id: 1 _retry: 5
  29. 29. Example: Moving Spark app to production _export: td: database: digdag_demo_${session_date_compact} +setup: td_ddl>: create_databases: ["${td.database}"] +ingestion: _parallel: true +items_from_access_logs: +wait_for_arrival: s3_wait>: digdag-demo-bucket/www_login_$ {session_date_compact}.csv +load_logs: td_load>: s3_import_1479918530 +facebook_ads: td_load>: facebook_ads_reporting_import_1479843958 +items_from_aurora: td_load>: mysql_import_1479918544 +enrichment: _parallel: 5 +ip_location_to_user: # ip_location, user td>: queries/ip_location_to_user.sql create_table: ip_location_to_user +item_to_click_count: # item, click_count td>: queries/item_to_click_count.sql create_table: item_to_click_count +item_to_item_count: # item_1, item_2, count td>: queries/item_to_item_count.sql create_table: item_to_item_count +modeling: emr>: cluster: j-OD82XANWFYQ8 staging: s3://digdag-demo-data/emr/staging/ steps: - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "ItemRecommends"] jars: [td-spark-assembly-0.1.jar] - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "LocationRecommends"] jars: [td-spark-assembly-0.1.jar] +loading: _parallel: true +load_location_recommends: redshift>: copy/copy_location_recommends.sql +load_item_recommends: redshift>: copy/copy_item_recommends.sql
  30. 30. Deployment & Fault tolerance
  31. 31. HA deployment of Digdag Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI
  32. 32. HA deployment of Digdag PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA
  33. 33. HA deployment of Digdag Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state $ digdag server --disable-local-agent 
 --disable-executor-loop $ digdag server --max-task-threads 100
  34. 34. Single-server task logs Digdag server PostgreSQL Digdag client HTTP Load Balancer Local disks A server writes logs
 to a local disk The same server
 serves the logs. $ digdag --task-log <dir> $ digdag log <attempt-id> -f
  35. 35. Centralized task log storage Digdag server PostgreSQL Digdag client Digdag server HTTP Load Balancer AWS S3 A server uploads logs A server pre-signs
 the download URL log-server.type = s3 log-server.s3.bucket = my-digdag-log-bucket log-server.s3.path = logs/ $ digdag log <attempt-id> -f Client downloads logs
 directly from S3
  36. 36. Sadayuki Furuhashi https://digdag.io Visit my website!

×