Successfully reported this slideshow.
Your SlideShare is downloading. ×

Automating Workflows for Analytics Pipelines

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Embuk internals
Embuk internals
Loading in …3
×

Check these out next

1 of 28 Ad

More Related Content

Slideshows for you (20)

Similar to Automating Workflows for Analytics Pipelines (20)

Advertisement

More from Sadayuki Furuhashi (19)

Recently uploaded (20)

Advertisement

Automating Workflows for Analytics Pipelines

  1. 1. Automating Workflows for Analytics Pipelines Sadayuki Furuhashi Open Source Summit 2017
  2. 2. Sadayuki Furuhashi A founder of Treasure Data, Inc. located in Silicon Valley. OSS projects I founded: An open-source hacker. Github: @frsyuki
  3. 3. What's Workflow Engine? • Automates your manual operations. • Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy
 (Continuous Delivery)
  4. 4. Challenge: Multiple Cloud & Regions On-Premises Different API, Different tools, Many scripts.
  5. 5. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR
  6. 6. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR > Hi! > I'm a new technology!
  7. 7. Challenge: Modern complex data analytics Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications Ingest UtilizeEnrich Model Load
  8. 8. Traditional "false" solution #!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  9. 9. Solution: Multi-Cloud Workflow Engine Solves > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  10. 10. Example in our case 1. Dump data to BigQuery 2. load all tables to Treasure Data 3. Run queries 5. Notify on slack 4. Create reports on Tableau Server
 (on-premises)
  11. 11. Workflow constructs
  12. 12. Unite Engineering & Analytic Teams +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows
  13. 13. Unite Engineering & Analytic Teams Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql + is a task > is an operator ${...} is a variable
  14. 14. Operator library _export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email Open-source libraries You can release & use open-source operator libraries.
  15. 15. Parallel execution +load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
  16. 16. Loops & Parameters +send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr} Parameter A task can propagate parameters to following tasks Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
  17. 17. Grouping workflows... Ingest UtilizeEnrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task
  18. 18. Grouping workflows Ingest UtilizeEnrich Model Load +ingest +enrich +task +task +model +basket_analysis +task +task +learn +load +task +task+tasks +task
  19. 19. Pushing workflows to a server with Docker image schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor Docker > Install scripts & dependences in a Docker image, not on a server. > Workflows can run anywhere including developer's laptop.
  20. 20. Demo
  21. 21. Digdag is production-ready Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI
  22. 22. Digdag is production-ready PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA
  23. 23. Digdag is production-ready Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state
  24. 24. Digdag at Treasure Data 3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total
  25. 25. Digdag & Open Source
  26. 26. Learning from my OSS projects • Make it pluggable! 700+ plugins in 6 years 200+ plugins in 3 years input/output, parser/formatter,
 decoder/encoder, filter, and executor input/output, and filter 70+ implementations in 8 years
  27. 27. Digdag also has plugin architecture 32 operators 7 schedulers 2 command executors 1 error notification module
  28. 28. Sadayuki Furuhashi https://digdag.io Visit my website!

×