Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workflow Hacks #1 - dots. Tokyo

2,461 views

Published on

Workflow Hacks #1

Published in: Engineering
  • Be the first to comment

Workflow Hacks #1 - dots. Tokyo

  1. 1. Workflow Hacks! #1 Taro L. Saito
 leo@treasure-data.com Dec. 14, 2015 dots. Tokyo, Japan
  2. 2. Workflow Hacks! #1 2
  3. 3. アンケート • 終了後 メールにてアンケートを送付します • 質問内容 • 現在、どのようなシステムを使っているか? • ワークフローでどのような問題を解決したいか? • 回答いただいた方に、抽選でTreasure Dataパーカー をプレゼント! 3
  4. 4. About Me: Taro L. Saito 4 2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing Relational-Style XML Query [SIGMOD 2008] ~ 2014 Assistant Professor at University of Tokyo Genome Science Research - Big Data Processing - Distributed Computing 2014.03~ Treasure Data, Inc. Tokyo 2015.07~ Treasure Data, Inc. 
 Mountain View, CA
  5. 5. Cloud Platform for Data Analytics 8 • Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine) • 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda Import Export Store Analyze with Presto/Hive (Distributed SQL Engine) Enterp Enterprise Data BI
  6. 6. Workflow Fundamental Features • Dependency management • task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management • Error handling • Easy access to logs • Notification 9
  7. 7. Workflow Tools • Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos • Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR) 10
  8. 8. Dataflow DSL • Translate this data processing program • into a cluster computing program 11 A B A0 A1 A2 B1 B2 f B0 C C g map reduce f g
  9. 9. Redbook: Dataflow Engines • Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis • http://www.redbook.io/ch5-dataflow.html • DryadLINQ • Most influential interface
 for dataflow DSL • SQL-like operation • Functional style • Spark • SparkSQL • 70% of Spark accesses • Dataset API • Shift to the dataframe based API 12
  10. 10. Dataflow -> Execution Plan • Example - Hive: SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 13 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  11. 11. Workflows 14 A f B C g D E F G
  12. 12. Hadoop is not enough • C. Olston et al. [SIGMOD 2011] • continuous processing • independent scheduling • Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
 Microsoft [SOSP 2013] 15
  13. 13. Continuous Processing • The Dataflow Model • Akidau et al., Google [VLDB2015] • Unbounded data processing • late-coming data • Integration of • batch processing • accumulation 16
  14. 14. Cluster Computing with Dryad 
 M. Budiu, 2008
  15. 15. Cluster Computing with Dryad 
 M. Budiu, 2008 Workflow Hacks!
  16. 16. Airflow 19
  17. 17. Airflow • Best practices with Airflow - An open source platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup • https://youtu.be/dgaoqOZlvEA 20
  18. 18. Workflow Development • Programmatic • Generate workflows by code • Configuration as Code • Workflow reuse/overwrite • object oriented • Parameterization 21
  19. 19. Luigi • Luigiによるワークフロー管理 • http://qiita.com/k24d/items/ fb9bed08423e6249d376 22
  20. 20. Nextflow • http://www.nextflow.io/ 23
  21. 21. Dataflow DSL vs Workflow DSL • Dataflow • A -> B -> C -> … • Data dependencies • Workflow • Task A -> Task B -> Task C -> … • Task dependencies • Data transfer is optional (through file or DB) • + Scheduling • + Task names • For monitoring, redo, etc. 24
  22. 22. Weavelet (wvlet) • Object-oriented workflow DSL for Scala • Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class 25
  23. 23. Isolating DAG generation and its execution • Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059 • Asakusa on Hadoop, Spark 26 Local Hadoop Spark Result DSL generates DAG
  24. 24. Stream DSL • Add “moving stream” support to Dataflow DSL • ”moving" streams and "resting" datasets • Example • Spark Streaming • Spark DSL + Micro-batch for stream • Microsoft Azure Stream SQL • Windowing support for moving data • Norikra • Stream processing with SQL • Reactive programming • ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG) • Back-pressure support • Controlling data transfer speed from receiver side 27
  25. 25. Task Execution Retry • リトライと冪等性のデザインパターン • http://frsyuki.hatenablog.com/entry/2014/06/09/164559 • System failures • Process is not responding • network, hardware failures • Middleware failures • provisioning failures, missing components • User failures • Wrong configuration • Programming error 28
  26. 26. Retry Example • Example: Task calling a REST API /create/xxx • Client: First attempt • Server returns 200 Success • But failed to get the status code • Client retries the task • Get 409 conflict error (entry xxx is already created) • Solution (Application side) • Handle 409 error as success in the client (idempotent execution) • More strict approach • Making xxx unique for each request 29
  27. 27. Fault Tolerance • Presto: Distributed query engine developed by Facebook • Uses HTTP data transfer • No fault-tolerance • 99.5% of queries finishes without any failure • For queries processing 10 billions or more rows => Drops to 85% 30 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  28. 28. Summary • Recent workflow tools • Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc. • Workflow manager • Handle system failures, monitoring • Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors • Idempotent execution • Requires splitting large tasks into smaller ones 31

×