SlideShare a Scribd company logo
1 of 31
Download to read offline
Workflow Hacks! #1
Taro L. Saito

leo@treasure-data.com
Dec. 14, 2015
dots. Tokyo, Japan
Workflow Hacks! #1
2
アンケート
• 終了後 メールにてアンケートを送付します
• 質問内容
• 現在、どのようなシステムを使っているか?
• ワークフローでどのような問題を解決したいか?
• 回答いただいた方に、抽選でTreasure Dataパーカー
をプレゼント!
3
About Me: Taro L. Saito
4
2007 University of Tokyo. Ph.D.
XML DBMS, Transaction Processing
Relational-Style XML Query [SIGMOD 2008]
~ 2014 Assistant Professor at University of Tokyo
Genome Science Research
- Big Data Processing
- Distributed Computing
2014.03~ Treasure Data, Inc. Tokyo
2015.07~ Treasure Data, Inc. 

Mountain View, CA
Cloud Platform for Data Analytics
8
• Importing 1,000,000~ records / sec.
• Presto (Distributed SQL engine)
• 50,000~ queries / day
• Processing 10 trillion records / day
• http://qiita.com/xerial/items/a9093b60062f2c613fda
Import Export
Store
Analyze with
Presto/Hive
(Distributed SQL Engine)
Enterp
Enterprise
Data
BI
Workflow Fundamental Features
• Dependency management
• task1 -> task2 -> task3 …
• Scheduling
• Execution monitoring
• State management
• Error handling
• Easy access to logs
• Notification
9
Workflow Tools
• Workflow Management Tools
• Python: Luigi, Airflow, pinball
• For Hadoop: Oozie (XML)
• Script-based: Makefile, Azkaban
• Biological Science: Galaxy (Web UI), nextflow
• Domestic: JP1, Hinemos
• Dataflow DSL
• Spark, Flink, DriadLINQ, TensorFlow
• Cascading (Java -> MR), Scalding (Scala -> MR)
10
Dataflow DSL
• Translate this data processing program
• into a cluster computing program
11
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
Redbook: Dataflow Engines
• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis
• http://www.redbook.io/ch5-dataflow.html
• DryadLINQ
• Most influential interface

for dataflow DSL
• SQL-like operation
• Functional style
• Spark
• SparkSQL
• 70% of Spark accesses
• Dataset API
• Shift to the dataframe based API
12
Dataflow -> Execution Plan
• Example - Hive: SQL to MapReduce
• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblog

GROUP BY page
13
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Workflows
14
A
f
B C
g
D E
F
G
Hadoop is not enough
• C. Olston et al. [SIGMOD 2011]
• continuous processing
• independent scheduling
• Incremental processing
• Google Parcolator [OSDI 2010]
• Naiad - Differential Workflow

Microsoft [SOSP 2013]
15
Continuous Processing
• The Dataflow Model
• Akidau et al., Google [VLDB2015]
• Unbounded data processing
• late-coming data
• Integration of
• batch processing
• accumulation
16
Cluster Computing with Dryad 

M. Budiu, 2008
Cluster Computing with Dryad 

M. Budiu, 2008
Workflow Hacks!
Airflow
19
Airflow
• Best practices with Airflow - An open source
platform for workflows & schedules (Nov 2015)
• At Silicon Valley Data Engineering Meetup
• https://youtu.be/dgaoqOZlvEA
20
Workflow Development
• Programmatic
• Generate workflows by code
• Configuration as Code
• Workflow reuse/overwrite
• object oriented
• Parameterization
21
Luigi
• Luigiによるワークフロー管理
• http://qiita.com/k24d/items/
fb9bed08423e6249d376
22
Nextflow
• http://www.nextflow.io/
23
Dataflow DSL vs Workflow DSL
• Dataflow
• A -> B -> C -> …
• Data dependencies
• Workflow
• Task A -> Task B -> Task C -> …
• Task dependencies
• Data transfer is optional (through file or DB)
• + Scheduling
• + Task names
• For monitoring, redo, etc.
24
Weavelet (wvlet)
• Object-oriented workflow DSL for Scala
• Workflow reuse, extension, override
• Parameterization
• Function := Task, Workflow := Class
25
Isolating DAG generation and its execution
• Alternatives of MR
• Tez
• Pig on Spark https://issues.apache.org/jira/browse/PIG-4059
• Asakusa on Hadoop, Spark
26
Local
Hadoop
Spark
Result
DSL generates DAG
Stream DSL
• Add “moving stream” support to Dataflow DSL
• ”moving" streams and "resting" datasets
• Example
• Spark Streaming
• Spark DSL + Micro-batch for stream
• Microsoft Azure Stream SQL
• Windowing support for moving data
• Norikra
• Stream processing with SQL
• Reactive programming
• ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG)
• Back-pressure support
• Controlling data transfer speed from receiver side
27
Task Execution Retry
• リトライと冪等性のデザインパターン
• http://frsyuki.hatenablog.com/entry/2014/06/09/164559
• System failures
• Process is not responding
• network, hardware failures
• Middleware failures
• provisioning failures, missing components
• User failures
• Wrong configuration
• Programming error
28
Retry Example
• Example: Task calling a REST API /create/xxx
• Client: First attempt
• Server returns 200 Success
• But failed to get the status code
• Client retries the task
• Get 409 conflict error (entry xxx is already created)
• Solution (Application side)
• Handle 409 error as success in the client (idempotent
execution)
• More strict approach
• Making xxx unique for each request
29
Fault Tolerance
• Presto: Distributed query engine developed by Facebook
• Uses HTTP data transfer
• No fault-tolerance
• 99.5% of queries finishes without any failure
• For queries processing 10 billions or more rows => Drops to 85%
30
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Summary
• Recent workflow tools
• Driven by Python community
• Because of this book! (=>)
• Airflow, Luigi, etc.
• Workflow manager
• Handle system failures, monitoring
• Workflow development
• DAG based DSL (dataflow, workflow, stream processing) -> Execution
• Does not cover application logic errors
• Idempotent execution
• Requires splitting large tasks into smaller ones
31

More Related Content

What's hot

Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 

What's hot (20)

Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 

Viewers also liked

Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続
Amazon Web Services Japan
 

Viewers also liked (11)

Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
 
xrdpを使ったお手軽BYOD環境の構築
xrdpを使ったお手軽BYOD環境の構築xrdpを使ったお手軽BYOD環境の構築
xrdpを使ったお手軽BYOD環境の構築
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方
 
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
 
xrdpで変える!社内のPC環境
xrdpで変える!社内のPC環境xrdpで変える!社内のPC環境
xrdpで変える!社内のPC環境
 
並列データベースシステムの概念と原理
並列データベースシステムの概念と原理並列データベースシステムの概念と原理
並列データベースシステムの概念と原理
 
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決 - OpenStack最新情報セミナー...
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決  - OpenStack最新情報セミナー...Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決  - OpenStack最新情報セミナー...
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決 - OpenStack最新情報セミナー...
 
OCP, Kubernetes ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
OCP, Kubernetes  ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)OCP, Kubernetes  ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
OCP, Kubernetes ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続
 

Similar to Workflow Hacks #1 - dots. Tokyo

Similar to Workflow Hacks #1 - dots. Tokyo (20)

Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
JavaOne_2010
JavaOne_2010JavaOne_2010
JavaOne_2010
 
Background processing with hangfire
Background processing with hangfireBackground processing with hangfire
Background processing with hangfire
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
 
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
UK Community day 20180427 Microsoft Flow hackathon
UK Community day 20180427 Microsoft Flow hackathonUK Community day 20180427 Microsoft Flow hackathon
UK Community day 20180427 Microsoft Flow hackathon
 
DOTNET8.pptx
DOTNET8.pptxDOTNET8.pptx
DOTNET8.pptx
 
Datastage Online Training
Datastage Online TrainingDatastage Online Training
Datastage Online Training
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 

More from Taro L. Saito

Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 

More from Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Silkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミングSilkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミング
 

Recently uploaded

Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
meharikiros2
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Recently uploaded (20)

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

Workflow Hacks #1 - dots. Tokyo

  • 1. Workflow Hacks! #1 Taro L. Saito
 leo@treasure-data.com Dec. 14, 2015 dots. Tokyo, Japan
  • 3. アンケート • 終了後 メールにてアンケートを送付します • 質問内容 • 現在、どのようなシステムを使っているか? • ワークフローでどのような問題を解決したいか? • 回答いただいた方に、抽選でTreasure Dataパーカー をプレゼント! 3
  • 4. About Me: Taro L. Saito 4 2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing Relational-Style XML Query [SIGMOD 2008] ~ 2014 Assistant Professor at University of Tokyo Genome Science Research - Big Data Processing - Distributed Computing 2014.03~ Treasure Data, Inc. Tokyo 2015.07~ Treasure Data, Inc. 
 Mountain View, CA
  • 5.
  • 6.
  • 7.
  • 8. Cloud Platform for Data Analytics 8 • Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine) • 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda Import Export Store Analyze with Presto/Hive (Distributed SQL Engine) Enterp Enterprise Data BI
  • 9. Workflow Fundamental Features • Dependency management • task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management • Error handling • Easy access to logs • Notification 9
  • 10. Workflow Tools • Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos • Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR) 10
  • 11. Dataflow DSL • Translate this data processing program • into a cluster computing program 11 A B A0 A1 A2 B1 B2 f B0 C C g map reduce f g
  • 12. Redbook: Dataflow Engines • Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis • http://www.redbook.io/ch5-dataflow.html • DryadLINQ • Most influential interface
 for dataflow DSL • SQL-like operation • Functional style • Spark • SparkSQL • 70% of Spark accesses • Dataset API • Shift to the dataframe based API 12
  • 13. Dataflow -> Execution Plan • Example - Hive: SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 13 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 15. Hadoop is not enough • C. Olston et al. [SIGMOD 2011] • continuous processing • independent scheduling • Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
 Microsoft [SOSP 2013] 15
  • 16. Continuous Processing • The Dataflow Model • Akidau et al., Google [VLDB2015] • Unbounded data processing • late-coming data • Integration of • batch processing • accumulation 16
  • 17. Cluster Computing with Dryad 
 M. Budiu, 2008
  • 18. Cluster Computing with Dryad 
 M. Budiu, 2008 Workflow Hacks!
  • 20. Airflow • Best practices with Airflow - An open source platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup • https://youtu.be/dgaoqOZlvEA 20
  • 21. Workflow Development • Programmatic • Generate workflows by code • Configuration as Code • Workflow reuse/overwrite • object oriented • Parameterization 21
  • 24. Dataflow DSL vs Workflow DSL • Dataflow • A -> B -> C -> … • Data dependencies • Workflow • Task A -> Task B -> Task C -> … • Task dependencies • Data transfer is optional (through file or DB) • + Scheduling • + Task names • For monitoring, redo, etc. 24
  • 25. Weavelet (wvlet) • Object-oriented workflow DSL for Scala • Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class 25
  • 26. Isolating DAG generation and its execution • Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059 • Asakusa on Hadoop, Spark 26 Local Hadoop Spark Result DSL generates DAG
  • 27. Stream DSL • Add “moving stream” support to Dataflow DSL • ”moving" streams and "resting" datasets • Example • Spark Streaming • Spark DSL + Micro-batch for stream • Microsoft Azure Stream SQL • Windowing support for moving data • Norikra • Stream processing with SQL • Reactive programming • ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG) • Back-pressure support • Controlling data transfer speed from receiver side 27
  • 28. Task Execution Retry • リトライと冪等性のデザインパターン • http://frsyuki.hatenablog.com/entry/2014/06/09/164559 • System failures • Process is not responding • network, hardware failures • Middleware failures • provisioning failures, missing components • User failures • Wrong configuration • Programming error 28
  • 29. Retry Example • Example: Task calling a REST API /create/xxx • Client: First attempt • Server returns 200 Success • But failed to get the status code • Client retries the task • Get 409 conflict error (entry xxx is already created) • Solution (Application side) • Handle 409 error as success in the client (idempotent execution) • More strict approach • Making xxx unique for each request 29
  • 30. Fault Tolerance • Presto: Distributed query engine developed by Facebook • Uses HTTP data transfer • No fault-tolerance • 99.5% of queries finishes without any failure • For queries processing 10 billions or more rows => Drops to 85% 30 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 31. Summary • Recent workflow tools • Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc. • Workflow manager • Handle system failures, monitoring • Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors • Idempotent execution • Requires splitting large tasks into smaller ones 31