Weaving Dataflows with Silk 
Taro L. Saito 
Treasure Data, Inc. 
leo@xerial.org 
 
September 6th, 2014  
ScalaMatsuri @ Tokyo 
1xerial.org/silk
About Me 
Weaving Dataflows with Silk 
xerial.org/silk2
Treasure Data Console 
Weaving Dataflows with Silk 
xerial.org/silk3
Processing Job Table 
Weaving Dataflows with Silk 
xerial.org/silk4
Functional Style Writing 
Weaving Dataflows with Silk 
xerial.org/silk5
Need an Optimization? 
Weaving Dataflows with Silk 
xerial.org/silk6
Procedural Style Writing 
Weaving Dataflows with Silk 
l Describes How to Process Data. 
xerial.org/silk7
Declarative Style Writing 
Weaving Dataflows with Silk 
l Less programming 
l System decides how to optimize the code 
l Hash joins, bloom filters and various optimization techniques are 
now available. 
xerial.org/silk8
Weaving Silk 
Weaving Dataflows with Silk 
In-memory weaver 
Cluster weaver (Spark?) 
MapReduce weaver 
Result 
Your own weaver (using TD?) 
l Making data processing code independent from the execution method! 
xerial.org/silk9 
Silk[A] 
(operation DAG) 
Weave 
(Execute)Silk Product
Cluster Weaver: Logical Plan to Physical Plan on Cluster 
Weaving Dataflows with Silk 
l Physical plan on cluster 
xerial.org/silk10 
I1 
I2 
I3 
P1 
P2 
P3 
P1 
P2 
P3 
P1 
P2 
P3 
S1 
S2 
S3 
S1 
S2 
S3 
S1 
S2 
S3 
R1 
S1 
S1 
S1 
S2 
S2 
S2 
S3 
S3 
S3 
P1 
P1 
P1 
P2 
P2 
P2 
P3 
P3 
P3 
R2 
R3 
Partition 
(hashing) 
serializationshuffledeserializationmerge sort 
Silk[people] 
Scatter
DAG-based Data Processing Engines 
Weaving Dataflows with Silk 
l Spark 
l Creates a task schedule for distributed processing 
l Summingbird 
l Integrates stream and batch data processing 
l e.g. Running Scalding and Storm at the same time 
l Apache Tez 
l Creates a dag schedule for optimizing MapReduce pipelines 
l GNU Makefile 
l Describes a pipeline of UNIX commands 
Why do we need another framework? 
xerial.org/silk11
Challenge: Isolate Code Writing and Its Execution 
Weaving Dataflows with Silk 
weaver 
Result 
Result 
Result 
l Why canʼ’t we run the program until finish writing? 
l How can we departure from compile-‐‑‒then-‐‑‒run paradigm? 
xerial.org/silk12 
Silk[A] 
(operation DAG) 
Weave 
(Execute)Silk Product
Weaving Dataflows with Silk 
l W 
xerial.org/silk13
Genome Science is A Big Data Science 
Weaving Dataflows with Silk 
l By sequencing, we can find 3 millions of SNPs for each person 
l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for 
narrowing down the candidate SNPs 
 
l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) 
l DNA Sequencer (Illumina, PacBio, etc.) 
l f: An alignment program 
l Output: Alignment results 750GB (sequence + alignment data) 
l Total storage space required: 1.2TB  
Output 
f 
Input 
University of Tokyo Genome Browser (UTGB) 
xerial.org/silk14
Human Genome Data Processing Workflows in Silk 
Weaving Dataflows with Silk 
l c”(UNIX Command)” 
xerial.org/silk15
Human Genome Data Processing Workflows 
Weaving Dataflows with Silk 
l Makefile: The result ($@) is stored into a file 
l Silk: The result is stored in variable 
l Computation of each command may take 1 or more hours  
xerial.org/silk16
SBT: A Good Hint 
Weaving Dataflows with Silk 
l SBT 
l Supports incremental  
compilation and testing 
l sbt ~∼test-‐‑‒only 
l Monitor source code change 
l Running specific tests 
l  sbt ~∼test-‐‑‒quick 
l Running failed tests only  
 
A 
fB 
C 
g 
D 
E 
F 
G 
l How do we compute the not-‐‑‒yet started part of a Scala 
program? 
l We need to know: 
l A-‐‑‒B and D-‐‑‒E are running 
l If B is finished, we can start B-‐‑‒C 
xerial.org/silk17
Writing A Dataflow 
Weaving Dataflows with Silk 
l Apply function f to the input A, then produce the output B 
l This step may take more than 1 hours in big data analysis 
 
18 
A 
B 
f 
val B = A.map(f) 
 
xerial.org/silk 
a 
Program v1
Distribution and Recovery 
Weaving Dataflows with Silk 
l Resume only B2 = A2.map(f) 
xerial.org/silk19 
A0 
A1 
A2 
B1 
B2 
f 
B0 
Failure! 
A 
B 
f 
a 
Program v1 
Retry
Extending Dataflows 
Weaving Dataflows with Silk 
Program v2 
l While running program v1, we may want to add another code 
(program v2) 
l We need to know variable B is now being processed 
20 
A 
B 
f 
C 
g 
Program v1 
xerial.org/silk
Labeling Program with Variable Names 
Weaving Dataflows with Silk 
Program v2 
l Storing intermediate results using variable names 
l variable names := program markers 
l But, we lost the variable names after compilation 
l Solution: Extract variable names from AST upon compile time 
l Using Scala Macros (Since Scala 2.10) 
21 
A 
B 
f 
val B = A.map(f) 
val C = B.map(g) 
 
C 
g 
Program v1 
xerial.org/silk
Scala Program (AST) to DAG Schedule (Logical Plan) 
Weaving Dataflows with Silk 
Program v2 
l Translate a program into a set of Silk operation objects 
l val B = MapOp(input:A, output:”B”, function:f) 
l val C = MapOp(input:B, output:”C”, function:g) 
l Operations in Silk form a DAG 
l val C = MapOp( 
input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g) 
22 
A 
B 
f 
C 
g 
Program v1 
xerial.org/silk
Using Scala Macros 
Weaving Dataflows with Silk 
l Produce operation objects with Scala Macros 
l map(f:A=B) produces MapOp[A, B](…) 
l Why do we need to use Macro here? 
l To extract FContext (target variable name, enclosing method, class, 
etc.) from AST. 
xerial.org/silk23
Weaving Dataflows with Silk 
l s 
xerial.org/silk24
Extract target variable name and enclosing method 
Weaving Dataflows with Silk 
xerial.org/silk25
Finding Target Variable 
Weaving Dataflows with Silk 
xerial.org/silk26
Weaving Dataflows with Silk 
Program v2 
l Translate a program into a set of Silk operation objects 
l val B = MapOp(input:A, output:”B”, function:f) 
l val C = MapOp(input:B, output:”C”, function:g) 
l Silk uses these variable names to store the intermediate data 
27 
A 
B 
f 
C 
g 
Program v1 
xerial.org/silk
Weaving Dataflows with Silk 
l Silk defines various types of operations  
xerial.org/silk28
Object-Oriented Dataflow Programming 
Weaving Dataflows with Silk 
l Reusing and overriding dataflows 
xerial.org/silk29
Summary 
Weaving Dataflows with Silk 
weaver 
Result 
Result 
Result 
Cluster weaver 
l Declarative-‐‑‒style coding is necessary for creating DAG schedule 
l DAG schedules are labeled with variable names using ScalaMacros 
l Weaver: An abstraction of how to execute the code. 
l Weaver manages the running and finished parts of the code 
xerial.org/silk30 
Silk[A] 
(operation DAG) 
Weave 
(Execute)Silk Product
http://xerial.org/silk 
Weaving Dataflows with Silk 
xerial.org/silk31
Copyright 
©2014 
Treasure 
Data. 
All 
Rights 
Reserved. 
32 
WE 
ARE 
HIRING! 
www.treasuredata.com
Silk[A] 
Weaving Silk materializes objects 
Resource Table 
(CPU, memory) 
User program 
builds workflows 
Static optimization 
DAG Schedule 
• read file, toSilk 
• map, reduce, join, 
• groupBy 
• UNIX commands 
• etc. 
• Register ClassBox 
• Submit schedule 
Silk Master 
dispatch 
Silk Client 
ZooKeeper 
Node Table 
Slice Table 
Task Scheduler 
Task Status 
Task Executor 
Resource Monitor 
Silk Client 
Task Scheduler 
Task Executor 
Resource Monitor 
ensemble mode 
(at least 3 ZK instances) 
• Leader election 
• Collects locations of slices 
and ClassBox jars 
• Watches active nodes 
• Watches available resources 
• Submits tasks 
• Run-time optimization 
• Resource allocation 
• Monitoring resource usage 
• Launches Web UI 
• Manages assigned task status 
• Object serialization/deserialization 
• Serves slice data 
Local ClassBox 
classpaths  local jar files 
ClassBox Table 
weave 
• Dispatches tasks to clients 
• Manages master resource table 
• Authorizes resource allocation 
• Automatic recovery by 
leader election in ZK 
Data Server 
Data Server 
Silk[A] 
SilkSingle[A] SilkSeq[A] 
weave 
A 
single object 
Seq[A] 
sequence of objects 
Local machine 
Cluster 
xerial.org/silk33
Integrating Varieties of Data Sources 
Weaving Dataflows with Silk 
l WormTSS: http://wormtss.utgenome.org/ 
l Integrating various data sources 
xerial.org/silk34
Varieties of Data Analysis 
Weaving Dataflows with Silk 
Using R, JFreeChart, etc. 
Need a automated 
pipeline to redo the entire 
analysis for answering the 
paper review within a 
month. 
xerial.org/silk35
Makefile 
Weaving Dataflows with Silk 
l Describes dependencies of commands through files 
l Good: We can resume and update the data flow processing 
l Bad: Makefile of WormTSS analysis exceeds 1,000 lines 
36
Splitting Data Analysis Into Command Modules 
Weaving Dataflows with Silk 
l Added a new command as we needed a new analysis and data processing 
l The result: 
l hundreds of commands! 
l # of files limits the parallelism  
37

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

  • 1.
    Weaving Dataflows withSilk Taro L. Saito Treasure Data, Inc. leo@xerial.org September 6th, 2014 ScalaMatsuri @ Tokyo 1xerial.org/silk
  • 2.
    About Me WeavingDataflows with Silk xerial.org/silk2
  • 3.
    Treasure Data Console Weaving Dataflows with Silk xerial.org/silk3
  • 4.
    Processing Job Table Weaving Dataflows with Silk xerial.org/silk4
  • 5.
    Functional Style Writing Weaving Dataflows with Silk xerial.org/silk5
  • 6.
    Need an Optimization? Weaving Dataflows with Silk xerial.org/silk6
  • 7.
    Procedural Style Writing Weaving Dataflows with Silk l Describes How to Process Data. xerial.org/silk7
  • 8.
    Declarative Style Writing Weaving Dataflows with Silk l Less programming l System decides how to optimize the code l Hash joins, bloom filters and various optimization techniques are now available. xerial.org/silk8
  • 9.
    Weaving Silk WeavingDataflows with Silk In-memory weaver Cluster weaver (Spark?) MapReduce weaver Result Your own weaver (using TD?) l Making data processing code independent from the execution method! xerial.org/silk9 Silk[A] (operation DAG) Weave (Execute)Silk Product
  • 10.
    Cluster Weaver: LogicalPlan to Physical Plan on Cluster Weaving Dataflows with Silk l Physical plan on cluster xerial.org/silk10 I1 I2 I3 P1 P2 P3 P1 P2 P3 P1 P2 P3 S1 S2 S3 S1 S2 S3 S1 S2 S3 R1 S1 S1 S1 S2 S2 S2 S3 S3 S3 P1 P1 P1 P2 P2 P2 P3 P3 P3 R2 R3 Partition (hashing) serializationshuffledeserializationmerge sort Silk[people] Scatter
  • 11.
    DAG-based Data ProcessingEngines Weaving Dataflows with Silk l Spark l Creates a task schedule for distributed processing l Summingbird l Integrates stream and batch data processing l e.g. Running Scalding and Storm at the same time l Apache Tez l Creates a dag schedule for optimizing MapReduce pipelines l GNU Makefile l Describes a pipeline of UNIX commands Why do we need another framework? xerial.org/silk11
  • 12.
    Challenge: Isolate CodeWriting and Its Execution Weaving Dataflows with Silk weaver Result Result Result l Why canʼ’t we run the program until finish writing? l How can we departure from compile-‐‑‒then-‐‑‒run paradigm? xerial.org/silk12 Silk[A] (operation DAG) Weave (Execute)Silk Product
  • 13.
    Weaving Dataflows withSilk l W xerial.org/silk13
  • 14.
    Genome Science isA Big Data Science Weaving Dataflows with Silk l By sequencing, we can find 3 millions of SNPs for each person l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for narrowing down the candidate SNPs l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) l DNA Sequencer (Illumina, PacBio, etc.) l f: An alignment program l Output: Alignment results 750GB (sequence + alignment data) l Total storage space required: 1.2TB Output f Input University of Tokyo Genome Browser (UTGB) xerial.org/silk14
  • 15.
    Human Genome DataProcessing Workflows in Silk Weaving Dataflows with Silk l c”(UNIX Command)” xerial.org/silk15
  • 16.
    Human Genome DataProcessing Workflows Weaving Dataflows with Silk l Makefile: The result ($@) is stored into a file l Silk: The result is stored in variable l Computation of each command may take 1 or more hours xerial.org/silk16
  • 17.
    SBT: A GoodHint Weaving Dataflows with Silk l SBT l Supports incremental compilation and testing l sbt ~∼test-‐‑‒only l Monitor source code change l Running specific tests l sbt ~∼test-‐‑‒quick l Running failed tests only A fB C g D E F G l How do we compute the not-‐‑‒yet started part of a Scala program? l We need to know: l A-‐‑‒B and D-‐‑‒E are running l If B is finished, we can start B-‐‑‒C xerial.org/silk17
  • 18.
    Writing A Dataflow Weaving Dataflows with Silk l Apply function f to the input A, then produce the output B l This step may take more than 1 hours in big data analysis 18 A B f val B = A.map(f) xerial.org/silk a Program v1
  • 19.
    Distribution and Recovery Weaving Dataflows with Silk l Resume only B2 = A2.map(f) xerial.org/silk19 A0 A1 A2 B1 B2 f B0 Failure! A B f a Program v1 Retry
  • 20.
    Extending Dataflows WeavingDataflows with Silk Program v2 l While running program v1, we may want to add another code (program v2) l We need to know variable B is now being processed 20 A B f C g Program v1 xerial.org/silk
  • 21.
    Labeling Program withVariable Names Weaving Dataflows with Silk Program v2 l Storing intermediate results using variable names l variable names := program markers l But, we lost the variable names after compilation l Solution: Extract variable names from AST upon compile time l Using Scala Macros (Since Scala 2.10) 21 A B f val B = A.map(f) val C = B.map(g) C g Program v1 xerial.org/silk
  • 22.
    Scala Program (AST)to DAG Schedule (Logical Plan) Weaving Dataflows with Silk Program v2 l Translate a program into a set of Silk operation objects l val B = MapOp(input:A, output:”B”, function:f) l val C = MapOp(input:B, output:”C”, function:g) l Operations in Silk form a DAG l val C = MapOp( input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g) 22 A B f C g Program v1 xerial.org/silk
  • 23.
    Using Scala Macros Weaving Dataflows with Silk l Produce operation objects with Scala Macros l map(f:A=B) produces MapOp[A, B](…) l Why do we need to use Macro here? l To extract FContext (target variable name, enclosing method, class, etc.) from AST. xerial.org/silk23
  • 24.
    Weaving Dataflows withSilk l s xerial.org/silk24
  • 25.
    Extract target variablename and enclosing method Weaving Dataflows with Silk xerial.org/silk25
  • 26.
    Finding Target Variable Weaving Dataflows with Silk xerial.org/silk26
  • 27.
    Weaving Dataflows withSilk Program v2 l Translate a program into a set of Silk operation objects l val B = MapOp(input:A, output:”B”, function:f) l val C = MapOp(input:B, output:”C”, function:g) l Silk uses these variable names to store the intermediate data 27 A B f C g Program v1 xerial.org/silk
  • 28.
    Weaving Dataflows withSilk l Silk defines various types of operations xerial.org/silk28
  • 29.
    Object-Oriented Dataflow Programming Weaving Dataflows with Silk l Reusing and overriding dataflows xerial.org/silk29
  • 30.
    Summary Weaving Dataflowswith Silk weaver Result Result Result Cluster weaver l Declarative-‐‑‒style coding is necessary for creating DAG schedule l DAG schedules are labeled with variable names using ScalaMacros l Weaver: An abstraction of how to execute the code. l Weaver manages the running and finished parts of the code xerial.org/silk30 Silk[A] (operation DAG) Weave (Execute)Silk Product
  • 31.
    http://xerial.org/silk Weaving Dataflowswith Silk xerial.org/silk31
  • 32.
    Copyright ©2014 Treasure Data. All Rights Reserved. 32 WE ARE HIRING! www.treasuredata.com
  • 33.
    Silk[A] Weaving Silkmaterializes objects Resource Table (CPU, memory) User program builds workflows Static optimization DAG Schedule • read file, toSilk • map, reduce, join, • groupBy • UNIX commands • etc. • Register ClassBox • Submit schedule Silk Master dispatch Silk Client ZooKeeper Node Table Slice Table Task Scheduler Task Status Task Executor Resource Monitor Silk Client Task Scheduler Task Executor Resource Monitor ensemble mode (at least 3 ZK instances) • Leader election • Collects locations of slices and ClassBox jars • Watches active nodes • Watches available resources • Submits tasks • Run-time optimization • Resource allocation • Monitoring resource usage • Launches Web UI • Manages assigned task status • Object serialization/deserialization • Serves slice data Local ClassBox classpaths local jar files ClassBox Table weave • Dispatches tasks to clients • Manages master resource table • Authorizes resource allocation • Automatic recovery by leader election in ZK Data Server Data Server Silk[A] SilkSingle[A] SilkSeq[A] weave A single object Seq[A] sequence of objects Local machine Cluster xerial.org/silk33
  • 34.
    Integrating Varieties ofData Sources Weaving Dataflows with Silk l WormTSS: http://wormtss.utgenome.org/ l Integrating various data sources xerial.org/silk34
  • 35.
    Varieties of DataAnalysis Weaving Dataflows with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk35
  • 36.
    Makefile Weaving Dataflowswith Silk l Describes dependencies of commands through files l Good: We can resume and update the data flow processing l Bad: Makefile of WormTSS analysis exceeds 1,000 lines 36
  • 37.
    Splitting Data AnalysisInto Command Modules Weaving Dataflows with Silk l Added a new command as we needed a new analysis and data processing l The result: l hundreds of commands! l # of files limits the parallelism 37