Silk is a framework for building and running complex workflows of distributed data processing. It allows describing dataflows in Scala in a type safe and concise syntax. Silk translates Scala programs into logical plans and schedules the distributed execution through various "weavers" like an in-memory weaver or Hadoop weaver. It performs static and run-time optimizations of dataflows and supports features like fault tolerance, resource monitoring, and UNIX command integration. The goal of Silk is to enable distributed data analysis for all data scientists through an object-oriented programming model.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Streaming Distributed Data Processing with Silk #deim2014
1. Streaming Distributed Data
Processing with Silk
Taro L. Saito
University of Tokyo
leo@xerial.org
March 3rd, 2014
DEIM2014
xerial.org/silk Twitter @taroleo
1
2. Distributed Data Processing
Streaming Distributed Data Processing with Silk
Translate this data processing program
A
g
f
B
C
into a cluster computing program
g
f
A0
B0
A1
B1
A2
B2
map
xerial.org/silk Twitter @taroleo
C
reduce
2
3. Streaming Distributed Data Processing
Streaming Distributed Data Processing with Silk
What is streaming?
A
f
g
B
F
C
G
D
E
Silk: A framework for building and running complex
workflows of distributed data processing
xerial.org/silk Twitter @taroleo
3
4. Problem Definition
Streaming Distributed Data Processing with Silk
How do we run the distributed data processing while
extending the program?
A
f
g
B
F
C
G
D
xerial.org/silk Twitter @taroleo
E
4
5. Silk
Streaming Distributed Data Processing with Silk
Describing Dataflows in Scala
A dataflow in Silk is a sequence of function calls
Type safe and concise syntax, easy to learn.
Silk[A] : Set of type A object
xerial.org/silk Twitter @taroleo
5
7. Big Data Volumes in Human Genome Analysis
Streaming Distributed Data Processing with Silk
Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)
DNA Sequencer (Illumina, PacBio, etc.)
f: An alignment program
Output: Alignment results 750GB (sequence + alignment data)
Total storage space required: 1.2TB
Computational time required: 1 days (using hundreds of CPUs)
Input
f
Output
University of Tokyo Genome Browser (UTGB)
xerial.org/silk Twitter @taroleo
7
8. Varieties of Scientific Data and Analysis
Streaming Distributed Data Processing with Silk
WormTSS: http://wormtss.utgenome.org/
Integrating various data sources, hundreds of data analysis…
xerial.org/silk Twitter @taroleo
8
9. Produced Thousands of Data Analysis Charts
Streaming Distributed Data Processing with Silk
Using R, JFreeChart, etc.
Need a automated
pipeline to redo the entire
analysis for answering the
paper review within a
month.
xerial.org/silk Twitter @taroleo
9
10. Writing A Dataflow
Streaming Distributed Data Processing with Silk
a Program v1
f
A
B
val B = A.map(f)
Apply function f to the input A, then produce the output B
This step may take more than 1 hours in big data analysis
xerial.org/silk Twitter @taroleo
10
11. Distribution and Fault Tolerance
Streaming Distributed Data Processing with Silk
Resume only B2 = A2.map(f)
a Program v1
f
A
B
f
A0
B0
A1
B1
A2
B2
Failure!
xerial.org/silk Twitter @taroleo
Retry
11
12. Extending Dataflows
Streaming Distributed Data Processing with Silk
Program v2
Program v1
A
f
g
B
C
While running program v1, adding another code (program v2)
How do we reuse the already computed result (B) to generate C?
xerial.org/silk Twitter @taroleo
12
13. Marking to A Program
Streaming Distributed Data Processing with Silk
Program v2
Program v1
A
f
g
B
C
val B = A.map(f)
val C = B.map(g)
Storing intermediate results using variable names
variable names := program markers!!
But, we lost variable names after compilation
Extracting AST and variable names upon compile time
Using Scala Macros (Since Scala 2.10)
xerial.org/silk Twitter @taroleo
13
14. Scala Program (AST) to DAG Schedule (Logical Plan)
Streaming Distributed Data Processing with Silk
Program v2
Program v1
A
g
B
C
Translating a program (AST) into a set of Silk operations (DAG)
f
val B = MapOp(input:A, output:B, function:f)
val C = MapOp(input:B, output:C, function:g)
Operations in Silk can be nested
val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)
xerial.org/silk Twitter @taroleo
14
15. Weaving Silks
Streaming Distributed Data Processing with Silk
In-memory weaver
Cluster weaver
Result
Hadoop weaver
Silk[A]
(operation DAG)
Weave
Output
Data analysis code is independent from weavers
xerial.org/silk Twitter @taroleo
15
17. Local machine
Local ClassBox
User program
builds workflows
Weaving Silk materializes objects
classpaths & local jar files
•
•
•
•
•
Silk[A]
Silk[A]
read file, toSilk
map, reduce, join,
groupBy
UNIX commands
etc.
SilkSingle[A]
SilkSeq[A]
weave
weave
Static optimization
A
DAG Schedule
single object
•
•
Cluster
•
•
•
•
Dispatches tasks to clients
Manages master resource table
Authorizes resource allocation
Automatic recovery by
leader election in ZK
Register ClassBox
Submit schedule
ZooKeeper
ensemble mode
(at least 3 ZK instances)
Silk Master
•
•
dispatch
•
•
Silk Client
Silk Client
Task Scheduler
Task Scheduler
Task Executor
Task Executor
Resource Monitor
Resource Monitor
Data Server
Data Server
Leader election
Collects locations of slices
and ClassBox jars
Watches active nodes
Watches available resources
Seq[A]
sequence of objects
Node Table
Slice Table
Task Status
Resource Table
(CPU, memory)
ClassBox Table
•
•
•
•
•
•
•
•
Submits tasks
Run-time optimization
Resource allocation
Monitoring resource usage
Launches Web UI
Manages assigned task status
Object serialization/deserialization
Serves slice data
xerial.org/silk Twitter @taroleo
17
18. Static Optimization
Streaming Distributed Data Processing with Silk
Tree transformation
map(f).map(g) => map(g・f)
(Function composition)
map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate
data)
Pushing-down selection
Retrieves only accessed fields in an object
Analyzing the byte code of functions with ASM
Rewriting logical plans using pattern matching in Scala
Easy to add optimization rules
xerial.org/silk Twitter @taroleo
18
19. Run-time Optimization
Streaming Distributed Data Processing with Silk
Adjusting the number of data splits
According to the available cluster resources.
Multi-core execution
Omega-based task scheduler
Sharing the cluster resource table between nodes
Each node determines how to use the resource
Monitoring actual CPU/memory resources periodically
xerial.org/silk Twitter @taroleo
19
20. UNIX Command Workflows in Silk
Streaming Distributed Data Processing with Silk
c”(UNIX Command)”
xerial.org/silk Twitter @taroleo
20
22. Summary
Streaming Distributed Data Processing with Silk
Silk
A framework for distributed data processing for all data scientists
Object-oriented data processing programming
Similar to query optimization in DBMS
Analyze Data as You Write Programs!
Reuse, override and mix-in
Optimizing data flow programs
including non-experts in distributed data processing (e.g. Biologists)
Database research now enters program optimization.
In Future
Workflow queries
Making queries against dataflow program
Monitoring intermediate results
Multi-user program execution
xerial.org/silk Twitter @taroleo
22