Streaming Distributed Data Processing with Silk #deim2014

Streaming Distributed Data
Processing with Silk
Taro L. Saito
University of Tokyo
leo@xerial.org
March 3rd, 2014
DEIM2014

xerial.org/silk Twitter @taroleo

1

Distributed Data Processing
Streaming Distributed Data Processing with Silk



Translate this data processing program
A



g

f

B

C

into a cluster computing program
g

f

A0

B0

A1

B1

A2

B2
map

C

reduce
2

Streaming Distributed Data Processing



What is streaming?

A

f

g

B

F

C

G
D



E

Silk: A framework for building and running complex
workflows of distributed data processing

3

Problem Definition



How do we run the distributed data processing while
extending the program?
A

f

g

B

F

C

G
D


E

4

Silk



Describing Dataflows in Scala


A dataflow in Silk is a sequence of function calls




Type safe and concise syntax, easy to learn.

Silk[A] : Set of type A object


5

Object-Oriented Dataflow Programming



Reusing and overriding dataflow programs


6

Big Data Volumes in Human Genome Analysis

Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)



DNA Sequencer (Illumina, PacBio, etc.)






f: An alignment program
Output: Alignment results 750GB (sequence + alignment data)



Total storage space required: 1.2TB
Computational time required: 1 days (using hundreds of CPUs)

Input

f

Output

University of Tokyo Genome Browser (UTGB)

7

Varieties of Scientific Data and Analysis



WormTSS: http://wormtss.utgenome.org/

Integrating various data sources, hundreds of data analysis…


8

Produced Thousands of Data Analysis Charts

Using R, JFreeChart, etc.
Need a automated
pipeline to redo the entire
analysis for answering the
paper review within a
month.


9

Writing A Dataflow

a Program v1

f

A

B
val B = A.map(f)



Apply function f to the input A, then produce the output B


This step may take more than 1 hours in big data analysis


10

Distribution and Fault Tolerance



Resume only B2 = A2.map(f)
a Program v1

f

A

B

f

A0

B0

A1

B1

A2

B2
Failure!

Retry

11

Extending Dataflows

Program v2
Program v1

A




f

g

B

C

While running program v1, adding another code (program v2)
How do we reuse the already computed result (B) to generate C?


12

Marking to A Program

Program v2
Program v1

A

f

g

B

C

val B = A.map(f)
val C = B.map(g)


Storing intermediate results using variable names


variable names := program markers!!



But, we lost variable names after compilation



Extracting AST and variable names upon compile time


Using Scala Macros (Since Scala 2.10)

13

Scala Program (AST) to DAG Schedule (Logical Plan)

Program v2
Program v1

A



g

B

C

Translating a program (AST) into a set of Silk operations (DAG)





f

val B = MapOp(input:A, output:B, function:f)
val C = MapOp(input:B, output:C, function:g)

Operations in Silk can be nested


val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)


14

Weaving Silks

In-memory weaver

Cluster weaver

Result
Hadoop weaver

Silk[A]
(operation DAG)



Weave

Output

Data analysis code is independent from weavers


15

Cluster Weaver: Logical Plan to Physical Plan on Cluster



Logical plan




GroupByOp(in:people, out:g, key: {_.dept.id})

Physical plan
P1

Partition
(hashing)

S1

P1

S3

S1

P1

S1

S2

P2

P2

S2

S2

P2

S3

S2

P2

S1

S3

P3

P2

S2

S3

P3

P3
Scatter

S2

P1

I3

P2

P3

I2

P1

P1

Silk[people]

S1

P3

I1

S1

S3

S3

P3

serialization

shuffle

deserialization


R1

R2

R3
merge sort

16

Local machine

Local ClassBox
User program
builds workflows

Weaving Silk materializes objects

classpaths & local jar files
•
•
•
•
•

Silk[A]

Silk[A]

read file, toSilk
map, reduce, join,
groupBy
UNIX commands
etc.

SilkSingle[A]

SilkSeq[A]

weave
weave

Static optimization

A

DAG Schedule

single object
•
•

Cluster
•
•
•
•

Dispatches tasks to clients
Manages master resource table
Authorizes resource allocation
Automatic recovery by
leader election in ZK

Register ClassBox
Submit schedule

ZooKeeper
ensemble mode
(at least 3 ZK instances)

Silk Master
•
•

dispatch

•
•

Silk Client

Silk Client

Task Scheduler

Task Scheduler

Task Executor

Task Executor

Resource Monitor

Resource Monitor

Data Server

Data Server

Leader election
Collects locations of slices
and ClassBox jars
Watches active nodes
Watches available resources

Seq[A]
sequence of objects

Node Table
Slice Table
Task Status
Resource Table
(CPU, memory)
ClassBox Table

•
•
•
•
•
•
•
•

Submits tasks
Run-time optimization
Resource allocation
Monitoring resource usage
Launches Web UI
Manages assigned task status
Object serialization/deserialization
Serves slice data


17

Static Optimization



Tree transformation






map(f).map(g) => map(g・f)
(Function composition)
map(f).filter(p) => mapWithFilter(f, p) （Reduces intermediate
data)
Pushing-down selection
Retrieves only accessed fields in an object




Analyzing the byte code of functions with ASM

Rewriting logical plans using pattern matching in Scala


Easy to add optimization rules


18

Run-time Optimization



Adjusting the number of data splits


According to the available cluster resources.



Multi-core execution



Omega-based task scheduler


Sharing the cluster resource table between nodes




Each node determines how to use the resource

Monitoring actual CPU/memory resources periodically


19

UNIX Command Workflows in Silk


c”(UNIX Command)”


20

Buffer Management




Silk frequently uses distributed memory (like Spark)
LArray[A]







Immediate memory deallocation (free)




To eliminate OutOfMemoryException and GC-stall

Fast memory allocation




Allocating Off-heap (outside JVM heap）memories
sun.misc.Unsafe
Github： https://github.com/xerial/larray

Skips zero-filling

Object Serialization


Extending msgpack





Scala Pickling
Inject ser/dser codes

Off-heap objects

21

Summary



Silk


A framework for distributed data processing for all data scientists




Object-oriented data processing programming




Similar to query optimization in DBMS

Analyze Data as You Write Programs!




Reuse, override and mix-in

Optimizing data flow programs




including non-experts in distributed data processing (e.g. Biologists)

Database research now enters program optimization.

In Future


Workflow queries





Making queries against dataflow program
Monitoring intermediate results

Multi-user program execution

22

http://xerial.org/silk


23

Streaming Distributed Data Processing with Silk #deim2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming Distributed Data Processing with Silk #deim2014

Similar to Streaming Distributed Data Processing with Silk #deim2014 (20)

More from Taro L. Saito

More from Taro L. Saito (20)

Recently uploaded

Recently uploaded (20)

Streaming Distributed Data Processing with Silk #deim2014