SlideShare a Scribd company logo
1 of 23
Streaming Distributed Data
Processing with Silk
Taro L. Saito
University of Tokyo
leo@xerial.org
March 3rd, 2014
DEIM2014

xerial.org/silk Twitter @taroleo

1
Distributed Data Processing
Streaming Distributed Data Processing with Silk



Translate this data processing program
A



g

f

B

C

into a cluster computing program
g

f

A0

B0

A1

B1

A2

B2
map
xerial.org/silk Twitter @taroleo

C

reduce
2
Streaming Distributed Data Processing
Streaming Distributed Data Processing with Silk



What is streaming?

A

f

g

B

F

C

G
D



E

Silk: A framework for building and running complex
workflows of distributed data processing
xerial.org/silk Twitter @taroleo

3
Problem Definition
Streaming Distributed Data Processing with Silk



How do we run the distributed data processing while
extending the program?
A

f

g

B

F

C

G
D

xerial.org/silk Twitter @taroleo

E

4
Silk
Streaming Distributed Data Processing with Silk



Describing Dataflows in Scala


A dataflow in Silk is a sequence of function calls




Type safe and concise syntax, easy to learn.

Silk[A] : Set of type A object

xerial.org/silk Twitter @taroleo

5
Object-Oriented Dataflow Programming
Streaming Distributed Data Processing with Silk



Reusing and overriding dataflow programs

xerial.org/silk Twitter @taroleo

6
Big Data Volumes in Human Genome Analysis
Streaming Distributed Data Processing with Silk

Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)



DNA Sequencer (Illumina, PacBio, etc.)






f: An alignment program
Output: Alignment results 750GB (sequence + alignment data)



Total storage space required: 1.2TB
Computational time required: 1 days (using hundreds of CPUs)

Input

f

Output

University of Tokyo Genome Browser (UTGB)
xerial.org/silk Twitter @taroleo

7
Varieties of Scientific Data and Analysis
Streaming Distributed Data Processing with Silk



WormTSS: http://wormtss.utgenome.org/

Integrating various data sources, hundreds of data analysis…

xerial.org/silk Twitter @taroleo

8
Produced Thousands of Data Analysis Charts
Streaming Distributed Data Processing with Silk

Using R, JFreeChart, etc.
Need a automated
pipeline to redo the entire
analysis for answering the
paper review within a
month.

xerial.org/silk Twitter @taroleo

9
Writing A Dataflow
Streaming Distributed Data Processing with Silk

a Program v1

f

A

B
val B = A.map(f)



Apply function f to the input A, then produce the output B


This step may take more than 1 hours in big data analysis

xerial.org/silk Twitter @taroleo

10
Distribution and Fault Tolerance
Streaming Distributed Data Processing with Silk



Resume only B2 = A2.map(f)
a Program v1

f

A

B

f

A0

B0

A1

B1

A2

B2
Failure!
xerial.org/silk Twitter @taroleo

Retry

11
Extending Dataflows
Streaming Distributed Data Processing with Silk

Program v2
Program v1

A




f

g

B

C

While running program v1, adding another code (program v2)
How do we reuse the already computed result (B) to generate C?

xerial.org/silk Twitter @taroleo

12
Marking to A Program
Streaming Distributed Data Processing with Silk

Program v2
Program v1

A

f

g

B

C

val B = A.map(f)
val C = B.map(g)


Storing intermediate results using variable names


variable names := program markers!!



But, we lost variable names after compilation



Extracting AST and variable names upon compile time


Using Scala Macros (Since Scala 2.10)
xerial.org/silk Twitter @taroleo

13
Scala Program (AST) to DAG Schedule (Logical Plan)
Streaming Distributed Data Processing with Silk

Program v2
Program v1

A



g

B

C

Translating a program (AST) into a set of Silk operations (DAG)





f

val B = MapOp(input:A, output:B, function:f)
val C = MapOp(input:B, output:C, function:g)

Operations in Silk can be nested


val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)

xerial.org/silk Twitter @taroleo

14
Weaving Silks
Streaming Distributed Data Processing with Silk

In-memory weaver

Cluster weaver

Result
Hadoop weaver

Silk[A]
(operation DAG)



Weave

Output

Data analysis code is independent from weavers

xerial.org/silk Twitter @taroleo

15
Cluster Weaver: Logical Plan to Physical Plan on Cluster
Streaming Distributed Data Processing with Silk



Logical plan




GroupByOp(in:people, out:g, key: {_.dept.id})

Physical plan
P1

Partition
(hashing)

S1

P1

S3

S1

P1

S1

S2

P2

P2

S2

S2

P2

S3

S2

P2

S1

S3

P3

P2

S2

S3

P3

P3
Scatter

S2

P1

I3

P2

P3

I2

P1

P1

Silk[people]

S1

P3

I1

S1

S3

S3

P3

serialization

shuffle

deserialization

xerial.org/silk Twitter @taroleo

R1

R2

R3
merge sort

16
Local machine

Local ClassBox
User program
builds workflows

Weaving Silk materializes objects

classpaths & local jar files
•
•
•
•
•

Silk[A]

Silk[A]

read file, toSilk
map, reduce, join,
groupBy
UNIX commands
etc.

SilkSingle[A]

SilkSeq[A]

weave
weave

Static optimization

A

DAG Schedule

single object
•
•

Cluster
•
•
•
•

Dispatches tasks to clients
Manages master resource table
Authorizes resource allocation
Automatic recovery by
leader election in ZK

Register ClassBox
Submit schedule

ZooKeeper
ensemble mode
(at least 3 ZK instances)

Silk Master
•
•

dispatch

•
•

Silk Client

Silk Client

Task Scheduler

Task Scheduler

Task Executor

Task Executor

Resource Monitor

Resource Monitor

Data Server

Data Server

Leader election
Collects locations of slices
and ClassBox jars
Watches active nodes
Watches available resources

Seq[A]
sequence of objects

Node Table
Slice Table
Task Status
Resource Table
(CPU, memory)
ClassBox Table

•
•
•
•
•
•
•
•

Submits tasks
Run-time optimization
Resource allocation
Monitoring resource usage
Launches Web UI
Manages assigned task status
Object serialization/deserialization
Serves slice data

xerial.org/silk Twitter @taroleo

17
Static Optimization
Streaming Distributed Data Processing with Silk



Tree transformation






map(f).map(g) => map(g・f)
(Function composition)
map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate
data)
Pushing-down selection
Retrieves only accessed fields in an object




Analyzing the byte code of functions with ASM

Rewriting logical plans using pattern matching in Scala


Easy to add optimization rules

xerial.org/silk Twitter @taroleo

18
Run-time Optimization
Streaming Distributed Data Processing with Silk



Adjusting the number of data splits


According to the available cluster resources.



Multi-core execution



Omega-based task scheduler


Sharing the cluster resource table between nodes




Each node determines how to use the resource

Monitoring actual CPU/memory resources periodically

xerial.org/silk Twitter @taroleo

19
UNIX Command Workflows in Silk
Streaming Distributed Data Processing with Silk


c”(UNIX Command)”

xerial.org/silk Twitter @taroleo

20
Buffer Management
Streaming Distributed Data Processing with Silk




Silk frequently uses distributed memory (like Spark)
LArray[A]







Immediate memory deallocation (free)




To eliminate OutOfMemoryException and GC-stall

Fast memory allocation




Allocating Off-heap (outside JVM heap)memories
sun.misc.Unsafe
Github: https://github.com/xerial/larray

Skips zero-filling

Object Serialization


Extending msgpack





Scala Pickling
Inject ser/dser codes

Off-heap objects
xerial.org/silk Twitter @taroleo

21
Summary
Streaming Distributed Data Processing with Silk



Silk


A framework for distributed data processing for all data scientists




Object-oriented data processing programming




Similar to query optimization in DBMS

Analyze Data as You Write Programs!




Reuse, override and mix-in

Optimizing data flow programs




including non-experts in distributed data processing (e.g. Biologists)

Database research now enters program optimization.

In Future


Workflow queries





Making queries against dataflow program
Monitoring intermediate results

Multi-user program execution
xerial.org/silk Twitter @taroleo

22
http://xerial.org/silk
Streaming Distributed Data Processing with Silk

xerial.org/silk Twitter @taroleo

23

More Related Content

What's hot

H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudSri Ambati
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Databricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in PythonC4Media
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie
 

What's hot (20)

H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 

Similar to Streaming Distributed Data Processing with Silk #deim2014

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoTaro L. Saito
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Building Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsBuilding Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsJ On The Beach
 
Exploring SharePoint with F#
Exploring SharePoint with F#Exploring SharePoint with F#
Exploring SharePoint with F#Talbott Crowell
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 

Similar to Streaming Distributed Data Processing with Silk #deim2014 (20)

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Building Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsBuilding Applications with Streams and Snapshots
Building Applications with Streams and Snapshots
 
Exploring SharePoint with F#
Exploring SharePoint with F#Exploring SharePoint with F#
Exploring SharePoint with F#
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 

More from Taro L. Saito

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Taro L. Saito
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Taro L. Saito
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020Taro L. Saito
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecTaro L. Saito
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesTaro L. Saito
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Taro L. Saito
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Taro L. Saito
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTaro L. Saito
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley CultureTaro L. Saito
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure DataTaro L. Saito
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Taro L. Saito
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 

More from Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Streaming Distributed Data Processing with Silk #deim2014

  • 1. Streaming Distributed Data Processing with Silk Taro L. Saito University of Tokyo leo@xerial.org March 3rd, 2014 DEIM2014 xerial.org/silk Twitter @taroleo 1
  • 2. Distributed Data Processing Streaming Distributed Data Processing with Silk  Translate this data processing program A  g f B C into a cluster computing program g f A0 B0 A1 B1 A2 B2 map xerial.org/silk Twitter @taroleo C reduce 2
  • 3. Streaming Distributed Data Processing Streaming Distributed Data Processing with Silk  What is streaming? A f g B F C G D  E Silk: A framework for building and running complex workflows of distributed data processing xerial.org/silk Twitter @taroleo 3
  • 4. Problem Definition Streaming Distributed Data Processing with Silk  How do we run the distributed data processing while extending the program? A f g B F C G D xerial.org/silk Twitter @taroleo E 4
  • 5. Silk Streaming Distributed Data Processing with Silk  Describing Dataflows in Scala  A dataflow in Silk is a sequence of function calls   Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object xerial.org/silk Twitter @taroleo 5
  • 6. Object-Oriented Dataflow Programming Streaming Distributed Data Processing with Silk  Reusing and overriding dataflow programs xerial.org/silk Twitter @taroleo 6
  • 7. Big Data Volumes in Human Genome Analysis Streaming Distributed Data Processing with Silk Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)  DNA Sequencer (Illumina, PacBio, etc.)    f: An alignment program Output: Alignment results 750GB (sequence + alignment data)   Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs) Input f Output University of Tokyo Genome Browser (UTGB) xerial.org/silk Twitter @taroleo 7
  • 8. Varieties of Scientific Data and Analysis Streaming Distributed Data Processing with Silk  WormTSS: http://wormtss.utgenome.org/  Integrating various data sources, hundreds of data analysis… xerial.org/silk Twitter @taroleo 8
  • 9. Produced Thousands of Data Analysis Charts Streaming Distributed Data Processing with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk Twitter @taroleo 9
  • 10. Writing A Dataflow Streaming Distributed Data Processing with Silk a Program v1 f A B val B = A.map(f)  Apply function f to the input A, then produce the output B  This step may take more than 1 hours in big data analysis xerial.org/silk Twitter @taroleo 10
  • 11. Distribution and Fault Tolerance Streaming Distributed Data Processing with Silk  Resume only B2 = A2.map(f) a Program v1 f A B f A0 B0 A1 B1 A2 B2 Failure! xerial.org/silk Twitter @taroleo Retry 11
  • 12. Extending Dataflows Streaming Distributed Data Processing with Silk Program v2 Program v1 A   f g B C While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C? xerial.org/silk Twitter @taroleo 12
  • 13. Marking to A Program Streaming Distributed Data Processing with Silk Program v2 Program v1 A f g B C val B = A.map(f) val C = B.map(g)  Storing intermediate results using variable names  variable names := program markers!!  But, we lost variable names after compilation  Extracting AST and variable names upon compile time  Using Scala Macros (Since Scala 2.10) xerial.org/silk Twitter @taroleo 13
  • 14. Scala Program (AST) to DAG Schedule (Logical Plan) Streaming Distributed Data Processing with Silk Program v2 Program v1 A  g B C Translating a program (AST) into a set of Silk operations (DAG)    f val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g) Operations in Silk can be nested  val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g) xerial.org/silk Twitter @taroleo 14
  • 15. Weaving Silks Streaming Distributed Data Processing with Silk In-memory weaver Cluster weaver Result Hadoop weaver Silk[A] (operation DAG)  Weave Output Data analysis code is independent from weavers xerial.org/silk Twitter @taroleo 15
  • 16. Cluster Weaver: Logical Plan to Physical Plan on Cluster Streaming Distributed Data Processing with Silk  Logical plan   GroupByOp(in:people, out:g, key: {_.dept.id}) Physical plan P1 Partition (hashing) S1 P1 S3 S1 P1 S1 S2 P2 P2 S2 S2 P2 S3 S2 P2 S1 S3 P3 P2 S2 S3 P3 P3 Scatter S2 P1 I3 P2 P3 I2 P1 P1 Silk[people] S1 P3 I1 S1 S3 S3 P3 serialization shuffle deserialization xerial.org/silk Twitter @taroleo R1 R2 R3 merge sort 16
  • 17. Local machine Local ClassBox User program builds workflows Weaving Silk materializes objects classpaths & local jar files • • • • • Silk[A] Silk[A] read file, toSilk map, reduce, join, groupBy UNIX commands etc. SilkSingle[A] SilkSeq[A] weave weave Static optimization A DAG Schedule single object • • Cluster • • • • Dispatches tasks to clients Manages master resource table Authorizes resource allocation Automatic recovery by leader election in ZK Register ClassBox Submit schedule ZooKeeper ensemble mode (at least 3 ZK instances) Silk Master • • dispatch • • Silk Client Silk Client Task Scheduler Task Scheduler Task Executor Task Executor Resource Monitor Resource Monitor Data Server Data Server Leader election Collects locations of slices and ClassBox jars Watches active nodes Watches available resources Seq[A] sequence of objects Node Table Slice Table Task Status Resource Table (CPU, memory) ClassBox Table • • • • • • • • Submits tasks Run-time optimization Resource allocation Monitoring resource usage Launches Web UI Manages assigned task status Object serialization/deserialization Serves slice data xerial.org/silk Twitter @taroleo 17
  • 18. Static Optimization Streaming Distributed Data Processing with Silk  Tree transformation     map(f).map(g) => map(g・f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate data) Pushing-down selection Retrieves only accessed fields in an object   Analyzing the byte code of functions with ASM Rewriting logical plans using pattern matching in Scala  Easy to add optimization rules xerial.org/silk Twitter @taroleo 18
  • 19. Run-time Optimization Streaming Distributed Data Processing with Silk  Adjusting the number of data splits  According to the available cluster resources.  Multi-core execution  Omega-based task scheduler  Sharing the cluster resource table between nodes   Each node determines how to use the resource Monitoring actual CPU/memory resources periodically xerial.org/silk Twitter @taroleo 19
  • 20. UNIX Command Workflows in Silk Streaming Distributed Data Processing with Silk  c”(UNIX Command)” xerial.org/silk Twitter @taroleo 20
  • 21. Buffer Management Streaming Distributed Data Processing with Silk   Silk frequently uses distributed memory (like Spark) LArray[A]     Immediate memory deallocation (free)   To eliminate OutOfMemoryException and GC-stall Fast memory allocation   Allocating Off-heap (outside JVM heap)memories sun.misc.Unsafe Github: https://github.com/xerial/larray Skips zero-filling Object Serialization  Extending msgpack    Scala Pickling Inject ser/dser codes Off-heap objects xerial.org/silk Twitter @taroleo 21
  • 22. Summary Streaming Distributed Data Processing with Silk  Silk  A framework for distributed data processing for all data scientists   Object-oriented data processing programming   Similar to query optimization in DBMS Analyze Data as You Write Programs!   Reuse, override and mix-in Optimizing data flow programs   including non-experts in distributed data processing (e.g. Biologists) Database research now enters program optimization. In Future  Workflow queries    Making queries against dataflow program Monitoring intermediate results Multi-user program execution xerial.org/silk Twitter @taroleo 22
  • 23. http://xerial.org/silk Streaming Distributed Data Processing with Silk xerial.org/silk Twitter @taroleo 23