Streaming Distributed Data Processing with Silk #deim2014

2,990 views

Published on

A framework written in Scala for describing distributed data processing programs.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,990
On SlideShare
0
From Embeds
0
Number of Embeds
226
Actions
Shares
0
Downloads
17
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Streaming Distributed Data Processing with Silk #deim2014

  1. 1. Streaming Distributed Data Processing with Silk Taro L. Saito University of Tokyo leo@xerial.org March 3rd, 2014 DEIM2014 xerial.org/silk Twitter @taroleo 1
  2. 2. Distributed Data Processing Streaming Distributed Data Processing with Silk  Translate this data processing program A  g f B C into a cluster computing program g f A0 B0 A1 B1 A2 B2 map xerial.org/silk Twitter @taroleo C reduce 2
  3. 3. Streaming Distributed Data Processing Streaming Distributed Data Processing with Silk  What is streaming? A f g B F C G D  E Silk: A framework for building and running complex workflows of distributed data processing xerial.org/silk Twitter @taroleo 3
  4. 4. Problem Definition Streaming Distributed Data Processing with Silk  How do we run the distributed data processing while extending the program? A f g B F C G D xerial.org/silk Twitter @taroleo E 4
  5. 5. Silk Streaming Distributed Data Processing with Silk  Describing Dataflows in Scala  A dataflow in Silk is a sequence of function calls   Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object xerial.org/silk Twitter @taroleo 5
  6. 6. Object-Oriented Dataflow Programming Streaming Distributed Data Processing with Silk  Reusing and overriding dataflow programs xerial.org/silk Twitter @taroleo 6
  7. 7. Big Data Volumes in Human Genome Analysis Streaming Distributed Data Processing with Silk Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)  DNA Sequencer (Illumina, PacBio, etc.)    f: An alignment program Output: Alignment results 750GB (sequence + alignment data)   Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs) Input f Output University of Tokyo Genome Browser (UTGB) xerial.org/silk Twitter @taroleo 7
  8. 8. Varieties of Scientific Data and Analysis Streaming Distributed Data Processing with Silk  WormTSS: http://wormtss.utgenome.org/  Integrating various data sources, hundreds of data analysis… xerial.org/silk Twitter @taroleo 8
  9. 9. Produced Thousands of Data Analysis Charts Streaming Distributed Data Processing with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk Twitter @taroleo 9
  10. 10. Writing A Dataflow Streaming Distributed Data Processing with Silk a Program v1 f A B val B = A.map(f)  Apply function f to the input A, then produce the output B  This step may take more than 1 hours in big data analysis xerial.org/silk Twitter @taroleo 10
  11. 11. Distribution and Fault Tolerance Streaming Distributed Data Processing with Silk  Resume only B2 = A2.map(f) a Program v1 f A B f A0 B0 A1 B1 A2 B2 Failure! xerial.org/silk Twitter @taroleo Retry 11
  12. 12. Extending Dataflows Streaming Distributed Data Processing with Silk Program v2 Program v1 A   f g B C While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C? xerial.org/silk Twitter @taroleo 12
  13. 13. Marking to A Program Streaming Distributed Data Processing with Silk Program v2 Program v1 A f g B C val B = A.map(f) val C = B.map(g)  Storing intermediate results using variable names  variable names := program markers!!  But, we lost variable names after compilation  Extracting AST and variable names upon compile time  Using Scala Macros (Since Scala 2.10) xerial.org/silk Twitter @taroleo 13
  14. 14. Scala Program (AST) to DAG Schedule (Logical Plan) Streaming Distributed Data Processing with Silk Program v2 Program v1 A  g B C Translating a program (AST) into a set of Silk operations (DAG)    f val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g) Operations in Silk can be nested  val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g) xerial.org/silk Twitter @taroleo 14
  15. 15. Weaving Silks Streaming Distributed Data Processing with Silk In-memory weaver Cluster weaver Result Hadoop weaver Silk[A] (operation DAG)  Weave Output Data analysis code is independent from weavers xerial.org/silk Twitter @taroleo 15
  16. 16. Cluster Weaver: Logical Plan to Physical Plan on Cluster Streaming Distributed Data Processing with Silk  Logical plan   GroupByOp(in:people, out:g, key: {_.dept.id}) Physical plan P1 Partition (hashing) S1 P1 S3 S1 P1 S1 S2 P2 P2 S2 S2 P2 S3 S2 P2 S1 S3 P3 P2 S2 S3 P3 P3 Scatter S2 P1 I3 P2 P3 I2 P1 P1 Silk[people] S1 P3 I1 S1 S3 S3 P3 serialization shuffle deserialization xerial.org/silk Twitter @taroleo R1 R2 R3 merge sort 16
  17. 17. Local machine Local ClassBox User program builds workflows Weaving Silk materializes objects classpaths & local jar files • • • • • Silk[A] Silk[A] read file, toSilk map, reduce, join, groupBy UNIX commands etc. SilkSingle[A] SilkSeq[A] weave weave Static optimization A DAG Schedule single object • • Cluster • • • • Dispatches tasks to clients Manages master resource table Authorizes resource allocation Automatic recovery by leader election in ZK Register ClassBox Submit schedule ZooKeeper ensemble mode (at least 3 ZK instances) Silk Master • • dispatch • • Silk Client Silk Client Task Scheduler Task Scheduler Task Executor Task Executor Resource Monitor Resource Monitor Data Server Data Server Leader election Collects locations of slices and ClassBox jars Watches active nodes Watches available resources Seq[A] sequence of objects Node Table Slice Table Task Status Resource Table (CPU, memory) ClassBox Table • • • • • • • • Submits tasks Run-time optimization Resource allocation Monitoring resource usage Launches Web UI Manages assigned task status Object serialization/deserialization Serves slice data xerial.org/silk Twitter @taroleo 17
  18. 18. Static Optimization Streaming Distributed Data Processing with Silk  Tree transformation     map(f).map(g) => map(g・f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate data) Pushing-down selection Retrieves only accessed fields in an object   Analyzing the byte code of functions with ASM Rewriting logical plans using pattern matching in Scala  Easy to add optimization rules xerial.org/silk Twitter @taroleo 18
  19. 19. Run-time Optimization Streaming Distributed Data Processing with Silk  Adjusting the number of data splits  According to the available cluster resources.  Multi-core execution  Omega-based task scheduler  Sharing the cluster resource table between nodes   Each node determines how to use the resource Monitoring actual CPU/memory resources periodically xerial.org/silk Twitter @taroleo 19
  20. 20. UNIX Command Workflows in Silk Streaming Distributed Data Processing with Silk  c”(UNIX Command)” xerial.org/silk Twitter @taroleo 20
  21. 21. Buffer Management Streaming Distributed Data Processing with Silk   Silk frequently uses distributed memory (like Spark) LArray[A]     Immediate memory deallocation (free)   To eliminate OutOfMemoryException and GC-stall Fast memory allocation   Allocating Off-heap (outside JVM heap)memories sun.misc.Unsafe Github: https://github.com/xerial/larray Skips zero-filling Object Serialization  Extending msgpack    Scala Pickling Inject ser/dser codes Off-heap objects xerial.org/silk Twitter @taroleo 21
  22. 22. Summary Streaming Distributed Data Processing with Silk  Silk  A framework for distributed data processing for all data scientists   Object-oriented data processing programming   Similar to query optimization in DBMS Analyze Data as You Write Programs!   Reuse, override and mix-in Optimizing data flow programs   including non-experts in distributed data processing (e.g. Biologists) Database research now enters program optimization. In Future  Workflow queries    Making queries against dataflow program Monitoring intermediate results Multi-user program execution xerial.org/silk Twitter @taroleo 22
  23. 23. http://xerial.org/silk Streaming Distributed Data Processing with Silk xerial.org/silk Twitter @taroleo 23

×