Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

2,374 views

Published on

Silk is a framework for building dataflows in Scala. In Silk users write data processing code with collection operators (e.g., map, filter, reduce, join, etc.). Silk uses Scala Macros to construct a DAG of dataflows, nodes of which are annotated with variable names in the program. By using these variable names as markers in the DAG, Silk can support interruption and resume of dataflows and querying the intermediate data. By separating dataflow descriptions from its computation, Silk enables us to switch executors, called weavers, for in-memory or cluster computing without modifying the code. In this talk, we will show how Silk helps you run data-processing pipelines as you write the code.

Published in: Technology
  • Be the first to comment

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo

  1. 1. Weaving Dataflows with Silk Taro L. Saito Treasure Data, Inc. leo@xerial.org September 6th, 2014 ScalaMatsuri @ Tokyo 1xerial.org/silk
  2. 2. About Me Weaving Dataflows with Silk xerial.org/silk2
  3. 3. Treasure Data Console Weaving Dataflows with Silk xerial.org/silk3
  4. 4. Processing Job Table Weaving Dataflows with Silk xerial.org/silk4
  5. 5. Functional Style Writing Weaving Dataflows with Silk xerial.org/silk5
  6. 6. Need an Optimization? Weaving Dataflows with Silk xerial.org/silk6
  7. 7. Procedural Style Writing Weaving Dataflows with Silk l Describes How to Process Data. xerial.org/silk7
  8. 8. Declarative Style Writing Weaving Dataflows with Silk l Less programming l System decides how to optimize the code l Hash joins, bloom filters and various optimization techniques are now available. xerial.org/silk8
  9. 9. Weaving Silk Weaving Dataflows with Silk In-memory weaver Cluster weaver (Spark?) MapReduce weaver Result Your own weaver (using TD?) l Making data processing code independent from the execution method! xerial.org/silk9 Silk[A] (operation DAG) Weave (Execute)Silk Product
  10. 10. Cluster Weaver: Logical Plan to Physical Plan on Cluster Weaving Dataflows with Silk l Physical plan on cluster xerial.org/silk10 I1 I2 I3 P1 P2 P3 P1 P2 P3 P1 P2 P3 S1 S2 S3 S1 S2 S3 S1 S2 S3 R1 S1 S1 S1 S2 S2 S2 S3 S3 S3 P1 P1 P1 P2 P2 P2 P3 P3 P3 R2 R3 Partition (hashing) serializationshuffledeserializationmerge sort Silk[people] Scatter
  11. 11. DAG-based Data Processing Engines Weaving Dataflows with Silk l Spark l Creates a task schedule for distributed processing l Summingbird l Integrates stream and batch data processing l e.g. Running Scalding and Storm at the same time l Apache Tez l Creates a dag schedule for optimizing MapReduce pipelines l GNU Makefile l Describes a pipeline of UNIX commands Why do we need another framework? xerial.org/silk11
  12. 12. Challenge: Isolate Code Writing and Its Execution Weaving Dataflows with Silk weaver Result Result Result l Why canʼ’t we run the program until finish writing? l How can we departure from compile-‐‑‒then-‐‑‒run paradigm? xerial.org/silk12 Silk[A] (operation DAG) Weave (Execute)Silk Product
  13. 13. Weaving Dataflows with Silk l W xerial.org/silk13
  14. 14. Genome Science is A Big Data Science Weaving Dataflows with Silk l By sequencing, we can find 3 millions of SNPs for each person l To find the cause of disease (one or a few SNPs), we need to sequence as many samples as possible for narrowing down the candidate SNPs l Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) l DNA Sequencer (Illumina, PacBio, etc.) l f: An alignment program l Output: Alignment results 750GB (sequence + alignment data) l Total storage space required: 1.2TB Output f Input University of Tokyo Genome Browser (UTGB) xerial.org/silk14
  15. 15. Human Genome Data Processing Workflows in Silk Weaving Dataflows with Silk l c”(UNIX Command)” xerial.org/silk15
  16. 16. Human Genome Data Processing Workflows Weaving Dataflows with Silk l Makefile: The result ($@) is stored into a file l Silk: The result is stored in variable l Computation of each command may take 1 or more hours xerial.org/silk16
  17. 17. SBT: A Good Hint Weaving Dataflows with Silk l SBT l Supports incremental compilation and testing l sbt ~∼test-‐‑‒only l Monitor source code change l Running specific tests l sbt ~∼test-‐‑‒quick l Running failed tests only A fB C g D E F G l How do we compute the not-‐‑‒yet started part of a Scala program? l We need to know: l A-‐‑‒B and D-‐‑‒E are running l If B is finished, we can start B-‐‑‒C xerial.org/silk17
  18. 18. Writing A Dataflow Weaving Dataflows with Silk l Apply function f to the input A, then produce the output B l This step may take more than 1 hours in big data analysis 18 A B f val B = A.map(f) xerial.org/silk a Program v1
  19. 19. Distribution and Recovery Weaving Dataflows with Silk l Resume only B2 = A2.map(f) xerial.org/silk19 A0 A1 A2 B1 B2 f B0 Failure! A B f a Program v1 Retry
  20. 20. Extending Dataflows Weaving Dataflows with Silk Program v2 l While running program v1, we may want to add another code (program v2) l We need to know variable B is now being processed 20 A B f C g Program v1 xerial.org/silk
  21. 21. Labeling Program with Variable Names Weaving Dataflows with Silk Program v2 l Storing intermediate results using variable names l variable names := program markers l But, we lost the variable names after compilation l Solution: Extract variable names from AST upon compile time l Using Scala Macros (Since Scala 2.10) 21 A B f val B = A.map(f) val C = B.map(g) C g Program v1 xerial.org/silk
  22. 22. Scala Program (AST) to DAG Schedule (Logical Plan) Weaving Dataflows with Silk Program v2 l Translate a program into a set of Silk operation objects l val B = MapOp(input:A, output:”B”, function:f) l val C = MapOp(input:B, output:”C”, function:g) l Operations in Silk form a DAG l val C = MapOp( input:MapOp(input:A, output:”B”, function:f), output:”C”, function:g) 22 A B f C g Program v1 xerial.org/silk
  23. 23. Using Scala Macros Weaving Dataflows with Silk l Produce operation objects with Scala Macros l map(f:A=B) produces MapOp[A, B](…) l Why do we need to use Macro here? l To extract FContext (target variable name, enclosing method, class, etc.) from AST. xerial.org/silk23
  24. 24. Weaving Dataflows with Silk l s xerial.org/silk24
  25. 25. Extract target variable name and enclosing method Weaving Dataflows with Silk xerial.org/silk25
  26. 26. Finding Target Variable Weaving Dataflows with Silk xerial.org/silk26
  27. 27. Weaving Dataflows with Silk Program v2 l Translate a program into a set of Silk operation objects l val B = MapOp(input:A, output:”B”, function:f) l val C = MapOp(input:B, output:”C”, function:g) l Silk uses these variable names to store the intermediate data 27 A B f C g Program v1 xerial.org/silk
  28. 28. Weaving Dataflows with Silk l Silk defines various types of operations xerial.org/silk28
  29. 29. Object-Oriented Dataflow Programming Weaving Dataflows with Silk l Reusing and overriding dataflows xerial.org/silk29
  30. 30. Summary Weaving Dataflows with Silk weaver Result Result Result Cluster weaver l Declarative-‐‑‒style coding is necessary for creating DAG schedule l DAG schedules are labeled with variable names using ScalaMacros l Weaver: An abstraction of how to execute the code. l Weaver manages the running and finished parts of the code xerial.org/silk30 Silk[A] (operation DAG) Weave (Execute)Silk Product
  31. 31. http://xerial.org/silk Weaving Dataflows with Silk xerial.org/silk31
  32. 32. Copyright ©2014 Treasure Data. All Rights Reserved. 32 WE ARE HIRING! www.treasuredata.com
  33. 33. Silk[A] Weaving Silk materializes objects Resource Table (CPU, memory) User program builds workflows Static optimization DAG Schedule • read file, toSilk • map, reduce, join, • groupBy • UNIX commands • etc. • Register ClassBox • Submit schedule Silk Master dispatch Silk Client ZooKeeper Node Table Slice Table Task Scheduler Task Status Task Executor Resource Monitor Silk Client Task Scheduler Task Executor Resource Monitor ensemble mode (at least 3 ZK instances) • Leader election • Collects locations of slices and ClassBox jars • Watches active nodes • Watches available resources • Submits tasks • Run-time optimization • Resource allocation • Monitoring resource usage • Launches Web UI • Manages assigned task status • Object serialization/deserialization • Serves slice data Local ClassBox classpaths local jar files ClassBox Table weave • Dispatches tasks to clients • Manages master resource table • Authorizes resource allocation • Automatic recovery by leader election in ZK Data Server Data Server Silk[A] SilkSingle[A] SilkSeq[A] weave A single object Seq[A] sequence of objects Local machine Cluster xerial.org/silk33
  34. 34. Integrating Varieties of Data Sources Weaving Dataflows with Silk l WormTSS: http://wormtss.utgenome.org/ l Integrating various data sources xerial.org/silk34
  35. 35. Varieties of Data Analysis Weaving Dataflows with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk35
  36. 36. Makefile Weaving Dataflows with Silk l Describes dependencies of commands through files l Good: We can resume and update the data flow processing l Bad: Makefile of WormTSS analysis exceeds 1,000 lines 36
  37. 37. Splitting Data Analysis Into Command Modules Weaving Dataflows with Silk l Added a new command as we needed a new analysis and data processing l The result: l hundreds of commands! l # of files limits the parallelism 37

×