Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA

599 views

Published on

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.

Published in: Technology
  • Be the first to comment

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA

  1. 1. Interactive Programs Debugging and Development in Apache Spark
  2. 2. Outline ‣ Motivating Scenario ‣ Titian Programming Interface ‣ Internals ‣ Vega ‣ Conclusions
  3. 3. Outline ‣ Motivating Scenario ‣ Titian Programming Interface ‣ Internals ‣ Vega ‣ Conclusions
  4. 4. ๏ Debugging data processing logic in Data-Intensive Scalable Computing (DISC) system is difficult ๏ Analysis tools are still in their “infancy” ๏ Today’s large-scale jobs are black boxes: • Job submitted to a cluster • Results come back minutes to hours later • No visibility into running algorithm Big Data Debugging
  5. 5. Big Data Debugging - State of the Art
  6. 6. Big Data Debugging - State of the Art
  7. 7. Big Data Debugging - State of the Art
  8. 8. Big Data Debugging - State of the Art
  9. 9. Big Data Debugging - State of the Art
  10. 10. ๏ Easy to use GDB-like debugger [ICSE 16] (not covered in this talk) ๏ Visibility of data into running workflow • E.g., what (input) data led to this (outlier) result? ๏ Selectively replaying a portion of the data processing steps on subsets of intermediate data leading to outliers results ๏ Interactive program analysis Big Data Debugging - Desiderata
  11. 11. ๏ Visibility of data -> Tracking the dependencies between the individual inputs and outputs records ๏ Selective replay -> Storage of intermediate results: • Dataset shared among running job and analysis tool ๏ Interactivity -> Implementation Constraints: • Latency constraint - In memory computation • Programming interface constraint - Integration with Spark DSL Big Data Debugging - Challenges
  12. 12. ๏ Well known technique in databases ๏ Two granularities of provenance • Transformation (coarse-grained) provenance – Records the complete workflow of the derivation of a dataset – Spark RDD lineage is an example of this form of provenance • Data (fine-grained) provenance – Records data dependencies between input and output records – The type of provenance Titian focuses on Data Provenance (Lineage)
  13. 13. Tuple-ID Time Sendor-ID Temperature T1 11AM 1 34 T2 11AM 2 35 T3 11AM 3 35 T4 12PM 1 35 T5 12PM 2 35 T6 12PM 3 100 T7 1PM 1 35 T8 1PM 2 35 T9 1PM 3 80 SELECT AVG(temp),time
 FROM sensors
 GROUP BY time Sensors Result-ID Time AVG(temp) ID-1 11AM 34.6 ID-2 12PM 56.6 ID-3 1PM 50 Data Provenance - Example
  14. 14. Tuple-ID Time Sendor-ID Temperature T1 11AM 1 34 T2 11AM 2 35 T3 11AM 3 35 T4 12PM 1 35 T5 12PM 2 35 T6 12PM 3 100 T7 1PM 1 35 T8 1PM 2 35 T9 1PM 3 80 SELECT AVG(temp),time
 FROM sensors
 GROUP BY time Sensors Result-ID Time AVG(temp) ID-1 11AM 34.6 ID-2 12PM 56.6 ID-3 1PM 50 Outlier Outlier Why ID-2 and ID-3 have those high Data Provenance - Example
  15. 15. Tuple-ID Time Sendor-ID Temperature T1 11AM 1 34 T2 11AM 2 35 T3 11AM 3 35 T4 12PM 1 35 T5 12PM 2 35 T6 12PM 3 100 T7 1PM 1 35 T8 1PM 2 35 T9 1PM 3 80 SELECT AVG(temp),time
 FROM sensors
 GROUP BY time Sensors Result-ID Time AVG(temp) ID-1 11AM 34.6 ID-2 12PM 56.6 ID-3 1PM 50 Outlier Outlier Why ID-2 and ID-3 have those high Data Provenance - Example
  16. 16. ๏ They use external storage systems (HDFS in RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to retain lineage data ๏ Data provenance queries are supported in a separate programming interface Previous Data Provenance DISC Systems
  17. 17. ๏ They use external storage systems (HDFS in RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to retain lineage data ๏ Data provenance queries are supported in a separate programming interface High overhead Previous Data Provenance DISC Systems
  18. 18. ๏ They use external storage systems (HDFS in RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to retain lineage data ๏ Data provenance queries are supported in a separate programming interface High overhead Low interactivity Previous Data Provenance DISC Systems
  19. 19. ๏ Word Count job ๏ RAMP is up to 4X Spark ๏ Newt up to 86X Experience with Newt and RAMP 100 1000 1 10 100 Time(s) Dataset Size (GB) Spark Newt RAMP
  20. 20. Outline ‣ Motivating Scenario ‣ Titian Programming Interface ‣ Internals ‣ Vega ‣ Conclusions
  21. 21. Loads error messages from a log, counts the number of errors occurrences and returns a report containing the description of each error lc = new LineageContext(sc) lines = lc.textFile(“hdfs://...”) errors = lines.filter(_.startswith(“error”)) codes = errors.map(_.split(“t”)(1)) pairs = codes.map(word =>(word, 1)) counts = pairs.reduceByKey(word =>(_ + _)) reports = counts.map(kv => (dscr(kv._1), kv._2)) reports.collect.foreach(println) Example: Log Analysis
  22. 22. Given the result of the previous example, select the most frequent error and trace back to the input lines containing them Example: Backward Tracing
  23. 23. Given the result of the previous example, select the most frequent error and trace back to the input lines containing them frequentPair = reports.sortBy(_._2, false).take(1) frequent = reports.filter(_ == frequentPair) lineage = frequent.getLineage() input = lineage.goBackAll() input.collect().foreach(println) Example: Backward Tracing
  24. 24. Return the error codes generated from the network sub-system (indicated in the log by a “NETWORK” tag) Example: Forward Tracing
  25. 25. Return the error codes generated from the network sub-system (indicated in the log by a “NETWORK” tag) network = errors.filter(_.contains(“NETWORK”)) lineage = network.getLineage() output = lineage.goNextAll() output.collect().foreach(println) Example: Forward Tracing
  26. 26. Return the error distribution without the ones cause by the Guest user Example: Selective Replay
  27. 27. Return the error distribution without the ones cause by the Guest user lineage = reports.getLineage() inputLines = lineage.goBackAll() noGuest = inputLines.filter(!_.contains(“Guest”) && _.startswith(“error”)) newCodes = noGuest.map(_.split(“t”)(1)) newPairs = newCodes.map(word =>(word, 1)) newCounts = newPairs.reduceByKey(word =>(_ + _)) newRep = newCounts.map(kv => (dscr(kv._1), kv._2)) newRep.collect Example: Selective Replay
  28. 28. Outline ‣ Motivating Scenario ‣ Titian Programming Interface ‣ Internals ‣ Vega ‣ Conclusions
  29. 29. ๏ LineageContext wrap SparkContext • Providing visibility into the submitted job ๏ Instrument LineageRDD at stage boundaries • Wrap native RDDs • Specific LineageRDD implementation based on instrument transformation ๏ Provenance data is buffered inside LineageRDDs • Saved into Spark BlockManager for querying Provenance Capturing
  30. 30. countspairscodeserrorslines Stage 1 Stage 2 reports lines = sc.textFile(“hdfs://...”) errors = lines.filter(_.startswith(“error”)) codes = errors.map(_.split(“t”)(1)) pairs = codes.map(word =>(word, 1)) counts = pairs.reduceByKey(word =>(_ + _)) reports = counts.map(kv => (dscr(kv._1), kv._2)) Spark Stage DAG
  31. 31. Instrumented Spark Stage DAG Combiner LineageRDD Reducer LineageRDD Hadoop LineageRDD counts pairscodeserrorslines Stage 1 Stage 2 reports Stage LineageRDD
  32. 32. Instrumented Workflow Combiner LineageRDD Reducer LineageRDD Hadoop LineageRDD counts pairscodeserrorslines Stage 1 Stage 2 reports Stage LineageRDD Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2
  33. 33. Lineage Capture Runtime Overheads 100 1000 1 10 100 Time(s) Dataset Size (GB) Spark Titian Newt RAMP ๏ Same Word Count job ๏ Titian is in average 1.3X slower than Spark
  34. 34. Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Hadoop Combiner Reducer Stage Example: Captured Data Lineage
  35. 35. Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Hadoop Combiner Reducer Stage Example: Trace Back
  36. 36. Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Hadoop Combiner Reducer Stage Example: Trace Back Stage.Input IDReducer.Output ID
  37. 37. Reducer.Output IDCombiner.Output ID Example: Trace Back Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Hadoop Combiner Reducer Stage
  38. 38. Example: Trace Back Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Hadoop Combiner Reducer Stage Combiner.Input IDHadoop.Output ID Now let’s do it for real!
  39. 39. Worker1 Worker2 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Example: Trace Back
  40. 40. Example: Trace Back Worker1 Worker2 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner
  41. 41. Example: Trace Back Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage
  42. 42. Example: Trace Back Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Stage.Input IDReducer.Output ID
  43. 43. Example: Trace Back Worker1 Worker2 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner
  44. 44. Example: Trace Back Worker1 Worker2 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Worker1 Worker2 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Input ID Output ID p1 400 Input ID Output ID p1 400 Targeted Shuffle
  45. 45. Example: Trace Back Worker1 Worker2 Worker3 Input ID Output ID 400 id1 4 id2 Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Input ID Output ID p1 400 Input ID Output ID p1 400
  46. 46. Example: Trace Back Worker1 Worker2 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Input ID Output ID p1 400 Input ID Output ID p1 400 Combiner.Output ID Reducer.Output ID Combiner.Output ID Reducer.Output ID
  47. 47. Example: Trace Back Hadoop Combiner Worker1 Worker2 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner
  48. 48. Example: Trace Back Hadoop Combiner Worker1 Worker2 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Worker1 Worker2 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Combiner.Input IDHadoop.Output ID Combiner.Input IDHadoop.Output ID
  49. 49. Tracing Performance ๏ Word Count job ๏ Tracing one record backward in < 1 sec for dataset < 100GB ๏ 18 sec for 500GB dataset
  50. 50. Vega: Optimizations for Selective Replay Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor Miryung Kim, Todd Millstein, Tyson Condie Under Submission
  51. 51. Debugging workflow ๏ Run program ๏ Understand the cause for bugs / outliers: • Lineage • Breakpoints/watchpoints • Crash culprit ๏ Fix bug • Fast selective replay } } Titian [VLDB 2016] BigDebug [ICSE 2016]
  52. 52. First Strategy Convert changes in code to changes in data
  53. 53. Incremental Plan Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) countspairslines Stage 1 Stage 2 shuffle input .map(x=>(x,1)) .reduceByKey(_+_)
  54. 54. Incremental Plan Inject a filter in the workflow countspairslines Stage 1 Stage 2 shufflefilter input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
  55. 55. Input aa b c aa c Map (aa, 1) (b, 1) (aa, 1) Shuffle (aa, [1, 1]) (b, 1) Reduce (aa, 2) (b, 1) Filter aa b aa countspairslines Stage 1 Stage 2 shufflefilter input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_) Incremental Plan
  56. 56. Incremental Plan Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c
  57. 57. Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c δFilter —c —c Incremental Plan
  58. 58. Incremental Plan Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c δFilter —c —c ∆Map —(c, 1) —(c, 1)
  59. 59. Incremental Plan Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c δFilter —c —c ∆Map —(c, 1) —(c, 1) ∆Shuffle c, [—1, —1])
  60. 60. Incremental Plan Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c δFilter —c —c ∆Map —(c, 1) —(c, 1) ∆Shuffle c, [—1, —1]) ∆Reduce —(c, 2)
  61. 61. Performance Input data size (GB) Time (s) About 10X faster
  62. 62. Performance ๏ Good up to a certain point ๏ Two factors dominate: • Space utilization • Time to shuffle deltas ๏ Insight: • The more downstream the filter is placed, the better the incremental performance • Especially beneficial if we can place it past the shuffle
  63. 63. Second Strategy Push code changes downstream
  64. 64. Commutative Rewrite Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c filter(x=>x!=‘c’)
  65. 65. Commutative Rewrite Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c filter(x=>x!=‘c’) But the input to the filter is (word, 1) We cannot use the filter anymore
  66. 66. Commutative Rewrite Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter aa b c aa c filter(x=>x!=‘c’) Observe that the map is invertible We can use the old filter by using the inverse of the map
  67. 67. Commutative Rewrite Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter’ aa b c aa c filter’((x, o)=>x!=‘c’) Rewritten filter
  68. 68. Commutative Rewrite Input aa b c aa c Map (aa, 1) (b, 1) (c, 1) (aa, 1) (c, 1) Shuffle (aa, [1, 1]) (b, 1) (c, [1, 1]) Reduce (aa, 2) (b, 1) (c, 2) Filter’ (aa, 2) (b, 1) filter((x, o)=>x!=‘c’) Shuffle and Reduce operations preserve keys
  69. 69. Performance Input data size (GB) Time About 1000X faster
  70. 70. Why does it scale so well? ๏ Runtime in the order of output ๏ Output depends on the number of unique words ๏ Unique words << total words
  71. 71. Combining Strategies ๏ Push the changed transform past as many shuffles as possible with rewrites • The new transform can be placed only after materialization points • By default we materialize shuffle output • Efficient because Spark already save shuffle output for fault tolerance ๏ Use delta computation for the remaining workflow
  72. 72. Vega ๏ Built on Spark and Spark SQL (only filter rewrite) ๏ Spark SQL API is unchanged ๏ Spark API includes: • Functions with inverses (for maps) • Inverse values (for incremental reduce) ๏ Automatically rewrites workflows using commutativity and incremental evaluation
  73. 73. ๏ Titian provides to Spark users the ability of tracing through program execution ๏ Features: • Intermediate results are shared in memory • Tight integration with the Spark API (LineageRDD) • Low job overhead • Efficient lineage query ๏ Vega provides 1–3 orders magnitude performance gains over rerunning the computation from scratch ๏ Both provide results in a few seconds for many workflows allowing interactive usage } Transformation provenance Conclusions
  74. 74. Thank you
  75. 75. Outline ‣ Motivating Scenario ‣ Titian Programming Interface ‣ Internals ‣ Performance ‣ Conclusions
  76. 76. Configuration ๏ Two set of experiments: • Unstructured - grep and word count • Structured - PigMix queries ๏ Datasets: • Unstructured: from 500MB to 500GB files contains words generated using a Zipf distribution from a dictionary of 8000 words • Structured: we used the PigMix generator to create dataset of sizes ranging from 1GB to 1TB ๏ Configuration: • 16 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk • Spark 1.2.1
  77. 77. Lineage Capture Runtime Overheads
  78. 78. Tracing Performance
  79. 79. ๏ Titian provides to Spark users the ability of tracing through program execution at interactive speed ๏ Features: • Intermediate results are shared in memory • Tight integration with the Spark API (LineageRDD) • Low job overhead • Efficient lineage query ๏ We believe Titian will open the door to program logic debugging, iterative data (and program) cleaning, and exploratory analysis }Transformation provenance Titian: Data Provenance in Spark
  80. 80. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD
  81. 81. Capturing: HadoopLineageRDD Hadoop LineageRDD linesInput records Output records Input ID Output ID TaskCont ext
  82. 82. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID Hadoop LineageRDD linesoffset1, “error 400 …” TaskCont ext
  83. 83. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID Get input Id Hadoop LineageRDD linesoffset1, “error 400 …” TaskCont ext
  84. 84. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID offset1 Get input Id Hadoop LineageRDD linesoffset1, “error 400 …” TaskCont ext
  85. 85. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID offset1 “error 400 …” Hadoop LineageRDD lines TaskCont ext
  86. 86. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID offset1 Get output Id Hadoop LineageRDD lines “error 400 …” TaskCont ext
  87. 87. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID offset1 id1 Get output Id Hadoop LineageRDD lines “error 400 …” TaskCont ext
  88. 88. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID offset1 id1 Save Hadoop LineageRDD lines “error 400 …” TaskCont ext
  89. 89. Capturing: HadoopLineageRDD Input records Output records Input ID Output ID offset1 id1 Save Hadoop LineageRDD lines “error 400 …” TaskCont ext id1
  90. 90. Capturing: HadoopLineageRDD Input records Output records Hadoop LineageRDD lines Input ID Output ID offset1 id1 offset2 offset2, “error 4 …” TaskCont ext id1
  91. 91. Capturing: HadoopLineageRDD Input records Output records Hadoop LineageRDD lines Input ID Output ID offset1 id1 offset2 id2 “error 4 …” TaskCont ext id2
  92. 92. Capturing: HadoopLineageRDD Input records Output records Hadoop LineageRDD lines Input ID Output ID offset1 id1 offset2 id2 offset3 offset3, “error 400 …” TaskCont ext id2
  93. 93. Capturing: HadoopLineageRDD Input records Output records Hadoop LineageRDD lines Input ID Output ID offset1 id1 offset2 id2 offset3 id3 “error 400 …” TaskCont ext id3
  94. 94. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD
  95. 95. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1 offset1, “error 400 …” Ke y Input IDs Ke y Agg Value
  96. 96. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 “error 400 …” Ke y Agg Value Ke y Input IDs TaskCont ext id1
  97. 97. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 “error 400 …” Ke y Agg Value Ke y Input IDs TaskCont ext id1
  98. 98. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 400 Ke y Agg Value Ke y Input IDs TaskCont ext id1
  99. 99. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 (400, 1) Ke y Agg Value Ke y Input IDs TaskCont ext id1
  100. 100. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 Ke y Agg Value 40 0 1 Ke y Input IDs 400 TaskCont ext id1
  101. 101. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 Ke y Agg Value 40 0 1 Ke y Input IDs 400 TaskCont ext id1
  102. 102. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 Ke y Agg Value 40 0 1 Ke y Input IDs 400 { id1 } TaskCont ext id1
  103. 103. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1 Ke y Agg Value 40 0 1 Ke y Input IDs 400 { id1 } offset2, “error 4 …” Input ID Output ID offset1 id1 TaskCont ext id1
  104. 104. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 offset2 id2 Ke y Agg Value 40 0 1 4 1 Ke y Input IDs 400 { id1 } 4 { id2 } TaskCont ext id2
  105. 105. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 offset2 id2 Ke y Agg Value 40 0 1 4 1 Ke y Input IDs 400 { id1 } 4 { id2 } TaskCont ext id2 offset3, “error 400 …”
  106. 106. Combiner LineageR DD Combiner Build Phase Hadoop LineageR DD pairscodes error s lines Stage 1Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Ke y Agg Value 40 0 2 4 1 Ke y Input IDs 400 { id1, id3} 4 { id2 } TaskCont ext id3
  107. 107. Combiner Probe Phase Input records Output records Input ID Output ID Combiner LineageRDD pairs TaskCont ext id3 Key Input IDs 400 { id1, id3 } 4 { id 2 } Ke y Agg Value 40 0 2 4 1
  108. 108. Combiner Probe Phase Input records Output records Input ID Output ID Combiner LineageRDD pairs TaskCont ext id3 Key Input IDs 400 { id1, id3 } 4 { id 2 } Ke y Agg Value 40 0 2 4 1 (400, 2)
  109. 109. Combiner Probe Phase Input records Output records Input ID Output ID Combiner LineageRDD pairs TaskCont ext id3 Key Input IDs 400 { id1, id3 } 4 { id 2 } Ke y Agg Value 40 0 2 4 1 (400, 2) Get output Id
  110. 110. Combiner Probe Phase Input records Output records Input ID Output ID {id1, id 3} 400 Combiner LineageRDD pairs TaskCont ext id3 Key Input IDs 400 { id1, id3 } 4 { id 2 } Ke y Agg Value 40 0 2 4 1 (400, 2) Get output Id
  111. 111. Combiner Probe Phase Input records Output records Input ID Output ID {id1, id 3} 400 { id2 } 4 Combiner LineageRDD pairs TaskCont ext id3 Key Input IDs 400 { id1, id3 } 4 { id 2 } Ke y Agg Value 40 0 2 4 1 (4, 1)
  112. 112. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD Input ID Output ID offset1 id1 TaskConte xt Id1 Input ID Output ID { id1, id 3} 400 { id2 } 4 (400, 2) (4, 1)
  113. 113. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD Input ID Output ID offset1 id1 TaskConte xt Id1 Input ID Output ID { id1, id 3} 400 { id2 } 4 (400, (2, p1))(4, (1, p1))
  114. 114. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD Input ID Output ID offset1 id1 TaskConte xt Id1 Input ID Output ID { id1, id 3} 400 { id2 } 4 (400, (2, p1))(4, (1, p1))(400, (5, p2)) …
  115. 115. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD Input ID Output ID offset1 id1 TaskConte xt Id1 Input ID Output ID { id1, id 3} 400 { id2 } 4 (400, (2, p1))(4, (1, p1))(400, (5, p2)) …
  116. 116. Combiner LineageR DD Reducer LineageR DD Instrumented Workflow Hadoop LineageR DD count s pairscodes error s lines Stage 1 Stage 2 repor ts Stage LineageR DD Input ID Output ID offset1 id1 TaskConte xt Id1 Input ID Output ID { id1, id 3} 400 { id2 } 4 (400, (2, p1))(4, (1, p1))(400, (5, p2)) … TaskConte xt 400 Input ID Output ID [ p1, p2 ] 400
  117. 117. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID TaskCont ext 400
  118. 118. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID TaskCont ext 400 (Bad request, 7)
  119. 119. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID TaskCont ext 400 Get input Id (Bad request, 7)
  120. 120. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID TaskCont ext 400 Get input Id (Bad request, 7)
  121. 121. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID 400 TaskCont ext 400 Get input Id (Bad request, 7)
  122. 122. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID 400 TaskCont ext 400 (Bad request, 7)
  123. 123. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID 400 TaskCont ext 400 Get output Id (Bad request, 7)
  124. 124. Get output Id Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID 400 id1 TaskCont ext 400 (Bad request, 7)
  125. 125. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID 400 id1 4 TaskCont ext 4 (Failure, 1)
  126. 126. Capturing: StageLineageRDD Stage LineageR DDInput records Output records Input ID Output ID 400 id1 4 id2 TaskCont ext 4 (Failure, 7)

×