Interactive Programs
Debugging and
Development in
Apache Spark
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
๏ Debugging data processing logic in Data-Intensive Scalable Computing
(DISC) system is difficult
๏ Analysis tools are still in their “infancy”
๏ Today’s large-scale jobs are black boxes:
• Job submitted to a cluster
• Results come back minutes to hours later
• No visibility into running algorithm
Big Data Debugging
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
๏ Easy to use GDB-like debugger [ICSE 16] (not covered in this talk)
๏ Visibility of data into running workflow
• E.g., what (input) data led to this (outlier) result?
๏ Selectively replaying a portion of the data processing steps on subsets
of intermediate data leading to outliers results
๏ Interactive program analysis
Big Data Debugging - Desiderata
๏ Visibility of data -> Tracking the dependencies between the
individual inputs and outputs records
๏ Selective replay -> Storage of intermediate results:
• Dataset shared among running job and analysis tool
๏ Interactivity -> Implementation Constraints:
• Latency constraint - In memory computation
• Programming interface constraint - Integration with Spark DSL
Big Data Debugging - Challenges
๏ Well known technique in databases
๏ Two granularities of provenance
• Transformation (coarse-grained) provenance
– Records the complete workflow of the derivation of a dataset
– Spark RDD lineage is an example of this form of provenance
• Data (fine-grained) provenance
– Records data dependencies between input and output records
– The type of provenance Titian focuses on
Data Provenance (Lineage)
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time

FROM sensors

GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Data Provenance - Example
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time

FROM sensors

GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time

FROM sensors

GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example
๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
Previous Data Provenance DISC Systems
๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
High overhead
Previous Data Provenance DISC Systems
๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
High overhead
Low interactivity
Previous Data Provenance DISC Systems
๏ Word Count job
๏ RAMP is up to 4X Spark
๏ Newt up to 86X
Experience with Newt and RAMP
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Newt
RAMP
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
Loads error messages from a log, counts the
number of errors occurrences and returns a report
containing the description of each error
lc = new LineageContext(sc)
lines = lc.textFile(“hdfs://...”)
errors = lines.filter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
reports.collect.foreach(println)
Example: Log Analysis
Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
Example: Backward Tracing
Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
frequentPair = reports.sortBy(_._2, false).take(1)
frequent = reports.filter(_ == frequentPair)
lineage = frequent.getLineage()
input = lineage.goBackAll()
input.collect().foreach(println)
Example: Backward Tracing
Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
Example: Forward Tracing
Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
network = errors.filter(_.contains(“NETWORK”))
lineage = network.getLineage()
output = lineage.goNextAll()
output.collect().foreach(println)
Example: Forward Tracing
Return the error distribution without the ones cause by
the Guest user
Example: Selective Replay
Return the error distribution without the ones cause by
the Guest user
lineage = reports.getLineage()
inputLines = lineage.goBackAll()
noGuest = inputLines.filter(!_.contains(“Guest”) && _.startswith(“error”))
newCodes = noGuest.map(_.split(“t”)(1))
newPairs = newCodes.map(word =>(word, 1))
newCounts = newPairs.reduceByKey(word =>(_ + _))
newRep = newCounts.map(kv => (dscr(kv._1), kv._2))
newRep.collect
Example: Selective Replay
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
๏ LineageContext wrap SparkContext
• Providing visibility into the submitted job
๏ Instrument LineageRDD at stage boundaries
• Wrap native RDDs
• Specific LineageRDD implementation based on instrument transformation
๏ Provenance data is buffered inside LineageRDDs
• Saved into Spark BlockManager for querying
Provenance Capturing
countspairscodeserrorslines
Stage 1 Stage 2
reports
lines = sc.textFile(“hdfs://...”)
errors = lines.filter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
Spark Stage DAG
Instrumented Spark Stage DAG
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD
Instrumented Workflow
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input
ID
Output
ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Lineage Capture Runtime Overheads
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Titian
Newt
RAMP
๏ Same Word Count job
๏ Titian is in average 1.3X slower than Spark
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Captured Data Lineage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Stage.Input IDReducer.Output ID
Reducer.Output IDCombiner.Output ID
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Combiner.Input IDHadoop.Output ID
Now let’s do it for real!
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Stage.Input IDReducer.Output ID
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Targeted Shuffle
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
400 id1
4 id2
Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Example: Trace Back
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Combiner.Output ID Reducer.Output ID
Combiner.Output ID Reducer.Output ID
Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Combiner.Input IDHadoop.Output ID
Combiner.Input IDHadoop.Output ID
Tracing Performance
๏ Word Count job
๏ Tracing one record backward in < 1 sec for
dataset < 100GB
๏ 18 sec for 500GB dataset
Vega: Optimizations
for Selective Replay
Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor
Miryung Kim, Todd Millstein, Tyson Condie
Under Submission
Debugging workflow
๏ Run program
๏ Understand the cause for bugs / outliers:
• Lineage
• Breakpoints/watchpoints
• Crash culprit
๏ Fix bug
• Fast selective replay
}
} Titian [VLDB 2016]
BigDebug [ICSE 2016]
First Strategy
Convert changes in code to changes in data
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
countspairslines
Stage 1 Stage 2
shuffle
input .map(x=>(x,1)) .reduceByKey(_+_)
Incremental Plan
Inject a filter in the workflow
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(aa, 1)
Shuffle
(aa, [1, 1])
(b, 1)
Reduce
(aa, 2)
(b, 1)
Filter
aa
b
aa
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
Incremental Plan
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
Incremental Plan
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
∆Reduce
—(c, 2)
Performance
Input data size (GB)
Time (s)
About 10X faster
Performance
๏ Good up to a certain point
๏ Two factors dominate:
• Space utilization
• Time to shuffle deltas
๏ Insight:
• The more downstream the filter is placed, the better the incremental
performance
• Especially beneficial if we can place it past the shuffle
Second Strategy
Push code changes downstream
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
But the input to the filter is (word, 1)
We cannot use the filter anymore
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
Observe that the map is invertible
We can use the old filter by using the inverse of the map
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter’
aa
b
c
aa
c
filter’((x, o)=>x!=‘c’)
Rewritten filter
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter’
(aa, 2)
(b, 1)
filter((x, o)=>x!=‘c’)
Shuffle and Reduce operations preserve keys
Performance
Input data size (GB)
Time
About 1000X faster
Why does it scale so well?
๏ Runtime in the order of output
๏ Output depends on the number of unique words
๏ Unique words << total words
Combining Strategies
๏ Push the changed transform past as many shuffles
as possible with rewrites
• The new transform can be placed only after materialization
points
• By default we materialize shuffle output
• Efficient because Spark already save shuffle output for fault
tolerance
๏ Use delta computation for the remaining workflow
Vega
๏ Built on Spark and Spark SQL (only filter rewrite)
๏ Spark SQL API is unchanged
๏ Spark API includes:
• Functions with inverses (for maps)
• Inverse values (for incremental reduce)
๏ Automatically rewrites workflows using commutativity
and incremental evaluation
๏ Titian provides to Spark users the ability of tracing through program execution
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ Vega provides 1–3 orders magnitude performance gains over rerunning the
computation from scratch
๏ Both provide results in a few seconds for many workflows allowing interactive
usage
} Transformation provenance
Conclusions
Thank you
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Performance
‣ Conclusions
Configuration
๏ Two set of experiments:
• Unstructured - grep and word count
• Structured - PigMix queries
๏ Datasets:
• Unstructured: from 500MB to 500GB files contains words generated using a
Zipf distribution from a dictionary of 8000 words
• Structured: we used the PigMix generator to create dataset of sizes ranging
from 1GB to 1TB
๏ Configuration:
• 16 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk
• Spark 1.2.1
Lineage Capture Runtime Overheads
Tracing Performance
๏ Titian provides to Spark users the ability of tracing through
program execution at interactive speed
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ We believe Titian will open the door to program logic debugging,
iterative data (and program) cleaning, and exploratory analysis
}Transformation provenance
Titian: Data Provenance in Spark
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Capturing: HadoopLineageRDD
Hadoop
LineageRDD
linesInput records Output records
Input ID Output
ID
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
Get input Id
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1
Get input Id
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1
“error 400
…”
Hadoop
LineageRDD
lines
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1
Get output Id
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1 id1
Get output Id
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1 id1
Save
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1 id1
Save
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
id1
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2
offset2, “error 4
…”
TaskCont
ext
id1
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
“error 4 …”
TaskCont
ext
id2
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
offset3
offset3, “error
400 …”
TaskCont
ext
id2
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
offset3 id3
“error 400
…”
TaskCont
ext
id3
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1
offset1, “error
400 …”
Ke
y
Input
IDs
Ke
y
Agg
Value
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
“error 400
…”
Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
“error 400
…”
Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
400 Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
(400, 1) Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400 { id1 }
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400 { id1 }
offset2, “error
4 …”
Input ID Output
ID
offset1 id1
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
offset2 id2
Ke
y
Agg
Value
40
0
1
4 1
Ke
y
Input
IDs
400 { id1 }
4 { id2 }
TaskCont
ext
id2
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
offset2 id2
Ke
y
Agg
Value
40
0
1
4 1
Ke
y
Input
IDs
400 { id1 }
4 { id2 }
TaskCont
ext
id2
offset3, “error
400 …”
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
offset2 id2
offset3 id3
Ke
y
Agg
Value
40
0
2
4 1
Ke
y
Input
IDs
400 { id1,
id3}
4 { id2 }
TaskCont
ext
id3
Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id
Combiner Probe Phase
Input records Output records
Input ID Output
ID
{id1, id
3}
400
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id
Combiner Probe Phase
Input records Output records
Input ID Output
ID
{id1, id
3}
400
{ id2 } 4
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(4, 1)
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400,
2)
(4, 1)
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
TaskConte
xt
400
Input ID Output
ID
[ p1, p2
]
400
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
Get input Id
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
Get input Id
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400
TaskCont
ext
400
Get input Id
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400
TaskCont
ext
400
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400
TaskCont
ext
400
Get output Id
(Bad request, 7)
Get output Id
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
TaskCont
ext
400
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
4
TaskCont
ext
4
(Failure, 1)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
4 id2
TaskCont
ext
4
(Failure, 7)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA