Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA

Interactive Programs
Debugging and
Development in
Apache Spark

Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions

๏ Debugging data processing logic in Data-Intensive Scalable Computing
(DISC) system is difficult
๏ Analysis tools are still in their “infancy”
๏ Today’s large-scale jobs are black boxes:
• Job submitted to a cluster
• Results come back minutes to hours later
• No visibility into running algorithm
Big Data Debugging

Big Data Debugging - State of the Art

๏ Easy to use GDB-like debugger [ICSE 16] (not covered in this talk)
๏ Visibility of data into running workflow
• E.g., what (input) data led to this (outlier) result?
๏ Selectively replaying a portion of the data processing steps on subsets
of intermediate data leading to outliers results
๏ Interactive program analysis
Big Data Debugging - Desiderata

๏ Visibility of data -> Tracking the dependencies between the
individual inputs and outputs records
๏ Selective replay -> Storage of intermediate results:
• Dataset shared among running job and analysis tool
๏ Interactivity -> Implementation Constraints:
• Latency constraint - In memory computation
• Programming interface constraint - Integration with Spark DSL
Big Data Debugging - Challenges

๏ Well known technique in databases
๏ Two granularities of provenance
• Transformation (coarse-grained) provenance
– Records the complete workflow of the derivation of a dataset
– Spark RDD lineage is an example of this form of provenance
• Data (fine-grained) provenance
– Records data dependencies between input and output records
– The type of provenance Titian focuses on
Data Provenance (Lineage)

Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time 
FROM sensors 
GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Data Provenance - Example

Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time 
FROM sensors 
GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example

๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
Previous Data Provenance DISC Systems

retain lineage data
High overhead

retain lineage data
High overhead
Low interactivity

๏ Word Count job
๏ RAMP is up to 4X Spark
๏ Newt up to 86X
Experience with Newt and RAMP
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Newt
RAMP

Loads error messages from a log, counts the
number of errors occurrences and returns a report
containing the description of each error
lc = new LineageContext(sc)
lines = lc.textFile(“hdfs://...”)
errors = lines.ﬁlter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
reports.collect.foreach(println)
Example: Log Analysis

Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
Example: Backward Tracing

Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
frequentPair = reports.sortBy(_._2, false).take(1)
frequent = reports.ﬁlter(_ == frequentPair)
lineage = frequent.getLineage()
input = lineage.goBackAll()
input.collect().foreach(println)
Example: Backward Tracing

Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
Example: Forward Tracing

Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
network = errors.ﬁlter(_.contains(“NETWORK”))
lineage = network.getLineage()
output = lineage.goNextAll()
output.collect().foreach(println)
Example: Forward Tracing

Return the error distribution without the ones cause by
the Guest user
Example: Selective Replay

Return the error distribution without the ones cause by
the Guest user
lineage = reports.getLineage()
inputLines = lineage.goBackAll()
noGuest = inputLines.ﬁlter(!_.contains(“Guest”) && _.startswith(“error”))
newCodes = noGuest.map(_.split(“t”)(1))
newPairs = newCodes.map(word =>(word, 1))
newCounts = newPairs.reduceByKey(word =>(_ + _))
newRep = newCounts.map(kv => (dscr(kv._1), kv._2))
newRep.collect
Example: Selective Replay

๏ LineageContext wrap SparkContext
• Providing visibility into the submitted job
๏ Instrument LineageRDD at stage boundaries
• Wrap native RDDs
• Specific LineageRDD implementation based on instrument transformation
๏ Provenance data is buffered inside LineageRDDs
• Saved into Spark BlockManager for querying
Provenance Capturing

countspairscodeserrorslines
Stage 1 Stage 2
reports
lines = sc.textFile(“hdfs://...”)
errors = lines.filter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
Spark Stage DAG

Instrumented Spark Stage DAG
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD

Instrumented Workflow
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input
ID
Output
ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2

Lineage Capture Runtime Overheads
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Titian
Newt
RAMP
๏ Same Word Count job
๏ Titian is in average 1.3X slower than Spark

Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Captured Data Lineage

Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Example: Trace Back

Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Example: Trace Back
Stage.Input IDReducer.Output ID

Reducer.Output IDCombiner.Output ID
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2

Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Combiner.Input IDHadoop.Output ID
Now let’s do it for real!

Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back

Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner

Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage

Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Stage.Input IDReducer.Output ID

Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Targeted Shuffle

Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
400 id1
4 id2
Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400

Example: Trace Back
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Combiner.Output ID Reducer.Output ID
Combiner.Output ID Reducer.Output ID

Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner

Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner

Tracing Performance
๏ Word Count job
๏ Tracing one record backward in < 1 sec for
dataset < 100GB
๏ 18 sec for 500GB dataset

Vega: Optimizations
for Selective Replay
Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor
Miryung Kim, Todd Millstein, Tyson Condie
Under Submission

Debugging workﬂow
๏ Run program
๏ Understand the cause for bugs / outliers:
• Lineage
• Breakpoints/watchpoints
• Crash culprit
๏ Fix bug
• Fast selective replay
}
} Titian [VLDB 2016]
BigDebug [ICSE 2016]

First Strategy
Convert changes in code to changes in data

Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
countspairslines
Stage 1 Stage 2
shufﬂe
input .map(x=>(x,1)) .reduceByKey(_+_)

Incremental Plan
Inject a filter in the workflow
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)

Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(aa, 1)
Shuffle
(aa, [1, 1])
(b, 1)
Reduce
(aa, 2)
(b, 1)
Filter
aa
b
aa
countspairslines
Stage 1 Stage 2
shufﬂefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
Incremental Plan

Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c

Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
Incremental Plan

Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)

Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])

Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
∆Reduce
—(c, 2)

Performance
Input data size (GB)
Time (s)
About 10X faster

Performance
๏ Good up to a certain point
๏ Two factors dominate:
• Space utilization
• Time to shuffle deltas
๏ Insight:
• The more downstream the filter is placed, the better the incremental
performance
• Especially beneficial if we can place it past the shuffle

Second Strategy
Push code changes downstream

Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)

Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
But the input to the filter is (word, 1)
We cannot use the filter anymore

Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
Observe that the map is invertible
We can use the old filter by using the inverse of the map

Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter’
aa
b
c
aa
c
filter’((x, o)=>x!=‘c’)
Rewritten filter

Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter’
(aa, 2)
(b, 1)
filter((x, o)=>x!=‘c’)
Shuffle and Reduce operations preserve keys

Performance
Input data size (GB)
Time
About 1000X faster

Why does it scale so well?
๏ Runtime in the order of output
๏ Output depends on the number of unique words
๏ Unique words << total words

Combining Strategies
๏ Push the changed transform past as many shuffles
as possible with rewrites
• The new transform can be placed only after materialization
points
• By default we materialize shuffle output
• Efficient because Spark already save shuffle output for fault
tolerance
๏ Use delta computation for the remaining workflow

Vega
๏ Built on Spark and Spark SQL (only filter rewrite)
๏ Spark SQL API is unchanged
๏ Spark API includes:
• Functions with inverses (for maps)
• Inverse values (for incremental reduce)
๏ Automatically rewrites workflows using commutativity
and incremental evaluation

๏ Titian provides to Spark users the ability of tracing through program execution
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ Vega provides 1–3 orders magnitude performance gains over rerunning the
computation from scratch
๏ Both provide results in a few seconds for many workflows allowing interactive
usage
} Transformation provenance
Conclusions

Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Performance
‣ Conclusions

Configuration
๏ Two set of experiments:
• Unstructured - grep and word count
• Structured - PigMix queries
๏ Datasets:
• Unstructured: from 500MB to 500GB files contains words generated using a
Zipf distribution from a dictionary of 8000 words
• Structured: we used the PigMix generator to create dataset of sizes ranging
from 1GB to 1TB
๏ Configuration:
• 16 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk
• Spark 1.2.1

Lineage Capture Runtime Overheads

๏ Titian provides to Spark users the ability of tracing through
program execution at interactive speed
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ We believe Titian will open the door to program logic debugging,
iterative data (and program) cleaning, and exploratory analysis
}Transformation provenance
Titian: Data Provenance in Spark

Combiner
LineageR
DD
Reducer
LineageR
DD
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD

Capturing: HadoopLineageRDD
Hadoop
LineageRDD
linesInput records Output records
Input ID Output
ID
TaskCont
ext

Input records Output records
Input ID Output
ID
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext

Input ID Output
ID
Get input Id
Hadoop
LineageRDD
400 …”
TaskCont
ext

Input ID Output
ID
offset1
Get input Id
Hadoop
LineageRDD
400 …”
TaskCont
ext

Input ID Output
ID
offset1
“error 400
…”
Hadoop
LineageRDD
lines
TaskCont
ext

Input ID Output
ID
offset1
Get output Id
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext

Input ID Output
ID
offset1 id1
Get output Id
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext

Input ID Output
ID
offset1 id1
Save
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext

Input ID Output
ID
offset1 id1
Save
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
id1

Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2
offset2, “error 4
…”
TaskCont
ext
id1

Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
“error 4 …”
TaskCont
ext
id2

Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
offset3
offset3, “error
400 …”
TaskCont
ext
id2

Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
offset3 id3
“error 400
…”
TaskCont
ext
id3

Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1
offset1, “error
400 …”
Ke
y
Input
IDs
Ke
y
Agg
Value

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
“error 400
…”
Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
400 Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
(400, 1) Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400
TaskCont
ext
id1

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400 { id1 }
TaskCont
ext
id1

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400 { id1 }
offset2, “error
4 …”
Input ID Output
ID
offset1 id1
TaskCont
ext
id1

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
offset2 id2
Ke
y
Agg
Value
40
0
1
4 1
Ke
y
Input
IDs
400 { id1 }
4 { id2 }
TaskCont
ext
id2

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
offset2 id2
Ke
y
Agg
Value
40
0
1
4 1
Ke
y
Input
IDs
400 { id1 }
4 { id2 }
TaskCont
ext
id2
offset3, “error
400 …”

Combiner
LineageR
DD
Hadoop
LineageR
DD
pairscodes
error
s
lines
ID
offset1 id1
offset2 id2
offset3 id3
Ke
y
Agg
Value
40
0
2
4 1
Ke
y
Input
IDs
400 { id1,
id3}
4 { id2 }
TaskCont
ext
id3

Combiner Probe Phase
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1

Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)

Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id

Input ID Output
ID
{id1, id
3}
400
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id

Input ID Output
ID
{id1, id
3}
400
{ id2 } 4
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(4, 1)

Combiner
LineageR
DD
Reducer
LineageR
DD
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400,
2)
(4, 1)

Combiner
LineageR
DD
Reducer
LineageR
DD
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))

Combiner
LineageR
DD
Reducer
LineageR
DD
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…

Combiner
LineageR
DD
Reducer
LineageR
DD
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
TaskConte
xt
400
Input ID Output
ID
[ p1, p2
]
400

Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400

Stage
LineageR
Input ID Output
ID
TaskCont
ext
400
(Bad request, 7)

Stage
LineageR
Input ID Output
ID
TaskCont
ext
400
Get input Id
(Bad request, 7)

Stage
LineageR
Input ID Output
ID
400
TaskCont
ext
400
Get input Id
(Bad request, 7)

Stage
LineageR
Input ID Output
ID
400
TaskCont
ext
400
(Bad request, 7)

Stage
LineageR
Input ID Output
ID
400
TaskCont
ext
400
Get output Id
(Bad request, 7)

Get output Id
Stage
LineageR
Input ID Output
ID
400 id1
TaskCont
ext
400
(Bad request, 7)

Stage
LineageR
Input ID Output
ID
400 id1
4
TaskCont
ext
4
(Failure, 1)

Stage
LineageR
Input ID Output
ID
400 id1
4 id2
TaskCont
ext
4
(Failure, 7)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA

More Related Content

What's hot

Viewers also liked

Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA

More from Data Con LA

Recently uploaded

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA