0
RAMP: A System for Capturing and TracingProvenance in MapReduce WorkflowsHyunjung Park, Robert Ikeda, Jennifer WidomStanfo...
Hyunjung ParkMapReduce Workflow Directed acyclic graph composed of MapReduce jobs Popular for large-scale data processin...
Hyunjung ParkProvenance Where data came from, How it was processed, … Uses: drill-down, verification, debugging, … RAMP...
Hyunjung ParkOutline of Talk MapReduce Provenance: definition and examples RAMP System: overview and some details RAMP ...
Hyunjung ParkMapReduce Provenance Mapper M(I) = Ui I M({i}) Provenance of an output element oM(I) isthe input element...
Hyunjung ParkMapReduce Provenance MapReduce workflows Intuitive recursive definition “Replay” property Replay the enti...
Hyunjung ParkMapReduce Provenance: Wordcount7Apache HadoopHadoop MapReduceApache PigHadoop SummitMapReduce Tutorial(Apache...
Hyunjung ParkReplay Property: Wordcount8Apache HadoopHadoop MapReduceApache PigHadoop SummitMapReduce Tutorial(Apache, 1)(...
Hyunjung ParkSystem Overview Built as an extension (library) to Hadoop 0.21 Some changes in the core part as well Gener...
Hyunjung ParkSystem Overview Pluggable schemes for provenance capture Element ID scheme Provenance storage scheme (Outp...
Hyunjung ParkProvenance Capture11RecordReaderMapper(ki, vi)(km, vm)InputMap OutputWrapperWrapperRecordReader(ki, vi)Mapper...
Hyunjung ParkProvenance Capture12ReducerRecordWriter(ko, vo)Map OutputOutput(km, [vm1,…,vmn])WrapperWrapperReducer(ko, vo)...
Hyunjung ParkDefault Scheme for File Input/Output Element ID (filename, offset) Element ID increases as elements are ap...
Hyunjung ParkExperimental Evaluation 51 large EC2 instances (Thank you, Amazon!) Two MapReduce “workflows” Wordcount• M...
Hyunjung ParkExperimental Results: Wordcount15012002400360048006000100G300G500G100G300G500G100G300G500GExecution time Map ...
Hyunjung ParkExperimental Results: Terasort16040080012001600200093G279G466G93G279G466G93G279G466GExecution time Map finish...
Hyunjung ParkExperimental Results: Summary Provenance capture Wordcount• 76% time overhead, space overhead depends direc...
Hyunjung ParkInstrumenting Pig for Provenance Can we run real-world MapReduce workflows on topof RAMP/Hadoop? Pig 0.8 A...
Hyunjung ParkPig Script: Sentiment Analysis Input data sets Tweets collected in 2009 478 highest-grossing movie titles ...
Hyunjung ParkPig Script: Sentiment Analysisraw_movie = LOAD ’movies.txt’ USING PigStorage(’t’) AS (title: chararray, year:...
Hyunjung ParkBackward Tracing: Sentiment Analysis21
Hyunjung ParkLimitations Definition of MapReduce provenance Reducer treated as a black-box Many-many reducers• Provenan...
Hyunjung ParkConclusion RAMP transparently captures provenance withreasonable time and space overhead RAMP provides conv...
Upcoming SlideShare
Loading in...5
×

RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows

351

Published on

Presented in Hadoop Summit 2011.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
351
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows"

  1. 1. RAMP: A System for Capturing and TracingProvenance in MapReduce WorkflowsHyunjung Park, Robert Ikeda, Jennifer WidomStanford University
  2. 2. Hyunjung ParkMapReduce Workflow Directed acyclic graph composed of MapReduce jobs Popular for large-scale data processing Hadoop Higher-level platforms: Pig, Hive, Jaql, … Debugging can be difficult2
  3. 3. Hyunjung ParkProvenance Where data came from, How it was processed, … Uses: drill-down, verification, debugging, … RAMP: fine-grained provenance of data elements Backward tracing Find the input subsets that contributed to a given outputelement Forward tracing Determine which output elements were derived from aparticular input element3
  4. 4. Hyunjung ParkOutline of Talk MapReduce Provenance: definition and examples RAMP System: overview and some details RAMP Experimental Evaluation Provenance-enabled Pig using RAMP Conclusion4
  5. 5. Hyunjung ParkMapReduce Provenance Mapper M(I) = Ui I M({i}) Provenance of an output element oM(I) isthe input element i I that produced o, i.e., oM({i}) Reducer (and Combiner) R(I) = U1≤ j ≤ n R(Ij) I1,…,In partition of I on reduce key Provenance of an output element oR(I) isthe group Ik  I that produced o, i.e., oR(Ik)5
  6. 6. Hyunjung ParkMapReduce Provenance MapReduce workflows Intuitive recursive definition “Replay” property Replay the entire workflow with the provenance of anoutput element o Does the result include the element o?6Usually YES, but not alwaysRMI* IEORM o  O
  7. 7. Hyunjung ParkMapReduce Provenance: Wordcount7Apache HadoopHadoop MapReduceApache PigHadoop SummitMapReduce Tutorial(Apache, 1)(Hadoop, 1)(Hadoop, 1)(MapReduce, 1)(Apache, 1)(Pig, 1)(Apache, 2)(Hadoop, 2)(MapReduce, 1)(Pig, 1)(Apache, 2)(Hadoop, 3)(MapReduce, 2)(Pig, 1)(Summit, 1)(Tutorial, 1)(Hadoop, 1)(Summit, 1)(MapReduce, 1)(Tutorial, 1)(Hadoop, 1)(Summit, 1)(MapReduce, 1)(Tutorial, 1)Mapper Combiner ReducerIntSumIntSumTokenizer
  8. 8. Hyunjung ParkReplay Property: Wordcount8Apache HadoopHadoop MapReduceApache PigHadoop SummitMapReduce Tutorial(Apache, 1)(Hadoop, 1)(Hadoop, 1)(MapReduce, 1)(Apache, 1)(Pig, 1)(Apache, 1)(Hadoop, 2)(MapReduce, 1)(Pig, 1)(Apache, 1)(Hadoop, 3)(MapReduce, 1)(Pig, 1)(Summit, 1)(Tutorial, 1)(Hadoop, 1)(Summit, 1)(MapReduce, 1)(Tutorial, 1)(Hadoop, 1)(Summit, 1)(MapReduce, 1)(Tutorial, 1)Mapper Combiner ReducerIntSum IntSumTokenizer
  9. 9. Hyunjung ParkSystem Overview Built as an extension (library) to Hadoop 0.21 Some changes in the core part as well Generic wrapper for provenance capture Capture provenance for one MapReduce job at a time Transparent• Provenance stored separately from the input/output data• Retains Hadoop’s parallel execution and fault tolerance Wrapped components• RecordReader• Mapper, Combiner, Reducer• RecordWriter9
  10. 10. Hyunjung ParkSystem Overview Pluggable schemes for provenance capture Element ID scheme Provenance storage scheme (OutputFormat) Applicable to any MapReduce workflow Provenance tracing program Stand-alone Depends on the pluggable schemes10
  11. 11. Hyunjung ParkProvenance Capture11RecordReaderMapper(ki, vi)(km, vm)InputMap OutputWrapperWrapperRecordReader(ki, vi)Mapper(ki, 〈vi, p〉)(ki, vi)(km, vm)(km, 〈vm, p〉)InputMap Outputpp
  12. 12. Hyunjung ParkProvenance Capture12ReducerRecordWriter(ko, vo)Map OutputOutput(km, [vm1,…,vmn])WrapperWrapperReducer(ko, vo)RecordWriter(ko, 〈vo, kmID〉)(ko, vo)(km, [vm1,…,vmn])(km, [〈vm1, p1〉,…, 〈vmn, pn〉])Map OutputOutput(kmID, pj)(q, kmID)Provenanceq
  13. 13. Hyunjung ParkDefault Scheme for File Input/Output Element ID (filename, offset) Element ID increases as elements are appended• Reduce provenance stored in ascending key order• Efficient backward tracing without special indexes Provenance storage Reduce provenance: offset  reduce group ID Map provenance: reduce group ID  (filename, offset) Compaction• Filename: replaced by file ID• Integer ID, offset: variable-length encoded13
  14. 14. Hyunjung ParkExperimental Evaluation 51 large EC2 instances (Thank you, Amazon!) Two MapReduce “workflows” Wordcount• Many-one with large fan-in• Input sizes: 100, 300, 500 GB Terasort• One-one• Input sizes: 93, 279, 466 GB14
  15. 15. Hyunjung ParkExperimental Results: Wordcount15012002400360048006000100G300G500G100G300G500G100G300G500GExecution time Map finishtimeAvg. reducetask timeTime(seconds)Time (no provenance)0400800120016002000100G300G500G100G300G500G100G300G500GMap outputdata sizeIntermediatedata sizeOutput datasizeSize(GB)Space (no provenance)012002400360048006000100G300G500G100G300G500G100G300G500GExecution time Map finishtimeAvg. reducetask timeTime(seconds)Time (no provenance) Time Overhead0400800120016002000100G300G500G100G300G500G100G300G500GMap outputdata sizeIntermediatedata sizeOutput datasizeSize(GB)Space (no provenance) Space Overhead
  16. 16. Hyunjung ParkExperimental Results: Terasort16040080012001600200093G279G466G93G279G466G93G279G466GExecution time Map finishtimeAvg. reducetask timeTime(seconds)Time (no provenance)0200400600800100093G279G466G93G279G466G93G279G466GMap outputdata sizeIntermediatedata sizeOutput datasizeSize(GB)Space (no provenance)040080012001600200093G279G466G93G279G466G93G279G466GExecution time Map finishtimeAvg. reducetask timeTime(seconds)Time (no provenance) Time Overhead0200400600800100093G279G466G93G279G466G93G279G466GMap outputdata sizeIntermediatedata sizeOutput datasizeSize(GB)Space (no provenance) Space Overhead
  17. 17. Hyunjung ParkExperimental Results: Summary Provenance capture Wordcount• 76% time overhead, space overhead depends directly onfan-in Terasort• 20% time overhead, 21% space overhead Backward tracing Wordcount• 1, 3, 5 minutes (for 100, 300, 500 GB input sizes) Terasort• 1.5 seconds17
  18. 18. Hyunjung ParkInstrumenting Pig for Provenance Can we run real-world MapReduce workflows on topof RAMP/Hadoop? Pig 0.8 Added (file, offset) based element ID scheme: ~100 LOC Default provenance storage scheme Default provenance tracing program18
  19. 19. Hyunjung ParkPig Script: Sentiment Analysis Input data sets Tweets collected in 2009 478 highest-grossing movie titles from IMDb For each tweet: Infer a 1-5 overall sentiment rating Generate all n-grams and join them with movie titles Output data set (Movie title, Rating, #Tweets in November, #Tweets inDecember)19
  20. 20. Hyunjung ParkPig Script: Sentiment Analysisraw_movie = LOAD ’movies.txt’ USING PigStorage(’t’) AS (title: chararray, year: int);movie = FOREACH raw_movie GENERATE LOWER(title) as title;raw_tweet = LOAD ’tweets.txt’ USING PigStorage(’t’) AS (datetime: chararray, url, tweet: chararray);tweet = FOREACH raw_tweet GENERATE datetime, url, LOWER(tweet) as tweet;rated = FOREACH tweet GENERATE datetime, url, tweet, InferRating(tweet) as rating;ngramed = FOREACH rated GENERATE datetime, url, flatten(GenerateNGram(tweet)) as ngram, rating;ngram_rating = DISTINCT ngramed;title_rating = JOIN ngram_rating BY ngram, movie BY title USING ’replicated’;title_rating_month = FOREACH title_rating GENERATE title, rating, SUBSTRING(datetime, 5, 7) as month;grouped = GROUP title_rating_month BY (title, rating, month);title_rating_month_count = FOREACH grouped GENERATE flatten($0), COUNT($1);november_count = FILTER title_rating_month_count BY month eq ’11’;december_count = FILTER title_rating_month_count BY month eq ’12’;outer_joined = JOIN november_count BY (title, rating) FULL OUTER, december_count BY (title, rating);result = FOREACH outer_joined GENERATE (($0 is null) ? $4 : $0) as title, (($1 is null) ? $5 : $1) asrating, (($3 is null) ? 0 : $3) as november, (($7 is null) ? 0 : $7) as december;STORE result INTO ’/sentiment-analysis-result’ USING PigStorage();20
  21. 21. Hyunjung ParkBackward Tracing: Sentiment Analysis21
  22. 22. Hyunjung ParkLimitations Definition of MapReduce provenance Reducer treated as a black-box Many-many reducers• Provenance may contain more input elements thandesired• E.g., identity reducer for sorting records with duplicates Wrapper-based approach for provenance capture “Pure” mapper, combiner, and reducer “Standard” input/output channels22
  23. 23. Hyunjung ParkConclusion RAMP transparently captures provenance withreasonable time and space overhead RAMP provides convenient and efficient means ofdrilling-down and verifying output elements23
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×