Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated Debugging of Big Data Analytics in Apache Spark Using BigSift with Muhammad Ali Gulzar and Miryung Kim

328 views

Published on

Developing Big Data Analytics often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions about data. Data scientists typically write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud.When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs.

In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records. In this talk, we present BigSift, an automated debugger for Apache Spark that data engineers and scientists can use. It takes an Apache Spark program, a user-defined test oracle function, and a dataset as input and outputs a minimum set of input records that reproduces the same test failure.

BigSift combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. It redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift exposes an interactive web interface where a user can monitor a big data analytics job running remotely on the cloud, write a user-defined test oracle function, and then trigger the automated debugging process. BigSift also provides a set of predefined test oracle functions, which can be used for explaining common types of anomalies in big data analytics. This debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences.

Published in: Data & Analytics
  • Login to see the comments

Automated Debugging of Big Data Analytics in Apache Spark Using BigSift with Muhammad Ali Gulzar and Miryung Kim

  1. 1. Muhammad Ali Gulzar Miryung Kim University of California, Los Angeles Automated Debugging of Big Data Analytics in Apache Spark Using BigSift #Res3SAIS
  2. 2. Big Data Debugging in the Dark 2#Res3SAIS Map Reduce Develop locally Hope it works Run in cloud Bug! Guesswork
  3. 3. BigDebug: Debugging Primitives for Interactive Big Data Processing ICSE 2016 Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali Tyson Condie, Todd Millstein, Miryung Kim Presented at Spark Summit 2017 3#Res3SAIS
  4. 4. Interactive Debugging with BigDebug 4#Res3SAIS 4. Backward and Forward Tracing 1. Simulated Breakpoint 2. On Demand Guarded Watchpoint 3. Crash Culprit Identification
  5. 5. Titian: Data Provenance support in Spark VLDB 2015 Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, Tyson Condie 5#Res3SAIS
  6. 6. Data Provenance – in SQL 6#Res3SAIS Tuple-ID Time Sendor-ID Temperature T1 11AM 1 34 T2 11AM 2 35 T3 11AM 3 35 T4 12PM 1 35 T5 12PM 2 35 T6 12PM 3 100 T7 1PM 1 35 T8 1PM 2 35 T9 1PM 3 80 SELECT time, AVG(temp) FROM sensors GROUP BY time Sensors Result- ID Time AVG(temp) ID-1 11AM 34.6 ID-2 12PM 56.6 ID-3 1PM 50 Outlier Outlier Why ID-2 and ID-3 have those high values?
  7. 7. Step 1: Instrument Workflow in Spark 7#Res3SAIS Combiner LineageRDD Reducer LineageRDD Hadoop LineageRDD counts pairscodeserrorslines Stage 1 Stage 2 reports Stage LineageRDD Input ID Output ID { id1, id 3} 400 { id2 } 4 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2
  8. 8. Step 2: Example Backward Tracing 8#Res3SAIS Worker1 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Worker1 Worker3 Input ID Output ID [p1, p2] 400 [ p1 ] 4 Input ID Output ID 400 id1 4 id2 Reducer Stage Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Input ID Output ID p1 400 Input ID Output ID p1 400 Stage.Input IDReducer.Output ID Worker2
  9. 9. Step 2: Example Backward Tracing 9#Res3SAIS Worker1 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Worker1 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Input ID Output ID p1 400 Input ID Output ID p1 400 Worker2 Combiner.Output ID Reducer.Output ID Combiner.Output ID Reducer.Output ID
  10. 10. Step 2: Example Backward Tracing 10#Res3SAIS Worker1 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Worker1 Input ID Output ID offset1 id1 offset2 id2 offset3 id3 Input ID Output ID { id1, id 3} 400 { id2 } 4 Hadoop Combiner Input ID Output ID offset1 id1 … … Input ID Output ID { id1, …} 400 Hadoop Combiner Worker2 Combiner.Input IDHadoop.Output ID Combiner.Input IDHadoop.Output ID
  11. 11. BigSift: Automated Debugging of Big Data Analytics in DISC Applications SoCC 2017 Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Tyson Condie, Miryung Kim 11#Res3SAIS
  12. 12. Motivating Example for Automated Debugging with BigSift • Alice writes a Spark program that identifies, for each state in the US, the delta between the minimum and the maximum snowfall reading for each day of any year and for any particular year. • An input data record that measures 1 foot of snowfall on January 1st of Year 1992, in the 99504 zip code (Anchorage, AK) area, appears as 99504 , 01/01/1992 , 1ft 12#Res3SAIS
  13. 13. Debugging Incorrect Output 13 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in 99504 , 03/01/1993 , 145mm 99504 , 01/01/1994 , 245mm 99504 , 01/01/1993 , 85mm 90031 , 02/01/1991 , 0mm AK , 01/01 , [304.8, 21336, 245, 85] AK , 03/01 , [30.5 , 145] AK , 1992 , [304.8 , 30.5] AK , 1993 , [21336, 145, 85] AK , 1994 , [245] CA , 02/01 , [0] CA , 1991 , [0] TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 03/01 , 145 AK , 1993 , 145 AK , 01/01 , 245 AK , 1994 , 245 …. …. AK , 01/01 , 21251 AK , 03/01 , 114.5 AK , 1992 , 274.3 AK , 1993 , 21251 AK , 1994 , 0 CA , 02/01 , 0 CA , 1991 , 0 It is challenging because the input records is very large and the program involves computing min and max and a unit conversion #Res3SAIS
  14. 14. Data Provenance with Titian [VLDB ‘15] 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in 99504 , 03/01/1993 , 145mm 99504 , 01/01/1994 , 245mm 99504 , 01/01/1993 , 85mm 90031 , 02/01/1991 , 0mm AK , 01/01 , [304.8, 21336, 245, 85] AK , 03/01 , [30.5 , 145] AK , 1992 , [304.8 , 30.5] AK , 1993 , [21336, 145, 85] AK , 1994 , [245] CA , 02/01 , [0] CA , 1991 , [0] TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 03/01 , 145 AK , 1993 , 145 AK , 01/01 , 245 AK , 1994 , 245 …. …. AK , 01/01 , 21251 AK , 03/01 , 114.5 AK , 1992 , 274.3 AK , 1993 , 21251 AK , 1994 , 0 CA , 02/01 , 0 CA , 1991 , 0 #Res3SAIS 14 It over-approximates the scope of failure-inducing inputs i.e. records in the faulty key-group are all marked as faulty
  15. 15. Delta Debugging • Alice tests the output from different input configurations using a test function 15 Run 1 fails the test 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in 99504 , 03/01/1993 , 145mm 99504 , 01/01/1994 , 245mm 99504 , 01/01/1993 , 85mm 90031 , 02/01/1991 , 0mm AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 03/01 , 145 AK , 1993 , 145 AK , 01/01 , 245 AK , 1994 , 245 …. …. AK , 01/01 , [304.8, 21336, 245, 85] AK , 03/01 , [30.5 , 145] AK , 1992 , [304.8 , 30.5] AK , 1993 , [21336, 145, 85] AK , 1994 , [245] CA , 02/01 , [0] CA , 1991 , [0] AK , 01/01 , 21251 AK , 03/01 , 114.5 AK , 1992 , 274.3 AK , 1993 , 21251 AK , 1994 , 0 CA , 02/01 , 0 CA , 1991 , 0 TextFile FlatMap GroupByKey Map Output 1 2 def test(key:String, delta: Float) : Boolean = { delta < 6000 } #Res3SAIS
  16. 16. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 16 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 01/01 , [304.8, 21336] AK , 03/01 , [30.5] AK , 1992 , [304.8 , 30.5] AK , 1993 , [21336] AK , 01/01 , 21031.2 AK , 03/01 , 0 AK , 1992 , 274.3 AK , 1993 , 0 Run 2 fails the test 1 2
  17. 17. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 17 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , [304.8] AK , 03/01 , [30.5] AK , 1992 , [304.8 , 30.5] AK , 01/01 , 0 AK , 03/01 , 0 AK , 1992 , 274.3 Run 3 passes the test
  18. 18. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 18 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 01/01 , 21336 AK , 1993 , 21336 AK , 01/01 , [21336] AK , 1993 , [21336] AK , 01/01 , 0 AK , 1993 , 0 Run 4 passes the test
  19. 19. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 19 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 01/01 , [304.8] AK , 1992 , [304.8] AK , 01/01 , 0 AK , 1992 , 0 Run 5 passes the test
  20. 20. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 20 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 03/01 , [30.5] AK , 1992 , [30.5] AK , 03/01 , 0 AK , 1992 , 0 Run 6 passes the test
  21. 21. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 21 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 01/01 , 21336 AK , 1993 , 21336 AK , 01/01 , [21336] AK , 1993 , [21336] AK , 01/01 , 0 AK , 1993 , 0 Run 7 passes the test
  22. 22. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 22 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 03/01 , [30.5] AK , 1992 , [30.5] AK , 01/01 , [21336] AK , 1993 , [21336] AK , 03/01 , 0 AK , 1992 , 0 AK , 01/01 , 0 AK , 1993 , 0 Run 8 passes the test
  23. 23. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. 23 TextFile FlatMap GroupByKey Map Output #Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 01/01 , [304.8 , 21336] AK , 1992 , [304.8] AK , 1993 , [21336] AK , 01/01 , 21031.2 AK , 1992 , 0 AK , 1993 , 0 Run 9 fails the test
  24. 24. Automated Debugging in DISC with BigSift 24#Res3SAIS Test Predicate Pushdown Prioritizing Backward Traces Bitmap based Test Memoization Input: A Spark Program, A Test Function Output: Minimum Fault- Inducing Input Records Data Provenance + Delta Debugging
  25. 25. Optimization 1: Test Predicate Pushdown • Observation: During backward tracing, data provenance traces through all the partitions even though only a few partitions are faulty 25#Res3SAIS If applicable, BigSift pushes down the test function to test the output of combiners in order to isolate the faulty partitions. Test Test Test Test Test Test Test
  26. 26. Optimization 2 :Prioritizing Backward Traces • Observation: The same faulty input record may contribute to multiple output records failing the test. 26#Res3SAIS 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft 99504 , 01/01/1993 , 70in 99504 , 03/01/1993 , 145mm 99504 , 01/01/1994 , 245mm 99504 , 01/01/1993 , 85mm 90031 , 02/01/1991 , 0mm AK , 01/01 , [304.8, 21336, 245, 85] AK , 03/01 , [30.5 , 145] AK , 1992 , [304.8 , 30.5] AK , 1993 , [21336, 145, 85] AK , 1994 , [245] CA , 02/01 , [0] CA , 1991 , [0] TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 1992 , 304.8 AK , 03/01 , 30.5 AK , 1992 , 30.5 AK , 01/01 , 21336 AK , 1993 , 21336 AK , 03/01 , 145 AK , 1993 , 145 AK , 01/01 , 245 AK , 1994 , 245 …. …. AK , 01/01 , 21251 AK , 03/01 , 114.5 AK , 1992 , 274.3 AK , 1993 , 21251 AK , 1994 , 0 CA , 02/01 , 0 CA , 1991 , 0 In case of multiple faulty outputs, BigSift overlaps two backward traces to minimize the scope of fault-inducing input records
  27. 27. Optimization 3: Bitmap based Test Memoization • Observation: Delta debugging may try running a program on the same subset of input redundantly. • BigSift leverages bitmap to compactly encode the offsets of original input to refer to an input subset #Res3SAIS 0 1 0 1 0 0 0 0 1 1 Input Data Bitmap ! Test Outcome We use a bitmap based test memoization technique to avoid redundant testing of the same input dataset.
  28. 28. Demo
  29. 29. Debugging Time 29#Res3SAIS Subject Program Running Time (sec) Debugging Time (sec) Subject Program Fault Original Job DD BigSift Improvement Movie Histogram Code 56.2 232.8 17.3 13.5X Inverted Index Code 107.7 584.2 13.4 43.6X Rating Histogram Code 40.3 263.4 16.6 15.9X Sequence Count Code 356.0 13772.1 208.8 66.0X Rating Frequency Code 77.5 437.9 14.9 29.5X College Student Data 53.1 235.3 31.8 7.4X Weather Analysis Data 238.5 999.1 89.9 11.1X Transit Analysis Code 45.5 375.8 20.2 18.6X BigSift provides up to a 66X speed up in isolating the precise fault-inducing input records, in comparison to the baseline DD
  30. 30. Debugging Time 30#Res3SAIS Subject Program Running Time (sec) Debugging Time (sec) Subject Program Fault Original Job DD BigSift Improvement Movie Histogram Code 56.2 232.8 17.3 13.5X Inverted Index Code 107.7 584.2 13.4 43.6X Rating Histogram Code 40.3 263.4 16.6 15.9X Sequence Count Code 356.0 13772.1 208.8 66.0X Rating Frequency Code 77.5 437.9 14.9 29.5X College Student Data 53.1 235.3 31.8 7.4X Weather Analysis Data 238.5 999.1 89.9 11.1X Transit Analysis Code 45.5 375.8 20.2 18.6X On average, BigSift takes 62% less time to debug a single faulty output than the time taken for a single run on the entire data.
  31. 31. Debugging Time 31#Res3SAIS 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 0 2000 4000 6000 8000 10000 12000 14000 #offault-inducinginput records Fault Localization Time (s) Sequence Count Delta Debugging BigSift Test Driven Data Provenance Data Provenance On average, BigSift takes 62% less time to debug a single faulty output than the time taken for a single run on the entire data.
  32. 32. Fault Localizability over Data Provenance 32#Res3SAIS 143796 6487290 520904 23411 5800 15003060 2554788 350 2 1350 15 13 1 1 1 1 1 1 2 1 10 100 1000 10000 100000 1000000 10000000 100000000 Movie Historgram Inverted Index Rating Histogram Sequence Count Rating Frequency College Students Weather Analysis #offault-inducinginputrecords Data Provenance Test Driven Data Provenance BigSift & DD BigSift leverages DD after DP to continue fault isolation, achieving several orders of magnitude 103 to 107 better precision.
  33. 33. Summary • BigSift is the first piece of work in automated debugging of big data analytics in DISC. • BigSift provides 103X – 107X more precision than data provenance in terms of fault localizability. • It provides up to 66X speed up in debugging time over baseline Delta Debugging. • In our evaluation we have observed that, on average, BigSift finds the faulty input in 62% less than the original job execution time. 33#Res3SAIS

×