SlideShare a Scribd company logo
1 of 30
Download to read offline
1	
  
Assis$ng	
  Developers	
  of	
  Big	
  Data	
  
Analy$cs	
  Applica$ons	
  When	
  Deploying	
  on	
  
Hadoop	
  Clouds	
  
Hadi	
  Hemma$	
  Bram	
  Adams	
  	
  
Weiyi	
  Shang	
  
	
  Zhen	
  Ming	
  Jiang	
  
Ahmed	
  E.	
  Hassan	
  
Patrick	
  Mar$n	
  
What	
  are	
  	
  
Big	
  Data	
  Analy$cs	
  Applica$on	
  (BDA	
  App)?	
  
BDA!
Apps!
2	
  
Many	
  fields	
  today	
  rely	
  on	
  BDA	
  Apps	
  to	
  
make	
  decisions	
  
So&ware	
  engineering	
  research,	
  especially	
  
Mining	
  So&ware	
  Repositories.	
  
And…	
  
3	
  
Under	
  the	
  hood	
  of	
  BDA	
  Apps	
  
4	
  
Hardware	
  
Infrastructure	
  
So&ware	
  
PlaCorm	
  
BDA	
  Apps	
  
 	
  
	
  	
  
5	
  
Discrepancy	
  between	
  scale	
  of	
  
development	
  and	
  
deployment	
  
Small sample data and
pseudo cloud!
Big data and
real-life cloud!
ACM	
  InteracIons	
  2012	
  
“Analysts	
  moved	
  back	
  and	
  
forth	
  from	
  local	
  machines	
  to	
  
cloud-­‐	
  based	
  systems.”	
  	
   6	
  
Many	
  things	
  can	
  go	
  wrong	
  when	
  scaling	
  
7	
  
BDA	
  App	
  
Step	
  1	
   Step	
  2	
   Step	
  n	
  …	
  
Large-­‐scale	
  intermediate	
  data	
  
generated	
  by	
  each	
  step	
  can	
  
fill	
  up	
  the	
  disk	
  space!!!	
  
How	
  to	
  verify	
  the	
  deployment	
  of	
  BDA	
  
Apps?	
  
8	
  
Small sample data and
pseudo cloud!
Big data and
real-life cloud!
How	
  to	
  verify	
  
Tradi$onal	
  approach	
  for	
  verifying	
  BDA	
  
apps	
  	
  
9	
  
Keyword	
  scan	
  
10	
  
Many	
  false	
  posi$ves!!	
  
Large	
  results,	
  too	
  
much	
  effort	
  to	
  
manually	
  examine	
  
Limita$ons	
  of	
  tradi$onal	
  approach	
  
Not	
  all	
  kills	
  are	
  bad:	
  “specula$ve	
  
execu$on”	
  
11	
  
Slow	
  task	
  
idenIfied	
  
The	
  results	
  of	
  the	
  first	
  finished	
  task	
  are	
  saved,	
  
others	
  tasks	
  are	
  killed!!	
  
Duplicate	
  the	
  task	
  
to	
  other	
  machines	
  
A	
  smarter	
  approach	
  is	
  needed	
  
12	
  
Execu$on	
  sequences	
  provide	
  context	
  
informa$on	
  of	
  log	
  lines	
  
13	
  
Kill	
  task	
  t	
  on	
  node	
  A.	
  
Assign	
  task	
  t	
  on	
  node	
  A.	
  
Assign	
  task	
  t	
  on	
  node	
  B.	
  
Task	
  t	
  finished	
  on	
  node	
  B.	
  
Log	
  abstrac$on	
  reduces	
  the	
  amount	
  of	
  
data	
  to	
  examine	
  
14	
  
Kill	
  task	
  t1	
  on	
  node	
  A.	
  
Kill	
  task	
  t2	
  on	
  node	
  B.	
  
Kill	
  task	
  t3	
  on	
  node	
  C.	
  
Kill	
  task	
  t4	
  on	
  node	
  A.	
  
Kill	
  task	
  t5	
  on	
  node	
  D.	
  
Kill	
  task	
  t6	
  on	
  node	
  B.	
  
Kill	
  task	
  t7	
  on	
  node	
  A.	
  
Kill	
  task	
  t8	
  on	
  node	
  C.	
  
	
  
Large	
  results,	
  too	
  
much	
  effort	
  to	
  
manually	
  examine	
  
Kill	
  task	
  $t	
  on	
  node	
  $n.	
  
Overview	
  of	
  our	
  approach	
  
15	
  
Small sample data and
pseudo cloud!
Big data and
real-life cloud!
Underlying	
  plaborm	
   Underlying	
  plaborm	
  
ExecuIon	
  
sequences	
  
ExecuIon	
  
sequences	
  
ExecuIon	
  
sequence	
  
delta	
  
Log	
  
abstracIon	
  
Log	
  linking	
  
Sequences	
  
simplificaIon	
  
Step	
  1:	
  Log	
  Abstrac$on	
  
reduces	
  the	
  size	
  of	
  logs	
  
16	
  
Log	
  
abstracIon	
   Log	
  Linking	
  
Simplifying	
  
sequences	
  
e of our approach.
Table 1: Example of log lines
# Log line
1 time=1, Task=Trying to launch, TaskID=01A
2 time=2, Task=Trying to launch, TaskID=077
3 time=3, Task=JVM, TaskID=01A
4 time=4, Task=Reduce, TaskID=01A
5 time=5, Task=JVM, TaskID=077
6 time=6, Task=Reduce, TaskID=01A
7 time=7, Task=Reduce, TaskID=01A
8 time=8, Task=Progress, TaskID=077
9 time=9, Task=Done, TaskID=077
10 time=10, Task=Commit Pending, TaskID=01A
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are shown
in Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
Table 2: Execution events
Event Event template #
E1 time=$t, Task=Trying to launch, TaskID=$id 1,2
E2 time=$t, Task=JVM, TaskID=$id 3,5
E3 time=$t, Task=Reduce, TaskID=$id 4,6,7
E4 time=$t, Task=Progress, TaskID=$id 8
E5 time=$t, Task=Commit Pending, TaskID=$id 10
E6 time=$t, Task=Done, TaskID=$id 9,11
Table 4: Execution sequence after eliminating loop-
the p-value, the higher probabi
Example of log lines
Execution events	
  
Jiang	
  et	
  al.	
  JSME	
  2008	
  
Step	
  2:	
  Log	
  linking	
  
provides	
  context	
  for	
  logs	
  
17	
  
Table 2: Execution events
Event Event template #
E1 time=$t, Task=Trying to launch, TaskID=$id 1,2
E2 time=$t, Task=JVM, TaskID=$id 3,5
E3 time=$t, Task=Reduce, TaskID=$id 4,6,7
E4 time=$t, Task=Progress, TaskID=$id 8
E5 time=$t, Task=Commit Pending, TaskID=$id 10
E6 time=$t, Task=Done, TaskID=$id 9,11
uence after eliminating loop-
the p-value, the higher probability that the new run has
failure. Therefore, every new run will be tested with the
previous failure-free run to calculate the p-value. A p-value
y contain the same TaskID.
gure 2-c shows the result sequence after abstracting the
and linking them into sequences using the TaskID val-
In the event linking result in Figure 2-c, Events E1, E2,
E5 and E6 are linked together (note that event E3 has
n executed twice) and Event E1, E2, E4, E6 are linked
ther since the same TaskID values are shared.
.2 Eliminating repetitions
here can be event repetitions in the existing sequences
ed by loops. For example, for sequences about reading
a from a remote node, there would be repeated events
ut keeping fetching the data. Similar log sequences that
ude di erent times of the same events are considered
rent sequences, although they indicate the same sys-
behaviour in essence. These repeated events need to be
pressed to ease the analysis. We use regular expression
niques to detect and suppress the repetitions. For the
mple shown in Figure 2, the sequence “E1 E2 E3 E3 E5
our technique would detect the repetition of E3 and
press this sequence into “E1 E2 E3 E5 E6”.
5 time=5, Task=JVM, TaskID=077
6 time=6, Task=Reduce, TaskID=01A
7 time=7, Task=Reduce, TaskID=01A
8 time=8, Task=Progress, TaskID=077
9 time=9, Task=Done, TaskID=077
10 time=10, Task=Commit Pending, TaskID=0
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are
in Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
3.4 Failure detection
Intuitively, if any failure exists, the cloud computin
Log	
  
abstracIon	
   Log	
  Linking	
  
Simplifying	
  
sequences	
  
e of our approach.
Table 1: Example of log lines
# Log line
1 time=1, Task=Trying to launch, TaskID=01A
2 time=2, Task=Trying to launch, TaskID=077
3 time=3, Task=JVM, TaskID=01A
4 time=4, Task=Reduce, TaskID=01A
5 time=5, Task=JVM, TaskID=077
6 time=6, Task=Reduce, TaskID=01A
7 time=7, Task=Reduce, TaskID=01A
8 time=8, Task=Progress, TaskID=077
9 time=9, Task=Done, TaskID=077
10 time=10, Task=Commit Pending, TaskID=01A
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are shown
in Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
3.4 Failure detection
Example of log lines
Execution events	
  
Step	
  3:	
  Sequence	
  simplifica$on	
  
deals	
  with	
  repeated	
  logs	
  
18	
  
10 time=10, Task=Commit Pending, TaskID=01A
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are shown
n Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
3.4 Failure detection
Intuitively, if any failure exists, the cloud computing plat-
Table 2
Event Event template
E1 time=$t, Task=
E2 time=$t, Task=
E3 time=$t, Task=
E4 time=$t, Task=
E5 time=$t, Task=
E6 time=$t, Task=
Table 4: Execution sequence after eliminating lo
ing
TaskID Event sequence
01A E1, E2, E3, E5, E6
077 E1, E2, E4, E6
form would generate extra logs. The extra logs con
event sequences indicating the process of error message
fault recovery. Therefore, di erent event sequences, w
reflect di erent system behaviours, should be recovered
tween di erent runs of an application with and without
ures. Several approaches that identify the di erent e
Log	
  
abstracIon	
   Log	
  Linking	
  
Simplifying	
  
sequences	
  
Repeated	
  logs:	
  	
  
task	
  t1	
  read	
  file	
  A.	
  
task	
  t1	
  read	
  file	
  A.	
  
task	
  t1	
  read	
  file	
  A.	
  
Remove	
  repe$$on	
  
and	
  order	
  of	
  events	
  
Comparing	
  small	
  and	
  large	
  runs	
  
19	
  
Logs	
  from	
  
tesIng	
  run	
  
with	
  small	
  
data	
  
Logs	
  from	
  
run	
  with	
  
large	
  data	
  
Event	
  sequence	
  
E1,	
  E2,	
  E3,	
  E5,	
  E6	
  
Event	
  sequence	
  
E1,	
  E2,	
  E3,	
  E5,	
  E6	
  
E1,	
  E2,	
  E3,	
  E7,	
  E5,	
  E6	
  
Event	
  sequence	
  delta	
  
E1,	
  E2,	
  E3,	
  E7,	
  E5,	
  E6	
  
Case	
  study:	
  subject	
  systems	
  
20	
  
Source Domain	
  
WordCount 	
  
official	
  example
File	
  processing	
  
Page	
  Rank 	
  
	
  
developed	
  from	
  
scratch
Social	
  network
JACK 	
  
migrated	
  from	
  Perl
Log	
  analysis
How	
  precise	
  is	
  our	
  
approach?	
  
Precision	
  
21	
  
Effort	
  
Reduc$on	
  
How	
  much	
  effort	
  
reduc$on	
  does	
  our	
  
approach	
  provide?	
  
0	
  
500	
  
1000	
  
1500	
  
2000	
  
WordCount	
   JACK	
   PageRank	
  
	
  #	
  log	
  sequences	
   	
  #	
  unique	
  log	
  events	
   	
  #	
  log	
  line	
  
Our	
  approach	
  reduces	
  the	
  logs	
  for	
  
manual	
  inspec$on	
  by	
  over	
  86%	
  
86%	
  
reducIon	
  
91%	
  
reducIon	
  
Our	
  approach	
   Keyword	
  search	
  
95%	
  
reducIon	
  
22	
  
How	
  precise	
  is	
  our	
  
approach?	
  
Precision	
  
23	
  
Effort	
  
Reduc$on	
  
How	
  much	
  effort	
  
reduc$on	
  does	
  our	
  
approach	
  provide?	
  
Reduce	
  logs	
  for	
  
manual	
  inspecIon	
  
by	
  over	
  86%	
  	
  
We	
  manually	
  inject	
  3	
  common	
  failures	
  
Machine
Failure!
Missing
supporting
library!
Lack of
disk space!
We	
  measure	
  the	
  number	
  of	
  log	
  lines	
  and	
  log	
  
sequences	
  caused	
  by	
  injected	
  failures.	
  
WordCount Page	
  Rank JACK
24	
  
Cola	
  et	
  al.	
  Euro-­‐Par	
  2005	
  
Our	
  approach	
  generates	
  less	
  false	
  
posi$ves	
  than	
  tradi$onal	
  approach	
  
25	
  
0	
  
5	
  
10	
  
15	
  
20	
  
25	
  
30	
  
35	
  
40	
  
WordCount	
   JACK	
   PageRank	
  
False	
  posi$ve	
  ra$o	
  between	
  keyword	
  search	
  and	
  our	
  
approach	
  
1:29	
  
1:8	
  
1:36	
  
How	
  precise	
  is	
  our	
  
approach?	
  
Precision	
  
26	
  
Effort	
  
Reduc$on	
  
How	
  much	
  effort	
  
reduc$on	
  does	
  our	
  
approach	
  provide?	
  
Reduce	
  logs	
  for	
  
manual	
  inspecIon	
  
by	
  over	
  86%	
  	
  
Less	
  false	
  posiIve	
  
and	
  addi$onal	
  context	
  
informaIon	
  to	
  assist	
  in	
  
manual	
  inspecIon	
  
27	
  
Under	
  the	
  hood	
  of	
  BDA	
  Apps	
  
28	
  
Physical	
  
Infrastructure	
  
Underlying	
  
PlaCorm	
  
BDA	
  Apps	
  
Our	
  approach	
  can	
  be	
  used	
  in	
  migra$on	
  of	
  
BDA	
  Apps	
  
Hadoop	
  generates	
  
more	
  job	
  sequences	
  
and	
  task	
  sequences.	
  
PIG!
PIG	
  automaIcally	
  
opImize	
  the	
  
applicaIon	
  by	
  
grouping	
  jobs	
  and	
  
reducing	
  tasks.	
  
Manually	
  browsing	
  logs	
  to	
  find	
  the	
  differences	
  can	
  
be	
  $me-­‐consuming.	
  
One	
  of	
  the	
  common	
  migraIons	
  
29	
  
We	
  use	
  our	
  approach	
  to	
  
compare	
  the	
  execu$on	
  
sequences	
  of	
  PageRank	
  
on	
  both	
  plaborms	
  
30	
  

More Related Content

What's hot

STAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash ReproductionSTAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash ReproductionSung Kim
 
Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Sung Kim
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation DefenseSung Kim
 
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...Feng Zhang
 
Understanding Log Lines using Development Knowledge
Understanding Log Lines using Development KnowledgeUnderstanding Log Lines using Development Knowledge
Understanding Log Lines using Development KnowledgeSAIL_QU
 
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)Sung Kim
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
 
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Work-Bench
 
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Maribel Acosta Deibe
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROWork-Bench
 
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationkrws
 
Data collection for software defect prediction
Data collection for software defect predictionData collection for software defect prediction
Data collection for software defect predictionAmmAr mobark
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
A Survey on Automatic Software Evolution Techniques
A Survey on Automatic Software Evolution TechniquesA Survey on Automatic Software Evolution Techniques
A Survey on Automatic Software Evolution TechniquesSung Kim
 
Cross-project defect prediction
Cross-project defect predictionCross-project defect prediction
Cross-project defect predictionThomas Zimmermann
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformFarzad Nozarian
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsIRJET Journal
 

What's hot (20)

STAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash ReproductionSTAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash Reproduction
 
Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
 
Understanding Log Lines using Development Knowledge
Understanding Log Lines using Development KnowledgeUnderstanding Log Lines using Development Knowledge
Understanding Log Lines using Development Knowledge
 
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
 
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
 
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RRO
 
A preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localizationA preliminary study on using code smells to improve bug localization
A preliminary study on using code smells to improve bug localization
 
Data collection for software defect prediction
Data collection for software defect predictionData collection for software defect prediction
Data collection for software defect prediction
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
A Survey on Automatic Software Evolution Techniques
A Survey on Automatic Software Evolution TechniquesA Survey on Automatic Software Evolution Techniques
A Survey on Automatic Software Evolution Techniques
 
Cross-project defect prediction
Cross-project defect predictionCross-project defect prediction
Cross-project defect prediction
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
 

Similar to ICSE2013

Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shangSAIL_QU
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
CP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithmsCP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithmsSheba41
 
DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019Sabrina Marechal
 
E2 – Fundamentals, Functions & ArraysPlease refer to announcements.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcements.docxE2 – Fundamentals, Functions & ArraysPlease refer to announcements.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcements.docxshandicollingwood
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm AnalysisMary Margarat
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterHarsh Kevadia
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 
Debs 2011 pattern rewritingforeventprocessingoptimization
Debs 2011  pattern rewritingforeventprocessingoptimizationDebs 2011  pattern rewritingforeventprocessingoptimization
Debs 2011 pattern rewritingforeventprocessingoptimizationOpher Etzion
 
E2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docxE2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docxjacksnathalie
 
Project 2: Baseband Data Communication
Project 2: Baseband Data CommunicationProject 2: Baseband Data Communication
Project 2: Baseband Data CommunicationDanish Bangash
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimationLouis A. Poulin
 
Call Execute For Everyone
Call Execute For EveryoneCall Execute For Everyone
Call Execute For EveryoneDaniel Boisvert
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with pythonPatrick Vergain
 
DAA-Unit1.pptx
DAA-Unit1.pptxDAA-Unit1.pptx
DAA-Unit1.pptxNishaS88
 

Similar to ICSE2013 (20)

Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shang
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Handout3o
Handout3oHandout3o
Handout3o
 
CP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithmsCP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithms
 
DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019
 
E2 – Fundamentals, Functions & ArraysPlease refer to announcements.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcements.docxE2 – Fundamentals, Functions & ArraysPlease refer to announcements.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcements.docx
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large Cluster
 
CQRS + ES. Más allá del hexágono
CQRS + ES. Más allá del hexágonoCQRS + ES. Más allá del hexágono
CQRS + ES. Más allá del hexágono
 
Matopt
MatoptMatopt
Matopt
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Debs 2011 pattern rewritingforeventprocessingoptimization
Debs 2011  pattern rewritingforeventprocessingoptimizationDebs 2011  pattern rewritingforeventprocessingoptimization
Debs 2011 pattern rewritingforeventprocessingoptimization
 
E2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docxE2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docx
E2 – Fundamentals, Functions & ArraysPlease refer to announcemen.docx
 
Project 2: Baseband Data Communication
Project 2: Baseband Data CommunicationProject 2: Baseband Data Communication
Project 2: Baseband Data Communication
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimation
 
Call Execute For Everyone
Call Execute For EveryoneCall Execute For Everyone
Call Execute For Everyone
 
Multiprocessing with python
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with python
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
DAA-Unit1.pptx
DAA-Unit1.pptxDAA-Unit1.pptx
DAA-Unit1.pptx
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

ICSE2013

  • 1. 1   Assis$ng  Developers  of  Big  Data   Analy$cs  Applica$ons  When  Deploying  on   Hadoop  Clouds   Hadi  Hemma$  Bram  Adams     Weiyi  Shang    Zhen  Ming  Jiang   Ahmed  E.  Hassan   Patrick  Mar$n  
  • 2. What  are     Big  Data  Analy$cs  Applica$on  (BDA  App)?   BDA! Apps! 2  
  • 3. Many  fields  today  rely  on  BDA  Apps  to   make  decisions   So&ware  engineering  research,  especially   Mining  So&ware  Repositories.   And…   3  
  • 4. Under  the  hood  of  BDA  Apps   4   Hardware   Infrastructure   So&ware   PlaCorm   BDA  Apps  
  • 5.         5   Discrepancy  between  scale  of   development  and   deployment   Small sample data and pseudo cloud! Big data and real-life cloud!
  • 6. ACM  InteracIons  2012   “Analysts  moved  back  and   forth  from  local  machines  to   cloud-­‐  based  systems.”     6  
  • 7. Many  things  can  go  wrong  when  scaling   7   BDA  App   Step  1   Step  2   Step  n  …   Large-­‐scale  intermediate  data   generated  by  each  step  can   fill  up  the  disk  space!!!  
  • 8. How  to  verify  the  deployment  of  BDA   Apps?   8   Small sample data and pseudo cloud! Big data and real-life cloud! How  to  verify  
  • 9. Tradi$onal  approach  for  verifying  BDA   apps     9   Keyword  scan  
  • 10. 10   Many  false  posi$ves!!   Large  results,  too   much  effort  to   manually  examine   Limita$ons  of  tradi$onal  approach  
  • 11. Not  all  kills  are  bad:  “specula$ve   execu$on”   11   Slow  task   idenIfied   The  results  of  the  first  finished  task  are  saved,   others  tasks  are  killed!!   Duplicate  the  task   to  other  machines  
  • 12. A  smarter  approach  is  needed   12  
  • 13. Execu$on  sequences  provide  context   informa$on  of  log  lines   13   Kill  task  t  on  node  A.   Assign  task  t  on  node  A.   Assign  task  t  on  node  B.   Task  t  finished  on  node  B.  
  • 14. Log  abstrac$on  reduces  the  amount  of   data  to  examine   14   Kill  task  t1  on  node  A.   Kill  task  t2  on  node  B.   Kill  task  t3  on  node  C.   Kill  task  t4  on  node  A.   Kill  task  t5  on  node  D.   Kill  task  t6  on  node  B.   Kill  task  t7  on  node  A.   Kill  task  t8  on  node  C.     Large  results,  too   much  effort  to   manually  examine   Kill  task  $t  on  node  $n.  
  • 15. Overview  of  our  approach   15   Small sample data and pseudo cloud! Big data and real-life cloud! Underlying  plaborm   Underlying  plaborm   ExecuIon   sequences   ExecuIon   sequences   ExecuIon   sequence   delta   Log   abstracIon   Log  linking   Sequences   simplificaIon  
  • 16. Step  1:  Log  Abstrac$on   reduces  the  size  of  logs   16   Log   abstracIon   Log  Linking   Simplifying   sequences   e of our approach. Table 1: Example of log lines # Log line 1 time=1, Task=Trying to launch, TaskID=01A 2 time=2, Task=Trying to launch, TaskID=077 3 time=3, Task=JVM, TaskID=01A 4 time=4, Task=Reduce, TaskID=01A 5 time=5, Task=JVM, TaskID=077 6 time=6, Task=Reduce, TaskID=01A 7 time=7, Task=Reduce, TaskID=01A 8 time=8, Task=Progress, TaskID=077 9 time=9, Task=Done, TaskID=077 10 time=10, Task=Commit Pending, TaskID=01A 11 time=11, Task=Done, TaskID=01A After eliminating looping, the final log sequences are shown in Figure 2-d. Table 3: Execution sequence TaskID Event sequence 01A E1, E2, E3, E3, E3, E5, E6 077 E1, E2, E4, E6 Table 2: Execution events Event Event template # E1 time=$t, Task=Trying to launch, TaskID=$id 1,2 E2 time=$t, Task=JVM, TaskID=$id 3,5 E3 time=$t, Task=Reduce, TaskID=$id 4,6,7 E4 time=$t, Task=Progress, TaskID=$id 8 E5 time=$t, Task=Commit Pending, TaskID=$id 10 E6 time=$t, Task=Done, TaskID=$id 9,11 Table 4: Execution sequence after eliminating loop- the p-value, the higher probabi Example of log lines Execution events   Jiang  et  al.  JSME  2008  
  • 17. Step  2:  Log  linking   provides  context  for  logs   17   Table 2: Execution events Event Event template # E1 time=$t, Task=Trying to launch, TaskID=$id 1,2 E2 time=$t, Task=JVM, TaskID=$id 3,5 E3 time=$t, Task=Reduce, TaskID=$id 4,6,7 E4 time=$t, Task=Progress, TaskID=$id 8 E5 time=$t, Task=Commit Pending, TaskID=$id 10 E6 time=$t, Task=Done, TaskID=$id 9,11 uence after eliminating loop- the p-value, the higher probability that the new run has failure. Therefore, every new run will be tested with the previous failure-free run to calculate the p-value. A p-value y contain the same TaskID. gure 2-c shows the result sequence after abstracting the and linking them into sequences using the TaskID val- In the event linking result in Figure 2-c, Events E1, E2, E5 and E6 are linked together (note that event E3 has n executed twice) and Event E1, E2, E4, E6 are linked ther since the same TaskID values are shared. .2 Eliminating repetitions here can be event repetitions in the existing sequences ed by loops. For example, for sequences about reading a from a remote node, there would be repeated events ut keeping fetching the data. Similar log sequences that ude di erent times of the same events are considered rent sequences, although they indicate the same sys- behaviour in essence. These repeated events need to be pressed to ease the analysis. We use regular expression niques to detect and suppress the repetitions. For the mple shown in Figure 2, the sequence “E1 E2 E3 E3 E5 our technique would detect the repetition of E3 and press this sequence into “E1 E2 E3 E5 E6”. 5 time=5, Task=JVM, TaskID=077 6 time=6, Task=Reduce, TaskID=01A 7 time=7, Task=Reduce, TaskID=01A 8 time=8, Task=Progress, TaskID=077 9 time=9, Task=Done, TaskID=077 10 time=10, Task=Commit Pending, TaskID=0 11 time=11, Task=Done, TaskID=01A After eliminating looping, the final log sequences are in Figure 2-d. Table 3: Execution sequence TaskID Event sequence 01A E1, E2, E3, E3, E3, E5, E6 077 E1, E2, E4, E6 3.4 Failure detection Intuitively, if any failure exists, the cloud computin Log   abstracIon   Log  Linking   Simplifying   sequences   e of our approach. Table 1: Example of log lines # Log line 1 time=1, Task=Trying to launch, TaskID=01A 2 time=2, Task=Trying to launch, TaskID=077 3 time=3, Task=JVM, TaskID=01A 4 time=4, Task=Reduce, TaskID=01A 5 time=5, Task=JVM, TaskID=077 6 time=6, Task=Reduce, TaskID=01A 7 time=7, Task=Reduce, TaskID=01A 8 time=8, Task=Progress, TaskID=077 9 time=9, Task=Done, TaskID=077 10 time=10, Task=Commit Pending, TaskID=01A 11 time=11, Task=Done, TaskID=01A After eliminating looping, the final log sequences are shown in Figure 2-d. Table 3: Execution sequence TaskID Event sequence 01A E1, E2, E3, E3, E3, E5, E6 077 E1, E2, E4, E6 3.4 Failure detection Example of log lines Execution events  
  • 18. Step  3:  Sequence  simplifica$on   deals  with  repeated  logs   18   10 time=10, Task=Commit Pending, TaskID=01A 11 time=11, Task=Done, TaskID=01A After eliminating looping, the final log sequences are shown n Figure 2-d. Table 3: Execution sequence TaskID Event sequence 01A E1, E2, E3, E3, E3, E5, E6 077 E1, E2, E4, E6 3.4 Failure detection Intuitively, if any failure exists, the cloud computing plat- Table 2 Event Event template E1 time=$t, Task= E2 time=$t, Task= E3 time=$t, Task= E4 time=$t, Task= E5 time=$t, Task= E6 time=$t, Task= Table 4: Execution sequence after eliminating lo ing TaskID Event sequence 01A E1, E2, E3, E5, E6 077 E1, E2, E4, E6 form would generate extra logs. The extra logs con event sequences indicating the process of error message fault recovery. Therefore, di erent event sequences, w reflect di erent system behaviours, should be recovered tween di erent runs of an application with and without ures. Several approaches that identify the di erent e Log   abstracIon   Log  Linking   Simplifying   sequences   Repeated  logs:     task  t1  read  file  A.   task  t1  read  file  A.   task  t1  read  file  A.   Remove  repe$$on   and  order  of  events  
  • 19. Comparing  small  and  large  runs   19   Logs  from   tesIng  run   with  small   data   Logs  from   run  with   large  data   Event  sequence   E1,  E2,  E3,  E5,  E6   Event  sequence   E1,  E2,  E3,  E5,  E6   E1,  E2,  E3,  E7,  E5,  E6   Event  sequence  delta   E1,  E2,  E3,  E7,  E5,  E6  
  • 20. Case  study:  subject  systems   20   Source Domain   WordCount   official  example File  processing   Page  Rank     developed  from   scratch Social  network JACK   migrated  from  Perl Log  analysis
  • 21. How  precise  is  our   approach?   Precision   21   Effort   Reduc$on   How  much  effort   reduc$on  does  our   approach  provide?  
  • 22. 0   500   1000   1500   2000   WordCount   JACK   PageRank    #  log  sequences    #  unique  log  events    #  log  line   Our  approach  reduces  the  logs  for   manual  inspec$on  by  over  86%   86%   reducIon   91%   reducIon   Our  approach   Keyword  search   95%   reducIon   22  
  • 23. How  precise  is  our   approach?   Precision   23   Effort   Reduc$on   How  much  effort   reduc$on  does  our   approach  provide?   Reduce  logs  for   manual  inspecIon   by  over  86%    
  • 24. We  manually  inject  3  common  failures   Machine Failure! Missing supporting library! Lack of disk space! We  measure  the  number  of  log  lines  and  log   sequences  caused  by  injected  failures.   WordCount Page  Rank JACK 24   Cola  et  al.  Euro-­‐Par  2005  
  • 25. Our  approach  generates  less  false   posi$ves  than  tradi$onal  approach   25   0   5   10   15   20   25   30   35   40   WordCount   JACK   PageRank   False  posi$ve  ra$o  between  keyword  search  and  our   approach   1:29   1:8   1:36  
  • 26. How  precise  is  our   approach?   Precision   26   Effort   Reduc$on   How  much  effort   reduc$on  does  our   approach  provide?   Reduce  logs  for   manual  inspecIon   by  over  86%     Less  false  posiIve   and  addi$onal  context   informaIon  to  assist  in   manual  inspecIon  
  • 27. 27  
  • 28. Under  the  hood  of  BDA  Apps   28   Physical   Infrastructure   Underlying   PlaCorm   BDA  Apps  
  • 29. Our  approach  can  be  used  in  migra$on  of   BDA  Apps   Hadoop  generates   more  job  sequences   and  task  sequences.   PIG! PIG  automaIcally   opImize  the   applicaIon  by   grouping  jobs  and   reducing  tasks.   Manually  browsing  logs  to  find  the  differences  can   be  $me-­‐consuming.   One  of  the  common  migraIons   29   We  use  our  approach  to   compare  the  execu$on   sequences  of  PageRank   on  both  plaborms  
  • 30. 30