Automated	
  Detec,on	
  of	
  Performance	
  
Regressions	
  Using	
  Regression	
  Models	
  on	
  
Clustered	
  Performance	
  Counters	
  
Weiyi	
  Shang,	
  Ahmed	
  E.	
  Hassan	
  
Queen’s	
  University,	
  Kingston,	
  Canada	
  
	
  
Mohamed	
  Nasser,	
  Parminder	
  Flora	
  
BlackBerry,	
  Waterloo,	
  Canada	
  
1	
  
What	
  is	
  a	
  performance	
  regression?	
  
2	
  
Old	
  version	
   New	
  version	
  
Does	
  the	
  new	
  version	
  have	
  
worse	
  performance	
  than	
  the	
  
old	
  version?	
  
How	
  to	
  detect	
  performance	
  regression?	
  
3	
  
Old	
  Version	
  
New	
  version	
  
requests	
  
requests	
  
requests	
  
Performance	
  
counters	
  
Performance	
  
counters	
  
Large	
  soDware	
  systems	
  generate	
  large	
  
amounts	
  of	
  performance	
  counters	
  
4	
  
5	
  
Performance	
  engineers	
  pick	
  counters	
  to	
  
compare	
  using	
  T-­‐test	
  
Picking	
  the	
  wrong	
  
counters	
  will	
  not	
  
detect	
  performance	
  
regression!	
  
Are	
  these	
  counters	
  
significantly	
  
different	
  between	
  
old	
  and	
  new	
  
versions?	
  
Performance	
  engineers	
  build	
  a	
  model	
  using	
  all	
  
counters	
  
6	
  
Target	
  Counter	
   Independent	
  Counters	
  
CPU	
  
Memory	
   I/O	
  read	
   I/O	
  write	
  
7	
  
Target	
  
counter	
  
Old	
  version	
   New	
  version	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  Model	
   PredicTon	
  
error	
  
Target	
  counter	
  is	
  
selected	
  by	
  
experience!	
  
Model-­‐based	
  performance	
  regression	
  
detec,on	
  
Selec,ng	
  a	
  wrong	
  target	
  counter	
  will	
  fail	
  to	
  
detect	
  performance	
  regression	
  
8	
  
Memory-­‐related	
  performance	
  regressions	
  will	
  not	
  
be	
  detected	
  by	
  this	
  model!	
  
Memory	
  has	
  a	
  low	
  
correlaTon	
  with	
  CPU	
  
CPU	
  
Memory	
   I/O	
  read	
   I/O	
  write	
  
I/O	
  write	
  has	
  a	
  high	
  
correlaTon	
  with	
  CPU	
  
Our	
  approach	
  
Target	
  
counter	
  
Old	
  version	
   New	
  version	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  Model	
   PredicTon	
  
error	
  
Reduce	
  
counters	
  
Reduced	
  	
  
counters	
  
Cluster	
  
counters	
  
Clusters	
  of	
  
counters	
   For	
  each	
  cluster	
  
9	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  
Reduce	
  
counters	
  
	
  
Cluster	
  
Counters	
  
	
  
Removing	
  zero	
  variance	
  
counters	
  
Removing	
  redundant	
  
counters	
  
Available	
  CPU	
  cores:	
  
4,	
  4,	
  4,	
  …4,	
  4	
  	
  
I/O	
  write	
  byte/sec=	
  
a*I/O	
  write	
  op/sec	
  +	
  
b*CPU	
  
Step	
  1:	
  Reduce	
  
	
  counters	
  
10	
  
Step	
  2:	
  Cluster	
  	
  
counters	
  
Hierarchical	
  clustering	
   Dendrogram	
  cuTng	
  
CPU	
   I/O	
   Memory	
   CPU	
   I/O	
   Memory	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  
Reduce	
  
counters	
  
	
  
Cluster	
  
Counters	
  
	
  
11	
  
Step	
  3:	
  Select	
  	
  
target	
  counter	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  
Reduce	
  
counters	
  
	
  
Cluster	
  
Counters	
  
	
  
CPU	
  from	
  old	
  version	
  
CPU	
  from	
  new	
  version	
  
Kolmogorov–Smirnov	
  test	
  
P-­‐VALUE	
  
We	
  select	
  the	
  counter	
  that	
  has	
  significant	
  difference	
  
between	
  two	
  versions	
  with	
  highest	
  certainty.	
  
12	
  
We	
  select	
  the	
  
counter	
  with	
  
smallest	
  p-­‐value	
  
Step	
  3:	
  Build	
  model	
   Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  
Reduce	
  
counters	
  
	
  
Cluster	
  
Counters	
  
	
  
CPU	
  
Memory	
   I/O	
  read	
   I/O	
  write	
  
We	
  build	
  a	
  linear	
  regression	
  model.	
  
+	
  c	
  +	
  b	
  a	
  
13	
  
Step	
  4:	
  Verify	
  model	
  
Linear	
  regression	
  
models	
  for	
  each	
  
cluster	
  
Calculate	
  
predicTon	
  error	
  
New	
  version	
  
counters	
  lec	
  
in	
  cluster	
  1	
  
Linear	
  regression	
  
model	
  for	
  cluster	
  1	
  
…	
   …	
  
Mean	
  predicTon	
  error	
  
<	
  threshold	
  
>	
  threshold	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  
Reduce	
  
counters	
  
	
  
Cluster	
  
Counters	
  
	
  
14	
  
Case	
  study:	
  subject	
  systems	
  
DELL	
  DVD	
  Store	
  
Enterprise	
  ApplicaTon	
  
CPU	
  overhead	
   Memory	
  overhead	
   I/O	
  overhead	
  
Removing	
  text	
  index	
   Removing	
  column	
  index	
  
Real-­‐life	
  performance	
  
regression	
  
15	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
16	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  small	
  number	
  of	
  
target	
  counters	
  
DELL	
  DVD	
  Store	
  
Our	
  approach	
  picks	
  2	
  to	
  4	
  target	
  
counters.	
  
	
  
Performance	
  engineers	
  cannot	
  pick	
  target	
  
counters	
  based	
  on	
  experience	
  
Picked	
  target	
  counters	
  are	
  
different	
  across	
  runs	
  of	
  our	
  
approach.	
  
17	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
18	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  
small	
  number	
  of	
  target	
  
counters.	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
19	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  
small	
  number	
  of	
  target	
  
counters.	
  
Our	
  approach	
  can	
  detect	
  performance	
  
regressions	
  
DELL	
  DVD	
  Store	
  
Max	
  predicTon	
  error	
  
4%	
  to	
  11%	
  without	
  regression	
  
24%	
  to	
  1622%	
  with	
  regression	
  
Our	
  approach	
  is	
  not	
  heavily	
  impacted	
  by	
  the	
  
choice	
  of	
  threshold	
  value.	
  	
  	
  
Without	
  
performance	
  
regression	
  
3%	
  
9%	
   Max	
  predicTon	
  
error	
  
With	
  
performance	
  
regression	
  
2%	
  
916%	
  
Max	
  
predicTon	
  
error	
  
20	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
21	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  
small	
  number	
  of	
  target	
  
counters.	
  
Our	
  approach	
  can	
  detect	
  
performance	
  regressions	
  
and	
  is	
  not	
  heavily	
  impacted	
  
by	
  threshold	
  value.	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
22	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  
small	
  number	
  of	
  target	
  
counters.	
  
Our	
  approach	
  can	
  detect	
  
performance	
  regressions	
  
and	
  is	
  not	
  heavily	
  impacted	
  
by	
  threshold	
  value.	
  
Our	
  approach	
  outperforms	
  picking	
  one	
  
target	
  counter	
  to	
  build	
  a	
  model	
  
DELL	
  DVD	
  Store	
  
Without	
  
performance	
  
regression	
  
0%	
  
With	
  
performance	
  
regression	
  
0%	
  
Building	
  one	
  model	
  using	
  memory	
  as	
  target	
  counter	
  
may	
  fail	
  to	
  detect	
  performance	
  regression	
  
23	
  
T-­‐test	
  does	
  not	
  perform	
  well	
  in	
  detec,ng	
  
performance	
  regressions.	
  
DELL	
  DVD	
  Store	
  
Without	
  
performance	
  
regression	
  
3	
  to	
  6	
  counters	
  are	
  
flagged	
  
There	
  are	
  a	
  large	
  number	
  of	
  counters	
  with	
  significant	
  
differences	
  in	
  the	
  T-­‐test	
  results	
  even	
  though	
  no	
  
regressions	
  exist.	
  
Enterprise	
  ApplicaTon	
  
Without	
  
performance	
  
regression	
  
32%	
  counters	
  are	
  
flagged	
  
24	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
25	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  
small	
  number	
  of	
  target	
  
counters.	
  
Our	
  approach	
  can	
  detect	
  
performance	
  regressions	
  
and	
  is	
  not	
  heavily	
  impacted	
  
by	
  threshold	
  value.	
  
Our	
  approach	
  out	
  
performs	
  tradiTonal	
  
approaches.	
  
How	
  to	
  detect	
  performance	
  regression?	
  
26	
  
Old	
  Version	
  
New	
  version	
  
requests	
  
requests	
  
requests	
  
Performance	
  
counters	
  
Performance	
  
counters	
  
27	
  
28	
  
Target	
  
counter	
  
Old	
  version	
   New	
  version	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  Model	
   PredicTon	
  
error	
  
Target	
  counter	
  is	
  
selected	
  by	
  
experience!	
  
Model-­‐based	
  performance	
  regression	
  
detec,on	
  
29	
  
Our	
  approach	
  
Target	
  
counter	
  
Old	
  version	
   New	
  version	
  
Select	
  
target	
  
counter	
  
Build	
  
model	
  
	
  
Verify	
  
model	
  
	
  Model	
   PredicTon	
  
error	
  
Reduce	
  
counters	
  
Reduced	
  	
  
counters	
  
Cluster	
  
counters	
  
Clusters	
  of	
  
counters	
   For	
  each	
  cluster	
  
30	
  
31	
  
Can	
  our	
  approach	
  
detect	
  performance	
  
regressions?	
  
Accuracy	
  
32	
  
Applicability	
  
How	
  many	
  target	
  
counters	
  does	
  our	
  
approach	
  pick?	
  
Is	
  our	
  approach	
  
be]er	
  than	
  
tradi,onal	
  
approaches?	
  
Comparison	
  
Our	
  approach	
  picks	
  a	
  
small	
  number	
  of	
  target	
  
counters.	
  
Our	
  approach	
  can	
  detect	
  
performance	
  regressions	
  
and	
  is	
  not	
  heavily	
  impacted	
  
by	
  threshold	
  value.	
  
Our	
  approach	
  out	
  
performs	
  tradiTonal	
  
approaches.	
  
33	
  
weiyishang.wordpress.com	
  
swy@cs.queensu.ca	
  

ICPE2015

  • 1.
    Automated  Detec,on  of  Performance   Regressions  Using  Regression  Models  on   Clustered  Performance  Counters   Weiyi  Shang,  Ahmed  E.  Hassan   Queen’s  University,  Kingston,  Canada     Mohamed  Nasser,  Parminder  Flora   BlackBerry,  Waterloo,  Canada   1  
  • 2.
    What  is  a  performance  regression?   2   Old  version   New  version   Does  the  new  version  have   worse  performance  than  the   old  version?  
  • 3.
    How  to  detect  performance  regression?   3   Old  Version   New  version   requests   requests   requests   Performance   counters   Performance   counters  
  • 4.
    Large  soDware  systems  generate  large   amounts  of  performance  counters   4  
  • 5.
    5   Performance  engineers  pick  counters  to   compare  using  T-­‐test   Picking  the  wrong   counters  will  not   detect  performance   regression!   Are  these  counters   significantly   different  between   old  and  new   versions?  
  • 6.
    Performance  engineers  build  a  model  using  all   counters   6   Target  Counter   Independent  Counters   CPU   Memory   I/O  read   I/O  write  
  • 7.
    7   Target   counter   Old  version   New  version   Select   target   counter   Build   model     Verify   model    Model   PredicTon   error   Target  counter  is   selected  by   experience!   Model-­‐based  performance  regression   detec,on  
  • 8.
    Selec,ng  a  wrong  target  counter  will  fail  to   detect  performance  regression   8   Memory-­‐related  performance  regressions  will  not   be  detected  by  this  model!   Memory  has  a  low   correlaTon  with  CPU   CPU   Memory   I/O  read   I/O  write   I/O  write  has  a  high   correlaTon  with  CPU  
  • 9.
    Our  approach   Target   counter   Old  version   New  version   Select   target   counter   Build   model     Verify   model    Model   PredicTon   error   Reduce   counters   Reduced     counters   Cluster   counters   Clusters  of   counters   For  each  cluster   9  
  • 10.
    Select   target   counter   Build   model     Verify   model     Reduce   counters     Cluster   Counters     Removing  zero  variance   counters   Removing  redundant   counters   Available  CPU  cores:   4,  4,  4,  …4,  4     I/O  write  byte/sec=   a*I/O  write  op/sec  +   b*CPU   Step  1:  Reduce    counters   10  
  • 11.
    Step  2:  Cluster     counters   Hierarchical  clustering   Dendrogram  cuTng   CPU   I/O   Memory   CPU   I/O   Memory   Select   target   counter   Build   model     Verify   model     Reduce   counters     Cluster   Counters     11  
  • 12.
    Step  3:  Select     target  counter   Select   target   counter   Build   model     Verify   model     Reduce   counters     Cluster   Counters     CPU  from  old  version   CPU  from  new  version   Kolmogorov–Smirnov  test   P-­‐VALUE   We  select  the  counter  that  has  significant  difference   between  two  versions  with  highest  certainty.   12   We  select  the   counter  with   smallest  p-­‐value  
  • 13.
    Step  3:  Build  model   Select   target   counter   Build   model     Verify   model     Reduce   counters     Cluster   Counters     CPU   Memory   I/O  read   I/O  write   We  build  a  linear  regression  model.   +  c  +  b  a   13  
  • 14.
    Step  4:  Verify  model   Linear  regression   models  for  each   cluster   Calculate   predicTon  error   New  version   counters  lec   in  cluster  1   Linear  regression   model  for  cluster  1   …   …   Mean  predicTon  error   <  threshold   >  threshold   Select   target   counter   Build   model     Verify   model     Reduce   counters     Cluster   Counters     14  
  • 15.
    Case  study:  subject  systems   DELL  DVD  Store   Enterprise  ApplicaTon   CPU  overhead   Memory  overhead   I/O  overhead   Removing  text  index   Removing  column  index   Real-­‐life  performance   regression   15  
  • 16.
    Can  our  approach   detect  performance   regressions?   Accuracy   16   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison  
  • 17.
    Our  approach  picks  a  small  number  of   target  counters   DELL  DVD  Store   Our  approach  picks  2  to  4  target   counters.     Performance  engineers  cannot  pick  target   counters  based  on  experience   Picked  target  counters  are   different  across  runs  of  our   approach.   17  
  • 18.
    Can  our  approach   detect  performance   regressions?   Accuracy   18   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison   Our  approach  picks  a   small  number  of  target   counters.  
  • 19.
    Can  our  approach   detect  performance   regressions?   Accuracy   19   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison   Our  approach  picks  a   small  number  of  target   counters.  
  • 20.
    Our  approach  can  detect  performance   regressions   DELL  DVD  Store   Max  predicTon  error   4%  to  11%  without  regression   24%  to  1622%  with  regression   Our  approach  is  not  heavily  impacted  by  the   choice  of  threshold  value.       Without   performance   regression   3%   9%   Max  predicTon   error   With   performance   regression   2%   916%   Max   predicTon   error   20  
  • 21.
    Can  our  approach   detect  performance   regressions?   Accuracy   21   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison   Our  approach  picks  a   small  number  of  target   counters.   Our  approach  can  detect   performance  regressions   and  is  not  heavily  impacted   by  threshold  value.  
  • 22.
    Can  our  approach   detect  performance   regressions?   Accuracy   22   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison   Our  approach  picks  a   small  number  of  target   counters.   Our  approach  can  detect   performance  regressions   and  is  not  heavily  impacted   by  threshold  value.  
  • 23.
    Our  approach  outperforms  picking  one   target  counter  to  build  a  model   DELL  DVD  Store   Without   performance   regression   0%   With   performance   regression   0%   Building  one  model  using  memory  as  target  counter   may  fail  to  detect  performance  regression   23  
  • 24.
    T-­‐test  does  not  perform  well  in  detec,ng   performance  regressions.   DELL  DVD  Store   Without   performance   regression   3  to  6  counters  are   flagged   There  are  a  large  number  of  counters  with  significant   differences  in  the  T-­‐test  results  even  though  no   regressions  exist.   Enterprise  ApplicaTon   Without   performance   regression   32%  counters  are   flagged   24  
  • 25.
    Can  our  approach   detect  performance   regressions?   Accuracy   25   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison   Our  approach  picks  a   small  number  of  target   counters.   Our  approach  can  detect   performance  regressions   and  is  not  heavily  impacted   by  threshold  value.   Our  approach  out   performs  tradiTonal   approaches.  
  • 26.
    How  to  detect  performance  regression?   26   Old  Version   New  version   requests   requests   requests   Performance   counters   Performance   counters  
  • 27.
  • 28.
    28   Target   counter   Old  version   New  version   Select   target   counter   Build   model     Verify   model    Model   PredicTon   error   Target  counter  is   selected  by   experience!   Model-­‐based  performance  regression   detec,on  
  • 29.
  • 30.
    Our  approach   Target   counter   Old  version   New  version   Select   target   counter   Build   model     Verify   model    Model   PredicTon   error   Reduce   counters   Reduced     counters   Cluster   counters   Clusters  of   counters   For  each  cluster   30  
  • 31.
  • 32.
    Can  our  approach   detect  performance   regressions?   Accuracy   32   Applicability   How  many  target   counters  does  our   approach  pick?   Is  our  approach   be]er  than   tradi,onal   approaches?   Comparison   Our  approach  picks  a   small  number  of  target   counters.   Our  approach  can  detect   performance  regressions   and  is  not  heavily  impacted   by  threshold  value.   Our  approach  out   performs  tradiTonal   approaches.  
  • 33.