An	
  Experience	
  Report	
  on	
  Scaling	
  Tools	
  
for	
  MSR	
  Studies	
  Using	
  MapReduce	
  
	
  Weiyi	
  Shang,	
  Bram	
  Adams,	
  Ahmed	
  E.	
  Hassan	
  
So2ware	
  Analysis	
  and	
  Intelligence	
  Lab	
  (SAIL)	
  
School	
  of	
  CompuCng,	
  Queen’s	
  University	
  
Mining	
  So<ware	
  Repositories:	
  
Propaga@ng	
  code	
  changes	
  
2	
  
Method	
  
A	
  is	
  
changed	
  
Method	
  
A	
  calls	
  
Method	
  
B	
  
Method	
  
C	
  calls	
  
Method	
  
A	
  
Change	
  
methods	
  
B	
  and	
  C	
  
Method	
  
A	
  is	
  
changed	
  
When	
  method	
  A	
  is	
  
changed,	
  90%	
  of	
  the	
  
Cme	
  method	
  D	
  is	
  
changed.	
  	
  
Change	
  
method	
  
D	
  
Not	
  Enough	
  
History	
  
helps!	
  
Tradi@onal	
  pipeline	
  for	
  MSR	
  studies	
  
So<ware	
  
repositories	
  
Data	
  prepara@on	
  (ETL)	
  
Extrac@on	
  
Transforma@on	
  
Loading	
  
Data	
  
Warehouse	
  
Data	
  Analysis	
  
3	
  
Source	
  
code	
  
history	
  
Bug	
  
database	
  
Mailing	
  
list	
  
System	
  
log	
  
Con@nues	
  
to	
  grow	
  
More	
  complex	
  
algorithms	
  
MSR	
  studies	
  must	
  scale	
  
Exis@ng	
  solu@ons	
  to	
  scale	
  
powerful	
  machines	
  
ad	
  hoc	
  distributed	
  compuCng	
  
mulC-­‐threaded	
  and	
  mulC-­‐core	
  
EXPENSIVE	
  
LARGE	
  
PROGRAMMING	
  EFFORT	
  
NOT	
  RE-­‐USABLE	
  
4	
  
Example:	
  D-­‐CCFinder	
  Clone	
  Detector	
  
40	
  days	
  on	
  1	
  pc	
  machine	
   52	
  hours	
  on	
  80-­‐
machines	
  cluster	
  
5	
  
Web	
  Analysis	
  is	
  similar	
  to	
  MSR	
  
studies
Large-­‐scale	
  data	
   Scan-­‐centric	
   Rapidly	
  evolving	
  
6	
  
Web-­‐scale	
  plaSorms	
  
7	
  
We	
  believe	
  that	
  the	
  MSR	
  field	
  can	
  benefit	
  
from	
  web-­‐scale	
  plaSorms	
  to	
  overcome	
  
the	
  limita@ons	
  of	
  current	
  approaches.	
  
	
  
In	
  our	
  previous	
  research	
  
8	
  
Hadoop	
  is	
  up	
  to	
  3	
  Cmes	
  faster	
  
on	
  a	
  4-­‐machine	
  cluster	
  
Feasibility	
  study	
  using	
  Hadoop	
  to	
  scale	
  a	
  
so2ware	
  evoluCon	
  study	
  on	
  Eclipse.	
  
	
  
In	
  this	
  paper	
  
9	
  
	
  
1.	
  Does	
  MapReduce	
  scale	
  to	
  
other	
  MSR	
  studies	
  and	
  larger	
  
clusters?	
  
2.	
  What	
  are	
  the	
  challenges	
  and	
  
experiences	
  of	
  scaling	
  MSR	
  
studies?	
  
Reduce	
  Map	
  
An	
  example	
  of	
  MapReduce	
  
Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Coun@ng	
  the	
  frequency	
  of	
  word	
  lengths	
  
10	
  
Key
4
5
4
3
6
5
5
3
Three	
  large-­‐scale	
  MSR	
  studies
•  So<ware	
  evolu@on	
  study	
  
– J-­‐REX:	
  code-­‐change	
  informaCon	
  abstractor	
  for	
  
Java	
  from	
  line	
  level	
  to	
  program	
  enCty	
  level	
  
•  Code	
  clone	
  detec@on	
  
– CC-­‐Finder:	
  code	
  clone	
  detecCon	
  tool	
  
•  Log	
  analysis	
  
– JACK:	
  log	
  analysis	
  tool	
  for	
  detecCng	
  system	
  
anomalies	
  during	
  load	
  tesCng	
  
11	
  
Experimental	
  environment	
  
CPU	
  type #machines	
   Memory	
  
size
Opera@ng	
  
system
Intel	
  Quad	
  
Core	
  Q6600	
  
(2.40	
  GHz)
18 3GB Ubuntu	
  8.04
8	
  Xeon	
  (3.0	
  
GHZ)
10 8GB CentOS	
  5.2
12	
  
Input	
  data
Data	
  Size Data	
  type #Files
Eclipse	
  
Datatools
10.4	
  GB	
  
227	
  MB
CVS	
  repository	
  
CVS	
  repository	
  
189,156	
  
10,629
FreeBSD 5.1	
  GB source	
  code 317,740
Log	
  files	
  No.1	
  
Log	
  files	
  No.2
9.9	
  GB	
  
2.1	
  GB
execuCon	
  log	
  
execuCon	
  log
54	
  
54
13	
  
1.	
  Does	
  MapReduce	
  scale	
  to	
  other	
  
MSR	
  studies	
  and	
  larger	
  clusters?	
  
	
  
14	
  
98	
  
580	
  
0	
   100	
   200	
   300	
   400	
   500	
   600	
   700	
  
SHARCNET(×10)	
  
1	
  machine	
  
min
80	
  
755	
  
0	
   100	
   200	
   300	
   400	
   500	
   600	
   700	
   800	
  
SHARCNET(×10)	
  
1	
  machine	
  
So<ware	
  Evolu@on	
  &	
  Log	
  analysis	
  
J-­‐REX	
  
	
  
JACK	
  
×9	
  
	
  
×6	
  
	
  
min
15	
  
Code	
  clone	
  detec@on
Can	
  MapReduce	
  scale	
  up	
  CCFinder	
  ?	
  
Yes!	
  
58	
  hours	
  on	
  an	
  18-­‐machine	
  
cluster.	
  
16	
  
2.	
  What	
  are	
  the	
  challenges	
  and	
  
experiences	
  of	
  scaling	
  MSR	
  studies?	
  
17	
  
Challenge	
  1:	
  Locality	
  of	
  MSR	
  analysis	
  
18	
  
Local	
  
analysis	
  
Semi-­‐local	
  
analysis	
  
Global	
  
analysis	
  
Web	
  
MSR	
   MSR	
   MSR	
  
Challenge	
  2:	
  Granularity	
  of	
  MSR	
  analysis	
  
19	
  
Fine-­‐grained	
  
analysis	
  
Coarse-­‐grained	
  
analysis	
  
•  Web	
  community	
  experience:	
  
– #Map:	
  10	
  ~	
  100	
  ×	
  #	
  
machines	
  
– #Reduce:	
  0.95	
  or	
  1.75	
  ×	
  
#CPU	
  cores	
  	
  	
  
•  MSR	
  experience:	
  
– #Reduce	
  tasks=	
  #CPU	
  cores	
  
(fine-­‐grained	
  analysis)	
  
– #Reduce	
  task=	
  #input	
  
records	
  (coarse-­‐grained	
  
analysis)	
  
	
  
Web	
  
MSR	
   MSR	
  
Challenges	
  of	
  migra@ng	
  MSR	
  studies	
  to	
  
MapReduce	
  
1.  Locality	
  of	
  MSR	
  analysis	
  
2.  Granularity	
  of	
  MSR	
  analysis	
  
3.  Loca@ng	
  a	
  suitable	
  cluster	
  
4.  Managing	
  data	
  during	
  analysis	
  
5.  Recovering	
  from	
  errors	
  	
  	
  
20	
  
21	
  
Ques@ons?	
  

ASE2010

  • 1.
    An  Experience  Report  on  Scaling  Tools   for  MSR  Studies  Using  MapReduce    Weiyi  Shang,  Bram  Adams,  Ahmed  E.  Hassan   So2ware  Analysis  and  Intelligence  Lab  (SAIL)   School  of  CompuCng,  Queen’s  University  
  • 2.
    Mining  So<ware  Repositories:   Propaga@ng  code  changes   2   Method   A  is   changed   Method   A  calls   Method   B   Method   C  calls   Method   A   Change   methods   B  and  C   Method   A  is   changed   When  method  A  is   changed,  90%  of  the   Cme  method  D  is   changed.     Change   method   D   Not  Enough   History   helps!  
  • 3.
    Tradi@onal  pipeline  for  MSR  studies   So<ware   repositories   Data  prepara@on  (ETL)   Extrac@on   Transforma@on   Loading   Data   Warehouse   Data  Analysis   3   Source   code   history   Bug   database   Mailing   list   System   log   Con@nues   to  grow   More  complex   algorithms   MSR  studies  must  scale  
  • 4.
    Exis@ng  solu@ons  to  scale   powerful  machines   ad  hoc  distributed  compuCng   mulC-­‐threaded  and  mulC-­‐core   EXPENSIVE   LARGE   PROGRAMMING  EFFORT   NOT  RE-­‐USABLE   4  
  • 5.
    Example:  D-­‐CCFinder  Clone  Detector   40  days  on  1  pc  machine   52  hours  on  80-­‐ machines  cluster   5  
  • 6.
    Web  Analysis  is  similar  to  MSR   studies Large-­‐scale  data   Scan-­‐centric   Rapidly  evolving   6  
  • 7.
    Web-­‐scale  plaSorms   7   We  believe  that  the  MSR  field  can  benefit   from  web-­‐scale  plaSorms  to  overcome   the  limita@ons  of  current  approaches.    
  • 8.
    In  our  previous  research   8   Hadoop  is  up  to  3  Cmes  faster   on  a  4-­‐machine  cluster   Feasibility  study  using  Hadoop  to  scale  a   so2ware  evoluCon  study  on  Eclipse.    
  • 9.
    In  this  paper   9     1.  Does  MapReduce  scale  to   other  MSR  studies  and  larger   clusters?   2.  What  are  the  challenges  and   experiences  of  scaling  MSR   studies?  
  • 10.
    Reduce  Map   An  example  of  MapReduce   Data good hello fish cat school night happy dog ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16 Coun@ng  the  frequency  of  word  lengths   10   Key 4 5 4 3 6 5 5 3
  • 11.
    Three  large-­‐scale  MSR  studies •  So<ware  evolu@on  study   – J-­‐REX:  code-­‐change  informaCon  abstractor  for   Java  from  line  level  to  program  enCty  level   •  Code  clone  detec@on   – CC-­‐Finder:  code  clone  detecCon  tool   •  Log  analysis   – JACK:  log  analysis  tool  for  detecCng  system   anomalies  during  load  tesCng   11  
  • 12.
    Experimental  environment   CPU  type #machines   Memory   size Opera@ng   system Intel  Quad   Core  Q6600   (2.40  GHz) 18 3GB Ubuntu  8.04 8  Xeon  (3.0   GHZ) 10 8GB CentOS  5.2 12  
  • 13.
    Input  data Data  SizeData  type #Files Eclipse   Datatools 10.4  GB   227  MB CVS  repository   CVS  repository   189,156   10,629 FreeBSD 5.1  GB source  code 317,740 Log  files  No.1   Log  files  No.2 9.9  GB   2.1  GB execuCon  log   execuCon  log 54   54 13  
  • 14.
    1.  Does  MapReduce  scale  to  other   MSR  studies  and  larger  clusters?     14  
  • 15.
    98   580   0   100   200   300   400   500   600   700   SHARCNET(×10)   1  machine   min 80   755   0   100   200   300   400   500   600   700   800   SHARCNET(×10)   1  machine   So<ware  Evolu@on  &  Log  analysis   J-­‐REX     JACK   ×9     ×6     min 15  
  • 16.
    Code  clone  detec@on Can  MapReduce  scale  up  CCFinder  ?   Yes!   58  hours  on  an  18-­‐machine   cluster.   16  
  • 17.
    2.  What  are  the  challenges  and   experiences  of  scaling  MSR  studies?   17  
  • 18.
    Challenge  1:  Locality  of  MSR  analysis   18   Local   analysis   Semi-­‐local   analysis   Global   analysis   Web   MSR   MSR   MSR  
  • 19.
    Challenge  2:  Granularity  of  MSR  analysis   19   Fine-­‐grained   analysis   Coarse-­‐grained   analysis   •  Web  community  experience:   – #Map:  10  ~  100  ×  #   machines   – #Reduce:  0.95  or  1.75  ×   #CPU  cores       •  MSR  experience:   – #Reduce  tasks=  #CPU  cores   (fine-­‐grained  analysis)   – #Reduce  task=  #input   records  (coarse-­‐grained   analysis)     Web   MSR   MSR  
  • 20.
    Challenges  of  migra@ng  MSR  studies  to   MapReduce   1.  Locality  of  MSR  analysis   2.  Granularity  of  MSR  analysis   3.  Loca@ng  a  suitable  cluster   4.  Managing  data  during  analysis   5.  Recovering  from  errors       20  
  • 21.