SlideShare a Scribd company logo
1 of 21
Download to read offline
An	
  Experience	
  Report	
  on	
  Scaling	
  Tools	
  
for	
  MSR	
  Studies	
  Using	
  MapReduce	
  
	
  Weiyi	
  Shang,	
  Bram	
  Adams,	
  Ahmed	
  E.	
  Hassan	
  
So2ware	
  Analysis	
  and	
  Intelligence	
  Lab	
  (SAIL)	
  
School	
  of	
  CompuCng,	
  Queen’s	
  University	
  
Mining	
  So<ware	
  Repositories:	
  
Propaga@ng	
  code	
  changes	
  
2	
  
Method	
  
A	
  is	
  
changed	
  
Method	
  
A	
  calls	
  
Method	
  
B	
  
Method	
  
C	
  calls	
  
Method	
  
A	
  
Change	
  
methods	
  
B	
  and	
  C	
  
Method	
  
A	
  is	
  
changed	
  
When	
  method	
  A	
  is	
  
changed,	
  90%	
  of	
  the	
  
Cme	
  method	
  D	
  is	
  
changed.	
  	
  
Change	
  
method	
  
D	
  
Not	
  Enough	
  
History	
  
helps!	
  
Tradi@onal	
  pipeline	
  for	
  MSR	
  studies	
  
So<ware	
  
repositories	
  
Data	
  prepara@on	
  (ETL)	
  
Extrac@on	
  
Transforma@on	
  
Loading	
  
Data	
  
Warehouse	
  
Data	
  Analysis	
  
3	
  
Source	
  
code	
  
history	
  
Bug	
  
database	
  
Mailing	
  
list	
  
System	
  
log	
  
Con@nues	
  
to	
  grow	
  
More	
  complex	
  
algorithms	
  
MSR	
  studies	
  must	
  scale	
  
Exis@ng	
  solu@ons	
  to	
  scale	
  
powerful	
  machines	
  
ad	
  hoc	
  distributed	
  compuCng	
  
mulC-­‐threaded	
  and	
  mulC-­‐core	
  
EXPENSIVE	
  
LARGE	
  
PROGRAMMING	
  EFFORT	
  
NOT	
  RE-­‐USABLE	
  
4	
  
Example:	
  D-­‐CCFinder	
  Clone	
  Detector	
  
40	
  days	
  on	
  1	
  pc	
  machine	
   52	
  hours	
  on	
  80-­‐
machines	
  cluster	
  
5	
  
Web	
  Analysis	
  is	
  similar	
  to	
  MSR	
  
studies
Large-­‐scale	
  data	
   Scan-­‐centric	
   Rapidly	
  evolving	
  
6	
  
Web-­‐scale	
  plaSorms	
  
7	
  
We	
  believe	
  that	
  the	
  MSR	
  field	
  can	
  benefit	
  
from	
  web-­‐scale	
  plaSorms	
  to	
  overcome	
  
the	
  limita@ons	
  of	
  current	
  approaches.	
  
	
  
In	
  our	
  previous	
  research	
  
8	
  
Hadoop	
  is	
  up	
  to	
  3	
  Cmes	
  faster	
  
on	
  a	
  4-­‐machine	
  cluster	
  
Feasibility	
  study	
  using	
  Hadoop	
  to	
  scale	
  a	
  
so2ware	
  evoluCon	
  study	
  on	
  Eclipse.	
  
	
  
In	
  this	
  paper	
  
9	
  
	
  
1.	
  Does	
  MapReduce	
  scale	
  to	
  
other	
  MSR	
  studies	
  and	
  larger	
  
clusters?	
  
2.	
  What	
  are	
  the	
  challenges	
  and	
  
experiences	
  of	
  scaling	
  MSR	
  
studies?	
  
Reduce	
  Map	
  
An	
  example	
  of	
  MapReduce	
  
Data
good
hello
fish
cat
school
night
happy
dog
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
Coun@ng	
  the	
  frequency	
  of	
  word	
  lengths	
  
10	
  
Key
4
5
4
3
6
5
5
3
Three	
  large-­‐scale	
  MSR	
  studies
•  So<ware	
  evolu@on	
  study	
  
– J-­‐REX:	
  code-­‐change	
  informaCon	
  abstractor	
  for	
  
Java	
  from	
  line	
  level	
  to	
  program	
  enCty	
  level	
  
•  Code	
  clone	
  detec@on	
  
– CC-­‐Finder:	
  code	
  clone	
  detecCon	
  tool	
  
•  Log	
  analysis	
  
– JACK:	
  log	
  analysis	
  tool	
  for	
  detecCng	
  system	
  
anomalies	
  during	
  load	
  tesCng	
  
11	
  
Experimental	
  environment	
  
CPU	
  type #machines	
   Memory	
  
size
Opera@ng	
  
system
Intel	
  Quad	
  
Core	
  Q6600	
  
(2.40	
  GHz)
18 3GB Ubuntu	
  8.04
8	
  Xeon	
  (3.0	
  
GHZ)
10 8GB CentOS	
  5.2
12	
  
Input	
  data
Data	
  Size Data	
  type #Files
Eclipse	
  
Datatools
10.4	
  GB	
  
227	
  MB
CVS	
  repository	
  
CVS	
  repository	
  
189,156	
  
10,629
FreeBSD 5.1	
  GB source	
  code 317,740
Log	
  files	
  No.1	
  
Log	
  files	
  No.2
9.9	
  GB	
  
2.1	
  GB
execuCon	
  log	
  
execuCon	
  log
54	
  
54
13	
  
1.	
  Does	
  MapReduce	
  scale	
  to	
  other	
  
MSR	
  studies	
  and	
  larger	
  clusters?	
  
	
  
14	
  
98	
  
580	
  
0	
   100	
   200	
   300	
   400	
   500	
   600	
   700	
  
SHARCNET(×10)	
  
1	
  machine	
  
min
80	
  
755	
  
0	
   100	
   200	
   300	
   400	
   500	
   600	
   700	
   800	
  
SHARCNET(×10)	
  
1	
  machine	
  
So<ware	
  Evolu@on	
  &	
  Log	
  analysis	
  
J-­‐REX	
  
	
  
JACK	
  
×9	
  
	
  
×6	
  
	
  
min
15	
  
Code	
  clone	
  detec@on
Can	
  MapReduce	
  scale	
  up	
  CCFinder	
  ?	
  
Yes!	
  
58	
  hours	
  on	
  an	
  18-­‐machine	
  
cluster.	
  
16	
  
2.	
  What	
  are	
  the	
  challenges	
  and	
  
experiences	
  of	
  scaling	
  MSR	
  studies?	
  
17	
  
Challenge	
  1:	
  Locality	
  of	
  MSR	
  analysis	
  
18	
  
Local	
  
analysis	
  
Semi-­‐local	
  
analysis	
  
Global	
  
analysis	
  
Web	
  
MSR	
   MSR	
   MSR	
  
Challenge	
  2:	
  Granularity	
  of	
  MSR	
  analysis	
  
19	
  
Fine-­‐grained	
  
analysis	
  
Coarse-­‐grained	
  
analysis	
  
•  Web	
  community	
  experience:	
  
– #Map:	
  10	
  ~	
  100	
  ×	
  #	
  
machines	
  
– #Reduce:	
  0.95	
  or	
  1.75	
  ×	
  
#CPU	
  cores	
  	
  	
  
•  MSR	
  experience:	
  
– #Reduce	
  tasks=	
  #CPU	
  cores	
  
(fine-­‐grained	
  analysis)	
  
– #Reduce	
  task=	
  #input	
  
records	
  (coarse-­‐grained	
  
analysis)	
  
	
  
Web	
  
MSR	
   MSR	
  
Challenges	
  of	
  migra@ng	
  MSR	
  studies	
  to	
  
MapReduce	
  
1.  Locality	
  of	
  MSR	
  analysis	
  
2.  Granularity	
  of	
  MSR	
  analysis	
  
3.  Loca@ng	
  a	
  suitable	
  cluster	
  
4.  Managing	
  data	
  during	
  analysis	
  
5.  Recovering	
  from	
  errors	
  	
  	
  
20	
  
21	
  
Ques@ons?	
  

More Related Content

What's hot

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesTiffany Timbers
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkJamie Grier
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value StoreSajeev P
 
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...Spark Summit
 
GluonNLP MXNet Meetup-Aug
GluonNLP MXNet Meetup-AugGluonNLP MXNet Meetup-Aug
GluonNLP MXNet Meetup-AugChenguang Wang
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkDalei Li
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Requirements driven Model-based Testing
Requirements driven Model-based TestingRequirements driven Model-based Testing
Requirements driven Model-based TestingDharmalingam Ganesan
 

What's hot (15)

Hui 3.0
Hui 3.0Hui 3.0
Hui 3.0
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slides
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
 
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
 
GluonNLP MXNet Meetup-Aug
GluonNLP MXNet Meetup-AugGluonNLP MXNet Meetup-Aug
GluonNLP MXNet Meetup-Aug
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on Spark
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Requirements driven Model-based Testing
Requirements driven Model-based TestingRequirements driven Model-based Testing
Requirements driven Model-based Testing
 

Similar to ASE2010

Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudPasquale Salza
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Msr2009 ian
Msr2009 ianMsr2009 ian
Msr2009 ianSAIL_QU
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Scientific
Scientific Scientific
Scientific marpierc
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraStratio
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19ExtremeEarth
 
Scientific Computing With Amazon Web Services
Scientific Computing With Amazon Web ServicesScientific Computing With Amazon Web Services
Scientific Computing With Amazon Web ServicesJamie Kinney
 

Similar to ASE2010 (20)

Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Spark
SparkSpark
Spark
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the Cloud
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Msr2009 ian
Msr2009 ianMsr2009 ian
Msr2009 ian
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Scientific
Scientific Scientific
Scientific
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
 
Scientific Computing With Amazon Web Services
Scientific Computing With Amazon Web ServicesScientific Computing With Amazon Web Services
Scientific Computing With Amazon Web Services
 

More from swy351

SEMLA_logging_infra
SEMLA_logging_infraSEMLA_logging_infra
SEMLA_logging_infraswy351
 
Msr2016 tarek
Msr2016 tarek Msr2016 tarek
Msr2016 tarek swy351
 
MSR 2015
MSR 2015MSR 2015
MSR 2015swy351
 
WCRE2011
WCRE2011WCRE2011
WCRE2011swy351
 
ICSME2014
ICSME2014ICSME2014
ICSME2014swy351
 
ICPE2015
ICPE2015ICPE2015
ICPE2015swy351
 

More from swy351 (6)

SEMLA_logging_infra
SEMLA_logging_infraSEMLA_logging_infra
SEMLA_logging_infra
 
Msr2016 tarek
Msr2016 tarek Msr2016 tarek
Msr2016 tarek
 
MSR 2015
MSR 2015MSR 2015
MSR 2015
 
WCRE2011
WCRE2011WCRE2011
WCRE2011
 
ICSME2014
ICSME2014ICSME2014
ICSME2014
 
ICPE2015
ICPE2015ICPE2015
ICPE2015
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

ASE2010

  • 1. An  Experience  Report  on  Scaling  Tools   for  MSR  Studies  Using  MapReduce    Weiyi  Shang,  Bram  Adams,  Ahmed  E.  Hassan   So2ware  Analysis  and  Intelligence  Lab  (SAIL)   School  of  CompuCng,  Queen’s  University  
  • 2. Mining  So<ware  Repositories:   Propaga@ng  code  changes   2   Method   A  is   changed   Method   A  calls   Method   B   Method   C  calls   Method   A   Change   methods   B  and  C   Method   A  is   changed   When  method  A  is   changed,  90%  of  the   Cme  method  D  is   changed.     Change   method   D   Not  Enough   History   helps!  
  • 3. Tradi@onal  pipeline  for  MSR  studies   So<ware   repositories   Data  prepara@on  (ETL)   Extrac@on   Transforma@on   Loading   Data   Warehouse   Data  Analysis   3   Source   code   history   Bug   database   Mailing   list   System   log   Con@nues   to  grow   More  complex   algorithms   MSR  studies  must  scale  
  • 4. Exis@ng  solu@ons  to  scale   powerful  machines   ad  hoc  distributed  compuCng   mulC-­‐threaded  and  mulC-­‐core   EXPENSIVE   LARGE   PROGRAMMING  EFFORT   NOT  RE-­‐USABLE   4  
  • 5. Example:  D-­‐CCFinder  Clone  Detector   40  days  on  1  pc  machine   52  hours  on  80-­‐ machines  cluster   5  
  • 6. Web  Analysis  is  similar  to  MSR   studies Large-­‐scale  data   Scan-­‐centric   Rapidly  evolving   6  
  • 7. Web-­‐scale  plaSorms   7   We  believe  that  the  MSR  field  can  benefit   from  web-­‐scale  plaSorms  to  overcome   the  limita@ons  of  current  approaches.    
  • 8. In  our  previous  research   8   Hadoop  is  up  to  3  Cmes  faster   on  a  4-­‐machine  cluster   Feasibility  study  using  Hadoop  to  scale  a   so2ware  evoluCon  study  on  Eclipse.    
  • 9. In  this  paper   9     1.  Does  MapReduce  scale  to   other  MSR  studies  and  larger   clusters?   2.  What  are  the  challenges  and   experiences  of  scaling  MSR   studies?  
  • 10. Reduce  Map   An  example  of  MapReduce   Data good hello fish cat school night happy dog ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16 Coun@ng  the  frequency  of  word  lengths   10   Key 4 5 4 3 6 5 5 3
  • 11. Three  large-­‐scale  MSR  studies •  So<ware  evolu@on  study   – J-­‐REX:  code-­‐change  informaCon  abstractor  for   Java  from  line  level  to  program  enCty  level   •  Code  clone  detec@on   – CC-­‐Finder:  code  clone  detecCon  tool   •  Log  analysis   – JACK:  log  analysis  tool  for  detecCng  system   anomalies  during  load  tesCng   11  
  • 12. Experimental  environment   CPU  type #machines   Memory   size Opera@ng   system Intel  Quad   Core  Q6600   (2.40  GHz) 18 3GB Ubuntu  8.04 8  Xeon  (3.0   GHZ) 10 8GB CentOS  5.2 12  
  • 13. Input  data Data  Size Data  type #Files Eclipse   Datatools 10.4  GB   227  MB CVS  repository   CVS  repository   189,156   10,629 FreeBSD 5.1  GB source  code 317,740 Log  files  No.1   Log  files  No.2 9.9  GB   2.1  GB execuCon  log   execuCon  log 54   54 13  
  • 14. 1.  Does  MapReduce  scale  to  other   MSR  studies  and  larger  clusters?     14  
  • 15. 98   580   0   100   200   300   400   500   600   700   SHARCNET(×10)   1  machine   min 80   755   0   100   200   300   400   500   600   700   800   SHARCNET(×10)   1  machine   So<ware  Evolu@on  &  Log  analysis   J-­‐REX     JACK   ×9     ×6     min 15  
  • 16. Code  clone  detec@on Can  MapReduce  scale  up  CCFinder  ?   Yes!   58  hours  on  an  18-­‐machine   cluster.   16  
  • 17. 2.  What  are  the  challenges  and   experiences  of  scaling  MSR  studies?   17  
  • 18. Challenge  1:  Locality  of  MSR  analysis   18   Local   analysis   Semi-­‐local   analysis   Global   analysis   Web   MSR   MSR   MSR  
  • 19. Challenge  2:  Granularity  of  MSR  analysis   19   Fine-­‐grained   analysis   Coarse-­‐grained   analysis   •  Web  community  experience:   – #Map:  10  ~  100  ×  #   machines   – #Reduce:  0.95  or  1.75  ×   #CPU  cores       •  MSR  experience:   – #Reduce  tasks=  #CPU  cores   (fine-­‐grained  analysis)   – #Reduce  task=  #input   records  (coarse-­‐grained   analysis)     Web   MSR   MSR  
  • 20. Challenges  of  migra@ng  MSR  studies  to   MapReduce   1.  Locality  of  MSR  analysis   2.  Granularity  of  MSR  analysis   3.  Loca@ng  a  suitable  cluster   4.  Managing  data  during  analysis   5.  Recovering  from  errors       20