Performing Large Scale Software Engineering Studies: A Platform for Empirical Research

Performing Large Scale
Software Engineering
Studies

Georgios Gousios

30

25

20
Number of Works

15

10

5

0
0 1 2 3 4 5 6 8 10 25 50
Sample sizes (number of projects)

Metric Gnome-VFS Evolution KDE

SCM History 10y 9m 12y 4m 14y

Num
5522 36835 >1200000
Revisions

SCM Size 105 MB 1.4 GB 60 GB

Num Emails
20 240 6252
(Nov 2008)

400

300
GB

200

100 Others

KDE
0
700 projects LSE (2003-2006) GenBank
Dataset

RCS Postﬁx Bugzilla
CVS
MailMan SF.net Tracker
SVN
Jira
Darcs Marc
Gnats
Git
Hg

Researcher’s view
Time
1. Go to project.org

2. Dowload SVN, Mail,
Bug, ask for IRC logs

3. .....?

4. Publish research

Researcher’s view (some) Project’s view
Time
1. Go to project.org 1. Hm, a new visitor

2. Hey, she is mirroring our bugzilla
2. Dowload SVN, Mail,
Bug, ask for IRC logs 3. .....?

3. .....? 4. Ban her!

4. Publish research

How is empirical research done in mature disciplines?

Our research:
A platform for software
engineering research

In our research
1. We examined the current situation
2. We propose a platform for large scale
research
3. We validated its design with 2 case studies

Empirical
= Model
Study

+ Data

+ Metrics / Tools

+ Analysis Methods

+ Results analysis

Research methods

40

35

30

25
Number of Works

20

15

10

5

0
0 EXP FCS ECS CCS SUR
Research Methods used

What sources of data are in use?
45

40

35

30
Number of Works

25

20

15

10

5

0
0 BTS SRC SF ECT SCM
Data Source

What is the examined data size (in number of projects) ?

30

25

20
Number of Works

15

10

5

0
0 1 2 3 4 5 6 8 10 25 50
Sample sizes (number of projects)

Findings
• Sample size very small
• How can we extract generic results?
• No experiment replication
• Do we believe in each other’s work or
just ignore it?
• We did not check the stats...

@

Only 20% of the tools and data reported in ICSE
papers could be retrieved a year after publication

We need better
empirical studies

Rigorous Evaluation

Freely Available Empirical Data

Tools and Results Sharing
Better Empirical Studies

Software Engineering Platform

A software engineering
research platform

Ready made tools
Formalised data
formats
Easily extensible

A software engineering
research platform

Researcher Large scale
Pre processed data community procesing

Platform

Data Tools Processing

Data

Raw Data Metadata

Tool Results
Mailing Lists BTS SCM

Processed raw data

Mirroring Root
/

project.
Project 1 Project 2 Project 3
properties

git svn mails bugs

Standard Standard
GIT SVN
format bug<id>
format
.xml
List 1 List 2

tmp cur new

messageid.eml

Raw Data
Mirroring Root
/

project.
Project 1 Project 2 Project 3
properties

git svn mails bugs

Standard Standard
GIT SVN
format bug<id>
format
.xml
List 1 List 2

tmp cur new

messageid.eml


 
  
 
 

  



    
 

  
  

  
  
  
  
  

   


  

   
 
  


 


   

  

  

  
     

  
  


    
 
   
  
   
 
    
  
     
  
    
  
 
 
 

  
  

  
   
    

  
 
 
   
 
 

 

 


 



Metadata

 
  
 
 

  



    
 

  
  

  
  
  
  
  

   


  

   
 
  


 


   

  

  

  
     

  
  


    
 
   
  
   
 
    
  
     
  
    
  
 
 
 

  
  

  
   
    

  
 
 
   
 
 

 

 


 



Tools
Metric
Metric
Plug-in
Job Metadata Web Cluster Metric
Plug-in Metric
Logging
Schedul Updater services Service Plug-in Activator
er

DB Messagi Web Plug-in Parser Data
Security
Service ng Admin Admin Servic Access
e

SQO-OSS SQO-OSS SQO-OSS

Project 1

PV

svn mails bugs
PV

PV

List 1 List 2
PV

Project
Mirror
tmp cur new
Metadata Storage

Tools
public interface AlitheiaPlugin {
String getVersion();
String getAuthor();
Date getDateInstalled();
String getName();
String getDescription();
List<Result> getResultIfAlreadyCalculated(DAObject o, List<Metric> l);
List<Result> getResult(DAObject o, List<Metric> l);
List<Metric> getAllSupportedMetrics();
List<Metric> getSupportedMetrics(Class<? extends DAObject> activationType);
void run(DAObject o);
boolean update();
boolean install();
boolean remove();
boolean cleanup(DAObject sp);
String getUniqueKey();
Set<Class<? extends DAObject>> getActivationTypes();
List<Class<? extends DAObject>> getMetricActivationTypes (Metric m);
Set<PluginConfiguration> getConfigurationSchema();
Set<String> getDependencies();
Map<MetricType.Type, SortedSet<Long>> getObjectIdsToSync(StoredProject sp, Metric m);
}

Tools
@MetricDeclarations(metrics = {
@MetricDecl(mnemonic="MNOF", activators={ProjectDirectory.class},
descr="Number of Source Code Files in Module"),
@MetricDecl(mnemonic="MNOL", activators={ProjectDirectory.class},
descr="Number of lines in module", dependencies={"Wc.loc"}),
@MetricDecl(mnemonic="AMS", activators={ProjectVersion.class},
descr="Average Module Size"),
@MetricDecl(mnemonic="ISSRCMOD", activators={ProjectDirectory.class},
descr="Mark for modules containing source files")
})
public class ModuleMetricsImplementation extends AbstractMetric {

public void run(ProjectFile pf) throws AlreadyProcessingException {[...]}
public void run(ProjectVersion pv) throws AlreadyProcessingException {[...]}

public List<Result> getResult(ProjectFile pf, Metric m) {
return getResult(pf, ProjectFileMeasurement.class, m, Result.ResultType.INTEGER);
}

public List<Result> getResult(ProjectVersion pv, Metric m) {
return getResult(pv, ProjectVersionMeasurement.class, m, Result.ResultType.FLOAT);
}

}

public void run(ProjectFile pf) {
// We do not support directories
Tools
if (pf.getIsDirectory()) {
return;
}

InputStream in = fds.getFileContents(pf);
if (in == null) {
return;
}
// Create an input stream from the project file's content
try {
// Measure the number of lines in the project file
LineNumberReader lnr =
new LineNumberReader(new InputStreamReader(in));
int lines = 0;
while (lnr.readLine() != null) {
lines++;
}
lnr.close();

// Store the results
Metric metric = Metric.getMetricByMnemonic("LOC");
ProjectFileMeasurement locm = new ProjectFileMeasurement();
locm.setMetric(metric);
locm.setProjectFile(pf);
locm.setWhenRun(new Timestamp(System.currentTimeMillis()));
locm.setResult(String.valueOf(lines));

db.addRecord(locm);
markEvaluation(metric, pf.getProjectVersion().getProject());
} catch (IOException e) {
log.error(this.getClass().getName() + " IO Error <" + e
+ "> while measuring: " + pf.getFileName());
}
}

Processing - The scmmap algorithm

CPU

Processing - the idmap algorithm

CPU

Line counting speed

60

45
Minutes

30

15

0
Naive implementation Alitheia Core

Do intense conversations
affect short-term project
development?

How do we identify
intense discussions
100000 120000
Number of messages Levels of thread depth

90000
100000
80000

70000
80000

60000

Occurences
Occurences

50000 60000

40000

40000
30000

20000
20000
10000

0 0
0 5 10 15 20 25 0 5 10 15 20 25
Number of messages per thread Thread depth

Hypotheses
• H1: Number of messages and thread depth
are dependent variables
• H2: We can identify intense discussions by
identifying threads in top depth and msg/
thread quartiles
• H3: Intense discussions affect the
repository’s source line intake

Method
• Import projects in Alitheia Core
• Develop metric plug-in to count the
variables we are interested in
• 3 metrics
• Emails from 60 projects, ~1,2 * 10^6 emails,
679427 threads
• Plug-in loc: 270 lines

100
Fit function (0.55 + 1.59x)

80
Number of Messages

60

40

20

0
0 5 10 15 20 25 30 35 40 45 50
Thread depth

H1:
R^2 = 0.70

• H2: 99 discussion threads
• H3 3: Project
Avogadro
ΗotEffect
622
Deskbar-Applet -103
FreeBSD -34
Gnome-Network -106
Gnome-Utils -263
GTK+ 183
Sabayon -244
Vala -356
LSR 1150
Meld 27

Does the number of
programmers affect
code maintainability?

Hypotheses

• H1: Number of programmers affects code
maintainability at the project level
• H2: Number of programmers affects code
maintainability at the directory level.

Method
• Per langauge
• C & Java (risky)
• Plug-ins that calculate
• Number of developers per period of time
• Halstead & McCabe (700 lines)
• Omar’s Maintainability Index (240 lines)
• Data from 213 projects

Maintainability at the project level
R^2 = 0.04

Maintainability at the module level
R^2 = 0.05

Per language

C: R^2 = 0.08
Java: R^2 = 0.07

Why do we need large
scale research?

Project ΗotEffect
Banshee -1121
Deskbar-Applet -103
FreeBSD -34
Gnome-Power-Manager -773
Gnome-Utils -263
GTranslator -230
Sabayon -244
Vala -356

1
R2

0.9

0.8

0.7
Correlation Co-efficient

0.6

0.5

0.4

0.3

0.2

0.1

0

R^2 at the module level for all
projects

Replication of published
studies

Repositories for tools,
data and results

The SoftEng cloud
project

Thank you!

http://www.sqo-oss.org
Georgios Gousios
gousiosg@aueb.gr

Performing Large Scale Software Engineering Studies: A Platform for Empirical Research

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Performing Large Scale Software Engineering Studies: A Platform for Empirical Research

Similar to Performing Large Scale Software Engineering Studies: A Platform for Empirical Research (20)

Performing Large Scale Software Engineering Studies: A Platform for Empirical Research

Editor's Notes