Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories

Model-based Analysis of
Source Code Repositories
Markus Scheidgen
1
Martin Schmidt Joachim Fischer
{scheidge,schmidma,ﬁscher@informatik.hu-berlin.de
Humboldt Universität zu Berlin

Agenda
▶ Software Evolution, Reverse Engineering, and Mining Software
Repositories
▶ Model-based Mining of Software Repositories
▶ srcrepo a framework for model-bases analysis of software
repositories
▶ Experiments with Eclipse’s software repositories
▶ Conclusions
2

software maintenance
RHEAD
…
R0
■ quality assessment
■ implicit dependencies
software engineering
Software Evolution
3
requirements
M
M {C}
S
user
problem
M
M {C}
S
user
mining software repositories1
software modernization
OMG’s ADM
(Achitecture-Driven Modernization)
‣AST Meta-Model (ASTM)
‣Knowledge Discovery Meta-
Model (KDM)
‣Software Metrics Meta-Model
(SMM)
M
SS
M
M
M
{C} {C}
reverseengineering
forwardengineering
transformation
1.H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007

Mining Software Repositories – In General
▶ The term mining software repositories (MSR) has been coined
to describe a broad class of investigations into the examination
of software repositories.
▶ The premise of MSR is that empirical and systematic
investigations of repositories will shed new light on the process
of software evolution. [1]
▶ Diﬀerent scopes, e.g. single software projects vs. many
software projects
▶ Diﬀerent goals, e.g. quality assessments and implicit
dependencies vs. generalizations about software evolution
4
1.H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007

Model-based Mining of Software Repositories
5
MS M{C} reverse engineering
1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008

5
MS M{C}
MM{Cn}
RHEAD
…
R
0
reverse engineering

5
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
reverse engineering

5
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
{C0}
…
MM
…
…
…
reverse engineering

▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
6

▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
7

▶ MSR tools are already “model-based”, but in a proprietary
manner
▶ Idea: existing reverse engineering framework and
corresponding standard meta-models and modeling
frameworks instead of proprietary solutions
▶ Goals
■ deal with heterogeneity (diﬀerent version control systems,
diﬀerent languages)
■ reuse of existing meta-models, transformations, and languages
■ interoperability with existing analysis tools
■ retaining meaningful scalability
8

Model-based MSR Strategies
9
versioncontrolsystem
A1-A2
A1 B1
B1-B3

9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3

9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1

snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1

snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)

snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot

snapshot
snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot

snapshot
snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)

snapshot
snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M2

snapshot
snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2

snapshot
snapshot
A2 B3
9
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X X

snapshot
snapshot
A2 B3
10
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2

snapshot
snapshot
A2 B3
10
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)

snapshot
snapshot
A2 B3
10
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)

snapshot
snapshot
A2 B3
10
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
!
X X
!

snapshot
snapshot
A2 B3
11
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)

snapshot
snapshot
A2 B3
11
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)

snapshot
snapshot
A2 B3
11
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
!
X
R
X
CUs
!
X
R
X
CUs
(Load + Analysis0
)

snapshot
snapshot
A2 B3
12
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
!
X
R
X
CUs
!
X
R
X
CUs
(Load + Analysis0
)
Analysis0
(r)

snapshot
snapshot
A2 B3
12
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
Analysis0
(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
!
X
R
X
CUs
!
X
R
X
CUs
(Load + Analysis0
)

snapshot
snapshot
A2 B3
13
snapshot
A1 B1
Checkout(r)
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
Analysis0
(r)

Research Questions
▶ Assumptions
■ Development of MSR-applications based on models,
transformation languages and standardized meta-models is
favorable
■ Some MSR-applications need to analyze source code on a deep
(AST) level
■ MSR-analysis is performed iteratively
▶ Hypotheses
■ Models of source code repositories can be created and persisted
■ Traversing existing persistent models of source code repositories
is much faster than traversing transient models that are created
from version control system on the ﬂy
14

srcrepo – A Framework for Model-based MSR
▶ Eclipse’s MoDisco as reverse engineering framework
■ reverse engineering for Java, based on EMF
■ Support for many JRE-ased languages: Java, xText, JSP, XML
■ creates instances of a Java EMF meta-model that corresponds to the handwritten
JDT AST-model
■ provides transformation to language independent artifacts, e.g. KDM
▶ EMF-Fragments1 to store very large-models
■ uses No-SQL databases and stores larger model fragments within database entries
■ in contrast to object-by-object stores such as ORM-based CDO or No-SQL-based
Morsa or Neo4J
▶ Xtend programming with higher order functions to mimic OCL-style
deﬁnition of software metrics2
15
1.M.Scheidgen, A.Zubow,J.Fischer,T.H.Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; ACM/IEEE
15th International Conference on Model Driven Engineering Languages & Systems (MODELS); Innsbruck; 2012
2. M.Scheidgen, J.Fischer: Model-based Mining of Software Repositories; 8th Systems and Modeling Conference, Valencia, Spain,
September 29th, 2014

“OCL” to Calculate Metrics of AST-Models
16
// Weighted number of methods per class.
def wmc(AbstractTypeDeclaration type,(Block) int weight) {
type.bodyDeclarations.sum[if (body != null) weight.apply(it.body) else 1]
}

Experiments
▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-
ins (large scale software repository)
▶ Organized in diﬀerent (couple hundred) projects: jdt, cdt,
emf, ...
▶ Available via GITHub
▶ GIT repositories can be gathered automated via GITHub’s
REST-ful API
▶ 200 largest Eclipse repositories that actually contained Java
code: 6.6 GB Git, 400 MLOC, 250 GB model with 4 billion
objects.
17

Example Plot: Halstead-length for each Revision; Eclipse CDT
18
2004 2006 2008 2010 2012 2014 2016
0246810
time (years)
WMCwithHalsteadlength(x106
)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++
+++++
+++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++
++++++++
+++++++++++++++++++++
++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
++++
++++
++++++++++++++++++ ++
++++++++++++++++++
++++++++++++++++++++ ++++ +
+++++++++++++ ++
+++++++++++++++++
+
++++++++ +
++++++++
++++++++
++++++++++ +
+
++++
++
+++
+ ++ + +++++ +++++++++++++++++++++++++++++++++++++++

Model Create v Analysis Times
19
jdt.ui
xtext
eclipselink
jdt.core
swt
cdt
ocl
ptp
org.aspectj
cdo
udf
merge/increment
load
save
parse
checkout
time(hours)
0
2
4
6
8
10
12
14
checkout parse save load merge/
increment
udf
0200400600800
avgtimeperrevision(ms)

jdt.core
cdt
jdt.ui
cdo
f.emfstore.core
emf
jdt.debug
emf.compare
emf.texo
.diffmerge.core
GIT size
Model size
GIT repository vs model size
inGB
0
5
10
15
Diskspace
20

Delta-Compression1
21
cdt
webtools
gmf−tooling
rap
jdt.core
delta-models
initial revisions
uncompressed
Named element matching
GB
0
2
4
6
8
10
12
14
cdt
webtools
gmf−tooling
rap
jdt.core
detla-models
initial revisions
uncompressed
Meta-class matching
GB
0
2
4
6
8
10
12
14
cdt
webtools
gmf−tooling
rap
jdt.core
delta-lines
initial revisions
uncompressed
Line matching
MLines
0
20
40
60
80
100
initial
revisions
delta
models for
meta-class
delta
lines
delta models
for named
elements
020406080
Compressed relative to full size
(%)
parse compress
named
elements
compress
meta-class
decompr.
meta-class
decompr.
named
elements
parse compress
named
elements
compress
meta-class
decompr.
meta-class
decompr.
named
elements
050010001500
Avg. execution times
avg.timeperrevision(ms)
10502001000
Avg. execution times (logarithmic)
avg.timeperrevision(ms)
1.M.Scheidgen: Evaluation of Model Comparison for Delta-Compression in Model Persistence; BigMDE 2016 (at STAF 2016),Vienna,
Austria, July 6-7, 2016

Model Creation v Analysis with Delta-Compression
22
jdt.ui
xtext
eclipselink
jdt.core
swt
cdt
ocl
ptp
org.aspectj
cdo
udf
merge/increment
load
save
compress
parse
checkout
time(hours)
0
2
4
6
8
10
12
14
jdt.ui
xtext
eclipselink
jdt.core
swt
cdt
ocl
ptp
org.aspectj
cdo
udf
merge/increment
load
save
parse
checkout
time(hours)
0
2
4
6
8
10
12
14
without compression with compression

Conclusions
▶ MSR can support software evolution and helps to
understand software evolution
▶ Traversing a source code repository to gather information
(MSR) is very time consuming, especially with iterative
analysis
▶ It is possible to save most of this time via saving data in its
model state, at the cost of comparably large models that
need to be persisted
▶ The MSR analysis execution time savings are considerable
23

Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories

Similar to Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories (20)

Recently uploaded

Recently uploaded (20)

Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories