Model-based Analysis of
Source Code Repositories
Markus Scheidgen
1
Martin Schmidt Joachim Fischer
{scheidge,schmidma,fischer@informatik.hu-berlin.de
Humboldt Universität zu Berlin
Agenda
▶ Software Evolution, Reverse Engineering, and Mining Software
Repositories
▶ Model-based Mining of Software Repositories
▶ srcrepo a framework for model-bases analysis of software
repositories
▶ Experiments with Eclipse’s software repositories
▶ Conclusions
2
software maintenance
RHEAD
…
R0
■ quality assessment
■ implicit dependencies
software engineering
Software Evolution
3
requirements
M
M {C}
S
user
problem
M
M {C}
S
user
mining software repositories1
software modernization
OMG’s ADM
(Achitecture-Driven Modernization)
‣AST Meta-Model (ASTM)
‣Knowledge Discovery Meta-
Model (KDM)
‣Software Metrics Meta-Model
(SMM)
M
SS
M
M
M
{C} {C}
reverseengineering
forwardengineering
transformation
1.H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
Mining Software Repositories – In General
▶ The term mining software repositories (MSR) has been coined
to describe a broad class of investigations into the examination
of software repositories.
▶ The premise of MSR is that empirical and systematic
investigations of repositories will shed new light on the process
of software evolution. [1]
▶ Different scopes, e.g. single software projects vs. many
software projects
▶ Different goals, e.g. quality assessments and implicit
dependencies vs. generalizations about software evolution
4
1.H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
Model-based Mining of Software Repositories
5
MS M{C} reverse engineering
1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}
RHEAD
…
R
0
reverse engineering
1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
reverse engineering
1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
{C0}
…
MM
…
…
…
reverse engineering
1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
5
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
{C0}
…
MM
…
…
…
reverse engineering
1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
6
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
7
Model-based Mining of Software Repositories
▶ MSR tools are already “model-based”, but in a proprietary
manner
▶ Idea: existing reverse engineering framework and
corresponding standard meta-models and modeling
frameworks instead of proprietary solutions
▶ Goals
■ deal with heterogeneity (different version control systems,
different languages)
■ reuse of existing meta-models, transformations, and languages
■ interoperability with existing analysis tools
■ retaining meaningful scalability
8
Model-based MSR Strategies
9
versioncontrolsystem
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
9
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X X
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
10
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X X
!
snapshot
snapshot
A2 B3
Model-based MSR Strategies
11
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
11
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
11
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X
R
X
CUs
(Load + Analysis0
)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
12
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X
R
X
CUs
(Load + Analysis0
)
Analysis0
(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
12
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
Analysis0
(r)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X
R
X
CUs
(Load + Analysis0
)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
13
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
Analysis0
(r)
Research Questions
▶ Assumptions
■ Development of MSR-applications based on models,
transformation languages and standardized meta-models is
favorable
■ Some MSR-applications need to analyze source code on a deep
(AST) level
■ MSR-analysis is performed iteratively
▶ Hypotheses
■ Models of source code repositories can be created and persisted
■ Traversing existing persistent models of source code repositories
is much faster than traversing transient models that are created
from version control system on the fly
14
srcrepo – A Framework for Model-based MSR
▶ Eclipse’s MoDisco as reverse engineering framework
■ reverse engineering for Java, based on EMF
■ Support for many JRE-ased languages: Java, xText, JSP, XML
■ creates instances of a Java EMF meta-model that corresponds to the handwritten
JDT AST-model
■ provides transformation to language independent artifacts, e.g. KDM
▶ EMF-Fragments1 to store very large-models
■ uses No-SQL databases and stores larger model fragments within database entries
■ in contrast to object-by-object stores such as ORM-based CDO or No-SQL-based
Morsa or Neo4J
▶ Xtend programming with higher order functions to mimic OCL-style
definition of software metrics2
15
1.M.Scheidgen, A.Zubow,J.Fischer,T.H.Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; ACM/IEEE
15th International Conference on Model Driven Engineering Languages & Systems (MODELS); Innsbruck; 2012
2. M.Scheidgen, J.Fischer: Model-based Mining of Software Repositories; 8th Systems and Modeling Conference, Valencia, Spain,
September 29th, 2014
“OCL” to Calculate Metrics of AST-Models
16
// Weighted number of methods per class.
def wmc(AbstractTypeDeclaration type,(Block) int weight) {
type.bodyDeclarations.sum[if (body != null) weight.apply(it.body) else 1]
}
Experiments
▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-
ins (large scale software repository)
▶ Organized in different (couple hundred) projects: jdt, cdt,
emf, ...
▶ Available via GITHub
▶ GIT repositories can be gathered automated via GITHub’s
REST-ful API
▶ 200 largest Eclipse repositories that actually contained Java
code: 6.6 GB Git, 400 MLOC, 250 GB model with 4 billion
objects.
17
Example Plot: Halstead-length for each Revision; Eclipse CDT
18
2004 2006 2008 2010 2012 2014 2016
0246810
time (years)
WMCwithHalsteadlength(x106
)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++
+++++
+++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++
++++++++
+++++++++++++++++++++
++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
++++
++++
++++++++++++++++++ ++
++++++++++++++++++
++++++++++++++++++++ ++++ +
+++++++++++++ ++
+++++++++++++++++
+
++++++++ +
++++++++
++++++++
++++++++++ +
+
++++
++
+++
+ ++ + +++++ +++++++++++++++++++++++++++++++++++++++
Model Create v Analysis Times
19
jdt.ui
xtext
eclipselink
jdt.core
swt
cdt
ocl
ptp
org.aspectj
cdo
udf
merge/increment
load
save
parse
checkout
time(hours)
0
2
4
6
8
10
12
14
checkout parse save load merge/
increment
udf
0200400600800
avgtimeperrevision(ms)
jdt.core
cdt
jdt.ui
cdo
f.emfstore.core
emf
jdt.debug
emf.compare
emf.texo
.diffmerge.core
GIT size
Model size
GIT repository vs model size
inGB
0
5
10
15
Diskspace
20
Delta-Compression1
21
cdt
webtools
gmf−tooling
rap
jdt.core
delta-models
initial revisions
uncompressed
Named element matching
GB
0
2
4
6
8
10
12
14
cdt
webtools
gmf−tooling
rap
jdt.core
detla-models
initial revisions
uncompressed
Meta-class matching
GB
0
2
4
6
8
10
12
14
cdt
webtools
gmf−tooling
rap
jdt.core
delta-lines
initial revisions
uncompressed
Line matching
MLines
0
20
40
60
80
100
initial
revisions
delta
models for
meta-class
delta
lines
delta models
for named
elements
020406080
Compressed relative to full size
(%)
parse compress
named
elements
compress
meta-class
decompr.
meta-class
decompr.
named
elements
parse compress
named
elements
compress
meta-class
decompr.
meta-class
decompr.
named
elements
050010001500
Avg. execution times
avg.timeperrevision(ms)
10502001000
Avg. execution times (logarithmic)
avg.timeperrevision(ms)
1.M.Scheidgen: Evaluation of Model Comparison for Delta-Compression in Model Persistence; BigMDE 2016 (at STAF 2016),Vienna,
Austria, July 6-7, 2016
Model Creation v Analysis with Delta-Compression
22
jdt.ui
xtext
eclipselink
jdt.core
swt
cdt
ocl
ptp
org.aspectj
cdo
udf
merge/increment
load
save
compress
parse
checkout
time(hours)
0
2
4
6
8
10
12
14
jdt.ui
xtext
eclipselink
jdt.core
swt
cdt
ocl
ptp
org.aspectj
cdo
udf
merge/increment
load
save
parse
checkout
time(hours)
0
2
4
6
8
10
12
14
without compression with compression
Conclusions
▶ MSR can support software evolution and helps to
understand software evolution
▶ Traversing a source code repository to gather information
(MSR) is very time consuming, especially with iterative
analysis
▶ It is possible to save most of this time via saving data in its
model state, at the cost of comparably large models that
need to be persisted
▶ The MSR analysis execution time savings are considerable
23

Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories

  • 1.
    Model-based Analysis of SourceCode Repositories Markus Scheidgen 1 Martin Schmidt Joachim Fischer {scheidge,schmidma,fischer@informatik.hu-berlin.de Humboldt Universität zu Berlin
  • 2.
    Agenda ▶ Software Evolution,Reverse Engineering, and Mining Software Repositories ▶ Model-based Mining of Software Repositories ▶ srcrepo a framework for model-bases analysis of software repositories ▶ Experiments with Eclipse’s software repositories ▶ Conclusions 2
  • 3.
    software maintenance RHEAD … R0 ■ qualityassessment ■ implicit dependencies software engineering Software Evolution 3 requirements M M {C} S user problem M M {C} S user mining software repositories1 software modernization OMG’s ADM (Achitecture-Driven Modernization) ‣AST Meta-Model (ASTM) ‣Knowledge Discovery Meta- Model (KDM) ‣Software Metrics Meta-Model (SMM) M SS M M M {C} {C} reverseengineering forwardengineering transformation 1.H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
  • 4.
    Mining Software Repositories– In General ▶ The term mining software repositories (MSR) has been coined to describe a broad class of investigations into the examination of software repositories. ▶ The premise of MSR is that empirical and systematic investigations of repositories will shed new light on the process of software evolution. [1] ▶ Different scopes, e.g. single software projects vs. many software projects ▶ Different goals, e.g. quality assessments and implicit dependencies vs. generalizations about software evolution 4 1.H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
  • 5.
    Model-based Mining ofSoftware Repositories 5 MS M{C} reverse engineering 1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990 2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
  • 6.
    Model-based Mining ofSoftware Repositories 5 MS M{C} MM{Cn} RHEAD … R 0 reverse engineering 1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990 2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
  • 7.
    Model-based Mining ofSoftware Repositories 5 MS M{C} MM{Cn} RHEAD … R 0 {Cn-1} MM reverse engineering 1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990 2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
  • 8.
    Model-based Mining ofSoftware Repositories 5 MS M{C} MM{Cn} RHEAD … R 0 {Cn-1} MM {C0} … MM … … … reverse engineering 1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990 2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
  • 9.
    Model-based Mining ofSoftware Repositories 5 MS M{C} MM{Cn} RHEAD … R 0 {Cn-1} MM {C0} … MM … … … reverse engineering 1.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990 2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symp. on Software Testing and Analysis; 2008
  • 10.
    Model-based Mining ofSoftware Repositories ▶ Scope ■ depends on concreter MSR-application and its goals ■ number of software projects: single repositories, large repositories, ultra-large repositories ■ Sources as text and text based metrics, e.g. LOC ■ Declarations only: packages, classes, methods, but no statements, expressions, etc. ■ Full AST with or without cross-references 6
  • 11.
    Model-based Mining ofSoftware Repositories ▶ Scope ■ depends on concreter MSR-application and its goals ■ number of software projects: single repositories, large repositories, ultra-large repositories ■ Sources as text and text based metrics, e.g. LOC ■ Declarations only: packages, classes, methods, but no statements, expressions, etc. ■ Full AST with or without cross-references 7
  • 12.
    Model-based Mining ofSoftware Repositories ▶ MSR tools are already “model-based”, but in a proprietary manner ▶ Idea: existing reverse engineering framework and corresponding standard meta-models and modeling frameworks instead of proprietary solutions ▶ Goals ■ deal with heterogeneity (different version control systems, different languages) ■ reuse of existing meta-models, transformations, and languages ■ interoperability with existing analysis tools ■ retaining meaningful scalability 8
  • 13.
  • 14.
    Model-based MSR Strategies 9 snapshot A1B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3
  • 15.
    Model-based MSR Strategies 9 snapshot A1B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1
  • 16.
    snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1
  • 17.
    snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d)
  • 18.
    snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot
  • 19.
    snapshot snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot
  • 20.
    snapshot snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r)
  • 21.
    snapshot snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M2
  • 22.
    snapshot snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2
  • 23.
    snapshot snapshot A2 B3 Model-based MSRStrategies 9 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 X R Checkout + X CUs Parse + Analysis ! X R Checkout + X CUs (Parse + Merge) + Analysis ! X R X CUs (Load + Merge) + Analysis ! X X
  • 24.
    snapshot snapshot A2 B3 Model-based MSRStrategies 10 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2
  • 25.
    snapshot snapshot A2 B3 Model-based MSRStrategies 10 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 f B.f fB.f Parse(d) X d2 CUs(r)
  • 26.
    snapshot snapshot A2 B3 Model-based MSRStrategies 10 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r)
  • 27.
    snapshot snapshot A2 B3 Model-based MSRStrategies 10 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) X R Checkout + X CUs Parse + Analysis ! X R Checkout + X CUs (Parse + Merge) + Analysis ! X X !
  • 28.
    snapshot snapshot A2 B3 Model-based MSRStrategies 11 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r)
  • 29.
    snapshot snapshot A2 B3 Model-based MSRStrategies 11 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r)
  • 30.
    snapshot snapshot A2 B3 Model-based MSRStrategies 11 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r) X R Checkout + X CUs Parse + Analysis ! X R Checkout + X CUs (Parse + Merge) + Analysis ! X R X CUs (Load + Merge) + Analysis ! X R X CUs (Load + Analysis0 )
  • 31.
    snapshot snapshot A2 B3 Model-based MSRStrategies 12 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r) X R Checkout + X CUs Parse + Analysis ! X R Checkout + X CUs (Parse + Merge) + Analysis ! X R X CUs (Load + Merge) + Analysis ! X R X CUs (Load + Analysis0 ) Analysis0 (r)
  • 32.
    snapshot snapshot A2 B3 Model-based MSRStrategies 12 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r) Analysis0 (r) X R Checkout + X CUs Parse + Analysis ! X R Checkout + X CUs (Parse + Merge) + Analysis ! X R X CUs (Load + Merge) + Analysis ! X R X CUs (Load + Analysis0 )
  • 33.
    snapshot snapshot A2 B3 Model-based MSRStrategies 13 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r) Analysis0 (r)
  • 34.
    Research Questions ▶ Assumptions ■Development of MSR-applications based on models, transformation languages and standardized meta-models is favorable ■ Some MSR-applications need to analyze source code on a deep (AST) level ■ MSR-analysis is performed iteratively ▶ Hypotheses ■ Models of source code repositories can be created and persisted ■ Traversing existing persistent models of source code repositories is much faster than traversing transient models that are created from version control system on the fly 14
  • 35.
    srcrepo – AFramework for Model-based MSR ▶ Eclipse’s MoDisco as reverse engineering framework ■ reverse engineering for Java, based on EMF ■ Support for many JRE-ased languages: Java, xText, JSP, XML ■ creates instances of a Java EMF meta-model that corresponds to the handwritten JDT AST-model ■ provides transformation to language independent artifacts, e.g. KDM ▶ EMF-Fragments1 to store very large-models ■ uses No-SQL databases and stores larger model fragments within database entries ■ in contrast to object-by-object stores such as ORM-based CDO or No-SQL-based Morsa or Neo4J ▶ Xtend programming with higher order functions to mimic OCL-style definition of software metrics2 15 1.M.Scheidgen, A.Zubow,J.Fischer,T.H.Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; ACM/IEEE 15th International Conference on Model Driven Engineering Languages & Systems (MODELS); Innsbruck; 2012 2. M.Scheidgen, J.Fischer: Model-based Mining of Software Repositories; 8th Systems and Modeling Conference, Valencia, Spain, September 29th, 2014
  • 36.
    “OCL” to CalculateMetrics of AST-Models 16 // Weighted number of methods per class. def wmc(AbstractTypeDeclaration type,(Block) int weight) { type.bodyDeclarations.sum[if (body != null) weight.apply(it.body) else 1] }
  • 37.
    Experiments ▶ Eclipse Foundationsources, i.e. Eclipse platform and plug- ins (large scale software repository) ▶ Organized in different (couple hundred) projects: jdt, cdt, emf, ... ▶ Available via GITHub ▶ GIT repositories can be gathered automated via GITHub’s REST-ful API ▶ 200 largest Eclipse repositories that actually contained Java code: 6.6 GB Git, 400 MLOC, 250 GB model with 4 billion objects. 17
  • 38.
    Example Plot: Halstead-lengthfor each Revision; Eclipse CDT 18 2004 2006 2008 2010 2012 2014 2016 0246810 time (years) WMCwithHalsteadlength(x106 ) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++ +++++ +++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++ ++++++++ +++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + ++++ ++++ ++++++++++++++++++ ++ ++++++++++++++++++ ++++++++++++++++++++ ++++ + +++++++++++++ ++ +++++++++++++++++ + ++++++++ + ++++++++ ++++++++ ++++++++++ + + ++++ ++ +++ + ++ + +++++ +++++++++++++++++++++++++++++++++++++++
  • 39.
    Model Create vAnalysis Times 19 jdt.ui xtext eclipselink jdt.core swt cdt ocl ptp org.aspectj cdo udf merge/increment load save parse checkout time(hours) 0 2 4 6 8 10 12 14 checkout parse save load merge/ increment udf 0200400600800 avgtimeperrevision(ms)
  • 40.
  • 41.
    Delta-Compression1 21 cdt webtools gmf−tooling rap jdt.core delta-models initial revisions uncompressed Named elementmatching GB 0 2 4 6 8 10 12 14 cdt webtools gmf−tooling rap jdt.core detla-models initial revisions uncompressed Meta-class matching GB 0 2 4 6 8 10 12 14 cdt webtools gmf−tooling rap jdt.core delta-lines initial revisions uncompressed Line matching MLines 0 20 40 60 80 100 initial revisions delta models for meta-class delta lines delta models for named elements 020406080 Compressed relative to full size (%) parse compress named elements compress meta-class decompr. meta-class decompr. named elements parse compress named elements compress meta-class decompr. meta-class decompr. named elements 050010001500 Avg. execution times avg.timeperrevision(ms) 10502001000 Avg. execution times (logarithmic) avg.timeperrevision(ms) 1.M.Scheidgen: Evaluation of Model Comparison for Delta-Compression in Model Persistence; BigMDE 2016 (at STAF 2016),Vienna, Austria, July 6-7, 2016
  • 42.
    Model Creation vAnalysis with Delta-Compression 22 jdt.ui xtext eclipselink jdt.core swt cdt ocl ptp org.aspectj cdo udf merge/increment load save compress parse checkout time(hours) 0 2 4 6 8 10 12 14 jdt.ui xtext eclipselink jdt.core swt cdt ocl ptp org.aspectj cdo udf merge/increment load save parse checkout time(hours) 0 2 4 6 8 10 12 14 without compression with compression
  • 43.
    Conclusions ▶ MSR cansupport software evolution and helps to understand software evolution ▶ Traversing a source code repository to gather information (MSR) is very time consuming, especially with iterative analysis ▶ It is possible to save most of this time via saving data in its model state, at the cost of comparably large models that need to be persisted ▶ The MSR analysis execution time savings are considerable 23