The Mining Software Repositories (MSR) field analyzes the rich data available in source code repositories (SCR) to uncover interesting and actionable information about software system evolution. Major obstacles in MSR are the heterogeneity of software projects and the amount of data that is processed. Model-driven software engineering (MDSE) can deal with heterogeneity by abstraction as its core strength, but only recent efforts in adopting NoSQL-databases for persisting and processing very large models made MDSE a feasible approach for MSR. This paper is a work in progress report on srcrepo: a model-based MSR system. Srcrepo uses the NoSQL-based EMF-model persistence layer EMF-Fragments and Eclipse’s MoDisco reverse engineering framework to create EMF-models of whole SCRs that comprise all code of all revisions at an abstract syntax tree (AST) level. An OCL-like language is used as an accessible way to finally gather information such as software metrics from these SCR models.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Model-based Mining of Software Repositories
1. Model-based Mining of
Source Code Repositories
Markus Scheidgen, Joachim Fischer
{scheidge,fischer}@informatik.hu-berlin.de
1
Tuesday, 30. September 2014
2. Agenda
▶ Mining Software Repositories (MSR) and Software Evolution
Research (SER)
▶ srcrepo – a model-based MSR system
■ components and mining process
■ distributed storage of very large models with emf-fragments
■ a meta-model for source code repositories
■ gathering software metrics with an OCL-like internal Scala DSL
▶ work in progress - discussion of remaining problems and
limitations
2
Tuesday, 30. September 2014
3. Relevant Research Fields
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
3
Tuesday, 30. September 2014
4. Relevant Research Fields
3
Mining Software Repositories (MSR)
The term mining software repositories (MSR)
has been coined to describe a broad class of
techniques for the examination of software
repositories.
The premise of MSR is that empirical and
systematic investigations of repositories will
shed new light on the process of software
evolution. [1]
Software Metrics
A software metric is a
mathematical definition mapping
the entities of a software system
to numeric metrics values.
[...] to express features of
software with numbers in order
to facilitate software quality
assessment. [2]
Reverse Engineering
Reverse engineering is the process
of analyzing a subject system to
(1) identify the system’s
components and their
interrelationships and (2) create
representations of the system in
another form or at a higher level
of abstraction [3]
Software Evolution Research (SER)
■(dis-)proving Lehmann’s Laws of software evolution
■empirical investigations of software repositories through statistical analysis of
software metrics and software change metrics over the evolutionary cause of
many software systems.
Model-based Mining Software Repositories (with srcrepo)
Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring scalability and
retaining meaningful information depth.
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
Tuesday, 30. September 2014
5. Relevant Research Fields
3
Mining Software Repositories (MSR)
The term mining software repositories (MSR)
has been coined to describe a broad class of
techniques for the examination of software
repositories.
The premise of MSR is that empirical and
systematic investigations of repositories will
shed new light on the process of software
evolution. [1]
Software Metrics
A software metric is a
mathematical definition mapping
the entities of a software system
to numeric metrics values.
[...] to express features of
software with numbers in order
to facilitate software quality
assessment. [2]
Reverse Engineering
Reverse engineering is the process
of analyzing a subject system to
(1) identify the system’s
components and their
interrelationships and (2) create
representations of the system in
another form or at a higher level
of abstraction [3]
Software Evolution Research (SER)
■(dis-)proving Lehmann’s Laws of software evolution
■empirical investigations of software repositories through statistical analysis of
software metrics and software change metrics over the evolutionary cause of
many software systems.
Model-based Mining Software Repositories (with srcrepo)
Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring scalability and
retaining meaningful information depth.
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
Tuesday, 30. September 2014
6. Relevant Research Fields
3
Mining Software Repositories (MSR)
The term mining software repositories (MSR)
has been coined to describe a broad class of
techniques for the examination of software
repositories.
The premise of MSR is that empirical and
systematic investigations of repositories will
shed new light on the process of software
evolution. [1]
Software Metrics
A software metric is a
mathematical definition mapping
the entities of a software system
to numeric metrics values.
[...] to express features of
software with numbers in order
to facilitate software quality
assessment. [2]
Reverse Engineering
Reverse engineering is the process
of analyzing a subject system to
(1) identify the system’s
components and their
interrelationships and (2) create
representations of the system in
another form or at a higher level
of abstraction [3]
Software Evolution Research (SER)
■(dis-)proving Lehmann’s Laws of software evolution
■empirical investigations of software repositories through statistical analysis of
software metrics and software change metrics over the evolutionary cause of
many software systems.
Model-based Mining Software Repositories (with srcrepo)
Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring scalability and
retaining meaningful information depth.
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
Tuesday, 30. September 2014
7. Relevant Research Fields
3
Mining Software Repositories (MSR)
The term mining software repositories (MSR)
has been coined to describe a broad class of
techniques for the examination of software
repositories.
The premise of MSR is that empirical and
systematic investigations of repositories will
shed new light on the process of software
evolution. [1]
Software Metrics
A software metric is a
mathematical definition mapping
the entities of a software system
to numeric metrics values.
[...] to express features of
software with numbers in order
to facilitate software quality
assessment. [2]
Reverse Engineering
Reverse engineering is the process
of analyzing a subject system to
(1) identify the system’s
components and their
interrelationships and (2) create
representations of the system in
another form or at a higher level
of abstraction [3]
Software Evolution Research (SER)
■(dis-)proving Lehmann’s Laws of software evolution
■empirical investigations of software repositories through statistical analysis of
software metrics and software change metrics over the evolutionary cause of
many software systems.
Model-based Mining Software Repositories (with srcrepo)
Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring scalability and
retaining meaningful information depth.
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
Tuesday, 30. September 2014
8. Relevant Research Fields
3
Mining Software Repositories (MSR)
The term mining software repositories (MSR)
has been coined to describe a broad class of
techniques for the examination of software
repositories.
The premise of MSR is that empirical and
systematic investigations of repositories will
shed new light on the process of software
evolution. [1]
Software Metrics
A software metric is a
mathematical definition mapping
the entities of a software system
to numeric metrics values.
[...] to express features of
software with numbers in order
to facilitate software quality
assessment. [2]
Reverse Engineering
Reverse engineering is the process
of analyzing a subject system to
(1) identify the system’s
components and their
interrelationships and (2) create
representations of the system in
another form or at a higher level
of abstraction [3]
Software Evolution Research (SER)
■(dis-)proving Lehmann’s Laws of software evolution
■empirical investigations of software repositories through statistical analysis of
software metrics and software change metrics over the evolutionary cause of
many software systems.
Model-based Mining Software Repositories (with srcrepo)
Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring scalability and
retaining meaningful information depth.
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
Tuesday, 30. September 2014
9. Relevant Research Fields
3
Mining Software Repositories (MSR)
The term mining software repositories (MSR)
has been coined to describe a broad class of
techniques for the examination of software
repositories.
The premise of MSR is that empirical and
systematic investigations of repositories will
shed new light on the process of software
evolution. [1]
Software Metrics
A software metric is a
mathematical definition mapping
the entities of a software system
to numeric metrics values.
[...] to express features of
software with numbers in order
to facilitate software quality
assessment. [2]
Reverse Engineering
Reverse engineering is the process
of analyzing a subject system to
(1) identify the system’s
components and their
interrelationships and (2) create
representations of the system in
another form or at a higher level
of abstraction [3]
Software Evolution Research (SER)
■(dis-)proving Lehmann’s Laws of software evolution
■empirical investigations of software repositories through statistical analysis of
software metrics and software change metrics over the evolutionary cause of
many software systems.
Model-based Mining Software Repositories (with srcrepo)
Overcoming heterogeneity and accessibility by raising the level of abstraction, while ensuring scalability and
retaining meaningful information depth.
1. H. Kagdi, M.L. Collard, J.I. Maletic: A survey and taxonomy of approaches for mining software repositories in the context of software
evolution; Journal of Software Maintenance and Evolution: Research and Practice; Vol.19/Nr.2/2007
2.R. Lincke, J. Lundberg, W. Löwe: Comparing Software Metrics Tools; 8th International Symposium on Software Testing and Analysis;
2008
3.E.J. Chikofsky, J.H. Cross: Reverse engineering and design recovery: A taxonomy; IEEE Software; Vol.7/Nr.1/1990
Tuesday, 30. September 2014
11. srcrepo – Components and Process
4
large scale software
repositories
(e.g. github, sourceforge)
source code
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
software projects
Tuesday, 30. September 2014
12. srcrepo – Components and Process
revision tree
4
large scale software
repositories
(e.g. github, sourceforge)
srcrepo storage
(EMF-models via EMF-Fragments,
e.g. on mongodb)
source code
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
software projects
C!
1
2 3
A" B# A!
AST-models of new and
changed CUs
import
Tuesday, 30. September 2014
13. srcrepo – Components and Process
revision tree
4
large scale software
repositories
(e.g. github, sourceforge)
srcrepo storage
(EMF-models via EMF-Fragments,
e.g. on mongodb)
srcrepo runtime
(headless eclipse RCP)
source code
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
software projects
C!
1
2 3
A" B# A!
AST-models of new and
changed CUs
import
1
revision tree
2 3
S! S" S"
fully resolved snapshot
models
A! A!B"
B"C# A#
analysis
Tuesday, 30. September 2014
14. 5
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
C!
1
2 3
A" B# A!
AST-models of new and
changed CUs
import
1
2 3
S! S" S"
fully resolved snapshot
models
A! A!B"
B"C# A#
analysis
Tuesday, 30. September 2014
15. 5
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
C!
1
2 3
A" B# A!
AST-models of new and
changed CUs
import
1
2 3
S! S" S"
fully resolved snapshot
models
A! A!B"
B"C# A#
analysis OCL
M! M" M#
Tuesday, 30. September 2014
17. 5
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
C!
1
2 3
A" B# A!
AST-models of new and
changed CUs
import
1
2 3
S! S" S"
fully resolved snapshot
models
A! A!B"
B"C# A#
analysis
M!
M!"#
M# M#
M#"$
store
timelines of metrics
OCL EMF-Compare
D!"#
M!"#
D#"$
M#"$
OCL
M! M" M#
Tuesday, 30. September 2014
18. 5
repository
(e.g. controlled
by Git, SVN, CVS)
source code
(e.g. java, C++,
eclipse*)
issue tracker, mailing lists, wiki
statistics software
(e.g. R, Matlab)
C!
1
2 3
A" B# A!
AST-models of new and
changed CUs
import
1
2 3
S! S" S"
fully resolved snapshot
models
A! A!B"
B"C# A#
analysis
M!
M!"#
M# M#
M#"$
store
timelines of metrics
export
OCL EMF-Compare
D!"#
M!"#
D#"$
M#"$
OCL
M! M" M#
Tuesday, 30. September 2014
19. Distributed Model Store with emf-fragments
6
normal references
*
Tuesday, 30. September 2014
20. Distributed Model Store with emf-fragments
6
normal references
*
fragmenting references
«fragments»
*
Tuesday, 30. September 2014
21. A Meta-Model for Source Code Repositories
7
code revision tree
»
«fragments»
Tuesday, 30. September 2014
22. 8
MoDisco source code revision tree
«fragments»
Tuesday, 30. September 2014
23. A OCL-like internal Scala DSL for Computing Metrics
▶ OCL-like internal Scala DSL analog to our internal Scala
model transformation language [1]
▶ OCL collection operations mapped to Scala’s higher-order
fuctions [2]:
1. L. George, A. Wider, M. Scheidgen: Type-Safe Model Transformation Languages as Internal DSLs in Scala; Theory and Practice of Model
Transformations - 5th International Conference, ICMT; 2012
2.Filip Krikava: Enrichting EMF Models with Scala; Slideshare
9
Tuesday, 30. September 2014
24. A OCL-like internal Scala DSL for Computing Metrics
ownedPackages
▶ OCL-like internal Scala Model DSL analog Package
to our internal Scala
model transformation language [1]
▶ OCL collection operations mapped to Scala’s higher-order
fuctions [2]:
9
pure OCL
context Model:
self.ownedElements->collect(p|p.ownedElements)->size
Abstract
Type
Declaration
ownedElements
ownedElements
*
*
*
1. L. George, A. Wider, M. Scheidgen: Type-Safe Model Transformation Languages as Internal DSLs in Scala; Theory and Practice of Model
Transformations - 5th International Conference, ICMT; 2012
2.Filip Krikava: Enrichting EMF Models with Scala; Slideshare
Tuesday, 30. September 2014
25. A OCL-like internal Scala DSL for Computing Metrics
Abstract
Type
Declaration
ownedPackages
*
▶ OCL-like internal Scala DSL ownedElements
Model analog Package
to our internal Scala
model transformation language [1]
▶ OCL collection operations mapped to Scala’s higher-order
fuctions [2]:
9
pure OCL
context Model:
*
ownedElements
*
self.ownedElements->collect(p|p.ownedElements)->size
OCL-like expression in Scala
def numberOfFirstPackageLevelTypes(self: Model): Int =
self.getOwnedElements().collect(p=>p.getOwnedElements()).size()
1. L. George, A. Wider, M. Scheidgen: Type-Safe Model Transformation Languages as Internal DSLs in Scala; Theory and Practice of Model
Transformations - 5th International Conference, ICMT; 2012
2.Filip Krikava: Enrichting EMF Models with Scala; Slideshare
Tuesday, 30. September 2014
28. Complex Example: Average Weighted Methods per Class (WMC)
▶ WMC is the first CK-metric [1]. There different commonly
used weights; here we use cyclomatic complexity.
def classes(model:Model):OclCollection[ClassDeclaration] =
model.getOwnedElements()
.collectClosure(pkg=>pkg.getOwnedPackages())
.collectAll(pkg=>pkg.getOwnedElements())
.collectClosure(typeDcl=>
typeDcl.getBodyDeclarations()
.selectOfType[ClassDeclaration])
12
def WMC(model:Model):Double =
classes(model).stats(clazz=>
clazz.getBodyDeclarations()
.selectOfType[MethodDeclaration]()
.sum(method=>cyclmaticComplexity(method))).average
def cyclomaticComplexity(method:MethodDeclaration):Int =
...
123456789
10
11
12
13
14
15
16
1. S.R. Chidamber, C.F. Kemerer: A Metrics Suite for Object Oriented Design; IEEE Transactions on Software Eng.; Vol.20/Nr.6/1994
Tuesday, 30. September 2014
29. Complex Example: Average Weighted Methods per Class (WMC)
▶ WMC is the first CK-metric [1]. There different commonly
used weights; here we use cyclomatic complexity.
def classes(model:Model):OclCollection[ClassDeclaration] =
model.getOwnedElements()
.collectClosure(pkg=>pkg.getOwnedPackages())
.collectAll(pkg=>pkg.getOwnedElements())
.collectClosure(typeDcl=>
typeDcl.getBodyDeclarations()
.selectOfType[ClassDeclaration])
Model
Abstract
Body
Declaration
12
def WMC(model:Model):Double =
classes(model).stats(clazz=>
clazz.getBodyDeclarations()
Package
ownedPackages
.selectOfType[MethodDeclaration]()
.sum(method=>cyclmaticComplexity(method))).average
Abstract
Type
Declaration
def cyclomaticComplexity(method:MethodDeclaration):Int =
...
123456789
10
11
12
13
14
15
16
Class
Declaration
1
1. S.R. Chidamber, C.F. Kemerer: A Metrics Suite for Object Oriented Design; IEEE Transactions on Software Eng.; Vol.20/Nr.6/1994
TypeAccess
ownedElements
ownedElements
bodyDeclarations
superInterfaces
superClass
returnType
type
usagesInTypeAccess
*
* *
*
* *
1
1
Tuesday, 30. September 2014
30. Model
Package
ownedElements
Complex Example: Average Weighted Methods * per Class *
ownedPackages
(WMC)
ownedElements
*
▶ WMC is the first CK-metric [1]. There different Abstract
commonly
Type
used weights; here we use cyclomatic complexity.
Declaration
bodyDeclarations
type
*
1
def classes(model:Model):OclCollection[Abstract
ClassDeclaration] =
model.getOwnedElements()
Body
usagesInTypeAccess
Declaration
.collectClosure(pkg=>pkg.getOwnedPackages())
.collectAll(pkg=>pkg.getOwnedElements())
.collectClosure(typeDcl=>
typeDcl.getBodyDeclarations()
.selectOfType[ClassDeclaration])
12
def WMC(model:Model):Double =
classes(model).stats(clazz=>
clazz.getBodyDeclarations()
superInterfaces
* *
TypeAccess
1
returnType
1
Method
Declaration
superClass
.selectOfType[MethodDeclaration]()
.sum(method=>cyclmaticComplexity(method))).average
def cyclomaticComplexity(method:MethodDeclaration):Int =
...
123456789
10
11
12
13
14
15
16
Class
Declaration
Abstract
Method
Invocation
method
1
1. S.R. Chidamber, C.F. Kemerer: A Metrics Suite for Object Oriented Design; IEEE Transactions on Software Eng.; Vol.20/Nr.6/1994
Tuesday, 30. September 2014
31. Implementation of the OCL-Collection Operations
▶ Just in time iterator-based implementation rather than
straight forward aggregation of result collections.
13
:RepositoryModel
:Rev :Rev :Rev
:Par... :Par... :Par... :Par... :Par... :Par...
:Di! :Di! :Di! :Di! :Di! :Di! :Di! :Di! :Di! :Di!
1
2
3
4
Tuesday, 30. September 2014
32. Future Work, Remaining Problems, and Limitations
14
Scalability
■very large compilation units
■incremental snapshot creation
■batching OCL execution
■experiments with large scale
repository (e.g. git.eclipse.org)
Heterogeneity
■MoDisco for different
programming languages
■common metrics meta-model
(e.g. OMG, KDM)
■VCS abstraction and support for
different VCS
Accessibility
■relating results to
software repository
entities
■persisting and
exporting results
Information Depth
■diff-models from
comparison of
compilation
units
▶ Very large compilation units (CU): e.g. a 3 MB, 600 kLOC CU in org.eclipse.emf
■ tends to have lots of dependencies ➞ changes often ➞ makes problem even bigger
■ CUs are smallest common denominator between text-based VCS view and syntax-based
AST view
■ smaller units require model-comparison or text-to-AST mappings
▶ Support for different programming languages: either abstraction, parallel meta-models, or
mixed approach
■ MoDisco is extendable, but only Java support exists; other languages need to be
implemented ➞ parallel meta-models
■ A reasonable abstraction for multiple (or all) programming language probably does
not exist.
■ A shared abstract meta-model that all language meta-models extends could be an
sensible compromise.
Tuesday, 30. September 2014