Dr. Markus Scheidgen
Model-based Analysis of
Large Scale Software
Repositories
■ problem
■ creating models of software repositories
■ the means for analyzing such models
■ example analysis
1
Problem
2
Is Software Engineering a Science?
■ Def.: Science (from Latin scientia) is a systematic enterprise that
builds and organizes knowledge in the form of testable
explanations and predictions about the universe.
■ Testable? Example theses:
★ DSLs allow domain experts to develop software effectively and more
efficiently as with GPLs.
★ Static type systems lead to safer programming and fewer bugs.
★ Functional programming leads to less performant programs.
★ Scrum allows to develop programs faster.
★ My framework allows to develop ... more, faster ... with less, fewer ...
■ Methods for quantitative measures of software properties
(metrics) are mostly used to assess the state of software projects,
and rarely for empirical studies on software engineering itself
3
Reasons
4
inaccessibility •new methods have to be used first to produce data
•industry cooperations necessary
•open-source repositories are a possibility
data quality •not easy to distinguish between written code, generated code,
test code
•there are maintained projects, developed projects, aborted
projects
heterogeneity •different project structures
•different paradigms
•different languages
•different APIs
amounts of data •source forge hosts >350.000 projects
•current snap-shop of linux kernel contains 108 AST-nodes
•EMF´s 50 MB Git repository, takes 20 GB of binary encoded
AST data
Relevant Fields with Partial Solutions
5
Mining Software Repositories
(MSR)
Software Metrics Reverse Engineering
analyzing of rich data contained in
software engineering related
repositories such as version control
systems, mailing list, bug-tracking
systems
definition, acquisition, and analysis of
quantitative measures of certain
software properties
analyzing existing code bases to create
representations at a higher level of
abstraction (models)
• guiding software development
• defect detection, prediction,
resolution
• gaining actionable knowledge about
software projects and software
engineering methodologies
• assessment of engineering costs for
development, change, maintenance,
etc.
• comparative analysis of software
systems or analysis of software
evolution
• comparative analysis of software
engineering methodologies
• understanding existing software for
development, change, maintenance,
etc.
• derive AST, UML, or KDM models
from software
• static language independent
• syntax based
• scale: single projects, large scale
(eclipse, apache), ultra large scale
(source forge, git-hub)
• language independent (e.g. LOC)
• syntax based (e.g. McCabe)
• static, dynamic (evolution)
• syntax (structure, behavior)
• semantics
Problem Statement: Everything is there,
but ...
1.Missing abstractions:
■ no general abstractions to cover multiple languages/
repositories are used
■ only proprietary solutions and systems tailored for specific
algorithms/databases, languages, repositories
2.Scalability is an issue:
■ for ultra large scale repositories only VCS meta-data is used
■ for large scale repositories only language independent analysis
on file-based granularity possible
■ only for single software projects language dependent analysis
on AST-level detail are feasible
6
Proposed Solution: Scalable Model-based
Framework
■ Meta-model and reverse engineering based approach to
analyze code-models on different and well-defined levels of
abstractions instead of the code itself.
■ Query and transformation languages as well as model
persistence based on the Map/Reduce BigData paradigm.
■ Target: AST-level analysis of large-scale repositories, e.g.
git.eclipse.org (>300 projects)
7
SrcRepo: A Framework for Large
Scale Repository Analysis
8
Model-based Analysis of Large Scale
Software Repositories
9
Model-based Analysis of Large Scale
Software Repositories
9
VCS
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
VCS Model MetricsVCS Model Metrics
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
3): Statistical analysis
Better
Understanding
Software
Engineering
1) Reverse Engineering Software in Version
Control Systems (VCS)
10
code code
code
code code
code
code code code
revisions
files
causalrelations
structural relations
Code in a VCS Software Model
1) Models of Source Repositories
(github.com/markus1978/srcrepo)
11
SrcRepoSrcRepo
EMF/EMF-
Fragments
EMF CompareEMF Compare
EMF/EMF-
Fragments
jGit MoDisco
EMF/EMF-
Fragments
git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
1) Models of Source Repositories
(github.com/markus1978/srcrepo)
12
A B C
A
A B
A D
PB1.R1
B1.R2
B1.R3
B1.R4
B2.R1
B2.R2
A
A B
Repository
Revision Diff
Compilation
Unit
Model
Package Class
...
* * * *
*
1
prevnext
JGit MoDisco
modelmetamodel
usageIn
Package
Access
*
package1
«relation,
fragmentation»
«fragmentation» «relation,
fragmentation»
«relation»
«fragmentation»
* *
extends1
1) Models of Source Repositories: Scalability
SrcRepo is based on EMF-Fragments
(https://github.com/markus1978/emf-fragments)
13
map/reduce
(hadoop)
“Share Nothing” Nodes Cluster
DFS
(HDFS)
key-value-store (EMF-resources)
(hbase)
structured data (EMF-model)model transformations
2) Scala for queries and transformations:
Syntax (internal DSL: from OCL to Scala)
14
Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
2) Scala for Queries: Syntax
def	
  exists(predicate:	
  (E)	
  =>	
  Boolean):	
  Boolean
def	
  forAll(predicate:	
  (E)	
  =>	
  Boolean):	
  Boolean
def	
  select(predicate:	
  (E)	
  =>	
  Boolean):	
  Collection[E]
def	
  reject(predicate:	
  (E)	
  =>	
  Boolean):	
  Collection[E]
def	
  collect[R](expr:	
  (E)	
  =>	
  R):	
  Collection[R]
def	
  collectAll[R](expr:	
  (E)	
  =>	
  Collection[R]):	
  Collection[R]
def	
  closure(expr:	
  (E)	
  =>	
  Collection[E]):	
  Collection[E]
def	
  aggregate[R](expr:	
  (E)	
  =>	
  R,	
  start:	
  ()	
  =>	
  R,	
  aggr:	
  (R,	
  R)	
  =>	
  R):	
  R
def	
  sum(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  product(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  max(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  min(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  average(expr:	
  (E)	
  =>	
  Double):	
  Double
...
def	
  run(runnable:	
  (E)	
  =>	
  Unit):	
  Unit
15
2) Scala for Queries: Syntax
■ example SrcRepo query: “average number of methods per
class”
def	
  avgMethodsPerClass(self:	
  Model)	
  =	
  {	
  
val	
  packages	
  =	
  self.getOwnedPackages().
	
  	
  closure((p)=>p.getOwnedPackages());
	
  	
  val	
  classes	
  =	
  packages.collect((p)=>p.getOwnedClasses()).
	
  	
  	
  	
  closure((c)=>c.getInnerClasses());
	
  	
  return	
  classes.average((c)=>c.getOwnedMethods().size());
}
16
2) Scala and internal DSLs: Semantics
■Three different semantics, one interface
■ immediate collection
■ lazy iterator
■ Map/Reduce database
17
Example Analysis
18
First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software
Designs, Journal of the institute of operations research and management science, 2006
First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software
Designs, Journal of the institute of operations research and management science, 2006
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Summary
21
VCS Model MetricsVCS Model Metrics
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
Statistical analysis
Better
Understanding
Software
Engineering

Model-based Analysis of Large Scale Software Repositories

  • 1.
    Dr. Markus Scheidgen Model-basedAnalysis of Large Scale Software Repositories ■ problem ■ creating models of software repositories ■ the means for analyzing such models ■ example analysis 1
  • 2.
  • 3.
    Is Software Engineeringa Science? ■ Def.: Science (from Latin scientia) is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. ■ Testable? Example theses: ★ DSLs allow domain experts to develop software effectively and more efficiently as with GPLs. ★ Static type systems lead to safer programming and fewer bugs. ★ Functional programming leads to less performant programs. ★ Scrum allows to develop programs faster. ★ My framework allows to develop ... more, faster ... with less, fewer ... ■ Methods for quantitative measures of software properties (metrics) are mostly used to assess the state of software projects, and rarely for empirical studies on software engineering itself 3
  • 4.
    Reasons 4 inaccessibility •new methodshave to be used first to produce data •industry cooperations necessary •open-source repositories are a possibility data quality •not easy to distinguish between written code, generated code, test code •there are maintained projects, developed projects, aborted projects heterogeneity •different project structures •different paradigms •different languages •different APIs amounts of data •source forge hosts >350.000 projects •current snap-shop of linux kernel contains 108 AST-nodes •EMF´s 50 MB Git repository, takes 20 GB of binary encoded AST data
  • 5.
    Relevant Fields withPartial Solutions 5 Mining Software Repositories (MSR) Software Metrics Reverse Engineering analyzing of rich data contained in software engineering related repositories such as version control systems, mailing list, bug-tracking systems definition, acquisition, and analysis of quantitative measures of certain software properties analyzing existing code bases to create representations at a higher level of abstraction (models) • guiding software development • defect detection, prediction, resolution • gaining actionable knowledge about software projects and software engineering methodologies • assessment of engineering costs for development, change, maintenance, etc. • comparative analysis of software systems or analysis of software evolution • comparative analysis of software engineering methodologies • understanding existing software for development, change, maintenance, etc. • derive AST, UML, or KDM models from software • static language independent • syntax based • scale: single projects, large scale (eclipse, apache), ultra large scale (source forge, git-hub) • language independent (e.g. LOC) • syntax based (e.g. McCabe) • static, dynamic (evolution) • syntax (structure, behavior) • semantics
  • 6.
    Problem Statement: Everythingis there, but ... 1.Missing abstractions: ■ no general abstractions to cover multiple languages/ repositories are used ■ only proprietary solutions and systems tailored for specific algorithms/databases, languages, repositories 2.Scalability is an issue: ■ for ultra large scale repositories only VCS meta-data is used ■ for large scale repositories only language independent analysis on file-based granularity possible ■ only for single software projects language dependent analysis on AST-level detail are feasible 6
  • 7.
    Proposed Solution: ScalableModel-based Framework ■ Meta-model and reverse engineering based approach to analyze code-models on different and well-defined levels of abstractions instead of the code itself. ■ Query and transformation languages as well as model persistence based on the Map/Reduce BigData paradigm. ■ Target: AST-level analysis of large-scale repositories, e.g. git.eclipse.org (>300 projects) 7
  • 8.
    SrcRepo: A Frameworkfor Large Scale Repository Analysis 8
  • 9.
    Model-based Analysis ofLarge Scale Software Repositories 9
  • 10.
    Model-based Analysis ofLarge Scale Software Repositories 9 VCS
  • 11.
    Model-based Analysis ofLarge Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model
  • 12.
    Model-based Analysis ofLarge Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies
  • 13.
    Model-based Analysis ofLarge Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies
  • 14.
    VCS Model MetricsVCSModel Metrics Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies 3): Statistical analysis Better Understanding Software Engineering
  • 15.
    1) Reverse EngineeringSoftware in Version Control Systems (VCS) 10 code code code code code code code code code revisions files causalrelations structural relations Code in a VCS Software Model
  • 16.
    1) Models ofSource Repositories (github.com/markus1978/srcrepo) 11 SrcRepoSrcRepo EMF/EMF- Fragments EMF CompareEMF Compare EMF/EMF- Fragments jGit MoDisco EMF/EMF- Fragments git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
  • 17.
    1) Models ofSource Repositories (github.com/markus1978/srcrepo) 12 A B C A A B A D PB1.R1 B1.R2 B1.R3 B1.R4 B2.R1 B2.R2 A A B Repository Revision Diff Compilation Unit Model Package Class ... * * * * * 1 prevnext JGit MoDisco modelmetamodel usageIn Package Access * package1 «relation, fragmentation» «fragmentation» «relation, fragmentation» «relation» «fragmentation» * * extends1
  • 18.
    1) Models ofSource Repositories: Scalability SrcRepo is based on EMF-Fragments (https://github.com/markus1978/emf-fragments) 13 map/reduce (hadoop) “Share Nothing” Nodes Cluster DFS (HDFS) key-value-store (EMF-resources) (hbase) structured data (EMF-model)model transformations
  • 19.
    2) Scala forqueries and transformations: Syntax (internal DSL: from OCL to Scala) 14 Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
  • 20.
    2) Scala forQueries: Syntax def  exists(predicate:  (E)  =>  Boolean):  Boolean def  forAll(predicate:  (E)  =>  Boolean):  Boolean def  select(predicate:  (E)  =>  Boolean):  Collection[E] def  reject(predicate:  (E)  =>  Boolean):  Collection[E] def  collect[R](expr:  (E)  =>  R):  Collection[R] def  collectAll[R](expr:  (E)  =>  Collection[R]):  Collection[R] def  closure(expr:  (E)  =>  Collection[E]):  Collection[E] def  aggregate[R](expr:  (E)  =>  R,  start:  ()  =>  R,  aggr:  (R,  R)  =>  R):  R def  sum(expr:  (E)  =>  Double):  Double def  product(expr:  (E)  =>  Double):  Double def  max(expr:  (E)  =>  Double):  Double def  min(expr:  (E)  =>  Double):  Double def  average(expr:  (E)  =>  Double):  Double ... def  run(runnable:  (E)  =>  Unit):  Unit 15
  • 21.
    2) Scala forQueries: Syntax ■ example SrcRepo query: “average number of methods per class” def  avgMethodsPerClass(self:  Model)  =  {   val  packages  =  self.getOwnedPackages().    closure((p)=>p.getOwnedPackages());    val  classes  =  packages.collect((p)=>p.getOwnedClasses()).        closure((c)=>c.getInnerClasses());    return  classes.average((c)=>c.getOwnedMethods().size()); } 16
  • 22.
    2) Scala andinternal DSLs: Semantics ■Three different semantics, one interface ■ immediate collection ■ lazy iterator ■ Map/Reduce database 17
  • 23.
  • 24.
    First Example CaseStudy: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  • 25.
    First Example CaseStudy: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  • 26.
    Second Example CaseStudy: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 27.
    Second Example CaseStudy: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 28.
    Second Example CaseStudy: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 29.
    Second Example CaseStudy: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 30.
    Summary 21 VCS Model MetricsVCSModel Metrics 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies Statistical analysis Better Understanding Software Engineering