Dr. Markus Scheidgen
Model-based Analysis of
Large Scale Software
Repositories
■ problem
■ creating models of software rep...
Problem
2
Is Software Engineering a Science?
■ Def.: Science (from Latin scientia) is a systematic enterprise that
builds and organi...
Reasons
4
inaccessibility •new methods have to be used first to produce data
•industry cooperations necessary
•open-source ...
Relevant Fields with Partial Solutions
5
Mining Software Repositories
(MSR)
Software Metrics Reverse Engineering
analyzing...
Problem Statement: Everything is there,
but ...
1.Missing abstractions:
■ no general abstractions to cover multiple langua...
Proposed Solution: Scalable Model-based
Framework
■ Meta-model and reverse engineering based approach to
analyze code-mode...
SrcRepo: A Framework for Large
Scale Repository Analysis
8
Model-based Analysis of Large Scale
Software Repositories
9
Model-based Analysis of Large Scale
Software Repositories
9
VCS
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software...
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software...
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software...
VCS Model MetricsVCS Model Metrics
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
...
1) Reverse Engineering Software in Version
Control Systems (VCS)
10
code code
code
code code
code
code code code
revisions...
1) Models of Source Repositories
(github.com/markus1978/srcrepo)
11
SrcRepoSrcRepo
EMF/EMF-
Fragments
EMF CompareEMF Compa...
1) Models of Source Repositories
(github.com/markus1978/srcrepo)
12
A B C
A
A B
A D
PB1.R1
B1.R2
B1.R3
B1.R4
B2.R1
B2.R2
A...
1) Models of Source Repositories: Scalability
SrcRepo is based on EMF-Fragments
(https://github.com/markus1978/emf-fragmen...
2) Scala for queries and transformations:
Syntax (internal DSL: from OCL to Scala)
14
Filip Krikava: Enriching EMF Models ...
2) Scala for Queries: Syntax
def	
  exists(predicate:	
  (E)	
  =>	
  Boolean):	
  Boolean
def	
  forAll(predicate:	
  (E)...
2) Scala for Queries: Syntax
■ example SrcRepo query: “average number of methods per
class”
def	
  avgMethodsPerClass(self...
2) Scala and internal DSLs: Semantics
■Three different semantics, one interface
■ immediate collection
■ lazy iterator
■ Ma...
Example Analysis
18
First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss ...
First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss ...
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Ec...
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Ec...
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Ec...
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Ec...
Summary
21
VCS Model MetricsVCS Model Metrics
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolu...
Upcoming SlideShare
Loading in...5
×

Model-based Analysis of Large Scale Software Repositories

303

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
303
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Model-based Analysis of Large Scale Software Repositories

  1. 1. Dr. Markus Scheidgen Model-based Analysis of Large Scale Software Repositories ■ problem ■ creating models of software repositories ■ the means for analyzing such models ■ example analysis 1
  2. 2. Problem 2
  3. 3. Is Software Engineering a Science? ■ Def.: Science (from Latin scientia) is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. ■ Testable? Example theses: ★ DSLs allow domain experts to develop software effectively and more efficiently as with GPLs. ★ Static type systems lead to safer programming and fewer bugs. ★ Functional programming leads to less performant programs. ★ Scrum allows to develop programs faster. ★ My framework allows to develop ... more, faster ... with less, fewer ... ■ Methods for quantitative measures of software properties (metrics) are mostly used to assess the state of software projects, and rarely for empirical studies on software engineering itself 3
  4. 4. Reasons 4 inaccessibility •new methods have to be used first to produce data •industry cooperations necessary •open-source repositories are a possibility data quality •not easy to distinguish between written code, generated code, test code •there are maintained projects, developed projects, aborted projects heterogeneity •different project structures •different paradigms •different languages •different APIs amounts of data •source forge hosts >350.000 projects •current snap-shop of linux kernel contains 108 AST-nodes •EMF´s 50 MB Git repository, takes 20 GB of binary encoded AST data
  5. 5. Relevant Fields with Partial Solutions 5 Mining Software Repositories (MSR) Software Metrics Reverse Engineering analyzing of rich data contained in software engineering related repositories such as version control systems, mailing list, bug-tracking systems definition, acquisition, and analysis of quantitative measures of certain software properties analyzing existing code bases to create representations at a higher level of abstraction (models) • guiding software development • defect detection, prediction, resolution • gaining actionable knowledge about software projects and software engineering methodologies • assessment of engineering costs for development, change, maintenance, etc. • comparative analysis of software systems or analysis of software evolution • comparative analysis of software engineering methodologies • understanding existing software for development, change, maintenance, etc. • derive AST, UML, or KDM models from software • static language independent • syntax based • scale: single projects, large scale (eclipse, apache), ultra large scale (source forge, git-hub) • language independent (e.g. LOC) • syntax based (e.g. McCabe) • static, dynamic (evolution) • syntax (structure, behavior) • semantics
  6. 6. Problem Statement: Everything is there, but ... 1.Missing abstractions: ■ no general abstractions to cover multiple languages/ repositories are used ■ only proprietary solutions and systems tailored for specific algorithms/databases, languages, repositories 2.Scalability is an issue: ■ for ultra large scale repositories only VCS meta-data is used ■ for large scale repositories only language independent analysis on file-based granularity possible ■ only for single software projects language dependent analysis on AST-level detail are feasible 6
  7. 7. Proposed Solution: Scalable Model-based Framework ■ Meta-model and reverse engineering based approach to analyze code-models on different and well-defined levels of abstractions instead of the code itself. ■ Query and transformation languages as well as model persistence based on the Map/Reduce BigData paradigm. ■ Target: AST-level analysis of large-scale repositories, e.g. git.eclipse.org (>300 projects) 7
  8. 8. SrcRepo: A Framework for Large Scale Repository Analysis 8
  9. 9. Model-based Analysis of Large Scale Software Repositories 9
  10. 10. Model-based Analysis of Large Scale Software Repositories 9 VCS
  11. 11. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model
  12. 12. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies
  13. 13. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies
  14. 14. VCS Model MetricsVCS Model Metrics Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies 3): Statistical analysis Better Understanding Software Engineering
  15. 15. 1) Reverse Engineering Software in Version Control Systems (VCS) 10 code code code code code code code code code revisions files causalrelations structural relations Code in a VCS Software Model
  16. 16. 1) Models of Source Repositories (github.com/markus1978/srcrepo) 11 SrcRepoSrcRepo EMF/EMF- Fragments EMF CompareEMF Compare EMF/EMF- Fragments jGit MoDisco EMF/EMF- Fragments git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
  17. 17. 1) Models of Source Repositories (github.com/markus1978/srcrepo) 12 A B C A A B A D PB1.R1 B1.R2 B1.R3 B1.R4 B2.R1 B2.R2 A A B Repository Revision Diff Compilation Unit Model Package Class ... * * * * * 1 prevnext JGit MoDisco modelmetamodel usageIn Package Access * package1 «relation, fragmentation» «fragmentation» «relation, fragmentation» «relation» «fragmentation» * * extends1
  18. 18. 1) Models of Source Repositories: Scalability SrcRepo is based on EMF-Fragments (https://github.com/markus1978/emf-fragments) 13 map/reduce (hadoop) “Share Nothing” Nodes Cluster DFS (HDFS) key-value-store (EMF-resources) (hbase) structured data (EMF-model)model transformations
  19. 19. 2) Scala for queries and transformations: Syntax (internal DSL: from OCL to Scala) 14 Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
  20. 20. 2) Scala for Queries: Syntax def  exists(predicate:  (E)  =>  Boolean):  Boolean def  forAll(predicate:  (E)  =>  Boolean):  Boolean def  select(predicate:  (E)  =>  Boolean):  Collection[E] def  reject(predicate:  (E)  =>  Boolean):  Collection[E] def  collect[R](expr:  (E)  =>  R):  Collection[R] def  collectAll[R](expr:  (E)  =>  Collection[R]):  Collection[R] def  closure(expr:  (E)  =>  Collection[E]):  Collection[E] def  aggregate[R](expr:  (E)  =>  R,  start:  ()  =>  R,  aggr:  (R,  R)  =>  R):  R def  sum(expr:  (E)  =>  Double):  Double def  product(expr:  (E)  =>  Double):  Double def  max(expr:  (E)  =>  Double):  Double def  min(expr:  (E)  =>  Double):  Double def  average(expr:  (E)  =>  Double):  Double ... def  run(runnable:  (E)  =>  Unit):  Unit 15
  21. 21. 2) Scala for Queries: Syntax ■ example SrcRepo query: “average number of methods per class” def  avgMethodsPerClass(self:  Model)  =  {   val  packages  =  self.getOwnedPackages().    closure((p)=>p.getOwnedPackages());    val  classes  =  packages.collect((p)=>p.getOwnedClasses()).        closure((c)=>c.getInnerClasses());    return  classes.average((c)=>c.getOwnedMethods().size()); } 16
  22. 22. 2) Scala and internal DSLs: Semantics ■Three different semantics, one interface ■ immediate collection ■ lazy iterator ■ Map/Reduce database 17
  23. 23. Example Analysis 18
  24. 24. First Example Case Study: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  25. 25. First Example Case Study: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  26. 26. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  27. 27. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  28. 28. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  29. 29. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  30. 30. Summary 21 VCS Model MetricsVCS Model Metrics 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies Statistical analysis Better Understanding Software Engineering
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×