Your SlideShare is downloading. ×
Model-based Analysis of Large Scale Software Repositories
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Model-based Analysis of Large Scale Software Repositories

260
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
260
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Dr. Markus Scheidgen Model-based Analysis of Large Scale Software Repositories ■ problem ■ creating models of software repositories ■ the means for analyzing such models ■ example analysis 1
  • 2. Problem 2
  • 3. Is Software Engineering a Science? ■ Def.: Science (from Latin scientia) is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. ■ Testable? Example theses: ★ DSLs allow domain experts to develop software effectively and more efficiently as with GPLs. ★ Static type systems lead to safer programming and fewer bugs. ★ Functional programming leads to less performant programs. ★ Scrum allows to develop programs faster. ★ My framework allows to develop ... more, faster ... with less, fewer ... ■ Methods for quantitative measures of software properties (metrics) are mostly used to assess the state of software projects, and rarely for empirical studies on software engineering itself 3
  • 4. Reasons 4 inaccessibility •new methods have to be used first to produce data •industry cooperations necessary •open-source repositories are a possibility data quality •not easy to distinguish between written code, generated code, test code •there are maintained projects, developed projects, aborted projects heterogeneity •different project structures •different paradigms •different languages •different APIs amounts of data •source forge hosts >350.000 projects •current snap-shop of linux kernel contains 108 AST-nodes •EMF´s 50 MB Git repository, takes 20 GB of binary encoded AST data
  • 5. Relevant Fields with Partial Solutions 5 Mining Software Repositories (MSR) Software Metrics Reverse Engineering analyzing of rich data contained in software engineering related repositories such as version control systems, mailing list, bug-tracking systems definition, acquisition, and analysis of quantitative measures of certain software properties analyzing existing code bases to create representations at a higher level of abstraction (models) • guiding software development • defect detection, prediction, resolution • gaining actionable knowledge about software projects and software engineering methodologies • assessment of engineering costs for development, change, maintenance, etc. • comparative analysis of software systems or analysis of software evolution • comparative analysis of software engineering methodologies • understanding existing software for development, change, maintenance, etc. • derive AST, UML, or KDM models from software • static language independent • syntax based • scale: single projects, large scale (eclipse, apache), ultra large scale (source forge, git-hub) • language independent (e.g. LOC) • syntax based (e.g. McCabe) • static, dynamic (evolution) • syntax (structure, behavior) • semantics
  • 6. Problem Statement: Everything is there, but ... 1.Missing abstractions: ■ no general abstractions to cover multiple languages/ repositories are used ■ only proprietary solutions and systems tailored for specific algorithms/databases, languages, repositories 2.Scalability is an issue: ■ for ultra large scale repositories only VCS meta-data is used ■ for large scale repositories only language independent analysis on file-based granularity possible ■ only for single software projects language dependent analysis on AST-level detail are feasible 6
  • 7. Proposed Solution: Scalable Model-based Framework ■ Meta-model and reverse engineering based approach to analyze code-models on different and well-defined levels of abstractions instead of the code itself. ■ Query and transformation languages as well as model persistence based on the Map/Reduce BigData paradigm. ■ Target: AST-level analysis of large-scale repositories, e.g. git.eclipse.org (>300 projects) 7
  • 8. SrcRepo: A Framework for Large Scale Repository Analysis 8
  • 9. Model-based Analysis of Large Scale Software Repositories 9
  • 10. Model-based Analysis of Large Scale Software Repositories 9 VCS
  • 11. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model
  • 12. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies
  • 13. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies
  • 14. VCS Model MetricsVCS Model Metrics Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies 3): Statistical analysis Better Understanding Software Engineering
  • 15. 1) Reverse Engineering Software in Version Control Systems (VCS) 10 code code code code code code code code code revisions files causalrelations structural relations Code in a VCS Software Model
  • 16. 1) Models of Source Repositories (github.com/markus1978/srcrepo) 11 SrcRepoSrcRepo EMF/EMF- Fragments EMF CompareEMF Compare EMF/EMF- Fragments jGit MoDisco EMF/EMF- Fragments git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
  • 17. 1) Models of Source Repositories (github.com/markus1978/srcrepo) 12 A B C A A B A D PB1.R1 B1.R2 B1.R3 B1.R4 B2.R1 B2.R2 A A B Repository Revision Diff Compilation Unit Model Package Class ... * * * * * 1 prevnext JGit MoDisco modelmetamodel usageIn Package Access * package1 «relation, fragmentation» «fragmentation» «relation, fragmentation» «relation» «fragmentation» * * extends1
  • 18. 1) Models of Source Repositories: Scalability SrcRepo is based on EMF-Fragments (https://github.com/markus1978/emf-fragments) 13 map/reduce (hadoop) “Share Nothing” Nodes Cluster DFS (HDFS) key-value-store (EMF-resources) (hbase) structured data (EMF-model)model transformations
  • 19. 2) Scala for queries and transformations: Syntax (internal DSL: from OCL to Scala) 14 Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
  • 20. 2) Scala for Queries: Syntax def  exists(predicate:  (E)  =>  Boolean):  Boolean def  forAll(predicate:  (E)  =>  Boolean):  Boolean def  select(predicate:  (E)  =>  Boolean):  Collection[E] def  reject(predicate:  (E)  =>  Boolean):  Collection[E] def  collect[R](expr:  (E)  =>  R):  Collection[R] def  collectAll[R](expr:  (E)  =>  Collection[R]):  Collection[R] def  closure(expr:  (E)  =>  Collection[E]):  Collection[E] def  aggregate[R](expr:  (E)  =>  R,  start:  ()  =>  R,  aggr:  (R,  R)  =>  R):  R def  sum(expr:  (E)  =>  Double):  Double def  product(expr:  (E)  =>  Double):  Double def  max(expr:  (E)  =>  Double):  Double def  min(expr:  (E)  =>  Double):  Double def  average(expr:  (E)  =>  Double):  Double ... def  run(runnable:  (E)  =>  Unit):  Unit 15
  • 21. 2) Scala for Queries: Syntax ■ example SrcRepo query: “average number of methods per class” def  avgMethodsPerClass(self:  Model)  =  {   val  packages  =  self.getOwnedPackages().    closure((p)=>p.getOwnedPackages());    val  classes  =  packages.collect((p)=>p.getOwnedClasses()).        closure((c)=>c.getInnerClasses());    return  classes.average((c)=>c.getOwnedMethods().size()); } 16
  • 22. 2) Scala and internal DSLs: Semantics ■Three different semantics, one interface ■ immediate collection ■ lazy iterator ■ Map/Reduce database 17
  • 23. Example Analysis 18
  • 24. First Example Case Study: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  • 25. First Example Case Study: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  • 26. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 27. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 28. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 29. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 30. Summary 21 VCS Model MetricsVCS Model Metrics 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies Statistical analysis Better Understanding Software Engineering