Model Comparison for Delta-
Compression
1
Markus Scheidgen
scheidge@informatik.hu-berlin.de
@mscheidgen
BigMDE at STAF 2016, Vienna
Agenda
▶ Motivation for Delta-Compression
▶ Model Comparison: Approaches
▶ Experiments
▶ Conclusions
2
Motivation – Delta-Compression
▶ What it is: Only store the differences of similar models
▶ Where do we have a lot of similar models:
■ Model Versioning
■ Model-based Mining of Source Repositories with reverse
engineering
▶ Why: Storage space and (indirectly) execution time for
persistence operations (I/O, etc.)
3
Model Versioning – Approaches
4
(or versioning in general)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based
(or operation-based)
+
e.g. EMF-store,
requires to record
or infer operations
from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based
(or operation-based)
+
e.g. EMF-store,
requires to record
or infer operations
from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
or compare?
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based
(or operation-based)
+
e.g. EMF-store,
requires to record
or infer operations
from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
+ +
+
-
hybrid
(persist changes, appear state-based)
+
e.g. GIT: you only see
whole files, internally uses
pack-files with delta-
compression
or compare?
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
Model Versioning – Architecture
5
user environment interface state
representation
compression persistence
+ +
+
-
Model Versioning – Architecture
5
user environment interface state
representation
compression persistence
+ +
+
-- +
Model Versioning – Architecture
5
user environment interface state
representation
compression persistence
+ +
+
-- +
show existing diff
Model Versioning – Architecture
5
user environment interface state
representation
compression persistence
+ +
+
-- +
create diff
Model Comparison – Tradeoffs
6
Comparison Quality
Comparison Time
Model Comparison – Tradeoffs
6
Comparison Quality
Comparison Time Extraction Time
Storage Space
Model Comparison – Tradeoffs
6
Comparison Quality
Comparison Time
Difference Model Usability
Extraction Time
Storage Space
Model Comparison – Tradeoffs
6
Comparison Quality
Comparison Time
Difference Model Usability
Extraction Time
Storage Space
Delta-Compression – Tradeoffs
7
Comparison Quality
priority for showing diffs to users [model comparison]
priority for using diffs in persistence [compression]
Model Comparison
▶ We know how to compare lists (e.g. lines of code) for a long
time
■ Meyer’s algorithm: O(N*D)
▶ Models aren’t list, but graphs (with spanning trees)
■ but, each feature in each model element is a “list” of values
■ we can compare two model elements, feature by feature
■ But, what pairs of elements should we compare?
■ We need a prior step to establish pairs of (supposingly)
matching elements
8
Meyers, E.W.: An O (ND) difference algorithm and its variations. Algorithmica 1(1- 4), 251–266 (1986)
Model Matching
9
model 1
model 2
matches differences
▶ Matching determines the quality of the comparison
▶ Different strategies to matching model elements
■ signatures: just meta-class, [name, parameter types], parent
■ similarity: lots of different criteria and heuristics
cheap
expensive
Comparison Representation
10
model 1 model 2matches
differences
Comparison Representation
11
model 1 model 2matches
differences
Comparison Representation for Compression
12
model 1
Comparison Representation for Compression
12
model 1 Δ(1,2)+
Comparison Representation for Compression
12
model 1 Δ(1,2)+ model 2=
Comparison Representation for Compression
12
model 1 Δ(1,2)+
Δ(2,3)+
model 2=
Comparison Representation for Compression
12
model 1 Δ(1,2)+
Δ(2,3)+
model 2=
model 3=
Comparison Representation for Compression
12
model 1 Δ(1,2)+
Δ(2,3)+
model 2=
model 3=
Δ(n,n+1) model n+1+ =
...
Comparison Representation for Compression
13
model 1 Δ(1,2)+
Δ(2,3)+
model 2=
model 3=
Δ(n,n+1) model n+1+ =
...
EMF-Compress
▶ We build a comparison framework for compression
■ signature-based matching
■ difference meta-model that allows patching
▶ http://github.com/markus1978/emf-compress
14
Experiments
▶ Reverse engineered GIT repositories with Java-code using
MoDisco
▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-ins
▶ organized in different projects: JDT, CDT, EMF, ...
▶ available via GIT-Hub
▶ GIT repositories can be gathered automated via GIT-Hub’s
REST API
▶ We used the 200 largest Eclipse repositories that actually
contained Java code: 6.6 GB Git, 400 MLOC, 250 GB
(binary) models with 4 billion objects.
15
Experiment 1: EMF-Compare vs EMF-Compress
▶ only first 1000 revisions of the 100 largest (GIT-size) repos;
only CU’s with less than 20k elements: ~300k individual
comparisons
16
signature lines
Differences model size
signature similarity similarity similaritylines
020406080100
Number of matches
(%)
020406080100
(%)
signature parse
1550500
Avg. execution times (log)
avg.timepercompilationunit(ms)
17
0
10
20
30
0 5000 10000 15000 20000 0 5000 10000 15000 20000
executiontime(s)
0
10
20
30
1
20
400
8000
1.0
1.5
2.0
iontime(s)
1.0
1.5
2.0
Similarity-based Signature-based
18
0
10
20
0 5000 10000 15000 20000 0 5000 10000 15000 20000
0 5000 10000 15000 20000 0 5000 10000 15000 20000
executiontime
0
10
20
1
20
400
8000
0.0
0.5
1.0
1.5
2.0
size (#objects)
executiontime(s)
0.0
0.5
1.0
1.5
2.0
size (#objects)
Experiment 2: Partial Comparison
▶ Problem: not all elements have a meaningful signature
▶ Two signature matching strategies:
■ Named-only: only match named elements, use equality for the
contents of named elements
■ Meta-class: match named elements based on their signature,
match the contents of named elements based on their parent
and meta-class only
19
Experiment 2: Partial Comparison
20
cdt
cdo
...ompare
emf
...e.core
Delta
UC
Full
Size - Named-only Matcher
GB
0
2
4
6
8
10
12
14
cdt
cdo
...ompare
emf
...e.core
Delta
UC
Full
Size - Meta Class Matcher
GB
0
2
4
6
8
10
12
14
cdt
cdo
...ompare
emf
...e.core
Delta
UC
Full
Lines
MLines
0
20
40
60
80
100
Delta
UC
Full
All vs Matched - Named-only Matcher
MObjects
200
300
400
500 Delta
UC
Full
All vs Matched - Meta Class Matcher
MObjects
200
300
400
500 Delta
UC
Full
All vs Matched Lines
MLines
40
60
80
100
Discussion
▶ Only reverse engineered Java models, different result for
other meta-models possible
▶ Similarity-based matching != similarity-based matching:
more evaluation with different qualities of similarity-based
matching necessary
▶ Signatures are not necessarily meta-model independent
▶ We need a better understanding about the relationship of
comparison runtime and comparison quality
21
Conclusions
▶ EMF Compare is tailored for difference model usability, not
for model-compression
■ insufficient execution times
■ wrong representation of differences
▶ EMF-Compress is an alternative: http://github.com/
markus1978/emf-compress
▶ Better analysis of matching strategies necessary:
■ evaluation of more matching strategies
■ evaluation with models in different languages
22
Example Use-Case: Model-based MSR
23
MS M{C}
Example Use-Case: Model-based MSR
23
MS M{C}
MM{Cn}
RHEAD
…
R
0
Example Use-Case: Model-based MSR
23
MS M{C}
MM{Cn}
RHEAD
…
R
0
Example Use-Case: Model-based MSR
23
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
Example Use-Case: Model-based MSR
23
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
{C0}
…
MM
…
…
…
Example Use-Case: Model-based MSR
23
MS M{C}
MM{Cn}
RHEAD
…
R
0
{Cn-1} MM
{C0}
…
MM
…
…
…
Model-based Mining of Software Repositories
▶ MSR tools are already “model-based”, but in a proprietary
manner
▶ Idea: existing reverse engineering framework and
corresponding standard meta-models and modeling
frameworks instead of proprietary solutions
▶ Goals
■ deal with heterogeneity (different version control systems,
different languages)
■ reuse of existing meta-models, transformations, and languages
■ interoperability with existing analysis tools
■ retaining meaningful scalability
24
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
25
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
26
Reverse Engineering with MoDisco
▶ Model Discovery
▶ reverse engineering for Java, based on EMF
▶ discovery, i.e. finding sources (so called compilation units)
within projects, source folders, and packages
▶ uses Eclipse’s workspace and Java Development Toolkit (JDT)
▶ provides
■ discovers for many languages: Java, xText, JSP, XML
■ creates instances of a Java EMF meta-model that corresponds to
the handwritten JDT AST-model
■ provides transformation to language independent artifacts, e.g.
KDM
27
From Source Code- to Model-Repository
28
snapshot
A1 B1
snapshot
A2 B1
snapshot
A2 B3
snapshotsnapshotsnapshot
M3
M2
M1
f
B.f
fB.f
Load(r)
Analysis(r)
Merge(r)
Save(r)
Checkout(r)
X
d2CUs(r)
Parse(d)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X
R
X
CUs
(Load + Analysis0
)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
X
d2CUs(r)
Parse(d)
Model-based MSR Strategies
29
versioncontrolsystem
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
30
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
30
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
30
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
31
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
31
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
32
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
33
snapshot
A1 B1
Checkout(r)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)
X
d2 CUs(r)
Load(r)Save(r)
Importing and Traversing Source Code Repositories
34
R1
R2
R3
A1 B1
f
A2
A1 B1
f
f
B3
A2
A1 B1
f
f
A2 B3
f
A2 B1
f
A1 B1
f
A2 B1
f
A1 B1
f
A1 B1
f
B3
A2
A1 B1
f
f
A2
A1 B1
f
f
A1 B1
f
A2 B3
A2 B1
f
f
A1 B1
f
A2 B1
f
A1 B1
f
A2 B.f?
A1 B1!fB.f?
A2 B3
f
A2 B1
f
A1 B1
f ✓
✗
✗
✓
✗
✓
import traverse
Importing and Traversing Source Code Repositories
35
R1
R2
R3
import
(persistent/storage)
traverse
(transient/runtime)
Importing and Traversing Source Code Repositories
36
R1
R2
R3
A1 B1
f
import
(persistent)
traverse
(transient)
A2
A1 B1
f
f
Importing and Traversing Source Code Repositories
37
R1
R2
R3
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
38
R1
R2
R3 B3
A2
A1 B1
f
f
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
39
R1
R2
R3
A1 B1
f
B3
A2
A1 B1
f
f
✓
import
(persistent)
traverse
(transient)
✓
Importing and Traversing Source Code Repositories
40
R1
R2
R3
A2 B1
f
A1 B1
f
B3
A2
A1 B1
f
f
✓
import
(persistent)
traverse
(transient)
✓
Importing and Traversing Source Code Repositories
41
R1
R2
R3 A2 B3
f
A2 B1
f
A1 B1
f
B3
A2
A1 B1
f
f
✓
✗
import
(persistent)
traverse
(transient)
A1 B1!fB.f?
Importing and Traversing Source Code Repositories
42
R1
R2
R3
import
(persistent)
traverse
(transient)
A2 B.f?
A1 B1!fB.f?
Importing and Traversing Source Code Repositories
43
R1
R2
R3
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
44
R1
R2
R3 B3!f
A2 B.f?
A1 B1!fB.f?
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
45
R1
R2
R3 B3!f
A2 B.f?
A1 B1!fB.f? A1 B1
f ✓
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
46
R1
R2
R3 B3!f
A2 B.f?
A1 B1!fB.f?
A2 B1
f
A1 B1
f
✓
✓
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
47
R1
R2
R3 B3!f
A2 B.f?
A1 B1!fB.f?
A2 B1
f
A1 B1
f
A2 B3
f ✓
✓
✓
import
(persistent)
traverse
(transient)
Importing and Traversing Source Code Repositories
48
R1
R2
R3
✓
B3!f
A2 B.f?
A1 B1!fB.f? A1 B1
f
import
(persistent)
traverse
(transient)
✓
Importing and Traversing Source Code Repositories
49
R1
R2
R3
✓
B3!f
A2 B.f?
A1 B1!fB.f?
A2
A1 B1
f
f
✗
import
(persistent)
traverse
(transient)
✓
✓
Importing and Traversing Source Code Repositories
50
R1
R2
R3
✓
B3!f
A2 B.f?
A1 B1!fB.f?
B3f
A2
A1 B1
f
f
✗
✗
import
(persistent)
traverse
(transient)

Model Comparison for Delta-Compression

  • 1.
    Model Comparison forDelta- Compression 1 Markus Scheidgen scheidge@informatik.hu-berlin.de @mscheidgen BigMDE at STAF 2016, Vienna
  • 2.
    Agenda ▶ Motivation forDelta-Compression ▶ Model Comparison: Approaches ▶ Experiments ▶ Conclusions 2
  • 3.
    Motivation – Delta-Compression ▶What it is: Only store the differences of similar models ▶ Where do we have a lot of similar models: ■ Model Versioning ■ Model-based Mining of Source Repositories with reverse engineering ▶ Why: Storage space and (indirectly) execution time for persistence operations (I/O, etc.) 3
  • 4.
    Model Versioning – Approaches 4 (orversioning in general) 1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
  • 5.
    Model Versioning – Approaches 4 (orversioning in general) state-based r0 r1 r2 r3 e.g. models stored in regular version control systems (VCS) 1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
  • 6.
    Model Versioning – Approaches 4 (orversioning in general) + + + - change-based (or operation-based) + e.g. EMF-store, requires to record or infer operations from the editing environment state-based r0 r1 r2 r3 e.g. models stored in regular version control systems (VCS) 1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
  • 7.
    Model Versioning – Approaches 4 (orversioning in general) + + + - change-based (or operation-based) + e.g. EMF-store, requires to record or infer operations from the editing environment state-based r0 r1 r2 r3 e.g. models stored in regular version control systems (VCS) or compare? 1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
  • 8.
    Model Versioning – Approaches 4 (orversioning in general) + + + - change-based (or operation-based) + e.g. EMF-store, requires to record or infer operations from the editing environment state-based r0 r1 r2 r3 e.g. models stored in regular version control systems (VCS) + + + - hybrid (persist changes, appear state-based) + e.g. GIT: you only see whole files, internally uses pack-files with delta- compression or compare? 1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
  • 9.
    Model Versioning – Architecture 5 userenvironment interface state representation compression persistence + + + -
  • 10.
    Model Versioning – Architecture 5 userenvironment interface state representation compression persistence + + + -- +
  • 11.
    Model Versioning – Architecture 5 userenvironment interface state representation compression persistence + + + -- + show existing diff
  • 12.
    Model Versioning – Architecture 5 userenvironment interface state representation compression persistence + + + -- + create diff
  • 13.
  • 14.
    Comparison Time Model Comparison– Tradeoffs 6 Comparison Quality
  • 15.
    Comparison Time ExtractionTime Storage Space Model Comparison – Tradeoffs 6 Comparison Quality
  • 16.
    Comparison Time Difference ModelUsability Extraction Time Storage Space Model Comparison – Tradeoffs 6 Comparison Quality
  • 17.
    Comparison Time Difference ModelUsability Extraction Time Storage Space Delta-Compression – Tradeoffs 7 Comparison Quality priority for showing diffs to users [model comparison] priority for using diffs in persistence [compression]
  • 18.
    Model Comparison ▶ Weknow how to compare lists (e.g. lines of code) for a long time ■ Meyer’s algorithm: O(N*D) ▶ Models aren’t list, but graphs (with spanning trees) ■ but, each feature in each model element is a “list” of values ■ we can compare two model elements, feature by feature ■ But, what pairs of elements should we compare? ■ We need a prior step to establish pairs of (supposingly) matching elements 8 Meyers, E.W.: An O (ND) difference algorithm and its variations. Algorithmica 1(1- 4), 251–266 (1986)
  • 19.
    Model Matching 9 model 1 model2 matches differences ▶ Matching determines the quality of the comparison ▶ Different strategies to matching model elements ■ signatures: just meta-class, [name, parameter types], parent ■ similarity: lots of different criteria and heuristics cheap expensive
  • 20.
    Comparison Representation 10 model 1model 2matches differences
  • 21.
    Comparison Representation 11 model 1model 2matches differences
  • 22.
    Comparison Representation forCompression 12 model 1
  • 23.
    Comparison Representation forCompression 12 model 1 Δ(1,2)+
  • 24.
    Comparison Representation forCompression 12 model 1 Δ(1,2)+ model 2=
  • 25.
    Comparison Representation forCompression 12 model 1 Δ(1,2)+ Δ(2,3)+ model 2=
  • 26.
    Comparison Representation forCompression 12 model 1 Δ(1,2)+ Δ(2,3)+ model 2= model 3=
  • 27.
    Comparison Representation forCompression 12 model 1 Δ(1,2)+ Δ(2,3)+ model 2= model 3= Δ(n,n+1) model n+1+ = ...
  • 28.
    Comparison Representation forCompression 13 model 1 Δ(1,2)+ Δ(2,3)+ model 2= model 3= Δ(n,n+1) model n+1+ = ...
  • 29.
    EMF-Compress ▶ We builda comparison framework for compression ■ signature-based matching ■ difference meta-model that allows patching ▶ http://github.com/markus1978/emf-compress 14
  • 30.
    Experiments ▶ Reverse engineeredGIT repositories with Java-code using MoDisco ▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-ins ▶ organized in different projects: JDT, CDT, EMF, ... ▶ available via GIT-Hub ▶ GIT repositories can be gathered automated via GIT-Hub’s REST API ▶ We used the 200 largest Eclipse repositories that actually contained Java code: 6.6 GB Git, 400 MLOC, 250 GB (binary) models with 4 billion objects. 15
  • 31.
    Experiment 1: EMF-Comparevs EMF-Compress ▶ only first 1000 revisions of the 100 largest (GIT-size) repos; only CU’s with less than 20k elements: ~300k individual comparisons 16 signature lines Differences model size signature similarity similarity similaritylines 020406080100 Number of matches (%) 020406080100 (%) signature parse 1550500 Avg. execution times (log) avg.timepercompilationunit(ms)
  • 32.
    17 0 10 20 30 0 5000 1000015000 20000 0 5000 10000 15000 20000 executiontime(s) 0 10 20 30 1 20 400 8000 1.0 1.5 2.0 iontime(s) 1.0 1.5 2.0 Similarity-based Signature-based
  • 33.
    18 0 10 20 0 5000 1000015000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 executiontime 0 10 20 1 20 400 8000 0.0 0.5 1.0 1.5 2.0 size (#objects) executiontime(s) 0.0 0.5 1.0 1.5 2.0 size (#objects)
  • 34.
    Experiment 2: PartialComparison ▶ Problem: not all elements have a meaningful signature ▶ Two signature matching strategies: ■ Named-only: only match named elements, use equality for the contents of named elements ■ Meta-class: match named elements based on their signature, match the contents of named elements based on their parent and meta-class only 19
  • 35.
    Experiment 2: PartialComparison 20 cdt cdo ...ompare emf ...e.core Delta UC Full Size - Named-only Matcher GB 0 2 4 6 8 10 12 14 cdt cdo ...ompare emf ...e.core Delta UC Full Size - Meta Class Matcher GB 0 2 4 6 8 10 12 14 cdt cdo ...ompare emf ...e.core Delta UC Full Lines MLines 0 20 40 60 80 100 Delta UC Full All vs Matched - Named-only Matcher MObjects 200 300 400 500 Delta UC Full All vs Matched - Meta Class Matcher MObjects 200 300 400 500 Delta UC Full All vs Matched Lines MLines 40 60 80 100
  • 36.
    Discussion ▶ Only reverseengineered Java models, different result for other meta-models possible ▶ Similarity-based matching != similarity-based matching: more evaluation with different qualities of similarity-based matching necessary ▶ Signatures are not necessarily meta-model independent ▶ We need a better understanding about the relationship of comparison runtime and comparison quality 21
  • 37.
    Conclusions ▶ EMF Compareis tailored for difference model usability, not for model-compression ■ insufficient execution times ■ wrong representation of differences ▶ EMF-Compress is an alternative: http://github.com/ markus1978/emf-compress ▶ Better analysis of matching strategies necessary: ■ evaluation of more matching strategies ■ evaluation with models in different languages 22
  • 38.
  • 39.
    Example Use-Case: Model-basedMSR 23 MS M{C} MM{Cn} RHEAD … R 0
  • 40.
    Example Use-Case: Model-basedMSR 23 MS M{C} MM{Cn} RHEAD … R 0
  • 41.
    Example Use-Case: Model-basedMSR 23 MS M{C} MM{Cn} RHEAD … R 0 {Cn-1} MM
  • 42.
    Example Use-Case: Model-basedMSR 23 MS M{C} MM{Cn} RHEAD … R 0 {Cn-1} MM {C0} … MM … … …
  • 43.
    Example Use-Case: Model-basedMSR 23 MS M{C} MM{Cn} RHEAD … R 0 {Cn-1} MM {C0} … MM … … …
  • 44.
    Model-based Mining ofSoftware Repositories ▶ MSR tools are already “model-based”, but in a proprietary manner ▶ Idea: existing reverse engineering framework and corresponding standard meta-models and modeling frameworks instead of proprietary solutions ▶ Goals ■ deal with heterogeneity (different version control systems, different languages) ■ reuse of existing meta-models, transformations, and languages ■ interoperability with existing analysis tools ■ retaining meaningful scalability 24
  • 45.
    Model-based Mining ofSoftware Repositories ▶ Scope ■ depends on concreter MSR-application and its goals ■ number of software projects: single repositories, large repositories, ultra-large repositories ■ Sources as text and text based metrics, e.g. LOC ■ Declarations only: packages, classes, methods, but no statements, expressions, etc. ■ Full AST with or without cross-references 25
  • 46.
    Model-based Mining ofSoftware Repositories ▶ Scope ■ depends on concreter MSR-application and its goals ■ number of software projects: single repositories, large repositories, ultra-large repositories ■ Sources as text and text based metrics, e.g. LOC ■ Declarations only: packages, classes, methods, but no statements, expressions, etc. ■ Full AST with or without cross-references 26
  • 47.
    Reverse Engineering withMoDisco ▶ Model Discovery ▶ reverse engineering for Java, based on EMF ▶ discovery, i.e. finding sources (so called compilation units) within projects, source folders, and packages ▶ uses Eclipse’s workspace and Java Development Toolkit (JDT) ▶ provides ■ discovers for many languages: Java, xText, JSP, XML ■ creates instances of a Java EMF meta-model that corresponds to the handwritten JDT AST-model ■ provides transformation to language independent artifacts, e.g. KDM 27
  • 48.
    From Source Code-to Model-Repository 28 snapshot A1 B1 snapshot A2 B1 snapshot A2 B3 snapshotsnapshotsnapshot M3 M2 M1 f B.f fB.f Load(r) Analysis(r) Merge(r) Save(r) Checkout(r) X d2CUs(r) Parse(d) X R Checkout + X CUs Parse + Analysis ! X R Checkout + X CUs (Parse + Merge) + Analysis ! X R X CUs (Load + Merge) + Analysis ! X R X CUs (Load + Analysis0 ) versioncontrolsystem A1-A2 A1 B1 B1-B3 X d2CUs(r) Parse(d)
  • 49.
  • 50.
    Model-based MSR Strategies 29 snapshot A1B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3
  • 51.
    Model-based MSR Strategies 29 snapshot A1B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1
  • 52.
    snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1
  • 53.
    snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d)
  • 54.
    snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot
  • 55.
    snapshot snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot
  • 56.
    snapshot snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r)
  • 57.
    snapshot snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M2
  • 58.
    snapshot snapshot A2 B3 Model-based MSRStrategies 29 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2
  • 59.
    snapshot snapshot A2 B3 Model-based MSRStrategies 30 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2
  • 60.
    snapshot snapshot A2 B3 Model-based MSRStrategies 30 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 f B.f fB.f Parse(d) X d2 CUs(r)
  • 61.
    snapshot snapshot A2 B3 Model-based MSRStrategies 30 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r)
  • 62.
    snapshot snapshot A2 B3 Model-based MSRStrategies 31 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r)
  • 63.
    snapshot snapshot A2 B3 Model-based MSRStrategies 31 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r)
  • 64.
    snapshot snapshot A2 B3 Model-based MSRStrategies 32 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r)
  • 65.
    snapshot snapshot A2 B3 Model-based MSRStrategies 33 snapshot A1 B1 Checkout(r) versioncontrolsystem A1-A2 A1 B1 B1-B3 snapshot A2 B1 snapshot X d2CUs(r) Parse(d) snapshot M1 Analysis(r) M3 M2 Merge(r) f B.f fB.f Parse(d) X d2 CUs(r) Load(r)Save(r)
  • 66.
    Importing and TraversingSource Code Repositories 34 R1 R2 R3 A1 B1 f A2 A1 B1 f f B3 A2 A1 B1 f f A2 B3 f A2 B1 f A1 B1 f A2 B1 f A1 B1 f A1 B1 f B3 A2 A1 B1 f f A2 A1 B1 f f A1 B1 f A2 B3 A2 B1 f f A1 B1 f A2 B1 f A1 B1 f A2 B.f? A1 B1!fB.f? A2 B3 f A2 B1 f A1 B1 f ✓ ✗ ✗ ✓ ✗ ✓ import traverse
  • 67.
    Importing and TraversingSource Code Repositories 35 R1 R2 R3 import (persistent/storage) traverse (transient/runtime)
  • 68.
    Importing and TraversingSource Code Repositories 36 R1 R2 R3 A1 B1 f import (persistent) traverse (transient)
  • 69.
    A2 A1 B1 f f Importing andTraversing Source Code Repositories 37 R1 R2 R3 import (persistent) traverse (transient)
  • 70.
    Importing and TraversingSource Code Repositories 38 R1 R2 R3 B3 A2 A1 B1 f f import (persistent) traverse (transient)
  • 71.
    Importing and TraversingSource Code Repositories 39 R1 R2 R3 A1 B1 f B3 A2 A1 B1 f f ✓ import (persistent) traverse (transient)
  • 72.
    ✓ Importing and TraversingSource Code Repositories 40 R1 R2 R3 A2 B1 f A1 B1 f B3 A2 A1 B1 f f ✓ import (persistent) traverse (transient)
  • 73.
    ✓ Importing and TraversingSource Code Repositories 41 R1 R2 R3 A2 B3 f A2 B1 f A1 B1 f B3 A2 A1 B1 f f ✓ ✗ import (persistent) traverse (transient)
  • 74.
    A1 B1!fB.f? Importing andTraversing Source Code Repositories 42 R1 R2 R3 import (persistent) traverse (transient)
  • 75.
    A2 B.f? A1 B1!fB.f? Importingand Traversing Source Code Repositories 43 R1 R2 R3 import (persistent) traverse (transient)
  • 76.
    Importing and TraversingSource Code Repositories 44 R1 R2 R3 B3!f A2 B.f? A1 B1!fB.f? import (persistent) traverse (transient)
  • 77.
    Importing and TraversingSource Code Repositories 45 R1 R2 R3 B3!f A2 B.f? A1 B1!fB.f? A1 B1 f ✓ import (persistent) traverse (transient)
  • 78.
    Importing and TraversingSource Code Repositories 46 R1 R2 R3 B3!f A2 B.f? A1 B1!fB.f? A2 B1 f A1 B1 f ✓ ✓ import (persistent) traverse (transient)
  • 79.
    Importing and TraversingSource Code Repositories 47 R1 R2 R3 B3!f A2 B.f? A1 B1!fB.f? A2 B1 f A1 B1 f A2 B3 f ✓ ✓ ✓ import (persistent) traverse (transient)
  • 80.
    Importing and TraversingSource Code Repositories 48 R1 R2 R3 ✓ B3!f A2 B.f? A1 B1!fB.f? A1 B1 f import (persistent) traverse (transient)
  • 81.
    ✓ Importing and TraversingSource Code Repositories 49 R1 R2 R3 ✓ B3!f A2 B.f? A1 B1!fB.f? A2 A1 B1 f f ✗ import (persistent) traverse (transient)
  • 82.
    ✓ ✓ Importing and TraversingSource Code Repositories 50 R1 R2 R3 ✓ B3!f A2 B.f? A1 B1!fB.f? B3f A2 A1 B1 f f ✗ ✗ import (persistent) traverse (transient)