SlideShare a Scribd company logo
1 of 99
Download to read offline
Graph operations in Git version control system
how the performance was improved (for large repositories),
how can it be further improved
dr Jakub Nar¦bski
Nicolaus Copernicus University in Toru«, Poland
presented on December 3, 2019
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 1 / 68
Table of contents
1 Introduction
Motivation
Graphs in Git
2 Operations on graphs
3 Methods for improving performance
Bitmap index
Generation number
Algorithm for nding common ancestors
Algorithm for topological sorting
4 Future work
Corrected commit creation date
Other graph labels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 2 / 68
Table of contents
1 Introduction
Motivation
Graphs in Git
2 Operations on graphs
3 Methods for improving performance
Bitmap index
Generation number
Algorithm for nding common ancestors
Algorithm for topological sorting
4 Future work
Corrected commit creation date
Other graph labels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 3 / 68
Motivation: scaling Git up
in the presence of the increasing size of repositories
Git repositories are growing with respect to the number of commits
examples:
Linux kernel: 740 000 commits (2018)
MS Windows: 1 700 000 commits (2018)
Android (AOSP): 874 000 commits (2019)
Chromium: 772 000 commits (2019)
. . .
Git: 55 000 commits (2019)
noticeable slowdown of Git operations (taking now seconds)
gitk i git log --graph
git push --force-with-lease
git status --ahead-behind
. . .
serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft)
space for storing auxiliary labels / reachability indices, such as e.g. the generation number
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
Motivation: scaling Git up
in the presence of the increasing size of repositories
Git repositories are growing with respect to the number of commits
examples:
Linux kernel: 826 000 commits (2019)
MS Windows: 3 100 000 commits (2019)
Android (AOSP): 874 000 commits (2019)
Chromium: 772 000 commits (2019)
. . .
Git: 55 000 commits (2019)
noticeable slowdown of Git operations (taking now seconds)
gitk i git log --graph
git push --force-with-lease
git status --ahead-behind
. . .
serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft)
space for storing auxiliary labels / reachability indices, such as e.g. the generation number
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
Motivation: scaling Git up
in the presence of the increasing size of repositories
Git repositories are growing with respect to the number of commits
examples:
Linux kernel: 826 000 commits (2019)
MS Windows: 3 100 000 commits (2019)
Android (AOSP): 874 000 commits (2019)
Chromium: 772 000 commits (2019)
. . .
Git: 55 000 commits (2019)
noticeable slowdown of Git operations (taking now seconds)
gitk i git log --graph
git push --force-with-lease
git status --ahead-behind
. . .
serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft)
space for storing auxiliary labels / reachability indices, such as e.g. the generation number
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
Object graph
Git repository as contentaddressed object database
data in Git repositories is stored as Direct Acyclic Graph (DAG),
that is, egdes are directed and there are no loops
nodes (vertices) in this graph are objects of the following types
commit  representing revisions, store project history
tree  snapshot of project les at given point of time
representing subdirectories (in a hierarchical way)
blob  store le contents at given version of it
tag  represents annotated or signed version of a project
edges between nodes represents relationships
commit → commit: based on relationship, the second one is parent of the rst
(each revision has zero or more parent commits)
commit → tree: project repository contents at given revision
tree → tree and tree → blob: lesystem hierarchy
tag → object (usually to commit): symbolic name of the object
xkcd.com/1597/
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 5 / 68
Visualization of the objects graph in the Git repository
Object graph and contents deduplication Hierarchical le structure
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 6 / 68
Visualization of the objects graph in the Git repository
Object graph and contents deduplication Hierarchical le structure
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 6 / 68
Object graph in a Git repository (object model of a repository)
Derrick Stolee Advanced Git for Beginners httpsXGGstoleeFdevGdo™sGgitFpdf
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 7 / 68
Object graph and external references to it: branches, tags, the index,. . .
example from httpsXGGgithu˜F™omGsensorfloGgitEdr—wGwiki
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 8 / 68
Commit graph
Representation of project history in Git
(Almost) every commit (revision) has reference to its
parent, that is the revision before it was based on
Some revisions (commits), being result of merge
operation, have two parents (rarely more  so called
octopus merges)
At least one (initial) revision has no parents
Each commit object includes reference to the tree
representing the snapshot of les in the repository
Branches and tags are external references to the
commit graph
HEAD is a symbolic reference to the current branch
(detached HEAD directly points to a commit)
c7cd3 master HEAD
f30ab
34ac2 v0.9
98ca9 23b88
d77af
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 9 / 68
Object addressing
SHA-1 / SHA-256 of object representation as object identiers
Each object in the repository's object database is referenced
using the SHA-1 hash of the object contents representation
(switch to SHA-256 aka NewHash is in progress)
Examples: object representation of a commit:
6 git ™—tEfile Ep rieh¢
tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU
p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ
—uthor tunio g r—m—no `gitsterdpo˜oxF™omb IRSPTRHVSI EHVHH
™ommitter tunio g r—m—no `gitsterdpo˜oxF™omb IRSPTRHVSI EHVHH
pirst ˜—t™h for post PFU ™y™le
ƒignedEoffE˜yX tunio g r—m—no `gitsterdpo˜oxF™omb
http://shafiul.github.io/gitbook
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
Object addressing
SHA-1 / SHA-256 of object representation as object identiers
Each object in the repository's object database is referenced
using the SHA-1 hash of the object contents representation
(switch to SHA-256 aka NewHash is in progress)
Examples: object representation of a tree:
6 git ™—tEfile Ep RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU
IHHTRR ˜lo˜ SeWVVHT™T™™PRT—™efSfSQW—eIWIUIH—H™HT—dQf Fgit—ttri˜utes
IHHTRR ˜lo˜ I™PfVQPIQVTfVWefV™HQdIIISW™WU—HfIWR™RRPQ Fgitignore
IHHTRR ˜lo˜ eS˜RIPT˜e™SSUd˜SSWPR˜U˜THedUHQRWTPTe—P™R Fm—ilm—p
IHHTRR ˜lo˜ ™Q˜fW™TdRdI™THRWddQV—S—VTId™e˜R™VeI˜UeWW Ftr—visFyml
IHHTRR ˜lo˜ SQTeSSSPRd˜UP˜dP—™fIUSPHV—efRfQdf™IRVdRP gy€‰sxq
HRHHHH tree QUHff˜fdTVW—SdQHU—SdWW˜eQPHHT˜efI˜HdQQed ho™ument—tion
IHHUSS ˜lo˜ SVUQfITQeSITVRffRf™SQIIW™SWfeP™Rf™PRf—˜e qs„E†i‚ƒsyxEqix
IHHTRR ˜lo˜ ff˜HUIeWfHQ—UW—HSP˜e——RQUPf—UWHe™˜—˜˜˜U˜ sxƒ„evv
FFF
http://shafiul.github.io/gitbook
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
Object addressing
SHA-1 / SHA-256 of object representation as object identiers
Each object in the repository's object database is referenced
using the SHA-1 hash of the object contents representation
(switch to SHA-256 aka NewHash is in progress)
Examples: object representation of a tag:
6 git ™—tEfile Ep vPFUFH
o˜je™t USRVVRPSS˜˜SVHdfISWeSVdef—VI™ddQH˜S™RQH™
type ™ommit
t—g vPFUFH
t—gger tunio g r—m—no `gitsterdpo˜oxF™omb IRSIWRSPWP EHVHH
qit PFU
!!Efiqsx €q€ ƒsqxe„…‚i!!E
†ersionX qnu€q vI
FFF
!!Eixh €q€ ƒsqxe„…‚i!!E
http://shafiul.github.io/gitbook
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
Table of contents
1 Introduction
Motivation
Graphs in Git
2 Operations on graphs
3 Methods for improving performance
Bitmap index
Generation number
Algorithm for nding common ancestors
Algorithm for topological sorting
4 Future work
Corrected commit creation date
Other graph labels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 11 / 68
Git commands working directly on the object graph
object exchange (object transfer)
git fetch
git push
git clone
server and clients perform negotiation to send
only those (new) objects that are necessary
(that are missing from the other side)
garbage collection
git repack
git gc
removing unreachable objects (results of
git ™ommit EE—mend, git re˜—se,
multiple git —dd `file b, etc)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 12 / 68
Git commands working directly on the commit graph (1/2)
breakdown into categories based on the type of the result
commands returning boolean: true/false value
git merge-base --is-ancestor A B
is B reachable from A
commands returning subset (of larger set)
git branch --contains (or git tag ...)
branches/tags from which given commit is reachable
git branch --merged (or git tag ...)
branches/tags reachable from given commit
autofollowing tags during git fetch
(see documentation of the remote.name.tagOpt)
commands nding node or nodes in the commit graph
git merge-base --all A B
nding lowest (closest) common ancestors
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 13 / 68
Git commands working directly on the commit graph (1/2)
breakdown into categories based on the type of the result
commands returning boolean: true/false value
git merge-base --is-ancestor A B
is B reachable from A
commands returning subset (of larger set)
git branch --no-contains (or git tag ...)
branches/tags from which given commit is unreachable
git branch --no-merged (or git tag ...)
branches/tags unreachable from given commit
autofollowing tags during git fetch
(see documentation of the remote.name.tagOpt)
commands nding node or nodes in the commit graph
git merge-base --all A B
nding lowest (closest) common ancestors
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 13 / 68
Git commands working directly on the commit graph (2/2)
breakdown based on the type of the result, continued
commands returning path / subgraph
git log A..B ≡ git log B --not A
reachable from B and unreachable from A
git log A...B (symmetrical dierence)
A...B ≡ A B --not $(git merge-base --all A B)
exclusively reachable from either A or from B
git log --ancestry-path A..B
commits directly on path leading from B to A (inclusive)
topological sorting options (and equivalent)
git log --topo-sort / --graph, gitk, etc.
additionally those try to keep related revisions together
iterative bisection of graph (to nd regression)
git bisect
A..B
A B
A
B
A...B
 
A

O
 
B

dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 14 / 68
Denition of reachability in a directed acyclic graph (DAG)
Denition of reachability (graph theory)
Let's assume that we have directed acyclic graph G = (V ,E), where V is nite set of vertices
(nodes), and E ⊂ V 2
is nite set of directed edges.
∀(u,v) ∈ V 2
we say that v is reachable from u, which we denote as r(u,v) or as u ⇝ v,
if and only if u = v or ∃(u,w) ∈ E ∧r(w,v).
Properties of this relation
∀v ∈ V : r(v,v)
r(u,w)∧r(w,v) =⇒ r(u,v)
r(u,v)∧r(v,u) =⇒ u = v
Reachability relation imposes partial order
for nodes in the graph
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 15 / 68
Lowest (closest) common ancestors, or merge base
Lowest common ancestor(s) is used when merging (via git merge) two branches A and B
lowest (closest) common ancestor, like P and Q, is reachable both from A and from B
it is not reachable from any other revision reachable from both A and from B
httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 16 / 68
Topological sorting of directed acyclic graph (DAG)
Denition of topological sorting
Topological sorting of directed acyclic graph: any such linear full ordering (≺) of nodes
(vertices), for which
(u,v) ∈ E =⇒ u ≺ v
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 17 / 68
Topological sorting of directed acyclic graph (DAG)
Denition of topological sorting
Topological sorting of directed acyclic graph: any such linear full ordering (≺) of nodes
(vertices), for which
(u,v) ∈ E =⇒ u ≺ v
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 17 / 68
Table of contents
1 Introduction
Motivation
Graphs in Git
2 Operations on graphs
3 Methods for improving performance
Bitmap index
Generation number
Algorithm for nding common ancestors
Algorithm for topological sorting
4 Future work
Corrected commit creation date
Other graph labels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 18 / 68
Bitmap indices  solving the problem of object exchange
here bitmap means simply bit vector (vector of 0/1 values)
image: Scott Chacon Pro Git, https://git-scm.com/book
I—RIHe
™—™H™—
fdfRf™
Q™RQW™
HISSe˜
dVQPWf
IfU—U—
f—RW˜H
WQ˜——e
I—RIHe 1 1 1 1 1 1 1 1 1
™—™H™— 0 1 1 0 1 1 1 1 1
fdfRf™ 0 0 1 0 0 1 0 0 1
bit location corresponds to the
position of the object in the packle
bit 1 in the bitmap for a given
revision means that object with given
position is reachable from it
bit 0 in the bitmap: not reachable
objects to be transferred: want AND NOT have
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 19 / 68
Bitmap indices  nding objects that we have (and don't need to fetch)
http://githubengineering.com/counting-objects/
$ GIT_TRACE_PACKET=1 git pull
[...]
... fetch-pack want a595...
... fetch-pack want a4c7...
... fetch-pack want d1c7...
[...]
... fetch-pack 0000
... fetch-pack have cc3f...
... fetch-pack have 5bd5...
[...]
... fetch-pack 0000
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 20 / 68
Bitmap index for the packle
Selected details of bitmap index implementation in Git
We cannot create the reachability bitmap (bit vector) for every revision,
because it would take too much space (storing transitive closure of the object graph)
use heuristic algorithm to select which commits will have bitmap
let newest versions have bitmap, which is needed for cloning
the deeper in the commit graph (earlier in history), the less frequently
reachability bitmaps are added (down to every 3000 revisions)
Minimize space taken by bitmaps by using RLE compression
EWAH Bitmaps  Daniel Lemire (implementation in Java, C#, C++)
patent free (which is unfortunately not the case for every bitmap compression algorithm)
JGit support added by Shawn Pearce and Colby Ranger (Google)
ported to C as libewok by Vincent Marti (GitHub)
good enough compression level with fast decompression
some operations on bitmap do not require decompression to perform
Trick: store result of XOR with some bitmap for an earlier commit
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 21 / 68
EWAH (Enhanced Word-Aligned Hybrid) format
wordaligned format
dierent variants: 32bit, 64bit
speed at the cost of compression ratio
clean words: run of 0 or of 1
RLE (run length) encoding
how many repeating 0s or 1s
dirty words: mixed 0 with 1
literal encoding
EWAH: marker words
length and type of sequence
bit operations on compressed bitmaps
without decompressing them
(symmetric operations only)
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves
word-aligned bitmap indexes. http://arxiv.org/abs/0901.3751
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 22 / 68
Bitmap index: References
Vicent Martí (GitHub).
Counting Objects. GitHub Engineering, 22 Sep 2015
http://githubengineering.com/counting-objects/
https://github.blog/2015-09-22-counting-objects/
Shawn Pearce (Google).
Scaling Up JGit. EclipseCon 2013, Boston.
Daniel Lemire.
All About Bitmap Indexes. . . And Sorting Them
slides presented at BDA'08 and DOLAP'08, 12 Feb 2009.
http://lemire.me/talks/uqamtalk.pdf
▶ Vicent Martí.
GIT bitmap v1 format.
Documentation/technical/bitmap-format.txt
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 23 / 68
Extracting information about edges in the commit graph
Following the edges requires:
nding the object (packed or loose)
object decompression (gzip)
possibly resolving deltas
parsing the commit object
commit object representation:
tree 49f152498cfde082f3223e20338ca64517991cd7
parent bdd1cc20929e9f7631dd29ff70426eea53f69443
author A U Thor a@ex.com 1452640851 -0800
committer C O Mitter c@ex.us 1452640851 -0800
First batch for post 2.7 cycle
Signed-off-by: A U Thor a@ex.com
Junio Hamano Git Chronicles talk at GitTogether 2008
loose format packed format
(one le per object) (multiple objects)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 24 / 68
Two ways of storing objects in Git  an outline
Junio Hamano Git Chronicles talk at GitTogether 2008
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 25 / 68
Commit-graph le format (storing serialized commit graph)
Following the edges required:
nding the object (packed or loose)
object decompression (gzip)
possibly resolving deltas
parsing the commit object
This can take a long time, especially in
large repositories, where it needs to be
done 1000s of times
Git 2.18 and later supports commit-graph le,
which stores this DAG information in a compact form
fanout table  binary search seeds
list of commits sorted by the commit ID
commit data, with parents as position on list
list of 3rd and later parents: octopus edges
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 26 / 68
Commit-graph le format schema (serialized commit graph)
httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 27 / 68
Making Git faster with commit-graph le (serialized commit graph)
Linux kernel (around 750 000 revisions / commits)
command before after change
git mergeE˜—se m—ster topi™ 0.52 0.06 -88%
git ˜r—n™h EE™ont—ins 76.20 0.04 -99%
git t—g EE™ont—ins 5.30 0.03 -99%
git t—g EEmerged 6.30 1.50 -76%
git log EEgr—ph EIH 5.90 0.74 -87%
m—ster: 032b4cc884490c4bc7c4ef8c91e6d
topi™: 62d18ecfa64137349fac9c5817784fb
where topi™ branch is 30 986 revisions
before m—ster branch,
and m—ster branch can reach 722 849
revisions
Git code repository (around 50 000 revisions)
command before after change
git mergeE˜—se m—ster topi™ 0.10 0.04 -60%
git ˜r—n™h EE™ont—ins 0.76 0.03 -96%
git t—g EE™ont—ins 0.70 0.03 -96%
git t—g EEmerged 0.74 0.12 -84%
git log EEgr—ph EIH 0.44 0.05 -89%
m—ster: b50d82b00a8fc9d24e41ae7dc30185
topi™: e144d126d74f5d2702870ca9423743
where m—ster is 2032 revisions behind
the topi™ branch,
and can reach 49 361 commits
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 28 / 68
Making Git faster with commit-graph le (serialized commit graph)
Linux kernel (around 750 000 revisions / commits)
command before after change
git mergeE˜—se m—ster topi™ 0.52 0.06 -88%
git ˜r—n™h EE™ont—ins 76.20 0.04 -99%
git t—g EE™ont—ins 5.30 0.03 -99%
git t—g EEmerged 6.30 1.50 -76%
git log EEgr—ph EIH 5.90 0.74 -87%
m—ster: 032b4cc884490c4bc7c4ef8c91e6d
topi™: 62d18ecfa64137349fac9c5817784fb
where topi™ branch is 30 986 revisions
before m—ster branch,
and m—ster branch can reach 722 849
revisions
MS Windows, with GVFS (around 1 700 000 commits)
command before after change
git st—tus EE—he—dE˜ehind 14.30 4.70 -67%
git mergeE˜—se m—ster topi™ 11.40 1.80 -84%
git ˜r—n™h EE™ont—ins 9.40 1.60 -83%
git log EEgr—ph EIH 24.30 5.30 -78%
where m—ster includes 2 214 796
reachable commits,
and local version of m—ster is 81 776
revisions behind originGm—ster, which
aect speed of git st—tus; such value
is typical for development there
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 28 / 68
Serialized commit graph: References
Derrick Stolee (Microsoft)
Supercharging the Git Commit Graph series, Azure DevOps Blog, 2018
httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phG
httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG
Johannes Schindelin (Microsoft), Derrick Stolee (Microsoft)
Making Git for Windows: starting from 15:00, Git Merge 2018
httpsXGGyoutuF˜eGoywziWVQmwctaWHS
▶ Documentation/technical/commit-graph-format.txt
Git commit graph format
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 29 / 68
The creation date of the commit object as heuristics
tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU
p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ
—uthor e … „hor `thordex—mpleF™omb IRSPTRHVSI EHVHH
™ommitter g y witter `terdex—mpleFusb IRSPTRHVSI EHVHH
pirst ˜—t™h for post PFU ™y™le
ƒignedEoffE˜yX e … „hor `thordex—mpleF™omb
Revision dates
authordate is creation date for
changes (authorship)
committerdate is date those
changes were added to repository
(creating commit object)
commit object data includes date and time of its creation
revisions based on it must have been created later with regard to a global time
unfortunately because of lack of clock synchronization we cannot entirely rely on this data
it can be however used as heuristics  as stop condition, and for order of traversal
Git stops searching after nding 5 (SLOP) revisions that are older than a boundary
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 30 / 68
The creation date of the commit object as heuristics
tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU
p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ
—uthor e … „hor `thordex—mpleF™omb IRSPTRHVSI EHVHH
™ommitter g y witter `terdex—mpleFusb IRSPTRHVSI EHVHH
pirst ˜—t™h for post PFU ™y™le
ƒignedEoffE˜yX e … „hor `thordex—mpleF™omb
Revision dates
authordate is creation date for
changes (authorship)
committerdate is date those
changes were added to repository
(creating commit object)
commit object data includes date and time of its creation
revisions based on it must have been created later with regard to a global time
unfortunately because of lack of clock synchronization we cannot entirely rely on this data
it can be however used as heuristics  as stop condition, and for order of traversal
Git stops searching after nding 5 (SLOP) revisions that are older than a boundary
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 30 / 68
Generation number / topological level
Denition of level in graph / of generation number
Level lv of vertex v in graph G = (V ,E) is dened as
its depth, that is the length of longest path from v to
root
if v has no parents (outgoing edges),
i.e. if v is a root, then lv = 0
otherwise
lv = max
u : (v,u)∈E
{lu}+1
Properties:
if u ⇝ v and u ̸= v, then lu  lv
if u ⇝ v, then lu lv (weaker condition)
Example DAG with topological levels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 31 / 68
Generation number / topological level
Denition of level in graph / of generation number
Level lv of vertex v in graph G = (V ,E) is dened as
its depth, that is the length of longest path from v to
leaf
if v has no predecessors (outgoing edges),
i.e. if v is a leaf, then lv = 0
otherwise
lv = max
u : (v,u)∈E
{lu}+1
Properties:
if u ⇝ v and u ̸= v, then lu  lv
if u ⇝ v, then lu lv (weaker condition)
Example DAG with topological levels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 31 / 68
Generation numbers (levels) in the commit graph
Example graph of revisions:
Derrick Stolee Supercharging the Git Commit Graph III: Generations and Graph Algorithms, July 2018
https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 32 / 68
Generation numbers (levels) in the commit graph
Generation numbers for this graph:
Derrick Stolee Supercharging the Git Commit Graph III: Generations and Graph Algorithms, July 2018
https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 32 / 68
Historical approaches to adding generation number: 2011 (or earlier)
Idea 1: adding new header to the commit
tree 49f152498cfde082f3223e20338ca64517991cd7
parent bdd1cc20929e9f7631dd29ff70426eea53f69443
author A U Thor a@ex.com 1452640851 -0800
committer C O Mitter c@ex.us 1452640851 -0800
generation 32145
First batch for post 2.7 cycle
Signed-off-by: A U Thor a@ex.com
problems with backward compatibility
question about ensuring corectness
repositories with and without it
possibly copying unknown headers by
cherry-pick, revert, etc.
Idea 2: using git notes as cache
git notes technique
allows to add notes to any object:
blob, commit, . . .
notes are split into namespaces,
e.g. the default refs/notes/commit
the textconv mechanism can be
congured to use them as a cache
pytania o zapewnienie poprawno±ci
performance: the notes mechanism is not
intended for a very large amount of notes,
O(commits)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 33 / 68
Storing generation numbers in the commit-graph le
Commit Data chunk: CDAT
includes column (eld) intended for commit creation date (committerdate )
if the column was using signed 32bit integer  Y2038 problem
therefore 64bit wide column is used (two 4byte words):
30 most signicant bits of rst 4 bytes are used for the generation number
32 bits of second 4 bytes and 2 least signicant bits
of the previous 4byte word are used for committerdate,
which makes together 34 bits to store datetime as Unix timestamp
Denition of generation numer in the commit-graph le
Generation number gen(A) of revision A of a project is dened in the following way
if A has no parents, i.e. it is a root commit, then gen(A) = 1
otherwise its generation number is one more than maximum generation number
among all its parents: gen(A) = max{P ∈ parent(A): gen(P)}+1
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 34 / 68
Storing generation numbers in the commit-graph le
Commit Data chunk: CDAT
includes column (eld) intended for commit creation date (committerdate )
if the column was using signed 32bit integer  Y2038 problem
therefore 64bit wide column is used (two 4byte words):
30 most signicant bits of rst 4 bytes are used for the generation number
32 bits of second 4 bytes and 2 least signicant bits
of the previous 4byte word are used for committerdate,
which makes together 34 bits to store datetime as Unix timestamp
Denition of generation numer in the commit-graph le
Generation number gen(A) of revision A of a project is dened in the following way
if A has no parents, i.e. it is a root commit, then gen(A) = 1
otherwise its generation number is one more than maximum generation number
among all its parents: gen(A) = max{P ∈ parent(A): gen(P)}+1
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 34 / 68
Corner cases of generation number in Git
Corner cases:
for historical reasons commits that were
present in commit-graph le before
generation number was added to Git
had gen(C) = 0
that is why gen(root commits) = 1
newly created commits, not present in
the commit-graph, have gen(C) greater
than maximal representable generation
number: gen(C) = 0x3FFFFFFF
for such A and B we have gen(A) = gen(B),
but nothing is known about their reachability
relation
gen(C) properties:
if A ⇝ B, and A ̸= B, then gen(A)  gen(B)
except for the corner cases
if A ⇝ B, then gen(A) gen(B) (weaker)
including the corner cases
⇕
if gen(A)  gen(B), then A ̸⇝ B
including the corner cases
Conclusion: gen(C) can be used as cuto,
even if new revisions are not present in the
commit-graph le
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 35 / 68
Denition and properties of generation number
Denition of generation number / level of node
Generation number of a revision (commit graph node) is dened in the following way:
If commit has no parents, then its generation number is 1.
Otherwise, its generation number is taken to be 1 greater
than maximum generation number of its parents
Properties of generation numbers / topological levels:
If a commit B is reachable from A (and both are present in the commit-graph le),
then the generation number of B is smaller than generation number of A
A ⇝ B =⇒ gen(A)  gen(B)
Therefore if the generation number of B is greater or equal to the generation number of A,
then B is not reachable from A (if both are present in the commit-graph le)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 36 / 68
Denition and properties of topological level
Denition of generation number / level of node
Level of a revision (commit graph node) is dened in the following way:
If node x has no outgoing edges, then gen(x) = 1.
Otherwise gen(x) = maxv∈V {gen(v)}+1,
where x → v (there exists edge from x to v, i.e. (x,v) ∈ E)
Properties of generation numbers / topological levels:
If a node B is reachable from A ,
then the generation number of B is smaller than generation number of A
A ⇝ B =⇒ gen(A)  gen(B)
Therefore if the generation number of B is greater or equal to the generation number of A,
then B is not reachable from A
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 36 / 68
Gains from using generation numbers
Using generation numbers improves the performance of the following commands:
git branch/git tag --contains commit
which nds all branches / tags from which commit is reachable
git branch/git tag --merged commit
which nds all branches / tags reachable from commit
git push with push.followTags cong variable set to true
git push --force and git push --force-with-lease
checking if a given tag points to any transferred commit
Generation numbers can also be used to speed up (not always true for reallife repositories):
computing lowest / closest common ancestors  merge bases
with git merge-base (or indirectly by git merge and git log A...B )
topological sorting (outputting a revision before its parents)
in git log --graph (or directly with --topo-order)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
Gains from using generation numbers
Using generation numbers improves the performance of the following commands:
git branch/git tag --no-contains commit
which nds all branches / tags from which commit is unreachable
git branch/git tag --no-merged commit
which nds all branches / tags unreachable from commit
git push with push.followTags cong variable set to true
git push --force and git push --force-with-lease
checking if a given tag points to any transferred commit
Generation numbers can also be used to speed up (not always true for reallife repositories):
computing lowest / closest common ancestors  merge bases
with git merge-base (or indirectly by git merge and git log A...B )
topological sorting (outputting a revision before its parents)
in git log --graph (or directly with --topo-order)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
Gains from using generation numbers
Using generation numbers improves the performance of the following commands:
git branch/git tag --no-contains commit
which nds all branches / tags from which commit is unreachable
git branch/git tag --no-merged commit
which nds all branches / tags unreachable from commit
git push with push.followTags cong variable set to true
git push --force and git push --force-with-lease
checking if a given tag points to any transferred commit
Generation numbers can also be used to speed up (not always true for reallife repositories):
computing lowest / closest common ancestors  merge bases
with git merge-base (or indirectly by git merge and git log A...B )
topological sorting (outputting a revision before its parents)
in git log --graph (or directly with --topo-order)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
Gains from using generation numbers
Using generation numbers improves the performance of the following commands:
git branch/git tag --no-contains commit
which nds all branches / tags from which commit is unreachable
git branch/git tag --no-merged commit
which nds all branches / tags unreachable from commit
git push with push.followTags cong variable set to true
git push --force and git push --force-with-lease
checking if a given tag points to any transferred commit
Generation numbers can also be used to speed up (not always true for reallife repositories):
computing lowest / closest common ancestors  merge bases
with git merge-base (or indirectly by git merge and git log A...B )
topological sorting (outputting a revision before its parents)
in git log --graph (or directly with --topo-order)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
Using generation numbers for reachability queries
Is commit T reachable from commit A?
1 2 3 4 5 6 7 8 9
A
R T B
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 38 / 68
Using generation numbers to compute git merge-base --all
Lowest common ancestors of A and B (reachable from both A and from B) are P and Q
1 2 3 4 5 6 7 8 9
P A
R B
Q
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 39 / 68
Topological sorting: Kahn's algorithm
Assumption: we walk the edges according to their direction
1 Compute indegree for each node;
it is the number of its incoming edges
2 Walk the graph, selecting a node with indegree of zero,
and decreasing the indegree of its parents
Q can be a queue,
a priority queue,
a stack, etc.
which gives dierent
topological orders
Q ← Queue with nodes that have the in-degree of 0
while Q is not empty do
remove node n from the beginning of queue Q (of independent nodes)
add n to the end of list L of topologically sorted nodes
for each node m where exists edge e from n to m do
remove edge e from the graph (which decreases in-degree of m)
if there are no incoming edge leading to m then
add node m to Q
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 40 / 68
Advantages and disadvantages of using Kahn algorithm in Git
advantages of Kahn algorithm
one can select the order which keeps commits on branch
together (which is needed for git log --graph)
easy inclusion of graph traversal limits
like git log --topo-order A...B
it is possible to terminate second step early
for example after showing rst full page of results;
at least in theory
disadvantages / limitations (for unmodied one)
whole graph needs to be traversed in rst step
to nd all independent nodes,
with the indegree of zero
1 limit•list@A
2 sort•in•topologi™—l•order@A
3 get•revision•I@A
git log --graph etc. may need only the rst
page of results (output goes to the pager)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 41 / 68
Using generation numbers during topological sorting
INDEGREE (generation number cuto)
while priority queue IN-Q is not empty
and maximum gen(x) is than cuto
do
remove commit C with highest gen(C) from IN-Q
add 1 to in-degree of each of C parents
if in-degree of C is 0, add it to TOPO-Q
priority queue IN-Q (INDEGREE_QUEUE):
with respect to maximum generation number (level)
⇒
TOPO
TOPO-Q ← priority queue, in-degree = 0
cuto ← generation number of rst in TOPO-Q
while priority queue TOPO-Q is not empty do
remove commit C from the start of TOPO-Q
add C at the end of sorted list L (output it)
for each parent P of commit C do
if gen(P) is lower than cuto then
set cuto to gen(P)
walk INDEGREE(cuto)
decrement in-degree of commit P
if in-degree of P is equal 0 then
insert P into priority queue TOPO-Q
priority queue TOPO-Q:
with respect to selected output sorting order
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
Using generation numbers during topological sorting
INDEGREE (generation number cuto)
while priority queue IN-Q is not empty
and maximum gen(x) is than cuto
do
remove commit C with highest gen(C) from IN-Q
add 1 to in-degree of each of C parents
if in-degree of C is 0, add it to TOPO-Q
priority queue IN-Q (INDEGREE_QUEUE):
with respect to maximum generation number (level)
⇒
TOPO
TOPO-Q ← priority queue, in-degree = 0
cuto ← generation number of rst in TOPO-Q
while priority queue TOPO-Q is not empty do
remove commit C from the start of TOPO-Q
add C at the end of sorted list L (output it)
for each parent P of commit C do
if gen(P) is lower than cuto then
set cuto to gen(P)
walk INDEGREE(cuto)
decrement in-degree of commit P
if in-degree of P is equal 0 then
insert P into priority queue TOPO-Q
priority queue TOPO-Q:
with respect to selected output sorting order
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
Using generation numbers during topological sorting with limits
EXPLORE (generation number cuto)
while priority queue EXPLORE-Q is not empty
and maximum gen(x) is than cuto do
take into account limits of eFFf type
add interesting parents to EXPLORE-Q
⇓
INDEGREE (generation number cuto)
while priority queue IN-Q is not empty
and maximum gen(x) is than cuto do
remove commit C with highest gen(C) from IN-Q
EXPLORE(gen(C))
add 1 to in-degree of each of C parents
if in-degree of C is 0, add it to TOPO-Q
priority queues EXPLORE-Q i IN-Q:
with respect to maximum generation number (level)
⇒
TOPO
TOPO-Q ← priority queue, in-degree = 0
cuto ← generation number of rst in TOPO-Q
while priority queue TOPO-Q is not empty do
remove commit C from the start of TOPO-Q
add C at the end of sorted list L (output it)
for each parent P of commit C do
if gen(P) is lower than cuto then
set cuto to gen(P)
walk INDEGREE(cuto)
decrement in-degree of commit P
if in-degree of P is equal 0 then
insert P into priority queue TOPO-Q
priority queue TOPO-Q:
with respect to selected output sorting order
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
Improving topological sorting performance with generation numbers
Linux kernel (2018)
Test: git rev-list --topo-order -100 HEAD
setup time [s] change
without commit-graph 6.80 
with commit-graph (old algorithm) 0.77 -88.7%
with commit-graph and generation number 0.02 -99.7%
Test: git rev-list --topo-order -100 HEAD -- tools
setup time [s] change
without commit-graph 9.63 
with commit-graph (old algorithm) 6.06 -37.1%
with commit-graph and generation number 0.06 -99.4%
taken from the commit message in revision.c: generation-based topo-order algorithm
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 43 / 68
Generation number: References
Derrick Stolee (Microsoft)
Supercharging the Git Commit Graph III: Generations and Graph Algorithms,
Azure DevOps Blog, 9 July 2018
httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiiEgener—tions
Developer Homepage of Derrick Stolee
httpsXGGstoleeFdevG
John Briggs (Microsoft)
Technical contributions towards scaling for Windows, Git Merge 2019
httpsXGGwwwFyoutu˜eF™omGw—t™hcvav—tWU—VgHoH
▶ Documentation/technical/commit-graph.txt
Git Commit Graph Design Notes
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 44 / 68
Table of contents
1 Introduction
Motivation
Graphs in Git
2 Operations on graphs
3 Methods for improving performance
Bitmap index
Generation number
Algorithm for nding common ancestors
Algorithm for topological sorting
4 Future work
Corrected commit creation date
Other graph labels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 45 / 68
Trouble with using topological level as generation number
generation number (topological level) serves as reachability index
gen(A)  gen(B) =⇒ A ̸⇝ B
elimination of unreachable revisions (negativecut lter)
before it, as heuristics, committerdate was used for this purpose
it can be incorrect due to, for example, clock desynchronization
used as cuto threshold with slop (5 revisions in the row) in order to speed up calculations
in most of new algorithms that make use of generation number
commit objects are sortowane according to this value (in priority queue)
optionally using committerdate (commit creation date) to resolve ties
it turned out that in some cases we can get worse performance
when using generation numbers as compared to committerdate heuristics
the algorith using the generation number always returns correct result
number generation used as sort key selects longest paths rst (?)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 46 / 68
Examples of decreased performance (using number of visited commits)
Test: git merge-base A B
repository A B date generation
Androidbase 53c1972bc8f 92f18ac3e39 81 999 109 025
Linux 69973b830859 c470abd4fde4 44 984 47 457
Linux c8d2bc9bc39e 69973b830859 167 468 635 579
TypeScript 35ea2bea76 123edced90 3464 3439
httpsXGGgithu˜F™omGderri™kstoleeGgenEtest
partial solution
sort by committerdate only when there is no generation number cuto provided
accept the possibility of performance regression for some rare history topologies
(more of a problem for git merge-base, than for git log --topo-order A..B)
alternative solution: using other generation number than topological level
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 47 / 68
Alternative sorting orders / generation numbers
V0: (minimal) generation number / topological level
gen(C) is 1 greater than the maximum of gen(P) of parents
gen(C) for the commit with no parents is 1
computable locally and incrementally, immutable
V1: (epoch, commit creation date) pair
epoch is not smaller than maximum of epochs of parents,
increased by 1 if parent has earlier date than the current commit
computable locally, immutable, compatibile with V0
V2: maximal generation number / reverse topological order (almost)
gen(C) for commit without children is set to the number of commits in the graph
otherwise gen(X) is 1 greater than minimum among children
not computable incrementally, compatibile with V0
V3: corrected commit creation date
gen(C) is maximum of C committerdate and corrected dates for parents (+1)
computable locally and incrementally, immutable, incompatibile with V0
best performance: V2, V3
incremental computation
is more important: V3
version number eld in the
™ommitEgr—ph format
httpsXGGgithu˜F™omGderri™kstoleeGgenEtest
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 48 / 68
Incremental update of commit-graph le
rewriting commit-graph le to add information about new
commit is time consuming
done during garbage collection
lowcost automatic update would be preferred
for example updating during git fetch
solution: chain of commit-graph les
lowest layer is selfsucient (closed)
higher layers can reference commits in lower layers
set limits (conditions)
higher layers are down to X times smaller than lower ones
maximum layer size (except for the lowest one (base))
merging layers if needed to fullll the above conditions
good amortized time is assured
taking into account time to merge layers
three layer commit-graph
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 49 / 68
Incremental update of commit-graph les
rewriting commit-graph le to add information about new
commit is time consuming
done during garbage collection
lowcost automatic update would be preferred
for example updating during git fetch
solution: chain of commit-graph les
lowest layer is selfsucient (closed)
higher layers can reference commits in lower layers
set limits (conditions)
higher layers are down to X times smaller than lower ones
maximum layer size (except for the lowest one (base))
merging layers if needed to fullll the above conditions
good amortized time is assured
taking into account time to merge layers
three layer commit-graph
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 49 / 68
The chain of commit-graph les
™ommitEgr—ph chain le format (CDAT chunk) three layer commit-graph
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 50 / 68
The chain of commit-graph les
™ommitEgr—ph chain le format (CDAT chunk) three layer commit-graph
https://devblogs.microsoft.com/devops/updates-to-the-git-commit-graph-feature/
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 50 / 68
Problems with changing the denition of generation number
incremental update of commit-graph les
requires that new gen(C) be locally updateable
which, in addition to performance requirements,
means corrected commit creation date
commit-graph format includes version number
but when creating incremental update code
it turned out that Git stops operation (hard fail)
if the commit-graph version is newer than supported
instead of not using the commit-graph
solution: variant of corrected commit date
column stores corrected commit date oset
its value is chosen to be at least 1 more than maximal oset of the parents of the commit
but also in such way that date plus oset is strictly monotonic (strictly increasing )
gives incremental updates, immutability and backward compatibility
however it is not implemented yet. . .
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
Problems with changing the denition of generation number
incremental update of commit-graph les
requires that new gen(C) be locally updateable
which, in addition to performance requirements,
means corrected commit creation date
commit-graph format includes version number
but when creating incremental update code
it turned out that Git stops operation (hard fail)
if the commit-graph version is newer than supported
instead of not using the commit-graph
solution: variant of corrected commit date
column stores corrected commit date oset
its value is chosen to be at least 1 more than maximal oset of the parents of the commit
but also in such way that date plus oset is strictly monotonic (strictly increasing )
gives incremental updates, immutability and backward compatibility
however it is not implemented yet. . .
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
Problems with changing the denition of generation number
incremental update of commit-graph les
requires that new gen(C) be locally updateable
which, in addition to performance requirements,
means corrected commit creation date
commit-graph format includes version number
but when creating incremental update code
it turned out that Git stops operation (hard fail)
if the commit-graph version is newer than supported
instead of not using the commit-graph
solution: variant of corrected commit date
column stores corrected commit date oset
its value is chosen to be at least 1 more than maximal oset of the parents of the commit
but also in such way that date plus oset is strictly monotonic (strictly increasing )
gives incremental updates, immutability and backward compatibility
however it is not implemented yet. . .
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
Problems with changing the denition of generation number
incremental update of commit-graph les
requires that new gen(C) be locally updateable
which, in addition to performance requirements,
means corrected commit creation date
commit-graph format includes version number
but when creating incremental update code
it turned out that Git stops operation (hard fail)
if the commit-graph version is newer than supported
instead of not using the commit-graph
solution: variant of corrected commit date
column stores corrected commit date oset
its value is chosen to be at least 1 more than maximal oset of the parents of the commit
but also in such way that date plus oset is strictly monotonic (strictly increasing )
gives incremental updates, immutability and backward compatibility
however it is not implemented yet. . .
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
New generation number: References
Incremental update of commit-graph les
Derrick Stolee (Microsoft)
Updates to the Git Commit Graph Feature, Azure DevOps Blog, 11 Nov 2019
httpsXGGdev˜logsFmi™rosoftF™omGdevopsGupd—tesEtoEtheEgitE™ommitEgr—phEfe—tureG
Christian Couder, Jakub Nar¦bski, Markus Jansen, Gabriel Alcaras, et.al.
Git Rev News: Edition 52 (June 28th, 2019), Reviews section
[PATCH 00/17] [RFC] Commit-graph: Write incremental les
httpsXGGgitFgithu˜FioGrev•newsGPHIWGHTGPVGeditionESPG
The need for new generation number and its choice
Christian Couder, Jakub Nar¦bski, Markus Jansen, Gabriel Alcaras, et.al.
Git Rev News: Edition 45 (November 21st, 2018), Support and Reviews sections
commit-graph is cool and [RFC] Generation Number v2
httpsXGGgitFgithu˜FioGrev•newsGPHIVGIIGPIGeditionERSG
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 52 / 68
Other graphs and other uses of reachability queries
Other types of graph data
social networks
WWW/Internet
XML documents
biological/chemical networks
RDF ontologies
Categories of graphs
large: |V | 100 000
sparse: |E|/|V | 2
Reachability relations  graph
of strongly connected
components:
=⇒ reachability in DAG
Practical use
social networks: inuence ow
citations: impact of an article
internet: link structure analysis
security: nding possible connections
between suspects
biological data: is given protein related directly
or indirectly, to a given gene expression?
chemical reaction: can you get given compound
starting from given substance?
Reachability queries in general:
classical graph theory problem
primitive operation used in other algorithms
(like pattern matching)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 53 / 68
Graph of Strongly Connected Components (SCC)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 54 / 68
Division of algorithms for solving the reachability problem
Extreme approaches:
Computing transitive closure of the graph
upfront
build time: O(|V |∗|E|)
constant time queries: O(1)
quadratic memory use: O(|V |2)
Online graph search: BFS, bidiBFS, DFS
no build time: O(1)
query answering time: O(|V |+|E|)
no additional memory needed: O(1)
Algorithm types:
LabelOnly
answers queries using labels only
nonlinear or unbounded index size
Label+G (label + graph)
requires [augmented] graph search if
labeling could not answer reachability
query by itself
linear and bounded index build time
and index size
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 55 / 68
Division of algorithms for solving the reachability problem
Extreme approaches:
Computing transitive closure of the graph
upfront
build time: O(|V |∗|E|)
constant time queries: O(1)
quadratic memory use: O(|V |2)
Online graph search: BFS, bidiBFS, DFS
no build time: O(1)
query answering time: O(|V |+|E|)
no additional memory needed: O(1)
Algorithm types:
LabelOnly
answers queries using labels only
nonlinear or unbounded index size
Label+G (label + graph)
requires [augmented] graph search if
labeling could not answer reachability
query by itself
linear and bounded index build time
and index size
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 55 / 68
Types of labels in augmented online search algorithms
negativecut lter (eliminating unreachable nodes)
if u ⇝ v and u ̸= v, then labels for u and v fullls the condition e(u) ⪯ e(v)
therefore if this condition is not met, then u cannot reach v
the reverse is not always true: false positives
e(u) ̸⪯ e(v) =⇒ u ̸⇝ v
examples: (minimal) generation numer, aka topological level
positivecut lter (nding reachable nodes)
if for u and v labels we have e′(u) ⪯ e′(v), then u ⇝ v, that is u can reach v
node v can be reachable from u event if
the condition for labels is not met: false negative
e′
(u) ⪯ e′
(v) =⇒ u ⇝ v
examples: min-post interval labeling for the spanning tree (see next slides)
Reachability algorithms like FELINE or BFL often use many dierent labels
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 56 / 68
Positivecut: spanning tree
Spanning tree / Spanning forest
Given directed acyclic graph, a spanning forest (tree)
is such its subgraph, for which the following is true
it includes all original (full) graph nodes
there is at most one incoming edge per node
Properties:
if there exists path from u to v in the spanning tree,
then u ⇝ v in a full graph
but the path from u to v could require going
through edges outside the spanning tree
a ⇝ h in graph and in tree
b ⇝ h, but not in tree
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 57 / 68
Positivecut: minpost interval labels in the spanning tree
minpost interval labels
For each vertex (node) u in graph we dene minpost
interval Lu = [su,eu] in the following way:
eu is dened as eu = post(u),
postorder value (back traversal)
su = eu for leaf nodes (no outcoming edges),
otherwise su = min{sx : x ∈ children(s)}
Properties:
path from u to v in the spanning tree exists if and
only if Iv ⊆ Iu
if Iv ⊆ Iu, then u ⇝ v (in full graph)
The same condition is true for similar postmax intervals
[3,3] ⊆ [1,5] then a ⇝ h
[3,3] ̸⊆ [7,9], but b ⇝ h
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 58 / 68
Commit graph with the spanning forest
Nodes in the graph are labeled with post(v) value: postvisit order in depthrst search (DFS)
1 2 3 4 5 6 7 8 9
2 3 4 5 10 17 18
1 6 7 8 9 19 20 21 23
11 12 13 14 16 22
15
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 59 / 68
Commit graph with the spanning forest and minpost intervals
This is the same graph as on previous slide, just drawn dierently (postvisit order vs level)
min–post interval Lu = [1, 18]
1
2
3
4
5
6
7
8
9
topologicallevellv
post–visit order post(v)
2
3
4
5
10
17
18
1
6
7
8
9
19
20
21
23
11
12
13
14
16
22
15
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 60 / 68
Linux kernel repository commit graph
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 61 / 68
Interval labeling for reachability queries: References (1/2)
Hilmi Yildirim, Vineet Chaoji, Mohammed J. Zaki
GRAIL: scalable reachability index for large graphs,
Proceedings of the VLDB Endowment 3(1):276-284 (2010)
httpsXGGwwwFrese—r™hg—teFnetGpu˜li™—tionGPPHSQVUVT•q‚esv•ƒ™—l—˜le•re—™h—˜ility•index•for•l
Florian Merz, Peter Sanders
PReaCH: A Fast Lightweight Reachability Index
using Pruning and Contraction Hierarchies (2014)
section 3.3 Pruning Based on DFS Numbering
httpsXGG—rxivForgG—˜sGIRHRFRRTS
Renê R. Veloso, Loïc Cerf, Wagner Meira Jr, Mohammed J. Zaki
Reachability Queries in Very Large Graphs: A Fast Rened Online Search Approach
Proc. 17th International Conference on Extending Database Technology (EDBT),
March 24-28, Athens, Greece (2014)
section 3.4.1 Positive-Cut Filter in 3.4 Optimizations
httpXGGopenpro™edingsForgGihf„GPHIRGp—per•ITTFpdf
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 62 / 68
Interval labeling for reachability queries: References (2/2)
Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, Gerhard Weikum
FERRARI: Flexible and Ecient Reachability Range Assignment for Graph Indexing.
Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE'13),
Brisbane, Australia. IEEE (2013).
httpXGG™iteseerxFistFpsuFeduGviewdo™Gdownlo—dcdoiaIHFIFIFQTSFPVWR8reparepI8typeapdf
httpsXGGgithu˜F™omGstepsGperr—ri
Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, Gerhard Weikum
High-Performance Reachability Query Processing under Index Size Restrictions (2012)
httpsXGG—rxivForgG—˜sGIPIIFQQUS
Jakub Nar¦bski
Reachability labels for version control graphs, Google Colaboratory Jupyter Notebook
httpsXGG™ol—˜Frese—r™hFgoogleF™omGdriveGI†E…U•sluSQsSiiiwpuhvˆt—xƒuSxyzg
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 63 / 68
Problems to solve
how to add additional labels to use in graph
algorithms in Git?
there can be only one order in priority queue (which
would be new generation number)
would adding positivecut improve performance of
graph operations, and if so which ones?
returning true/false reachability query result
selecting and returning a subset
nding node (commit) in a graph
nding path / subgraph
topological sorting
which labels can be computed incrementally?
graph of revisions (commit graph) has specic
properties and a specic way of growing (dynamics)
Online search algorithms
Tree+SPPI (2005)
GRIPP (2007)
GRAIL (2010)
(Graph Reachability indexing
via rAndomized Interval Labeling)
FERRARI (2013)
(Flexible and Ecient Reachability
Range Assignment for gRaph Indexing)
FELINE (2014)
(Fast rEned onLINE search)
IP (2014) i BFL (2016)
(Independent Permutations labeling)
(Bloom Filter Labeling)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 64 / 68
Problems to solve
how to add additional labels to use in graph
algorithms in Git?
there can be only one order in priority queue (which
would be new generation number)
would adding positivecut improve performance of
graph operations, and if so which ones?
returning true/false reachability query result
selecting and returning a subset
nding node (commit) in a graph
nding path / subgraph
topological sorting
which labels can be computed incrementally?
graph of revisions (commit graph) has specic
properties and a specic way of growing (dynamics)
Online search algorithms
Tree+SPPI (2005)
GRIPP (2007)
GRAIL (2010)
(Graph Reachability indexing
via rAndomized Interval Labeling)
FERRARI (2013)
(Flexible and Ecient Reachability
Range Assignment for gRaph Indexing)
FELINE (2014)
(Fast rEned onLINE search)
IP (2014) i BFL (2016)
(Independent Permutations labeling)
(Bloom Filter Labeling)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 64 / 68
Incremental update of minpost interval labels
Starting point: commit graph at some point in the past, with 3 branch tips
1 2 3 4 5 6 7 8 9
2 3 4 5 10
1 6 7 8 9 14
11 12 13 15 17
16
[1, 10] + 0
[11, 14] + 0
[15, 17] + 0
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
Incremental update of minpost interval labels
Beginning of an incremental update of labels, starting from one of new branch tips
1 2 3 4 5 6 7 8 9
2 3 4 5 10
1 6 7 8 9 14
11 12 13 15 17
16
[1, 10] + 0
[11, 14] + 0
[15, 17] + 0
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
Incremental update of minpost interval labels
Adjusting values of labels: adding a constant value for a subtrees (starting from old branch tips)
1 2 3 4 5 6 7 8 9
2 3 4 5 10 14 15
1 6 7 8 9 19
16 17 18 11 13
12
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
Incremental update of minpost interval labels
Continuation of an incremental update of labels, walking from second of new branch tips
1 2 3 4 5 6 7 8 9
2 3 4 5 10 14 15
1 6 7 8 9 19
16 17 18 11 13
12
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
Incremental update of minpost interval labels
Final step: updated labels, walking only new commits; giving O(changes) update time
1 2 3 4 5 6 7 8 9
2 3 4 5 10 14 15
1 6 7 8 9 19 20 21 23
16 17 18 11 13 22
12
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
Incremental update of minpost interval labels
This results in dierent spanning forest and dierent labels than computed from scratch
1 2 3 4 5 6 7 8 9
2 3 4 5 10 17 18
1 6 7 8 9 19 20 21 23
11 12 13 14 16 22
15
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
Advantages and disadvantages of incremental update of interval labels
1 2 3 4 5 6 7 8 9
2 3 4 5 10
1 6 7 8 9 14
11 12 13 15 17
16
[1, 10] + 0
[11, 14] + 0
[15, 17] + 0
1 2 3 4 5 6 7 8 9
2 3 4 5 10 14 15
1 6 7 8 9 19 20 21 23
16 17 18 11 13 22
12
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
1 2 3 4 5 6 7 8 9
2 3 4 5 10 17 18
1 6 7 8 9 19 20 21 23
11 12 13 14 16 22
15
advantages
computing update by walking O(changes) commits
updating post(v) labels is not more costly than
updating graph positions (lexicographical order in
updated graph)
disadvantages
possibly suboptimal reachability labeling
spanning forest and interval labels depend on when
commit-graph le was updated
question
is the obtained result of an incremental update
good enough (for improving performance)?
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 66 / 68
Incremental commit-graph format and interval labels
1 2 3 4 5 6 7 8 9
layer = 0 commit-graph file layer = 1 commit-graph file
2 3 4 5 10 14 15
1 6 7 8 9 14 20 21 23
11 12 13 15 17 22
16
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
adjustments
Each layer in the commit-graph chain includes corrections (adjustments) to post(v) labels
for previous layer in the chain (original values shown)
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 67 / 68
Incremental commit-graph format and interval labels
1 2 3 4 5 6 7 8 9
layer = 0 commit-graph file layer = 1 commit-graph file
2 3 4 5 10 14 15
1 6 7 8 9 14 20 21 23
11 12 13 15 17 22
16
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
adjustments
1 2 3 4 5 6 7 8 9
2 3 4 5 10 14 15
1 6 7 8 9 19 20 21 23
16 17 18 11 13 22
12
[1, 10] + 0
[11, 14] + 5
[15, 17] − 4
Possible solution:
For each layer in the commit-graph chain
store (in relevant chunks):
minpost interval labels
list of tips (heads) in the graph
possibly also list of their intervals
for each layer in chain, except for base
layer, store ajustments for previous
layer (only needed for tips)
top gure shows data as store in the
™ommitEgr—ph le chain,
bottom gure shows eective post(v) labels,
as visible from the top layer.
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 67 / 68
Example of large graph: CiteseerX citation network
6 540 401 nodes
15 011 260 edges
2.295 average degree
567 149 roots
5 740 710 leaves
4.07 ×10
−4
Rratio
connected nodes
probability
59 max. path length
that is max. level
dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 68 / 68

More Related Content

What's hot

RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
Snow cover assessment tool using Python
Snow cover assessment tool using PythonSnow cover assessment tool using Python
Snow cover assessment tool using PythonPrasun Kumar Gupta
 
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with StormDECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with StormMike Lohmann
 
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...vishnu rao
 
Multiplatform development with Kotlin
Multiplatform  development with KotlinMultiplatform  development with Kotlin
Multiplatform development with KotlinGaetan Zoritchak
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Till Blume
 
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasetsRob Emanuele
 
Webilea: The OpenWebGlobe Project
Webilea: The OpenWebGlobe ProjectWebilea: The OpenWebGlobe Project
Webilea: The OpenWebGlobe ProjectMartin Christen
 
Serverless architectureazurefunctions
Serverless architectureazurefunctionsServerless architectureazurefunctions
Serverless architectureazurefunctionsTsukasa Kato
 

What's hot (9)

RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
Snow cover assessment tool using Python
Snow cover assessment tool using PythonSnow cover assessment tool using Python
Snow cover assessment tool using Python
 
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with StormDECK36 - Log everything! and Realtime Datastream Analytics with Storm
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
 
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
 
Multiplatform development with Kotlin
Multiplatform  development with KotlinMultiplatform  development with Kotlin
Multiplatform development with Kotlin
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...
 
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
 
Webilea: The OpenWebGlobe Project
Webilea: The OpenWebGlobe ProjectWebilea: The OpenWebGlobe Project
Webilea: The OpenWebGlobe Project
 
Serverless architectureazurefunctions
Serverless architectureazurefunctionsServerless architectureazurefunctions
Serverless architectureazurefunctions
 

Similar to Graph operations in Git version control system

Git, Fast and Distributed Source Code Management
Git, Fast and Distributed Source Code ManagementGit, Fast and Distributed Source Code Management
Git, Fast and Distributed Source Code ManagementSalimane Adjao Moustapha
 
Git_and_GitHub Integration_with_Guidewire
Git_and_GitHub Integration_with_GuidewireGit_and_GitHub Integration_with_Guidewire
Git_and_GitHub Integration_with_GuidewireGandhi Ramu
 
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...DrupalCape
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDSunnyvale
 
Introduction to Git for developers
Introduction to Git for developersIntroduction to Git for developers
Introduction to Git for developersDmitry Guyvoronsky
 
Productive parallel teamwork: Decentralized Version Control Systems
Productive parallel teamwork: Decentralized Version Control SystemsProductive parallel teamwork: Decentralized Version Control Systems
Productive parallel teamwork: Decentralized Version Control SystemsDanilo Pianini
 
New Views on your History with git replace
New Views on your History with git replaceNew Views on your History with git replace
New Views on your History with git replaceChristian Couder
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.comRavi Raj
 
R tools for HiC data visualization
R tools for HiC data visualizationR tools for HiC data visualization
R tools for HiC data visualizationtuxette
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerPatrick Diehl
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxshericehewat
 
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...Roberto Casadei
 
Advanced Git Techniques: Subtrees, Grafting, and Other Fun Stuff
Advanced Git Techniques: Subtrees, Grafting, and Other Fun StuffAdvanced Git Techniques: Subtrees, Grafting, and Other Fun Stuff
Advanced Git Techniques: Subtrees, Grafting, and Other Fun StuffAtlassian
 
Getting Started with GitHub
Getting Started with GitHubGetting Started with GitHub
Getting Started with GitHubMichael Redlich
 
The journey to GitOps
The journey to GitOpsThe journey to GitOps
The journey to GitOpsNicola Baldi
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiTaswar Bhatti
 

Similar to Graph operations in Git version control system (20)

Git, Fast and Distributed Source Code Management
Git, Fast and Distributed Source Code ManagementGit, Fast and Distributed Source Code Management
Git, Fast and Distributed Source Code Management
 
Git_and_GitHub Integration_with_Guidewire
Git_and_GitHub Integration_with_GuidewireGit_and_GitHub Integration_with_Guidewire
Git_and_GitHub Integration_with_Guidewire
 
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
 
Introduction to Git (Greg Lonnon)
Introduction to Git (Greg Lonnon)Introduction to Git (Greg Lonnon)
Introduction to Git (Greg Lonnon)
 
Introduction to Git for developers
Introduction to Git for developersIntroduction to Git for developers
Introduction to Git for developers
 
Productive parallel teamwork: Decentralized Version Control Systems
Productive parallel teamwork: Decentralized Version Control SystemsProductive parallel teamwork: Decentralized Version Control Systems
Productive parallel teamwork: Decentralized Version Control Systems
 
Advanced git
Advanced gitAdvanced git
Advanced git
 
New Views on your History with git replace
New Views on your History with git replaceNew Views on your History with git replace
New Views on your History with git replace
 
Git introduction
Git introductionGit introduction
Git introduction
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
 
R tools for HiC data visualization
R tools for HiC data visualizationR tools for HiC data visualization
R tools for HiC data visualization
 
Tech thursdays / GIT
Tech thursdays / GITTech thursdays / GIT
Tech thursdays / GIT
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
 
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
Programming Distributed Collective Processes for Dynamic Ensembles and Collec...
 
Advanced Git Techniques: Subtrees, Grafting, and Other Fun Stuff
Advanced Git Techniques: Subtrees, Grafting, and Other Fun StuffAdvanced Git Techniques: Subtrees, Grafting, and Other Fun Stuff
Advanced Git Techniques: Subtrees, Grafting, and Other Fun Stuff
 
Getting Started with GitHub
Getting Started with GitHubGetting Started with GitHub
Getting Started with GitHub
 
The journey to GitOps
The journey to GitOpsThe journey to GitOps
The journey to GitOps
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 

Recently uploaded

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Recently uploaded (20)

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

Graph operations in Git version control system

  • 1. Graph operations in Git version control system how the performance was improved (for large repositories), how can it be further improved dr Jakub Nar¦bski Nicolaus Copernicus University in Toru«, Poland presented on December 3, 2019 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 1 / 68
  • 2. Table of contents 1 Introduction Motivation Graphs in Git 2 Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 2 / 68
  • 3. Table of contents 1 Introduction Motivation Graphs in Git 2 Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 3 / 68
  • 4. Motivation: scaling Git up in the presence of the increasing size of repositories Git repositories are growing with respect to the number of commits examples: Linux kernel: 740 000 commits (2018) MS Windows: 1 700 000 commits (2018) Android (AOSP): 874 000 commits (2019) Chromium: 772 000 commits (2019) . . . Git: 55 000 commits (2019) noticeable slowdown of Git operations (taking now seconds) gitk i git log --graph git push --force-with-lease git status --ahead-behind . . . serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft) space for storing auxiliary labels / reachability indices, such as e.g. the generation number dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
  • 5. Motivation: scaling Git up in the presence of the increasing size of repositories Git repositories are growing with respect to the number of commits examples: Linux kernel: 826 000 commits (2019) MS Windows: 3 100 000 commits (2019) Android (AOSP): 874 000 commits (2019) Chromium: 772 000 commits (2019) . . . Git: 55 000 commits (2019) noticeable slowdown of Git operations (taking now seconds) gitk i git log --graph git push --force-with-lease git status --ahead-behind . . . serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft) space for storing auxiliary labels / reachability indices, such as e.g. the generation number dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
  • 6. Motivation: scaling Git up in the presence of the increasing size of repositories Git repositories are growing with respect to the number of commits examples: Linux kernel: 826 000 commits (2019) MS Windows: 3 100 000 commits (2019) Android (AOSP): 874 000 commits (2019) Chromium: 772 000 commits (2019) . . . Git: 55 000 commits (2019) noticeable slowdown of Git operations (taking now seconds) gitk i git log --graph git push --force-with-lease git status --ahead-behind . . . serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft) space for storing auxiliary labels / reachability indices, such as e.g. the generation number dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
  • 7. Object graph Git repository as contentaddressed object database data in Git repositories is stored as Direct Acyclic Graph (DAG), that is, egdes are directed and there are no loops nodes (vertices) in this graph are objects of the following types commit representing revisions, store project history tree snapshot of project les at given point of time representing subdirectories (in a hierarchical way) blob store le contents at given version of it tag represents annotated or signed version of a project edges between nodes represents relationships commit → commit: based on relationship, the second one is parent of the rst (each revision has zero or more parent commits) commit → tree: project repository contents at given revision tree → tree and tree → blob: lesystem hierarchy tag → object (usually to commit): symbolic name of the object xkcd.com/1597/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 5 / 68
  • 8. Visualization of the objects graph in the Git repository Object graph and contents deduplication Hierarchical le structure dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 6 / 68
  • 9. Visualization of the objects graph in the Git repository Object graph and contents deduplication Hierarchical le structure dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 6 / 68
  • 10. Object graph in a Git repository (object model of a repository) Derrick Stolee Advanced Git for Beginners httpsXGGstoleeFdevGdo™sGgitFpdf dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 7 / 68
  • 11. Object graph and external references to it: branches, tags, the index,. . . example from httpsXGGgithu˜F™omGsensorfloGgitEdr—wGwiki dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 8 / 68
  • 12. Commit graph Representation of project history in Git (Almost) every commit (revision) has reference to its parent, that is the revision before it was based on Some revisions (commits), being result of merge operation, have two parents (rarely more so called octopus merges) At least one (initial) revision has no parents Each commit object includes reference to the tree representing the snapshot of les in the repository Branches and tags are external references to the commit graph HEAD is a symbolic reference to the current branch (detached HEAD directly points to a commit) c7cd3 master HEAD f30ab 34ac2 v0.9 98ca9 23b88 d77af dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 9 / 68
  • 13. Object addressing SHA-1 / SHA-256 of object representation as object identiers Each object in the repository's object database is referenced using the SHA-1 hash of the object contents representation (switch to SHA-256 aka NewHash is in progress) Examples: object representation of a commit: 6 git ™—tEfile Ep rieh¢ tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ —uthor tunio g r—m—no `gitsterdpo˜oxF™omb IRSPTRHVSI EHVHH ™ommitter tunio g r—m—no `gitsterdpo˜oxF™omb IRSPTRHVSI EHVHH pirst ˜—t™h for post PFU ™y™le ƒignedEoffE˜yX tunio g r—m—no `gitsterdpo˜oxF™omb http://shafiul.github.io/gitbook dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
  • 14. Object addressing SHA-1 / SHA-256 of object representation as object identiers Each object in the repository's object database is referenced using the SHA-1 hash of the object contents representation (switch to SHA-256 aka NewHash is in progress) Examples: object representation of a tree: 6 git ™—tEfile Ep RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU IHHTRR ˜lo˜ SeWVVHT™T™™PRT—™efSfSQW—eIWIUIH—H™HT—dQf Fgit—ttri˜utes IHHTRR ˜lo˜ I™PfVQPIQVTfVWefV™HQdIIISW™WU—HfIWR™RRPQ Fgitignore IHHTRR ˜lo˜ eS˜RIPT˜e™SSUd˜SSWPR˜U˜THedUHQRWTPTe—P™R Fm—ilm—p IHHTRR ˜lo˜ ™Q˜fW™TdRdI™THRWddQV—S—VTId™e˜R™VeI˜UeWW Ftr—visFyml IHHTRR ˜lo˜ SQTeSSSPRd˜UP˜dP—™fIUSPHV—efRfQdf™IRVdRP gy€‰sxq HRHHHH tree QUHff˜fdTVW—SdQHU—SdWW˜eQPHHT˜efI˜HdQQed ho™ument—tion IHHUSS ˜lo˜ SVUQfITQeSITVRffRf™SQIIW™SWfeP™Rf™PRf—˜e qs„E†i‚ƒsyxEqix IHHTRR ˜lo˜ ff˜HUIeWfHQ—UW—HSP˜e——RQUPf—UWHe™˜—˜˜˜U˜ sxƒ„evv FFF http://shafiul.github.io/gitbook dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
  • 15. Object addressing SHA-1 / SHA-256 of object representation as object identiers Each object in the repository's object database is referenced using the SHA-1 hash of the object contents representation (switch to SHA-256 aka NewHash is in progress) Examples: object representation of a tag: 6 git ™—tEfile Ep vPFUFH o˜je™t USRVVRPSS˜˜SVHdfISWeSVdef—VI™ddQH˜S™RQH™ type ™ommit t—g vPFUFH t—gger tunio g r—m—no `gitsterdpo˜oxF™omb IRSIWRSPWP EHVHH qit PFU !!Efiqsx €q€ ƒsqxe„…‚i!!E †ersionX qnu€q vI FFF !!Eixh €q€ ƒsqxe„…‚i!!E http://shafiul.github.io/gitbook dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
  • 16. Table of contents 1 Introduction Motivation Graphs in Git 2 Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 11 / 68
  • 17. Git commands working directly on the object graph object exchange (object transfer) git fetch git push git clone server and clients perform negotiation to send only those (new) objects that are necessary (that are missing from the other side) garbage collection git repack git gc removing unreachable objects (results of git ™ommit EE—mend, git re˜—se, multiple git —dd `file b, etc) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 12 / 68
  • 18. Git commands working directly on the commit graph (1/2) breakdown into categories based on the type of the result commands returning boolean: true/false value git merge-base --is-ancestor A B is B reachable from A commands returning subset (of larger set) git branch --contains (or git tag ...) branches/tags from which given commit is reachable git branch --merged (or git tag ...) branches/tags reachable from given commit autofollowing tags during git fetch (see documentation of the remote.name.tagOpt) commands nding node or nodes in the commit graph git merge-base --all A B nding lowest (closest) common ancestors dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 13 / 68
  • 19. Git commands working directly on the commit graph (1/2) breakdown into categories based on the type of the result commands returning boolean: true/false value git merge-base --is-ancestor A B is B reachable from A commands returning subset (of larger set) git branch --no-contains (or git tag ...) branches/tags from which given commit is unreachable git branch --no-merged (or git tag ...) branches/tags unreachable from given commit autofollowing tags during git fetch (see documentation of the remote.name.tagOpt) commands nding node or nodes in the commit graph git merge-base --all A B nding lowest (closest) common ancestors dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 13 / 68
  • 20. Git commands working directly on the commit graph (2/2) breakdown based on the type of the result, continued commands returning path / subgraph git log A..B ≡ git log B --not A reachable from B and unreachable from A git log A...B (symmetrical dierence) A...B ≡ A B --not $(git merge-base --all A B) exclusively reachable from either A or from B git log --ancestry-path A..B commits directly on path leading from B to A (inclusive) topological sorting options (and equivalent) git log --topo-sort / --graph, gitk, etc. additionally those try to keep related revisions together iterative bisection of graph (to nd regression) git bisect A..B A B A B A...B A O B dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 14 / 68
  • 21. Denition of reachability in a directed acyclic graph (DAG) Denition of reachability (graph theory) Let's assume that we have directed acyclic graph G = (V ,E), where V is nite set of vertices (nodes), and E ⊂ V 2 is nite set of directed edges. ∀(u,v) ∈ V 2 we say that v is reachable from u, which we denote as r(u,v) or as u ⇝ v, if and only if u = v or ∃(u,w) ∈ E ∧r(w,v). Properties of this relation ∀v ∈ V : r(v,v) r(u,w)∧r(w,v) =⇒ r(u,v) r(u,v)∧r(v,u) =⇒ u = v Reachability relation imposes partial order for nodes in the graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 15 / 68
  • 22. Lowest (closest) common ancestors, or merge base Lowest common ancestor(s) is used when merging (via git merge) two branches A and B lowest (closest) common ancestor, like P and Q, is reachable both from A and from B it is not reachable from any other revision reachable from both A and from B httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 16 / 68
  • 23. Topological sorting of directed acyclic graph (DAG) Denition of topological sorting Topological sorting of directed acyclic graph: any such linear full ordering (≺) of nodes (vertices), for which (u,v) ∈ E =⇒ u ≺ v dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 17 / 68
  • 24. Topological sorting of directed acyclic graph (DAG) Denition of topological sorting Topological sorting of directed acyclic graph: any such linear full ordering (≺) of nodes (vertices), for which (u,v) ∈ E =⇒ u ≺ v dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 17 / 68
  • 25. Table of contents 1 Introduction Motivation Graphs in Git 2 Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 18 / 68
  • 26. Bitmap indices solving the problem of object exchange here bitmap means simply bit vector (vector of 0/1 values) image: Scott Chacon Pro Git, https://git-scm.com/book I—RIHe ™—™H™— fdfRf™ Q™RQW™ HISSe˜ dVQPWf IfU—U— f—RW˜H WQ˜——e I—RIHe 1 1 1 1 1 1 1 1 1 ™—™H™— 0 1 1 0 1 1 1 1 1 fdfRf™ 0 0 1 0 0 1 0 0 1 bit location corresponds to the position of the object in the packle bit 1 in the bitmap for a given revision means that object with given position is reachable from it bit 0 in the bitmap: not reachable objects to be transferred: want AND NOT have dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 19 / 68
  • 27. Bitmap indices nding objects that we have (and don't need to fetch) http://githubengineering.com/counting-objects/ $ GIT_TRACE_PACKET=1 git pull [...] ... fetch-pack want a595... ... fetch-pack want a4c7... ... fetch-pack want d1c7... [...] ... fetch-pack 0000 ... fetch-pack have cc3f... ... fetch-pack have 5bd5... [...] ... fetch-pack 0000 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 20 / 68
  • 28. Bitmap index for the packle Selected details of bitmap index implementation in Git We cannot create the reachability bitmap (bit vector) for every revision, because it would take too much space (storing transitive closure of the object graph) use heuristic algorithm to select which commits will have bitmap let newest versions have bitmap, which is needed for cloning the deeper in the commit graph (earlier in history), the less frequently reachability bitmaps are added (down to every 3000 revisions) Minimize space taken by bitmaps by using RLE compression EWAH Bitmaps Daniel Lemire (implementation in Java, C#, C++) patent free (which is unfortunately not the case for every bitmap compression algorithm) JGit support added by Shawn Pearce and Colby Ranger (Google) ported to C as libewok by Vincent Marti (GitHub) good enough compression level with fast decompression some operations on bitmap do not require decompression to perform Trick: store result of XOR with some bitmap for an earlier commit dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 21 / 68
  • 29. EWAH (Enhanced Word-Aligned Hybrid) format wordaligned format dierent variants: 32bit, 64bit speed at the cost of compression ratio clean words: run of 0 or of 1 RLE (run length) encoding how many repeating 0s or 1s dirty words: mixed 0 with 1 literal encoding EWAH: marker words length and type of sequence bit operations on compressed bitmaps without decompressing them (symmetric operations only) Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. http://arxiv.org/abs/0901.3751 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 22 / 68
  • 30. Bitmap index: References Vicent Martí (GitHub). Counting Objects. GitHub Engineering, 22 Sep 2015 http://githubengineering.com/counting-objects/ https://github.blog/2015-09-22-counting-objects/ Shawn Pearce (Google). Scaling Up JGit. EclipseCon 2013, Boston. Daniel Lemire. All About Bitmap Indexes. . . And Sorting Them slides presented at BDA'08 and DOLAP'08, 12 Feb 2009. http://lemire.me/talks/uqamtalk.pdf ▶ Vicent Martí. GIT bitmap v1 format. Documentation/technical/bitmap-format.txt dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 23 / 68
  • 31. Extracting information about edges in the commit graph Following the edges requires: nding the object (packed or loose) object decompression (gzip) possibly resolving deltas parsing the commit object commit object representation: tree 49f152498cfde082f3223e20338ca64517991cd7 parent bdd1cc20929e9f7631dd29ff70426eea53f69443 author A U Thor a@ex.com 1452640851 -0800 committer C O Mitter c@ex.us 1452640851 -0800 First batch for post 2.7 cycle Signed-off-by: A U Thor a@ex.com Junio Hamano Git Chronicles talk at GitTogether 2008 loose format packed format (one le per object) (multiple objects) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 24 / 68
  • 32. Two ways of storing objects in Git an outline Junio Hamano Git Chronicles talk at GitTogether 2008 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 25 / 68
  • 33. Commit-graph le format (storing serialized commit graph) Following the edges required: nding the object (packed or loose) object decompression (gzip) possibly resolving deltas parsing the commit object This can take a long time, especially in large repositories, where it needs to be done 1000s of times Git 2.18 and later supports commit-graph le, which stores this DAG information in a compact form fanout table binary search seeds list of commits sorted by the commit ID commit data, with parents as position on list list of 3rd and later parents: octopus edges dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 26 / 68
  • 34. Commit-graph le format schema (serialized commit graph) httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 27 / 68
  • 35. Making Git faster with commit-graph le (serialized commit graph) Linux kernel (around 750 000 revisions / commits) command before after change git mergeE˜—se m—ster topi™ 0.52 0.06 -88% git ˜r—n™h EE™ont—ins 76.20 0.04 -99% git t—g EE™ont—ins 5.30 0.03 -99% git t—g EEmerged 6.30 1.50 -76% git log EEgr—ph EIH 5.90 0.74 -87% m—ster: 032b4cc884490c4bc7c4ef8c91e6d topi™: 62d18ecfa64137349fac9c5817784fb where topi™ branch is 30 986 revisions before m—ster branch, and m—ster branch can reach 722 849 revisions Git code repository (around 50 000 revisions) command before after change git mergeE˜—se m—ster topi™ 0.10 0.04 -60% git ˜r—n™h EE™ont—ins 0.76 0.03 -96% git t—g EE™ont—ins 0.70 0.03 -96% git t—g EEmerged 0.74 0.12 -84% git log EEgr—ph EIH 0.44 0.05 -89% m—ster: b50d82b00a8fc9d24e41ae7dc30185 topi™: e144d126d74f5d2702870ca9423743 where m—ster is 2032 revisions behind the topi™ branch, and can reach 49 361 commits dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 28 / 68
  • 36. Making Git faster with commit-graph le (serialized commit graph) Linux kernel (around 750 000 revisions / commits) command before after change git mergeE˜—se m—ster topi™ 0.52 0.06 -88% git ˜r—n™h EE™ont—ins 76.20 0.04 -99% git t—g EE™ont—ins 5.30 0.03 -99% git t—g EEmerged 6.30 1.50 -76% git log EEgr—ph EIH 5.90 0.74 -87% m—ster: 032b4cc884490c4bc7c4ef8c91e6d topi™: 62d18ecfa64137349fac9c5817784fb where topi™ branch is 30 986 revisions before m—ster branch, and m—ster branch can reach 722 849 revisions MS Windows, with GVFS (around 1 700 000 commits) command before after change git st—tus EE—he—dE˜ehind 14.30 4.70 -67% git mergeE˜—se m—ster topi™ 11.40 1.80 -84% git ˜r—n™h EE™ont—ins 9.40 1.60 -83% git log EEgr—ph EIH 24.30 5.30 -78% where m—ster includes 2 214 796 reachable commits, and local version of m—ster is 81 776 revisions behind originGm—ster, which aect speed of git st—tus; such value is typical for development there dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 28 / 68
  • 37. Serialized commit graph: References Derrick Stolee (Microsoft) Supercharging the Git Commit Graph series, Azure DevOps Blog, 2018 httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phG httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG Johannes Schindelin (Microsoft), Derrick Stolee (Microsoft) Making Git for Windows: starting from 15:00, Git Merge 2018 httpsXGGyoutuF˜eGoywziWVQmwctaWHS ▶ Documentation/technical/commit-graph-format.txt Git commit graph format dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 29 / 68
  • 38. The creation date of the commit object as heuristics tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ —uthor e … „hor `thordex—mpleF™omb IRSPTRHVSI EHVHH ™ommitter g y witter `terdex—mpleFusb IRSPTRHVSI EHVHH pirst ˜—t™h for post PFU ™y™le ƒignedEoffE˜yX e … „hor `thordex—mpleF™omb Revision dates authordate is creation date for changes (authorship) committerdate is date those changes were added to repository (creating commit object) commit object data includes date and time of its creation revisions based on it must have been created later with regard to a global time unfortunately because of lack of clock synchronization we cannot entirely rely on this data it can be however used as heuristics as stop condition, and for order of traversal Git stops searching after nding 5 (SLOP) revisions that are older than a boundary dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 30 / 68
  • 39. The creation date of the commit object as heuristics tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ —uthor e … „hor `thordex—mpleF™omb IRSPTRHVSI EHVHH ™ommitter g y witter `terdex—mpleFusb IRSPTRHVSI EHVHH pirst ˜—t™h for post PFU ™y™le ƒignedEoffE˜yX e … „hor `thordex—mpleF™omb Revision dates authordate is creation date for changes (authorship) committerdate is date those changes were added to repository (creating commit object) commit object data includes date and time of its creation revisions based on it must have been created later with regard to a global time unfortunately because of lack of clock synchronization we cannot entirely rely on this data it can be however used as heuristics as stop condition, and for order of traversal Git stops searching after nding 5 (SLOP) revisions that are older than a boundary dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 30 / 68
  • 40. Generation number / topological level Denition of level in graph / of generation number Level lv of vertex v in graph G = (V ,E) is dened as its depth, that is the length of longest path from v to root if v has no parents (outgoing edges), i.e. if v is a root, then lv = 0 otherwise lv = max u : (v,u)∈E {lu}+1 Properties: if u ⇝ v and u ̸= v, then lu lv if u ⇝ v, then lu lv (weaker condition) Example DAG with topological levels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 31 / 68
  • 41. Generation number / topological level Denition of level in graph / of generation number Level lv of vertex v in graph G = (V ,E) is dened as its depth, that is the length of longest path from v to leaf if v has no predecessors (outgoing edges), i.e. if v is a leaf, then lv = 0 otherwise lv = max u : (v,u)∈E {lu}+1 Properties: if u ⇝ v and u ̸= v, then lu lv if u ⇝ v, then lu lv (weaker condition) Example DAG with topological levels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 31 / 68
  • 42. Generation numbers (levels) in the commit graph Example graph of revisions: Derrick Stolee Supercharging the Git Commit Graph III: Generations and Graph Algorithms, July 2018 https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 32 / 68
  • 43. Generation numbers (levels) in the commit graph Generation numbers for this graph: Derrick Stolee Supercharging the Git Commit Graph III: Generations and Graph Algorithms, July 2018 https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 32 / 68
  • 44. Historical approaches to adding generation number: 2011 (or earlier) Idea 1: adding new header to the commit tree 49f152498cfde082f3223e20338ca64517991cd7 parent bdd1cc20929e9f7631dd29ff70426eea53f69443 author A U Thor a@ex.com 1452640851 -0800 committer C O Mitter c@ex.us 1452640851 -0800 generation 32145 First batch for post 2.7 cycle Signed-off-by: A U Thor a@ex.com problems with backward compatibility question about ensuring corectness repositories with and without it possibly copying unknown headers by cherry-pick, revert, etc. Idea 2: using git notes as cache git notes technique allows to add notes to any object: blob, commit, . . . notes are split into namespaces, e.g. the default refs/notes/commit the textconv mechanism can be congured to use them as a cache pytania o zapewnienie poprawno±ci performance: the notes mechanism is not intended for a very large amount of notes, O(commits) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 33 / 68
  • 45. Storing generation numbers in the commit-graph le Commit Data chunk: CDAT includes column (eld) intended for commit creation date (committerdate ) if the column was using signed 32bit integer Y2038 problem therefore 64bit wide column is used (two 4byte words): 30 most signicant bits of rst 4 bytes are used for the generation number 32 bits of second 4 bytes and 2 least signicant bits of the previous 4byte word are used for committerdate, which makes together 34 bits to store datetime as Unix timestamp Denition of generation numer in the commit-graph le Generation number gen(A) of revision A of a project is dened in the following way if A has no parents, i.e. it is a root commit, then gen(A) = 1 otherwise its generation number is one more than maximum generation number among all its parents: gen(A) = max{P ∈ parent(A): gen(P)}+1 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 34 / 68
  • 46. Storing generation numbers in the commit-graph le Commit Data chunk: CDAT includes column (eld) intended for commit creation date (committerdate ) if the column was using signed 32bit integer Y2038 problem therefore 64bit wide column is used (two 4byte words): 30 most signicant bits of rst 4 bytes are used for the generation number 32 bits of second 4 bytes and 2 least signicant bits of the previous 4byte word are used for committerdate, which makes together 34 bits to store datetime as Unix timestamp Denition of generation numer in the commit-graph le Generation number gen(A) of revision A of a project is dened in the following way if A has no parents, i.e. it is a root commit, then gen(A) = 1 otherwise its generation number is one more than maximum generation number among all its parents: gen(A) = max{P ∈ parent(A): gen(P)}+1 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 34 / 68
  • 47. Corner cases of generation number in Git Corner cases: for historical reasons commits that were present in commit-graph le before generation number was added to Git had gen(C) = 0 that is why gen(root commits) = 1 newly created commits, not present in the commit-graph, have gen(C) greater than maximal representable generation number: gen(C) = 0x3FFFFFFF for such A and B we have gen(A) = gen(B), but nothing is known about their reachability relation gen(C) properties: if A ⇝ B, and A ̸= B, then gen(A) gen(B) except for the corner cases if A ⇝ B, then gen(A) gen(B) (weaker) including the corner cases ⇕ if gen(A) gen(B), then A ̸⇝ B including the corner cases Conclusion: gen(C) can be used as cuto, even if new revisions are not present in the commit-graph le dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 35 / 68
  • 48. Denition and properties of generation number Denition of generation number / level of node Generation number of a revision (commit graph node) is dened in the following way: If commit has no parents, then its generation number is 1. Otherwise, its generation number is taken to be 1 greater than maximum generation number of its parents Properties of generation numbers / topological levels: If a commit B is reachable from A (and both are present in the commit-graph le), then the generation number of B is smaller than generation number of A A ⇝ B =⇒ gen(A) gen(B) Therefore if the generation number of B is greater or equal to the generation number of A, then B is not reachable from A (if both are present in the commit-graph le) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 36 / 68
  • 49. Denition and properties of topological level Denition of generation number / level of node Level of a revision (commit graph node) is dened in the following way: If node x has no outgoing edges, then gen(x) = 1. Otherwise gen(x) = maxv∈V {gen(v)}+1, where x → v (there exists edge from x to v, i.e. (x,v) ∈ E) Properties of generation numbers / topological levels: If a node B is reachable from A , then the generation number of B is smaller than generation number of A A ⇝ B =⇒ gen(A) gen(B) Therefore if the generation number of B is greater or equal to the generation number of A, then B is not reachable from A dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 36 / 68
  • 50. Gains from using generation numbers Using generation numbers improves the performance of the following commands: git branch/git tag --contains commit which nds all branches / tags from which commit is reachable git branch/git tag --merged commit which nds all branches / tags reachable from commit git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors merge bases with git merge-base (or indirectly by git merge and git log A...B ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  • 51. Gains from using generation numbers Using generation numbers improves the performance of the following commands: git branch/git tag --no-contains commit which nds all branches / tags from which commit is unreachable git branch/git tag --no-merged commit which nds all branches / tags unreachable from commit git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors merge bases with git merge-base (or indirectly by git merge and git log A...B ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  • 52. Gains from using generation numbers Using generation numbers improves the performance of the following commands: git branch/git tag --no-contains commit which nds all branches / tags from which commit is unreachable git branch/git tag --no-merged commit which nds all branches / tags unreachable from commit git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors merge bases with git merge-base (or indirectly by git merge and git log A...B ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  • 53. Gains from using generation numbers Using generation numbers improves the performance of the following commands: git branch/git tag --no-contains commit which nds all branches / tags from which commit is unreachable git branch/git tag --no-merged commit which nds all branches / tags unreachable from commit git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors merge bases with git merge-base (or indirectly by git merge and git log A...B ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  • 54. Using generation numbers for reachability queries Is commit T reachable from commit A? 1 2 3 4 5 6 7 8 9 A R T B dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 38 / 68
  • 55. Using generation numbers to compute git merge-base --all Lowest common ancestors of A and B (reachable from both A and from B) are P and Q 1 2 3 4 5 6 7 8 9 P A R B Q dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 39 / 68
  • 56. Topological sorting: Kahn's algorithm Assumption: we walk the edges according to their direction 1 Compute indegree for each node; it is the number of its incoming edges 2 Walk the graph, selecting a node with indegree of zero, and decreasing the indegree of its parents Q can be a queue, a priority queue, a stack, etc. which gives dierent topological orders Q ← Queue with nodes that have the in-degree of 0 while Q is not empty do remove node n from the beginning of queue Q (of independent nodes) add n to the end of list L of topologically sorted nodes for each node m where exists edge e from n to m do remove edge e from the graph (which decreases in-degree of m) if there are no incoming edge leading to m then add node m to Q dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 40 / 68
  • 57. Advantages and disadvantages of using Kahn algorithm in Git advantages of Kahn algorithm one can select the order which keeps commits on branch together (which is needed for git log --graph) easy inclusion of graph traversal limits like git log --topo-order A...B it is possible to terminate second step early for example after showing rst full page of results; at least in theory disadvantages / limitations (for unmodied one) whole graph needs to be traversed in rst step to nd all independent nodes, with the indegree of zero 1 limit•list@A 2 sort•in•topologi™—l•order@A 3 get•revision•I@A git log --graph etc. may need only the rst page of results (output goes to the pager) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 41 / 68
  • 58. Using generation numbers during topological sorting INDEGREE (generation number cuto) while priority queue IN-Q is not empty and maximum gen(x) is than cuto do remove commit C with highest gen(C) from IN-Q add 1 to in-degree of each of C parents if in-degree of C is 0, add it to TOPO-Q priority queue IN-Q (INDEGREE_QUEUE): with respect to maximum generation number (level) ⇒ TOPO TOPO-Q ← priority queue, in-degree = 0 cuto ← generation number of rst in TOPO-Q while priority queue TOPO-Q is not empty do remove commit C from the start of TOPO-Q add C at the end of sorted list L (output it) for each parent P of commit C do if gen(P) is lower than cuto then set cuto to gen(P) walk INDEGREE(cuto) decrement in-degree of commit P if in-degree of P is equal 0 then insert P into priority queue TOPO-Q priority queue TOPO-Q: with respect to selected output sorting order dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
  • 59. Using generation numbers during topological sorting INDEGREE (generation number cuto) while priority queue IN-Q is not empty and maximum gen(x) is than cuto do remove commit C with highest gen(C) from IN-Q add 1 to in-degree of each of C parents if in-degree of C is 0, add it to TOPO-Q priority queue IN-Q (INDEGREE_QUEUE): with respect to maximum generation number (level) ⇒ TOPO TOPO-Q ← priority queue, in-degree = 0 cuto ← generation number of rst in TOPO-Q while priority queue TOPO-Q is not empty do remove commit C from the start of TOPO-Q add C at the end of sorted list L (output it) for each parent P of commit C do if gen(P) is lower than cuto then set cuto to gen(P) walk INDEGREE(cuto) decrement in-degree of commit P if in-degree of P is equal 0 then insert P into priority queue TOPO-Q priority queue TOPO-Q: with respect to selected output sorting order dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
  • 60. Using generation numbers during topological sorting with limits EXPLORE (generation number cuto) while priority queue EXPLORE-Q is not empty and maximum gen(x) is than cuto do take into account limits of eFFf type add interesting parents to EXPLORE-Q ⇓ INDEGREE (generation number cuto) while priority queue IN-Q is not empty and maximum gen(x) is than cuto do remove commit C with highest gen(C) from IN-Q EXPLORE(gen(C)) add 1 to in-degree of each of C parents if in-degree of C is 0, add it to TOPO-Q priority queues EXPLORE-Q i IN-Q: with respect to maximum generation number (level) ⇒ TOPO TOPO-Q ← priority queue, in-degree = 0 cuto ← generation number of rst in TOPO-Q while priority queue TOPO-Q is not empty do remove commit C from the start of TOPO-Q add C at the end of sorted list L (output it) for each parent P of commit C do if gen(P) is lower than cuto then set cuto to gen(P) walk INDEGREE(cuto) decrement in-degree of commit P if in-degree of P is equal 0 then insert P into priority queue TOPO-Q priority queue TOPO-Q: with respect to selected output sorting order dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
  • 61. Improving topological sorting performance with generation numbers Linux kernel (2018) Test: git rev-list --topo-order -100 HEAD setup time [s] change without commit-graph 6.80 with commit-graph (old algorithm) 0.77 -88.7% with commit-graph and generation number 0.02 -99.7% Test: git rev-list --topo-order -100 HEAD -- tools setup time [s] change without commit-graph 9.63 with commit-graph (old algorithm) 6.06 -37.1% with commit-graph and generation number 0.06 -99.4% taken from the commit message in revision.c: generation-based topo-order algorithm dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 43 / 68
  • 62. Generation number: References Derrick Stolee (Microsoft) Supercharging the Git Commit Graph III: Generations and Graph Algorithms, Azure DevOps Blog, 9 July 2018 httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiiEgener—tions Developer Homepage of Derrick Stolee httpsXGGstoleeFdevG John Briggs (Microsoft) Technical contributions towards scaling for Windows, Git Merge 2019 httpsXGGwwwFyoutu˜eF™omGw—t™hcvav—tWU—VgHoH ▶ Documentation/technical/commit-graph.txt Git Commit Graph Design Notes dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 44 / 68
  • 63. Table of contents 1 Introduction Motivation Graphs in Git 2 Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 45 / 68
  • 64. Trouble with using topological level as generation number generation number (topological level) serves as reachability index gen(A) gen(B) =⇒ A ̸⇝ B elimination of unreachable revisions (negativecut lter) before it, as heuristics, committerdate was used for this purpose it can be incorrect due to, for example, clock desynchronization used as cuto threshold with slop (5 revisions in the row) in order to speed up calculations in most of new algorithms that make use of generation number commit objects are sortowane according to this value (in priority queue) optionally using committerdate (commit creation date) to resolve ties it turned out that in some cases we can get worse performance when using generation numbers as compared to committerdate heuristics the algorith using the generation number always returns correct result number generation used as sort key selects longest paths rst (?) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 46 / 68
  • 65. Examples of decreased performance (using number of visited commits) Test: git merge-base A B repository A B date generation Androidbase 53c1972bc8f 92f18ac3e39 81 999 109 025 Linux 69973b830859 c470abd4fde4 44 984 47 457 Linux c8d2bc9bc39e 69973b830859 167 468 635 579 TypeScript 35ea2bea76 123edced90 3464 3439 httpsXGGgithu˜F™omGderri™kstoleeGgenEtest partial solution sort by committerdate only when there is no generation number cuto provided accept the possibility of performance regression for some rare history topologies (more of a problem for git merge-base, than for git log --topo-order A..B) alternative solution: using other generation number than topological level dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 47 / 68
  • 66. Alternative sorting orders / generation numbers V0: (minimal) generation number / topological level gen(C) is 1 greater than the maximum of gen(P) of parents gen(C) for the commit with no parents is 1 computable locally and incrementally, immutable V1: (epoch, commit creation date) pair epoch is not smaller than maximum of epochs of parents, increased by 1 if parent has earlier date than the current commit computable locally, immutable, compatibile with V0 V2: maximal generation number / reverse topological order (almost) gen(C) for commit without children is set to the number of commits in the graph otherwise gen(X) is 1 greater than minimum among children not computable incrementally, compatibile with V0 V3: corrected commit creation date gen(C) is maximum of C committerdate and corrected dates for parents (+1) computable locally and incrementally, immutable, incompatibile with V0 best performance: V2, V3 incremental computation is more important: V3 version number eld in the ™ommitEgr—ph format httpsXGGgithu˜F™omGderri™kstoleeGgenEtest dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 48 / 68
  • 67. Incremental update of commit-graph le rewriting commit-graph le to add information about new commit is time consuming done during garbage collection lowcost automatic update would be preferred for example updating during git fetch solution: chain of commit-graph les lowest layer is selfsucient (closed) higher layers can reference commits in lower layers set limits (conditions) higher layers are down to X times smaller than lower ones maximum layer size (except for the lowest one (base)) merging layers if needed to fullll the above conditions good amortized time is assured taking into account time to merge layers three layer commit-graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 49 / 68
  • 68. Incremental update of commit-graph les rewriting commit-graph le to add information about new commit is time consuming done during garbage collection lowcost automatic update would be preferred for example updating during git fetch solution: chain of commit-graph les lowest layer is selfsucient (closed) higher layers can reference commits in lower layers set limits (conditions) higher layers are down to X times smaller than lower ones maximum layer size (except for the lowest one (base)) merging layers if needed to fullll the above conditions good amortized time is assured taking into account time to merge layers three layer commit-graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 49 / 68
  • 69. The chain of commit-graph les ™ommitEgr—ph chain le format (CDAT chunk) three layer commit-graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 50 / 68
  • 70. The chain of commit-graph les ™ommitEgr—ph chain le format (CDAT chunk) three layer commit-graph https://devblogs.microsoft.com/devops/updates-to-the-git-commit-graph-feature/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 50 / 68
  • 71. Problems with changing the denition of generation number incremental update of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing ) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  • 72. Problems with changing the denition of generation number incremental update of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing ) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  • 73. Problems with changing the denition of generation number incremental update of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing ) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  • 74. Problems with changing the denition of generation number incremental update of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing ) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  • 75. New generation number: References Incremental update of commit-graph les Derrick Stolee (Microsoft) Updates to the Git Commit Graph Feature, Azure DevOps Blog, 11 Nov 2019 httpsXGGdev˜logsFmi™rosoftF™omGdevopsGupd—tesEtoEtheEgitE™ommitEgr—phEfe—tureG Christian Couder, Jakub Nar¦bski, Markus Jansen, Gabriel Alcaras, et.al. Git Rev News: Edition 52 (June 28th, 2019), Reviews section [PATCH 00/17] [RFC] Commit-graph: Write incremental les httpsXGGgitFgithu˜FioGrev•newsGPHIWGHTGPVGeditionESPG The need for new generation number and its choice Christian Couder, Jakub Nar¦bski, Markus Jansen, Gabriel Alcaras, et.al. Git Rev News: Edition 45 (November 21st, 2018), Support and Reviews sections commit-graph is cool and [RFC] Generation Number v2 httpsXGGgitFgithu˜FioGrev•newsGPHIVGIIGPIGeditionERSG dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 52 / 68
  • 76. Other graphs and other uses of reachability queries Other types of graph data social networks WWW/Internet XML documents biological/chemical networks RDF ontologies Categories of graphs large: |V | 100 000 sparse: |E|/|V | 2 Reachability relations graph of strongly connected components: =⇒ reachability in DAG Practical use social networks: inuence ow citations: impact of an article internet: link structure analysis security: nding possible connections between suspects biological data: is given protein related directly or indirectly, to a given gene expression? chemical reaction: can you get given compound starting from given substance? Reachability queries in general: classical graph theory problem primitive operation used in other algorithms (like pattern matching) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 53 / 68
  • 77. Graph of Strongly Connected Components (SCC) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 54 / 68
  • 78. Division of algorithms for solving the reachability problem Extreme approaches: Computing transitive closure of the graph upfront build time: O(|V |∗|E|) constant time queries: O(1) quadratic memory use: O(|V |2) Online graph search: BFS, bidiBFS, DFS no build time: O(1) query answering time: O(|V |+|E|) no additional memory needed: O(1) Algorithm types: LabelOnly answers queries using labels only nonlinear or unbounded index size Label+G (label + graph) requires [augmented] graph search if labeling could not answer reachability query by itself linear and bounded index build time and index size dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 55 / 68
  • 79. Division of algorithms for solving the reachability problem Extreme approaches: Computing transitive closure of the graph upfront build time: O(|V |∗|E|) constant time queries: O(1) quadratic memory use: O(|V |2) Online graph search: BFS, bidiBFS, DFS no build time: O(1) query answering time: O(|V |+|E|) no additional memory needed: O(1) Algorithm types: LabelOnly answers queries using labels only nonlinear or unbounded index size Label+G (label + graph) requires [augmented] graph search if labeling could not answer reachability query by itself linear and bounded index build time and index size dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 55 / 68
  • 80. Types of labels in augmented online search algorithms negativecut lter (eliminating unreachable nodes) if u ⇝ v and u ̸= v, then labels for u and v fullls the condition e(u) ⪯ e(v) therefore if this condition is not met, then u cannot reach v the reverse is not always true: false positives e(u) ̸⪯ e(v) =⇒ u ̸⇝ v examples: (minimal) generation numer, aka topological level positivecut lter (nding reachable nodes) if for u and v labels we have e′(u) ⪯ e′(v), then u ⇝ v, that is u can reach v node v can be reachable from u event if the condition for labels is not met: false negative e′ (u) ⪯ e′ (v) =⇒ u ⇝ v examples: min-post interval labeling for the spanning tree (see next slides) Reachability algorithms like FELINE or BFL often use many dierent labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 56 / 68
  • 81. Positivecut: spanning tree Spanning tree / Spanning forest Given directed acyclic graph, a spanning forest (tree) is such its subgraph, for which the following is true it includes all original (full) graph nodes there is at most one incoming edge per node Properties: if there exists path from u to v in the spanning tree, then u ⇝ v in a full graph but the path from u to v could require going through edges outside the spanning tree a ⇝ h in graph and in tree b ⇝ h, but not in tree dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 57 / 68
  • 82. Positivecut: minpost interval labels in the spanning tree minpost interval labels For each vertex (node) u in graph we dene minpost interval Lu = [su,eu] in the following way: eu is dened as eu = post(u), postorder value (back traversal) su = eu for leaf nodes (no outcoming edges), otherwise su = min{sx : x ∈ children(s)} Properties: path from u to v in the spanning tree exists if and only if Iv ⊆ Iu if Iv ⊆ Iu, then u ⇝ v (in full graph) The same condition is true for similar postmax intervals [3,3] ⊆ [1,5] then a ⇝ h [3,3] ̸⊆ [7,9], but b ⇝ h dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 58 / 68
  • 83. Commit graph with the spanning forest Nodes in the graph are labeled with post(v) value: postvisit order in depthrst search (DFS) 1 2 3 4 5 6 7 8 9 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 59 / 68
  • 84. Commit graph with the spanning forest and minpost intervals This is the same graph as on previous slide, just drawn dierently (postvisit order vs level) min–post interval Lu = [1, 18] 1 2 3 4 5 6 7 8 9 topologicallevellv post–visit order post(v) 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 60 / 68
  • 85. Linux kernel repository commit graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 61 / 68
  • 86. Interval labeling for reachability queries: References (1/2) Hilmi Yildirim, Vineet Chaoji, Mohammed J. Zaki GRAIL: scalable reachability index for large graphs, Proceedings of the VLDB Endowment 3(1):276-284 (2010) httpsXGGwwwFrese—r™hg—teFnetGpu˜li™—tionGPPHSQVUVT•q‚esv•ƒ™—l—˜le•re—™h—˜ility•index•for•l Florian Merz, Peter Sanders PReaCH: A Fast Lightweight Reachability Index using Pruning and Contraction Hierarchies (2014) section 3.3 Pruning Based on DFS Numbering httpsXGG—rxivForgG—˜sGIRHRFRRTS Renê R. Veloso, Loïc Cerf, Wagner Meira Jr, Mohammed J. Zaki Reachability Queries in Very Large Graphs: A Fast Rened Online Search Approach Proc. 17th International Conference on Extending Database Technology (EDBT), March 24-28, Athens, Greece (2014) section 3.4.1 Positive-Cut Filter in 3.4 Optimizations httpXGGopenpro™edingsForgGihf„GPHIRGp—per•ITTFpdf dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 62 / 68
  • 87. Interval labeling for reachability queries: References (2/2) Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, Gerhard Weikum FERRARI: Flexible and Ecient Reachability Range Assignment for Graph Indexing. Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE'13), Brisbane, Australia. IEEE (2013). httpXGG™iteseerxFistFpsuFeduGviewdo™Gdownlo—dcdoiaIHFIFIFQTSFPVWR8reparepI8typeapdf httpsXGGgithu˜F™omGstepsGperr—ri Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, Gerhard Weikum High-Performance Reachability Query Processing under Index Size Restrictions (2012) httpsXGG—rxivForgG—˜sGIPIIFQQUS Jakub Nar¦bski Reachability labels for version control graphs, Google Colaboratory Jupyter Notebook httpsXGG™ol—˜Frese—r™hFgoogleF™omGdriveGI†E…U•sluSQsSiiiwpuhvˆt—xƒuSxyzg dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 63 / 68
  • 88. Problems to solve how to add additional labels to use in graph algorithms in Git? there can be only one order in priority queue (which would be new generation number) would adding positivecut improve performance of graph operations, and if so which ones? returning true/false reachability query result selecting and returning a subset nding node (commit) in a graph nding path / subgraph topological sorting which labels can be computed incrementally? graph of revisions (commit graph) has specic properties and a specic way of growing (dynamics) Online search algorithms Tree+SPPI (2005) GRIPP (2007) GRAIL (2010) (Graph Reachability indexing via rAndomized Interval Labeling) FERRARI (2013) (Flexible and Ecient Reachability Range Assignment for gRaph Indexing) FELINE (2014) (Fast rEned onLINE search) IP (2014) i BFL (2016) (Independent Permutations labeling) (Bloom Filter Labeling) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 64 / 68
  • 89. Problems to solve how to add additional labels to use in graph algorithms in Git? there can be only one order in priority queue (which would be new generation number) would adding positivecut improve performance of graph operations, and if so which ones? returning true/false reachability query result selecting and returning a subset nding node (commit) in a graph nding path / subgraph topological sorting which labels can be computed incrementally? graph of revisions (commit graph) has specic properties and a specic way of growing (dynamics) Online search algorithms Tree+SPPI (2005) GRIPP (2007) GRAIL (2010) (Graph Reachability indexing via rAndomized Interval Labeling) FERRARI (2013) (Flexible and Ecient Reachability Range Assignment for gRaph Indexing) FELINE (2014) (Fast rEned onLINE search) IP (2014) i BFL (2016) (Independent Permutations labeling) (Bloom Filter Labeling) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 64 / 68
  • 90. Incremental update of minpost interval labels Starting point: commit graph at some point in the past, with 3 branch tips 1 2 3 4 5 6 7 8 9 2 3 4 5 10 1 6 7 8 9 14 11 12 13 15 17 16 [1, 10] + 0 [11, 14] + 0 [15, 17] + 0 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  • 91. Incremental update of minpost interval labels Beginning of an incremental update of labels, starting from one of new branch tips 1 2 3 4 5 6 7 8 9 2 3 4 5 10 1 6 7 8 9 14 11 12 13 15 17 16 [1, 10] + 0 [11, 14] + 0 [15, 17] + 0 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  • 92. Incremental update of minpost interval labels Adjusting values of labels: adding a constant value for a subtrees (starting from old branch tips) 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 16 17 18 11 13 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  • 93. Incremental update of minpost interval labels Continuation of an incremental update of labels, walking from second of new branch tips 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 16 17 18 11 13 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  • 94. Incremental update of minpost interval labels Final step: updated labels, walking only new commits; giving O(changes) update time 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 20 21 23 16 17 18 11 13 22 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  • 95. Incremental update of minpost interval labels This results in dierent spanning forest and dierent labels than computed from scratch 1 2 3 4 5 6 7 8 9 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  • 96. Advantages and disadvantages of incremental update of interval labels 1 2 3 4 5 6 7 8 9 2 3 4 5 10 1 6 7 8 9 14 11 12 13 15 17 16 [1, 10] + 0 [11, 14] + 0 [15, 17] + 0 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 20 21 23 16 17 18 11 13 22 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 1 2 3 4 5 6 7 8 9 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 advantages computing update by walking O(changes) commits updating post(v) labels is not more costly than updating graph positions (lexicographical order in updated graph) disadvantages possibly suboptimal reachability labeling spanning forest and interval labels depend on when commit-graph le was updated question is the obtained result of an incremental update good enough (for improving performance)? dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 66 / 68
  • 97. Incremental commit-graph format and interval labels 1 2 3 4 5 6 7 8 9 layer = 0 commit-graph file layer = 1 commit-graph file 2 3 4 5 10 14 15 1 6 7 8 9 14 20 21 23 11 12 13 15 17 22 16 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 adjustments Each layer in the commit-graph chain includes corrections (adjustments) to post(v) labels for previous layer in the chain (original values shown) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 67 / 68
  • 98. Incremental commit-graph format and interval labels 1 2 3 4 5 6 7 8 9 layer = 0 commit-graph file layer = 1 commit-graph file 2 3 4 5 10 14 15 1 6 7 8 9 14 20 21 23 11 12 13 15 17 22 16 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 adjustments 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 20 21 23 16 17 18 11 13 22 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 Possible solution: For each layer in the commit-graph chain store (in relevant chunks): minpost interval labels list of tips (heads) in the graph possibly also list of their intervals for each layer in chain, except for base layer, store ajustments for previous layer (only needed for tips) top gure shows data as store in the ™ommitEgr—ph le chain, bottom gure shows eective post(v) labels, as visible from the top layer. dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 67 / 68
  • 99. Example of large graph: CiteseerX citation network 6 540 401 nodes 15 011 260 edges 2.295 average degree 567 149 roots 5 740 710 leaves 4.07 ×10 −4 Rratio connected nodes probability 59 max. path length that is max. level dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 68 / 68