4. “Classic” way of storing data in a database
I D N A M E I S _ A R C H I V E D S P E A K E R _ I D A D D R E S S _ I D
1 Die zahnärztliche Niederlassung false 15 84
2 Psychosomatik I true 301 12
5. ”Event sourced” way
I D e C O U R S E _ I D E V E N T _ T Y P E E V E N T _ D AT A
1 1 CourseCreated {
“name”: “Psychosomatik 1”,
“address_id”: 84,
“speaker_id”: 12
}
2 1 CourseSpeakerChanged { “new_speaker_id”: 15 }
3 1 CourseNameChanged { ”new_name”: “Psychosomatik I” }
7. Pros and cons
• Audit log for free
• Version tracking for free
• Course at time X
But:
• Performance with large amount of events
• Querying is not as easy because we never store the current state in
the database (Find all archived courses)
• It can make some operational tasks harder
9. Pros and cons
• Audit log for free
• Version tracking for free
• Course at time X
But:
• Filtering is not as easy because we never store the current state in
the database (Find all archived courses)
• It can make some operational tasks harder
• Performance with large amount of events
10. Pros and cons
• Audit log for free
• Version tracking for free
• Course at time X
But:
• Filtering is not as easy because we never store the current state in
the database (Find all archived courses)
• It can make some operational tasks harder
• Performance with large amount of events
11.
12.
13. Learning goals
• A little bit about the history of git
• Internal storage mechanisms
• What’s in the .git/ folder?
• Which data structures are used by git?
14.
15. git (/ɡɪt/)
I'm an egoistical
bastard, and I name all
my projects after myself.
First 'Linux', now 'git'.
L I N U S T O RVA L D S
Source: https://www.urbandictionary.com/define.php?term=Git
16. Facts
• Development started in April 2005
• Linux kernel team was using BitKeeper (but the owner withdrew
free use of the product)
• Linus Torvalds wanted a DVCS but none met his needs
19. 0
1
Create a new file
02Add the file to the index
03Create a new commit
04 We can see the commit in the log
(incl. its hash)
20.
21.
22. 0
1
Switch to a new branch that starts at
our initial commit
02Create a copy of our hello.txt file and
name it hello3.txt
03Create a new commit with it
04We can see all commits of this branch
but not the ones from master
23.
24. Learnings so far …
• .git/refs/heads contains one text file for each branch (named
like the branch itself) = “branch pointer”
e.g. .git/refs/heads/master, .git/refs/heads/feature/a
• HEAD is a text file that always contains the path (relative to
.git/) to the currently checked out branch
e.g. ref: refs/heads/feature/a
• Creating new branches is very easy (we only need to store a
reference to the commit)
25. How does git know which commits
belong to a branch?
26. 0
1
Get the commit hashes of master …
02… and feature/a
03Print out the contents of both
commits with git cat-file
27. Learnings
• git cat-file –p <hash> allows us to look at the contents of a
commit
• Each commit contains
• Information about its author
• The commit message
• A timestamp when it was created
• and a reference to its direct ancestor commit
• All commits of a single branch form a linked list that can be
traversed back to the initial/first commit
• .git/refs/heads/… points to the head of this linked list
28. A: By hashing its contents with SHA-1.
Q: How is the commit hash generated?
32. 0
1
Move hello.txt into a newly created
subfolder
02Create a new commit with this
change
03The tree object of the root folder now
contains a reference to another tree
object
Let’s create a subfolder
34. There are THREE different kinds of “objects”
commit
Author
Commit message
Reference to the
previous commit
Timestamp
tree
One per folder (incl.
one for the root folder)
Contains the name of all
files in the folder and
references to its
corresponding blob objects
blob
File contents
Hash = SHA-1(object)
40. Advantages
• Efficient storage/transfer because objects with the same content
are only stored once (same hash)
• If you fucked something up the chances are very high that it can
be fixed 🎉
46. Further reading
Building Git
J A M E S C O G L A N
https://shop.jcoglan.com/building-git/
Pro Git
S C O T T C H A C O N
https://git-scm.com/book/de/v2
Modifying the data is as simple as updating the row in the database
”Event sourcing”: we don’t store the current state of the object but a stream of events that modified it: CourseCreated, CourseChanged, CourseArchivedLater, we can “reconstruct” the current state of the object by applying all events in order again
If we want to archive the course -> we need to insert a new event into the stream
Reconstructing the current state later happens in code where we iterate over all events in the stream (from the database)
And apply the corresponding changes
e.g. CourseArchived event is handled by setting tha
You store the timestamp of the event, also the user who executed the operation.
CQRS
Snapshots to cache intermediary states
You store the timestamp of the event, also the user who executed the operation.
CQRS
Snapshots to cache intermediary states
You store the timestamp of the event, also the user who executed the operation.
CQRS
Snapshots to cache intermediary states
Indeed if we type git log –p
Four each commit you can see all changes that have been to the working directory (creating files, modifying files, deleting files)
The current state of the working directory can always be derived by applying all changes/change events since the beginning.
Did you already know that git works like this?
https://www.polleverywhere.com/multiple_choice_polls/VnIwdXx7qmTsGepodbekY
What has git to do with these two persons (Linus and Angela Merkel), and a shopping cart
Bingo: but please in your mind (don’t scream bingo ;))
Applying patches should be fast
“Take Concurrent Versions System (CVS) as an example of what not to do; if in doubt, make the exact opposite decision.”
Include very strong safeguards against corruption, either accidental or malicious.
Support a distributed, BitKeeper-like workflow.
Learning goal of today: to know what’s inside the .git folder in a repository folder
Git already created an .git/ folder with some files and subfolders. But most of it is still empty.
Only hooks contain some example hooks and the refs folder contains two empty sub-directories called “heads” and “tags”
I’ve also published this sample repository on GitHub (So you can also follow along if you want to)
We take all that -> and hash it with SHA-1
Any change to any of this data also means that a new hash is generated (changing the email of the author for example)
And the folder structure of the working directory at point in time X (commit X)
Let’s go back to our two commits and what git store about them
If you looked carefully you might already have noticed that there is another piece of information stored that I didn’t mention yet
We can again use our friend git cat-file to find out what’s behind the hash
Why do they all contain the same hash?
Because the hashes for blobs are calculated the same way as those of the commit -> by hashing its content
Files that have the same content are stored under the same hash
Because we just copied the file we initially created they all have the same hash
Only references
We still don’t know where the actual contents of the objects (e.g. files are stored)
We didn’t look into the objects directory yet.
That’s what we’re going to do now.
Where key is the object’s hash
The objects folder contains sub-directories with the first two bytes of the hash
Where key is the object’s hash
The objects folder contains sub-directories with the first two bytes of the hash
Every time you write “git add ” a new blob object is added to the objects KVS (unless there already is a file with the same contents)
Objects with the same hash are only stored once
Trees and commits are stored in a binary format and then deflated with zlib
Objects are never removed (as long as there is a reference to it somewhere)
Garbage collector removes unreferenced objects
Packing algorithm that compresses objects
Unless it was already sent to the remote; even then it’s more of a communication problem
Say you accidentally created a squashed merge commit instead of a “normal” merge commit
Select the **true** statements about the way _git_ works.
https://www.polleverywhere.com/multiple_choice_polls/QH0fsFke1EnN0cjmZHgDR
Three side tasks: What has git to do with these two persons, and a shopping cart
Named after its inventor Ralph Merkle
Also called hash tree
Leaf nodes contain actual content
Nodes contain hash of its children
Cryptographic properties:
Finding differences in two trees is really easy/efficient
Used in bitcoin (Merkle proof) where a user can verify only with sending hashes that a transaction has taken place
Trust on probabilities
He couldn’t come up with a chain of hashes that produce the exact same merkle root
If we change a file (hello3.txt) and create a new commit
This blob will get a new hash (because its contents have changed)
In turn also the tree object will get a new hash (because its contents, the references to the blobs/trees, have changed)
And we also get the new commit hash (not only because of the modified tree hash but also because of the timestamp
If we later (git log –p) want to find out where these two commits differ (and where not) we only need to traverse the trees
-> subtrees with the same hash have not changed
Three side tasks: What has git to do with these two persons, and a shopping cart