How git works

How Git Works ?
Let’s GIT Deeper :)

Git Folder Structure
It’s All in The .git , this where all
the magic happens,while every item on
the list here has a definite role in
git magic, the object and the refs
folders hold the most important roles
regarding stored data.

Git Data Structure
Git is a content-addressable file system,but what
does that means well , it’s simply a key value
store Where is the key is a hash and the value is
a blob or a group of blob with hashes.
Git uses cryptographic hashes for keys which is
an algorithm which constructs a short digest from
a sequence of bytes of any length. Ex [sha 1=>160
bit, sha2 > 256, skein > 256]. * the bigger the
better
When you commit a file into Git , it calculates
and remembers the hash of the contents of the
file. So when you retrieve it, it can verify that
the hash of the data being retrieved exactly
matches the hash that was computed when it was
stored.

Let’s Git Objective :)
Git use a different kinds of objects
to store data related to revisions
for example we have the basic hash
object and the tree also the commit
and the tag “which is hybrid kind
and not really an object”.
Everything in git is organised
around those objects so most of the
plumbing tools in git deals with
those objects directly to handle
saving the revisions behind the
scenes.

Git Hash Object
The Hash object is the basic unit in git, it’s used in every revision as a stand
alone object or part of group of objects important note that it only saves shows you
the hash and store the data as blob no other information is saved.
All Objects except of tags are stored in objects folder.
Git hash-object “blob object” -> used to hash and save a single unnamed data object
in objects folder
$ echo 'test content' | git hash-object -w --stdin
d670460b4b4aece5915caf5c68d12f560a9fe3e4
*hashing side effect : When Git stores the contents of somefile.txt, it will realize
that it already has a copy of that data when comparing hashes, There is no need to
store it again,his process is called deduplication.

That Blob , What Exactly it looks like
You know that git store revision as object but what is that object looks like
actually , let's find out .
Object creations steps :
1 - content ,i.e $content = "Hello , I’m content" .
2 - header ,i.e $header = "blob ".strlen($content)."0".
3 - concatenate header and content into a store,i.e $store = $header.$content .
4 - calculate the hash, hash = sha1($store).
5 - zipping the store , $compressed = gzdeflate($string, 9).
6 - use the hash as name for the object storage , using two first characters as
folder name inside objects folder and the rest as file name.
path = '.git/objects/' + substr($hash,0,2) + '/' + substr($hash,2)
7 - write out the the zipped store to that location.

Git Tree Object*
Git stores content in a manner similar to a
UNIX filesystem, but a bit simplified. All the
content is stored as tree and blob objects,
with trees corresponding to UNIX directory
entries and blobs corresponding more or less to
inodes or file contents. A single tree object
contains one or more entries, each of which is
the SHA-1 hash of a blob or subtree with its
associated mode, type, and filename.
The tree can be chained to hold different
revisions in the same tree by reading the old
tree before updating the new tree index
$ git cat-file -p master^{tree}
100644 blob a906cb2a4a904a152e80877d4088654daad0c859
README
100644 blob 8f94139338f9404f26296befa88755fc2598c289
Rakefile

Git Commit Object
Now you saved your data but that’s it you must
remember all SHA-1 values in order to recall the
snapshots. You also don’t have any information about
who saved the snapshots, when they were saved, or
why they were saved. This is the basic information
that the commit object stores for you
$ echo 'First commit' | git commit-tree d8329f
fdf4fc3344e67ab068f836878b6c4951e3b15f3d
The format for a commit object is simple: it
specifies the top-level tree for the snapshot
of the project at that point; the parent
commits if any (the commit object described
above does not have any parents); the
author/committer information (which uses your
user.name and user.email configuration settings
and a timestamp); a blank line, and then the
commit message
And Then Commit Chain
Log Miracle was Birthed

hmmm,What’s About Branching ?
Here where the refs folder comes to play,While you can use the hashes to
navigate to commits but you have to remember them which is hard thus git
provides easier way to access those values,references or refs for short
are filename which mirror to branches i.e .git/refs/heads/master
References Types :
Local refs : stored locally and can be accessed using git log and name of the reference which usually corresponds to
the name of the branch.
$ git log --pretty=oneline master
The Head : a Symbolic reference to the last current commit on branch and is used to determine the sha-1 and update
refs using update-refs and it do that by maintaining a sym ref to your current branch,but it also can can point to
sha-1 of an object when you checkout tag,commit,remote branch*.
Remotes : they same as local refs but the main difference they are on another location usually upstream version of
the same repo and they are read only but they can be navigated the same.
$ cat .git/refs/remotes/origin/master
Tag : which is technically is an object but it’s actually a hard reference to a single commit and contains extra
information like a date and a message .
$ git tag -a v1.1 1a410efbd13591db07496601ebc7a059dd55cfe9 -m 'Test tag'
*it’s called a detached head mode and can be used to manually moving to a specific commit/tag

but ,I want to Talk To My Friends outside !
Using remotes we can move changesets between multiple git repositories.
The remote uses format called refspec which is stored in config folder ,
you can view them using git remote -v
Refspec section example:
[remote "origin"]
url = https://github.com/schacon/simplegit-progit -> remote
url
fetch = +refs/heads/*:refs/remotes/origin/* -> +<src>:<dst>
We can filter by single branch
We have two main operations to be used with the refspec fetch and push :
Fetch:
fetch = +refs/heads/master:refs/remotes/origin/master
The + sign is optional , it tells git to update the remote refs even
it’s n’t using fast forward
You can use multiple fetch in the same section or you can the command
git fetch
You can use globe in the fetch to create a pattern , which can be used
to namespace and partition the code
Push :
push = refs/heads/master:refs/heads/qa/master
You can create a push section to push automatically or use the command

But That’s a lot of objects
Well , git got your pack, literally :)
That’s where the role of packfiles comes , when you create revisions
more objects created as more revision you have more objects are
created but since we don't work alone those object at some time will
have to travel to somewhere else and sending a lot of objects
sequentially seems like a recipe for failure, so the packfiles were
created.
So what’s a packfile :
Packfile is basically a compressed format of all the committed
objects into two files .pack and .idx ,The pack file contains the
objects and the index file contain offsets to the objects in the
pack file

what happens when travelling up there
When we talked about how git saves revisions as objects that’s called The loose object format which
means each revision is stored as a single object in the object folder but as we said before when
sending your changesets from one place to another having thousands of objects isn’t very efficient so
git had the command gc or as we know in the computer science lingo a garbage collector , while garbage
collector is usually used for clean objects from memory , git gc is used to pack the objects into a
neat format which can be compressed and sent anywhere neatly , the gc command is usually run before
every push but you can run it locally to clean up every now and then.
Example :
$ git cat-file -s b042a60ef7dff760008df33cee372b945b6e884e 22054
$ git gc
Counting objects: 18, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (14/14), done.
Writing objects: 100% (18/18), done.
Total 18 (delta 3), reused 0 (delta 0)
$ find .git/objects -type f
.git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
.git/objects/info/packs
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.idx
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.pack
git verify-pack
plumbing command
allows you to see what
was packed up.

Cool , but how do we actually travel ?
Git uses two ways to send and receive data from the outside world :
Dumb Protocol:
The dumb protocol is likely to be used when setup a repository for the first time , read operations basically .
This protocol is called “dumb” because it requires no Git-specific code on the server side during the transport
process, the fetch process is a series of HTTP GET requests.
Using git clone it starts by downloading info/refs using update-server-info command then HEAD to know what to
checkout then it start getting the objects from the list in the refs in the loose object format.
Warning: this method is not secure and rarely anyone use it but it can be used in rare situation
Smart Protocol:
The smart protocol is a more common method of transferring data, but it requires a process on the remote end that
is intelligent it can read local data, figure out what the client has and needs, and generate the packfile for
it.
There are two sets of processes for transferring data: a pair for uploading data and a pair for downloading data.
Uploading Data: PUSH
To upload data to a remote process, Git uses the send-pack and receive-pack processes. The send-pack process runs
on the client and connects to a receive-pack process on the remote side.
Downloading Data: FETCH
When you download data, the fetch-pack and upload-pack processes are involved. The client initiates a fetch-pack
process that connects to an upload-pack process on the remote side to negotiate what data will be transferred
down.

I lost some Data , what would i do ?
Git provide two ways to access missing revisions :
1- git reflog / git log -g
Git silently records what your HEAD is every time you
change it. Each time you commit or change branches,
the reflog is updated. The reflog is also updated by
the git update-ref command, which is another reason to
use it instead of just writing the SHA-1 value to your
ref files
2- git fsck --full
git fsck utility, which checks your database for
integrity, which will show any orphaned object by
adding dangling before the type of the object and the
hash.
Both ways will provide a way to access the hash value
which you can write it’s values in a new branch
git branch recover-branch ab1afef

How git works

Recommended

Recommended

More Related Content

Similar to How git works

Similar to How git works (20)

Recently uploaded

Recently uploaded (20)

How git works