This page looks best with JavaScript enabled

Working with Git - The Theory

 ·   ·  ☕ 14 min read
Photo by Yancy Min on Unsplash

Introduction

Welcome to this series of blog posts about Git! Today, we’re gonna dive into a bit of theory about Git.
Git as become one of the most used tools for versioning file thanks to the popularity of GitHub.
It is a very powerful tool with a lot of commands and concepts. It can be used for simple usage and for more advanced use cases that can look a bit obscure to newcomers.
Today, we keep it simple as much as possible with a few topics:

  • Centralized versus Distributed Version Control System
  • How Git works?
    • Git Snapshot storage system
    • Git file integrity
    • Git states

Centralized vs Distributed System

There are some big differences between centralized version control system (CVCS) and distributed version control system and in order to better understand why is Git so popular and digging into “How Git Works?”, we need to understand what are the differences between the two systems.

Centralized version control system

When we work with a centralized version control system, we often have a dedicated main server that host our repository. This server is responsible of the whole repository history and revisions. When a developer wants to retrieve a local working copy of the remote repository, he / she has to pull it from the main server. Once this is done, he / she is ready to implement the new requested feature. When this feature is ready, it can be pushed back to the main server which is doing a merge from it current revision of the code base and the newly created feature code.

There are several drawbacks with this approach:

  • A permanent connection with the main server is needed in order to:
    • retrieve and view repository history
    • retrieve a file at a specific revision
    • keep our local repository up-to-date
  • If we are offline, we can just keep our modifications but we lost repository history view and access to specific revisions
  • When we push back our work, it can be a big amount of modifications in one shot which can be hard to review

Fortunately, there are a few advantages with this approach:

  • Everything is centralized on a main server
  • Backup and revision of the code are simple and clear
  • The workflow is completely smooth and easy to learn

Distributed version control system

When we work with a distributed version control system, we do the same things as we would do with a centralized version control system. We start by pulling a local copy of the remote repository from the main server. Then we can work on our new feature locally without the need to be connected to the server. Once our feature is ready to be merged back to the main repository, we push our changes back to the main server.

One of the biggest advantage is the fact that once a local copy of the repository is pulled from the main server, we can work locally and offline. The local copy contains the whole repository history so we can retrieve any version of a specific file and we have a local history that can be used to commit our changes step-by-step. It means that our local history can diverge from the remote one and the action of pushing back our changes to the main server is equal to a merge of both history.

There are a few drawbacks:

  • The workflow can be a bit more complex because we have a local and a remote history that can diverge from each other
  • Evolution of project history is less clear to view

There are a several advantages:

  • The complete repository history is available inside our local copy
  • The local history gives us the ability to manage our feature evolution in a step-by-step way by creating local revisions (no huge commit)
  • No more limitation when working offline, it is possible to work without the need of a connection to remote server
  • Less network usage as we don’t need to rely on a remote server

Note: Most distributed version control system can be used in a serverless way using only the local history. It is just more convenient to use such system with a main server that keep all changes made to a project code base. This is what we do when we use platform such as GitHub.

How Git Works?

We have seen the differences between centralized and distributed version control systems and as you might have guessed, Git is what we call a distributed version control system. We can completely work offline and without the need of a remote server.

Git offers the following advantages:

  • Local repository with it owns local history
  • The ability to create branches with ease
  • The ability to add one or more remote repository and synchronize with it / them
  • A useful tool that helps to find which commit is responsible of a regression (this will be covered in another blog post)
  • Great performance
  • Low storage consumption

We will discuss all of theses feature later in a future blog post.
Having the ability to work completely on our machine is great but as you might have guessed, we still need a kind of main repository if we want to easily share and work with others developers. That why, nowadays, we rely on platform such as GitHub or GitLab to host our remote Git repositories. This kind of platform acts as the main repository, they provide a way to manage, review and collaborate easily on a project. They are widely used by Open Source Community and companies over the world.

Git Snapshot storage system

Git has a very powerful storage system that relies on file tree snapshot instead of file differences (delta). When we put a file under Git version control, it takes what we call a snapshot of the repository file tree. It can be viewed as the state of the repository file tree at a specific moment in time.

Here is an example of repository with three files to illustrate Git snapshot:

  • __init__.py
  • main.py
  • model.py

When we add them under Git version control, a new snapshot is created as in the picture below:

The newly created snapshot contains an exact copy of the repository file tree. One thing to keep in mind is that Git is file-based so it only cares about files and not directories. The only way to keep a directory is to put an empty file such as “.keepme” inside it. This can sound like a limitation but it is really not and there are numerous way to handle it such as the one described here.

Another advantage of Git is the fact that when a snapshot is created, it contains only the changes we’ve made and thus, every others files that stayed the same than the previous snapshot are not stored as is but rather stored as link. We can view such link as what we call in the UNIX world a symbolic link. Let’s have a look at the following picture which shows the way Git handles file storage as a stream of snapshots:

Snapshot 1 is the first one which correspond to our initial file tree that contains our three files. After a while, we’ve made a modification to “model.py” file and decided to put it under version control. Inside Snapshot 2, you can see that “__init__.py” and “main.py” file are not stored as is but as link to their previous version inside Snapshot 1. Only “model.py” file is stored inside Snapshot 2 because it is the only file that has changed since Snapshot 1. After that, we decided to make a modification to “main.py” file and put it under version control. If you look closely, you see that the file “__init__.py” is linked to it original version inside Snapshot 1. The file “model.py” is linked to it version in Snapshot 2. On Snapshot 4, we’ve made another modification to “main.py” file. The others files are just link to their previous versions.

At the end, we have:

  • __init__.py which is equal to it version in Snapshot 1
  • main.py which is a new version of itself in Snapshot 4
  • model.py which is equal to it version in Snapshot 2

With that in mind, we can see that the amount of disk space used by Git is optimized with a few overheads (internal data). It is slightly less than what we can have on a centralized version control system such as Subversion (SVN) for example.

Git file integrity

We talked a lot about snapshot but in fact, the correct terminology used by Git is “commit”. A commit is a snapshot of your repository file tree at a specific moment in time. In order to ensure that this file tree reflect exactly what it was at that specific moment, Git provides us with a SHA-1 checksum that guaranty the integrity of the file tree. This SHA-1 checksum is a 40 character string composed of hexadecimal and based on the contents of the file / directory tree. Such checksum is part of every commit as you can see in the following picture:

In the above picture, we have a commit that contains:

  • a file tree snapshot
  • a SHA-1 checksum of the snapshot

Note 1: The SHA-1 is always abbreviated from 40 character to 7 in order to make it more readable and easier to use, so the biggest “f3a2a38da20733af263cefb52e1a381dcf14cac7” is displayed as “f3a2a38”.
Note 2: a commit contains additional information such as author, date and a convenient message that describe the changes.

Git states

Now that we have a clear view of Git file storage and mechanism, we can dive into the three file states provided by Git:

  • Modified: a file has been modified in our repository tree
  • Staged: a file has been marked in it current version as part of the next commit
  • committed: a file is now stored inside Git database (snapshot)

In my point of view, there is a fourth stage which can be considered, it is the “Untracked” state which means that a file is present in the repository file tree but it is not part of the next commit or even under version control. It must be viewed as a special case because when you create a new file inside your Git repository, it is always considered as “untracked”.

Let review each state and their possible transitions:

  • Untracked -> Staged: A file can transition from untracked to staged in order to be in next commit
  • Staged -> Untracked: A file can be un-staged and if it was not part of your Git database, it comes back to untracked
  • Modified -> Staged: A modified file can transition from modified to staged in order to be in next commit
  • Staged -> Modified: A staged file can go back to modified state if it is no more part of the next commit
  • Staged -> Committed: A file can be committed to Git database

The most used workflows are Untracked -> Staged -> Committed and Modified -> Staged -> Committed.

Git branching model

The branching model of Git is one of it most powerful feature and also the most used one.
When we create a new Git repository, a default branch is created, it is named “main” in our case.
In this section, we first had a closer look at commit then we speak about branch and how they work in the Git ecosystem.

Inside a commit

A commit has the following fields:

  • a SHA-1 checksum of the snapshot
  • a author
  • a creation date
  • a convenient message that describe the changes
  • a file tree snapshot
  • one of more Parent(s)

We already seen all of these fields except one that we haven’t talked about before: Parents
The first commit has no parent at all, it is considered as the oldest ancestor.
A standard commit has one parent that is just a link to it predecessor as in the example bellow:

In this example, “Commit 2” has a link to “Commit 1” which is its parent. This way, Git can easily identifies and manage snapshots of the repository file tree. It helps Git to re-create easily the state of the repository contained inside a snapshot.

Definition of a branch

If we continue to make commits, we will end up with something like this:

Remember that Git create a default branch named main in our case when we initialized our repository. A branch can be viewed as a stream of commits where Git keeps a pointer (the branch name) to the last commit. In our example, we have 4 commits on the “main” branch and the last one which is pointed by the label main is Commit 4.

Multiple branches

Creation of a new branch

As you might have guessed, you can create and manage multiple branches inside your repository.
From the point of view of Git, when we first create a new branch, a new pointer is created with the branch name used as the label and it point to the last commit of your previous branch or to the last commit of the specified parent branch.

In the above example, the new feature branch pointer is linked to Commit 4 of the main branch.

One of the benefit of this approach, it the fact that this new branch is created at near no cost because it is just a new pointer.
A branch start to live it’s life after the first commit on it because it starts to diverge from it parent branch.
In the example below, the feature branch has diverged from the main branch and they both have their own life.

This is very comfortable because you can work on a specific feature inside of a dedicated branch without compromising the main branch.

Merge of a branch

When the implementation of our feature is done, we want to put our work back into the main branch. This process is what we call a “merge” in the Git terminology, it is the fact that two distinct branches are merged back together. This process can be done in different ways but we are going to have a look at the two most common merge strategies that are used:

  • fast-forward: try to make commit from the source branch as if they were part of the destination branch and update the destination branch pointer accordingly
  • recursive non fast-forward: always create a dedicated “merge commit” that marks the fusion of the two branches (source and destination)
Recursive non fast-forward merge

With the recursive non fast-forward strategy, a merge commit is created when we make the fusion of our feature branch with the main branch:

Let summarize a bit:
The “feature” branch has diverged by two commits:

  • Commit 5
  • Commit 6

The main branch pointer was on Commit 4 before the merge.
After the merge, Commit 7 has two parents:

  • Commit 4
  • Commit 6

Remember that a standard commit has only one parent but the Commit 7 is what we call a merge commit and thus, it can have more than one parent.
This commit marks the fusion of snapshots from the main branch and the feature branch and we may have to resolve some conflicts that can appear during this process because it is possible that the same file was edited in both branches.

Fast-forward merge

With the fast-forward strategy, Git tries to take commits from the source branch (“feature” in our case) and make them as if they were part of the main branch as in the example below:

Let summarize a bit:
The “feature” branch has diverged by two commits:

  • Commit 5
  • Commit 6

The main branch pointer was on Commit 4 before the merge.

As Commit 4 is the common ancestor of both main branch and feature branch, Git takes Commit 5 and Commit 6 from feature branch and place them next to Commit 4 and use it as parent for Commit 5'. At the end, everything looks like if all commits were made directly on the main branch. This approach is often used in order to keep a linear and clean main branch history but it is up to you.

ℹī¸ Be aware that if Git cannot make the fast-forward strategy, we will have to deal with a recursive merge strategy or with another tool that Git provides us called “rebase”. We will explore the this tool in another blog post because it is not a “basic” tool.

Conclusion

In this blog post, we started with an overview of both Centralized and Distributed Version Control System. We looked at the differences between them and then, we dive into “How Git work?”.
From there, we look at how Git manage it file storage using snapshot of repository file tree. We also looked at how Git guaranty the integrity of it snapshot and we ended with a view of Git file states.
At the end, we looked at Git branching model to understand what a branch is and how it work.

It is a long path of theory and I hope you enjoyed it. Next time, we will dive into some Git basics usage such as how to setup Git, setup a repository, making changes and commit them.

Thank you for reading, if you have any questions or remarks, feel free to contact me on twitter or by e-mail.

Reference

Reference used in this blog post:

Share on

KokutoSan
WRITTEN BY
KokutoSan
Software Engineer

What's on this Page