[git] git internal principle

Git internal principle

1. Summary

As a software development engineer, we rarely fight alone in our daily work. It will certainly involve multiple people working together to complete a project. How can we give the code we develop to other developers? Via U SB flash drive? Or write it in word and send it to each other? This is obviously not elegant enough. CVS, SVN, GIT and other tools with version control function perfectly solve this problem. Developers can realize the interworking of code and other files through SVN or GIT.

The main difference between Git and other version control systems (including Subversion and approximation tools) is the way Git treats data. Conceptually, most other systems store information in the form of file change lists (CVS, Subversion, Perforce, Bazaar, etc.) treat the information they store as a set of basic files and the differences that each file gradually accumulates over time (they are commonly referred to as delta based version control).

I think everyone is familiar with how to use Git, but how does Git realize version control, multi person collaboration and file merging? Next, we will go inside Git and deeply explore how Git does it. We should not only know what it is, but also know why it is.

2. Git object

Git is actually a content addressable file system, which means that the core of Git is actually a key value based database. We can insert any type of content into git warehouse, and it will return a unique key, which can retrieve the content at any time.

Since Git is a file system, it should be able to act as the concepts of "file" and "folder", which git collectively refers to as "object". That is, any type of content inserted into the GIT repository is called an object.

2.1. blob object

The data object acts as the concept of "file" in the file system (but does not contain the file name). The data object contains a key value pair, where the value is the file content, and the key is the data to be stored plus a header information (header) the checksum obtained by performing SHA-1 check operation together. The first two characters of the checksum are used to name the subdirectory, and the remaining 38 characters are used as the file name.

Use the underlying command git hash object to demonstrate:

1. After git init initializes a git repository, you can see that git automatically creates an objects directory and two subdirectories info and pack:

$ git init git-test
Initialized empty Git repository in E:/demo-projects/git-test/.git/

$ cd git-test/
$ find .git/objects/
.git/objects/
.git/objects/info
.git/objects/pack

2. Next, create a new blob object with git hash object and store it in the hash database:

$ echo 'content one two' | git hash-object -w --stdin
2938b4de55b3da15112c00deadf244dd6d3ef073

$ find .git/objects/
.git/objects/
.git/objects/29
.git/objects/29/38b4de55b3da15112c00deadf244dd6d3ef073
.git/objects/info
.git/objects/pack

3. The Git cat file command can be used to fetch data from Git:

$ git cat-file -p 2938b4de55b3da15112c00deadf244dd6d3ef073
content one two

$ git cat-file -t 2938b4de55b3da15112c00deadf244dd6d3ef073
blob

$ git cat-file -s 2938b4de55b3da15112c00deadf244dd6d3ef073
16

Cat file parameters mainly use the following

-t                    show object type
-s                    show object size
-p                    pretty-print object's content

4. Start simple version control: create a file and add it to the Git database, then modify the file content and store it in the database to see what changes will occur:

$ echo 'version 1' > test.txt
$ git hash-object -w test.txt
warning: LF will be replaced by CRLF in test.txt.
The file will have its original line endings in your working directory
83baae61804e65cc73a7201a7252750c76066a30


$ echo 'version 2' > test.txt
$ git hash-object -w test.txt
warning: LF will be replaced by CRLF in test.txt.
The file will have its original line endings in your working directory
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a

$ find .git/objects/ -type f
.git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a
.git/objects/29/38b4de55b3da15112c00deadf244dd6d3ef073
.git/objects/83/baae61804e65cc73a7201a7252750c76066a30

As you can see, the test is added twice Txt are saved in the database. We can use cat file to retrieve the file content corresponding to the specified SHA-1 value. Note that since the blob object does not specify a file name, the file contents can theoretically be taken out to any file through cat file.

$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test-new.txt
$ cat test-new.txt
version 1

2.2. tree object

Although git cat file is very powerful, it is unrealistic to remember the SHA-1 value of each file in daily work. Git introduces tree objects to assist management, which are similar to the concept of folders in the file system.

Usually, Git creates and records a corresponding tree object according to the state represented by the temporary storage area (i.e. index area, the same below) at a certain time. It records the permission, file type, SHA-1 information and snapshot of file name at a certain time of each file / subfolder.

We use the Git update index command to update the test new Txt add to Git staging area:

$ git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 test-new.txt

Note: 100644 here refers to the file mode, where 100 refers to the file and 644 refers to the permission of the file. Other options for file mode include:

100755: executable

120000: symbolic links

040000: folder / tree object

At this point, we found that An index directory appears under Git directory, and it is found through tortoise git that test new Txt file has become the status tracked by Git. Use git status to verify:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   test-new.txt

Next, continue with the topic, write the contents of the temporary storage area into a tree object with git write tree, and then use cat file to view the contents contained in the tree object:

$ git write-tree
a37e53c0b0e7c2520d1d594f9b5246ee148de814

$ git cat-file -p a37e53c0b0e7c2520d1d594f9b5246ee148de814
100644 blob 83baae61804e65cc73a7201a7252750c76066a30    test-new.txt

Now that we have a snapshot, let's create the next tree object: including a new file new Txt and the previously added test Txt (version 2).

$ echo "new file" > new.txt
$ git update-index --add new.txt
warning: LF will be replaced by CRLF in new.txt.
The file will have its original line endings in your working directory

$ git update-index --add --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a test-new.txt

$ git write-tree
8821d4c684e63569c0bf448affd5faef50919338

$ git cat-file -p 8821d4c684e63569c0bf448affd5faef50919338
100644 blob fa49b077972391ad58037050f2a75f74e3671e92    new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a    test-new.txt

After adding, use git status to view the status of the staging area: you can see it clearly

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   new.txt
        new file:   test-new.txt

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   test-new.txt

Next, you can use the read tree command to add the first tree object to the second tree object to construct a hierarchy:

$ git read-tree --prefix=bak a37e53c0b0e7c2520d1d594f9b5246ee148de814
$ git write-tree
cdd0e09fa699af925dfb6747986a60b69c1a0c05

$ git cat-file -p cdd0e09fa699af925dfb6747986a60b69c1a0c05
040000 tree a37e53c0b0e7c2520d1d594f9b5246ee148de814    bak
100644 blob fa49b077972391ad58037050f2a75f74e3671e92    new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a    test-new.txt


#######################################################################
# Verify test new under different tree objects Txt

$ git cat-file -p a37e53c0b0e7c2520d1d594f9b5246ee148de814
100644 blob 83baae61804e65cc73a7201a7252750c76066a30    test-new.txt

$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
version 1

$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
version 2

At this time, the structure of the new tree object cdd0e09fa699af925dfb6747986a60b69c1a0c05 is as follows:

2.3. commit object

Here, we find that even if the tree object is introduced, it still needs to use the SHA-1 value to refer to each tree object, and we can't completely remember which tree was created at which time and what content. Just like now, do you remember the SHA-1 value of each tree created in this article? Let's review:

1. The first tree, with SHA-1 value of a37e53c0b0e7c2520d1d594f9b5246ee148de814, contains a file test new with "version 1" txt

$ git cat-file -p a37e53c0b0e7c2520d1d594f9b5246ee148de814
100644 blob 83baae61804e65cc73a7201a7252750c76066a30    test-new.txt

$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
version 1

2. The second tree, with the SHA-1 value of 8821d4c684e63569c0bf448affd5faef50919338, contains a file test new with the content of "version 2" Txt and a file with the content "new file" txt

$ git cat-file -p 8821d4c684e63569c0bf448affd5faef50919338
100644 blob fa49b077972391ad58037050f2a75f74e3671e92    new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a    test-new.txt

$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
version 2

$ git cat-file -p fa49b077972391ad58037050f2a75f74e3671e92
new file

3. The third tree, whose SHA-1 value is cdd0e09fa699af925dfb6747986a60b69c1a0c05, contains the second tree and the first tree. The contents of the first tree are placed in the bak directory, as follows:

$ git cat-file -p cdd0e09fa699af925dfb6747986a60b69c1a0c05
040000 tree a37e53c0b0e7c2520d1d594f9b5246ee148de814    bak
100644 blob fa49b077972391ad58037050f2a75f74e3671e92    new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a    test-new.txt

$ git cat-file -p a37e53c0b0e7c2520d1d594f9b5246ee148de814
100644 blob 83baae61804e65cc73a7201a7252750c76066a30    test-new.txt

$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
version 1

$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
version 2

After reviewing the existing three tree objects, a commit object will be introduced to record the submitter, submission time and possible parent submissions:

$ echo 'first commit' | git commit-tree a37e53c0b0e7c2520d1d594f9b5246ee148de814
1580971f914fe3bed9c1ccdb117387db3525cf63

$ git cat-file -p 1580971f914fe3bed9c1ccdb117387db3525cf63
tree a37e53c0b0e7c2520d1d594f9b5246ee148de814
author Lu Hao <luhao@tp-link.com.cn> 1638934182 +0800
committer Lu Hao <luhao@tp-link.com.cn> 1638934182 +0800

first commit

Next, create two more submissions and concatenate the three tree objects:

$ echo 'second commit' | git commit-tree 8821d4c684e63569c0bf448affd5faef50919338 -p 1580971f914fe3bed9c1ccdb117387db3525cf63
bb6d6733199b5dbe16b4d5247999d6c9487cbd15

$ echo 'third commit'  | git commit-tree cdd0e09fa699af925dfb6747986a60b69c1a0c05 -p bb6d6733199b5dbe16b4d5247999d6c9487cbd15
ff9152a5392a6f0f1f290cd8928d41b308b14fa9

At this time, use the git log --stat command on the last submission to find that there is a complete submission log: This is actually what git add and git commit do: save the rewritten file as a data object, update the staging area, record the tree object, and finally create a submission object indicating the top-level tree object and the parent submission.

$ git log --stat ff9152a5392a6f0f1f290cd8928d41b308b14fa9
commit ff9152a5392a6f0f1f290cd8928d41b308b14fa9
Author: Lu Hao <luhao@tp-link.com.cn>
Date:   Wed Dec 8 11:46:29 2021 +0800

    third commit

 bak/test-new.txt | 1 -
 1 file changed, 1 deletion(-)

commit bb6d6733199b5dbe16b4d5247999d6c9487cbd15
Author: Lu Hao <luhao@tp-link.com.cn>
Date:   Wed Dec 8 11:32:13 2021 +0800

    second commit

 bak/test-new.txt | 1 +
 new.txt          | 1 +
 test-new.txt     | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

commit 1580971f914fe3bed9c1ccdb117387db3525cf63
Author: Lu Hao <luhao@tp-link.com.cn>
Date:   Wed Dec 8 11:29:42 2021 +0800

    first commit

 test-new.txt | 1 +
 1 file changed, 1 insertion(+)

3. Git reference

HEAD, branch, tag, etc. are essentially references. You can see their existence in the. git/refs directory. git references are files that help users identify the corresponding SHA-1 value. We can create a new reference to record the location of the latest submission, which is actually the essence of branches.

$ echo ff9152a5392a6f0f1f290cd8928d41b308b14fa9 > .git/refs/heads/master

$ git log --pretty=oneline master
ff9152a5392a6f0f1f290cd8928d41b308b14fa9 (HEAD -> master) third commit
bb6d6733199b5dbe16b4d5247999d6c9487cbd15 second commit
1580971f914fe3bed9c1ccdb117387db3525cf63 first commit

At this point, you can use the advanced command git log to view the submission history:

$ git log
commit ff9152a5392a6f0f1f290cd8928d41b308b14fa9 (HEAD -> master)
Author: Lu Hao <luhao@tp-link.com.cn>
Date:   Wed Dec 8 11:46:29 2021 +0800

    third commit

commit bb6d6733199b5dbe16b4d5247999d6c9487cbd15
Author: Lu Hao <luhao@tp-link.com.cn>
Date:   Wed Dec 8 11:32:13 2021 +0800

    second commit

commit 1580971f914fe3bed9c1ccdb117387db3525cf63
Author: Lu Hao <luhao@tp-link.com.cn>
Date:   Wed Dec 8 11:29:42 2021 +0800

    first commit

HEAD reference is a symbolic reference, that is, a pointer to other references. HEAD generally points to the current branch. It plays a key role in indicating the latest submitted SHA-1 value when performing other operations such as creating a new branch, switching branches, submitting, etc.

Tag reference is a branch reference that never moves. It always points to the same submission object, which is called * * tag object**

The remote reference is located at git/object/origin directory stores remote branches, labels and other information. It is a read-only reference. The HEAD reference can never point to the remote reference. Therefore, the remote reference can never be updated with commit, but can only be push ed to the remote by updating the local reference.

4. Git package

As mentioned earlier, every time a new object is created, it will be created in the A series of files are generated in the git/objects directory, and Git is a "snapshot based" version control system. Each submission will generate a complete snapshot. As time goes by, if no operation is done, the volume of GIT database will grow faster and faster. Therefore, GIT has designed a compression and packaging mechanism to prevent too many objects.

Before compression, the format used by Git to initially store objects to disk is called "loose" object format, and Git will package these loose objects into Git packages through compression and packaging to realize incremental storage.

git gc is such a command, which will be automatically triggered when executing pull and other commands. Before execution, check the files in the objects Directory:

$ find .git/objects/ -type f
.git/objects/15/80971f914fe3bed9c1ccdb117387db3525cf63
.git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a
.git/objects/29/38b4de55b3da15112c00deadf244dd6d3ef073
.git/objects/83/baae61804e65cc73a7201a7252750c76066a30
.git/objects/88/21d4c684e63569c0bf448affd5faef50919338
.git/objects/a3/7e53c0b0e7c2520d1d594f9b5246ee148de814
.git/objects/bb/6d6733199b5dbe16b4d5247999d6c9487cbd15
.git/objects/cd/d0e09fa699af925dfb6747986a60b69c1a0c05
.git/objects/fa/49b077972391ad58037050f2a75f74e3671e92
.git/objects/ff/9152a5392a6f0f1f290cd8928d41b308b14fa9

Then execute git gc: you will find that many object s have disappeared and replaced by a pair of files in the pack (idx and pack). Look at the data in the package with verify pack and find that all the three submissions we made earlier are packaged.

$ git gc
Enumerating objects: 9, done.
Counting objects: 100% (9/9), done.
Delta compression using up to 6 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (9/9), done.
Total 9 (delta 0), reused 5 (delta 0), pack-reused 0

$ find .git/objects/ -type f
.git/objects/29/38b4de55b3da15112c00deadf244dd6d3ef073
.git/objects/info/commit-graph
.git/objects/info/packs
.git/objects/pack/pack-dcaadb1de954542edabb220467ede8e288a9461e.idx
.git/objects/pack/pack-dcaadb1de954542edabb220467ede8e288a9461e.pack

$ git verify-pack -v .git/objects/pack/pack-dcaadb1de954542edabb220467ede8e288a9461e.idx
ff9152a5392a6f0f1f290cd8928d41b308b14fa9 commit 219 152 12
bb6d6733199b5dbe16b4d5247999d6c9487cbd15 commit 220 152 164
1580971f914fe3bed9c1ccdb117387db3525cf63 commit 171 123 316
83baae61804e65cc73a7201a7252750c76066a30 blob   10 19 439
fa49b077972391ad58037050f2a75f74e3671e92 blob   9 18 458
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob   10 19 476
cdd0e09fa699af925dfb6747986a60b69c1a0c05 tree   105 105 495
a37e53c0b0e7c2520d1d594f9b5246ee148de814 tree   40 50 600
8821d4c684e63569c0bf448affd5faef50919338 tree   75 77 650
non delta: 9 objects
.git/objects/pack/pack-dcaadb1de954542edabb220467ede8e288a9461e.pack: ok

However, it should be noted that git gc will not package dangling data. For example, if SHA-1 of the submitted object is not stored in the reference, the submitted object will not be packaged during packaging, as follows:

$ git gc
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Writing objects: 100% (5/5), done.
Total 5 (delta 0), reused 0 (delta 0), pack-reused 0

$ find .git/objects/ -type f
.git/objects/15/80971f914fe3bed9c1ccdb117387db3525cf63
.git/objects/29/38b4de55b3da15112c00deadf244dd6d3ef073
.git/objects/88/21d4c684e63569c0bf448affd5faef50919338
.git/objects/bb/6d6733199b5dbe16b4d5247999d6c9487cbd15
.git/objects/ff/9152a5392a6f0f1f290cd8928d41b308b14fa9
.git/objects/info/packs
.git/objects/pack/pack-44648cd90240ab687547eee0989f2dde2aee9a36.idx
.git/objects/pack/pack-44648cd90240ab687547eee0989f2dde2aee9a36.pack

$ git verify-pack -v .git/objects/pack/pack-44648cd90240ab687547eee0989f2dde2aee9a36.idx
83baae61804e65cc73a7201a7252750c76066a30 blob   10 19 12
fa49b077972391ad58037050f2a75f74e3671e92 blob   9 18 31
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob   10 19 49
cdd0e09fa699af925dfb6747986a60b69c1a0c05 tree   105 105 68
a37e53c0b0e7c2520d1d594f9b5246ee148de814 tree   40 50 173
non delta: 5 objects
.git/objects/pack/pack-44648cd90240ab687547eee0989f2dde2aee9a36.pack: ok

Through the gc mechanism, Git also turns files into incremental storage. Every time you submit an update or save the project status, Git will create a snapshot of all the files at that time and save the index of the snapshot. If the file is not modified, Git will no longer re store the file, but only keep a link to the previously stored file. After gc is executed, Git will recognize the change of the response file according to the file name and file size, and save the difference.

5. Git transport protocol

Git can transfer data between version libraries in two main ways: dumb protocol and smart protocol

5.1. Agreement

In the transmission process, the server does not need to have Git specific code; The fetching process is a series of HTTP GET requests. In this case, the client can infer the layout of the server Git warehouse.

The git warehouse distribution is inferred through a series of operations such as pulling info/refs, obtaining HEAD, reading tree objects, etc. this method is almost no longer used.

5.2 intelligent protocol

Intelligent protocol is a more commonly used method of transmitting data. It needs to run a process on the server to read local data, understand what the client has and needs, and generate an appropriate package file for it. A total of two groups of processes are used to transmit data, which are responsible for uploading and downloading data respectively.

5.2. 1. Upload data

The two processes started by uploading data are send pack and receive pack. The send pack process running on the client is connected to the receive pack process running on the remote end.

When executing the git push origin master command, the GIT receive pack command will immediately send a line of response for each reference it has. When the client receives the response, it will know the status of the server. Your send pack process will judge which submission records it has but the server does not have, Send pack will inform receive pack of the references that will be updated in this push.

5.2. 2. Download data

The two processes started by downloading data are fetch pack and upload pack. The client starts the fetch pack process and connects to the remote upload pack process to negotiate the subsequent transmitted data.

The fetch pack process looks at the objects it owns and responds to "want" and the SHA-1 value of the objects it needs. It also sends "have" and SHA-1 values for all objects it already owns. At the end of the list, it also sends "done" to inform the upload pack process that it can start sending the package file of the objects it needs.

Keywords: git github

Added by Lirpa on Sat, 01 Jan 2022 13:49:08 +0200