In depth analysis, maintenance and data recovery of Git internal principle

1, Maintain

Git will automatically run a command called "auto gc" from time to time. Most of the time, this command will not have an effect. However, if there are too many loose objects (objects that are not in the package file) or too many package files, GIT runs a complete git gc command. "gc" stands for garbage collection. This command will do the following: collect all loose objects and place them in the package file, merge multiple package files into a large package file, and remove obsolete objects that are not related to any submission.
Automatic garbage collection can be performed manually as follows:

$ git gc --auto

As mentioned above, this command usually has no effect. It takes more than 7000 loose objects or more than 50 package files for Git to start a real GC command. These values can be changed by modifying the settings of gc.auto and gc.autopacklimit.
Another thing gc will do is package your references into a separate file, assuming that the repository contains the following branches and Tags:

$ find .git/refs -type f
.git/refs/heads/experiment
.git/refs/heads/master
.git/refs/tags/v1.0
.git/refs/tags/v1.1

If you execute the git gc command, these files will no longer exist in the refs directory. To ensure efficiency, GIT moves them to a file named. Git / packed refs, like this:

$ cat .git/packed-refs
# pack-refs with: peeled fully-peeled
cac0cab538b970a37ea1e769cbbde608743bc96d refs/heads/experiment
ab1afef80fac8e34258ff41fc1b867c702daa24b refs/heads/master
cac0cab538b970a37ea1e769cbbde608743bc96d refs/tags/v1.0
9585191f37f7b0fb9444f35a9bf50de191beadc2 refs/tags/v1.1
^1a410efbd13591db07496601ebc7a059dd55cfe9

If the reference is updated, Git will not modify the file, but create a new file to refs/heads. In order to obtain the correct SHA-1 value of the specified reference, Git will first find the specified reference in the refs directory, and then find it in the packed refs file. So if you can't find a reference in the refs directory, it may be in the packed refs file.
Note: the last line of this file will start with ^. This symbol indicates that the label on the previous line is the note label, and the line where ^ is located is the submission to which the note label points.

2, Data recovery

When using Git, a submission may be accidentally lost. Usually, this is because the working branch is forcibly deleted, but it is finally found that this branch is still needed, or a branch is hard reset and the desired submission is abandoned. If these events have occurred, how can the corresponding submission be retrieved?
As shown below, branch the master in the hard reset test warehouse to an old commit to recover the lost commit. First, let's see where the warehouse is now:

$ git log --pretty=oneline
ab1afef80fac8e34258ff41fc1b867c702daa24b modified repo a bit
484a59275031909e19aadb7c92262719cfcdf19a added repo.rb
1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
cac0cab538b970a37ea1e769cbbde608743bc96d second commit
fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit

Now hard reset the master branch to the third commit:

$ git reset --hard 1a410efbd13591db07496601ebc7a059dd55cfe9
HEAD is now at 1a410ef third commit
$ git log --pretty=oneline
1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
cac0cab538b970a37ea1e769cbbde608743bc96d second commit
fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit

Now the top two submissions have been lost, and there are no branches pointing to these submissions. You need to find the SHA-1 of the last submission, and then add a branch pointing to it. The trick is to find the SHA-1 of the last submission, but what if you can't remember?
The most convenient and commonly used method is to use a tool called git reflog. When working, GIT will silently record its value every time the HEAD is changed. Every time the branch is submitted or changed, the reference log will be updated. The reference log can also be updated through git update ref command, We mentioned the reason for using this command in Git reference instead of directly writing the value of SHA-1 into the reference file. You can know what you have done by executing git reflog command at any time:

$ git reflog
1a410ef HEAD@{0}: reset: moving to 1a410ef
ab1afef HEAD@{1}: commit: modified repo.rb a bit
484a592 HEAD@{2}: commit: added repo.rb

Here we can see that we have checked out two submissions, but there is not enough information. To make the displayed information more useful, you can execute git log -g. this command will output the reference log in the standard log format:

$ git log -g
commit 1a410efbd13591db07496601ebc7a059dd55cfe9
Reflog: HEAD@{0} (Scott Chacon <schacon@gmail.com>)
Reflog message: updating HEAD
Author: Scott Chacon <schacon@gmail.com>
Date:   Fri May 22 18:22:37 2009 -0700

		third commit

commit ab1afef80fac8e34258ff41fc1b867c702daa24b
Reflog: HEAD@{1} (Scott Chacon <schacon@gmail.com>)
Reflog message: updating HEAD
Author: Scott Chacon <schacon@gmail.com>
Date:   Fri May 22 18:15:24 2009 -0700

       modified repo.rb a bit

It seems that the one below is the lost submission, which can be recovered by creating a new branch to point to the submission. For example, you can create a branch named recover branch to point to this submission (ab1afef):

$ git branch recover-branch ab1afef
$ git log --pretty=oneline recover-branch
ab1afef80fac8e34258ff41fc1b867c702daa24b modified repo a bit
484a59275031909e19aadb7c92262719cfcdf19a added repo.rb
1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
cac0cab538b970a37ea1e769cbbde608743bc96d second commit
fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit

Yes, now there is a branch called recover branch, which is where the master branch once pointed, once again making the first two submissions reachable. Next, assuming that the lost submissions are not in the reference log for some reason, we can simulate this situation by removing the recover branch branch and deleting the reference log. Now the first two submissions are not pointed to by any branch:

$ git branch -D recover-branch
$ rm -Rf .git/logs/

Since the reference log data is stored in the. git/logs / directory, there is no reference log now. How to restore the submission? One way is to use the git fsck utility to check the integrity of the database. If you run it with a -- full option, it will display all objects that are not pointed to by other objects:

$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (18/18), done.
dangling blob d670460b4b4aece5915caf5c68d12f560a9fe3e4
dangling commit ab1afef80fac8e34258ff41fc1b867c702daa24b
dangling tree aea790b9a58f6cf6f2804eeac9f0abbe9631e4c9
dangling blob 7108f7ecb345ee9d0084193f147cdad4d2998293

In this example, you can see the lost submission after "dangling commit". Now you can restore the submission in the same way as before, that is, add a branch to the submission.

3, Remove object

Git has many great features, but one of them will cause problems. git clone will download the whole project history, including each version of each file. If everything is source code, this is good because Git is highly optimized to store this data effectively. However, if someone adds a file with a very large size to the project before, even if the file is removed from the project, each clone still has to force the download of the large file. The reason for this problem is that the file exists in history and will always be there.
When migrating Subversion or Perforce warehouses to Git, this will be a serious problem, because these version control systems do not download all historical files, so such files cause less problems. If you migrate from another version control system to Git and find that the repository is much larger than expected, you need to find and remove these large files.
Warning: this operation is destructive to the modification of submission history. It will rewrite each submission from the tree object with the earliest reference to a large file that must be modified or removed. If this operation is performed after importing into the warehouse and before anyone starts to work based on these submissions, there will be no problem. Otherwise, all contributors must be informed that they need to base their results on new submissions.
For demonstration, we will add a large file to the test repository and delete it in the next submission. Now we need to find it and permanently delete it from the repository. First, add a large file to the warehouse:

$ curl https://www.kernel.org/pub/software/scm/git/git-2.1.0.tar.gz > git.tgz
$ git add git.tgz
$ git commit -m 'add git tarball'
[master 7b30847] add git tarball
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 git.tgz

In fact, this huge compressed file is not required for this project. Now remove it:

$ git rm git.tgz
rm 'git.tgz'
$ git commit -m 'oops - removed large tarball'
[master dadf725] oops - removed large tarball
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 git.tgz

Execute gc to see how much space the database occupies:

$ git gc
Counting objects: 17, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (13/13), done.
Writing objects: 100% (17/17), done.
Total 17 (delta 1), reused 10 (delta 0)

You can also execute the count objects command to quickly view the occupied space:

$ git count-objects -v
count: 7
size: 32
in-pack: 17
packs: 1
size-pack: 4868
prune-packable: 0
garbage: 0
size-garbage: 0

The value of size pack refers to the size of the package file in KB, so it takes up about 5MB of space. Less than 2KB was used before the last submission. Obviously, removing a file from a previous submission does not remove it from history. Every time someone clones the warehouse, they will have to clone all 5MB to obtain the micro project. Just because a large file is accidentally added, now let's completely remove the file.
First, you must find it. In this case, you already know which file it is, but if you don't know, how can you find out which file or which files occupy so much space? If you execute the git gc command, all objects will be put into a package file. You can find the large file by running the GIT verify pack command and sorting the third column (i.e. file size) of the output content. You can also transfer the execution results of the command to the tail command through the pipeline, because you only need to find the last few large objects listed:

$ git verify-pack -v .git/objects/pack/pack-29...69.idx \
  | sort -k 3 -n \
  | tail -3
dadf7258d699da2c8d89b09ef6670edb7d5f91b4 commit 229 159 12
033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 blob   22044 5792 4977696
82c99a3e86bb1267b236a4b6eff7868d97489af1 blob   4975916 4976258 1438

You can see that this large object appears at the bottom of the returned result, occupying 5MB of space. To find out which file it is, you can use the Rev list command. If you pass the -- objects parameter to the Rev list command, it will list all submitted SHA-1, SHA-1 of data objects and their associated file paths. You can use the following command to find the name of the data object:

$ git rev-list --objects --all | grep 82c99a3
82c99a3e86bb1267b236a4b6eff7868d97489af1 git.tgz

Now, just remove the file from all the trees in the past. You can easily see which submissions have made changes to this file using the following command:

$ git log --oneline --branches -- git.tgz
dadf725 oops - removed large tarball
7b30847 add git tarball

All submissions after 7b30847 submission must be rewritten to completely remove this file from Git history. To perform this operation, use the filter branch command:

$ git filter-branch --index-filter \
  'git rm --ignore-unmatch --cached git.tgz' -- 7b30847^..
Rewrite 7b30847d080183a1ab7d18fb202473b3096e9f34 (1/2)rm 'git.tgz'
Rewrite dadf7258d699da2c8d89b09ef6670edb7d5f91b4 (2/2)
Ref 'refs/heads/master' was rewritten

The – index filter option is similar to that in the Git's in-depth analysis of how to rewrite the submission history The -- tree filter option mentioned in, but this option does not let the command modify the files checked out on the hard disk, but only the files in the staging area or index.
You must use the git rm --cached command to remove the file, not through a command like rm file, because you need to remove it from the index, not from disk. Another reason is speed. When Git runs the filter, it will not check out each revision to the disk, so the process will be very fast. If you want, you can also complete the same task through the -- tree filter option. The -- ignore unmatch option of the git rm command tells the command that if the mode you are trying to delete does not exist, no error will be prompted. Finally, use the filter branch option to rewrite the history since 7b30847 submission, which is where this problem arises. Otherwise, the command will start with the oldest submission, which will take a lot of unnecessary time.
The history will no longer contain references to that file. However, there are still references to this file in the reference log and the new references you added through the filter branch option in. git/refs/original, so you must remove them and repackage the database. Before repackaging, you need to remove any files that contain pointers to those old submissions:

$ rm -Rf .git/refs/original
$ rm -Rf .git/logs/
$ git gc
Counting objects: 15, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (15/15), done.
Total 15 (delta 1), reused 12 (delta 0)

Let's see how much space is saved:

$ git count-objects -v
count: 11
size: 4904
in-pack: 15
packs: 1
size-pack: 8
prune-packable: 0
garbage: 0
size-garbage: 0

The size of the packaged warehouse is reduced to 8K, which is much better than 5MB. From the size value, we can see that the large file is still in the loose object and does not disappear; But it will not appear in push or subsequent cloning, which is the most important. If you really want to delete it, you can completely remove that object through the git prune command with the -- expire option:

$ git prune --expire now
$ git count-objects -v
count: 0
size: 0
in-pack: 15
packs: 1
size-pack: 8
prune-packable: 0
garbage: 0
size-garbage: 0

Added by Viper76 on Mon, 27 Sep 2021 20:32:50 +0300

Programming VIP

In depth analysis, maintenance and data recovery of Git internal principle

1, Maintain

2, Data recovery

3, Remove object

Popular Keywords