[Python] de filtering file code through md5

Here are some code records

  • This time is to help a friend restore the hard disk. The scanned result contains many partitions. By exporting the data in the partition, it is found that many files are duplicate. So I thought of using python code to duplicate.
  • First, put the picture files of all partitions into a folder A. if there is a naming conflict, select "let me decide each file" for win10, and then mark both checkmarks. In this way, win10 will add (1) such suffix to the duplicate files (so the deletion is to remove the longer file names). Duplicate files can be easily removed by the following de duplication code.
  • Because there are many new pictures on your friend's hard disk, you can remove the files in folder A from your hard disk through the second code.
  • Then, after writing the second code, I thought of a better idea and the implementation of the progress bar, so I wrote another verification code.

duplicate removal

Through md5 de duplication, keep the version with shorter file name (mainly remove the last (1) mark in the name). For the sake of insurance, just move the duplicate files instead of deleting them, and there will be no loss in case of code error. This is strongly recommended

import os
import hashlib
import shutil

hash_dict = {}

def get_md5(file_name):
    with open(file_name, "rb") as f:
        r = f.read()
        m = hashlib.md5()
        m.update(r)
        return m.hexdigest()

A_file_list = os.listdir("A")

t_num = len(A_file_list)
print ("Total number of documents: " + str(t_num));
cnt = 0
for file in A_file_list:
    md5_str = get_md5('A/'+file)
    if md5_str in hash_dict:  # Duplicate file
        print (f"repeat: {hash_dict[md5_str]} | {file}");
        if len(file) < len(hash_dict[md5_str]):
            shutil.move('A/' + hash_dict[md5_str], 'B/' + hash_dict[md5_str])
            # os.remove('A/' + hash_dict[md5_str])
            hash_dict[md5_str] = file
        else:
            shutil.move('A/' + file, 'B/' + file)
            # os.remove('A/' + file)
    else:
        hash_dict[md5_str] = file
    cnt += 1
    print (f"{str(cnt)}/{str(t_num)}")

print ("done.")

Recursive search file comparison and filtering

Through the previous step, the following code first takes md5 for all the files in folder A, then traverses the files in disk F and calculates md5. If there are the same files, move them to folder B.

import os
import hashlib
import shutil

hash_dict = {}
f_file_cnt = 0

def get_md5(file_name):
    with open(file_name, "rb") as f:
        r = f.read()
        m = hashlib.md5()
        m.update(r)
        return m.hexdigest()
        
def fun(path):
    global f_file_cnt
    try:
        os.chdir(path)
    except:
        return
    file_list = os.listdir()
    for file in file_list:
        if os.path.isdir(file):
            fun(file)
        elif os.path.isfile(file):
            if os.path.getsize(file) > 73056832:
                continue
            md5_str = get_md5(file)
            # with open("D://F.md5", 'a') as f:
                # f.write(md5_str+"\n")
            if md5_str in hash_dict:
                try:
                    shutil.move("D://Restore / A / "+ hash_dict [md5_str]," D: / / restore / B/"+hash_dict[md5_str])
                    print ("D://Restore / A/"+hash_dict[md5_str])
                except Exception as e:
                    print(e)
            f_file_cnt += 1
            if (f_file_cnt % 50) == 0:
                print(f"{(f_file_cnt/25524)*100}% - {md5_str}")

    os.chdir("..")

A_file_list = os.listdir("A")

t_num = len(A_file_list)

print ("Total number of documents: " + str(t_num));

cnt = 0

#Read the comparison result between md5 in folder A and the existing files in disk F as A basis to delete the files in folder A
print("load A folder md5")
for file in A_file_list:
    md5_str = get_md5('A/'+file)
    hash_dict[md5_str] = file
    cnt += 1
    if (cnt % 50) == 0:
        print (f"{(cnt/t_num)*100}%")
print("A folder md5 Read complete\n Start traversal F disc")


fun("F:/")

print (f_file_cnt)
print ("done.")

Final inspection

By modifying the above code (refer to the annotated code), you can traverse the files on disk F to get the md5 list of files, and then compare it with the files in folder A according to the following code, which can save A lot of time. The reason is that the speed of the mobile hard disk is too slow. If I had done this earlier, I would have screened out the necessary files...

import os
import hashlib
import shutil
from tqdm import tqdm

f_hash_list = []

def get_md5(file_name):
    with open(file_name, "rb") as f:
        r = f.read()
        m = hashlib.md5()
        m.update(r)
        return m.hexdigest()
        

with open("D://F.md5") as f:
    f_hash_list = f.readlines()
    
file_list = os.listdir("B")

pbar = tqdm(total=len(file_list))

for file in file_list:
    md5_str = get_md5("B/"+file)
    if md5_str+"\n" in f_hash_list:
        try:
            shutil.move("B/"+file, "C/"+file)
        except Exception as e:
            print (e)
    pbar.update(1)

pbar.close()
print("done.")

tqdm library is also used here to realize the effect of progress bar. I feel the effect is very good. But in this case, you can't output anything at will. Otherwise, there will be multiple progress bars. At present, I don't know how to solve the problem, but considering that the console output will seriously slow down the running speed of the program, it doesn't matter if you don't output it.

Keywords: Python

Added by fatywombat on Thu, 27 Jan 2022 13:38:52 +0200