Use Python to sort documents by specific keywords contained in file names

Broken thoughts

As the first group of people who came into contact with the Internet in China, there are countless pictures of ancient times on my hard disk, and many of them are downloaded from forums that are not very popular.
As we all know, this uses phpwind v7 5. For the widely used Chinese Forum, its user id actually corresponds to a string of numbers, and the published pictures are named according to the classification of the article section of the forum, and the corresponding format is very regular.
However, how to sort out the newly downloaded pictures according to the existing IDS is a difficulty.
I used to classify, search, move, copy and paste manually one by one. The efficiency is very low and it is easy to make mistakes. So the first motivation to learn Python was not to write crawlers, but to classify and archive the massive pictures I downloaded manually according to the id of the sender.
For example, I already have a guy with id jw11. The file names of all his posts are in the following format:

82_25158_03405c1ea40bbd4.jpg
82_25158_03448e17837f098.jpg
82_25158_03806c963523233.jpg
82_25158_0392c5a9be503f8.jpg
82_25158_053b6be42c4002c.jpg
82_25158_059c497f183e2c1.jpg
82_25158_063d1c887f2b3f8.jpg
82_25158_0648bdd9358a3a1.jpg

Then I already know that this 25158 is jw11. Similarly, a bunch of pictures I downloaded have various IDS, corresponding to the IDs of various posting people.
Before I learned Python, I created a new folder called jw11, and then manually searched for the keyword 25158 in the downloaded target folder, then select all, select cut, and then switch to the folder of jw11 to select paste. This is very inefficient.

The existence of computers is to help us get rid of repeated work. Then the problem comes:

How to process the file through Python and classify the automatic file of the corresponding folder according to the number marked as id contained in the file name of the file?

It's a pity that although CSDN, simple book or knowledge, there are countless articles to teach you how to play with Python, teach you advanced, teach you introduction, teach you this and that, but my demand is so special in this world, and no one on earth will teach me how to do it.

What can I do? I have to start from scratch, learn little by little, and then summarize and do experiments by myself. After countless failures, one day, it succeeded!
I can't believe it, and I can't believe it, and it's really fast.
Write it down here for the benefit of those who are destined to see this article:

text

First, the code

import os
import shutil

'''
Put the newly downloaded folder in id Folder, make sure there is at least one photo in it
 Put all the other photos to be classified in img folder

Automatically move all photos to the corresponding folder in the document
'''
folder_id = r'D:\TRY\id'
folder_img = r'D:\TRY\img'

id_raw = list(os.walk(folder_id))
id_files = id_raw[1:]

auid = dict()

for n in id_files:
    author = n[0]
    aid = n[2][0][3:9]
    auid.update({aid: author})

img_files = os.listdir(folder_img)
for i in img_files:
    if i[3:9] in auid.keys():
        orin = os.path.join(folder_img, i)
        dest = os.path.join(auid[i[3:9]], i)
        shutil.move(orin, dest)

Parsing code

For those who don't ask for a clear understanding and use it, copy the above code and close this article. I don't care about the copyright or whether you like it or not, but there is only one situation I can't bear:
That is, you indicated the source and said that I wrote this code and this article, but you changed it beyond recognition. It's less here and less there, so that other friends can't use it and understand it. In the end, I think my level is not high. This is not good. We need face. Remember.

first

import os
import shutil

In order to enable Python to move files and operate folders, two libraries, os and shutil, need to be introduced. These two libraries are the standard libraries of python, so you don't need to install them separately and import them directly.

Next you need to declare

folder_id = r'D:\TRY\id'
folder_img = r'D:\TRY\img'

Where did your papers come from and go.
The two folders specified here are id and img.

id is the folder containing the author's id name. For example, jw11's folder is in this id folder.
In addition, in order to give the program a reference and let it know that jw11 is 25158, it is necessary to put at least one picture belonging to it in the folder of jw11, that is, any picture similar to 82_25158_03405c1ea40bbd4.jpg pictures, as long as they contain 25158.

img folder is much simpler. Throw all pictures of similar format in it. After executing the program, all pictures that can find the corresponding id will be cut out from this folder. What remains is what you need to pay more attention to.
Generally, as long as the preliminary work is done in place, this folder is empty after the program is executed.

The reason why I put an r in front of the quotation marks here is that I am too lazy to write two slashes. If I put an r, python can ignore the slashes in the quotation marks, or it will use the escape character \ to escape \ into \, which is very unsightly and intuitive.

id_raw = list(os.walk(folder_id))
id_files = id_raw[1:]

The purpose of this paragraph is to use OS Walk() function to traverse the id and img directories and generate a list of file names
The following slicing operation [1:] is also to leave the author's id, that is, the name of the folder, and eliminate the file name in the following folder.

For example, the ID obtained by direct traversal_ If raw is printed with the print function, it will be the following

[('D:\\TRY\\id', ['jw11', 'coshi', 'liuxiao', 'qqqsssiw', 'Jiang Xiao', 'Wolf snow', 'Lily', 'Know a leaf', 'Elephant man shot', 'Tieman'], []), ('D:\\TRY\\id\\jw11', [], ['88_25158_e7755e644d79483.jpg']), ('D:\\TRY\\id\\coshi', [], ['88_293501_3d3ebf722eb7943.jpg']), ('D:\\TRY\\id\\liuxiao', [], ['88_278271_a114aedde9cf86a.png']), ('D:\\TRY\\id\\qqqsssiw', [], ['88_130878_6894267b8ac7571.jpg']),('D:\\TRY\\id\\Jiang Xiao', [], ['88_226990_c882fbb9f4e62a4.jpg']), ('D:\\\TRY\\id\\Wolf snow', [], ['88_246203_c07bb391177f4d3.jpg']), ('D:\\\TRY\\id\\Lily', [], ['88_65202_d9fa5aad0397e5f.jpg']), ('D:\\\TRY\\id\\Know a leaf', [], ['88_217146_4e4e62e48f19b09.jpg']), ('D:\\\TRY\\id\\Elephant man shot', [], ['88_293734_ebc4a29a509bdc2.jpg']), ('D:\\\TRY\\id\\Tieman', [], ['88_16661_237b863406986e0.jpg'])]

For those who have just come into contact with python, what the hell is this? But friends, this is a list containing the id of the author we want and the file name it should correspond to!
Next, let's use slicing to manipulate id_files = id_raw[1:] separate it and have a look:

[('D:\\TRY\\id\\jw11', [], ['88_25158_e7755e644d79483.jpg']), ('D:\\TRY\\id\\coshi', [], ['88_293501_3d3ebf722eb7943.jpg']), ('D:\\TRY\\id\\liuxiao', [], ['88_278271_a114aedde9cf86a.png']), ('D:\\TRY\\id\\qqqsssiw', [], ['88_130878_6894267b8ac7571.jpg']),('D:\\TRY\\id\\Jiang Xiao', [], ['88_226990_c882fbb9f4e62a4.jpg']), ('D:\\\TRY\\id\\Wolf snow', [], ['88_246203_c07bb391177f4d3.jpg']), ('D:\\\TRY\\id\\Lily', [], ['88_65202_d9fa5aad0397e5f.jpg']), ('D:\\\TRY\\id\\Know a leaf', [], ['88_217146_4e4e62e48f19b09.jpg']), ('D:\\\TRY\\id\\Elephant man shot', [], ['88_293734_ebc4a29a509bdc2.jpg']), ('D:\\\TRY\\id\\Tieman', [], ['88_16661_237b863406986e0.jpg'])]

Look carefully and hard. Do you see anything famous? Yes, the first item in the list, that is, the one containing all the folder names ('d: \ \ try \ \ ID ', ['jw11', 'coshi', 'LiuXiao', 'qqqssiw', 'Jiangxiao', 'Wolf snow', 'Lily', 'zhiyiye', 'Xiangren Pai', 'Tieman'], []) has been abandoned by us, which is the meaning of slicing.

img_files = os.listdir(folder_img)

And OS The listdir() function is much simpler. All you get is a list of file names in the folder, similar to:

['82_25158_fe2f6e2c96aef1a.jpg', '82_25158_fe56c4415872950.jpg', '82_25158_fe5aac922ec1be2.jpg', '82_25158_febe6e5b82780e0.jpg', '82_25158_fec8f31d64ae59c.jpg', '82_25158_fee0c5cf16ceaa1.jpg', '82_25158_ff27a5397954489.jpg', '82_25158_ff2e52901689b3b.jpg', '82_25158_ff51fe758e511e8.jpg', '82_25158_ff58bb59e08aac0.jpg', '82_25158_ff7d27b587d5a2f.jpg', '82_25158_fff81d10d18eff9.jpg']

With these two tables, we can use python's dict() function with a loop to construct a dictionary.
First construct an empty dictionary:

auid = dict()
for n in id_files:
    author = n[0]
    aid = n[2][0][3:9]
    auid.update({aid: author})

The author here is the path of the folder. The format is as follows:

D:\TRY\id\jw11
D:\TRY\id\coshi
D:\TRY\id\liuxiao
D:\TRY\id\qqqsssiw
D:\TRY\id\Jiang Xiao
D:\TRY\id\Wolf snow

aid is sliced through the corresponding list, and the format is as follows:

246203
65202_
254624
217146
293734
16661_

As for why it is cut [3:9], it is because you count the original file name
82_25158_fff81d10d18eff9.jpg
0123456789012345678901234567

The following line is my counting. It can be seen that there are 28 characters from 0 to 27.
[3:9] in Python means to start from the third string, that is, 2, to the ninth string. Note that the end here means that the 9th character is not included, that is, it actually ends between the 8th character and the 9th character.
In fact, the six characters from 3 to 9 are 345678, that is, the number 25158 corresponding to the id_

A little friend said why do you want to connect_ For example, if the jw5 user's id is more than five digits, it will generate a new one. If the user's id is more than five digits, it will generate an error.

After explaining this paragraph, let's continue to look at the dictionary constructed by the loop:

auid.update({aid: author})

Because an empty dictionary has been constructed above, it is used here update function to fill in the dictionary, so that the dictionary will change from an empty dictionary to a beautiful dictionary, similar to jw11:25158_
In front is the Key of the dictionary, followed by the Value of the dictionary

for i in img_files:
    if i[3:9] in auid.keys():
        orin = os.path.join(folder_img, i)
        dest = os.path.join(auid[i[3:9]], i)
        shutil.move(orin, dest)

Finally to the last and most exciting part, judge and move the file.
Still loop, use i as a temporary variable and test img one by one_ Files to see if it is in the dictionary we constructed in the previous step:

for i in img_files:
    if i[3:9] in auid.keys():

If so, use the shutil function to move the file from the orin folder to the dest folder.
Here we use the os in the os library path. Join function to combine the path and file name of the file.
After all, if you want to move files, you have to provide a complete absolute path, similar to:
hold
D:\try\img\82_25158_ff7d27b587d5a2f.jpg
Move to
D:\try\id\jw11
folder

The above is my little experience in sorting out folders as a python beginner. I share it with you. Thank you.

Keywords: Python Programming Operating System

Added by mynameisbob on Sat, 05 Mar 2022 05:15:02 +0200