Using Python and shell to realize long stitching cases of text processing

Recently, due to the new requirements of business systems, our platform needs to provide supplier G with a kind of data conversion format and then provide it to customer K. The headache of comparison is that the data provided by supplier G are stored in Excel under Windows, and the data type previously agreed by customer K with our relevant docking personnel must use the txt file utf-8, and due to the need of client K program processing, it also generates a check file data transmission node matching the data file. Bundle sign.

The main steps are as follows:

1. First of all, the suffix of.xlsx must be changed to the suffix of. csv to save, so that it can be opened in Linux.

2. Because the encoding format in Windows is basically gbk, it needs to be transcoded to utf-8 to display normally.

The iconv-fgbk-tutf8-c-o to_file from_file can be used for transcoding. After transcoding, the file can be roughly shown as follows:

Account number, bank, name, ID card number, mobile phone number, login mailbox, suspected fraudulent account using equipment, request type, whether queried by multiple public security organs
000167342xxx, Shenzhen Agricultural and Commercial Company, Shenzhen XX Warehousing Service Co., Ltd,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
000195557xxx, Shenzhen Agricultural and Commercial Company, Shenzhen XXX Shoe Material Co., Ltd,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
000251484xxx, Shenzhen Agricultural and Commercial Corporation, Shenzhen XXX Electronics Co., Ltd,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

In fact, there are only nine columns of data, and the eighth, ninth and tenth columns are merged cells.

3. Customer K requires 9 columns of data to be partitioned with "|".

According to the content of the transcoded file above, there are 11 columns of data divided by "," while column 8, 9 and 10 continue to use "," for segmentation, and others use "|" for segmentation. In terms of my current range of knowledge, there are two ways to use it

One way is to use Python, which is undoubtedly the best choice. But you can also use the shell to splice using for loops (maybe a better way, but I didn't think about it), but it's too slow. Supplier G provides it.

The file has about 12W rows of data, which takes nearly 20 minutes to complete. Here are two ways:

The Python script is as follows:

import sys
def readfile(rfilename,wfilename):
　　wfile=open(wfilename,'a+')
　　#wfile.write('Account number|Bank|Full name|ID number|Cell-phone number|Login Mailbox|Equipment used in suspected fraudulent accounts|Request type|Is it inquired by many public security organs?\n')
　　#The previous line is file. title，If you don't use this method, use the following lines Processing from 0, that is, from the first line
　　with open(rfilename, 'r') as fr:
　　　　lines=fr.readlines()
　　　　for line in lines[0:]:
　　　　　　llist=[]
　　　　　　if len(line)>1:
　　　　　　　　words=line.split(',')
　　　　　　　　if (words[0]!=''):
　　　　　　　　　　llist.append(words[0]+'|')
　　　　　　　　　　llist.append(words[1]+'|')
　　　　　　　　　　llist.append(words[2]+'|')
　　　　　　　　　　llist.append(words[3]+'|')
　　　　　　　　　　llist.append(words[4]+'|')
　　　　　　　　　　llist.append(words[5]+'|')
　　　　　　　　　　llist.append(words[6]+'|')
　　　　　　　　　　llist.append(words[7]+',')
　　　　　　　　　　llist.append(words[8]+',')
　　　　　　　　　　llist.append(words[9]+'|')
　　　　　　　　　　llist.append(words[10])
　　　　　　　　　　wstr=''.join(llist)
　　　　　　　　　　#You need to specify that the separator between the new file columns is empty, otherwise there will be multiple separators between each field.
　　　　　　　　　　wfile.write(wstr+'\r')
　　　　　　　　　　#Line breaks are used between lines here \r ,Instead of using carriage return \n，If you use carriage return, a large number of blank lines will be generated in the new file
　　wfile.close()

if __name__ == '__main__':
    inpath=sys.argv[1]
    outpath=sys.argv[2]
   #Specify the path and name of the input file
　　rfilename=inpath+'1111.csv'
   #Specify the path and name of the output file 
　　wfilename=outpath+'3333.csv'
    readfile(rfilename,wfilename)

#implement
[root@A opt] python $python_file $inpath $outpath

It's very fast. It runs out in about 12W lines in a second.

The shell approach, compared with Python scripts, is at least too silly for me to come up with.

#Pre-implementation, Let me start with,Change to|
for line in `cat 3333.txt`
do
    echo "`echo "$line" | awk -F "|" 'BEGIN{OFS="|"} {print $1,$2,$3,$4,$5,$6,$7}'`|`echo "$line" | awk -F"|" 'BEGIN{OFS=","} {print $8,$9,$10}'`|`echo "$line" | awk -F "|" 'BEGIN{OFS="|"} {print $11}'`" >> 4444.txt
done


#If circumstances permit, of course., Parallelism can also be used
for line in `cat 3333.txt`
do
 {   
    echo "`echo "$line" | awk -F "|" 'BEGIN{OFS="|"} {print $1,$2,$3,$4,$5,$6,$7}'`|`echo "$line" | awk -F"|" 'BEGIN{OFS=","} {print $8,$9,$10}'`|`echo "$line" | awk -F "|" 'BEGIN{OFS="|"} {print $11}'`" >> 4444.txt
 }&
done

After testing, it seems that there is no big difference between parallel and non-parallel, but it takes about 20 minutes to complete the stitching of 12W rows, which is slightly faster than that.

The data processed by the above two methods are as follows:

Account | Bank | Name | Identity Card | Mobile Phone | Login Mailbox | Suspected Fraudulent Account Use Equipment | Request Type | Is It Inquired by Several Public Security Organs
000167342xxx | Shenzhen Agricultural and Commercial Company | Shenzhen XX Warehousing Service Co., Ltd | | | | | Stop payment, freeze, detailed inquiry | Yes
000195557xxx | Shenzhen Agricultural Merchants | Shenzhen XXX Shoe Material Co., Ltd | | | | Stop payment, freeze, detailed inquiry | Yes
000251484xxx | Shenzhen Agricultural and Commercial Company | | | | | | | | Stop payment, freeze, detailed inquiry | Yes
001980099990xxx | Agricultural Bank | Unknown | | | | | |, Detailed Query | Yes

4. It's easy to generate check files, using MD5, 16-bit encryption; or hash, which defaults to SHA-1, 20-bit encryption, SHA-224, SHA-256, SHA-384.

#Command example
[root@A opt]# md5sum 2222.csv
d6b37d6921b0153079ef6bb976872f01  2222.csv
[root@A opt]# sha1sum 2222.csv
c9e780381f756308362d44172e06e46ee8758ecf  2222.csv
[root@A opt]# sha224sum 2222.csv
1f79435e1f5eefc91b1fabf66df1a25391478e0fa137a526e6bdf66e  2222.csv
[root@A opt]# sha256sum 2222.csv
bf9e8b0b25807e9b31026a56d8dc4040dd4c90e7a468b1a4d91cc3b6866dbb13  2222.csv

#Generate checkout files
[root@A opt]# md5sum 2222.csv >2222_md5.txt

[root@A opt]# sha1sum 2222.csv >2222_sha1.txt

#Check file integrity
[root@A opt]# md5sum -c 2222_md5.txt
2222.csv: OK
[root@A opt]# sha1sum -c 2222_sha1.txt
2222.csv: OK

For more information on the generation of validation files, see https://www.jb51.net/LINUXjishu/156064.html

Keywords: Linux Python Windows Mobile shell

Added by Shadow Hatake on Tue, 20 Aug 2019 14:45:14 +0300

Programming VIP

Using Python and shell to realize long stitching cases of text processing

Popular Keywords