Use redis to record a large number of file lists

demand

Now there is a file list. If you query directly in the text, the efficiency is very low and slow. Now you want to import it into redis and then use it later. How do you deal with this operation

functional requirement

The following functions need to be implemented

  • Import
  • ergodic
  • delete

To query, you can directly get the key, so here is the traversal function

Implementation of import function

Using single process import

Single process import is to read the text line by line and then insert it into the database. Let's see how long it takes to implement this
Number of file entries

[root@lab102 ssd]# cat /root/chuli/file.list|wc -l
983040
[root@lab102 ssd]# time python3 filetoredis-rclone.py  /root/chuli/file.list
0
 Start time is : Thu Nov 25 11:31:10 2021
 End time is : Thu Nov 25 11:33:31 2021

real	2m20.554s
user	1m25.535s
sys	0m31.026s

The code is as follows:

#! /usr/bin/env python3
# -*- coding:utf-8 -*-
import sys
import redis
import time
starttime = time.asctime( time.localtime(time.time()) )
inputfile=sys.argv[1]

r=redis.StrictRedis(host='localhost',port=6379,db=0)

def filekeytoredis(inputfile):
    for line in open(inputfile):
        r.set(line.strip('\n'),0)

filekeytoredis(inputfile)
endtime = time.asctime( time.localtime(time.time()) )
print ("Start time is :", starttime)
print ("End time is :", endtime)

The operation of 980000 files took 2 minutes and 20 seconds to complete the import

Let's look at the implementation of multi process

Using multi process import

[root@lab102 pget]# time python3 filetoredis.py /root/chuli/file.list
 Start time is : Thu Nov 25 11:43:24 2021
 End time is : Thu Nov 25 11:43:51 2021

real	0m27.962s
user	2m18.184s
sys	0m52.835s

The operation of 980000 files took 27 seconds to complete the import, much faster

The contents of the filetoredis.py script are as follows:

#! /usr/bin/env python3
# -*- coding:utf-8 -*-
import sys
import redis
import time
import multiprocessing

starttime = time.asctime( time.localtime(time.time()) )
inputfile=sys.argv[1]
def initialize():
    global r
    r=redis.StrictRedis(host='localhost',port=6379,db=0)

linekeys=[]
def redisset(key):
    r.set(key,0)
def filekeytoredis(inputfile):
    count=len(open(inputfile).readlines())
    for line in open(inputfile):
        line=line.strip('\n')
        linekeys.append(line)
        count=count-1
        if count < 20000:
            if count==0:
                pool = multiprocessing.Pool(20,initialize)
                pool.map(redisset, linekeys)
                pool.close()
                pool.join()
            else:
                pass
        else:
            if len(linekeys) == 20000:
                pool = multiprocessing.Pool(20,initialize)
                pool.map(redisset, linekeys)
                pool.close()
                pool.join()
                #print(linekeys)
                linekeys.clear()
            else:
                pass

filekeytoredis(inputfile)
endtime = time.asctime( time.localtime(time.time()) )
print ("Start time is :", starttime)
print ("End time is :", endtime)

The above multi process is to read the file list, and then allocate the file list to multiple processes for related processing

Implementation of traversal function

Single process traversal

[root@lab102 pget]# time python3 checkredis.py 0

real	0m6.825s
user	0m5.334s
sys	0m0.135s

The script implementation is as follows:

#! /usr/bin/env python3
# -*- coding:utf-8 -*-
# This script is to clean up the query database, followed by the number of db
import sys
import redis
import time
starttime = time.asctime( time.localtime(time.time()) )
dbnumber=sys.argv[1]

r=redis.StrictRedis(host='localhost',port=6379,db=dbnumber)

cursor=0
while True:
    cursor,keys = r.scan(cursor,match="*",count=20000)
    for key in keys:
        #print(key)
        pass
    if cursor == 0:
        break

The traversal of 980000 key s was completed in the above 6 seconds
It's not clear how to implement multiple processes in this place. The main time itself is to get the cursor

Implementation of deletion function

Single process deletion

[root@lab102 pget]# time python3 cleanredis.py 0

real	2m14.760s
user	1m21.221s
sys	0m30.077s

The clearredis.py deletion script is as follows:

#! /usr/bin/env python3
# -*- coding:utf-8 -*-
# This script is to clean up the redis database, followed by the number of db
import sys
import redis
import time
starttime = time.asctime( time.localtime(time.time()) )
dbnumber=sys.argv[1]

r=redis.StrictRedis(host='localhost',port=6379,db=dbnumber)

cursor=0
while True:
    cursor,keys = r.scan(cursor,match="*",count=10000)
    for key in keys:
        r.delete(key)
    if cursor == 0:
        break

It takes 2 minutes and 14 seconds to delete 980000 in a single process

Multi process deletion

[root@lab102 rclone]# time python3 cleanredis.py  0

real	0m38.574s
user	1m39.278s
sys	0m37.553s

The script reads as follows

#! /usr/bin/env python3
# -*- coding:utf-8 -*-
# This script is to clean up the redis database, followed by the number of db
import sys
import redis
import time
import multiprocessing
starttime = time.asctime( time.localtime(time.time()) )
dbnumber=sys.argv[1]
r=redis.StrictRedis(host='localhost',port=6379,db=dbnumber)
def keydelete(key):
    global r
    r.delete(key)
cursor=0
while True:
    cursor,keys = r.scan(cursor,match="*",count=10000)
    pool = multiprocessing.Pool(10)
    pool.map(keydelete, keys)
    pool.close()
    pool.join()
    if cursor == 0:
        break

980000 entries, 38 seconds for multi process deletion

summary

It's really very fast to store some data through redis. Here, it's just import, traversal, deletion and other operations. If you query, you can directly get, usually key val. if you query key, you need to traverse it. Here, you just record some basic operations

In this article, it is realized through single process and multi process. You can see that multi process can still improve the speed

Keywords: Database

Added by manalnor on Thu, 25 Nov 2021 06:29:33 +0200