gevent file read / write pit

Scenarios and phenomena to be optimized

The project gives the flash gevent framework, which will periodically (5 seconds) write a json file to the file system

The file size is 40M

The phenomenon is that gevent will not switch the code of other processes until the file is written

principle

When reading and writing files, linux will block the threads that write files (regardless of whether the fd of the file is set to blocking or non blocking)

Because the io is always ready for files, the thread will always call write to write. Unlike the network io, it can detect the ready state asynchronously

If the project is a single threaded program, use gevent's collaboration to realize parallelism. Once a collaboration of gevent writes a file, the single thread of the whole process will be blocked, and gevent will not be given the opportunity to schedule the collaboration. Therefore, all the collaboration will get stuck

test

# coding=utf-8
 
from gevent.monkey import patch_all
 
patch_all()
 
import gevent
import json
import os
import sys
import redis
 
AMOUNT = 300000
OUTPUT_FILE = 'test.json'
 
r = redis.Redis(host='127.0.0.1', port=6379, decode_responses=True)
 
dict = {}
 
 
# Generate large json
def generate_dict():
    print 'begin generate dict of {} subject'.format(AMOUNT)
    for i in xrange(0, AMOUNT):
        dict[i] = {"avatar": "/static/upload/photo/2019-11-02/v2_0cc3325d6467d8ebadde2edc8f3c92aab409b87c.jpg",
                   "birthday": None,
                   "create_time": 1572669596,
                   "department": "QA",
                   "description": "",
                   "end_time": None,
                   "entry_date": None,
                   "extra_id": None,
                   "groups": [
                       0
                   ],
                   "id": 12268,
                   "interviewee": "",
                   "interviewee_pinyin": "",
                   "inviter_id": None,
                   "job_number": "",
                   "name": "40735",
                   "remark": "",
                   "start_time": None,
                   "subject_type": 0,
                   "title": "",
                   "wg_number": ""
                   }
 
    print 'complete generate dict'
 
 
# gevent another co process
def foo():
    for i in xrange(0, 30):
        gevent.sleep(0.1)
        print i
        sys.stdout.flush()
 
 
# Write file in blocking mode
def write_file1():
    print 'begin write {}'.format(OUTPUT_FILE)
    with open(OUTPUT_FILE, 'w') as fp:
        json.dump(dict, fp)
    print 'complete write {}'.format(OUTPUT_FILE)
 
 
# Write file in non blocking mode
def write_file2():
    print 'begin write {}'.format(OUTPUT_FILE)
    b = json.dumps(dict)
    fd = os.open(OUTPUT_FILE, os.O_CREAT | os.O_WRONLY | os.O_NONBLOCK)
    os.write(fd, b)
    os.close(fd)
    print 'complete write {}'.format(OUTPUT_FILE)
 
 
# Write redis
def write_redis():
    print 'begin write {}'.format(OUTPUT_FILE)
    b = json.dumps(dict)
    r.set('storage_test', b)
    print 'complete write {}'.format(OUTPUT_FILE)
 
 
if __name__ == '__main__':
    generate_dict()
    g1 = gevent.spawn(foo)
    gevent.sleep(1)
    # g2 = gevent.spawn(write_file1)
    # g2 = gevent.spawn(write_file2)
    g2 = gevent.spawn(write_redis)
    gevent.joinall([g1, g2])

The test program has two coroutines. One is the foo function, which will output numbers circularly (and gevent.sleep takes a very short time to give gevent a chance to schedule coroutines)

Another collaboration is the io collaboration (writing files or redis)

Here are the test results

Whether you use write_file1 (fd of blocking) or write_file2 (fd of non blocking), there will be no digital output in "begin write" and "complete write" in the log, which proves that the thread has been blocked and the coroutine will not be scheduled

But write_redis, the "begin write" and "complete write" in the log will be mixed with digital output, which proves that the network io can be called non blocking (epoll actually used). The thread will not block when io is not ready and will execute the scheduling of the process

conclusion

Try to avoid writing large files. Do not store files periodically. Organize the data into redis or mysql

reference resources

Explain why gevent's monkey patch does not automatically set the blocked file descriptor to non blocking
https://github.com/gevent/gevent/issues/1070

How to open a non blocking file descriptor
https://stackoverflow.com/questions/9259380/how-to-write-to-a-file-using-non-blocking-ioI

Is the write() function in C blocking or non-blocking? Depends on the parameters when creating the file descriptor
https://stackoverflow.com/questions/42449987/is-the-write-function-in-c-blocking-or-non-blocking

If fd is a file, even if it is created as non blocking, it will be blocked when writing and read ing
https://www.remlab.net/op/nonblock.shtm

Keywords: Python Flask

Added by yanivkalfa on Thu, 27 Jan 2022 07:08:22 +0200