Performance optimization who won't? flask+gunicorn+ pytorch+...

Article catalog
Problem background
System environment
Optimization process
A little wave of analysis
Stage 1: directly access torch.cuda.empty_cache() cleanup.
The second stage (creating sub process loading model and training)
Phase 3 (Global thread pool + release GPU)
summary
reference resources
Exclusive benefits for fans

Problem background

There is a training interface for automatic generation of ancient poetry. The interface generates the training model (i.e. generation of ancient poetry) through pytoch. In order to speed up the use of GPU, but the GPU cannot be released after the training is completed. Therefore, it needs to be optimized, that is, release the GPU after the generation of ancient poetry. The project is a web service built through flash. In order to realize concurrency on the server, gunicorn is used to start the application. Ancient poetry training through python. The project is deployed on a CentOS server.

System environment

Software	edition
flask	0.12.2
gunicorn	19.9.0
CentOS 6.6	Servers with GPU cannot add machines
pytorch	1.7.0+cpu

For special reasons, the next server is available for use, so the addition of machines cannot be considered.

Optimization process

When training the model, pytorch needs to load the model and data first. If there is GPU video memory, we can put it into GPU video memory for acceleration. If there is no GPU, we can only use CPU. Because the process of loading model and data is relatively slow. Therefore, I put the loading process at the start of the project.

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT2LMHeadModel.from_pretrained(os.path.join(base_path, "model"))
model.to(device)
model.eval()

This part takes about 6 seconds. cuda means cuda using torch. The GPU video memory occupied by the model data after loading is about 1370MB. The goal of optimization is to release the occupied video memory after training.

A little wave of analysis

The current situation is that if the model model and data are loaded at the start of the project, when the model data is released in the GPU, there will be no model model and data for the next model training? If you want to release the GPU, you need to consider how to reload the GPU. Therefore, the model and data cannot be loaded when the project is started, but only when the training function is called. However, due to the slow loading, it can only be run in an asynchronous sub thread or sub process. Therefore, I first put the loading process of model data and training in a separate thread.

Stage 1: directly access torch.cuda.empty_cache() cleanup.

If the GPU is not released, release it. Isn't that simple? Baidu wave pytorch how to release GPU video memory.

Click and you'll find the answer. That is torch.cuda.empty after training_ cache() . Run the code after adding it, and find it useless!!!!, The first application of CV method failed What is the reason? We will analyze later!!!

The second stage (creating sub process loading model and training)

Since the sub thread loading model and training can not release the GPU, can we change our thinking. Create a sub process to load model data and train, When the training is completed, kill the sub process, and the resources it occupies (mainly GPU video memory) will not be released? There seems to be nothing wrong with this idea. Just do it.

Define the method of loading model data and training. (code for reference only)

def training(queue):
    manage.app.app_context().push()
    current_app.logger.error('Foundation loading start')
    with manage.app.app_context():
        device = "cuda" if torch.cuda.is_available() else "cpu"
        current_app.logger.error('device1111 Start cheering')
        model.to(device)
        current_app.logger.error('device2222')
        model.eval()
        n_ctx = model.config.n_ctx
        current_app.logger.error('Foundation loading completed')
		#training method
		result_list=start_train(model,n_ctx,device)
        current_app.logger.error('Finish training')
        #Put the results returned by the training method into the queue
        queue.put(result_list)

Create a sub process, execute the training method, and then obtain the training results through the blocking method

from torch import multiprocessing as mp
def sub_process_train():
	#Define a queue to get training results
    train_queue = mp.Queue()
    training_process = mp.Process(target=training, args=(train_queue))
    training_process.start()
    current_app.logger.error('Subprocess execution')
    # When the training is completed
    training_process.join()
    current_app.logger.error('Execution complete')
    #Get training results
    result_list = train_queue.get()
    current_app.logger.error('Get data')
    if training_process.is_alive():
        current_app.logger.error('The child process is still alive')
        #Kill child process
        os.kill(training_process.pid, signal.SIGKILL)
    current_app.logger.error('Kill child process')
    return result_list

Because the subprocess needs to block the execution result, it needs to define a thread to execute the sub_process_train method to ensure that the training interface can return normally.

import threading
threading.Thread(target=sub_process_train).start()

The code is written, and it's time to witness miracles. First, start with python manage.py and look at the results. The running results are as follows. An error is reported. From the error prompt, CUDA cannot be loaded repeatedly in the child process of forked. "Cannot re-initialize CUDA in forked subprocess. " + msg) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

The question here is, what is forked and what is spawn? Here you need to know how to create a child process. Create a child process through torch.multiprocessing.Process(target=training, args=(train_queue)) fork and spawn are different ways to build subprocesses. The difference is 1. fork: in addition to the necessary startup resources, other variables, packages and data are integrated from the parent process, that is, they share some memory pages of the parent process, so they start faster. However, most of them use the data from the parent process, and all of them are unsafe child processes. 2. spawn: build a child process from scratch. The data of the parent process is copied to the space of the child process. It has its own Python interpreter. All packages of the parent process need to be reloaded. Therefore, the startup is slow, but because the data is its own, the security is relatively high. Go back to the error report just now. Why prompt to not load repeatedly. This is because the use of the spawn startup method in Python 3 supports the sharing of CUDA tensors between processes. multiprocessing uses fork to create subprocesses, which is not supported by CUDA runtime. Therefore, you can only add mp.set before creating a child process_ start_ Method ('spawn ') method. Namely

def sub_process_train(prefix, length):
    try:
        mp.set_start_method('spawn')
    except RuntimeError:
        pass
    train_queue = mp.Queue()
    training_process = mp.Process(target=training, args=(train_queue))
    ##Omit other codes

Run the project again through python manage.py. The operation results are shown in Figure 1 and Figure 2. It can be seen that the GPU memory can be used correctly, and the GPU can also be released after the training is completed.

Everything looks perfect. But，But. After starting the project through gunicorn, call the interface again, and the following results will appear.

It's a big deal that the sub process of starting the project with gunicorn is not executed. Without mp.set_start_method('spawn ') method model data cannot be loaded, In addition, the sub process of this method cannot be executed. It is really the first two.

Phase 3 (Global thread pool + release GPU)

The way of subprocess is also not good. We can only go back to the previous thread mode. The previous methods of creating threads are to directly create a new thread. When too many threads are running at the same time, it is easy to fill the GPU, resulting in application crash. Therefore, the global thread pool is used to create and manage threads, and then release resources after thread execution is completed.

A global thread pool is created after the project is started. The size is 2. Ensure that there are remaining GPU s.

from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=2)

Perform training through thread pool

  pool.apply_async(func=async_produce_poets)

Load model and release GPU with thread

def async_produce_poets():
    try:
        print("Child process start" + str(os.getpid())+" "+str(threading.current_thread().ident))
        start_time = int(time.time())
        manage.app.app_context().push()
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model = GPT2LMHeadModel.from_pretrained(os.path.join(base_path, "model"))
        model.to(device)
        model.eval()
        n_ctx = model.config.n_ctx
        result_list=start_train(model,n_ctx,device)
        #Transfer model to cpu
        model = model.to('cpu')
        #To delete a model is to delete a reference
        del model
        #Release the GPU when using it.
        torch.cuda.empty_cache()
        train_seconds = int(time.time() - start_time)
        current_app.logger.info('The total training time is={0}'.format(str(train_seconds)))
    except Exception as e:
        manage.app.app_context().push()

After this operation, the ideal effect was finally achieved.

This is because gunicorn is used to start the project. Therefore, gunicorn related knowledge is essential. It is ideal to use sync in the system with limited CPU. See gunicorn's brief summary for details

In the first stage of problem analysis, torch.cuda.empty is directly used_ Cache () failed to release the GPU because the model was not deleted. The model has been loaded into the GPU.

summary

This paper starts with the optimization of the actual project and records all aspects of the optimization. I hope it will be helpful to readers.

reference resources

multiprocessing fork() vs spawn()

Added by Rayman3.tk on Thu, 18 Nov 2021 06:45:19 +0200

Programming VIP