Principle and implementation of Linux native asynchronous IO (Native AIO)

Video analysis related to linux server development:

Principle and implementation of asynchronous for linux server performance optimization
redis, memcached, nginx

c/c++ linux server development free learning address: Senior architect of c/c++ linux background server

What is asynchronous IO?

Asynchronous IO: when the application initiates a IO operation, the caller can not get the result immediately, but after the kernel completes the IO operation, it will notify the caller through the signal or callback.

The difference between asynchronous IO and synchronous IO is shown in the figure:

It can be seen from the figure above that the synchronous IO must wait for the kernel to complete the IO operation before returning. Asynchronous IO does not have to wait for the IO operation to complete, but sends an IO operation to the kernel and returns immediately. When the kernel completes the IO operation, it will notify the application by signal.

Linux native AIO principle

Linux Native AIO is a native AIO supported by Linux. Why add the word native? Because there are many third-party asynchronous IO libraries in Linux, such as libeio and glibc AIO. Therefore, in order to distinguish, the asynchronous IO provided by the Linux kernel is called native asynchronous io.

Many third-party asynchronous IO libraries are not real asynchronous IO, but use multithreading to simulate asynchronous io. For example, libeio uses multithreading to simulate asynchronous io.

This paper mainly introduces the principle and implementation of Linux native AIO, so we won't analyze other third-party asynchronous IO libraries. Let's first introduce the principle of Linux native AIO.

As shown in the figure:

Linux native AIO processing flow:

  • When the application calls IO_ After the submit system call initiates an asynchronous IO operation, it will add an IO task to the IO task queue of the kernel and return success.
  • The kernel will process the IO tasks in the IO task queue in the background, and then store the processing results in the IO tasks.
  • Applications can call io_ The getevents system call is used to obtain the processing results of asynchronous io. If the IO operation has not been completed, the failure information will be returned, otherwise it will be returned
    IO processing results.

As can be seen from the above process, the asynchronous IO operation of Linux mainly consists of two steps:

  1. Call IO_ The submit function initiates an asynchronous IO operation.
  2. Call io_getevents function gets the result of asynchronous io.

[article benefits] learning materials for C/C++ Linux server architects plus group 812855908 (materials include C/C + +, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, collaboration, DPDK, ffmpeg, etc.)

Linux native AIO implementation

Generally speaking, there are three steps to use native AIO for Linux:

  1. Call IO_ The setup function creates a general IO context.
  2. Call IO_ The submit function submits an asynchronous IO operation to the kernel.
  3. Call io_getevents function gets the result of asynchronous IO operation.
    Therefore, we can understand the implementation of Linux native AIO by analyzing the implementation of these three functions.

Linux native AIO is implemented in the source file / FS / AIO C.

Create asynchronous IO context

To use Linux native AIO, you first need to create an asynchronous IO context. In the kernel, the asynchronous IO context is represented by kioctx structure, which is defined as follows:

struct kioctx {
    atomic_t                users;    // Reference counter
    int                     dead;     // Is it closed
    struct mm_struct        *mm;      // Corresponding memory management object

    unsigned long           user_id;  // A unique ID that identifies the current context and returns it to the user
    struct kioctx           *next;

    wait_queue_head_t       wait;     // Waiting queue
    spinlock_t              ctx_lock; // lock

    int                     reqs_active; // Number of asynchronous IO requests in progress
    struct list_head        active_reqs; // Asynchronous IO request object in progress
    struct list_head        run_list;

    unsigned                max_reqs;  // Maximum IO requests

    struct aio_ring_info    ring_info; // Ring buffer

    struct work_struct      wq;
};

In the kioctx structure, the more important member is active_reqs and ring_info. active_reqs saves all ongoing asynchronous IO operations, while ring_info member is used to store the results of asynchronous IO operations.

The kioctx structure is shown in the figure:

As shown in the figure above, active_ The asynchronous IO operation queue saved by reqs member is based on kiocb structure, while ring_ The info member points to an AIO type_ ring_ Ring Buffer of info structure.

So let's first look at the kiocb structure and AIO_ ring_ Definition of info structure:

struct kiocb {
    ...
    struct file         *ki_filp;      // File object for asynchronous IO operation
    struct kioctx       *ki_ctx;       // Points to the asynchronous IO context to which it belongs
    ...
    struct list_head    ki_list;       // Used to connect all asynchronous IO operation objects in progress
    __u64               ki_user_data;  // User provided data pointer (which can be used to distinguish asynchronous IO operations)
    loff_t              ki_pos;        // File offset for asynchronous IO operations
    ...
};

kiocb has a simple structure and is mainly used to save some information of asynchronous IO operations, such as:

  • ki_filp: used to save the file object for asynchronous IO.
  • ki_ctx: point to the asynchronous IO context object to which it belongs.
  • ki_list: used to connect all IO operation objects in the current asynchronous IO context.
  • ki_user_data: this field is mainly provided for user-defined use, such as distinguishing asynchronous IO operations or setting a callback function.
  • ki_pos: the file offset used to hold asynchronous IO operations.

And aio_ring_info structure is the implementation of a ring buffer, which is defined as follows:

struct aio_ring_info {
    unsigned long       mmap_base;     // Virtual memory address of ring buffer
    unsigned long       mmap_size;     // Size of ring buffer

    struct page         **ring_pages;  // An array of memory pages used by the ring buffer
    spinlock_t          ring_lock;     // Spin lock for protecting ring buffer
    long                nr_pages;      // Number of memory pages occupied by ring buffer

    unsigned            nr, tail;

    // If the ring buffer is no more than 8 memory pages
    // ring_pages points to internal_pages field
#define AIO_RING_PAGES  8
    struct page         *internal_pages[AIO_RING_PAGES]; 
};

This ring buffer is mainly used to save the results of completed asynchronous IO operations. The results of asynchronous IO operations use io_event structure representation. As shown in the figure:

The head in the figure represents the start position of the ring buffer, while the tail represents the end position of the ring buffer. If the tail is greater than the head, it indicates that the completed asynchronous IO operation results can be obtained. If head equals tail, it indicates that there is no asynchronous IO operation completed.

The head and tail positions of the ring buffer are saved in AIO_ In the structure of ring, it is defined as follows:

struct aio_ring {
    unsigned    id;
    unsigned    nr;    // The IO that the ring buffer can hold_ Event number
    unsigned    head;  // Start position of ring buffer
    unsigned    tail;  // End of ring buffer
    ...
};

So many data structures are introduced above, just for the sake of easier understanding of the next source code analysis.

Now, let's start to analyze the creation process of asynchronous IO context. The asynchronous IO context is created by calling io_ The setup function is complete, while IO_ The setup function calls the kernel function sys_io_setup, which is implemented as follows:

asmlinkage long sys_io_setup(unsigned nr_events, aio_context_t *ctxp)
{
    struct kioctx *ioctx = NULL;
    unsigned long ctx;
    long ret;
    ...
    ioctx = ioctx_alloc(nr_events);  // Call ioctx_alloc function creates asynchronous IO context
    ret = PTR_ERR(ioctx);
    if (!IS_ERR(ioctx)) {
        ret = put_user(ioctx->user_id, ctxp); // Returns the identifier of the asynchronous IO context to the caller
        if (!ret)
            return 0;
        io_destroy(ioctx);
    }
out:
    return ret;
}

sys_ io_ The implementation of the setup function is relatively simple. First, call ioctx_alloc applies for an asynchronous IO context object, and then returns the identifier of the asynchronous IO context object to the caller.

So, Sys_ io_ The core procedure of the setup function is to call ioctx_alloc function, let's continue to analyze ioctx_ Implementation of alloc function:

static struct kioctx *ioctx_alloc(unsigned nr_events)
{
    struct mm_struct *mm;
    struct kioctx *ctx;
    ...
    ctx = kmem_cache_alloc(kioctx_cachep, GFP_KERNEL); // Request a kioctx object
    ...
    INIT_LIST_HEAD(&ctx->active_reqs);                 // Initialize asynchronous IO operation queue
    ...
    if (aio_setup_ring(ctx) < 0)                       // Initialize ring buffer
        goto out_freectx;
    ...
    return ctx;
    ...
}

ioctx_alloc function mainly completes the following work:

  • Call kmem_ cache_ The alloc function requests an asynchronous IO context object from the kernel.
  • Initialize each member variable of asynchronous IO context, such as initializing asynchronous IO operation queue.
  • Call AIO_ setup_ The ring function initializes the ring buffer.

Ring buffer initialization function AIO_ setup_ The implementation of ring is a little complicated, which mainly involves the knowledge of memory management, so we skip this part of the analysis here.

Commit asynchronous IO operation

The asynchronous IO operation is submitted through io_ The submit function is completed, io_submit needs to provide an array of iocb structure to represent the information related to the asynchronous IO operation to be performed. Let's take a look at the definition of iocb structure first:

struct iocb {
    __u64   aio_data;       // User defined data callback function, or user-defined data callback function
    ...
    __u16   aio_lio_opcode; // Type of IO operation, such as read (IOCB_CMD_PREAD) or write (IOCB_CMD_PWRITE) operation
    __s16   aio_reqprio;
    __u32   aio_fildes;     // File handle for IO operation
    __u64   aio_buf;        // Buffer for IO operations (such as data written to files for write operations)
    __u64   aio_nbytes;     // Buffer size
    __s64   aio_offset;     // File offset for IO operation
    ...
};

io_ The submit function eventually calls the kernel function sys_io_submit to provide asynchronous IO operations. Let's analyze sys_ io_ Implementation of submit function:

asmlinkage long
sys_io_submit(aio_context_t ctx_id, long nr, 
              struct iocb __user **iocbpp)
{
    struct kioctx *ctx;
    long ret = 0;
    int i;
    ...
    ctx = lookup_ioctx(ctx_id); // Get asynchronous IO context object through asynchronous IO context identifier
    ...
    for (i = 0; i < nr; i++) {
        struct iocb __user *user_iocb;
        struct iocb tmp;

        if (unlikely(__get_user(user_iocb, iocbpp+i))) {
            ret = -EFAULT;
            break;
        }

        // Copy asynchronous IO operations from user space to kernel space
        if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
            ret = -EFAULT;
            break;
        }

        // Call IO_ submit_ The one function submits asynchronous IO operations
        ret = io_submit_one(ctx, user_iocb, &tmp);
        if (ret)
            break;
    }

    put_ioctx(ctx);
    return i ? i : ret;
}

sys_ io_ The implementation of the submit function is relatively simple, mainly from the user space to copy the asynchronous IO operation information to the kernel space, and then call io_. submit_ The one function submits asynchronous IO operations. We focus on Io_ submit_ Implementation of one function:

int io_submit_one(struct kioctx *ctx, 
                  struct iocb __user *user_iocb,
                  struct iocb *iocb)
{
    struct kiocb *req;
    struct file *file;
    ssize_t ret;
    char *buf;
    ...
    file = fget(iocb->aio_fildes);      // Get file object through file handle
    ...
    req = aio_get_req(ctx);             // Get an asynchronous IO operation object
    ...
    req->ki_filp = file;                // File object for asynchronous IO
    req->ki_user_obj = user_iocb;       // iocb object pointing to user space
    req->ki_user_data = iocb->aio_data; // Set user-defined data
    req->ki_pos = iocb->aio_offset;     // Sets the file offset for asynchronous IO operations

    buf = (char *)(unsigned long)iocb->aio_buf; // Data buffer for asynchronous IO operation

    // Different processing is performed according to different asynchronous IO operation types
    switch (iocb->aio_lio_opcode) {
    case IOCB_CMD_PREAD: // Asynchronous read operation
        ...
        ret = -EINVAL;
        // When an asynchronous IO operation is initiated, different functions will be called according to different file systems:
        // For example, ext3 file system will call generic_file_aio_read function
        if (file->f_op->aio_read)
            ret = file->f_op->aio_read(req, buf, iocb->aio_nbytes, req->ki_pos);
        break;
    ...
    }
    ...
    // Asynchronous IO operations may call aio_read has been completed or will be added to the IO request queue.
    // Therefore, if the asynchronous IO operation is submitted to the IO request queue, it will be returned directly
    if (likely(-EIOCBQUEUED == ret)) return 0;

    aio_complete(req, ret, 0); // If the IO operation has been completed, call AIO_ The complete function completes the closing work
    return 0;
}

The above code has been applied to io_ submit_ The one function is annotated in detail. Here is a summary of Io_ submit_ The one function mainly completes the following tasks:

  • Get the file object corresponding to the file handle by calling the fget function.
  • Call AIO_ get_ The req function obtains an asynchronous IO operation object of type kiocb structure, which has been analyzed earlier. In addition, AIO_ get_ The req function also adds the asynchronous IO operation object to the active of the asynchronous IO context_ Reqs queue.
  • Different processing is carried out according to different asynchronous IO operation types. For example, asynchronous read operation will call AIO of file object_ Read method. Different file systems, their AIO_ The implementation of read method is different, such as AIO of Ext3 file system_ The read method points to generic_file_aio_read function.
  • If the asynchronous IO operation is added to the IO request queue of the kernel, it is returned directly. Otherwise, it means that the IO operation has been completed, so AIO is called_ complete
    Function completes the closing work.

io_ submit_ The operation process of one function is shown in the figure:

So, IO_ submit_ The main task of the one function is to submit IO requests to the kernel.

Asynchronous IO operation completed

When the asynchronous IO operation is completed, the kernel will call aio_complete function to put the processing result into the ring buffer ring of asynchronous IO context_ Info, let's analyze AIO_ Implementation of complete function:

int aio_complete(struct kiocb *iocb, long res, long res2)
{
    struct kioctx *ctx = iocb->ki_ctx;
    struct aio_ring_info *info;
    struct aio_ring *ring;
    struct io_event *event;
    unsigned long flags;
    unsigned long tail;
    int ret;
    ...
    info = &ctx->ring_info; // Ring buffer object

    spin_lock_irqsave(&ctx->ctx_lock, flags);         // Lock asynchronous IO context
    ring = kmap_atomic(info->ring_pages[0], KM_IRQ1); // Virtual memory address mapping for memory pages

    tail = info->tail;                           // The next free location in the ring buffer
    event = aio_ring_event(info, tail, KM_IRQ0); // Get the free location from the ring buffer and save the result
    tail = (tail + 1) % info->nr;                // Update next free location

    // Save asynchronous IO results to ring buffer
    event->obj = (u64)(unsigned long)iocb->ki_user_obj;
    event->data = iocb->ki_user_data;
    event->res = res;
    event->res2 = res2;
    ...
    info->tail = tail;
    ring->tail = tail; // Update the next free location of the ring buffer

    put_aio_ring_event(event, KM_IRQ0); // Unmap virtual memory address
    kunmap_atomic(ring, KM_IRQ1);       // Unmap virtual memory address

    // Release asynchronous IO object
    ret = __aio_put_req(ctx, iocb);
    spin_unlock_irqrestore(&ctx->ctx_lock, flags);
    ...
    return ret;
}

aio_ The iocb parameter of the complete function is that we call io_ submit_ The asynchronous IO object submitted by the once function, and the parameters res and res2 are the results returned after the IO operation is completed by the kernel.

aio_ The main work of complete function is as follows:

  • Get a free IO according to the tail pointer of the ring buffer_ Event object to save the results of IO operations.
  • Add one to the tail pointer of the ring buffer to point to the next free position.

When the result of asynchronous IO operation is saved to the ring buffer, the user layer can call io_getevents function to read the result of IO operation, IO_ The getevents function eventually calls sys_io_getevents function.

Let's analyze sys_ io_ Implementation of getevents function:

asmlinkage long sys_io_getevents(aio_context_t ctx_id,
                                 long min_nr,
                                 long nr,
                                 struct io_event *events,
                                 struct timespec *timeout)
{
    struct kioctx *ioctx = lookup_ioctx(ctx_id);
    long ret = -EINVAL;
    ...
    if (likely(NULL != ioctx)) {
        // Call read_ The events function reads the result of the IO operation
        ret = read_events(ioctx, min_nr, nr, events, timeout);
        put_ioctx(ioctx);
    }
    return ret;
}

As can be seen from the above code, Sys_ io_ The getevents function mainly calls read_events function to read the results of asynchronous IO operations. We then analyze read_events function:

static int read_events(struct kioctx *ctx,
                      long min_nr, long nr,
                      struct io_event *event,
                      struct timespec *timeout)
{
    long start_jiffies = jiffies;
    struct task_struct *tsk = current;
    DECLARE_WAITQUEUE(wait, tsk);
    int ret;
    int i = 0;
    struct io_event ent;
    struct timeout to;

    memset(&ent, 0, sizeof(ent));
    ret = 0;

    while (likely(i < nr)) {
        ret = aio_read_evt(ctx, &ent); // Read an IO processing result from the ring buffer
        if (unlikely(ret <= 0))        // If the ring buffer has no IO processing result, exit the loop
            break;

        ret = -EFAULT;
        // Copy IO processing results to user space
        if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
            break;
        }

        ret = 0;
        event++;
        i++;
    }

    if (min_nr <= i)
        return i;
    if (ret)
        return ret;
    ...
}

read_ The events function mainly calls aio_read_evt function to read the results of asynchronous IO operations from the ring buffer. If the reading is successful, the results will be copied to the user space.

aio_ read_ The evt function reads the result of asynchronous IO operation from the ring buffer. Its implementation is as follows:

static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
{
    struct aio_ring_info *info = &ioctx->ring_info;
    struct aio_ring *ring;
    unsigned long head;
    int ret = 0;

    ring = kmap_atomic(info->ring_pages[0], KM_USER0);

    // If the head pointer of the ring buffer is equal to the tail pointer, it means that the ring buffer is empty, so it is returned directly
    if (ring->head == ring->tail) 
        goto out;

    spin_lock(&info->ring_lock);

    head = ring->head % info->nr;
    if (head != ring->tail) {
        // Read the result from the ring buffer according to the head pointer of the ring buffer
        struct io_event *evp = aio_ring_event(info, head, KM_USER1);

        *ent = *evp;                  // Save the results to the ent parameter
        head = (head + 1) % info->nr; // Move the head pointer of the ring buffer to the next position
        ring->head = head;            // Save the head pointer of the ring buffer
        ret = 1;
        put_aio_ring_event(evp, KM_USER1);
    }

    spin_unlock(&info->ring_lock);

out:
    kunmap_atomic(ring, KM_USER0);
    return ret;
}

aio_ read_ The main work of EVT function is to judge whether the ring buffer is empty. If it is not empty, read the result of asynchronous IO operation from the ring buffer, save it to the parameter ent, and move the head pointer of the ring buffer to the next position.

summary

This paper mainly analyzes the principle and implementation of Linux native AIO, but in order not to fall into too many implementation details, this paper does not involve the knowledge points related to disk IO. However, disk IO is also an indispensable part of AIO implementation, so interested friends can analyze its implementation principle by reading the source code of Linux.

Keywords: C++ Linux

Added by snday on Fri, 04 Mar 2022 18:18:52 +0200