Linux kernel eBPF Foundation: perf Foundation_ event_ Open system call kernel source code analysis

Linux kernel eBPF Foundation perf(5)perf_event_open system call kernel source code analysis

Rong Tao May 19, 2021

Relevant note code of this article: https://github.com/Rtoax/linux-5.10.13
Linux kernel performance architecture: perf_event

1. perf_event_open system call

See for details Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual.

#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>

int perf_event_open(struct perf_event_attr *attr,
                   pid_t pid, int cpu, int group_fd,
                   unsigned long flags);

1.1. pid

The parameter pid allows events to be attached to processes in various ways.

If pid is 0, the measurement is performed on the current thread;
If pid is greater than 0, measure the process indicated by pid;
If pid is - 1, all processes are counted.

1.2. cpu

The CPU parameter allows the measurement to be CPU specific.

If CPU > = 0, limit the measurement to the specified CPU; Otherwise, it will be limited to 0.
If cpu=-1, events are measured on all CPUs.

Note that the combination of pid == -1 and cpu == -1 is invalid.

PID > 0 and cpu == -1 setting will measure each process and follow any CPU to which the process is scheduled. Each user can create events for each process.
The pid == -1 and CPU > = 0 settings for each CPU are for each CPU and measure all processes on the specified CPU. Cap required per CPU event_ SYS_ Admin function or / proc / sys / kernel / perf less than / 1_ event_ Paramoid value. See the chapter of perf_event related configuration files.

1.3. group_fd

group_ The FD parameter allows the creation of event groups. An event group has one event, that is, the team leader. First create a leader, group_fd = -1. The remaining group members are passed through the subsequent perf_ event_ Created by the open() call, where group_ FD is set to FD of the team leader. (create a separate event with group_fd = -1, and the event is considered to be a group with only 1 member.) Scheduling an event group to the CPU as a unit: only if all events in the group can be placed on the CPU. This means that the values of member events can be meaningfully compared to each other, added, divided (to obtain ratios), etc., because they have counted events for the same set of executed instructions.

1.4. flags

#define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
#define PERF_FLAG_FD_OUTPUT		(1UL << 1)
#define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
#define PERF_FLAG_FD_CLOEXEC		(1UL << 3) /* O_CLOEXEC */

This is explained in the system call man Manual:

PERF_FLAG_FD_NO_GROUP: this flag allows an event to be created as part of an event group without a leader. It's not clear why this works.
PERF_FLAG_FD_OUTPUT: this flag reroutes the output from the event to the team leader.
PERF_FLAG_PID_CGROUP: this flag activates system wide monitoring of each container. A container is an abstraction that isolates a set of resources for finer control (CPU, memory, etc.). In this mode, the event is measured only if the thread running on the monitored CPU belongs to the specified container (cgroup). cgroup is identified by passing a file descriptor open in its directory in the cgroupfs file system. For example, if the cgroup to be monitored is called test, the file descriptor opened on / dev / cgroup / test (assuming cgroupfs is installed on / dev / cgroup) must be passed as a pid parameter. cgroup monitoring is only applicable to system wide events, so additional permissions may be required. (container related content is not discussed in this article)
PERF_FLAG_FD_CLOEXEC: O_CLOEXEC: in linux system, open a file with O_CLOEXEC flag bit, which represents the FD set with fcntl_ Cloexec has the same function, which is to close the FD obtained by fork in the subprocess before loading a new executable program with exec series system calls in the subprocess of fork. (when calling perf_event_open using strace perf stat ls, the incoming is perf_flag_fd_cloxec)

2. Test example

https://github.com/Rtoax/test/tree/master/c/glibc/linux/perf_event

/*
https://stackoverflow.com/questions/42088515/perf-event-open-how-to-monitoring-multiple-events

perf stat -e cycles,faults ls

*/
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <asm/unistd.h>
#include <errno.h>
#include <stdint.h>
#include <inttypes.h>

struct read_format {
    uint64_t nr;
    struct {
        uint64_t value;
        uint64_t id;
    } values[];
};

void do_malloc() {

    int i;
    char* ptr;
    int len = 2*1024*1024;
    ptr = malloc(len);
    
    mlock(ptr, len);
    
    for (i = 0; i < len; i++) {
        ptr[i] = (char) (i & 0xff); // pagefault
    }
    
    free(ptr);
}

void do_ls() {
    system("/bin/ls");
}

void do_something(int something) {
    
    switch(something) {
    case 1:
        do_ls();
        break;
    case 0:
    default:
        do_malloc();
        break;
    }
}

int create_hardware_perf(int grp_fd, enum perf_hw_id hw_ids, uint64_t *ioc_id)
{
    if(PERF_COUNT_HW_MAX <= hw_ids || hw_ids < 0) {
        printf("Unsupport enum perf_hw_id.\n");
        return -1;
    }
    
    struct perf_event_attr pea;
    
    memset(&pea, 0, sizeof(struct perf_event_attr));
    pea.type = PERF_TYPE_HARDWARE;
    pea.size = sizeof(struct perf_event_attr);
    pea.config = hw_ids;
    pea.disabled = 1;
    pea.exclude_kernel = 1;
    pea.exclude_hv = 1;
    pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
    int fd = syscall(__NR_perf_event_open, &pea, 0, -1, grp_fd>2?grp_fd:-1, 0);
    ioctl(fd, PERF_EVENT_IOC_ID, ioc_id);

    return fd;
}

int create_software_perf(int grp_fd, enum perf_sw_ids sw_ids, uint64_t *ioc_id)
{
    if(PERF_COUNT_SW_MAX <= sw_ids || sw_ids < 0) {
        printf("Unsupport enum perf_sw_ids.\n");
        return -1;
    }

    struct perf_event_attr pea;
    
    memset(&pea, 0, sizeof(struct perf_event_attr));
    pea.type = PERF_TYPE_SOFTWARE;
    pea.size = sizeof(struct perf_event_attr);
    pea.config = sw_ids;
    pea.disabled = 1;
    pea.exclude_kernel = 1;
    pea.exclude_hv = 1;
    pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
    int fd = syscall(__NR_perf_event_open, &pea, 0, -1, grp_fd>2?grp_fd:-1 /*!!!*/, 0);
    ioctl(fd, PERF_EVENT_IOC_ID, ioc_id);

    return fd;
}


int main(int argc, char* argv[]) 
{
    struct perf_event_attr pea;
    
    int group_fd, fd2, fd3, fd4, fd5;
    uint64_t id1, id2, id3, id4, id5;
    uint64_t val1, val2, val3, val4, val5;
    char buf[4096];
    struct read_format* rf = (struct read_format*) buf;
    int i;

    group_fd = create_hardware_perf(-1, PERF_COUNT_HW_CPU_CYCLES, &id1);
    
    fd2 = create_hardware_perf(group_fd, PERF_COUNT_HW_CACHE_MISSES, &id2);
    fd3 = create_software_perf(group_fd, PERF_COUNT_SW_PAGE_FAULTS, &id3);
    fd4 = create_software_perf(group_fd, PERF_COUNT_SW_CPU_CLOCK, &id4);
    fd5 = create_software_perf(group_fd, PERF_COUNT_SW_CPU_CLOCK, &id5);

    printf("ioctl %ld, %ld, %ld, %ld, %ld\n", id1, id2, id3, id4, id5);

    ioctl(group_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
    ioctl(group_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);

    do_something(-1);
    
    ioctl(group_fd, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);


    read(group_fd, buf, sizeof(buf));
    for (i = 0; i < rf->nr; i++) {
        if (rf->values[i].id == id1) {
            val1 = rf->values[i].value;
        } else if (rf->values[i].id == id2) {
            val2 = rf->values[i].value;
        } else if (rf->values[i].id == id3) {
            val3 = rf->values[i].value;
        } else if (rf->values[i].id == id4) {
            val4 = rf->values[i].value;
        } else if (rf->values[i].id == id5) {
            val5 = rf->values[i].value;
        }
    }

    printf("cpu cycles:     %"PRIu64"\n", val1);
    printf("cache misses:   %"PRIu64"\n", val2);
    printf("page faults:    %"PRIu64"\n", val3);
    printf(" cpu clock:     %"PRIu64"\n", val4);
    printf("task clock:     %"PRIu64"\n", val5);

    close(group_fd);
    close(fd2);
    close(fd3);
    close(fd4);
    close(fd5);

    return 0;
}

Program running results:

[rongtao@localhost perf_event]$ ./a.out 
ioctl 2640, 2641, 2642, 2643, 2644
cpu cycles:     12996145
cache misses:   1135
page faults:    518
 cpu clock:     5417726
task clock:     5393251

As shown in the program, establish five FDS and add the last four to the first fd to form a group to monitor the above information respectively. In the source code receiving in the next chapter, we will introduce how to obtain the above information from the kernel. It should be noted that the memory locking in the code has no effect on page missing interrupt:

    mlock(ptr, len);
    
    for (i = 0; i < len; i++) {
        ptr[i] = (char) (i & 0xff); // pagefault
    }

3. perf_event_open source analysis

3.1. How to call perf_event_open

The code is as follows:

    struct perf_event_attr pea;
    uint64_t id1;
    
    memset(&pea, 0, sizeof(struct perf_event_attr));
    pea.type = PERF_TYPE_HARDWARE;
    pea.size = sizeof(struct perf_event_attr);
    pea.config = PERF_COUNT_HW_CPU_CYCLES;
    pea.disabled = 1;
    pea.exclude_kernel = 1;
    pea.exclude_hv = 1;
    pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
    int fd = syscall(__NR_perf_event_open, &pea, getpid(), -1, -1, 0);
    ioctl(fd, PERF_EVENT_IOC_ID, &id1);

The PID here is the PID of the current process (0 can also be filled in), and the CPU is - 1, group_fd is - 1, flags is 0, flags is 0, flags is perf in perf command_ FLAG_ FD_ Cloexec, in the source code:

	if (flags & PERF_FLAG_FD_CLOEXEC)
		f_flags |= O_CLOEXEC;

Therefore, the above input parameters indicate any CPU that measures the current process and schedules the process to follow the process plan (PID > 0 and cpu == -1). During syscall, it will fall into kernel state.

3.2. How to handle input parameters

3.2.1. struct perf_event_attr

First, copy from user space to kernel space:

err = perf_copy_attr(attr_uptr, &attr);

Different versions of struct perf are considered internally_ event_ Length of attr structure:

	if (!size)
		size = PERF_ATTR_SIZE_VER0;
	if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE)
		goto err_size;

Then use copy_struct_from_user copy. Then check the parameters:

	if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
		return -EINVAL;

	if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
		return -EINVAL;

	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
		return -EINVAL;

Next is the sample_ Check the type field (see< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual >), divided into two parts:

PERF_SAMPLE_BRANCH_STACK: (starting from Linux 3.4), it provides a record of the most recent branch, which is provided by CPU branch sampling hardware (such as Intel Last Branch Record). Not all hardware supports this feature. For information on how to filter the branches of a report, see branch_sample_type field.
PERF_SAMPLE_REGS_USER: (starting from Linux 3.7) record the current user level CPU register status (the value in the process before calling the kernel).
PERF_SAMPLE_STACK_USER: (starting with Linux 3.7) logs the user level stack, allowing the stack to expand.
PERF_SAMPLE_REGS_INTR
PERF_SAMPLE_CGROUP

3.2.2. pid and cpu

If flag bit PERF_FLAG_PID_CGROUP (this flag activates system wide monitoring for each container), then neither pid nor cpu can be - 1.

	/*
	 * In cgroup mode, the pid argument is used to pass the fd
	 * opened to the cgroup directory in cgroupfs. The cpu argument
	 * designates the cpu on which to monitor threads from that
	 * cgroup.
	 */
	if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1))
		return -EINVAL;

If flag bit PERF_FLAG_PID_CGROUP is not set and PID! =- 1. Find task_struct structure.

static struct task_struct *
find_lively_task_by_vpid(pid_t vpid)    /* Get task_struct */
{
	struct task_struct *task;

	rcu_read_lock();
	if (!vpid)
		task = current; /* Current process */
	else
		task = find_task_by_vpid(vpid); /* lookup */
	if (task)
		get_task_struct(task);  /* Reference count */
	rcu_read_unlock();

	if (!task)
		return ERR_PTR(-ESRCH);

	return task;
}

3.2.3. group_fd

If group is set_ fd != - 1. Get struct fd structure

static inline int perf_fget_light(int fd, struct fd *p)
{
	struct fd f = fdget(fd);
	if (!f.file)
		return -EBADF;

	if (f.file->f_op != &perf_fops) {
		fdput(f);
		return -EBADF;
	}
	*p = f;
	return 0;
}

Note that the file operator is perf_fops (useful later):

static const struct file_operations perf_fops = {
	.llseek			= no_llseek,
	.release		= perf_release,
	.read			= perf_read,
	.poll			= perf_poll,
	.unlocked_ioctl		= perf_ioctl,
	.compat_ioctl		= perf_compat_ioctl,
	.mmap			= perf_mmap,
	.fasync			= perf_fasync,
};

3.2.4. flags

A series of authentication work will not be discussed in detail here. See< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual>

#define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
#define PERF_FLAG_FD_OUTPUT		(1UL << 1)
#define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
#define PERF_FLAG_FD_CLOEXEC		(1UL << 3) /* O_CLOEXEC */

This is explained in the system call man Manual:

PERF_FLAG_FD_NO_GROUP: this flag allows an event to be created as part of an event group without a leader. It's not clear why this works.
PERF_FLAG_FD_OUTPUT: this flag reroutes the output from the event to the team leader.
PERF_FLAG_PID_CGROUP: this flag activates system wide monitoring of each container. A container is an abstraction that isolates a set of resources for finer control (CPU, memory, etc.). In this mode, the event is measured only if the thread running on the monitored CPU belongs to the specified container (cgroup). cgroup is identified by passing a file descriptor open in its directory in the cgroupfs file system. For example, if the cgroup to be monitored is called test, the file descriptor opened on / dev / cgroup / test (assuming cgroupfs is installed on / dev / cgroup) must be passed as a pid parameter. cgroup monitoring is only applicable to system wide events, so additional permissions may be required. (container related content is not discussed in this article)
PERF_FLAG_FD_CLOEXEC: O_CLOEXEC: in linux system, open a file with O_CLOEXEC flag bit, which represents the FD set with fcntl_ Cloexec has the same function, which is to close the FD obtained by fork in the subprocess before loading a new executable program with exec series system calls in the subprocess of fork. (when calling perf_event_open using strace perf stat ls, the incoming is perf_flag_fd_cloxec)

3.3. perf_event_alloc

	event = perf_event_alloc(&attr, cpu, task, group_leader, NULL,
				 NULL, NULL, cgroup_fd);

Apply for struct perf using kzalloc_ Event structure, followed by a series of initialization. The points needing attention are as follows:

overflow_ The handler is NULL, which determines the write direction of the ring queue.

	if (overflow_handler) {
		event->overflow_handler	= overflow_handler;
		event->overflow_handler_context = context;
	} else if (is_write_backward(event)){/* Write ring buffer from end to beginning */
		event->overflow_handler = perf_event_output_backward;
		event->overflow_handler_context = NULL;
	} else {
		event->overflow_handler = perf_event_output_forward;
		event->overflow_handler_context = NULL;
	}

Status event - > State

/*
 * Initialize event state based on the perf_event_attr::disabled.
 */
static inline void perf_event__state_init(struct perf_event *event)
{
	event->state = event->attr.disabled ? PERF_EVENT_STATE_OFF :
					      PERF_EVENT_STATE_INACTIVE;
}

3.3.1. perf_init_event

3.3.1.1. perf_try_init_event

This function will call the relevant operator PMU - > event of PMU_ init(event); (see< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >), such as:

//kernel/events/core.c
static struct pmu/* Performance monitoring unit */ perf_swevent = {
	.task_ctx_nr	= perf_sw_context,

	.capabilities	= PERF_PMU_CAP_NO_NMI,

	.event_init	= perf_swevent_init,
	.add		= perf_swevent_add,
	.del		= perf_swevent_del,
	.start		= perf_swevent_start,
	.stop		= perf_swevent_stop,
	.read		= perf_swevent_read,
};

perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);

//kernel/events/core.c
static struct pmu perf_cpu_clock = {
	.task_ctx_nr	= perf_sw_context,

	.capabilities	= PERF_PMU_CAP_NO_NMI,

	.event_init	= cpu_clock_event_init,
	.add		= cpu_clock_event_add,
	.del		= cpu_clock_event_del,
	.start		= cpu_clock_event_start,
	.stop		= cpu_clock_event_stop,
	.read		= cpu_clock_event_read,
};

perf_pmu_register(&perf_cpu_clock, NULL, -1);

3.3.1.1.1. perf_swevent->perf_swevent_init

See< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU>.

3.4. find_get_context

Get target context struct perf from task or per CPU variable_ event_ Context structure. If the task passed in is empty:

	if (!task) {
		/* Must be root to operate on a CPU event: */
		err = perf_allow_cpu(&event->attr);
		if (err)
			return ERR_PTR(err);

		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
		ctx = &cpuctx->ctx;
		get_ctx(ctx);
		++ctx->pin_count;

		return ctx;
	}

3.4.1. alloc_perf_context

static struct perf_event_context *
alloc_perf_context(struct pmu *pmu, struct task_struct *task)
{
	struct perf_event_context *ctx;

	ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
	if (!ctx)
		return NULL;

	__perf_event_init_context(ctx);
	if (task)
		ctx->task = get_task_struct(task);
	ctx->pmu = pmu;

	return ctx;
}

3.4.1.1. __perf_event_init_context

/*
 * Initialize the perf_event context in a task_struct:
 */
static void __perf_event_init_context(struct perf_event_context *ctx)   /* Initialize CPU ctx */
{
	raw_spin_lock_init(&ctx->lock);
	mutex_init(&ctx->mutex);
	INIT_LIST_HEAD(&ctx->active_ctx_list);
	perf_event_groups_init(&ctx->pinned_groups);
	perf_event_groups_init(&ctx->flexible_groups);
	INIT_LIST_HEAD(&ctx->event_list);
	INIT_LIST_HEAD(&ctx->pinned_active);
	INIT_LIST_HEAD(&ctx->flexible_active);
	refcount_set(&ctx->refcount, 1);
}

Here is our group_fd fills in - 1, so all if (group_leaders) are not valid, which will pass in an fd as the leader in the next system call.

3.5. perf_event_set_output

Because of our output_event is empty. It is given here first. The interface is mainly allocated with the ring buffer struct perf_buffer.

	if (output_event) {
		err = perf_event_set_output(event, output_event);
		if (err)
			goto err_context;
	}

3.6. anon_inode_getfile

Create a new struct file instance, hang it into the anonymous inode, and a detry describes the class. Here, the file operator perf is also used_ FOPS, this interface will not be explained in detail. This file will be bound to fd, which will return user space.

3.7. perf_event_validate_size

static bool perf_event_validate_size(struct perf_event *event)
{
	/*
	 * The values computed here will be over-written when we actually
	 * attach the event.
	 */
	__perf_event_read_size(event, event->group_leader->nr_siblings + 1);
	__perf_event_header_size(event, event->attr.sample_type & ~PERF_SAMPLE_READ);
	perf_event__id_header_size(event);

	/*
	 * Sum the lot; should not exceed the 64k limit we have on records.
	 * Conservative limit to allow for callchains and other variable fields.
	 */
	if (event->read_size + event->header_size +
	    event->id_header_size + sizeof(struct perf_event_header) >= 16*1024)
		return false;

	return true;
}

3.7.1. __perf_event_read_size

See< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual>

static void __perf_event_read_size(struct perf_event *event, int nr_siblings)
{
	int entry = sizeof(u64); /* value */
	int size = 0;
	int nr = 1;

	if (event->attr.read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
		size += sizeof(u64);

	if (event->attr.read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
		size += sizeof(u64);

	if (event->attr.read_format & PERF_FORMAT_ID)
		entry += sizeof(u64);

	if (event->attr.read_format & PERF_FORMAT_GROUP) {
		nr += nr_siblings;
		size += sizeof(u64);
	}

	size += entry * nr;
	event->read_size = size;
}

3.7.2. __perf_event_header_size

static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
{
	struct perf_sample_data *data;
	u16 size = 0;

	if (sample_type & PERF_SAMPLE_IP)
		size += sizeof(data->ip);

	if (sample_type & PERF_SAMPLE_ADDR)
		size += sizeof(data->addr);

	if (sample_type & PERF_SAMPLE_PERIOD)
		size += sizeof(data->period);

	if (sample_type & PERF_SAMPLE_WEIGHT)
		size += sizeof(data->weight);

	if (sample_type & PERF_SAMPLE_READ)
		size += event->read_size;

	if (sample_type & PERF_SAMPLE_DATA_SRC)
		size += sizeof(data->data_src.val);

	if (sample_type & PERF_SAMPLE_TRANSACTION)
		size += sizeof(data->txn);

	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
		size += sizeof(data->phys_addr);

	if (sample_type & PERF_SAMPLE_CGROUP)
		size += sizeof(data->cgroup);

	event->header_size = size;
}

3.7.3. perf_event__id_header_size

static void perf_event__id_header_size(struct perf_event *event)
{
	struct perf_sample_data *data;
	u64 sample_type = event->attr.sample_type;
	u16 size = 0;

	if (sample_type & PERF_SAMPLE_TID)
		size += sizeof(data->tid_entry);

	if (sample_type & PERF_SAMPLE_TIME)
		size += sizeof(data->time);

	if (sample_type & PERF_SAMPLE_IDENTIFIER)
		size += sizeof(data->id);

	if (sample_type & PERF_SAMPLE_ID)
		size += sizeof(data->id);

	if (sample_type & PERF_SAMPLE_STREAM_ID)
		size += sizeof(data->stream_id);

	if (sample_type & PERF_SAMPLE_CPU)
		size += sizeof(data->cpu_entry);

	event->id_header_size = size;
}

3.7.4. Size limit

Cannot exceed 16 * 1024 size

	if (event->read_size + event->header_size +
	    event->id_header_size + sizeof(struct perf_event_header) >= 16*1024)
		return false;

3.8. perf_install_in_context

	/*
	 * Precalculate sample_data sizes; do while holding ctx::mutex such
	 * that we're serialized against further additions and before
	 * perf_install_in_context() which is the point the event is active and
	 * can use these values.
	 */
	perf_event__header_size(event);
	perf_event__id_header_size(event);

	event->owner = current;

	perf_install_in_context(ctx, event, event->cpu);
	perf_unpin_context(ctx);

	if (move_group)
		perf_event_ctx_unlock(group_leader, gctx);
	mutex_unlock(&ctx->mutex);

	if (task) {
		up_read(&task->signal->exec_update_lock);
		put_task_struct(task);
	}

3.8.1. __perf_install_in_context

TODO

3.9. Add to process linked list

	mutex_lock(&current->perf_event_mutex);
	list_add_tail(&event->owner_entry, &current->perf_event_list);
	mutex_unlock(&current->perf_event_mutex);

3.10. Return to user space

So far, the system call returns to user space. This fd is group in our test case_ fd, the subsequent fd will be in this group_fd forms a perf group.

3.11. Call perf again_ event_ open

    struct perf_event_attr pea;
    uint64_t id2;
    
    memset(&pea, 0, sizeof(struct perf_event_attr));
    pea.type = PERF_TYPE_SOFTWARE;
    pea.size = sizeof(struct perf_event_attr);
    pea.config = PERF_COUNT_HW_CACHE_MISSES;
    pea.disabled = 1;
    pea.exclude_kernel = 1;
    pea.exclude_hv = 1;
    pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
    int fd = syscall(__NR_perf_event_open, &pea, getpid(), -1, grp_fd, 0);
    ioctl(fd, PERF_EVENT_IOC_ID, &id2);

This time, in the system call, group_leader is not empty:

	if (group_fd != -1) {
		err = perf_fget_light(group_fd, &group);
		if (err)
			goto err_fd;
		group_leader = group.file->private_data;

The difference is that:

In perf_ event_ Assign event - > group in alloc_ leader = group_ leader;；
About group_ Branch of leader:
- pmu = group_ leader->ctx->pmu; The difference between soft and hard;
- Add time to the leader;
- Operation of some brothers;
- perf_event_set_output;

4. File operator perf_fops

static const struct file_operations perf_fops = {
	.llseek			= no_llseek,
	.release		= perf_release,
	.read			= perf_read,
	.poll			= perf_poll,
	.unlocked_ioctl	= perf_ioctl,
	.compat_ioctl	= perf_compat_ioctl,
	.mmap			= perf_mmap,
	.fasync			= perf_fasync,
};

5. File operation ioctl

The operations supported by ioctl are:

/*
 * Ioctls that can be done on a perf event fd:
 */
#define PERF_EVENT_IOC_ENABLE			_IO ('$', 0)
#define PERF_EVENT_IOC_DISABLE			_IO ('$', 1)
#define PERF_EVENT_IOC_REFRESH			_IO ('$', 2)
#define PERF_EVENT_IOC_RESET			_IO ('$', 3)
#define PERF_EVENT_IOC_PERIOD			_IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT		_IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER		_IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID			_IOR('$', 7, __u64 *)
#define PERF_EVENT_IOC_SET_BPF			_IOW('$', 8, __u32)
#define PERF_EVENT_IOC_PAUSE_OUTPUT		_IOW('$', 9, __u32)
#define PERF_EVENT_IOC_QUERY_BPF		_IOWR('$', 10, struct perf_event_query_bpf *)
#define PERF_EVENT_IOC_MODIFY_ATTRIBUTES	_IOW('$', 11, struct perf_event_attr *)

arg currently has only one item:

enum perf_event_ioc_flags {
	PERF_IOC_FLAG_GROUP		= 1U << 0,
};

In our test example, ioctl is used as follows:

    ioctl(group_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
    ioctl(group_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);
    do_something(-1);
    ioctl(group_fd, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);

Its path in the kernel is:

ioctl
    perf_fops->compat_ioctl
        perf_compat_ioctl
            perf_ioctl
                _perf_ioctl

For our example, simplify_ perf_ioctl code is:

static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
{
	void (*func)(struct perf_event *);
	u32 flags = arg;

	switch (cmd) {
	case PERF_EVENT_IOC_ENABLE:
		func = _perf_event_enable;
		break;
	case PERF_EVENT_IOC_DISABLE:
		func = _perf_event_disable;
		break;
	case PERF_EVENT_IOC_RESET:
		func = _perf_event_reset;
		break;
	default:
		return -ENOTTY;
	}
	if (flags & PERF_IOC_FLAG_GROUP)
		perf_event_for_each(event, func);
	else
		perf_event_for_each_child(event, func);

	return 0;
}

The flags we passed in are PERF_IOC_FLAG_GROUP, that is, perf will be executed_ event_ for_ Each, this function is better than perf_event_for_each_child has one more layer of brother traversal:

static void perf_event_for_each_child(struct perf_event *event,
					void (*func)(struct perf_event *))
{
	struct perf_event *child;

	WARN_ON_ONCE(event->ctx->parent_ctx);

	mutex_lock(&event->child_mutex);
	func(event);
	list_for_each_entry(child, &event->child_list, child_list) {
		func(child);}
	mutex_unlock(&event->child_mutex);
}

static void perf_event_for_each(struct perf_event *event,
				  void (*func)(struct perf_event *))
{
	struct perf_event_context *ctx = event->ctx;
	struct perf_event *sibling;

	lockdep_assert_held(&ctx->mutex);

	event = event->group_leader;

	perf_event_for_each_child(event, func);
	for_each_sibling_event(sibling, event) {
		perf_event_for_each_child(sibling, func);}
}

Finally, func will be called directly, corresponding to:

PERF_EVENT_IOC_RESET: _perf_event_reset
PERF_EVENT_IOC_ENABLE: _perf_event_enable
PERF_EVENT_IOC_DISABLE: _perf_event_disable

In< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >I introduced how the underlying kernel path calls the performance management entity PMU.

5.1. _perf_event_reset

5.1.1. perf_event_update_userpage

About struct perf_ event_ mmap_ The introduction of page can be referred to< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual>.
Generally speaking, it is for struct perf_event_mmap_page filling. It should be noted that arch_perf_update_userpage calls cyc2ns_read_begin gets the start clock from the CPU.

5.2. _perf_event_enable

At the core, he talks about calling event_function_call(event, __perf_event_enable, NULL);， Can be in< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >In fact, the add function of the corresponding event PMU is finally called:

perf_event_enable
    _perf_event_enable
        __perf_event_enable
            ctx_sched_in
                ctx_flexible_sched_in|ctx_pinned_sched_in
                    merge_sched_in
                        group_sched_in
                            event_sched_in
                                event->pmu->add(event, PERF_EF_START)

In perf_ Take SwEvent as an example, that is, call perf_swevent_add, that is, ioctl(group_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP); Event - > PMU - > Add will be called for, when it is perf_ TYPE_ When software, perf is called_ swevent_ add.

//kernel/events/core.c
static struct pmu/* Performance monitoring unit */ perf_swevent = {
	.task_ctx_nr	= perf_sw_context,

	.capabilities	= PERF_PMU_CAP_NO_NMI,

	.event_init	= perf_swevent_init,
	.add		= perf_swevent_add,
	.del		= perf_swevent_del,
	.start		= perf_swevent_start,
	.stop		= perf_swevent_stop,
	.read		= perf_swevent_read,
};

perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);

We mentioned in the above section that our group_fd is PERF_TYPE_HARDWARE type.

5.3. _perf_event_disable

In< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >View in

perf_event_disable
    _perf_event_disable
        __perf_event_disable
            group_sched_out
                event_sched_out
                    event->pmu->del(event, 0);

6. File operation mmap

struct perf_ event_ The mmap bit of the attr structure allows the recording of the execution of mmap events. The following structure has to be proposed:

6.1. perf_mmap_vmops

static const struct vm_operations_struct perf_mmap_vmops = {
	.open		= perf_mmap_open,
	.close		= perf_mmap_close, /* non mergeable */
	.fault		= perf_mmap_fault,
	.page_mkwrite	= perf_mmap_fault,
};

6.2. Allocate memory

perf_mmap
    rb_alloc
        kzalloc
        vmalloc_user
        ring_buffer_init
    ring_buffer_attach
        list_add_rcu
    perf_event_init_userpage
    perf_event_update_userpage
        calc_timer_values
            __perf_update_times
        arch_perf_update_userpage

7. File operation read

perf_read

perf_read
    __perf_read
        perf_read_group
            __perf_read_group_add
                perf_event_read
                    __perf_event_read_cpu
                        topology_physical_package_id
                            (cpu_data(cpu).phys_proc_id)
                                per_cpu(cpu_info, cpu)
                    __perf_event_read
                        __get_cpu_context
                        perf_event_update_time
                            __perf_update_times
                        perf_event_update_sibling_time
                        pmu->start_txn(pmu, PERF_PMU_TXN_READ);    ->x86_pmu_start_txn
                        pmu->read(event);    -> x86_pmu_read
                        pmu->commit_txn(pmu); -> x86_pmu_commit_txn
            copy_to_user
        perf_read_one
            TODO

8. Page missing statistics PERF_COUNT_SW_PAGE_FAULTS

In the page missing interrupt of 5.10.13 kernel, perf_event related code path is as follows:

In earlier versions of the kernel, the function to handle missing pages was do_page_fault, renamed exc in 5.10.13_ page_ Fault and use DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) is defined. The code path related to perf is given below:

exc_page_fault
    handle_page_fault
        do_kern_addr_fault    - Kernel does not count (I didn't find it)
        do_user_addr_fault
            perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); ######### perf entry
                if (static_key_false(&perf_swevent_enabled[event_id]))
                    __perf_sw_event(event_id, nr, regs, addr);
                        ___perf_sw_event
                            perf_sample_data_init
                                data->addr = addr;
                                data->raw  = NULL;
                                data->br_stack = NULL;
                                data->period = period;
                                data->weight = 0;
                                data->data_src.val = PERF_MEM_NA;
                                data->txn = 0;
                            do_perf_sw_event
                                perf_swevent_event
                                    local64_add(nr, &event->count);
                                    perf_swevent_overflow
                                        perf_swevent_set_period
                                        __perf_event_overflow
                                            __perf_event_account_interrupt #General event, overflow handling, sampling
                                                perf_adjust_period
                                                    perf_calculate_period
                                                        pmu->stop(event, PERF_EF_UPDATE);
                                                        pmu->start(event, PERF_EF_RELOAD);
                                            irq_work_queue

Function perf_swevent_set_period: we directly add event - > count and add event - > HW period_ Leave the second value in left to calculate the interval. This periodic event remains in the range [- sample_period, 0], so that we can use symbols as triggers.
Function__ perf_event_account_interrupt: general event overflow processing, sampling.

9. Summary

That's all for the analysis of this paper. Because the content of perf performance is relatively large and complex (relatively tracepoint,kprobe, etc.), the subsequent specific function analysis will be studied separately.

10. Related links

Added by tannerc on Thu, 10 Feb 2022 10:20:54 +0200

Programming VIP