Rong Tao May 19, 2021
- Relevant note code of this article: https://github.com/Rtoax/linux-5.10.13
- Linux kernel performance architecture: perf_event
1. perf_event_open system call
See for details Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual.
#include <linux/perf_event.h> #include <linux/hw_breakpoint.h> int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags);
1.1. pid
The parameter pid allows events to be attached to processes in various ways.
- If pid is 0, the measurement is performed on the current thread;
- If pid is greater than 0, measure the process indicated by pid;
- If pid is - 1, all processes are counted.
1.2. cpu
The CPU parameter allows the measurement to be CPU specific.
- If CPU > = 0, limit the measurement to the specified CPU; Otherwise, it will be limited to 0.
- If cpu=-1, events are measured on all CPUs.
Note that the combination of pid == -1 and cpu == -1 is invalid.
- PID > 0 and cpu == -1 setting will measure each process and follow any CPU to which the process is scheduled. Each user can create events for each process.
- The pid == -1 and CPU > = 0 settings for each CPU are for each CPU and measure all processes on the specified CPU. Cap required per CPU event_ SYS_ Admin function or / proc / sys / kernel / perf less than / 1_ event_ Paramoid value. See the chapter of perf_event related configuration files.
1.3. group_fd
group_ The FD parameter allows the creation of event groups. An event group has one event, that is, the team leader. First create a leader, group_fd = -1. The remaining group members are passed through the subsequent perf_ event_ Created by the open() call, where group_ FD is set to FD of the team leader. (create a separate event with group_fd = -1, and the event is considered to be a group with only 1 member.) Scheduling an event group to the CPU as a unit: only if all events in the group can be placed on the CPU. This means that the values of member events can be meaningfully compared to each other, added, divided (to obtain ratios), etc., because they have counted events for the same set of executed instructions.
1.4. flags
#define PERF_FLAG_FD_NO_GROUP (1UL << 0) #define PERF_FLAG_FD_OUTPUT (1UL << 1) #define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */ #define PERF_FLAG_FD_CLOEXEC (1UL << 3) /* O_CLOEXEC */
This is explained in the system call man Manual:
- PERF_FLAG_FD_NO_GROUP: this flag allows an event to be created as part of an event group without a leader. It's not clear why this works.
- PERF_FLAG_FD_OUTPUT: this flag reroutes the output from the event to the team leader.
- PERF_FLAG_PID_CGROUP: this flag activates system wide monitoring of each container. A container is an abstraction that isolates a set of resources for finer control (CPU, memory, etc.). In this mode, the event is measured only if the thread running on the monitored CPU belongs to the specified container (cgroup). cgroup is identified by passing a file descriptor open in its directory in the cgroupfs file system. For example, if the cgroup to be monitored is called test, the file descriptor opened on / dev / cgroup / test (assuming cgroupfs is installed on / dev / cgroup) must be passed as a pid parameter. cgroup monitoring is only applicable to system wide events, so additional permissions may be required. (container related content is not discussed in this article)
- PERF_FLAG_FD_CLOEXEC: O_CLOEXEC: in linux system, open a file with O_CLOEXEC flag bit, which represents the FD set with fcntl_ Cloexec has the same function, which is to close the FD obtained by fork in the subprocess before loading a new executable program with exec series system calls in the subprocess of fork. (when calling perf_event_open using strace perf stat ls, the incoming is perf_flag_fd_cloxec)
2. Test example
https://github.com/Rtoax/test/tree/master/c/glibc/linux/perf_event
/* https://stackoverflow.com/questions/42088515/perf-event-open-how-to-monitoring-multiple-events perf stat -e cycles,faults ls */ #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <sys/ioctl.h> #include <linux/perf_event.h> #include <linux/hw_breakpoint.h> #include <asm/unistd.h> #include <errno.h> #include <stdint.h> #include <inttypes.h> struct read_format { uint64_t nr; struct { uint64_t value; uint64_t id; } values[]; }; void do_malloc() { int i; char* ptr; int len = 2*1024*1024; ptr = malloc(len); mlock(ptr, len); for (i = 0; i < len; i++) { ptr[i] = (char) (i & 0xff); // pagefault } free(ptr); } void do_ls() { system("/bin/ls"); } void do_something(int something) { switch(something) { case 1: do_ls(); break; case 0: default: do_malloc(); break; } } int create_hardware_perf(int grp_fd, enum perf_hw_id hw_ids, uint64_t *ioc_id) { if(PERF_COUNT_HW_MAX <= hw_ids || hw_ids < 0) { printf("Unsupport enum perf_hw_id.\n"); return -1; } struct perf_event_attr pea; memset(&pea, 0, sizeof(struct perf_event_attr)); pea.type = PERF_TYPE_HARDWARE; pea.size = sizeof(struct perf_event_attr); pea.config = hw_ids; pea.disabled = 1; pea.exclude_kernel = 1; pea.exclude_hv = 1; pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; int fd = syscall(__NR_perf_event_open, &pea, 0, -1, grp_fd>2?grp_fd:-1, 0); ioctl(fd, PERF_EVENT_IOC_ID, ioc_id); return fd; } int create_software_perf(int grp_fd, enum perf_sw_ids sw_ids, uint64_t *ioc_id) { if(PERF_COUNT_SW_MAX <= sw_ids || sw_ids < 0) { printf("Unsupport enum perf_sw_ids.\n"); return -1; } struct perf_event_attr pea; memset(&pea, 0, sizeof(struct perf_event_attr)); pea.type = PERF_TYPE_SOFTWARE; pea.size = sizeof(struct perf_event_attr); pea.config = sw_ids; pea.disabled = 1; pea.exclude_kernel = 1; pea.exclude_hv = 1; pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; int fd = syscall(__NR_perf_event_open, &pea, 0, -1, grp_fd>2?grp_fd:-1 /*!!!*/, 0); ioctl(fd, PERF_EVENT_IOC_ID, ioc_id); return fd; } int main(int argc, char* argv[]) { struct perf_event_attr pea; int group_fd, fd2, fd3, fd4, fd5; uint64_t id1, id2, id3, id4, id5; uint64_t val1, val2, val3, val4, val5; char buf[4096]; struct read_format* rf = (struct read_format*) buf; int i; group_fd = create_hardware_perf(-1, PERF_COUNT_HW_CPU_CYCLES, &id1); fd2 = create_hardware_perf(group_fd, PERF_COUNT_HW_CACHE_MISSES, &id2); fd3 = create_software_perf(group_fd, PERF_COUNT_SW_PAGE_FAULTS, &id3); fd4 = create_software_perf(group_fd, PERF_COUNT_SW_CPU_CLOCK, &id4); fd5 = create_software_perf(group_fd, PERF_COUNT_SW_CPU_CLOCK, &id5); printf("ioctl %ld, %ld, %ld, %ld, %ld\n", id1, id2, id3, id4, id5); ioctl(group_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP); ioctl(group_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP); do_something(-1); ioctl(group_fd, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP); read(group_fd, buf, sizeof(buf)); for (i = 0; i < rf->nr; i++) { if (rf->values[i].id == id1) { val1 = rf->values[i].value; } else if (rf->values[i].id == id2) { val2 = rf->values[i].value; } else if (rf->values[i].id == id3) { val3 = rf->values[i].value; } else if (rf->values[i].id == id4) { val4 = rf->values[i].value; } else if (rf->values[i].id == id5) { val5 = rf->values[i].value; } } printf("cpu cycles: %"PRIu64"\n", val1); printf("cache misses: %"PRIu64"\n", val2); printf("page faults: %"PRIu64"\n", val3); printf(" cpu clock: %"PRIu64"\n", val4); printf("task clock: %"PRIu64"\n", val5); close(group_fd); close(fd2); close(fd3); close(fd4); close(fd5); return 0; }
Program running results:
[rongtao@localhost perf_event]$ ./a.out ioctl 2640, 2641, 2642, 2643, 2644 cpu cycles: 12996145 cache misses: 1135 page faults: 518 cpu clock: 5417726 task clock: 5393251
As shown in the program, establish five FDS and add the last four to the first fd to form a group to monitor the above information respectively. In the source code receiving in the next chapter, we will introduce how to obtain the above information from the kernel. It should be noted that the memory locking in the code has no effect on page missing interrupt:
mlock(ptr, len); for (i = 0; i < len; i++) { ptr[i] = (char) (i & 0xff); // pagefault }
3. perf_event_open source analysis
3.1. How to call perf_event_open
The code is as follows:
struct perf_event_attr pea; uint64_t id1; memset(&pea, 0, sizeof(struct perf_event_attr)); pea.type = PERF_TYPE_HARDWARE; pea.size = sizeof(struct perf_event_attr); pea.config = PERF_COUNT_HW_CPU_CYCLES; pea.disabled = 1; pea.exclude_kernel = 1; pea.exclude_hv = 1; pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; int fd = syscall(__NR_perf_event_open, &pea, getpid(), -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_ID, &id1);
The PID here is the PID of the current process (0 can also be filled in), and the CPU is - 1, group_fd is - 1, flags is 0, flags is 0, flags is perf in perf command_ FLAG_ FD_ Cloexec, in the source code:
if (flags & PERF_FLAG_FD_CLOEXEC) f_flags |= O_CLOEXEC;
Therefore, the above input parameters indicate any CPU that measures the current process and schedules the process to follow the process plan (PID > 0 and cpu == -1). During syscall, it will fall into kernel state.
3.2. How to handle input parameters
3.2.1. struct perf_event_attr
First, copy from user space to kernel space:
err = perf_copy_attr(attr_uptr, &attr);
Different versions of struct perf are considered internally_ event_ Length of attr structure:
if (!size) size = PERF_ATTR_SIZE_VER0; if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE) goto err_size;
Then use copy_struct_from_user copy. Then check the parameters:
if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3) return -EINVAL; if (attr->sample_type & ~(PERF_SAMPLE_MAX-1)) return -EINVAL; if (attr->read_format & ~(PERF_FORMAT_MAX-1)) return -EINVAL;
Next is the sample_ Check the type field (see< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual >), divided into two parts:
- PERF_SAMPLE_BRANCH_STACK: (starting from Linux 3.4), it provides a record of the most recent branch, which is provided by CPU branch sampling hardware (such as Intel Last Branch Record). Not all hardware supports this feature. For information on how to filter the branches of a report, see branch_sample_type field.
- PERF_SAMPLE_REGS_USER: (starting from Linux 3.7) record the current user level CPU register status (the value in the process before calling the kernel).
- PERF_SAMPLE_STACK_USER: (starting with Linux 3.7) logs the user level stack, allowing the stack to expand.
- PERF_SAMPLE_REGS_INTR
- PERF_SAMPLE_CGROUP
3.2.2. pid and cpu
If flag bit PERF_FLAG_PID_CGROUP (this flag activates system wide monitoring for each container), then neither pid nor cpu can be - 1.
/* * In cgroup mode, the pid argument is used to pass the fd * opened to the cgroup directory in cgroupfs. The cpu argument * designates the cpu on which to monitor threads from that * cgroup. */ if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1)) return -EINVAL;
If flag bit PERF_FLAG_PID_CGROUP is not set and PID! =- 1. Find task_struct structure.
static struct task_struct * find_lively_task_by_vpid(pid_t vpid) /* Get task_struct */ { struct task_struct *task; rcu_read_lock(); if (!vpid) task = current; /* Current process */ else task = find_task_by_vpid(vpid); /* lookup */ if (task) get_task_struct(task); /* Reference count */ rcu_read_unlock(); if (!task) return ERR_PTR(-ESRCH); return task; }
3.2.3. group_fd
If group is set_ fd != - 1. Get struct fd structure
static inline int perf_fget_light(int fd, struct fd *p) { struct fd f = fdget(fd); if (!f.file) return -EBADF; if (f.file->f_op != &perf_fops) { fdput(f); return -EBADF; } *p = f; return 0; }
Note that the file operator is perf_fops (useful later):
static const struct file_operations perf_fops = { .llseek = no_llseek, .release = perf_release, .read = perf_read, .poll = perf_poll, .unlocked_ioctl = perf_ioctl, .compat_ioctl = perf_compat_ioctl, .mmap = perf_mmap, .fasync = perf_fasync, };
3.2.4. flags
A series of authentication work will not be discussed in detail here. See< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual>
#define PERF_FLAG_FD_NO_GROUP (1UL << 0) #define PERF_FLAG_FD_OUTPUT (1UL << 1) #define PERF_FLAG_PID_CGROUP (1UL << 2) /* pid=cgroup id, per-cpu mode only */ #define PERF_FLAG_FD_CLOEXEC (1UL << 3) /* O_CLOEXEC */
This is explained in the system call man Manual:
- PERF_FLAG_FD_NO_GROUP: this flag allows an event to be created as part of an event group without a leader. It's not clear why this works.
- PERF_FLAG_FD_OUTPUT: this flag reroutes the output from the event to the team leader.
- PERF_FLAG_PID_CGROUP: this flag activates system wide monitoring of each container. A container is an abstraction that isolates a set of resources for finer control (CPU, memory, etc.). In this mode, the event is measured only if the thread running on the monitored CPU belongs to the specified container (cgroup). cgroup is identified by passing a file descriptor open in its directory in the cgroupfs file system. For example, if the cgroup to be monitored is called test, the file descriptor opened on / dev / cgroup / test (assuming cgroupfs is installed on / dev / cgroup) must be passed as a pid parameter. cgroup monitoring is only applicable to system wide events, so additional permissions may be required. (container related content is not discussed in this article)
- PERF_FLAG_FD_CLOEXEC: O_CLOEXEC: in linux system, open a file with O_CLOEXEC flag bit, which represents the FD set with fcntl_ Cloexec has the same function, which is to close the FD obtained by fork in the subprocess before loading a new executable program with exec series system calls in the subprocess of fork. (when calling perf_event_open using strace perf stat ls, the incoming is perf_flag_fd_cloxec)
3.3. perf_event_alloc
event = perf_event_alloc(&attr, cpu, task, group_leader, NULL, NULL, NULL, cgroup_fd);
Apply for struct perf using kzalloc_ Event structure, followed by a series of initialization. The points needing attention are as follows:
- overflow_ The handler is NULL, which determines the write direction of the ring queue.
if (overflow_handler) { event->overflow_handler = overflow_handler; event->overflow_handler_context = context; } else if (is_write_backward(event)){/* Write ring buffer from end to beginning */ event->overflow_handler = perf_event_output_backward; event->overflow_handler_context = NULL; } else { event->overflow_handler = perf_event_output_forward; event->overflow_handler_context = NULL; }
- Status event - > State
/* * Initialize event state based on the perf_event_attr::disabled. */ static inline void perf_event__state_init(struct perf_event *event) { event->state = event->attr.disabled ? PERF_EVENT_STATE_OFF : PERF_EVENT_STATE_INACTIVE; }
- more
3.3.1. perf_init_event
3.3.1.1. perf_try_init_event
This function will call the relevant operator PMU - > event of PMU_ init(event); (see< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >), such as:
//kernel/events/core.c static struct pmu/* Performance monitoring unit */ perf_swevent = { .task_ctx_nr = perf_sw_context, .capabilities = PERF_PMU_CAP_NO_NMI, .event_init = perf_swevent_init, .add = perf_swevent_add, .del = perf_swevent_del, .start = perf_swevent_start, .stop = perf_swevent_stop, .read = perf_swevent_read, }; perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
//kernel/events/core.c static struct pmu perf_cpu_clock = { .task_ctx_nr = perf_sw_context, .capabilities = PERF_PMU_CAP_NO_NMI, .event_init = cpu_clock_event_init, .add = cpu_clock_event_add, .del = cpu_clock_event_del, .start = cpu_clock_event_start, .stop = cpu_clock_event_stop, .read = cpu_clock_event_read, }; perf_pmu_register(&perf_cpu_clock, NULL, -1);
3.3.1.1.1. perf_swevent->perf_swevent_init
See< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU>.
3.4. find_get_context
Get target context struct perf from task or per CPU variable_ event_ Context structure. If the task passed in is empty:
if (!task) { /* Must be root to operate on a CPU event: */ err = perf_allow_cpu(&event->attr); if (err) return ERR_PTR(err); cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu); ctx = &cpuctx->ctx; get_ctx(ctx); ++ctx->pin_count; return ctx; }
3.4.1. alloc_perf_context
static struct perf_event_context * alloc_perf_context(struct pmu *pmu, struct task_struct *task) { struct perf_event_context *ctx; ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL); if (!ctx) return NULL; __perf_event_init_context(ctx); if (task) ctx->task = get_task_struct(task); ctx->pmu = pmu; return ctx; }
3.4.1.1. __perf_event_init_context
/* * Initialize the perf_event context in a task_struct: */ static void __perf_event_init_context(struct perf_event_context *ctx) /* Initialize CPU ctx */ { raw_spin_lock_init(&ctx->lock); mutex_init(&ctx->mutex); INIT_LIST_HEAD(&ctx->active_ctx_list); perf_event_groups_init(&ctx->pinned_groups); perf_event_groups_init(&ctx->flexible_groups); INIT_LIST_HEAD(&ctx->event_list); INIT_LIST_HEAD(&ctx->pinned_active); INIT_LIST_HEAD(&ctx->flexible_active); refcount_set(&ctx->refcount, 1); }
Here is our group_fd fills in - 1, so all if (group_leaders) are not valid, which will pass in an fd as the leader in the next system call.
3.5. perf_event_set_output
Because of our output_event is empty. It is given here first. The interface is mainly allocated with the ring buffer struct perf_buffer.
if (output_event) { err = perf_event_set_output(event, output_event); if (err) goto err_context; }
3.6. anon_inode_getfile
Create a new struct file instance, hang it into the anonymous inode, and a detry describes the class. Here, the file operator perf is also used_ FOPS, this interface will not be explained in detail. This file will be bound to fd, which will return user space.
3.7. perf_event_validate_size
static bool perf_event_validate_size(struct perf_event *event) { /* * The values computed here will be over-written when we actually * attach the event. */ __perf_event_read_size(event, event->group_leader->nr_siblings + 1); __perf_event_header_size(event, event->attr.sample_type & ~PERF_SAMPLE_READ); perf_event__id_header_size(event); /* * Sum the lot; should not exceed the 64k limit we have on records. * Conservative limit to allow for callchains and other variable fields. */ if (event->read_size + event->header_size + event->id_header_size + sizeof(struct perf_event_header) >= 16*1024) return false; return true; }
3.7.1. __perf_event_read_size
See< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual>
static void __perf_event_read_size(struct perf_event *event, int nr_siblings) { int entry = sizeof(u64); /* value */ int size = 0; int nr = 1; if (event->attr.read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) size += sizeof(u64); if (event->attr.read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) size += sizeof(u64); if (event->attr.read_format & PERF_FORMAT_ID) entry += sizeof(u64); if (event->attr.read_format & PERF_FORMAT_GROUP) { nr += nr_siblings; size += sizeof(u64); } size += entry * nr; event->read_size = size; }
3.7.2. __perf_event_header_size
static void __perf_event_header_size(struct perf_event *event, u64 sample_type) { struct perf_sample_data *data; u16 size = 0; if (sample_type & PERF_SAMPLE_IP) size += sizeof(data->ip); if (sample_type & PERF_SAMPLE_ADDR) size += sizeof(data->addr); if (sample_type & PERF_SAMPLE_PERIOD) size += sizeof(data->period); if (sample_type & PERF_SAMPLE_WEIGHT) size += sizeof(data->weight); if (sample_type & PERF_SAMPLE_READ) size += event->read_size; if (sample_type & PERF_SAMPLE_DATA_SRC) size += sizeof(data->data_src.val); if (sample_type & PERF_SAMPLE_TRANSACTION) size += sizeof(data->txn); if (sample_type & PERF_SAMPLE_PHYS_ADDR) size += sizeof(data->phys_addr); if (sample_type & PERF_SAMPLE_CGROUP) size += sizeof(data->cgroup); event->header_size = size; }
3.7.3. perf_event__id_header_size
static void perf_event__id_header_size(struct perf_event *event) { struct perf_sample_data *data; u64 sample_type = event->attr.sample_type; u16 size = 0; if (sample_type & PERF_SAMPLE_TID) size += sizeof(data->tid_entry); if (sample_type & PERF_SAMPLE_TIME) size += sizeof(data->time); if (sample_type & PERF_SAMPLE_IDENTIFIER) size += sizeof(data->id); if (sample_type & PERF_SAMPLE_ID) size += sizeof(data->id); if (sample_type & PERF_SAMPLE_STREAM_ID) size += sizeof(data->stream_id); if (sample_type & PERF_SAMPLE_CPU) size += sizeof(data->cpu_entry); event->id_header_size = size; }
3.7.4. Size limit
Cannot exceed 16 * 1024 size
if (event->read_size + event->header_size + event->id_header_size + sizeof(struct perf_event_header) >= 16*1024) return false;
3.8. perf_install_in_context
/* * Precalculate sample_data sizes; do while holding ctx::mutex such * that we're serialized against further additions and before * perf_install_in_context() which is the point the event is active and * can use these values. */ perf_event__header_size(event); perf_event__id_header_size(event); event->owner = current; perf_install_in_context(ctx, event, event->cpu); perf_unpin_context(ctx); if (move_group) perf_event_ctx_unlock(group_leader, gctx); mutex_unlock(&ctx->mutex); if (task) { up_read(&task->signal->exec_update_lock); put_task_struct(task); }
3.8.1. __perf_install_in_context
TODO
3.9. Add to process linked list
mutex_lock(¤t->perf_event_mutex); list_add_tail(&event->owner_entry, ¤t->perf_event_list); mutex_unlock(¤t->perf_event_mutex);
3.10. Return to user space
So far, the system call returns to user space. This fd is group in our test case_ fd, the subsequent fd will be in this group_fd forms a perf group.
3.11. Call perf again_ event_ open
struct perf_event_attr pea; uint64_t id2; memset(&pea, 0, sizeof(struct perf_event_attr)); pea.type = PERF_TYPE_SOFTWARE; pea.size = sizeof(struct perf_event_attr); pea.config = PERF_COUNT_HW_CACHE_MISSES; pea.disabled = 1; pea.exclude_kernel = 1; pea.exclude_hv = 1; pea.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; int fd = syscall(__NR_perf_event_open, &pea, getpid(), -1, grp_fd, 0); ioctl(fd, PERF_EVENT_IOC_ID, &id2);
This time, in the system call, group_leader is not empty:
if (group_fd != -1) { err = perf_fget_light(group_fd, &group); if (err) goto err_fd; group_leader = group.file->private_data;
The difference is that:
- In perf_ event_ Assign event - > group in alloc_ leader = group_ leader;;
- About group_ Branch of leader:
- pmu = group_ leader->ctx->pmu; The difference between soft and hard;
- Add time to the leader;
- Operation of some brothers;
- perf_event_set_output;
4. File operator perf_fops
static const struct file_operations perf_fops = { .llseek = no_llseek, .release = perf_release, .read = perf_read, .poll = perf_poll, .unlocked_ioctl = perf_ioctl, .compat_ioctl = perf_compat_ioctl, .mmap = perf_mmap, .fasync = perf_fasync, };
5. File operation ioctl
The operations supported by ioctl are:
/* * Ioctls that can be done on a perf event fd: */ #define PERF_EVENT_IOC_ENABLE _IO ('$', 0) #define PERF_EVENT_IOC_DISABLE _IO ('$', 1) #define PERF_EVENT_IOC_REFRESH _IO ('$', 2) #define PERF_EVENT_IOC_RESET _IO ('$', 3) #define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64) #define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5) #define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *) #define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *) #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32) #define PERF_EVENT_IOC_PAUSE_OUTPUT _IOW('$', 9, __u32) #define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *) #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES _IOW('$', 11, struct perf_event_attr *)
arg currently has only one item:
enum perf_event_ioc_flags { PERF_IOC_FLAG_GROUP = 1U << 0, };
In our test example, ioctl is used as follows:
ioctl(group_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP); ioctl(group_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP); do_something(-1); ioctl(group_fd, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
Its path in the kernel is:
ioctl perf_fops->compat_ioctl perf_compat_ioctl perf_ioctl _perf_ioctl
For our example, simplify_ perf_ioctl code is:
static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg) { void (*func)(struct perf_event *); u32 flags = arg; switch (cmd) { case PERF_EVENT_IOC_ENABLE: func = _perf_event_enable; break; case PERF_EVENT_IOC_DISABLE: func = _perf_event_disable; break; case PERF_EVENT_IOC_RESET: func = _perf_event_reset; break; default: return -ENOTTY; } if (flags & PERF_IOC_FLAG_GROUP) perf_event_for_each(event, func); else perf_event_for_each_child(event, func); return 0; }
The flags we passed in are PERF_IOC_FLAG_GROUP, that is, perf will be executed_ event_ for_ Each, this function is better than perf_event_for_each_child has one more layer of brother traversal:
static void perf_event_for_each_child(struct perf_event *event, void (*func)(struct perf_event *)) { struct perf_event *child; WARN_ON_ONCE(event->ctx->parent_ctx); mutex_lock(&event->child_mutex); func(event); list_for_each_entry(child, &event->child_list, child_list) { func(child);} mutex_unlock(&event->child_mutex); } static void perf_event_for_each(struct perf_event *event, void (*func)(struct perf_event *)) { struct perf_event_context *ctx = event->ctx; struct perf_event *sibling; lockdep_assert_held(&ctx->mutex); event = event->group_leader; perf_event_for_each_child(event, func); for_each_sibling_event(sibling, event) { perf_event_for_each_child(sibling, func);} }
Finally, func will be called directly, corresponding to:
- PERF_EVENT_IOC_RESET: _perf_event_reset
- PERF_EVENT_IOC_ENABLE: _perf_event_enable
- PERF_EVENT_IOC_DISABLE: _perf_event_disable
In< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >I introduced how the underlying kernel path calls the performance management entity PMU.
5.1. _perf_event_reset
5.1.1. perf_event_update_userpage
About struct perf_ event_ mmap_ The introduction of page can be referred to< Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual>.
Generally speaking, it is for struct perf_event_mmap_page filling. It should be noted that arch_perf_update_userpage calls cyc2ns_read_begin gets the start clock from the CPU.
5.2. _perf_event_enable
At the core, he talks about calling event_function_call(event, __perf_event_enable, NULL);, Can be in< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >In fact, the add function of the corresponding event PMU is finally called:
perf_event_enable _perf_event_enable __perf_event_enable ctx_sched_in ctx_flexible_sched_in|ctx_pinned_sched_in merge_sched_in group_sched_in event_sched_in event->pmu->add(event, PERF_EF_START)
In perf_ Take SwEvent as an example, that is, call perf_swevent_add, that is, ioctl(group_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP); Event - > PMU - > Add will be called for, when it is perf_ TYPE_ When software, perf is called_ swevent_ add.
//kernel/events/core.c static struct pmu/* Performance monitoring unit */ perf_swevent = { .task_ctx_nr = perf_sw_context, .capabilities = PERF_PMU_CAP_NO_NMI, .event_init = perf_swevent_init, .add = perf_swevent_add, .del = perf_swevent_del, .start = perf_swevent_start, .stop = perf_swevent_stop, .read = perf_swevent_read, }; perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
We mentioned in the above section that our group_fd is PERF_TYPE_HARDWARE type.
5.3. _perf_event_disable
In< Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU >View in
perf_event_disable _perf_event_disable __perf_event_disable group_sched_out event_sched_out event->pmu->del(event, 0);
6. File operation mmap
struct perf_ event_ The mmap bit of the attr structure allows the recording of the execution of mmap events. The following structure has to be proposed:
6.1. perf_mmap_vmops
static const struct vm_operations_struct perf_mmap_vmops = { .open = perf_mmap_open, .close = perf_mmap_close, /* non mergeable */ .fault = perf_mmap_fault, .page_mkwrite = perf_mmap_fault, };
6.2. Allocate memory
perf_mmap rb_alloc kzalloc vmalloc_user ring_buffer_init ring_buffer_attach list_add_rcu perf_event_init_userpage perf_event_update_userpage calc_timer_values __perf_update_times arch_perf_update_userpage
7. File operation read
perf_read
perf_read __perf_read perf_read_group __perf_read_group_add perf_event_read __perf_event_read_cpu topology_physical_package_id (cpu_data(cpu).phys_proc_id) per_cpu(cpu_info, cpu) __perf_event_read __get_cpu_context perf_event_update_time __perf_update_times perf_event_update_sibling_time pmu->start_txn(pmu, PERF_PMU_TXN_READ); ->x86_pmu_start_txn pmu->read(event); -> x86_pmu_read pmu->commit_txn(pmu); -> x86_pmu_commit_txn copy_to_user perf_read_one TODO
8. Page missing statistics PERF_COUNT_SW_PAGE_FAULTS
In the page missing interrupt of 5.10.13 kernel, perf_event related code path is as follows:
In earlier versions of the kernel, the function to handle missing pages was do_page_fault, renamed exc in 5.10.13_ page_ Fault and use DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) is defined. The code path related to perf is given below:
exc_page_fault handle_page_fault do_kern_addr_fault - Kernel does not count (I didn't find it) do_user_addr_fault perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); ######### perf entry if (static_key_false(&perf_swevent_enabled[event_id])) __perf_sw_event(event_id, nr, regs, addr); ___perf_sw_event perf_sample_data_init data->addr = addr; data->raw = NULL; data->br_stack = NULL; data->period = period; data->weight = 0; data->data_src.val = PERF_MEM_NA; data->txn = 0; do_perf_sw_event perf_swevent_event local64_add(nr, &event->count); perf_swevent_overflow perf_swevent_set_period __perf_event_overflow __perf_event_account_interrupt #General event, overflow handling, sampling perf_adjust_period perf_calculate_period pmu->stop(event, PERF_EF_UPDATE); pmu->start(event, PERF_EF_RELOAD); irq_work_queue
- Function perf_swevent_set_period: we directly add event - > count and add event - > HW period_ Leave the second value in left to calculate the interval. This periodic event remains in the range [- sample_period, 0], so that we can use symbols as triggers.
- Function__ perf_event_account_interrupt: general event overflow processing, sampling.
9. Summary
That's all for the analysis of this paper. Because the content of perf performance is relatively large and complex (relatively tracepoint,kprobe, etc.), the subsequent specific function analysis will be studied separately.
10. Related links
- Comment source code: https://github.com/Rtoax/linux-5.10.13
- Linux kernel eBPF Foundation: perf (1): perf_ Initialization of event in kernel
- Linux kernel eBPF Foundation: perf (2): registration of perf performance management unit PMU
- eBPF foundation of Linux kernel: perf (3) user state instruction analysis
- Linux kernel eBPF Foundation: perf (4) perf_event_open system call and user manual
- Linux kernel perf architecture
- Linux perf 1.1,perf_event kernel framework
- Linux kernel performance architecture: perf_event
- https://www.kernel.org/doc/man-pages/