Linux kernel tracing eBPF: bpftrace one line tutorial

bpftrace one line tutorial

This tutorial helps you understand the use of bpftrace through 12 simple sections. Each section is a one-line command. You can run it immediately and see the effect. This tutorial series is used to
Introduce the concept of bpftrace. For a complete reference to bpftrace, see bpftrace complete reference.
This tutorial is contributed by Brendan Gregg, Netflix (2018), based on his DTrace tutorial series [DTrace tutorial]( https://wiki.freebsd.org/DTrac
e/Tutorial).

1. List all probes

bpftrace -l 'tracepoint:syscalls:sys_enter_*'

"bpftrace -l" lists all probe points and can add search terms.

A probe is a detection point used to capture event data.
The provided search terms support wildcards, such as * /?
"bpftrace -l" can also be piped to grep for complete regular expression search.

2. Hello World

# bpftrace -e 'BEGIN { printf("hello world\n"); }'
Attaching 1 probe...
hello world
^C

Print a welcome message. After running, press Ctrl-C to finish.

BEGIN is a special probe that triggers probe execution at the beginning of program execution. You can use it to set variables and print message headers.
The probe can associate actions and put actions into {}.

3. File open

# bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
Attaching 1 probe...
snmp-pass /proc/cpuinfo
snmp-pass /proc/stat
snmpd /proc/net/dev
snmpd /proc/net/if_inet6
^C

Here we print the process name and file name when the tracking file is opened.

The command starts with tracepoint:syscalls:sys_enter_openat: This is the type of tracepoint probe (kernel static tracking). When entering the openat() system, call
The probe is executed with time. We prefer the tracepoints probe to the kprobes probe (kernel dynamic tracing, described in Section 6), because tracepoints has a stable application programming interface. Note: in modern linux system (glibc > = 2.26), open always calls openat system call- comm is a built-in variable that represents the name of the current process. Other similar variables are pid and tid, which represent process ID and thread ID respectively.
args is a pointer to the parameter of the tracepoint. This structure is automatically generated by bpftrace according to tracepoint information. Members of this structure can
Command bpftrace -vl tracepoint:syscalls:sys_enter_openat found- Args - > filename is used to obtain the value of the member variable filename of args.
str() is used to convert a string pointer to a string.

4. System call count statistics of process

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
Attaching 1 probe...
^C

@[bpftrace]: 6
@[systemd]: 24
@[snmp-pass]: 96
@[sshd]: 125

Press Ctrl-C to print the system call count of the process.

@: represents a special variable type, called map, which can store and describe data in different ways. You can add optional variable names after @ to increase readability or by
Multiple maps are used to distinguish different maps- []: optional brackets allow you to set the keyword of map, which is more like an associative array
count(): This is a map function that records the number of calls. Because the number of calls is saved in the comm map, the result is the count of system calls executed by the process.

At the end of bpftrace (such as pressing Ctrl-C), Maps will print out automatically

5. Read distribution statistics

# bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == 18644/ { @bytes = hist(args->ret); }'
Attaching 1 probe...
^C

@bytes:
[0, 1]                12 |@@@@@@@@@@@@@@@@@@@@                                |
[2, 4)                18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)              30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)             19 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[128, 256)             1 |@

Here, the tracking process with process number 18644 executes the kernel function sys_read() and print out the histogram.

/... /: set a filter condition (condition judgment) here. The actions in {} will be executed only when the filter condition is met. In this example, only those with process number 18644 are tracked
Process. The filter condition expression also supports Boolean operations, such as "& &", "|")- ret: indicates the return value of the function. For sys_read(), - 1 indicates an error, and others indicate the number of bytes successfully read.
@: similar to the map in the previous section, but there is no [] here. Use "bytes" to decorate the output.
hist(): a map function that describes the parameters of the histogram. The output line starts with an interval to the power of 2, such as [128, 256), indicating that the value is greater than or equal to 128 and less than 256
Following the statistics of the number of people in this interval, the last is the histogram represented by ascii code. The graph can be used to study its pattern distribution- Other map functions include lhist (linear histogram), count(), sum(), avg(), min() and max().

6. The kernel dynamically tracks the number of bytes of read()

# bpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200); }'
Attaching 1 probe...
^C

@bytes:
(...,0]                0 |                                                    |
[0, 200)              66 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400)             2 |@                                                   |
[400, 600)             3 |@@                                                  |
[600, 800)             0 |                                                    |
[800, 1000)            5 |@@@                                                 |
[1000, 1200)           0 |                                                    |
[1200, 1400)           0 |                                                    |
[1400, 1600)           0 |                                                    |
[1600, 1800)           0 |                                                    |
[1800, 2000)           0 |                                                    |
[2000,...)            39 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |

The kernel dynamic tracing technology is used to display the histogram of the number of bytes returned by read().

kretprobe:vfs_read: here is the trace VFS of kretprobe type (dynamic trace kernel function return value)_ Read system call probe. There are also kprobe types
The probe is used to track kernel function calls. They are powerful probe types that allow us to track thousands of kernel functions. However, they are "unstable" probe types: because they can track any kernel function, kprobe and kretprobe are not guaranteed to work properly. Due to different kernel versions, kernel function names, parameters, return values, etc. may change. Because they are used to track the underlying kernel, you need to browse the kernel source code to understand the meaning of the parameters and return values of these probes- lhist(): linear histogram function: parameters are values, minimum, maximum and step value. The first parameter (retval) represents the system call sys_ Return value of read(): successful reading
Number of bytes.

7. Time of read() call

# bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - 
@start[tid]); delete(@start[tid]); }'Attaching 2 probes...

[...]
@ns[snmp-pass]:
[0, 1]                 0 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)            27 |@@@@@@@@@                                           |
[512, 1k)            125 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[1k, 2k)              22 |@@@@@@@                                             |
[2k, 4k)               1 |                                                    |
[4k, 8k)              10 |@@@                                                 |
[8k, 16k)              1 |                                                    |
[16k, 32k)             3 |@                                                   |
[32k, 64k)           144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64k, 128k)            7 |@@                                                  |
[128k, 256k)          28 |@@@@@@@@@@                                          |
[256k, 512k)           2 |                                                    |
[512k, 1M)             3 |@                                                   |
[1M, 2M)               1 |                                                    |

According to the process name, the time spent in the read() call is displayed in the form of histogram, and the time unit is nanosecond.

@start[tid]: This uses the thread ID as a key. There may be many reads in-flight, and we want to store a start tim
estamp to each. How? We could construct a unique identifier for each read, and use that as the key. But because kernel threads can only be executing one syscall at a time, we can use the thread ID as the unique identifier, as each thread cannot be executing more than one.- nsecs: the number of nanoseconds since the system was started. This is a high-precision timestamp tracking, which can be used to track time events.
/@start[tid] /: the start time of the filter condition check. Without this filter condition, you can only detect the end time of each read() call, not now start.
delete(@start[tid]): release variables

8. Statistical process level events

# bpftrace -e 'tracepoint:sched:sched* { @[probe] = count(); } interval:s:5 { exit(); }'
Attaching 25 probes...
@[tracepoint:sched:sched_wakeup_new]: 1
@[tracepoint:sched:sched_process_fork]: 1
@[tracepoint:sched:sched_process_exec]: 1
@[tracepoint:sched:sched_process_exit]: 1
@[tracepoint:sched:sched_process_free]: 2
@[tracepoint:sched:sched_process_wait]: 7
@[tracepoint:sched:sched_wake_idle_without_ipi]: 53
@[tracepoint:sched:sched_stat_runtime]: 212
@[tracepoint:sched:sched_wakeup]: 253
@[tracepoint:sched:sched_waking]: 253
@[tracepoint:sched:sched_switch]: 510

Here, process level events within 5 seconds are counted and printed.

Sched: the sched probe can detect advanced events and process events of the scheduler, such as fork, exec and context switching
Probe: the full name of the probe
interval:s:5: the probe is triggered only once every 5 seconds on one cpu. The interval or timeout used to create the script level.
exit(): exit bpftrace.

9. Analyze the kernel real-time function stack

# bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'
Attaching 1 probe...
^C

[...]
@[
filemap_map_pages+181
__handle_mm_fault+2905
handle_mm_fault+250
__do_page_fault+599
async_page_fault+69
]: 12
[...]
@[
cpuidle_enter_state+164
do_idle+390
cpu_startup_entry+111
start_secondary+423
secondary_startup_64+165
]: 22122

Analyze the kernel call stack at the frequency of 99 Hz and count the printing times.

profile:hz:99: all CPUs here sample and analyze the kernel stack at the frequency of 99 Hz. Why 99 instead of 100 or 1000? We want to grab the kernel in enough detail
Kernel stack information during execution, but too much frequency affects performance. 100 Hz is enough, but we don't want to use exactly 100 Hz, because sampling may occur in lockstep, so 99 Hz is enough- kstack: returns the kernel call stack. Here, as the keyword of map, you can track the number of times. This output information can be visualized using flame diagrams. In addition, ustack is used to analyze
User level stack.

10. Scheduler tracking

# bpftrace -e 'tracepoint:sched:sched_switch { @[kstack] = count(); }'
^C
[...]

@[
__schedule+697
__schedule+697
schedule+50
schedule_timeout+365
xfsaild+274
kthread+248
ret_from_fork+53
]: 73
@[
__schedule+697
__schedule+697
schedule_idle+40
do_idle+356
cpu_startup_entry+111
start_secondary+423
secondary_startup_64+165
]: 305

The number of process context switches is counted here. The above output is truncated and only the top two are output.

Sched: scheduler events that track scheduling categories: sched_switch, sched_wakeup, sched_migrate_task, etc
sched_switch: triggered when the thread releases cpu resources and is not currently running. Possible blocking events here: such as waiting for I/O, timer, paging / switching, lock, etc
Ksstack: kernel stack trace, print call stack
sched_ The switch is triggered when the thread is switched, and the printed call stack is the thread switched out of the cpu. You can use other probes and pay attention to the process context here
, you can print information such as comm and pidmkstack, but you may not be able to reference the context information of switching to tangent.

11. Block level I/O tracking

# bpftrace -e 'tracepoint:block:block_rq_issue { @ = hist(args->bytes); }'
Attaching 1 probe...
^C

@:
[0, 1]                 1 |@@                                                  |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)              24 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              2 |@@@@                                                |
[16K, 32K)             6 |@@@@@@@@@@@@@                                       |
[32K, 64K)             5 |@@@@@@@@@@                                          |
[64K, 128K)            0 |                                                    |
[128K, 256K)           1 |@@                                                  |

The above is a histogram of block I/O request bytes.

tracepoint:block: the trace point of the block category tracks block level I/O events.
block_rq_issue: triggered when I/O is committed to the block device.
Args - > bytes: trace point block_ rq_ The parameter member bytes of issue indicates the number of bytes submitted for I/O request.

The context of the probe is very important: that is, the context in which the I/O is submitted to the block device. Usually located in the process context, the process name can be obtained through the comm of the kernel, but
It may also be the kernel context (such as readahead). At this time, the expected process number and process name information cannot be displayed.

12. Kernel structure tracking

# cat path.bt
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:vfs_open
{
	printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name));
}

# bpftrace path.bt
Attaching 1 probe...
open path: dev
open path: if_inet6
open path: retrans_time_ms
[...]

Here we use kernel dynamic tracing technology to trace vfs_read() function, whose (struct path *) is taken as the first parameter.

Kprobe: as mentioned earlier, this is the kernel dynamic tracking kprobe probe type and tracking the call of kernel function (kretprobe probe type tracks the return value of kernel function).
The probe type is determined by the built-in variable arg0. The first parameter represents the meaning of the probe. For kprobe type probes, it represents the first parameter of the function.
Other parameters are accessed using arg1,..., argN- ((struct path *)arg0)->dentry->d_ name. Name: here arg0 is used as struct path * and refers to dentry.
#include: header file containing the necessary path and dentry type declarations.

bfptrace has the same support for kernel structure tracing as bcc and allows the use of header files. This means that most structures are available, but not all, and sometimes require hands
Add the declaration of some structures. For example, see dcsnoop tool , including the declaration of struct nameidate. In the future, bpftrace will use the new linux BTF support, when all structures will be available. Now that you have understood most of the functions of bpstrace, you can start using and writing powerful one-line commands. consult Reference manual added
Function.

Added by alecks on Mon, 07 Mar 2022 22:43:12 +0200

Programming VIP