BPF tracing with SPDK
Recently, in the v21.07 release, SPDK got
support for BPF tracing. It consists of a set of scripts and static probes defined within the SPDK
libraries giving the user yet another way to inspect SPDK-based applications. SPDK already had a
tracing library, called libtrace
, for a long time. BPF tracing isn’t supposed to replace it, but
rather complement it. Both have slightly different capabilities - libtrace
has significantly
lower overhead, but is less flexible. BPF traces, on the other hand, are more costly (they require
a kernel trap), but are more robust and can even be attached dynamically, without having to define
the probes in code.
User-level tracing
Firstly, let’s discuss what BPF tracing is and how SPDK is utilizing it. BPF, Berkeley Packet
Filter, is a technology that was originally developed to analyze and filter network traffic. It
allows the user to run user-supplied programs inside the operating system kernel in a special BPF
virtual machine. Later on, in Linux, it was extended (hence eBPF - extended BPF) to support more
use-cases, among-others, tracing. It lets users to leverage Linux tracing subsystem to attach
probes at certain places in the program and, once hit, run an eBPF program tied to that probe
[1]. In user-space applications, it works by placing an
instruction (int3
on x86) generating SIGTRAP
, which is then caught by the kernel triggering the
execution of the eBPF program. SPDK uses bpftrace to define
and attach probes. This tool uses a high-level scripting language (described by its authors as a
mix of awk and C) to describe the actions to be taken when a given probe is triggered.
The probes can be divided into two categories: statically and dynamically defined. Static probes,
called USDT (User Statically-Defined Tracing), are placed by the programmer in various critical
locations in code (using one of the SPDK_DTRACE_PROBE*
macros). Dynamic probes, on the other
hand, can be enabled at the start of any function without having to change the code or recompile
the application. There’s one caveat though: some of the static function calls might get inlined by
the compiler, in which case the probe won’t get executed when expected.
Tools and examples
In order for USDT probes to be compiled-in, SPDK must be configured with --with-usdt
. It’s also
recommended to build a recent version of bpftrace from source, as we’ve encountered a number of
issues with packaged versions on both Ubuntu 20.04 and Fedora 33. scripts/bpftrace.sh
can be used
to attach and display the traces. It expects two parameters - PID of the process to trace and the
bpftrace script to attach. These scripts are capable of gathering and displaying a variety of
different information and statistics. For instance, it’s possible to print parameters of a function
each time it’s called, count how many times a given code path is executed, or even create and
display a histogram from a set of values. SPDK already provides several useful scripts, located in
the scripts/bpf
directory.
Let’s take a look at one of them, scripts/bpf/send_msg.bt
, see how it works and how it can be used
to inspect an SPDK application. The goal of this script is to count the number of times a function
is executed through spdk_thread_send_msg()
and spdk_for_each_channel()
. The contents of the
script are pretty straight-forward:
uprobe:__EXE__:spdk_thread_send_msg {
@send_msg[usym(arg1)] = count();
}
uprobe:__EXE__:spdk_for_each_channel {
@for_each_channel[usym(arg1)] = count();
}
It defines two dynamic probes (via the uprobe
keyword) that saves the number of calls to each
function in a map (called send_msg
and for_each_channel
respectively). The first argument,
arg1
, is used as a key in the map and, in both cases, refers to the pointer to the function to be
executed. The script uses two helper functions: usym()
returning the name of a symbol at given
address, and count()
, which counts the number of times a function has been called. For more
information on the available functions and the general syntax of the scripts, consult the
bpftrace manual.
There’s one extra special variable called __EXE__
. This one is SPDK-specific and is replaced by
scripts/bpftrace.sh
with the name of the executable (which is expected by bpftrace). There’s also
__PID__
, which is replaced by the PID of the traced process. Both of them offer a convenient way
of using the same scripts with different applications. More details can be found in
SPDK documentation.
Once the script is attached, it’ll run until the application exits (or the script is stopped). When that happens, it’ll print out the name of each function along with the number of times it was called. The sample output below was gathered by attaching to an NVMe/TCP target:
@for_each_channel[nvmf_ctrlr_disconnect_io_qpairs_on_pg]: 1
@for_each_channel[_nvmf_tgt_add_transport]: 1
@for_each_channel[subsystem_state_change_on_pg]: 5
@send_msg[_nvmf_ctrlr_add_admin_qpair]: 1
@send_msg[_nvmf_subsystem_add_ctrlr]: 1
@send_msg[_nvmf_ctrlr_destruct]: 1
@send_msg[_finish_unregister]: 1
@send_msg[_nvmf_ctrlr_destruct]: 1
@send_msg[put_io_channel]: 4
@send_msg[_call_completion]: 7
@send_msg[_call_channel]: 7
@send_msg[nvmf_ctrlr_add_io_qpair]: 24
@send_msg[_nvmf_ctrlr_add_io_qpair]: 24
@send_msg[_nvmf_ctrlr_free_from_qpair]: 25
@send_msg[_nvmf_poll_group_add]: 25
@send_msg[_nvmf_transport_qpair_fini]: 25
The other available scripts can be used to track system calls (syscalls.bt
), NVMeoF subsystem
state transitions (nvmf.bt
), and readv calls with the number of bytes read (readv.bt
). These,
on their own, provide a lot of information, but they can also be used as a guideline for writing
scripts aimed to track different properties of an application.
SPDK and BPF traces together
As stated before, one of the main goals of the SPDK trace library was to have minimal overhead. This means that each tracepoint is allowed to record only the most critical data. It makes it possible to place tracepoints in I/O path without major effect on performance. But now, with BPF traces, we can put bpftrace probes in places where objects are created, gather all kinds of their properties, and then use that data to annotate SPDK traces.
For instance, in NVMe/RDMA I/O path we record the pointer to a qpair on which a request is executed.
We can use bpftrace to record information about a qpair such as its queue ID, thread ID, subsystem
NQN, and host NQN and display it along with the pointer when showing SPDK traces. This is what the
scripts/bpf/trace.py
script is responsible for.
Let’s see how it works in practice. Firstly, bpftraces have to be recorded (assuming we’re tracing
spdk_tgt
app exposing an NVMe/RDMA target):
$ scripts/bpf/trace.py --record `pidof spdk_tgt` > bpftraces
Then, we can use it to annotate the traces:
$ build/bin/spdk_trace -f /dev/shm/spdk_tgt_trace.pid`pidof spdk_tgt` -j | scripts/bpf/trace.py --bpftrace bpftraces
This will transform the output from:
0 39100254.090 RDMA_REQ_COMPLETING id: r10 time: 49.692 qpair: 0x1178af0
0 39100320.354 RDMA_REQ_COMPLETED id: r10 time: 115.956 qpair: 0x1178af0
0 39100494.100 RDMA_REQ_NEW id: r11 qpair: 0x11790a0
0 39100494.454 RDMA_REQ_NEED_BUFFER id: r11 time: 0.354 qpair: 0x11790a0
0 39100494.684 RDMA_REQ_RDY_TO_EXECUTE id: r11 time: 0.584 qpair: 0x11790a0
0 39100500.546 RDMA_REQ_EXECUTING id: r11 time: 6.446 qpair: 0x11790a0
0 39100516.727 RDMA_REQ_EXECUTED id: r11 time: 22.627 qpair: 0x11790a0
0 39100516.903 RDMA_REQ_RDY_TO_COMPL id: r11 time: 22.803 qpair: 0x11790a0
0 39100517.333 RDMA_REQ_COMPLETING id: r11 time: 23.233 qpair: 0x11790a0
0 39100550.895 RDMA_REQ_COMPLETED id: r11 time: 56.795 qpair: 0x11790a0
to:
0 39100254.090 RDMA_REQ_COMPLETING id: r10 time: 49.692 qpair(ptr=0x1178af0, thread=2, qid=1, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100320.354 RDMA_REQ_COMPLETED id: r10 time: 115.956 qpair(ptr=0x1178af0, thread=2, qid=1, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100494.100 RDMA_REQ_NEW id: r11 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100494.454 RDMA_REQ_NEED_BUFFER id: r11 time: 0.354 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100494.684 RDMA_REQ_RDY_TO_EXECUTE id: r11 time: 0.584 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100500.546 RDMA_REQ_EXECUTING id: r11 time: 6.446 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100516.727 RDMA_REQ_EXECUTED id: r11 time: 22.627 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100516.903 RDMA_REQ_RDY_TO_COMPL id: r11 time: 22.803 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100517.333 RDMA_REQ_COMPLETING id: r11 time: 23.233 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
0 39100550.895 RDMA_REQ_COMPLETED id: r11 time: 56.795 qpair(ptr=0x11790a0, thread=2, qid=2, subnqn=nqn.2016-06.io.spdk:cnode0, hostnqn=nqn.2014-08.org.nvmexpress:uuid:8da29bed555a45e8b9bc378e115f4c2)
For now, only RDMA qpairs can be annotated, however, expect more to come in the future.
Conclusion
Summing up, SPDK now has another robust way of inspecting its applications via BPF tracing. It consists of a set of bpftrace programs, USDT probes defined in SPDK libraries, and several scripts that simplify BPF tracing and make it possible to combine it with regular SPDK traces.
Stay tuned for more blog posts on this topic. Happy tracing!