NVMe Driver

In this document

Introduction

The NVMe driver is a C library that may be linked directly into an application that provides direct, zero-copy data transfer to and from NVMe SSDs. It is entirely passive, meaning that it spawns no threads and only performs actions in response to function calls from the application itself. The library controls NVMe devices by directly mapping the PCI BAR into the local process and performing MMIO. I/O is submitted asynchronously via queue pairs and the general flow isn't entirely dissimilar from Linux's libaio.

More recently, the library has been improved to also connect to remote NVMe devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both local PCI busses and on remote NVMe over Fabrics discovery services. The API is otherwise unchanged.

Examples

Getting Start with Hello World

There are a number of examples provided that demonstrate how to use the NVMe library. They are all in the examples/nvme directory in the repository. The best place to start is hello_world.

Running Benchmarks with Fio Plugin

SPDK provides a plugin to the very popular fio tool for running some basic benchmarks. See the fio start up guide for more details.

Running Benchmarks with Perf Tool

NVMe perf utility in the examples/nvme/perf is one of the examples which also can be used for performance tests. The fio tool is widely used because it is very flexible. However, that flexibility adds overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf benchmarking tool which has minimal overhead during benchmarking. We have measured up to 2.6 times more IOPS/core when using perf vs. fio with the 4K 100% Random Read workload. The perf benchmarking tool provides several run time options to support the most common workload. The following examples demonstrate how to use perf.

Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds

perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300

Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF

perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300

Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds

perf -q 128 -o 4096 -w randrw -M 70 -t 300

Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD, users must write to the SSD before reading the LBA from SSD

perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD'
perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD'

Public Interface

Key Functions Description
spdk_nvme_probe() Enumerate the bus indicated by the transport ID and attach the userspace NVMe driver to each device found if desired.
spdk_nvme_ctrlr_alloc_io_qpair() Allocate an I/O queue pair (submission and completion queue).
spdk_nvme_ctrlr_get_ns() Get a handle to a namespace for the given controller.
spdk_nvme_ns_cmd_read() Submits a read I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_readv() Submit a read I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_read_with_md() Submits a read I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_write() Submit a write I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_writev() Submit a write I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_write_with_md() Submit a write I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_write_zeroes() Submit a write zeroes I/O to the specified NVMe namespace.
spdk_nvme_ns_cmd_dataset_management() Submit a data set management request to the specified NVMe namespace.
spdk_nvme_ns_cmd_flush() Submit a flush request to the specified NVMe namespace.
spdk_nvme_qpair_process_completions() Process any outstanding completions for I/O submitted on a queue pair.
spdk_nvme_ctrlr_cmd_admin_raw() Send the given admin command to the NVMe controller.
spdk_nvme_ctrlr_process_admin_completions() Process any outstanding completions for admin commands.
spdk_nvme_ctrlr_cmd_io_raw() Send the given NVM I/O command to the NVMe controller.
spdk_nvme_ctrlr_cmd_io_raw_with_md() Send the given NVM I/O command with metadata to the NVMe controller.

NVMe Driver Design

NVMe I/O Submission

I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe driver submits the I/O request as an NVMe submission queue entry on the queue pair specified in the command. The function returns immediately, prior to the completion of the command. The application must poll for I/O completion on each queue pair with outstanding I/O to receive completion callbacks by calling spdk_nvme_qpair_process_completions().

See also
spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management, spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions

Scaling Performance

NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for I/O. I/O may be submitted on multiple queue pairs simultaneously from different threads. Queue pairs contain no locks or atomics, however, so a given queue pair may only be used by a single thread at a time. This requirement is not enforced by the NVMe driver (doing so would require a lock), and violating this requirement results in undefined behavior.

The number of queue pairs allowed is dictated by the NVMe SSD itself. The specification allows for thousands, but most devices support between 32 and 128. The specification makes no guarantees about the performance available from each queue pair, but in practice the full performance of a device is almost always achievable using just one queue pair. For example, if a device claims to be capable of 450,000 I/O per second at queue depth 128, in practice it does not matter if the driver is using 4 queue pairs each with queue depth 32, or a single queue pair with queue depth 128.

Given the above, the easiest threading model for an application using SPDK is to spawn a fixed number of threads in a pool and dedicate a single NVMe queue pair to each thread. A further improvement would be to pin each thread to a separate CPU core, and often the SPDK documentation will use "CPU core" and "thread" interchangeably because we have this threading model in mind.

The NVMe driver takes no locks in the I/O path, so it scales linearly in terms of performance per thread as long as a queue pair and a CPU core are dedicated to each new thread. In order to take full advantage of this scaling, applications should consider organizing their internal data structures such that data is assigned exclusively to a single thread. All operations that require that data should be done by sending a request to the owning thread. This results in a message passing architecture, as opposed to a locking architecture, and will result in superior scaling across CPU cores.

NVMe Driver Internal Memory Usage

The SPDK NVMe driver provides a zero-copy data transfer path, which means that there are no data buffers for I/O commands. However, some Admin commands have data copies depending on the API used by the user.

Each queue pair has a number of trackers used to track commands submitted by the caller. The number trackers for I/O queues depend on the users' input for queue size and the value read from controller capabilities register field Maximum Queue Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes, so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB.

I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers, some NVMe controllers which can support Controller Memory Buffer may put I/O queue pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue into controller memory buffer, it depends on users' input and controller capabilities. Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue pair is (MQES + 1) * (64 + 16) Bytes.

NVMe over Fabrics Host Support

The NVMe driver supports connecting to remote NVMe-oF targets and interacting with them in the same manner as local NVMe SSDs.

Specifying Remote NVMe over Fabrics Targets

The method for connecting to a remote NVMe-oF target is very similar to the normal enumeration process for local PCIe-attached NVMe devices. To connect to a remote NVMe over Fabrics subsystem, the user may call spdk_nvme_probe() with the trid parameter specifying the address of the NVMe-oF target.

The caller may fill out the spdk_nvme_transport_id structure manually or use the spdk_nvme_transport_id_parse() function to convert a human-readable string representation into the required structure.

The spdk_nvme_transport_id may contain the address of a discovery service or a single NVM subsystem. If a discovery service address is specified, the NVMe library will call the spdk_nvme_probe() probe_cb for each discovered NVM subsystem, which allows the user to select the desired subsystems to be attached. Alternatively, if the address specifies a single NVM subsystem directly, the NVMe library will call probe_cb for just that subsystem; this allows the user to skip the discovery step and connect directly to a subsystem with a known address.

NVMe Multi Process

This capability enables the SPDK NVMe driver to support multiple processes accessing the same NVMe device. The NVMe driver allocates critical structures from shared memory, so that each process can map that memory and create its own queue pairs or share the admin queue. There is a limited number of I/O queue pairs per NVMe controller.

The primary motivation for this feature is to support management tools that can attach to long running applications, perform some maintenance work or gather information, and then detach.

Configuration

DPDK EAL allows different types of processes to be spawned, each with different permissions on the hugepage memory used by the applications.

There are two types of processes:

  1. a primary process which initializes the shared memory and has full privileges and
  2. a secondary process which can attach to the primary process by mapping its shared memory regions and perform NVMe operations including creating queue pairs.

This feature is enabled by default and is controlled by selecting a value for the shared memory group ID. This ID is a positive integer and two applications with the same shared memory group ID will share memory. The first application with a given shared memory group ID will be considered the primary and all others secondary.

Example: identical shm_id and non-overlapping core masks

./perf options [AIO device(s)]...
[-c core mask for I/O submission/completion]
[-i shared memory group ID]
./perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1
./perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1

Limitations

  1. Two processes sharing memory may not share any cores in their core mask.
  2. If a primary process exits while secondary processes are still running, those processes will continue to run. However, a new primary process cannot be created.
  3. Applications are responsible for coordinating access to logical blocks.
See also
spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions

NVMe Hotplug

At the NVMe driver level, we provide the following support for Hotplug:

  1. Hotplug events detection: The user of the NVMe library can call spdk_nvme_probe() periodically to detect hotplug events. The probe_cb, followed by the attach_cb, will be called for each new device detected. The user may optionally also provide a remove_cb that will be called if a previously attached NVMe device is no longer present on the system. All subsequent I/O to the removed device will return an error.
  2. Hot remove NVMe with IO loads: When a device is hot removed while I/O is occurring, all access to the PCI BAR will result in a SIGBUS error. The NVMe driver automatically handles this case by installing a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location. This means I/O in flight during a hot remove will complete with an appropriate error code and will not crash the application.
See also
spdk_nvme_probe