NVMe over Fabrics Target Programming Guide

Target Audience

This programming guide is intended for developers authoring applications that use the SPDK NVMe-oF target library (lib/nvmf). It is intended to provide background context, architectural insight, and design recommendations. This guide will not cover how to use the SPDK NVMe-oF target application. For a guide on how to use the existing application as-is, see NVMe over Fabrics Target.

Introduction

The SPDK NVMe-oF target library is located in lib/nvmf. The library implements all logic required to create an NVMe-oF target application. It is used in the implementation of the example NVMe-oF target application in app/nvmf_tgt, but is intended to be consumed independently.

This guide is written assuming that the reader is familiar with both NVMe and NVMe over Fabrics. The best way to become familiar with those is to read their specifications.

Primitives

The library exposes a number of primitives - basic objects that the user creates and interacts with. They are:

struct spdk_nvmf_tgt: An NVMe-oF target. This concept, surprisingly, does not appear in the NVMe-oF specification. SPDK defines this to mean the collection of subsystems with the associated namespaces, plus the set of transports and their associated network connections. This will be referred to throughout this guide as a target.

struct spdk_nvmf_subsystem: An NVMe-oF subsystem, as defined by the NVMe-oF specification. Subsystems contain namespaces and controllers and perform access control. This will be referred to throughout this guide as a subsystem.

struct spdk_nvmf_ns: An NVMe-oF namespace, as defined by the NVMe-oF specification. Namespaces are bdevs. See Block Device User Guide for an explanation of the SPDK bdev layer. This will be referred to throughout this guide as a namespace.

struct spdk_nvmf_qpair: An NVMe-oF queue pair, as defined by the NVMe-oF specification. These map 1:1 to network connections. This will be referred to throughout this guide as a qpair.

struct spdk_nvmf_transport: An abstraction for a network fabric, as defined by the NVMe-oF specification. The specification is designed to allow for many different network fabrics, so the code mirrors that and implements a plugin system. Currently, only the RDMA transport is available. This will be referred to throughout this guide as a transport.

struct spdk_nvmf_poll_group: An abstraction for a collection of network connections that can be polled as a unit. This is an SPDK-defined concept that does not appear in the NVMe-oF specification. Often, network transports have facilities to check for incoming data on groups of connections more efficiently than checking each one individually (e.g. epoll), so poll groups provide a generic abstraction for that. This will be referred to throughout this guide as a poll group.

struct spdk_nvmf_listener: A network address at which the target will accept new connections.

struct spdk_nvmf_host: An NVMe-oF NQN representing a host (initiator) system. This is used for access control.

The Basics

A user of the NVMe-oF target library begins by creating a target using spdk_nvmf_tgt_create(), setting up a set of addresses to accept connections on by calling spdk_nvmf_tgt_listen(), then creating a subsystem using spdk_nvmf_subsystem_create().

Subsystems begin in an inactive state and must be activated by calling spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only when in the paused or inactive state. A running subsystem may be paused by calling spdk_nvmf_subsystem_pause() and resumed by calling spdk_nvmf_subsystem_resume().

Namespaces may be added to the subsystem by calling spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused. Namespaces are bdevs. See Block Device User Guide for more information about the SPDK bdev layer. A bdev may be obtained by calling spdk_bdev_get_by_name().

Once a subsystem exists and the target is listening on an address, new connections may be accepted by polling spdk_nvmf_tgt_accept().

All I/O to a subsystem is driven by a poll group, which polls for incoming network I/O. Poll groups may be created by calling spdk_nvmf_poll_group_create(). They automatically request to begin polling upon creation on the thread from which they were created. Most importantly, a poll group may only be accessed from the thread it was created on.

When spdk_nvmf_tgt_accept() detects a new connection, it will construct a new struct spdk_nvmf_qpair object and call the user provided new_qpair_fn callback for each new qpair. In response to this callback, the user must assign the qpair to a poll group by calling spdk_nvmf_poll_group_add(). Remember, a poll group may only be accessed from the thread it was created on, so making a call to spdk_nvmf_poll_group_add() may require passing a message to the appropriate thread.

Access Control

Access control is performed at the subsystem level by adding allowed listen addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept connections from any host or over any established listen address. Listeners and hosts may only be added to inactive or paused subsystems.

Discovery Subsystems

A discovery subsystem, as defined by the NVMe-oF specification, is automatically created for each NVMe-oF target constructed. Connections to the discovery subsystem are handled in the same way as any other subsystem - new qpairs are created in response to spdk_nvmf_tgt_accept() and they must be assigned to a poll group.

Transports

The NVMe-oF specification defines multiple network transports (the "Fabrics" in NVMe over Fabrics) and has an extensible system for adding new fabrics in the future. The SPDK NVMe-oF target library implements a plugin system for network transports to mirror the specification. The API a new transport must implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA transport has been implemented.

The SPDK NVMe-oF target is designed to be able to process I/O from multiple fabrics simultaneously.

Choosing a Threading Model

The SPDK NVMe-oF target library does not strictly dictate threading model, but poll groups do all of their polling and I/O processing on the thread they are created on. Given that, it almost always makes sense to create one poll group per thread used in the application. New qpairs created in response to spdk_nvmf_tgt_accept() can be handed out round-robin to the poll groups. This is how the SPDK NVMe-oF target application currently functions.

More advanced algorithms for distributing qpairs to poll groups is possible. For instance, a NUMA-aware algorithm would be an improvement over basic round-robin, where NUMA-aware means assigning qpairs to poll groups running on CPU cores that are on the same NUMA node as the network adapter and storage device. Load-aware algorithms also may have benefits.

Scaling Across CPU Cores

Incoming I/O requests are picked up by the poll group polling their assigned qpair. For regular NVMe commands such as READ and WRITE, the I/O request is processed on the initial thread from start to the point where it is submitted to the backing storage device, without interruption. Completions are discovered by polling the backing storage device and also processed to completion on the polling thread. Regular NVMe commands (READ, WRITE, etc.) do not require any cross-thread coordination, and therefore take no locks.

NVMe ADMIN commands, which are used for managing the NVMe device itself, may modify global state in the subsystem. For instance, an NVMe ADMIN command may perform namespace management, such as shrinking a namespace. For these commands, the subsystem will temporarily enter a paused state by sending a message to each thread in the system. All new incoming I/O on any thread targeting the subsystem will be queued during this time. Once the subsystem is fully paused, the state change will occur, and messages will be sent to each thread to release queued I/O and resume. Management commands are rare, so this style of coordination is preferable to forcing all commands to take locks in the I/O path.

Zero Copy Support

For the RDMA transport, data is transferred from the RDMA NIC to host memory and then host memory to the SSD (or vis. versa), without any intermediate copies. Data is never moved from one location in host memory to another. Other transports in the future may require data copies.

RDMA

The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and rdmacm libraries, which are packaged and available on most Linux distributions. It does not use a user-space RDMA driver stack through DPDK.

In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA transport allocates a single RDMA completion queue per poll group. All new qpairs assigned to the poll group are given their own RDMA send and receive queues, but share this common completion queue. This allows the poll group to poll a single queue for incoming messages instead of iterating through each one.

Each RDMA request is handled by a state machine that walks the request through a number of states. This keeps the code organized and makes all of the corner cases much more obvious.

RDMA SEND, READ, and WRITE operations are ordered with respect to one another, but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For instance, it is possible to detect an incoming RDMA RECV message containing a new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND containing an NVMe completion. This is problematic at full queue depth because there may not yet be a free request structure. To handle this, the RDMA request structure is broken into two parts - an rdma_recv and an rdma_request. New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a queue for a SEND acknowledgement before they can acquire a full rdma_request object.

Further, RDMA NICs expose different queue depths for READ/WRITE operations than they do for SEND/RECV operations. The RDMA transport reports available queue depth based on SEND/RECV operation limits and will queue in software as necessary to accommodate (usually lower) limits on READ/WRITE operations.