This document is intended to provide an overview of how Vhost works behind the scenes. Code snippets used in this document might have been simplified for the sake of readability and should not be used as an API or implementation reference.
Reading from the Virtio specification:
The purpose of virtio and [virtio] specification is that virtual environments and guests should have a straightforward, efficient, standard and extensible mechanism for virtual devices, rather than boutique per-environment or per-OS mechanisms.
Virtio devices use virtqueues to transport data efficiently. Virtqueue is a set of three different single-producer, single-consumer ring structures designed to store generic scatter-gatter I/O. Virtio is most commonly used in QEMU VMs, where the QEMU itself exposes a virtual PCI device and the guest OS communicates with it using a specific Virtio PCI driver. With only Virtio involved, it's always the QEMU process that handles all I/O traffic.
Vhost is a protocol for devices accessible via inter-process communication. It uses the same virtqueue layout as Virtio to allow Vhost devices to be mapped directly to Virtio devices. This allows a Vhost device, exposed by an SPDK application, to be accessed directly by a guest OS inside a QEMU process with an existing Virtio (PCI) driver. Only the configuration, I/O submission notification, and I/O completion interruption are piped through QEMU. See also SPDK optimizations
The initial vhost implementation is a part of the Linux kernel and uses ioctl interface to communicate with userspace applications. What makes it possible for SPDK to expose a vhost device is Vhost-user protocol.
The Vhost-user specification describes the protocol as follows:
[Vhost-user protocol] is aiming to complement the ioctl interface used to control the vhost implementation in the Linux kernel. It implements the control plane needed to establish virtqueue sharing with a user space process on the same host. It uses communication over a Unix domain socket to share file descriptors in the ancillary data of the message.
The protocol defines 2 sides of the communication, master and slave. Master is the application that shares its virtqueues, in our case QEMU. Slave is the consumer of the virtqueues.
In the current implementation QEMU is the Master, and the Slave is intended to be a software Ethernet switch running in user space, such as Snabbswitch.
Master and slave can be either a client (i.e. connecting) or server (listening) in the socket communication.
SPDK vhost is a Vhost-user slave server. It exposes Unix domain sockets and allows external applications to connect.
One of major Vhost-user use cases is networking (DPDK) or storage (SPDK) offload in QEMU. The following diagram presents how QEMU-based VM communicates with SPDK Vhost-SCSI device.
All initialization and management information is exchanged using Vhost-user messages. The connection always starts with the feature negotiation. Both the Master and the Slave exposes a list of their implemented features and upon negotiation they choose a common set of those. Most of these features are implementation-related, but also regard e.g. multiqueue support or live migration.
After the negotiation, the Vhost-user driver shares its memory, so that the vhost device (SPDK) can access it directly. The memory can be fragmented into multiple physically-discontiguous regions and Vhost-user specification puts a limit on their number - currently 8. The driver sends a single message for each region with the following data:
The Master will send new memory regions after each memory change - usually hotplug/hotremove. The previous mappings will be removed.
Drivers may also request a device config, consisting of e.g. disk geometry. Vhost-SCSI drivers, however, don't need to implement this functionality as they use common SCSI I/O to inquiry the underlying disk(s).
Afterwards, the driver requests the number of maximum supported queues and starts sending virtqueue data, which consists of:
If multiqueue feature has been negotiated, the driver has to send a specific ENABLE message for each extra queue it wants to be polled. Other queues are polled as soon as they're initialized.
The Master sends I/O by allocating proper buffers in shared memory, filling the request data, and putting guest addresses of those buffers into virtqueues.
A Virtio-Block request looks as follows.
And a Virtio-SCSI request as follows.
Virtqueue generally consists of an array of descriptors and each I/O needs to be converted into a chain of such descriptors. A single descriptor can be either readable or writable, so each I/O request consists of at least two (request + response).
Legacy Virtio implementations used the name vring alongside virtqueue, and the name vring is still used in virtio data structures inside the code. Instead of
struct virtq_desc, the
struct vring_desc is much more likely to be found.
The device after polling this descriptor chain needs to translate and transform it back into the original request struct. It needs to know the request layout up-front, so each device backend (Vhost-Block/SCSI) has its own implementation for polling virtqueues. For each descriptor, the device performs a lookup in the Vhost-user memory region table and goes through a gpa_to_vva translation (guest physical address to vhost virtual address). SPDK enforces the request and response data to be contained within a single memory region. I/O buffers do not have such limitations and SPDK may automatically perform additional iovec splitting and gpa_to_vva translations if required. After forming the request structs, SPDK forwards such I/O to the underlying drive and polls for the completion. Once I/O completes, SPDK vhost fills the response buffer with proper data and interrupts the guest by doing an eventfd_write on the call descriptor for proper virtqueue. There are multiple interrupt coalescing features involved, but they are not be discussed in this document.
Due to its poll-mode nature, SPDK vhost removes the requirement for I/O submission notifications, drastically increasing the vhost server throughput and decreasing the guest overhead of submitting an I/O. A couple of different solutions exist to mitigate the I/O completion interrupt overhead (irqfd, vDPA), but those won't be discussed in this document. For the highest performance, a poll-mode Virtio driver can be used, as it suppresses all I/O completion interrupts, making the I/O path to fully bypass the QEMU/KVM overhead.