raid5f: the SPDK RAID 5 implementation

RAID 5 is a popular RAID level which, similarly to RAID 1 (mirroring), protects data in the array against failure of a member drive and can improve read performance by reading from multiple drives in parallel. It is often preferred over mirroring because it “wastes” less storage capacity - the array can have many member drives but the total available capacity is always reduced only by the size of one member. Meanwhile, RAID 1 exposes the capacity of only one member. That’s 50% in the most common two-way mirror case, which provides the same level of protection as RAID 5 - protection from failure of one drive.

The biggest downside of RAID 5 has always been write performance. Every write to the array causes writes to at least two member drives, because apart from the actual data, the parity also has to be updated. Parity is the redundant data that allows recovery after a drive failure. It must stay in sync with the data on the other drives. But the problem is not in the additional write. After all, RAID 1 also has to duplicate writes. Computing the parity does add some overhead but is still much faster than accessing storage. The real performance hit comes when a write causes a read-modify-write cycle - to compute the updated parity some data needs to be read from at least one member drive first. It happens when a RAID stripe is partially updated.

A stripe is the data plus parity interleaved across member drives. The size of the part of a stripe that is contained on one member is configurable and is known as strip (AKA chunk) size. The amount of data that a single stripe can hold is equal to strip_size * (n - 1), where n is the number of members in the array. Here is a diagram showing an example layout of an array of 4 drives with strip size set to 4 blocks, so the stripe size (or width) is 12 blocks:

block     drive 0        drive 1        drive 2        drive 3
     +--------------+--------------+--------------+--------------+
 0   | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 0
 1   |              |              |              |              |
 2   | RAID blocks  | RAID blocks  | RAID blocks  |              | RAID blocks
 3   | 0-3          | 4-7          | 8-11         |              | 0-11
     +--------------+--------------+--------------+--------------+
 4   | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 1
 5   |              |              |              |              |
 6   | RAID blocks  | RAID blocks  |              | RAID blocks  | RAID blocks
 7   | 12-15        | 16-19        |              | 20-23        | 12-23
     +--------------+--------------+--------------+--------------+
 8   | data strip 0 | parity strip | data strip 1 | data strip 2 | stripe 2
 9   |              |              |              |              |
10   | RAID blocks  |              | RAID blocks  | RAID blocks  | RAID blocks
11   | 24-27        |              | 28-31        | 32-35        | 24-35
     +--------------+--------------+--------------+--------------+
12   | parity strip | data strip 0 | data strip 1 | data strip 2 | stripe 3
13   |              |              |              |              |
14   |              | RAID blocks  | RAID blocks  | RAID blocks  | RAID blocks
15   |              | 36-39        | 40-43        | 44-47        | 36-47
     +--------------+--------------+--------------+--------------+
16   | data strip 0 | data strip 1 | data strip 2 | parity strip | stripe 4
17   |              |              |              |              |
18   | RAID blocks  | RAID blocks  | RAID blocks  |              | RAID blocks
19   | 48-51        | 52-55        | 56-59        |              | 48-59
     +--------------+--------------+--------------+--------------+
20   | data strip 0 | data strip 1 | parity strip | data strip 2 | stripe 5
21   |              |              |              |              |
...

Writing less than the size of a stripe causes read-modify-write, is bad for performance and can even lead to silent data corruption in case of a dirty shutdown combined with a drive failure, a phenomenon known as RAID Write Hole. For these reasons, it is recommended to optimize the workload to avoid partial stripe writes and use full stripe writes whenever possible.

With raid5f we went beyond recommending full stripe writes and outright require them. That’s what the “f” at the end of raid5f stands for: full stripe writes, to differentiate from standard RAID 5. Without having to support partial stripe updates, the code can be much simpler and possibly faster. The stripe size is set as the bdev’s write unit size and is enforced by the bdev layer. This value can be retrieved with the API call spdk_bdev_get_write_unit_size(). If a write to a raid5f bdev is not aligned to or is not a multiple of a stripe, the I/O will fail. Reads are handled normally, without such restrictions.

Obviously, the requirement to only use full stripe writes puts additional burden on the application. In some cases it may not be a problem, in others it will require big changes. An interesting option, and initially the main motivation behind raid5f, is to combine it with the FTL bdev. It can work on top of raid5f and, among other benefits, eliminates the requirement to issue writes in full stripes, thanks to its logical to physical (L2P) block mapping.

If you would like to try raid5f, configure SPDK using the --with-raid5f option. You need at least 3 bdevs of any kind, but with the same block size and metadata format. Their size does not have to be equal, the array member size will be limited to the smallest base bdev. An example command to create a raid5f bdev is:

$ scripts/rpc.py bdev_raid_create -n raid_bdev0 -z 128 -r 5f -b "malloc0 malloc1 malloc2"

This creates a raid5f array named raid_bdev0 from bdevs malloc0 malloc1 malloc2 with strip size set to 128 KiB and write unit size to 256 KiB.

SPDK RAID modules are still in active development. Some big features like rebuild and superblock support have recently been merged and are available in SPDK release 24.01. More are coming in the future, we encourage you to review, share feedback, and submit your changes!