Last month, Bill Martin, SNIA Technical Council Co-Chair, presented a detailed update on what’s happening in the development and deployment of the NVMe Key-Value standard. Bill explained where Key Value fits within an architecture, why it’s important, and the standards work that is being done between NVM Express and SNIA. The webcast was one of our highest rated. If you missed it, it’s available on-demand along with the webcast slides. Attendees at the live event had many great questions, which Bill Martin has answered here:
Q. Two of the most common KV storage mechanisms in use today are AWS S3 and RocksDB. How does NVMe KV standards align or differ from them? How difficult would it be to map between the APIs and semantics of those other technologies to NVMe KV devices?
A. KV Storage is intended as a storage layer that would support these and other object storage mechanisms. There is a publicly available KVRocks: RocksDB compatible key value store and MyRocks compatible storage engine designed for KV SSDs at GitHub. There is also a Ceph Object storage design available. These are example implementations that can help an implementer get to an efficient use of NVMe KV storage.
Q. At which layer will my app stack need to change to take advantage of KV storage? Will VMware or Linux or Windows need to change at the driver level? Or do the apps need to be changed to treat data differently? If the apps don’t need to change doesn’t this then just take the data layout tables and move them up the stack in to the server?
A. The application stack needs to change at the point where it interfaces to a filesystem, where the interface would change from a filesystem interface to a KV storage interface. In order to take advantage of Key Value storage, the application itself may need to change, depending on what the current application interface is. If the application is talking to a RocksDB or similar interface, then the driver could simply be changed out to allow the app to talk directly to Key Value Storage. In this case, the application does not care about the API or the underlying storage. If the application is currently interfacing to a filesystem, then the application itself would indeed need to change and the KV API provides a standardized interface that multiple vendors can support to provide both the necessary libraries and access to a Key Value storage device. There will need to be changes in the OS to support this in providing a kernel layer driver for the NVMe KV device. If the application is using an existing driver stack that goes through a filesystem and does not change, then you cannot take advantage of KV Storage, but if the application changes or already has an object storage interface then the kernel filesystem and mapping functions can be removed from the data path.
Q. Is there a limit to the length of a key or value in the KV Architecture?
A.There are limits to the Key and value sizes in the current NVMe standard. The current implementation limits the key to 16 bytes due to a desire to pass the key within the NVMe command. The other architectural limit on a key is that the length of the key is specified in a field that allows up to 255 bytes for the key length. To utilize this, an alternative mechanism for passing the key to the device is necessary. For the value, the limit on the size is 4 GBytes.
Q. Are there any atomicity guarantees (e.g. for overwrites)?
A. The current specification makes it mandatory for atomicity at the KV level. In other words, if a KV Store command overwrites an existing KV pair and there is a power failure, you either get all of the original value or all of the new value.
Q. Is KV storage for a special class of storage called computational storage or can it be used for general purpose storage?
A. This is for any application that benefits from storing objects as opposed to storing blocks. This is unrelated to computational storage but may be of use in computational storage applications. One application that has been considered is that for a filesystem that rather than using the filesystem for storing blocks and having a mapping of each file handle to a set of blocks that contain the file contents, you would use KV storage where the file handle is the key and the object holds the file contents.
Q. What are the most frequently used devices to use the KV structure?
A. If what is being asked is, what are the devices that provide a KV structure, then the answer is, we expect the most common devices using the KV structure will be KV SSDs.
Q. Does the NVMe KV interface require 2 accessed in order to get the value (i.e., on access to get the value size in order to allocate the buffer and then a second access to read the value)?
A.If you know the size of the object or if you can pre-allocate enough space for your maximum size object then you can do a single access. This is no different than current implementations where you actually have to specify how much data you are retrieving from the storage device by specifying a starting LBA and a length. If you do not know the size of the value and require that in order to retrieve the value then you would indeed need to submit two commands to the NVMe KV storage device.
Q. Does the device know whether an object was compressed, and if not how can a previously compressed object be stored?
A. The hardware knows if it does compression automatically and therefore whether it should de-compress the object. If the storage device supports compression and the no-compress option, then the device will store metadata with the KV pair indicating if no-compress was specified when storing the file in order to return appropriate data. If the KV storage device does not perform compression, it can simply support storage and retrieval of previously compressed objects. If the KV storage device performs its own compression and is given a previously-compressed object to store and the no-compress option is not requested, the device will recompress the value (which typically won’t result in any space savings) or if the no-compress option is requested the device will store the value without attempting additional compression.
Q. On flash, erased blocks are fixed sizes, so how does Key Value handle defrag after a lot of writes and deletes?
A. This is implementation specific and depends on the size of the values that are stored. This is much more efficient on values that are approximately the size of the device’s erase block size as those values may be stored in an erase block and when deleted the erase block can be erased. For smaller values, an implementation would need to manage garbage collection as values are deleted and when appropriate move values that remain in a mostly empty erase block into a new erase block prior to erasing the erase block. This is no different than current garbage collection. The NVMe KV standard provides a mechanism for the device to report optimal value size to the host in order to better manage this as well.
Q. What about encryption? Supported now or will there be SED versions of [key value] drives released down the road?
A. There is no reason that a product could not support encryption with the current definition of key value storage. The release of SED (self-encrypting drive) products is vendor specific.
Q. What are considered to be best use cases for this technology? And for those use cases – what’s the expected performance improvement vs. current NVMe drives + software?
A. The initial use case is for database applications where the database is already storing key/value pairs. In this use case, experimentation has shown that a 6x performance improvement from RocksDB to a KV SSD implementing KV-Rocks is possible.
Q. Since writes are complete (value must be written altogether), does this mean values are restricted to NVMe’s MDTS?
A. Yes. Values are limited by MDTS (maximum data transfer size). A KV device may set this value to something greater than a block storage device does in order to support larger value sizes.
Q. How do protection scheme works with key-value (erasure coding/RAID/…)?
A. Since key value deals with complete values as opposed to blocks that make up a user data, RAID and erasure coding are usually not applicable to key value systems. The most appropriate data protection scheme for key value storage devices would be a mirrored scheme. If a storage solution performed erasure coding on data first, it could store the resulting EC fragments or symbols on key value SSDs.
Q. So Key Value is not something built on top of block like Object and NFS are? Object and NFS data are still stored on disks that operate on sectors, so object and NFS are layers on top of block storage? KV is drastically different, uses different drive firmware and drive layout? Or do the drives still work the same and KV is another way of storing data on them alongside block, object, NFS?
A. Today, there is only one storage paradigm at the drive level — block. Object and NFS are mechanisms in the host to map data models onto block storage. Key Value storage is a mechanism for the storage device to map from an address (a key) to a physical location where the value is stored, avoiding a translation in the host from the Key/value pair to a set of block addresses which are then mapped to physical locations where data is then stored. A device may have one namespace that stores blocks and another namespace that stores key value pairs. There is not a difference in the low-level storage mechanism only in the mapping process from address to physical location. Another difference from block storage is that the value stored is not a fixed size.
Q. Could you explain more about how tx/s is increased with KV?
A. The increase in transfers/second occurs for two reasons: one is because the translation layer in the host from key/value to block storage is removed; the second is that the commands over the bus are reduced to a single transfer for the entire key value pair. The latency savings from this second reduction is less significant than the savings from removing translation operations that have to happen in the host.
Keep up-to-date on work SNIA is doing on the Key Value Storage API Specification at the SNIA website.