Notable Questions on NVMe-oF 1.1

At our recent SNIA Networking Storage Forum (NSF) webcast, Notable Updates in NVMe-oF™ 1.1we explored the latest features of NVMe over Fabrics (NVMe-oF), discussing what’s new in the NVMe-oF 1.1 release, support for CMB and PMR, managing and provisioning NVMe-oF devices with SNIA Swordfish™, and FC-NVMe-2. If you missed the live event, you can watch it here. Our presenters received many interesting questions on NVMe-oF and here are answers to them all:

Q. Is there an implementation of NVMe-oF with direct CMB access?

A. The Controller Memory Buffer (CMB) was introduced in NVMe 1.2 and first supported in the NVMe-oF 1.0 specification. It’s supported if the storage vendor has implemented this within the hardware and the network supports it. We recommend that you ask your favorite vendor if they support the feature.

Q. What is the different between PMR in an NVMe device and the persistent memory in general?

A. The Persistent Memory Region (PMR) is a region within the SSD controller and it is reserved for system level persistent memory that is exposed to the host. Just like a Controller Memory Buffer (CMB), the PMR may be used to store command data, but because it’s persistent it allows the content to remain even after power cycles and resets. To go further into this answer would require a follow up webinar.

Q. Are any special actions required on the host side over Controller Memory Buffers to maintain the data consistency?

A. To prevent possible disruption and to maintain data consistency, first the control address range must be configured so that addresses will not overlap, as described in the latest specification. There is also a flush command so that persistent memory can be cleared, (also described in the specification).

Q. Is there a field to know the size of CMB and PMR supported by controller? What is the general size of CMR in current devices?

A. The general size of CMB/PMR is vendor-specific but there is a size register field in both that is defined in the specification by the size register.

Q. Does having PMR guarantee that write requests to the PMR region have been committed to media, even though they have not been acknowledged before the power fail? Is there a max time limit in spec, within which NVMe drive should recover after power fail?

A. The implementation must ensure that the previous write has completed and that it is persistent. Time limit is vendor-specific.

Q. What is the average latency of an unladen swallow using NVMe-oF 1.1?

A. Average latency will depend on the media, the network and the way the devices are implemented. It also depends on whether or not the swallow is African or European (African swallows are non-migratory).

Q. Doesn’t RDMA provide an ‘implicit’ queue on the controller side (negating the need for CMB for queues). Can the CMB also be used for data?

A. Yes, the CMB can be used to hold both commands and command data and the queues are managed by RDMA within host memory or within the adapter. By having the queue in the CMB you can gain performance advantages.

Q. What is a ballpark latency difference number between CMB and PMR access, can you provide a number based on assumption that both of these are accessed over RDMA fabric?

A. When using CMB, latency goes down but there are no specific latency numbers available as of this writing.

Q. What is the performance of NVMe/TCP in terms of IOPS as compared to NVMe/RDMA? (Good implementation assumed)

A. This is heavily implementation dependent as the network adapter may provide offloads for TCP. NVMe/RDMA generally will have lower latency.

Q. If there are several sequence-level errors, how can we correct the errors in an appropriate order?

Q. How could we control the right order for the error corrections in FC-NVMe-2?

  1. These two questions are related and the response below is applicable to both questions.

As mentioned in the presentation, Sequence-level error recovery provides the ability to detect and recover from lost commands, lost data, and lost status responses. For Fibre Channel, a Sequence consists of one or more frames: e.g., a Sequence containing a NVMe command, a Sequence containing data, or a Sequence containing a status response. 

The order for error correction is based on information returned from the target on the given state of the Exchange compared to the state of the Exchange at the initiator. To do this, and from a high-level overview, upon sending an Exchange containing an NVMe command, a timer is started at the initiator. The default value for this timer is 2 seconds, and if a response is not received for the Exchange before the timer expires, a message is sent from the initiator to the target to determine the status of the Exchange.

Also, a response from the target may be received before the information on the Exchange is obtained from the target. If this occurs the command just continues on as normal, the timer is restarted if the Exchange is still in progress, and all is good. Otherwise, if no response from the target has been received since sending the Exchange information message, then one of two actions usually take place:

a) If the information returned from the target indicates the Exchange is not known, then the Exchange resources are cleaned up and released, and the Exchange containing the NVMe command is re-transmitted; or

b) If the information returned from the target indicates the Exchange is known and the target is still working on the command, then no error recovery is needed; the timer is restarted, and the initiator continues to wait for a response from the target.

An example of this behavior is a format command, where it may take a while for the command to complete, and the status response to be sent.

For some other typical information returned from the target per the Exchange status query:

  1. If the information returned from the target indicates the Exchange is known, and a ready to receive data message was sent by the target (e.g., a write operation), then the initiator requests the target to re-transmit the ready-to-receive data message, and the write operation continues at the transport level;
  2. If the information returned from the target indicates the Exchange is known, and data was sent by the target (e.g., a read operation), then the initiator requests the target to re-transmit the data and the status response, and the read operation continues at the transport level; and
  3. If the information returned from the target indicates the Exchange is known, and the status response was sent by the target, then the initiator requests the target to re-transmit the status response and the command completes accordingly at the transport level.

For further information, detailed informative Sequence level error recovery diagrams are provided in Annex E of the FC-NVMe-2 standard available via INCITS. 

Leave a Reply

Your email address will not be published. Required fields are marked *