Our live SNIA-ESF Webcast, “Under the Hood with NVMe over Fabrics,” generated more questions than we anticipated, proving to us that this topic is worthy of future discussions. Here are answers to both the questions we took during the live event as well as those we didn’t have time for.
Q. So fabric is an alternative to PCIe, for those of us familiar with PCIe-attached NVMe devices, yes?
A. Yes, fabric is the term used in the specification that represents a variety of physical interconnects and transports for NVM Express.
Q. How are the namespaces shared in a fabric?
A. Namespaces are NVM subsystem resources and are accessible by all controllers in the NVM subsystem. Multi-host access may be coordinated using reservations.
Q. If there are multiple subsystems accessing same NVMe devices over fabric then how is namespace shared?
A. The mapping of fabric NVM subsystem resources (namespaces and controllers) to PCIe NVMe device subsystems is implementation specific. They may be mapped 1 to 1 or N to 1, depends on the functionality of the NVMe bridge.
Q. Are namespace reservations similar to SCSI reservations?
A. Yes
Q. Are there plans for defining bindings for Intel Omni Path fabric?
A. Intel Omni-Path is a good candidate fabric for NVMe over Fabrics.
Q. Is hybrid attachment allowed? Could a single namespace be attached to a fabric and PCIe (through two controllers) concurrently?
A. At this moment, such hybrid configuration is not permitted within the specification
Q. Is a NVM sub-system purpose built or commodity server hardware?
A. This is a difficult question to answer. At the time of this writing there are not enough “off-the-shelf” commodity components to be able to construct NVMe over Fabric subsystems.
Q. Does NVMEoF use the same NVMe PCIe controller register map?
A. A subset of the NVMe controller register mapping was retained for fabrics but renamed to “Properties” to avoid confusion.
Q. So does NVMe over Fabric act like an extension of the PCIe bus? Meaning that I see the same MMIO registers and queues remotely? Or is it a completely different protocol that is solely message based? Will current NVMe host drivers work on the fabric or does it really require a different driver stack?
A. Fabrics is not an extension of PCIe, it’s an extension of NVMe. It uses the same NVMe Submission and Completion Queue model and Descriptors as the PCIe NVMe. Most of the original NVMe host driver stack is retained and shared between PCIe and Fabrics, the bottom side was modified to allow for multiple transports.
Q. Does NVMe over Fabrics support immediate data for writes, or must write data always be fetched by the NVMe controller?
A. Yes, immediate data is termed “in-capsule” and is used to send the NVMe command data with the NVMe submission entry.
Q. As far as I know, Linux introduced a multi-queue model at the block layer recently. Is it the same thing you are mentioning?
A. No, but NVMe uses the Linux Block-MQ layer. NVMe Multi-Queue is used between the host and the NVMe controller for both PCIe and fabric based controllers.
Q. Are there situations where you might want to have more than one queue pair per CPU? What are they?
A. Queue-Pairs are matched up by CPU cores, not CPUs, which allows the creation of multiple namespace entities per CPU. This, in turn, is very useful for virtualization and application separation.
Q. What are three mandatory commands? Do they refer to read/write/sync cache?
A. Actually, there are 13 required commands. Kevin Marks has a very good presentation from the Flash Memory Summit that provides a list of these commands within the broader NVMe context. You can download it here.
Q. Please talk about queue depths? Arbitrary? Limited?
A. Controller defined maximum queue depths up to a maximum of 64K entries.
Q. Where will SQs and CQs be physically located? Are they on host memory or SSD memory?
A. For fabrics, the SQ is located on the controller side to avoid the inefficiency of having to pull SQE’s across a fabric. CQ’s reside on the host.
Q. How do you create ordering guarantee when that is needed for correctness?
A. For commands that require sequencing, there is a concept called “Fused Commands” which get sent as a single unit.
Q. In NVMeoF how are devices discovered?
A. NVMeoF devices are discoverable via a couple of different means, depending on whether you are using Fibre Channel (which has its own discovery and login process) or an iSCSI-like name server. Mike Shapiro goes over the discovery mechanism in considerable detail in this BrightTALK Webcast.
Q. I guess all new drivers will be required for NVMeoF?
A. Yes, new drivers are being written and will be required for NVMeoF.
Q. Why can’t the doorbell+ plus communication model apply to PCIe? I mean, why doesn’t PCIe use doorbell+?
A. NVMe 1.2 defines controller resident buffers that can be used for pushing SQ Entries from the host to the controller. Doorbells are still required for PCIe to inform the controller about the new SQ entries.
Q. If there are two hosts connected to the same subsystem then will NVMe controller have two queues :- one for each host
A. Yes
Q. So with your command and data description, does NVMe over Fabric require RDMA or does it have a “Data Ready” type message to tell the host when to send write data?
A. Data transfer operations are fabric dependent. RDMA uses RDMA_READ, another transport may use some form of Data Ready model.
Q. Can you quantify the protocol translation overhead? In reality, that does not look like that big from performance perspective.
A. Submission Queue entries are 64bytes and Completion Queue entries are 16bytes. These are sufficiently small for block storage traffic which typically is in 4K+ size requests.
Q. Do Dual Port SSDs need to support two Admin Qs since they have two paths to the same host?
A. Dual-Port or multi-path capable NVM subsystems require using two NVMe controllers each with one AdminQ and one or more IO queues.
Q. For a Dual Port SSD, does each port need to have its Submission Q on a different CPU core in the host? I assume the SQs for the two ports cannot be on the same CPU core.
A. The mapping of controller queues to host CPU cores is typically per controller. If the host was connected to two controllers, there would be two queues per core. One queue to controller 1 and one queue to controller 2 per host core.
Q. As you mentioned currently there is an LBA addressing in standard. What will happen when Intel will go to market with new media (3D Point), which is announced to be byte addressable?
A. The NVMe NVM command set is block based and is independent of the type and access method of the NVM media used in a subsystem implementation.
Q. Is there a real benefit of this architecture in a NAS environment?
A. There is a natural advantage to making any storage access more efficient. A network-attached system still requires block access at the lower levels, and NVMe (either local or over a Fabric) can improve NAS design and flexibility immensely. This is particularly true for pNFS and scale-out SMB paradigms.
Q. How do you handle authentication across many servers (hosts) on the fabric? How do you decide what host can access what part of each device? Does it have to be namespace specific?
A. The fabrics specification defines an Authentication model and also defines the naming format for NVM subsystems and hosts. A target implementation can choose to provision NVM subsystems to specific host based on the naming format.
Q. Having same structure at all layers means at the transport layer of flash appliance also we should maintain the submission and completions Queue model and these mapped to physical Queue of NVMe sub controller?
A. The NVMe Submission Queue and Completion Queue entries are common between fabrics and PCIe NVMe. This simplifies the steps required to bridge between NVMe fabrics and NVMe PCIe. An implementation may choose to map the fabrics SQ directly to a PCIe NVMe SSD SQ to provide a very efficient simple NVMe transport bridge
Q. With an RDMA based transport, how will each host discover the NVME controller(s) that it has been granted access to?
A. Please see the answer above.
Q. Traditionally SAS supports SAS expander for scaling purpose. How does NVMe over fabric solve the issue as there is no expander concept in NVMe world?
A. Recall that SAS expanders compensate for SCSI’s inherent lack of scalability. NVMe perpetuates the multi-queue model (which does not exist for SCSI) natively, so SAS expander-like pieces are not required for scale-out.
Update: If you missed the live event, it’s now available on-demand. You can also download the webcast slides.
Update: Want to learn more about NVMe? Check out these SNIA ESF webcasts:
- The Performance Impact of NVMe and NVMe over Fabrics (Download webcast slides)
- Under the Hood with NVMe over Fabrics (Download webcast slides)
- How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics (Download webcast slides)