Author of NVMe™/TCP Spec Answers Your Questions

900 people have already watched our SNIA Networking Storage Forum webcast, What NVMe™/TCP Means for Networked Storage? where Sagi Grimberg, lead author of the NVMe/TCP specification, and J Metz, Board Member for SNIA, explained what NVMe/TCP is all about. If you haven’t seen the webcast yet, check it out on-demand.

Like any new technology, there’s no shortage of areas for potential confusion or questions. In this FAQ blog, we try to clear up both.

Q. Who is responsible for updating NVMe Host Driver?

A. We assume you are referring to the Linux host driver (independent OS software vendors are responsible for developing their own drivers). Like any device driver and/or subsystem in Linux, the responsibility of maintenance is on the maintainer(s) listed under the MAINTAINERS file. The responsibility of contributing is shared by all the community members.

Q. What is the realistic timeframe to see a commercially available NVME over TCP driver for targets? Is one year from now (2020) fair?

A. Even this year commercial products are coming to market. The work started even before the spec was fully ratified, but now that it has been, we expect wider NVMe/TCP support available.

Q. Does NVMe/TCP work with 400GbE infrastructure?

A. As of this writing, there is no reason to believe that upper layer protocols such as NVMe/TCP will not work with faster Ethernet physical layers like 400GbE.

Q. Why is NVMe CQ in the controller and not on the Host?

A. The example that was shown in the webcast assumed that the fabrics controller had an NVMe backend. So the controller backend NVMe device had a local completion queue, and on the host sat the “transport completion queue” (in NVMe/TCP case this is the TCP stream itself).

Q. So, SQ and CQ streams run asynchronously from each other, with variable ordering depending on the I/O latency of a request?

A. Correct. For a given NVMe/TCP connection, stream delivery is in-order, but commands and completions can arrive (and be processed by the NVMe controller) in any order.

Q. What TCP ports are used? Since we have many NVMe queues, I bet we need a lot of TCP ports.

A. Each NVMe queue will consume a unique source TCP port. Common NVMe host implementations will create a number of NVMe queues in the same order of magnitude of the number of CPU cores.

Q. What is the max size of Data PDU supported? Are there any restrictions in parallel writes?

A. The maximum size of an H2CData PDU (MAXH2CDATA) is negotiated and can be as large as 4GB. It is recommended that it will be no less than 4096 bytes.

Q. Is immediate data negotiated between host and target?

A. The in-capsule data size (IOCCSZ) is negotiated on an NVMe level. In NVMe/TCP the admin queue command capsule size is 8K by default. In addition, the maximum size of the H2CData PDU is negotiated during the connection initialization.

Q. Is NVMe/TCP hardware infrastructure cost lower?

A. This can vary widely, but we assume you are referring to Ethernet hardware infrastructure. Plus, NVMe/TCP does not require RDMA capable NIC so the variety of implementations is usually wider which typically drives down cost.

Q. What are the plans for the major OS suppliers to support NVMe over TCP (Windows, Linux, VMware)?

A. Unfortunately, we cannot comment on their behalf, but Linux already supports NVMe/TCP which should find its way to the various distributions soon. We are working with others to support NVMe/TCP, but suggest asking them directly.

Q. Where does the overhead occur for NVMe/TCP packetization, is it dependent on the CPU, or does the network adapter offload that heavy lifting? And what is the impact of numerous, but extremely small transfers?

A. Indeed a software NVMe/TCP implementation will introduce an overhead resulting from the TCP stream processing. However, you are correct that common stateless offloads such as Large Receive Offload and TCP Segmentation Offload are extremely useful both for large and for small 4K transfers.

Q. What do you mean Absolute Latency is higher than RDMA by “several” microseconds? <10us, tens of microseconds, or 100s of microseconds?

A. That depends on various aspects such as the CPU model, the network infrastructure, the controller implementation, services running on top etc. Remote access to raw NVMe devices over TCP was measured to add a range between 20-35 microseconds with Linux in early testing, but the degrees of variability will affect this.

Q. Will Wireshark support NVMe/TCP soon? Is an implementation in progress?

A. We most certainly hope so, it shouldn’t be difficult, but we are not aware of an ongoing implementation in progress.

Q. Are there any NVMe TCP drivers out there?

A. Yes, Linux and SPDK both support NVMe/TCP out-of-the-box, see: https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/

Q. Do you recommend a dedicated IP network for the storage traffic or can you use the same corporate network with all other LAN traffic?

A. This really depends on the use case, the network utilization and other factors. Obviously if the network bandwidth is fully utilized to begin with, it won’t be very efficient to add the additional NVMe/TCP “load” on the network, but that alone might not be the determining factor. Otherwise it can definitely make sense to share the same network and we are seeing customers choosing this route.

It might be useful to consider the best practices for TCP-based storage networks (iSCSI has taught valuable lessons), and we anticipate that many of the same principles will apply to NVMe/TCP.

The AQM, buffer etc. tuning settings is very dependent on the traffic pattern and needs to be developed based on the requirements. Base configuration is determined by the vendors.

Q.  On slide 28: no, TCP needs the congestion feedback, mustn’t need to be a drop (could be ecn, latency variance etc)

A. Yes, you are correct. The question refers to how that feedback is received, though, and in the most common (traditional) TCP methods it’s done via drops.

Q. How can you find out/check what TCP stack (drop vs. zero-buffer) your network is using?

A. The use/support of DCTCP is mostly driven by the OS. The network needs to support and have ECN enabled and correctly configured for the traffic of interest. So the best way to figure this out is to talk to the network team. The use of ECN,etc. needs to be developed between server and network team

Q.  On slide 33: drop is signal of overloaded network; congestion on-set is when there is a standing Q (latency already increases). Current state of the art is to always overload the network (switches).

A. ECN is used to signal before drop happens to make it more efficient.

Q. Is it safe to assume that most current switches on the market today support DCTCP/ECN and that we can mix/match switches from vendors across product families?

A. Most modern ASICS support ECN today. Mixing different product lines needs to be carefully planned and tested. AQM, Buffers etc. need to be fine-tuned across the platforms.

Q. Is there a substantial cost savings by implementing all of what is needed to support NVMe over TCP versus just sticking with RDMA? Much like staying with Fibre Channel instead of risking performance with iSCSI not being and staying implemented correctly. Building the separately supported network just seems the best route.

A. By “sticking with RDMA” you mean that you have already selected RDMA, which means you already made the investments to make it work for your use case. We agree that changing what currently works reliably and meets the targets might be an unnecessary risk. NVMe/TCP brings a viable option for Ethernet fabrics which is easily scalable and allows you to utilize a wide variety of both existing and new infrastructure while still maintaining low latency NVMe access.

Q. It seems that with multiple flavors of TCP and especially congestion management (DCTCP, DCQCN?) is there a plan for commonality in ecosystem to support a standard way to handle congestion management? Is that required in the switches or also in the HBAs?

A. DCTCP is an approach for L3 based congestion management, whereas DCQCN is a combination of PFC and ECN for RoCEv2(UDP) based communication. So both of these are two different approaches.

Q. Who are the major players in terms of marketing this technology among storage vendors?

A. The key organization to find out about NVMe/TCP (or all NVMe-related material, in fact), is NVM Express ®

Q. Can I compare the NVMe over TCP to iSCSI?

A. Easy, you can download upstream kernel and test both of the in-kernel implementations (iSCSI and NVMe/TCP). Alternatively you can reach out to a vendor that supports any of the two to test it as well. You should expect NVMe/TCP to run substantially faster for pretty much any workload.

Q. Is network segmentation crucial as “go to” architecture with host to storage proximity objective to accomplish objective of manage/throttled close to near loss-less connectivity?

A.  There is a lot to unpack in this question. Let’s see if we can break it down a little.

Generally speaking, best practice is to keep the storage as close to the host as possible (and is reasonable). Not only does this reduce latency, but it reduces the variability in latency (and bandwidth) that can occur at longer distances.

In cases where storage traffic shares bandwidth (i.e., links) with other types of traffic, the variable nature of different applications (some are bursty, others are more long-lived) can create unpredictability. Since storage – particularly block storage – doesn’t “like” unpredictability, different methods are used to regain some of that stability as scales increase.

A common and well-understood best practice is to isolate storage traffic from
“regular” Ethernet traffic. As different workloads tend to be either “North-South” but increasingly “East-West” across the network topologies, this network segmentation becomes more important. Of course, it’s been used as a typical best practice for many years with protocols such as iSCSI, so this is not new.

In environments where the variability of congestion can have a profound impact on the storage performance, network segmentation will, indeed, become crucial as a “go-to” architecture. Proper techniques at L2 and L3 will help determine how close to a “lossless” environment can be achieved, of course, as well as properly configured QoS mechanisms across the network.

As a general rule of thumb, though, network segmentation is a very powerful tool to have for reliable storage delivery.

Q. How close are we to shared NVMe storage either over Fiber or TCP?

A. There are several shared storage products available on the market for NVMe over Fabrics, but as of this writing (only 3 months after the ratification of the protocol) no major vendors have announced NVMe over TCP shared storage capabilities.

A good place to look for updates is on the NVM Express website for interoperability and compliance products. [https://nvmexpress.org/products/]

Q. AQM -> DualQ work in IETF for coexisting L4S (DCTCP) and legacy TCP. Ongoing work @ chip merchants

A. Indeed a lot of advancements around making TCP evolve as the speeds and feeds increase. This is yet another example that shows why NVMe/TCP is, and will remain, relevant in the future.

Q. Are there any major vendors who are pushing products based on these technologies?

A. We cannot comment publicly on any vendor plans. You would need to ask a vendor directly for a concrete timeframe for the technology. However, several startups have made public announcements on supporting NVMe/TCP.

Leave a Reply

Your email address will not be published. Required fields are marked *