NVMe®/TCP Q&A

The SNIA Networking Storage Forum (NSF) had an outstanding response to our live webinar, “NVMe/TCP: Performance, Deployment, and Automation.” If you missed the session, you can watch it on-demand and download a copy of the presentation slides at the SNIA Educational Library. Our live audience gave the presentation a 4.9 rating on a scale of 1-5, and they asked a lot of detailed questions, which our presenter, Erik Smith, Vice Chair of SNIA NSF, has answered here.

Q: Does the Centralized Discovery Controller (CDC) layer also provide drive access control or is it simply for discovery of drives visible on the network?

A: As defined in TP8010, the CDC only provides transport layer discovery. In other words, the CDC will allow a host to discover transport layer information (IP, Port, NQN) about the subsystem ports (on the array) that each host has been allowed to communicate with. Provisioning storage volumes to a particular host is additional functionality that COULD be added to an implementation of the CDC. (e.g., Dell has a CDC implementation that we refer to as SmartFabric Storage Software (SFSS).

Q: Can you provide some examples of companies that provide CDC and drive access control functionalities?

A: To the best of my knowledge the only CDC implementation currently available is Dell’s SFSS.

Q: You addressed the authentication piece of the security picture, but what about the other half – encryption. Are there encryption solutions available or in the works?

A: I was running out of time and flew through that section. Both Authentication (DH-HMAC-CHAP) and Secure Channels (TLS 1.3) may be used per the specification. Dell does not support either of these yet, but we are working on it.

Q: I believe NVMe/Fibre Channel is widely deployed as well. Is that true?

A: Not based on what I’m seeing. NVMe/FC has been around for a while, it works well and Dell does support it. However, adoption has been slow. Again, based on what I’m seeing, NVMe/TCP seems to be gaining more traction.

Q: Is nvme-stas an “in-box” solution, EPEL solution, or prototype solution?

 A: It currently depends on the distro.

  • SLES 15 SP4 and SP5 – Inbox
  • RHEL 9.X – Inbox (Tech Preview) [RHEL 8.X: not available]
  • Ubuntu 22.04 – Universe (Community support)

Q: Regarding the slide comparing iSCSI, NVMe-oF, FC speeds, how do these numbers compare to RDMA transport over Ethernet or Infiniband (iSCSI Extensions for RDMA (iSER) or NMVe-oF RMDA)?  Closer to the FC NVMe-oF numbers? Did you consider NVMe-oF RoCE or is there not enough current or perceived future adoption rate? As a follow-on, do you see the same pitfalls with connectivity/hops as seen with FCoE?

A: When we first started looking at NVMe over fabrics, we spent quite a bit of time working with RoCE, iWARP, NVMe/TCP and NVMe/FC. Some of these test results were presented during a previous webinar NVMe-oF: Looking Beyond Performance Hero Numbers . The RoCE performance numbers were actually amazing, especially at 100GbE and were much better than anything else we looked at with the exception of NVMe/TCP when hardware offload was used. The downsides to RoCE are described in the Hero numbers webinar referenced above. But the short version is, the last time I worked with it, it was difficult to configure and troubleshoot. I know NVIDIA has done a lot of work to make this better recently, but I think most end users will eventually end up using NVMe/TCP for general purpose IP SAN connectivity to external storage.

 Q: Can you have multiple CDCs, like in a tree where you might have a CDC in an area of subnets that are segregated LAN wise, but would report or be managed by a manager of CDC so that you could have one centralized ‘CDC’ as an area that may present or have a physical presence in each of the different storage networks that are accessible by the segregated servers?

 A: Theoretically yes. We have worked out the protocol details to provide this functionality. However, we could currently provide this functionality by providing a single CDC instance that has multiple network interfaces on it. We could then connect each interface to a different subnet. It would be a bit of work to configure, but it would get you out of needing to maintain multiple CDC instances.

 Q: Does NVMe/TCP provide a block level or file level access to the storage?

 A: Block. More information can be found in the blog post titled Storage Protocol Stacks for NVMe.

Q: Which one will give best performance NVMe/TCP on 40G or NVMe/FC over on 32G?

 A: It’s impossible to say without knowing the implementation we are talking about. I have also not seen any performance testing results for NVMe/TCP over 40GbE.

Q: Ok, but creating two Ethernet fabrics for SAN A and SAN B goes against an ancient single-fabric network deployment standard… Besides: wouldn’t this procedure require bare ripping Fibre Channel and replacing it with Ethernet?

A: I agree. Air gapped SAN A and SAN B using Ethernet does not go over very well with IP networking teams. A compromise could be to have the networking team allocate two VLANs (one for SAN A and the other for SAN B). This mostly side-steps the concerns I have. With regards to ripping FC and replacing it with Ethernet, I think absolutely nobody will replace their existing FC SAN with an Ethernet based one. It doesn’t make sense from an economics perspective. However, I do think as end-users plan to deploy new applications or environments, using Ethernet as a substitute for FC would make sense. This is mainly because the provisioning process we defined for NVMe/TCP was based on the FC provisioning process and this was done to allow legacy FC customers to move to Ethernet as painlessly as possible should they need to migrate off of FC.

Q: Can you share the scripts again that you used to connect?

 A: Please refer to slide 47. The scripts are available here: https://github.com/dell/SANdbox/tree/main/Toolkit

Q: Any commitment from Microsoft for a Windows NVMe/TCP driver to be developed?

 A: I can’t comment on another company’s product roadmap. I would highly recommend that you reach out to Microsoft directly.

Q: There is a typo in that slide not 10.10.23.2 should be 10.10.3.2

 A: 10.10.23.2 is the IP Address of the CDC in that diagram. The “mDNS response” is telling the host that a CDC is available at 10.10.23.2.

Q: What is the difference between -1500 and -9000?

 A: This is the MTU (Maximum Transmission Unit) size.

Q: When will TP-8010 be ratified?

 A: It was ratified in February of 2022.

Q: Does CDC sit at end storage (end point) or in fabric?

 A: The CDC can theoretically reside anywhere. Dell’s CDC implementation (SFSS) can currently be deployed as a VM (or on an EC2 instance in AWS). Longer term, you can expect to see SFSS running on a switch.

Q: In FC-NVMe it was 32Gb adapters. What was used for testing Ethernet/NVMe over TCP?

A: We used Intel E810 adapters that were set to 25GbE.

Q: Will a higher speed Ethernet adapter give better results for NVMe over TCP as 100Gb Ethernet adapters are more broadly available and 128Gb FC is still not a ratified standard?

A: A higher speed Ethernet adapter will give better results for NVMe/TCP. A typical/modern Host should be able to drive a pair of 100GbE adapters to near line rate with NVMe/TCP IO. The problem is, attempting to do this would consume a lot of CPU and could negatively impact the amount of CPU left for applications / VMs unless offloads in the NIC are utilized to offset utilization. Also, the 128GFC standard was ratified earlier this year.

Q: Will CDC be a separate device? Appliance?

 A: The CDC currently runs as a VM on a server. We also expect CDCs to be deployed on a switch.

Q: What storage system was used for this testing?

 A: The results were for Dell PowerStore. The testing results will vary depending on the storage platform being used.

Q: Slides 20-40:  Who are you expecting to do this configuration work, the server team, the network team or the storage team?”

 A: These slides were intended to show the work that needs to be done, not which team needs to do it.  That said, the fully automated solution could be driven by the storage admin with only minimal involvement from the networking and server teams.

Q: Are the CPU utilization results for the host or array?

 A: Host

Q: What was the HBA card & Ethernet NIC used for the testing?

 A: HBA = QLE2272. NIC = Intel E810

Q: What were the FC HBA & NIC speeds?

 A: HBA was running at 32GFC. Ethernet was running at 25GbE.

Q: How to you approach multi-site redundancy or single site redundancy?

 A: Single site redundancy can be accomplished by deploying more than one CDC instance and setting up a SAN A / SAN B type of configuration.  Multi-site redundancy depends on the scope of the admin domain. If the admin domain spans both sites, then a single CDC instance COULD provide discovery services for both sites.  If the admin domain is restricted to an admin domain per site, then it would currently require one CDC instance per site/admin domain.

 Q: When a host admin decides to use the “direct connect” discovery approach instead of “Centralized Discovery”, what functionality is lost?

 A: This configuration works fine up to a point (~tens of end points), but it results in full-mesh discovery and this can lead to wasting resources on both the host and storage.

Q: Are there also test results with larger / more regular block sizes?

 A: Yes. Please see the Transport Performance Comparison White paper.

Q: Is Auto Discovery supported natively within for example VMware ESXi?

 A: Yes. With ESXi 8.0u1, dynamic discovery is fully supported.

Q: So NVMe/TCP does support LAG vs iSCSI which does not?

 A: LAG can be supported for both NVMe/TCP and iSCSI. There are some limitations with ESXi and these are described in the SFSS Deployment Guide.

Q: So NVMe/TCP does support routing?

 A: Yes. I was showing how to automate the configuration of routing information that would typically need to be done on the host to support NVMe/TCP over L3.

Q: You are referring to Dell Open Source Host Software; do other vendors also have the same multipathing / storage path handling concept?

 A: I do not believe there is another discovery client available right now.  Dell has gone to great lengths to make sure that the discovery client can be used by any vendor.

Q: FC has moved to 64G as the new standard. Does this solution work well with mixed end device speeds as most environments have today, or was the testing conducted with all devices running the same NIC, storage speeds?

A: We’ve tested with mixed speeds and have not encountered any issues. That said, mixed speed configurations have not gotten anywhere near the amount of testing that homogeneous speed configurations have.

Q: Any reasoning on why lower MTU produces better performance results than jumbo MTU on NVMe TCP? This seems to go against the regular conventional thought process to enable 9K MTU when SAN is involved.

 A: That is a great question that I have not been able to answer up to this point. We also found it counterintuitive.

Q: Is CDC a vendor thing or a protocol thing?

 A: The CDC is an NVM ExpressÒ Standard. The CDC is defined in the NVM ExpressÒ standard. See TP8010 for more information.

Q: Do you see any issues with virtual block storage behind NVMe-oF? Specifically, zfs zvols in my case vs raw NVMe disks. Already something done with iSCSI?

A: As long as the application does not use SCSI commands (e.g., vendor unique SCSI commands for array management purposes) to perform a specialized task, it will not know if the underlying storage volume is NVMe or SCSI based.

Q: In your IOPS comparison were there significant hardware offload specific to NVMe/F TCP or just general IP/TCP offload?

 A: There were no HW offloads used for NVMe/TCP testing. It was all software based.

Q: Is IPv6 supported for NVMe/TCP? If so, is there any improvement for response times (on the same subnet)?

 A: Yes, IPv6 is supported and it does not impact performance.

Q: The elephant in the room between Link Aggregation and Multipath is that only the latter actually aggregates the effective bandwidth between any two devices in a reliable manner…

 A: I am not sure I would go that far, but I do agree they are both important and can be combined if both the network and storage teams want to make sure both cases are covered. I personally would be more inclined to use Multipathing because I am more concerned about inadvertently causing a data unavailability (DU) event, rather than making sure I get the best possible performance.

Q: Effective performance is likely to be limited to only the bandwidth of a single link too… MPIO is the way to go.

 A: I think this is heavily dependent on the workload, but I agree that multipathing is the best way to go overall if you have to choose one or the other.

Q: You’d need ProxyARP for the interface routes to work in this way, correct?

 A: YES! And thank you for mentioning this. You do need proxy ARP enabled on the switches in order to bypass the routing table issue with the L3 IP SAN.

Q: Were the tests on 1500 byte frames?  Can we do jumbo frames?

 A: The test results included MTU of 1500 and 9000.

Q: Seems like a lot of configuration steps for discovery. How does this compare to Fibre Channel in terms of complexity of configuration?

A: When we first started working on Discovery Automation for NVMe/TCP, we made a conscious decision to ensure that the user experience of provisioning storage to a host via NVMe/TCP was as close as possible to the process used to provision storage to a host over a FC SAN.  We included concepts like a name server and zoning to make is as easy as possible for legacy FC customers to work with NVMe/TCP. I think we successfully met our goal.

Make sure you know about all of SNIA NFS upcoming webinars by following us on Twitter @SNIANSF.

Leave a Reply

Your email address will not be published. Required fields are marked *