RoCE vs. iWARP Q&A

In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our “Great Storage Debate” webcast series. If you haven’t seen it yet, it’s now available on-demand along with a PDF of the presentation slides.

We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all.  

Q. Are RDMA NIC’s and TOE NIC’s the same? What are the differences?

 A.  No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb’s API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads.

Q. Does RDMA use the TCP/UDP/IP protocol stack?

A.  RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet.

Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs?

A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality.

Q. Do the BSD OS’s (e.g. FreeBSD) support RoCE and iWARP?    

A. FreeBSD supports both iWARP and RoCE.

Q. Any comments on NVMe over TCP?

A.  The NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts.

Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP?

A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model.

Q. What’s the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much?

A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA.

Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today?

A.  As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet.

Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP?

A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most.

iWARP provides RDMA functionality over TCP/IP  and  inherits the loss resilience and congestion  management from the underlying TCP/IP layer. Thus, it  does not require specific best practices beyond those in use for TCP/IP  including  not requiring any specific host or  switch configuration as well as  out-of-the-box support  across LAN/MAN/WAN networks.

Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct?

A.  Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference.

Q. What is the significance of BTH+ and GRH header?

A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets.

Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market?

A.  All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage.

DCB, PFC, lossless networks, and Congestion Management

Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards?

A.  Sorry for the confusion, you’ve got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2.

Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2?

A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE).

Q. Is iWARP a lossless or losssy protocol?

A.  iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds  for  iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality.

Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct?

A.  Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience
for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links.

Q. Is this correct statement? Please clarify — RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC?

A. Remember, we called this presentation a “Great Storage Debate?” Here is an area where there are two schools of thoughts.

Answer #1: It’s recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs.

Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP
installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE.

Q. Does RoCE v2 also require PFC or how will it handle lossy networks?

 A.  RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details.

Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)?

A.  RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN.

Q. In order to guarantee lossless, does ECN and PFC both have to be used?

A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network.

Q. Are there real deployments that use “Resilient RoCE” without PFC configured?

A.  To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance.

Q. For RoCEv2, can ECN be implemented without PFC?

A.  Yes, ECN can be implemented on it’s own within a RoCE v2 implementation without the need for PFC.

Q. RoCE needs to have converged Ethernet, but no iWARP, correct?

A.  Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so “Converged Ethernet” requirement doesn’t apply to iWARP.

Q. It’s not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it?

A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP.

Q. On slide #10, does this require any support on the switch?

A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today.

Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads?

A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we’d likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying.

Performance, Scalability and Distance

Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)?

A.  We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future.

Q. Do you have some performance measurements which compare iWARP and RoCE?

A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog.

Q. Are there performance benchmarks between RoCE vs. iWARP?

A.  Debating which one is faster is beyond the scope of this webcast.

Q. Can RoCE scale to 1000’s of Ceph nodes, assuming each node hosts 36 disks?

A.  RoCE has been successfully tested with dozens of Ceph nodes. It’s unknown if RoCE with Ceph can scale to 1000s of Ceph nodes.

Q. Is ROCE limited in number of hops?

A.  No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue.

Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP?

A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be.

Multipathing, Error Correction

Q. Isn’t the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn’t require lossless but doesn’t one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA “limited” but even if that’s true don’t they have to deal with the same challenges on the switch in regards to congestion management?

A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP’s performance under network congestion should closely match that of TCP.

Q. If you are using RoCE v2 with UDP, how is error correction implemented?

A. Error correction is done by the RoCE protocol running on top of UDP.

Q. How does multipathing works with RDMA?

A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports.

Q. Do RoCE and iWARP work with bonded NICs?

A.  The short answer is yes, but it will depend on individual NIC vendor’s implementation.

Windows and SMB Direct

Q. What is SMB Direct?

A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary.

Q. Is there iSER support in Windows?

A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018.

Q. When will iWARP or RoCE for NVMe-oF be supported on Windows?

A.  Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP.

Q. Why is iWARP better for Storage Spaces Direct?

A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances.

We hope that answers all your questions! We encourage you to check out the other “Great Storage Debate” in this webcast series. To date, our experts have had friendly, vendor-neutral debates on  File vs. Block vs. Object Storage,  Fibre Channel vs. iSCSI,  FCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *