Optimizing NVMe over Fabrics Performance Q&A

Almost 800 people have already watched our webcast “Optimizing NVMe over Fabrics Performance with Different Ethernet Transports: Host Factors” where SNIA experts covered the factors impacting different Ethernet transport performance for NVMe over Fabrics (NVMe-oF) and provided data comparisons of NVMe over Fabrics tests with iWARP, RoCEv2 and TCP. If you missed the live event, watch it on-demand at your convenience.

The session generated a lot of questions, all answered here in this blog. In fact, many of the questions have prompted us to continue this discussion with future webcasts on NVMe-oF performance. Please follow us on Twitter @SNIANSF for upcoming dates.

Q. What factors will affect the performance of NVMe over RoCEv2 and TCP when the network between host and target is longer than typical Data Center environment? i.e., RTT > 100ms

A. For a large deployment with long distance, congestion management and flow control will be the most critical considerations to make sure performance is guaranteed. In a very large deployment, network topology, bandwidth subscription to storage target, and connection ratio are all important factors that will impact the performance of NVMe-oF.

Q. Were the RoCEv2 tests run on ‘lossless’ Ethernet and the TCP tests run on ‘lossy’ Ethernet?

A. Both iWARP and RoCEv2 tests were run in a back to back configuration without a switch in the middle, but with Link Flow Control turned on.

Q. Just to confirm, this is with pure ROCEv2? No TCP, right? ROCEv2 end 2 end (initiator 2 target)?

A. Yes, for RoCEv2 test, that was RoCEv2 Initiator to RoCEv2 target.

Q. How are the drives being preconditioned? Is it based on I/O size or MTU size? 

A. Storage is pre-conditioned by I/O size and type of the selected workload. MTU size is not relevant.  The selected workload is applied until performance changes are time invariant – i.e. until performance stabilizes within a range known as steady state.  Generally, the workload is tracked by specific I/O size and type to remain within a data excursion of 20% and a slope of 10%.

Q. Are the 6 SSDs off a single Namespace, or multiple? If so, how many Namespaces used?

A. Single namespace.

Q. What I/O generation tool was used for the test?

A. Calypso CTS IO Stimulus generator which is based on libaio. CTS has same engine as fio and applies IOs to the block IO level.  Note vdbench and iometer are java-based file system level and higher in the software stack.

Q. Given that NVMe SSD performance is high with low latency, is it not that the performance bottleneck is shifted to the storage controller?

A. Test I/Os are applied to the logical storage seen by host on the target server in our attempt to normalize the host and target in order to assess NIC-Wire-NIC performance. The storage controller is beneath this layer and not applicable to this test. If we test the storage directly on the target – not over the wire – then we can see impact of the controller and controller related issues (such as garbage collection, over provisioning, table structures, etc.)

Q. What are the specific characteristics of RoCEv2 that restrict it to ‘rack’ scale deployments?  In other words, what is restricting it from larger scale deployments?

A. RoCEv2 can, and does, scale beyond the rack if you have one of three things:

  1. A lossless network with DCB (priority flow control)
  2. Congestion management with solutions like ECN
  3. Newer RoCEv2-capable adapters that support out of order packet receive and selective re-transmission

Your mileage will vary based upon features of different network vendors.

Q. Is there an option to use some caching mechanism on host side?

A. Host side has RAM cache per platform set up but is held constant among these tests. 

Q. Was there caching in the host?

A. The test used host memory for NVMe over Fabrics.

Q. Were all these topics from the description covered?  In particular, #2?
We will cover the variables:

  1. How many CPU cores are needed (I’m willing to give)?
  2. Optane SSD or 3D NAND SSD?
  3. How deep should the Q-Depth be?
  4. Why do I need to care about MTU?

A. Cores – see TC/QD sweep to see optimal OIO.  Core Usage/Required can be inferred from this. Note incongruity of TC/QD to OIO 8, 16, 32, 48 in this case.  

  1. The test used a dual socket server on target with IntelÒ XeonÒ Platinum 8280L processor with 28 cores. Target server only used one processor so that all the workloads were on a single NUMA node. 1-4% CPU utilization is the average of 28 cores.
  2. SSD-1 is Optane SSD, SSD-2 is 3D NAND.
  3. Normally QD is set to 32.
  4. You do not need to care about MTU, at least in our test, we saw minimal performance differences.

Q. The result of 1~4% of CPU utilization on target is based on single SSD? Do you expect to see much higher CPU utilization if the amount of SSD increases?

A. CPU % is the target server for the 6 SSD LUN.

Q. Is there any difference between the different transports and the sensitivity of lost packets?

A. Theoretically, iWARP and TCP are more tolerant to packet lost. iWARP is based on TCP/IP, TCP provides flow control and congestion management that can still perform in a congested environment. In the event of packet loss, iWARP supports selective re-transmission and out of order packet receive, those technology can further improve the performance in a lossy network. While, RoCEv2 standard implementation does not tolerate packet loss and would require lossless network and would experience performance degradation when packet loss happens.

Q. 1. When you mean offload TCP, is this both at Initiator and target side or just host initiator side?
2. Do you see any improvement with ADQ on TCP?

A. RDMA iWARP in the test has a complete offload TCP engine on the network adapter on both Initiator and target side. Application Device Queues (ADQ) can significantly improve throughput, latency and most importantly latency jitter with dedicated CPU core allocated for NVMe-oF solutions.

Q. Since the CPU utilization is extremely low on the host, any comments about the CPU role in NVMe-oF and the impact of offloading?

A. NVMe-oF was designed to reduce the CPU load on target as shown in the test. On the initiator side CPU load will be a little bit higher. RDMA, as an offloaded technology, requires fairly minimal CPU utilization. NVMe over TCP still uses TCP stack in the kernel to do all the work, thus CPU still plays an important role. Also, the test was done with a high-end IntelÒ XeonÒ  Processor with very powerful processing capability, if a processor with less processing power is used, CPU utilization would be higher.

Q. 1. What should be the ideal incapsulated data (inline date) size for best performance in a real-world scenario? 2. How could one optimize buffer copies at block level in NVMe-oF?

A. 1. There is no simple answer to this question. The impact of incapsulated data size to performance in the real-world scenario is more complicated as switch is playing a critical role in the whole network. Whether there is a shallow buffer switch or a deep buffer switch, switch settings like policy, congestion management etc. would all impact the overall performance. 2. There are multiple explorations to improve the performance of NVMe-oF by reducing or optimizing buffer copies. One possible option is to use controller memory buffer introduced in NVMe Specification 1.2.

Q. Is it possible to combine any of the NVMe-of technologies with SPDK – user space processing?

A. SPDK currently supports all these Ethernet-based transports: iWarp, RoCEv2 and TCP.

Q. You indicated that TCP is non-offloaded, but doesn’t it still use the ‘pseudo-standard’ offloads like Checksum, LSO, RSS, etc?  It just doesn’t have the entire TCP stack offloaded?

A. Yes, stateless offloads are supported and used.

Q. What is the real idea in using 4 different SSDs? Why didn’t you use 6 or 8 or 10? What is the message you are trying to relay? I understand that SSD1 is higher/better performing than SSD2.

A. We used a six SSD LUN in both SSD-1 and SSD-2.  We compared higher performance – lower capacity Optane to lower performance – higher capacity NVMe.  Note NVMe is 10X capacity of Optane.

Q. It looks like one of the key takeaways is that SSD specs matter. Can you explain (without naming brands) the main differences between SSD-1 and SSD-2?

A. Manufacturer specs are only a starting point and actual performance depends on the workload.  Large differences are seen for small block RND W workloads and large block SEQ R workloads.

Q. What is the impact to the host CPU and memory during the tests? Wondering what minimum CPU and memory are necessary to achieve peak NVMe-oF performance, which leads to describe how much application workload one might be able to achieve.

A. The test did not limit CPU core or memory to try the minimal configuration to achieve peak NVMe-oF performance. This might be an interesting topic we can cover in the future presentation.  (We measured target server CPU usage, not host / initiator CPU Usage).

Q. Did you let the tests run for 2 hours and then take results? (basically, warm up the cache/SSD characterization)?

A. We precondition with the TC/QD Sweep test then run the remaining 3 tests back to back to take advantage of the preconditioning done in the first test.

Q. How do you check outstanding IOs?

A. We use OIO = TC x QD in test settings and populate each thread with the QD jobs. We do not look at in flight OIO, but wait for all OIOs to complete and measure response times.

Q. Where can we get the performance test specifications as defined by SNIA?

A. You can find the test specification on the SNIA website here.

Q. Have these tests been run using FC-NVMe. If so, how did they fare?

A. We have not yet run tests your NVMe over Fibre Channel.

Q. What tests did you use? FIO, VDBench, IOZone, or just DD or IOMeter? What was the CPU peak utilization? and what CPUs did you use?

A. CTS IO generator which is similar to fio as both are based on libaio and test at the block level.  Vdbench, iozone and Iometer are java file system level.  DD is direct and lacks complex scripting.  Fio allows compiles scripting but not multiple variables per loop – i.e. requires iterative tests and post-test compilation vs. CTS which has multi variable – multi loop concurrency.

Q. What test suites did you use for testing?

A. Calypso CTS tests

Q. I heard that iWARP is dead?

A. No, iWARP is not dead. There are multiple Ethernet network adapter vendors supporting iWARP now. The adapter used in the test supports iWARP, RoCEv2 and TCP at the same time.

Q. Can you post some recommendation on the switch setup and congestion?

A. The test talked about in this presentation used back to back configuration without switch. We will have a presentation in the near future to take into account switch settings and will share more information at that time. Don’t forget to follow us on Twitter @SNIANSF for dates of upcoming webcasts.

Leave a Reply

Your email address will not be published. Required fields are marked *