In our most recent SNIA Networking Storage Forum (NSF) webcast Extending RDMA for Persistent Memory over Fabrics, our expert speakers, Tony Hurson and Rob Davis outlined extensions to RDMA protocols that confirm persistence and additionally can order successive writes to different memories within the target system. Hundreds of people have seen the webcast and have given it a 4.8 rating on a scale of 1-5! If you missed, it you can watch it on-demand at your convenience. The webcast slides are also available for download.
We had several interesting questions during the live event. Here are answers from our presenters:
Q. For the RDMA Message Extensions, does the client have to qualify a WRITE completion with only Atomic Write Response and not with Commit Response?
A. If an Atomic Write must be confirmed persistent, it must be followed by an additional Commit Request. Built-in confirmation of persistence was dropped from the Atomic Request because it adds latency and is not needed for some application streams.
Q. Why do you need confirmation for writes? From my point of view, the only thing required is ordering.
A. Agreed, but only if the entire target system is non-volatile! Explicit confirmation of persistence is required to cover the “gap” between the Write completing in the network and the data reaching persistence at the target.
Q. Where are these messages being generated? Does NIC know when the data is flushed or committed?
A. They are generated by the application that has reserved the memory window on the remote node. It can write using RDMA writes to that window all it wants, but to guarantee persistence it must send a flush.
Q. How is RPM presented on the client host?
A. The application using it sees it as memory it can read and write.
Q. Does this RDMA commit response implicitly ACK any previous RDMA sends/writes to same or different MR?
A. Yes, the new Commit (and Verify and Atomic Write) Responses have the same acknowledgement coalescing properties as the existing Read Response. That is, a Commit Response is explicit (non-coalesced); but it coalesces/implies acknowledgement of prior Write and/or Send Requests.
Q. Does this one still have the current RMDA Write ACK?
A. See previous general answer. Yes. A Commit Response implicitly acknowledges prior Writes.
Q. With respect to the Race Hazard explained to show the need for explicit completion response, wouldn’t this be the case even with a non-volatile Memory, if the data were to be stored in non-volatile memory. Why is this completion status required only on the non-volatile case?
A. Most networked applications that write over the network to volatile memory do not require explicit confirmation at the writer endpoint that data has actually reached there. If so, additional handshake messages are usually exchanged between the endpoint applications. On the other hand, a writer to PERSISTENT memory across a network almost always needs assurance that data has reached persistence, thus the new extension.
Q. What if you are using multiple RNIC with multiple ports to multiple ports on a 100Gb fabric for server-to-server RDMA? How is order kept there…by CPU software or ‘NIC teaming plus’?
A. This would depend on the RNIC vendor and their implementation.
Q. What is the time frame for these new RDMA messages to be available in verbs API?
A. This depends on the IBTA standards approval process which is not completely predicable, roughly sometime the first half of 2019.
Q. Where could I find more details about the three new verbs (what are the arguments)?
A. Please poll/contact/Google the IBTA and IETF organizations towards the end of calendar year 2018, when first drafts of the extension documents are expected to be available.
Q. Do you see this technology used in a way similar to Hyperconverged systems now use storage or could you see this used as a large shared memory subsystem in the network?
A. High-speed persistent memory, in either NVDIMM or SSD form factor, has enormous potential in speeding up hyperconverged write replication. It will require however substantial re-write of such storage stacks, moving for example from traditional three-phase block storage protocols (command/data/response) to an RDMA write/confirm model. More generally, the RDMA extensions are useful for distributed shared PERSISTENT memory applications.
Q. What would be the most useful performance metrics to debug performance issues in such environments?
A. Within the RNIC, basic counts for the new message types would be a baseline. These plus total stall times encountered by the RNIC awaiting Commit Responses from the local CPU subsystem would be useful. Within the CPU platform basic counts of device write and read requests targeting persistent memory would be useful.
Q. Do all the RDMA NIC’s have to update their firmware to support this new VERB’s? What is the expected performance improvement with the new Commit message?
A. Both answers would depend on the RNIC vendor and their implementation.
Q. Will the three new verbs be implemented in the RNIC alone, or will they require changes in other places (processor, memory controllers, etc.)?
A. The new Commit request requires the CPU platform and its memory controllers to confirm that prior write data has reached persistence. The new Atomic Write and Verify messages however may be executed entirely within the RNIC.
Q. What about the future of NVMe over TCP – this would be much simpler for people to implement. Is this a good option?
A. Again this would depend on the NIC vendor and their implementation. Different vendors have implemented various tests for performance. It is recommended that readers do their own due diligence.
In regards to the race hazard vs latency requirements, would it be an idea to fit the RNIC with non volatile memory, so that it can return a solid ack straight away?
Yes, good idea! One can imagine particular target architectures where confirmation of persistence occurs with the simple RDMA Write’s ACK alone. In such cases, one can picture a remote client software option where, knowing these target properties, the new additional Commit message is omitted. The new extension remains necessary however for targets based on today’s general purpose servers, where persistent memory is physically distributed and is to various degrees physically removed from the local network card.