The SNIA Networking Storage Forum celebrated St. Patrick’s Day by hosting a live webcast, Ethernet-attached SSDs – Brilliant Idea or Storage Silliness?” Even though we didn’t serve green beer during the event, the response was impressive with hundreds of live attendees who asked many great questions – 25 to be exact. Our expert presenters have answered them all here:
Q. Has a prototype drive been built today that includes the Ethernet controller inside the NVMe SSD?
A. There is an Interposing board that extends the length by a small amount. Integrated functionality will come with volume and a business case. Some SSD vendors have plans to offer SSDs with fully-integrated Ethernet controllers.
Q. Costs seem to be the initial concern… true apples to apples between JBOF?
A. Difference is between a PCIe switch and an Ethernet switch. Ethernet switches usually cost more but provide more bandwidth than PCIe switches. An EBOF might cost more than a JBOF with the same number of SSDs and same capacity, but the EBOF is likely to provide more performance than the JBOF.
Q. What are the specification names and numbers. Which standards groups are involved?
A. The Native NVMe-oF Drive Specification from the SNIA is the primary specification. Within that specification, multiple other standards are referenced from SFF, NVMe, and DMTF.
Q.
How is this different than “Kenetic”, “Object Storage”, etc.
effort a few years ago?
Is there any true production quality open source available or planned (if so
when), if so, by whom and where?
A. Kinetic drives were hard disks and thus did not need high speed Ethernet. In fact, new lower-speed Ethernet was developed for this case. The pins chosen for Kinetic would not accommodate the higher Ethernet speeds that SSDs need, so the new standard re-uses the same lanes defined for PCIe for use by Ethernet. Kinetic was a brand-new protocol and application interface rather than leveraging an existing standard interface such as NVMe-oF.
Q. Can OpenChannel SSDs be used as EBOF?
A. To the extent that Open Channel can work over NVMe-oF it should work.
Q. Define the Signal Integrity challenges of routing Ethernet at these speeds compared to PCIe.
A. The signal integrity of the SFF 8639 connector is considered good through 25Gb Ethernet. The SFF 1002 connector has been tested to 50Gb speeds with good signal integrity and may go higher. Ethernet is able to carry data with good signal integrity much farther than a PCIe connection of similar speed.
Q. Is there a way to expose Intel Optane DC Persistent Memory through NVMe-oF?
A. For now, it would need to be a block-based NVMe device. Byte addressability might be available in the future
Q. Will there be interposer to send the Block IO directly over the Switch?
A. For the Ethernet Drive itself, there is a dongle available for standard PCIe SSDs to become an Ethernet Drive that supports block IO over NVMe-oF.
Q. Do NVMe drives fail? Where is HA implemented? I never saw VROC from Intel adopted. So, does the user add latency when adding their own HA?
A. Drive reliability is not impacted by the fact that it uses Ethernet. HA can be implemented by dual port versions of Ethernet drives. Dual port dongles are available today. For host or network-based data protection, the fact that Ethernet Drives can act as a secondary location for multiple hosts, makes data protection easier.
Q. Ethernet is a contention protocol and TCP has overhead to deliver reliability. Is there any work going on to package something like Fibre Channel/QUIC or other solutions to eliminate the downsides of Ethernet and TCP?
A. FC-NVMe has been approved as a standard since 2017 and is available and maturing as a solution. NVMe-oF on Ethernet can run on RoCE or TCP with the option to use lossless Ethernet and/or congestion management to reduce contention, or to use accelerator NICs to reduce TCP overhead. QUIC is growing in popularity for web traffic but it’s not clear yet if QUIC will prove popular for storage traffic.
Q. Is Lenovo or other OEM’s building standard EBOF storage servers? Is OCP having a work group on EBOF supporting hardware architecture and specification?
A. Currently, Lenovo does not offer an EBOF. However, many ODMs are offering JBOFs and a few are offering EBOFs. OCP is currently focusing on NVMe SSD specifics, including form factor. While several JBOFs have been introduced into OCP, we are not aware of an OCP EBOF specification per se. There are OCP initiatives to optimize the form factors of SSDs and there are also OCP storage designs for JBOF that could probably evolve into an Ethernet SSD enclosure with minimal changes.
Q. Is this an accurate statement on SAS latency. Where are you getting and quoting your data?
A. SAS is a transaction model, meaning the preceding transaction must complete before the next transaction can be started (QD does ameliorate this to some degree but end points still have to wait). With the initiator and target having to wait for the steps to complete, overall throughput slows. SAS HDD = milliseconds per IO (governed by seek and rotation); SAS SSD = 100s of microseconds (governed by transaction nature); NVMe SSD = 10s of microseconds (governed by queuing paradigm).
Q. Regarding performance & scaling, a 50GbE has less bandwidth than a PCIe Gen3 x4 connection. How is converting to Ethernet helping performance of the array? Doesn’t it face the same bottleneck of the NICs connecting the JBOF/eBOF to the rest of the network?
A. It eliminates the JBOF’s CPU and NIC(s) from the data path and replaces them with an Ethernet switch. Math: 1P 50G = 5GBps 1P 4X Gen 3 = 4 GBps,. because PCIe Gen 3 = 8 Gbps per lane so a single 25GbE NIC is usually connected to 4 lanes of PCIe Gen3 and a single 50GbE NIC is usually connected to 8 lanes of PCIe Gen3 (or 4 lanes of PCIe Gen4). But that is half of the story: 2 other dimensions to consider. First, getting all this BW (either way) out the JBOF vs. an EBOF. Second, at the solution level, all these ports (connectivity) and scaling (bandwidth) present their own challenges
Q. What about Persistent Memory? Can you present Optane DC through NVMe-Of?
A. Interesting idea!!! Today persistent memory DIMMs sit on memory bus so they would not benefit directly from Ethernet architecture. But with the advent of CXL and PCIe Gen 5, there may be a place for persistent memory in “bays” for a more NUMA-like architecture
Q. For those of us that use Ceph, this might be an interesting vertical integration, but feels like there’s more to the latency to “finding” and “balancing” the data on the arrays of Ethernet-attached NVMe. Has there been any software suites to accompany this hardware changes and are whitepapers published?
A. Ceph nodes are generally slower (for like-to-like HW than non-Ceph storage solutions, so Ceph might be less likely to benefit from Ethernet SSDs, especially NVMe-oF SSDs. That said, if the cost model for ESSDs works out (really cheap Ceph nodes to overcome “throwing HW at the problem”), one could look at Ceph solutions using ESSDs, either via NVMe-oF or by creating ESSDs with a key-value interface that can be accessed directly by Ceph.
Q. Can the traditional Array functions be moved to the LAN switch layer, either included in the switch (~the Cisco MDS and IBM SVC “”experiment””) or connect the controller functionality to the LAN switch backbone with the SSD’s in a separate VLAN?
A. Many storage functions are software/firmware driven. Certainly, a LAN switch with a rich X86 complex could do this…or…a server with a switch subsystem could. I can see low level storage functions (RAID XOR, compression, maybe snapshots) translated to switch HW, but I don’t see a clear path for high level functions (dedupe, replication, etc) translated to switch HW. However, since hyperscale does not perform many high-level storage functions at the storage node, perhaps enough can be moved to switch HW over time.
Q. ATA over Ethernet has been working for nearly 18 years now. What is the difference?
A. ATA over Ethernet is more of a work group concept and has never gone mainstream (to be honest, your question is the first time I heard this since 2001). In any event, ATA does not take advantage of queuing nature of NVMe so it’s still held hostage by transaction latency. Also, no high availability (HA) in ATA (at least I am not aware of any HA standards for ATA), which presents a challenge because HA at the box or storage controller level does NOT solve the SPOF problem at the drive level.
Q. Request for comment – Ethernet 10G, 25G, 50G, 100G per lane (all available today), and Ethernet MAC speeds of 10G, 25G, 40G, 50G, 100G, 200G, 400G (all available today), Ethernet is far more scalable compared to PCIe. Comparing Ethernet Switch relative cost to PCIe switch, Ethernet Switch is far more economical. Why shouldn’t we switch?
A. Yes Ethernet is more scalable than PCIe, but 3 things need to happen. 1) Solution level orchestration has to happen (putting an EBOF behind an RBOF is okay but only the first step); 2) The Ethernet world has to start understanding how storage works (multipathing, ACLs, baseband drive management, etc.); 3) Lower cost needs to be proven–jury still out on cost (on paper, it’s a no brainer, but costs of the Ethernet switch in the I/O Module can rival an X86 complex). Note that Ethernet with 100Gb/s per lane is not yet broadly available as of Q2 2020.
Q. We’ve seen issues with single network infrastructure from an availability perspective. Why would anyone put their business at risk in this manner? Second question is how will this work with multiple vendor hosts or drive vendors, each having different specifications?
A. Customers already connect their traditional storage arrays to either single or dual fabrics, depending on their need for redundancy, and an Ethernet drive can do the same, so there is no rule that an Ethernet SSD must rely on a single network infrastructure. Some large cloud customers use data protection and recovery at the application level that spans multiple drives (or multiple EBOFS), providing high levels of data availability without needing dual fabric connections to every JBOF or to every Ethernet drive. For the second part of the question, it seems likely that all the Ethernet drives will support a standard Ethernet interface and most of them will support the NVMe-oF standard, so multiple host and drive vendors will interoperate using the same specifications. This is already been happening through UNH plug fests at the NIC/Switch level. Areas where Ethernet SSDs might use different specifications might include a key-value or object interface, computational storage APIs, and management tools (if the host or drive maker don’t follow one of the emerging SNIA specifications).
Q. Will there be a Plugfest or certification test for Ethernet SSDs?
A. Those Ethernet SSDs that use the NVMe-oF interface will be able to join the existing UNH IOL plugfests for NVMe-oF. Whether there are plugfests for any other aspects of Ethernet SSDs–such as key-value or computational storage APIs–likely depends on how many customers want to use those aspects and how many SSD vendors support them.
Q. Do you anticipate any issues with mixing control (Redfish/Swordfish) and data over the same ports?
A. No, it should be fine to run control and data over the same Ethernet ports. The only reason to run management outside of the data connection would be to diagnose or power cycle an SSD that is still alive but not responding on its Ethernet interface. If out-of-band management of power connections is required, it could be done with a separate management Ethernet connection to the EBOF enclosure.
Q.
We will require more Switch ports would it mean more investment to be spent
Also how is the management of Ethernet SSD’s done.
A. Deploying Ethernet SSDs will require more Ethernet switch ports, though it will likely decrease the needed number of other switch or repeater ports (PCIe, SAS, Fibre Channel, InfiniBand, etc.). Also, there are models showing that Ethernet SSDs have certain cost advantages over traditional storage arrays even after including the cost of the additional Ethernet switch ports. Management of the Ethernet SSDs can be done via standard Ethernet mechanisms (such as SNMP), through NVMe commands (for NVMe-oF SSDs), and through the evolving DTMF Redfish/SNIA Swordfish management frameworks mentioned by Mark Carlson during the webcast. You can find more information on SNIA Swordfish here.
Q. Is it assumed that Ethernet connected SSDs need to implement/support congestion control management, especially for cases of overprovision in EBOF (i.e. EBOF bandwidth is less than sum of the underlying SSDs under it)? If so – is that standardized?
A. Yes, but both NVMe/TCP and NVMe/RoCE protocols have congestion management as part of the protocol, so it is baked in. The eSSDs can connect to either a switch inside the EBOF enclosure or to an external Top-of-Rack (ToR) switch. That Ethernet switch may or may not be oversubscribed, but either way the protocol-based congestion management on the individual Ethernet SSDs will kick in if needed. But if the application does not access all the eSSDs in the enclosure at the same time, the aggregate throughput from the SSDs being used might not exceed the throughput of the switch. If most or all of the SSDs in the enclosure will be accessed simultaneously, then it could make sense to use a non-blocking switch (that will not be oversubscribed) or rely on the protocol congestion management.
Q. Are the industry/standards groups developing application protocol (IOS layers 5 thru 7) to allow customers to use existing OS/Apps without modification? If so when will these be available and via what delivery to the market such as new IETF Application Protocol, Consortium,…?
A. Applications that can directly use individual SSDs can access a NVMe-oF Ethernet SSD directly as block storage, without modification and without using any other protocols. There are also software-defined storage solutions that already manage and virtualize access to NVMe-oF arrays and they could be modified to allow applications to access multiple Ethernet SSDs without modifications to the applications. At higher levels of the IOS stack, the computational storage standard under development within SNIA or a key-value storage API could be other solutions to allow applications to access Ethernet SSDs, though in some cases the applications might need to be modified to support the new computational storage and/or key-value APIs.
Q. In an eSSD implementation what system element implements advanced features like data streaming and IO determinism? Maybe a better question is does the standard support this at the drive level?
A. Any features such as these that are already part of NVMe will work on Ethernet drives.