Hyperscalers Take on NVMe™ Cloud Storage Questions - SNIA on Data, Networking & Storage

Our recent webcast on how Hyperscalers, Facebook and Microsoft are working together to merge their SSD drive requirements generated a lot of interesting questions. If you missed “How Facebook & Microsoft Leverage NVMe Cloud Storage” you can watch it on-demand. As promised at our live event. Here are answers to the questions we received.

Q. How does Facebook or Microsoft see Zoned Name Spaces being used?

A. Zoned Name Spaces are how we will consume QLC NAND broadly. The ability to write to the NAND sequentially in large increments that lay out nicely on the media allows for very little write amplification in the device.

Q. How high a priority is firmware malware? Are there automated & remote management methods for detection and fixing at scale?

A. Security in the data center is one of the highest priorities. There are tools to monitor and manage the fleet including firmware checking and updating.

Q. If I understood correctly, the need for NVMe rooted from the need of communicating at faster speeds with different components in the network. Currently, at which speed is NVMe going to see no more benefit with higher speed because of the latencies in individual components? Which component is most gating/concerning at this point?

A. In today’s SSDs, the NAND latency dominates. This can be mitigated by adding backend channels to the controller and optimization of data placement across the media. There are applications that are direct connect to the CPU where performance scales very well with PCIe lane speeds and do not have to deal with network latencies.

Q. Where does zipline fit? Does Microsoft expect Azure to default to zipline at both ends of the Azure network?

A. Microsoft has donated the RTL for the Zipline compression ASIC to Open Compute so that multiple endpoints can take advantage of “bump in the wire” inline compression.

Q. What other protocols exist that are competing with NVMe? What are the pros and cons for these to be successful?

A. SATA and SAS are the legacy protocols that NVMe was designed to replace. These protocols still have their place in HDD deployments.

Q. Where do you see U.2 form factor for NVMe?

A. Many enterprise solutions use U.2 in their 2U offerings. Hyperscale servers are mostly focused on 1U server form factors were the compact heights of E1.S and E1.L allow for vertical placement on the front of the server.

Q. Is E1.L form factor too big (32 drives) for failure domain in a single node as a storage target?

A. E1.L allows for very high density storage. The storage application must take into account the possibility of device failure via redundancy (mirroring, erasure coding, etc.) and rapid rebuild. In the future, the ability for the SSD to slowly lose capacity over time will be required.

Q. What has been the biggest pain points in using NVMe SSD – since inception/adoption, especially, since Microsoft and Facebook started using this.

A. As discussed in the live Q&A, in the early days of NVMe the lack of standard drives for both Windows and Linux hampered adoption. This has since been resolved with standard in box drive offerings.

Q. Has FB or Microsoft considered allowing drives to lose data if they lose power on an edge server? if the server is rebuilt on a power down this can reduce SSD costs.

A. There are certainly interesting use cases where Power Loss Protection is not needed.

Q. Do zoned namespaces makes Denali spec obsolete or dropped by Microsoft? How does it impact/compete open channel initiatives by Facebook?

A. Zoned Name Spaces incorporates probably 75% of the Denali functionality in an NVMe standardized way.

Q. How stable is NVMe PCIe hot plug devices (unmanaged hot plug)?

A. Quite stable.

Q. How do you see Ethernet SSDs impacting cloud storage adoption?

A. Not clear yet if Ethernet is the right connection mechanism for storage disaggregation. CXL is becoming interesting.

Q. Thoughts on E3? What problems are being solved with E3?

A. E3 is meant more for 2U servers.

Q. ZNS has a lot of QoS implications as we load up so many dies on E1.L FF. Given the challenge how does ZNS address the performance requirements from regular cloud requirements?

A. With QLC, the end to end systems need to be designed to meet the application’s requirements. This is not limited to the ZNS device itself, but needs to take into account the entire system.

If you’re looking for more resources on any of the topics addressed in this blog, check out the SNIA Educational Library where you’ll find over 2,000 vendor-neutral presentations, white papers, videos, technical specifications, webcasts and more.

Leave a Reply