The SNIA Networking
Storage Forum recently
hosted another webcast in our
Great
Storage Debate webcast series. This time, our SNIA experts debated three competing
visions about how storage should be done: Hyperconverged Infrastructure (HCI),
Disaggregated Storage, and Centralized Storage. If you missed the live event,
it’s available
on-demand. Questions
from the webcast attendees made the panel debate quite lively. As promised,
here are answers to those questions.
Q. Can you imagine a realistic scenario where the different storage types are used as storage tiers? How much are they interoperable?
A. Most HCI solutions already have a tiering/caching structure built-in. However, a user could use HCI for hot to warm data, and also tier less frequently accessed data out to a separate backup/archive. Some of the HCI solutions have close partnerships with backup/archive vendor solutions just for this purpose.
Q. Is there a possibility where two or more classifications of storage can co-exist or be deployed? Examples please?
A. Often IT organizations have multiple types of storage deployed in their data centers, particularly over time with various types of legacy systems. Also, HCI solutions that support iSCSI can interface with these legacy systems to enable better sharing of data to avoid silos.
Q. Does Hyperconverged (HCI) use primarily object storage with erasure coding for managing the distributed storage, such as VMware vSAN for VxRail (from Dell)?
A. That is accurate for VMware vSAN, but other HCI solutions are not necessarily object based. Even if object-based, the object interface is rarely exposed. Erasure coding is a common method of distributing the data across the cluster for increased durability with efficient space sharing.
Q. How would you classify HPC deployment given it is more distributed file systems and converged storage? does it need a new classification?
A. Often HPC storage is deployed on large, distributed file systems (e.g. Lustre), which I would classify as distributed, scale-out storage, but not hyperconverged, as the compute is still on separate servers.
Q. A lot of HCI solutions are already allowing heterogeneous nodes within a cluster. What about these “new” Disaggregated HCI solutions that uses “traditional” storage arrays in the solution (thus not using a Software Defined Storage solution? Doesn’t it sound a step back? It seems most of the innovation comes on the software.
A. The solutions marketed as disaggregated HCI are not using a traditional HCI design. They are traditional servers and storage combined in a chassis. This would meet the definition of converged, but not hyperconverged.
Q. Why is HCI growing so quickly and seems so popular of late? It seems to be one of the fastest growing “data storage” use cases.
A. HCI has many advantages, as I shared in the slides up front. The #1 reason for the growth and popularity is the ease of deployment and management. Any IT person who is familiar with deploying and managing a VM can now easily deploy and manage the storage with the VM. No specialized storage system skillsets required, which makes better use of limited IT people resources, and reduces OpEx.
Q. Where do you categorize newer deployments like Vast Data? Is that considered NAS since it presents as NFS and CIFS?
A. I would categorize Vast Data as scale-out, software-defined storage. HCI is also a type of scale-out, software-defined storage, but with compute as well, so that is the key difference.
Q. So what happens when HCI works with ANY storage including centralized solutions. What is HCI then?
A. I believe this question is referencing the SCS interface support. HCI solutions that support iSCSI can interface with other types of storage systems to enable better sharing of data to avoid silos.
Q. With NVMe/RoCE becoming more available, DAS-like performance while have reducing CPU usage on the hosts massively, saving license costs (potentially, we are only in pilot phase) does the ball swing back towards disaggregated?
A. I’m not sure I fully understand the question, but RDMA can be used to streamline the inter-node traffic across the HCI cluster. Network performance becomes more critical as the size of the cluster, and therefore the traffic between nodes increases, and RDMA can reduce any network bottlenecks. RoCEv2 is popular, and some HCI solutions also support iWARP. Therefore, as HCI solutions adopt RDMA, the ability to support RDMA is not by itself a driver to disaggregated.
Q. HCI was initially targeted at SMB and had difficulty scaling beyond 16 nodes. Why would HCI be the choice for large scale enterprise implementations?
A. HCI has proven itself as capable of running a broad range of workloads in small to large data center environments at this point. Each HCI solution can scale to different numbers of nodes, but usage data shows that single clusters rarely exceed about 12 nodes, and then users start a new cluster. There are a mix of reasons for this: concerns about the size of failure domains, departmental or remote site deployment size requirements, but often it’s the software license fees for the applications running on the HCI infrastructure that limits the typical clusters sizes in practice. As a result, large enterprises often implement HCI in multiple clusters.
Q. SPC (Storage Performance Council) benchmarks are still the gold standard (maybe?) and my understanding is they typically use an FC SAN. Is that changing? I understand that the underlying hardware is what determines performance but I’m not aware of SPC benchmarks using anything other than SAN.
A. Myriad benchmarks are used to measure HCI performance across a cluster. I/O benchmarks that are variants on FIO are common to measure the storage performance, and then the compute performance is often measured using other benchmarks, such as TPC benchmarks for database performance, LoginVSI for VDI performance, etc.
Q. What is the current implementation mix ratio in the industry? What is the long-term projected mix ratio?
A. Today the enterprise is dominated by centralized storage with HCI in second place and growing more rapidly. Large cloud service providers and hyperscalers are dominated by disaggregated storage, but also use some centralized storage and some have their own customized HCI implementations for specific workloads. HPC and AI customers use a mix of disaggregated and centralized storage. In the long-term, it’s possible that disaggregated will have the largest overall share since cloud storage is growing the most, with centralized storage and HCI splitting the rest.
Q. Is the latency high for HCI vs. disaggregated vs. centralized?
A. It depends on the implementation. HCI and disaggregated might have slightly higher latency than centralized storage if they distribute writes across nodes before acknowledging them or if they must retrieve reads from multiple nodes. But HCI and disaggregated storage can also be implemented in a way that offers the same latency as centralized.
Q. What about GPUDirect?
A. GPUDirect Storage allows GPUs to access storage more directly to reduce latency. Currently it is supported by some types of centralized and disaggregated storage. In the future, it might be supported with HCI as well.
Q. Splitting so many hairs here. Each of the three storage types are more about HOW the storage is consumed by the user/application versus the actual architecture.
A. Yes, that is largely correct, but the storage architecture can also affect how it’s consumed.
Q. Besides technical qualities, is there a financial differentiator between solutions? For example, OpEx and CapEx, ROI?
A. For very large-scale storage implementations, disaggregated generally has the lowest CapEx and OpEx because the higher initial cost of managing distributed storage software is amortized across many nodes and many terabytes. For medium to large implementations, centralized storage usually has the best CapEx and OpEx. For small to medium implementations, HCI usually has the lowest CapEx and OpEx because it’s easy and fast to acquire and deploy. However, it always depends on the specific type of storage and the skill set or expertise of the team managing the storage.
Q. Why wouldn’t disaggregating storage compute and memory be the next trend? The Hyperscalers have already done it. What are we waiting for?
A. Disaggregating compute is indeed happening, supported by VMs, containers, and faster network links. However, disaggregating memory across different physical machines is more challenging because even today’s very fast network links have much higher latency than memory. For now, memory disaggregation is largely limited to being done “inside the box” or within one rack with links like PCIe, or to cases where the compute and memory stick together and are disaggregated as a unit.
Q. Storage lends itself as first choice for disaggregation as mentioned before. What about disaggregation of other resources (such as networking, GPU, memory) in the future and how do you believe will it impact the selection of centralized vs disaggregated storage? Will Ethernet stay 1st choice for the fabric for disaggregation?
A. See the above answer about disaggregating memory. Networking can be disaggregated within a rack by using a very low-latency fabric, for example PCIe, but usually networking is used to support disaggregation of other resources. GPUs can be disaggregated but normally still travel with some CPU and memory in the same box, though this could change in the near future. Ethernet will indeed remain the 1st networking choice for disaggregation, but other network types will also be used (InfiniBand, Fibre Channel, Ethernet with RDMA, etc.)
Q. How does storage tiering fit in the various solutions?
A. All three architectures can support storage tiering, either within one system/node or by using different systems with different performance/capacity/price characteristics. Disaggregated storage might be less likely to use tiering within one system than HCI or centralized storage, but a large deployment could have different disaggregated storage pools and move data between those pools for tiering.
Q. With the rapid adoption of Computational Storage and inline 2:1 compression/dedupe, ED-SFF and 500TB to 1PB capable 1U servers, Smart NICs, significant increases in processor performance, and 100/400GB Networking, where do you think traditional centralized storage will compete in the future?
A. These technologies will enable faster, denser, more scalable deployments of both HCI and disaggregated storage, but many of them can also be used to benefit centralized storage Traditional centralized storage will continue to play a role in enterprise environments, though as mentioned during the webcast, it seems likely that the use of HCI and disaggregated storage will grow more quickly than the use of centralized storage.
Q. Good discussion. It’s interesting we are debating one versus the other. Is it not also the case that both solutions coexist in many infrastructures today? That basically there is a place for both. Can you speak a bit to how both have their place in many environments?
A. It is true that many infrastructures use more than one storage architecture for different use cases. Many enterprises might use a combination of centralized storage and HCI, while many cloud service providers might use a combination of HCI and distributed storage.
Q. What is the performance gap between “best of breed” storage vs. HCI? Performance wise we see positives for both, but disaggregated tends to always win at high end for us.
A. HCI storage can be faster or slower than centralized storage or disaggregated storage. There are many variables that affect the performance of all three types of storage, including drive type, data protection method, storage access, network, etc. It is true that customers and designers of disaggregated storage and centralized storage often put more emphasis on storage performance and capacity scale than customers and vendors of HCI. HCI systems often devote more of their performance capability to the apps running in the compute VMs than purely the storage. In general, a virtualized solution like HCI does add overhead that could impact performance. However, users often find the benefits in efficiency, ease of management and scalability outweigh these drawbacks.
Q. What is the typical storage utilization for centralized vs. disaggregated vs. HCI?
A. Centralized storage typically has higher (better) utilization than HCI or disaggregated storage, especially for small and medium storage deployments. As the deployments gets larger, utilization usually increases (improves) for HCI and disaggregated storage but rarely reaches the same level as centralized storage due to the need to keep some free space on each node in the HCI or disaggregated storage cluster. That said, centralized storage is more likely to use RAID at the drive level and/or mirroring of entire volumes, which can reduce total storage usage efficiency. For increased replication efficiency, HCI and disaggregated storage solutions often support erasure coding.
Q. How do you manage fault domains with disaggregation? If a single storage subsystem fails, it will impact perhaps several compute nodes. It seems that as you support more compute nodes, you will need to spread data across more and more storage subsystems in order to mitigate risks of a single storage system bringing down multiple compute nodes.
A. Disaggregated storage usually creates fault domains consisting of multiple nodes, though it could be as small as a single drive or as large as an entire rack or row of racks in the data center. The redundancy can be managed by the storage software or by the application. It is correct that storage solutions supporting more compute nodes should have a higher level of availability than those that support fewer compute nodes, but that doesn’t mean the number of storage nodes used for one fault domain must forever increase in proportion to the number of servers supported. It is possible to have the number of storage nodes scale independently from the number of servers supported.
Q. A comment rather than a question: Definitely centralized.
A. We agree–centralized Storage is definitely centralized at the system level, though not all the centralized storage systems are always centralized in the same data center–they could be distributed across multiple data centers.
Don’t forget to check out our other great storage debates, including: File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, FCoE vs. iSCSI vs. iSER, RoCE vs. iWARP, and Centralized vs. Distributed. You can view them all on our SNIAVideo YouTube Channel.
One thought to “A Storage Debate Q&A: Hyperconverged vs. Disaggregated vs. Centralized”