Distributed Storage - SNIA on Data, Networking & Storage

To date, thousands have watched our “Great Storage Debate” webcast series. Our most recent installment of this friendly debate (where no technology actually emerges as a “winner”) was Centralized vs. Distributed Storage. If you missed it, it’s now available on-demand. The live event generated several excellent questions which our expert presenters have thoughtfully answered here:

Q. Which performs faster, centralized or distributed storage?

A. The answer depends on the type of storage, the type of connections to the storage, and whether the compute is distributed or centralized. The stereotype is that centralized storage performs faster if the compute is local, that is if it’s in the same data center as the centralized storage.

Distributed storage is often using different (less expensive) storage media and designed for slower WAN connections, but it doesn’t have to be so. Distributed storage can be built with the fastest storage and connected with the fastest networking, but it rarely is used that way. Also it can outperform centralized storage if the compute is distributed in a similar way to the distributed storage, letting each compute node access the data from a local node of the distributed storage.

Q. What about facilities costs in either environment? Ultimately the data has to physically “land” somewhere and use power/cooling/floor space. There is an economy of scale in centralized data centers, how does that compare with distributed?

A. One big difference is in the cost of power between various data centers. Typically, data centers tend to be the places where businesses have had traditional office space and accommodation for staff. Unfortunately, these are also areas of power scarcity and are consequently expensive to run. Distributed data centers can be in much cheaper locations; there are a number for instance in Iceland where geothermally generated electricity is very cheap, and environmental cooling is effectively free. Plus, the thermal cost per byte can be substantially lower in distributed data centers by efficiently packing drives to near capacity with compressed data. Learn more about data centers in Iceland here.

Another difference is that distributed storage might consume less space if its data protection method (such as erasure coding) is more efficient than the data protection method used by centralized storage (typically RAID or triple replication). While centralized storage can also use erasure coding, compression, and deduplication, it’s sometimes easier to apply these storage efficiency technologies to distributed storage.

Q. What is sharding?

A. Sharding is the process of breaking up, typically a database, into a number of partitions, and then putting these pieces or shards on separate storage devices or systems. The partitioning is normally a horizontal partition; that is, the rows of the database remain complete in a shard and some criteria (often a key range) is used to make each shard. Sharding is often used to improve performance, as the data is spread across multiple devices which can be accessed in parallel.

Sharding should not be confused with erasure coding used for data protection. Although this also breaks data into smaller pieces and spreads it across multiple devices, each part of the data is encoded and can only be understood once a minimum number of the fragments have been read and the data has been reconstituted on some system that has requested it.

Q. What is the preferred or recommended choice of NVMe over Fabrics (NVME-oF) for centralized vs. distributed storage systems for prioritized use-case scenarios such as data integrity, latency, number of retries for read-write/resource utilization?

A. This is a straightforward cost vs. performance question. This kind of solution only makes sense if the compute is very close to the data; so either a centralized SAN, or a (well-defined) distributed system in one location with co-located compute would make sense. Geographically dispersed data centers or compute on remote data adds too much latency, and often bandwidth issues can add to the cost.

Q. Is there a document that has catalogued the impact of latency on the many data types? When designing storage I would start with how much latency an application can withstand.

A. We are not aware of any single document that has done so, but many applications (along with their vendors, integrators, and users) have documented their storage bandwidth and latency needs. Other documents show the impact of differing storage latencies on application performance. Generally speaking one could say the following about latency requirements, though exceptions exist to each one:

Block storage wants lower latency than file storage, which wants lower latency than object storage
Large I/O and sequential workloads tolerate latency better than small I/O and random workloads
One-way streaming media, backup, monitoring and asynchronous replication care more about bandwidth than latency. Two-way streaming (e.g. videoconferencing or IP telephony), database updates, interactive monitoring, and synchronous replication care more about latency than bandwidth.
Real-time applications (remote control surgery, multi-person gaming, remote AR/VR, self-driving cars, etc.) require lower latency than non-real-time ones, especially if the real-time interaction goes both ways on the link.

One thing to note is that many factors affect performance of a storage system. You may want to take a look at our excellent Performance Benchmark webinar series to find out more.

Q. Computation faces an analogous debate between distributed compute vs. centralized compute. Please comment on how the computation debate relates to the storage debate. Typically, distributed computation will work best with distributed storage. Ditto for centralized computation and storage. Are there important applications where a user would go for centralized compute and distributed storage? Or distributed compute and centralized storage?

A. That’s a very good question, to which there is a range of not so very good answers! Here are some application scenarios that require different thinking about centralized vs. distributed storage.

Video surveillance is best with distributed storage (and perhaps a little local compute to do things like motion detection or object recognition) with centralized compute (for doing object identification or consolidation of multiple feeds). Robotics requires lots of distributed compute; think self-driving cars, where the analysis of a scene and the motion of the vehicle needs to be done locally, but where all the data on traffic volumes and road conditions needs multiple data sources to be processed centrally. There are lots of other (often less exciting but just as important) applications that have similar requirements; retail food sales with smart checkouts (that part is all local) and stock management systems & shipping (that part is heavily centralized).

In essence, sometimes it’s easier to process the data where it’s born, rather than move it somewhere else. Data is “sticky”, and that sometimes dictates that the compute should be where the data lies. Equally, it’s also true that the only way of making sense of distributed data is to centralize it; weather stations can’t do weather forecasting, so it needs to be unstuck, collected up & transmitted, and then computed centrally.We hope you enjoyed this un-biased, vendor-neutral debate. You can check out the others in this series below:

We hope you’ve been following the SNIA Ethernet Storage Forum (ESF) “Great Storage Debates” webcast series. We’ve done four so far and they have been incredibly popular with 4,000 live and on-demand views to date and counting. Check out the links to all of them at the end of this blog.

Although we have “versus” in the title of these presentations, the goal of this series is not to have a winner emerge, but rather provide a “compare and contrast” that educates attendees on how the technologies work, the advantages of each, and to explore common use cases.

That’s exactly what we plan to do on September 11, 2018 when we host “Centralized vs. Distributed Storage.” In the history of enterprise storage there has been a trend to move from local storage to centralized, networked storage. Customers found that networked storage provided higher utilization, centralized and hence cheaper management, easier failover, and simplified data protection amongst many advantages, which drove the move to FC-SAN, iSCSI, NAS and object storage.

Recently, however, distributed storage has become more popular where storage lives in multiple locations, but can still be shared over a LAN (Local Area Network) and/or WAN (Wide Area Network). The advantages of distributed storage include the ability to scale out capacity. Conversely, in the hyperconverged use case, enterprises can use each node for both compute and storage, and scale-up as more resources are needed.

What does this all mean?

Pros and cons of centralized vs. distributed storage
Typical use cases for centralized and distributed storage
How SAN, NAS, parallel file systems, and object storage fit in these different environments
How hyperconverged has introduced a new way of consuming storage

It’s sure to be another un-biased, vendor-neutral look at a storage topic many are debating within their own organizations. I hope you’ll join us on September 11^th. In the meantime, I encourage you to watch our on-demand debates:

Learn about the work SNIA is doing to lead the storage industry worldwide in developing and promoting vendor-neutral architectures, standards, and educational services that facilitate the efficient management, movement, and security of information by visiting snia.org.

Tag: Distributed Storage

Centralized vs. Distributed Storage FAQ

We’re Debating Again: Centralized vs. Distributed Storage