Centralized vs. Distributed Storage FAQ - SNIA on Data, Networking & Storage

To date, thousands have watched our “Great Storage Debate” webcast series. Our most recent installment of this friendly debate (where no technology actually emerges as a “winner”) was Centralized vs. Distributed Storage. If you missed it, it’s now available on-demand. The live event generated several excellent questions which our expert presenters have thoughtfully answered here:

Q. Which performs faster, centralized or distributed storage?

A. The answer depends on the type of storage, the type of connections to the storage, and whether the compute is distributed or centralized. The stereotype is that centralized storage performs faster if the compute is local, that is if it’s in the same data center as the centralized storage.

Distributed storage is often using different (less expensive) storage media and designed for slower WAN connections, but it doesn’t have to be so. Distributed storage can be built with the fastest storage and connected with the fastest networking, but it rarely is used that way. Also it can outperform centralized storage if the compute is distributed in a similar way to the distributed storage, letting each compute node access the data from a local node of the distributed storage.

Q. What about facilities costs in either environment? Ultimately the data has to physically “land” somewhere and use power/cooling/floor space. There is an economy of scale in centralized data centers, how does that compare with distributed?

A. One big difference is in the cost of power between various data centers. Typically, data centers tend to be the places where businesses have had traditional office space and accommodation for staff. Unfortunately, these are also areas of power scarcity and are consequently expensive to run. Distributed data centers can be in much cheaper locations; there are a number for instance in Iceland where geothermally generated electricity is very cheap, and environmental cooling is effectively free. Plus, the thermal cost per byte can be substantially lower in distributed data centers by efficiently packing drives to near capacity with compressed data. Learn more about data centers in Iceland here.

Another difference is that distributed storage might consume less space if its data protection method (such as erasure coding) is more efficient than the data protection method used by centralized storage (typically RAID or triple replication). While centralized storage can also use erasure coding, compression, and deduplication, it’s sometimes easier to apply these storage efficiency technologies to distributed storage.

Q. What is sharding?

A. Sharding is the process of breaking up, typically a database, into a number of partitions, and then putting these pieces or shards on separate storage devices or systems. The partitioning is normally a horizontal partition; that is, the rows of the database remain complete in a shard and some criteria (often a key range) is used to make each shard. Sharding is often used to improve performance, as the data is spread across multiple devices which can be accessed in parallel.

Sharding should not be confused with erasure coding used for data protection. Although this also breaks data into smaller pieces and spreads it across multiple devices, each part of the data is encoded and can only be understood once a minimum number of the fragments have been read and the data has been reconstituted on some system that has requested it.

Q. What is the preferred or recommended choice of NVMe over Fabrics (NVME-oF) for centralized vs. distributed storage systems for prioritized use-case scenarios such as data integrity, latency, number of retries for read-write/resource utilization?

A. This is a straightforward cost vs. performance question. This kind of solution only makes sense if the compute is very close to the data; so either a centralized SAN, or a (well-defined) distributed system in one location with co-located compute would make sense. Geographically dispersed data centers or compute on remote data adds too much latency, and often bandwidth issues can add to the cost.

Q. Is there a document that has catalogued the impact of latency on the many data types? When designing storage I would start with how much latency an application can withstand.

A. We are not aware of any single document that has done so, but many applications (along with their vendors, integrators, and users) have documented their storage bandwidth and latency needs. Other documents show the impact of differing storage latencies on application performance. Generally speaking one could say the following about latency requirements, though exceptions exist to each one:

Block storage wants lower latency than file storage, which wants lower latency than object storage
Large I/O and sequential workloads tolerate latency better than small I/O and random workloads
One-way streaming media, backup, monitoring and asynchronous replication care more about bandwidth than latency. Two-way streaming (e.g. videoconferencing or IP telephony), database updates, interactive monitoring, and synchronous replication care more about latency than bandwidth.
Real-time applications (remote control surgery, multi-person gaming, remote AR/VR, self-driving cars, etc.) require lower latency than non-real-time ones, especially if the real-time interaction goes both ways on the link.

One thing to note is that many factors affect performance of a storage system. You may want to take a look at our excellent Performance Benchmark webinar series to find out more.

Q. Computation faces an analogous debate between distributed compute vs. centralized compute. Please comment on how the computation debate relates to the storage debate. Typically, distributed computation will work best with distributed storage. Ditto for centralized computation and storage. Are there important applications where a user would go for centralized compute and distributed storage? Or distributed compute and centralized storage?

A. That’s a very good question, to which there is a range of not so very good answers! Here are some application scenarios that require different thinking about centralized vs. distributed storage.

Video surveillance is best with distributed storage (and perhaps a little local compute to do things like motion detection or object recognition) with centralized compute (for doing object identification or consolidation of multiple feeds). Robotics requires lots of distributed compute; think self-driving cars, where the analysis of a scene and the motion of the vehicle needs to be done locally, but where all the data on traffic volumes and road conditions needs multiple data sources to be processed centrally. There are lots of other (often less exciting but just as important) applications that have similar requirements; retail food sales with smart checkouts (that part is all local) and stock management systems & shipping (that part is heavily centralized).

In essence, sometimes it’s easier to process the data where it’s born, rather than move it somewhere else. Data is “sticky”, and that sometimes dictates that the compute should be where the data lies. Equally, it’s also true that the only way of making sense of distributed data is to centralize it; weather stations can’t do weather forecasting, so it needs to be unstuck, collected up & transmitted, and then computed centrally.We hope you enjoyed this un-biased, vendor-neutral debate. You can check out the others in this series below:

Leave a Reply