An FAQ on Data Reduction Fundamentals - SNIA on Data, Networking & Storage

There’s a fair amount of confusion when it comes to data reduction terminology and techniques. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast, “Everything You Wanted to Know About Storage But Were Too Proud to Ask: Data Reduction.” It was a 101-level lesson on the fundamentals of data reduction, which can be performed in different places and at different stages of the data lifecycle. The goal was to clear up confusion around different data reduction and data compression techniques and set the stage for deeper dive webcasts on this topic (see the end of this blog for info on those).

As promised during the webcast, here are answers to the questions we didn’t have time to address during the live event.

Q. Does block level compression have any direct advantage over file level compression?

A. One significant advantage is not requiring the entire thing, the file or database or whatever we’re storing, to be compressed and decompressed as a unit. That would almost certainly increase read latency, and for large files, require quite a bit of caching. In the case of blocks, a single block can be the compression unit, even if it’s part of a file, database or other larger data structure. Compressing a block is much faster and computationally less intensive, which is reflected in reduced latency overhead and cache impacts.

Q. You made it sound like thin provisioning had no overhead but on-demand allocation is an overhead and can be quite bad at the worst time. Do you agree?

A. Finding free space when the system is at capacity may be an issue, and this may indeed cause significant slowdowns. This is an undesirable situation, and the advice is never to run so close to the capacity wire that thin provisioning impacts performance or jeopardizes successfully writing the data. In a system with adequate amounts of free space, caching can make the normally small overhead of thin provisioning very small to unmeasurable.

Q. Will migration to SSD zoning vs. HDD based block/pages impact data compression?

A. It shouldn’t, since compression is done at a level where zoning isn’t an issue. Compression is only applicable to blocks or files.

Q. Does compressing blocks on computational storage devices have the disadvantage of not reducing the PCIe bandwidth since raw data has to be transferred over to the storage devices?

A. Yes. But the same is true of any storage device; so computational storage is no worse in respect of the transfer of the data, but it provides much more apparent storage on the device once it gets there. A computational storage device requires no application changes to do this.

Q. How do we measure performance in out-of- line <data> reduction?

A. Data reduction techniques like compression and deduplication can be done in-line (that is, while writing the data) or out-of-line (as a later point in time). Out-of-line shifts the compute required from now—where big horsepower is required if there’s to be no impact on storage performance, to later, where smaller processors can take their time. Out-of-line data reduction requires more space to store the data, as it’s unreduced when it’s written. These tradeoffs also have impacts on performance (both back-end latency and bandwidth). This all impacts the total cost of the system. It’s not so much that we need to measure the performance of in-line vs. out-of-line, something we know how to do, and declare one a winner; but it’s whether the system provides us the needed performance at the right cost. That’s a purchasing decision, not a technology one.

Q. How do customers (or vendors) decide how wide their deduplication net should be, i.e. one disk, per file, across one file system, one storage system, or multiple storage systems?

A. By testing and balancing the savings vs. the cost. One thing is true: the balance right now is very definitely in favor of deduplicating at every level where possible. Vendors can demonstrate huge space savings advantages by doing so. Consumers, as indicated by my answer to the previous question, need to look at the whole system and its cost vs. performance, and buy on that basis.

Q. Is compression like doing deduplication on a very small and very local scale?

A. You could think of it as bit-level deduplication, and then realize that you can stretch an analogy to breaking point…

Q. Are some blocks or files so small that it’s not worth doing deduplication or cloning because the extra metadata will be larger than the block/file space savings?

A. Yes. They’re often stored as is – but they do need metadata to say that they’re raw and not reduced.

Q. Do cloning and snapshots operate only at the block level or can they operate at the file or object level too?

A. Cloning and snapshots can operate at the file or object level, as long as there is an efficient way of extracting and storing the differences. Sometimes it’s cheaper and simpler just to copy the whole thing, especially for small files or objects.

Q. Why does (Virtual Data Optimizer) VDO do dedupe before compression if the other way is preferable? Why is it better to compress then deduplicate?

A. That’s a decision that the designers of VDO felt gave them the best storage efficiencies and reasonable compute overheads. (It’s also not the only system that uses this order.) But the dedupe scope of VDO is relatively small. Compression then deduplication allows in-line compression with out-of-line and much broader deduplication across very large sets of data, and there are many systems that use this order for that reason.

Q. There’s also so much stuff because we (as an industry) have enabled storing so much stuff. (cheaply/affordably) 🙂 Today’s business and storage market would look and act differently if costs were different. Data reduction’s interaction with encryption (e.g. proper ordering) could be useful to mention. Or a topic for another presentation!

A. We’ll consider it!

Remember I said we were taking a deeper dive on the topic of data reduction? We have two more webcast in this series – one on compression and the other on data deduplication. You can access them here:

Compression: Putting the Squeeze on Storage – Available on-demand
Not Again! Data Deduplication for Storage Systems – Live on November 10, 2020, on-demand after that date

Leave a Reply