Data Compression Q&A

Everyone is looking to squeeze more efficiency from storage. That’s why the

SNIA Networking Storage Forum hosted a live webcast last month “Compression: Putting the Squeeze on Storage.” The audience asked many great questions on compression techniques. Here are answers from our expert presenters, John Kim and Brian Will:

Q. When multiple unrelated entities are likely to compress the data, how do they understand that the data is already compressed and so skip the compression?

A. Often they can tell from the file extension or header that the file has already been compressed. Otherwise each entity that wants to compress the data will try to compress it and then discard the results if it makes the file larger (because it was already compressed). 

Q. I’m curious about storage efficiency of data reduction techniques (compression/ thin provisioning etc) on certain database/server workloads which end up being more of a hindrance. Ex: Oracle ASM, which does not perform very well under any form of storage efficiency method. In such scenarios, what would be the recommendation to ensure storage is judiciously utilized?

A. Compression works well for some databases but not others, depending both on how much data repetition occurs within the database and how the database tables are structured. Database compression can be done on the row, column or page level, depending on the method and the database structure. Thin provisioning generally works best if multiple applications using the storage system (such as the database application) want to reserve or allocate more space than it actually needs. If your database system does not like the use of external (storage-based, OS-based, or file system-based) space efficiency techniques, you should check if it supports its own internal compression options.

Q. What is a DPU?

A. A DPU is a data processing unit that specializes in moving, analyzing and processing data as it moves in and out of servers, storage, or other devices. DPUs usually combine network interface card (NIC) functionality with programmable CPU and/or FPGA cores. Some possible DPU functions include packet forwarding, encryption/decryption, data compression/decompression, storage virtualization/acceleration, executing SDN policies, running a firewall agent, etc. 

Q. What’s the difference between compression versus compaction?

A. Compression replaces repeated data with either shorter symbols or pointers that represent the original data but take up less space. Compaction eliminates empty space between blocks or inside of files, often by moving real data closer together. For example, if you store multiple 4KB chunks of data in a storage system that uses 32KB blocks, the default storage solution might consume one 32KB storage block for each 4KB of data. Compaction could put 5 to 8 of those 4KB data chunks into one 32KB storage block to recover wasted free space.

Q. Is data encryption at odds with data compression?  That is, is data encryption a problem for data compression?

A.If you encrypt data first, it usually makes compression of the encrypted data difficult or impossible, depending on the encryption algorithm. (A simple substitution cypher would still allow compression but wouldn’t be very secure.) In most cases, the answer is to first compress the data then encrypt it. Going the other way, the reverse process is to first decrypt the data then decompress it.

Q. How do we choose the binary form code 00, 01, 101, 110, etc?

A. These will be used as the final symbol representations written into the output data stream. The table represented in the presentation is only illustrative, the algorithm document in the deflate RFC is a complete algorithm to represent symbols in a compacted binary form.

Q. Is there a resource for different algorithms vs CPU requirements vs compression ratios?

A. A good resource to see the cost versus ratio trade-offs with different algorithms is on GItHub here. This utility covers a wide range of compression algorithms, implementations and levels. The data shown on their GitHub location is benchmarked against the silesia corpus which represents a number of different data sets.

Q. Do these operations occur on individual data blocks, or is this across the entire compression job?

A. Assuming you mean the compression operations, it typically occurs across multiple data blocks in the compression window. The compression window almost always spans more than one data block but usually does not span the entire file or disk/SSD, unless it’s a small file.

Q. How do we guarantee that important information is not lost during the lossy compression?

A. Lossy compression is not my current area of expertise but there is a significant area of information theory called Rate-distortion theory which is used for quantification of images for compression, that may be of interest. In addition, lossy compression is typically only used for files/data where it’s known the users of that data can tolerate the data loss, such as images or video. The user or application can typically adjust the compression ratio to ensure an acceptable level of data loss.

Q. Do you see any advantage in performing the compression on the same CPU controller that is managing the flash (running the FTL, etc.)?

A.There may be cache benefits from running compression and flash on the same CPU depending on the size of transactions. If the CPU is on the SSD controller itself, running compression there could offload the work from the main system CPU, allowing it to spend more cycles running applications instead of doing compression/decompression.

Q. Before compressing data, is there a method to check if the data is good to be compressed?

A.Some compression systems can run a quick scan of a file to estimate the likely compression ratio. Other systems look at the extension and/or header of the file and skip attempts to compress it if it looks like it’s already compressed, such as most image and video files. Another solution is to actually attempt to compress the file and then discard the compressed version if it’s larger than the original file.

Q. If we were to compress on a storage device (SSD) what do you think are the topic challenges? Error propagation? Latency/QoS or other?

A. Compressing on a storage device could mean higher latency for the storage device, both when writing files (if compression is inline) or when reading files back (as they are decompressed). But it’s likely this latency would otherwise exist somewhere else in the system if the files were being compressed and decompressed somewhere other than on the storage device. Compressing (and decompressing) on the storage device means the data will be transmitted to (and from) the storage while uncompressed, which could consume more bandwidth. If an SSD is doing post compression (i.e. compression after the file is stored and not inline as the file is being stored), it would likely cause more wear on the SSD because each file is written twice.

Q. Are all these CPU-based compression analyses?

A. Yes these are CPU-based compression analyses.

Q. Can you please characterize the performance difference between, say LZ4 and Deflate in terms of microseconds or nanoseconds?

A. Extrapolating from the data available here, an 8KB request using LZ4 fast level 3 (lz4fast 1.9.2 -3) would take 9.78 usec for compression and 1.85 usec for decompression. While using zlib level 1 for an 8KB request compression takes 68.8 usec while decompression will take 21.39 usec. Another aspect to note it that at while LZ4 fast level 3 takes significantly less time, the compression ratio is 50.52% while zlib level 1 is 36.45%, showing that better compression ratios can have a significant cost.

Q. How important is the compression ratio when you are using specialty products?

A. The compression ratio is a very important result for any compression algorithm or implementation.

Q. In slide #15, how do we choose the binary code form for the characters?

A. The binary code form in this example is entirely controlled by the frequency of occurrence of the symbol within the data stream. The higher the symbol frequency the shorter the binary code assigned. The algorithm used here is just for illustrative purposes and would not be used (at least in this manner) in a standard. Huffman Encoding in DEFLATE. Here is a good example of a defined encoding algorithm.

This webcast was part of a SNIA NSF series on data reduction. Please check out the other two sessions:

Leave a Reply

Your email address will not be published. Required fields are marked *