Introduction
VAST provides for advanced data reduction techniques to reduce the cost of storage. VAST 2.1 and earlier include local compression and global deduplication to reduce storage costs. VAST 3.0 will add an even more advanced technique known as global similarity based data reduction. Similarity reduction is able to detect data blocks that are similar, but not identical and then use that similarity to store data in a way that only the deltas between those similar blocks are stored. This can save a great deal of space in unexpected ways since a lot of data is similar, but not identical.
What we will show is that VAST's data reduction can achieve substantial degrees of space savings above and beyond client side compression and the savings varies depending on how clients compress their data. In this article we will use Virtual Machine images compressed different ways, but the same considerations apply to any client side compression. Significantly, VAST's similarity based data reduction and application compression together can reduce the stored size of Virtual Machines (VMs) by 25-86%, with the greatest reduction coming from letting VAST do all of the work transparently.
Test Overview
In this case study, we explore the effect of VAST Data’s global, similarity-based, data reduction on a data set of 9 Windows 7 VM images when the images are subjected to a variety of different application-level compression scenarios.
This study consisted of 3 runs of the VAST Probe (a data reduction estimation tool that is available for customer use) on this data set:
- The first run was done on an uncompressed set of VMs
- The second run on a set of VMs that have been compressed using LZ4 compression
- The third run was performed on set of VMs after they’ve been GZIP compressed.
The main difference between LZ4 and GZIP is that while gzip applies both Lempel-Ziv (LZ) encoding as well as Huffman coding to compress data, LZ4 on the other hand only employs the LZ encoding. We'll explain this difference in a moment, but for now, know that Huffman coding substantially reduces the similarity between data that was similar before compression.
Test Results
The three treatments to the Virtual Machine data set were subjected to the VAST Probe. The VAST Probe was configured to simulate the data reduction that will be available in VAST Data release 3.0 (available early 2020). This similarity-based data reduction hashes blocks as they enter the system and compares them against pre-existing reference blocks in the storage cluster. The mechanism to compress all data on Flash is the ZSTD compression algorithm that was created by Facebook:
- Blocks that are 100% identical are, in effect, deduplicated against their reference blocks (100% match below)
- Blocks that are not 100% identical will be compressed against the reference block in a cluster of like data, called a similarity cluster, using ZSTD (similarity below)
- Blocks that represent no similarity are still locally compressed using ZSTD (non-similar below)
Compression Option |
Original Size |
Probe Estimate |
100% Match Gain (GBs) |
Similar-Block Match Gain (GBs) |
Non-Similar Block Gain (GBs) |
Total Gain |
Reduction Ratio vs. Original Dataset |
No Compression |
254.82 |
43.4 |
116.15 |
55.35 |
39.93 |
211.43 |
86% |
LZ4 |
194.24 |
55.89 |
0 |
131.02 |
7.36 |
138.38 |
56% |
GZIP |
180.15 |
112.99 |
0 |
67.03 |
0.1 |
67.13 |
27% |
Table 1. VAST Probe Run Results
The following chart shows that even though the original size of the compressed images is smaller the resulting size after applying VAST data reduction is larger for the pre-compressed images.
The next chart shows the space saving that was gained from each data reduction method the probe applies when running on the various data sets.
Analysis
VAST Data applies similarity-based data reduction globally across files, meaning that the VAST storage system is able to identify similar data chunks and compress them together in order to achieve space savings that other storage system are unable to.
This specific data set is composed of virtual machine images which are well known to benefit from deduplication as the first run shows. On top of this, we see that applying VAST Data’s similarity-based data reduction demonstrates an additional 21% gain for the set of uncompressed images.
Once Lempel-Ziv compression is applied to the VM images, simple deduplication is not able to find 100% block matches due to the subtle changes done to blocks of data. Since VAST Data’s similarity-based data reduction is able to find and benefit from partial matches at a level smaller than a block - VAST’s superior approach still manages to find significant saving from the LZ4 compressed dataset and compensates for most of the reduction gains that could not be found from exact-match blocks.
Once Huffman Coding is applied, the bit-level encoding creates noise in the on-disk format that makes it harder to find opportunities for similarity detection across blocks - leading to significantly less storage saving.
Application Data Reduction vs Storage Data Reduction
When considering where is the best place to employ data reduction, two tradeoffs come into mind. The first is the CPU required to compress and decompress the data at the application level. The second is the network bandwidth required to transfer uncompressed data between the clients and the storage system.
Applications also have better knowledge of the data they store - which in some cases enables them to implement application-specific compression for specific file formats (e.g. JPEG for images). However, many applications simply use general purpose compression such as GZIP.
In the past, storage systems could only perform a local compression that exhibits no benefit over what applications can typically achieve - so the best option was simply to compress at the application level in order to save network bandwidth. Newer storage systems have introduced deduplication - which provides an advantage to storage-based data reduction over application based data reduction due to its global nature. On the other hand, storage savings from deduplication is limited by block-level deduplication (which is sensitive to small changes across blocks) as well as the fact that many implementations of deduplication are not scalable beyond a single controller and/or deduplication realms are limited to a single volume. Even in a time of deduplicating storage, many users find little benefit in deduplication for unstructured data such that they still prefer to compress data at the application level in order to save network bandwidth.
VAST Data offers a way to break these tradeoffs with its next-generation similarity-based data reduction. Customers can compress their data at the application level using LZ4 compression which provides most of the network savings while still allowing VAST backend data reduction to save significant additional space at the storage level and at scale.
On the other hand, for environments where network bandwidth has far outpaced data growth and I/O requirements - it may make more sense to allow VAST’s to apply its full data reduction capabilities by not compressing data at all in applications. At a time where 100Gb networking is now commodity, the challenges of storing uncompressed data are not as severe as they were in the past
Appendix
LZ4 vs GZIP Compression
As you can see above LZ4 compressed data achieves much higher levels of data reduction in VAST than GZIP compressed data. For those that want to understand why, here's a deeper explanation of the differences between the two and why LZ4 is easier to reduce.
The LZ algorithm (used by LZ4 and GZIP) achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A compression match is encoded by a pair of numbers called a length-distance pair, resulting in a shorter stream of data as matches are just referenced in a compression dictionary.
Huffman coding (used by GZIP and ZSTD), on the other hand, is a form of variable-length encoding. Huffman codes exploit the fact that some characters occur more frequently than others to create an algorithm which can represent the same piece of data using a lesser number of bits. In variable-length encoding, the compressor assigns variable number of bits to characters depending upon their frequency in a given byte stream. As a result, some characters may end up taking 1 bit, others may end up taking two bits, others will be encoded using three bits, and so on. Since this approach use less bits in order to encode the most common symbols, the resulting data stream is smaller than the original one.
With respect to storage-level considerations:
- With LZ encoding, bytes of data that have only a single occurrence are untransformed in the data stream and remain unchanged as the data structure is laid down onto a storage device.
- The Huffman approach to bit-level variable length encoding has a more disruptive effect on storage-level data reduction engines - as the compressor will transform all of the symbols in a data stream such that the unique blocks are no longer recognizable. As a result, any correlative data reduction system will struggle to find commonality across compressed blocks even though these blocks may have a high degree of commonality once the Huffman-induced ‘noise’ is stripped off. As such GZIP and ZSTD (not shown here) are more difficult to reduce.
Comments
0 comments
Please sign in to leave a comment.