As a System Admin, your goal is simple – you want to reduce backup storage capacity using the best technology available, but without breaking the bank. Unfortunately, this oftentimes isn’t as simple as it should be. With backup products that incorporate differently designed data deduplication technology, it can be challenging to know what’s really effective and what isn’t.
And, unfortunately, some use technology that was designed by cutting corners, or the deduplication technology itself doesn’t live up to expectations (e.g. only deduplicating one job at a time vs. across the entirety of your data set).
We believe it’s important to be fully transparent about how data deduplication works, so you get the most bang for your buck. When you break down the way the technology varies, it becomes easier to see which products deliver value and which deliver excuses.
Deduplication performance: inline deduplication vs. post-process deduplication
When it comes to the type of data deduplication technology you should use, the main consideration lies in the size of the data to be deduplicated. As the size of the data increases (e.g. 250Kb, 512Kb, 1024Kb), the less efficient deduplication becomes. The more data you process, the more computational resources are required, highlighting a direct trade-off between deduplication efficiency and compute resources.
But there is an additional challenge which centers on managing all of the resulting hash signatures. Managing thousands of hash signatures is a simple matter compared to managing millions of hash signatures.
In order to achieve (sub-millisecond) access to the Hash Index Table, the technology requires very sophisticated data management structures. Unbeknownst to many, this technology is not simply available “off-the-shelf.” To develop this type of data deduplication technology, vendors must design their own hash table management scheme.
Fortunately, you can quickly identify a vendor’s level of hash table management sophistication by looking at the number of hash signatures they manage. If the deduplication method only supports large data sets (e.g. 512Kb or 1024Kb), it’s a good indication that the deduplication is limited to a single backup job or storage volume.
On the flip side of the coin, the negative impact of having too many hash signatures is performance. If the time to compare hash values delays the disk write time, the application suffers. The alternative is to defer deduplication processing to after the disk write is complete. This method is called post-processing deduplication, and requires extra storage space because it writes more data to disk.
Many vendors will tell you that post-processing is advantageous, but the truth is, inline deduplication delivers more efficiency because data is processed before being written to disk – meaning that it only has to write data to disk once.
Key takeaway: It’s wise to consider the differences in post-process and inline deduplication before choosing a vendor, as they may be supporting less efficient methods, based on their lack of development of a high-performance Hash Table Index management process. To save on storage space, ignore this argument and always demand inline processing for the data deduplication technology you ultimately choose.
Reducing data redundancy: target-side deduplication vs. source-side deduplication
When considering target or source-side deduplication, many vendors will claim that one method is preferred over another, but defer explanation as to their reasoning.
The truth is, the real differences between the two methods are based on how they manage the Hash Table Index. In the case of target deduplication, the processing is performed on the backup appliance. In theory, this wouldn’t be an issue, aside from the fact that this causes a significant strain on your network as backups travel back and forth between the clients and the backup device. Because of this reason, target deduplication is often considered “old school,” and is quickly being replaced in favor of source-side deduplication.
Like the name says, source-side deduplication is performed at the “source,” before data is transferred to the backup device. This method reduces redundancies before the data is transferred over the network, which, as you’d imagine, yields dramatic savings in bandwidth, required storage, and corresponding storage costs. Just imagine how many times you back up your Windows Server. Doesn’t it make more sense to send over only new data versus data that is redundant?
Here’s the caveat: source-side deduplication requires that data deduplication technology be added to each backup agent for physical machines, as well as integration with the local hypervisor for host-based agentless backup. More importantly, source-side deduplication requires a sophisticated workflow to optimize replication between the source client and the backup device. This is tough technology to develop, and one that not every vendor has.
Finally, global deduplication refers to the process of multiple backup devices federating (“sharing”) the Hash Table Index for maximum deduplication efficiency. This federation is not an easy task, and requires sophisticated merging algorithms to keep the Hash Index Tables concurrent.
Knowing this, it makes sense why some vendors claim that source-side or global source-side deduplication consumes too many client compute resources, or that it’s only meant for VMware (not for physical machines or other virtual systems). That said, there is data deduplication technology available that’s been developed to perform global source-side dedupe, as well.
Key Takeaway: When developed correctly, source-side or global source-side deduplication doesn’t consume too many resources, and can be performed across physical machines and virtual systems. Finding a vendor that has truly designed this data deduplication technology well can make or break your backup and recovery plan.
Data deduplication technology – the bottom line
Every backup benefits from deduplication, and shouldn’t be viewed as a simple checkbox in a list of features. Integrating this powerful technology into your backup and recovery plan can deliver significant storage efficiency and reduce network traffic (and costs) if you leverage deduplication from a vendor that’s developed it well.