Optimizing Backup and Recovery: A Deep Dive into Data Deduplication and Compression

MARCH 21ST, 2024
Aftab Alam
Executive Vice President, Product Management

Your data is more than just an asset—it’s the lifeblood of your business. This is true for almost any business, with Statista forecasting global data creation will grow to more than 180 zettabytes by 2025. Statista also found that over 72 percent of companies worldwide were affected by ransomware in 2023. 

This increasing reliance on data combined with an ever-evolving threat landscape underscores that a robust backup and discovery plan is vital for business survival. An effective plan ensures business continuity in the face of threats, system failures, natural disasters, and cyberattacks. Without a comprehensive backup and disaster recovery strategy, your business risks severe financial, operational, and reputational damage. 

With growing data volumes and complex IT requirements, deduplication and compression are two technologies that can help you optimize storage requirements and improve the efficiency of your backup and recovery plans.

Deduplication and Compression: Functionality and Benefits

Deduplication and compression are designed to reduce the storage footprint of data backups, making data protection more efficient and cost-effective. By eliminating redundant data and reducing data footprints, these technologies help you optimize your storage resources, reduce costs, and improve recovery time and recovery point objectives (RTOs/RPOs).

Deduplication identifies and removes duplicate copies of data across your storage environment. Instead of storing multiple copies of the same data, deduplication retains a single copy and creates points to the original data for any subsequent duplicates. Deduplication is especially effective in backup environments where the same data may be stored across multiple backup sets. 

Compression reduces the size of data files by algorithmically eliminating redundancy within each file. It can be applied to a wide range of data types, making it a versatile tool for reducing storage requirements and accelerating data transfer rate, which is crucial for efficient disaster recovery.

How Backup Deduplication Works

Deduplication can be deployed at different levels: file level, block level, and byte level, with block level being the most common. In block-level deduplication, data is divided into unique blocks, which are then analyzed for redundancies. If a block is identical to one already stored, a reference to the existing block is created instead of keeping another copy.

This process depends on sophisticated indexing mechanisms to track all unique data blocks and their references to ensure quick access and restoration of data when needed. 

There are two primary types of deduplication: post-process deduplication, where data is first stored in its original form and then deduplicated, and inline deduplication—which Arcserve incorporates into its solutions—where data is deduplicated in real-time as its being written to the storage system. Inline deduplication is more efficient in terms of storage space but does require more processing power.

Here’s how data deduplication works:

Data Splitting
When a backup job starts, Arcserve Unified Data Protection (UDP) deduplicates data by segmenting it into blocks, with the default deduplication block size being 4KB. However, you can adjust this block size, with available options including 4KB, 8KB, 16KB, 32KB, and 64KB, depending on specific requirements and your desired balance between deduplication efficiency and resource utilization.

Hash Calculation
Each block is assigned a unique hash value, which acts as a unique identifier calculated based on the data within each block.

Hash Comparison
The hash values are sent to the Recovery Point Server (RPS), where they’re compared to existing hashes in the backup repository. This step identifies redundant data by finding matching hashes.

Filtering and Backup
If a hash match is found, it indicates a duplicate block, which is then excluded from the backup. Only blocks with unique hashes — those with new or changed data — are sent to the RPS for storage. The RPS updates its database with each new entry, ensuring future backups are compared to the latest dataset. 

Arcserve UDP employs data deduplication to deliver faster backups by eliminating redundant data. It also streamlines the merge process to minimize performance impacts on your systems. You can deduplicate data across multiple agents to further enhance storage efficiency and backup speed on a global scale. You can also count on optimized and more reliable replication that ensures your data is quickly and efficiently mirrored to offsite locations for disaster recovery purposes. 

How Compression Works

Data compression further optimizes backup processes by reducing data traffic and storage requirements.Compression algorithms reduce the size of data blocks before they’re stored. With Arcserve UDP, you can choose from several lossless compression levels, from no compression to maximize performance to maximum compression to increase storage efficiency. That allows you to balance your storage optimization needs against available CPU resources and your desired backup speed.

Here's how data compression works:

Pre-Compression Analysis
Before compressing data, Arcserve UDP evaluates the data to determine the potential compression ratio and ensure that applying compression will save substantial storage space.

Compression Process
Data is compressed using efficient algorithms that reduce its size while maintaining its integrity. The algorithm maps recurring sequences and redundancies within the data and replaces them with more concise and compact symbols or codes. 

Decompression works by reversing this process. The algorithm reads the compressed file and uses the stored mapping to look up each symbol or code, replacing it with the original sequence or pattern it represents. This process continues sequentially until all symbols or codes have been replaced with their original data.

A simple example would be a string of code that includes “BBBBBBB” that might be compressed to “5B” where “6” represents the number of times “B” repeats.

Post-Compression Data Handling
Once compressed, the data is prepared for storage or transmission. Compressed data requires less bandwidth for offsite replication and less storage space.

Get All the Benefits of Arcserve UDP

Find out just how effective Arcserve UDP is for data protection, backup, and disaster recovery by requesting a demo.

For expert help, choose an Arcserve technology partner.

You May Also Like