What is Data Deduplication and Why Does it Matter for Backup?
The Problem Deduplication Solves
When you back up data every day, you are not backing up 100% new data each time. Most files remain unchanged between backups. A document that hasn't been edited, a database that hasn't changed significantly, a virtual machine image that is largely the same as yesterday — all of these contain vast amounts of data that is identical to what was already backed up.
Without deduplication, every backup stores that redundant data again in full. With deduplication, the system identifies data that has already been stored and replaces it with a small reference pointer. Only genuinely new or changed data is stored in full.
The result is significantly reduced storage consumption — and lower costs.
How Deduplication Works
Deduplication works by dividing data into chunks (either fixed-size or variable-size) and calculating a unique fingerprint — typically a hash — for each chunk. When a new backup runs, each chunk is fingerprinted and compared against the existing database of stored chunks.
If the chunk already exists in storage, only the reference pointer is written. If it is new, the chunk is stored and its fingerprint added to the database.
Inline vs Post-Process Deduplication
Inline deduplication happens as data is being written. The system checks each chunk against the existing database before deciding whether to write it. This saves storage immediately but adds some processing overhead during the backup window.
Post-process deduplication writes data first, then deduplicates it afterwards. This is faster during the backup itself but requires more temporary storage and uses resources after the backup completes.
Source-Side vs Target-Side Deduplication
Source-side deduplication happens on the machine being backed up, before data is transmitted. This reduces both storage consumption and the amount of data sent over the network — particularly valuable for remote sites with limited bandwidth.
Target-side deduplication happens at the backup storage destination. More data travels over the network, but the processing burden is off the source machine.
Deduplication Ratios
The effectiveness of deduplication depends heavily on the type of data being backed up. Typical ratios:
- Virtual machine backups: 10:1 to 50:1 — VMs often share large amounts of common OS data
- File server backups: 5:1 to 20:1 — documents, spreadsheets, and emails often contain repeated content
- Database backups: 2:1 to 5:1 — databases tend to change more significantly between backups
- Already-compressed files (video, images, ZIP): near 1:1 — these contain little redundancy to remove
Deduplication vs Compression
These two technologies are often used together but work differently. Deduplication eliminates redundant chunks across multiple backups. Compression reduces the size of individual chunks by encoding them more efficiently.
Both reduce storage consumption. Combined, they can dramatically reduce the total footprint of a backup repository compared to storing raw data.
Why It Matters for MSPs
For MSPs managing backup for multiple clients, deduplication directly affects the economics of the service. Less storage consumed means lower costs to the MSP, which either improves margin or allows more competitive pricing.
It also affects backup windows. Smaller amounts of data to transfer means faster backups, which is particularly important for clients with limited internet bandwidth or tight backup windows.
BOBcloud's backup platform includes built-in deduplication and compression, ensuring efficient storage use across all client backup sets. Find out more about our MSP backup platform.