- Home
- Products & Solutions
- Storage
- Solutions
- Deduplication
What is Deduplication?
The amount of data organisations need to collect and store on a daily basis is increasing exponentially, making the task of managing and protecting it harder and harder. This task is further complicated by the fact that much of the data is likely to be duplicated. Data deduplication allows organisations to operate more effectively by reducing the amount of duplicate data, increasing the efficiency of storage and backup systems and reducing storage costs.
Although data deduplication systems differ in the way they carry out the deduplication process, they all work by performing a comparison of data segments to find which are repeated and can be replaced with a reference or pointer. One method, used in file-by-file comparison systems, compares two versions of the same file or fileset looking for unique data. A second method employed in block-based systems, works by segmenting a data stream into blocks and writing them to disk. In doing so, it creates a digital signature that acts as a unique fingerprint for each data segment and adds these signatures to an index. The index provides a reference list that allows data deduplication software to determine whether a block already exists in stored data. When it finds a duplicate block rather than storing it again, data deduplication software inserts a pointer to the original block. If a block appears more than once, additional pointers are created for each occurrence. As pointers are smaller than blocks they require much less disk storage space.
In both systems the ultimate goal is to eliminate duplicate or redundant information, so that only unique data segments are stored and repeated ones only referenced. When there are several versions of similar data sets to be stored, either technology can provide very powerful savings and support highly optimised replication in which only the unique segments need to be transmitted over a network. However, as it takes longer to check for duplicate data rather than simply writing data to disk the inevitable by-product of the deduplication process is some level of system overhead. So, one consideration when employing a deduplication approach is when this overhead occurs.
Two approaches exist to deduplication ‘Inline’ or ‘Deferred’, each having its own merits, neither offering a single best solution, therefore selection should be based on which of the different methodologies is best suited to a specific backup job.
Inline deduplicating
Deduplication is carried out during ingest (whilst the backup is going) and before data is written to a storage location.
Advantages
- Uses the least amount of disk capacity
- Can allow the deduplication process to begin whilst the backup is going on, so allowing it to finish faster.
Disadvantage
- The deduplication rates are linked to backup speeds. Where the backup speed is faster than the deduplication rate, for instance; if there is a high volume of new data to be processed, or where there are backup bursts, or very fast servers, this may cause the backup to take longer to complete, lengthening the backup window.
Deferred or Post Process deduplicating
Widely used on commercial systems, this method copies all the data to disk first and then after the backup ingest is complete, deduplicates the data as a deferred or post data transfer process.
Advantage
- Allows a short backup window, as there is no deduplication overhead to slow down the backup. Application servers can be back in service as soon as possible.
Disadvantages
- More disk space is required as enough space is required to hold an entire backup job (with some systems requiring enough space for two backups to be held)
- The replication of unique data is delayed until the deduplication starts, so taking longer.
It is very likely that the combination of, ingest and de-duplication in a post process may take longer than ingesting and de-duplicating at the same time. So, a post process system is likely to be preferred when the most important issue is the length of the initial backup window.
Not all data storage jobs benefit significantly from deduplication, these include data that has been pre-compressed, encrypted, or consisting of randomised data in which segment patterns do not recur, some specialised image files, such as satellite images fall into this category. In these cases it makes more sense to use native virtual tape libraries (VTL) or network attached storage (NAS).
For more information:



