Home Topics Data Deduplication What is data deduplication?
Deduplicate your data with IBM solutions Subscribe to the IBM Newsletter
Blue abstract dots

Published: 3 January 2024
Contributors: Phill Powell, Ian Smalley

What is data deduplication?

Data deduplication is a streamlining process in which redundant data is reduced by eliminating extra copies of the same information. The goal of data deduplication, or “dedupe” as it’s commonly shortened, is to lessen an organization’s ongoing storage needs.

Organizations can implement data deduplication processes and techniques to make sure that only one, unique instance of data exists within their storage system. Duplicate or redundant data are removed and users are pointed to a single instance of the data.

When data deduplication is successful, it can improve an organization’s overall storage utilization and help recude costs.

Demystifying data with AI on IBM Z

Learn common customer pain points that AI can address, what capabilities are available today and how IBM Z is the ideal AI platform.

Related content

Register for the ebook to modernize your apps faster

Why is data deduplication needed?

So, why would a company create duplicate data anyway? There could be one or more of any number of valid reasons, including the following:

  • An organization or one of its departments may need to repurpose original data, so new data copies are created.
  • A company might want to retain duplicate copies as part of its backup system in case of a data-loss event.
  • An organization could find it has kept multiple copies of the same data but stored in different formats.

Another key reason for data duplication is simply because that’s what often occurs in most multidepartment organizations. Data is either regularly created or re-created as an accepted and organic function of doing business within a modern context. Therefore, either data creation or replication is not the actual problem; excessive data proliferation is.

Were there no extra financial burdens that are associated with it, data proliferation might seem to be less of a problem than it is. An organization could opt to store data at various locations within its IT architecture and not care about those redundancies.

But the fact is that a company does incur financial penalties by maintaining a large number of data redundancies in the form of extra storage costs. Organizations that can’t stop creating data redundancies need to allocate more labor and budget to implement new storage solutions and data management, be they based on new hardware purchases or added cloud storage.

Benefits of data deduplication

The most obvious benefit of data deduplication techniques is that weeding out extraneous data lessens the total amount of data that an organization must store and manage. This effectively increases an organization’s storage capacity by having less data to occupy storage space.

Aside from reduced storage costs, data deduplication offers other key advantages, like furthering data backup plans and supporting emergency steps to safeguard disaster recovery.

Another plus is revitalizing data integrity by removing “deadweight” data and making sure that the remaining data has been properly cleansed. Deduplicated data were shown to both run better and consume less energy.

Another benefit of data deduplication is how well it works with virtual desktop infrastructure (VDI) deployments, thanks to the fact that the virtual hard disks behind the VDI’s remote desktops operate identically. Popular Desktop as a Service (DaaS) products include Azure Virtual Desktop, from Microsoft and its Windows VDI. These products make virtual machines (VMs), which are created during the server virtualization process. In turn, these virtual machines empower the VDI technology.

How does data deduplication work?

At its most basic level, data deduplication operates through automated functions to identify duplications in data blocks and then remove those duplications. By working at this block level, chunks of unique data can be analyzed and specified as being worthy of preservation. Then, when the deduplication software detects a repetition of the same data block, that repetition is removed and a reference to the original data is included in its place.

An alternate method of data deduplication operates at the file level. Single instance data storage compares full copies of data within the file system, but not chunks or blocks of data. Like its counterpart method, file deduplication depends upon keeping the original file and removing extra copies.

Deduplication techniques do not work in quite the same manner as data compression algorithms (for example, LZ77, LZ78), although it’s true that both pursue the same general goal of reducing data redundancies. Deduplication techniques achieve this on a larger, macro scale than compression algorithms, whose goal is less about replacing identical files with shared copies and more about efficiently encoding data redundancies.

Types of data deduplication

There are two basic types of data deduplication that depend on when the processes occur.

Inline deduplication

This form of data deduplication occurs in real-time as data flows within the system. The system carries less data traffic because it neither transfers nor stores duplicated data. This can lead to a reduction in the total amount of bandwidth needed by that organization.

Post-process deduplication

This type of deduplication takes place after data has been written and placed on some type of storage device.

Both types of data deduplication are affected by the hash calculations inherent to data deduplication. These cryptographic calculations are integral to identifying repeated patterns in data. During inline deduplication, those calculations are performed at the moment, which can dominate and temporarily overwhelm computer functionality. In post-processing deduplications, the hash calculations can be performed at any time after the data is added.

The subtle differences between deduplication types don’t end there. A second way to classify deduplication types is based on where such processes occur.

Source deduplication

This form of deduplication takes place near where new data is generated. The system scans that area and detects new copies of files, which are then removed.

Target deduplication

Target deduplication is basically an inversion of source deduplication. In target deduplication, the system deduplicates any copies that are found in areas other than where the original data was created.

Because there are different types of deduplication methods that are practiced, forward-leaning organizations must make careful and considered decisions regarding the type of deduplication they choose, balancing that method against that company’s particular needs.

In many use cases, an organization’s deduplication method of choice may very well come down to various internal variables, such as the following:

  • How many and what type of data sets are being created
  • The organization’s primary storage system
  • Which virtual environments are in use
  • Which apps the company relies upon
Related solutions
IBM® Storage FlashSystem

Minimize the potential for operational disruptions and isolate workloads from ransomware attacks and other cyberthreats. Add speed to your cyber-resilience posture so your company can suffer less loss and return to normal operations faster.

Explore IBM Storage FlashSystem

IBM Storage Protect

Bring power to data backup and recovery with IBM Storage Protect. Meet software that enhances the data resilience of physical file servers, providing extra efficiencies and a scalable solution for governing billions of objects per backup server.

Explore IBM Storage Protect

IBM Storage as a Service

Slash storage infrastructure costs with an on-premises data storage solution. You bring the data—IBM supplies the storage system. The FlashSystem and IBM DS8900F hardware give you a more flexible, consumption-based, STaaS model that operates like a cloud.

Explore IBM Storage as a System
Resources What is data storage?

Explore the basics of data storage, including storage device types and different formats of data storage.

What is data migration?

Gain a better grasp of how data flows from one storage system or computing environment to another.

What is data architecture?

See why successful data management begins with a solid blueprint in the form of a data architecture.

What is data security?

There’s no more urgent topic in computing or business. Get the fundamentals about data protection.

Take the next step

Simplify data and infrastructure management with IBM Storage FlashSystem, a high-performance, all-flash storage solution that streamlines administration and operational complexity across on-premises, hybrid cloud, virtualized and containerized environments.

    Explore FlashSystem storage Take a tour