Home Topics Data Reduction What is data reduction?
Explore our AI platform Subscribe for AI updates
Man in server room

Published: 18 January 2024
Contributors: Phill Powell, Ian Smalley

What is data reduction?

Data reduction is the process in which an organization sets out to limit the amount of data it’s storing.

Data reduction techniques seek to lessen the redundancy found in the original data set so that large amounts of originally sourced data can be more efficiently stored as reduced data.

At the outset, it should be stressed that the term “data reduction” does not automatically equate to a loss of information. In many instances, data reduction only means that data is now being stored in a smarter fashion—perhaps after going through the optimization process and then being reassembled with related data in a more practical configuration.

Nor is data reduction the same thing as data deduplication, in which extra copies of the same data are purged for streamlining purposes. More accurately, data reduction combines various aspects of different activities, such as data deduplication and data consolidation, to achieve its goals.

Why AI governance is a business imperative for scaling enterprise AI

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

Register for the guide on foundation models

A more comprehensive view of data

When data is being discussed in the context of data reduction, we’re often speaking of data in its singular form, as opposed to the pluralized form typically used. One aspect of data reduction, for example, deals with defining the actual physical dimensions of individual data points.

There’s a considerable amount of data science involved with data-reduction activities. The material can be fairly complex and difficult to summarize concisely, and this dilemma has spawned its own term—interpretability, or the ability for a human of average intelligence to understand a particular machine learning model.

Grasping the meanings of some of these terms can be challenging because this is data as seen from a near-microscopic perspective. We’re usually discussing data in its “macro” form, but in data reduction, we’re often speaking of data in its most “micro” sense. More accurately, most discussions of this topic will require both discussions at the macro level and others at the micro end of the scale.

Benefits of data reduction

When an organization reduces the volume of data it’s carrying, that company typically realizes substantial financial savings in the form of reduced storage costs associated with consuming less storage space.

Data reduction methods provide other advantages, as well, like increasing data efficiency. When data reduction has been achieved, that resulting data is easier for artificial intelligence (AI) methods to use in a variety of ways, including sophisticated data analytics applications that can greatly streamline decision-making tasks.

For example, when storage virtualization is used successfully, it assists the coordination between server and desktop environments, enhancing their overall efficiency and making them more reliable.

Data reduction efforts play a key role in data mining activities. Data must be as clean and prepared as possible before it’s mined and used for data analysis.

Types of data reduction

The following are some of the methods organizations can use to achieve data reduction.

Dimensionality reduction

The notion of data dimensionality underpins this entire concept. Dimensionality refers to the number of attributes (or features) assigned to a single dataset. However, there’s a tradeoff at work here—the greater the amount of dimensionality, the more data storage demanded by that dataset. Furthermore, the higher the dimensionality, the more often data tends to be sparse, complicating necessary outlier analysis.

Dimensionality reduction counters that by limiting the “noise” in the data and enabling better visualization of data. A prime example of dimensionality reduction is the wavelet transform method, which assists image compression by maintaining the relative distance that exists between objects at various resolution levels.

Feature extraction is another possible transformation for data—one that changes original data into numeric features and works in conjunction with machine learning. It differs from principal component analysis (PCA), another means of reducing the dimensionality of large data sets, in which a sizable set of variables is transformed into a smaller set while retaining most of the data from the large set.

Numerosity reduction

The other method involves selecting a smaller, less data-intensive format for representing data. There are two types of numerosity reduction—that which is based on parametric methods and that which is based on non-parametric methods. Parametric methods like regression concentrate on model parameters, to the exclusion of the data itself. Similarly, a log-linear model might be employed that focuses on subspaces within data. Meanwhile, non-parametric methods (like histograms, which show the way numerical data is distributed) don’t rely upon models at all.

Data cube aggregation

Data cubes are a visual way to store data. The term “data cube” is actually almost misleading in its implied singularity, because it’s really describing a large, multidimensional cube that’s composed of smaller, organized cuboids. Each of the cuboids represents some aspect of the total data within that data cube, specifically pieces of data concerning measurements and dimensions. Data cube aggregation, therefore, is the consolidation of data into the multidimensional cube visual format, which reduces data size by giving it a unique container specifically built for that purpose.

Data discretization

Another method enlisted for data reduction is data discretization, in which a linear set of data values is created based around a defined set of intervals that each correspond to a determined data value.

Data compression

In order to limit file size and achieve successful data compression, various types of encoding can be used. In general, data compression techniques are considered as either using lossless compression or lossy compression, and they are grouped according to those two types. In lossless compression, data size is reduced through encoding techniques and algorithms, and the complete original data can be restored if needed. Lossy compression, on the other hand, uses other methods to perform its compression, and although its processed data may be worth retaining, it will not be an exact copy, as you would get with lossless compression.

Data preprocessing

Some data needs to be cleaned, treated and processed before it undergoes the data analysis and data reduction processes. Part of that transformation may involve changing the data from analog in nature to digital. Binning is another example of data preprocessing, one in which median values are utilized to normalize various types of data and ensure data integrity across the board.

Related solutions
Storage sustainability with IBM FlashSystem

Take advantage of a win-win situation for both your organization and the environment by using IBM FlashSystem storage. Consume less energy and reap cost savings, while reducing your company’s carbon footprint.

Explore storage sustainability with IBM FlashSystem

IBM Spectrum Virtualize for Public Cloud

Imagine a solution that supports mirroring between on-premises and cloud data centers or between cloud data centers. IBM Spectrum Virtualize for Public Cloud also helps implement disaster recovery strategies.

Explore IBM Spectrum Virtualize for Public Cloud

IBM Storage-as-a-Service

Get the best of two worlds with IBM Storage as-a-Service. Start with on-premises hardware provided and managed by IBM. Couple that with a cloud-like, consumption-based pricing model, for a flexible combination.

Explore IBM Storage-as-a-Service
Resources IBM FlashSystem product tour

Explore FlashSystems powered by IBM Spectrum Virtualize software that uses symmetric virtualization.

Watch the sustainable storage webinar

Energy costs and data seem to both be growing at exponential rates. As corporations grapple with this expensive reality, they require energy-efficient storage that they can rely on.

IBM Data Reduction Estimator Tool

The Data Reduction Estimator tool (DRET) is a command-line host-based utility for estimating the data reduction saving on block devices.

What is data consolidation?

Discover why many organizations are relying on data consolidation tools to handle their data warehouses.

What is data storage?

Learn about the basics of data storage, including storage device types and different formats of data storage.

What is flash storage?

Flash storage solutions range from USB drives to enterprise-level arrays. Learn what makes them tick.

Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo