Data observability refers to the practice of monitoring, managing and maintaining data in a way that ensures its quality, availability and reliability across various processes, systems and pipelines within an organization.
Data observability is about truly understanding the health of your data and its state across your data ecosystem. It includes a variety of activities that go beyond traditional monitoring, which only describes a problem. Data observability can help identify, troubleshoot and resolve data issues in near-real time.
Utilizing data observability tools is essential for getting ahead of bad data issues, which sit at the heart of data reliability. These tools enable automated monitoring, triage alerting, tracking, comparisons, root cause analysis, logging, data lineage and service level agreement (SLA) tracking, all of which work together to help practitioners understand end-to-end data quality—including data reliability.
Implementing a data observability solution is especially important for modern data teams, where data is used to gain insights, develop machine learning models and drive innovation. It ensures that data remains a valuable asset rather than a potential liability.
Data observability needs to be infused consistently throughout the end-to-end data lifecycle. That way, all data management activities involved are standardized and centralized across teams for a clear and uninterrupted view of issues and impacts across the organization.
Data observability is the natural evolution of the data quality movement, which is making the practice of data operations (DataOps) possible.
Plain and simple, most organizations believe their data is unreliable:
The impact of this bad data can’t be underestimated. In May 2022, Unity Software discovered it had been ingesting bad data from a large customer, which led to a 30% plunge in the company’s stock3 and ultimately cost the business USD 110 million in lost revenue4.
Traditionally, it’s been difficult to identify bad data until it’s too late. Unlike when an application goes down and it affects thousands of users immediately, businesses can operate on bad data unknowingly for quite some time. For example, a sales team would know right away if a Salesforce dashboard wasn’t loading, but there’s no telling how long it would take them to figure out that a dashboard was showing incorrect data.
Data observability is the best defense against bad data getting through. It monitors data pipelines to ensure complete, accurate and timely delivery of data so that data teams can prevent data downtime, meet data SLAs and maintain the business’s trust in the data it sees.
Modern data systems provide a wide variety of functionality, allowing users to store and query their data in many different ways. Of course the more functionality you add, the more complicated it becomes to ensure that your system works correctly. This complication includes:
In the past, data infrastructure was built to handle small amounts of data—usually operational data from a few internal data sources—and the data was not expected to change very much. Now, many data products rely on data from internal and external sources, and the sheer volume and velocity in which this data is collected can cause unexpected drift, schema changes, transformations and delays.
More data ingested from external data sources means you need to transform, structure and aggregate all that data in all other formats to make it all usable. Even worse, if those formats change at all, it causes a domino effect of failures downstream as the strictly coded logic fails to adapt to the new schema.
Complex ingestion pipelines have created a market for tools to simplify this end-to-end process, mostly automating the ingestion and extract, transform, load (ETL)/extract, load, transform (ELT) processes. Combining them together, you get a data platform the analytics industry has dubbed the “modern data stack,” or MDS. The goal of the MDS is to reduce the amount of time it takes for data to be made usable for end-users (typically analysts) so they can start leveraging that data faster. However, the more automation you have, the less control you have over how data is delivered. These organizations need to build out custom data pipelines so they can better guarantee data is delivered as expected.
Data operations (DataOps) is a workflow that enables an agile delivery pipeline and feedback loop so that businesses can create and maintain their products more efficiently. DataOps allows companies to use the same tools and strategies throughout all phases of their analytics projects, from prototyping to product deployment.
The DataOps cycle outlines the fundamental activities needed to improve data management within the DataOps workflow. This cycle consists of three distinct stages: detection, awareness and iteration.
It’s important that this cycle starts with detection because the bedrock of the DataOps movement is founded on a data quality initiative.
This first stage of the DataOps cycle is validation-focused. These include the same data quality checks that have been used since the inception of the data warehouse. They were looking at column schema and row-level validations. Essentially, you are ensuring all datasets adhere to the business rules in your data system.
This data quality framework that lives in the detection stage is important but reactionary by its very nature. It’s giving you the ability to know whether the data that’s already stored in your data lake or data warehouse (and likely already being utilized) is in the form you expect.
It’s also important to note that you are validating datasets and following business rules you know. If you don’t know the causes of issues, you cannot establish new business rules for your engineers to follow. This realization fuels the demand for a continuous data observability approach that ties directly into all stages of your data lifecycle, starting with your source data.
Awareness is a visibility-focused stage of the DataOps phase. This is where the conversation around data governance comes into the picture and a metadata-first approach is introduced. Centralizing and standardizing pipeline and dataset metadata across your data ecosystem gives teams visibility into issues within the entire organization.
The centralization of metadata is crucial to giving the organization awareness into the end-to-end health of its data. Doing this allows you to move toward a more proactive approach to solving data issues. If there is bad data that is entering your “domain,” you can trace the error to a certain point upstream in your data system. For example, Data Engineering Team A can now go on to look at Data Engineering Team B’s pipelines and be able to understand what’s going on and collaborate with them to fix the issue.
The vice-versa also applies. Data Engineering Team B can detect an issue and trace what impact it will have on downstream dependencies. This means Data Engineering Team A will know that an issue will happen and can take whatever measures are necessary to contain it.
Here, teams focus on data-as-code. This stage of the cycle is process-focused. Teams are ensuring that they have repeatable and sustainable standards that will be applied to all data development to ensure that they get the same trustworthy data at the end of those pipelines.
The gradual improvement of the data platform’s overall health is now made possible by the detection of issues, awareness of the upstream root causes and efficient processes for iteration.
A well-executed data observability strategy can deliver a range of benefits that contribute to better data quality, decision-making, reliability and overall organizational performance. These include:
Data observability allows teams to detect issues such as missing values, duplicate records or inconsistent formats early on before they affect downstream dependencies. With higher-quality data, organizations are enabled to make better, data-driven decisions that lead to improved operations, customer satisfaction and overall performance.
Data observability enables teams to swiftly identify errors or deviations in data through anomaly detection, real-time monitoring and alerts. Faster troubleshooting and issue resolution helps minimize the cost and severity of downtime.
By using shared dashboards offered by data observability platforms, various stakeholders can gain visibility into the status of critical datasets, which can foster better collaboration across teams.
Data observability tools help pinpoint bottlenecks and performance issues, enabling engineers to optimize their systems for better resource usage and quicker processing times. In addition, automation reduces the time and effort required to maintain the health of your data, allowing data engineers, analysts and data scientists to focus their efforts on deriving value from the data.
Data observability can help organizations in highly regulated industries such as finance, healthcare and telecommunications ensure that their data meets the necessary standards for accuracy, consistency and security. This reduces the risk of non-compliance and associated penalties.
High-quality data is essential for understanding customer needs, preferences and behaviors, which, in turn, enables organizations to deliver more personalized, relevant experiences. Data observability can help organizations maintain accurate, up-to-date customer data, leading to improved customer satisfaction and loyalty.
By improving data quality through observability, organizations can unlock new insights, identify trends and discover potential revenue-generating opportunities. Making the most of their data assets, organizations can increase their revenue and growth.
Together, the five pillars of data observability provide valuable insight into the quality and reliability of your data.
Freshness describes how up to date your data is and how frequently it is updated. Data staleness occurs when there are important gaps in time when the data has not been updated. Often, when data pipelines break it is due to a freshness issue.
An indicator of your data’s field-level health, distribution refers to whether or not the data falls within an accepted range. Deviations from the expected distribution might indicate data quality issues, errors or changes in the underlying data sources.
Volume refers to the amount of data being generated, ingested, transformed and moved through various processes and pipelines. It also refers to the completeness of your data tables. Volume is a key indicator as to whether or not your data intake is meeting expected thresholds.
Schema describes the organization of your data. Schema changes often result in broken data. Data observability helps ensure that your data is organized consistently, is compatible across different systems and maintains its integrity throughout its lifecycle.
Lineage’s purpose is to answer the question, “Where?” when data breaks. It looks at the data from its source to its end location and notes any changes, including what changed, why it changed and how it changed along the way. Lineage is most often represented visually.
Data observability supports data quality, but the two are different aspects of managing data.
While data observability practices can point out quality problems in data sets, they can’t on their own guarantee good data quality. That requires efforts to fix data issues and to prevent them from occurring in the first place. On the other hand, an organization can have strong data quality even if it doesn’t implement a data observability initiative.
Data quality monitoring measures whether the condition of data sets is good enough for their intended uses in operational and analytics applications. To make that determination, data is examined based on various dimensions of quality, such as accuracy, completeness, consistency, validity, reliability, and timeliness.
Data observability and data governance are complementary processes that support each other.
Data governance aims to ensure that an organization’s data is available, usable, consistent and secure, and that it’s used in compliance with internal standards and policies. Governance programs often incorporate or are closely tied to data quality improvement efforts.
A strong data governance program helps eliminate the data silos, data integration problems and poor data quality that can limit the value of data observability practices.
Data observability can aid the governance program by monitoring changes in data quality, availability and lineage.
All data observability isn’t created equal. The level of context you can achieve depends on what metadata you can collect and provide visibility into. This is known as the hierarchy of data observability. Each level is a foundation for the next and allows you to attain increasingly finer grains of observability.
Getting visibility into your operational and dataset health is a sound foundation for any data observability framework.
Monitoring dataset health refers to monitoring your dataset as a whole. You are getting awareness into the state of your data while it’s in a fixed location, which is referred to as “data at rest.”
Dataset monitoring answers questions like:
Operational monitoring refers to monitoring the state of your pipelines. This type of monitoring gives you awareness into the state of your data while it’s transforming and moving through your pipelines. This data state is referred to as “data in motion.”
Pipeline monitoring answers questions like:
While dataset and data pipeline monitoring are usually separated into two different activities, it’s essential to keep them coupled to achieve a solid foundation of observability. These two states are highly interconnected and dependent on each other. Siloing out these two activities into different tools or teams makes it more challenging to get a high-level view of your data’s health.
Column-level profiling is key to this hierarchy. Once a solid foundation has been laid for it, column-level profiling gives you the insights you need to establish new business rules for your organization and enforce existing ones at the column level as opposed to just the row level.
This level of awareness allows you to improve your data quality framework in a very actionable way.
It allows you to answer questions like:
From here, you can move up to the final level of observability: row-level validation. This looks at the data values in each row and validates that they are accurate.
This type of observability looks at:
When organizations get tunnel vision on row-level validation, it becomes difficult to see the forest for the trees. By building an observability framework starting with operational and dataset monitoring, you can get big picture context on the health of your data while still honing in on the root cause of issues and their downstream impacts.
Below are the main steps typically involved in building a successful observability pipeline. The process involves the integration of various tools and technologies, as well as the collaboration of different teams within an organization.
Building an observability pipeline is a continuous process of learning and refinement. It’s crucial to start small, learn from the experience and incrementally expand your observability capabilities.
1 Data Integrity Trends: Chief Data Officer Perspectives in 2021, Precisely, June 2021
2 The data powered enterprise: Why organizations must strengthen their data mastery, Capgemini, February 2021
3 Unity Software's stock plunges nearly 30% on weak revenue guidance, MarketWatch, 10 May 2022
4 2 Reasons Unity Software’s Virtual World is Facing a Stark Reality, The Motley Fool, 17 July 2022
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak for Data.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.