Data-Driven

Data-Driven Enterprise? What’s so difficult?

Veröffentliche eine Notiz:

In this second part of our series, we want to review requirements and challenges of a Data-Driven Enterprise in more detail. 

In Part 1 we looked at the overarching goals of a Data-Driven Enterprise. Part 3 will describe approaches and methodologies to become data-driven and adapt culture, organization, and processes accordingly. 

Requirements of a Data-Driven Enterprise

Considering the business goals laid in Part 1, what are some of the most important requirements and challenges when attempting to implement a Data-Driven Enterprise. 

Figure 1 – Requirements and Challenges of a Data-Driven Enterprise

Effective data collection, preparation, and integration

Data volume has grown exponentially and will continue to grow at even faster rates. Potentially relevant data resides in a plethora of locations across on-premises, cloud, and edge locations. Additionally, data is usually formatted and optimized for the process it is needed for, not necessarily for analyzing the data or training machine learning models with it. Enterprises need cost effective approaches for data collection or access, data preparation and integration into the lifecycle of data usage. Collecting, copying, and aggregating data centrally is often too costly or too slow. Therefore, alternatives such as distributed analysis including all required pre-aggregation or remotely connecting to and federating data or virtualizing data access need to be considered. 

Agility and Resilience

For the past two or three years, the business environment for every enterprise has seen drastic changes and challenges. At the same time, data volume and complexity keep growing and data keeps changing. But not just the data itself changes, the number of sources from which relevant data needs to be acquired is growing daily. A challenge that classical data integration approaches can hardly manage. A Data-Driven enterprise needs to be able to quickly react and adapt to changes in business environments and underlying data.

Speed and Time-to-value

Even if all relevant data in arbitrary locations can be collected and changes to data can be handled, if insights from that data cannot be derived in a timely manner, money is lost, or decisions have become obsolete. A Data-Driven Enterprise needs to automate all data processing operations (data pipelines) to acquire data or push analytics “down” to the data, e.g., to edge locations or edge devices. Despite that growing complexity, performance, and scale need to be continuously improved to provide timely insights and save cost.

Leveraging Domain knowledge

Data is originally created and processed within the context of a specific business domain, e.g., product engineering or controlling, etc. Data-Driven Enterprises need to implement solutions that allow properly leveraging domain knowledge while using data enterprise-wide. Refining and preparing data for analytical or machine learning usage within or outside of a domain needs to leverage domain understanding of the data. In this context Data-Driven Enterprises need to ensure meaningful metadata is associated with data products so that other domains can correctly interpret and leverage the domain knowledge reflected in the data. 

Self-Service Access to Governed Data

In the past, data engineering, analytics and machine learning were carried out by a small number of highly skilled experts within an enterprise. Well defined processes, usually involving IT, granted or revoked access to data for these experts. But today, product and business model innovation require many business users and roles within an enterprise to have access to data, e.g., sales or HR professionals, marketers, finance, and operations specialists. Requesting and gaining access to data needs to be as simple as ordering office supplies. Instead of involving IT or incurring other manual efforts, it needs to be fully self-service. At the same time, proper access control requirements need to be consistently applied. Data-Driven Enterprises therefore need governed, self-service access to data. Usage of data needs to be tracked (Data lineage) so that data owners can understand the impact of changing data type, format, or quality. 

Data Quality

Self-service access to data requires additional means to ensure and demonstrate the quality of data. Data consumers need to know whether certain data assets are fit for purpose and want to trace data quality computation to gain transparency. To establish trust in the data available, Data-Driven Enterprises need to ensure key quality criteria such as relevance, accuracy, correctness, completeness, consistency, timeliness, uniqueness, etc.

Data Security and Compliance

While data quality will mostly impact the value an enterprise can gain from data, data security and compliance with legal regulations can have an immediate impact on an enterprise’s bottom line. The average cost of a data breach is exceeding $4M in 2022 (2), the impact of violating data privacy or other relevant regulations around data usage and protection can easily outgrow that number. Data-Driven Enterprises need to ensure end-to-end data security and compliance with all applicable regulations across the full lifecycle. In addition to protecting and managing data and models according to regulations, approaches and processes need to be auditable to prove full compliance. Data security and compliance are specifically challenging when data assets or products are owned by individual business units. In this case policies need to be enforced and techniques such as data masking and obfuscation need to be applied. 

Fair and Ethical data usage

In addition to data quality, security and compliance, usage of data to leverage benefits of artificial intelligence further demands to comply with regulations and policies for fair and ethical usage. Decisions made by AI models need to be explainable and continuously monitored for bias and drift to avoid unethical or unfair results. In addition to monitoring models across their lifecycle, processes for approval and deployment need to be auditable to prove compliance with relevant regulations. Data-Driven Enterprises need to ensure compliance with policies and regulation today, but also be prepared to consider new regulations in the future. 

Openness

Managing and providing access to data based upon the requirements above is not sufficient by itself to gain the benefits of a Data-Driven Enterprise. Discovering available data is just one part of the equation. Making data accessible for downstream usage through open standard interfaces enables the usage of data assets and derived insights in a variety of tools and contexts from data management to AI inference. Open standards play a key role both in terms of standard data formats (Parquet, Iceberg, etc.) as well as protocols and interfaces (REST, JDBC, etc.). 

Gaining insights from properly governed and accessible data requires integration with the best of breed business intelligence, analytics tools, and platforms as well as data science tools. Data-Driven Enterprises need to be able to flexibly choose the best tool for the job and seamlessly integrate such tools into the management of data and models. Openness does not end with integrating best of breed tools, but also requires replacing tools at a later point in time to avoid getting locked into a single provider solution. Open source communities for data governance, data privacy, analytics, data science, etc. are a key engine for innovation and Data-Driven Enterprises need to be able to leverage this engine. 

Elasticity and Sustainability

Considering the IT infrastructure and platforms required to deliver upon the requirements laid out above, Data-Driven Enterprises need to scale the required infrastructure based upon the demand and right scale the investment to save cost and energy. Whether such infrastructure is provisioned on-premises, in public clouds or even in edge locations, it needs to be tailored to the required demand and grow with increased load.

Collaboration

The intersection of the requirements above makes it necessary for any Data-Driven Enterprise to build a data platform that allows the different roles and units involved to collaborate across the full data and AI lifecycle (data acquisition, preparation, analyzing, building, and training models and finally infuse AI into applications). Without collaboration, other requirements such as self-service, domain knowledge instead of silos and several others will not be able to provide their full value. 

Key Challenges for a Data-Driven Enterprise

Goals and requirements for Data-Driven Enterprises aren’t static, but over the past years a reasonably good understanding has been established. With known requirements and goals, why is it so difficult to deliver upon the promise of a Data-Driven Enterprise? The following paragraphs look at some of the key challenges Data-Driven Enterprise are facing today. 

Structured data within an enterprise is mostly generated and processed within systems of record. These systems of record implement all the required business logic to ensure accuracy and consistency of the data. In many cases data from several systems of records needs to be combined to analyze specific aspects or train models for decision support. Other structured, semi-, or un-structured data is originating from systems of engagement, social media, internet of things and other sources. 

Data-Driven Enterprises have attempted to centrally collect, transform, and store all required data for usage in analytics and AI. Such approach requires data to be copied from source systems of record into data warehouses, data lakes, data lakehouses, etc. 

Massive data pipelines may be required to extract, transform (cleanse, enrich, merge, etc.) and load (ETL) or ingest data into target data platforms. This has posed several potential challenges on Data-Driven Enterprises:

Proliferation of Data Sources and Complex Data Processing

The ever-growing number of systems or applications creating and processing relevant data will require additional data pipelines to be included to cover additional source systems across on-premises, multiple clouds, and edge locations. 

Implementing the required data processing for each individual source and target is complex and requires constant updates if the underlying data or their schemas are being updated without impacting the ability to act quickly at an enterprise level. 

Data Duplication, Cost and Lock-In

Storing copies of all data will significantly increase the cost for required infrastructure for data storage. The lack of control over the extensive ETL jobs may result in even more copies of source data. Often, central data platforms are located within public cloud infrastructures based on the promise of lower cost. Risks of cost overruns and vendor lock-in are often neglected early in the process. 

Some data may not even be appropriate for storing in public cloud infrastructure (regulatory compliance), which will incur additional complexity to the overall architecture. 

Data Ownership across its Lifecycle

Division of labor may provide cost benefits for required tools and skills, but centralized data engineering teams will lack domain understanding of the underlying data to be processed. The lack of understanding of domain data will not only impact the efficiency of processing data from source to a central platform, but it will also further impact the usability of data in downstream analytics, AI models or applications. 

Instead, data ownership across its full lifecycle, including future updates and propagation of continuous domain events is required. Data-Driven Enterprises need to solve this trade-off and operationalize data curation and ownership at a domain level without losing the economies of scale at an enterprise level. 

Breaking Silos

At the same time, Data-Driven Enterprises need to break silos that hinder the value generation from data beyond the borders of an individual domain. Breaking these silos is not only a technical challenge. Individual organizations need to be motivated or compensated to share, maintain, and serve data assets for others to benefit from them. Why should one business unit invest effort and cost into maintaining or extending a data asset just for the benefit of another unit’s interest and their goals? 

Balanced Decentralization

Even though data ownership should be located within the originating domains, compliance with data regulations or enterprise policies must be consistently enforced across such a distributed architecture. Local or decentralized ownership of data while applying rules and policies, which are ideally centrally defined and maintained. The same is true for common or consistent meta data and glossaries across individual data products. A carefully balanced approach with distributed ownership of data products and a centrally administrated set of policies, meta data and standards is needed. 

Skills at Scale

With unlimited resources and skills available, many of the above challenges could be solved more easily. But data engineering, analytics and data science skills are a scarce resource today. Additionally, business insights from domain subject experts are needed nonetheless, therefore data platforms and tools need to cater for different roles and skill profiles. 

Specific skills such as knowledge about regulatory compliance are even harder to build and attract. Inability to enforce governance and privacy regulations, will increase the risk of violation and penalties for Data-Driven Enterprises. It is simply impossible to train every role involved or establish required knowledge across the board. Data-Driven Enterprises need to address these skills through automation and clearly defined ownership and responsibilities. 

Degree of Automation

The challenge of attracting sufficient talent with appropriate skill is even intensified with the ever-increasing diversity described above (e.g., data formats, heterogeneity of tools and platforms, …). Maintaining an overview of all the different processes and jobs across an end-to-end data lifecycle is becoming a daunting challenge. Data catalogs or knowledge graphs can help addressing this problem. Keeping them always up to date, requires a high degree of automation. Automation needs to be established as a core design principle within any Data-Driven Enterprise.

Where to go from here?

To gain the advantages of a Data-Driven Enterprise, ideally all the requirements and most challenges laid out in the above chapters need to be addressed. Some can be addressed by technology, some require cultural, organizational, and structural changes within an enterprise.

Part 3 of this series will provide an overview of technologies and approaches available today. Based on our experiences, we will describe the ideal journey towards become Data-Driven.

Part 4 will serve as a technology glossary and describe many technological concepts that, such as Data Warehousing, Data Lakes, Delta Lakes, Data Mesh and Data Fabric. Where appropriate we will also look at their advantages and shortcomings in the context of a Data-Driven Enterprise.

We are hoping to initiate a lively discussion with these articles, so please leave your comment and share your experience or perspectives on these topics. Do you agree with this point of view? Have you made or witnessed different experiences?

IBM Distinguished Engineer, Technical Lead Data and AI DACH

Sascha Slomka

Senior Client Engineering Solution Architect

Jan Bode

Information Systems Management Master@IBM and TU Berlin

More stories
By Sascha Slomka and others on Oktober 24, 2023

AI Governance

AI governance has received a lot more attention as AI regulations are being formulated and passed. But AI Governance is not only about regulation, it is the key discipline to master the complexity induced by the variety of AI frameworks, models and tools. AI Governance relies on proper Data Governance which has been discussed in […]

Weiterlesen

By Andreas Weininger and others on September 12, 2023

IBM’s Data Platform for Data-Driven Enterprises

What technology does IBM have to offer to help you become or strengthen your position as a data driven enterprise? IBM recognizes that most enterprises don’t start on a greenfield, but instead already have a landscape of data stores and analytical systems grown over many years. Therefore, IBM’s approach to a modern data platform focuses […]

Weiterlesen

By Sascha Slomka and others on Juli 18, 2023

Experiential and Incremental Implementation

Motivation We have started this blog-series with the question why it is so difficult to become data driven and explored the approaches to accomplish this in Part 3. In this article we go in more detail and focus on experiential and incremental delivery. The main goal of experiential and incremental approaches is to gain a […]

Weiterlesen