Data-Driven

IBM’s Data Platform for Data-Driven Enterprises

Veröffentliche eine Notiz:

What technology does IBM have to offer to help you become or strengthen your position as a data driven enterprise? IBM recognizes that most enterprises don’t start on a greenfield, but instead already have a landscape of data stores and analytical systems grown over many years. Therefore, IBM’s approach to a modern data platform focuses on three key areas: providing a unified experience through integrationautomation, and openness.

Openness in this context has several dimensions:

Openness…

  1. … concerning the infrastructure on which the data platform is deployed
  2. … through the usage of open-source software and open data formats
  3. … concerning the management of the data platform
  4. … by being able to integrate with systems in an existing landscape
  5. … by integrating with a whole ecosystem of 3rd party products

Openness with regard to the infrastructure is based upon IBM’s hybrid cloud and a multi-cloud strategy. Hybrid cloud means that the platform can not only be deployed in a public cloud but also on different kinds of infrastructure on premises at a customer. Multi-cloud means that the platform is not tied to any specific public cloud but can be deployed on any public cloud. This is achieved by building the data platform on top of Redhat OpenShift.

 Openness also addressed the need to benefit from the immense innovation potential, that open-source provides. Many popular open-source programs and libraries have become available for common tasks within a data platform e.g., for machine learning. Also, many open data formats have been defined and agreed upon, which are optimized for specific tasks and can be used by different applications. IBM’s data platform is specifically designed and built to leverage the leading open-source projects, libraries, and data formats.

Some customers want to have full control over their data platform while others just want to use a data platform because they don’t have the skills to manage platforms themselves. Therefore, IBM’s data platform is open to different consumption and delivery models. It can be consumed as Software as a Service (SaaS) or it can be completely self-managed.

Another aspect of the openness of the data platform is its easy integration with an existing data landscape and with 3rd party tools. IBM’s platform comes with various connectors and adapters for 3rd party and IBM Technology. This is important for enterprise customers to be able to leverage already-made investments.

Integration – The second focus lies on providing an integrated and thoroughly tested platform that covers all aspects of the analysis of data, from the storage of the necessary data, over the preparation and governance of data all the way to the usage of data to gain insights and to operationalize these insights. 

Unified Experience – One aspect of the integrated platform is that it provides a unified experience: Access to data, metadata, projects, and more is all provided within the same user interfaces and with the same look & feel. The unified experience is also extended to administrators of the platform, for example for managing users and roles or defining governance policies.

Automation – The fourth focus area of IBM’s data platform is automation. Automation entails several aspects: Automation of deployment of services and components, which aims at allowing self-service for deployments. But it also means automation of different levels of governance: e.g., data lineage derived automatically from existing metadata. Data classification automatically assigns actual data types and thus controls how data is being governed and visualized (e.g., whether a char or numeric field is a zip code or a credit card number). Or data classification controlling which data requires special protection, and last but not least automatically deriving statistics about the quality of data.

Architecture of the IBM Data Platform 

A high-level component architecture of IBM’s data platform is shown in the diagram below. The data flows from various sources on the left to various consumers on the right. In between data can be persisted, transformed, integrated, filtered, and aggregated by different types of engines and finally accessed by analytics and AI tools. The architecture consists of the following main building blocks:

  • Data Sources: These can be either external data sources as well as data sources that are deployed on the IBM data platform. 
  • Ingestion and Integration: This block provides the movement of data from data sources to data management and storage. It also covers any type of transformation of data required so that efficient querying for analytics becomes possible after the fact.
  • Data Management and Storage: The purpose of this set of services is the storage of data, optimized for efficient analysis. Therefore, several kinds of database management systems, data stores, and query engines are part of this component of the platform.
  • Data Access: This component is responsible for providing access to data that is stored in the data management or storage services, so that analytics and AI tools and services have consistent access to data.
  • Analytics & AI: This functional block includes analytics capabilities like reporting, dashboarding, and analytical queries. The AI services include both classical machine learning as well as support for generative AI and foundation models.
  • Data in Motion: While the integration and ingestion services, the data management and storage components, and the data access services provide “data at rest” for analytics and AI, this set of services provides capabilities to handle “data in motion“, so that near-real-time processing of data flowing into the data platform can be handled.
  • Governance: Often underestimated, or even neglected, data governance and AI model governance plays a major role in modern data architectures. Providing self-service and democratized access to data requires strong governance capabilities for an enterprise to remain fully compliant.
  • Security: The services that are running on the platform (for example data source services) as well as the platform itself provide different types of security mechanisms in each layer.

While not depicted in the diagram, Red Hat OpenShift serves as the hybrid cloud platform and foundation of the overall IBM data platform built and deployed. Relying on Red Hat OpenShift allows IBM to make its data platform available anywhere Red Hat OpenShift is supported. Clients can choose to deploy the platform on-premise on their own choice of hardware, deploy it on public cloud infrastructure, or consume Red Hat Openshift as a managed service on all major public clouds. 

The functional components (services) that are running on the platform are modular and pre-integrated, both in the front-end, but also in the back-end. Customers can deploy the services independently or in combination with each other. 

Front-end integration enables UI-based workflows across multiple services. For example, data engineers can use the UI to add data assets to a project, prepare data by creating ETL flows, enrich data assets with metadata, and publish the data assets to a self-service catalog. End-users may search and access the assets in the catalog and add them to their projects, for example, to refine them or to perform analytics or machine learning tasks. In addition, the platform provides workflows, for example, to request new data assets or to request access to assets that are already available in the catalog. In the back-end, there are also integration points between services. For example, IBM’s data fabric approach is based on tight integration between the data virtualization service that enables simplified and unified access to heterogeneous data sources and the data governance capabilities, that for example enable the definition of data protection and data masking rules.

Typically, for each service two deployment options are supported – self-managed software or Software as a Service (SaaS). Most services are based on Redhat OpenShift as the underlying container orchestration platform. Since OpenShift is available both on-premises and in all major clouds IBM enables a true multi-hybrid cloud approach. 

Data Sources

Data sources may be located inside and outside of the enterprise both on-premises or in multiple clouds. IBM provides connectors to access a wide range of data sources like for example, relational databases, noSQL databases, data lakes, data lakehouses, etc. In addition, there are connectors available to access various open table formats like Apache Iceberg, Hudi, and Delta Lake as well as file formats like CSV, JSON, Parquet, ORC, and AVRO. 

Ingestion and Integration

There are three options for ingesting data into the IBM data platform:

  • ETL (Extract, Transform, and Load) and ELT (Extract, Load, and Transform)
  • Federation
  • Replication

ETL executes the transformations in the ingestion tool while ELT pushes the transformations down into an engine in the data management and storage layer making use of the resources available there.

ETL/ELT data flows can be defined through a web-based UI and compiled and executed in multiple cloud environments. Alternatively, data flows can be defined programmatically. An example of the first kind of tool is DataStage (https://www.ibm.com/products/datastage), and an example of the latter is Apache Spark (aka Analytics Engine https://www.ibm.com/cloud/analytics-engine in the IBM data platform).

ETL and ELT introduce two kinds of time delays: First for the development of the ETL/ELT jobs, and then during the execution since this is done in a batch-oriented way. Both federation and replication avoid the delays, since they provide data in real-time from the data sources to the data platform. The difference lies in the initiator of the data transfer. In the case of federation, the initiator is a data store from the data management and storage layer like Db2 Warehouse (https:/ibm.com/products/db2/warehouse) or the Presto (https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-presto_overview) or Spark engine of watsonx.data (https://www.ibm.com/products/watsonx-data), while in the case of replication, the initiator is the data source which sends data to the data platform as soon as they are generated. While federation accesses just the data that are needed, replication sends all data whether there is currently a demand for them or not.

Data Management and Storage

IBM’s data platform supports a variety of data store services that can be deployed on the platform. This includes IBM products, 3rd party products, and open source offerings. There are relational data stores like Db2 (https://www.ibm.com/products/db2), Postgres (https://www.ibm.com/products/postgres-enterprise), Informix (https://www.ibm.com/products/informix), and Netezza (https://www.ibm.com/products/netezza), execution engines like Presto, NoSQL data stores like MongoDB (https://marketplace.redhat.com/en-us/products/mongodb-enterprise-advanced-from-ibm) or Cassandra (https://www.ibm.com/products/datastax-enterprise), and pure storage like object storage (https://www.ibm.com/cloud/object-storage).

The data stores can be used to implement data warehouses, data marts, data lakes, and data lakehouses. In general, the data management consists of three components:

  • storage
  • metadata, and
  • engine

Storage determines where the data is stored, e.g., on block storage backed by SSDs, and how they are stored, e.g., in which format they are stored.

Metadata in this context is not metadata in the general sense used in the governance of the whole data platform (e.g., for describing data lineage), but instead is used for making data on the storage accessible as data objects (e.g., a table with columns of specific types) to the engine. Therefore, the wording “database catalog” is also used for this kind of metadata.

The engine is using the metadata to query and insert data into the storage. Therefore, engines serve as the interface to the data management system with which a data consumer interacts.

Traditionally, storage, metadata, and engine were strongly combined in one data management solution, not only for relational database management systems like Db2, Informix, Netezza, or Postgres, but also for NoSQL database management systems like MongoDB. For gaining more flexibility and independent scaling of the three components, this fixed relationship was lifted. Open meta stores like the Hive meta store allow different kinds of storage — e.g., HDFS as well as object storage — and different kinds of open data formats, e.g., CSV, ORC, Avro, Parquet. This allows independent scaling of compute and storage, but also different engines using the data on the storage via the metadata, e.g., IBM Big SQL (https://www.ibm.com/products/db2/big-sql) can be used instead of Hive to query the same data in a Cloudera (https://www.ibm.com/analytics/partners/cloudera) data lake. Also, the traditional relational engines like Db2 Warehouse and Netezza have been extended to not just use their own metadata but to be able to access data described by other metadata stores.

The final step in the separation of storage, metadata, and engine is the IBM lakehouse, watsonx.data. There, different kinds of storage and metadata can be combined with different engines to support all kinds of workloads with the right combination. In addition to the open data formats, watsonx.data also supports open table formats (initially Apache Iceberg) in the metadata layer. Initially, Presto and Spark are supported as internal lakehouse engines, but also Db2 Warehouse and Netezza have been extended to external engines of the lakehouse accessing the same metadata and data. The openness to different engines, the open data and table formats, and the openness regarding storage used foster the above-mentioned goal of overall openness. 

Data Access

For AI applications to use the data in the data management layer, the data access layer has to provide access. For this, there are APIs. Examples of APIs are REST and GraphQL (https://www.ibm.com/products/stepzen), but also the SQL interfaces provided via e.g., ODBC or JDBC. 

The goal of the data access layer is to provide uniform access to different kinds of data stores in the data management layer. These data stores might be managed by the IBM data platform, or they may be outside. Data Virtualization (https://www.ibm.com/products/watson-query) is an approach to achieve this kind of uniform access. Connectors for many different kinds of data stores like relational or NoSQL database systems, or even just files can be defined. The connectors map all these data to the relational model so that data virtualization can combine data from different sources in a single SQL query. Performance optimizations like caching and push-down of operations to the sources in data virtualization improve the experience for the consumers. Data virtualization also enables real-time data integration without data duplication.

Data in Motion

The data provided via the three layers just described — ingestion & integration, data management & storage, and data access are characterized by the fact that data are prepared and stored in advance before an actual query is executed. In this way, it is easy to query the same data in different ways several times. Therefore, this approach is called using “data at rest”. But sometimes the data has to be queried at once when the data is created, and there may be no need to access it later again.

Such highly perishable data which has to be consumed freshly, can be analyzed with the “data in motion” approach. In this approach, the data is flowing in a stream and a query engine is working on this stream analyzing and aggregating the data. Data that still might be analyzed later again, can be written by the query engine into the data management layer. Typical tools used for data in motion are Flink (part of the Cloudera platform), Spark Streaming, IBM Streams (https://www.ibm.com/cloud/streaming-analytics), and IBM Event Streams aka Kafka (https://www.ibm.com/cloud/event-streams).

Analytics & AI

This building block provides analytics and AI capabilities. Analytics capabilities help to explore data interactively to find patterns and insights, to graphically visualize data, and to create reports. An integrated planning solution helps to automate planning, budgeting, and forecasting and drives more intelligent workflows. A decision optimization service helps to make optimal business decisions by evaluating millions of possibilities to find the most appropriate prescriptive solutions.

The AI capabilities are intended to operationalize the use of AI and comprise both classical machine learning (ML) but also generative AI capabilities. The platform includes capabilities to manage the whole lifecycle of machine learning models. This includes tools to build, train, and deploy ML models. Experienced data scientists can use an integrated development environment to create new ML models. Machine learning pipelines can also be built in a graphical UI. Inexperienced users can use a no-code tool to find the best ML model for their particular use case. Finished ML models are moved into deployment spaces where they can be used by applications, for example via REST API calls. Optionally, a machine learning accelerator can be used for distributed deep learning and hyper-parameter search and optimization to simplify and accelerate the process of training large deep learning models across a cluster.

The platform also enables the usage of generative AI for business applications which is an important differentiator to many other generative AI offerings in the market. Generative AI is a form of machine learning that is able to create new content. Examples of supported generative AI tasks include Q&A, summarization, content generation, and fact extraction. Customers benefit from very large models that are pre-trained by IBM with curated data. This ensures high-quality outcomes and eliminates the burden (required skills and necessary hardware resources) of model training. Nevertheless, models can be fine-tuned with customer-specific or industry domain-specific data (aka prompt tuning). The output of the tuned models is explainable, which is a prerequisite for most business applications. 

Finally, the platform provides other capabilities like a development environment for chatbots, voice interaction capabilities, and an AI-powered search- and text-analytics platform, that helps to find valuable information in business documents. It automates search and content analysis, so experts make faster decisions and have more time for high-value work.

Governance 

This building block comprises both data governance as well as AI governance:

AI governance is a framework to manage an organization’s use of ML models. This includes the following capabilities:

  • Tracking of ML models across their life cycle in the organization: This comprises for example the tracking of model deployments, input data sets for model training, model descriptions, etc.
  • Analysis and monitoring of ML model bias and model drift: Model bias occurs, when a model generates results that are systematically prejudiced, for example, due to low-quality training data. Model drift happens when a model’s performance on new data is different from how it performed on the training data it was built on.
  • Management of financial- and other risks as well as regulatory challenges related to ML models: ML model bias for example can cause major threads and financial risks for companies. 

Data governance includes the following capabilities:

  • A data catalog that provides self-service access to data
  • Data discovery capabilities to automatically extract metadata from connected data sources
  • Data quality management, for example, data quality alerting functions and data quality analysis across multiple dimensions
  • Data classification (for example, data can be classified as confidentialpersonally identifiable information, etc.)
  • Data enrichment (for example, the automatic or manual assignment of data classes and business terms)
  • Data lineage to understand where data comes from, how it is transformed, and where it is used (reports, ML models, etc..)
  • A knowledge graph that reflects the relationship between different types of assets (both metadata- and data assets).
  • Policies and rules (for example data access and data masking rules)
  • Metadata management (for example, the management of data classes, business terms, reference data, hierarchies, etc.)

Security 

The IBM data platform is based on the Red Hat OpenShift container platform (OCP). In combination with the underlying Red Hat operating systems, OCP provides advanced container security mechanisms. This includes for example automated patching, the evaluation of container content and sources for vulnerabilities, access control to containers through authentication and authorization, and securing networking and attached storage.

The primary goal of Red Hat Enterprise Linux CoreOS (RHCOS) is to provide a secure operating system platform for running Kubernetes, OpenShift services, and the containerized workloads that are running on OCP.

Finally, the available IBM data management solutions include data encryption mechanisms (both for data in motion and data at rest) as well as data protection mechanisms (for example, data access control, data masking, and data anonymization).

Conclusion

In conclusion, IBM’s data platform provides a comprehensive and versatile solution that enables organizations to transform into data-driven enterprises. It highlights integration, automation, and openness throughout its modular architecture.

Critical features comprise seamless data source connectivity, flexible data ingestion methods, adaptable data management and storage alternatives, unified data access, real-time data processing, robust analytics and AI capabilities, and comprehensive governance and security measures.

With support for hybrid cloud and multi-cloud deployment, open-source technologies, and various consumption models, IBM’s platform caters to diverse organizational needs. It provides the tools and infrastructure for organizations to leverage data, encourage innovation, and ensure data privacy and regulatory compliance in today’s data-driven landscape.

Principal Information Architecture Technical Specialist

Andreas Christian

Senior Information Architecture Technical Specialist

Hardy Gröger

IBM Distinguished Engineer, Technical Lead Data and AI DACH

More stories
By Sascha Slomka and others on Oktober 24, 2023

AI Governance

AI governance has received a lot more attention as AI regulations are being formulated and passed. But AI Governance is not only about regulation, it is the key discipline to master the complexity induced by the variety of AI frameworks, models and tools. AI Governance relies on proper Data Governance which has been discussed in […]

Weiterlesen

By Andreas Weininger and others on September 12, 2023

IBM’s Data Platform for Data-Driven Enterprises

What technology does IBM have to offer to help you become or strengthen your position as a data driven enterprise? IBM recognizes that most enterprises don’t start on a greenfield, but instead already have a landscape of data stores and analytical systems grown over many years. Therefore, IBM’s approach to a modern data platform focuses […]

Weiterlesen

By Sascha Slomka and others on Juli 18, 2023

Experiential and Incremental Implementation

Motivation We have started this blog-series with the question why it is so difficult to become data driven and explored the approaches to accomplish this in Part 3. In this article we go in more detail and focus on experiential and incremental delivery. The main goal of experiential and incremental approaches is to gain a […]

Weiterlesen