February 7, 2024 By Ashley Bassman
Anson Kokkat
4 min read

The sense of urgency has never been higher for businesses to leverage data and AI for competitive advantage. Today’s leaders are still grappling with unprecedented data challenges in scaling AI. Not only will data volumes continue to grow, but new formats of unstructured data are growing 30–60% annually1. Data silos and data complexity are multiplying further in more locations and applications, preventing data from being accessed, enriched and used effectively. To make use of large volumes of unstructured data for analytics and AI, organizations turned to Hadoop data lakes for cost effective storage, open formats and flexibility. But as data volumes grow, these traditional data lakes are constrained by performance, governance and complexity to maintain. 

To scale trusted analytics and AI workloads, organizations are adopting an open data lakehouse approach, which combines the performance and governance of data warehouses and flexibility of data lakes on low-cost object storage. IBM® and Cloudera share this vision and with our strategic collaboration, a new era of AI possibilities unfolds. 

What is Cloudera Data Platform? 

Five years ago, IBM and Cloudera strategically partnered to bring Cloudera’s data-in-motion capabilities, spanning analytics for IoT and event streaming processing, monitoring and management, to IBM customers. This integration is designed to work seamlessly with our data fabric architecture, fostering a holistic approach to data and AI strategy. The Cloudera Data Platform (CDP) solution enables customers to run rapid analytics on unstructured data anywhere and is built on a combination of Hadoop-based solutions and incorporates over 30 open-source and proprietary components.  

However, given today’s challenges maintaining, governing and transforming data for AI that is stored in Hadoop data lakes, IBM and Cloudera have built a unique integration to share and prepare data for trusted AI workloads. The IBM watsonx™ and Cloudera Data Platform (CDP) integration enables customers to augment their Hadoop data lake with warehouse-like performance, optimize for cost with simple object storage and multiple query engines and scale AI across the enterprise with trusted data. This collaboration not only simplifies data management for Cloudera customers but also enables them to bring those data lake workloads into a modern environment that is ready for generative AI use cases.

Better together: IBM watsonx and Cloudera 

IBM watsonx™ AI and data platform includes three core components and a set of AI assistants designed to help you scale and accelerate the impact of AI with trusted data across your business. The core components include: a studio for new  foundation models, generative AI and machine learning, a fit-for-purpose data store built on an open data lakehouse architecture, and a toolkit, to accelerate AI workflows that are built with responsibility, transparency and explainability.  

Let’s take a closer look at how these components work with CDP data to address the complexities of unstructured data and enable customers to scale AI with trust.  

1. Streamlining data for AI without compromise 

One of the critical challenges in AI implementation lies in efficiently sharing data without duplication or migration. Watsonx.data open data lakehouse, in tandem with Cloudera, addresses this challenge by allowing users to access diverse data sources and data types across the hybrid cloud within minutes. The integration supports sharing data in Apache Iceberg table format, offering a unique solution to augment existing Hadoop data lakes with warehouse-like performance and robust governance, security and lineage. CDP’s support for Iceberg open table format ensures that a single copy of data can be shared with customers’ tools of choice including Hive, Impala and watsonx.data’s various engines, including Presto, Spark, Db2, Netezza and any query engine that supports Iceberg for the best price-performance, without resorting to migration, duplication or the need for extensive ETL processes. 

2. Accelerating data insights with Natural Language capabilities 

Watsonx and Cloudera empower users to accelerate data insights without the need for complex SQL queries. Watsonx.data introduces a conversational interface that allows CDP users to use natural language to discover, enrich and transform data. The AI functionality within the system recommends relevant data sets based on user queries and automatically generates semantics for easy identification of data. This approach not only streamlines the data exploration process but also enhances user experience by making data insights more accessible. With AI-driven recommendations, organizations can uncover hidden patterns and correlations within their data, driving informed decision-making. 

3. Bringing governed data to AI applications of choice 

Data governance, lineage and reproducibility are essential elements for any organization aiming to harness the power of AI responsibly. Watsonx and Cloudera facilitate the unification, discovery and preparation of CDP data for AI applications. CDP users can store, query and search vector embeddings in watsonx.data with integrated vector database capabilities. Watsonx.data introduces the Milvus vector store (in tech preview), enabling users to collect, curate and prepare unstructured data from CDP as vectors for retrieval augmented generation (RAG) use cases in watsonx.ai™ and reduce generative AI model hallucinations. Furthermore, the collaboration introduces the watsonx.ai studio, enabling AI builders to scale both traditional ML and new generative AI use cases powered by CDP data. Deploying responsible, transparent and explainable AI workflows for CDP customers is made possible with watsonx.governance™, ensuring that organizations can navigate the complex landscape of AI ethics and compliance seamlessly with an end-to-end toolkit for AI governance across the entire model lifecycle. 

Wacth this demo on how to use CDP data for RAG use cases with watsonx.ai.

 

Embracing the future of AI with watsonx and Cloudera 

As organizations grapple with the challenges of scaling AI, the integration of watsonx and Cloudera presents a compelling solution. By addressing the complexities of data sharing, accelerating data insights through natural language capabilities and ensuring the governance of data for AI applications, this collaboration sets a new standard for the industry. In the pursuit of unlocking the true potential of AI, leaders can now leverage the combined strengths of Cloudera Data Platform and IBM watsonx. The future of AI is not just about algorithms and models; it’s about empowering organizations to extract meaningful insights from their data, responsibly and efficiently. With watsonx and Cloudera, that future is now within reach. 

Ready to get started? Try watsonx for free today.


Learn more about IBM and Cloudera

Gartner 2022 Strategic Roadmap for Storage, Gartner 

More from AI for the Enterprise

Unify and share data across Netezza and watsonx.data for new generative AI applications

3 min read - In today's data and AI-driven world, organizations are generating vast amounts of data from various sources. The ability to extract value from AI initiatives relies heavily on the availability and quality of an enterprise's underlying data. In order to unlock the full potential of data for AI, organizations must be able to effectively navigate their complex IT landscapes across the hybrid cloud.   At this year’s IBM Think conference in Boston, we announced the new capabilities of IBM watsonx.data, an open…

Speed, scale and trustworthy AI on IBM Z with Machine Learning for IBM z/OS v3.2 

4 min read - Recent years have seen a remarkable surge in AI adoption, with businesses doubling down. According to the IBM® Global AI Adoption Index, about 42% of enterprise-scale companies surveyed (> 1,000 employees) report having actively deployed AI in their business. 59% of those companies surveyed that are already exploring or deploying AI say they have accelerated their rollout or investments in the technology. Yet, amidst this surge, navigating the complexities of AI implementation, scalability issues and validating the trustworthiness of AI…

IBM watsonx.data updates are live: Superior price-performance and enhanced management and delivery of trusted data for AI 

4 min read - Traditional data management approaches store data in disparate databases, often with data duplication across systems and time consuming, risky, and expensive data integration and processing. Getting reliable data without friction is key in achieving successful Generative AI. Watsonx.data is a data lakehouse architecture built with open standards that support both traditional SQL-derived analytics and AI driven insights with automation in a single platform, supporting the needs of different data users and a broad variety of enterprise workloads.  Think 2024 announcements,…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters