IBM acquires StreamSets, a leading real-time data integration company
1 July 2024
3 min read

We are thrilled to announce that IBM has acquired StreamSets, a real-time data integration company specializing in streaming structured, unstructured and semistructured data across hybrid multicloud environments.

Acquired from Software AG along with webMethods, this strategic acquisition expands IBM’s already robust data integration capabilities, helping to solidify our position as a leader in the data integration market and enhancing IBM Data Fabric’s delivery of secure, high-quality data for artificial intelligence (AI).

According to a Forrester study conducted on behalf of IBM, 87% of organizations require data to be ingested and analyzed within one day or faster. As data variety, volume and velocity continue to rise, implementing a real-time data integration tool, such as StreamSets, helps decrease the staleness of data produced by traditional data pipelines, allowing for real-time insights and decision-making.

Unlocking the power of real-time data integration

The acquisition of StreamSets extends the breadth and depth of IBM Data Fabric’s data integration capabilities by enabling the design of real-time data pipelines. This helps users ingest, enrich and harness the potential of streaming data through features such as offset handling and delivery guarantees.

The goal is to enable continuous, real-time processing, integration and transfer of data when it is available, reducing latency and data staleness. StreamSets is available today as a SaaS service across major hyperscalers.

Enabling generative AI use cases with enhanced data integration capabilities for IBM Data Fabric

The emergence of generative AI has heightened the significance of data. According to IDC, stored data is set to increase by 250% by 2025 (link resides outside ibm.com), rapidly propagating on-premises and across clouds, applications and locations with compromised quality.

With growth comes complexity. Multiple data applications, formats and data silos make it harder for organizations to use all their data for AI. Data is becoming more diverse, distributed and dynamic, stored across multiple systems and repositories in hybrid and multicloud environments, ultimately resulting in data silos. This remarkable growth has posed a substantial challenge for organizations in curating high-quality data assets for analytics and AI use cases.

Effectively managing data quality within a distributed data landscape stands as a major obstacle for organizations striving to become more data-driven and embrace cutting-edge generative AI technologies. By prioritizing seamless integration between products and emphasizing strategic architectural decisions, organizations can gain a significant competitive edge with their AI implementations.

Using the real-time and streaming capabilities provided by StreamSets, coupled with IBM Data Fabric’s top-tier data integration services facilitated through IBM® DataStage® for bulk processing and bolstered by data observability via IBM® Databand®, enables us to comprehensively address modern data pipeline workloads.

With StreamSets’ innovative visual-oriented approach to building real-time data pipelines, we can now offer our clients the ability to capture and stream data in real time, regardless of its structure or complexity. This means that our clients can respond faster to changing business conditions, make more informed decisions and drive greater innovation.

Other features of StreamSets include:

  • Change data capture (CDC) support: Generate a feed of events by using transaction-based capture.
  • Hybrid cloud support: Integrate data across multiple cloud platforms and on-premises systems with StreamSets hybrid control and data planes, enabling workloads to run in the same physical location where data resides.
  • Reduce data drift with inflight transformation:Apply filtering and quality checks during ingestion. Automatically detect and alert changes in data structures and schemas, seamlessly adapting to evolving business requirements with zero downtime.
The future of IBM Data Integration

IBM data integration solutions play an essential role in organizations’ data architectures, enabling the connectivity, transformation and enrichment of data across various locations for productive and trusted use. It is crucial to choose the right integration style to fit the organization’s use case, whether it involves batch extract, transform and load (ETL) or extract, load and transform (ELT), data virtualization, change data capture or real-time streaming.

With the acquisition of StreamSets, our aim is to simplify how organizations approach streaming data use cases. We firmly believe that data should be the driving force behind innovation and growth, and we are dedicated to providing our customers with the necessary tools for success.

Existing StreamSets clients continue to receive the same high level of support and service they have come to expect. Moreover, this acquisition brings increased investment to extend connectivity and CDC support, integrate with data lineage and data observability, and focus on end-to-end data pipeline orchestration.

For existing IBM customers, this acquisition expands and complements our data integration capabilities. It provides data engineers with an integrated suite of tools that cater to multiple data integration patterns, such as batch, CDC and real-time patterns, all infused with data observability capabilities.

At IBM, we are committed to innovation and providing our customers with the hybrid multicloud, AI and data tools they need to succeed. The acquisition of StreamSets is a testament to this commitment, and we are excited to bring their innovative technology to our clients.

Author
Scott Brokaw Director, Product Management, Data Integration, IBM