Apache Kafka is an open-source, distributed, event-streaming platform used to publish, store, process and consume real-time data streams. It is based on a publish-subscribe messaging model and is designed to support fault-tolerant, scalable, high-throughput and low-latency data pipelines, event-driven applications and stream-processing systems.
Today, billions of data sources continuously produce streams of information, often in the form of events, which are foundational data structures that record occurrences in a system or environment.
Typically, an event represents an action that drives another action as part of a process. A customer placing an order, choosing a seat on a flight, or submitting a registration form are all examples of events. An event doesn’t have to involve a person, for instance, a connected thermostat’s report of the temperature at a given time is also an event.
Event streaming enables applications to respond instantly to new information. Streaming data platforms like Apache Kafka allow developers to build systems that consume, process and act on data as it arrives while maintaining the order and reliability of each event.
Kafka has evolved into the most widely adopted event-streaming platform, capable of ingesting and processing trillions of records per day in large-scale deployments while maintaining high throughput and low latency.
Over 80% of Fortune 500 organizations use Kafka, including Target, Microsoft, Airbnb and Netflix, to support real-time, data-driven applications and customer experiences.
In 2011, LinkedIn developed Apache Kafka to meet the company’s growing need for a high-throughput, low-latency system capable of handling massive volumes of real-time event data. Built using Java and Scala, Kafka was later open-sourced and donated to the Apache Software Foundation.
Several of Kafka’s original creators later founded Confluent to help organizations adopt and operate Kafka at enterprise scale, contributing additional tooling, cloud services and governance capabilities around the open-source project.
While traditional message queue systems (such as AWS’s Amazon SQS) focus on delivering messages between producers and consumers, Kafka introduced something fundamentally different—a distributed event-streaming architecture built around durable event logs and configurable message retention.
Unlike many message queues, Kafka stores messages after they are consumed rather than immediately removing them. Messages remain available for a configurable retention period, allowing multiple consumers to read the same data independently.
This enables multiple applications to consume, replay, and process the same stream of data, making Kafka ideal for publish-subscribe messaging, event sourcing, stream processing, streaming analytics, and real-time data pipelines.
Today, Kafka has become the de facto standard for real-time event streaming. Organizations across finance, e-commerce, telecommunications, transportation and other industries use Kafka to process large volumes of data in real time and build event-driven applications.
For example, companies such as Uber, British Gas and LinkedIn use Kafka to support real-time analytics, operational systems, monitoring and other data-intensive workloads.
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Apache Kafka is a distributed system built around several core components that work together to move and process data streams:
Together, these components allow organizations to reliably stream, process and distribute high-volume data in real time across complex, distributed environments.
Learn more about these key concepts in Apache Kafka Fundamentals from IBM Developer.
Kafka is a distributed event streaming platform that runs as a fault-tolerant, highly available cluster that can span multiple servers and even multiple data centers.
Its architecture can be understood as a flow of data between its key components:
Producers write records to Kafka topics named logs that store the records in the order they occurred relative to one another.
Each topic is then split into partitions and distributed across a cluster of Kafka brokers (servers) for scalability and parallelism.
Within each partition, Kafka maintains strict ordering of the records and stores them durably on disk for a configurable retention period. Ordering is guaranteed only within a partition, not across partitions.
Based on the application’s needs, consumers can read records from these partitions independently in real time or from a specific offset.
Kafka ensures reliability through partition replication. Each partition has a leader on one broker and one or more follower replicas on other brokers. This replication helps tolerate node failures without data loss.
Consumers typically operate as part of a consumer group, which coordinates processing across multiple instances by distributing partitions among them and tracking progress using offsets.
Historically, Kafka relied on Apache ZooKeeper, a centralized coordination service for distributed brokers. ZooKeeper ensured Kafka brokers remained synchronized, even if some brokers failed.
In 2021, Kafka introduced KRaft (Kafka Raft Protocol) mode, eliminating the need for ZooKeeper by consolidating these tasks into the Kafka brokers themselves. This shift reduces external dependencies, simplifies architecture and makes Kafka clusters more fault-tolerant and easier to manage and scale.
Developers can leverage Kafka’s capabilities through four primary application programming interfaces (APIs):
The Producer API enables applications to publish records (events) to Kafka topics. After a record is written to a topic, it becomes part of Kafka’s append-only log. Records are typically retained according to configured retention policies and cannot be modified in place.
The Consumer API enables applications to subscribe to one or more topics and to consume, process and react to the records stored within them. Consumers can process records as they arrive in real time or replay historical records by reading from earlier offsets in a topic.
This API builds on the Producer and Consumer APIs by adding stream-processing capabilities that enable applications to perform continuous, front-to-back stream processing in real time.
Applications built with Kafka Streams can consume records from one or more topics, perform filtering, aggregations, joins and transformations, and publish the resulting streams to downstream topics or applications.
While the Producer and Consumer APIs can be used for basic stream processing, the Streams API enables the development of more sophisticated data- and event-streaming applications.
This API lets developers build connectors, which are reusable source or sink components that simplify and automate the integration of external systems with Kafka.
Source connectors ingest data into Kafka topics, while sink connectors export data from Kafka topics to external systems such as databases, data warehouses, cloud services, and enterprise applications.
Kafka’s core function is event streaming, which developers primarily use to create two kinds of applications:
This use case is for applications designed specifically to move millions and millions of data or event records between enterprise systems, at scale and in real-time. The apps must move data reliably, without risk of corruption, duplication or other problems that typically occur when moving such large volumes of data at high speeds.
For example, financial institutions use Kafka to stream thousands of transactions per second across payment gateways, fraud detection services and accounting systems, ensuring accurate, real-time data flow without duplication or loss.
Applications that are driven by record or event streams and that generate streams of their own. In the digitally driven world, we encounter these apps every day.
Examples include e-commerce sites that update product availability in real-time or platforms that deliver personalized content and ads based on live user activity. Kafka drives these experiences by streaming user interactions directly into analytics and recommendation engines.
Kafka is used for real-time data pipelines and streaming applications, but where and how it is applied in system design can vary. Common examples of how Kafka is used in real systems include:
Kafka integrates with several technologies, many of which are part of the Apache Software Foundation. Organizations typically use these technologies in larger event-driven architectures, stream processing or big data analytics solutions.
The Apache Kafka ecosystem includes:
Apache Spark is an analytics engine for large-scale data processing. You can use Spark to perform analytics on streams delivered by Apache Kafka and to produce real-time stream-processing applications, such as clickstream analysis.
Apache NiFi is a data-flow management system with a visual, drag-and-drop interface. Because NiFi can run as a Kafka producer and a Kafka consumer, it’s an ideal tool for managing data-flow challenges that Kafka can’t address.
Apache Flink is an engine for performing large-scale computations on event streams with consistently high speed and low latency. Flink can ingest streams as a Kafka consumer, perform real-time operations based on these streams, and publish the results to Kafka or another application.
Apache Hadoop is a distributed software framework that lets you store massive amounts of data in a cluster of computers for use in big data analytics, machine learning, data mining and other data-driven applications that process structured and unstructured data. Kafka is often used to create a real-time streaming data pipeline to a Hadoop cluster.
Apache Camel is an integration framework with a rule-based routing and mediation engine. It supports Kafka as a component, enabling easy data integration with other systems (e.g., databases, messaging queues), thus allowing Kafka to become part of a larger event-driven architecture.
Apache Cassandra is a highly scalable NoSQL database designed to handle large amounts of data across many commodity servers without any single point of failure. Kafka is commonly used to stream data to Cassandra for real-time data ingestion and for building scalable, fault-tolerant applications.
Enterprise Kafka platforms and managed service providers extend the open-source Apache Kafka ecosystem with additional features for real-time data processing at scale. These offerings build on Kafka’s core features, adding capabilities such as a schema registry, data governance, security controls and monitoring tools.
For example, Confluent provides a managed Kafka platform that extends Apache Kafka with enterprise-grade operational and governance features. IBM Event Streams integrates Apache Kafka to deliver a managed event streaming service designed for scalable, production workloads.
RabbitMQ is an open-source message broker that enables applications, systems and services to communicate by translating messaging protocols. While Kafka and RabbitMQ are often compared, they serve different purposes. Kafka is designed for large-scale event streaming, whereas RabbitMQ focuses on flexible routing and low-latency messaging.
See the table below for key differences between Kafka and RabbitMQ:
| Category | Kafka | RabbitMQ |
|---|---|---|
| Architecture | Distributed log system | Message queue broker |
| Delivery Model | Multiple consumers per topic | Single-consumer message delivery |
| Message Storage | Persistent storage | Ephemeral (deleted after consumption) |
Integrating Apache Kafka and open-source AI transforms how organizations handle real-time data and artificial intelligence. When combined with open-source AI tools, Kafka enables the application of pre-trained AI models to live data, supporting real-time decision-making and automation.
Open-source AI has made artificial intelligence more accessible, and Kafka provides the infrastructure needed to process data in real-time. This setup eliminates the need for batch processing, allowing businesses to act on data immediately as it’s produced.
For example, an e-commerce company might use Kafka to stream customer interactions, such as clicks or product views, as they happen. Pre-trained AI models then process this data in real-time, providing personalized recommendations or targeted offers. Kafka manages the data flow, while the AI models adapt based on incoming data, improving customer engagement.
By combining real-time data processing with AI models, organizations can make quicker decisions in fraud detection, predictive maintenance or dynamic pricing, enabling more responsive and efficient systems.
Developers and architects use Kafka to build scalable, high-throughput real-time data streaming applications and pipelines. But there are many available technologies that can achieve that.
So, why Kafka? Its popularity can be attributed to the following benefits:
While Kafka is powerful, the learning curve and operational overhead can be steep, especially as use cases grow. This is often referred to as the “Kafka Tax.” It encompasses the following challenges:
With both its benefits and limitations in mind, when should you use Apache Kafka?
Kafka excels at event streaming, data integration and large-scale data processing, but is usually not the best choice for database storage, simple point-to-point messaging or low-volume workloads.
IBM Event Streams is an event streaming software built on open source Apache Kafka. It is available as a fully managed service on IBM Cloud or for self-hosting.
Unlock business potential with IBM integration solutions, connecting applications and systems to access critical data quickly and securely.
Unlock new capabilities and drive business agility with IBM cloud consulting services.