What is Apache Kafka?

3d render, abstract neon background, space tunnel turning to left, ultra violet rays, glowing lines, virtual reality jump, speed of light, space and time strings, highway night lights

Apache Kafka, defined

Apache Kafka is an open-source, distributed, event-streaming platform used to publish, store, process and consume real-time data streams. It is based on a publish-subscribe messaging model and is designed to support fault-tolerant, scalable, high-throughput and low-latency data pipelines, event-driven applications and stream-processing systems.

Today, billions of data sources continuously produce streams of information, often in the form of events, which are foundational data structures that record occurrences in a system or environment. 

Typically, an event represents an action that drives another action as part of a process. A customer placing an order, choosing a seat on a flight, or submitting a registration form are all examples of events. An event doesn’t have to involve a person, for instance, a connected thermostat’s report of the temperature at a given time is also an event.

Event streaming enables applications to respond instantly to new information. Streaming data platforms like Apache Kafka allow developers to build systems that consume, process and act on data as it arrives while maintaining the order and reliability of each event.

Kafka has evolved into the most widely adopted event-streaming platform, capable of ingesting and processing trillions of records per day in large-scale deployments while maintaining high throughput and low latency.

Over 80% of Fortune 500 organizations use Kafka, including Target, Microsoft, Airbnb and Netflix, to support real-time, data-driven applications and customer experiences.

The origin of Apache Kafka

In 2011, LinkedIn developed Apache Kafka to meet the company’s growing need for a high-throughput, low-latency system capable of handling massive volumes of real-time event data. Built using Java and Scala, Kafka was later open-sourced and donated to the Apache Software Foundation.

Several of Kafka’s original creators later founded Confluent to help organizations adopt and operate Kafka at enterprise scale, contributing additional tooling, cloud services and governance capabilities around the open-source project.

Why Apache Kafka matters

While traditional message queue systems (such as AWS’s Amazon SQS) focus on delivering messages between producers and consumers, Kafka introduced something fundamentally different—a distributed event-streaming architecture built around durable event logs and configurable message retention.

Unlike many message queues, Kafka stores messages after they are consumed rather than immediately removing them. Messages remain available for a configurable retention period, allowing multiple consumers to read the same data independently.

This enables multiple applications to consume, replay, and process the same stream of data, making Kafka ideal for publish-subscribe messaging, event sourcing, stream processing, streaming analytics, and real-time data pipelines.

Today, Kafka has become the de facto standard for real-time event streaming. Organizations across finance, e-commerce, telecommunications, transportation and other industries use Kafka to process large volumes of data in real time and build event-driven applications.

For example, companies such as Uber, British Gas and LinkedIn use Kafka to support real-time analytics, operational systems, monitoring and other data-intensive workloads.

Key Apache Kafka components

Apache Kafka is a distributed system built around several core components that work together to move and process data streams:

  • Messages: A unit of data comprised of two parts: a key and a value. The key is commonly used for data about the message and the value is the body of the message. Kafka uses the terms “message” and “record” interchangeably. 

  • Producers: Applications that publish records (events or messages) to Kafka topics. A producer can publish to one or more topics and can optionally choose the partition that stores the data.

  • Topics: Named categories or streams of messages. Applications write data to and read data from topics.

  • Partitions: Subdivisions of a topic that distribute records across multiple brokers. Partitions let Kafka scale horizontally while preserving record order within each partition.

  • Brokers: Kafka servers that store topic partitions, handle client requests and replicate data across the cluster for fault tolerance.

  • Consumers: Applications that subscribe to topics and read records from partitions.

  • Consumer groups: Allow multiple consumers to work together to process records from a topic.

  • Offsets: Unique sequential identifiers assigned to records within a partition. Consumers use offsets to track their reading position and replay data when needed.

Together, these components allow organizations to reliably stream, process and distribute high-volume data in real time across complex, distributed environments.

Learn more about these key concepts in Apache Kafka Fundamentals from IBM Developer.

How Apache Kafka works

Kafka is a distributed event streaming platform that runs as a fault-tolerant, highly available cluster that can span multiple servers and even multiple data centers.

Its architecture can be understood as a flow of data between its key components:

  • Producers publish data to topics
  • Topics are split into partitions and distributed across brokers
  • Partitions maintain ordering and store data durably
  • Consumers read data from partitions using offsets
  • Kafka replicates partitions for reliability
  • Consumer groups coordinate parallel processing
  • Cluster coordination (ZooKeeper and KRaft)

Producers publish data to topics

Producers write records to Kafka topics named logs that store the records in the order they occurred relative to one another.

Topics are split into partitions and distributed across brokers

Each topic is then split into partitions and distributed across a cluster of Kafka brokers (servers) for scalability and parallelism.

Partitions maintain ordering and store data durably

Within each partition, Kafka maintains strict ordering of the records and stores them durably on disk for a configurable retention period. Ordering is guaranteed only within a partition, not across partitions.

Consumers read data from partitions using offsets

Based on the application’s needs, consumers can read records from these partitions independently in real time or from a specific offset.

Kafka replicates partitions for reliability

Kafka ensures reliability through partition replication. Each partition has a leader on one broker and one or more follower replicas on other brokers. This replication helps tolerate node failures without data loss.

Consumer groups coordinate parallel processing

Consumers typically operate as part of a consumer group, which coordinates processing across multiple instances by distributing partitions among them and tracking progress using offsets.

Cluster coordination (ZooKeeper and KRaft)

Historically, Kafka relied on Apache ZooKeeper, a centralized coordination service for distributed brokers. ZooKeeper ensured Kafka brokers remained synchronized, even if some brokers failed.

In 2021, Kafka introduced KRaft (Kafka Raft Protocol) mode, eliminating the need for ZooKeeper by consolidating these tasks into the Kafka brokers themselves. This shift reduces external dependencies, simplifies architecture and makes Kafka clusters more fault-tolerant and easier to manage and scale.

Apache Kafka APIs

Developers can leverage Kafka’s capabilities through four primary application programming interfaces (APIs):

  1. Producer API
  2. Consumer API
  3. Streams API
  4. Connect API
Producer API

The Producer API enables applications to publish records (events) to Kafka topics. After a record is written to a topic, it becomes part of Kafka’s append-only log. Records are typically retained according to configured retention policies and cannot be modified in place.

Consumer API

The Consumer API enables applications to subscribe to one or more topics and to consume, process and react to the records stored within them. Consumers can process records as they arrive in real time or replay historical records by reading from earlier offsets in a topic.

Streams API

This API builds on the Producer and Consumer APIs by adding stream-processing capabilities that enable applications to perform continuous, front-to-back stream processing in real time.

Applications built with Kafka Streams can consume records from one or more topics, perform filtering, aggregations, joins and transformations, and publish the resulting streams to downstream topics or applications.

While the Producer and Consumer APIs can be used for basic stream processing, the Streams API enables the development of more sophisticated data- and event-streaming applications.

Connect API

This API lets developers build connectors, which are reusable source or sink components that simplify and automate the integration of external systems with Kafka.

Source connectors ingest data into Kafka topics, while sink connectors export data from Kafka topics to external systems such as databases, data warehouses, cloud services, and enterprise applications. 

What is Apache Kafka used for?

Kafka’s core function is event streaming, which developers primarily use to create two kinds of applications:

  • Real-time streaming data pipelines
  • Real-time streaming applications

Real-time streaming data pipelines

This use case is for applications designed specifically to move millions and millions of data or event records between enterprise systems, at scale and in real-time. The apps must move data reliably, without risk of corruption, duplication or other problems that typically occur when moving such large volumes of data at high speeds.

For example, financial institutions use Kafka to stream thousands of transactions per second across payment gateways, fraud detection services and accounting systems, ensuring accurate, real-time data flow without duplication or loss.

Real-time streaming applications

Applications that are driven by record or event streams and that generate streams of their own. In the digitally driven world, we encounter these apps every day.

Examples include e-commerce sites that update product availability in real-time or platforms that deliver personalized content and ads based on live user activity. Kafka drives these experiences by streaming user interactions directly into analytics and recommendation engines.

Core Apache Kafka use cases

Kafka is used for real-time data pipelines and streaming applications, but where and how it is applied in system design can vary. Common examples of how Kafka is used in real systems include:

  • Microservices: Kafka facilitates communication between microservices by enabling asynchronous, event-driven messaging. This feature allows services to trigger actions across other services without being tightly coupled, supporting scalable and decoupled system architectures.
  • Containerized cloud-native environments: Kafka integrates seamlessly with cloud-native platforms using Docker for containerization and Kubernetes for container orchestration. This setup supports scalable, fault-tolerant, event-driven communication while minimizing the need for manual infrastructure management. Kafka can automatically scale and recover from failures within Kubernetes, making it ideal for dynamic cloud computing environments that run diverse application workloads.
  • Data lakes and data warehousing: Kafka acts as a real-time data pipeline between data sources and storage platforms, such as data lakes or data warehouses. This feature enables streaming large data volumes for timely ingestion and analysis, which is essential for modern analytics and business intelligence workflows.
  • IoT (Internet of Things) data handling: Kafka is well-suited for processing continuous data streams from IoT devices, enabling real-time routing of high-throughput, low-latency data to destinations like databases, analytics engines or monitoring tools. This capability supports time-sensitive applications in industries like manufacturing and healthcare.

The Apache Kafka ecosystem

Kafka integrates with several technologies, many of which are part of the Apache Software Foundation. Organizations typically use these technologies in larger event-driven architectures, stream processing or big data analytics solutions.

The Apache Kafka ecosystem includes:

Apache Spark

Apache Spark is an analytics engine for large-scale data processing. You can use Spark to perform analytics on streams delivered by Apache Kafka and to produce real-time stream-processing applications, such as clickstream analysis.

Apache NiFi

Apache NiFi is a data-flow management system with a visual, drag-and-drop interface. Because NiFi can run as a Kafka producer and a Kafka consumer, it’s an ideal tool for managing data-flow challenges that Kafka can’t address.

Apache Flink

Apache Flink is an engine for performing large-scale computations on event streams with consistently high speed and low latency. Flink can ingest streams as a Kafka consumer, perform real-time operations based on these streams, and publish the results to Kafka or another application.

Apache Hadoop

Apache Hadoop is a distributed software framework that lets you store massive amounts of data in a cluster of computers for use in big data analytics, machine learning, data mining and other data-driven applications that process structured and unstructured data. Kafka is often used to create a real-time streaming data pipeline to a Hadoop cluster.

Apache Camel

Apache Camel is an integration framework with a rule-based routing and mediation engine. It supports Kafka as a component, enabling easy data integration with other systems (e.g., databases, messaging queues), thus allowing Kafka to become part of a larger event-driven architecture.

Apache Cassandra

Apache Cassandra is a highly scalable NoSQL database designed to handle large amounts of data across many commodity servers without any single point of failure. Kafka is commonly used to stream data to Cassandra for real-time data ingestion and for building scalable, fault-tolerant applications.

Enterprise Kafka platforms and managed service providers

Enterprise Kafka platforms and managed service providers extend the open-source Apache Kafka ecosystem with additional features for real-time data processing at scale. These offerings build on Kafka’s core features, adding capabilities such as a schema registry, data governance, security controls and monitoring tools.

For example, Confluent provides a managed Kafka platform that extends Apache Kafka with enterprise-grade operational and governance features. IBM Event Streams integrates Apache Kafka to deliver a managed event streaming service designed for scalable, production workloads.

Kafka vs. RabbitMQ

RabbitMQ is an open-source message broker that enables applications, systems and services to communicate by translating messaging protocols. While Kafka and RabbitMQ are often compared, they serve different purposes. Kafka is designed for large-scale event streaming, whereas RabbitMQ focuses on flexible routing and low-latency messaging.

See the table below for key differences between Kafka and RabbitMQ:

CategoryKafkaRabbitMQ
ArchitectureDistributed log systemMessage queue broker
Delivery ModelMultiple consumers per topicSingle-consumer message delivery
Message StoragePersistent storageEphemeral (deleted after consumption)

Apache Kafka and open-source AI

Integrating Apache Kafka and open-source AI transforms how organizations handle real-time data and artificial intelligence. When combined with open-source AI tools, Kafka enables the application of pre-trained AI models to live data, supporting real-time decision-making and automation.

Open-source AI has made artificial intelligence more accessible, and Kafka provides the infrastructure needed to process data in real-time. This setup eliminates the need for batch processing, allowing businesses to act on data immediately as it’s produced.

For example, an e-commerce company might use Kafka to stream customer interactions, such as clicks or product views, as they happen. Pre-trained AI models then process this data in real-time, providing personalized recommendations or targeted offers. Kafka manages the data flow, while the AI models adapt based on incoming data, improving customer engagement.

By combining real-time data processing with AI models, organizations can make quicker decisions in fraud detection, predictive maintenance or dynamic pricing, enabling more responsive and efficient systems.

WebMethods Hybrid Integration

Reimagine integration for the AI era

IBM Web Methods Hybrid Integration showcases how businesses can seamlessly connect cloud and on-premises applications, enabling agile and scalable digital transformation. 

Benefits of Apache Kafka

Developers and architects use Kafka to build scalable, high-throughput real-time data streaming applications and pipelines. But there are many available technologies that can achieve that.

So, why Kafka? Its popularity can be attributed to the following benefits:

  • Low latency: Kafka can deliver a high volume of messages using a cluster of machines with exceptionally low latencies (2 milliseconds). Low latency is crucial for real-time data processing and immediate responses to data streams.

  • High throughput: Kafka architecture can handle high-velocity and high-volume streams of data, which makes it ideal for applications requiring real-time data processing and integration across multiple servers.

  • High scalability: Kafka topics are partitioned and replicated in such a way that they can horizontally scale to handle large volumes of events and concurrent consumers without impacting performance.

  • High availability: Kafka ensures high availability through data replication across multiple brokers. Messages remain available even if a node fails, and the system continues running without data loss or downtime.

  • Permanent storage: Kafka stores data streams in a distributed, durable and fault-tolerant cluster, which ensures they can be accessed even when servers fail. The partitioned log model further enhances Kafka’s ability to manage data streams and provide exactly-once processing guarantees.

  • Operational observability: Kafka provides operational visibility through metrics exposed via JMX (Java Management Extensions), allowing teams to track throughput, latency, broker health and consumer lag, which are critical for effective monitoring and capacity planning.

  • Broad integration ecosystem: With a wide range of connectors and client APIs, Kafka supports easy integration with databases, file systems and cloud services. For instance, Kafka Connect facilitates seamless data movement between systems, while Kafka Streams provides a library for building real-time stream processing applications.

Challenges of Apache Kafka

While Kafka is powerful, the learning curve and operational overhead can be steep, especially as use cases grow. This is often referred to as the “Kafka Tax.” It encompasses the following challenges:

  • Operational complexity: Running and maintaining clusters—especially at scale—requires specialized expertise and a high operational burden. Even small misconfigurations can lead to outages, degraded performance, replication issues or data durability risks.

  • Monitoring and troubleshooting: Identifying the root cause of issues can be challenging because Kafka generates many operational metrics. Issues such as unbalanced partitions or network constraints can be difficult to detect and diagnose yet significantly impact performance and reliability.

  • Scaling: Scaling Kafka often involves carefully redistributing partitions and rebalancing workloads across brokers, which can be resource-intensive and temporarily affect performance. Multi-region or multi-cloud deployments introduce additional architectural complexity around replication, latency, failover and consistency.

  • Security: Kafka provides built-in support for encryption, authentication and access. However, organizations often require additional tooling and processes for governance, audit logging, compliance and enterprise-scale security management.

With both its benefits and limitations in mind, when should you use Apache Kafka?

Kafka excels at event streaming, data integration and large-scale data processing, but is usually not the best choice for database storage, simple point-to-point messaging or low-volume workloads.

Authors

Stephanie Susnjara

Staff Writer

IBM Think

Ian Smalley

Staff Editor

IBM Think

Alexandra Jonker

Staff Editor

IBM Think

Related solutions
IBM Event Streams

IBM Event Streams is an event streaming software built on open source Apache Kafka. It is available as a fully managed service on IBM Cloud or for self-hosting.

Explore Event Streams
Integration Software and Solutions

Unlock business potential with IBM integration solutions, connecting applications and systems to access critical data quickly and securely.

Explore integration solutions
Cloud consulting services

Unlock new capabilities and drive business agility with IBM cloud consulting services.

Explore cloud consulting services
Take the next step

IBM Event Streams is an event streaming software built on open source Apache Kafka. It is available as a fully managed service on IBM Cloud or for self-hosting.

  1. Explore Event Streams
  2. Get more information