Apache Kafka is a high-performance, highly scalable event streaming platform. To unlock Kafka’s full potential, you need to carefully consider the design of your application. It’s all too easy to write Kafka applications that perform poorly or eventually hit a scalability brick wall. Since 2015, IBM has provided the IBM Event Streams service, which is a fully-managed Apache Kafka service running on IBM Cloud®. Since then, the service has helped many customers, as well as teams within IBM, resolve scalability and performance problems with the Kafka applications they have written.
This article describes some of the common problems of Apache Kafka and provides some recommendations for how you can avoid running into scalability problems with your applications.
Certain Kafka operations work by the client sending data to the broker and waiting for a response. A whole round-trip might take 10 milliseconds, which sounds speedy, but limits you to at most 100 operations per second. For this reason, it’s recommended that you try to avoid these kinds of operations whenever possible. Fortunately, Kafka clients provide ways for you to avoid waiting on these round-trip times. You just need to ensure that you’re taking advantage of them.
Tips to maximize throughput:
If you read the above and thought, “Uh oh, won’t that make my application more complex?” — the answer is yes, it likely will. There is a trade-off between throughput and application complexity. What makes network round-trip time a particularly insidious pitfall is that once you hit this limit, it can require extensive application changes to achieve further throughput improvements.
One helpful feature of Kafka is that it monitors the “liveness” of consuming applications and disconnects any that might have failed. This works by having the broker track when each consuming client last called “poll” (Kafka’s terminology for asking for more messages). If a client doesn’t poll frequently enough, the broker to which it is connected concludes that it must have failed and disconnects it. This is designed to allow the clients that are not experiencing problems to step in and pick up work from the failed client.
Unfortunately, with this scheme the Kafka broker can’t distinguish between a client that is taking a long time to process the messages it received and a client that has actually failed. Consider a consuming application that loops: 1) Calls poll and gets back a batch of messages; or 2) processes each message in the batch, taking 1 second to process each message.
If this consumer is receiving batches of 10 messages, then it’ll be approximately 10 seconds between calls to poll. By default, Kafka will allow up to 300 seconds (5 minutes) between polls before disconnecting the client — so everything would work fine in this scenario. But what happens on a really busy day when a backlog of messages starts to build up on the topic that the application is consuming from? Rather than just getting 10 messages back from each poll call, your application gets 500 messages (by default this is the maximum number of records that can be returned by a call to poll). That would result in enough processing time for Kafka to decide the application instance has failed and disconnect it. This is bad news.
You’ll be delighted to learn that it can get worse. It is possible for a kind of feedback loop to occur. As Kafka starts to disconnect clients because they aren’t calling poll frequently enough, there are less instances of the application to process messages. The likelihood of there being a large backlog of messages on the topic increases, leading to an increased likelihood that more clients will get large batches of messages and take too long to process them. Eventually all the instances of the consuming application get into a restart loop, and no useful work is done.
What steps can you take to avoid this happening to you?
We’ll return to the topic of consumer failures later in this article, when we look at how they can trigger consumer group re-balancing and the disruptive effect this can have.
Under the hood, the protocol used by the Kafka consumer to receive messages works by sending a “fetch” request to a Kafka broker. As part of this request the client indicates what the broker should do if there aren’t any messages to hand back, including how long the broker should wait before sending an empty response. By default, Kafka consumers instruct the brokers to wait up to 500 milliseconds (controlled by the “fetch.max.wait.ms” consumer configuration) for at least 1 byte of message data to become available (controlled with the “fetch.min.bytes” configuration).
Waiting for 500 milliseconds doesn’t sound unreasonable, but if your application has consumers that are mostly idle, and scales to say 5,000 instances, that’s potentially 2,500 requests per second to do absolutely nothing. Each of these requests takes CPU time on the broker to process, and at the extreme can impact the performance and stability of the Kafka clients that are want to do useful work.
Normally Kafka’s approach to scaling is to add more brokers, and then evenly re-balance topic partitions across all the brokers, both old and new. Unfortunately, this approach might not help if your clients are bombarding Kafka with needless fetch requests. Each client will send fetch requests to every broker leading a topic partition that the client is consuming messages from. So it is possible that even after scaling the Kafka cluster, and re-distributing partitions, most of your clients will be sending fetch requests to most of the brokers.
So, what can you do?
If you come to Kafka from a background with other publish–subscribe systems (for example Message Queuing Telemetry Transport, or MQTT for short) then you might expect Kafka topics to be very lightweight, almost ephemeral. They are not. Kafka is much more comfortable with a number of topics measured in thousands. Kafka topics are also expected to be relatively long lived. Practices such as creating a topic to receive a single reply message, then deleting the topic, are uncommon with Kafka and do not play to Kafka’s strengths.
Instead, plan for topics that are long lived. Perhaps they share the lifetime of an application or an activity. Also aim to limit the number of topics to the hundreds or perhaps low thousands. This might require taking a different perspective on what messages are interleaved on a particular topic.
A related question that often arises is, “How many partitions should my topic have?” Traditionally, the advice is to overestimate, because adding partitions after a topic has been created doesn’t change the partitioning of existing data held on the topic (and hence can affect consumers that rely on partitioning to offer message ordering within a partition). This is good advice; however, we’d like to suggest a few additional considerations:
Most Kafka applications that consume messages take advantage of Kafka’s consumer group capabilities to coordinate which clients consume from which topic partitions. If your recollection of consumer groups is a little hazy, here’s a quick refresher on the key points:
As Kafka has matured, increasingly sophisticated re-balancing algorithms have (and continue to be) devised. In early versions of Kafka, when a consumer group re-balanced, all the clients in the group had to stop consuming, the topic partitions would be redistributed amongst the group’s new members and all the clients would start consuming again. This approach has two drawbacks (don’t worry, these have since been improved):
More recent re-balance algorithms have made significant improvements by, to use Kafka’s terminology, adding “stickiness” and “cooperation”:
Despite these enhancements to more recent re-balancing algorithms, if your applications is frequently subject to consumer group re-balances, you will still see an impact on overall messaging throughput and be wasting network bandwidth as clients discard and re-fetch buffered message data. Here are some suggestions about what you can do:
You’re now an expert in scaling Kafka applications. You’re invited to put these points into practice and try out the fully-managed Kafka offering on IBM Cloud. For any challenges in set up, see the Getting Started Guide and FAQs.
Experience IBM API Connect with a free trial or connect with our experts to discuss your needs. Whether you're ready to optimize your API management or want to learn more, we're here to support your digital transformation.
Discover the full potential of your integration processes with AI-powered solutions. Schedule a meeting with our experts or explore our product documentation to get started.
Supercharge your business with IBM MQ secure, high-performance messaging solutions. Start your free trial or connect with our experts to explore how IBM MQ can transform your operations.
Experience faster, more secure file transfers—any size, any distance. Try IBM Aspera today and streamline your data workflows with high-speed efficiency.
Transform your business by effortlessly connecting apps and data. Start your free trial today and see how IBM App Connect can streamline your integration journey.
Explore how IBM DataPower Gateway enhances security, control and performance for your cloud and on-premises applications. Book a meeting now to get started with a free container evaluation.
Implement a complete solution for modernizing integrations across hybrid environments, allowing your team to accelerate application deployment while cutting down costs and complexity.
Streamline your digital transformation with IBM’s hybrid cloud solutions, built to optimize scalability, modernization and seamless integration across your IT infrastructure.
IBM Cloud Infrastructure Center is an OpenStack-compatible software platform for managing the infrastructure of private clouds on IBM zSystems and IBM LinuxONE.