Monitoring Kafka

The Kafka sensor is automatically deployed and installed after you install the Instana agent.

Support information

To make sure that the Kafka sensor is compatible with your current setup, verify the following support information sections:

Supported versions and support policy

All Kafka metrics that Instana collects are available for every version of Apache Kafka, Cloudera Kafka, and Confluent Kafka, except the Consumer group lag and the Consumer/Producer Byte Rate/Throttling metrics. IBM® Event Streams, which is built on open source Apache Kafka, is supported from IBM Event Streams 11.0.4 (IBM Event Streams Operator 3.0.5) and later versions.

The following table shows the latest supported version and support policy:

Table 1. Latest supported version and support policy
Technology	Support policy	Latest version	Latest supported version
Apache Kafka	45 days	3.9	3.9
Cloudera Kafka	45 days	4.1.x	4.1.x
Confluent Kafka	45 days	7.8.0	7.8.0
IBM Event Stream	On demand	11.6	11.6

For more information about the support policy, see Support strategy for sensors.

Additional support information

Consumer group lag metrics are available for the following versions:

Apache Kafka versions from 0.11.x.x to 3.x.x
Cloudera Kafka version from 3.x.x to 4.1.x
Confluent Kafka versions from 3.3.x to 7.8.0
IBM Event Streams version from 11.0.4 (IBM Event Streams Operator 3.0.5) and later versions

Consumer/Producer Byte Rate/Throttling metrics are available for Java Kafka clients only and for the following versions:

Apache Kafka versions from 1.1.x to 3.x.x
Cloudera Kafka versions from 4.0.x to 4.1.x
Confluent Kafka versions from 4.1.x to 7.8.0
IBM Event Streams version from 11.0.4 (IBM Event Streams Operator v3.0.5) and later versions

Supported client-side tracing

Instana supports client-side tracing for the following languages and runtimes:

Configuration

The Instana agent automatically detects the running Kafka agent. Therefore, no configuration is required.

Instana collects the first 400 topics that are sorted by topic name.

If you need to filter topics, you can configure it in the agent configuration file <agent_install_dir>/etc/instana/configuration.yaml as shown in the following example:

com.instana.plugin.kafka:
  ...
  topicsRegex: '<OPTIONAL_REGEX_HERE>'
  brokerPropertiesFilePath: '/path/to/server.properties'
  collectLagData: '' # true or false. The default value is true

topicsRegex: Optional regular expression to select up to 400 topics by name. If the value is empty or does not exist, Instana collects the first 400 topics that are sorted by name.
brokerPropertiesFilePath: The path to the broker server.properties file that the agent uses to obtain information about the broker network and security protocol settings.
collectLagData: Flag that enables or disables lag data collection (enabled by default).

If the path to the broker properties is not specified, the agent tries to find server.properties in the following places:

Kafka broker process arguments
KAFKA_SERVER_PROPERTIES environment variable
Using the predefined paths: /path_to_kafka_home/config/server.properties or /path_to_kafka_home/etc/kafka/server.properties for Confluent Kafka.

The Agent uses /opt/kafka/config/server.properties as a default path when the path to server.properties is not found in any of the previously mentioned ways.

SSL TLS support

If your Kafka broker instance requires SSL client connections, you need to configure the Instana agent via <agent_install_dir>/etc/instana/configuration.yaml to enable collecting Consumer lag metrics as shown in the following example:

com.instana.plugin.kafka:
  ...
  sslTrustStore: '/path/to/truststore.jks'
  sslTrustStorePassword: 'kafkaTsPassword'
  sslKeyStore: '/path/to/sslKeyStoreFile.jks'
  sslKeyStorePassword: 'kafkaKsPassword'

Make sure that Keys are in the Java keystore format (JKS). Use the keytool to create keys.

This action enables the Instana agent to connect to the Kafka broker via SSL and collect Consumer group lag metrics.

JMX authentication support

If your JMX authentication is enabled for your Kafka, you need to configure the Instana agent by using <agent_install_dir>/etc/instana/configuration.yaml to authenticate your JMX as shown in the following example:

com.instana.plugin.kafka:
  jmxUsername: ''
  jmxPassword: ''
  jmxPort: '' # default jmx port is 1099

If your JMX is not secured, Instana begins monitoring by connecting to your default 1099 JMX port.

Kafka node - metrics collection

Kafka node metrics collection gathers and analyzes data about the performance and health of individual nodes within a Kafka cluster.

Configuration data

You need the following details to configure Kafka node:

Version
Zookeeper Connects
Process ID
Node ID
Topics/Partitions

Performance metrics

The following table contains the performance metrics details:

Table 1. Performance metrics of Kafka node
Metric	Description	Granularity
Total Produce Time	Total time in milliseconds to serve the specified request that is collected from `kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce`.	1 second
Total Fetch Consumer Time	Total time in milliseconds to serve the specified request that is collected from `kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer`.	1 second
Total Fetch Follower Time	Total time in milliseconds to serve the specified request that is collected from `kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower`.	1 second

Broker traffic

The following table contains the broker traffic details:

Table 2. Broker traffic of Kafka node
Metric	Description	Granularity
In	Aggregate incoming byte rate and is collected from `kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec`.	1 second
Out	Aggregate outgoing byte rate and is collected from `kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec`.	1 second
Rejected	Aggregate rejected byte rate and is collected from `kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec`.	1 second

Broker messages in

The following table contains the broker messages details:

Table 3. Broker messages in of Kafka node
Metric	Description	Granularity
Count	Aggregate incoming message rate and is collected from `kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`.	1 second

Produce requests

The following table contains the produced requests details:

Table 4. Produce requests of Kafka node
Metric	Description	Granularity
Count	Request rate and is collected from `kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce`.	1 second
Mean Latency	Average latency is calculated as quotient of Count (mentioned earlier) and of total time in milliseconds to serve the specified request collected from `kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce`.	1 second

Fetch consumer requests

The following table contains the fetched consumer requests details:

Table 5. Fetch consumer requests of Kafka node
Metric	Description	Granularity
Count	Request rate and is collected from `kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer`.	1 second
Mean Latency	Average latency is calculated as the quotient of Count (mentioned earlier) and of total time in milliseconds to serve the specified request collected from `kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer`.	1 second

Fetch follower requests

The following table contains the fetched follower requests details:

Table 6. Fetch follower requests of Kafka node
Metric	Description	Granularity
Count	Request rate and is collected from `kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower`.	1 second
Mean Latency	Average latency is calculated as the quotient of Count (mentioned earlier) and of total time in milliseconds to serve the specified request collected from `kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower`.	1 second

Average idle time

The following table contains the average idle time details:

Table 7. Average idle time of Kafka node
Metric	Description	Granularity
Network Processor	The average fraction of time the network processor threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from `kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`.	1 second
Request Handler	The average fraction of time the request handler threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from `kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent`.	1 second

Broker failures

The following table contains the broker failures details:

Table 8. Broker failures of Kafka node
Metric	Description	Granularity
Fetch	Fetch request rate for requests that failed and is collected from `kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec`.	1 second
Produce	Produce request rate for requests that failed and is collected from `kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec`.	1 second

Broker state metrics

The following table contains the broker state metrics details:

Table 9. Broker state metrics of Kafka node
Metric	Description	Granularity
Under-replicated Partitions	The number of under-replicated partitions (ISR < all replicas) and is collected from `kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`.	1 second
Offline Partitions	The number of partitions that don’t have an active leader and are hence not writable or readable and is collected from `kafka.controller:type=KafkaController,name=OfflinePartitionsCount`.	1 second
Leader Elections	Leader election rate and latency and is collected from `kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs`.	1 second
Unclean Leader Elections	Unclean leader election rate and is collected from `kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec`.	1 second
ISR Shrinks	If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISR is expanded when the replicas are fully caught up. Other than that, the expected value for both the ISR shrink rate and expansion rate is 0. Collected from `kafka.server:type=ReplicaManager,name=IsrShrinksPerSec`.	1 second
ISR Expansions	When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR. Collected from `kafka.server:type=ReplicaManager,name=IsrExpandsPerSec`.	1 second
Active controller count	The number of active controllers in the cluster and is collected from `kafka.controller:type=KafkaController,name=ActiveControllerCount`.	1 second

Partitions

The following table contains the partition details:

Table 10. Partitions of Kafka node
Metric	Description	Granularity
Count	Total number of partitions on this broker. This must be mostly even across all brokers and is collected from `kafka.server:type=ReplicaManager,name=PartitionCount`.	1 second

Log flushing

The following table contains the log flushing details:

Table 11. Log flushing of Kafka node
Metric	Description	Granularity
Mean	Log flush rate and is collected from `kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs`.	1 second
Flushes	Log flush count and is collected from `kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs`.	1 second

Topics

The following table contains the topics of Kafka node details:

Table 12. Topic of Kafka node
Metric	Description	Granularity
Name	Aggregate incoming message rate and is collected from `kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`.	1 second
Partitions	Aggregate incoming message rate and is collected from `kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`.	1 second
Bytes In	Aggregate incoming byte rate for the topic and is collected from `kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec`.	1 second
Bytes Out	Aggregate outgoing byte rate for the topic and is collected from `kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec`.	1 second
Bytes Rejected	Aggregate rejected byte rate for the topic and is collected from `kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec`.	1 second
Messages In	Aggregate incoming message rate for the topic and is collected from `kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`.	1 second
In-Sync Replicas	In-sync replicas count and are collected from `kafka.cluster:type=Partition,name=InSyncReplicasCount`.	1 second

Kafka cluster - metrics collection

Kafka cluster metrics collection gathers and analyzes data about the performance and health of the entire Apache Kafka cluster, rather than individual nodes.

Configuration data

You need the following details to configure Kafka cluster:

Cluster Name
Zookeeper
Nodes (Name, Version)
Topics/Partitions

Performance metrics

The following table contains the performance metrics details:

Table 13. Performance metrics of Kafka cluster
Metric	Description	Granularity
All Brokers Messages In	The sum of the broker messages in metric from all nodes.	1 second
Rejected Traffic	The sum of the broker traffic rejected metric from all nodes.	1 second
Total Fetch Consumer Time	The sum of the total fetch consumer time metric from all nodes.	1 second
Total Fetch Follower Time	The sum of the total fetch follower time metric from all nodes.	1 second

Average request latency versus throughput

The following table contains the average request latency versus throughput details:

Table 14. Average request latency versus throughput of Kafka cluster
Metric	Description	Granularity
Produce Throughput	The sum of the produce requests count metric from all nodes.	1 second
Fetch Consumer Throughput	The sum of the fetch consumer requests count metric from all nodes.	1 second
Fetch Follower Throughput	The sum of the fetch follower requests count metric from all nodes.	1 second
Total Produce Time	The sum of the total produce time from all nodes.	1 second
Total Fetch Consumer Time	The sum of the total fetch consumer time from all nodes.	1 second
Total Fetch Follower Time	The sum of the total fetch follower time from all nodes.	1 second

All brokers traffic

The following table contains the all brokers traffic details:

Table 15. All brokers traffic of Kafka cluster
Metric	Description	Granularity
In	The sum of the broker traffic in from all nodes.	1 second
Out	The sum of the broker traffic out from all nodes.	1 second
Rejected	The sum of the broker traffic rejected from all nodes.	1 second

All brokers failures

The following table contains the all brokers failures details:

Table 16. All brokers failures of Kafka cluster
Metric	Description	Granularity
Fetch	The sum of the broker failures fetch from all nodes.	1 second
Produce	The sum of the broker failures produce from all nodes.	1 second

All brokers state metrics

The following table contains the all brokers state metrics details:

Table 17. All brokers state metrics of Kafka cluster
Metric	Description	Granularity
Under-replicated Partitions	The sum of the broker state metrics under-replicated partitions from all nodes.	1 second
Offline Partitions	The sum of the broker state metrics offline partitions from all nodes.	1 second
Leader Elections	The sum of the broker state metrics leader elections from all nodes.	1 second
Unclean Leader Elections	The sum of the broker state metrics unclean leader elections from all nodes.	1 second
ISR Shrinks	The sum of the broker state metrics ISR shrinks from all nodes.	1 second
ISR Expansions	The sum of the broker state metrics ISR expansions from all nodes.	1 second
Active controller count	The sum of the broker state metrics active controller count from all nodes.	1 second

Average idle time percentage

The following table contains the average idle time percentage details:

Table 18. Average idle time percentage of Kafka cluster
Metric	Description	Granularity
Network Processor	The total average of the average idle time network processor from all nodes.	1 second
Request Handler	The total average of the average idle time request handler from all nodes.	1 second

Log flushing

The following table contains the log flushing details:

Table 19. Log flushing of Kafka cluster
Metric	Description	Granularity
Mean	The sum of the log flushing mean from all nodes.	1 second
Flushes	The sum of the log flushing flushes from all nodes.	1 second

Cluster nodes

The following table contains the cluster nodes details:

Table 20. Cluster nodes of Kafka cluster
Metric	Description	Granularity
Controller	Is the node controller? Yes or No.	1 second
Messages In	Chart with the count of the broker messages In.	1 second
Bytes In	Chart with the count of the broker bytes In.	1 second
Bytes Out	Chart with the count of the broker bytes Out.	1 second
Average Response Time	Chart with the count of the broker average response time.	1 second
Health	The node health indicator.	1 second

Cluster topics

The following table contains the cluster topics details:

Table 21. Cluster topics of Kafka cluster
Metric	Description	Granularity
Partitions	The total number of partitions.	10 minutes
Bytes In	Chart with the count of the topic bytes in.	1 second
Bytes Out	Chart with the count of the topic bytes out.	1 second
Bytes Rejected	Chart with the count of the topic bytes rejected.	1 second
Messages In	Chart with the count of the topic messages in.	1 second

Consumer group lag

The following table contains the consumer group lag details:

Table 22. Consumer group lag of Kafka cluster
Metric	Description	Granularity
Lag	Consumer group lag per topic.	60 seconds

Consumers

The following table contains the consumers details:

Table 23. Consumers of Kafka cluster
Metric	Description	Granularity
Byte Rate	The total number of bytes consumed that sent per second.	1 second
Throttling	The total average throttle time.	1 second
Latency	The total average fetch latency.	1 second

Producers

The following table contains the producers details:

Table 24. Producers of Kafka cluster
Metric	Description	Granularity
Byte Rate	The total number of outgoing bytes that sent per second.	1 second
Throttling	The total average throttle time.	1 second
Latency	The total average request latency.	1 second

To enable the Instana agent client to query the Kafka broker for lag-related data, add the PLAINTEXT security protocol for localhost socket connections within the Kafka broker configuration file.

Health Signatures

For each sensor, a knowledge base of health signatures are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For more information about built-events for Kafka node and cluster, see Built-in events reference.

Troubleshooting

SSL not configured

Monitoring issue type: kafka_ssl_not_configured

To resolve the SSL configuration issue and configure Kafka SSL truststore location and password, see SSL/TLS Support.

SSL client authentication not configured

Monitoring issue type: kafka_ssl_client_not_configured

To resolve the SSL client authentication related issue and configure Kafka SSL client authentication (keystore location and password), see SSL/TLS Support.

JMX authentication not configured

Monitoring issue type: kafka_invalid_jmx_credentials

To resolve the JMX authentication related issue and configure JMX authentication credentials, see JMX Authentication support.