Kafka

Supported Versions

Monitoring of all metrics is supported for every version of Kafka and Confluent Kafka, apart from the Consumer group lag and the Consumer/Producer Byte Rate/Throttling metrics.

Monitoring of Consumer group lag metrics is supported for Kafka versions from 0.11.x.x to 2.3.x, as well as Confluent Kafka versions from 3.3.x to 5.3.x.

Monitoring of Consumer/Producer Byte Rate/Throttling metrics is supported for Java kafka clients only and Kafka versions from 1.1.x to 2.3.x, as well as Confluent Kafka versions from 4.1.x to 5.3.x.

Configuration

The Instana agent automatically detects the running Kafka agent and therefore no configuration is required.

Instana will collect first 400 topics sorted by topic name.

If there is a need to filter topics, it can be configured in the Instana Agent <agent_install_dir>/etc/instana/configuration.yaml:

com.instana.plugin.kafka:
  topicsRegex: '.*'
  • topicsRegex: Optional regex pattern used for filtering topics by name. If value is empty or doesn’t exist, Instana will collect all topics (max. 400).

Kafka Node - Metrics collection

Configuration data

  • Version
  • Zookeeper Connect
  • Process ID
  • Node ID
  • Topics/Partitions

Performance metrics

  • Produce Latency
  • Fetch Consumer Latency
  • Fetch Follower Latency

Broker Traffic

Metric Description Granularity
In Aggregate incoming byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. 1 second
Out Aggregate outgoing byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec. 1 second
Rejected Aggregate rejected byte rate and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec. 1 second

Broker Messages In

Metric Description Granularity
Count Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second

Produce Requests

Metric Description Granularity
Count Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce. 1 second
Mean Latency Average latency calculated as quotient of Count (mentioned above) and of total time in ms to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce. 1 second

Fetch Consumer Requests

Metric Description Granularity
Count Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer. 1 second
Mean Latency Average latency calculated as quotient of Count (mentioned above) and of total time in ms to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer. 1 second

Fetch Follower Requests

Metric Description Granularity
Count Request rate and is collected from kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower. 1 second
Mean Latency Average latency calculated as quotient of Count (mentioned above) and of total time in ms to serve the specified request collected from kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. 1 second

Average Idle Time

Metric Description Granularity
Network Processor Average fraction of time the network processor threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent. 1 second
Request Handler Average fraction of time the request handler threads are idle. Values are between 0% (all resources are used) and 100% (all resources are available) and is collected from kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent. 1 second

Broker Failures

Metric Description Granularity
Fetch Fetch request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec. 1 second
Produce Produce request rate for requests that failed and is collected from kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec. 1 second

Broker State Metrics

Metric Description Granularity
Under-replicated Partitions Number of under-replicated partitions (ISR < all replicas) and is collected from kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions. 1 second
Offline Partitions Number of partitions that don’t have an active leader and are hence not writable or readable and is collected from kafka.controller:type=KafkaController,name=OfflinePartitionsCount. 1 second
Leader Elections Leader election rate and latency and is collected from kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs. 1 second
Unclean Leader Elections Unclean leader election rate and is collected from kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec. 1 second
ISR Shrinks If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0. Collected from kafka.server:type=ReplicaManager,name=IsrShrinksPerSec. 1 second
ISR Expansions When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR. Collected from kafka.server:type=ReplicaManager,name=IsrExpandsPerSec. 1 second
Active controller count Number of active controllers in the cluster and is collected from kafka.controller:type=KafkaController,name=ActiveControllerCount. 1 second

Partitions

Metric Description Granularity
Count Number of partitions on this broker. This should be mostly even across all brokers and is collected from kafka.server:type=ReplicaManager,name=PartitionCount. 1 second

Log Flushing

Metric Description Granularity
Mean Log flush rate and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. 1 second
Flushes Log flush count and is collected from kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. 1 second

Topics

Metric Description Granularity
Name Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second
Partitions Aggregate incoming message rate and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second
Bytes In Aggregate incoming byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec. 1 second
Bytes Out Aggregate outgoing byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec. 1 second
Bytes Rejected Aggregate rejected byte rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec. 1 second
Messages In Aggregate incoming message rate for the topic and is collected from kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec. 1 second
In-Sync Replicas In-sync replicas count and is collected from kafka.cluster:type=Partition,name=InSyncReplicasCount. 1 second

Kafka Cluster - Metrics collection

Configuration data

  • Cluster Name
  • Zookeeper
  • Nodes (Name, Version)
  • Topics/Partitions

Performance metrics

  • All Brokers Messages In
  • Rejected Traffic
  • Fetch Consumer Latency
  • Fetch Follower Latency

Average Request Latency vs Throughput

Metric Description Granularity
Produce Throughput Sum of the Produce Requests Count from all nodes. 1 second
Fetch Consumer Throughput Sum of the Fetch Consumer Requests Count from all nodes. 1 second
Fetch Follower Throughput Sum of the Fetch Follower Requests Count from all nodes. 1 second
Produce Latency Sum of the Produce Requests Latency from all nodes. 1 second
Fetch Consumer Latency Sum of the Fetch Consumer Requests Latency from all nodes. 1 second
Fetch Follower Latency Sum of the Fetch Follower Requests Latency from all nodes. 1 second

All Brokers Traffic

Metric Description Granularity
In Sum of the Broker Traffic In from all nodes. 1 second
Out Sum of the Broker Traffic Out from all nodes. 1 second
Rejected Sum of the Broker Traffic Rejected from all nodes. 1 second

All Brokers Failures

Metric Description Granularity
Fetch Sum of the Broker Failures Fetch from all nodes. 1 second
Produce Sum of the Broker Failures Produce from all nodes. 1 second

All Brokers State Metrics

Metric Description Granularity
Under-replicated Partitions Sum of the Broker State Metrics Under-replicated Partitions from all nodes. 1 second
Offline Partitions Sum of the Broker State Metrics Offline Partitions from all nodes. 1 second
Leader Elections Sum of the Broker State Metrics Leader Elections from all nodes. 1 second
Unclean Leader Elections Sum of the Broker State Metrics Unclean Leader Elections from all nodes. 1 second
ISR Shrinks Sum of the Broker State Metrics ISR Shrinks from all nodes. 1 second
ISR Expansions Sum of the Broker State Metrics ISR Expansions from all nodes. 1 second
Active controller count Sum of the Broker State Metrics Active controller count from all nodes. 1 second

Average Idle Time Percentage

Metric Description Granularity
Network Processor Average of the Average Idle Time Network Processor from all nodes. 1 second
Request Handler Average of the Average Idle Time Request Handler from all nodes. 1 second

Log Flushing

Metric Description Granularity
Mean Sum of the Log Flushing Mean from all nodes. 1 second
Flushes Sum of the Log Flushing Flushes from all nodes. 1 second

Log Flushing

Metric Description Granularity
Mean Sum of the Log Flushing Mean from all nodes. 1 second
Flushes Sum of the Log Flushing Flushes from all nodes. 1 second

Cluster Nodes

Metric Description Granularity
Controller Is the node controller? Yes/No. 1 second
Messages In Chart with the count of the Broker Messages In. 1 second
Bytes In Chart with the count of the Broker Bytes In. 1 second
Bytes Out Chart with the count of the Broker Bytes Out. 1 second
Average Response Time Chart with the count of the Broker Average Response Time. 1 second
Health The node health indicator. 1 second

Cluster Topics

Metric Description Granularity
Partitions Number of partitions. 10 minutes
Bytes In Chart with the count of the Topic Bytes In. 1 second
Bytes Out Chart with the count of the Topic Bytes Out. 1 second
Bytes Rejected Chart with the count of the Topic Bytes Rejected. 1 second
Messages In Chart with the count of the Topic Messages In. 1 second

Consumer group lag

Metric Description Granularity
Lag Consumer group lag per topic. 120 seconds

Consumer Byte Rate/Throttling

Metric Description Granularity
Byte Rate The number of bytes consumed sent per second. 1 second
Throttling Average throttle time 1 second

Producer Byte Rate/Throttling

Metric Description Granularity
Byte Rate The number of outgoing bytes sent per second. 1 second
Throttling Average throttle time. 1 second

Note: In order to enable the Instana agent client to query the Kafka broker for lag-related data, add the PLAINTEXT security protocol for localhost socket connections within the Kafka broker configuration file.

Health Signatures

For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For information about built-events for Kafka Node and Cluster, see the Built-in events reference.