Kafka Monitoring — Metrics & Consumer Lag

Why Monitor Kafka
Key Broker Metrics
Key Producer Metrics
Key Consumer Metrics
Consumer Lag
JMX Monitoring
Monitoring with Spring Boot Actuator
Prometheus & Grafana Setup
Alerting Rules
Common Issues & Troubleshooting

Kafka is the backbone of many event-driven architectures. When a broker goes silent, a consumer group falls behind, or a topic runs out of disk, the impact cascades across every downstream system. Proactive monitoring addresses three core needs:

Reliability

Monitoring under-replicated partitions and ISR (In-Sync Replica) counts lets you detect replication failures before they become data-loss events. A healthy cluster keeps all replicas in sync; any deviation signals a broker that is overloaded, network-partitioned, or running out of disk.

Early Warning

Consumer lag is the single most important indicator of processing health. A growing lag means consumers cannot keep up with the rate of incoming records. Catching a lag spike early gives you time to scale consumers, increase partitions, or investigate a slow downstream dependency before end users notice delays.

Capacity Planning

Tracking bytes-in/out per broker, request handler idle percentage, and disk utilization over time reveals long-term growth trends. These metrics feed directly into decisions about adding brokers, expanding storage, or re-partitioning high-volume topics.

A good monitoring stack covers all three layers: broker infrastructure (disk, CPU, network), Kafka-internal metrics (replication, controller, request queues), and application-level metrics (producer send rate, consumer lag, error rates).

Brokers expose hundreds of metrics through JMX. The following are the most critical for day-to-day operations.

UnderReplicatedPartitions

Reports the number of partitions on this broker whose follower replicas have fallen out of the ISR. A non-zero value means at least one replica is not keeping up with the leader. Persistent values above zero indicate broker illness, disk bottlenecks, or network problems.

# JMX MBean path kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

ActiveControllerCount

Exactly one broker in the cluster should be the active controller. If the sum across all brokers is not 1, you have either no controller (cluster is leaderless) or multiple controllers (split-brain). Both situations are critical.

# Must be 1 on exactly one broker, 0 on all others kafka.controller:type=KafkaController,name=ActiveControllerCount

RequestHandlerAvgIdlePercent

Measures how much time the broker's request handler threads spend idle. A value near 1.0 means the broker has ample capacity. Below 0.3 indicates the broker is overloaded and requests are queuing up.

kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent

BytesInPerSec / BytesOutPerSec

Tracks the inbound and outbound data rate per broker. Use these metrics to detect traffic imbalance across brokers and to project when you will hit network or disk throughput limits.

kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec

Critical Broker Metrics Table

Metric	MBean Path	Healthy Value	Alert Threshold
UnderReplicatedPartitions	kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions	0	> 0 for > 5 min
ActiveControllerCount	kafka.controller:type=KafkaController,name=ActiveControllerCount	1 (cluster-wide sum)	!= 1
RequestHandlerAvgIdlePercent	kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent	> 0.7	< 0.3
BytesInPerSec	kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec	Baseline-dependent	> 80% NIC capacity
BytesOutPerSec	kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec	Baseline-dependent	> 80% NIC capacity
PartitionCount	kafka.server:type=ReplicaManager,name=PartitionCount	Evenly distributed	Skew > 20%
IsrShrinksPerSec	kafka.server:type=ReplicaManager,name=IsrShrinksPerSec	0	> 0 sustained
LogFlushRateAndTimeMs	kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs	< 10 ms (p99)	> 50 ms (p99)

A sustained UnderReplicatedPartitions > 0 combined with rising IsrShrinksPerSec is a red flag that a broker is about to fail. Investigate disk I/O and network connectivity immediately.

Producer metrics are exposed per client instance. They tell you whether the producer is sending records successfully and how fast.

record-send-rate

The average number of records sent per second. A sudden drop indicates the producer is blocked (buffer full, broker unreachable) or the upstream data source has slowed down.

# MBean kafka.producer:type=producer-metrics,client-id=*,name=record-send-rate

record-error-rate

The average number of records per second that resulted in errors. Any non-zero sustained value requires investigation. Common causes include serialization failures, authorization errors, or topic auto-creation being disabled.

kafka.producer:type=producer-metrics,client-id=*,name=record-error-rate

request-latency-avg

The average time in milliseconds for a produce request round-trip. High latency suggests broker saturation, network congestion, or that acks=all is waiting on slow replicas.

kafka.producer:type=producer-metrics,client-id=*,name=request-latency-avg

Metric	What It Tells You	Typical Healthy Range
record-send-rate	Throughput — records per second	Matches expected ingest rate
record-error-rate	Failed sends per second	0
request-latency-avg	Round-trip time for produce requests	< 50 ms (acks=1), < 200 ms (acks=all)
batch-size-avg	Average bytes per batch	Close to configured batch.size
buffer-available-bytes	Remaining producer buffer memory	> 50% of buffer.memory

If buffer-available-bytes drops to zero, the producer's send() call blocks until space frees up or max.block.ms expires. Monitor this metric to avoid producer stalls.

Consumer metrics reveal whether records are being fetched and processed in a timely manner. The most actionable metrics live under the consumer-fetch-manager-metrics MBean group.

records-consumed-rate

Records consumed per second across all assigned partitions. Compare this against the producer's record-send-rate to gauge whether consumers are keeping pace.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,name=records-consumed-rate

records-lag-max

The maximum lag in terms of number of records across all partitions assigned to this consumer. This is the most direct indicator of consumer health — a value that trends upward over time means the consumer is falling behind.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,name=records-lag-max

fetch-latency-avg

Average time in milliseconds for a fetch request. High fetch latency may indicate broker overload, large fetch sizes, or network issues between the consumer and the broker.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,name=fetch-latency-avg

Metric	What It Tells You	Action When Unhealthy
records-consumed-rate	Consumption throughput	Scale consumers or tune max.poll.records
records-lag-max	Worst-case consumer lag	Add consumers, increase partitions, optimize processing
fetch-latency-avg	Broker response time for fetches	Check broker load, network, fetch.min.bytes
commit-rate	Offset commits per second	Verify auto-commit interval or manual commit logic
rebalance-rate-per-hour	How often rebalances occur	Tune session.timeout.ms and max.poll.interval.ms

Consumer lag is the difference between the latest offset in a partition (the log-end offset) and the last committed offset of a consumer group. It represents how many records a consumer has yet to process.

Why It Matters

Lag directly translates to data freshness. In an event-driven system, a lag of 100,000 records on a topic producing 10,000 records per second means consumers are 10 seconds behind real-time. For use cases like fraud detection, payment processing, or live dashboards, even a few seconds of lag can be unacceptable.

Measuring Lag with kafka-consumer-groups.sh

Kafka ships with a built-in CLI tool to inspect consumer group lag across all assigned partitions.

# Describe a consumer group — shows lag per partition bin/kafka-consumer-groups.sh \ --bootstrap-server broker1:9092 \ --describe \ --group my-consumer-group

The output includes columns for each assigned partition:

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID my-consumer-group orders 0 4582190 4582195 5 consumer-1-abc123 /10.0.1.5 consumer-1 my-consumer-group orders 1 3891002 3891900 898 consumer-2-def456 /10.0.1.6 consumer-2 my-consumer-group orders 2 5120300 5120300 0 consumer-3-ghi789 /10.0.1.7 consumer-3

In this example, partition 1 has a lag of 898 records while partition 2 is fully caught up. Investigate why partition 1's consumer is slower — it could be processing-heavy records, a slow downstream call, or uneven partition distribution.

Monitoring Lag Programmatically

For automated monitoring, use the AdminClient API to fetch consumer group offsets and compare them against partition end offsets.

try (AdminClient admin = AdminClient.create(props)) { // Get committed offsets for the group Map<TopicPartition, OffsetAndMetadata> committed = admin.listConsumerGroupOffsets("my-consumer-group") .partitionsToOffsetAndMetadata().get(); // Get end offsets for the same partitions Map<TopicPartition, ListOffsetsResult.ListOffsetsResultInfo> endOffsets = admin.listOffsets( committed.keySet().stream().collect( Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest()) ) ).all().get(); // Calculate lag per partition committed.forEach((tp, offsetMeta) -> { long lag = endOffsets.get(tp).offset() - offsetMeta.offset(); System.out.printf("Partition %s lag: %d%n", tp, lag); }); }

Consumer lag is a lagging indicator — by the time lag is large, the problem has been building for a while. Combine lag monitoring with records-consumed-rate to detect slowdowns before they accumulate significant lag.

Kafka exposes all internal metrics through Java Management Extensions (JMX). Every broker, producer, and consumer publishes MBeans that can be read by any JMX-compatible tool.

Enabling JMX on the Broker

Set the JMX_PORT environment variable before starting the broker. Optionally configure authentication and SSL for production environments.

# Enable JMX on port 9999 (no authentication — dev only) export JMX_PORT=9999 bin/kafka-server-start.sh config/server.properties

For remote access with authentication:

export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9999 \ -Dcom.sun.management.jmxremote.rmi.port=9999 \ -Dcom.sun.management.jmxremote.authenticate=true \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.password.file=/etc/kafka/jmx.password \ -Dcom.sun.management.jmxremote.access.file=/etc/kafka/jmx.access" bin/kafka-server-start.sh config/server.properties

Key MBean Paths

Category	MBean Path Pattern	Description
Broker	kafka.server:type=BrokerTopicMetrics,name=*	Throughput metrics (bytes/messages in/out)
Replication	kafka.server:type=ReplicaManager,name=*	ISR shrinks/expands, under-replicated partitions
Controller	kafka.controller:type=KafkaController,name=*	Active controller, leader elections
Network	kafka.network:type=RequestMetrics,name=,request=	Request latency by type (Produce, Fetch, etc.)
Log	kafka.log:type=LogFlushStats,name=*	Log flush rate and time
Producer	kafka.producer:type=producer-metrics,client-id=*	Send rate, error rate, latency
Consumer	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*	Consumption rate, lag, fetch latency

Connecting with JConsole

JConsole ships with the JDK and provides a quick way to browse Kafka MBeans interactively:

# Connect to a local broker's JMX port jconsole localhost:9999

Navigate to the MBeans tab and expand the kafka.server domain to browse broker metrics. This is useful for ad-hoc investigation but not suitable for continuous monitoring — use Prometheus and Grafana for that.

In containerized deployments (Docker, Kubernetes), make sure the JMX port is exposed and the RMI hostname is set to the container's advertised address using -Djava.rmi.server.hostname.

Spring Boot applications using spring-kafka automatically expose Kafka producer and consumer metrics through Micrometer when Actuator is on the classpath. This gives you zero-configuration observability for Spring-based Kafka applications.

Dependencies

<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> <dependency> <groupId>org.springframework.kafka</groupId> <artifactId>spring-kafka</artifactId> </dependency>

Application Configuration

# Expose Prometheus endpoint management.endpoints.web.exposure.include=health,info,prometheus,metrics management.endpoint.prometheus.enabled=true management.metrics.export.prometheus.enabled=true # Kafka connection spring.kafka.bootstrap-servers=broker1:9092,broker2:9092 spring.kafka.consumer.group-id=my-app-group

Auto-Exposed Kafka Metrics

Spring Boot auto-configures Micrometer bindings for the Kafka client. Once the application starts, the /actuator/prometheus endpoint includes metrics such as:

kafka_consumer_records_consumed_total — total records consumed
kafka_consumer_records_lag — current lag per partition
kafka_consumer_fetch_manager_fetch_latency_avg — average fetch latency
kafka_producer_record_send_total — total records sent
kafka_producer_record_error_total — total send errors
kafka_producer_request_latency_avg — average produce latency

Custom Kafka Health Indicator

You can add a custom health check that verifies Kafka connectivity and reports consumer lag:

@Component public class KafkaHealthIndicator implements HealthIndicator { private final KafkaAdmin kafkaAdmin; public KafkaHealthIndicator(KafkaAdmin kafkaAdmin) { this.kafkaAdmin = kafkaAdmin; } @Override public Health health() { try (AdminClient client = AdminClient.create(kafkaAdmin.getConfigurationProperties())) { DescribeClusterResult cluster = client.describeCluster(); int nodeCount = cluster.nodes().get(5, TimeUnit.SECONDS).size(); return Health.up() .withDetail("clusterId", cluster.clusterId().get(5, TimeUnit.SECONDS)) .withDetail("nodeCount", nodeCount) .build(); } catch (Exception e) { return Health.down(e).build(); } } }

Spring Boot 3.x uses Micrometer 1.10+ which provides richer Kafka metric bindings out of the box. If you are on Spring Boot 2.x, consider upgrading or adding micrometer-core with explicit Kafka binder configuration.

The industry-standard approach for Kafka observability is to expose JMX metrics as Prometheus-scrapable endpoints using the JMX Exporter, then visualize them in Grafana.

Step 1 — JMX Exporter Agent

The Prometheus JMX Exporter runs as a Java agent alongside the Kafka broker, converting JMX MBeans into Prometheus format on an HTTP endpoint.

# Download the JMX Exporter agent JAR curl -LO https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar

Create a configuration file that defines which MBeans to export:

# jmx-exporter-config.yml lowercaseOutputName: true lowercaseOutputLabelNames: true rules: - pattern: kafka.server<type=BrokerTopicMetrics, name=(.+)><>(Count|OneMinuteRate) name: kafka_server_broker_topic_metrics_$1_$2 type: GAUGE - pattern: kafka.server<type=ReplicaManager, name=(.+)><>Value name: kafka_server_replica_manager_$1 type: GAUGE - pattern: kafka.controller<type=KafkaController, name=(.+)><>Value name: kafka_controller_$1 type: GAUGE - pattern: kafka.network<type=RequestMetrics, name=(.+), request=(.+)><>(Mean|99thPercentile) name: kafka_network_request_metrics_$1_$3 labels: request: $2 type: GAUGE

Attach the agent to the Kafka broker:

export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka/jmx-exporter-config.yml" bin/kafka-server-start.sh config/server.properties

Step 2 — Prometheus Scrape Configuration

Add Kafka broker targets to your prometheus.yml:

# prometheus.yml scrape_configs: - job_name: 'kafka-brokers' scrape_interval: 15s static_configs: - targets: - 'broker1:7071' - 'broker2:7071' - 'broker3:7071' labels: cluster: 'production' - job_name: 'kafka-spring-apps' scrape_interval: 15s metrics_path: '/actuator/prometheus' static_configs: - targets: - 'app-server1:8080' - 'app-server2:8080'

Step 3 — Grafana Dashboards

Import community dashboards or build custom ones for different monitoring perspectives:

Broker Overview — BytesIn/Out, messages/sec, request latency, under-replicated partitions per broker
Consumer Lag — lag per consumer group and partition, consumption rate vs. production rate
Topic Detail — per-topic message rate, bytes, partition count, replication status
Producer Performance — send rate, error rate, batch utilization, buffer availability

Example PromQL Queries

# Total bytes in per broker (1-minute rate) rate(kafka_server_broker_topic_metrics_bytesinpersec_count[5m]) # Under-replicated partitions across all brokers sum(kafka_server_replica_manager_underreplicatedpartitions) # Consumer lag by group and topic sum by (group, topic) (kafka_consumer_records_lag) # Request handler idle percent (should be > 0.7) kafka_server_request_handler_avg_idle_percent

The Grafana community dashboard ID 7589 (Kafka Overview) is a popular starting point. Import it via Grafana's dashboard import feature and customize panels for your specific cluster topology.

Alerts turn metrics into action. Define Prometheus alerting rules for the most critical Kafka failure modes and route them to PagerDuty, Slack, or your on-call system via Alertmanager.

Prometheus Alert Rules

# kafka-alerts.yml groups: - name: kafka_broker_alerts rules: - alert: KafkaUnderReplicatedPartitions expr: sum(kafka_server_replica_manager_underreplicatedpartitions) > 0 for: 5m labels: severity: critical annotations: summary: "Under-replicated partitions detected" description: "{{ $value }} partitions are under-replicated for more than 5 minutes." - alert: KafkaNoActiveController expr: sum(kafka_controller_activecontrollercount) != 1 for: 1m labels: severity: critical annotations: summary: "No active Kafka controller" description: "Cluster has {{ $value }} active controllers (expected 1)." - alert: KafkaBrokerRequestHandlerSaturated expr: kafka_server_request_handler_avg_idle_percent < 0.25 for: 10m labels: severity: warning annotations: summary: "Broker request handlers saturated" description: "Request handler idle percent is {{ $value }} on {{ $labels.instance }}." - alert: KafkaHighConsumerLag expr: sum by (group, topic) (kafka_consumer_records_lag) > 10000 for: 10m labels: severity: warning annotations: summary: "High consumer lag on {{ $labels.group }}/{{ $labels.topic }}" description: "Consumer group {{ $labels.group }} has {{ $value }} records of lag on topic {{ $labels.topic }}." - alert: KafkaIsrShrinking expr: rate(kafka_server_replica_manager_isrshrinkspersec_count[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "ISR is shrinking on {{ $labels.instance }}" description: "Replicas are falling out of sync on broker {{ $labels.instance }}."

Critical Thresholds Summary

Condition	Severity	Threshold	For Duration
Under-replicated partitions	Critical	> 0	5 minutes
No active controller	Critical	!= 1	1 minute
Request handler saturation	Warning	< 0.25 idle	10 minutes
Consumer lag	Warning	> 10,000 records	10 minutes
ISR shrinking	Warning	rate > 0	5 minutes
Offline partitions	Critical	> 0	Immediate

Set consumer lag thresholds relative to your topic's throughput. A lag of 10,000 records on a topic producing 1,000/sec means only 10 seconds behind, but the same lag on a topic producing 10/sec means over 16 minutes behind. Tune thresholds per consumer group.

When Kafka metrics signal trouble, use the following playbooks to diagnose and resolve the most common problems.

High Consumer Lag

Symptom: records-lag-max climbing steadily across partitions.
Check processing time — if the consumer's poll() loop is blocked by slow downstream calls (database writes, HTTP calls), records queue up. Profile the processing code and consider async processing or batching.
Check partition count vs. consumer count — if you have more partitions than consumers in the group, some consumers handle multiple partitions. Add consumers (up to the partition count) to parallelize work.
Check max.poll.records — lowering this value reduces the number of records per poll, giving the consumer more time per batch and preventing max.poll.interval.ms timeouts that trigger rebalances.
Check for frequent rebalances — rebalances pause all consumption. Monitor rebalance-rate-per-hour and increase session.timeout.ms if consumers are being evicted prematurely.

Broker Imbalance

Symptom: One broker shows significantly higher BytesInPerSec or PartitionCount than others.
Run partition reassignment — use kafka-reassign-partitions.sh to redistribute partition leaders evenly across brokers.
Check preferred leader election — run kafka-leader-election.sh --election-type preferred to restore leaders to their preferred brokers after a broker restart.
Review topic partitioning — high-volume topics with few partitions can create hot spots. Increase partitions for topics with skewed load.

# Trigger preferred leader election for all topics bin/kafka-leader-election.sh \ --bootstrap-server broker1:9092 \ --election-type preferred \ --all-topic-partitions

ISR Shrinks

Symptom: IsrShrinksPerSec is non-zero and UnderReplicatedPartitions is rising.
Check broker disk I/O — slow disks cause followers to lag behind leaders. Monitor disk latency with iostat and consider moving log directories to faster storage (SSD).
Check network between brokers — replication relies on inter-broker network throughput. Use iperf to verify bandwidth between broker nodes.
Check replica.lag.time.max.ms — this setting controls how long a replica can be out of sync before being removed from the ISR. The default is 30 seconds. Do not lower it aggressively or you will get ISR flapping.
Check GC pauses — long garbage collection pauses on a follower broker prevent it from fetching from the leader. Review GC logs and tune JVM heap settings.

# Check disk I/O on the broker host iostat -xm 2 5 # Verify inter-broker network throughput iperf3 -c broker2 -t 10

Quick Diagnostic Checklist

Symptom	First Check	Second Check	Resolution
Consumer lag growing	Processing time per record	Consumer count vs. partitions	Scale consumers, optimize processing
Under-replicated partitions	Broker disk I/O	Network between brokers	Replace disks, fix network, restart broker
No active controller	ZooKeeper / KRaft connectivity	Broker logs for controller election	Restart ZK or affected broker
Producer send errors	Broker availability	Topic authorization / ACLs	Fix broker, update ACLs
High request latency	Request handler idle %	Disk flush time	Add brokers, move to SSD
Frequent rebalances	max.poll.interval.ms	Consumer heartbeat logs	Increase timeout, reduce max.poll.records

When multiple symptoms appear simultaneously (ISR shrinks + high lag + elevated latency), the root cause is almost always a single broker in distress. Identify it by comparing per-broker metrics and focus your investigation there.

Contents