Contents

Kafka is the backbone of many event-driven architectures. When a broker goes silent, a consumer group falls behind, or a topic runs out of disk, the impact cascades across every downstream system. Proactive monitoring addresses three core needs:

Reliability

Monitoring under-replicated partitions and ISR (In-Sync Replica) counts lets you detect replication failures before they become data-loss events. A healthy cluster keeps all replicas in sync; any deviation signals a broker that is overloaded, network-partitioned, or running out of disk.

Early Warning

Consumer lag is the single most important indicator of processing health. A growing lag means consumers cannot keep up with the rate of incoming records. Catching a lag spike early gives you time to scale consumers, increase partitions, or investigate a slow downstream dependency before end users notice delays.

Capacity Planning

Tracking bytes-in/out per broker, request handler idle percentage, and disk utilization over time reveals long-term growth trends. These metrics feed directly into decisions about adding brokers, expanding storage, or re-partitioning high-volume topics.

A good monitoring stack covers all three layers: broker infrastructure (disk, CPU, network), Kafka-internal metrics (replication, controller, request queues), and application-level metrics (producer send rate, consumer lag, error rates).

Brokers expose hundreds of metrics through JMX. The following are the most critical for day-to-day operations.

UnderReplicatedPartitions

Reports the number of partitions on this broker whose follower replicas have fallen out of the ISR. A non-zero value means at least one replica is not keeping up with the leader. Persistent values above zero indicate broker illness, disk bottlenecks, or network problems.

# JMX MBean path kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
ActiveControllerCount

Exactly one broker in the cluster should be the active controller. If the sum across all brokers is not 1, you have either no controller (cluster is leaderless) or multiple controllers (split-brain). Both situations are critical.

# Must be 1 on exactly one broker, 0 on all others kafka.controller:type=KafkaController,name=ActiveControllerCount
RequestHandlerAvgIdlePercent

Measures how much time the broker's request handler threads spend idle. A value near 1.0 means the broker has ample capacity. Below 0.3 indicates the broker is overloaded and requests are queuing up.

kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
BytesInPerSec / BytesOutPerSec

Tracks the inbound and outbound data rate per broker. Use these metrics to detect traffic imbalance across brokers and to project when you will hit network or disk throughput limits.

kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Critical Broker Metrics Table
Metric MBean Path Healthy Value Alert Threshold
UnderReplicatedPartitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions 0 > 0 for > 5 min
ActiveControllerCount kafka.controller:type=KafkaController,name=ActiveControllerCount 1 (cluster-wide sum) != 1
RequestHandlerAvgIdlePercent kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent > 0.7 < 0.3
BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec Baseline-dependent > 80% NIC capacity
BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec Baseline-dependent > 80% NIC capacity
PartitionCount kafka.server:type=ReplicaManager,name=PartitionCount Evenly distributed Skew > 20%
IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrShrinksPerSec 0 > 0 sustained
LogFlushRateAndTimeMs kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs < 10 ms (p99) > 50 ms (p99)
A sustained UnderReplicatedPartitions > 0 combined with rising IsrShrinksPerSec is a red flag that a broker is about to fail. Investigate disk I/O and network connectivity immediately.

Producer metrics are exposed per client instance. They tell you whether the producer is sending records successfully and how fast.

record-send-rate

The average number of records sent per second. A sudden drop indicates the producer is blocked (buffer full, broker unreachable) or the upstream data source has slowed down.

# MBean kafka.producer:type=producer-metrics,client-id=*,name=record-send-rate
record-error-rate

The average number of records per second that resulted in errors. Any non-zero sustained value requires investigation. Common causes include serialization failures, authorization errors, or topic auto-creation being disabled.

kafka.producer:type=producer-metrics,client-id=*,name=record-error-rate
request-latency-avg

The average time in milliseconds for a produce request round-trip. High latency suggests broker saturation, network congestion, or that acks=all is waiting on slow replicas.

kafka.producer:type=producer-metrics,client-id=*,name=request-latency-avg
Metric What It Tells You Typical Healthy Range
record-send-rate Throughput — records per second Matches expected ingest rate
record-error-rate Failed sends per second 0
request-latency-avg Round-trip time for produce requests < 50 ms (acks=1), < 200 ms (acks=all)
batch-size-avg Average bytes per batch Close to configured batch.size
buffer-available-bytes Remaining producer buffer memory > 50% of buffer.memory
If buffer-available-bytes drops to zero, the producer's send() call blocks until space frees up or max.block.ms expires. Monitor this metric to avoid producer stalls.

Consumer metrics reveal whether records are being fetched and processed in a timely manner. The most actionable metrics live under the consumer-fetch-manager-metrics MBean group.

records-consumed-rate

Records consumed per second across all assigned partitions. Compare this against the producer's record-send-rate to gauge whether consumers are keeping pace.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,name=records-consumed-rate
records-lag-max

The maximum lag in terms of number of records across all partitions assigned to this consumer. This is the most direct indicator of consumer health — a value that trends upward over time means the consumer is falling behind.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,name=records-lag-max
fetch-latency-avg

Average time in milliseconds for a fetch request. High fetch latency may indicate broker overload, large fetch sizes, or network issues between the consumer and the broker.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,name=fetch-latency-avg
Metric What It Tells You Action When Unhealthy
records-consumed-rate Consumption throughput Scale consumers or tune max.poll.records
records-lag-max Worst-case consumer lag Add consumers, increase partitions, optimize processing
fetch-latency-avg Broker response time for fetches Check broker load, network, fetch.min.bytes
commit-rate Offset commits per second Verify auto-commit interval or manual commit logic
rebalance-rate-per-hour How often rebalances occur Tune session.timeout.ms and max.poll.interval.ms

Consumer lag is the difference between the latest offset in a partition (the log-end offset) and the last committed offset of a consumer group. It represents how many records a consumer has yet to process.

Why It Matters

Lag directly translates to data freshness. In an event-driven system, a lag of 100,000 records on a topic producing 10,000 records per second means consumers are 10 seconds behind real-time. For use cases like fraud detection, payment processing, or live dashboards, even a few seconds of lag can be unacceptable.

Measuring Lag with kafka-consumer-groups.sh

Kafka ships with a built-in CLI tool to inspect consumer group lag across all assigned partitions.

# Describe a consumer group — shows lag per partition bin/kafka-consumer-groups.sh \ --bootstrap-server broker1:9092 \ --describe \ --group my-consumer-group

The output includes columns for each assigned partition:

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID my-consumer-group orders 0 4582190 4582195 5 consumer-1-abc123 /10.0.1.5 consumer-1 my-consumer-group orders 1 3891002 3891900 898 consumer-2-def456 /10.0.1.6 consumer-2 my-consumer-group orders 2 5120300 5120300 0 consumer-3-ghi789 /10.0.1.7 consumer-3

In this example, partition 1 has a lag of 898 records while partition 2 is fully caught up. Investigate why partition 1's consumer is slower — it could be processing-heavy records, a slow downstream call, or uneven partition distribution.

Monitoring Lag Programmatically

For automated monitoring, use the AdminClient API to fetch consumer group offsets and compare them against partition end offsets.

try (AdminClient admin = AdminClient.create(props)) { // Get committed offsets for the group Map<TopicPartition, OffsetAndMetadata> committed = admin.listConsumerGroupOffsets("my-consumer-group") .partitionsToOffsetAndMetadata().get(); // Get end offsets for the same partitions Map<TopicPartition, ListOffsetsResult.ListOffsetsResultInfo> endOffsets = admin.listOffsets( committed.keySet().stream().collect( Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest()) ) ).all().get(); // Calculate lag per partition committed.forEach((tp, offsetMeta) -> { long lag = endOffsets.get(tp).offset() - offsetMeta.offset(); System.out.printf("Partition %s lag: %d%n", tp, lag); }); } Consumer lag is a lagging indicator — by the time lag is large, the problem has been building for a while. Combine lag monitoring with records-consumed-rate to detect slowdowns before they accumulate significant lag.

Kafka exposes all internal metrics through Java Management Extensions (JMX). Every broker, producer, and consumer publishes MBeans that can be read by any JMX-compatible tool.

Enabling JMX on the Broker

Set the JMX_PORT environment variable before starting the broker. Optionally configure authentication and SSL for production environments.

# Enable JMX on port 9999 (no authentication — dev only) export JMX_PORT=9999 bin/kafka-server-start.sh config/server.properties

For remote access with authentication:

export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9999 \ -Dcom.sun.management.jmxremote.rmi.port=9999 \ -Dcom.sun.management.jmxremote.authenticate=true \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.password.file=/etc/kafka/jmx.password \ -Dcom.sun.management.jmxremote.access.file=/etc/kafka/jmx.access" bin/kafka-server-start.sh config/server.properties
Key MBean Paths
Category MBean Path Pattern Description
Broker kafka.server:type=BrokerTopicMetrics,name=* Throughput metrics (bytes/messages in/out)
Replication kafka.server:type=ReplicaManager,name=* ISR shrinks/expands, under-replicated partitions
Controller kafka.controller:type=KafkaController,name=* Active controller, leader elections
Network kafka.network:type=RequestMetrics,name=*,request=* Request latency by type (Produce, Fetch, etc.)
Log kafka.log:type=LogFlushStats,name=* Log flush rate and time
Producer kafka.producer:type=producer-metrics,client-id=* Send rate, error rate, latency
Consumer kafka.consumer:type=consumer-fetch-manager-metrics,client-id=* Consumption rate, lag, fetch latency
Connecting with JConsole

JConsole ships with the JDK and provides a quick way to browse Kafka MBeans interactively:

# Connect to a local broker's JMX port jconsole localhost:9999

Navigate to the MBeans tab and expand the kafka.server domain to browse broker metrics. This is useful for ad-hoc investigation but not suitable for continuous monitoring — use Prometheus and Grafana for that.

In containerized deployments (Docker, Kubernetes), make sure the JMX port is exposed and the RMI hostname is set to the container's advertised address using -Djava.rmi.server.hostname.

Spring Boot applications using spring-kafka automatically expose Kafka producer and consumer metrics through Micrometer when Actuator is on the classpath. This gives you zero-configuration observability for Spring-based Kafka applications.

Dependencies
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> <dependency> <groupId>org.springframework.kafka</groupId> <artifactId>spring-kafka</artifactId> </dependency>
Application Configuration
# Expose Prometheus endpoint management.endpoints.web.exposure.include=health,info,prometheus,metrics management.endpoint.prometheus.enabled=true management.metrics.export.prometheus.enabled=true # Kafka connection spring.kafka.bootstrap-servers=broker1:9092,broker2:9092 spring.kafka.consumer.group-id=my-app-group
Auto-Exposed Kafka Metrics

Spring Boot auto-configures Micrometer bindings for the Kafka client. Once the application starts, the /actuator/prometheus endpoint includes metrics such as:

Custom Kafka Health Indicator

You can add a custom health check that verifies Kafka connectivity and reports consumer lag:

@Component public class KafkaHealthIndicator implements HealthIndicator { private final KafkaAdmin kafkaAdmin; public KafkaHealthIndicator(KafkaAdmin kafkaAdmin) { this.kafkaAdmin = kafkaAdmin; } @Override public Health health() { try (AdminClient client = AdminClient.create(kafkaAdmin.getConfigurationProperties())) { DescribeClusterResult cluster = client.describeCluster(); int nodeCount = cluster.nodes().get(5, TimeUnit.SECONDS).size(); return Health.up() .withDetail("clusterId", cluster.clusterId().get(5, TimeUnit.SECONDS)) .withDetail("nodeCount", nodeCount) .build(); } catch (Exception e) { return Health.down(e).build(); } } } Spring Boot 3.x uses Micrometer 1.10+ which provides richer Kafka metric bindings out of the box. If you are on Spring Boot 2.x, consider upgrading or adding micrometer-core with explicit Kafka binder configuration.

The industry-standard approach for Kafka observability is to expose JMX metrics as Prometheus-scrapable endpoints using the JMX Exporter, then visualize them in Grafana.

Step 1 — JMX Exporter Agent

The Prometheus JMX Exporter runs as a Java agent alongside the Kafka broker, converting JMX MBeans into Prometheus format on an HTTP endpoint.

# Download the JMX Exporter agent JAR curl -LO https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar

Create a configuration file that defines which MBeans to export:

# jmx-exporter-config.yml lowercaseOutputName: true lowercaseOutputLabelNames: true rules: - pattern: kafka.server<type=BrokerTopicMetrics, name=(.+)><>(Count|OneMinuteRate) name: kafka_server_broker_topic_metrics_$1_$2 type: GAUGE - pattern: kafka.server<type=ReplicaManager, name=(.+)><>Value name: kafka_server_replica_manager_$1 type: GAUGE - pattern: kafka.controller<type=KafkaController, name=(.+)><>Value name: kafka_controller_$1 type: GAUGE - pattern: kafka.network<type=RequestMetrics, name=(.+), request=(.+)><>(Mean|99thPercentile) name: kafka_network_request_metrics_$1_$3 labels: request: $2 type: GAUGE

Attach the agent to the Kafka broker:

export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka/jmx-exporter-config.yml" bin/kafka-server-start.sh config/server.properties
Step 2 — Prometheus Scrape Configuration

Add Kafka broker targets to your prometheus.yml:

# prometheus.yml scrape_configs: - job_name: 'kafka-brokers' scrape_interval: 15s static_configs: - targets: - 'broker1:7071' - 'broker2:7071' - 'broker3:7071' labels: cluster: 'production' - job_name: 'kafka-spring-apps' scrape_interval: 15s metrics_path: '/actuator/prometheus' static_configs: - targets: - 'app-server1:8080' - 'app-server2:8080'
Step 3 — Grafana Dashboards

Import community dashboards or build custom ones for different monitoring perspectives:

Example PromQL Queries
# Total bytes in per broker (1-minute rate) rate(kafka_server_broker_topic_metrics_bytesinpersec_count[5m]) # Under-replicated partitions across all brokers sum(kafka_server_replica_manager_underreplicatedpartitions) # Consumer lag by group and topic sum by (group, topic) (kafka_consumer_records_lag) # Request handler idle percent (should be > 0.7) kafka_server_request_handler_avg_idle_percent The Grafana community dashboard ID 7589 (Kafka Overview) is a popular starting point. Import it via Grafana's dashboard import feature and customize panels for your specific cluster topology.

Alerts turn metrics into action. Define Prometheus alerting rules for the most critical Kafka failure modes and route them to PagerDuty, Slack, or your on-call system via Alertmanager.

Prometheus Alert Rules
# kafka-alerts.yml groups: - name: kafka_broker_alerts rules: - alert: KafkaUnderReplicatedPartitions expr: sum(kafka_server_replica_manager_underreplicatedpartitions) > 0 for: 5m labels: severity: critical annotations: summary: "Under-replicated partitions detected" description: "{{ $value }} partitions are under-replicated for more than 5 minutes." - alert: KafkaNoActiveController expr: sum(kafka_controller_activecontrollercount) != 1 for: 1m labels: severity: critical annotations: summary: "No active Kafka controller" description: "Cluster has {{ $value }} active controllers (expected 1)." - alert: KafkaBrokerRequestHandlerSaturated expr: kafka_server_request_handler_avg_idle_percent < 0.25 for: 10m labels: severity: warning annotations: summary: "Broker request handlers saturated" description: "Request handler idle percent is {{ $value }} on {{ $labels.instance }}." - alert: KafkaHighConsumerLag expr: sum by (group, topic) (kafka_consumer_records_lag) > 10000 for: 10m labels: severity: warning annotations: summary: "High consumer lag on {{ $labels.group }}/{{ $labels.topic }}" description: "Consumer group {{ $labels.group }} has {{ $value }} records of lag on topic {{ $labels.topic }}." - alert: KafkaIsrShrinking expr: rate(kafka_server_replica_manager_isrshrinkspersec_count[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "ISR is shrinking on {{ $labels.instance }}" description: "Replicas are falling out of sync on broker {{ $labels.instance }}."
Critical Thresholds Summary
Condition Severity Threshold For Duration
Under-replicated partitions Critical > 0 5 minutes
No active controller Critical != 1 1 minute
Request handler saturation Warning < 0.25 idle 10 minutes
Consumer lag Warning > 10,000 records 10 minutes
ISR shrinking Warning rate > 0 5 minutes
Offline partitions Critical > 0 Immediate
Set consumer lag thresholds relative to your topic's throughput. A lag of 10,000 records on a topic producing 1,000/sec means only 10 seconds behind, but the same lag on a topic producing 10/sec means over 16 minutes behind. Tune thresholds per consumer group.

When Kafka metrics signal trouble, use the following playbooks to diagnose and resolve the most common problems.

High Consumer Lag
Broker Imbalance
# Trigger preferred leader election for all topics bin/kafka-leader-election.sh \ --bootstrap-server broker1:9092 \ --election-type preferred \ --all-topic-partitions
ISR Shrinks
# Check disk I/O on the broker host iostat -xm 2 5 # Verify inter-broker network throughput iperf3 -c broker2 -t 10
Quick Diagnostic Checklist
Symptom First Check Second Check Resolution
Consumer lag growing Processing time per record Consumer count vs. partitions Scale consumers, optimize processing
Under-replicated partitions Broker disk I/O Network between brokers Replace disks, fix network, restart broker
No active controller ZooKeeper / KRaft connectivity Broker logs for controller election Restart ZK or affected broker
Producer send errors Broker availability Topic authorization / ACLs Fix broker, update ACLs
High request latency Request handler idle % Disk flush time Add brokers, move to SSD
Frequent rebalances max.poll.interval.ms Consumer heartbeat logs Increase timeout, reduce max.poll.records
When multiple symptoms appear simultaneously (ISR shrinks + high lag + elevated latency), the root cause is almost always a single broker in distress. Identify it by comparing per-broker metrics and focus your investigation there.