Contents
Why Monitor Kafka Key Broker Metrics Key Producer Metrics Key Consumer Metrics Consumer Lag JMX Monitoring Monitoring with Spring Boot Actuator Prometheus & Grafana Setup Alerting Rules Common Issues & Troubleshooting
Kafka is the backbone of many event-driven architectures. When a broker goes silent, a consumer group falls behind, or a topic runs out of disk, the impact cascades across every downstream system. Proactive monitoring addresses three core needs:
Reliability
Monitoring under-replicated partitions and ISR (In-Sync Replica) counts lets you detect replication failures before they become data-loss events. A healthy cluster keeps all replicas in sync; any deviation signals a broker that is overloaded, network-partitioned, or running out of disk.
Early Warning
Consumer lag is the single most important indicator of processing health. A growing lag means consumers cannot keep up with the rate of incoming records. Catching a lag spike early gives you time to scale consumers, increase partitions, or investigate a slow downstream dependency before end users notice delays.
Capacity Planning
Tracking bytes-in/out per broker, request handler idle percentage, and disk utilization over time reveals long-term growth trends. These metrics feed directly into decisions about adding brokers, expanding storage, or re-partitioning high-volume topics.
Brokers expose hundreds of metrics through JMX. The following are the most critical for day-to-day operations.
UnderReplicatedPartitions
Reports the number of partitions on this broker whose follower replicas have fallen out of the ISR. A non-zero value means at least one replica is not keeping up with the leader. Persistent values above zero indicate broker illness, disk bottlenecks, or network problems.
ActiveControllerCount
Exactly one broker in the cluster should be the active controller. If the sum across all brokers is not 1, you have either no controller (cluster is leaderless) or multiple controllers (split-brain). Both situations are critical.
RequestHandlerAvgIdlePercent
Measures how much time the broker's request handler threads spend idle. A value near 1.0 means the broker has ample capacity. Below 0.3 indicates the broker is overloaded and requests are queuing up.
BytesInPerSec / BytesOutPerSec
Tracks the inbound and outbound data rate per broker. Use these metrics to detect traffic imbalance across brokers and to project when you will hit network or disk throughput limits.
Critical Broker Metrics Table
| Metric | MBean Path | Healthy Value | Alert Threshold |
|---|---|---|---|
| UnderReplicatedPartitions | 0 | > 0 for > 5 min | |
| ActiveControllerCount | 1 (cluster-wide sum) | != 1 | |
| RequestHandlerAvgIdlePercent | > 0.7 | < 0.3 | |
| BytesInPerSec | Baseline-dependent | > 80% NIC capacity | |
| BytesOutPerSec | Baseline-dependent | > 80% NIC capacity | |
| PartitionCount | Evenly distributed | Skew > 20% | |
| IsrShrinksPerSec | 0 | > 0 sustained | |
| LogFlushRateAndTimeMs | < 10 ms (p99) | > 50 ms (p99) |
Producer metrics are exposed per client instance. They tell you whether the producer is sending records successfully and how fast.
record-send-rate
The average number of records sent per second. A sudden drop indicates the producer is blocked (buffer full, broker unreachable) or the upstream data source has slowed down.
record-error-rate
The average number of records per second that resulted in errors. Any non-zero sustained value requires investigation. Common causes include serialization failures, authorization errors, or topic auto-creation being disabled.
request-latency-avg
The average time in milliseconds for a produce request round-trip. High latency suggests broker saturation, network congestion, or that
| Metric | What It Tells You | Typical Healthy Range |
|---|---|---|
| record-send-rate | Throughput — records per second | Matches expected ingest rate |
| record-error-rate | Failed sends per second | 0 |
| request-latency-avg | Round-trip time for produce requests | < 50 ms (acks=1), < 200 ms (acks=all) |
| batch-size-avg | Average bytes per batch | Close to configured |
| buffer-available-bytes | Remaining producer buffer memory | > 50% of |
Consumer metrics reveal whether records are being fetched and processed in a timely manner. The most actionable metrics live under the
records-consumed-rate
Records consumed per second across all assigned partitions. Compare this against the producer's
records-lag-max
The maximum lag in terms of number of records across all partitions assigned to this consumer. This is the most direct indicator of consumer health — a value that trends upward over time means the consumer is falling behind.
fetch-latency-avg
Average time in milliseconds for a fetch request. High fetch latency may indicate broker overload, large fetch sizes, or network issues between the consumer and the broker.
| Metric | What It Tells You | Action When Unhealthy |
|---|---|---|
| records-consumed-rate | Consumption throughput | Scale consumers or tune |
| records-lag-max | Worst-case consumer lag | Add consumers, increase partitions, optimize processing |
| fetch-latency-avg | Broker response time for fetches | Check broker load, network, |
| commit-rate | Offset commits per second | Verify auto-commit interval or manual commit logic |
| rebalance-rate-per-hour | How often rebalances occur | Tune |
Consumer lag is the difference between the latest offset in a partition (the log-end offset) and the last committed offset of a consumer group. It represents how many records a consumer has yet to process.
Why It Matters
Lag directly translates to data freshness. In an event-driven system, a lag of 100,000 records on a topic producing 10,000 records per second means consumers are 10 seconds behind real-time. For use cases like fraud detection, payment processing, or live dashboards, even a few seconds of lag can be unacceptable.
Measuring Lag with kafka-consumer-groups.sh
Kafka ships with a built-in CLI tool to inspect consumer group lag across all assigned partitions.
The output includes columns for each assigned partition:
In this example, partition 1 has a lag of 898 records while partition 2 is fully caught up. Investigate why partition 1's consumer is slower — it could be processing-heavy records, a slow downstream call, or uneven partition distribution.
Monitoring Lag Programmatically
For automated monitoring, use the
Kafka exposes all internal metrics through Java Management Extensions (JMX). Every broker, producer, and consumer publishes MBeans that can be read by any JMX-compatible tool.
Enabling JMX on the Broker
Set the
For remote access with authentication:
Key MBean Paths
| Category | MBean Path Pattern | Description |
|---|---|---|
| Broker | Throughput metrics (bytes/messages in/out) | |
| Replication | ISR shrinks/expands, under-replicated partitions | |
| Controller | Active controller, leader elections | |
| Network | Request latency by type (Produce, Fetch, etc.) | |
| Log | Log flush rate and time | |
| Producer | Send rate, error rate, latency | |
| Consumer | Consumption rate, lag, fetch latency |
Connecting with JConsole
JConsole ships with the JDK and provides a quick way to browse Kafka MBeans interactively:
Navigate to the MBeans tab and expand the
Spring Boot applications using
Dependencies
Application Configuration
Auto-Exposed Kafka Metrics
Spring Boot auto-configures Micrometer bindings for the Kafka client. Once the application starts, the
kafka_consumer_records_consumed_total — total records consumedkafka_consumer_records_lag — current lag per partitionkafka_consumer_fetch_manager_fetch_latency_avg — average fetch latencykafka_producer_record_send_total — total records sentkafka_producer_record_error_total — total send errorskafka_producer_request_latency_avg — average produce latency
Custom Kafka Health Indicator
You can add a custom health check that verifies Kafka connectivity and reports consumer lag:
The industry-standard approach for Kafka observability is to expose JMX metrics as Prometheus-scrapable endpoints using the JMX Exporter, then visualize them in Grafana.
Step 1 — JMX Exporter Agent
The Prometheus JMX Exporter runs as a Java agent alongside the Kafka broker, converting JMX MBeans into Prometheus format on an HTTP endpoint.
Create a configuration file that defines which MBeans to export:
Attach the agent to the Kafka broker:
Step 2 — Prometheus Scrape Configuration
Add Kafka broker targets to your
Step 3 — Grafana Dashboards
Import community dashboards or build custom ones for different monitoring perspectives:
- Broker Overview — BytesIn/Out, messages/sec, request latency, under-replicated partitions per broker
- Consumer Lag — lag per consumer group and partition, consumption rate vs. production rate
- Topic Detail — per-topic message rate, bytes, partition count, replication status
- Producer Performance — send rate, error rate, batch utilization, buffer availability
Example PromQL Queries
Alerts turn metrics into action. Define Prometheus alerting rules for the most critical Kafka failure modes and route them to PagerDuty, Slack, or your on-call system via Alertmanager.
Prometheus Alert Rules
Critical Thresholds Summary
| Condition | Severity | Threshold | For Duration |
|---|---|---|---|
| Under-replicated partitions | Critical | > 0 | 5 minutes |
| No active controller | Critical | != 1 | 1 minute |
| Request handler saturation | Warning | < 0.25 idle | 10 minutes |
| Consumer lag | Warning | > 10,000 records | 10 minutes |
| ISR shrinking | Warning | rate > 0 | 5 minutes |
| Offline partitions | Critical | > 0 | Immediate |
When Kafka metrics signal trouble, use the following playbooks to diagnose and resolve the most common problems.
High Consumer Lag
- Symptom:
records-lag-max climbing steadily across partitions. - Check processing time — if the consumer's
poll() loop is blocked by slow downstream calls (database writes, HTTP calls), records queue up. Profile the processing code and consider async processing or batching. - Check partition count vs. consumer count — if you have more partitions than consumers in the group, some consumers handle multiple partitions. Add consumers (up to the partition count) to parallelize work.
- Check
max.poll.records — lowering this value reduces the number of records per poll, giving the consumer more time per batch and preventingmax.poll.interval.ms timeouts that trigger rebalances. - Check for frequent rebalances — rebalances pause all consumption. Monitor
rebalance-rate-per-hour and increasesession.timeout.ms if consumers are being evicted prematurely.
Broker Imbalance
- Symptom: One broker shows significantly higher
BytesInPerSec orPartitionCount than others. - Run partition reassignment — use
kafka-reassign-partitions.sh to redistribute partition leaders evenly across brokers. - Check preferred leader election — run
kafka-leader-election.sh --election-type preferred to restore leaders to their preferred brokers after a broker restart. - Review topic partitioning — high-volume topics with few partitions can create hot spots. Increase partitions for topics with skewed load.
ISR Shrinks
- Symptom:
IsrShrinksPerSec is non-zero andUnderReplicatedPartitions is rising. - Check broker disk I/O — slow disks cause followers to lag behind leaders. Monitor disk latency with
iostat and consider moving log directories to faster storage (SSD). - Check network between brokers — replication relies on inter-broker network throughput. Use
iperf to verify bandwidth between broker nodes. - Check
replica.lag.time.max.ms — this setting controls how long a replica can be out of sync before being removed from the ISR. The default is 30 seconds. Do not lower it aggressively or you will get ISR flapping. - Check GC pauses — long garbage collection pauses on a follower broker prevent it from fetching from the leader. Review GC logs and tune JVM heap settings.
Quick Diagnostic Checklist
| Symptom | First Check | Second Check | Resolution |
|---|---|---|---|
| Consumer lag growing | Processing time per record | Consumer count vs. partitions | Scale consumers, optimize processing |
| Under-replicated partitions | Broker disk I/O | Network between brokers | Replace disks, fix network, restart broker |
| No active controller | ZooKeeper / KRaft connectivity | Broker logs for controller election | Restart ZK or affected broker |
| Producer send errors | Broker availability | Topic authorization / ACLs | Fix broker, update ACLs |
| High request latency | Request handler idle % | Disk flush time | Add brokers, move to SSD |
| Frequent rebalances | Consumer heartbeat logs | Increase timeout, reduce |