Apache Kafka - Introduction

What is Kafka?
Core Components
Topics and Partitions
Producers and Consumers
Use Cases
Kafka vs Traditional Message Queues

Kafka acts as a distributed commit log. Events (messages) are appended to topics and retained for a configurable period — consumers can replay from any offset. This differs from traditional queues where messages are deleted after consumption.

Publish-Subscribe — many producers publish events; many consumer groups read independently.
Durable — messages are written to disk and replicated across brokers.
Scalable — topics are split into partitions distributed across a cluster.
High-throughput — sequential disk writes + batching achieve millions of events/second.

Component	Role
Broker	A Kafka server that stores and serves messages. A cluster has multiple brokers for fault tolerance.
Topic	A named log/category where messages are published. Analogous to a database table.
Partition	A topic is split into one or more ordered, immutable partitions. Parallelism unit.
Producer	Application that publishes messages to a topic.
Consumer	Application that reads messages from a topic.
Consumer Group	A group of consumers that together consume a topic — each partition is assigned to exactly one consumer in the group.
Offset	A unique sequential ID for each message within a partition. Consumers track their position via offsets.
ZooKeeper / KRaft	Coordinates cluster metadata (broker election, topic configs). Kafka 3.x+ supports KRaft (built-in, removes ZooKeeper dependency).

A topic is a logical channel. A partition is a physical file on a broker. Partitions enable parallelism: multiple consumers in a group can read different partitions simultaneously.

Messages within a partition are strictly ordered.
Across partitions, there is no global ordering guarantee.
A replication factor of N means N copies exist — one leader + (N-1) followers. If the leader fails, a follower takes over.
Messages are routed to partitions by key hash (same key → same partition) or round-robin.

Increase the number of partitions to scale throughput. However, more partitions mean more open file handles and longer leader election times. A common starting point is 1–3 partitions per broker for most topics.

Producers write messages to topics. Key settings:

acks — acknowledgment level: 0 (fire and forget), 1 (leader ack), all/-1 (leader + replicas).
retries — automatic retry on transient failures.
batch.size — batch messages together for higher throughput.

Consumers read messages from topics. Key concepts:

Consumers belong to a consumer group. Each partition is assigned to exactly one consumer per group.
Adding consumers to a group increases parallelism (up to the number of partitions).
Consumers commit offsets to track their position. On restart, they resume from the committed offset.
Commit strategies: auto-commit (simple but risk of reprocessing/skipping), manual sync/async commit (more control).

Real-time analytics — stream user clicks, transactions, sensor data to analytics pipelines.
Microservices communication — decouple services via events instead of synchronous REST calls.
Log aggregation — centralise application and system logs from many services.
Event sourcing — store every state change as an immutable event log.
Data integration (ETL) — move data between databases, data warehouses, and search indexes.
Stream processing — Kafka Streams or Apache Flink for real-time transformations.

Feature	Kafka	Traditional Queue (RabbitMQ, SQS)
Message retention	Configurable (hours/days/forever)	Deleted after consumption
Replay	✅ Yes — reset offset to re-read	❌ No
Consumers	Multiple independent groups	Competing consumers (one wins)
Ordering	Per partition	Per queue (often)
Throughput	Very high (millions/sec)	Lower
Best for	Event streaming, analytics, log pipelines	Task queues, job dispatching

Contents

Kafka Articles