Home

Kafka: a Distributed Messaging System for Log Processing

Authors: Jay Kreps, Neha Narkhede, Jun Rao (LinkedIn folks)

Date: 2011

Link: PDF


  1. There is a large amount of “log” data generated at any sizable internet company:
  2. Every day, China Mobile collects 5–8TB of phone call records and Facebook gathers almost 6TB of various user activity events.
  3. Many early systems for processing this kind of data relied on physically scraping log files off production servers for analysis.
  4. Issues with traditional enterprise messaging systems for log processing:
  5. Issues with specialized log aggregators like Facebook Scribe:
  6. Kafka is a novel messaging system for log processing called that combines the benefits of traditional log aggregators and messaging systems.
  7. Kafka benefits:
  8. Kafka basic concepts:
  9. Sample production code (not the exact API):
producer = new Producer(...);
message = new Message(“test message str”.getBytes());
set = new MessageSet(message);
producer.send(“topic1”, set);
  1. Sample consumer code:
streams[] = Consumer.createMessageStreams(“topic1”, 1);
for (message : streams[0]) {
    bytes = message.payload();
    // do something with the bytes
}
  1. A Kafka cluster typically consists of multiple brokers: To balance the load, a topic is divided into multiple partitions, and each broker stores one or more of those partitions.
  2. Each producer can publish a message to either a randomly selected partition or a partition semantically determined by a partitioning key and a partitioning function.
  3. Design decisions:
  4. Kafka has the concept of consumer groups. Each consumer group consists of one or more consumers that jointly consume a set of subscribed topics, i.e., each message is delivered to only one of the consumers within the group. Different consumer groups each independently consume the full set of subscribed messages and no coordination is needed across consumer groups.
  5. Kafka uses Zookeeper for the following tasks:
  6. Kafka only guarantees at-least-once delivery.
  7. Kafka guarantees that messages from a single partition are delivered to a consumer in order. However, there is no guarantee on the ordering of messages coming from different partitions.
  8. To avoid log corruption, Kafka stores a CRC for each message in the log.
  9. The authors performed a benchmark against ActiveMQ and RabbitMQ and found out that Kafka producers and consumers are faster for the use cases they were designed for — thanks to the design decisions outlined above.