Apache Kafka Overview

Apache Kafka is a community distributed event streaming platform. It is capable of handling trillions of events every day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log.

Messages in Kafka are variable-size byte arrays. Use any format your application requires like include free-form text, JSON, and Avro. There is no explicit limit on message size. Kafka is not suitable for fewer messages of larger size. Kafka is more suitable for large volume of messages of smaller size(<1MB). Kafka is written in Java and Scala.(80% of the code in Java & 20% in Scala)

Kafka retains all messages for a defined time period and/or total size.Kafka discards messages automatically after the retention period or total size is exceeded (whichever limit is reached first). Default retention is one week, can be increased to one year or longer. Retention can be specified on global or per-topic basis

Key Terminology

Message A single data record passed by Kafka
Topic: A named log or feed of messages within Kafka
Producer: A program that writes messages to Kafka
Consumer: A program that reads messages from Kafka

Topic is called Messaging Destination. It temporarily holds the message before it is delivered to the receiver. A topic is divided into multiple partitions. Each partition holds a set of related messages. In a cluster, a single topic spreads across multiple servers(brokers). Each broker is a kafka server instance. Each partition is replicated in multiple brokers based on the replication factor. Assumer that there is a topic with 4 partitions(0,1,2 and 3) and replication factor of 3 in a cluster with 5 brokers. Replication factor denotes in how many brokers , a partition is replicated.

	broker-1		0,1
	broker-2		2,3,0
	broker-3		1,2
	broker-4		3,0
	broker-5		1,2,3

Publish + Subscribe

Immutable commit log, and from there you can subscribe to it, and publish data to any number of systems or real-time applications. Unlike messaging queues, Kafka is a highly scalable, fault tolerant distributed system, allowing it to be deployed for applications like managing passenger and driver matching at Uber, providing real-time analytics and predictive maintenance for British Gas' smart home, and performing numerous real-time services across all of LinkedIn. This unique performance makes it perfect to scale from one app to company-wide use.

Store

An abstraction of a distributed commit log commonly found in distributed databases, Apache Kafka provides durable storage. Kafka can act as a 'source of truth', being able to distribute data across multiple nodes for a highly available deployment within a single data center or across multiple availability zones.

Process

An event streaming platform would not be complete without the ability to manipulate that data as it arrives. The Streams API within Apache Kafka is a powerful, lightweight library that allows for on-the-fly processing, letting you aggregate, create windowing parameters, perform joins of data within a stream, and more. Perhaps best of all, it is built as a Java application on top of Kafka, keeping your workflow intact with no extra clusters to maintain.