Video Summary

Apache Kafka 101: Kafka Streams (2023)

Confluent

Main takeaways
01

Kafka consumer code becomes complex when adding windowing, joins, enrichment, and aggregation — Kafka Streams removes most of that boilerplate.

02

Kafka Streams is a Java library (not separate infrastructure) that exposes primitives like filter, group, aggregate, and join.

03

Streams manages state off-heap, persisting to local disk and Kafka internal topics for recovery and fault tolerance.

04

Works naturally inside microservices: stream processing alongside REST APIs and other service responsibilities, scaling via consumer groups.

Key moments
Questions answered

Why do Kafka consumers tend to become more complex than producers?

Consumers often evolve from simple stateless transforms to stateful operations like aggregation, enrichment, windowing and handling late or out-of-order messages, which require extensive framework and state management code.

What shortcomings of the consumer API does Kafka Streams address?

Kafka Streams provides built-in primitives for filtering, grouping, aggregating, joins, time windows and handling out-of-order or late messages, removing the need to write custom framework code on top of the basic consumer API.

How does Kafka Streams manage application state and ensure fault tolerance?

Streams manages state off-heap, persists it to local disk for fast restart, and replicates it to internal Kafka topics so state can be restored after failures or when rebalancing nodes.

Can Kafka Streams be used inside a microservice, and how does it scale?

Yes — Kafka Streams is a library you embed in your service so stream processing coexists with other functions (e.g., REST APIs). It scales using consumer-group semantics, distributing processing across instances.

The Complexity of Kafka Consumers 00:19

"In a growing Kafka-based application, consumers tend to grow in complexity; maybe producers don't too much, but consumers definitely do."

  • As Kafka applications expand, the complexity of consumers often increases significantly while producers tend to remain simpler.

  • Tasks that began as basic stateless transformations, such as masking personally identifiable information or formatting messages, evolve into complex operations involving aggregation and enrichment.

The Limitations of the Consumer API 00:41

"There isn't a lot of support in the consumer API for advanced operations like handling TimeWindows and out-of-order messages."

  • The consumer API provides minimal support for advanced stream processing needs, requiring developers to write considerable framework code for handling features such as time windows and late or out-of-order messages.

  • This situation can lead to challenges, particularly since operations like aggregation and enrichment require managing state, which is inherently resource-intensive.

The Need for Persistence in Stream Processing 01:15

"If your stream processing application goes down, its state goes with it, unless you've devised a scheme to persist that state somewhere."

  • The transient nature of in-memory state poses a fault-tolerance risk; if the application crashes, all stored state is lost unless backed up externally.

  • An effective persistence strategy for state management is crucial for maintaining continuity and reliability in data processing, especially as applications scale.

Advantages of Kafka Streams 01:50

"Apache Kafka provides a stream processing API called Kafka Streams, which gives you easy access to the computational primitives of stream processing."

  • Kafka Streams is introduced as a Java-based API designed to simplify stream processing and provide developers with foundational capabilities like filtering, grouping, aggregating, and joining.

  • By utilizing Kafka Streams, developers can bypass writing extensive framework code atop the consumer API, allowing them to focus directly on the data processing tasks.

State Management with Kafka Streams 03:37

"Streams manages state off-heap, which is an important feature when dealing with distributed stream processing."

  • Kafka Streams effectively manages distributed state off-heap, allowing it to persist state to local disk and internal Kafka topics, enhancing fault tolerance.

  • The ability to recover state from both local storage and Kafka topics is vital during failures or when adjusting the configuration of stream processing nodes within a cluster.

Integrating Stream Processing in Microservices 04:51

"In a typical microservice, stream processing is one function among other operational capabilities, like exposing a REST API for synchronous key lookups."

  • In microservices, stream processing often acts as one of many functions, alongside other responsibilities such as serving REST APIs for real-time data queries.

  • For example, a shipment notification service can integrate multiple streams, combining shipment events, product information, and customer data while also serving client requests through HTTP endpoints.

Kafka Streams as a Library, Not Infrastructure 06:53

"Kafka Streams is a library that you add to an application that you already need, enhancing existing scaling capabilities without requiring a separate infrastructure."

  • Kafka Streams is described as a library that enhances existing applications, rather than requiring a standalone infrastructure dedicated solely to stream processing.

  • This integration allows developers to leverage familiar application deployment practices alongside enhanced functional capabilities for sophisticated, scalable, and fault-tolerant stream processing applications.