Home > Data Engineering Best Practices > Kafka for Data Integration

Kafka for Data Integration

Modern, microservice-based application development means applications select their data storage mechanisms according to their use cases, and there is no consideration for having a unified database. This lack of a unified data store has resulted in organizations having many data sources in different kinds of databases, event streams, application logs, etc. Add to that the explosion in data handling frameworks in the recent past, and the result is a disjointed and scattered data environment.

Real-time streaming data integration platforms can make a difference in this environment. Such tools help aggregate data from multiple sources, transform them, and push data to locations from where it is accessed. This article will explore how Kafka can be a do-it-all data integration platform and help you quickly derive value from your data.

Kafka Features

A data integration platform connects with different data sources and destinations to help organizations derive value from scattered data. Kafka is a great alternative for data integration because of the following reasons.

Feature	Description
Durability, fault tolerance, and scaling capabilities	Kafka uses partitions for scale and multiple replicas for resilience.
Extensive and easy data routing	Connector support, flexible routing strategies, and stream processing abilities make Kafka a good choice for data integration.
Flexible data ingestion	Kafka can connect with numerous data sources, like databases, SAAS APIs, and microservices events, and it can facilitate data integration.
Highly tunable for different use cases	Fine-tuning parameters related to partitions, transactions, topic creation, and logging helps produce more value from your Kafka deployment.

Understanding Kafka

Kafka is an open-source, scalable, fault-tolerant, and durable real-time streaming platform that helps integrate data from a variety of sources and sinks. Kafka works based on the concepts of producers, brokers, consumers, topics and partitions. Producers emit data streams, and consumers process them and write to various destinations. Topics help users segregate messages based on sources, destinations, and processing logic. Brokers help translate the messages from producers to consumers, if required. Messages or events from producers are stored in disks with a unique offset to enable precise fetching by consumers. Kafka brokers use an internal topic named __consumer_offsets that keeps track of what messages a given consumer group last successfully processed. Each message in a Kafka topic has a partition ID and an offset ID attached to it. Therefore, in order to “checkpoint” how far a consumer has been reading into a topic partition, the consumer will regularly commit the latest processed message, also known as consumer offset.

One of the main challenges of working with a large amount of data is establishing an architecture that inherently provides durability, which means avoiding data loss and corruption. Kafka works based on multiple instances of brokers that help it replicate data based on partitions. Such a durable architecture ensures that messages are never lost before consumers read them.

Another important consideration with streaming data is fault tolerance, which describes the ability of the system to recover from failures. Kafka achieves this through the use of partitions, which help Kafka store messages on multiple nodes to avoid any single point of failure. Another important consideration with streaming data is fault tolerance, which describes the ability of the system to recover from failures. Kafka achieves this through the use of partitions and replicas, which help Kafka store replicas (copies) of messages on multiple nodes to avoid any single point of failure.

Kafka allows scaling by use of partitions. Having multiple partitions for a topic, allows consumption of messages by multiple consumers, one per partition, thus parallelising the reads by multiple consumers and getting better throughout (scale). Kafka ensures that only one consumer from each consumer group can consume from a partition.

Kafka provides various message delivery guarantee options that users can choose based on their performance requirements and use cases. It supports at-most-once, at-least-once, and exactly-once guarantees.

What is the impact of GenAI on Data
Engineering?

WATCH EXPERT PANEL

Why Kafka for Data Integration?

A data integration platform connects with numerous data sources and destinations, acts as a broker between source-destination combinations, and transforms and routes data according to custom logic between them. Let us see how Kafka supports each of these requirements.

Support for Connectors

Kafka handles connections to data sources and sinks through a separate module called Kafka Connect. Kafka Connect exposes an API that abstracts away much of the complexities involved in interacting with the low-level APIs of data sources and sinks., providing a quick and easy way for developers to create custom connectors. The Kafka Connect API greatly simplifies connector development, deployment, and management. Connectors can also help with essential data transformation before the data gets into the queue. Many readily usable Kafka Connectors are available from organizations like Confluent and Aiven.

Routing Features

A data integration platform often needs to route messages between different services. For example, let’s say you are working on an ecommerce website that uses events to coordinate multiple microservices. All the events that deal with payments will be routed to a specific gateway, where a separate consumer processes them. Topics are Kafka’s way of dealing with such requirements. Kafka topics help to organize messages into logical groups or categories.

In Kafka, static routing, or predefined rule-based routing, is relatively easy to accomplish by using event branches and assigning branches to different topics. Kafka provides the TopicNameExtractor class to facilitate dynamic routing. The TopicNameExtractor class lets developers create topic names based on message data. Kafka has configurations that allow automatic topic creation, but careless use of this feature can lead to topic explosion, so this is strongly discouraged.

To handle routing logic that needs external data access, Kafka allows one to ingest data from external databases as a Kafka stream. You can then apply static or dynamic rules to facilitate the routing.

Kafka routing is fairly advanced and is one of the reasons why Kafka does so well as a data integration platform.

Is your Data Integration ready to be Metadata-driven?

Download Free Guide

Stream Processing

Kafka streams provide a way to process incoming data. Stream processing applications are real-time applications that react to the incoming data stream instantaneously. Kafka provides a high-level, domain-specific language and a low-level stream processing API to build such applications. It can use either the event time or ingestion time as the time reference while processing events. It supports windowing functions to aggregate streams based on time.

Since requirements often mandate enriching stream data from other sources, Kafka Streaming considers tables integral to the paradigm. Kafka can interchangeably use tables and streams. For example, a table can be turned into a stream by capturing the changes made to it, while a stream can be turned into a table by aggregating the stream using functions like sum or count. A table can always be defined in terms of its underlying change stream. The table-stream duality is embedded in the design of the Kafka stream API, which helps developers quickly build applications by combining streams with databases.

If the built-in processing API cannot meet your requirements, Kafka also supports integration with other processing frameworks. Spark Streaming and Spark Batch are often used with Kafka to augment its stream processing abilities.

How Kafka Fits into Large Data Integration Pipelines

A typical enterprise organization will have data scattered across multiple applications and databases. For example, let’s take the example of an ecommerce organization and list the typical data sources:

Event data as a result of website operation. This data consists of user action events like searching, adding to the cart, order confirmation, and shipping.
Website tracking events from services like Google Analytics.
Customer and product databases.
Logs from hundreds of microservices.
Analytics data coming from third-party services like chatbots such as Azure Bot frameworks or Dialogflow chatbots.

All such data needs to be processed separately and pushed to a unified data lake or warehouse. A distributed file system like HDFS or completely managed services like S3 are used to implement data lakes and warehouses. Kafka can act as a broker among these to facilitate integration.

Kafka-based architecture for integrating data in an ecommerce organization.

The architecture above integrates data from multiple sources, including on-premises and cloud-based sources, through Kafka. The data gets pushed to data lakes, warehouses, monitoring applications, and processing applications. For example, a processing application could be a shipping provider integrator service that responds every time an event about a confirmed order is triggered. Here, a monitoring application could be a log monitoring app that generates alerts based on rules from logs. Kafka sits in the middle of these and reliably routes and processes messages to facilitate integration.

Now that we are clear about how Kafka acts as a data integrator in large, high-volume data pipelines, let’s explore some more relevant use cases of Kafka-based data integration.

Guide to Metadata-Driven Integration

FREE DOWNLOAD

Learn how to overcome constraints in the evolving data integration landscape

Shift data architecture fundamentals to a metadata-driven design

Implement metadata in your data flows to deliver data at time-of-use

Implementing Clickstream Analytics Using Kafka

Clickstream analytics use cases involve ingesting the clicks generated on websites and then analyzing them to derive insights about customer behavior. Such data is instrumental in building product recommendation systems. Kafka is a popular choice for implementing clickstream analytics; let’s see it in action in the architecture below.

In this example, a script embedded in the website front-end application keeps sending events about user actions. The data generally contains an event type, page link, access time, and user identifier, if available. The script sends the data to an API that pushes data to Kafka, which processes the data, enriches it with operational data like product details, and moves it to raw data and processed storage accordingly. Analysts and data scientists then work on this data to develop models or rules that can help the company influence user behavior.

Implementing Change Data Capture Using Kafka

Change data capture (CDC) involves synchronizing every change in a database with another database. CDC helps facilitate use cases like replication, disaster recovery, or even syncing data from an operational database to a data warehouse. Kafka’s event streaming features help process incoming events, translate them to a format understood by the target database, and insert them. Typically, a CDC connector like Debezium is used to get the events from the source database to Kafka.

The architecture below gives you an overview of how a CDC implementation using Kafka would look.

Kafka-based architecture for integrating data for change data capture.

In this case, we implement CDC between a PostgreSQL database and Nexla, a popular data lakehouse platform. Debezium, a popular CDC connector, connects with PostgreSQL and uses the write-ahead logs from PostgreSQL to read events and stream them to the Kafka topic. A Kafka consumer then reads these events, transforms them into insert statements for Nexla, and calls the required APIs.

Kafka’s durable, scalable, and fault-tolerant mechanism ensures no data loss or corruption.

Best Practices

Now that we are clear about Kafka’s use as a high-volume data integration framework, let’s explore some of the best practices while implementing Kafka in production. Kafka has multiple moving parts in terms of its producers, consumers, brokers, partitions, and topics; getting maximum performance from Kafka is a balancing act of configuring all of them correctly. We will touch upon some of the critical factors here.

Kafka’s partitions are an essential part of its fault-tolerant nature. We recommend using random partitioning by default unless there are architectural considerations that demand a custom partition strategy. Random partitions will ensure a uniform distribution of data and better load balancing while using multiple consumers.
The choice of Kafka producer guarantee strategy is a vital factor to consider while deploying Kafka-based integration to production. Acknowledgment-based guarantees, like exactly-once guarantees, have a cost associated with them because of the transaction structure. The overhead is dependent on the number of transactions and not the number of messages, so transactions need to be sufficiently large to optimize the throughput. Playing around with the commit.interval.ms parameter can help with optimizing this.
Kafka consumer performance is often dependent on the socket buffer size while fetching the messages. The default value for this is 64 KB. We recommend optimizing this value based on your use case for maximum throughput in production.
Kafka logging often takes up a lot of disk space. While deploying in production, configure the logging parameters to avoid unnecessary logs. Logging is expensive, and excessive logging can clog CPU resources.
Defining metrics based on data accumulation in partitions will help determine the scaling configuration for your Kafka cluster. Defining these values early in the development cycle will help avoid surprises while going to production.
The topic creation strategy requires close supervision while deploying Kafka pipelines to production. It is best not to use automatic topic creation since that can easily result in a topic explosion if something goes wrong.

Empowering Data Engineering Teams

Free Strategy
Session

Platform	Data Extraction	Data Warehousing	No-Code Automation	Auto-Generated Connectors	Metadata-driven	Multi-Speed Data Integration
Informatica	✔	✔
Fivetran	✔	✔	✔
Nexla	✔	✔	✔	✔	✔	✔

Conclusion

We have now learned about the basics of Kafka and how it is a good option for real-time integration of scattered data sources. Kafka provides an easy-to-use Connect API to connect all kinds of sources and sinks. Its fault-tolerant and durable nature are major reasons why it is the de facto streaming platform for most organizations.