ETL Batch Processing With Kafka?

Tom Kaszuba
The Startup
Published in
6 min readOct 2, 2020

--

Kafka is most likely not the first platform you reach for when thinking of processing batched data. Most likely you’ve heard of Kafka being used to process millions of continuous real time events such as twitter feeds or IoT feeds not running end of day batches from old mainframes. Does that mean Kafka should not be used to process batched data? As in everything related to software engineering. It depends.

What Kafka isn’t

First I think it’s wise to point out some misconceptions about Kafka. Kafka isn’t just a message queue. It’s much more than that. It’s a continuous stream of change events or a change log. At the heart of every traditional relational database you also have this change log; the transaction log. A transaction log allows you to replay events to arrive at a certain state of data. Therefore Kafka is essentially a sharded distributed database. Everything that you can do in a database can also be done in Kafka. Kafka has one advantage over traditional databases though. It has the time dimension built in.

Real time vs batched data

Is batched data any different than real time? No. A real time data feed is unbounded in time while batched data is a snapshot of that real time data. Therefore batched data is a subset of real time data. You can go from unbounded real time data to bounded batched data but you can’t go from batched to real time data.

Real time vs batch streaming

Stateless vs Stateful processing

Real time streams are usually discreet well defined events that can be processed individually. To process a batch as a discreet event you would need to send out one giant message with all events attached. While technically possible it defeats the point of streaming and technologies like FTP or C:D would be more suitable. To stream a batch you need to somehow break the granularity down to smaller events, which can then be grouped together and processed as one. Such operations require state to be kept.

Event vs Walk clock timers

Most real time streams provide a continuous stream of events. When doing stateful operations it is therefore possible to rely solely on new events to trigger the completion of an operation. Batches come in bursts, therefore wall clock time triggers might have to be used to push data downstream. Most of the Kafka streams DSL is designed around event timers, therefore some work extending the DSL, using custom processors and transformers, has to be done.

Data Integrity

Since batched data has a clear beginning and an end it is possible to ensure data integrity without having to rely on real time methods such as watermarking. This can be done by using:

  • end of batch events
  • control messages
  • cyclic redundancy checks
  • reconciliation

Of course all these methods rely on the upstream feeding system to provide some sort of control data to assert integrity against. If the upstream system does not provide this information then Kafka can’t differentiate a batched data feed from a real time data feed.

Ex: an upstream database sends a batch as individual events

There are ways to deal with such scenarios but they are inherently the same ways you would deal with real time streaming data.

  • Windowing
  • Watermarking
  • Business Rules

For systems that need to guarantee data integrity, such as financial systems, batch data integrity checks might be a necessity.

It is also possible to work around the data integrity issue by modeling the data in such a way as to break up the batch into smaller discreet events in the upstream system. By denormalizing data into concrete entities, as you would do in a DWH, you then don’t need to provide any guarantees. But as with any denormalization you loose flexibility and might need to provide different views on the same data in infinite ways or worse yet, you get very large events that try to contain all the required data for every scenario. Depending on the event this might not even be possible in the upstream system.

Abstractions

The Kafka Streams DSL provides abstractions on how to handle both unbounded and bounded data streams. Unbounded by the kStream abstraction while bounded with the kTable abstraction. kTables are local state store caches backed by kafka topics to make them fault tolerant. The default persistent state store is RocksDb but any database could be used actually. While unbounded operations are very well covered in Kafka, bounded only recently have started getting more focus. Recent releases of foreign keys and co-grouping are moving bounded data stream operations towards more traditional databases. Two bounded operations are still missing though, that of sorting and indexing. Hopefully this will arrive with a future release as it is a major pain point currently.

Advantages vs Disadvantages

So it’s clear that Kafka can be used to process batched data. But should it be?

Disadvantages

Complexity

Kafka has one major disadvantage when compared to traditional batch processing systems such as ETL tools or RDBMS. It is very low level and requires the developer to implement a lot of functionality that is already provided by traditional tools. Before undertaking this endeavor it is wise to make sure that the people implementing such pipelines have or are willing to implement low level boiler plate code. Perhaps in the future more tools will show up that will mimic how the traditional ETL tools work but as of now nothing such as this has arrived. (as far as I know)

Cost

Setting up and running a Kafka cluster brings back memories of installing and running an Oracle 12 RAC cluster and not in a good way. Kafka has a LOT of configuration settings and the admin is expected to know how to tweak these to make everything run smoothly. RDBMS have evolved a lot in the last 30 years and even a junior can set up fail over on an MSSQL DB in less than an hour. If you want to manage Kafka yourself you’ll need to invest in training your admins or go for a hosted solution of which there aren’t a lot of options currently. I’ve also been told that storing data in Kafka is a lot more expensive than a traditional databases in the cloud. This doesn’t really make a lot of sense to me since a disk is a disk but perhaps some key facts were left out that I am not aware of. Either way the cost of storage can be a factor and should be investigated.

Advantages

Scalability

Since partitioning is already built in it is much simpler to partition batches into smaller batches. With traditional ETL tools this usually had to be implemented by the developer and resulted in memory hungry monolithic pipelines that would route data between distributed shards. Kafka does this by design.

Flexibility

Everything can be customized to how the developer wants it to be. For example, instead of having to process an entire batch to create an aggregate of a subset of records you can split the aggregation along several time windows reducing latency and resource cost. Or the fact that you can emit error messages when a join is not fulfilled instead of just dropping events. Kafka does not enforce any way on how to use it.

Fault Tolerance

The Kafka streams DSL provides fault tolerant persistent state stores coupled with exactly once processing (EOS) semantics. Meaning that if the pipeline goes down for whatever reason processing resumes from the last place that it left off. Coupled with k8s it results in a self healing fault tolerant ETL pipeline. Traditional ETL tools usually require, if not built in by the developer, a restart of the entire process which could result in hours of lost processing time.

Testing

The Kafka streams DSL provides wonderful tools to perform unit and integration testing on pipelines. I can’t rave enough how good the TopologyTestDriver or MockContext is in testing topologies and custom processors. I have yet to meet an ETL tool or a RDBMS that takes testing so seriously. In terms of easy of use in producing quality tested code Kafka has the traditional tools beat.

Monitoring

The Kafka streams DSL provides hundreds of built in monitors that can be hooked into dashboards making tuning a lot easier.

Verdict?

Clearly for small batch loads using traditional ETL tools is less complicated and much simpler to implement. But if the ETL pipeline needs to handle large amounts of data and scale. Kafka wins hands down.

For information regarding the streaming concepts mentioned in this entry please consult this excellent Google Research paper:

Fault Tolerant Stream processing at Internet Scale

--

--

Tom Kaszuba
The Startup

Java, Scala and .Net Consultant with over 20 years experience in the Financial Industry, specializing in Big Data and Integration... and especially Kafka.