How to sink a Kafka roll out

6 min readNov 7, 2022

Has management started to flirt with the idea of doing event sourcing and rolling out a Kafka cluster? Have you read up on the technology and you want nothing to do with it? What is the best way to nip this rollout in the bud? Is arguing whether Kafka is a database or not the answer? Maybe arguing that long term storage of data in Kafka is a ticket to hell (it’s not)? Perhaps arguing that there are no snapshots for replays in Kafka Streams the answer? (there are) Maybe arguing that you can do event streaming with postgres is the answer? Sure, all of these could be good enough reasons but they can be argued either way if you encounter a heavyweight opponent. But why risk it? There is an easier way to sink a Kafka roll out. It’s called GOVERNANCE. Or should I say, a complete lack of Governance.

So what exactly do I mean by Governance? I think the best way is to look at use cases that will pop up when you have a reasonable size project rolled out into an enterprise production environment.

What is the best way to roll out changes to Kafka in production?

How do you reset a consumer group or an application id? How do you create a new topic? Perhaps you would like to change the retention on an existing topic, what is the best way of achieving this?

If you have spent some time with Kafka your first answer will be the command line tools that come with Kafka. Well yes but no. Since we are working in an enterprise environment, security, (hopefully) has put up firewalls that prevent you from running these command line tools from your local machine. So what do you do now?

You can describe what to do in the release notes, but for many commands this will require a lot of documentation that can be easily misinterpreted by the implementers.
You can try to script it using bash commands and include that with the release.
How about an own Kafka installer? Why not! You’ve got deep pockets right?

What is the best way to bring up a new environment?

Business has a new client and they need a new testing environment fast. What is the best way how to bring up this new environment? Do you:

Go through every release note and recreate the environment from scratch? Most likely not since this will take weeks.
Do you maintain central scripts that can create an environment up to a certain version? Could work but every release will require an update to the scripts, it’ll require some work.
Copy an existing environment? Might work if you are environment agnostic but most likely there is that one mischievous environment variable out there that foils your plans. Or even better, what if you prefix the topic names with the environment? Come on now, it’s ok to admit it, I know you do, we all do.

How do you copy data across environments?

This one is easy right. Mirror Maker. Except when your data is somehow tied to your environment. For example, you have different schema registries and hence different schema ids in each environment. Confluent of course provides tools that can deal with this but if you don’t have a confluent liscence, well, I guess you need to grow your own. Since I’m not a 100% sure about the need for that liscence, for the sake of argument, lets say there is something specific added to your message header that identifies the environment for auditing. How do you handle this now? Does Kafka provide you a means to copy across environment specific data? Sure, you can grow your own with Kafka Connect or Apache nifi or whatever your heart desires.

How do you do impact analysis?

How do you track the changes in your environment? If you make a change to a topic who needs to be informed? During a release which consumers will be affected? Does Kafka out of the box provide any sort of data lineage facilities? Kind of, it provides a topology graph on the level of Streams but not on a plain vanilla consumer/producer. Of course to make this information searchable you will need to dump it to some sort of a graph DB and maintain it or rely on cloud providers, like Azure, who offer to infer this for you by looking at the published metrics. How deep are those pockets again?

How do you manage topics?

The roll out is a great success. It is in fact so successful that everyone wants to use Kafka. Suddenly you have over 10k topics. How do you keep track who is the owner of a topic? Is the data in the topic a golden source that can’t be lost? Perhaps the topic is a legacy topic and needs to be kept around only for audit purposes? Does that topic contain GDPR data? Can you enforce a naming convention? Can you delete that dev topic from that one developer who does all their development directly in the dev environment? (Shouts out to Bobby!) What if you need to maintain multiple versions of the same topic? Can you link ACLs against a business unit like you can in active directory? It’s a good thing that Kafka provides you the means for maintaining topics out of the box, oh wait, it doesn’t, you need to grow your own.

How can you check the state of the environment?

Did you know that creating and deleting topics is an asynchronous operation? Those release bash scripts work great, except a developer decided to clean a topic by first deleting it and then recreating it. (I understand this is the worst way of doing this, please don’t spam me in the comments ;)) The script ran without errors, except the topic was never recreated because the delete happened after the create script ran. Or better yet, how about checking a schema for compatibility before you roll out a new version of a service instead of just crashing in production? Maybe there is a way to do some sort of a sanity check on the environment to see that it is in an expected state? Of course! But you need to grow your own.

How do you migrate schemas?

Who needs to migrate schemas when every attribute is optional, right Google? But perhaps you’re an old school DB guy that likes tight schemas that don’t require all your downstream consumers to implement the same data quality checks over and over again. How then do you migrate the schema and the data in production? DBs have flyway scripts, since Kafka is a “DB” it must also have some sort of flyway scripts as well right? Sure, if you grow your own.

Hey it’s Saul Goodman! Protobuf is here!

You see a common trend here right? Of course, most of these issues also exist in the traditional DB world, which is of course correct. The big difference between Kafka and the traditional DB world is that nobody talks about Governance in Kafka. When you first encounter a Governance Kafka issue you ask yourself, hmm, this seems like a pretty common issue, somebody must have already solved it or at least mentioned it right? Well, no. It seems that either most rollouts hack through the short comings or they’ve already grown their own tools. Which is exactly the case with my current client and their own flyway like mechanism for maintaining Kafka environments. Hopefully in the future I can write about as it’s pretty simple but very powerful.

So what can you do about Governance in Kafka if you don’t want to invest the time and money in growing your own tools? Well, as it turns out Confluent has recently released Governance tools that deal with some of the issues I’ve mentioned above but as far as I know they are only available in the Confluent cloud or with a special Confluent Center build. Did nobody tell you that you would need to migrate into the Confluent cloud after 6 months in production? Silly you. Maybe there’s still time to sink that roll out!

ps: I love Kafka and everything about Kafka. If you do take this beautiful journey do it with a clear head so that there are no hard feelings down the line. I’d like to not have to argue with you about whose definition of what a database is, is right.
pss: Ooh, a new contender has entered the ring. I know nothing about the license costs but a nice UI to manage ACLs and metadata of topics looks very nice