Avro Schema Evolution Strategies on Kafka

Tom Kaszuba
7 min readDec 25, 2020

--

If you have spent any significant time with Avro (or Protbuf) and are using the Confluent Schema Registry you probably have encountered a breaking schema change characterized by the following mysterious exception.

{
“error_code”: 409,
“message”: “Schema being registered is incompatible with an earlier schema”
}

What happens is the schema registry validates the schema sent by the producer against the “version” that is stored in the schema registry for whatever schema evolution “strategy” that you have set, be it either Forward, Backward or None. The reason that I have version and strategy in quotes is because there is no actual real versioning done by the schema registry. All that the schema registry does is to do a validation check against a previously registered schema to see if you haven’t made a breaking schema change. This is all the schema registry does. If you think this seems like a heavy handed approach to schema validation then stayed tuned, it gets better.

So you have a breaking schema change? What should you do? Oddly enough there is very little information on the net (that I found) on how to handle this properly. Even Confluent seems oddly silent when I asked them the correct way of tackling this. Most of the answers I have found are along the lines of avoid making breaking schema changes. Now, if you told a database designer to design his schema in such a way as to never need to change it, they’d show you the door. So why is schema evolution not talked about in streaming systems? Probably, because just like in the DB world, it’s hard, maybe even harder? So here are my strategies on how to handle this problem in increasing complexity; with advantages and disadvantages to each approach. Some of them have been tested but some are theories and might not be viable or even correct so don’t hold it against me. :)

Migration

Before getting to the strategies a note on how to migrate between schemas. In order to do this you will need to maintain multiple Avro schema versions either in the same schema using unions or in the same or different libraries using namespaces. Version info will either need to be appended to the name or to the package info or embedded as separate versioned objects. Some way for both schemas to live in the same context.

Ex:

  • my.company.schema1.1
  • my.company.v1_1.schema

If you plan on doing any sort of migration this needs to be thought about from the beginning before rolling the system out to third parties.

Strategies

1. No Versioning, No Migration, No Problem

If data loss is of no concern and versioning information does not need to be maintained then this is by far the simplest approach. Every time a breaking change is encountered a script is run that deletes the old schema in the schema registry. If you have new data coming in with the old schema then it will fail to deserialize and will be moved to the DLQ.

If you are also storing data as Avro in local state stores, they will also need need to be cleaned up as well since a restart of the service will replay the data which will cause an incompatibility exception to be thrown. This for me is a serious flaw in the design of the Confluent Serdes. There is no point in having an internally maintained domain model be the subject of a schema validation check used to validate interface endpoints. I wish Confluent would allow you to override this behavior.

Advantages

  • Easy to automate
  • Only the latest version of the schema is maintained
  • Simplicity

Disadvantages

  • Potential Data loss
  • Upstream and down stream interfaces need to change at the same time as the breaking schema change occurs

2. Multiple Versions in One Schema, No Migration

By using unions in Avro it is possible to extend the Avro Schema by attaching new field versions. Think of it like an “or” statement. In this way you can work around the compatibility checks of the schema registry, the downstream systems can then decide on which action to take given which version of the data is present. This concept is very similar to the approach taken by Protobuf where you are not allowed to delete fields and just add new ones. It is described here: https://softwaremill.com/schema-registry-and-topic-with-multiple-message-types/

Advantages

  • No data loss
  • Gives times for upstream and downstream systems to adapt to the change

Disadvantages

  • Breaks the one schema, one version contract in one topic
  • The Schema can grow to be very large

3. Topic Naming Versioning, No Migration

Versioning is maintained by adding the version information to the topic name and this topic is wired to the new version of the service.

Ex: my_topic_0_1_20201002

This has the advantage of maintaining the old data in the old topic giving time for this data to expire or be archived, it also gives other systems time to adapt to the changes. If you don’t have a lot of topics then this might be a viable strategy.

Advantages

  • No data loss
  • Gives times for upstream and downstream systems to adapt to the change
  • No schema mixing in the same topic

Disadvantages

  • Topic governance most likely has to be done centrally through some sort of registry
  • Possible explosion of topics
  • Could be difficult to support

4. Topic Naming Versioning, Topology Migration

This builds on the previous strategy but this time with the addition of migration of data from the previous schema/topic. Migration is done by adding the old schema/topic to the topology as the input along with the new schema/topic. The old data is then migrated to the new schema in code and then merged to the normal input stream. Such a strategy would be beneficial if an upstream system can’t be easily migrated so multiple ingest channels need to be maintained.

public KStream<SomeKey, SomeValue> process(KStream<SomeKey_v1, SomeValue_v1> oldStream, KStream<SomeKey_v2, SomeValue_v2> newStream)
{
var migratedOldStream = oldStream.mapValues(…)
return newStream.merge(migratedOldStream)
}

Advantages

  • As above
  • Helps to migrate upstream data for migrated downstream services

Disadvantages

  • As above
  • Doesn’t help in migrating data in local state stores
  • More complicated service configuration
  • For topologies that have multiple inputs it will lead to an explosion of input topics
  • Migration code hard coded in the topology, cleanup of old migrations will require a new release of the micro service
  • Ordering not maintained

5. Topic Naming Versioning, Dedicated Migration Service

Similar to the previous strategy but this time there is a dedicated service deployed that migrates data between the topics/schemas. Such an approach helps in not polluting business logic with migration logic. Once a migration is finished only this micro service needs to be updated and the business micro service does not need to be redeployed.

Advantages

  • No data loss
  • Gives times for upstream and downstream systems to adapt to the change
  • No schema mixing in the same topic
  • No pollution of business logic with migration logic, therefore two different teams can work independently
  • Business service untouched and deals only with the latest schema
  • Migration service can easily be brought down or changed

Disadvantages

  • Topic governance most likely has to be done centrally through some sort of registry
  • Explosion of topics
  • Doesn’t help in migrating data in state stores
  • Migration service configuration complicated
  • Ordering not maintained

6. Schema Naming Versioning, Topology migration

Similar to approach 2, this approach mixes different schemas in the same topic. The Kafka application has to then deal with different schemas itself and not assume that a certain type of schema will always arrive.

Mixing schemas in the same topic is very well described in this blog entry: https://www.confluent.de/blog/put-several-event-types-kafka-topic/

The downsides of schema mixing in the same topic is described here: https://groups.google.com/forum/#!topic/confluent-platform/XQTjNJd-TrU

The process method must either use a generic record or a specific record base abstraction to allow various Avro message types. In the process method it either calls instanceOf or looks for a specific field in the schema to determine what message arrived. Migration is done after identifying the specific schemas.

public KStream<SomeKey, SomeValue> process(KStream<SpecificRecordBase, SpecificRecordBase> stream)
{
var schemaSplit = stream.branch((k,v) -> v instanceOf Schemav2, (k,v) -> v.getSchema().getField(“someField”) != null)

var migratedOldSchema = schemaSplit[1].mapValues(…)
var schemaSplit[0].merge(migratedOldSchema)
}

Advantages

  • No data loss
  • Gives times for upstream and downstream systems to adapt to the change
  • No need for topic governance
  • No explosion of topics
  • Service configuration doesn’t change
  • Ordering maintained

Disadvantages

  • Schema mixing in the same topic
  • Doesn’t help in migrating data in state stores
  • Migration code hard coded in the topology, cleanup of old migrations will require a new release of the service
  • Migration code can’t easily be outsourced to a separate service since there is only one input topic

7. Schema Naming Versioning, Serde migration

Similar to the previous approach except migration is done on the level of message serialization and deserialization (Serde). This approach separates business logic from migration logic. The Serde contains information about the old schemas and performs the migration to the latest schema on deserialization or serialization. It has the advantage of allowing migration of backup topic data in state stores between schemas. Since the Serde also has access to the the schema registry it is possible to not hard code schema version information and work on the level of the Generic Record and do migrations based on which fields exist in the schema. This makes the migration implementation more flexible but more difficult to test and error prone. The Confluent Serdes are not extensible due to private members, therefore the Serdes need to be created from scratch.

Advantages

  • No data loss
  • Gives times for upstream and downstream systems to adapt to the change
  • No need for topic governance
  • No explosion of topics
  • Ordering maintained
  • State Stores can be migrated
  • Migration code isolated from business logic
  • Migration logic can be turned on and off on the level of configuration
  • More flexible generic migration

Disadvantages

  • Schema mixing in the same topic
  • Micro service configuration a bit more complicated due to using multiple Serdes instead of one SpecificAvroSerde
  • More difficult code than in the topology migration
  • Migration code included in service so removal requires a new version to be rolled out
  • Migration code can’t easily be outsourced to a separate service since there is only one input topic

Schema Registry

So with all the hassle of versioning and migration that needs to be developed what exactly is the point of the schema registry? It doesn’t really do much, it increases the architecture complexity and it sometimes fails with connection timeouts when creating multiple topics due to it’s one writer architecture.

I think the only place that the schema registry makes sense is when controlling 3rd party connections or in simple Kafka architectures. When you start rewriting Serdes or start doing migrations in your services you quickly realize that you need to detect breaking schema changes anyway, so why have a centralized schema registry?

--

--

Tom Kaszuba

Java, Scala and .Net Consultant with over 20 years experience in the Financial Industry, specializing in Big Data and Integration... and especially Kafka.