Ordering, Transactions and Exactly-Once Semantics in Data Integrations

This post is ninth part in a series about building and adopting a modern streaming data integration platform. In this post I will discuss three important aspects of robust data integrations when using the logging metaphor: global ordering of messages, multi-message transactionality, and exactly-once semantics. Focus is on Apache Kafka, which is a common platform for implementing streaming data integrations.

Apache Kafka is commonly used platform for building streaming data integrations. On the surface Kafka seems very suitable for this purpose. Confluent, the company developing Kafka, was founded on the same vision I am discussing in this series of blog posts.

Kafka has high performance and scalability due to parallelism. In Kafka, each topic is a single queue. Parallelism is supported by partitioning the topics. Each message has a key, which is used in the partitioning. The downside of this approach is that Kafka cannot guarantee global ordering of events when using multiple topics, or multiple partitions on topics. If you do not have global ordering of events, then you do not have a change log in the strict sense. In other words, you are not fully invested in the logging metaphor.

Kafka has nominal support for transactions. In practice this means that Kafka allows producer to send a batch of messages to multiple partitions so that either all messages in the batch are eventually visible to any consumer, or none are ever visible to any consumer.

However, sink systems do not know when transactions start and end, so they can read “transactionally” only individual messages. Without support for multi-message transactions, it becomes difficult to maintain referential integrity between messages, if individual messages contain entities.

Kafka has also optional support for exactly-once semantics. In practice this means that Kafka can guarantee that a successful consume-transform-produce iteration is executed only once. However, as the transform component may potentially call external systems or cause other side effects, the overall result may still be non-deterministic.

For the data integration platform development project I am describing in this series of blog posts, we ended up choosing Apache Kafka as the central architectural component. We did not understand all the consequences of the technology choice. In a later post I will discuss the lessons I learned from this.

There are severe consequences from lack of referential integrity. You will likely encounter complex data issues that are difficult to solve. At the minimum, applications using the data need to take into account eventual consistency, pushing the complexity to the application developers. Much of this complexity could be be handled by the data integration platform.

The benefit of this approach is high and scalable throughput. I recommend that you seriously consider your requirements regarding performance and data integrity. The current trend appears to favor performance and scalability. In typical enterprise environments this may be a bad trade-off, even if there are no strict regulatory requirements for data integrity.

What are the alternatives? There seems to be two possibilities. The first alternative is to use IBM MQ. IBM MQ seems popular in industries which have high data integrity requirements. The second alternative is to build your own solution based on database features. This sounds like a huge effort, but ultimately the implementation is quite straight-forward, and can support features that are useful when you are using the logging metaphor.

If you need consulting related to system architectures in general, or data integrations in particular, please do not hesitate to contact Mikko Ahonen through the contact page.