This post is second part in a series about building and adopting a modern streaming
data integration platform. Here we discuss the abstraction to store changes to your data instead of
the latest state of your data.
Using logging to record data changes is an underappreciated abstraction, useful in distributed systems in general, and
enterprise system-of-systems in particular.
The basic idea behind logging is not new. Earliest found examples of logging are from 10 000 years ago. Ancient
Sumerians recorded economic activities on clay tablets for the record keeping, in other words they were doing accounting.
In accounting, the original records are never modified, only appended. In case of errors, a correction record
is appended to the books. This is the essence of data logging.
This abstraction might seem too simple to be useful. In fact, it is actually very useful abstraction in wide range of
applications. The best known example is the database transaction log, which is used to maintain data integrity in
case of database server crashes.
When data changes in the database, the modifications are recorded both to the in-memory structures and also to the transaction log
residing on disk. During checkpoint, the in-memory structures are flushed to disk. If the database crashes,
the database reads structures stored on the disk (data containing the latest checkpoint), and the applies all the changes
recorded in the transaction log since the last checkpoint. This is a performant way to maintain data integrity.
Most modern databases support some kind of data replication as well. The most common method for implementing data replication is
some form of Change Data Capture (CDC). This is very similar to transaction log. All the changes to the database are
recorded in a seperate change log. The change log is transmitted to another system. When all the changes in the change log
have been applied to the sink databse, it will have the same state as the source database.
There is an interesting relationship between tables (representing the current state) and logs (ordered record of all the changes
to the state). You could say that log is actually a more fundamental data structure than table, because you can generate the
table (get the recent state) from the log by just applying all the changes, but you cannot get the log from the table. Actually,
from the full log, you can get any previous state as well.
There are some strict requirements as well. As Jay Kreps describes in his
pamphlet I ♥ Logs,
the process of applying the changes needs to happen in exactly the same order,
and processes of applying changes need to be deterministic. Kreps actually used the log
abstraction very succesfully to drive enterprise architecture at LinkedIn. He is also one of the founders of Confluent,
which develops Kafka with a similar vision.
The logging abstraction appears to be more common in financial organizations than elsewhere. This may be because
in financial systems, it is often not enough to know the current state, we often need to know how we ended up there as well.
But the log abstraction is very useful in enteprise environments beyond this requirement.
If we have made data available as a change log, creating near real-time read copies of your data becomes cheap and reliable.
For example, you can create a central cache of your data for web-facing operative applications, increasing performance and
The price is that you need to have a consistent way for extracting data from your data sources. This may require lots
of work if the existing systems have not been created with this requirement in mind.
If you need consulting related to system architectures in general, or data integrations in
particular, please do not hesitate to contact Mikko Ahonen through the contact page.