Ordering, Transactions and Exactly-Once Semantics in Data Integrations
There are three important aspects of robust data integrations when using queues: global ordering of messages, multi-message transactionality, and exactly-once semantics.
This post is first part in a series about building and adopting a modern streaming data integration platform. Here we discuss the motivation for building your own, instead of using existing solutions.
Larger enterprises often have a complex system-of-systems. Typically there are modern systems, based on microservice architectures, perhaps running in public or private cloud. There may be older systems that are probably monolithic. Some of them may still be running on mainframe. There are often commercial off-the-shelf (COTS) products. And so on.
Often we need to use the data from these systems in other systems. Reporting and analysis need data, and for this we often employ analytical databases. More recently, data clouds, such as Snowflake, have emerged to fill this need. These analytical databases often have licensing and performance characteristics that make them unsuitable for operative application use.
Data may be needed in operative applications as well. For example, we might want to show the data to employees, customers, and business partners. The freshness requirements in operative applications may be different from what is needed in reports or analysis. This also limits reusing the analytical data integrations for operative applications.
You also need to understand whether your data is “tall” or “wide”. If your data is “tall”, you operate with a fewer kinds of entities, but you have more of them. For example, technology-oriented companies often have “tall” data. Enterprises, on the other hand, more often have “wide” data. When data is “wide” where there are more different kinds of entities, but number of each kind of entity is smaller.
This has multiple consequences on the data integration architectures. For example, if the operative data needs are “wide”, and applications are build using point-to-point integrations, the number of dependencies between systems increases more quickly. This may create a combinatorial explosion in the dependency graph, creating a maintenance nightmare. In such case, pure microservice architecture (with very small services) may also become less attractive or even unsustainable.
One approach is to use backend systems directly, through APIs. This has the benefit that the data is always fresh, and changes may be easily supported as well. However, the availability and performance characteristics of backend systems might not match the requirements of the end user facing applications.
If your data is “wide”, even your landing page may need to fetch data from tens of systems, making your application take perhaps tens of seconds for initial page load. This is probably too much for your end users.
If you use data sources directly, you also increase system interdependency. If you have lots of legacy backend systems, every time you use an interface of legacy system in a new application, you make it more difficult to replace the legacy system later.
You may alleviate this issue by creating an interface layer in front of the legacy system. This is sometimes called the strangler pattern. But unless you put effort to design this interface properly, the issue is really not solved. There is just an intermediate interface with it’s own format, that needs to be deprecated later.
Commercial integration platforms, sometimes called Hybrid Integration Platforms (HIP) offer some help. Integration Platform as a Service (IPaaS) are also becoming more popular. Many of the HIPs have evolved from Enterprise Service Bus (ESB) products. They are useful in various scenarios, such as when you need data from a system that has an existing connector. The main selling point of these integration platforms is that they can reduce the implementation costs.
Big part of integration costs are typically not the integration implementation itself, but everything else. For example getting to know what data you need, finding which systems contain the data, negotiating the explicit (or implicit) data model with various parties, specifying data formats etc. The implementation might represent only 20% of the integration costs.
What integration platform vendors typically do not emphasize is that the integration platforms are less helpful with the expensive part of building integrations. The anticipated cost savings might not really materialize.
Another selling point of integration platforms is that they allow controlling integrations. That may sound useful on the surface, but it is not always clear what that really entails. For instance, if you want to have a single integration catalog or single place for administering integration permissions, it may require you to reimplement all your integrations using the integration platform. That is not really feaasible in an enterprise environment, as there may be hundreds or thousands of existing integrations.
There are some hidden long-term costs related to integration platforms as well. The important one is employee satisfaction. Many senior developers dislike integration platforms, as they smell like the ESBs that never really took off.
One reason why senior developers dislike these platforms is that they typically have their own language for defining mappings between systems and data formats, with more limited expressive power than full-blown programming languages, such as Java. Having an additional platform also increases the complexity, by adding more layers and components. Typically there is also a performance penalty in using integration platforms.
The management typically likes the idea of integration platforms, as one implicit benefit is that integration work can be done by junior developers. However, the specification phase (data models, etc.) is typically the the most important part and involves the most work. To do it well, you probably need senior developers anyway. Having two people is probably going to offset the cost savings of the integration.
The integration platforms typically offer the most help with the simpler problems related to integration, such as when you can use an existing connector. The more complex problems related to integration might not be supported. For example masking persistent error situations when source or target systems are down, providing more reliable and performant access to data than source systems can support, helping with data reconciliation, ensuring data integrity over multiple systems etc.
In our analysis of existing integration platforms, we could not find one that would have helped us to provide sufficiently reliable, performant and consistent access to data for operative applications.
If you need consulting related to system architectures in general, or data integrations in particular, please do not hesitate to contact Mikko Ahonen through the contact page.