This post is first part in a series about building and adopting a modern streaming
data integration platform. Here we discuss the motivation for building your own, instead
of using existing solutions.
Larger enterprises often have a complex system-of-systems. Typically there are
modern systems, based on microservice architectures, perhaps running in public
or private cloud. There may be older systems that are probably monolithic. Some of them
may still be running on mainframe. There are often commercial off-the-shelf
(COTS) products. And so on.
Often we need to use the data from these systems in other systems. Reporting and analysis need
data, and for this we often employ analytical databases. More recently, data
clouds, such as Snowflake, have emerged to fill this need. These analytical databases often
have licensing and performance characteristics that make them unsuitable for operative
application use.
Data may be needed in operative applications as well. For example, we might
want to show the data to employees, customers, and business partners. The freshness
requirements in operative applications may be different from
what is needed in reports or analysis. This also limits reusing the analytical data
integrations for operative applications.
You also need to understand whether your data is “tall” or “wide”.
If your data is “tall”, you operate with a fewer kinds of entities, but you have
more of them. For example, technology-oriented companies often have “tall” data.
Enterprises, on the other hand, more often have “wide” data. When data is “wide”
where there are more different kinds of entities, but number of each kind of entity
is smaller.
This has multiple consequences on the data integration architectures. For
example, if the operative data needs are “wide”, and applications are build
using point-to-point integrations, the number of dependencies between systems increases
more quickly. This may create a combinatorial explosion in the dependency graph,
creating a maintenance nightmare. In such case, pure microservice
architecture (with very small services) may also become less attractive or even
unsustainable.
One approach is to use backend systems directly, through APIs. This has the
benefit that the data is always fresh, and changes may be easily supported as
well. However, the availability and performance characteristics of backend
systems might not match the requirements of the end user facing applications.
If your data is “wide”, even your landing page may need to fetch data from tens of
systems, making your application take perhaps tens of seconds for initial page load. This is
probably too much for your end users.
If you use data sources directly, you also increase system interdependency. If
you have lots of legacy backend systems, every time you use an interface of
legacy system in a new application, you make it more difficult to replace the
legacy system later.
You may alleviate this issue by creating an interface layer in front
of the legacy system. This is sometimes called the strangler pattern. But unless you put effort
to design this interface properly, the issue is really not solved. There is just an
intermediate interface with it’s own format, that needs to be deprecated later.
Commercial integration platforms, sometimes called Hybrid Integration Platforms (HIP) offer
some help. Integration Platform as a Service (IPaaS) are also becoming more popular. Many of
the HIPs have evolved from Enterprise Service Bus (ESB) products. They are useful in
various scenarios, such as when you need data from a system that has an existing connector.
The main selling point of these integration platforms is that they can reduce the
implementation costs.
Big part of integration costs are typically not the integration implementation
itself, but everything else. For example getting to know what data you need,
finding which systems contain the data, negotiating the explicit (or implicit)
data model with various parties, specifying data formats etc. The implementation
might represent only 20% of the integration costs.
What integration platform vendors typically do not emphasize is that the integration
platforms are less helpful with the expensive part of building integrations. The
anticipated cost savings might not really materialize.
Another selling point of integration platforms is that they allow controlling
integrations. That may sound useful on the surface, but it is not always clear
what that really entails. For instance, if you want to have a single integration
catalog or single place for administering integration permissions, it may require
you to reimplement all your integrations using the integration platform. That is
not really feaasible in an enterprise environment, as there may be hundreds or
thousands of existing integrations.
There are some hidden long-term costs related to integration platforms as well.
The important one is employee satisfaction. Many senior developers dislike
integration platforms, as they smell like the ESBs that never really took off.
One reason why senior developers dislike these platforms is that they typically
have their own language for defining mappings between systems and data formats,
with more limited expressive power than full-blown programming languages, such
as Java. Having an additional platform also increases the complexity, by adding
more layers and components. Typically there is also a performance penalty in using
integration platforms.
The management typically likes the idea of integration platforms, as one implicit
benefit is that integration work can be done by junior developers. However,
the specification phase (data models, etc.) is typically the the most important part
and involves the most work. To do it well, you probably need senior developers anyway.
Having two people is probably going to offset the cost savings of the integration.
The integration platforms typically offer the most help with the simpler problems related to
integration, such as when you can use an existing connector. The more complex problems
related to integration might not be supported. For example masking persistent error
situations when source or target systems are down, providing more reliable and performant
access to data than source systems can support, helping with data reconciliation, ensuring data
integrity over multiple systems etc.
In our analysis of existing integration platforms, we could not find one that would have
helped us to provide sufficiently reliable, performant and consistent access to data for
operative applications.
If you need consulting related to system architectures in general, or data integrations in
particular, please do not hesitate to contact Mikko Ahonen through the contact page.