Ordering, Transactions and Exactly-Once Semantics in Data Integrations
There are three important aspects of robust data integrations when using queues: global ordering of messages, multi-message transactionality, and exactly-once semantics.
This post is third part in a series about building and adopting a modern streaming data integration platform. Here we discuss different methods for data integration, from Extract-Transform-Load (ETL) to data streaming.
Extract-Transform-Load (ETL) is the traditional, and most widely adopted approach to data integrations. It is common for example when using data marts for data analytics. The ETL is a well-established technology, and proven tools exist for it. It is also a relatively simple to create reliable data replication.
ELT has downsides as well. If the source system data model is used directly, it encourages tight coupling between source and sink systems. Additionally, in such cases, the sink system users need to understand the source system data models.
Another downside is that the ETL typically uses batch processing with the full copy of the relevant tables in the source database. This limits replication frequency, especially if the database sizes grow. If you need to add new tables, there are changes to the whole data pipeline.
Wide adoption of data marts has been critisized for contributing to the siloing of data. Data warehouse, a single repository for structured data, was introduced as an answer to the data siloing. The ETL method is typically used in data warehouses as well.
Later, new products, such as Hadoop, contributed to the emergence of what is called a data lake. These products allowed ingestion of unstructured data along the structured data. As organizations may have lots of unstructured data, the data lake allows access to more of the data in the organization.
Extract-Transform-Load | Extract-Load-Transform | API-based | Data Streaming |
---|---|---|---|
* Traditional method for reporting and analytics | * More recent method for reporting and analytics | * Typical method in operative applications | * More recent method for operative applications |
* Has good tooling, well-understood | * Cloud tooling still under active development | * Has good tooling, well-understood, integration platforms may help | * Integration platform must support data streaming |
* Often batch processing | * Often batch processing with cloud data pipelines | * Data stays in the source system | * Near real-time data streaming |
* For data marts and data warehouses | * For data lakes | * For operative applications | * For cost-efficient, performant and reliable operative applications in enterprise environments |
* Data freshness challenging | * Data freshness challenging, data marts expensive | * Backend systems may limit availability and performance, multiple copies of data models | * Not suitable for highly interactive use |
Data lakes use a new integration method, Extract-Load-Transform (ELT). All the data in the source systems is extracted and loaded into the data lake. Transformation is done only for the data that is needed somewhere is transformed before consumption. This has the benefit that only transformation phase of data pipeline needs to be changed when new data is needed.
Many organizations learned that the data lake easily becames a dumping ground for unstructured, semi-structured and structured data. The data lake turns into a data swamp. The formats for different kind of data are naturally different – think of a video vs. a database table. Data quality is difficult to handle, meaning quality issues flow downstream. It is also difficult to find the data from data lakes, which means data is often duplicated in multiple places.
Both ETL and ELT typically use batch processing. If data sizes are large, the data is synced perhaps once per day. For reporting and analytics this is usually not a problem – it may even be a beneficial. But operational applications may have different needs for data freshness. If the end customer signs up for a service, then it should be visible almost immediately on the web service.
Therefore, direct APIs are often used for data integrations for operative applications. The data needed by operative applications may reside in multiple backend systems. In typical modern architecture, the data is mapped to a more suitable data model, perhaps inside the operative application. If you have many operative applications, each will naturally evolve their own internal data model. Ultimately, you end up with multiple custom data models for the same data. This encourages tight coupling, making data model changes difficult in the source systems. Some backend systems might also not support the performance and availability needs of operative applications.
Another approach to data integrations is data streaming. If you have established a log of all the changes to the source data, then applying those changes to another system allows you to maintain a near real-time copy of the original data.
The benefit of this method is that providing access to near real-time data becomes possible. Also, as creating additional copies of the data is inexpensive, serving some of the complex needs of the operative applications becomes possible, such as providing sufficient performance and availability of web services.
This approach requires investment as well. The source systems need to be modified to support data streaming. You also need a data integration platform that supports streaming data. Some operative applications might also not be a good fit with the data streaming method. For example, when the use is highly interactive, data streaming might not be suitable.
If you need consulting related to system architectures in general, or data integrations in particular, please do not hesitate to contact Mikko Ahonen through the contact page.