This post is sixth part in a series about building and adopting a modern streaming data integration platform. In this post
I will describe a lightweight option for storing subset of reference data, called code sets.
Many larger business applications have something that can be called a code
service, used to store and administer various code sets used by the
application. Technically, the code service might be a microservice, a set of
database tables or it might be just a set of configuration files. In this post,
I will use the term code service regardless of technology used to implement it.
The code service typically has many code sets, and each code set consists of a list of codes,
with corresponding names. If the system supports localization, there may be multiple translations
for the names.
The codes are used in the application, for example for allowed values in dropdowns. Some of
these code sets might be editable by the admin user, while others require changes to the
Also, many enterprises have some reference data in data warehouses. Examples include
list of countries or list of possible languages. Some reference data may
follow similar simple structure as the code sets, while other reference data may
be more complex.
The list of countries is well-defined by ISO standards. When you dig deeper, however,
there is more complexity. For example, there exists multiple (two-letter, three-letter, numeric)
codes for each country, and not all countries have all the codes. Some codes have been
reassigned. There are also political disputes about countries and their borders.
Another example of more complex reference data is the relationship between
postal codes and municipalities. There may also be a hierarchy between the code
sets. One typical example of a hierarchy is list of industries and the related sub-industries.
Many data quality issues are related to reference data. Some of them are obvious,
such as when the reference data ownership is not clear, or organizational procedures to update
the reference are not working properly.
Another source of data quality issues is the misuse of code sets. Unless you are
very clear about the usage, code sets easily get reused for other purpose, which appears to
be close, but may be semantically very different. For example, list of existing
countries might be reused as a list of countries of birth.
Such semantical data quality issues may be compounded by quick fixes. If
a customer reports that they cannot add their country of birth, the reason might be that the country
no longer exists. If the generic list of countries is the same as the list of
birth countries, then the missing country will be added to the list, which will
break the code sets both semantically and practically. Updating the data based on ISO data source
changes becomes unfeasible or at least difficult.
Adding new values to code sets may also cause more subtle issues. There are
some code sets, where adding a new value invalidates the old values. Gender is
one example. If the original options for choosing your gender are male and female, and then a new
option is added, this logically invalidates the old data. We cannot know what the user would have
chosen if they also had the new option. This may sound a little bit theoretical, but are also practical
issues related to adding new options. When adding new values, all the integrations that map from that code
set need to be updated as well.
It is important to establish the ownership of reference data in the organization. One way is to
link ownership to data domains. Additionally, it is very helpful to link the reference data in general
and code sets in particular to the conceptual data models.
Commercial reference data products exist. I don’t have personal experience using such
systems, so I will not comment on them.
However, I have seen multiple custom implementations of centralized code services. As code sets
seem deceptively simple on the surface, it seems like a good idea to implement your own.
However, like always, the god is in the details. Just two commonly requested features,
versioning and links between code sets, can make the service complex. If you allow
adding custom attributes to codes, application logic easily slips into your reference data.
If you allow additional flexibility in the data models to support custom data types needed by more
complex reference data, the service can easily become too generic. I have seen
an effort to create a code service to fail from the complexity.
There is a lightweight approach to code sets that has proven succesful in
practice. The idea is to store code sets in a git repository using a standard JSON
format. Only simple code sets, with list of code and their corresponding localized names
are supported. Other type of reference data is excluded from this service.
By using git, you automatically have support for versioning. Integration
is also simple, as JSON format can be directly incorporated into application builds.
This means that in many cases, you do not even need a seperate REST call, although in this case you
lose the flexibility of modifying codes on the fly.
You can also create a publisher that broadcasts the reference data changes to interested parties and systems
if your git service provider (GitHub, GitLab etc.) has a method for listening of change events to the
repository. Alternatively, you can build your own publisher by detecting the changes periodically.
I have found it useful to define the codes as typed and human-readable format
instead of numeric values. In a corporate environment, debugging code values
from logs and integration messages becomes easier. If you want, you can use the URN (Uniform
Resource Name) standard (RFC 2141), or you can invent your own syntax.
An example of a value for “lang” code set value for Finnish could be “urn:x-xyz:code:lang:fin”, where xyz
would be your organization name. The method supports hierarchical values almost out of the box, as
you can just add more parts to the end of the URNs.
To help adoption of such a code service, you should probably have client libraries. They should provide
the developers with enumerations (such as Java enum) of possible values of each code set. You can create
build processes that automatically generate these libraries from the code sets. Just make sure that the
invalid values fail gracefully.
You should have a validity checker for the code sets to ensure data quality. This can be implemented
as a git pre-commit hook. The checker should verify the JSON format, required fields, required translations,
duplicates, that the URN naming follows conventions etc.
It is helpful to provide the business users with a client for editing code set translations,
adding new codes and deprecating old codes. If you use git, this frontend should create merge
requests from the changes. This way developers can maintain control of the changes.
Code set values should never be removed, because there may always be some system or temporary
storage that contains those values. Instead, mark the codes that are no longer valid as deprecated.
While this provides you with a lightweight code service, this is only the technical aspect. The difficult part
is the governance model, including the workflows for updating the reference data, adding new reference data etc.
If you need consulting related to system architectures in general, or data integrations in
particular, please do not hesitate to contact Mikko Ahonen through the contact page.