At ABN AMRO, we are quite familiar with a data mesh. As a matter of fact, we’ve been working on this new type of architecture for quite some time. 4 years ago, our data journey started. With this blogpost it is my pleasure to let you peek deeper inside.
The term data mesh was originally coined by Zhamak Dehghani. It describes a paradigm shift from traditional data architectures, like enterprise data warehouses and data lakes, towards a modern distributed architecture, using concepts like Domain-Driven Design, platform- and self-service thinking, and treating data as products.
One of the biggest problems many enterprises are dealing with is getting value out of their current enterprise data architectures. The majority of all data architectures use a monolithic design-either an enterprise data warehouse or data lake-and manage and distribute data centrally. In a highly distributed environment, these architectures won’t fulfill future needs.
Enterprise Data Warehouse and Business Intelligence
The first-generation data architectures are based on data warehousing and business intelligence. The philosophy is that there is one central integrated data repository, containing years of detailed and stable data, for the entire organization. This architecture comes with several downsides.
Enterprise data unification is an incredibly complex process and takes many years to complete. Chances are relatively high that the meaning of data differs across different domains, departments, and systems. Data elements can have the same names, but their meaning and definitions differ, so we either end up creating many variations or just accepting the differences and inconsistencies. The more data we add, and the more conflicts and inconsistencies in definitions that arise, the more difficult it will be to harmonize. Chances are you end up with a unified context that is meaningless to everybody. For advanced analytics, such as machine learning, leaving context out can be a big problem because if the data is meaningless, it is impossible to correctly predict the future.
Enterprise data warehouses (EDWs) behave like integration databases. They act as data stores for multiple data-consuming applications. This means that they are a point of coupling between all the applications that want to access it. Changes need to be carried out carefully because of the many cross dependencies between different applications. Some changes can also trigger a ripple effect of other changes. When this happens, you’ve created a big ball of mud.
As data volumes and the need for faster insights grew, engineers started to work on other concepts. Data lakes emerged as an alternative for access to raw and higher volumes of data. By providing data as is, without having to structure it first, any consumer can decide how to use it and how to transform and integrate it. Data lakes, just like data warehouses, are considered centralized (monolithic) data repositories, but they differ from warehouses because they store data before it has been transformed, cleansed, and structured. Schemas therefore are often determined when reading data. This differs from data warehouses, which use a predefined and fixed structure. Data lakes also provide a higher data variety by supporting multiple formats: structured, semi-structured, and unstructured.
Many of the lakes collect pure, unmodified, raw data from the original source systems. Dumping in raw application structures-exact copies-is fast and allows data analysts and scientists quick access. However, the complexity with raw data is that use cases always require reworking the data. Data quality problems have to be sorted out, aggregations and abstractions are required, and enrichments with other data are needed to bring the data into context. This introduces a lot of repeatable work and is another reason why data lakes are typically combined with data warehouses. Data warehouses, in this combination, act like high-quality repositories of cleansed and harmonized data, while data lakes act like (ad hoc) analytical environments, holding a large variety of raw data to facilitate analytics.
Designing data lakes, just like data warehouses, is a challenge. Gartner analyst Nick Heudecker tweeted that he sees a data-lake-implementation failure rate of more than 60%. Data lake implementations typically fail, in part, because of their immense complexity, difficult maintenance, and shared dependencies.
Data distribution and integration at scale
The solution to these siloed-data complexity problems, as we learned within ABN AMRO, is an architecture which allows domains or teams to change and exchange data more independently in a federated and self-service model. Domains, in our architecture, own pieces of the overall architecture and act as providers and consumers. They provide data or integrate applications by complying with high-quality standards, using centrally provided building blocks for interoperability. By ensuring that everything flows through the same single logical layer, maximum transparency and increased the speed of consumption is created. Within ABN AMRO, we deployed a sheer of data management capabilities in this single logical layer, for security, observability, discoverability, linage and linkage, quality monitoring, orchestration, notification, and so on. Lastly, we have set additional principles and developed an architecture for what we call: an accelerated growth of data-intensiveness. We see that the read-versus-write ratio is changing significantly, because of the diverse and high variety of use-case variations and read patterns that comes with these use cases. Analytical models that are constantly retrained, for example, constantly read large volumes of data.
Our architecture has many similarities with a data mesh. At the edges we have distributed domains, which we organize using the principles of Domain-Driven Design. In addition, we apply a taste business capability thinking for carefully drawing the boundaries. Within these boundaries, teams are responsible for their applications and the data that comes with it. We also embraced data product owner thinking, as a method for ensuring high quality of the data, as well providing data or services in an optimized way. Here we promote human-readability and -interpretability using the business language of the domain. Additionally, we encourage teams to optimize for data-intensiveness. This, for example, could mean incorporating enterprise identifiers or encapsulating security metadata within the data.
At the hearth of our data architecture there’s the mesh model. This is central place of decoupling and control center to distribute data to any location. When domains, for example, want to access each other’s datasets, they must use the mesh for their data distribution. Within this multifaceted mesh we have blended the philosophies of service-oriented architecture (SOA), event-driven architecture (EDA) and CQRS together.
For a better distinct view, when distributing data or integrating applications, we grouped our patterns into three categories:
- DIAL patterns: DIAL is an ABN AMRO-specific architecture, which philosophy is to capture all original records, and allow other consuming applications to (intensively) read these from Read Data Stores. This architecture shares many similarities with CQRS, but also comes with standards for data life cycle management to manage and retain historical data.
- API patterns: APIs come from SOA and focus on real-time communication and can be used for communication between legacy systems as well as modern applications. APIs can provide both data and business functionality (behavioral). APIs mainly use a request-response model, as commonly seen in client-server architectures and use HTTP methods, such as POST and GET.
- EDA patterns: EDA is a software architecture paradigm, utilizing event brokers and message queues, for promoting the production, detection, consumption of, and reaction to events. Providers transmit events to either delivering messages (e.g., via queues) or events (e.g., via streams). Consumers listen to incoming events in the platform and retrieve them through event delivery. Consumers may use the same platform to transmit back events, triggering actions from other in a chain.
As you can see, our data mesh not only distributes data. The architecture is also an integration mesh, because it facilitates different styles of application integration, which has a wider scope than data distribution, because it also covers commands part and includes business functionality to be shared across domains.
Our data mesh is also an integration mesh
Our mesh is distributed and spans across different environments. Although you see a simple logical representation of the interaction patterns, it utilizes many technologies: middleware platforms and (self-service) capabilities are assembled into architectural patterns to accommodate the dimensions of volume, velocity, consistency, and variety. Conceptually, one abstract and simplified layer is seen, but under the hood, three architectures work closely together to persist, route, transform, manipulate, and replicate the data to the various endpoints. Let’s zoom in further, and discover what’s inside.
On the left you see providers, acting as nodes. They have to ability to choose their style of distribution or integration. Providers in this model should make nuances and work closely together with their consumers to satisfy their needs. Let’s walk through every option:
1.1 Reading data intensively from RDSs is recommended when performing data processing at large. The RDSs, which also retain historical data, act as provider query models by storing data in immutable fashion. Although you see one logical representation, under the hood RDSs can utilize different database types, such as relational, document-oriented, key-value, and so on. Lastly, different RDSs can optionally work together to distribute data between Cloud environments.
2.1 APIs, operated by the API management plane, are meant for strongly consistent reads and commands. The communication in this model goes directly between providers and consumers, and is facilitated by either an ESB, API Gateway or Service Mesh. This pattern can be used for both data distribution and application integration.
2.2 APIs, which are provided by DIAL, are for reading eventual consistent data. This is because there’s a slight delay between the state of the application and RDSs. This pattern is extremely useful for scaling up data-intensive applications that require API-based data access.
3.1 Events brokers are most suitable for processing, distributing and routing messages, such as event notifications, change state detections, and so on. This pattern, just like 2.1, can be used for both data distribution and application integration.
3.2 Message queues facilitate the mediator topology: requests go through a central mediator where it will post messages to queues. This is more useful when you want to apply event sourcing, need to orchestrate a complex series of events in a workflow, or error handling and transactional integrity is more important.
3.3 DIAL’s event brokers are best suitable for event-carried state transfer and building up history. This can be useful when applications want to access larger volumes of data of other application’s data without calling the source, or to comply with more complex security requirements.
These different patterns don’t stand on themselves, but can also be combined to provide more rich experiences for consumers. Let me provide an example.
Providers can help scaling up use cases by applying both CQRS and business logic at the same time. Providers, in this example, on-board their data to easily create read-optimized copies of their data. During this process, metadata for the authorization is encapsulated within the data (1). For example, if transaction data needs to be filtered using specific arguments, any required metadata is expected to be part of the data.
Providers can enable fine-grained data authorization by applying column-level, row-level security and dynamic data masking. For this they require data sharing agreements in place with the filtering logic documented in metadata. The enforcement, for example, can be a view (2) that only selects the relevant data by taking arguments from the consumer.
When complex business logic or orchestration is required, an additional component (3) can be deployed. This component is owned by the provider and uses the underlying DIAL API endpoint and can, for example and if required, retrieve additional context or perform complex comparisons and calculations. With this separation we keep the architecture clean and avoid that domain logic creeps into the mesh.
Meta model — the critical glue that binds everything together
All three architectures discussed are fully metadata-driven to help define data services that can be reused for other consuming applications. They also provide insights into the lineage of distribution, consumption requirements, data quality, meaning, and so on. When distributing data or integrating, we also rolled out a large set of additional data principles. Let’s see what this means in practice.
When onboarding datasets, we make use of a phosisticated meta model with many underlying principles. For the sake of simplicity, I abstracted the model by showing only the core entities that are most relevant for data governance.
Data ownership and governance are important aspects of providing transparency and trust around data. In the figure, you see datasets, which are technology-agnostic representations used to classify data and link it to data owners and elements; data elements are atomic units of information that act as the glue linking physical data, interface, and data-modeling metadata. This information is kept abstract for the sake of flexibility, and allows us to push down our controls to all places within the multifaceted mesh, regardless whether is duplicated or replicated. Why keep ownership information abstract and not link it directly to the physical data attributes? To avoid tight coupling. If, for example, owners, business entities, or classifications change, all corresponding physical metadata has to change as well. By decoupling and linking to the dataset’s elements, we made the architecture flexible.
The entities in blue represent all physical representations of data within the mesh. They are used to guarantee data integrity and transform data to different technical formats. Although this model can work for any interface type, I’ve simplified it here, with only an offline batch pattern using flat files.
In the middle, shown in purple, are the entities that capture all data sharing agreements details. These are formal metadata contracts that document what dataset and data elements are shared between domains and how they can be consumed. They capture information about scope, privacy, purpose limitations, and additional fine-grained filters.
Data Marketplace and Metadata Lake
Out metadata blueprint is an integrated and unified model that works across all enterprise architecture and data management disciplines. We have decided to not focus on a narrow catalog or commercial metadata solution, but to develop our linkage capability ourselves. We strongly belief that providing insight in observability, the correlation between information, and semantic consistency must be a central capability. Therefore, we introduced a central construct, which is called our ‘ Metadata Lake ‘. Its main purpose is to collect essential metadata from our digital ecosystem. Our mesh, in that respect, is just one of the sources that provides information of the interoperability of data.
The Metadata Lake is the backend for two applications, which we have named Data Marketplace and Clarity.
- Data Marketplace is our self-service portal that shows all data that is available at our fingertips. It’s the central portal to get rapid access to data. It shows data quality, ownership and lineage over time, and enables us to demonstrate internal and external compliance at any given point in time.
- Clarity is about decision making and allows looking at our organization using a helicopter view. It uses our architecture models, connects the dots and provides detailed information about data, processes, organization and technology. It tells, for example, what centralization or decentralization is applied.
Our data journey has just begun and we’re in the middle of exploring many new opportunities. We plan to develop ML-powered intelligence to free up data scientists from their tasks of finding, cleansing and querying data. We want to use our business data models to discovery semantic consistency and leverage telemetry data to make our platforms self-healing and more efficient.
The architecture that we discussed throughout this blogpost helps us manage data at scale. If you are curious to learn more, I engage you to have a look at the book Data Management at Scale.