ABN AMRO’s Data Integration Architecture
With this blogpost it’s my pleasure to offer you the chance to have a look at the cool initiatives that we do at ABN AMRO. I want to emphasize that our Data Integration Architecture pertains to a common problem area. The ideas that are discussed in this article are not restricted to only ABN AMRO and are non-industry specific. The audience is anyone with an interest in data: data architects, data engineers, solution designers, data professionals, etc.
The architecture that we have developed helps our architects, engineers and solution designers to pick the right building blocks for delivering value for our business and customers. With this architecture we strongly believe we can improve our agility, while at the same time retain control over our data integration and distribution channels. We also believe that this pioneering work will advance the practice of our enterprise architecture. Interested? Please, keep on reading!
The concept of connecting Ecosystems
Before I start talking about our future state architecture and initiatives, it’s good to go back at the point where our journey started. Almost two years ago, my colleagues and I were challenged to transform the current Data Warehousing Architecture. It had to meet the latest and future business intelligence and analytical requirements, and support the data-driven culture within ABN AMRO in an agile way. But at the same time it also has to respect our strict data management principles: governance, legal obligations and our internal policies. These requirements typically place constraints around streaming data, usage of unstructured and external data, embedment of analytics in our operational processes, API usage and so on.
The modern architecture had to meet all these different requirements, but at the same time it must also work an ecosystem with FinTechs, external data providers and data consumers. This led to us with a belief that an “architecture for ecosystems” is key to achieve success in the area of data integration. We strongly believe that future data is distributed. Especially when moving to Cloud and interacting, collaborating more and more with external parties. Most likely we’ll reach a point in the future where most of the data we use is “external” by nature. To keep control of our data distribution and usage at large, we need an “architecture for data distribution at scale”.
Current situation
When we started, we followed the good practice of identifying and examining the existing situation. Our existing architecture is like what most enterprises have and can be best illustrated by the conceptual diagram below:
Beside the opportunity to leverage from the latest Big Data and analytical technology trends, we also must be able to control it. Data lineage is an important objective. Whenever the data schema changes, we need to be able to track the data and the changes. Judging the truthfulness and quality of the data is also very important. The typical (Enterprise) Data Warehouse designs are based on the principle of the 90’s to bring all the data together and first integrate everything. Taking vast amounts of external and unstructured data into account, using this approach, is not possible. Also data governance perspective is a concern. We want to have clear insight in ownership and allow data providers to have control over the distribution and consumption of their data. Most important we wanted to improve our agility with less shared dependencies.
Our ‘every application has a database’ hypothesis
Before we move on I would like to share a hypothesis. The first hypothesis is that every application (at least in the context of a banking application) that creates data, has a ‘database’ (organized collection of application data). In this view, even stateless applications — that create data — have ‘databases’. Data, in these type of scenarios, typically sits in RAM or temporary files. Following this hypothesis it is logical when there are two application, there are two databases.
The second hypothesis is that data integration is always around the corner. When the interoperability between these two applications takes place, we expect data to be moved from one application to another.
Common understanding of integration
A second problem is the context and schema transformation. This lies in the fact that an application database schema is designed to meet the specific business requirements. Since requirements differ; application differ as well. Consequently, schemas are expected to be different and data integration is always required when moving data around.
Whether you do ETL (Extract, Transform and Load) or ELT (Extract, Load and Transform), virtual or physical, batch or real-time, there’s no escape from the data integration dilemma. Data interoperability and integration will frame the new architecture.
Data Provider and Data Consumer
For our integration architecture we have adopted and adapted the philosophy of ‘Service Orientation’ and TOGAF’s Integrated Information Infrastructure Reference Model (III-RM). An application or a system is either a data provider (producer of data) or a consumer (consumer of data). Because we’re part of a larger ecosystem, we expect data providers and data consumers to be external parties as well.
With this fundamental concept of data providers and data consumers, we defined a set of principles:
- Clear application and data ownership
- Data Quality is maintained at the source, which is a data provider’s responsibility.
- Understandable data, which means definitions, labels and proper metadata is available.
- No data consumption without a purpose.
- Don’t do integration when it’s not required.
- If no data integration with other sources is required, solve the issue at your own side.
- Data consumers can become data providers when distributing data. If so, they must adhere to the same principles. (Additionally, don’t distribute data that you do not own.)
- The integration is pushed closed to the data consumers. They take the accountability for the data transformation, since they set the requirements.
First view of our ‘Digital Integration and Access Layer’ Architecture
With the concept of data providers and data consumers, how does the integration and interoperability work? In the middle we have the ‘solution’, what we call as our Digital Integration and Access Layer (DIAL). Let’s walk through some of the aspects.
Access: Data consumers, ideally, want a single place — a layer — , where they can ‘explore’, access and query all data in a consistent manner, at any time at any speed. ‘Make data available’ is the motto. Data consumers shouldn’t have to worry about the availability. Whether the data must be physically present in this layer depends on the non-functional requirements.
Integration: A crucial part of the architecture is about transformation/integration. In our DIAL architecture we set a hard requirement that the data transformation between data providers and data consumers will be only done once. So, no initial transformation to an enterprise model. No ‘IBM Information FrameWork’ or languages which are called ‘Esperanto’. In our approach, data is in either the context of a provider or consumer. Consumers set requirements. We accept harmonisation of data and that an additional step might be needed in case data heavily overlaps. But in the new model, creating additional layers are only allowed on a domain level. By doing so, we strongly believe the agility will increase significantly. By letting the enterprise model go, data providers and consumers can change at their own speed. Everything is decoupled.
Metadata: How do data consumers understand what data means, if no enterprise model is used? This is where our metadata comes into the picture. Metadata is the critical glue. For the ‘understandable data’ principle we require providers and consumers to deliver business metadata centrally. For integrations and transformations, we require lineage metadata. By making our architecture metadata-driven, it supports an approach that data can be ‘reused’ for other consuming applications by providing insight. This logically means that all data consumers have access to the enterprise metadata catalogue from which they can see the available data, schema’s, definitions, lineage, ownership, quality, list of sources, etc.
The last and final part is data security: The architecture uses data delivery agreements between data providers and data consumers. Data is routed based on metadata agreements and labels. This allows data providers to be in control of the distribution, because whenever the metadata labels and classification change, the routing changes.
Digital: The way we want to implement our metadata requirements is to apply some intelligence to it: machine learning and AI. Eventually, our architecture should become a self-learning semantic layer, allowing parties to interact with it. This layer will also include a taxonomic service for seeing the true potential and value of the data, including its relations to our business capability models, technical models and so forth.
Engineering the Architecture with Architecture building blocks
Let’s move on and look at some of the engineering aspects. To distribute and move data around we distinguish between the following capabilities and patterns:
Read Data Stores (RDS) Architecture
Read Data Stores act as a ‘read-only caches’, where data can be taken from. See it as a ‘Command Query Responsibility Segregation’ extension of the operational system. Because we don’t do any upfront transformation, the format is ‘domain data’, which means that the context is inherited (based on architecture guidelines) from the domain. With ‘domain data’ we also imply that no new data can be created and that the context of the data cannot be changed. For adding new business logic we expect data consumers to first extract and transform. Consequently new data ownership is required.
The benefit of the RDSs is that the developers can refactor their operational systems, without requiring to change their RDS. Queries against RDSs won’t add load to the operational systems. RDSs must be kept up to date, which can be done via ingestion techniques like Batch, Change Data Capture (CDC), Micro batches, Replication or Synchronization.
For our data governance practise it is important to let data providers own and control the data in the RDSs. This also including data access: determining who or which application has access. Here we use the concept of Data Delivery Agreements. If consumers make a dependency with, for example, a table in one of the RDSs, a contract is created and data providers know that there is a dependency. Provider can still engineer or refactor their operational system, but backwards compatibility must be guaranteed.
We envision multiple RDSs for different use cases and different RDSs can also sit and share the same technological platform. Our Read Data Store is technology agnostic. This means that the RDS can be a relational system or a document store based on the data structure. Cassandra, Hadoop (HIVE2), MongoDB, RedShift to mention a few, are all valid platforms. A data provider can extend itself to one or multiple of these environments. This makes the RDS a perfect place for consumers, because there’s variation in the way data needs to be consumed: for example, large volumes vs. small volumes at a higher speed. In a scenario, where multiple RDSs sit on the same environment, it’s also logical to bring the master and reference data to such a shared environment. Metadata again is the critical glue, which is responsible for disguising transactional-, master- and reference data. Environments with multiple RDSs are also the place where coherence and integrity checks across systems can be performed. This, as expected, will positively impact data quality.
Service Oriented Architecture
For real-time application-to-application communication or volumes which can be taken out directly from applications or systems, we use light-weight integration components: ‘service orientation’. Request-response is the most well-known pattern. The decoupling and integrations happens by putting a communication bus or an integration component between our applications, for example using Enterprise Services Busses or API Gateways. The metadata in our real-time architecture is again collected centrally. All services must be registered in the central service registry. They always are owned by data providers. For APIs that are being consumed we use the same Data Delivery Agreements philosophy.
Event-driven architecture
The last pattern is event-streaming data. Whenever a state of change happens, an event or trigger is created. These are forwarded to one of our streaming platforms, from where the data distribution towards data consumers can start to take place. Data can also be persisted for a longer period. In our DIAL architecture these “state stores’ are also called a ‘Read Data Stores’. Also for streaming we follow the same principles. Topics and subscriptions are always controlled via metadata.
Combining the different patterns
We acknowledge there might be a slightly overlap between the patterns, but when combining them, the Integration Architecture looks like this:
Patterns in DIAL are also complementary. Events can be ingested in Read Data Stores. Command and queries can be separated from each other by API Gateways. We have also worked out generic consumption patterns for each layer. For example for Read Data Stores consuming patterns can be ETL, ODBC access, pull after poll, push subscription on files or direct access via Business Intelligence solutions.
Domain Data Store
On the right side of the DIAL Architecture you see the data consumer’s solutions and the concept of an ‘Domain Data Store’ or ‘Integrated Data Store’. In our future architecture we want all future applications to only rely on a single application database. Databases, which store data for multiple applications, must be avoided.
The DDS is symbol for a large variety of different use cases, e.g. Business Intelligence, Analytics, Operational applications, etc. The application database is setup specifically to satisfy and address the specific business requirements. Consequently, the data model is expected to be very specific. The data model can vary from highly normalized to dimensional. Also, the integration steps can also vary, from a single integration step to a situation where additional steps are required (data cleansing, additional harmonisation, etc.).
All these new types of applications, which directly consume, transform and store the data, we call Integrated Data Stores. Since the data schema is changed we expect new ownership and consider the new data as new ‘golden’ data. All these new applications are required to be registered in our metadata repository, what we call as the List of Golden Sources (LoGS). We have also set this principle of registration for our transitional/operational systems.
We acknowledge that the DIAL Architecture abandons the Enterprise Data Model in favour of a higher agility. This is also in line with our ‘connecting ecosystems’ thinking. The Enterprise Data Model in our architecture has made place for disciplines like metadata management, master data management, data governance and data quality. By having Data Management in place, we ensure supervising the data distribution and foster the data reusability.
Data movement across chains
Let’s go back to our DIAL reference architecture. Here we also see an arrow on the right side, which goes all the way back to the Digital Integration and Access layer. This is because a data consumer can also become a data provider, when distributing data again. So, when applications want to share/distribute data with other applications, the patterns of the DIAL architecture must be used again. To illustrate the data interoperability between applications, see the picture below:
Interoperability between applications always takes place via the DIAL architecture. Since the architecture relies on metadata, we can keep track of the data, no matter what pattern is used. The architecture guidelines make an exception for data distribution within the ‘bounded context’: applications boundaries which shares the same business concerns. But when the context or responsibilities change, decoupling using DIAL is required.
Metadata components
To flesh out the details of the metadata stream, please have a look at the picture below.
All RDSs are required to sync all their metadata schema information to the central metadata repository. The DIAL Architecture have ‘centrally managed’ ETL capabilities, which write the lineage to the metadata repository automatically, but data consumers can also favour their own patterns. In these cases there will be a manual delivery of the metadata lineage. Although RDSs have been visualized in the image above, the same approach we use for Service Orientation and Streaming.
What is the relation to MicroServices?
You might wonder: what is the relation between microservices and the DIAL architecture? The microservice architecture is an application architecture, while our DIAL architecture is an integration architecture between applications. A microservice is an independently deployable unit, which is part of an application. Many microservices together form an application. The bounded context is where we draw the lines. Within the boundaries of an application, developers have certain freedom, but when spanning across applications, we require the DIAL patterns for decoupling to be used.
Wrapping up
The Digital Integration & Access Layer is for ABN AMRO the Architecture for our architects, developers, engineers and solution designers, so they know how to deliver the highest value for the business. To summarize, the main advantages are:
- Clear insight in the data supply chain
- Insight for both data providers and consumers in data consumption and consumer’s requirements and responsibilities.
- A much higher agility, since we cut out the additional step of integration needed in the current architecture and removal of dependencies with other domains
- Much easier to access and find data
- Insight in the meaning of data, quality and ownership
- Much better security, because having labels on attribute level we can enforce attribute-based access.
- The opportunity to leverage much quicker from the latest trends and developments
My name is Piethein Strengholt and I’m a Technology Architect for ABN AMRO. I’m part of a high performing team of technology enthusiasts with a passion for the latest developments and trends. Are you interested to join me on this journey? Please let me know!
Many thanks to Bernard Faber, Henk Vinke, Dave van Wingerde, Fabian Dekker and Reijer Klopman for the cooperation.