Data Integration and Data Modelling demystified

This article consolidates a large amount of content from the book Data Management at Scale. This content didn’t make it to the final book, because various external reviewers found it unchallenging. So, if you believe you have a good understanding of data management already, the content discussed in this article might sound familiar. However, I have noticed that less mature data professionals find the data integration part and context perspectives, difficult to digest. Therefore I decided to make this content freely available to all of my followers.

Data Integration is considered to be part of data management, but since data integration is such a fundamental area I have put an emphasis on this in this article. To help you to better understand, I’ll begin discussing what data integration is by using layer-by-layer approach and a practical metaphor. Before we will discuss what data integration is about we first must agree on a common viewpoint on what data exactly is, how data is structured, and how it is stored and used in applications. Because of this, data modelling and data integration will be discussed together.

Demystifying data

Before we dive deep into the content, let’s first untangle data and try to define it better. Since the definition of data varies among business readers and technical readers, I find it important to show three viewpoints.

Business viewpoint: Business professionals usually use the term data to mean information used for their daily processes and work. Documents, files, folders, pictures and even non-digital reports are all considered to be “data”. Within a particular context, typically a department, they usually share the same definition of what is in the data. In many cases this is poorly documented, or not documented at all. For data management it might be difficult to only use this viewpoint, because physical documents, pictures and prints are something different than electronic data. Non-digital reports, for example, cannot be integrated with other data. In order to integrate non-digital reports, we need to digitize them first.

Technical viewpoint: True technicians will state that everything computers or electronic devices can process is called data. This implies all the zeroes and ones (bits and bytes), such as 11011000. The difficulty with the technical view is that it also includes application code, supporting libraries, frameworks (higher-level abstractions) and operating systems. Even firmwares of electronic devices can be encapsulated by the technical viewpoint. While this viewpoint is valid, it makes it hard to distinguish data and difficult to bring data into the context of usage or business operation.

Application viewpoint: A more preferable way to look at the data is to use the application viewpoint. Applications are (computer) programs — software — designed to perform coordinated functions or tasks in order to solve particular user or business problems. They in general are constructed with (packaged) application code for the business functionality, supporting libraries (precompiled routines or frameworks) and data. If we could take an application — that uses the most widespread three-tier architecture — apart, we would see three different layers or tiers:

The presentation tier: The presentation tier or user interface tier, as pictured above, is responsible for the interaction and user operations. It usually presents the application results and does the interaction in a user-friendly way with the users. Modern applications often interact via the web-browser or mobile apps. The underlying technologies are then web technologies, such as HTML5, JavaScript and Cascading Style Sheets (CSS).

The business logic tier: The business logic tier has the role of transferring the information (data) between the presentation tier and the data tier. This interaction can include complex validation and business logic, such as decision rules, transformations, calculations, aggregations and complex statistical models. This tier can be developed with many different techniques and programming languages. The selected technologies often come from the preferences of the developers and engineers.

The data tier: The data (storage) tier, also known as application data store, is responsible for persisting (storing) the application’s data. In general databases are used, but a simple file holding the application data might do the job as well. The type of database engine depends on the data structures the application uses, flexibility and non-functional requirements, such as performance.

When looking through the application lens, the data tier is a collection of all the values the application works and interacts with. When users interact with the presentation tier calls are made to the business tier, which consequently communicates with the data tier. The data that sits in the data tier is general specific for the context and purpose where the application is used for. Additionally, the data is usually managed by only the application itself, which means that there are clear application boundaries around the data.

Note: The application state represents the totality of everything necessary to keep your application running. The application state includes the application data, objects in memory, the libraries and the code required to operate the application. Some people refer to the application state and say that the application state is the only application data, but according to this definition this is not correct.

Not all data sits in application data tiers, because data can also be created in the “surrounding space” of applications. Logging information, temporary files and in-memory data are typical examples. The surrounding data can be also used to generate insights from, but isn’t directly part of the data tier. From the log data, for example, we can retrieve what users accessed the application and performed what actions. These audit trails might be of value to the security department.

Note: What defines the scope and boundaries of an application? The way application boundaries are drawn depends on the viewpoint that is taken, because “applications” mean different things to different people. A viewpoint you can use is how application code and data, are bundled, deployed together and isolated from other applications. Another viewpoint can be the virtual address spaces, such as virtual machines, where the different processes run in. One more viewpoint can be the same integrated code, that shares the language runtime system, and talks to other integrated code via service layers.

Business viewpoints can be used as well: An application can be all the code and data, that are used together to fulfil a particular business need. Such a viewpoint can draw boundaries spanning across different physical environments, runtime environments and platforms. Boundaries can also be (political) ownership and organizational structures within the company. Briefly summarised, the application boundary highly depends on the viewpoint you take.

Since we’ll be discussing the aspects of data management, but especially data integration, the application is the viewpoint we will use throughout this book. I can also define data more specifically as: “Any values from an application that can be transformed into facts and eventually information” Now we have a clear definition of what data exactly is, we can have a look at how data is created.

Data origination

One of the key aspects of data is its origination (creation). Data in applications normally comes from users or processes using the applications. The usage (interaction) of the application leads to the creation of new data within the (application) data tier. This can be done via normal human interaction, like filling in forms, pressing buttons on websites, or user interaction with web browsers. The communication and data creation in these examples takes place via the presentation tier of the application.

Another way data can be created is via the initiation of processes, which are running inside the application. These processes typically utilize functionality from the business logic tier to either spin up new processes or analyse or process data to create new data. A typical example would be a financial process that calculates and stores all the total sums of revenues and expenses at the end of day. The data creation leads to the Golden Source and Golden Dataset.

Golden Source

The Golden Source and Golden Dataset are foundational and referred to many times. They overlap with other industry definitions as well, so I’ll first provide all and explain the differences later:

Golden Source: Golden sources are the authoritative applications where all original originated data is managed within a particular context. They are made up of one or more golden datasets. Golden sources are described in more detail in the book “Mastering Your Data” by Andy Graham.

Golden Dataset: A golden dataset is the authoritative original data created. This dataset is unique and must be accurate, complete and known.

Golden records: The golden source and golden dataset have similarities with what people call golden records. Tech Target defines a golden record as “the ‘single version of the truth’, where ‘truth’ is understood to mean the reference to which data users can turn when they want to ensure that they have the correct version of a piece of information. The golden record encompasses all the data in every system of record (SOR) within a particular organization.”

System of Record: Some people, like Bill Inmon (The System of Record in the Global Data Warehouse 2003), use the term System of Record for the original data. Systems of record are used to classify systems or databases to be the authoritative data sources for a given data or information.

I personally see systems of records as the transactional or operational systems. Although these systems are accurate, it is most likely that these systems also exchange and retrieve data from other systems. Determining what the original records are within such a system consequently requires an additional exercise. Golden records aren’t accurate as well, since they refer to all the data within the system of records. The golden source and golden dataset therefore I find more strictly to determine in which applications data is managed, and what data is truly unique.

Viewpoint of the layers

By stating “any values” in the definition of data, we don’t limit ourselves to only the application data tier or data layer. As we already discussed data isn’t only stored in databases. Files, logging information, temporary files and caches (in-memory) can be of interest as well.

From purely the application standpoint the operating system’s data isn’t directly of any interest, because we wouldn’t consider this to be an application. Operating systems are used to support and keep applications running. The operating system’s components, code and all other operating system’s data aren’t directly relevant, since this data doesn’t add value from the business context in which applications is used. Data is only important within a specific business context.

But if we change the angle towards the operational department or IT department, operating systems suddenly are the IT department’s applications to run and manage other applications. The IT department probably looks at the application monitoring, data movements, auditing and configurations items. Suddenly the operating system generated data becomes the source of truth to generate the facts from for the IT department. I have visualised the different viewpoints and most commons technology layers.

What we have just learned is that the viewpoint, or business context within applications are used, it is crucial defining what data is. Viewpoints change the application scope and therefore the corresponding data we take into account.

Note: “Are there applications without any data?” Applications can exist without a database, but in such a situation data would be stored in memory, in files or caches. Application functions can interact with other applications, but still this requires some form of input or data exchange. When data comes in, it must be stored somewhere before the application (function) can start processing it. So applications always have some form of data storage, and thus always have an application data store.

Data Modelling

Now we know what applications are, and how data is created, I want to go deeper into data and application design. This first area of data management is called Data Modelling. Data modelling focusses on analyzing requirements, scoping the application and making requirements concrete for designing and building the application database model. It is an area of data management, which is important to understand.

A tale of two applications

Let’s look at a hypothetical example to understand how Data Integration and Data Modelling work and also make the roles of the providing- and consuming application clear.

Finance department

Imagine two (fictitious) departments within a brand new, successful company. Each department has an essential role to fulfill. They have been setup completely separate from each other, although they belong to the same company.

The first department is the finance department, which processes all the customers’ invoices. Since all invoicing is currently done via email the department decides to develop a modern web application.

The usual process of developing an application is first to start thinking and determining what the business requirements and needs are. Cash receipts, invoices received, customer and supplier are important concepts (terminologies) used within the daily operations of the finance department. Some of the concepts have additional characteristics: Invoices typically have purchase order numbers, ordered items and amounts. Some concepts have strong relationships, while other concepts don’t directly relate to each other. This conceptual thinking is required before the actual application development can start. In the example of the finance department the concepts are captured on paper with a sketch.

In the process of planning this out, all concepts and business terminologies are specific to the perspective of the finance department. It is expected that the application and database design (structure or schema) will be tailored to these requirements. The table and column names in the database will likely receive the same names as the concepts that were captured. So, table receipts and table invoices will most likely be created. Also a counterparty table is created. A counterparty is the party on the other side of a transaction, which can be referenced to a receipt or an invoice in case of the finance department. Each table has a set of columns, which inherit the names from the concepts as well. So likely we end up with column names, like an unique_identification_code, order_number. One row will be inserted for every receipt or invoice.

The table names and column names in the database can be created by hand, but another possibility is to generate the database structure is to use tools. I’ll come back to this aspect later, but in this case the database structure is created manually.

Since data integration is about combining different sources of data we need to have another application from a second department.

Marketing department

Another department I want you to imagine is the marketing department, which has the task to influence the satisfaction of the customers’ needs and determine competitive product prices. This task involves a lot of research on what customers want. The research is currently done by hand and a time consuming process, so the department decides to develop an application with the business objectives to better understand the correlation between product prices, promotions and customer satisfaction. Again, this terminology is specific to the marketing department.

The application development in this example will be outsourced to an external company, so ideas and business requirements have to be well documented. The external company examines these and asks the marketing department to validate the diagrams, which correctly reflect the entities, dependencies and relationships of all concepts. The final result is a logical diagram that will instruct engineers to develop the application correctly.

Just like the other example, all table- and column names in the database will inherit the naming convention. So the marketings database will contain tables, which might be given names like promotions, customers and channels. The difference in this situation is that the database structure isn’t created manually, but with a data modelling and design tool.

Building applications and designing databases, as you can see, is supported by a process of thinking, analysing business requirements, designing, and ordering concepts and needs. There are different phases in the process design. Each phase is represented by a data model, which has a certain degree of abstraction. Let’s examine each phase in more detail in the next sections.

Data Modelling and Design

Capturing requirements and translating these into a data design is essential for successfully delivering an application. The different design phases and the process, which supports this, is called ‘Data Modelling’. The disciplines and activities to capture, structure and organize the different structures and representations are consequently called Data Modelling Management.

Data modelling management is an important area within data management, because it reflects the reality and gives a representation of data is used within a particular context and within the application. Capturing the business context and different data models, allow other people to get the insights on how the business requirements have been translated into the true application and database implementation.

What you have observed in the metaphor of the finance and marketing departments is that there are different phases in the design of the application and database. Let’s have a detailed look at different phases of the data model creation. These concepts are widely used within the industry.

Conceptual Data Model

The first phase of data modelling is the conceptual design in which business requirements and thoughts are captured and structured as concepts. In this phase it is important to define how concepts relate to other concepts, since it influences other phases. Therefore conceptual models are used. They are a structured business view of the data required to support business processes and the organizational goals. Conceptual models are often established with entities, their attributes, such as definitions, and their relationships.

Conceptual data models, business data models or (business) information models are terms used by the industry for these types of data models. There aren’t any rules for the size and representation of these models. They can stay abstract or high-level, but also are allowed to be detailed, describing carefully all attributes, dependencies, relationships and definitions.

There are a number of reasons why the conceptual data models are often missing. One reason is that conceptual data modelling is unknown to a lot of business users and IT professionals. Many conceptual models are created implicit, for example created with a sketch on a whiteboard or document, but not digitally stored. Conceptual models are also often confused with other data models. Expertise and deep knowledge in conceptual modeling are in general rare. Others perceive it as too abstract or difficult, because it only supports the development approach and is not suited well enough for both business and IT professionals. Capturing is also difficult because no good and easy tools exist as of today to capture all conceptual data models in a correlation to all applications. The majority of the existing tools cover only a narrow range of technologies.

Another reason why conceptual data models are missing is because people often question the added value of creating conceptual data models. Within agile development or under time pressure I have seen people deciding not to capture any of the implicit thoughts or concepts at all. In these situation business users, developers and engineers take shortcuts to deliver the application on time.

A conceptual data model is not limited to only a single application, in fact it can cover many applications or multiple databases. Multiple applications can work together to accomplish a specific business need. This means that the relation between a conceptual data model and an application (logical) data model is a one-to-many relationship. The conceptual data model can also include and capture more concepts than the concepts representing the application itself. It can include concepts, which are used to make the context clearer, but don’t play a direct role (yet) in the application. It can also include concepts, that represents future requirements, but didn’t find their way yet into the actual design of the application.

Application Logical Data Model

The second phase of data modelling is to come up with a better formal representation of the data structure. Concepts in this stage are more concrete, which should allow users better to understand the application and database design, and making sure all the requirements are met. In this design phase a more functional or logical structure is created with diagrams or objects. It is therefore called the application logical data model (ALDM), or logical design of the application. When the designs are finished the characteristics of the data are supposed to be clear. Concepts have made room for entities, attributes and properties. These entities, attributes and properties have primary and foreign keys, data types, and also the relationships are clear (many-to-many, many-to-one or one-to-one). Some concepts might end up in lookup or reference tables. Since this model is more concrete it is often called a functional or logical data model.

Note: What are Canonical Models models exactly and are they the same as Conceptual (Data) Models? A canonical model is a design pattern used to communicate between different data formats, regardless of the technology used. The term canonical, when used in data modelling, simply means “one”. Generally, when architects talk about canonical data models, they mean that there is one canonical (rule) language for the whole enterprise.

The application logical data model is also normalized. Normalization is the process of identifying the one best place each fact belongs. It’s a design approach to minimize data redundancy and optimize data structures by systematically and properly placing data elements into the appropriate groupings. The process of normalization, as introduced by Edgar F. Codd, comes with some normalization forms and degrees, such as Third Normal Form (3NF).

The logical application and database design is the first concrete step towards the selected technology, which means that the technology choice also impacts the design. When choosing relational databases, the logical model in general will contain database tables, columns and keys, while for XML documents, the model will contain XML elements, attributes and references. Typically the chosen technology is a dominating factor of how the logical design looks like.

Just like the creation of conceptual data models, ALDMs can be created optional as well. In agile development, for example, I have seen people not creating any logical models. We have also seen this in the example of the finance department. In some cases the creation of the ALDM is enforced by the chosen application development approach. Let me give two examples:

  • Certain data modelling tools allow you to create, maintain and distribute logical models on a small or even larger scale. Especially when the team is large, maintaining the logical models with good tools helps in understanding the complexity. Some well known examples are Erwin Data Modeller, SAP PowerDesigner, Toad Data Modeler and Oracle Designer. Many of these data-modeling tools generate the physical data model from the ALDM automatically. This is called forward-translation. The ALDM in this example is translated into a database specific schema, for e.g. Oracle, MySQL, SQL Server, etc.
  • Popular development frameworks allow maintaining or even force designing the logical model in code. Many Model-Viewer-Controller (MVC) frameworks allow “models” to be “documented” in code. These MVC frameworks allow you to automatically generate the underlying physical model, which we will discuss within a minute. They are also flexible and not limited to a single database type. Additionally these MVC frameworks support ‘schema evolution’ and ‘evolutionary design’. Rather than creating the whole logical model at once, designs can be created in iterations. For any change to the logical model, the framework will manage the underlying physical model and handle the migration from the old to the new physical model.

The current logical data modelling practice has several shortcomings as well. Forward-translation isn’t well enough supported for newer database technologies. They use modern syntaxes and exotic database designs, which aren’t well covered by the traditional data modelling solutions. Another shortcoming is from the modern development frameworks. They differ significantly from how traditional data modelling tools manage ALDMs. Modern development frameworks manage the models internally in the framework or code.

Consequently and depending on the variety of databases and development frameworks, ALDMs are scattered. This makes it difficult to extract the end-to-end insights into how business requirements have been translated and databases were designed.

Application Physical Data Model

The last phase of data modelling is the creation of the application physical data model (APDM). This phase isn’t optional, because it touches upon the true implementation. The physical data model, or application database design, is specific to what underlying database technology or product is used. Every database system has its own unique characteristics, such as internal storage structures or indexing techniques. Some databases only allow uppercase field names, while others have restrictions on the field name length. The internal storage structure typically varies as well. Relational databases use tables, columns and rows. Document stores, on the other hand, use document structures (attribute-value pairs). These documents are similar to the objects used within programming.

Security, integrity and performance requirements play an important role in the design of the database schema. Logically there can be a relation between two objects, while for performance reasons the two objects are stored in a single table. The consequence of this is that the physical database design can be quite different from the logical model. It also means that we can have the same logical model, but completely different implementations of the APDM. The same logical model can be used to implement a transactional system or a reporting system, each using its own physical design.

Note: The application physical data model is still the design, and could differ from what has been actually implemented. In order to create and modify the structure of database objects in a database you typically need to make use of a Data Definition Language (DDL). Many of the data modelling tools create DDLs for you, but you need to be aware that the APDMs and the DDLs aren’t the same. DDL statements can also be used to enforce data scrambling (mask or obfuscate) or encryption, which makes it for other users impossible to read the database, including its structure. Consequently the true physical implementation and physical representation design can differ.

Designing and implementing ALDMs and APDMs is a widely known practice, with many techniques and flavors. This discipline also has been pushed by the major traditional database vendors, by providing off-the-shelf packages with prefabricated data models to accelerate the implementation of specific solutions.

Linkage

The conceptual, logical and physical data models might stand on their own, but in an ideal situation these models are linked. In a perfect world you would like travel top down from the conceptual data model all the way to the APDM(s). Concepts from conceptual data models are linked to entities of the ALDM(s), which again are also connected to the entities and attributes of the APDM(s). Having these relationships would allow you to quickly find the underlying data and databases. You can also extract and reuse the insights, by seeing how concepts have been interpreted and are translated into real-world implementations. It also helps to link the impact of data quality, data privacy and data security to the specific parts of the business. If, for example, we know the physical data is of bad quality and it relates to the concepts of households and promotions, used within marketing, we better can determine the business impact.

The industry has started to recognize the importance to quickly find data based on the different models as well. Processes like reverse engineering or scanning to try to detect what the logical model might look like, based on the physical data structures and data examined. Tools that support this process are emerging solutions. Terminologies like vertical lineage or linkage are used for connecting the different models.

Data modelling helps organizations to better understand and find data, which is important given the enormous growth of it. Data modelling also enables data to be reused more easily because translations from business concepts to the actual data can be “replayed”. It also plays an important role in designing applications for data intensive usage. The way data is stored, structured and processed determines the speed of which data can be retrieved from applications. Mastering the skills of data modelling are essential for engineering a modern data architecture, as well practicing software architecture. A good database design is an essential part of the overall application design.

The area of data modelling is a nice bridge to, Data Integration. These areas have a strong relationship, because data modelling is a core task of the data integration work of combing, transforming and storing different sources of data.

Data Integration

Now we know how data is created and applications and databases are designed, we also have to consider that data can come from other applications. The origination of data in these situations takes place outside the usual boundaries of the application. Consequently, if data is coming from other applications a form of interfacing is required, because data isn’t transferred and copied over from one application to another by itself. There is Data Interoperability when multiple applications communicate. Additionally there is the process of combining and consolidating data, which is called Data Integration. This process includes all the activities and procedures of moving, combining and transforming data.

Note: Some people use data interoperability and data integration interchangeably, but these terms are not exactly the same. Data interoperability is the capability to communicate, invoke functions, or transfer data among various applications in a way that requires to have little or no knowledge of the application characteristics. Data integration, on the other hand, is about consolidating data from different (multiple) sources into a unified view. This process is often supported by additional tools, such as replication and ETL tools.

Let’s refer back to the finance and marketing departments and continue our story. This time we’ll focus on connecting the department’s applications. Both applications are based on unique department requirements. There’s isn’t any relationship or correlation between the two departments, because they haven’t been working together.

Someone from the finance department meets another person from the marketing department at the coffee machine. They explain their roles and responsibilities and they become enthusiastic about working together. Data from finance can support marketing to better target the right customers. The marketer and accountant agreed to schedule a meeting about how to establish an interface between both applications in order to exchange the data.

At the moment when the two applications aren’t connected there isn’t any interface.

This situation changes when the departments create an interface between their applications.

Data Provider & Data Consumer

There are two fundamental application roles, which frame data integration. One department has clear needs and sets the requirements. The other department, on the other hand, creates data and has been asked to deliver it. There is a data provider and data consumer. In the example, the marketing department has the data consumer role and the finance department the data provider role. The data provider is positioned on the left in above image and the data provider on the right.

Data Provider: The data provider, or the providing application, is where the data is produced (data origination) and provided from.

Data Consumer: The data consumer, or the consuming application, is where the data is transferred for further consumption.

The roles of the data provider and data consumer are essential and will frame any architecture! Now the roles of the applications and users have become clear, the interpretation of the data can start.

Conflicting (business) contexts

To ease data usage, both departments agree to write down all their concepts to have a better understanding of the different viewpoints. Both departments start documenting their data. Finance starts with the definitions for concepts like receipts, invoices, addresses, but also the counterparty domain value names like retail customers, corporates customers, large corporates customers, etc. Marketing does the same and ends up with a list with customers, promotions, but also the types of promotions, e.g. late night evening promotions, double discount promotions, half price promotions, etc.

After both departments have done their homework, they come together to present the lists with definitions to each other. There’s a lot of overlap in the terms they use. Customers, addresses and identification numbers exist on both sides. Customer, for example, is from a marketing perspective only a retail customer, but from the finance perspective customer includes both the retail and corporate dimensions. Concepts conflict and concepts with the same name have a different meaning.

This is a simple example, but I want to illustrate the situation with another, more complex, example. Finance is legally obliged to register all customer’s identity information. So whenever a customer orders a product or receives an invoice, the department asks for their formal identities, which includes formal names and legal addresses. It also uses the term household: all people registered on the same legal address are grouped together and linked to the same household.

Marketing uses data from finance, so they copy all the customer data into their marketing database. This data is extended with data from the online web channels, operations departments and customer care systems. The objective is to keep track of which customers have been contacted and interacted with. When marketing analyzes all the results they discover hidden patterns and new individuals in the dataset. The marketeers also discover that people have been contacted on different addresses or people submitted forms using addresses that differ from the customer administration. Consequently the marketers make adjustments in the marketing database. The concept of the household changes, because missing individuals are added to a household. Some people are moved between one and another household.

Based on this example you could argue that either the customer administration or the marketing administration is invalid, because customers have been adjusted. But in this example both the finance and marketing viewpoints are valid. Finance looks from a legal perspective at the data. Marketing started with the same data, but enriched and adjusted it based on their marketing viewpoint. Both viewpoints are completely correct. Some people might argue that within the company you must try to create one single version of the truth, but my personal experience is that when bringing the legal and marketing departments together people will end up in endless debates. Unification across the entire company is extremely difficult, if not impossible.

The usage of viewpoints is recognized by the industry. Some people refer to domains, subject areas or bounded contexts, to group everything that belongs to a specific viewpoint.

Let’s continue our story with the two departments and focus on the data distribution method and interface between the two applications, which allows the data to be exchanged.

Push or Pull

Finance and marketing have to make a number of choices about how to exchange data between applications. The first important choice they will have to make is whether the data will be ‘pulled’ or ‘pushed’ from the finance application.

Push: Once the data provider identifies it has some data to be sent to the other party the data will be forwarded from the data provider to the data consumer. The data provider in this model has an active role and the freedom to determine when to initiate the transmission. It will be up to the data consumer to handle the received data.

Pull: The data provider waits for the data consumer to reach out. The data provider then has to acknowledge the request and starts delivering the data. In this model the data consumer has an active role and also the freedom to decide when to start requesting the data.

Deciding whether to push or pull the data requires a number of considerations. Pulling the data requires the data provider to be ready. There might be situations where pulling data out is difficult: If systems are under a lot of stress (load) and a large data set is required, it might be difficult to handle the request as a whole. Large data requests can overload systems, which creates the risk unpredictable behaviors, crashes, unavailability or even worse, data gets corrupted.

Pushing out the data on the other hand requires data consumers to be ready. If data providers push out data straight into the applications of the data consumers, the same unpredictable situations can occur. They can also be stressed out.

A comfortable method of distribution is to make data extracts and place these copies in a separate location. This method gives both applications the convenience to start processing when ready. Also both applications are decoupled (or loosely coupled). One application can wait for the other and vice versa. When pulling or pushing data directly, applications are tightly coupled, because they both need to process the data at the same time.

Data in transit, Data at rest & Data in use

The push or pull transfer method typically uses the network for the data interoperability. Copying data via the network is called data in transit. Once data is copied and stored, it typically sits in a database or file. We call this data at rest. Another option is that data sits in the memory at the moment of using or processing, inside the application. This is called data in use.

Some people also refer to data in motion. Data in motion is considered to be slightly different, because data is moved around within the boundaries of the application infrastructure itself, not transiting off of the server or leaving the physical infrastructure or hardware. Sometimes the servers have a direct private connection and share the underlying storage. Data can be copied via the underlying shared (network) storage environment, not over the traditional network. This pattern is also applicable for data in motion.

Data Transformation

After the two departments have selected the data exchange pattern, the real data integration work will start. The marketing department is unpleasantly surprised when examining the data from the finance department for the first time. The structure of the data is not what has been expected, because the naming convention for tables and columns is completely different. Also, the codes and values used are difficult to interpret, because the marketing department isn’t familiar with counterparty classification. Both departments are coming from a different context with different perspectives. The marketing department needs to interpret the data and adjust the data to its own needs before using it.

Note: What you have discovered in this situation of data interpretation is extremely essential to understand. Every application has been designed to address specific business needs. Applications are unique and so is data.

Every application is used within a specific, unique context. Schema designs, column names, field names, data values are all designed and created from this unique context. Applications are never the same. When applications exchange data, a data transformation is always the next necessary step. There is no escape from this pattern.

The data integration step of getting data out, transforming it and loading it into the target application conceptually sits in between the applications. In the image below I have illustrated how this could work for the two applications.

Note: Can’t we truly not escape from the data integration dilemma? When I say “you can’t escape from the data transformation”, people of course always try to come up with exceptions to this rule. Yes, there might be a situation where tables are identically copied into another application. In such a situation the (context) transformation is expected to happen inside the application, so within the code and on the fly. Or when a backup of an application is created and data is restored, but in such an example it is the same application. These examples are really the exceptions. The fundamental reason for requiring a data transformation is the difference in context in which applications are designed, developed, created and used.

This whole process of translating the data based on the new requirements and making the data compatible with the new (target) structure is called a data transformation. All the steps combined of Extracting, Transforming and Loading (ETL) the data into the target database is what the industry calls ETL. People use different terminologies for similar patterns, so we will work with the following definitions:

Data integration: Data integration comprises the activities, techniques and tools required to consolidate and harmonize data from different (multiple) sources into a unified view. The processes of extract, transform and load (ETL) are part of this discipline.

Data transformation: Data transformation is the process of making the selected data compatible with the structure of the target database. Examples are: format changes, structure changes, semantic or context changes, de-duping and re-ordering.

Data mapping: Data mapping is the process of creating data element mappings between two distinct data models. This activity is considered to be part of data integration.

Data interoperability: Data interoperability is the ability for multiple applications to communicate (and exchange data).

Extract, Load and Transform: The difference with ETL is that the target structure is created, after the data has been extract and loaded into the target database.

Schema on Write, Schema on Read: With Schema on Write the (target) schema is known and must be created before any data is written to the database. Schema on Read differs from Schema on Write, because the database schema is created at the moment of reading the data and doesn’t have to exist yet.

When the finance department and the marketing department attempt to connect their applications, the marketing department struggles. Imagine taking many applications and data sources into account. Data integration and data modelling can be extremely intense and will be an awful lot of work. Some projects take months or even years. The hard work sits in interpreting, translating and mapping the data. Documentation and insight in the data modelling effort will lighten the pain, but there’s no escape since every application is developed within a unique context.

Business logic and mappings

For building data transformation logic, there are additional considerations to be made. One is on which side to place the transformation logic. Is it better to transform on the data provider or consumer side? To answer this question, let’s go back to the two departments and continue the story.

Via the interface the data becomes available in the marketing application, which allows marketing to easily to identify counterparties and their prospects. Marketing however finds it easier to filter out all non-retail counterparties with a business rule. Marketing argues that the finance should take care of the logic to map to filter, because the counterparty classification codes originated from their application. Finance agrees and places a mapping table (see image below) inside their application. The finance department’s application from now on filters all counterparties, before transferring those to the marketing application.

There are a number of consequences and considerations in building business logic and adding mapping rules. Building in consumer’s business requirements inside provider applications, requires providers always to make the changes. In case the pace of change is high, this maintenance becomes a burden. Providers might be busy and don’t have time, which forces consumers to wait. I’ve included a few scenarios that might influence who would be responsible for maintaining these changes.

Pace of change

In the example you observed that consumer logic has been placed in the data provider’s application. Imagine the following situation: A data consumer asks multiple data providers to conform and incorporate consumer specific logic inside each application. In such a situation multiple data providers must process any changes coming from the data consumer. Effectively distributing these changes is a major coordination challenge. All the data providers need to change and conform simultaneously, which is a problem because the pace of change can be totally different between all parties. If data providers have no time, the delivered data might be out of sync, leading to incorrect data.

To better keep up with the changes it would be better to have all complexity, around managing definitions, transformations, mappings, conflicting requirements, and so on, maintained in a single place. This same argument also applies within software architecture. Duplicating and distributing the same code or business logic is seen as an anti-pattern, a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.

On the other hand, if the pace of change on the data provider’s side is high and the mapping is done on the data consumer’s side, then for the data consumer it might be a challenge to catch up. The data consumer has to be made aware of upcoming changes from the data provider and immediately has to respond. My recommendation is that data consumers and data providers should be closely working together and decide what abstractions work best given the different paces of change on every side.

Application specific knowledge

Another common integration dilemma I want to bring up are situations where interpretation of the data is really difficult and requires an awful lot of application knowledge. This is the case when the data is no clean representation, but instead complex codes or not understandable values are used. Exposing this data and the technical underlying physical structure is probably not what data consumers will like. Adding a form of abstraction will help data consumers understand. Typically in complex systems with thousands of tables, lots of programming logic, and complex data, it is reasonable to ask the system owner to add additional functionality, so users can better understand the data. This is often called an abstraction layer, because it hides the complexity for the users.

Responsibility falls to data consumers

A good recommendation for mappings or business logic implementations is to group based on what belongs and changes together. If the data consumer consumes data from a data provider, it is highly likely that the data must be adjusted according to the data consumer’s requirements. Otherwise there wouldn’t be a need to get data from the data provider. Since in general data consumers set the requirements and the pace of change on the data consumer’s side is higher than on the data provider’s side, it is valid to do the heavy lifting and most of the transformation work on the data consumer’s side. A ground rule is that data providers should not conform their data model to specific needs of data consumers, because that would require the provider to adapt their application too much. So, data consumers are responsible to set requirements and therefore also responsible for the data transformation into the new context. Data transformations, consequently, will be pushed as close as possible to the data consumers.

The only exception might be situations where the pace of change on the data provider’s side are high. In such a situation for data consumers it is difficult to be aware of all the dynamics of the data provider(s). Therefore it is reasonable asking the data provider taking care of some of the data consumer’s business logic or mapping rules. The data can be provided, including the mapping table, which would give the data consumer control. Or an abstraction to a more stable value range can be provided.

Multiple applications using one database

Instead of having two applications with two databases (as is the case with the finance and marketing department), why not use a single shared database for the two applications instead? So, multiple applications share one single database with integrated data structures. You could argue that both the data provider and consumer would benefit, because they can directly read from each other’s tables.

First of all, using a shared database only works if all applications technically can share the same database. Databases have different characteristics, protocols and different data structures. Tradeoffs have been made in terms of performance, integrity and consistency. Some applications require a specific vendor to work with. Many vendors have extended the SQL syntax with additional parameters and functions. Applications might rely on these specific (proprietary) extensions vendors provide, which again makes it impossible to share a database. Sharing databases between applications, only works if all these these different criteria match up.

Second, sharing databases also means sharing integrity. Changes in data need to be precisely coordinated across all other applications, with the risk of breaking other applications. This shared model leads to a tight coupling of all applications. Each application development project has its own heartbeat. With a shared database model the different application development teams have to wait on each other, resulting in delays and waste of time.

Third, a form of unification is required, since teams have to agree on common concepts and language, while requirements can be completely different or even conflict. This can cause problems if the relationships conflict, for example, one application might want to enforce a ‘one-to-one relationship’ between objects, while another application requires an ‘one-to-many or many-to-many relationship’.

Fourth, linking data to business objectives will be complex, because different applications share the same database. Consequently, assigning data to owners and enforcing accountability might be difficult.

Bottom line: A shared database model for multiple applications with different business concerns will eventually lead to serious problems and must be avoided at any time!

Instead of applications sharing databases, there is also another possible scenario of sharing: Different databases share the same hosted (physical) infrastructure. Each application is expected to have its own dedicated and isolated database. Databases in this scenario aren’t directly shared between applications. Every application can only access its own database. There is tight coupling with the underlying hardware, so multiple applications fail if the shared hardware goes down, but between the database and all other applications there isn’t any direct coupling.

One application with multiple databases

Contrasting to multiple applications with a single database is the scenario of a single application with multiple databases, an idea that is growing in popularity. This mix of databases and possibly multiple languages, to express that applications can use a mix of programming languages to advantage that different languages are suitable for solving different problems.] used within the application is called a polyglot design. The advantage is that the application can delegate specific work (problems) to optimized databases, because certain databases are better positioned to handle specific workloads.

Polyglot design also enables developers to work faster, because less work has to be performed on an application level, since specialized databases takes care of the complex problem more easily. There’s also another concept called microservices in which many application components are closely working together. These components communicate in a similar way to how applications communicate.

To summarize we can classify three different design patterns for using a database:

Dedicated database design

An application has its own dedicated isolated database, which is a single instance. No other application accesses this dedicated database directly. The application and its dedicated database are responsible for maintaining data integrity.

Polyglot database design

A polyglot database design, also known as polyglot persistence, is that applications use a mix of languages and database types to an advantage that different languages and databases are better positioned for tackling different problems. Each database has a specific purpose (e.g. integrity, performance, scalability, data structures, etc.).

Shared database design

A single database is used by multiple applications. The integrated data structure and consequently the integrity is shared.

Summary

Data integration is an inescapable pattern. Each application has its own unique context that requires data to be transformed when it is exchanged from application to application. I strongly recommend that you delineate a clear scope for all applications, and avoid sharing databases directly. Instead, each application must use its own dedicated and isolated database(s).

I believe that the key to mastering data integration is connecting it tightly to the practice of data modelling, which provides insights on both the business and technical levels. Gaining these insights will allow you to find data quickly and visualize how business concepts will translate into your application’s real-world implementation.

Countless companies fail to implement data management properly — and that’s understandable, given the changing data landscape, the rapidly increasing amount of data, and the accompanying integration challenges. My view and perspective on data management and integration differs from others. If you are curious, I engage you to have a look at my book Data Management at Scale. This book will allow you to take a deep dive far beyond what you’ll find on the internet, into disciplines like data integration, data governance, data security, data quality, master data management, and metadata management.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store