Data warehousing and business intelligence play an important role in many, if not, all of the large sized organizations working on turning data into insights that drive meaningful business value. Data warehousing is process for collecting and managing data from varied sources to provide meaningful business insights. Business intelligence comprises the strategies and technologies used for providing these insights. Both areas can be considered part of data management. They are heavily intertwined with the other data management areas and depend a lot on data integration.
Data warehousing became popular during the nineties and started as a common practice to collect and integrate data into a harmonized form with the objective to create a consistent version of the truth for the organization. This consistent version of the truth became an important source for business decision-making within the company.
Such decision-making was supported by another trend, business intelligence. Business intelligence, as defined by Gartner is “an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance”. Business intelligence started with producing data and simple reports in a nice presentable way, but at a later stage Self-Service was introduced, with more advanced capabilities, such as in memory processing and predictable functions.
Many enterprise organizations heavily rely on data warehouses. A lot of these have been developed during the last decade. They are considered to be a critical assets, used in many daily processes and fulfill an important role of the data consumption and distribution within the company. In this post I will explain some of the popular approaches and common characteristics and unravel the enterprise data warehouse concept, which differs in scale.
WARNING: I need to give you a warning as we continue. It is my strong belief that the enterprise data warehouse (EDW) will soon to become extinct. Not the underlying concepts of data warehousing themselves, because the need for data harmonization, bringing amounts of data into a particular context always remains. But what will become extinct is using a data warehouse for enterprise wide data integration, consumption and distribution. I’ll explain why in this post.
Many companies are still investing heavily in data warehouses. The size and scale in which data warehousing architectures are used however can differ significantly. Before I tell you my viewpoint on this architecture let’s step back and start from the beginning.
What is a Data Warehouse?
What is a data warehouse? The core concept of data warehousing was introduced in 1988 by IBM researchers Barry Devlin and Paul Murphy. As computer systems became more complex and the amounts of data increased, an architectural model was required for the flow of data from operational systems to support decision environments. Bill Inmon, a few years later, published a book and gave a more concrete definition about what data warehouses are meant for: “a data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”
As data warehouses gain popularity, and variations of the model with different definitions started to emerge, I like to define a data warehouse as ”a system which harmonizes (historical) data from different sources into a central repository and manages it, with the main objective to support the decision-making process”. From this sentence, we can extract a number of important characteristics. From the word central you can conclude that in a data warehouse data is brought together and centrally stored. In many of the cases this is indeed true, but some technology trends allow a data warehouse to be developed in other ways, such as data virtualization. From the word harmonize we can derive that data is integrated, unified and made consistent. This means additional components are needed to collect, cleanse, fix, transform, and store data. By stating different sources we make explicit the number is plural, so a data warehouse isn’t or won’t be used to only collect data from a single source.
Bill Inmon once said that “one of the great appeals of the data warehousing concept is there is a single version of the truth”. This single version of the truth means that there’s a single vocabulary used and all users must agree with the meaning and definitions for all data used within the data warehouse. This single vocabulary aspect we cover in more detail later.
This sounds comprehensive, but why do you want a data warehouse in the first place? What are the benefits? In order to understand better why data warehouses are useful we need to step back, explain some of the key drivers and first look more closely how transactional and analytical systems work.
OLTP (Online transaction processing)
The space of applications and databases traditionally has been divided into two worlds. Many of the applications started as transactional or operational as the world of computers begun with the requirement to handle transactions and storing records. These systems made processes more efficient as they were capable to replace the traditional card catalogs and many of the manual processes. Banking systems, airline ticket booking systems, order management systems, customer systems, telecom systems, healthcare systems, shopping systems and so on. These applications are called online transaction processing (OLTP) systems given their crucial operational role.
What’s important for OLTP systems is that both consistency and stability must be guaranteed. Imagine you run a large regional telecom company and the telecom system goes down. The impact can be tremendous. Customers can no longer make phone calls, answer their phones, receive text messages, etc. If the system stays down for too long, the business model will be impacted. Customers lose trust and walk away. OLTP systems, therefore, are designed for data integrity, system stability and availability. The large majority are ACID compliant.
An observation we can make from OLTP systems, is that the (operational) workloads are usually quite predictable. We know how OLTP systems are used and what typical loads are expected. Queries are relatively straightforward and and retrieved data is relatively low in volume: read a record, update a record, delete a record, etc. The underlying physical data model is designed based on these predictable queries. Tables in the OLTP systems are usually normalized. Ideally every single attribute is only stored once.
This normalized design comes with some drawbacks: operational systems are not designed to provide a representable comprehensive view of what is happening in the domain or business. Pulling out data from heavily normalized data models for complex questions is often difficult and performance intensive. Complex questions require more data and combinations of data. Writing such a query would require many tables to be joined, or grouped together. These types of queries are typically fairly performance intensive, so executing too many of these introduces the risk of reaching performance limitations. If so, a system become unpredictable and that is the last thing we want to see happening with OLTP systems, given the trust we need to provide.
The consequence of having high integrity and high performance for availability is that OLTP systems are expensive. They aren’t in general positioned to retain large amounts of data, so data life cycle management is important to keep them healthy and fit for purpose. Typically unused data is either removed or moved to a secondary location. From this secondary location data always can be placed back, if it becomes relevant again.
OLTP systems are one part of how databases traditionally have been divided. The other part are OLAP systems, which will be discussed in the next section.
OLAP (Online Analytical Processing)
Business users want to analyze data for their business decisions by examining large and more complex collections of data. Since OLTP systems are expensive and fulfill different purposes, the common best practice always has been to take data out, to bring it over to another environment. This other environment (different systems and databases) will be used for online analytical processing (OLAP): complex and analytical processing. Since offline analyses are usually less business critical, the integrity and availability requirements can be less stringent. Considering this, you could say that OLAP systems are usually less expensive than OLTP systems.
Data, in OLTP systems, is stored and optimized for integrity and redundancy, but in OLAP we optimize for analytical performance. As we do mainly repeated reads and barely any writes, it is common to optimize for reading the data more intensively. Data can be duplicated to facilitate different read patterns for various analytical scenarios. Tables in OLAP database are usually not that heavily normalized, but preprocessed and in structures that are denormalized: tables are flattened, sparse, and contain more redundant copies. Data is repeatedly stored. This sounds ineffective, but when data is logically grouped and physically stored together, it is easier for systems to process data and deliver results quicker. Since OLAP systems are less expensive we can afford using more data storage.
Note: Some systems combine OLTP and OLAP. These systems are called Hybrid transactional/analytical processing, a term created by Gartner. Although these systems look like emerging architectures on the outside, on the inside there are, in general, still two databases. One database is designed for many small transactions with a high fraction of updates. The other (in-memory) database handles the complex queries for the analytical workloads. This approach of separating the command from the queries is quite similar to CQRS.
OLAP systems typically are also used to facilitate data life cycle management by storing the unused or less frequently used data from the OLTP systems. This is done for two reasons. The first reason is that historical data is of value, because for analysis we often look to the past. Performance is the second reason. After OLTP systems have copied their data to OLAP systems, they clean up by removing the redundant copies. This maintenance activity makes OLTP systems more effective, because if tables contain less data, the lookups and queries will run faster.
In OLAP systems data from multiple OLTP systems is often brought together, because business users or data analysts often want to make combinations of data. This requires a form of harmonization or unification, because, systems use different contexts and have different data structures.
What are Operational Data Stores? Operational Data Stores (ODS) are, in general, used for operational- analysis and reporting. ODSs carry some characteristics from both OLTP and OLAP systems. The word “operational” positions ODSs closer to OLTP systems, because their purpose is to get insight in operational performance and activities. They mainly inherit the context from OLTP systems.
The characteristics that come from the OLAP systems are that ODSs are aimed to take away the analytical workloads, which would be caused by ad-hoc, less predictive analysis and reporting. Another characteristic is that ODSs in general retain more historical data than OLTP systems. One more characteristic is that ODSs can integrate small proportions of data from other systems. This means there could also be an integration or harmonization aspect involved when designing an ODS.
In general however ODSs stay closer to the design of the primary OLTP system they are positioned to work with. Table structures are often similar. ODSs, as a result, differentiate from data warehouses, because they typically use data only from one single OLTP system, unlike data warehouses, which house data from multiple sources.
The harmonization of multiple source systems is a nice bridge to data warehouses, because they also take care of the historical requirements, bring data together into a harmonized form and facilitate the analytical activities. Data warehouses therefore can be classified as OLAP as well.
How do Data Warehouses work?
Now we know that data warehouses harmonize data from multiple sources into a integrated data model (single version of the truth), it is time to go a little more in depth. In the next sections I want to discuss some of the common characteristics and styles of building and operating data warehouses. The most popular styles were developed by industry leaders Bill Inmon and Ralph Kimball.
Bill Inmon’s view is that any piece of data that could possible useful should be stored in a single universal data model that would be a “single source of truth” for the enterprise. This source of truth uses an integer and efficient model for storage, typically a 3NF structure. From this single source of the truth, selections (subsets) are made for specific projects, use cases or departments. This selected subset, which is optimized for the read performance of the use case, is called a data mart.
Ralph Kimball’s view is that a data warehouse must be a conglomerate or collection of dimensional data, which are copies of the transaction data from the source systems. Because a data warehouse is used for various use cases, Kimball has the concept of ‘conformed dimensions’, which are the key dimensions that are shared across and used by different consumers.
In the next sections we will look more closely at the two styles, together with some generic components that are always needed when designing and building data warehouses. We’ll start with the staging layer, and move on to cover the remaining parts of a data warehouse.
To engineer a data warehouse a number of additional components are required. Within both the Inmon and Kimball approaches data must be extracted and stored (staged) on an intermediate storage area first, before processing can take place. This environment for letting the data “land” first, is called a Staging Area and has been visualized in the image below.
Staging areas always sit between the data sources and the data targets, which are often data warehouses. They are used to decouple systems and play an important role during the extract, transform and load (ETL) processes. Staging areas can be implemented in different ways, varying from relational databases and file stores. Also the implementation of how data is captured can vary: pushing out data, pulling data or using CDC. Data is typically delivered in the raw operational format, although here there can be variations as well to this model. I will come back to this later.
Staging areas are also typically used to retain historical copies. This is useful for reprocessing scenarios, in cases where the data warehouse gets corrupted and needs to be rebuilt up from a longer period. The number of older data deliveries (historical copies) can vary between staging areas. I have seen use cases where all the data deliveries, including corrections, had to be kept for audits for several years. In other use cases I have seen the staging area emptied (non-persistent) after successful processing or after a fixed period of time. Cleaning up saves storage and costs.
Staging areas play an important role within the area of data quality as well. Before data ingestion into the data warehouse starts, it is recommended to first validate all data. If any of sources are missing or data is invalid, the processing can be temporarily stopped or paused. Sources can be asked to redeliver again or corrections on the data can be made. Once the all acceptance criteria are met, processing and ingested the data into the integration layer of the data warehouse can truly start.
After all data has been successfully inspected and all acceptance criteria of the staging layer are met, data is ready to be integrated into the Integration Layer. This is where all cleansed, corrected, enriched and transformed data is stored together into an integrated and common model, using a unified context. It is harmonized, which means unification has been applied to formats, types, naming, structures and relations. Also the historical data is expected to be in this layer, although the difference with the staging area is that data is transformed and no longer raw.
The integration layer and how data is stored and structured is also where the Inmon and Kimball styles differ.
In the Inmon style all data first lands into a (central) integration layer, which has a normalized relational model. It means that data is optimized for redundancy. You could argue that this layer is similar to the data models we discussed in OLTP and you’d be right. It represents the transactional systems, but in an harmonized way, which means that unifications are applied on formats, field names, data types, relations and so on. This implementation requires a big upfront data modelling exercise to align all the different source system structures and transform them into a harmonized and unified data model. The step from staging to the integration layer usually leads to specific and complex ETL logic.
The Inmon style also introduces data marts. After data has been stored in the integration layer, additional layers are created for the consumers. These (presentation) layers are called data marts or dimensional data marts. They typically only contain a subset of the integration layer’s data. They are also specific for a use case, group or set of users. Data in data marts typically is organized in dimensional structures, because it has been optimized for reading performance. Data marts are often implemented using cubes, although they can also be implemented using relational databases (using star and snowflake schemas). The additional layer of a data mart has a benefit: data marts are decoupled from the integration layer. Changes to the integration layer, if made carefully, shouldn’t impact the data marts and thus the data consumers.
The style of Inmon has several drawbacks. The first drawback is that the development time is typically long. Designing an integer and normalized model takes time, because all the different source systems need to be mapped carefully. Integrity and consistency are important drivers. As the size and number of sources start increasing, the number of relationships with dependencies increases heavily. This could lead to cascading effects within the design: endless parent-child complexities or structures.
Another drawback is that if new data has to be added to a data mart, data always has to be added to the integration layer first. Since the development takes time and changes to the integration layer have to be made carefully, users requiring new data have to wait (long). The list below summarizes the pros and cons of the Inmon style.
- Very low data redundancy because of the (heavily) normalized model of the integration layer.
- Data marts are decoupled from the integration layer.
- Consistent design, well suited for historical data.
- Most engineers are familiar with 3NF normalization principles.
- Longer development time. Complex ETL logic is required to map the data to the heavy normalized model of the integration layer.
- Higher coupling of data ingestion, which introduces dependencies, require all sources to be ingested at the same time. This reduces parallelism and flexibility in ingestion.
- Growth of new relationships. Huge number of ETL load dependencies because of referential integrity across all tables.
- Referential integrity is a problem if sources disappear or change heavily.
The Inmon style differs from the Kimball style mainly on the integration layer.
In the Kimball style, data in the integration layer differs from the way data is stored in the transactional systems. When data is copied into the integration layer, it is already optimized for read performance. The data is more flattened, sparse and thus has similarities with data mart structures from the Inmon style. The difference here, compared to Inmon, is that the integration layer is conglomerate of dimensional tables, which are the ingredients for the data marts. A data mart is just a logical distinction or selection of the different tables with some additional tables to link everything together. The data warehouse, in this approach, is the combination of all individual data marts. The benefit is that these dimensional tables are better understood, because the overwhelming number of parent-child structures have been reduced.
An important aspect of Kimball style is data is logically grouped into what is called facts and dimensions. Facts are quantitative numbers: sales figures, amounts, counts, etc. The dimensions represent the business entities, so customers, products, employees, contracts, etc. By linking facts and dimensions together, a star schema is created. Conceptually these are the same as data marts, with the difference that they sit in the same integration layer, and share the same underlying database technology. The benefit of the Kimball model is that this style has a higher short term agility than Inmon. Typically users don’t need to wait for months or years, before consumption can start.
Although the data is organized subject area and more easy to understand for the users, the Kimball style also has some drawbacks. Since each data mart is created for a specific user group typically many additional tables are needed to create more specific star schema’s. As users will have more conflicting requirements more additional “helper” tables will be needed.
Within Kimball there is also a higher chance of more coupling: users can start reusing integration logic or helper tables from other use cases or users. So, as size grows and the number of users increases, then the complexity grows as well. Consequently, even small changes to structures can have high-impact on all the users.
Another implication is that when loading the data, issues can occur with transactions (facts) appearing before dimensions are loaded. Lastly is that updates and deletes can be performance intensive due to high level of denormalization applied on the tables. In the table below I have listed all high-level benefits and downsides from the Kimball style.
- Better suited for multi dimensional analysis.
- Organized along business subject area, which makes it easier to understand for non-IT users.
- Faster development time compared to Inmon, because there’s a single ETL process that loads data into the final data model.
- More flexibility in terms of linking the aggregation points.
- No decoupling of the presentation layer. Reuse or sharing helper tables leads to tighter coupling.
- Issues with referential integrity when loading transactions (facts) before dimensional tables.
- More expensive for updates and deletes because of denormalization. Integrity can also become a problem if data is not correctly updated.
- Larger data volumes (compared to Inmon).
Based on advantages and disadvantages of both styles, engineers started to experiment and combine different approaches. An example can be a core data warehouse model, which is Inmon based, but extended with dimensional tables, which are Kimball based. Complementary to these alternatives approaches, is the technique of Data Vault.
Data Vault style
Data Vault was originally created by Dan Linstedt and is especially popular in Northern Europe. Linstedt defines a data vault as a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. Absent from this definition is the attempt to use data vault for cleansed or integrated data. No effort is made to reconcile data model differences across source systems. Instead, data vault is depending on downstream architectural components for presenting analysis-ready data.
Data vault addresses the inflexibilities of both Inmon and Kimball via hubs, satellites and link tables. Via separated tables, data vault is more decoupled, allowing for parallel data loads and more flexibility when carrying out changes. Relationships can easily be dropped and recreated on-the-fly, and new systems can be integrated into the existing model without any disruption.
These flexibility improvements also come with a downside: Data vault itself is not directly suited for ad-hoc queries and analysis, because querying all the hubs, satellites and links, joining and them together, is performance intensive. In most cases it requires an additional presentation or data mart layer.
The number of tables within Data Vault is significantly higher than 3NF modelling. It requires that a higher number of ETL jobs are needed to ensure the segregations and referential integrity between the different tables. This drives up the complexity and makes it more difficult for engineers to understand how the data warehouse is designed. Companies, like Wherescape, can help to address these problems, by generating the ETL jobs via metadata.
One last aspect is that there’s no strict or formal standard. Data vault has been described conceptually well, but people use implementation variations. Dan Linstedt, Kent Graziano and Hans Hultgren, all experts in this field, advocate data vault dissimilarity, which introduces a risk of having conflicting style applied in larger projects.
The list below contains the high-level benefits and downsides from the data vault.
- More adaptable for changes. Higher flexibility
- Suited for real-time. Decoupling enables parallel processing
- More flexible and incremental development approaches can be used
- Massive number of joins are needed when using the Data Vault modelling technique. This can be a real performance hit. An additional ‘presentation layer’ or ‘data mart layer’ is required.
- Requires more experienced engineers and strong agreements on guidelines, since best practices and approaches differ
- Higher number of ETL jobs and tables is required. This might increase complexity and reduces overview. An hybrid approach also invite engineers to be more creative, which introduces additional complexity and inconsistency on the longer term
Data warehouses distinguish themselves from other applications because of their integration layers and complex data models. They consume, combine, harmonize and integrated data. These integration techniques require a data-modeling expertise, which is typically different from designing and developing regular applications.
Capturing the data
Data warehouses cannot be created without pulling data from different sources. In the previous sections we learned about the role of staging areas to facilitate data to come together in different formats. What we haven’t discussed are the formats or structures in which these data sources are coming in.
A commonly seen format of delivering the data is to use the raw (identical) format of the source systems. Typically full collections of data are delivered, so what you end up with are complete raw extracts of all tables in the form of flat files. A drawback of a this delivery style is that there’s high coupling with the inner data structures from the source system(s) as they are the same representations. If a source system changes its structure, processing the data might no longer work.
An alternative method of delivering data towards data warehouses is to agree on some form of abstraction that reduces the coupling risks. Structures can be agreed to be more simple, they can have a fixed format. These changes at least take away that processes break whenever source systems change their structure.
A more extreme form of abstraction is asking source systems to conform their data deliveries to the format of the integration layer. The files from the source systems in this method must be delivered in the exact same format of the data warehouse. I consider this form of abstraction as high coupling as well, because it requires all sources to change simultaneously when the integration layer changes.
Note: Most commercial vendors (Oracle, SAP, Siebel, etc.) use highly specialized and normalized data models for their operational systems, which change with every release cycle. These data models are difficult to interpret. People make a decent living knowing how to interpret and map these data models into data warehouses.
To make data integration easier typically a standardized format is chosen. Comma-separated values (CSV) and Parquet are two of the most popular choices, because the majority of all databases and ETL tools can work with this format. XML and JSON can also be used, but these formats in general produce more overhead, since a lot of metadata is included.
An alternative to using full extracts is requesting or pulling out data based on data changes or deltas. Only changed data is delivered in this approach. A popular design pattern is to use Change Data Capture (CDC), which is well supported by the different vendors and can work with various mechanisms: timestamp-based, version-based, indicator-based, and so on.
The next step, after all data has been collected, is to integrate data into the integration layer. Data needs to be cleansed, mapped and transformed into the unified data model of the data warehouse. This requires ETL tooling and can be done via several approaches.
The most common way to pick up data from staging areas, transform and load into the integration layer using a commercial ETL tool. Informatica, SQL Server Integrated Services, Ab Initio, Infosphere Information Server, Talend, Pentaho and Oracle Data Integrator are popular choices and seen within many large enterprises. These tools have a wide number of features, large number of database connectors, debugging and testing capabilities and often an intuitive look and feel. They can extract the data into their own local environment or delegate the work to the target database. A common heard drawback of these commercial tools is tight coupling. Once you choose an ETL vendor for your data warehouse, you’re married with that vendor forever.
The alternative for ETL tools is to build an ETL framework yourself. I have seen a number of companies, who generate ETL code (Java, Oracle’s PL/SQL, Teradata’s SQL, etc.) from metadata repositories. Coding and storing all the integration logic in a source code repository, such as git, is also not uncommon. Some companies try to achieve what commercial vendors are providing, including intuitive web-based frontends, debugging capabilities, track and trace of data, lineage, etc. A benefit of building an ETL tool yourself is looser coupling. As long as you adhere to the commonly used open standard ANSI SQL you could potentially replace the database technology of the data warehouse at any time. The “do it yourself” scenario requires highly skilled engineers for maintenance and development. There is a potential risk of seeing engineers leave over time. Resources with ETL knowledge of the commercial vendors is in general more easy to attract.
ETL processing within a data warehouse ecosystem typically doesn’t allow for a lot of variations. In general a single product or framework is used. It is also high likely, due to complexity, ETL tools are scheduled to run in fixed time windows. Processing times are typically long, because of performance reasons. Another reason why the ecosystem is static, is that many enterprises are terrified to make big changes to ETL: code is often large and enormously complex, because it has been evolved through years of development.
After the introduction of this blogpost I mentioned that some technology trends changed the way data warehouses can be built. One of these technologies is Data Virtualization: data remains in place, and sources can accessed and viewed as a single database, without the batch-oriented need for physically extracting and duplicating data. In detail there is a data virtualization layer, which presents itself as an integration layer that caches data, by extracting, transforming and integrating data virtually and at runtime. This technology comes with an application component: a virtualization engine that handles all the incoming queries, determines the best strategy for caching, knows how to access different data store technologies, knows how to integrate data, and so on. Under the hood there is a lot of metadata for building the abstraction. I have visualized this technique in the image below.
Although this technique looks attractive it has potential drawbacks:
- Pulling out too much data, could cause OLTP systems to go down or become unstable. Data virtualization takes care of this risk by ‘caching’ data, which means a copy is kept into the virtualization engine, but this pattern doesn’t entirely eliminate the risk. When routing too many queries you still potentially can harm the OLTP systems. Alternatively you can speed up these, but that can be very costly, since operational systems are relatively more expensive than analytical systems.
- The data virtualization layer leads to tighter coupling. If source systems change, changes to the data virtualization layer are immediately required as well. Views or additional layers can help, but any change still requires cross coordination between the source system owners and engineers maintaining the data virtualization layer.
- Consumers are directly impacted by data quality, data virtualization does not allow you to create new data, what would be required if you want to make corrections.
- Data virtualization relies on the network. In a highly distributed environment with a lot of networks and hops (passing additional network devices), latency is expected to go up. Additionally there is coupling, so if the network is down, the integration layer is broken.
- Data virtualization is not complimentary to data life cycle management. It cannot move irrelevant data out of the underlying systems. It requires all historical data to stay in the OLTP systems, which long term makes data virtualization an expensive solution. Alternatively, you can move data out to a secondary location and combine it again in the virtualization layer, but this approach increases the complexity because you need to develop and maintain several abstraction layers.
- Data virtualization is limited by the underlying supportive technology, which is typically a relational database management system. Although data virtualization can read many database systems, it in general does not allow to create document, key value, columnar and graph database endpoints, if required.
- For intensive and repeatable queries data virtualization uses more computing power, because transformations take place in real-time when data is queried. Caching techniques can reduce this, but the amount of computing power always will be comparable more when using data after it has been physically transformed and stored.
Note: Some database vendors provide a database (virtual) query layer, which is also called a data virtualization layer. This layer abstracts the database and optimizes the data for better read performance. Another reason to abstract is to intercept queries for better security. An example is Amazon Athena.
Despite the drawbacks data virtualization can help in migration scenario’s. Making the database virtual and part of the new capability can lighten the pain. Data virtualization is also real-time, thus faster for relatively small amounts of data, compared to ETL tooling, which needs to process and persist all data, before consumption can take place.
The last capability we didn’t discuss in too much detail yet are Business Intelligence (BI) tools, which are used to generate insights and present results for smooth business decisions. They became quite popular as data warehouses started to emerge with more and more data. They also expanded their capabilities throughout the years. First versions only included simple reporting and dashboarding functions, but later generations offer strong visualization capabilities, support self-service, intelligence caching techniques and predictive analytics. Business intelligence thus doesn’t exclude (predictive and prescriptive) advanced analytics: Both disciplines are complimentary and can strengthen each other.
Modern business intelligence tools are also quite flexible with the number of sources they can work with. Traditionally BI tools were more closely coupled to data warehouses, but later generations are capable of pulling out and combining data from multiple and a higher variety of source systems. BI tools can also pull only out the changes and persist data longer into what are called Cubes. These cubes have some similarities with data virtualization, because they also abstract the underlying sources and can pull data our directly or cache it.
What are the differences between OLAP, ROLAP, MOLAP, and HOLAP? Cubes, also known as OLAP cubes, are pre-processed and pre-summarized collections of data to drastically improve query time. OLAP (Online Analytical Processing) refers to specialized systems or tools that make data easily accessible for analytical decision making. OLAP cubes are logical structures as defined by the metadata. Multidimensional expressions, or MDX, is a popular metadata-based query language that supports you to query these OLAP cubes. Vendors offer a variety of OLAP products that you can group into three categories:
ROLAP (Relational Online Analytical Processing): ROLAP products work closely together with the relational databases to support OLAP. Often a star schema structure is used to extend and adapt an underlying relational database structure to present itself as an OLAP server.
MOLAP (Multidimensional Online Analytical Processing): MOLAP products provide multi-dimensional analyses of data by putting it in a cube structure. Data in this model is highly optimized to maximize query performance.
HOLAP (Hybrid Online Analytical Processing): HOLAP products combine MOLAP and ROLAP, by using a relational database for most of the data, and a separate multi-dimensional database for the most dense data, which is typically a small proportion of the data.
Modern BI tools also can keep the data close with them, by allowing you to integrate (ETL) and persist data inside their “storage layer”, rather than integrating and storing it inside the data warehouses. My personal recommendation is to store data in only in the data warehouse, when it is subject to reuse. For shorter term and one off integration exercises integrating inside the business intelligence solution is more recommended.
You have learned a lot about how data warehouses work, the integration layers are organized and how data is ingested and consumed. In the next section we will discuss the data lake, which some people call the second generation data repositories.
As data volumes and the need for faster insights grow, engineers started to work on other concepts. Data Lakes started to emerge as an alternative for access to raw and higher volumes of data. By providing data as-is, without having to first structure the data, any consumer can decide how to use it, and how to transform and integrate it.
Data lakes and data warehouses are both considered centralized (monolithic) data repositories, but they differ because data lakes store data before it has been transformed, cleansed and structured. Schemas therefore are often determined when reading data, rather than loading data in a fixed structure. They also support a wider variety of formats: structured, semi- structured, and unstructured.
A bigger difference between data warehouses and data lakes are the underlying technology used. Data warehouses are usually engineered with relational database systems, while data lakes are typically engineered with distributed databases or NoSQL systems. Hadoop is a popular choice, because it runs on commodity hardware and therefore can be cost effectiveness. It is also highly scalability, open-source, and modular with a large variety of database types and analytical frameworks.
Using public cloud services is also a popular choice for building data lakes. Recently distributed and fully managed cloud-scale databases, on top of container infrastructure have simplified the task of managing centralized data repositories at scale, while adding advantages in elasticity and cost.
Many of the data lakes, as pictured in the image above, collect pure, unmodified and raw data from the original source systems. Dumping in a raw — exact copies of — data is fast and enables data analysts and scientists to have quick access. However the complexity with raw data is that use cases always require to rework the data. Data quality problems have to be sorted out, aggregations are required and enrichments with other data are needed to bring the data into new context. This introduces a lot of repeatable work and is also why data lakes are typically combined with data warehouses.
Data warehouses, in this data lake combination, act as a high quality repositories of cleansed and harmonized data, while data lakes, act as analytical environments, holding a large variety of raw data to facilitate analysis, such as data discovery, data science and processing unstructured data. When combined, outcomes or insights flow back into the data warehouse, while the data lake is built for raw data and (manually) operated.
By knowing what a data lake is we can move to the next section, in which I want to discuss the scale in which data warehouses and data lakes can be applied within organizations.
The Pitfalls of Enterprise Data Warehouses and Data Lakes
As people started to see the benefits of data warehouses, the initiatives and scopes became larger. Companies instructed people to go to training facilities and prepare to start building something massive. Business users were asked to deliver all their requirements to the central data warehousing engineering teams. Large programs were initiated to deliver Enterprise Data Warehouses (EDWs). An EDW is a central data warehouse with the objectives to create a single version of the truth and service all reporting and data analysis needs of the entire organization.
Although the image below is a high-level representation of an enterprise data warehouse, you can see the complexity of many integration points, steps of integration and dependencies. Once enterprise data warehouses start growing, their complexity grows exponentially. Before you know you end up in an uncomfortable situation where changes are seen as bottlenecks.
What I want to do in the next sections is to discuss a number of widespread observations and failure modes. Before we continue, I would like to ask you to take a deep breath and put your biases aside. The need for data harmonization, bringing amounts of data into a particular context, always remains, but something we have to consider is the scale in which we want to apply this discipline. Is it really the best way to bring all data centrally before it can be consumed by any user or application?
Single canonical models and unnecessary schema translations
EDWs require data providers and data consumers to agree on a single (enterprise) canonical data model. This single version of the truth is, to my mind, the biggest problem why companies can’t scale. In large ecosystems many different contexts exist. Building an integration layer that accommodates everybody’s needs is a big challenge, because it requires everybody to agree. The larger the company, the more conflicts you will see, the longer it takes to agree and align.
My second biggest problem with the harmonization at scale is that value is lost. Engineers are required to make interpretations when building up data warehouses. For example, what is a start date of a contract? There are quite a number of possible answers. The start date can be the legal signing date of a contract, it can be the system’s booking data. The start date can also be the approval date given by the legal department. It can be a date of when the usage started or the starting date of when the first payment was made. Chances are relatively high that the meaning of these definitions differ across different departments and are implemented specifically within systems. We either end up creating many variations or accept the differences and inconsistencies. The more data we add, and the more conflicts and inconsistencies in definitions arise, the more difficult it will be to harmonize. Chances are that you end up with a unified context that is meaningless to everybody.
Another problem is that data, and thus context, is thrown away during integration process. Very specific value ranges from one domain are mapped to ranges with fewer details in favor of another domain, aggregations are made out of details or fields are left out. For advanced analytics, such as machine learning, leaving these details out is a big problem. Analytical models, such as machine learning, work more precisely with detailed data.
Additional transformation step
A second concern, when using data warehouses at a large scale, is the need for an additional transformation step. Data always has to be integrated first into the data warehouse, before it can be used. This step means a lot of waiting time, because in a data warehouse there is inheritance coupling between almost all components. Tables have references to other tables, and tables have dependencies to ETL jobs. A change in one table usually forces a ripple effect of changes in other tables and ETL jobs. This complexity requires a lot of work and increase waiting time, which starts to make people creative.
What is a Big ball of mud? A “big ball of mud” is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. It is a popularized term that has been first coined by Brian Foote and Joseph Yoder. A big ball of mud describes a system architecture that is monolithic, difficult to understand, hard to maintain, and tightly coupled because of its many dependencies. The image below shows a dependency diagram that illustrates this. Each line represents a relationship between two software components.
Data warehouses with its integration layer, countless tables, relationships, scripts, ETL jobs and scheduling flows, often end in a chaotic web of dependencies. These complexities are such that you often end up, after a while, with what they call a big ball of mud.
An engineer might propose for the sake of time to bypass the integration layer, by directly mapping data from a source system to a data mart. Another might suggest building a data mart on top of two different, already existing, data marts, since changing the integration layer takes too much time. This technical debt (future rework) will cause problems later. The architecture becomes more complex and people loose insight in all the creativity and shortcuts created in order to deliver on time.
Data warehouses are characterized by their exponential complexity in terms of the number of sources and consumers. The more sources you add, the more difficult it becomes to make changes. This same principle applies for the number of data consumer entry points. If you have many data marts sitting on top of the integration layer, making changes to the integration layer can be quite impactful, because its changes have impact on data marts, which need to be coordinated closely with all data consumers.
What I also have seen is that there is a ’tipping point’: once you reach a certain number of sources and data marts, the agility drops to almost zero. Integrating a huge number of sources requires tremendous coordination and unification. Release cycles of several months aren’t exceptions for complex data warehouse systems.
Endless stakeholder discussions
Because of the tight coupling and long release cycles for changes, stakeholders are queued up, and have to agree on the priorities. I have seen long discussions between engineers and business users to agree on the priorities of new data sourced into the data warehouse or changes to be made to the integration layer. These discussions don’t add any direct business value.
Agility problems of monolithic platforms introduce a consumer habit of impatience. Adding data to EDWs and data lakes can take long, so users start to ask for all data to be consumed. Having access to all data takes away the risks of waiting any longer. I have seen data marts that contain almost all data from the integration layer. The storage space sum of all data marts becomes a multiplier of all storage required by the integration layer. This doesn’t make the architecture cost effective.
I have also seen business and IT users accessing both data marts and the integration layer at the same time. Users building views on top of data marts and integration layers to get their needs quickly fulfilled. These views were exposed again to other users. Eventually an endless cascading layering effect is seen. Users are pulling out data without having any clue where data is exactly coming from.
Often no requirements are set by consumers. Data is provided because business users are demanding, otherwise escalations will follow. If timelines aren’t met, often point-to-point interfaces are created.
Pressure on coordination
Another problem of operating data warehouses and data lakes at a large scale is that engineers and IT users are required to coordinate data deliveries. Sources might change and interfaces get broken, which requires escalations to get things fixed. I have also seen situations where engineers were taking care of the problems themselves. For example, fixing data in the staging layer in order to have data properly loaded into the data warehouse. These fixes became permanent and over time hundreds of additional scripts had to be applied before data processing can start. These scripts aren’t part of trustworthy ETL process and can’t be tracked back. It is expected this has a significant effect on the lineage and transparency.
Data Quality and Governance
Data governance is also typically a problem too, because who owns the data in data warehouses? Who is responsible if source systems mess up and data gets corrupted? Many of the source system engineers point to the data warehouse engineers when something goes wrong, because they transform data to something system owners do not know. Visa versa, data warehousing engineers blame source system owners for delivering incorrect data.
Data quality, another point of ownership discussion. Because after the data has been transformed it is unrecognizable for the source systems owners. Source systems owners blame the DWH engineers for bad data quality. In many of the implementations I have seen the data warehouse engineers taking care of data quality. Data is fixed before or after loading it into the data warehouse. Especially for integrity reasons it is important to have the relationships between the tables setup properly. But what data is the truth? Users start comparing figures and results from operational systems with data warehouse figures. Nobody knows what the truth exactly is. Data quality between operational systems and data warehouses might diverge so badly that nobody has trust in data.
Data life cycle management of historical data is also an issue. Because EDWs are seen as the archive of truth, operational systems cleanup irrelevant data knowing data will be retained in the warehouse. What if the operational systems need to make an analytical reports based on their own historical data, what happens then? In many cases ODSs would fulfill this need, but I have also seen companies using EDWs for this purpose. This brings up an interesting pattern. First, operational data is moved and transformed into a harmonized model. Second, that same data is transferred and transformed back (reverse engineered) into its original context.
For operational advanced analytics historical data often is a requirement. It is expected to be fast available, because windows of opportunity are often fleeting. Using a data warehouse for this purpose is often a problem, given that they have to process data typically for many hours.
Data leaves the centralized platform
Keeping and managing data centrally is a problem, because the organizational needs are very diverse and have a high variety: different types of solutions, teams with different sizes, different types of knowledge, different requirements varying from defensive to offensive, and so on. Keeping everything together is a big problem, because of proliferation.
EDWs and data lakes are tightly coupled with the underlying chosen solution or technology, meaning consumers requiring different read patterns are always required to export data to other environments. As the vendor landscape changes and new types of databases pop up, these monolithic platforms are getting more scattered, always forced to export data. This trend undermines the concept of efficiently using a ‘single central repository’. Consequently point solutions are created and the underlying hardware of the data warehouse is only used ETL processing and data persistency. This is a true waste of money, because date warehouse systems use in general very expensive hardware, optimized for intensive querying.
After data has been carried, and is enriched, it is also typically distributed further. This means data consumers start acting as data providers. For data consumers in a large scale environments this situation might be quite confusing, because where did data originate? Some consumers start consuming without having the awareness that data is also being consumed indirectly. This makes the chain longer, and therefore more fragile.
This further distribution and proliferation of data also introduces another problem. Many data consumers don’t know where to find the right data, because data distribution is scattered throughout the organization. Is the operational system the right place to look for? The enterprise data warehouse? The data lake? Or perhaps another data consumer, who may have already slightly better prepared the data? Users use the quickest route, the one with the shortest distance and least resistance, to the data. EDWs and data lakes grow into a spaghetti architecture. Others call it a big ball of mud. Uncoordinated changes, unregulated growth, and distribution of data harm the overall enterprise data architecture. Consistency, insights, and most important agility, are lost.
Data Providers have no insight
EDWs and data lakes often lack insight in ad-hoc consumption and further distribution, especially when data is carried out of the DWH ecosystem. With new regulation, such as GDPR or CCPA, insight into the consumption and distribution is important, because you want to explain what personal data has been consumed by whom and for what purpose. Logically where the creation and origination of the data starts, responsibilities should start. The problem with enterprise data warehouses is that data providers have no control and limited insight. There is no way to control data consumption, nor there’s insight in further distribution of the data or the purpose where data is being used for.
Service Orientation is implemented separately
Within many organizations the creation of APIs for service orientation is handled via an enterprise canonical model by a separated team. This is strange because both service orientation and EDW use and rely on the same operational systems. A separate implementation creates two different versions of the truth. This can result in two camps in the organization and two different vocabularies with different meanings of data. The original sources and their context however, are still the same.
Obsessive Evangelists and Classical Thinkers
This may seem like a bit of an exaggeration, but enterprise data warehouses also have evangelists. Most data initiatives started with many good intentions, but when times were less busy, engineers continued mapping data. Integrating data was no longer done for the sake of data consumption, but only for its own sake. Data integration became a hobby. The side effect is that various data warehouse became unnecessary complex and too difficult to maintain. Some of these evangelists were only discussing the modelling techniques, rather than paying attention to business users and their requirements. Business users felt not fully understood, so lack of trust was created.
Another problem is how the enterprise data warehouse and data lake teams are assembled with traditional software engineers. Many come with a database administrator (DBA) background and are used to design monoliths. They are trapped in the past and lack today’s modern skillsets like DataOps, domain-driven design, distributed and evolutionary design thinking, data versioning, CI/CD and automated-testing experience, and so on. Scaling up and fulfilling today’s modern requirements is consequently a challenge.
Many EDWs are often build upon ‘prefabricated’, off-the-shelf, industry-specific data models. IBM, Teradata, Oracle and Microsoft, for example, provide these data models for different industries. The problem I have with these industry-specific data models is that they are the vendor’s version of the truth, not yours. Many of these models are too detailed and full with assumptions. These details require you to fill up many empty tables with only keys. Consequently the integration layer becomes extremely large and complex.
Logical Data Warehouse
There is also an approach of combining EDWs, data lakes, with data virtualization. They are “complemented” with an additional virtual layer of abstraction. Underlying integration complexity is abstracted and new integrated views of data are created. These can span across operational systems and many other applications as well. Henry Cook, research director in Gartner, calls this design pattern a Logical Data Warehouse. Although not having to add new data to the enterprise data warehouse gives short terms benefits in terms of agility, longer term it will be a catastrophic failure.
First, it is tempting to substitute complex integration logic by abstracting, but all the underlying complexity is still there, including all the operational and technology overhead. Second, it expands the data platform with direct bypasses to operational systems and ad-hoc data sources, while directly reusing parts of the data platform. Many of these operational systems, as you have learned, aren’t optimized for intensive read access and can’t retain larger amounts of historical data. Third, there is tight coupling with both operational systems, the data platform and a virtualization product. Changes to the underlying physical structures of data platforms or operational systems, that are not carefully coordinated, will immediately break the logical layer. Lastly, all data integration is still “funneled”, because data virtualization is just another ETL tool for maintaining the enterprise data model.
Although technology has changed, the whole concept around combining data virtualization with data lakes, enterprise data warehouses, and funneling all ETL is the root of the problem. Data is moved to a monolith, requiring everybody to wait for, all changes to be coordinated precisely, without allowing use cases to optimize and choosing a fit for purpose technology that suits the use case requirements best.
Operationalizing use cases from the Data Lake
Many companies have put all their hope in big data ecosystems, which pulls together raw and unstructured data from various source systems, with probably EDWs attached to it as well. Analytical use cases are expected to run in the lake and consolidated views are expected to land in the EDWs. This approach carries a number of challenges:
- Data, which is pulled into a data lake is often raw and high likely a complex representation of all the different applications. It can include (ten) thousands of tables, not understandable values and application logic encapsulated in the data. Additionally, there is tight coupling with the source systems, since the inherited structure is an identical copy. Applicational changes risk to immediately break the data lake environment.
- Analytical models in data lake are often trained on both raw (unstructured) and harmonized data. It is not unthinkable that data scientists are technically plumping, creating data and operating these data pipelines and models by hand or in their data science project. Data lakes therefore carry substantial risks.
- Analytical outcomes in many cases need to be brought back into the EDW for reporting. This requires us to deal with two different data models: raw data models and a unified, harmonized version of the EDW. Since these structures are completely different another pipeline is required to fix all inconsistencies.
- Data lakes are often a single platform and shared by many different (other) use cases. Due to its tight coupling with the hardware, compatibility challenges, shared libraries and configurations these platforms are very hard to maintain.
The challenges I summed up are just a few reasons why the failure rate of big data projects is so high. Management resistance, internal politics, lack of expertise, and security and governance challenges are a few of the other reasons why analytics never made it to production.
Knowledge isolation and centralized design
One of the biggest problems with building a centralized data platform, like EDWs or data lakes, is how the teams are organized: people with data engineering skills, are separated from the people with domain and business knowledge. Those people are, in many cases, also the people producing and knowing the data.
By siloing all the data professionals, one stovepipe is created. All the knowledge is siloed, allowing no other teams to leverage from data. Data life cycles and heart beats have to be adjusted to meet what centrally is dictated. One team owns all infrastructure, which is only available to that central team. One single team has to change everything, run everything, maintain everything, fix everything and so on. It is this stove pipe that hampers organizations from scaling up.
Data warehouses are here to stay, because the needs to harmonize data from different sources within a particular context will always remain. Patterns of decoupling by staging data won’t disappear. Just like the steps of cleansing, fixing and transforming schemas. Any style of Inmon or Kimball can be well applied, depending on the needs of the use case. Technology trends, like metadata-driven ELT, data virtualization, cloud, distributed processing, machine learning for enrichments will change data warehousing, but not in a negative way.
The way we want to manage our data is something we have to consider. Big silos, like enterprise data warehouses will become extinct, because they are unable to scale. Tightly coupled integration layers, loss of context and intense data consumption will force companies to look for alternatives. Data lake architectures, which pull in data raw are the other extreme. Raw, polluted data, which can change any time, will force experiments and use cases to never make it into production. Raw on itself carries a lot of repeatable work with it.
Also the siloed and monolithic thinking is a problem, which doesn’t allow companies scaling up. Siloing data professionals creates stovepipes. Separating from the people with domain and business knowledge results into coordination problems.
What is needed is a balanced and well governed data management environment, which allows for variation by using a mix of technologies. It must give domains the insight and flexibility to adapt and distribute data, while staying decoupled at the same time. How the modern trends are effecting data integration and what type of architecture you need for scalability is something that I’ve outlined in a series of articles (https://medium.com/@piethein/data-management-at-scale-91118a1a7d83) and in the book: Data Management at Scale 2nd edition.