The content of this blogpost builds up on my previous article: Data Integration and Data Modelling demystified and consolidates a large amount of content from the book Data Management at Scale. This content didn’t make it to the final book, because various external reviewers found it unchallenging. So, if you believe you have a good understanding of data and database management already, the content discussed in this article might sound familiar.
A crucial application component that I haven’t dived deep into the previous article is the application data layer, or database layer, of the application. The values or data of the application can be stored in various ways, but modern applications usually use databases. Databases are everywhere and used in our routines every day: When we make payments, check social media, shop online, view media, book a hotel, etc. They even sit on all of today’s modern smartphones. They play an essential role in the way value from data is delivered, therefore it is important to look at Database Management, and dive deeper into aspects like RDBMs, NoSQL, NewSQL, ACID, CAP, Consistency levels, and other design patterns.
Relational Database Management Systems
Databases started to emerge in the 70s, based on Edgar Codd’s idea to store values in tables, rather than using the linked list methods. By storing values into relational structures we can insert, delete and edit more easily, without the need to constantly rewrite all links and pointers to the other values. These relational data structures are called tables, whereas each set of values is a column, and each line item inside a table is called a row. The structure and dependencies between these relational tables are called the (database) schema.
Relational databases became extremely popular and resulted in wide usage and deployment of relational database management systems (RDBMs). The majority of RDBMs use SQL (Structured Query Language) for organizing and retrieving the data.
When the first databases emerged the only way to scale these databases was to use more powerful hardware and infrastructure: faster servers, with more capacity. Adding more power (processing power, memory) to an existing system to reach a desired state of performance is called scale-up or vertical-scaling. Because vertical-scaling is limited to the upper capacity of a single server, the industry started to work on another kind of scaling, which is horizontal-scaling. Horizontal-scaling, or scaling-out, means adding more servers (machines) to a distributed network of databases.
Distributed databases, contrasting to single-machine databases, distribute (or copy) data to multiple instances (network of nodes or machines). Instead of using the computing power of only one computer, the distributed model allows to spread the tasks over multiple machines. This improves performance. Double the amount of hardware resources typically means double the amount of size and performance. Linear scalability is the term used to achieve an equal percentage increase of output when the input is doubled.
Tip: If you want to learn more about distributed databases and systems, I highly encourage you to read the book “Designing Data-Intensive Applications”, written by Martin Kleppmann (2017).
Non-distributed and distributed databases exist in many variations and typically differ in functionality, performance, availability and consistency. The most popular databases I have categorized by type and name in the overview below. Some of these databases also do a form of data distribution, although they don’t scale performance horizontally. For that reason they have been grouped in a separate category.
- Non-distributed relational databases are based on a predefined relational model using tables with rows and columns. Non-distributed relational database share all the storage with the same processing unit. The default query language for relational database management systems is SQL.
- Distributed databases are databases in which storage devices are not all attached to a common processing unit and which is controlled by a distributed database management system. The distributed cluster of shared-nothing nodes uses a relational data model which can handle SQL.
- File based data stores work with files, which are are stored on file systems. Indexes can be used to more quickly retrieve results.
Distributed databases can run faster, but another important reason for distributing data across multiple instances is to achieve better resilience. By copying and duplicating the data, a database instance (node) is allowed to fail, because another copy of the data is still available. The most common aspects and trade-offs of replication for consistency, availability and fault tolerance in database architectures will be discussed later. First, though, let’s look at how to make databases more trustworthy with transactional logs and ACID.
Many databases use a transaction log, which is used for additional integrity and audibility of the system. The transaction log records all transactions (database modifications) into a separate file or database. If the database fails, the transaction log can be used to inspect, recover or restore the data.
This same transaction log can also be used to start reading from and replicating all changes. Rather than a daily copy of the full database you can start delivering data based on changes that were made to the database within a certain time window. For streaming and event-driven architectures this pattern is extremely useful, since you don’t have to query databases for changes, but can directly start reading from the transaction log.
Note: Change Data Capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. CDC often uses the database transaction log to populate the delta’s, although it can also query the database directly.
The transition log makes databases more reliable and trustworthy. It can be used to guarantee ACID characteristics over hardware failures and database crashes.
Many databases adhere to the standards of ACID. ACID stands for Atomicity, Consistency, Isolation, Durability.
- Atomicity ensures that a transaction is either fully completed, or is not begun at all.
- Consistency enforces that at the end of any transaction the system is in a valid state.
- Isolation guarantees that the system only carries out one action at a time.
- Durability means that once the transaction has been successfully completed, the change is permanent.
The ACID standards ensure that databases can operate safely and recover, because they guarantee validity even in the event of interruptions, errors, power failures and so on. ACID enforces that transactions are seen as single (atomic) operations, which either succeeds completely, or fails completely. Transactions only have one valid state and are isolated at the moment of writing or updating. Some databases however much more value scalability and performance over ACID. This brings us to another database category: NoSQL.
Relational databases are superior at storing the rows into tables and making sure data is stored safely and correctly. This model of high integrity and (enforced) consistency is at the same time a performance and flexibility dilemma. Because of that engineers started to work on different database models, which compromise consistency in favor of more speed, higher availability and/or better flexibility. The flexibility, for example, resulted in allowing the insertion of data without a predefined relational schema, rather than having the (enforced) schema determined upfront. These non-relational databases increased their dominance with the popularity of big data, real-time web applications and unstructured data. During this time frame also the name was coined: NoSQL
Note: The term NoSQL initially coined in 1998 by Carlo Strozzi to state that his lightweight, open-source relational database did not use the standard Structured Query Language (SQL). Ten years later Johan Oskarsson used the term again to describe non-relational databases. The term NoSQL thus can mean either “No SQL” or “Not only SQL”.
The majority of these non-relation databases follow a philosophy, which is different than the relational world. Many are highly scalable and favor the BASE (BASE stands for Basically Available, Soft State, Eventual Consistency) consistency model, which is less assurance than ACID. The term “database elasticity” is also often used to indicate that these databases are extremely variable and can quickly scale out. Another commonality is that the majority of the NoSQL systems use simplistic interfaces for their non-relational structures. These interfaces deviate from the traditional SQL-like interface languages, which consequently results that data modelling works different and is non-comparable with data modelling as seen in the relational world.
NoSQL systems in general are optimized for different type of workloads and use specific data structures: key-value store, document-oriented, columnar, graph and time-series.
While NoSQL has strengths and is popular, it doesn’t work in all situations. Schemaless designs give a higher degree of flexibility, but also carry an integrity challenge at the same time. To enforce data quality controls, such as constraints, mandatory fields, specific data types or allowed value ranges, developers have typically take care of the controls at the application-side, which requires additional code and causes increased complexity.
Another problem is the eventual consistency model most NoSQL systems work with. In a mission-critical or transactional system you don’t want to have records created twice in the database or get an answer back without knowing what the exact truth is.
NoSQL databases also use many different programming and interfacing standards. There isn’t a reliable standard and simple tools to access a large variety of NoSQL systems are missing. In a large organization with an enormous variation in NoSQL technologies, the amount of different interfaces and application languages programmers have to learn can be a concern.
Because of the several drawbacks NoSQL systems have, new databases started to emerge, which provide much of the functionality of traditional SQL relational databases, while offering the benefits of NoSQL technologies. These databases are called NewSQL.
NewSQL databases try to offer the best of both worlds: scalability and speed of NoSQL, combined with the relational data model and ACID transactional consistency. Whereas NoSQL is a good choice for extreme availability, faster response times and eventual consistency, NewSQL emphasizes consistency over availability, while using a distributed architecture.
Some of the NewSQL database products also combine multiple NoSQL schema characteristics and provide several ways for data to be stored and manipulated. These are called multi-schema or multi-model databases. An example is MarkLogic, which combines many different database design patterns into a single database. For example you can have strong-consistent relational, document, key-value, and graph all at the same time. Although the core idea is elegant, this implementation leads to increased complexity and lower performance.
When data in a single data model is mapped and represented by several other models, the complexity significantly increases, because any change to the database design will impact all representations. A change to the core structure could impact all the different endpoints to look different. Developers consequently have to work harder to keep all applications working consistently with the various database endpoints. These challenges are similar to what you see in integration databases.
Another problem is performance: multi-model databases are basically under the hood running multiple different optimized databases. The more representations you have, the more places data needs to be replicated to, the more pre-processing optimizations and indexes are needed, the more storage is needed and the more bandwidth is required, and so on. Some vendors solve this by not replicating any data, but by transforming data real-time at runtime. This approach sounds nice, but performing complex SQL statements, while under the hood transforming a NoSQL into a SQL data model requires incredible performance, which consequently significantly drives up the hardware costs. Multi-schema databases that replicate data internally consequently suffer, just like any distributed database, from the CAP theorem principles, to which we now turn.
When you dive more deep into distributed architectures, there is high chance that you will encounter the CAP theorem (or Eric Brewer’s theorem). CAP stands for Consistency, Availability and Partition tolerance and the theory says that achieving all three in a distributed system is not possible, and you have to choose two of three dimensions. To help you better understand I first want to give you example to show how the dimensions work together.
Let’s first have a look how availability and consistency are impacting each other. The (strong) consistency dimension ensures that all reads get the same data for any given point of time. When data is updated in a distributed system, it has to be replicated between the different nodes of the cluster. In this time window of updating and distributing, the data is locked (database locking is a technique of preventing simultaneous access to data in a database, to prevent inconsistent results delivered or incorrect values to be stored) and will only be unlocked after all the nodes have successfully processed the data and mutually agree that the new data is the same. Any other process, which tries to access or update the same data at the moment of processing, will have to wait till the update process has finished. The benefit is that processes are guaranteed to always get the most accurate data. When consistency is favored over availability, the system will return either an error or a time-out if data cannot be guaranteed to be consistent. For example, when some of the nodes are down, data cannot be consistently written to all nodes.
Database consistency: The database community uses the word, consistency, with different meanings. A “weak consistent” or “eventual consistent” read means that no locks on the table rows are set, and therefore other processes can modify data at the same time reads are performed. Consequently, data you receive might not reflect the actual results of a recently completed write operation. When reads are “strong consistent”, data is locked and processes have to wait, so the most up-to-date data is always returned. Write consistency has various meanings as well. For distributed systems “strong consistent” writes mean you are guaranteed that write operations are successfully and all (or at least two) respective nodes in the distributed system are updated correctly. “Eventual consistent” writes work as the opposite. You are not guaranteed that, in case the system goes down, all nodes have written the database operations successfully.
Availability can also be traded in for favoring it over consistency. Locking consequently is less strict and the system is configured to always process requests and tries to answer them, no matter what happens with the nodes inside the cluster. Even if nodes in cluster are down or other processes get inconsistent data, the system always tries to stay responsive. Risks are that if more nodes go down, the system will get down and different nodes hold different versions of the data. Eventually you end up with a corrupt system holding inconsistent data.
In practice strong consistency models are an expensive guarantee, because they requires a lot of network coordination between the nodes. Strong consistency drives up the latency which will make distributed systems slower to respond. In order to solve this some distributed systems choose to process the changes asynchronously. This consequently means if the connectivity stops between the nodes they still remain available, however not all changes appear on the other nodes. Obviously, you might get different answers depending on which node you query.
Note: Cloud vendors make a difference between ‘strong consistency’ and ‘eventual consistency’. Strong consistency guarantees the data to be written exactly once. It also means it cannot provide 100% availability, so it may return an error if something goes wrong, which requires the client to resubmit the data later. Choosing consistency means you have to accept that the most non-consistent (accurate) value is returned or that during a system failure the data might become inconsistent.
The partition tolerance is the third dimension and means that the cluster continues to operate even if there is a “partition” communication break between the nodes. In theory you can sacrifice partition tolerance for availability and consistency, but in practice this is misleading, because we cannot design a system that will never lose any communication in a distributed environment. Having only one partition will make every node a single point of failure. With this in mind the choice is more between consistency and availability.
The CAP theorem is important to understand, because data distribution, strong consistency and low latency can’t always be balanced perfectly. Building applications and architectures at scale is one of today’s biggest challenges. Any architecture suffers from CAP theorem, so you have to provide best practices and design principles for engineers to make the right considerations when distributing data across systems and environments. Depending on the situation you’re dealing with you have to choose to favor (strong) consistency over low latency, or vice versa.
Database limitations and design patterns
Although the industry has seen an exponential growth in CPU, memory, and disk density, there is a bottleneck of what databases can deliver in throughput. This bottleneck makes every database break or start to hamper at a certain point. This breaking happens when the database load (number of queries) is too high. Consequently, application developers and system engineers are using design patterns to overcome the performance limitations all databases have. In the overview below I have listed the most known design patterns.
- CQRS: Command and Query Responsibility Segregation segregates the operations that read data (Queries) from the operations that update data (Commands) by using separate interfaces.
- Data offloading: Data that is not used as often is moved to another system (possibly with a different storage technology). Analysis on the remaining data will be faster because less data is processed.
- Append Only design: Appends data at the end of the file or database. Updates or transactions are not changed in ‘the middle’. The advantage is that writes to the database should be faster and an audit trail is saved. Drawback is that all changes will be added to the database as a new records, resulting in higher storage usage.
- In-memory: Data is stored (entirely) in main memory. Main memory is faster than writing to and reading from a file system.
- Caching: Frequently-accessed records are placed or held in an area (memory or optimized storage) for faster access.
- Materialized views: A (optimized) database object that contains the results of a query. Materialized views can be considered as a form of caching.
- Database indexes: An index or database index is a data structure which is used to quickly locate and access the data in a database table.
- Usage of specialized hardware: Using dedicated or specialized hardware, which is focused on better taking care of specific workloads. For example, graphics processing units (GPUs) are known to process complex structures of data in parallel faster than the traditional central processing units (CPUs).
- Workload management: Ensures that workloads or queries are distributed/divided in a manner that minimizes over-utilization and under-utilization while maximizing resource usage and efficiency.
- Query optimizers: Determines the most efficient way to execute a given query by considering possible query plans. Indexes and materialized views are also part of this approach.
- Optimized database engines: Improve performance by using a different engine. A column-oriented database, for example, is faster aggregating results than a row-oriented database.
- Data Grid: Data is shared across different systems working together (cluster). This pattern is often combined with the in-memory pattern.
- (Read only) partitions: Partitioning is the spreading of data across multiple files across a cluster to balance large amounts of data across disks or nodes. Read only partitions make a read-only table-space that prevents updates on all tables in the table-space. Other patterns can be applied on this table-space in order to improve performance.
- Sharding: Dividing data into separate data stores or horizontal partitions, based on some logical entity, such as region, customer segments, etc. This pattern can improve scalability when storing and accessing large volumes of data.
- Consistency model: Changing the consistency model from strong to eventually, as you learned, is way to improve the performance.
Within large enterprises with many databases the operational management is important. Support and guidance must be developed, so teams know what database to select and how to use the database properly. Database administrators need to be trained on data storage and operations. Version and product life cycle management and maintenance are important, but so are backup, monitoring and auditing. Building database components with potential for reuse can help to accelerate projects and become more effective building applications.
Databases determine the way applications interact with the data and deliver value. They impact the velocity of how data is moved and distributed. Databases also have gone through an evolution and exist in many different variations. Because of that the data integration becomes important and higher variety of integration patterns is required.
Knowing all database characteristics and engineering best practices is expertise on itself. Many large organizations recognize this and have bundled or grouped all knowledge into an organizational unit or department, usually called a Center of Excellence or Center of Expertise.
Lastly, I recommend making choices about how many and what data stores you want to offer to your organization. You might want to define a common set of reusable database technologies or data stores and patterns to ensure you leverage the strength of each data store. For example, mission-critical and transitional applications might only be allowed to go with strong consistency models, or business intelligence and reporting might only be allowed with stores that provide fast SQL access. Additionally, you might want to vary between on-premises and cloud. You will probably end up with a list of several database technologies and data stores that will facilitate the majority of use cases.
In this post you learned the importance of database management and gained an understanding of its core ingredients. Databases determine the way applications interact with the data and deliver value. They impact the velocity of how data is moved and distributed. Databases also have gone through an evolution and exist in many different variations. Because of that the data integration becomes important and higher variety of integration patterns is required.
You’ve also learned some of the most crucial concepts of how databases can help your organization to become data-driven on a larger scale. Although the world of technology is changing, data management’s core concepts and principles continue to make a difference in keeping data valuable. These areas are the joint responsibility of management, business, and IT.
Countless companies fail to implement data management properly-and that’s understandable, given the changing data landscape, the rapidly increasing amount of data, and the accompanying integration challenges. My view and perspective on data management and integration differs from others. If you are curious, I engage you to have a look at my book Data Management at Scale. This book will allow you to take a deep dive far beyond what you’ll find on the internet, into disciplines like data integration, data governance, data security, data quality, master data management, and metadata management.