Data Management at Scale

Piethein Strengholt
19 min readJul 11, 2023

Over the last few years, decentralized architectures have emerged as the new paradigm for managing data at large. They are meant to scale distribution of data between teams, while aiming for higher value and a faster time to market.

In this article, I would like to unpack how to implement such a federated design. We cover many different things. We’ll begin with a short reflection on your data strategy, and whether you should start with a centralized or decentralized approach. Then we’ll go through the phases of implementing a data architecture, from setting the strategic direction, to laying the foundation, to professionalizing your capabilities.

Note that much content comes from the book Data Management at Scale 2nd edition. If you would like to learn more or see the depth, I encourage you to read the full version of the abstract below.

A Brief Reflection on Your Data Journey

Before you jump on the data-driven bandwagon, ensure you have a data strategy in place. Whether you’re starting small or have a large set of use cases to implement, without a plan you’re doomed to fail. I see countless enterprises fail because they’re unable to bring everybody onboard or to articulate their strategy; because they don’t include business users or lack support from senior leadership. I can’t emphasize this enough, but before you start implementing any change, ensure you have a balcony view and a clear map guiding you in the right direction.

After establishing a vision that clearly articulates your ambition and the path ahead of you, it’s time to enter the next phase. Your next steps are about communicating your vision, building the right team, optimizing your architecture, making execution plans, defining processes, and selecting use cases. Again, it’s important to get everybody aligned and committed to your objectives. During the initial stages of development, start small with the implementation, but at the same time keep the big picture in mind because your target state architecture must be inclusive for all use cases. You’ll need data governance capabilities for implementing roles, processes, policies, procedures, and standards to govern your most critical data. You’ll need master data and data quality management capabilities for ensuring consistency and trust. You need metadata for tracking lineage, capturing business context, and link ing to physical data. You need integration and analytical services for building data products and turning data into value. The recommended approach for achieving all these objectives is to look holistically at the building blocks you need. These building blocks will remain stable throughout your journey, while the underlying technologies will change over the years to come.

After you’ve identified your critical ingredients, you must secure top-down commitment. In addition, communicate and engage with your business stakeholders: create awareness and excitement. Make sure your intended approach, business objectives, and goals are clear and well understood throughout the organization. A good approach is to get started by compiling a short list of use cases with the greatest potential for impact. Ensure each use case is aligned with your data strategy, as well as your business strategy. Benchmark the feasibility for each use case in terms of complexity, financial costs, commercial added value, risks, and operational manageability. After that, select the best candidates to start with. Begin with the low-hanging fruit: your first use case shouldn’t be too difficult to implement, but at the same time should deliver enough value to justify the work. It should set an example for the rest of the organization. The rest of the use cases will come later.

Centralized or Decentralized?

After you’ve prepared for the potential challenges and identified your first workloads, it’s time to get started with the actual implementation. At this point, you also need to make a decision about whether a more centralized or decentralized approach is suitable. If your company has a lower level of data management maturity, a centralized approach in the beginning is more appropriate. Why? A complex transformation requires a cultural shift, upskilling, building mature self-service capabilities, breaking down silos and political boundaries, and sharing knowledge. The complexity of these activities shouldn’t be underestimated. So, if your organization doesn’t have the maturity or scale yet to tackle them, consider centralization over decentraliza tion. As you progress in your journey, onboard more use cases, and enhance your data management and self-service capabilities, allow federation to happen next to centralization.

If your company already has a high level of data management maturity or is decentrally organized, then you can begin with a more decentralized approach to data management. However, to align your decentralized teams, you will need to set standards and principles and make technology choices for shared capabilities. These activities need to happen at a central level and require superb leaders and good architects. I’ll come back to these points toward the end of this chapter, when discussing the role of enterprise architects.

Besides the starting point, there are other aspects to take into consideration with regard to centralization and decentralization. First, you should determine your goals for the end of your journey. If your intended end state is a decentralized architecture, but you’ve decided to start centrally, the engineers building the architecture should be aware of this from the beginning. With the longer-term vision in mind, engineers can make capabilities more loosely coupled, allowing for easier decentralization at a later point in time. Second, you should determine which capabilities you intend to remain under central control and which ones you plan to decentralize later. Making this clear up front will save a lot of arguments and political infighting on the way forward.

Making It Real

After you’ve articulated your data strategy and ambitions at an organizational level, your next activities are aimed at making things real! My recommendation here, based on my experiences and success stories from customers, is to use a phased approach and start small: build a hypothesis for value-add, begin with only one or a limited number of use cases, get some experience with what building data products is like, put principles in place and address operational inefficiencies, and determine what “just enough” governance means for your organization.

Note that the discussion here follows the “happy path.” So, it doesn’t represent a nasty multiheaded complex beast with many tails. Nor does it go too deep into highly context-specific subjects. My aim is to describe the main stages of a general data journey to give you an idea of what to expect.

Opportunistic Phase: Set Strategic Direction

For onboarding your first domains, it’s not essential to map out all your business capabilities, delineate your data domains, and align them with the organizational structure. Instead, your goals during this phase are centered around learning how concepts work and can be translated into practice. For this purpose, I recommend selecting a fairly simple use case with one or two source systems as a starting point. You’ll use these sources as input for data product development, then serve these data products directly to consumers for turning data into value. You don’t need to focus on data value creation capabilities yet, though; those will come later. For now, you should concentrate on the source system side of your new architecture.

This first phase is about figuring out what mindset your teams need to have for taking data ownership and building the first series of data products. During this stage, the central team watches over other teams, coaching and delivering expertise while the domain teams write data pipelines, debug problems, and fix data quality issues at the source. All teams need to work closely together during this phase to get an understanding of the shared needs because many collective decisions must be made about storage services, ETL tools, security, the data catalog, and so on.

Your first steps in this phase will be as follows:

  • Select the first use case and identify a business team that is a good candidate to be the starting team for building the first data products. Preferably, this business team would have all the engineering skills necessary to develop the first deliverables. If not, the data platform team can assist the business team.
  • Define a project that has a clear but limited scope, and preferably can be finished in a few months. Set up a small program board with senior representatives for the needed top-down support and coordination.
  • Identify (potential) members for your data platform team, who are dedicated to the overall process. This team will be available to assist the domain team, which will be responsible for building the first data pipelines for producing data products.

From a data architecture point of view, I encourage you to start with only one (cloud infrastructure) landing zone and one data management landing zone. Next, provision only the services that are needed for capturing and storing, transforming, and cataloging data. Put aside what you don’t need, and focus on what’s essential.

For your solution design, consider adopting a lakehouse architecture. While this design initially might look like centralization, it brings many benefits: it’s a proven, common, and well-understood pattern, it’s easy to set up, and it simplifies manage ment of the infrastructure. The figure below shows an example of how the architecture might look in the beginning. It’s a small architecture, designed for a single domain.

Initial lakehouse architecture (Image design by Piethein Strengholt)

When you’re just starting your journey, the architecture only uses a few essential services. During the initial phase, data engineers from the domain engage with members of the data platform team. Together, they determine the scope and analyze what services are needed for building the first data products.

The intent of this design is to create a foundation for building data products at scale, supporting the objectives of data ownership and self-service with computational data governance. The first domain team, the product owner, and the data engineers, work closely together to make the data available. They start by extracting data from the source systems and ingesting it into the Bronze layer. For this, the team uses integration services offered by the central platform team. Next, they select the relevant data and transform it into several user-friendly datasets, for example using notebooks or data transformation services. During these activities, the data may pass through additional layers: Silver for intermediate data; Gold for functionally cohesive and consumable data. After all the transformations are performed, the data catalog scans all the data. Optionally, it may also scan the integration services for lineage. The configuration for scanning happens on a central level, so it’s handled by the central platform team or data governance team. The final step is to make the data available. After that, other teams can use this data as input for their analytical use cases.

Importance of Data Catalog

Do not underestimate the complexity of organizing your catalog. When you start implementing your data catalog, do not scan all your sources or domains at once. Instead, only scan domains that are part of your new ecosystem. Onboard one domain at a time, one after another. What you’ll learn in practice is that the alignment between business domains and application domains depends on whether business capabilities are shared or used exclusively. To set up a good structure, first divide and group your data sources and data platform(s) in a logical way. The key point here is that every asset from each data source or application can only be stored in a single location within your catalog. This means that, on a technical level, you need to relate data assets to application domains. Then, on this application domain level, you align the responsibilities, assigning administrator, data source admin, and curator roles to the respective people who manage databases, applications, data products, data pipelines, and other services. Next, you’ll need to perform the same grouping and ordering activities for information that describes the meaning of data, applica tions, and domains in a business sense. Identify your business domains by studying business capabilities and finding people who work together on common business goals. Assign glossary owner and data steward roles for managing metadata such as descriptions and glossary terms. Finally, align your application and business domains, asking the domain users to create relationships between metadata that is managed on an application domain level and metadata that is managed on a business domain level.

During the initial phase, your capabilities won’t have a high degree of maturity. It’s also likely that you’ll have many manual processes, which you’ll need to automate or make self-service as you progress to the next stages. Metadata also becomes more important during the next stages, because what worked in Excel will no longer work on scale.

After you’ve implemented your first use case, working from the bottom up, it’s time to conquer the hearts and minds of your first (business) stakeholders by showing your results to the rest of the organization. Don’t be shy about showcasing your success and demonstrating the added value and the benefits of using data. These activities shouldn’t be underestimated. They are essential for securing a top-down commitment from the higher-level executives. As you progress further, your program should become a role model that empowers you to decommission and clean up legacy data management platforms or systems, and stop similar kinds of projects that don’t align with the new data management strategy. With all this accomplished, you’re ready to collect new use case requirements for the second phase.

Transformation Phase: Lay Out the Foundation

After you deliver your first use case(s) into production, it’s all about scaling up, adding more data domains and refining your architecture. At this stage, it’s important to have the full picture of your landscape sharp. Thus, by now your business capabilities should be clear, including the alignment with people, processes, and technology. You should know which domains own which applications, and what they are responsible for. You should also know what new use cases can be served by what potential new data products (for more on this topic, or a refresher on how to identify your business domains, consider consulting my blogpost on data domains). Additionally, during this phase, you will work on budgeting plans, road maps, added value for the business, and operating models. These activities are important as you gradually scale up.

With the high-level target state in mind, your next steps are about defining what domain and landing zone topologies are best suited to your organization. For any topology that we discussed in this blogpost, I recommend you harmonize blueprints that include services for processing, storing, and cataloging data; publishing metadata; enforcing policies; and so on. Next, you should study data traffic flows between domains. Based on your analysis, you’ll need to make several design decisions. For example, if many domains require data from many other domains, then a highly decentralized or fine-grained domain topology is not recommended, as it leads to complexity and management overhead. In that situation, a governed topology that uses a centralized location for managing shared data products is usually a better option. If the amount of data flowing between domains varies significantly, you could also implement a hybrid approach of centrally managed and peer-to-peer distributed data. In that case, you should make choices about how you will manage data products within databases and storage account services between and within domains.

After your first domains have been onboarded, it’s time to consider the lessons learned from the initial phase. You will probably find that your domain teams want to be more proficient and self-supporting. Your first data products weren’t that complicated: you took incoming raw data, fed it through a few data pipelines, and turned it into more accessible data for consumers. The difficulty arises when you must do this repeatedly and in a governed and controlled manner. For this, you need to add automation and more advanced capabilities to your data product development process. For example, you might want to add a data quality framework for validating schemas and data quality, or set a standard for ETL services that your domains must adhere to when building new data products. Instead of writing more notebooks or using one-off, nonreusable solutions, you’ll need to provide blueprints and patterns for implementing this standard functionality.

The image below illustrates how the architecture might look during this next stage. In this updated architecture, you can see that real-time data ingestion capabilities, a metadata-driven framework, a logging database, and virtualized access have been added.

Updated lakehouse architecture (Image design by Piethein Strengholt)

Next, you’ll need to make similar improvements to your data pipelines. The most efficient way to scale and have flexibility is to use parameterized, metadata-driven pipelines. So, instead of reengineering and manually creating new pipelines (with their own data sources, selections, transformations, and output) using hardcoded values, you can create a common and configurable pipeline that will retrieve the configuration at runtime from, for example, a database or code repository. Your parameterized workflows will thus use metadata inputs and conditions provided by your domain teams. The role of the central platform team is to provide all these capabilities using an as-a-service model by making everything part of your infrastructure blueprints.

Cooperation between teams

You might also observe that your domain teams are looking for more proactive cooperation between teams. They don’t want to rely on the central platform team(s) as a middleman passing messages back and forth between providers and consumers, for example, when a new consumer wants to consume a provider’s data, or when a provider has data quality issues or will be delivering its data late. Instead, the platform team should implement central monitoring services and a control framework that enables action-oriented interaction between providers and consumers. For instance, you may want to add a central logging store, or a good monitoring service that detects issues and proactively notifies users using alert rules, notifications, or events. An example of a multidomain architecture can be seen in the image below.

To support the addition of new domains, complement your architecture with standardized services for data observability, data quality, data transformation, and automation (Image design by Piethein Strengholt)

The recommended approach when adding domains is to focus on the source system side of your architecture for a while, before scaling up the consuming side. Why? Typically, there are many more data consumers than data providers. In addition, consumer-oriented analytical services are complex and draw a lot of attention. It is therefore essential to guarantee stable and scalable delivery of new data products before you add large numbers of consumers. If you shift your focus to the consuming side too quickly, you risk seeing all your teams trying to fix the same data engineering problems again and again. This can cause your entire organization to lose trust in the architecture, undermining your data ambitions. So, for every step you make, make sure you’re adding business value, while at the same time not losing trust from the business in the effectiveness of the data platform.

After you’ve made data product development more efficient, implement the first set of computational data governance controls. For example, don’t allow data to be shared without first linking it to a data owner. For this, you’ll need to use workflows that are provided by your data governance solution. The goal of this exercise is to put guardrails in place, allowing your domain teams to be more self-supporting without unknowingly causing themselves harm. The role of the central team during this phase changes. They oversee what help is needed and may step in when required. Thus, instead of executing all data governance–related activities themselves, the central team trains, coaches, and guides other teams. In addition, this is the point where the central team needs to start thinking, “Control is good; trust is better.” Platform policies and audit reports act as automated declarative guardrails.

Migration or legacy scenarios

While onboarding new use cases and domains into your architecture, it’s likely that you will run into migration or legacy scenarios. For example, you may encounter situations in which consumers demand historical data from several years back. When building up your new data product architecture, you’ll learn that historical data is only available from the moment you onboarded your first data products into the new architecture. So, if you need historical data from before the onboarding period, you’ll have a gap. To solve this problem, make an extract or one-off copy of the historical data from your other environment(s). For example, if your data warehouse retains data from the past seven years, you can use that data to build a legacy data product, then combine that legacy data product with incoming data that feeds into the new architecture. This will give your domains the full picture. Note that combining historical data with new data isn’t always that easy, however; you’ll often need to match fields, delete duplicates, clean the data, or write business logic. When scaling up further, you’ll need to interconnect domains by enabling them to exchange or directly share data products. For this, you need to set interoperability standards and implement query services. Consider popular file formats such as Parquet or Delta, and (serverless) SQL services to allow other domains to access and browse the data products.

Optimization Phase: Professionalize Capabilities

After the foundation has been established, it’s time to iterate on prioritized business use cases and further professionalize your capabilities. One of the key objectives for this stage is to carry over all the supporting activities from the central team to your domain teams. Look for inefficiencies, and try to solve these with self-service and automation. To strengthen your organization, guide your teams so they become more efficient in managing their data and corresponding data pipelines. Allow them to self-onboard and self-subscribe to data products. Deploy services that allow for self-registration and maintenance of metadata. For example, deploy APIs that can be easily integrated with the CI/CD processes of your domains.

For the next iteration of your architecture, you will focus on real-time data process ing, consumption-readiness, security, MDM, and distribution of curated data. Try to standardize the consuming side of your architecture with blueprints and services. Remember that data usage is diverse: many variations are possible. To remain in control, gradually expand by launching one new service at a time. For each new service, evaluate the need before handing it out to your domains. The image below shows an abstract example of an updated design.

As the adoption of your architecture grows, new services will be added (Image design by Piethein Strengholt)

One key thing about the solution architecture is that many of the services don’t yet support a federated way of working. Your central team must focus on closing these gaps by integrating services and delivering self-service and automation capabilities. They are the ones to arbitrate between deviation, tools, and technical requirements. They own all choices related to programming frameworks, ETL services, and storage-related services.

For managing data reusability concerns and consistency, look for data products that have the highest levels of usage. What you’ve most likely encountered is multiple domains complaining about data that is too hard to integrate and combine with data from other domains. So, look for repetitive harmonization and quality improvement activities that are being performed across teams. If you see many of these overlapping activities, the data products in question might be candidates for master data management. Alternatively, you can separate out generic (repeatable) integration logic and ask one data product team to take ownership of the data using a customer/supplier model.

As part of the data sharing experience, your data product teams should be able to describe which data products can be used for which purposes. Domains should be able to manage, in close collaboration with the central governance team, data access controls on data they own. Unfortunately, at the time of writing, I’m not aware of any out-of-the-box solutions that will give your teams a great experience. If you want to support your teams with easy-to-follow workflow processes and automated creation of policies, consider building a small data contract framework yourself that deploys consumer-oriented (secure) views using a workflow process. This framework may include pointers to business semantics, as well data quality and service level agreements.

When scaling up further, it is important to have your structures for data governance clear. So, you need to move away from poorly defined data roles to a clear structure with well-aligned processes. Depending on your organization’s size, you may have multiple governing bodies and data product teams that interact.

An example of how different data governance bodies and domain teams can work together (Image design by Piethein Strengholt)

At the top, your governance bodies manage strategic oversight, working together to further the enterprise’s vision and goals. Thus, decisions on high-level designs, roadmaps, and large programs have a huge dependency on the new architecture. In these bodies you usually expect representation from the chief data officer, lead architects, senior domain owners, and other executive management members.

At the bottom, your domain teams are grouped together in domain bodies for man aging use cases and dependencies, onboarding new data, and reworking issues. While doing this, they might receive feedback and requirements from other teams. Thus, between all bodies there is interaction. For example, one domain team might require mediation from another domain team or a strategic decision from the program management team.

After you’ve worked out your governance structures and set up your meeting cadence for synchronizing and coordinating the planning and feedback loops, it’s time for the icing on the cake: intelligent data fabric and data marketplace capabilities. Here there are also spaces to fill in, so you’ll either need to wait, conform, or address gaps yourself with homegrown application components.

DataOps and Governance

The shift to DataOps and good governance is mostly a cultural one. For this transition, you need a central team that takes the lead for setting standards and developing configuration templates and blueprints. This same team is also responsible for setting up coaching, creating training materials, organizing walk-in sessions, and so on. Don’t underestimate the weight of these activities; they can require additional staffing.

All of these changes, in addition to the architectural changes, are significant. Enterprise architecture plays an important part in aligning all of these activities. Enterprise architects must be seen as leaders who can guide development teams through the implementation of future-state architectures. They must be pragmatic and realistic, yet lay out the vision and inspire everyone to follow. They must breathe technology and excel in different areas, such as security, cloud infrastructure, software architecture, integration, and data manage ment. Enterprise architects must also have a very deep understanding of business boundaries and know how to decouple them using modern integration patterns and business architecture.

Your enterprise architecture practice must find a balance between longer-term objec tives and practicality. This need is contributing to the demise of enterprise architecture frameworks (such as the Open Group Architecture Framework’s TOGAF standard) because their logic of formalizing longstanding, static future states doesn’t fit well into the new world of DevOps and DataOps. Yes, you need to paint the big picture, but you also need to be comfortable letting go of the “policing” mindset. A modern enterprise architect should become a community leader, taking the initiative to define minimum viable products, organizing design and whiteboard sessions, and discussing and translating customer needs. Leave the details to the teams, but be an authoritative expert when things go wrong.

Last Words

Future-generation enterprise data landscapes will be organized in completely new ways. Data, in my view, will be much more distributed in the years to come — and so will data ownership. Enterprises need to learn how best to balance the imperatives of centralization and decentralization. This change, as you will soon start to experience for yourself, requires trial and error, and a new vision of data management.

A last word of advice: don’t be nervous, flying is the safest form of travel! There’s plenty of fun to be discovered in the wonderful world of data. We’re only getting started.

If you want to learn more, feel free to check my book Data Management at Scale 2nd edition.

--

--

Piethein Strengholt

Hands-on data management professional. Working @Microsoft.