Many organizations consult me about how I see Microsoft Purview in relation to data mesh or a federated way of working. They ask me about sharing best practices for establishing a domain-oriented architecture. Let’s explorer in this blogpost how Purview can support your data governance ambitions.
Purview is Microsoft’s unified data governance service that enables organization to manage their data at scale. It uses names such as glossary, collections and assets for organizing metadata. Let’s find out what these mean by starting with the basics first. For this, I propose a high-level design that shows an overview of the different metadata areas that are managed within Purview.
On the top you, see business metadata, which is about providing (business) context to users. For managing business metadata, Purview provides a Glossary: a vocabulary for business users. At first sight, a glossary is a list of business terms with definitions and relations to other items. The glossary is important for maintaining and organizing information about your data. It captures domain knowledge of information that is commonly used, communicated, and shared in organizations as they’re conducting business.
There aren’t any rules for the precise size and representation of glossaries. They can stay abstract or high-level, but can also be detailed, describing carefully attributes, dependencies, relationships and definitions. A glossary isn’t limited to only a single domain, in fact it can cover countless applications or multiple databases from many domains. From this point of view, multiple applications can work together to accomplish a specific business need. It means that the relation between a glossary and data attributes is a one-to-many relationship.
A glossary can also include and capture more terms than the concepts representing the application or database itself. It can include concepts, which are used to make the context clearer, but don’t play a direct role (yet) in the application or database design. It may include concepts that represent future requirements, but didn’t find their way yet into the actual design of the application or database yet. Thus, a general best practice is to advice your teams how the glossary should be used.
When implementing your glossary, it’s important to consider how to structure your business terms and definitions. For example, using hierarchies and aligning these with business domains. Or introducing naming standards or term templates for capturing additional information about your business metadata. You could also use relationships for linking to contact persons or other business terms, such as Acronyms, Related Terms and Synonyms. These relationships avoid creating terms with duplicated names. They lower the overhead of management.
Technical metadata is about providing (technical) information about the solution space. It’s about how applications and databases are implemented. For technical metadata, Purview uses Collections.
When planning your Azure Purview deployment and aligning your data governance activities, you need to define how technical metadata, such as data asset information, will be managed together. This grouping and the granularity of your technical metadata is what Collections are for. A Collection is a logical container or a boundary in which your metadata, such as information about your data sources, will be managed. A collection is also used for security, allowing users to access or manage metadata.
When creating collections and placing them in a hierarchy, you need to make different considerations, such as your security requirements, governance structure and democratization needs. For instance, a more centralized-aligned style of data management could lead to a different collection structure than a more domain-oriented style of data management.
For a domain-oriented or federated way of working, it’s recommended to clearly scope, delegate the ownership and align your business and technical metadata with your data domains. Technical metadata is used for organizing data assets and sources. You can cluster technical metadata within a collection, use this as a boundary and align this with a particular domain.
Business terms and information from your collections can also be linked together. This method of bringing information from your source systems together by linking it to the same business terms, gives a richer experience to your users. It helps them to understand how concepts and business terms have been translated to technical designs. It also enables you to correlate data across different systems and applications. Customer data, for example, sometimes is stored across different systems. By linking to business terms, you can better oversee and more efficiently manage your data landscape.
Lineage, or data lineage, is about providing insights how data moves over time. For example, when extracting data and copying it from one application into another application you want to get insights in all the selections, transformations and data movements. For lineage, Purview uses datasets and processes. Datasets are also referred to as nodes while processes can be also called edges:
- Dataset (Node): A dataset (structured or unstructured) provided as an input to a process. For example, a SQL Table, Azure blob, and files (such as .csv and .xml), are all considered datasets. In the lineage section of Microsoft Purview, datasets are represented by rectangular boxes.
- Process (Edge): An activity or transformation performed on a dataset is called a process. For example, ADF Copy activity, Data Share snapshot and so on. In the lineage section of Microsoft Purview, processes are represented by round-edged boxes.
Lineage is typically captured from services that extract, transform and load data. These services are, for example, Azure Data Factory, Azure Data Share, and Power BI. In Purview, lineage is automatically captured when scanning. Additionally, Purview also supports manual or custom lineage. Custom lineage is lineage that you created yourself. For example, by uploading metadata using the Azure Purview’s Atlas hooks or REST APIs.
During Ignite 2022, Microsoft announced that it will extend Purview with a business metamodel. On an abstract level, a metamodel reflects the basic meaning of how your metadata and the relationships to other metadata are managed in your data governance solution. Let me explain how this works in practice.
So far, we mainly looked at metadata that has been predefined by Purview. You learned about business terms for describing context, collections for managing technical metadata and assets for cataloging real sources. But what if you want to extend this metadata or enrich it with your own metadata? This is what the metamodel is about. For example, you can add entities like “business process”, “business application”, “business unit” or “data domain”. You can also use these entities for creating relationships to other entities, allowing your users to relate information or enforce relationships to be set. For example, a data domain could belong to another data domain, for making clear it’s a subdomain. Or a business process must be always linked to a department (see image below).
The benefit of organizing your metadata using a metamodel is that it enables your users to manage your metadata in a controlled and structured way. It’s your single pane of glass for data management from which you can oversee all data domains, data products, physical data, lineage, and so on. At the same time your metamodel could act as a source for sharing metadata to other applications and processes. So, you could integrate your metadata with other applications, such as data security or ETL tools.
The metamodel you can use for defining data products as well. So, instead of seeing data products as physical representations, you could choose to manage these as logical entities from which you draw relationships to the underlying technology architecture. These changes requires a data product to be first defined as technology-agnostic. Consequently, by making it a logical unit, you include a logical dataset name, the relationships to the originating domain, it’s unique elements, business terms, owner of the dataset, and references to physical data: so the actual data itself. An example of how this works you see below.
The motivation for this approach is that you can keep a data product semantically consistent for the business, while simultaneously allowing it to have multiple representations and different shapes on a physical level. For example, imagine a situation in which you have two semantically consistent datasets. One dataset is stored in a Parquet file format and another one in Delta. These datasets are in fact the same data and should be managed as such! So, if multiple physical datasets contain the same semantical data, you want to link all of these physical datasets to the same data product. By doing so, you ensure the ownership for all these datasets is clearly set. The reflection of what I just said has been visualized in the screenshot below.
Another motivation for defining a data product as a logical entity is that you can more easily connect a workflow to the process of data product creation. So, each time your users instantiate a data product, you require your users to always define a data owner first. Next, your users will use this entity to draw relationships to a data domain, an owner and the underlying data. Then, and finally, someone signs off the process to ensure that your newly defined data product becomes available as a certified asset.
Domain-oriented way of working
Enough about the basics, so let’s go back to the core question when you started reading this blog. What are best practices and workaround are need for federating responsibilities? Below are the items you should consider when implementing Purview within your organization:
- A good starting point is making a proper domain decomposition of your organization before assigning roles and responsibilities. In addition to that: slowly scale up. Start with only a few domains. Evaluate, before you continue adding more domains.
- For managing technical metadata, the general best practice is to establish collections and align these with your domain teams or domain structure. If so, technical users from each domain team get data source admin rights on a collection level for managing technical information about their data sources. More information about structuring collections can be found here: https://learn.microsoft.com/en-us/azure/purview/concept-best-practices-collections
- Domains that are king at DataOps should consider using Purview’s APIs. Service principals can be added on a collection level, so they’re great for automatically triggering scans after releasing a data pipeline or pushing metadata automatically into Purview.
- Be cautious about too quickly adding many sources. Strike a balance between scanning all data and only data that truly matters. As of now, there are no features to hide any metadata that is brought into Purview. Thus, for now, start with only data that matters. For example, start with data that is distributed between domains or data that is used downstream in analytical solutions. Gradually extend your scope over time by adding more sources. This approach ensures your catalog won’t become a mess.
- When scanning operational applications and data products, consider using additional collections for segregating the concerns of managing metadata about the inner-architecture of your applications and data that is being distributed between domains. For example, you could think about setting up a collection per domain, and another more generic collection for all data products. Below this -generic data product collection-, you add, again, collections for each domain. The benefit of such a structure is that you can better support the different life cycles of application development and data product development. Additionally, such a structure also allows you to better navigate and zoom into data that truly matters.
- For aligning glossaries and domains, the best practice is to use the multi-glossary feature. In this approach, each domain uses its own glossary for building up business metadata. On each of these, you configure fine-grained rights for business users and other users.
- To avoid that glossary terms are managed by incorrect domains, consider adding workflows that requires approval from the respective domain owners. For achieving more complex and end-to-end orchestration, consider using a HTTP connector within your workflow.
- For data lake services (ADLS gen2) that are shared between domains, the recommended advise is to register these resources at the parent level of your collection structure. Then, on this parent level, you create scans (using scope scans) that are specific for your data domains (assuming they follow container level segregation for data domains). So for example, if all data products are managed in containers within a single (shared) data lake service, then, perform a scoped scan for each domain. Within this scoped scan, select the container that holds the essential data of the domain. You must repeat these steps for all of your domains.
- Azure Data Factory and PowerBI, for example, ingests all of their metadata at root. Similar issues you might encounter when scanning shared or combined data lake services. For addressing these issues, consider bulk movements between collections: https://medium.com/@piethein/purview-bulk-collection-mover-6e8a9309ba3a
- For the scanning and managing services, consider aligning your architecture with the cloud adoption framework. Use data landing zones for your different domains. For IaaS: the best practice is to deploy self-hosted integration runtimes inside your domain VNETs. For PaaS: consider using managed private endpoints. More information can be found here: https://learn.microsoft.com/en-us/azure/purview/concept-best-practices-network
- Classifications are best to be managed by a central team. For these: take it slow. Don’t overclassify and try to automate using regular expressions.
- Many customers that I talk to use Databricks. For capturing lineage, consult the Azure Databricks to Purview Lineage Connector, which is based on OpenLineage. For the metadata within Databricks itself: use Hive or wait for future announcements.
- Some organizations implemented a metamodel in Purview using custom type definitions. For example, a data product is defined as a logical entity in Purview via the PyApacheAtlas SDK. My general advice for this approach is to wait for the metamodel to become final. The SDK is complex and doesn’t allow for easy self-service. A custom metamodel is probably a heavy investment and by the time you’re ready, the new Business metamodel feature is ready.
- Stay away from the pattern of implementing many Purview instances. Purview is positioned to be central or enterprise data governance service for managing your data end-to-end. If you insist on implementing multiple instances, position them clearly and avoid overlap!
The list from this blog post are considerations to take into account when implementing a federated way of working. Please, be aware that Purview is relatively new product. It has been launched in September 2021. More updates are expected soon, so stay tuned! If you have more recommendations or best practices, please feel free to leave a comment behind.