Building the Medallion foundation with Azure Databricks and Unity Catalog
In the upcoming book, Building Medallion Architectures, I initially planned to extensively explore the process of building a Medallion architecture using various services. However, after careful consideration, I decided to adopt a more vendor-neutral approach, focusing on Delta Lake and Spark with Microsoft Fabric as the foundational technology.
Despite this pivot, I believe the insights and methodologies applicable to Azure Databricks are too valuable not to share. Therefore, I have decided to post the foundational setup process for Azure Databricks and Unity Catalog as complimentary book content via this blog post. This will allow readers who are keen on using Azure Databricks to effectively implement the strategies discussed, creating a seamless learning experience. They can then easily transition back to the book to continue their exercises using the platform that best suits their needs. Please note that to guarantee the wide applicability and dependability of all content in the book, I have thoroughly tested all code snippets and examples on various services.
In the following sections, I will walk you through the initial setup of Azure Databricks, which is crucial for building a Medallion Architecture. After completing this setup, you will be well-prepared to rejoin the broader discussions in the book and explore the implementation of various layers of the architecture.
Azure Databricks
For those unfamiliar, Azure Databricks is a data platform that provides a robust and collaborative environment. It is ideally suited for engineers, data scientists, and business analysts working together on scalable data models. The platform prominently features Spark and Delta Lake, enhancing its capability to handle complex data operations.
Azure Databricks features a user-friendly, managed web interface and boasts seamless integration with a wide array of Azure services, enhancing both its utility and ease of use. A prime example of this integration is with Microsoft Entra ID (previously called Active Directory), which streamlines user authentication. Users can log into Databricks using their Entra ID credentials, and administrators have the option to bolster security by configuring the system to require multi-factor authentication and other safety measures.
Now that we’ve quickly set the scene, let’s dive into the practical steps required to effectively deploy Azure Databricks. Through this blog, I will guide you through each phase of the deployment, ensuring that you can fully leverage Azure Databricks in your data architecture endeavors. Here and there, I’ll also discuss reflections and how things fit into the broader context.
Get started: Azure Databricks
To begin using Azure Databricks, the initial step is to deploy it, which involves several stages and requires an Azure subscription. If you want to experience this firsthand, you’ll need to log into the Azure Portal and create a resource group. This resource group will act as a container for various services, including Azure Databricks, facilitating efficient access control management and cost reporting.
Once your resource group is ready, the next step is to deploy an Azure Databricks Workspace within that resource group. This workspace is an interactive and collaborative working environment where you can perform tasks like data discovery, exploration, and development of new workloads. I recommend you to have separate workspaces for different teams, projects, lines of business, or for development and testing activities, to ensure better organization and resource management.
For all of your workspace instances, it’s important to mention that the Databricks environment comprises two essential components, as illustrated in the image below.
Firstly, the Databricks control plane ensures reliable and secure management of your workspace, including notebooks, clusters, and jobs. This component is managed by Microsoft and operates on their infrastructure backbone.
Secondly, within your Azure subscription, you have control over compute resources and storage accounts. These services require you to take charge of their deployment and customization to meet your specific data processing and storage needs. Let’s explore how you can effectively deploy and manage these services yourself.
Create an Azure Databricks Workspace
To create an Azure Databricks workspace, start by logging into the Azure Portal. Then, search the Marketplace for “Azure Databricks”, and click on “Create a resource.” A new dialog will appear. See the image below for an example of this dialog.
During the Azure Databricks deployment, you will be asked for a workspace name and to choose between a Standard, Premium (+ Role-based access controls), or Trial pricing tier.
For this exercise, we will choose the Premium version, knowing that it is ideal for organizations that require advanced security and compliance features. It offers additional features such as role-based access controls, Azure AD credential passthrough, and more. Once you have selected the pricing tier, you will need to choose the region where you want to deploy your workspace. It is recommended to choose a region that is close to your data sources to minimize latency. After selecting the region, you can click next to head over to the network settings.
At this stage, you will also need to specify the network configuration for the Azure Databricks workspace. In this example, we won’t deploy it into a virtual network, but you can choose to do so if needed. This might be required when you, for example, need to access on-premises data sources or other services within other virtual networks.
After specifying the network settings, you will also be prompted to specify a Managed Resource Group Name. This name identifies a new resource group that Azure Databricks will automatically create to manage the underlying resources of the workspace. If you do not specify a name, Azure Databricks will generate a random name for you. Note this resource group cannot be the same resource group where you are deploying the workspace. So, it is important to choose a unique name for this resource group and acknowledge that it will be created next to the other resource group(s).
Once you have completed these steps, you can click on Review + Create. This will validate your settings and create the Azure Databricks workspace. After a few minutes, your workspace will be ready for use. You can access it by clicking on the “Go to resource” button in the Azure Portal. This will take you to the Azure Databricks workspace, where you can start creating notebooks, clusters, and jobs. For a visual representation of the Azure Databricks welcome screen, see the image below.
Manually deploying Azure Databricks through the Azure Portal offers direct control over the setup process, enabling you to tailor configurations to specific requirements. Manual deployment can require careful attention, particularly when managing multiple workspaces concurrently. To streamline this, Azure offers ARM and Terraform templates that automate the deployment process. These templates facilitate a quick and error-free deployment of services like Azure Databricks, significantly simplifying the management of your data architecture projects.
Templates
If you’re looking to streamline your deployment process and save time, templates are the way to go. By using templates, you can automate your deployment process and take advantage of the infrastructure-as-code practice. This means that you define the necessary infrastructure directly in the code, simplifying the setup process and reducing the likelihood of errors.
One of the benefits of using templates is that they use a declarative syntax. This means you can state what you want to deploy without having to detail every programming command. You simply need to specify which resources to deploy, their properties, and the resource group they belong to. This efficiency makes it easy to deploy services like Azure Databricks quickly and with a single action. To see a practical example of templates and how they work, you can check out the following pages:
- Databricks Terraform provider overview and Terraform Docs.
- Terraform Templates, which are available via Terraform Modules.
- Azure Databricks templates for ARM are available via this link.
To effectively manage costs, technology deviations, and security, you should consider using a base template that can be customized for different environments like development, testing, and production within the infrastructure continuous integration and continuous delivery (CI/CD). By using templates, they can provide multiple teams with environments easily. Best practices suggest keeping templates in a single repository and managing them with a unified team. This helps in standardizing processes and ensuring compliance across all deployments.
For now, our focus shifts to manually deploying services. This hands-on approach will give us clearer insights into how the different components of the architecture interact together. In the next sections, we will continue our deployment process with the creation of compute resources, which are crucial for running data processing jobs.
Create Compute Resources
In Azure Databricks, you have the option to use Serverless Compute — such as Serverless Compute for Notebooks, Workflows and Delta Live Table, or Databricks SQL — or to create clusters for running your data processing jobs. For those opting to deploy a cluster, it is important to understand that a cluster consists of virtual machines that execute Spark jobs. Depending on the specific requirements of your project, you can choose from various types of clusters. For instance, you might select a cost-effective cluster for development tasks and opt for a high-performance cluster for complex production workloads.
To get started, log into your Azure Databricks workspace and click on the Compute tab on the left side. From there, you can create a new cluster by entering details like the cluster name, type, Spark version, and node type. You can also set the number of worker nodes and configure the cluster’s autoscaling settings. Please pay attention to the following configuration options:
- Access Mode: When setting up a cluster, you have three access modes to choose from: single mode, shared, and no isolation shared. The single mode cluster is typically used by one user, the shared cluster by multiple users, and the no isolation shared cluster allows access to multiple users without isolation. Remember, only single and shared mode clusters can integrate with Unity Catalog. For this exercise, select the shared mode cluster.
- Databricks runtime: This configuration determines the version of Spark your cluster will use. It’s best to go with the latest version of Databricks runtime because you will have access to the most up-to-date and feature-rich version available.
- Worker type: This choice affects the type of virtual machines your cluster will use. There are several options available based on your needs. For this example, you should go with the Standard_DS4_v2, which handles most production workloads well.
- Number worker nodes: This number should match the scale of your data and the complexity of your processing jobs. It’s wise to start with a few nodes and scale up as necessary. Take your time to explore and tweak these settings to best fit your data processing needs.
If you follow the guidance above to configure your new cluster, the dialog should appear as shown in the image below.
When you are done, click on Create compute. Once your cluster is set up, you can start running your data processing jobs. Azure Databricks will automatically set up the necessary resources within the Managed Resource Group Name you specified during the deployment of your workspace.
With the workspace and compute clusters up and running, we can now move on to setting up Azure Data Lake Storage, which will serve as the storage layer for our data architecture.
Setting up Azure Data Lake Storage
Microsoft Azure provides a cloud-based storage service called Azure Data Lake Storage (ADLS) Gen2. It’s a distributed file system tailored for big data storage, emphasizing scalability, performance, and security. ADLS is compatible with Hadoop and other frameworks that utilize the Apache Hadoop Distributed File System (HDFS). Additionally, Hadoop distributions and Azure Databricks come equipped with the Azure Blob File System (ABFS) driver. This driver allows numerous applications and frameworks to directly access data stored in ADLS.
To let Azure Databricks effectively use ADLS, the next step is to deploy and set up an ADLS account. This account will be part of the Medallion architecture and is used to store the raw data ingested by another service, which will be deployed later. For this deployment, we’ll use the ADLS deployment guide. During the creation of the storage account, make sure to select the Hierarchical Namespace option. This feature creates a tree-like structure that is vital for integrating with the vast array of Hadoop software. Additionally, Azure Databricks relies on this structure to organize its data into folders and subfolders, making it a crucial step.
After successfully deploying, proceed to navigate to your newly deployed ADLS account. Here, you should create three containers named bronze, silver, and gold. These containers will serve as the storage layers for your data architecture, each representing a different stage of data processing and refinement. Additionally, I recommend creating a landing container for more flexibility. Once, you have followed all these steps, your ADLS account should look similar to the image below.
By now, we have successfully set up our foundational services. We’ve established essential compute resources and configured the storage accounts for Azure Databricks. Let’s take a moment to review and reflect on the three steps we’ve taken and the design of our setup.
- Azure Databricks Workspace: We started by manually deploying an Azure Databricks workspace. This workspace will serve as the primary working environment for development, running data processing jobs and collaborating on data projects.
- Compute Resources: We set up a compute cluster within Azure Databricks to run data processing jobs. This cluster will be used to execute Spark jobs and process data.
- Azure Data Lake Storage: We deployed an Azure Data Lake Storage account and created four containers: landing, bronze, silver, and gold. These containers will serve as the storage layers for our data architecture, each representing a different stage of data processing and refinement.
The considerations for workspaces, compute resources, and storage aspects are detailed exclusively in the book. Therefore, readers keen to expand their knowledge in these areas are recommended to first consult the book before continuing with this exercise. After you have enhanced your understanding of the design choices, you can proceed with the remaining part of this exercise, which focuses on the Unity Catalog. This section also includes configuring credentials and managed locations.
Unity Catalog
Databricks Unity Catalog is a unified data governance solution designed for organizations to manage and secure their data and AI assets. As part of the Databricks platform, Unity Catalog offers a centralized metadata management system that simplifies data discovery, governance, and access control, ensuring that data across an enterprise is consistent, compliant, and securely accessible. With features like fine-grained access control and comprehensive data lineage, the Unity Catalog ensures that data management can be effectively implemented on an operational level.
Note that Databricks began to enable new workspaces for Unity Catalog automatically on November 9, 2023, with a rollout proceeding gradually across accounts. If you have an existing workspace, you can verify if Unity Catalog is enabled by consulting this documentation: Unity Catalog: Get Started or the video: Unity Catalog Setup for Azure Databricks. This means that the guidance below might not be applicable to your setup. If you already have Unity Catalog enabled, you can skip the next section and proceed to the section that touches upon external locations. However, the next sections will still provide valuable insights into the setup process.
Setting up Unity Catalog
To build a Medallion architecture using Unity Catalog, you first need to set it up and link it to a separate ADLS (Azure Data Lake Storage) account that Unity Catalog will use exclusively for storing its metadata. Start by creating this new account through the Azure Portal. Make sure to select the Hierarchical Namespace option during the setup process.
Next, you need to establish a new service in the Azure Portal. Look for the “Databricks Access Connector service”, name it something like UnityCatalog, and pick a region. Then, turn on the “System assigned managed identity” option in the “Managed Identity” tab. After reviewing your settings, hit “Create.”
Now that the service is active and has a managed identity, you need to give this identity the proper permissions to access the ADLS service. Go to the IAM settings and assign the “Storage Blob Data Contributor” role to this newly created managed identity.
After that, you need to create a Metastore in the Azure Databricks workspace. This is where all metadata will be registered for Unity Catalog. To create a Metastore, go to the Managed Account option in the upper right corner of your workspace. Only account admins can perform this action. Once you access the Managed Account option, proceed to the “Data” section, and select “Create Metastore.” Follow the on-screen instructions, which are similar to those shown in the image below.
Now, let’s continue setting up your Metastore in Unity Catalog. Start by naming the Metastore, then specify the location of your newly created ADLS account, which is dedicated to Unity Catalog. Next, input the configuration string from the newly created Databricks Access Connector.
After entering this information, create the Metastore. Once created successfully, link it to your workspace by connecting the metadata to one of the available workspaces. Assign the Metastore to the workspace you set up earlier.
After completing this assignment, click the ‘enable’ button to activate Unity Catalog. By enabling it, you successfully integrated it into your workspace, allowing for centralized management of data and enhanced collaboration across various workspaces in your organization.
Now that Unity Catalog is in place, you’re ready to set up external locations within your Azure Databricks workspace. This step is crucial for enabling access to external storage accounts, such as ADLS. Let’s explore this process in the next section.
External Locations in Unity Catalog
To safely access ADLS accounts using Unity Catalog, you need to deploy another “Access Connector for Azure Databricks” service. This service serves as a bridge, allowing Databricks and Unity Catalog to interact with your storage accounts via managed identities.
To deploy this service, start by heading to the Azure Portal and searching for “Access Connector for Azure Databricks.” Select the service, then enter a name and choose a region under the first tab. Next, check the third tab labeled “Managed Identity” and ensure the “System assigned managed identity” is switched to On. After reviewing your settings, click “Create.”
Next, go to your ADLS account where you have your containers (e.g., Bronze) stored. In the IAM settings, assign the “Storage Blob Data Contributor” role to the managed identity of your Access Connector. This step is important because it grants the necessary permissions for the connector to interact with the storage account.
Once your Azure service objects are set up, it’s time to configure the mapping of storage objects in Databricks. To do this, go to the catalog explorer interface in Databricks and click the + button in the upper right corner. You’ll see options to add a catalog, an external location, or a storage credential. Start by adding a storage credential by clicking on “Add a storage credential.” You’ll see an overview that looks like the dialog below.
Select “Azure Managed Identity” for the “Credential Type” and input the Access Connector’s resource ID that you noted earlier, then click “Create.” Your Databricks workspace is now equipped with a storage credential, enabling access to the storage accounts.
With the credentials in place, the next step is to link them with the storage location. From the same overview, choose “Create new external location” and fill in the details as illustrated in the dialog below.
To create an external location, start by entering “landing” as the name and selecting the storage credential you just created. Then, input the storage location URL into the URL field:
abfss://<container>@<storage-account>.dfs.core.windows.net/
Finally, click the Create button to define this external location. Repeat this process for all other containers that you want to link. Just make sure to update the code to accurately reflect your blob and layer names.
In addition to managing external locations, you can also create volumes that point to your external locations.
Volumes in Unity Catalog
Volumes replace previously used mount points and add governance over external locations. They also help you avoid writing hardcoded file locations within your notebooks and scripts, as these volumes serve as a reference point. In addition, they provide a more secure way to access external storage locations because you can set permissions on this level. Lastly, a volume always requires a managed external location to be created first. More information on volumes can be found here: Announcing Public Preview of Volumes in Databricks Unity Catalog.
Volumes are of two types: managed and external. Managed volumes reside within the managed storage area of the schema they belong to and are controlled by Unity Catalog, meaning you don’t need to specify a location for them. Essentially, they are a pointer to a storage space managed by Unity Catalog.
On the other hand, external volumes link to a directory in an external location, set up using storage credentials governed by Unity Catalog.
To set up an external volume, use this command:
CREATE EXTERNAL VOLUME prod_sales.sources.landing
LOCATION 'abfss://<container>@<storage-account>.dfs.core.windows.net/'
Note that a volume in Unity Catalog sits in the hierarchy at the same level as a table. So, it’s the third level: catalog_name.schema_name.volume. This means that for creating a volume, you need to use or create a schema first. In the example above, the volume is created in the sources schema, which is part of the prod_sales catalog. If you plan on using multiple catalogs and volumes, I recommend you to create a consistent naming convention to avoid confusion.
With the external locations and volumes configured, you are now ready to establish external and managed tables using these volumes. This setup will allow you to access and process data stored in your ADLS account using Azure Databricks. In the next section, we’ll explore some considerations when working with external locations in Azure Databricks.
External Locations Considerations
When working with external locations in Azure Databricks, it’s important to remember that these external locations must be defined as external volumes within Unity Catalog to prevent errors. Without proper definition, Azure Databricks lacks the required credentials to access the external storage, potentially causing access problems.
Furthermore, Databricks does not permit the direct creation of an external table using the ADLS abfss path or mount path anymore. Attempting to do so will result in an exception. If the path is not registered in the Unity Catalog (e.g., as an external location), it will raise an exception. This behavior should be considered when deciding on mounting paths and distributing to non-Databricks services. Essentially, with Unity Catalog you have two options:
- Managed Tables: These tables are stored as Delta tables in storage managed by Databricks. It is not recommended to manipulate them directly using tools external to Databricks. Furthermore, the locations of these managed tables are designated by Globally Unique Identifiers (GUIDs) in the storage area. As a result, pinpointing which folder corresponds to a specific table can be difficult without utilizing Unity Catalog.
- External tables: These tables can only be created after defining an external location in the Unity Catalog. Once the external location is defined, you can create external tables at those locations.
Understanding the differences and consequences of using managed tables compared to external tables is essential. Managed tables are seamlessly integrated within Databricks, ensuring an integrated and secure data handling experience. On the other hand, external tables offer greater flexibility by connecting with external storage locations. This allows you to utilize pre-existing data infrastructures and facilitates the sharing of tables directly with others. However, it’s important to note that when you access external tables directly, you bypass any security policies established in Unity Catalog.
Note that both types of tables can be accessed through various methods including Databricks SQL, Delta Sharing, SDKs (for Python, Go, Node, CLI, and JDBC), Mirrored Azure Databricks Unity Catalog, as well as through clients that support the Iceberg REST APIs, and more.
In the following sections, we will explore the Unity Catalog object model and how to set up external locations within your Azure Databricks workspace.
Unity Catalog Object Model
The Metastore, we’ve created in the previous steps and depicted in the image below, serves as the top-level container for metadata within Unity Catalog. It records metadata concerning data and AI assets, along with the permissions that regulate access to these assets.
In Unity Catalog, Catalogs sit right below the Metastore and play a crucial role in organizing your data assets. They often represent the highest level in your data isolation framework, allowing you to segregate data assets by team, project, or environment. Consider an example company. They could set up different catalogs for each team and purpose. One catalog might manage production data for the sales team, while another handles their development and test data. For instance, the sales team could create a catalog for their production data using the statement:
CREATE CATALOG prod_sales;
Similarly, the marketing team could have separate catalogs for production and for development and testing. This setup ensures that each team accesses only the data relevant to them, keeping operations well organized.
A level further down are schemas, also known as databases. Schemas house tables, views, volumes, AI models, and functions, encapsulating all the technical details. You could use these schemas with naming conventions to separate and classify different layers within the same catalog. For example, they could use the prefix “bronze_” followed by the source system or data source and the object’s name — for example, bronze_adventureworks.
When creating a new catalog, you have the option to select the storage location for your managed objects. It’s important to ensure that this path is specified in an external location configuration, and that you possess the CREATE MANAGED STORAGE privilege for that configuration. To initiate a new catalog, use the command below:
CREATE CATALOG <catalog-name>
MANAGED LOCATION 'abfss://<container>@<storage-account>.dfs.core.windows.net/<path>';
When you set a managed location for a catalog or schema, that location is used for all its managed tables and volumes. Therefore, if you create a schema without a specific managed location, it automatically adopts the managed location of the catalog it’s part of.
When using Unity Catalog in Azure Databricks, setting up and managing the default catalog can make your data operations more efficient and align your queries and data manipulations with the right datasets. Here’s how you can select or set the default catalog in Azure Databricks with Unity Catalog:
1. Notebook-Level: Utilize Spark SQL within your notebook by executing the command:
spark.sql(f”USE CATALOG {catalog}”)
2. Cluster-Level: Configure your cluster with a Spark setting by using:
spark.conf.set(“spark.databricks.sql.initial.catalog.name”, “sales_prod”
3. Via the admin settings (filter for default catalog):
These options allow you to tailor the setup to your preferences and the needs of your project. Choose the method that integrates best with your workflow!
Catalogs help organize data assets into logical groups, making data management smoother. Implementing separate catalogs for different teams or departments can be particularly beneficial, especially for streamlining data processes. For more best practices, I recommend reading the article Unity Catalog best practices.
With Unity Catalog in place and your credentials, newly created catalog, external locations and volumes configured, you’ve established a solid foundation for data processing and governance. The next step is to onboard your first data source using Azure Data Factory. This integration will enhance your ability to manage and process data efficiently. Note that it is essential to follow the steps carefully, as the order is important.
Now that we have Azure Databricks, Unity Catalog and ADLS up and running, you are recommended to deploy another set of services to enhance the capabilities of your Azure Databricks workspace. These services include Azure Data Factory and an Azure SQL instance.
Key Takeaways
For the main points from this section, refer to Chapter 3 of the book.