Construct the Bronze Layer with Azure Databricks

Piethein Strengholt
12 min readDec 17, 2024

In the upcoming book, Building Medallion Architectures, there is an exercise demonstrating how to build a Medallion Architecture from start to finish using Microsoft Fabric. However, readers can also follow the learning path using Azure Databricks. For those choosing this route, you may notice slight differences in some configuration dialogs when using Azure Data Factory. Due to this, I have posted the initial configuration steps for Chapter 4 on this page.

Data Ingestion

For ingesting and copying data into our platform using Azure Databricks you have different options.

One of those options is Databricks LakeFlow Connect. This is a solution that’s been designed by Databricks to provide all the tools you need to build and operate data pipelines that are suitable for production use. One of the great things about this solution is that it’s fully integrated with Databricks itself, meaning it can orchestrate and schedule all operations within the platform.

Another option you could consider is Azure Data Factory. This tool has been created to automate and orchestrate data integration workflows across Azure. It supports a wide range of data sources and destinations, including Azure Databricks. So, if you’re already using Azure for other services, this could be the way to go.

Ultimately, the choice between these two and other options will depend on the complexity of your data integration requirements, as well as where you prefer to orchestrate your end-to-end operations. Take into account other platforms and services you may be using when making your decision.

Although both tools serve similar purposes, Azure Data Factory and later Airflow are extensively used throughout the book exercises. This is because they facilitate workflow orchestration across various Azure services.

Deploying Azure Data Factory

To get started with Azure Data Factory, deploy an instance using the get started guide, which can be found here: Quickstart: Create a data factory using the Azure portal. After you’re done with the deployment, you can start building your first pipeline to ingest data into the Bronze layer. For that we need to deploy another service: Azure SQL.

Deploying AdventureWorks Sample Database

For this exercise we will use the AdventureWorks sample database as a showcase for onboarding the first data source. The AdventureWorks sample is a widely used database for demonstration and training purposes, making it an excellent choice to showcase our new environment using a real data source. We will use this source database as a valuable resource for studying data onboarding, fixing data quality and building data integration activities.

The AdventureWorks sample is included in Azure SQL, which is a cloud-based relational database service provided by Microsoft Azure. This service is designed to manage and store structured data, offering high security, scalability, and availability. For organizations looking to create new cloud-native applications, Azure SQL is a popular choice.

To deploy the AdventureWorks sample database, follow the instructions in the Quickstart: Create a single database and AdventureWorks Sample Database. These guides will walk you through the process of deploying the AdventureWorks sample database in Azure SQL. Once you’ve completed the deployment, you can start building your first pipeline using Azure Data Factory.

In the upcoming sections, we will use Azure Data Factory to load data into the Bronze layer and partition it by datetime. Once the data is loaded, we will trigger Azure Databricks to create our Bronze tables using the same pipeline. Let’s get started!

Onboarding First Data Source with Azure Data Factory

To onboard data into our Bronze layer, we need to follow many steps. First, we will setup the configuration for the AdventureWorks sample database that we’ve created at the beginning of this chapter. This database will serve as our initial data source, providing a structured dataset for our first data ingestion process. In addition, we will develop a pipeline in Azure Data Factory to transfer tables from the AdventureWorks sample database into the Bronze layer. A pipeline is a logical grouping of activities that together perform a task. This pipeline will include several activities such as Lookup, ForEach, CopyTables, and Notebook. The intermediate result for this part is depicted in the image below

Overview of the pipeline in Azure Data Factory

In the upcoming sections, we will explore how to onboard data from the AdventureWorks sample database into our system by discussing each step in more detail. To effectively build the Lookup and ForEach activities in Azure Data Factory, I recommend watching this instructional video: Building Lookup and ForEach Activities in ADF. This video provides a clear tutorial and will be particularly helpful for visual learners. In the next sections, we will delve into the specifics of different activities and how they contribute to the data ingestion process. Before we can do that, let’s first set up the necessary linked services and datasets in Azure Data Factory.

Linked Services and Datasets

In order to read data from the AdventureWorks sample database and write it to the landing zone in ADLS, we need to set up the necessary linked services and Datasets in Azure Data Factory. Linked Services are the connections between Azure Data Factory and external data sources or destinations. They define the connection properties and credentials required to connect to these sources or destinations.

Note that usually each data source comes with its own unique challenges, so you may find yourself using different setups, Datasets, and activities than those described here. Additionally, the naming convention in this example follows the AdventureWorks sample database. Feel free to adjust it to better match your specific data source.

So in order to get started, ensure you have added your ADLS and Azure SQL services as linked services in ADF. Detailed guidance on adding linked services can be found here: Azure Data Factory Linked Services. During this setup, you’ll need to configure connection details and authentication methods, preferably using managed identities for enhanced security.

In addition to Linked Services, we need to configure Datasets, which are the logical data structures that represent the data you want to move or transform within Azure Data Factory. They define the schema and the location of the data, whether it’s in a file, a database table, or a data lake. Datasets can be used as inputs and outputs in activities, allowing you to move data between different data stores.

To get started with the AdventureWorks example, you will first need to create three Datasets:

  • Lookup Dataset: To retrieve all table names from the Azure SQL AdventureWorks database, we use the Lookup Dataset. This saves us the hassle of hardcoding all table names. The Lookup Dataset helps us grab all tables by looking them up in the database. You don’t need to configure the linked service in any specific way.
  • Dynamic Dataset: This dataset is used to handle schema names dynamically during the data processing. It utilizes the same Linked Service to Azure SQL, although with different parameters. We’ll come back to this point later.
  • Landing Zone Dataset: This Dataset is for writing data to the designated landing zone in your storage architecture. We’ll use the ADLS Linked Service for this Dataset. You might name this Dataset AdventureWorks_Landingzone.

Start by setting up the first Dataset, which connects to the Azure SQL Database using the linked service. This Dataset primarily serves to look up all the table names required for subsequent data operations. Configuring this Dataset correctly is crucial as it sets the foundation for the data extraction process in the pipeline. The image below shows the details of the AdventureWorks_Lookup dataset.

Details of the AdventureWorks_Lookup Dataset

The second Dataset, which also utilizes the Linked Service to Azure SQL, is designed for copying all tables from the database. You might name this Dataset AdventureWorks_Schemas. This Dataset is particularly crucial as it handles the dynamic configuration of schema and table names, which is essential for accurately mapping and transferring data.

To dynamically configure schema and table names within this Dataset, you should use the following two parameters:

@dataset().SchemaName
@dataset().TableName

These parameters allow the Dataset to adapt dynamically to the schema and table names during the data copying process, ensuring that the data flow can accommodate variations in table structures or schema definitions without manual intervention. For more detailed guidance on configuring these parameters, refer to the image below

Details of the AdventureWorks_Schemas Dataset

Also ensure that the parameters for this dataset are set accordingly, as in the image below.

Parameter details of the AdventureWorks_Schemas dataset

After setting up the Azure SQL Datasets, create another Dataset for the landing zone. This Dataset is used to write data to the landing zone in your storage architecture. You might name this Dataset AdventureWorks_Landingzone. This Dataset is crucial for storing the data extracted from the AdventureWorks sample database in the designated landing zone. It uses a dynamic FilePath location to store the data in the landing zone using the YYYYMMDD date format.

@concat('landing/adventureworks')
@dataset().FilePath
@dataset().FileName

With the above configuration the AdventureWorks data will be stored in the landing zone using following format: landing/adventureworks/YYYYMMDD/TableName.parquet. For more details on configuring this Dataset, refer to the image below

Details of the AdventureWorks_Landingzone Dataset

Once you have configured your linked services, you can start your project by creating a new pipeline within Azure Data Factory. This pipeline will form the foundation for future activities, including Lookup, ForEach, CopyTables, and a Databricks Notebook, which you will add later.

Creating a New Pipeline

To create a new pipeline, navigate to “Pipelines”, select “Create a new pipeline”, for instance, name it “AdventureWorks”. Begin by incorporating the Lookup activity through dragging it into the pipeline. Select AdventureWorks_Lookup to display all tables names. Then, move to the details section and use a Query to fetch all relevant tables names. Ensure that “first row only” is unchecked to allow the retrieval of all rows from the table. For more details, refer to the image below.

Query used for retrieving only the relevant schema information.

For the query, you can use the SQL statement below to retrieve all functional tables names from the INFORMATION_SCHEMA view. This view provides an internal, system table-independent view of the metadata in the database. Thus, it provides a more standardized way to retrieve information about the database objects such as tables, columns, domains, check constraints, privileges, etc.

SELECT * FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_TYPE = 'BASE TABLE'
AND TABLE_SCHEMA = 'SalesLT'

To ensure everything is functioning correctly, click the “Preview Data” button. If everything works as expected, we continue the development of our pipeline by introducing a ForEach activity into your pipeline.

Building the ForEach Loop

An Azure Data Factory ForEach activity is a control flow activity that iterates over a collection of items and executes an operation for each item in the collection. This activity is commonly used to perform a set of operations on a group of files or tables. In our example, we will use the ForEach activity to iterate over the system objects information returned from the AdventureWorks example database. This means we use the the query to retrieve the metadata about the database objects such as tables, columns, constraints, etc., in a standardized manner, and then start to loop through it.

To iterate over the system objects information returned from the AdventureWorks example database, add the following information to the Items option within the ForEach activity:

@activity('ListTables').output.value

For more details on configuring this ForEach activity, see the image below.

Iterating over all tables using a ForEach activity

When configuring the ForEach activity, ensure that the input name precisely matches the name from the previous activity. For example, if you reference the example from the image above, the input should link directly to the ListTables job. Correct configuration allows the ForEach activity to extract schema and table names from the query results, which it then uses as arguments within the loop.

Configuring the Copy Activity

Next, integrate a CopyTables activity within the ForEach loop. For the Source, select the AdventureWorks_Schemas dataset, ensuring you specify both the TableName and SchemaName as input parameters. Now, proceed to input the following details:

@item().table_name
@item().table_schema

For more details on configuring this CopyTables activity, see below.

Details of the Source Dataset properties for the CopyTables activity

Next, head over to the Sink. Add a new Dataset, which must be the AdventureWorks_Landingzone Dataset. In the Parameters dialog, we’ll add two properties: FileName and FilePath. Use the following details:

@concat(item().table_name,'.parquet')
@formatDateTime(utcnow(), 'yyyyMMdd')

For more details on configuring this Sink Dataset, see below

Details of the sink Dataset properties for the CopyTables activity.

The FileName will utilize the table name provided as input from the ForEach activity. The FilePath will use today’s date in the YYYYMMDD format. If everything is configured, hit the button ‘Add trigger’ and monitor the process. If everything works as expected you should see a folder within your Bronze container. The folder, which is illustrated in below should have the datetime of today.

Overview of data stored in ADLS

Each folder is organized using the YYYYMMDD partitioning scheme and contains a set of tables in the Parquet file format. This method offers the advantage of archiving all data within a daily time window, ensuring a structured and time-sensitive data storage approach. By systematically historizing the data, we create a well-prepared input for our Databricks jobs. This approach also supports auditing purposes and provides a historical record for data recovery in case of errors.

With these last steps completed successfully, we approached the end of building the foundation with Azure Databricks. We have set up the necessary resources, linked storage to Databricks, and established Unity Catalog. We have also on-boarded our first data source into our landing zone using Azure Data Factory. Before we continue with the next steps of discussing ingestion patterns, let’s review the key takeaways from these sections.

Invoke Scripts

Up until now, we’ve focused on setting up a batch process. This means we’ve been processing large volumes of data all at once instead of handling each data record individually or continuously. Specifically, we used Azure Data Factory to transfer tables from the AdventureWorks sample database to the landing zone in ADLS, saving them in the Parquet file format. However, this step alone doesn’t make the data ready for querying and analysis.

To get the data ready for queries, we need to take an additional step: creating Delta tables in Azure Databricks.

To create Delta tables in Azure Databricks, we need to create and call a script that invokes Azure Databricks from Azure Data Factory. This script plays a major role in our pipeline. To establish communication between Azure Data Factory and Databricks, we need to create a linked service in Azure Data Factory. This will allow Azure Data Factory to connect with Databricks and execute the necessary commands. For detailed instructions, check out the documentation on Databricks Notebook Activity in Azure Data Factory. If you would like to use a managed identity for the authentication type, grant Contributor role to both identities in Azure Databricks resource’s Access control menu. For using an Access Token, you can generate it from Azure Databricks workplace. You can find the steps here: https://docs.databricks.com/administration-guide/access-control/tokens.html

After setting up the linked service, ensure its functionality by clicking the ‘Test Connection’ button. Once you’ve verified the connection, go back to your pipeline. In the ForEach activity, integrate a new Databricks Notebook activity by dragging it into your workflow. After this, update the settings by using the following configuration details for Base Parameters:

  • schemaName: adventureworks
  • tableName: @item().table_name
  • filePath: @formatDateTime(utcnow(), ‘yyyyMMdd’)

For more details on configuring the Databricks Notebook activity, refer to the image below.

To proceed with your Databricks integration, head over to Azure Databricks and create a new Notebook. After a few seconds, a new notebook containing a single cell will open. Notebooks are made up of one or more cells that can contain code or markdown (formatted text). When the notebook opens, rename it to CreateBronzeTables by selecting the Notebook text at the top left of the notebook and entering the new name. Next, paste the following code into the cell:

# Fetch parameters from Azure Data Factory
schemaName=dbutils.widgets.get("schemaName")
tableName=dbutils.widgets.get("tableName")
filePath=dbutils.widgets.get("filePath")

After, you have copy-pasted the code, toggle the “Toggle parameter cell” using the three dots at the top right of the cell. This will allow you to set the parameters in the cell. After you have done this, create a new code block by clicking the “+” icon at the bottom of the cell. In this new code block, paste the following code:

To delve deeper into the various methods of configuring tables, rejoin the book at Chapter 4 at the section Implementation of Lakehouse Tables. This chapter covers the creation of Delta tables, external tables, data accumulation, data merging, Auto Loader, among other topics.

--

--

Piethein Strengholt
Piethein Strengholt

Written by Piethein Strengholt

Hands-on Chief Data Officer. Working @Microsoft.