Integrating Azure Databricks and Microsoft Fabric
Disclaimer: This article reflects my personal experiences and viewpoints, not an official stance from Microsoft or Databricks.
This article plunges into a hot topic often brought up during customer interactions — the combination and integration of Azure Databricks and Microsoft Fabric. Both services are top tier in their respective fields. Azure Databricks excels in scaling data engineering, data science, and machine learning workloads. Similarly, Microsoft Fabric shines with its simplicity and self-service features for a wide array of data usage. The burning question that usually arises is: how do we integrate these two powerhouses?
Currently, there are five options to consider. Keep in mind that this article may evolve as new features are introduced.
- Enhance a Databricks-enabled architecture by adding a reporting and analysis layer.
- Compliment a Databricks-enabled architecture by incorporating a OneLake gold layer.
- Make Databricks write all data to OneLake. Though not recommended, it’s worth discussing.
- Extend Databricks with a V-ORDERED enabled consumption layer
- Enhance Databricks and Microsoft Fabric’s data processing efficiency by adding extra components. This is more of a personal touch.
The options available today will be thoroughly examined in the following sections. I will provide nuances, weigh the pros and cons, and refer to relevant documentation. But before that, it’s beneficial to understand why organizations choose to utilize both of these powerful tools.
Why the combination?
Organizations choose to combine Azure Databricks and Microsoft Service Fabric due to the unique capabilities this combination provides.
Azure Databricks, a comprehensive data processing, analytics, and data science platform, is favored by organizations of various sizes. With its long-standing reputation and successful adoption across numerous organizations, it has secured its place as a trusted platform. Founded by the creators of Spark, Databricks primarily caters to engineers, offering them a platform to manage Spark workloads, write notebooks, and handle complex tasks on a larger scale.
The appeal of Microsoft Fabric lies in its simplicity. Launched in 2023, it evolved from PowerBI, suggesting an easy transition for existing PowerBI users. With its user-friendly interface, unified self-service features, and seamless integration with Microsoft 365, it attracts particularly to business users. Microsoft Fabric is designed to democratize data usage and lower entry barriers, making it an accessible platform for all.
In essence, the combination of Azure Databricks and Microsoft Service Fabric offers a comprehensive solution that caters to both technical and business needs, making it a popular choice among many organizations.
Now we know why organizations often choose for a combination. Let’s go back to the burning question: how do we integrate these two services?
Enhance a Databricks-enabled architecture by adding a reporting and analysis layer
The first design consideration involves enhancing the typical Azure Databricks Medallion Lakehouse architecture, which leverages services like Azure Data Lake Storage (ADLS) gen2, Azure Data Factory, and Azure Databricks. In this setup, Databricks manages all aspects of data ingestion, processing, validation, and enrichment. PowerBI typically takes care of the remaining tasks, including reporting and delivering analytical insights.
Expanding the Databricks-focused architecture to include Microsoft Fabric is a popular used strategy to enhance self-service functions and improve user experience for business users. Think of it as Databricks and giving PowerBI a makeover — equipping it with a fresh suite of features and capabilities for a more engaging and efficient experience.
Microsoft has recently introduced a new feature called ‘shortcuts’ for Microsoft Fabric. This feature serves as a lightweight data virtualization engine that reads data from various sources, eliminating the need for data duplication and enabling direct data usage. For example, when using PowerBI, you can access the required data instantly without having to copy or import it into PowerBI.
Relating back to the Databricks-focused design we talked about earlier, we can use the ADLS Gen2 shortcut feature, given that Databricks writes all its data to ADLS. However, there are several important considerations to keep in mind:
- Shortcuts necessitate a Fabric Lakehouse. If you don’t already have one, be sure to create one.
- Shortcuts to tables can only be used to access data in Delta Lake format.
- Use shortcuts on external tables whenever possible, rather than Databricks managed tables. I’ll come back to this point later when discussing the next design consideration.
- Each shortcut can only reference a single Delta folder. Therefore, if you need to access data from multiple Delta folders, you’ll need to create individual shortcuts for each folder.
- Don’t manipulate files directly in these table directories. Instead, use a read-only approach for reading Delta files from ADLS. So, in this approach ADLS acts as an intermediate store. You aren’t reading tables directly from Databricks.
- Creating shortcuts in your Lakehouse must be done manually via the Fabric UI. Alternatively, you can provision all shortcuts programmatically using the REST API. Here’s the link to a tutorial and Notebook script.
- When data is read directly from ADLS, the data access policies from Unity Catalog’s security model are not applied.
For integrating Databricks and Microsoft Fabric, there are exciting developments underway! These features were announced during the Microsoft Build 2024 Conference. Soon, you’ll be able to integrate Azure Databricks Unity Catalog with Fabric. Using the Fabric portal, you’ll have the ability to create and configure a new Azure Databricks Unity Catalog item. Following this step, all tables managed in the Unity Catalog can be promptly upgraded to shortcuts. This forthcoming integration will dramatically streamline the unification of Azure Databricks data in Fabric, enabling smooth operation across all Fabric workloads. The demonstration of this new feature can be found here: https://www.youtube.com/watch?v=BYob0cGW0Nk&t=4434s
The expanded Databricks-centric architecture, which now includes Microsoft Fabric for data usage, is commonly observed among customers who are exceptionally satisfied with Databricks. These customers have already invested significant amount of time and resources in establishing a Lakehouse using Databricks and plan to continue leveraging it. Microsoft Fabric recognizes the strength and versatility of the Lakehouse approach using the Delta format. It allows enhancing an (existing) architecture by adding a layer optimized for data consumption. This allows organizations to augment their existing Databricks-centric setup with an additional layer designed specifically for data consumption.
Compliment a Databricks-enabled architecture by incorporating a OneLake gold layer
The second design modifies the initial design pattern by incorporating a OneLake gold layer into the architecture. This is feasible because of the Azure Databricks’ Azure Blob Filesystem (ABFS) driver, which supports both ADLS and OneLake. You can see an illustration of this approach below and find Notebook examples on the MS Learn pages here.
Within this architecture, the overall workflow and data processing steps — ingestion, processing, validation, and enrichment — remain largely unchanged. Everything is managed within Azure Databricks. The key difference is that data for consumption is now closer to Microsoft Fabric because Databricks writes its data to a Gold layer, which is stored in OneLake. You might wonder, is this a best practice and why would this beneficial?
Importantly note that this style of integration is not officially supported by Databricks, which has implications for data management, which I will delve into next. For more information, please refer to the Databricks documentation.
Databricks distinguishes between two types of tables: managed tables and external tables. Managed tables are created by default and are managed by Unity Catalog, which also handles their lifecycle and file layout. It is not recommended to manipulate files directly in these tables using external tools. In contrast, external tables store data outside of the managed storage location specified for the metastore, catalog, or schema.
So, based on the guidance provided in the documentation, all tables created by writing directly to OneLake using this approach are recommended to be classified as external tables. This is because data is managed outside the scope of the metastore. As a result, the management of these tables should be done elsewhere, such as within Fabric. The motivation for this approach might be the following:
First, storing data physically in OneLake leads to improved performance within Microsoft Fabric. This is due to the fact that OneLake tables are optimized for performance, particularly for queries involving joins and aggregations. In contrast, if you’re reading data from ADLS Gen2 via shortcuts, you might encounter slower performance for queries that involve these operations.
Second, managing data in OneLake is useful for applying security measures within Microsoft Fabric. For example, OneLake tables can be secured using role-based access control (RBAC), simplifying the process of managing data access. However, if you were to use ADLS Gen2, you would need to handle the permissions for the ADLS Gen2 storage account, which could be a more complex task.
Thirdly, OneLake tables can be governed by policies, which makes it easier to ensure that the data is used in a compliant manner. For instance, when (externally) sharing tables with domains that reside elsewhere.
Besides merely reading data, you might want to consider generating new data within Microsoft Fabric. If this is part of your plan, an upcoming feature could be of great interest. Soon, Fabric users will be able to access data items, like lakehouses, via the Unity Catalog in Azure Databricks. Even though the data will remain in OneLake, you’ll have the ability to access and view its lineage and other metadata directly in Azure Databricks. This enhancement will facilitate reading data back from Fabric to Databricks. For instance, if you’re planning to leverage AI using Azure Databricks’ Mosaic AI, you’ll be able to do so by reading back from Microsoft Fabric. The technology for this is likely Lakehouse Federation. More information can be seen in this part of the video: https://youtu.be/BYob0cGW0Nk?t=4125
In conclusion, the strategy of handling all integration and data processing within Databricks, and having a consumption layer managed in Fabric, offers organizations the convenience of leveraging the best features from each application area. This approach ensures optimal performance and security in data handling.
Make Databricks write all data to OneLake (Not Recommended)
Given our experience integrating Databricks with OneLake, we know that OneLake supports the same APIs as ADLS Gen2. With this in mind, let’s consider a hypothetical design possibility: storing all Medallion layers in OneLake. Could you make this work? Let’s find out.
The incentive for this approach could stem from greenfield deployments. The goal here is to leverage Databricks’ native features to efficiently scale data engineering tasks, while advocating design simplicity and self-service for data usage and consumption across all layers using Microsoft Fabric.
Regrettably, this design is not adequate for efficient data management. This configuration may result in administrative overhead due to an increasing number of workspaces, as each layer of the workspace requires its own Lakehouse entity in Microsoft Fabric. This proliferation could give rise to additional challenges such as governance, metadata management, and collaboration overhead when sharing data. Additionally, Databricks does not support this approach when using managed tables. Hence, while this architecture may appear attractive in theory, I strongly discourage its use as a best practice.
Extend Databricks with a V-ORDERED enabled consumption layer
The next design consideration revolves around putting more weight on using Microsoft Fabric and utilizing the V-Order feature. This feature is a write-time optimization for the parquet file format, enabling fast data reads under Microsoft Fabric compute engines such as Power BI.
Both Databricks and Microsoft have chosen to adopt Delta Lake, an open-source columnar file format. However, Microsoft has incorporated an added layer of V-Order compression, which offers up to 50% more compression. V-Order is fully compliant with the open-source parquet format; all parquet engines can read it like regular parquet files.
Please note, you can apply V-ordering to tables that lack it by utilizing Fabric’s maintenance feature.
V-Order provides significant advantages for Microsoft Fabric, especially to components like Power BI and SQL endpoints. For instance, it allows Power BI to connect directly to live data using Direct Lake mode while maintaining high performance during data queries. Since there’s no import process, changes in the data source are instantly reflected in Power BI, eliminating the need to wait for a refresh.
It’s crucial to note that the use of V-Order optimized tables is currently exclusive to Microsoft Fabric. Databricks has not yet incorporated this feature. Therefore, until that happens, you’ll need to utilize a service within Microsoft Fabric for leveraging V-Order optimized tables.
Note that it is possible to argue that the processing step with Databricks between the Silver and Gold stages remains relevant if V-order optimization is not necessary. While this may seem repetitive, it is a viable option that allows for continued data processing with Databricks.
Another noteworthy aspect, why organizations opt for this design, is the transactional consistency across multiple tables. Maintaining such consistency, especially in Gold, is crucial. Today, Spark only supports transactions on individual tables. Thus, if there are any data inconsistencies across tables, they need to be resolved through compensatory measures. For example, you could commit inserts to multiples tables or none of the tables if an error arises. If you’re changing details about a purchase order that affects three tables, you can group those changes into a single transaction. That means when those tables are queried, they either have the all changes or none of them do. This integrity concern highlights the importance of an environment that can manage complex transactions across numerous tables. Microsoft Fabric Warehouse is the only platform capable of supporting this atop Delta Lake. You can learn more about this at here.
In the updated architecture, depicted in the image above, Synapse Engineering now serves as the processing engine from Silver to Gold. This approach guarantees that all tables are V-Order optimized. In addition, Synapse Warehouse has been added for use cases that require transactional capabilities. However, these architectural changes mean that data engineers will need to navigate different distinct data processing services. Therefore, it’s crucial to provide clear guidance to all teams. For example, you could establish principles for Bronze and Silver for utilizing Databricks’ native features such as ingestion tracking with AutoLoader and validations with Delta Live Tables for data quality. And then for Gold, you focus on building consumption-specific integration logic solely with Microsoft Fabric.
Enhance Databricks and Microsoft Fabric’s data processing efficiency by adding extra components
In our previous discussion, we addressed the challenge faced by engineers who have to navigate different data processing services. This issue can be resolved by adopting a metadata-driven approach and a templating framework, such as DBT, for data processing. In the updated architecture illustrated below, I’ve augmented both Databricks and Microsoft Fabric with additional components. Let’s delve into these changes.
On the Databricks side, I’ve added a metadata-driven framework (metadata store), Great Expectations, and the Data Build Tool (DBT). The metadata-driven framework can significantly reduce the amount of code you need to write and maintain. Instead of creating multiple notebooks, this approach enables a universal pipeline for ingesting and validating all data with another open-source framework called Great Expectations. This approach is achieved by reading from the metadata store and dynamically invoking different scripts. If you’re interested in learning more about this approach, I recommend reading another blogpost on this subject.
Next, let’s discuss DBT. This open-source command-line tool, also known as Data Build Tool, is written in Python. Its strength lies in providing a universal interface for defining transformations using templates, with a syntax like SQL’s SELECT statements. Databricks is supported through the dbt-databricks package. For more information on using DBT and Databricks, I suggest reading another blogpost on this subject.
On the Microsoft Fabric side, DBT can also play a significant role too. We have the option to use either dbt-fabric for Synapse Warehousing or dbt-fabricspark for Synapse Spark within Microsoft Fabric. The benefit of this templating approach is that you leverage both services while developers only need to familiarize themselves with a single front-end for all data transformation use cases. This methodology streamlines the process and increases efficiency.
Conclusion
The integration of Azure Databricks and Microsoft Fabric presents a myriad of benefits and possibilities for organizations. The flexibility and scalability of Azure Databricks, combined with the simplicity and user-friendly features of Microsoft Fabric, can significantly enhance data usage and management across all layers. There are several architectural design choices available, from enhancing a Databricks-centric architecture with a Microsoft Fabric layer to incorporating a OneLake gold layer into the architecture for better performance and security.
Furthermore, the introduction of V-Order optimization in Microsoft Fabric and the use of additional components can significantly streamline and enhance data processing efficiency. However, such combinations or integration require careful considerations, as it might involve navigating services and balancing flexibility, data security, and isolation.
In conclusion, the integration of Azure Databricks and Microsoft Fabric, coupled with the exciting advancements announced at the Microsoft Build 2024 Conference, signifies a promising frontier for big data processing workloads.