Member-only story
Using DBT for building a Medallion Lakehouse architecture (Azure Databricks + Delta + DBT)
There’s a lot of fuzz going around the data build tool (DBT). It gains a lot of traction from the data community. I’ve seen several customers and projects, where DBT is already chosen as the tool to be used. But what it is? Where can you position DBT in your data landscape? What does it do and how does it work?
This blog post focuses on answering these questions. I’ll demonstrate how DBT works by using a DataBricks medallion architecture. Be warned! It will be a long read. Before configuring anything, let’s first look at DBT.
What is data build tool?
DBT is a transformation tool in the ELT process. It is an open source command line tool written in Python. DBT focusses on the T in ELT (Extract, Transform and Load) process , so it doesn’t extract or load data, but only transforms data.
DBT comes in two flavors: “DBT core” which is the open source cli version, and a paid version: “DBT cloud”. In this blog I’ll use the free cli version. We’ll use it after running our extraction pipeline using Azure Data Factory (ADF). From that point, we’ll transform using DBT.
DBT shows strength by defining transformations using templates. The syntax is similar to SELECT statements in SQL. In addition, the whole flow is constructed in Direct Acyclic Graph (DAG), which visualizes documentation, including data lineage.
DBT differs from other tools because it’s template-based and CLI-driven. Thus, instead of visually designing ETL or writing code using Notebooks, you configure your transformations using SQL boiler templates. A benefit of this approach is that you aren’t that strongly dependent on the underlying database. So, you can more easily switch from one database vendor towards another. In addition, you don’t have to learn many database languages. DBT automatically transpiles or generate the code needed to transform data.
Use Case
For building the use case, we’ll be using an Azure SQL database that is configured with sample data: AdventureWorks. This database will play the role as source from which we’ll be getting the data. For storing and progressively improving the structure of our newly onboarded data, we’ll be using services like Azure Data Lake Services (ADLS), Azure Data Factory, Azure DataBricks and DBT. The end goal is to build a simple and…