Using DBT for building a Medallion Lakehouse architecture (Azure Databricks + Delta + DBT)
There’s a lot of fuzz going around the data build tool (DBT). It gains a lot of traction from the data community. I’ve seen several customers and projects, where DBT is already chosen as the tool to be used. But what it is? Where can you position DBT in your data landscape? What does it do and how does it work?
This blog post focuses on answering these questions. I’ll demonstrate how DBT works by using a DataBricks medallion architecture. Be warned! It will be a long read. Before configuring anything, let’s first look at DBT.
What is data build tool?
DBT is a transformation tool in the ELT process. It is an open source command line tool written in Python. DBT focusses on the T in ELT (Extract, Transform and Load) process , so it doesn’t extract or load data, but only transforms data.
DBT comes in two flavors: “DBT core” which is the open source cli version, and a paid version: “DBT cloud”. In this blog I’ll use the free cli version. We’ll use it after running our extraction pipeline using Azure Data Factory (ADF). From that point, we’ll transform using DBT.
DBT shows strength by defining transformations using templates. The syntax is similar to SELECT statements in SQL. In addition, the whole flow is constructed in Direct Acyclic Graph (DAG), which visualizes documentation, including data lineage.