Modern Data Pipelines with Azure Synapse Analytics and Azure Purview
Processing data within modern data platforms, for example data lakehouses, is a complicated task. The variety and multitude of sources quickly make data pipelines chaotic and difficult to handle. A solution for this problem is to follow the mantra of ‘code once use often’ by making your pipelines dynamic and configurable.
With this blogpost I want to demonstrate how Azure Synapse Analytics and Azure Purview can be integrated together to build a framework for intelligently processing data. Azure Synapse is positioned as the analytics platform and Azure Purview as the foundation for metadata management. This blogpost will be a long read, but at the end you will see how Spark is used for data processing, data quality validation and building up historical datasets (slowly changing dimensions). Purview in this respect will hold all metadata for dynamically driving these data pipelines.
Prerequisites
To get started, you must have an Azure Purview account and the following services configured:
- Azure Purview, including a service principal account for making API REST calls: https://piethein.medium.com/use-azure-purviews-rest-apis-for-creating-custom-lineage-ad8efacc6230
- Azure Synapse, including a Spark Pool.
- Azure SQL Server, including the AdventureWorks sample database.