Designing a metadata-driven processing framework for Azure Synapse and Azure Purview
--
In the blogpost Modern Data Pipelines, you learned about data pipelines and some of their difficulties. You also read up on an approach by binding Azure Synapse Analytics and Azure Purview in order to build a framework for intelligently processing data.
In this article I want to take the data processing approach one step further by making data pipelines metadata-driven. A metadata-driven framework allows for more scalability. It speeds up development, provides better maintainability, reusability and visibility. For example, you can process thousands of tables and apply a variety of processing steps without designing all data flows by hand.
When turning your data pipeline processing into a metadata-driven approach, you will learn that data lineage won’t show up out of the box because all processing logic sits in a metadata repository. To overcome this problem, I’ll provide a solution by hooking up a data lineage registration component.
Before you continue reading, it is important to stress out that this blogpost is about demonstrating you concepts. It’s about showing how to get started! The end result isn’t complete. It misses essential features like error handling or advanced transformation techniques. This blogpost will be a long read. At the end you will see how Azure Synapse retrieves its metadata from an external database, a Notebook is dynamically called and how a simple web app is invoked for registering lineage into Azure Purview.
Prerequisites
To get started, you must have an Azure Purview account and the following services configured:
- Azure Purview, including a service principal account for making API REST calls: https://piethein.medium.com/use-azure-purviews-rest-apis-for-creating-custom-lineage-ad8efacc6230
- Azure Synapse, including a Spark Pool.
- Azure SQL Server, including the AdventureWorks sample database. I recommend you to already make extracts using Parquet files. For example, to the storage account that comes with Azure Synapse.
Create metadata database
Your first step when designing metadata-driven processing framework is setting up a database for your metadata configuration. A metadata-driven processes is a different approach than building or designing your transformations in an ETL tool, for example Azure Synapse Data Flows. Instead, you store your specifications or instructions in an external database, which fed that information to an engine, such as Spark.
In this blogpost I use an Azure SQL database, which can easily deployed using these steps: Deploy single SQL database. I recommend you to change the Gen5 option. Instead go for a basic tier. After deployment you should “allow Azure services and resources to access this server”. You can find this option under Firewalls and virtual networks.
After deploying your SQL database, complete your configuration by setting up your database structures for your pipeline configuration by using the following metadata.sql script. This initial framework has four tables:
- Systems: this table describes all sources or systems.
- Pipelines: this is a placeholder for you data pipelines, including name and description
- Flows: flows are the series of data processing step. These include references to notebooks, storage account names, container names, folder names, primary key information, and so on.
- PipelineLog: table used for storing logging results
Note; only a few tables are provided. In a comprehensive framework you would expect more metadata tables for validations, scheduling, triggering information, security and information on data usage, data contracts and so on.
Don’t forget to define your storage account, container and folder locations. In the metadata.sql script you find the arguments I used in my own environment. You should replace the INSERT statements with your own locations to the AdventureWorks data.
After everything has been configured and loaded in, you should be able to query your database. I created a view that integrates systems, pipelines and flows. This view will be used as input for executing data processing.
Create dynamic notebooks
Imagine you have different teams. All teams manage many applications and databases. All teams want to consistently apply historization on their extracted datasets using a standard file format. The same technology toolset is used, however all extracted datasets are stored on different locations: storage accounts, containers and folders. A scalable approach would be to submit parameters to a common notebook that dynamically finds and processes these datasets.
For dynamically processing data, the plan is to Synapse Spark and Notebooks. I created a simple notebook that transforms parquet files into delta files. At the same time it compares and merges the historical data. Continue, by creating a new script. For the purpose of this demo, the script must be called DataProcessing. The script name also refers to the NotebookName used in the pipeline flows, like in the screenshot above. If you prefer many scripts, just change the NotebookName references in the metadata database. If you’re unsure about these steps, please consider reading my previous blogpost.
Note; In a typical setup you would expect many notebooks for handling different type of operations. You want integrity checks and technically transforming data from one format to another format. There might be scripts for situations you want to apply historization on your data. Or perhaps transforming data using complex logic.
Pipeline configuration
After setting up your metadata database and first script, it’s time to configure your first pipeline. Open the integration section within Synapse. Create a new pipeline and start by dragging in a Lookup activity. For the source dataset, you must select your metadata database. If you don’t have configured a linked service, it’s time to do now. For the query we will use the SourceToTargetView.
SELECT * FROM [dbo].[SourceToTargetView]
You can validate the results by previewing data. When everything works as excepted, extend your pipeline by dragging in a ForEach operation. You will use this operation for all records that are fetched from your metadata database.
Note; If you prefer multiple pipelines, I recommend adding where statements to query specific pipelines. If you want to add a sequence or order, I recommend to order your results.
For the foreach operation, it’s important to submit the metadata records as input arguments for all next processing steps. Use the Settings tab to pass in the output values from your metadata database into the ForEach operation.
@activity('GetPipelines').output.value
Continue and open your ForEach operation. Drag in a Notebook activity. The name of the Notebook must be dynamic: it uses the arguments provided by our metadata database.
@item().NotebookName
Did you remember the name DataProcessing? This argument is passed into the notebook, which refers to the NotebookName argument to be executed. For passing in arguments to your notebooks, use the base parameters section. Provide parameters, such as: @item().SourceSystemName for passing in the SourceSystemName.
Deploy a NodeJS web app for registering Purview Lineage
A metadata-driven ingestion framework is also about providing insights in your data movements. In this blogpost, Azure Purview is used as the foundation for lineage registration. It offers a REST API for submitting lineage.
Unfortunately, the Azure Purview APIs come with overhead. Making an API call from Synapse isn’t easy. First, you need to authenticate. Second, you need to validate whether objects already exist. You might want to cleanup things and need to retrieve unique identifiers for creating lineage objects. For reducing this overhead, I’ve decided use an abstraction: a small web app that takes arguments from your data pipelines, looks up the data assets, cleans up, and performs a registration of source, target and a lineage process step. So instead of calling Azure Purview, you will call a web app that handles the registration.
In order to deploy this NodeJS web app you need to have the following:
- Both Node.js and npm installed. Run the command
node --version
to verify that Node.js is installed. - Visual Studio Code installed
- The Azure App Service extension for Visual Studio Code installed
Clone the purview-nodejs-lineage-registration repository and run the following commands:
npm install
npm start
Open http://localhost:8080, and verify your application is running. When you see a welcome message, deploy your web app to Azure by following these steps:
- In the App Service explorer, within VSCode, select the Deploy to Web App icon.
- Select Create new Web App. A Linux container is recommended.
- Type a globally unique name for your web app and press Enter.
- In Select a runtime stack, select the Node.js version 16 LTS.
- In Select a pricing tier, select Free (F1) and wait for the resources to be provisioned in Azure.
- In the popup Always deploy the workspace “myExpressApp” to <app-name>”, select Yes.
- While Visual Studio Code provisions the Azure resources and deploys the code, it shows progress notifications.
- Once deployment completes, select Browse Website in the notification popup. The browser should display the welcome page.
To validate the deployment, you can try out and use Postman for submitting your first lineage registration. Within the repository, there’s a sample.json message. With a single API call, it registers a source, target and lineage process step. When everything works as expected, you should see a Status 200 message.
After your web app has been successfully deployed, head back to Azure Synapse.
Complete pipeline configuration
Extend your ForEach step by adding a Web step. Navigate to Settings and copy paste the URL of your linage registration app. Select POST as a method. For the body type in:
@item()
By using @item() you will submit a JSON holding all arguments from your metadata database, including the Notebook Name, storage account and container information, source, target and operation type. With some imagination you could scale this up by also providing all transformation logic.
Note; If you metadata database holds complex transformation statements you might consider concatenating those and push all statements as description information. This enables you to not only see the lineage step, but also see the logic being executed.
For completing your pipeline, you can use a stored procedure for submitting logging data. This is the output of pipeline. Under Settings, use the linked service of your metadata database. Select the InsertPipelineLog stored procedure. Click on import to retrieve all parameters. I provide arguments like “@{activity(‘ExecuteNotebook’).output.executionDuration}” for logging the duration. And parameters like “@item().PipelineID” and “@item().RunID” for registering the pipeline and log id.
When ready, commit, publish all code and hit Add Trigger. When everything works as expected, navigate to your storage accounts. Validate that your data has been transformed. Head to Azure Purview, and lookup your lineage object pointing to what data has been used as a source and target.
Further steps for improvement
The framework from this blogpost can be starting point for building something more sophisticated. You can scale up the framework by sharing it across different teams within your organization. In that respect, allow your teams to contribute. Add pointers for different workspaces, allowing different team to share a central metadata repository for data pipeline orchestration and monitoring. For scaling up further I encourage you to work on these improvements:
- Add extraction activities: https://docs.microsoft.com/en-us/azure/data-factory/copy-data-tool-metadata-driven
- Add self-service; allowing teams to maintain their own metadata.
- Add technical validations or data quality support.
- Add support for different file formats (CSV, TXT, XML and so on).
- Add support for additional processing types like append and overwrite.
- Add support for complex data transformation. I’ve seen companies including JSON files describing all transformation steps. I also see companies using SQL or source to target mapping tables.
- Add support for consumption, for example by deploying secure views in Synapse Serverless SQL Pools.
Conclusion
A metadata-driven processing framework allows for scalability. It speeds up development, allows for better maintainability, reusability and visibility. Azure Synapse Analytics and Azure Purview, as you’ve seen, can at the heart. They complement each other and can be easily integration.