Use Azure Purview’s REST APIs for creating custom lineage

Piethein Strengholt
6 min readSep 22, 2021

Azure Purview is a unified data governance service that helps organizations to manage and govern their data estate. It provides a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage.

One of the great things of Azure Purview is its openness. Azure Purview uses the Apache Atlas Open API ecosystem, with some enhancements and additions by Microsoft. In this tutorial, you learn how to use these APIs by creating custom lineage.

Prerequisites

To get started, you must have an existing Azure Purview account. If you don’t have a catalog yet, see the quickstart for creating a Azure Purview account.

Create a service principal (application)

To invoke the Purview’s REST API, you must first register an application (i.e. service principal) that will act as the identity that the Azure Purview platform recognizes and is configured to trust.

  1. Sign in to the Azure portal, navigate to Azure Active Directory > App registrations, and click New registration.
  2. Provide the application a name, select an account type, and click Register.
  3. Copy the following values for later use: Application (client) ID and Directory (tenant) ID

4. Next you need to create a secret. Navigate to Certifications & secrets and click New client secret.

5. Provide a Description and set the expiration to In 2 years, click Add.

6. Copy the client secret value for later use.

Set up authentication using service principal

Once service principal is created, you need to assign Data plane roles of your Purview account to the service principal created above. This is required in order to the APIs. The below steps need to be followed to assign role to establish trust between the service principal and purview account.

  1. Navigate to your Purview Studio.
  2. Select the Data Map in the left menu.
  3. Select Collections.
  4. Select the root collection in the collections menu. This will be the top collection in the list, and will have the same name as your Purview account.
  5. Select the Role assignments tab.
  6. Assign the following roles to service principal created above to access various data planes in Purview.
  7. ‘Data Curator’ role to access Catalog Data plane.
  8. ‘Data Source Administrator’ role to access Scanning Data plane.
  9. ‘Collection Admin’ role to access Account Data Plane.

Get token

Next, you need to acquire an access token, which is needed for accessing Purview’s REST APIs. In order to obtain this token you need to send a POST request to the following URL.

https://login.microsoftonline.com/{your-tenant-id}/oauth2/token

The following parameters needs to be passed to the above URL.

  • client_id: client ID of the application registered in Azure Active directory and is assigned to a data plane role for the Purview account.
  • client_secret: client secret created for the above application.
  • grant_type: This should be ‘client_credentials’.
  • resource: This should be ‘https://purview.azure.net

In my case, I’m using curl for invoking the REST API:

curl --location --request POST 'https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_id=ddb25b53-4e6f-41da-b646-579c742ab8ec' \
--data-urlencode 'client_secret=NFl7Q~vOkFtEqqBx~lSLk4EFLc74P7Ood2LAT' \
--data-urlencode 'resource=https://purview.azure.net'

If everything goes well you see an access_token returned. Copy this value for later use.

Next you need to obtain the ATLAS API endpoint. Navigate back to the Azure portal, open the Azure Purview account, navigate to Properties and find the Atlas endpoint. Copy the Atlas endpoint for later use.

Next, you will validate the Atlas endpoint by requesting all type definitions. Paste the access_token to the request below and make the API call:

curl --location --request GET 'https://purview-piethein.catalog.purview.azure.com/api/atlas/v2/types/typedefs' \
--header 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Imwzc1EtNTBjQ0g0eEJWWkxIVEd3blNSNzY4MCIsImtpZCI6Imwzc1EtNTBjQ0g0eEJWWkxIVEd3blNSNzY4MCJ9.eyJhdWQiOiJodHRwczovL3B1cnZpZXcuYXp1cmUubmV0IiwiaXNzIjoiaHR0cHM6Ly9zdHMud2luZG93cy5uZXQvNzJmOTg4YmYtODZmMS00MWFmLTkxYWItMmQ3Y2QwMTFkYjQ3' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_id=ddb25b53-4e6f-41da-b646-579c742ab8ec' \
--data-urlencode 'client_secret=NFl7Q~vOkFtEqqBx~lSLk4EFLc74P7Ood2LAT' \
--data-urlencode 'resource=https://purview.azure.net'

If everything goes well all the type definitions will be returned. Congratulations! Let’s continue our journey by creating new objects in Purview.

Creating objects in Purview

For uploading lineage to Purview we will use the bulk entity endpoint for creating multiple Datasets at once. In this tutorial you will create two input datasets and one output dataset. Adjust the code below by changing the location and access code.

curl --location --request POST 'https://purview-piethein.catalog.purview.azure.com/api/atlas/v2/entity/bulk' \
--header 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Imwzc1EtNTBjQ0g0eEJWWkxIVEd3blNSNzY4MCIsImtpZCI6Imwzc1EtNTBjQ0g0eEJWWkxIVEd3blNSNzY4MCJ9.eyJhdWQiOiJodHRwczovL3B1cnZpZXcuYXp1cmUubmV0IiwiaXNzIjoiaHR0cHM6Ly9zdHMud2luZG93cy5uZXQvNzJmOTg4YmYtODZmMS00MWFmLTkxYWItMmQ3Y2QwMTFkYjQ3' \
--header 'Content-Type: application/json' \
--data-raw '{
"entities": [
{
"meanings": [
],
"status": "ACTIVE",
"version": 0,
"typeName": "DataSet",
"attributes": {
"qualifiedName": "system://input_01",
"name": "input_table01",
"description": "Input table",
"objectType": null
}
},
{
"meanings": [
],
"status": "ACTIVE",
"version": 0,
"typeName": "DataSet",
"attributes": {
"qualifiedName": "system://input_02",
"name": "input_table02",
"description": "Input table",
"objectType": null
}
},
{
"meanings": [
],
"status": "ACTIVE",
"version": 0,
"typeName": "DataSet",
"attributes": {
"qualifiedName": "system://output_01",
"name": "output_table01",
"description": "Output table",
"objectType": null
}
}
]
}'

If everything goes well you should have three new Datasets created within your Purview collection:

Important here is to capture the guidAssignments. These are the unique references that we need for creating our lineage object. Copy the information for later use.

"guidAssignments":{"-184464159039":"938fab2e-e270-4fc2-ad84-d752c1dd6560","-184464159038":"4c2115f9-a80d-467a-b7df-e93ec59505e1","-184464159037":"0589b96c-0cc6-454f-9251-32ee4aabefc0"}

Next we can make a lineage object by making another call. Use the code below and change the unique identifiers using the output from the previous call:

curl --location --request POST 'https://purview-piethein.catalog.purview.azure.com/api/atlas/v2/entity' \
--header 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Imwzc1EtNTBjQ0g0eEJWWkxIVEd3blNSNzY4MCIsImtpZCI6Imwzc1EtNTBjQ0g0eEJWWkxIVEd3blNSNzY4MCJ9.eyJhdWQiOiJodHRwczovL3B1cnZpZXcuYXp1cmUubmV0IiwiaXNzIjoiaHR0cHM6Ly9zdHMud2luZG93cy5uZXQvNzJmOTg4YmYtODZmMS00MWFmLTkxYWItMmQ3Y2QwMTFkYjQ3' \
--header 'Content-Type: application/json' \
--data-raw '{
"entity": {
"status": "ACTIVE",
"version": 0,
"typeName": "Process",
"attributes": {
"inputs": [
{"guid": "938fab2e-e270-4fc2-ad84-d752c1dd6560"},
{"guid": "4c2115f9-a80d-467a-b7df-e93ec59505e1"}
],
"outputs": [
{"guid": "0589b96c-0cc6-454f-9251-32ee4aabefc0"}
],
"qualifiedName": "apacheatlas://customlineage01",
"name": "lineage01"
}
}
}'

If everything goes well you can see the lineage created within Purview.

Next steps

In this tutorial we learned to easily create datasets and custom lineage. You can extend these objects with additional metadata, like relationships to terms, new entities and definitions.

Registering can be done manually via the Purview portal, but as you learned also programmatically via Purview’s REST API. The big benefit is that you can apply customizations, register new types or transfer metadata from other repositories. To simplify this process I recommend you to check out the PyApacheAtlas, which allows bulk uploading using Excel templates.

Azure Purview REST APIs are largely based on the open source Apache Atlas project. Therefore many additional resources are available. The Atlas documentation is a great resource. This documentation is also provided by Microsoft: PurviewCatalogAPISwagger.zip.

Lastly, there is a CLI and great video from the Azure Purview. It explains how Purview metadata repository works and how the API can be used: https://www.youtube.com/watch?v=4qzjnMf1GN4

--

--

Piethein Strengholt
Piethein Strengholt

Written by Piethein Strengholt

Hands-on Chief Data Officer. Working @Microsoft.

Responses (2)