Apache Spark in Azure Synapse
Connecting Decube to Apache Spark specifically for Azure Synapse.
This guide outlines the steps necessary to integrate Decube with an Apache Spark instance in Azure Synapse Analytics Workspace. The process is divided into two primary sections: configuring the Decube platform and setting up the necessary components within the Azure Synapse Analytics Workspace.
Minimum Requirements: Credentials and Access
Before beginning the setup, ensure that you have the following credentials and access:
Decube Data Source:
Name of the Source
Client - Azure Synapse Workspace:
Must have WRITE access to Azure Synapse Workspace Libraries.
Required roles include:
Synapse Administrator
Synapse Apache Spark Administrator
Synapse Contributor
Synapse Artifact Publisher
How to Connect
The connection process is divided into two main parts:
Decube platform configuration
Azure Synapse Analytics Workspace setup
Part 1: Decube Platform Configuration
1. Create Spark Data Source
Input the name of the connector. This will be the identifier for your data source within Decube.
2. Copy Credentials
After creating the data source, Decube will provide you with credentials. These credentials are essential for the integration process and will be used later in the Azure Synapse setup.
Important: Keep these credentials secure as they are required for the integration with the Azure Synapse Apache Spark instance.
Part 2: Azure Synapse Analytics Workspace Setup
Setup OpenLineage as a Workspace Package
To track and manage data lineage within Spark jobs, the OpenLineage library must be added as a workspace-wide package, so we can add to the Apache Spark Pool as packages later.
Steps:
Download the OpenLineage Binary:
Navigate to the https://mvnrepository.com/artifact/io.openlineage/openlineage-spark and download the appropriate version of the OpenLineage binary. For most use cases, the version 1.20.5 - Scala 2.13 is recommended.
Make sure that the Scala Versions match for Apache Spark and OpenLineage Library.
The versions before 1.8.0
does not have support of Scala 2.13
, only Scala 2.12
is recommended.
Below is the example how to download the OpenLineage Binary from Maven:
After choosing the version
1.20.5 - Scala 2.13
Click on the
jar
label to download the binary.
Upload the Binary to Azure Synapse:
Go to Synapse Studio.
In the left panel, navigate to Manage > Configurations + Libraries > Workspace Packages > Upload.
Upload the OpenLineage jar binary from Download the OpenLineage Binary.
Install OpenLineage in Apache Spark Pool Environment
Now that OpenLineage is a workspace package, it needs to be installed in the default Apache Spark Pool environment.
Steps:
In Synapse Studio, go to Manage > Apache Spark Pools.
Select the pool to configure and navigate to Packages.
Choose the OpenLineage jar uploaded in the previous step and install it.
Select the
OpenLineage
jar that was uploaded as workspace packages.
Note: The installation process may take some time as the package is being integrated into the Spark environment.
3. Configure Apache Spark with Decube Credentials
To finalize the setup, you need to configure Apache Spark with the necessary settings to communicate with Decube.
We will setup the Apache Spark Configurations for
OpenLineage
, by using the credentials copied during Copy Credentials portion of Data Source setup, it is to be noted that the credentials can also be seen after setting up in the Modify screen.
Configuration Parameters:
Decube-URL: The URL where the Decube platform is hosted.
Api-key: The API key copied from the Decube platform.
Webhook-uuid: The webhook UUID copied from the Decube platform.
[Namespace]: A custom namespace determined by the client.
For full reference on how to add the configurations above, see https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-create-spark-configuration
Add to Existing Spark Pool Configuration
Input the configurations provided in Apache Spark with the placeholders filled with the values required.
For Example:
Expected output
Once your Spark has been successfully set up, you should be able to see the Data Jobs in the Catalog (which are named after app name in Spark Config).
You will also be able to see lineages extracted from the workflow.
Last updated