Apache Spark in Azure Synapse

Connecting Decube to Apache Spark specifically for Azure Synapse.

This guide outlines the steps necessary to integrate Decube with an Apache Spark instance in Azure Synapse Analytics Workspace. The process is divided into two primary sections: configuring the Decube platform and setting up the necessary components within the Azure Synapse Analytics Workspace.

Minimum Requirements: Credentials and Access

  • Before beginning the setup, ensure that you have the following credentials and access:

  • Decube Data Source:

    • Name of the Source

  • Client - Azure Synapse Workspace:

    • Must have WRITE access to Azure Synapse Workspace Libraries.

  • Required roles include:

    • Synapse Administrator

    • Synapse Apache Spark Administrator

    • Synapse Contributor

    • Synapse Artifact Publisher

How to Connect

The connection process is divided into two main parts:

  • Decube platform configuration

  • Azure Synapse Analytics Workspace setup

Part 1: Decube Platform Configuration

1. Create Spark Data Source

  • Input the name of the connector. This will be the identifier for your data source within Decube.

2. Copy Credentials

  • After creating the data source, Decube will provide you with credentials. These credentials are essential for the integration process and will be used later in the Azure Synapse setup.

Important: Keep these credentials secure as they are required for the integration with the Azure Synapse Apache Spark instance.

Part 2: Azure Synapse Analytics Workspace Setup

  1. Setup OpenLineage as a Workspace Package

To track and manage data lineage within Spark jobs, the OpenLineage library must be added as a workspace-wide package, so we can add to the Apache Spark Pool as packages later.

Steps:

Download the OpenLineage Binary:

Navigate to the https://mvnrepository.com/artifact/io.openlineage/openlineage-spark and download the appropriate version of the OpenLineage binary. For most use cases, the version 1.20.5 - Scala 2.13 is recommended.

Make sure that the Scala Versions match for Apache Spark and OpenLineage Library.

The versions before 1.8.0 does not have support of Scala 2.13 , only Scala 2.12 is recommended.

  • Below is the example how to download the OpenLineage Binary from Maven:

  • After choosing the version 1.20.5 - Scala 2.13

  • Click on the jar label to download the binary.

Upload the Binary to Azure Synapse:

  • Go to Synapse Studio.

  • In the left panel, navigate to Manage > Configurations + Libraries > Workspace Packages > Upload.

  • Upload the OpenLineage jar binary from Download the OpenLineage Binary.

  1. Install OpenLineage in Apache Spark Pool Environment

Now that OpenLineage is a workspace package, it needs to be installed in the default Apache Spark Pool environment.

Steps:

  1. In Synapse Studio, go to Manage > Apache Spark Pools.

  2. Select the pool to configure and navigate to Packages.

  3. Choose the OpenLineage jar uploaded in the previous step and install it.

  • Select the OpenLineage jar that was uploaded as workspace packages.

Note: The installation process may take some time as the package is being integrated into the Spark environment.

3. Configure Apache Spark with Decube Credentials

To finalize the setup, you need to configure Apache Spark with the necessary settings to communicate with Decube.

  • We will setup the Apache Spark Configurations for OpenLineage , by using the credentials copied during Copy Credentials portion of Data Source setup, it is to be noted that the credentials can also be seen after setting up in the Modify screen.

Configuration Parameters:

spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type http
spark.openlineage.transport.url https://integrations.decube.io
spark.openlineage.transport.auth.type api_key
spark.openlineage.transport.auth.apiKey <api-key>
spark.openlineage.transport.endpoint /integrations/apache_spark/webhook/<webhook-uuid>
spark.openlineage.namespace [Namespace]

Api-key: The API key copied from the Decube platform.

Webhook-uuid: The webhook UUID copied from the Decube platform.

[Namespace]: A custom namespace determined by the client.

  1. Add to Existing Spark Pool Configuration

  • Input the configurations provided in Apache Spark with the placeholders filled with the values required.

For Example:

Expected output

Once your Spark has been successfully set up, you should be able to see the Data Jobs in the Catalog (which are named after app name in Spark Config).

You will also be able to see lineages extracted from the workflow.

Last updated