# Apache Spark in Azure Synapse

This guide outlines the steps necessary to integrate Decube with an Apache Spark instance in Azure Synapse Analytics Workspace. The process is divided into two primary sections: configuring the Decube platform and setting up the necessary components within the Azure Synapse Analytics Workspace.

## **Minimum Requirements: Credentials and Access**

* Before beginning the setup, ensure that you have the following credentials and access:
* **Decube Data Source:**
  * Name of the Source
* **Client - Azure Synapse Workspace:**
  * Must have **WRITE** access to Azure Synapse Workspace Libraries.
* **Required roles include:**
  * Synapse Administrator
  * Synapse Apache Spark Administrator
  * Synapse Contributor
  * Synapse Artifact Publisher

## **How to Connect**

The connection process is divided into two main parts:

* *Decube* platform configuration
* *Azure Synapse Analytics Workspace* setup

**Part 1: Decube Platform Configuration**

**1. Create Spark Data Source**

* Input the name of the connector. This will be the identifier for your data source within Decube.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-ee311e7a9c150b5aad3c05704d4e144902da8d92%2Fimage.png?alt=media" alt=""><figcaption><p>Spark</p></figcaption></figure>

**2. Copy Credentials**

* After creating the data source, Decube will provide you with credentials. These credentials are essential for the integration process and will be used later in the Azure Synapse setup.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-7f4acd30c5a67e7443208cb5adfbff627bdb9ec2%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
**Important:** Keep these credentials secure as they are required for the integration with the Azure Synapse Apache Spark instance.
{% endhint %}

**Part 2: Azure Synapse Analytics Workspace Setup**

1. **Setup OpenLineage as a Workspace Package**

To track and manage data lineage within Spark jobs, the OpenLineage library must be added as a workspace-wide package, so we can add to the *Apache Spark Pool* as packages later.

**Steps:**

**Download the OpenLineage Binary:**

Navigate to the <https://mvnrepository.com/artifact/io.openlineage/openlineage-spark> and download the appropriate version of the OpenLineage binary. For most use cases, the version 1.20.5 - Scala 2.13 is recommended.

{% hint style="info" %}
Make sure that the Scala Versions match for Apache Spark and OpenLineage Library.

The versions before `1.8.0` does not have support of `Scala 2.13` , only `Scala 2.12` is recommended.
{% endhint %}

* **Below is the example how to download the OpenLineage Binary from Maven:**
* After choosing the version `1.20.5 - Scala 2.13`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-811279f4022bf3fa4cb55630000e8462e2e3f77c%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Click on the `jar` label to download the binary.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-e408922226bc2207e829b78107aa2f5ffa35f209%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

**Upload the Binary to Azure Synapse:**

* Go to Synapse Studio.
* In the left panel, navigate to **Manage** > **Configurations + Libraries** > **Workspace Packages** > **Upload**.
* Upload the OpenLineage jar binary from **Download the OpenLineage Binary.**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-8184f028eb974a140eaefa327913ab50e6a79d58%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

2. **Install OpenLineage in Apache Spark Pool Environment**

Now that OpenLineage is a workspace package, it needs to be installed in the default Apache Spark Pool environment.

**Steps:**

1. In Synapse Studio, go to **Manage** > **Apache Spark Pools**.
2. Select the pool to configure and navigate to **Packages**.
3. Choose the OpenLineage jar uploaded in the previous step and install it.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-fc2547778eb23b8a8421c588e4b62c9ff18b2e1d%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-5502fd324e480e3cca5a948693f2417d4f505f11%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Select the `OpenLineage` jar that was uploaded as workspace packages.

{% hint style="info" %}
**Note:** The installation process may take some time as the package is being integrated into the Spark environment.
{% endhint %}

**3. Configure Apache Spark with Decube Credentials**

To finalize the setup, you need to configure Apache Spark with the necessary settings to communicate with Decube.

* We will setup the Apache Spark Configurations for `OpenLineage` , by using the credentials copied during **Copy Credentials** portion of Data Source setup, it is to be noted that the credentials can also be seen after setting up in the **Modify** screen.

**Configuration Parameters:**

```jsx
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type http
spark.openlineage.transport.url https://integrations.<Region>.decube.io
spark.openlineage.transport.auth.type api_key
spark.openlineage.transport.auth.apiKey <api-key>
spark.openlineage.transport.endpoint /integrations/openlineage/webhook/<webhook-uuid>
spark.openlineage.namespace [Namespace]
```

**Api-key**: The API key copied from the Decube platform.

**Webhook-uuid**: The webhook UUID copied from the Decube platform.

**\[Namespace]**: A custom namespace determined by the client.

{% hint style="info" %}
For full reference on how to add the configurations above, see <https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-create-spark-configuration>
{% endhint %}

4. **Add to Existing Spark Pool Configuration**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-2f447120a8305291197125ef38ff9e90227d2852%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* **Input the configurations provided in** [**Apache Spark**](https://docs.decube.io/transformation-tools/apache-spark) **with the placeholders filled with the values required.**

For Example:

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-6436316f2c286b328a8dee525f1315086459838b%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

### Expected output

Once your Spark has been successfully set up, you should be able to see the Data Jobs in the Catalog (which are named after app name in Spark Config).

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-0c4594850d0bb5dbf1ab4ba4f0e8270640a43705%2FCatalog%20of%20Data%20Jobs.png?alt=media" alt=""><figcaption><p>Example Data Jobs in Catalog</p></figcaption></figure>

You will also be able to see lineages extracted from the workflow.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-18d0313997e9b4070029357c151cd411a83d251c%2FExample%20Lineage.png?alt=media" alt=""><figcaption><p>Example of 2 csv tables joined onto a parquet file in ADLS.</p></figcaption></figure>
