Decube
Try for free
  • 🚀Overview
    • Welcome to decube
    • Getting started
      • How to connect data sources
    • Security and Compliance
    • Data Policy
    • Changelog
    • Public Roadmap
  • Support
  • 🔌Data Warehouses
    • Snowflake
    • Redshift
    • Google Bigquery
    • Databricks
    • Azure Synapse
  • 🔌Relational Databases
    • PostgreSQL
    • MySQL
    • SingleStore
    • Microsoft SQL Server
    • Oracle
  • 🔌Transformation Tools
    • dbt (Cloud Version)
    • dbt Core
    • Fivetran
    • Airflow
    • AWS Glue
    • Azure Data Factory
    • Apache Spark
      • Apache Spark in Azure Synapse
    • OpenLineage (BETA)
    • Additional configurations
  • 🔌Business Intelligence
    • Tableau
    • Looker
    • PowerBI
  • 🔌Data Lake
    • AWS S3
    • Azure Data Lake Storage (ADLS)
      • Azure Function for Metadata
    • Google Cloud Storage (GCS)
  • 🔌Ticketing and Collaboration
    • ServiceNow
    • Jira
  • 🔒Security and Connectivity
    • Enabling VPC Access
    • IP Whitelisting
    • SSH Tunneling
    • AWS Identities
  • ✅Data Quality
    • Incidents Overview
    • Incident model feedback
    • Enable asset monitoring
    • Available Monitor Types
    • Available Monitor Modes
    • Catalog: Add/Modify Monitor
    • Set Up Freshness & Volume Monitors
    • Set Up Field Health Monitors
    • Set Up Custom SQL Monitors
    • Grouped-by Monitors
    • Modify Schema Drift Monitors
    • Modify Job Failure Monitors (Data Job)
    • Custom Scheduling For Monitors
    • Config Settings
  • 📖Catalog
    • Overview of Asset Types
    • Assets Catalog
    • Asset Overview
    • Automated Lineage
      • Lineage Relationship
      • Supported Data Sources and Lineage Types
    • Add lineage relationships manually
    • Add tags and classifications to fields
    • Field Statistcs
    • Preview sample data
  • 📚Glossary
    • Glossary, Category and Terms
    • Adding a new glossary
    • Adding Terms and Linked Assets
  • Moving Terms to Glossary/Category
  • AI Copilot
    • Copilot's Autocomplete
  • 🤝Collaboration
    • Ask Questions
    • Rate an asset
  • 🌐Data Mesh [BETA]
    • Overview on Data Mesh [BETA]
    • Creating and Managing Domains/Sub-domains
    • Adding members to Domain/Sub-domain
    • Linking Entities to Domains/Sub-domains
    • Adding Data Products to Domains/Subdomains
    • Creating a draft Data Asset
    • Adding a Data Contract - Default Settings
    • Adding a Data Contract - Freshness Test
    • Adding a Data Contract - Column Tests
    • Publishing the Data Asset
  • 🏛️Governance
    • Governance module
    • Classification Policies
    • Auto-classify data assets
  • ☑️Approval Workflow
    • What are Change Requests?
    • Initiate a change request
    • What are Access Requests?
    • Initiate an Access Request
  • 📋Reports
    • Overview of Reports
    • Supported sources for Reports
    • Asset Report: Data Quality Scorecard
  • 📊Dashboard
    • Dashboard Overview
    • Incidents
    • Quality
  • ⏰Alert Notifications
    • Get alerts on email
    • Connect your Slack channels
    • Connect to Microsoft Teams
    • Webhooks integration
  • 🏛️Manage Access
    • User Management - Overview
    • Invite users
    • Deactivate or re-activate users
    • Revoke a user invite
  • 🔐Group-based Access Controls
    • Groups Management - Overview
    • Create Groups & Assign Policies
    • Source-based Policies
    • Administrative-based Policies
    • Module-based Policies
    • What is the "Owners" group?
  • 🗄️Org Settings
    • Multi-factor authentication
    • Single Sign-On (SSO) with Microsoft
    • Single Sign-On (SSO) with JumpCloud
  • ❓Support
    • Supported Features by Integration
    • Frequently Asked Questions
    • Supported Browsers and System Requirements
  • Public API (BETA)
    • Overview
      • Data API
        • Glossary
        • Lineage
        • ACL
          • Group
      • Control API
        • Users
    • API Keys
Powered by GitBook
On this page
  • Minimum Requirements: Credentials and Access
  • How to Connect
  • Expected output
  1. Transformation Tools
  2. Apache Spark

Apache Spark in Azure Synapse

Connecting Decube to Apache Spark specifically for Azure Synapse.

This guide outlines the steps necessary to integrate Decube with an Apache Spark instance in Azure Synapse Analytics Workspace. The process is divided into two primary sections: configuring the Decube platform and setting up the necessary components within the Azure Synapse Analytics Workspace.

Minimum Requirements: Credentials and Access

  • Before beginning the setup, ensure that you have the following credentials and access:

  • Decube Data Source:

    • Name of the Source

  • Client - Azure Synapse Workspace:

    • Must have WRITE access to Azure Synapse Workspace Libraries.

  • Required roles include:

    • Synapse Administrator

    • Synapse Apache Spark Administrator

    • Synapse Contributor

    • Synapse Artifact Publisher

How to Connect

The connection process is divided into two main parts:

  • Decube platform configuration

  • Azure Synapse Analytics Workspace setup

Part 1: Decube Platform Configuration

1. Create Spark Data Source

  • Input the name of the connector. This will be the identifier for your data source within Decube.

2. Copy Credentials

  • After creating the data source, Decube will provide you with credentials. These credentials are essential for the integration process and will be used later in the Azure Synapse setup.

Important: Keep these credentials secure as they are required for the integration with the Azure Synapse Apache Spark instance.

Part 2: Azure Synapse Analytics Workspace Setup

  1. Setup OpenLineage as a Workspace Package

To track and manage data lineage within Spark jobs, the OpenLineage library must be added as a workspace-wide package, so we can add to the Apache Spark Pool as packages later.

Steps:

Download the OpenLineage Binary:

Make sure that the Scala Versions match for Apache Spark and OpenLineage Library.

The versions before 1.8.0 does not have support of Scala 2.13 , only Scala 2.12 is recommended.

  • Below is the example how to download the OpenLineage Binary from Maven:

  • After choosing the version 1.20.5 - Scala 2.13

  • Click on the jar label to download the binary.

Upload the Binary to Azure Synapse:

  • Go to Synapse Studio.

  • In the left panel, navigate to Manage > Configurations + Libraries > Workspace Packages > Upload.

  • Upload the OpenLineage jar binary from Download the OpenLineage Binary.

  1. Install OpenLineage in Apache Spark Pool Environment

Now that OpenLineage is a workspace package, it needs to be installed in the default Apache Spark Pool environment.

Steps:

  1. In Synapse Studio, go to Manage > Apache Spark Pools.

  2. Select the pool to configure and navigate to Packages.

  3. Choose the OpenLineage jar uploaded in the previous step and install it.

  • Select the OpenLineage jar that was uploaded as workspace packages.

Note: The installation process may take some time as the package is being integrated into the Spark environment.

3. Configure Apache Spark with Decube Credentials

To finalize the setup, you need to configure Apache Spark with the necessary settings to communicate with Decube.

  • We will setup the Apache Spark Configurations for OpenLineage , by using the credentials copied during Copy Credentials portion of Data Source setup, it is to be noted that the credentials can also be seen after setting up in the Modify screen.

Configuration Parameters:

spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type http
spark.openlineage.transport.url https://integrations.<Region>.decube.io
spark.openlineage.transport.auth.type api_key
spark.openlineage.transport.auth.apiKey <api-key>
spark.openlineage.transport.endpoint /integrations/openlineage/webhook/<webhook-uuid>
spark.openlineage.namespace [Namespace]

Api-key: The API key copied from the Decube platform.

Webhook-uuid: The webhook UUID copied from the Decube platform.

[Namespace]: A custom namespace determined by the client.

  1. Add to Existing Spark Pool Configuration

For Example:

Expected output

Once your Spark has been successfully set up, you should be able to see the Data Jobs in the Catalog (which are named after app name in Spark Config).

You will also be able to see lineages extracted from the workflow.

PreviousApache SparkNextOpenLineage (BETA)

Last updated 8 days ago

Navigate to the and download the appropriate version of the OpenLineage binary. For most use cases, the version 1.20.5 - Scala 2.13 is recommended.

For full reference on how to add the configurations above, see

Input the configurations provided in with the placeholders filled with the values required.

🔌
https://mvnrepository.com/artifact/io.openlineage/openlineage-spark
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-create-spark-configuration
Apache Spark
Example Data Jobs in Catalog
Example of 2 csv tables joined onto a parquet file in ADLS.