Decube
Try for free
  • 🚀Overview
    • Welcome to decube
    • Getting started
      • How to connect data sources
    • Security and Compliance
    • Data Policy
    • Changelog
    • Public Roadmap
  • 🔌Data Warehouses
    • Snowflake
    • Redshift
    • Google Bigquery
    • Databricks
    • Azure Synapse
  • 🔌Relational Databases
    • PostgreSQL
    • MySQL
    • SingleStore
    • Microsoft SQL Server
    • Oracle
  • 🔌Transformation Tools
    • dbt (Cloud Version)
    • dbt Core
    • Fivetran
    • Airflow
    • AWS Glue
    • Azure Data Factory
    • Apache Spark
      • Apache Spark in Azure Synapse
    • OpenLineage (BETA)
    • Additional configurations
  • 🔌Business Intelligence
    • Tableau
    • Looker
    • PowerBI
  • 🔌Data Lake
    • AWS S3
    • Azure Data Lake Storage (ADLS)
      • Azure Function for Metadata
    • Google Cloud Storage (GCS)
  • 🔌Ticketing and Collaboration
    • ServiceNow
    • Jira
  • 🔒Security and Connectivity
    • Enabling VPC Access
    • IP Whitelisting
    • SSH Tunneling
    • AWS Identities
  • ✅Data Quality
    • Incidents Overview
    • Incident model feedback
    • Enable asset monitoring
    • Available Monitor Types
    • Available Monitor Modes
    • Catalog: Add/Modify Monitor
    • Set Up Freshness & Volume Monitors
    • Set Up Field Health Monitors
    • Set Up Custom SQL Monitors
    • Grouped-by Monitors
    • Modify Schema Drift Monitors
    • Modify Job Failure Monitors (Data Job)
    • Custom Scheduling For Monitors
    • Config Settings
  • 📖Catalog
    • Overview of Asset Types
    • Assets Catalog
    • Asset Overview
    • Automated Lineage
      • Lineage Relationship
      • Supported Data Sources and Lineage Types
    • Add lineage relationships manually
    • Add tags and classifications to fields
    • Field Statistcs
    • Preview sample data
  • 📚Glossary
    • Glossary, Category and Terms
    • Adding a new glossary
    • Adding Terms and Linked Assets
  • Moving Terms to Glossary/Category
  • AI Copilot
    • Copilot's Autocomplete
  • 🤝Collaboration
    • Ask Questions
    • Rate an asset
  • 🌐Data Mesh [BETA]
    • Overview on Data Mesh [BETA]
    • Creating and Managing Domains/Sub-domains
    • Adding members to Domain/Sub-domain
    • Linking Entities to Domains/Sub-domains
    • Adding Data Products to Domains/Subdomains
    • Creating a draft Data Asset
    • Adding a Data Contract - Default Settings
    • Adding a Data Contract - Freshness Test
    • Adding a Data Contract - Column Tests
    • Publishing the Data Asset
  • 🏛️Governance
    • Governance module
    • Classification Policies
    • Auto-classify data assets
  • ☑️Approval Workflow
    • What are Change Requests?
    • Initiate a change request
    • What are Access Requests?
    • Initiate an Access Request
  • 📑Data reconciliation
    • Adding a new recon
    • Understand your recon results
    • Supported sources for Recon
  • 📋Reports
    • Overview of Reports
    • Supported sources for Reports
    • Asset Report: Data Quality Scorecard
  • 📊Dashboard
    • Dashboard Overview
    • Incidents
    • Quality
  • ⏰Alert Notifications
    • Get alerts on email
    • Connect your Slack channels
    • Connect to Microsoft Teams
    • Webhooks integration
  • 🏛️Manage Access
    • User Management - Overview
    • Invite users
    • Deactivate or re-activate users
    • Revoke a user invite
  • 🔐Group-based Access Controls
    • Groups Management - Overview
    • Create Groups & Assign Policies
    • Source-based Policies
    • Administrative-based Policies
    • Module-based Policies
    • What is the "Owners" group?
  • 🗄️Org Settings
    • Multi-factor authentication
    • Single Sign-On (SSO) with Microsoft
    • Single Sign-On (SSO) with JumpCloud
  • ❓Support
    • Supported Features by Integration
    • Frequently Asked Questions
    • Supported Browsers and System Requirements
  • Public API (BETA)
    • Overview
      • Data API
        • Glossary
        • Lineage
        • ACL
          • Group
      • Control API
        • Users
    • API Keys
Powered by GitBook
On this page
  • Supported Capabilities
  • Minimum Requirement
  • Connection Options:
  • a. AWS Role
  • b. AWS IAM User
  • Enable Athena for Data Observability
  • Configuring AWS Athena
  • Creating an s3 bucket
  • Creating an Athena Workgroup
  • Adding policies to IAM User
  • OpenLineage with AWS Glue
  • Verification
  1. Transformation Tools

AWS Glue

View catalogued assets within your AWS Glue, or leverage AWS Athena to add data observability capabilities and monitor Iceberg tables.

PreviousAirflowNextAzure Data Factory

Last updated 1 day ago

Supported Capabilities

Data Quality
Capability

Freshness

Volume

Schema Drift

Field Health

Custom SQL

Job Failure

Catalog
Capability

Data Profiling

Data Preview

Data Recon
Capability

Add Recon

*only available when AWS Athena compute is selected.

Minimum Requirement

To connect your AWS Glue to decube, we will need the following information:

Choose authentication method:

a. :

  • Select AWS Identity

  • Customer AWS Role ARN

  • Region

  • Enable Athena (Optional) - Read more in this . If Athena is enabled,

    • Workgroup

    • Bucket Name

  • Data source name

b. AWS Access Key:

  • Access Key ID

  • Secret Access Key

  • Region

    • Workgroup

    • Bucket Name

  • Data source name

Connection Options:

a. AWS Role

This section will create a Customer AWS Role within your AWS account that has the right set of permission to access your data sources.

  • Step 1: Go to your AWS Account → IAM Module → Roles

  • Step 2: Click on Create role

  • Step 3: Choose Custom trust policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<DECUBE-AWS-IDENTITY-ARN>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<EXTERNAL-ID>"
                }
            }
        }
    ]
}

  • Step 5: Click next to proceed to attach policy

  • Step 6: Click on Attach policies directly and search for AWSGlueServiceRole and add this policy

  • Step 6: Click next and specify a role name. For this documentation, the name will be presumed to be CustomerAWSRole but can be set to any value.

b. AWS IAM User

  • Step 1: Login to AWS Console and proceed to IAM > User > Create User

  • Step 2: Click on attach policies directly and search for AWSGlueServiceRole

  • Step 3: Review and create your user

  • Step 4: Navigate to the newly created user and click on Create access key

  • Step 5: Choose Application running outside AWS

  • Step 6: Save the provided access key and secret access key. You will not be able to retrieve these keys again

Enable Athena for Data Observability

This section is applicable if you intend to enable monitors on your AWS Glue source. This includes monitoring on Iceberg tables as AWS Athena will be required to query Iceberg tables.

Configuring AWS Athena

You will need to set up these items:

  1. Creating an s3 bucket to store Athena query results.

  2. Creating an Athena Workgroup

  3. Optional - Athena Data Source name

Athena saves the results of queries in an s3 bucket. The location of the bucket will then be attached in one of the policies of the next section. Athena Workgroup is required for some of the policies that we will attach to the IAM user as well.

Creating an s3 bucket

  1. Go to S3 > Bucket and click on Create bucket

  2. For bucket name, we suggest decube-athena-query-results.

  3. For Object Ownership, select ACLs disabled.

  4. Click on Create bucket.

  5. Take note of the ARN for the bucket, we will refer it as decube-athena-query-results in the following sections when setting up Athena.

Creating an Athena Workgroup

  1. Go to Amazon Athena > Administration > Workgroups

  2. Click on Create Workgroup

  1. Fill in Workgroup name. Recommended name here is: decube-athena-workgroup.

  2. Select Athena SQL as Analytics engine.

  3. Select Manual for Upgrade query engine.

  4. Select Athena engine version 3 as Query engine version.

  1. For Authentication, select AWS Identity and Access Management (IAM).

  2. For Query result configuration, specifically Location of query result , fill in the location of the bucket from the previous section, if you’re following the name convention it would be s3://decube-athena-query-results .

  3. Click on Create workgroup.

  4. Take note of the name of the workgroup.

Adding policies to IAM User

  1. Go to IAM > Users and search for the user previously created for Decube to ingest Glue and click on it.

  2. On the Permissions tab, click on Add permissions > Create inline policy.

  1. On the Policy editor tab, click on JSON.Click Next.

  1. Copy and paste these policies onto the form provided, note to change the block on Resource accordingly.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "DecubeAthenaS3Ingest",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:GetBucketLocation",
				"athena:GetTableMetadata",
				"athena:StartQueryExecution",
				"athena:GetQueryResults",
				"athena:GetDatabase",
				"athena:GetDataCatalog",
				"athena:ListQueryExecutions",
				"athena:GetWorkGroup",
				"athena:StopQueryExecution",
				"athena:GetQueryResultsStream",
				"athena:ListDatabases",
				"athena:GetQueryExecution",
				"athena:ListTableMetadata",
				"athena:BatchGetQueryExecution"
			],
			"Resource": [
				"arn:aws:athena:*:{account id}:datacatalog/{specify a data catalog or *}",
				"arn:aws:athena:*:{account id}:workgroup/{workgroup_name or *}",
			  // example
				// "arn:aws:athena:*:1234567:datacatalog/*",
				// "arn:aws:athena:*:1234567:workgroup/decube-athena-workgroup ",
				
				// example - all buckets to be monitored by Athena
				// "arn:aws:s3:::decube-glue_results/*",
				// "arn:aws:s3:::decube-glue_results"
			]
		},
		{
			"Sid": "DecubeS3AthenaOutput",
			"Effect": "Allow",
			"Action": [
				"s3:PutObject",
				"s3:GetObject",
				"s3:ListBucketMultipartUploads",
				"s3:AbortMultipartUpload",
				"s3:ListBucket",
				"s3:GetBucketLocation",
				"s3:ListMultipartUploadParts"
			],
			"Resource": [
				// example. ARN from athena input bucket
				// "arn:aws:s3:::decube-athena-query-results/*",
				// "arn:aws:s3:::decube-athena-query-results"
			]
		}
	]
}
  1. Click Next . We recommend naming the policy decube-athena-s3. Finally click on Create policy.

OpenLineage with AWS Glue

This section is applicable if you intend to view lineages from your AWS Glue jobs. OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.

  1. Specify the OpenLineage JAR URL

  • In the Job details tab, navigate to Advanced properties → Libraries → Dependent Jars path

  • Ensure you select the version for Scala 2.12, as Glue Spark is compiled with Scala 2.12, and version 2.13 won't be compatible.

  • On the page, for the specific OpenLineage version for Scala 2.12, copy the URL of the jar file from the Files row and use it in Glue.

  • Alternatively, upload the jar to an S3 bucket and use its URL. The URL should use the s3 scheme: s3://<your bucket>/path/to/openlineage-spark_2.12-<version>.jar

  1. Add OpenLineage configuration in Job Parameters

    In the same Job details tab, add a new property under Job parameters:

    • Use the format param1=value1 --conf param2=value2 ... --conf paramN=valueN.

    • Make sure every parameter except the first has an extra --conf in front of it.

    • Example: spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=http --conf spark.openlineage.transport.url=https://integrations.<Region>.decube.io --conf spark.openlineage.transport.endpoint=/integrations/openlineage/webhook/<webhook-uuid> --conf spark.openlineage.transport.auth.type=api_key --conf spark.openlineage.transport.auth.apiKey=<webhook-key>

  2. Set User Jars First Parameter

  • Add the --user-jars-first parameter and set its value to true

Verification

  • To confirm that OpenLineage registration has been successful, check the logs for the following entry:

INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineage
SparkListener
  • If you see this log message, it indicates that OpenLineage has been correctly registered with your AWS Glue job.

  1. Insert the "access key" and "secret key" with "region" of the connection form, then test the connection. If it is successful, you can now add the name and connect to the data source.

*

*

*

*

*

*

*

Enable Athena (Optional) - Read more in this . If Athena is enabled,

Step 4: Specify the following as the trust policy, replacing DECUBE-AWS-IDENTITY-ARN and EXTERNAL-ID with values from

AWS Glue, by itself, does not provide native support for Data Quality Monitoring. To address this, we leverage AWS Athena, a serverless, interactive query service, to analyze and query the data that was produced by glue and stored in AWS S3. Because of that, Decube requires additional policies to be attached to the IAM user created in this step .

Ensure that step in to set up IAM User has been completed first before this section.

Follow below steps to :

Use the URL directly from

🔌
enable OpenLineage on AWS Glue
Maven Central openlineage-spark
AWS Identity
section
section
AWS Glue | Decube
previous section
✅
✅
✅
✅
✅
✅
✅
✅
❌
Generating a Decube AWS Identity
Create workgroup details.
Select AWS IAM and add location of query result.
Go to Create inline policy.
Select JSON
Create policy name.