AWS Glue
View catalogued assets within your AWS Glue, or leverage AWS Athena to add data observability capabilities and monitor Iceberg tables.
Supported Capabilities
Freshness
✅*
Volume
✅*
Schema Drift
✅*
Field Health
✅*
Custom SQL
✅*
Job Failure
✅
Data Profiling
✅*
Data Preview
✅*
*only available when AWS Athena compute is selected.
Minimum Requirement
To connect your AWS Glue to decube, we will need the following information:
Choose authentication method:
a. AWS Identity:
Select AWS Identity
Customer AWS Role ARN
Region
Enable Athena (Optional) - Read more in this section. If Athena is enabled,
Workgroup
Bucket Name
Data source name
b. AWS Access Key:
Access Key ID
Secret Access Key
Region
Enable Athena (Optional) - Read more in this section. If Athena is enabled,
Workgroup
Bucket Name
Data source name

Connection Options:
a. AWS Role
Step 1: Go to your AWS Account → IAM Module → Roles
Step 2: Click on Create role

Step 3: Choose Custom trust policy

Step 4: Specify the following as the trust policy, replacing
DECUBE-AWS-IDENTITY-ARN
andEXTERNAL-ID
with values from Generating a Decube AWS Identity
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<DECUBE-AWS-IDENTITY-ARN>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<EXTERNAL-ID>"
}
}
}
]
}
Step 5: Click next to proceed to attach policy
Step 6: Click on Attach policies directly and search for
AWSGlueServiceRole
and add this policy

Step 6: Click next and specify a role name. For this documentation, the name will be presumed to be CustomerAWSRole but can be set to any value.
b. AWS IAM User
Step 1: Login to AWS Console and proceed to IAM > User > Create User

Step 2: Click on attach policies directly and search for
AWSGlueServiceRole

Step 3: Review and create your user

Step 4: Navigate to the newly created user and click on
Create access key

Step 5: Choose
Application running outside AWS

Step 6: Save the provided access key and secret access key. You will not be able to retrieve these keys again

Enable Athena for Data Observability
AWS Glue, by itself, does not provide native support for Data Quality Monitoring. To address this, we leverage AWS Athena, a serverless, interactive query service, to analyze and query the data that was produced by glue and stored in AWS S3. Because of that, Decube requires additional policies to be attached to the IAM user created in this step AWS Glue | Decube.
Configuring AWS Athena
You will need to set up these items:
Creating an s3 bucket to store Athena query results.
Creating an Athena Workgroup
Optional - Athena Data Source name
Athena saves the results of queries in an s3 bucket. The location of the bucket will then be attached in one of the policies of the next section. Athena Workgroup is required for some of the policies that we will attach to the IAM user as well.
Creating an s3 bucket
Go to
S3
>Bucket
and click onCreate bucket
For bucket name, we suggest
decube-athena-query-results
.For
Object Ownership
, selectACLs disabled
.Click on
Create bucket
.Take note of the ARN for the bucket, we will refer it as
decube-athena-query-results
in the following sections when setting up Athena.
Creating an Athena Workgroup
Go to
Amazon Athena
>Administration
>Workgroups
Click on
Create Workgroup

Fill in Workgroup name. Recommended name here is:
decube-athena-workgroup
.Select
Athena SQL
asAnalytics engine
.Select
Manual
forUpgrade query engine
.Select
Athena engine version 3
asQuery engine version
.

For Authentication, select
AWS Identity and Access Management (IAM)
.For
Query result configuration
, specificallyLocation of query result
, fill in the location of the bucket from the previous section, if you’re following the name convention it would bes3://decube-athena-query-results
.Click on
Create workgroup
.Take note of the name of the workgroup.
Adding policies to IAM User
Ensure that step in previous section to set up IAM User has been completed first before this section.
Go to IAM > Users and search for the user previously created for Decube to ingest Glue and click on it.
On the
Permissions
tab, click onAdd permissions
>Create inline policy
.

On the
Policy editor
tab, click onJSON.
ClickNext
.

Copy and paste these policies onto the form provided, note to change the block on Resource accordingly.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DecubeAthenaS3Ingest",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketLocation",
"athena:GetTableMetadata",
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:GetDatabase",
"athena:GetDataCatalog",
"athena:ListQueryExecutions",
"athena:GetWorkGroup",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:ListDatabases",
"athena:GetQueryExecution",
"athena:ListTableMetadata",
"athena:BatchGetQueryExecution"
],
"Resource": [
"arn:aws:athena:*:{account id}:datacatalog/{specify a data catalog or *}",
"arn:aws:athena:*:{account id}:workgroup/{workgroup_name or *}",
// example
// "arn:aws:athena:*:1234567:datacatalog/*",
// "arn:aws:athena:*:1234567:workgroup/decube-athena-workgroup ",
// example - all buckets to be monitored by Athena
// "arn:aws:s3:::decube-glue_results/*",
// "arn:aws:s3:::decube-glue_results"
]
},
{
"Sid": "DecubeS3AthenaOutput",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucketMultipartUploads",
"s3:AbortMultipartUpload",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListMultipartUploadParts"
],
"Resource": [
// example. ARN from athena input bucket
// "arn:aws:s3:::decube-athena-query-results/*",
// "arn:aws:s3:::decube-athena-query-results"
]
}
]
}
Click Next . We recommend naming the policy decube-athena-s3. Finally click on Create policy.

OpenLineage with AWS Glue
This section is applicable if you intend to view lineages from your AWS Glue jobs. OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.
Follow below steps to enable OpenLineage on AWS Glue:
In the Job details tab, navigate to Advanced properties → Libraries → Dependent Jars path

Use the URL directly from Maven Central openlineage-spark
Ensure you select the version for Scala 2.12, as Glue Spark is compiled with Scala 2.12, and version 2.13 won't be compatible.
On the page, for the specific OpenLineage version for Scala 2.12, copy the URL of the jar file from the Files row and use it in Glue.
Alternatively, upload the jar to an S3 bucket and use its URL. The URL should use the
s3
scheme:s3://<your bucket>/path/to/openlineage-spark_2.12-<version>.jar
- Add OpenLineage configuration in Job Parameters
In the same Job details tab, add a new property under Job parameters:
Use the format
param1=value1 --conf param2=value2 ... --conf paramN=valueN
.Make sure every parameter except the first has an extra
--conf
in front of it.Example:
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=http --conf spark.openlineage.transport.url=https://integrations.<Region>.decube.io --conf spark.openlineage.transport.endpoint=/integrations/openlineage/webhook/<webhook-uuid> --conf spark.openlineage.transport.auth.type=api_key --conf spark.openlineage.transport.auth.apiKey=<webhook-key>
Add the --user-jars-first parameter and set its value to true

Verification
To confirm that OpenLineage registration has been successful, check the logs for the following entry:
INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineage
SparkListener
If you see this log message, it indicates that OpenLineage has been correctly registered with your AWS Glue job.
Insert the "access key" and "secret key" with "region" of the connection form, then test the connection. If it is successful, you can now add the name and connect to the data source.
Last updated