AWS Glue
View catalogued assets within your AWS Glue, or leverage AWS Athena to add data observability capabilities and monitor Iceberg tables.
Last updated
View catalogued assets within your AWS Glue, or leverage AWS Athena to add data observability capabilities and monitor Iceberg tables.
Last updated
Freshness
Volume
Schema Drift
Field Health
Custom SQL
Job Failure
Data Profiling
Data Preview
Add Recon
*only available when AWS Athena compute is selected.
To connect your AWS Glue to decube, we will need the following information:
Create an IAM user for us with AWSGlueServiceRole
IAM user's Access Key
IAM user's Secret Key
Glue Region
Enable Athena (Optional) - Read more in this . If Athena is enabled,
Workgroup
Bucket Name
Login to AWS Console and proceed to IAM > User > Create User
Click on attach policies directly and search for AWSGlueServiceRole
Review and create your user
Navigate to the newly created user and click on Create access key
Choose Application running outside AWS
Save the provided access key and secret access key. You will not be able to retrieve these keys again.
You will need to set up these items:
Creating an s3 bucket to store Athena query results.
Creating an Athena Workgroup
Optional - Athena Data Source name
Athena saves the results of queries in an s3 bucket. The location of the bucket will then be attached in one of the policies of the next section. Athena Workgroup is required for some of the policies that we will attach to the IAM user as well.
Go to S3
> Bucket
and click on Create bucket
For bucket name, we suggest decube-athena-query-results
.
For Object Ownership
, select ACLs disabled
.
Click on Create bucket
.
Take note of the ARN for the bucket, we will refer it as decube-athena-query-results
in the following sections when setting up Athena.
Go to Amazon Athena
> Administration
> Workgroups
Click on Create Workgroup
Fill in Workgroup name. Recommended name here is: decube-athena-workgroup
.
Select Athena SQL
as Analytics engine
.
Select Manual
for Upgrade query engine
.
Select Athena engine version 3
as Query engine version
.
For Authentication, select AWS Identity and Access Management (IAM)
.
For Query result configuration
, specifically Location of query result
, fill in the location of the bucket from the previous section, if you’re following the name convention it would be s3://decube-athena-query-results
.
Click on Create workgroup
.
Take note of the name of the workgroup.
Go to IAM > Users and search for the user previously created for Decube to ingest Glue and click on it.
On the Permissions
tab, click on Add permissions
> Create inline policy
.
On the Policy editor
tab, click on JSON.
Click Next
.
Copy and paste these policies onto the form provided, note to change the block on Resource accordingly.
Click Next . We recommend naming the policy decube-athena-s3. Finally click on Create policy.
This section is applicable if you intend to view lineages from your AWS Glue jobs. OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.
In the Job details tab, navigate to Advanced properties → Libraries → Dependent Jars path
Ensure you select the version for Scala 2.12, as Glue Spark is compiled with Scala 2.12, and version 2.13 won't be compatible.
On the page, for the specific OpenLineage version for Scala 2.12, copy the URL of the jar file from the Files row and use it in Glue.
Alternatively, upload the jar to an S3 bucket and use its URL. The URL should use the s3
scheme: s3://<your bucket>/path/to/openlineage-spark_2.12-<version>.jar
In the same Job details tab, add a new property under Job parameters:
Use the format param1=value1 --conf param2=value2 ... --conf paramN=valueN
.
Make sure every parameter except the first has an extra --conf
in front of it.
Example: spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=http --conf spark.openlineage.transport.url=https://integrations.<Region>.decube.io --conf spark.openlineage.transport.endpoint=/integrations/openlineage/webhook/<webhook-uuid> --conf spark.openlineage.transport.auth.type=api_key --conf spark.openlineage.transport.auth.apiKey=<webhook-key>
Add the --user-jars-first parameter and set its value to true
To confirm that OpenLineage registration has been successful, check the logs for the following entry:
If you see this log message, it indicates that OpenLineage has been correctly registered with your AWS Glue job.
Insert the "access key" and "secret key" with "region" of the connection form, then test the connection. If it is successful, you can now add the name and connect to the data source.
*
*
*
*
*
*
*
AWS Glue, by itself, does not provide native support for Data Quality Monitoring. To address this, we leverage AWS Athena, a serverless, interactive query service, to analyze and query the data that was produced by glue and stored in AWS S3. Because of that, Decube requires additional policies to be attached to the IAM user created in this step .
Ensure that step in to set up IAM User has been completed first before this section.
Follow below steps to :
Use the URL directly from