# AWS Glue

## Supported Capabilities

{% tabs %}
{% tab title="Supported Capabilities" %}
**General**

* **Metadata** — metadata extraction and display of asset information (tables, columns, schemas). Types collected: Schema, Table, Column, Data Job, Data Run, Data Task
* **Profiling \*** — data profiling on the Profiler tab
* **Preview \*** — sample data preview
* **Data Quality \*** — data quality monitoring and observability
* **Configurable Collection** — selective ingestion of schemas/workspaces in Data Source Management

**Data Quality Monitors**

* Freshness \*
* Volume \*
* Field Health \*
* Custom SQL \*
* Schema Drift \*

\*Only available when AWS Athena compute is selected.
{% endtab %}

{% tab title="Not Supported" %}
**General**

* External Table
* View Table
* Stored Procedure

**Data Quality Monitors**

* Job Failure
  {% endtab %}
  {% endtabs %}

## Minimum Requirement

To connect your AWS Glue to decube, we will need the following information:

Choose authentication method:

a. [**AWS Identity**](#a.-aws-role):

* Select AWS Identity
* Customer AWS Role ARN
* Region
* Enable Athena (Optional) - Read more in this [section](#enable-athena-for-data-observability). If Athena is enabled,
  * Workgroup
  * Bucket Name
* Data source name

b. **AWS Access Key**:

* Access Key ID
* Secret Access Key
* Region
* Enable Athena (Optional) - Read more in this [section](#enable-athena-for-data-observability). If Athena is enabled,
  * Workgroup
  * Bucket Name
* Data source name

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-301a75f11f81290a9a74c63050d41479c8ace394%2Fimage.png?alt=media" alt=""><figcaption><p>AWS Glue</p></figcaption></figure>

## Connection Options:

### a. AWS Role

{% hint style="info" %}
This section will create a Customer AWS Role within your AWS account that has the right set of permission to access your data sources.
{% endhint %}

* Step 1: Go to your AWS Account → IAM Module → Roles
* Step 2: Click on **Create role**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-bc6d8296019ea69d4e3edd1cd421cdb472d2a77b%2Fimage%20(444).png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 3: Choose **Custom trust policy**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-c6ecfb6e4b9ba1639d62502abe968593c95482da%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 4: Specify the following as the trust policy, replacing `DECUBE-AWS-IDENTITY-ARN` and `EXTERNAL-ID` with values from [#generating-a-decube-aws-identity](https://docs.decube.io/security-and-connectivity/aws-identities#generating-a-decube-aws-identity "mention")

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<DECUBE-AWS-IDENTITY-ARN>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<EXTERNAL-ID>"
                }
            }
        }
    ]
}
```

* Step 5: Click next to proceed to attach policy
* Step 6: Click on **Attach policies directly** and search for `AWSGlueServiceRole` and add this policy

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-b897a4646a8f1d082a17912f7dcfaafbe07e52e8%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 6: Click next and specify a role name. For this documentation, the name will be presumed to be CustomerAWSRole but can be set to any value.

### b. AWS IAM User

* Step 1: Login to AWS Console and proceed to IAM > User > Create User

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-e9dee9adfac4da63dd032e91f338f5b4d4824621%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 2: Click on attach policies directly and search for `AWSGlueServiceRole`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-a173a1a40783d6684bec224fc6913d5b475bf5e3%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 3: Review and create your user

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-9c6e173c8133d4c6ad9d0758e7b0a41a8ab9c7be%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 4: Navigate to the newly created user and click on `Create access key`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-cac6e3821dd5ed2dd8a0453efd9962878d0d4125%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 5: Choose `Application running outside AWS`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-10149a602ca6ea53030d5a5a85d9abc26d103f30%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 6: Save the provided access key and secret access key. You will not be able to retrieve these keys again

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-f1f1d0cc0396846e9fea891a2984da3517a3e069%2Fimage%20(62)%20(1).png?alt=media" alt=""><figcaption></figcaption></figure>

## Enable Athena for Data Observability

{% hint style="info" %}
This section is applicable if you intend to enable monitors on your AWS Glue source. This includes monitoring on Iceberg tables as AWS Athena will be required to query Iceberg tables.
{% endhint %}

AWS Glue, by itself, does not provide native support for Data Quality Monitoring. To address this, we leverage AWS Athena, a serverless, interactive query service, to analyze and query the data that was produced by glue and stored in AWS S3. Because of that, Decube requires additional policies to be attached to the IAM user created in this step [AWS Glue | Decube](#aws-iam-user).

### Configuring AWS Athena

You will need to set up these items:

1. Creating an s3 bucket to store Athena query results.
2. Creating an Athena Workgroup
3. Optional - Athena Data Source name

Athena saves the results of queries in an s3 bucket. The location of the bucket will then be attached in one of the policies of the next section. Athena Workgroup is required for some of the policies that we will attach to the IAM user as well.

### Creating an s3 bucket

1. Go to `S3` > `Bucket` and click on `Create bucket`
2. For bucket name, we suggest `decube-athena-query-results`.
3. For `Object Ownership`, select `ACLs disabled`.
4. Click on `Create bucket`.
5. Take note of the ARN for the bucket, we will refer it as `decube-athena-query-results` in the following sections when setting up Athena.

### Creating an Athena Workgroup

1. Go to `Amazon Athena` > `Administration` > `Workgroups`
2. Click on `Create Workgroup`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-e2dc978223083ab2c27f1219f6a4475322b0557e%2Fimage%20(1)%20(1)%20(2).png?alt=media" alt=""><figcaption><p>Create workgroup details.</p></figcaption></figure>

3. Fill in Workgroup name. Recommended name here is: `decube-athena-workgroup`.
4. Select `Athena SQL` as `Analytics engine`.
5. Select `Manual` for `Upgrade query engine`.
6. Select `Athena engine version 3` as `Query engine version`.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-5e9712f974b11f78e365b91019e3130be5ffd48a%2Fimage.png?alt=media" alt=""><figcaption><p>Select AWS IAM and add location of query result.</p></figcaption></figure>

7. For Authentication, select `AWS Identity and Access Management (IAM)`.
8. For `Query result configuration`, specifically `Location of query result` **,** fill in the location of the bucket from the previous section, if you’re following the name convention it would be `s3://decube-athena-query-results` .
9. Click on `Create workgroup`.
10. Take note of the name of the workgroup.

### Adding policies to IAM User

{% hint style="warning" %}
Ensure that step in [previous section](#aws-iam-user) to set up IAM User has been completed first before this section.
{% endhint %}

1. Go to IAM > Users and search for the user previously created for Decube to ingest Glue and click on it.
2. On the `Permissions` tab, click on `Add permissions` > `Create inline policy`.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-62a80e76ef773cde72d967c0d9422d185ce04f0d%2Fpermissions.png?alt=media" alt=""><figcaption><p>Go to Create inline policy.</p></figcaption></figure>

3. On the `Policy editor` tab, click on `JSON.`Click `Next`.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-b85aea3205052f048eba396d7030e0c11b89950a%2Fimage.png?alt=media" alt=""><figcaption><p>Select JSON</p></figcaption></figure>

4. Copy and paste these policies onto the form provided, **note to change the block on Resource** accordingly.

```json
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "DecubeAthenaS3Ingest",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:GetBucketLocation",
				"athena:GetTableMetadata",
				"athena:StartQueryExecution",
				"athena:GetQueryResults",
				"athena:GetDatabase",
				"athena:GetDataCatalog",
				"athena:ListQueryExecutions",
				"athena:GetWorkGroup",
				"athena:StopQueryExecution",
				"athena:GetQueryResultsStream",
				"athena:ListDatabases",
				"athena:GetQueryExecution",
				"athena:ListTableMetadata",
				"athena:BatchGetQueryExecution"
			],
			"Resource": [
				"arn:aws:athena:*:{account id}:datacatalog/{specify a data catalog or *}",
				"arn:aws:athena:*:{account id}:workgroup/{workgroup_name or *}",
			  // example
				// "arn:aws:athena:*:1234567:datacatalog/*",
				// "arn:aws:athena:*:1234567:workgroup/decube-athena-workgroup ",
				
				// example - all buckets to be monitored by Athena
				// "arn:aws:s3:::decube-glue_results/*",
				// "arn:aws:s3:::decube-glue_results"
			]
		},
		{
			"Sid": "DecubeS3AthenaOutput",
			"Effect": "Allow",
			"Action": [
				"s3:PutObject",
				"s3:GetObject",
				"s3:ListBucketMultipartUploads",
				"s3:AbortMultipartUpload",
				"s3:ListBucket",
				"s3:GetBucketLocation",
				"s3:ListMultipartUploadParts"
			],
			"Resource": [
				// example. ARN from athena input bucket
				// "arn:aws:s3:::decube-athena-query-results/*",
				// "arn:aws:s3:::decube-athena-query-results"
			]
		}
	]
}
```

5. Click Next . We recommend naming the policy decube-athena-s3. Finally click on Create policy.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-17476df236bd2935065f6c59a70ec53b2e0b62c9%2Fpolicy%20name.png?alt=media" alt=""><figcaption><p>Create policy name.</p></figcaption></figure>

## OpenLineage with AWS Glue

This section is applicable if you intend to view lineages from your AWS Glue jobs. OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.

Follow below steps to [enable OpenLineage on AWS Glue](https://openlineage.io/docs/integrations/spark/quickstart/quickstart_glue/):

1. **Specify the OpenLineage JAR URL**

* In the **Job details** tab, navigate to **Advanced properties** → **Libraries** → **Dependent Jars path**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-5ab7b62261ea2fc3cc7534e8f5452203926bc9b2%2Fimage%20(2)%20(3).png?alt=media" alt=""><figcaption></figcaption></figure>

* Use the URL directly from [**Maven Central openlineage-spark**](https://mvnrepository.com/artifact/io.openlineage/openlineage-spark)
* **`Ensure you select the version for Scala 2.12, as Glue Spark is compiled with Scala 2.12, and version 2.13 won't be compatible.`**
* On the page, for the specific OpenLineage version for Scala 2.12, copy the URL of the jar file from the Files row and use it in Glue.
* **Alternatively**, upload the jar to an **S3 bucket** and use its URL. The URL should use the `s3` scheme: `s3://<your bucket>/path/to/openlineage-spark_2.12-<version>.jar`

2. **Add OpenLineage configuration in Job Parameters**

   In the same **Job details** tab, add a new property under **Job parameters**:

   * Use the format **`param1=value1 --conf param2=value2 ... --conf paramN=valueN`**.
   * Make sure every parameter except the first has an extra **`--conf`** in front of it.
   * Example: `spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=http --conf spark.openlineage.transport.url=https://integrations.<Region>.decube.io --conf spark.openlineage.transport.endpoint=/integrations/openlineage/webhook/<webhook-uuid> --conf spark.openlineage.transport.auth.type=api_key --conf spark.openlineage.transport.auth.apiKey=<webhook-key>`
3. **Set User Jars First Parameter**

* Add the --user-jars-first parameter and set its value to true

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-891c8daec4a3d7edb31aafc72c5e19e3f65cfdc9%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

## Verification

* To confirm that OpenLineage registration has been successful, check the logs for the following entry:

```
INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineage
SparkListener
```

* If you see this log message, it indicates that OpenLineage has been correctly registered with your AWS Glue job.

7. Insert the "access key" and "secret key" with "region" of the connection form, then test the connection. If it is successful, you can now add the name and connect to the data source.
