# AWS S3

## Supported Capabilities

{% tabs %}
{% tab title="Supported Capabilities" %}
**General**

* **Metadata** — metadata extraction and display of asset information (tables, columns, schemas). Types collected: Schema, Table, Column
* **Preview** — sample data preview
  {% endtab %}

{% tab title="Not Supported" %}
**General**

* Profiling
* Data Quality
* Configurable Collection
* External Table
* View Table
* Stored Procedure
  {% endtab %}
  {% endtabs %}

To connect your AWS S3 to Decube, we will need the following information.

Choose authentication method:

a. [**AWS Identity**](#a.-aws-roles):

* Select AWS Identity
* Customer AWS Role ARN
* Region
* Path Specs
* Data source name

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-facec369126c3aa1be54eb13df34282951def39d%2Fimage.png?alt=media" alt=""><figcaption><p>S3 Datalake using AWS Identity<br></p></figcaption></figure>

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-ef4e7ecf42080c5f926e1c8c033ff68b2306d89b%2Fimage.png?alt=media" alt=""><figcaption><p>Continuation of S3 Data Lake setup using AWS Identity</p></figcaption></figure>

b. **AWS** **Access** **Key**:

* Access Key ID
* Secret Access Key
* Region
* Path Specs
* Data source name

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-07b5f13a4f4551559ad1cc33f6d0f285f405789b%2Fimage.png?alt=media" alt=""><figcaption><p>S3 Datalake using AWS Access Key</p></figcaption></figure>

## Connection Options:

#### a. AWS Roles

{% hint style="info" %}
This section will create a **Customer AWS Role** within your AWS account that has the right set of permission to access your data sources.
{% endhint %}

* Step 1: Go to your AWS Account > IAM Module > Roles
* Step 2: Click on **Create role**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-bc6d8296019ea69d4e3edd1cd421cdb472d2a77b%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 3: Choose **Custom trust policy**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-c6ecfb6e4b9ba1639d62502abe968593c95482da%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 4: Specify the following as the trust policy, replacing `DECUBE-AWS-IDENTITY-ARN` and `EXTERNAL-ID` with values from [#generating-a-decube-aws-identity](https://docs.decube.io/security-and-connectivity/aws-identities#generating-a-decube-aws-identity "mention").

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<DECUBE-AWS-IDENTITY-ARN>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<EXTERNAL-ID>"
                }
            }
        }
    ]
}
```

* Step 5: Click next to proceed to attach policy.
* Step 6: Click on Attach Policies and Create Policy and choose JSON Editor. Input the following policy and press next, input the policy name of your choice and press Create Policy.

```
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}
```

#### b. AWS IAM User

* Step 1: Login to AWS Console and proceed to IAM > User > Create User

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-72c6607405d6ef0fd07aa6f6fbb13cd3b093f4c0%2FUntitled%20(1).png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 2: Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy

```jsx
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}
```

* Step 3: Search for the policy you created just now, select it and press Next.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-63a8b92d16f2ae24e3999f027a0f05346693834a%2FUntitled%20(2).png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 4: Press **Create user**

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-7dc32b5ec8f18563c4423a5527a0dbab0ce64f76%2FUntitled%20(3).png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 5: Navigate to the newly created user and click on `Create access key`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-ab35322a4d2d240f0ce9887b6a8378a51813dc3d%2FScreenshot%202023-10-27%20at%203.35.08%20PM.png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 6: Choose `Application running outside AWS`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-8f21befd15e4eee216d90ada034f5027fd2afda4%2FUntitled%20(4).png?alt=media" alt=""><figcaption></figcaption></figure>

* Step 7: Save the provided access key and secret access key. You will not be able to retrieve these keys again

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-e3bf74ebb4be39411655b9cbbe271106c5b61329%2FUntitled%20(5).png?alt=media" alt=""><figcaption></figcaption></figure>

#### AWS KMS

If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.

1. Login to AWS Console and proceed to AWS KMS > Customer-managed keys.
2. Find the key that was used to encrypt the AWS S3 bucket.
3. On the Key policy tab, click on `Edit`

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-1673b13426cd4a536e5e70442e1b0c3bfe2d264d%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

4. Assuming the user created is `decube-s3-datalake`

a. If there is not an existing policy attached to the key

```bash
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}

```

b. If there is an existing policy, append this section to the `Statement` array:

```bash
{
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}
```

5. `Save Changes`

## Path Specs

Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Providing a path specification represents the formatted path to the dataset in your S3 bucket which Decube will use to ingest and catalog the data.

The provided path specification MUST end with `*.*` or `*.[ext]` to represent leaf level. (Note that here '\*' is *not* a wildcard symbol). If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of the supported file types listed below.

Each `path_spec` represents only one file type (e.g., only `*.csv` or only `*.parquet`). To ingest multiple file types, add multiple `path_spec` entries.

**SingleFile pathspec**: PathSpec without `{table}` (targets individual files).**MultiFile pathspec**: PathSpec with `{table}` (targets folders as datasets).

**Include only datasets that match this pattern**: If the path spec specifies a table, and the regex is provided, only datasets that match the regex will be included.

### File Format Settings

* **CSV**
  * `delimiter` (default: `,`)
  * `escape_char` (default: `\\`)
  * `quote_char` (default: `"`)
  * `has_headers` (default: true)
  * `skip_n_line` (default: 0)
  * `file_encoding` (default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT\_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
* **Parquet**
  * No options
* **JSON/JSONL**
  * `file_encoding` (default: UTF-8; see above for supported encodings)
* **Delta Table**
  * When selecting Format = `Delta Table`, the path spec MUST include the named token `{table}`. The connector expects the `{table}` token in the path spec so it can discover Delta table roots. No per-file options are required.

**Additional points to note**

* Folder names should not contain {, }, \*, / in their names.
* Named variable {folder} is reserved for internal working. please do not use in named variables.

### Example Path Specs

#### Example 1 - Individual file as Dataset (SingleFile pathspec)

Bucket structure:

```
test-bucket
├── employees.csv
├── departments.json
└── food_items.csv
```

Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:

```
path_specs:
    - s3://test-bucket/*.csv
```

This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.

**Example 2 - Folder of files as Dataset (without Partitions)**

Bucket structure:

```
test-bucket
└──  offers
     ├── 1.csv
     └── 2.csv
```

Path specs config to ingest folder `offers` as dataset:

```
path_specs:
    - s3://test-bucket/{table}/*.csv
```

`{table}` represents folder for which dataset will be created.

**Example 3 - Folder of files as Dataset (with Partitions)**

Bucket structure:

```
test-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

```

Path specs config to ingest folders `orders` and `returns` as datasets:

```
path_specs:
    - s3://test-bucket/{table}/*/*/*.parquet
```

**Example 4 - Advanced - Either Individual file OR Folder of files as Dataset**

Bucket structure:

```
test-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

```

Path specs config:

```
path_specs:
    - path_spec_1: s3://test-bucket/*.csv
    - path_spec_2: s3://test-bucket/{table}/*.json
    - path_spec_3: s3://test-bucket/{table}/*/*/*.parquet

```

Above config has 3 path\_specs and will ingest following datasets

* `employees.csv` - Single File as Dataset
* `food_items.csv` - Single File as Dataset
* `customers` - Folder as Dataset
* `orders` - Folder as Dataset

**Valid path\_specs.include**

```python
s3://my-bucket/foo/tests/bar.csv # single file table
s3://my-bucket/foo/tests/*.* # mulitple file level tables
s3://my-bucket/foo/tests/{table}/*.parquet #table without partition
s3://my-bucket/foo/tests/{table}/*/*.csv #table where partitions are not specified
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
s3://my-bucket/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format
```

#### Example 5 - Delta Table (S3)

For Delta Table support, include `{table}` in the path. Example simple path spec for S3:

```
path_specs:
    - s3://my-bucket/{table}/
```

The connector will interpret `{table}` as the table root for each Delta table.

#### Supported file types

* CSV (`*.csv`)
* JSON (`*.json`)
* JSONL (`*.jsonl`)
* Parquet (`*.parquet`)
* Delta (`Delta Table`)

### Notes

* Data Quality Monitoring is no longer supported for S3 sources. Only cataloging is available.
* For advanced dataset structures, add multiple `path_spec` entries as needed, each with its own file type and settings.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.decube.io/datalake/s3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
