AWS S3

Connect your S3 to see your S3 datasets and files within the Catalog.

Supported Capabilities

Capability

Supported

Metadata Extraction

✅

Metadata Types Collected

Schema, Table, Column

Data Profiling

❌

Data Preview

✅

Data Quality

❌

Configurable Collection

❌

External Table

❌

View Table

❌

Stored Procedure

❌

Minimum Requirement

To connect your AWS S3 to Decube, we will need the following information.

Choose authentication method:

a. AWS Identity:

Select AWS Identity
Customer AWS Role ARN
Region
Path Specs
Data source name

b. AWS Access Key:

Access Key ID
Secret Access Key
Region
Path Specs
Data source name

Connection Options:

a. AWS Roles

This section will create a Customer AWS Role within your AWS account that has the right set of permission to access your data sources.

Step 1: Go to your AWS Account > IAM Module > Roles
Step 2: Click on Create role

Step 3: Choose Custom trust policy

Step 4: Specify the following as the trust policy, replacing DECUBE-AWS-IDENTITY-ARN and EXTERNAL-ID with values from Generating a Decube AWS Identity.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<DECUBE-AWS-IDENTITY-ARN>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<EXTERNAL-ID>"
                }
            }
        }
    ]
}

Step 5: Click next to proceed to attach policy.
Step 6: Click on Attach Policies and Create Policy and choose JSON Editor. Input the following policy and press next, input the policy name of your choice and press Create Policy.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}

b. AWS IAM User

Step 1: Login to AWS Console and proceed to IAM > User > Create User

Step 2: Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}

Step 3: Search for the policy you created just now, select it and press Next.

Step 4: Press Create user

Step 5: Navigate to the newly created user and click on Create access key

Step 6: Choose Application running outside AWS

Step 7: Save the provided access key and secret access key. You will not be able to retrieve these keys again

AWS KMS

If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.

Login to AWS Console and proceed to AWS KMS > Customer-managed keys.
Find the key that was used to encrypt the AWS S3 bucket.
On the Key policy tab, click on Edit

Assuming the user created is decube-s3-datalake

a. If there is not an existing policy attached to the key

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}

b. If there is an existing policy, append this section to the Statement array:

{
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}

Save Changes

Path Specs

Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets. Providing a path specification represents the formatted path to the dataset in your S3 bucket which Decube will use to ingest and catalog the data.

The provided path specification MUST end with *.* or *.[ext] to represent leaf level. (Note that here '*' is not a wildcard symbol). If *.[ext] is provided then files with only specified extension type will be scanned. ".[ext]" can be any of the supported file types listed below.

Each path_spec represents only one file type (e.g., only *.csv or only *.parquet). To ingest multiple file types, add multiple path_spec entries.

SingleFile pathspec: PathSpec without {table} (targets individual files).MultiFile pathspec: PathSpec with {table} (targets folders as datasets).

Include only datasets that match this pattern: If the path spec specifies a table, and the regex is provided, only datasets that match the regex will be included.

File Format Settings

CSV
- delimiter (default: ,)
- escape_char (default: \\)
- quote_char (default: ")
- has_headers (default: true)
- skip_n_line (default: 0)
- file_encoding (default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
Parquet
- No options
JSON/JSONL
- file_encoding (default: UTF-8; see above for supported encodings)
Delta Table
- When selecting Format = Delta Table, the path spec MUST include the named token {table}. The connector expects the {table} token in the path spec so it can discover Delta table roots. No per-file options are required.

Additional points to note

Folder names should not contain {, }, *, / in their names.
Named variable {folder} is reserved for internal working. please do not use in named variables.

Example Path Specs

Example 1 - Individual file as Dataset (SingleFile pathspec)

Bucket structure:

test-bucket
├── employees.csv
├── departments.json
└── food_items.csv

Path specs config to ingest employees.csv and food_items.csv as datasets:

path_specs:
    - s3://test-bucket/*.csv

This will automatically ignore departments.json file. To include it, use *.* instead of *.csv.

Example 2 - Folder of files as Dataset (without Partitions)

Bucket structure:

test-bucket
└──  offers
     ├── 1.csv
     └── 2.csv

Path specs config to ingest folder offers as dataset:

path_specs:
    - s3://test-bucket/{table}/*.csv

{table} represents folder for which dataset will be created.

Example 3 - Folder of files as Dataset (with Partitions)

Bucket structure:

test-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config to ingest folders orders and returns as datasets:

path_specs:
    - s3://test-bucket/{table}/*/*/*.parquet

Example 4 - Advanced - Either Individual file OR Folder of files as Dataset

Bucket structure:

test-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

Path specs config:

path_specs:
    - path_spec_1: s3://test-bucket/*.csv
    - path_spec_2: s3://test-bucket/{table}/*.json
    - path_spec_3: s3://test-bucket/{table}/*/*/*.parquet

Above config has 3 path_specs and will ingest following datasets

employees.csv - Single File as Dataset
food_items.csv - Single File as Dataset
customers - Folder as Dataset
orders - Folder as Dataset

Valid path_specs.include

s3://my-bucket/foo/tests/bar.csv # single file table
s3://my-bucket/foo/tests/*.* # mulitple file level tables
s3://my-bucket/foo/tests/{table}/*.parquet #table without partition
s3://my-bucket/foo/tests/{table}/*/*.csv #table where partitions are not specified
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
s3://my-bucket/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format

Example 5 - Delta Table (S3)

For Delta Table support, include {table} in the path. Example simple path spec for S3:

path_specs:
    - s3://my-bucket/{table}/

The connector will interpret {table} as the table root for each Delta table.

Supported file types

CSV (*.csv)
JSON (*.json)
JSONL (*.jsonl)
Parquet (*.parquet)
Delta (Delta Table)

Notes

Data Quality Monitoring is no longer supported for S3 sources. Only cataloging is available.
For advanced dataset structures, add multiple path_spec entries as needed, each with its own file type and settings.

PreviousPowerBI NextAzure Data Lake Storage (ADLS)

Last updated 1 month ago