AWS S3
Connect your S3 to see your S3 datasets and files within the Catalog.
Supported Capabilities
Metadata Extraction
✅
Metadata Types Collected
Schema, Table, Column
Data Profiling
❌
Data Preview
✅
Data Quality
❌
Configurable Collection
❌
External Table
❌
View Table
❌
Stored Procedure
❌
Minimum Requirement
To connect your AWS S3 to Decube, we will need the following information.
Choose authentication method:
a. AWS Identity:
- Select AWS Identity 
- Customer AWS Role ARN 
- Region 
- Path Specs 
- Data source name 


b. AWS Access Key:
- Access Key ID 
- Secret Access Key 
- Region 
- Path Specs 
- Data source name 

Connection Options:
a. AWS Roles
- Step 1: Go to your AWS Account > IAM Module > Roles 
- Step 2: Click on Create role 

- Step 3: Choose Custom trust policy 

- Step 4: Specify the following as the trust policy, replacing - DECUBE-AWS-IDENTITY-ARNand- EXTERNAL-IDwith values from Generating a Decube AWS Identity.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<DECUBE-AWS-IDENTITY-ARN>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<EXTERNAL-ID>"
                }
            }
        }
    ]
}- Step 5: Click next to proceed to attach policy. 
- Step 6: Click on Attach Policies and Create Policy and choose JSON Editor. Input the following policy and press next, input the policy name of your choice and press Create Policy. 
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}b. AWS IAM User
- Step 1: Login to AWS Console and proceed to IAM > User > Create User 

- Step 2: Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy 
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}- Step 3: Search for the policy you created just now, select it and press Next. 

- Step 4: Press Create user 

- Step 5: Navigate to the newly created user and click on - Create access key

- Step 6: Choose - Application running outside AWS

- Step 7: Save the provided access key and secret access key. You will not be able to retrieve these keys again 

AWS KMS
If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.
- Login to AWS Console and proceed to AWS KMS > Customer-managed keys. 
- Find the key that was used to encrypt the AWS S3 bucket. 
- On the Key policy tab, click on - Edit

- Assuming the user created is - decube-s3-datalake
a. If there is not an existing policy attached to the key
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}
b. If there is an existing policy, append this section to the Statement array:
{
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}- Save Changes
Path Specs
Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets. Providing a path specification represents the formatted path to the dataset in your S3 bucket which Decube will use to ingest and catalog the data.
The provided path specification MUST end with *.* or *.[ext] to represent leaf level. (Note that here '*' is not a wildcard symbol). If *.[ext] is provided then files with only specified extension type will be scanned. ".[ext]" can be any of the supported file types listed below.
Each path_spec represents only one file type (e.g., only *.csv or only *.parquet). To ingest multiple file types, add multiple path_spec entries.
SingleFile pathspec: PathSpec without {table} (targets individual files).MultiFile pathspec: PathSpec with {table} (targets folders as datasets).
Include only datasets that match this pattern: If the path spec specifies a table, and the regex is provided, only datasets that match the regex will be included.
File Format Settings
- CSV - delimiter(default:- ,)
- escape_char(default:- \\)
- quote_char(default:- ")
- has_headers(default: true)
- skip_n_line(default: 0)
- file_encoding(default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
 
- Parquet - No options 
 
- JSON/JSONL - file_encoding(default: UTF-8; see above for supported encodings)
 
- Delta Table - When selecting Format = - Delta Table, the path spec MUST include the named token- {table}. The connector expects the- {table}token in the path spec so it can discover Delta table roots. No per-file options are required.
 
Additional points to note
- Folder names should not contain {, }, *, / in their names. 
- Named variable {folder} is reserved for internal working. please do not use in named variables. 
Example Path Specs
Example 1 - Individual file as Dataset (SingleFile pathspec)
Bucket structure:
test-bucket
├── employees.csv
├── departments.json
└── food_items.csvPath specs config to ingest employees.csv and food_items.csv as datasets:
path_specs:
    - s3://test-bucket/*.csvThis will automatically ignore departments.json file. To include it, use *.* instead of *.csv.
Example 2 - Folder of files as Dataset (without Partitions)
Bucket structure:
test-bucket
└──  offers
     ├── 1.csv
     └── 2.csvPath specs config to ingest folder offers as dataset:
path_specs:
    - s3://test-bucket/{table}/*.csv{table} represents folder for which dataset will be created.
Example 3 - Folder of files as Dataset (with Partitions)
Bucket structure:
test-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet
Path specs config to ingest folders orders and returns as datasets:
path_specs:
    - s3://test-bucket/{table}/*/*/*.parquetExample 4 - Advanced - Either Individual file OR Folder of files as Dataset
Bucket structure:
test-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet
Path specs config:
path_specs:
    - path_spec_1: s3://test-bucket/*.csv
    - path_spec_2: s3://test-bucket/{table}/*.json
    - path_spec_3: s3://test-bucket/{table}/*/*/*.parquet
Above config has 3 path_specs and will ingest following datasets
- employees.csv- Single File as Dataset
- food_items.csv- Single File as Dataset
- customers- Folder as Dataset
- orders- Folder as Dataset
Valid path_specs.include
s3://my-bucket/foo/tests/bar.csv # single file table
s3://my-bucket/foo/tests/*.* # mulitple file level tables
s3://my-bucket/foo/tests/{table}/*.parquet #table without partition
s3://my-bucket/foo/tests/{table}/*/*.csv #table where partitions are not specified
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
s3://my-bucket/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value formatExample 5 - Delta Table (S3)
For Delta Table support, include {table} in the path. Example simple path spec for S3:
path_specs:
    - s3://my-bucket/{table}/The connector will interpret {table} as the table root for each Delta table.
Supported file types
- CSV ( - *.csv)
- JSON ( - *.json)
- JSONL ( - *.jsonl)
- Parquet ( - *.parquet)
- Delta ( - Delta Table)
Notes
- Data Quality Monitoring is no longer supported for S3 sources. Only cataloging is available. 
- For advanced dataset structures, add multiple - path_specentries as needed, each with its own file type and settings.
Last updated