AWS S3
Connect your S3 to see your S3 datasets and files within the Catalog.
AWS S3
Supported Capabilities
Data Preview
✅
Minimum Requirement
To connect your AWS S3 to Decube, we will need the following information.
Choose authentication method:
a. AWS Identity:
Select AWS Identity
Customer AWS Role ARN
Region
Path Specs
Data source name


b. AWS Access Key:
Access Key ID
Secret Access Key
Region
Path Specs
Data source name

Connection Options:
a. AWS Roles
Step 1: Go to your AWS Account > IAM Module > Roles
Step 2: Click on Create role

Step 3: Choose Custom trust policy

Step 4: Specify the following as the trust policy, replacing
DECUBE-AWS-IDENTITY-ARN
andEXTERNAL-ID
with values from Generating a Decube AWS Identity.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<DECUBE-AWS-IDENTITY-ARN>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<EXTERNAL-ID>"
}
}
}
]
}
Step 5: Click next to proceed to attach policy.
Step 6: Click on Attach Policies and Create Policy and choose JSON Editor. Input the following policy and press next, input the policy name of your choice and press Create Policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::{bucket-name}",
"arn:aws:s3:::{bucket-name}/*"
]
}
]
}
b. AWS IAM User
Step 1: Login to AWS Console and proceed to IAM > User > Create User

Step 2: Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::{bucket-name}",
"arn:aws:s3:::{bucket-name}/*"
]
}
]
}
Step 3: Search for the policy you created just now, select it and press Next.

Step 4: Press Create user

Step 5: Navigate to the newly created user and click on
Create access key

Step 6: Choose
Application running outside AWS

Step 7: Save the provided access key and secret access key. You will not be able to retrieve these keys again

AWS KMS
If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.
Login to AWS Console and proceed to AWS KMS > Customer-managed keys.
Find the key that was used to encrypt the AWS S3 bucket.
On the Key policy tab, click on
Edit

Assuming the user created is
decube-s3-datalake
a. If there is not an existing policy attached to the key
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow decube to use key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
]
},
"Action": "kms:Decrypt",
"Resource": "*"
}
]
}
b. If there is an existing policy, append this section to the Statement
array:
{
"Statement": [
{
"Sid": "Allow decube to use key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
]
},
"Action": "kms:Decrypt",
"Resource": "*"
}
]
}
Save Changes
Path Specs
Path Specs (path_specs
) is a list of Path Spec (path_spec
) objects where each individual path_spec
represents one or more datasets. Providing a path specification represents the formatted path to the dataset in your S3 bucket which Decube will use to ingest and catalog the data.
The provided path specification MUST end with *.*
or *.[ext]
to represent leaf level. (Note that here '*' is not a wildcard symbol). If *.[ext]
is provided then files with only specified extension type will be scanned. ".[ext]
" can be any of the supported file types listed below.
Each path_spec
represents only one file type (e.g., only *.csv
or only *.parquet
). To ingest multiple file types, add multiple path_spec
entries.
SingleFile pathspec: PathSpec without {table}
(targets individual files).MultiFile pathspec: PathSpec with {table}
(targets folders as datasets).
Include only datasets that match this pattern: If the path spec specifies a table, and the regex is provided, only datasets that match the regex will be included.
File Format Settings
CSV
delimiter
(default:,
)escape_char
(default:\\
)quote_char
(default:"
)has_headers
(default: true)skip_n_line
(default: 0)file_encoding
(default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
Parquet
No options
JSON/JSONL
file_encoding
(default: UTF-8; see above for supported encodings)
Additional points to note
Folder names should not contain {, }, *, / in their names.
Named variable {folder} is reserved for internal working. please do not use in named variables.
Example Path Specs
Example 1 - Individual file as Dataset (SingleFile pathspec)
Bucket structure:
test-bucket
├── employees.csv
├── departments.json
└── food_items.csv
Path specs config to ingest employees.csv
and food_items.csv
as datasets:
path_specs:
- s3://test-bucket/*.csv
This will automatically ignore departments.json
file. To include it, use *.*
instead of *.csv
.
Example 2 - Folder of files as Dataset (without Partitions)
Bucket structure:
test-bucket
└── offers
├── 1.csv
└── 2.csv
Path specs config to ingest folder offers
as dataset:
path_specs:
- s3://test-bucket/{table}/*.csv
{table}
represents folder for which dataset will be created.
Example 3 - Folder of files as Dataset (with Partitions)
Bucket structure:
test-bucket
├── orders
│ └── year=2022
│ └── month=2
│ ├── 1.parquet
│ └── 2.parquet
└── returns
└── year=2021
└── month=2
└── 1.parquet
Path specs config to ingest folders orders
and returns
as datasets:
path_specs:
- s3://test-bucket/{table}/*/*/*.parquet
Example 4 - Advanced - Either Individual file OR Folder of files as Dataset
Bucket structure:
test-bucket
├── customers
│ ├── part1.json
│ ├── part2.json
│ ├── part3.json
│ └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└── orders
└── year=2022
└── month=2
├── 1.parquet
├── 2.parquet
└── 3.parquet
Path specs config:
path_specs:
- path_spec_1: s3://test-bucket/*.csv
- path_spec_2: s3://test-bucket/{table}/*.json
- path_spec_3: s3://test-bucket/{table}/*/*/*.parquet
Above config has 3 path_specs and will ingest following datasets
employees.csv
- Single File as Datasetfood_items.csv
- Single File as Datasetcustomers
- Folder as Datasetorders
- Folder as Dataset
Valid path_specs.include
s3://my-bucket/foo/tests/bar.csv # single file table
s3://my-bucket/foo/tests/*.* # mulitple file level tables
s3://my-bucket/foo/tests/{table}/*.parquet #table without partition
s3://my-bucket/foo/tests/{table}/*/*.csv #table where partitions are not specified
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
s3://my-bucket/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format
Supported file types
CSV (
*.csv
)JSON (
*.json
)JSONL (
*.jsonl
)Parquet (
*.parquet
)
Notes
Data Quality Monitoring is no longer supported for S3 sources. Only cataloging is available.
For advanced dataset structures, add multiple
path_spec
entries as needed, each with its own file type and settings.
Last updated