AWS S3

Connect your S3 to see your S3 datasets and files within the Catalog.

AWS S3

Supported Capabilities

Catalog
Capability

Data Preview

Minimum Requirement

To connect your AWS S3 to Decube, we will need the following information.

Choose authentication method:

a. AWS Identity:

  • Select AWS Identity

  • Customer AWS Role ARN

  • Region

  • Path Specs

  • Data source name

S3 Datalake using AWS Identity
Continuation of S3 Data Lake setup using AWS Identity

b. AWS Access Key:

  • Access Key ID

  • Secret Access Key

  • Region

  • Path Specs

  • Data source name

S3 Datalake using AWS Access Key

Connection Options:

a. AWS Roles

This section will create a Customer AWS Role within your AWS account that has the right set of permission to access your data sources.

  • Step 1: Go to your AWS Account > IAM Module > Roles

  • Step 2: Click on Create role

  • Step 3: Choose Custom trust policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<DECUBE-AWS-IDENTITY-ARN>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<EXTERNAL-ID>"
                }
            }
        }
    ]
}
  • Step 5: Click next to proceed to attach policy.

  • Step 6: Click on Attach Policies and Create Policy and choose JSON Editor. Input the following policy and press next, input the policy name of your choice and press Create Policy.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}

b. AWS IAM User

  • Step 1: Login to AWS Console and proceed to IAM > User > Create User

  • Step 2: Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket",
				"s3:ListAllMyBuckets"
			],
			"Resource": [
				"arn:aws:s3:::{bucket-name}",
				"arn:aws:s3:::{bucket-name}/*"
			]
		}
	]
}
  • Step 3: Search for the policy you created just now, select it and press Next.

  • Step 4: Press Create user

  • Step 5: Navigate to the newly created user and click on Create access key

  • Step 6: Choose Application running outside AWS

  • Step 7: Save the provided access key and secret access key. You will not be able to retrieve these keys again

AWS KMS

If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.

  1. Login to AWS Console and proceed to AWS KMS > Customer-managed keys.

  2. Find the key that was used to encrypt the AWS S3 bucket.

  3. On the Key policy tab, click on Edit

  1. Assuming the user created is decube-s3-datalake

a. If there is not an existing policy attached to the key

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}

b. If there is an existing policy, append this section to the Statement array:

{
    "Statement": [
        {
            "Sid": "Allow decube to use key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
                ]
            },
            "Action": "kms:Decrypt",
            "Resource": "*"
        }
    ]
}
  1. Save Changes

Path Specs

Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets. Providing a path specification represents the formatted path to the dataset in your S3 bucket which Decube will use to ingest and catalog the data.

The provided path specification MUST end with *.* or *.[ext] to represent leaf level. (Note that here '*' is not a wildcard symbol). If *.[ext] is provided then files with only specified extension type will be scanned. ".[ext]" can be any of the supported file types listed below.

Each path_spec represents only one file type (e.g., only *.csv or only *.parquet). To ingest multiple file types, add multiple path_spec entries.

SingleFile pathspec: PathSpec without {table} (targets individual files).MultiFile pathspec: PathSpec with {table} (targets folders as datasets).

Include only datasets that match this pattern: If the path spec specifies a table, and the regex is provided, only datasets that match the regex will be included.

File Format Settings

  • CSV

    • delimiter (default: ,)

    • escape_char (default: \\)

    • quote_char (default: ")

    • has_headers (default: true)

    • skip_n_line (default: 0)

    • file_encoding (default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)

  • Parquet

    • No options

  • JSON/JSONL

    • file_encoding (default: UTF-8; see above for supported encodings)

Additional points to note

  • Folder names should not contain {, }, *, / in their names.

  • Named variable {folder} is reserved for internal working. please do not use in named variables.

Example Path Specs

Example 1 - Individual file as Dataset (SingleFile pathspec)

Bucket structure:

test-bucket
├── employees.csv
├── departments.json
└── food_items.csv

Path specs config to ingest employees.csv and food_items.csv as datasets:

path_specs:
    - s3://test-bucket/*.csv

This will automatically ignore departments.json file. To include it, use *.* instead of *.csv.

Example 2 - Folder of files as Dataset (without Partitions)

Bucket structure:

test-bucket
└──  offers
     ├── 1.csv
     └── 2.csv

Path specs config to ingest folder offers as dataset:

path_specs:
    - s3://test-bucket/{table}/*.csv

{table} represents folder for which dataset will be created.

Example 3 - Folder of files as Dataset (with Partitions)

Bucket structure:

test-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config to ingest folders orders and returns as datasets:

path_specs:
    - s3://test-bucket/{table}/*/*/*.parquet

Example 4 - Advanced - Either Individual file OR Folder of files as Dataset

Bucket structure:

test-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

Path specs config:

path_specs:
    - path_spec_1: s3://test-bucket/*.csv
    - path_spec_2: s3://test-bucket/{table}/*.json
    - path_spec_3: s3://test-bucket/{table}/*/*/*.parquet

Above config has 3 path_specs and will ingest following datasets

  • employees.csv - Single File as Dataset

  • food_items.csv - Single File as Dataset

  • customers - Folder as Dataset

  • orders - Folder as Dataset

Valid path_specs.include

s3://my-bucket/foo/tests/bar.csv # single file table
s3://my-bucket/foo/tests/*.* # mulitple file level tables
s3://my-bucket/foo/tests/{table}/*.parquet #table without partition
s3://my-bucket/foo/tests/{table}/*/*.csv #table where partitions are not specified
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
s3://my-bucket/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format

Supported file types

  • CSV (*.csv)

  • JSON (*.json)

  • JSONL (*.jsonl)

  • Parquet (*.parquet)

Notes

  • Data Quality Monitoring is no longer supported for S3 sources. Only cataloging is available.

  • For advanced dataset structures, add multiple path_spec entries as needed, each with its own file type and settings.

Last updated