AWS S3
Connect your S3 to see your S3 datasets and files within the Catalog.
Last updated
Connect your S3 to see your S3 datasets and files within the Catalog.
Last updated
To connect your AWS S3 to Decube, we will need the following information:
IAM user's Access Key
IAM user's Secret Access Key
S3 Region
Login to AWS Console and proceed to IAM > User > Create User
Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy
Search for the policy you created just now, select it and press Next.
Press Create user
Navigate to the newly created user and click on Create access key
Choose Application running outside AWS
Save the provided access key and secret access key. You will not be able to retrieve these keys again.
If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.
Login to AWS Console and proceed to AWS KMS > Customer-managed keys.
Find the key that was used to encrypt the AWS S3 bucket.
On the Key policy tab, click on Edit
Assuming the user created is decube-s3-datalake
a. If there is not an existing policy attached to the key
b. If there is an existing policy, append this section to the Statement
array:
Save Changes
Path Specs (path_specs
) is a list of Path Spec (path_spec
) objects where each individual path_spec
represents one or more datasets. Include path (path_spec.include
) represents formatted path to the dataset. This path must end with *.*
or *.[ext]
to represent leaf level. If *.[ext]
is provided then files with only specified extension type will be scanned. ".[ext]
" can be any of supported file types. Refer example 1 below for more details.
All folder levels need to be specified in include path. You can use /*/
to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use {table}
placeholder to represent folder level for which dataset is to be created. Refer example 2 and 3 below for more details.
Exclude paths (path_spec.exclude
) can be used to ignore paths that are not relevant to current path_spec
. This path cannot have named variables ( {}
). Exclude path can have **
to represent multiple folder levels. Refer example 4 below for more details.
Refer example 5 if your bucket has more complex dataset representation.
Additional points to note
Folder names should not contain {, }, *, / in their names.
Named variable {folder} is reserved for internal working. please do not use in named variables.
Bucket structure:
Path specs config to ingest employees.csv
and food_items.csv
as datasets:
This will automatically ignore departments.json
file. To include it, use *.*
instead of *.csv
.
Bucket structure:
Path specs config to ingest folder offers
as dataset:
{table}
represents folder for which dataset will be created.
Bucket structure:
Path specs config to ingest folders orders
and returns
as datasets:
Bucket structure:
Path specs config to ingest folder orders
as dataset but not folder tmp_orders
:
Bucket structure:
Path specs config:
Above config has 3 path_specs and will ingest following datasets
employees.csv
- Single File as Dataset
food_items.csv
- Single File as Dataset
customers
- Folder as Dataset
orders
- Folder as Dataset and will ignore file tmp_10101000.csv
Valid path_specs.include
Valid path_specs.exclude
CSV (*.csv)
TSV (*.tsv)
JSON (*.json)
JSON (*.jsonl)
Parquet (*.parquet)
Avro (*.avro) [beta]
Table format:
Apache Iceberg [beta]
Delta table [beta]
Schemas for Parquet and Avro files are extracted as provided.
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV, TSV, JSONL files, we consider the first 100 rows by default JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.