AWS S3
Connect your S3 to see your S3 datasets and files within the Catalog.
Minimum Requirement
To connect your AWS S3 to Decube, we will need the following information:
IAM user's Access Key
IAM user's Secret Access Key
S3 Region
AWS IAM User
Login to AWS Console and proceed to IAM > User > Create User
Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy
Search for the policy you created just now, select it and press Next.
Press Create user
Navigate to the newly created user and click on
Create access key
Choose
Application running outside AWS
Save the provided access key and secret access key. You will not be able to retrieve these keys again.
Path Specs
Path Specs (path_specs
) is a list of Path Spec (path_spec
) objects where each individual path_spec
represents one or more datasets. Include path (path_spec.include
) represents formatted path to the dataset. This path must end with *.*
or *.[ext]
to represent leaf level. If *.[ext]
is provided then files with only specified extension type will be scanned. ".[ext]
" can be any of supported file types. Refer example 1 below for more details.
All folder levels need to be specified in include path. You can use /*/
to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use {table}
placeholder to represent folder level for which dataset is to be created. Refer example 2 and 3 below for more details.
Exclude paths (path_spec.exclude
) can be used to ignore paths that are not relevant to current path_spec
. This path cannot have named variables ( {}
). Exclude path can have **
to represent multiple folder levels. Refer example 4 below for more details.
Refer example 5 if your bucket has more complex dataset representation.
Additional points to note
Folder names should not contain {, }, *, / in their names.
Named variable {folder} is reserved for internal working. please do not use in named variables.
Path Specs - Examples
Example 1 - Individual file as Dataset
Bucket structure:
Path specs config to ingest employees.csv
and food_items.csv
as datasets:
This will automatically ignore departments.json
file. To include it, use *.*
instead of *.csv
.
Example 2 - Folder of files as Dataset (without Partitions)
Bucket structure:
Path specs config to ingest folder offers
as dataset:
{table}
represents folder for which dataset will be created.
Example 3 - Folder of files as Dataset (with Partitions)
Bucket structure:
Path specs config to ingest folders orders
and returns
as datasets:
Example 4 - Folder of files as Dataset (with Partitions), and Exclude Filter
Bucket structure:
Path specs config to ingest folder orders
as dataset but not folder tmp_orders
:
Example 5 - Advanced - Either Individual file OR Folder of files as Dataset
Bucket structure:
Path specs config:
Above config has 3 path_specs and will ingest following datasets
employees.csv
- Single File as Datasetfood_items.csv
- Single File as Datasetcustomers
- Folder as Datasetorders
- Folder as Dataset and will ignore filetmp_10101000.csv
Valid path_specs.include
Valid path_specs.exclude
Supported file types
CSV (*.csv)
TSV (*.tsv)
JSON (*.json)
JSON (*.jsonl)
Parquet (*.parquet)
Avro (*.avro) [beta]
Table format:
Apache Iceberg [beta]
Delta table [beta]
Schemas for Parquet and Avro files are extracted as provided.
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV, TSV, JSONL files, we consider the first 100 rows by default JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Last updated