Comment on page
AWS S3
Connect your S3 to see your S3 datasets and files within the Catalog.
To connect your AWS S3 to Decube, we will need the following information:
IAM user's Access Key
IAM user's Secret Access Key
S3 Region

- 1.Login to AWS Console and proceed to IAM > User > Create User
.png?alt=media&token=3bbd3879-ec5f-4457-9b9d-59b19d57ad30)
- 2.Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::{bucket-name}",
"arn:aws:s3:::{bucket-name}/*"
]
}
]
}
- 3.Search for the policy you created just now, select it and press Next.
.png?alt=media&token=f0e82d7c-df36-42be-ba3d-9eb8b7562f04)
- 4.Press Create user
.png?alt=media&token=1c951c3c-4114-4d42-b388-3410ba9656ef)
- 5.Navigate to the newly created user and click on
Create access key

- 6.Choose
Application running outside AWS
.png?alt=media&token=09bdef2e-5bdb-4ad2-963c-dbcf9ad26441)
- 7.Save the provided access key and secret access key. You will not be able to retrieve these keys again.
.png?alt=media&token=39833311-045f-4dc8-b95c-0fa3adc429dc)

Untitled
Path Specs (
path_specs
) is a list of Path Spec (path_spec
) objects where each individual path_spec
represents one or more datasets. Include path (path_spec.include
) represents formatted path to the dataset. This path must end with *.*
or *.[ext]
to represent leaf level. If *.[ext]
is provided then files with only specified extension type will be scanned. ".[ext]
" can be any of supported file types. Refer example 1 below for more details.All folder levels need to be specified in include path. You can use
/*/
to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use {table}
placeholder to represent folder level for which dataset is to be created. Refer example 2 and 3 below for more details.Exclude paths (
path_spec.exclude
) can be used to ignore paths that are not relevant to current path_spec
. This path cannot have named variables ( {}
). Exclude path can have **
to represent multiple folder levels. Refer example 4 below for more details.Additional points to note
- Folder names should not contain {, }, *, / in their names.
- Named variable {folder} is reserved for internal working. please do not use in named variables.
Bucket structure:
test-bucket
├── employees.csv
├── departments.json
└── food_items.csv
Path specs config to ingest
employees.csv
and food_items.csv
as datasets:path_specs:
- include: s3://test-bucket/*.csv
This will automatically ignore
departments.json
file. To include it, use *.*
instead of *.csv
.Bucket structure:
test-bucket
└── offers
├── 1.csv
└── 2.csv
Path specs config to ingest folder
offers
as dataset:path_specs:
- include: s3://test-bucket/{table}/*.csv
{table}
represents folder for which dataset will be created.Bucket structure:
test-bucket
├── orders
│ └── year=2022
│ └── month=2
│ ├── 1.parquet
│ └── 2.parquet
└── returns
└── year=2021
└── month=2
└── 1.parquet
Path specs config to ingest folders
orders
and returns
as datasets:path_specs:
- include: s3://test-bucket/{table}/*/*/*.parquet
Bucket structure:
test-bucket
├── orders
│ └── year=2022
│ └── month=2
│ ├── 1.parquet
│ └── 2.parquet
└── tmp_orders
└── year=2021
└── month=2
└── 1.parquet
Path specs config to ingest folder
orders
as dataset but not folder tmp_orders
:path_specs:
- include: s3://test-bucket/{table}/*/*/*.parquet
exclude:
- **/tmp_orders/**
Bucket structure:
test-bucket
├── customers
│ ├── part1.json
│ ├── part2.json
│ ├── part3.json
│ └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└── orders
└── year=2022
└── month=2
├── 1.parquet
├── 2.parquet
└── 3.parquet
Path specs config:
path_specs:
- include: s3://test-bucket/*.csv
exclude:
- **/tmp_10101000.csv
- include: s3://test-bucket/{table}/*.json
- include: s3://test-bucket/{table}/*/*/*.parquet
Above config has 3 path_specs and will ingest following datasets
employees.csv
- Single File as Datasetfood_items.csv
- Single File as Datasetcustomers
- Folder as Datasetorders
- Folder as Dataset and will ignore filetmp_10101000.csv
Valid path_specs.include
s3://my-bucket/foo/tests/bar.csv # single file table
s3://my-bucket/foo/tests/*.* # mulitple file level tables
s3://my-bucket/foo/tests/{table}/*.parquet #table without partition
s3://my-bucket/foo/tests/{table}/*/*.csv #table where partitions are not specified
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
s3://my-bucket/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format
Valid path_specs.exclude
- */tests/**
- s3://my-bucket/hr/**
- */tests/*.csv
- s3://my-bucket/foo/*/my_table/**
Supported file types are as follows:
- CSV (*.csv)
- TSV (*.tsv)
- JSON (*.json)
- JSON (*.jsonl)
- Parquet (*.parquet)
Schemas for Parquet and Avro files are extracted as provided.
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV, TSV, JSONL files, we consider the first 100 rows by default JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.