Azure Data Lake Storage (ADLS)
Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.
Last updated
Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.
Last updated
To connect your ADLS to Decube, the following information is required:
Tenant ID
Client ID
Client Secret
On the Azure Home Page, go to Azure Active Directory
. The Tenant ID can be copied from the Basic information section.
Go to App registrations
.
Click on New registration
.
Click Register.
Save the Application (client) ID
and Directory (tenant) ID
.
Click Add a certificate or secret
.
Go to Client secrets
and client + New client secret
.
Click +Add
.
Copy and save the Value
for the client secret.
Click on the storage account you wish to connect with decube.
Click on Access Control -> Click on '+Add' -> Click on Role assignments.
Find the role called Storage Blob Data Reader click on it and click next.
On the next page, search for the name of the application that you just created on Microsoft Entra ID.
Assign it to the role.
Path Specs (path_specs
) is a list of Path Spec (path_spec
) objects where each individual path_spec
represents one or more datasets. Include path (path_spec.include
) represents formatted path to the dataset. This path must end with *.*
or *.[ext]
to represent leaf level. If *.[ext]
is provided then files with only specified extension type will be scanned. ".[ext]
" can be any of supported file types. Refer example 1 below for more details.
All folder levels need to be specified in include path. You can use /*/
to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use {table}
placeholder to represent folder level for which dataset is to be created. Refer example 2 and 3 below for more details.
Exclude paths (path_spec.exclude
) can be used to ignore paths that are not relevant to current path_spec
. This path cannot have named variables ( {}
). Exclude path can have **
to represent multiple folder levels. Refer example 4 below for more details.
Remember the storage account, you want to connect to decube.
Take note of:
Storage account name
Container name
Folder path
Follow this schema when building a path spec:
Bucket structure:
Path specs config to ingest employees.csv
and food_items.csv
as datasets:
This will automatically ignore departments.json
file. To include it, use *.*
instead of *.csv
.
Bucket structure:
Path specs config to ingest folder offers
as dataset:
{table}
represents folder for which dataset will be created.
Bucket structure:
Path specs config to ingest folders orders
and returns
as datasets:
Bucket structure:
Path specs config to ingest folder orders
as dataset but not folder tmp_orders
:
Bucket structure:
Path specs config:
Above config has 3 path_specs and will ingest following datasets
employees.csv
- Single File as Dataset
food_items.csv
- Single File as Dataset
customers
- Folder as Dataset
orders
- Folder as Dataset and will ignore file tmp_10101000.csv
Valid path_specs.include
Valid path_specs.exclude
CSV (*.csv)
TSV (*.tsv)
JSON (*.json)
JSON (*.jsonl)
Parquet (*.parquet)
Avro (*.avro) [beta]
Table format:
Apache Iceberg [beta]
Delta table [beta]
Schemas for Parquet and Avro files are extracted as provided.
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV, TSV, JSONL files, we consider the first 100 rows by default JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.