Azure Data Lake Storage (ADLS)
Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.
Azure Data Lake Storage (ADLS)
Supported Capabilities
Data Preview
✅
Minimum Requirement
To connect your ADLS to Decube, the following information is required:
Tenant ID
Client ID
Client Secret
Potential Data Egress
Under the SaaS deployment model, data must be transferred from the storage container to the Data Plane to inspect files, retrieve schema information, and perform data quality monitoring. If this is not preferable, you may opt for a Self-Hosted deployment model or bring your own Azure Function Azure Function for Metadata
Credentials Needed
Setup on Microsoft Azure

Setup on Microsoft Azure
On the Azure Home Page, go to
Azure Active Directory
. The Tenant ID can be copied from the Basic information section.\Go to
App registrations
.\Click on
New registration
.\Click
Register.
Save the
Application (client) ID
andDirectory (tenant) ID
.Click
Add a certificate or secret
.Go to
Client secrets
and client+ New client secret
.\Click
+Add
. \Copy and save the
Value
for the client secret.\
Assigning Role to Credentials
From Azure Services, find and click on Storage Accounts. You should be able to see the option for Access control (IAM) on the left sidebar.

Click on Access Control -> Click on '+Add' -> Click on Role assignments.

Find the role called Storage Blob Data Reader click on it and click next.
On the next page, search for the name of the application that you just created on Microsoft Entra ID.
Assign it to the role.

Path Specs
Path Specs (path_specs
) is a list of Path Spec (path_spec
) objects where each individual path_spec
represents one or more datasets for cataloging in ADLS.
The provided path specification MUST end with *.*
or *.[ext]
to represent the leaf level. (Note: here *
is not a wildcard symbol.) If *.[ext]
is provided, only files with the specified extension will be scanned. .[ext]
can be any of the supported file types listed below.
Each path_spec
represents only one file type (e.g., only *.csv
or only *.parquet
). To ingest multiple file types, add multiple path_spec
entries.
SingleFile pathspec: PathSpec without {table}
(targets individual files).MultiFile pathspec: PathSpec with {table}
(targets folders as datasets).
PathSpec Structure
Take note of thesee following parameters when building a path spec:
Storage account name
Container name
Folder path

Follow this schema when building a path spec:
"abfs://{container name}@{storage account name}.dfs.core.windows.net/{folder path}"
example
"abfs://[email protected]/second/*.*"// Some code
Include only datasets that match this pattern: If the path spec specifies a table and a regex is provided, only datasets that match the regex will be included.
File Format Settings
CSV
delimiter
(default:,
)escape_char
(default:\\
)quote_char
(default:"
)has_headers
(default: true)skip_n_line
(default: 0)file_encoding
(default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
Parquet
No options
JSON/JSONL
file_encoding
(default: UTF-8; see above for supported encodings)
Additional points to note
Folder names should not contain {, }, *, / in their names.
Named variable {folder} is reserved for internal working. Please do not use in named variables.
Example Path Specs
Example 1 - Individual file as Dataset (SingleFile pathspec)
Bucket structure:
test-bucket
├── employees.csv
├── departments.json
└── food_items.csv
Path specs config to ingest employees.csv
and food_items.csv
as datasets:
path_specs:
- include: abfs://[email protected]/*.csv
This will automatically ignore departments.json
file. To include it, use *.*
instead of *.csv
.
Example 2 - Folder of files as Dataset (without Partitions)
Bucket structure:
test-bucket
└── offers
├── 1.csv
└── 2.csv
Path specs config to ingest folder offers
as dataset:
path_specs:
- include: abfs://[email protected]/{table}/*.csv
{table}
represents folder for which dataset will be created.
Example 3 - Folder of files as Dataset (with Partitions)
Bucket structure:
test-bucket
├── orders
│ └── year=2022
│ └── month=2
│ ├── 1.parquet
│ └── 2.parquet
└── returns
└── year=2021
└── month=2
└── 1.parquet
Path specs config to ingest folders orders
and returns
as datasets:
path_specs:
- include: abfs://[email protected]/{table}/*/*/*.parquet
Example 4 - Advanced - Either Individual file OR Folder of files as Dataset
Bucket structure:
test-bucket
├── customers
│ ├── part1.json
│ ├── part2.json
│ ├── part3.json
│ └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└── orders
└── year=2022
└── month=2
├── 1.parquet
├── 2.parquet
└── 3.parquet
Path specs config:
path_specs:
- path_spec_1: abfs://[email protected]/*.csv
- path_spec_2: abfs://[email protected]/{table}/*.json
- path_spec_3: abfs://[email protected]/{table}/*/*/*.parquet
Above config has 3 path_specs and will ingest following datasets
employees.csv
- Single File as Datasetfood_items.csv
- Single File as Datasetcustomers
- Folder as Datasetorders
- Folder as Dataset and will ignore filetmp_10101000.csv
Valid path_specs.include
abfs://[email protected]/foo/tests/bar.csv # single file table
abfs://[email protected]/foo/tests/*.* # mulitple file level tables
abfs://[email protected]/foo/tests/{table}/*.parquet #table without partition
abfs://[email protected]/tests/{table}/*/*.csv #table where partitions are not specified
abfs://[email protected]/tests/{table}/*.* # table where no partitions as well as data type specified
abfs://[email protected]/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format
Supported file types
CSV (
*.csv
)JSON (
*.json
)JSONL (
*.jsonl
)Parquet (
*.parquet
)
Notes
Data Quality Monitoring is no longer supported for ADLS sources. Only cataloging is available.
For advanced dataset structures, add multiple
path_spec
entries as needed, each with its own file type and settings.
Last updated