Azure Data Lake Storage (ADLS)
Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.
Supported Capabilities
Metadata Extraction
✅
Metadata Types Collected
Schema, Table, Column
Data Profiling
❌
Data Preview
✅
Data Quality
❌
Configurable Collection
❌
External Table
❌
View Table
❌
Stored Procedure
❌
Minimum Requirement
To connect your ADLS to Decube, the following information is required:
Tenant ID
Client ID
Client Secret
Firewall and connectivity configuration
By default, Azure Storage accounts may allow access from all networks. However, if your organization requires Public network access to be disabled for security compliance, you must explicitly whitelist Decube's IP addresses to allow our connectors to access your Data Lake.
Follow the steps below to configure your firewall settings.
1. Navigate to Networking Settings
Log in to the Azure Portal and navigate to your Storage Account.
In the left-hand sidebar, under Security + networking, select Networking.
Under the Firewalls and virtual networks tab, locate the Public network access setting.
2. Enable Access for Selected Networks
To allow Decube to connect while keeping the storage account private from the general public:
Select Enabled from selected virtual networks and IP addresses.
This option enables the Firewall section below it, where you can specify allowed IP addresses.

3. Whitelist Decube IP Addresses
In the Firewall section, you must add the IP addresses corresponding to the region where your Decube SaaS instance is hosted. See the section on IP Whitelisting to get the list of IP address.
IP WhitelistingCredentials setup
Setup on Microsoft Azure
On the Azure Home Page, go to
Azure Active Directory. The Tenant ID can be copied from the Basic information section.\
Go to
App registrations.\
Click on
New registration.\
Click
Register.
Save the
Application (client) IDandDirectory (tenant) ID.Click
Add a certificate or secret.Go to
Client secretsand client+ New client secret.\
Click
+Add. \
Copy and save the
Valuefor the client secret.\
Assigning Role to Credentials
From Azure Services, find and click on Storage Accounts. You should be able to see the option for Access control (IAM) on the left sidebar.

Click on Access Control -> Click on '+Add' -> Click on Role assignments.

Find the role called Storage Blob Data Reader click on it and click next.
On the next page, search for the name of the application that you just created on Microsoft Entra ID.
Assign it to the role.

Path Specs
Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets for cataloging in ADLS.
The provided path specification MUST end with *.* or *.[ext] to represent the leaf level. (Note: here * is not a wildcard symbol.) If *.[ext] is provided, only files with the specified extension will be scanned. .[ext] can be any of the supported file types listed below.
Each path_spec represents only one file type (e.g., only *.csv or only *.parquet). To ingest multiple file types, add multiple path_spec entries.
SingleFile pathspec: PathSpec without {table} (targets individual files).MultiFile pathspec: PathSpec with {table} (targets folders as datasets).
PathSpec Structure
Take note of thesee following parameters when building a path spec:
Storage account name
Container name
Folder path

Follow this schema when building a path spec:
Include only datasets that match this pattern: If the path spec specifies a table and a regex is provided, only datasets that match the regex will be included.
File Format Settings
CSV
delimiter(default:,)escape_char(default:\\)quote_char(default:")has_headers(default: true)skip_n_line(default: 0)file_encoding(default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
Parquet
No options
JSON/JSONL
file_encoding(default: UTF-8; see above for supported encodings)
Delta Table
When selecting Format =
Delta Table, the path spec MUST include the named token{table}. The connector expects the{table}token in the path spec so it can discover Delta table roots. No per-file options are required.
Additional points to note
Folder names should not contain {, }, *, / in their names.
Named variable {folder} is reserved for internal working. Please do not use in named variables.
Example Path Specs
Example 1 - Individual file as Dataset (SingleFile pathspec)
Bucket structure:
Path specs config to ingest employees.csv and food_items.csv as datasets:
This will automatically ignore departments.json file. To include it, use *.* instead of *.csv.
Example 2 - Folder of files as Dataset (without Partitions)
Bucket structure:
Path specs config to ingest folder offers as dataset:
{table} represents folder for which dataset will be created.
Example 3 - Folder of files as Dataset (with Partitions)
Bucket structure:
Path specs config to ingest folders orders and returns as datasets:
Example 4 - Advanced - Either Individual file OR Folder of files as Dataset
Bucket structure:
Path specs config:
Above config has 3 path_specs and will ingest following datasets
employees.csv- Single File as Datasetfood_items.csv- Single File as Datasetcustomers- Folder as Datasetorders- Folder as Dataset and will ignore filetmp_10101000.csv
Valid path_specs.include
Example 5 - Delta Table (ADLS)
For Delta Table support, include {table} in the path. Example simple path spec for ADLS:
The connector will interpret {table} as the table root for each Delta table.
Supported file types
CSV (
*.csv)JSON (
*.json)JSONL (
*.jsonl)Parquet (
*.parquet)Delta (
Delta Table)
Notes
Data Quality Monitoring is no longer supported for ADLS sources. Only cataloging is available.
For advanced dataset structures, add multiple
path_specentries as needed, each with its own file type and settings.
Last updated