Azure Data Lake Storage (ADLS)

Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.

Supported Capabilities

Capability
Supported

Metadata Extraction

Metadata Types Collected

Schema, Table, Column

Data Profiling

Data Preview

Data Quality

Configurable Collection

External Table

View Table

Stored Procedure

Minimum Requirement

To connect your ADLS to Decube, the following information is required:

  • Tenant ID

  • Client ID

  • Client Secret

Potential Data Egress

Under the SaaS deployment model, data must be transferred from the storage container to the Data Plane to inspect files, retrieve schema information, and perform data quality monitoring. If this is not preferable, you may opt for a Self-Hosted deployment model or bring your own Azure Function Azure Function for Metadata

Firewall and connectivity configuration

By default, Azure Storage accounts may allow access from all networks. However, if your organization requires Public network access to be disabled for security compliance, you must explicitly whitelist Decube's IP addresses to allow our connectors to access your Data Lake.

Follow the steps below to configure your firewall settings.

1. Navigate to Networking Settings

  1. Log in to the Azure Portal and navigate to your Storage Account.

  2. In the left-hand sidebar, under Security + networking, select Networking.

  3. Under the Firewalls and virtual networks tab, locate the Public network access setting.

2. Enable Access for Selected Networks

To allow Decube to connect while keeping the storage account private from the general public:

  1. Select Enabled from selected virtual networks and IP addresses.

  2. This option enables the Firewall section below it, where you can specify allowed IP addresses.

3. Whitelist Decube IP Addresses

In the Firewall section, you must add the IP addresses corresponding to the region where your Decube SaaS instance is hosted. See the section on IP Whitelisting to get the list of IP address.

IP Whitelisting

Credentials setup

Setup on Microsoft Azure

  1. On the Azure Home Page, go to Azure Active Directory. The Tenant ID can be copied from the Basic information section.\

  2. Go to App registrations.\

  3. Click on New registration.\

  4. Click Register.

  5. Save the Application (client) ID and Directory (tenant) ID.

  6. Click Add a certificate or secret.

  7. Go to Client secrets and client + New client secret.\

  8. Click +Add. \

  9. Copy and save the Value for the client secret.\

Assigning Role to Credentials

  1. From Azure Services, find and click on Storage Accounts. You should be able to see the option for Access control (IAM) on the left sidebar.

  1. Click on Access Control -> Click on '+Add' -> Click on Role assignments.

  1. Find the role called Storage Blob Data Reader click on it and click next.

  2. On the next page, search for the name of the application that you just created on Microsoft Entra ID.

  3. Assign it to the role.

Path Specs

Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets for cataloging in ADLS.

The provided path specification MUST end with *.* or *.[ext] to represent the leaf level. (Note: here * is not a wildcard symbol.) If *.[ext] is provided, only files with the specified extension will be scanned. .[ext] can be any of the supported file types listed below.

Each path_spec represents only one file type (e.g., only *.csv or only *.parquet). To ingest multiple file types, add multiple path_spec entries.

SingleFile pathspec: PathSpec without {table} (targets individual files).MultiFile pathspec: PathSpec with {table} (targets folders as datasets).

PathSpec Structure

  • Take note of thesee following parameters when building a path spec:

    • Storage account name

    • Container name

    • Folder path

Follow this schema when building a path spec:

Include only datasets that match this pattern: If the path spec specifies a table and a regex is provided, only datasets that match the regex will be included.

File Format Settings

  • CSV

    • delimiter (default: ,)

    • escape_char (default: \\)

    • quote_char (default: ")

    • has_headers (default: true)

    • skip_n_line (default: 0)

    • file_encoding (default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)

  • Parquet

    • No options

  • JSON/JSONL

    • file_encoding (default: UTF-8; see above for supported encodings)

  • Delta Table

    • When selecting Format = Delta Table, the path spec MUST include the named token {table}. The connector expects the {table} token in the path spec so it can discover Delta table roots. No per-file options are required.

Additional points to note

  • Folder names should not contain {, }, *, / in their names.

  • Named variable {folder} is reserved for internal working. Please do not use in named variables.

Example Path Specs

Example 1 - Individual file as Dataset (SingleFile pathspec)

Bucket structure:

Path specs config to ingest employees.csv and food_items.csv as datasets:

This will automatically ignore departments.json file. To include it, use *.* instead of *.csv.

Example 2 - Folder of files as Dataset (without Partitions)

Bucket structure:

Path specs config to ingest folder offers as dataset:

{table} represents folder for which dataset will be created.

Example 3 - Folder of files as Dataset (with Partitions)

Bucket structure:

Path specs config to ingest folders orders and returns as datasets:

Example 4 - Advanced - Either Individual file OR Folder of files as Dataset

Bucket structure:

Path specs config:

Above config has 3 path_specs and will ingest following datasets

  • employees.csv - Single File as Dataset

  • food_items.csv - Single File as Dataset

  • customers - Folder as Dataset

  • orders - Folder as Dataset and will ignore file tmp_10101000.csv

Valid path_specs.include

Example 5 - Delta Table (ADLS)

For Delta Table support, include {table} in the path. Example simple path spec for ADLS:

The connector will interpret {table} as the table root for each Delta table.

Supported file types

  • CSV (*.csv)

  • JSON (*.json)

  • JSONL (*.jsonl)

  • Parquet (*.parquet)

  • Delta (Delta Table)

Notes

  • Data Quality Monitoring is no longer supported for ADLS sources. Only cataloging is available.

  • For advanced dataset structures, add multiple path_spec entries as needed, each with its own file type and settings.

Last updated