Azure Data Lake Storage (ADLS)

Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.

Azure Data Lake Storage (ADLS)

Supported Capabilities

Catalog
Capability

Data Preview

Minimum Requirement

To connect your ADLS to Decube, the following information is required:

  • Tenant ID

  • Client ID

  • Client Secret

Potential Data Egress

Under the SaaS deployment model, data must be transferred from the storage container to the Data Plane to inspect files, retrieve schema information, and perform data quality monitoring. If this is not preferable, you may opt for a Self-Hosted deployment model or bring your own Azure Function Azure Function for Metadata

Credentials Needed

Setup on Microsoft Azure

Azure Data Lake Storage

Setup on Microsoft Azure

  1. On the Azure Home Page, go to Azure Active Directory. The Tenant ID can be copied from the Basic information section.\

  2. Go to App registrations.\

  3. Click on New registration.\

  4. Click Register.

  5. Save the Application (client) ID and Directory (tenant) ID.

  6. Click Add a certificate or secret.

  7. Go to Client secrets and client + New client secret.\

  8. Click +Add. \

  9. Copy and save the Value for the client secret.\

Assigning Role to Credentials

  1. From Azure Services, find and click on Storage Accounts. You should be able to see the option for Access control (IAM) on the left sidebar.

  1. Click on Access Control -> Click on '+Add' -> Click on Role assignments.

  1. Find the role called Storage Blob Data Reader click on it and click next.

  2. On the next page, search for the name of the application that you just created on Microsoft Entra ID.

  3. Assign it to the role.

Path Specs

Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets for cataloging in ADLS.

The provided path specification MUST end with *.* or *.[ext] to represent the leaf level. (Note: here * is not a wildcard symbol.) If *.[ext] is provided, only files with the specified extension will be scanned. .[ext] can be any of the supported file types listed below.

Each path_spec represents only one file type (e.g., only *.csv or only *.parquet). To ingest multiple file types, add multiple path_spec entries.

SingleFile pathspec: PathSpec without {table} (targets individual files).MultiFile pathspec: PathSpec with {table} (targets folders as datasets).

PathSpec Structure

  • Take note of thesee following parameters when building a path spec:

    • Storage account name

    • Container name

    • Folder path

Follow this schema when building a path spec:

"abfs://{container name}@{storage account name}.dfs.core.windows.net/{folder path}"
example
"abfs://[email protected]/second/*.*"// Some code

Include only datasets that match this pattern: If the path spec specifies a table and a regex is provided, only datasets that match the regex will be included.

File Format Settings

  • CSV

    • delimiter (default: ,)

    • escape_char (default: \\)

    • quote_char (default: ")

    • has_headers (default: true)

    • skip_n_line (default: 0)

    • file_encoding (default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)

  • Parquet

    • No options

  • JSON/JSONL

    • file_encoding (default: UTF-8; see above for supported encodings)

Additional points to note

  • Folder names should not contain {, }, *, / in their names.

  • Named variable {folder} is reserved for internal working. Please do not use in named variables.

Example Path Specs

Example 1 - Individual file as Dataset (SingleFile pathspec)

Bucket structure:

test-bucket
├── employees.csv
├── departments.json
└── food_items.csv

Path specs config to ingest employees.csv and food_items.csv as datasets:

path_specs:
    - include: abfs://[email protected]/*.csv

This will automatically ignore departments.json file. To include it, use *.* instead of *.csv.

Example 2 - Folder of files as Dataset (without Partitions)

Bucket structure:

test-bucket
└──  offers
     ├── 1.csv
     └── 2.csv

Path specs config to ingest folder offers as dataset:

path_specs:
    - include: abfs://[email protected]/{table}/*.csv

{table} represents folder for which dataset will be created.

Example 3 - Folder of files as Dataset (with Partitions)

Bucket structure:

test-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config to ingest folders orders and returns as datasets:

path_specs:
    - include: abfs://[email protected]/{table}/*/*/*.parquet

Example 4 - Advanced - Either Individual file OR Folder of files as Dataset

Bucket structure:

test-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

Path specs config:

path_specs:
    - path_spec_1: abfs://[email protected]/*.csv
    - path_spec_2: abfs://[email protected]/{table}/*.json
    - path_spec_3: abfs://[email protected]/{table}/*/*/*.parquet

Above config has 3 path_specs and will ingest following datasets

  • employees.csv - Single File as Dataset

  • food_items.csv - Single File as Dataset

  • customers - Folder as Dataset

  • orders - Folder as Dataset and will ignore file tmp_10101000.csv

Valid path_specs.include

abfs://[email protected]/foo/tests/bar.csv # single file table
abfs://[email protected]/foo/tests/*.* # mulitple file level tables
abfs://[email protected]/foo/tests/{table}/*.parquet #table without partition
abfs://[email protected]/tests/{table}/*/*.csv #table where partitions are not specified
abfs://[email protected]/tests/{table}/*.* # table where no partitions as well as data type specified
abfs://[email protected]/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format

Supported file types

  • CSV (*.csv)

  • JSON (*.json)

  • JSONL (*.jsonl)

  • Parquet (*.parquet)

Notes

  • Data Quality Monitoring is no longer supported for ADLS sources. Only cataloging is available.

  • For advanced dataset structures, add multiple path_spec entries as needed, each with its own file type and settings.

Last updated