Azure Data Lake Storage (ADLS)

Azure Data Lake Storage (ADLS) is a scalable and secure data lake solution from Microsoft Azure designed to handle the vast amounts of data generated by modern applications.

Minimum Requirement

To connect your ADLS to Decube, the following information is required:

  • Tenant ID

  • Client ID

  • Client Secret

Credentials Needed

Setup on Microsoft Azure

  1. On the Azure Home Page, go to Azure Active Directory. The Tenant ID can be copied from the Basic information section.

  2. Go to App registrations.

  3. Click on New registration.

  4. Click Register.

  5. Save the Application (client) ID and Directory (tenant) ID.

  6. Click Add a certificate or secret.

  7. Go to Client secrets and client + New client secret.

  8. Click +Add.

  9. Copy and save the Value for the client secret.

Assigning Role to Credentials

  1. Click on the storage account you wish to connect with decube.

  1. Click on Access Control -> Click on '+Add' -> Click on Role assignments.

  1. Find the role called Storage Blob Data Reader click on it and click next.

  2. On the next page, search for the name of the application that you just created on Microsoft Entra ID.

  3. Assign it to the role.

Path Specs

Path Specs (path_specs) is a list of Path Spec (path_spec) objects where each individual path_spec represents one or more datasets. Include path (path_spec.include) represents formatted path to the dataset. This path must end with *.* or *.[ext] to represent leaf level. If *.[ext] is provided then files with only specified extension type will be scanned. ".[ext]" can be any of supported file types. Refer example 1 below for more details.

All folder levels need to be specified in include path. You can use /*/ to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use {table} placeholder to represent folder level for which dataset is to be created. Refer example 2 and 3 below for more details.

Exclude paths (path_spec.exclude) can be used to ignore paths that are not relevant to current path_spec. This path cannot have named variables ( {} ). Exclude path can have ** to represent multiple folder levels. Refer example 4 below for more details.

Building Path Spec for ADLS

  • Remember the storage account, you want to connect to decube.

  • Take note of:

    • Storage account name

    • Container name

    • Folder path

Follow this schema when building a path spec:

"abfs://{container name}@{storage account name}{folder path}"
"abfs://*.*"// Some code

Path Specs - Examples

Example 1 - Individual file as Dataset

Bucket structure:

├── employees.csv
├── departments.json
└── food_items.csv

Path specs config to ingest employees.csv and food_items.csv as datasets:

    - include: abfs://*.csv

This will automatically ignore departments.json file. To include it, use *.* instead of *.csv.

Example 2 - Folder of files as Dataset (without Partitions)

Bucket structure:

└──  offers
     ├── 1.csv
     └── 2.csv

Path specs config to ingest folder offers as dataset:

    - include: abfs://{table}/*.csv

{table} represents folder for which dataset will be created.

Example 3 - Folder of files as Dataset (with Partitions)

Bucket structure:

├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config to ingest folders orders and returns as datasets:

    - include: abfs://{table}/*/*/*.parquet

Example 4 - Folder of files as Dataset (with Partitions), and Exclude Filter

Bucket structure:

├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── tmp_orders
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config to ingest folder orders as dataset but not folder tmp_orders:

    - include: abfs://{table}/*/*/*.parquet
        - **/tmp_orders/**

Example 5 - Advanced - Either Individual file OR Folder of files as Dataset

Bucket structure:

├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

Path specs config:

    - include: abfs://*.csv
        - **/tmp_10101000.csv
    - include: abfs://{table}/*.json
    - include: abfs://{table}/*/*/*.parquet

Above config has 3 path_specs and will ingest following datasets

  • employees.csv - Single File as Dataset

  • food_items.csv - Single File as Dataset

  • customers - Folder as Dataset

  • orders - Folder as Dataset and will ignore file tmp_10101000.csv

Valid path_specs.include

abfs:// # single file table
abfs://*.* # mulitple file level tables
abfs://{table}/*.parquet #table without partition
abfs://{table}/*/*.csv #table where partitions are not specified
abfs://{table}/*.* # table where no partitions as well as data type specified
abfs://{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format

Valid path_specs.exclude

- */tests/**
- abfs://**
- */tests/*.csv
- abfs://*/my_table/**

Supported file types

  • CSV (*.csv)

  • TSV (*.tsv)

  • JSON (*.json)

  • JSON (*.jsonl)

  • Parquet (*.parquet)

  • Avro (*.avro) [beta]

Table format:

  • Apache Iceberg [beta]

  • Delta table [beta]

Schemas for Parquet and Avro files are extracted as provided.

Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV, TSV, JSONL files, we consider the first 100 rows by default JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.

Last updated