# Azure Data Lake Storage (ADLS)

## Supported Capabilities

{% tabs %}
{% tab title="Supported Capabilities" %}
**General**

* **Metadata** — metadata extraction and display of asset information (tables, columns, schemas). Types collected: Schema, Table, Column
* **Preview** — sample data preview
  {% endtab %}

{% tab title="Not Supported" %}
**General**

* Profiling
* Data Quality
* Configurable Collection
* External Table
* View Table
* Stored Procedure
  {% endtab %}
  {% endtabs %}

To connect your ADLS to Decube, the following information is required:

* Tenant ID
* Client ID
* Client Secret

{% hint style="info" %}
**Potential Data Egress**

Under the SaaS deployment model, data must be transferred from the storage container to the Data Plane to inspect files, retrieve schema information, and perform data quality monitoring.\
\
If this is not preferable, you may opt for a Self-Hosted deployment model or bring your own Azure Function [azure-function-for-metadata](https://docs.decube.io/datalake/azure-data-lake-storage-adls/azure-function-for-metadata "mention")
{% endhint %}

## **Firewall and connectivity configuration**

By default, Azure Storage accounts may allow access from all networks. However, if your organization requires Public network access to be disabled for security compliance, you must explicitly whitelist Decube's IP addresses to allow our connectors to access your Data Lake.

Follow the steps below to configure your firewall settings.

**1. Navigate to Networking Settings**

1. Log in to the Azure Portal and navigate to your Storage Account.
2. In the left-hand sidebar, under Security + networking, select Networking.
3. Under the Firewalls and virtual networks tab, locate the Public network access setting.

**2. Enable Access for Selected Networks**

To allow Decube to connect while keeping the storage account private from the general public:

1. Select Enabled from selected virtual networks and IP addresses.
2. This option enables the Firewall section below it, where you can specify allowed IP addresses.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2FpL0S65wcsaRiZb2DXxsZ%2Fimage.png?alt=media&#x26;token=86122e53-b2ce-4108-b593-3ca5834b1484" alt=""><figcaption></figcaption></figure>

**3. Whitelist Decube IP Addresses**

In the Firewall section, you must add the IP addresses corresponding to the region where your Decube SaaS instance is hosted. See the section on IP Whitelisting to get the list of IP address.

{% content-ref url="../security-and-connectivity/ip-whitelisting" %}
[ip-whitelisting](https://docs.decube.io/security-and-connectivity/ip-whitelisting)
{% endcontent-ref %}

## **Credentials setup**

### Setup on Microsoft Azure

1. On the Azure Home Page, go to `Azure Active Directory`. The **Tenant ID** can be copied from the Basic information section.\\

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-e42ca7b69914fba95fff8e9b8005db7e2a458400%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>
2. Go to `App registrations`.\\

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-d151ddc9c13e94fbdb9275168b7852c54ee45ba7%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>
3. Click on `New registration`.\\

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-dc8c73b2a9b8b12e8ae3a7b22f82de46c607eadb%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>
4. Click `Register.`

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-6cc6d24f7c9f77c665e7d61a311f6b3ab538727e%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>
5. Save the `Application (client) ID` and `Directory (tenant) ID`.
6. Click `Add a certificate or secret`.
7. Go to `Client secrets` and client `+ New client secret`.\\

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-196f0fc37d4b2e75758d5e502ed03c13d59b9ef6%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>
8. Click `+Add`.\
   \\

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-d479ff4c05c04bbe7b42b62f834214ce02fc4e6d%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>
9. Copy and save the `Value` for the **client secret**.\\

   <figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-a232b183dfdc07c1cafda562e30f936e836fc696%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

#### **Assigning Role to Credentials**

1. From Azure Services, find and click on Storage Accounts. You should be able to see the option for Access control (IAM) on the left sidebar.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-17ac52c3719a063e082fedb03d54868d6d03840e%2FSCR-20250429-nzhp.png?alt=media" alt=""><figcaption></figcaption></figure>

2. Click on Access Control -> Click on '+Add' -> Click on Role assignments.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-2f354269d3c0435c57aba32cf887aa334b9c4135%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

3. Find the role called **Storage Blob Data Reader** click on it and click next.
4. On the next page, search for the name of the application that you just created on Microsoft Entra ID.
5. Assign it to the role.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-d1daec31b920fae00e653ccabda7010945648737%2Fimage%20(65).png?alt=media" alt=""><figcaption></figcaption></figure>

## Path Specs

Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets for cataloging in ADLS.

The provided path specification MUST end with `*.*` or `*.[ext]` to represent the leaf level. (Note: here `*` is not a wildcard symbol.) If `*.[ext]` is provided, only files with the specified extension will be scanned. `.[ext]` can be any of the supported file types listed below.

Each `path_spec` represents only one file type (e.g., only `*.csv` or only `*.parquet`). To ingest multiple file types, add multiple `path_spec` entries.

**SingleFile pathspec**: PathSpec without `{table}` (targets individual files).**MultiFile pathspec**: PathSpec with `{table}` (targets folders as datasets).

### PathSpec Structure

* Take note of thesee following parameters when building a path spec:
  * Storage account name
  * Container name
  * Folder path

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2Fgit-blob-13145342f6e8def56f81c2e9589e08d969483a2f%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

Follow this schema when building a path spec:

```bash
"abfs://{container name}@{storage account name}.dfs.core.windows.net/{folder path}"
example
"abfs://first@decubetestadls.dfs.core.windows.net/second/*.*"// Some code
```

**Include only datasets that match this pattern**: If the path spec specifies a table and a regex is provided, only datasets that match the regex will be included.

### File Format Settings

* **CSV**
  * `delimiter` (default: `,`)
  * `escape_char` (default: `\\`)
  * `quote_char` (default: `"`)
  * `has_headers` (default: true)
  * `skip_n_line` (default: 0)
  * `file_encoding` (default: UTF-8; supported: ASCII, UTF-8, UTF-16, UTF-32, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN, EUC-JP, SHIFT\_JIS, CP932, ISO-2022-JP, EUC-KR, ISO-2022-KR, Johab, KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251, MacRoman, ISO-8859-7, windows-1253, ISO-8859-8, windows-1255, TIS-620)
* **Parquet**
  * No options
* **JSON/JSONL**
  * `file_encoding` (default: UTF-8; see above for supported encodings)
* **Delta Table**
  * When selecting Format = `Delta Table`, the path spec MUST include the named token `{table}`. The connector expects the `{table}` token in the path spec so it can discover Delta table roots. No per-file options are required.

**Additional points to note**

* Folder names should not contain {, }, \*, / in their names.
* Named variable {folder} is reserved for internal working. Please do not use in named variables.

### Example Path Specs

#### Example 1 - Individual file as Dataset (SingleFile pathspec)

Bucket structure:

```
test-bucket
├── employees.csv
├── departments.json
└── food_items.csv
```

Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:

```
path_specs:
    - include: abfs://test-container@test-storage-account.dfs.core.windows.net/*.csv
```

This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.

**Example 2 - Folder of files as Dataset (without Partitions)**

Bucket structure:

```
test-bucket
└──  offers
     ├── 1.csv
     └── 2.csv
```

Path specs config to ingest folder `offers` as dataset:

```
path_specs:
    - include: abfs://test-container@test-storage-account.dfs.core.windows.net/{table}/*.csv
```

`{table}` represents folder for which dataset will be created.

**Example 3 - Folder of files as Dataset (with Partitions)**

Bucket structure:

```
test-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

```

Path specs config to ingest folders `orders` and `returns` as datasets:

```
path_specs:
    - include: abfs://test-container@test-storage-account.dfs.core.windows.net/{table}/*/*/*.parquet
```

**Example 4 - Advanced - Either Individual file OR Folder of files as Dataset**

Bucket structure:

```
test-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

```

Path specs config:

```
path_specs:
    - path_spec_1: abfs://test-container@test-storage-account.dfs.core.windows.net/*.csv
    - path_spec_2: abfs://test-container@test-storage-account.dfs.core.windows.net/{table}/*.json
    - path_spec_3: abfs://test-container@test-storage-account.dfs.core.windows.net/{table}/*/*/*.parquet

```

Above config has 3 path\_specs and will ingest following datasets

* `employees.csv` - Single File as Dataset
* `food_items.csv` - Single File as Dataset
* `customers` - Folder as Dataset
* `orders` - Folder as Dataset and will ignore file `tmp_10101000.csv`

**Valid path\_specs.include**

<pre class="language-python"><code class="lang-python"><strong>abfs://test-container@test-storage-account.dfs.core.windows.net/foo/tests/bar.csv # single file table
</strong>abfs://test-container@test-storage-account.dfs.core.windows.net/foo/tests/*.* # mulitple file level tables
abfs://test-container@test-storage-account.dfs.core.windows.net/foo/tests/{table}/*.parquet #table without partition
abfs://test-container@test-storage-account.dfs.core.windows.net/tests/{table}/*/*.csv #table where partitions are not specified
abfs://test-container@test-storage-account.dfs.core.windows.net/tests/{table}/*.* # table where no partitions as well as data type specified
abfs://test-container@test-storage-account.dfs.core.windows.net/{dept}/tests/{table}/*.parquet # specifying keywords to be used in display namepartition key and value format
</code></pre>

#### Example 5 - Delta Table (ADLS)

For Delta Table support, include `{table}` in the path. Example simple path spec for ADLS:

```
path_specs:
    - abfss://datalake@account.dfs.core.windows.net/{table}/
```

The connector will interpret `{table}` as the table root for each Delta table.

#### Supported file types

* CSV (`*.csv`)
* JSON (`*.json`)
* JSONL (`*.jsonl`)
* Parquet (`*.parquet`)
* Delta (`Delta Table`)

### Notes

* Data Quality Monitoring is no longer supported for ADLS sources. Only cataloging is available.
* For advanced dataset structures, add multiple `path_spec` entries as needed, each with its own file type and settings.
