# OpenLineage (BETA)

## Supported Capabilities

{% tabs %}
{% tab title="Supported Capabilities" %}
**General**

* **Metadata** — metadata extraction and display of asset information (tables, columns, schemas). Types collected: Schema, Virtual Table, Virtual Column, Data Job, Data Task, Data Run

**Data Quality Monitors**

* Job Failure
  {% endtab %}

{% tab title="Not Supported" %}
**General**

* Profiling
* Preview
* Data Quality
* Configurable Collection
* External Table
* View Table
* Stored Procedure

**Data Quality Monitors**

* Freshness
* Volume
* Field Health
* Custom SQL
* Schema Drift
  {% endtab %}
  {% endtabs %}

## Connection Requirements

* Step 1: Go to **My Account** and click on the **Integrations** tab
* Step 2: Go to the **Connect a new data source** section
* Step 3: Click on the **OpenLineage** icon.
* Step 4: Enter a **name** for the data source and click **Submit.**

The exclusion filters can be added to exclude specific tables and lineage paths from being ingested. [See more here](#exclusion-filters).

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2FXlumVp0M5D6kBXbCL3n5%2Fimage.png?alt=media&#x26;token=2e10ea98-a743-4e49-b1cf-29a9191e7185" alt=""><figcaption></figcaption></figure>

Step 5: A **Webhook UUID** and an **API Key** will be provided. **Copy** them into your connector’s configuration settings.

<figure><img src="https://1779874722-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FTw0qpCVzfrIXqS4FEg4T%2Fuploads%2FYqKVjlZUNWSr8fXP4YPw%2Fimage.png?alt=media&#x26;token=478bab66-3bcc-467a-b81d-a46a430bade3" alt=""><figcaption></figcaption></figure>

## Webhook Endpoint

Payload must submitted to the following endpoint:

```
https://integrations.<region>.decube.io/integrations/openlineage/webhook/<webhook-uuid>
```

## Submitting Payload to OpenLineage Webhook

If you're using these tools, please follow the respective documentation in the OpenLineage website.

| Tool         | Documentation                                                                                                                    |
| ------------ | -------------------------------------------------------------------------------------------------------------------------------- |
| Airflow      | [openlineage.io/docs/integrations/airflow/usage](https://openlineage.io/docs/integrations/airflow/usage)                         |
| Apache Spark | [openlineage.io/docs/integrations/spark/configuration/usage](https://openlineage.io/docs/integrations/spark/configuration/usage) |
| Apache Flink | [openlineage.io/docs/integrations/flink/configuration](https://openlineage.io/docs/integrations/flink/configuration)             |

## Custom Integration

If you want to create your own integration for your tools, follow these steps:

* Submit the webhook payload to the above endpoint.
* Use the Bearer token system for authentication.

Example request:

```
curl -X POST \
   -H "Authorization: Bearer <api-key>" \
   -H "Content-Type: application/json" \
   --data '{}' \
   https://integrations.<region>.decube.io/integrations/openlineage/webhook/<webhook-uuid>
```

## Exclusion Filters

Exclusion filters let you exclude specific tables and lineage paths from being ingested by Decube. This is useful when your OpenLineage jobs produce metadata for staging tables, temporary paths, or other assets you do not want tracked in your catalog.

You configure exclusion filters directly in the Decube UI on your OpenLineage data source settings page. Each filter expects a Python RegEx-compliant regular expression for its fields.

### Supported Filter Types

{% tabs %}
{% tab title="ADLS Gen2 Path" %}
Matches tables with an ADLS Gen2 URI in the format:

```
abfss://<container-name>@<service-name>.dfs.core.windows.net/<path>
```

| Field          | Description                        |
| -------------- | ---------------------------------- |
| container-name | The ADLS container name            |
| service-name   | The ADLS storage account name      |
| path           | The file path within the container |

**Example** — exclude everything under the `discard/` path in the `decube` container across all storage accounts:

| Field          | Value        |
| -------------- | ------------ |
| container-name | `decube`     |
| service-name   | `.*`         |
| path           | `discard/.*` |

This excludes `abfss://decube@test.dfs.core.windows.net/discard/test/file` but not `abfss://decube@test.dfs.core.windows.net/nodiscard/test/file`.
{% endtab %}

{% tab title="Snowflake" %}
Matches Snowflake tables by their fully qualified identifier.

| Field              | Description                                                           |
| ------------------ | --------------------------------------------------------------------- |
| account-identifier | Your Snowflake account in `<organization-name>-<account-name>` format |
| database           | The database name                                                     |
| schema             | The schema name                                                       |
| table              | The table name                                                        |

**Example** — exclude all tables in the `TEST` schema of `WORKDATABASE` on the `decube-test` account:

| Field              | Value          |
| ------------------ | -------------- |
| account-identifier | `decube-test`  |
| database           | `WORKDATABASE` |
| schema             | `TEST`         |
| table              | `.*`           |

This excludes `snowflake://decube-test/WORKDATABASE.TEST.TABLE` but not `snowflake://decube-test/PROD.TEST.TABLE`.
{% endtab %}

{% tab title="S3" %}
Matches tables stored in Amazon S3.

| Field       | Description                             |
| ----------- | --------------------------------------- |
| bucket-name | The S3 bucket name                      |
| object-key  | The object key (path) within the bucket |

**Example** — exclude all CSV files under `raw/data/` in the `decube-datalake` bucket:

| Field       | Value              |
| ----------- | ------------------ |
| bucket-name | `decube-datalake`  |
| object-key  | `raw/data/.*\.csv` |

This excludes `s3://decube-datalake/raw/data/report.csv` but not `s3://decube-datalake/raw/data/report.json` or `s3://decube-datalake/bronze/report.csv`.
{% endtab %}

{% tab title="Generic Regex" %}
Use the generic regex filter when the table format does not match any of the specific filter types above. The regex is matched against the full table identifier regardless of type.

| Field | Description                                              |
| ----- | -------------------------------------------------------- |
| regex | A regular expression matched against the full table path |

**Example** — exclude any table whose path contains `/ignored/`:

| Field         | Value           |
| ------------- | --------------- |
| regex         | `.*/ignored/.*` |
| {% endtab %}  |                 |
| {% endtabs %} |                 |

{% hint style="info" %}
You can add multiple exclusion filters of different types on the same data source. Each filter is evaluated independently — a table is excluded if it matches any filter.
{% endhint %}
