Lineage Relationship

In this document, we explore various types of data lineage and their significance in understanding data relationships.

Overview of Lineage Types

Each data source supports different types of lineage, ensuring transparency in data transformations, relationships, and reporting:

  • Foreign Key Lineage – Tracks relationships between tables via primary and foreign keys.

  • ETL Lineage – Shows how data moves and transforms through ETL processes.

  • SQL Lineage – Tracks data movement through SQL queries (SELECT, JOIN, INSERT, etc.).

  • BI Lineage – Traces data from sources to reports, dashboards, and BI tools.

  • View Lineage – Tracks virtual tables (views) and their data dependencies.

  • External (Table) Lineage – Tracks the direct relationship between raw files in cloud storage (S3/ADLS) and the virtualized relational schema.

circle-check

Foreign Key

What it is:

A foreign key is a column in one table that refers to the primary key of another table, establishing a relationship between them. This lineage type visualizes table relationships in relational databases, ensuring referential integrity.

Example:

Consider a scenario where a sale is linked to a credit card transaction. The id column in the credit_card table serves as a foreign key for the payment_card_id column in the sale table. This relationship allows us to track transactions based on the payment method used.

Why it matters:

• Helps in understanding one-to-many or many-to-many table relationships.

• Ensures data consistency and integrity across tables.

ETL Lineage

What it is:

ETL (Extract, Transform, Load) lineage tracks how data is processed in ETL pipelines. It ensures transparency in data movement and transformation—helping users understand how raw data evolves into its final state.

Example:

In dbt (Data Build Tool), transformation jobs are frequently used to reshape data within a data warehouse. ETL lineage helps visualize these changes, showing the step-by-step evolution of data.

Why it matters:

• Provides a clear transformation path for debugging and optimization.

• Ensures trust in analytics and reporting by making data workflows transparent.

SQL Lineage

What it is:

SQL lineage maps the movement of data between tables and columns as SQL queries execute. It tracks data transformations performed by SQL commands such as SELECT, JOIN, INSERT, UPDATE, and DELETE.

Example:

If a report is generated based on a JOIN operation between multiple tables, SQL lineage will track which columns were used, modified, or derived in the process.

Why it matters:

• Helps in tracing data changes for auditing and compliance.

• Improves query optimization and debugging.

BI Lineage

What it is:

BI (Business Intelligence) lineage tracks how data moves from raw sources to reports, dashboards, and visualizations in a BI tool. It provides insights into data transformations within BI platforms.

Example:

If data from Snowflake (a data warehouse) is used in Looker Studio (a BI tool), BI lineage helps track the source-to-report journey, ensuring transparency in insights.

Why it matters:

• Helps users understand the accuracy of reports.

• Improves troubleshooting of incorrect metrics.

View Lineage

What it is:

A view is a virtual table created using SQL queries. View lineage is similar to SQL lineage but focuses specifically on data transformations within views.

Example:

If a database view is built using a SELECT query that joins multiple tables, view lineage tracks the origin of each field displayed in that view.

Why it matters:

• Ensures data consistency in virtual tables.

• Helps in understanding dependencies between views and base tables.

What it is:

External table lineage tracks the "schema-on-read" relationship between physical files in cloud storage (S3/ADLS) and the virtualized tables in your data platform. Unlike standard ingestion, this lineage captures data accessed by reference, mapping the link between the storage URI and the relational metadata without moving the underlying data.

Example:

If an external table in Snowflake or Synapse queries files in S3 or ADLS, the lineage captures:

  • The Physical Origin: The specific cloud storage bucket, container, and file path.

  • The Access Layer: The credentialed connection (Storage Integration or Linked Service) and the metadata pointer (Stage or External Data Source).

  • The Virtual Schema: The external table definition that maps raw file formats (Parquet, CSV, JSON) into queryable columns.

  • Downstream Consumption: Any views or permanent tables populated using the external table as a source.

Why it matters:

  • Security & Compliance: Audits exactly which external cloud folders are being exposed to the data platform.

  • Impact Analysis: Identifies which downstream reports will break if a file format or folder structure changes in the data lake.

  • Data Provenance: Provides a complete "Source-to-BI" map, even when data remains in its raw format outside the primary database.

External Table Lineage (Cloud Storage to Data Platform)

What it is:

External table lineage tracks the "schema-on-read" relationship between physical files in cloud storage (S3/ADLS) and the virtualized tables in your data platform. Unlike standard ingestion, this lineage captures data accessed by reference, mapping the link between the storage URI and the relational metadata without moving the underlying data.

Example:

If an external table in Snowflake or Synapse queries files in S3 or ADLS, the lineage captures:

  • The Origin: The specific cloud storage bucket, container, and file path.

  • The Virtual Schema: The external table definition that maps raw file formats (Parquet, CSV, JSON) into queryable columns.

Why it matters:

  • Security & Compliance: Audits exactly which external cloud folders are being exposed to the data platform.

  • Impact Analysis: Identifies which downstream reports will break if a file format or folder structure changes in the data lake.

  • Data Provenance: Provides a complete "Source-to-BI" map, even when data remains in its raw format outside the primary database.

Last updated