Lineage Relationship
In this document, we explore various types of data lineage and their significance in understanding data relationships.
Overview of Lineage Types
Each data source supports different types of lineage, ensuring transparency in data transformations, relationships, and reporting:
Foreign Key Lineage – Tracks relationships between tables via primary and foreign keys.
ETL Lineage – Shows how data moves and transforms through ETL processes.
SQL Lineage – Tracks data movement through SQL queries (SELECT, JOIN, INSERT, etc.).
BI Lineage – Traces data from sources to reports, dashboards, and BI tools.
View Lineage – Tracks virtual tables (views) and their data dependencies.
External Lineage – Tracks data movement across external platforms (e.g., AWS S3 to Snowflake).
Foreign Key
What it is:
A foreign key is a column in one table that refers to the primary key of another table, establishing a relationship between them. This lineage type visualizes table relationships in relational databases, ensuring referential integrity.
Example:
Consider a scenario where a sale is linked to a credit card transaction. The id column in the credit_card table serves as a foreign key for the payment_card_id column in the sale table. This relationship allows us to track transactions based on the payment method used.
Why it matters:
• Helps in understanding one-to-many or many-to-many table relationships.
• Ensures data consistency and integrity across tables.
ETL Lineage
What it is:
ETL (Extract, Transform, Load) lineage tracks how data is processed in ETL pipelines. It ensures transparency in data movement and transformation—helping users understand how raw data evolves into its final state.
Example:
In dbt (Data Build Tool), transformation jobs are frequently used to reshape data within a data warehouse. ETL lineage helps visualize these changes, showing the step-by-step evolution of data.
Why it matters:
• Provides a clear transformation path for debugging and optimization.
• Ensures trust in analytics and reporting by making data workflows transparent.
SQL Lineage
What it is:
SQL lineage maps the movement of data between tables and columns as SQL queries execute. It tracks data transformations performed by SQL commands such as SELECT, JOIN, INSERT, UPDATE, and DELETE.
Example:
If a report is generated based on a JOIN operation between multiple tables, SQL lineage will track which columns were used, modified, or derived in the process.
Why it matters:
• Helps in tracing data changes for auditing and compliance.
• Improves query optimization and debugging.
BI Lineage
What it is:
BI (Business Intelligence) lineage tracks how data moves from raw sources to reports, dashboards, and visualizations in a BI tool. It provides insights into data transformations within BI platforms.
Example:
If data from Snowflake (a data warehouse) is used in Looker Studio (a BI tool), BI lineage helps track the source-to-report journey, ensuring transparency in insights.
Why it matters:
• Helps users understand the accuracy of reports.
• Improves troubleshooting of incorrect metrics.
View Lineage
What it is:
A view is a virtual table created using SQL queries. View lineage is similar to SQL lineage but focuses specifically on data transformations within views.
Example:
If a database view is built using a SELECT query that joins multiple tables, view lineage tracks the origin of each field displayed in that view.
Why it matters:
• Ensures data consistency in virtual tables.
• Helps in understanding dependencies between views and base tables.
External Lineage (S3 to Snowflake)
What it is:
External lineage refers to data flow between different platforms and environments. This is crucial for tracking external data ingestion, ensuring an end-to-end view of data movement.
Example:
If data is ingested from AWS S3 into Snowflake, external lineage will capture:
✔ The source location in S3
✔ How data is processed before entering Snowflake
✔ Any transformations performed during ingestion
Why it matters:
• Helps in governance and security by tracking external data sources.
• Ensures completeness of data flow visualization.
Last updated