dbt Core

Connect your decube platform to dbt Core to see all data jobs in the Catalog and see end-to-end lineage.

This documentation is on how to add a data source connection to dbt Core, which is the open source framework for dbt. If you are interested to connect to your dbt Cloud instance instead, please check out this documentation for dbt Core.

Integrating DBT Core with Decube involves reading files from an AWS S3 bucket, which shares similarities with how AWS S3 itself connects to the platform.

A summary of steps to set up dbt core:

  1. Set up an S3 bucket following the same procedure outlined in our documentation for AWS S3.

  2. Define folder partitions (details will be provided in the following section).

  3. Upload the necessary files to those partitions.

Following these steps, the metadata collector will connect to the S3 bucket and retrieve the data.

Retrieving Access Keys from AWS

Folder partition

decube supports ingesting information from multiple dbt projects. You would need to structure the bucket using a format that we define based on the current date.

Given that base_path for a single project uses the following format:

  • base_path = ”${year}/${month}/${day}” where:

    • year = $(date +%Y)

    • month = $(date +%B)

    • day = $(date +%d)

  • Example of a folder partition on your S3 - s3://your-bucket/${base_path}

    • Where the full path of the folder could be s3://your-bucket/2024/May/01/

After setting up the format based on the current date partition, you can proceed to define your own structure.

decube currently supports reading two-level deep bucket structure. You could define how you would want to upload project files into separate directories.

Basically, all of the following are valid bucket path and you can refer to the examples below:

Example 1 - Multiple Projects

  • project_a

    • year=2024

      • month=May

        • day=01

          • [location of project files]

  • project_b

    • Same as project_a

  • project_c

    • Same as project_a

Example 2 - Multiple Projects with Environments

  • dev

    • project_a

      • year=2024

        • month=May

          • day=01

            • [location of project files]

    • project_b

      • Same as project_a

    • project_c

      • Same as project_a

  • prod

    • project_a_prod

    • project_b_prod

Example 3 - Single Project

  • project_a

    • year=2024

      • month=May

        • day=01

          • [location of project files]

Example 4 - No Project

  • year=2024

    • month=May

      • day=01

        • [location of project files]

Upload project files

You would need to upload specific files from the target/ directory into the bucket after your dbt command has concluded.

  • manifest.json, which is generated by any command that parses your project. Here is an example of a command that generates the file:

    • dbt run —full-refresh

      • This single file contains a full representation of your dbt project's resources (models, tests, macros, etc), including all node configurations and resource properties.

  • run_results.json, which is generated by a few commands such as build, compile, and run just to name a few (you can refer to the documentation). Here is an example of a command that generates the file:

    • dbt build

      • This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed.

  • catalog.json, which is only produced by docs generates and is optional. This is required if you want to acquire column metadata. The command can be run like so:

    • dbt docs generate

      • This file contains information from your data warehouse about the tables and views produced and defined by the resources in your project.

To ensure the collector runs successfully, you will need to upload in the following manner:

  • (in pair) manifest.json and run_results.json or

  • (in triplets) manifest.json and run_results.json and catalog.json.

Additional Notes

For uploading the project files, you may choose to do the following:

  • Only upload the latest project files to the specified bucket where there is only one set of manifest.json and run_results.json in that bucket for that folder partition at any time.

    • Caution: If you were to do it this way, you may lose out information of the runs before the latest project files are processed.

  • Retain a series of project files based on the timestamp of when it was run. For example, for each run append a timestamp after the filename:

    • Do: manifest_20240503142827.json

    • Do not: 20240503142827_manifest.json

    • Timestamped project file in this example was generated using the following commands:

      • Using timestamp=$(date +%Y%m%d%H%M%S) to create manifest_${timestamp}.json

Note: To ensure that each project is successfully collected by our metadata collector, we recommend uploading the manifest.json and run_results.json in the same folder. If you want to include column metadata, make sure you include catalog.json as well.

Sample Script

Here is a sample script for uploading the project files:

#!/bin/bash

# Project name
project_name=some_project

# Generate timestamp
export TZ=UTC
timestamp=$(date +%Y%m%d%H%M%S)

# Generate date-based directory structure
year=$(date +%Y)
month=$(date +%B)
day=$(date +%d)

# Define the base path for S3
base_path="${project_name}/${year}/${month}/${day}"

# Copy project files to S3 with the new structured path
aws s3 cp /path/to/target/manifest.json s3://some-bucket/${base_path}/manifest_${timestamp}.json
aws s3 cp /path/to/target/run_results.json s3://some-bucket/${base_path}/run_results_${timestamp}.json
aws s3 cp /path/to/target/catalog.json s3://some-bucket/${base_path}/catalog_${timestamp}.json

You may modify and integrate this into your existing workflows.

Connecting DBT Core with Decube

After following the above steps, you may start ingesting the metadata from your DBT Core bucket into decube by navigating to My Account > Data Sources Tab > Connect A New Data Source > DBT Core.

where 'Path' follows these format: s3://some-bucket s3://some-bucket/path-to-dbt-core

Please provide the required credentials and click "Test This Connection" to verify their validity. Afterward, assign a name to your data source, and by selecting the "Connect This Data Source" option, your connection between DBT Core and Decube will be successfully established.

Currently, only S3 storage is supported for DBT Core under the "Storage" dropdown.

Additional configuration for lineage

Once you have connected your dbt core, you will then need to map the connection sources to the data sources on the decube platform. Refer how to do that in this documentation.

Last updated