dbt Core
Connect your decube platform to dbt Core to see all data jobs in the Catalog and see end-to-end lineage.
Supported Capabilities
Freshness
❌
Volume
❌
Schema Drift
❌
Field Health
❌
Custom SQL
❌
Job Failure
✅
Data Profiling
❌
Data Preview
❌
This documentation is on how to add a data source connection to dbt Core, which is the open source framework for dbt. If you are interested to connect to your dbt Cloud instance instead, please check out this documentation for dbt Cloud version.
Integrating DBT Core with Decube involves reading files from an AWS S3 bucket, which shares similarities with how AWS S3 itself connects to the platform.
Set up an S3 bucket following the same procedure outlined in our documentation for AWS S3.
Define folder partitions (details will be provided in the following section).
Upload the necessary files to those partitions.
A summary of steps to set up dbt core:
Set up an S3 bucket following the same procedure outlined in our documentation for AWS S3.
Define folder partitions (details will be provided in the following section).
Upload the necessary files to those partitions.
Following these steps, the metadata collector will connect to the S3 bucket and retrieve the data.
Minimum Requirement
To connect your AWS Glue to decube, we will need the following information:
Choose authentication method:
a. AWS Identity:
Select AWS Identity
Customer AWS Role ARN
Path
Region
Storage Type
Data source name
b. AWS Access Key:
Access Key ID
Secret Access Key
Path
Region
Storage Type
Data source name

Connection Options:
a. AWS Roles
Step 1: Go to your AWS Account → IAM Module → Roles
Step 2: Click on Create role.

Step 3: Choose Custom trust policy.

Step 4: Specify the following as the trust policy, replacing
DECUBE-AWS-IDENTITY-ARN
andEXTERNAL-ID
with values from Generating a Decube AWS Identity.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<DECUBE-AWS-IDENTITY-ARN>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<EXTERNAL-ID>"
}
}
}
]
}
Step 5: Click next to proceed to attach policy.
Step 6: Click on Attach Policies and Create Policy and choose JSON Editor. Input the following policy and press next, input the policy name of your choice and press Create Policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::{bucket-name}",
"arn:aws:s3:::{bucket-name}/*"
]
}
]
}
b. Retrieving Access Keys from AWS
Step 1: Login to AWS Console and proceed to IAM > User > Create User

Extra Step: Click on Attach Policies and Create Policy and choose JSON Editor input the following policy and press next, input the policy name of your choice and press Create Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::{bucket-name}",
"arn:aws:s3:::{bucket-name}/*"
]
}
]
}
Step 2: Search for the policy you created just now, select it and press Next.

Step 3: Review and Create user.

Step 4: Navigate to the newly created user and click on
Create access key

Step 5: Choose
Application running outside AWS

Step 6: Save the provided access key and secret access key. You will not be able to retrieve these keys again.

AWS KMS
If the bucket intended to be connected to Decube is encrypted using a customer managed KMS key, you will need to add the AWS IAM user created above to the key policy statement.
Login to AWS Console and proceed to AWS KMS > Customer-managed keys.
Find the key that was used to encrypt the AWS S3 bucket.
On the Key policy tab, click on
Edit

Assuming the user created is
decube-s3-datalake
a. If there is not an existing policy attached to the key
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow decube to use key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
]
},
"Action": "kms:Decrypt",
"Resource": "*"
}
]
}
b. If there is an existing policy, append this section to the Statement
array:
{
"Statement": [
{
"Sid": "Allow decube to use key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<AWSAccountID>:user/{decube-s3-datalake}"
]
},
"Action": "kms:Decrypt",
"Resource": "*"
}
]
}
Save Changes
Folder partition
Decube supports ingesting information from multiple dbt projects. You would need to structure the bucket using a format that we define based on the current date.
Given that base_path
for a single project uses the following format:
base_path = ”${year}/${month}/${day}”
where:year = $(date +%Y)
month = $(date +%B)
day = $(date +%d)
Example of a folder partition on your S3 -
s3://your-bucket/${base_path}
Where the full path of the folder could be
s3://your-bucket/2024/May/01/
After setting up the format based on the current date partition, you can proceed to define your own structure.
decube currently supports reading two-level deep bucket structure. You could define how you would want to upload project files into separate directories.
Basically, all of the following are valid bucket path and you can refer to the examples below:
Example 1 - Multiple Projects
project_a
year=2024
month=May
day=01
[location of project files]
project_b
Same as project_a
project_c
Same as project_a
Example 2 - Multiple Projects with Environments
dev
project_a
year=2024
month=May
day=01
[location of project files]
project_b
Same as project_a
project_c
Same as project_a
prod
project_a_prod
project_b_prod
…
Example 3 - Single Project
project_a
year=2024
month=May
day=01
[location of project files]
Example 4 - No Project
year=2024
month=May
day=01
[location of project files]
Upload project files
You would need to upload specific files from the target/
directory into the bucket after your dbt command has concluded.
manifest.json
, which is generated by any command that parses your project. Here is an example of a command that generates the file:dbt run —full-refresh
This single file contains a full representation of your dbt project's resources (models, tests, macros, etc), including all node configurations and resource properties.
run_results.json
, which is generated by a few commands such asbuild
,compile
, andrun
just to name a few (you can refer to the documentation). Here is an example of a command that generates the file:dbt build
This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed.
catalog.json
, which is only produced bydocs generates
and is optional. This is required if you want to acquire column metadata. The command can be run like so:dbt docs generate
This file contains information from your data warehouse about the tables and views produced and defined by the resources in your project.
To ensure the collector runs successfully, you will need to upload in the following manner:
(in pair)
manifest.json
andrun_results.json
or(in triplets)
manifest.json
andrun_results.json
andcatalog.json.
Additional Notes
For uploading the project files, you may choose to do the following:
Only upload the latest project files to the specified bucket where there is only one set of
manifest.json
andrun_results.json
in that bucket for that folder partition at any time.Caution: If you were to do it this way, you may lose out information of the runs before the latest project files are processed.
Retain a series of project files based on the timestamp of when it was run. For example, for each run append a timestamp after the filename:
Do: manifest_20240503142827.json
Do not: 20240503142827_manifest.json
Timestamped project file in this example was generated using the following commands:
Using
timestamp=$(date +%Y%m%d%H%M%S)
to createmanifest_${timestamp}.json
Note: To ensure that each project is successfully collected by our metadata collector, we recommend uploading the manifest.json
and run_results.json
in the same folder. If you want to include column metadata, make sure you include catalog.json
as well.
Sample Script
Here is a sample script for uploading the project files:
#!/bin/bash
# Project name
project_name=some_project
# Generate timestamp
export TZ=UTC
timestamp=$(date +%Y%m%d%H%M%S)
# Generate date-based directory structure
year=$(date +%Y)
month=$(date +%B)
day=$(date +%d)
# Define the base path for S3
base_path="${project_name}/${year}/${month}/${day}"
# Copy project files to S3 with the new structured path
aws s3 cp /path/to/target/manifest.json s3://some-bucket/${base_path}/manifest_${timestamp}.json
aws s3 cp /path/to/target/run_results.json s3://some-bucket/${base_path}/run_results_${timestamp}.json
aws s3 cp /path/to/target/catalog.json s3://some-bucket/${base_path}/catalog_${timestamp}.json
You may modify and integrate this into your existing workflows.
Connecting DBT Core with Decube
After following the above steps, you may start ingesting the metadata from your DBT Core bucket into decube by navigating to My Account > Data Sources Tab > Connect A New Data Source > DBT Core.
where 'Path' follows these format: s3://some-bucket s3://some-bucket/path-to-dbt-core

Please provide the required credentials and click "Test this connection
" to verify their validity. Afterward, assign a name to your data source, and by selecting the "Connect This Data Source
" option, your connection between DBT Core and Decube will be successfully established.
Additional configuration for lineage
Once you have connected your dbt core, you will then need to map the connection sources to the data sources on the decube platform. Refer how to do that in this documentation.
Last updated