Profiler source support and limitations

Understand profiler support, metric availability, and source limitations

Profiler behavior varies by source. This page explains how Decube calculates row counts, which metric types are available, and which source-specific limitations can affect profiling results.

How Decube profiles a table

When you generate a profile, Decube:

  1. Retrieves the table row count.

  2. Determines how to sample rows from the table based on the source and table size.

  3. Calculates profiling metrics from that sample.

  4. Stops the run if the profiling job exceeds the execution limit.

circle-info

Profiling jobs have a 30-minute execution limit.

Very wide tables can also hit profiling limits at around 250 columns. This is a rough guideline because numeric columns generate more metrics than non-numeric columns.

Row count modes

The table row count shown in Profiler can be exact or estimated, depending on the data source.

Exact row count

An exact row count is calculated directly from the table at the time the profile runs. This gives the most accurate result, but it can be more expensive on large tables.

Estimated row count

An estimated row count is retrieved from source metadata instead of scanning the full table. This is faster, but it depends on the source's metadata being up to date.

If the source metadata is stale, the reported total row count can be lower than the sampling row count.

Metric availability by data type

Profiler always shows metric availability based on both the source and the column type.

Common metrics

These metrics are available for supported column types across most sources:

  • Null count

  • Null percentage

  • Unique count

  • Unique percentage

Numeric metrics

Numeric columns can also include:

  • Minimum value

  • Maximum value

  • Mean

  • Median

  • Standard deviation

  • Zero count

  • Zero percentage

Text metrics

Text columns can also include:

  • Minimum length

  • Maximum length

  • Mean length

  • Empty string count

  • Empty string percentage

Boolean metrics

Boolean columns can also include:

  • True count

  • True percentage

  • False count

  • False percentage

Date and time metrics

Date, time, and datetime columns can include:

  • Minimum value

  • Maximum value

Source support and limitations

PostgreSQL

  • Row count mode: Estimated

  • If PostgreSQL has not refreshed table statistics, the row count can return -1

  • Views are supported, but they always return a row count of 0

Redshift

  • Row count mode: Estimated

  • If Redshift has not refreshed table statistics, the row count can return -1

  • Views are supported, but they always return a row count of 0

  • MEDIAN is not supported for numeric metrics

  • Unique count and unique percentage are not supported because they require expensive queries.

    • This may change in a future update where you can select the columns to be profiled to avoid full table query.

BigQuery

  • Row count mode: Exact

  • Views always return a row count of 0

Microsoft SQL Server

  • Row count mode: Estimated

  • TEXT, NTEXT, and IMAGE columns do not support count-based common metrics

  • Profiling is supported on SQL Server 2016 and later.

Azure Synapse

  • Row count mode: Exact

  • BIT is supported as a boolean type

  • MEDIAN is not supported for numeric metrics

  • General wide-table guidance is around 250 columns, but Azure Synapse supports up to 150 columns because of engine limits

circle-info

Azure Synapse also requires additional permissions for profiling. For setup steps, see Azure Synapse.

Databricks

  • Row count mode: Exact

  • Interval data types are not supported for profiling

AWS Glue

  • Row count mode: Exact

  • Row count is always reported as 0

  • Decube does not run row count queries against Glue sources because files can be too large.

MySQL

  • Row count mode: Estimated

  • Boolean is not supported as a native data type

  • MEDIAN is not supported for numeric metrics

SingleStore

SingleStore inherits the same profiling behavior as MySQL.

Oracle

  • Row count mode: Estimated

  • num_rows is only populated after Oracle statistics are collected with DBMS_STATS

  • CLOB and NCLOB map to string types but are not supported for profiling

  • Boolean is not supported

Snowflake

  • Row count mode: Exact

  • Unique count and unique percentage are not supported because they require expensive queries.

    • This may change in a future update where you can select the columns to be profiled to avoid full table query.

  • Profiling speed depends on warehouse size, table size, and column count

  • Large profiling jobs can time out before completion

FAQ

Why is the total row count lower than the sampling row count?

This usually means the source uses an estimated row count and the metadata has not been refreshed recently.

Why do some metrics show as not supported?

Metric support depends on the source engine, the column data type, and the cost of calculating the metric on that source.

Why did my profile run time out?

Large tables, wide schemas, and source-specific engine limits can increase query time. As a rough guideline, tables with around 250 columns can hit profiling limits, especially when many columns are numeric and generate more metrics. If the run exceeds the profiling execution limit, Decube stops the job.

Last updated