# Profiler source support and limitations

Profiler behavior varies by source. This page explains how Decube calculates row counts, which metric types are available, and which source-specific limitations can affect profiling results.

## How Decube profiles a table

When you generate a profile, Decube:

1. Retrieves the table row count.
2. Determines how to sample rows from the table based on the source and table size.
3. Calculates profiling metrics from that sample.
4. Stops the run if the profiling job exceeds the execution limit.

{% hint style="info" %}
Profiling jobs have a 30-minute execution limit.
{% endhint %}

Very wide tables can also hit profiling limits at around 250 columns. This is a rough guideline because numeric columns generate more metrics than non-numeric columns.

## Row count modes

The table row count shown in Profiler can be exact or estimated, depending on the data source.

### Exact row count

An exact row count is calculated directly from the table at the time the profile runs. This gives the most accurate result, but it can be more expensive on large tables.

### Estimated row count

An estimated row count is retrieved from source metadata instead of scanning the full table. This is faster, but it depends on the source's metadata being up to date.

The **Total row count** metric is based on table metadata and is not recalculated for date/time filtered runs.

If the source metadata is stale, the reported total row count can be lower than the sampling row count.

## Metric availability by data type

Profiler always shows metric availability based on both the source and the column type.

### Common metrics

These metrics are available for supported column types across most sources:

* Null count
* Null percentage
* Unique count
* Unique percentage

### Numeric metrics

Numeric columns can also include:

* Minimum value
* Maximum value
* Mean
* Median
* Standard deviation
* Zero count
* Zero percentage

### Text metrics

Text columns can also include:

* Minimum length
* Maximum length
* Mean length
* Empty string count
* Empty string percentage

### Boolean metrics

Boolean columns can also include:

* True count
* True percentage
* False count
* False percentage

### Date and time metrics

Date, time, and datetime columns can include:

* Minimum value
* Maximum value

## Sampling strategy support

When you configure a profile run, the sampling strategies available depend on the source and whether the asset is a table or a view.

### Tables

| Source        | Auto | Percentage | Target Row Count | Full |
| ------------- | :--: | :--------: | :--------------: | :--: |
| PostgreSQL    |   ✅  |      ✅     |         ✅        |   ✅  |
| MySQL         |   ✅  |      ❌     |         ✅        |   ✅  |
| SingleStore   |   ✅  |      ❌     |         ✅        |   ✅  |
| Redshift      |   ✅  |      ❌     |         ✅        |   ✅  |
| Snowflake     |   ✅  |      ✅     |         ✅        |   ✅  |
| BigQuery      |   ✅  |      ✅     |         ✅        |   ✅  |
| Databricks    |   ✅  |      ✅     |         ✅        |   ✅  |
| SQL Server    |   ✅  |      ✅     |         ✅        |   ✅  |
| Azure Synapse |   ✅  |      ✅     |         ✅        |   ✅  |
| Oracle        |   ✅  |      ✅     |         ✅        |   ✅  |
| AWS Glue      |   ✅  |      ✅     |         ✅        |   ✅  |

### Views

Views generally have no reliable row count metadata, which limits the sampling strategies available on some sources.

| Source        | Auto | Percentage | Target Row Count | Full |
| ------------- | :--: | :--------: | :--------------: | :--: |
| PostgreSQL    |   ✅  |      ❌     |         ✅        |   ✅  |
| MySQL         |   ✅  |      ❌     |         ✅        |   ✅  |
| SingleStore   |   ✅  |      ❌     |         ✅        |   ✅  |
| Redshift      |   ✅  |      ❌     |         ✅        |   ✅  |
| Snowflake     |   ✅  |      ✅     |         ✅        |   ✅  |
| BigQuery      |   ✅  |      ✅     |         ✅        |   ✅  |
| Databricks    |   ✅  |      ✅     |         ✅        |   ✅  |
| SQL Server    |   ✅  |      ❌     |         ✅        |   ✅  |
| Azure Synapse |   ✅  |      ❌     |         ✅        |   ✅  |
| Oracle        |   ✅  |      ❌     |         ✅        |   ✅  |
| AWS Glue      |   ✅  |      ✅     |         ✅        |   ✅  |

{% hint style="info" %}
When a date/time filter is active, the sampling options switch to **All matching rows** and **Limit row count** for all sources. The table above applies only when no date/time filter is set.
{% endhint %}

## Source support and limitations

### PostgreSQL

* Row count mode: Estimated
* If PostgreSQL has not refreshed table statistics, the row count can return `-1`
* Views are supported, but they always return a row count of `0`

### Redshift

* Row count mode: Estimated
* If Redshift has not refreshed table statistics, the row count can return `-1`
* Views are supported, but they always return a row count of `0`
* `MEDIAN` is not supported for numeric metrics
* Unique count and unique percentage are not supported because they require expensive queries.
  * This may change in a future update where you can select the columns to be profiled to avoid full table query.

### BigQuery

* Row count mode: Exact
* Views always return a row count of `0`

### Microsoft SQL Server

* Row count mode: Estimated
* `TEXT`, `NTEXT`, and `IMAGE` columns do not support count-based common metrics
* Profiling is supported on SQL Server 2016 and later.

### Azure Synapse

* Row count mode: Exact
* `BIT` is supported as a boolean type
* `MEDIAN` is not supported for numeric metrics
* General wide-table guidance is around 250 columns, but Azure Synapse supports up to 150 columns because of engine limits

{% hint style="info" %}
Azure Synapse also requires additional permissions for profiling. For setup steps, see [Azure Synapse](/warehouses/azure-synapse.md).
{% endhint %}

### Azure Synapse Serverless

Profiling is not currently supported for Azure Synapse Serverless. Support may be added in a future release.

{% hint style="warning" %}
**Total row count will never be supported for Azure Synapse Serverless.** The serverless engine does not expose row count metadata, so this metric cannot be calculated regardless of profiling support status.
{% endhint %}

### Databricks

* Row count mode: Exact
* Interval data types are not supported for profiling

### AWS Glue

* Row count mode: Exact
* Row count is always reported as `0`
* Decube does not run row count queries against Glue sources because files can be too large.

### MySQL

* Row count mode: Estimated
* Boolean is not supported as a native data type
* `MEDIAN` is not supported for numeric metrics

### SingleStore

SingleStore inherits the same profiling behavior as MySQL.

### Oracle

* Row count mode: Estimated
* `num_rows` is only populated after Oracle statistics are collected with `DBMS_STATS`
* `CLOB` and `NCLOB` map to string types but are not supported for profiling
* Boolean is not supported

### Snowflake

* Row count mode: Exact
* Unique count and unique percentage are not supported because they require expensive queries.
  * This may change in a future update where you can select the columns to be profiled to avoid full table query.
* Profiling speed depends on warehouse size, table size, and column count
* Large profiling jobs can time out before completion

{% hint style="info" %}
If you see frequent profile timeouts on Snowflake, scope the run to a smaller subset of columns or apply a date/time filter to reduce the data scanned. See [Configure a profile run](/catalog/profiler/configure-profile-run.md) for details.
{% endhint %}

## FAQ

### Why is the total row count lower than the sampling row count?

This usually means the source uses an estimated row count and the metadata has not been refreshed recently.

### Why do some metrics show as unsupported?

Metric support depends on the source engine, the column data type, and the cost of calculating the metric on that source.

### Why did my profile run time out?

Large tables, wide schemas, and source-specific engine limits can increase query time. As a rough guideline, tables with around 250 columns can hit profiling limits, especially when many columns are numeric and generate more metrics. If the run exceeds the profiling execution limit, Decube stops the job.

If you see this happening repeatedly, reduce the scope of the profile run:

* **Select specific columns** — use column selection in the [Configure profile run](/catalog/profiler/configure-profile-run.md) modal to profile only the columns you need, rather than the full table.
* **Apply a date/time filter** — restrict profiling to a recent time window (for example, the last 7 or 30 days) to reduce the number of rows scanned.

On Snowflake specifically, verify that the Decube warehouse has `statement_timeout_in_seconds` set to at least `1800`. If it is set lower, the Snowflake warehouse will terminate the profiling query before Decube's own 30-minute execution limit is reached. See the [Snowflake setup guide](/warehouses/snowflake.md) for the recommended warehouse configuration.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.decube.io/catalog/profiler/profiler-source-support-and-limitations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
