HomeDocumentationRecipesChangelog
HomeRequest DemoContact
Documentation
HomeRequest DemoContact

Categorical distribution

Categorical reference statistics between two datasets.

Validator overview

The categorical distribution Validator verifies that only the expected number of categories are added, removed, or changed over time.

Configuration

Step Required Parameters Options

Validator type

Categorical distribution

Config

Metric

Categories Added
Categories Removed
Categories Changed
Relative Entropy

Config

Backfill

Initialize with backfill (checkbox)

Source config

Field

List of source fields with string data type

Source config

Segmentation

  1. Select a configured Segmentation

Or

  1. Unsegmented (default)

Source config

Window

Select a configured Window

Source config

Filter

No filter (default)
Boolean
Enum
Null (*1)
String
Threshold filter

Reference source config

Source

Select a Source to use as reference source

Reference source config

Field

List of reference source fields with string data type

Reference source config

Window

Select a configured Window

Reference source config

Window offset

Select how many Windows you want to offset by

Reference source config

Number of windows

Select the number of Windows to include

Reference source config

Filter

No filter (default)
Boolean
Enum
Null (*1)
String
Threshold Filter

Threshold

Threshold type

Fixed threshold
Dynamic threshold

Threshold

✅(*2)

Operator

Less than
Less than or equal
Equal
Not equal
Greater than
Greater than or equal

Threshold

✅(*2)

Value

Numeric value to validate threshold on

Threshold

✅(*3)

Sensitivity

Enter a numeric value

Threshold

✅(*3)

Decision bounds type

Upper
Lower
Upper and lower (default)

*1 Only applicable for nullable columns.

*2 Only applicable for Fixed thresholds.

*3 Only applicable for Dynamic thresholds.

Configuration details

Categories added

Validating the number of new categories in the source dataset against a reference dataset.

This example shows all values from the categorical fields monitored in respective datasets:

Categories in the reference datasetCategories in the source dataset
A
B
CC
DD
EE
F

Compared to the reference dataset, the source dataset has one new categorical value F. The number of new categories is thus = 1.

Categories removed

Using our example, two categorical values are missing in the source dataset vs. reference dataset; A and B. The number of removed categories are thus two.

Categories changed

Using our example again, the number of changed categories is simply the sum of new and removed categories. In this case, 1+2=3.

Relative entropy

In Validio, relative entropy is adapted from the implementation of the Kullback - Leibler divergence.

Relative entropy is presented as a percentage where:

  • 0% means identical empirical distributions.
  • 100% means maximal difference in empirical distributions.

📘

You can use relative entropy to validate distribution shifts in your data over time, or to compare the distributions of two data sets.

Reference source

For information on how you configure the reference source, refer to reference source.

Sensitivity

Higher sensitivity means that the accepted range of values is narrower, which identifies more anomalies. Conversely, lower sensitivity values imply a wider range of accepted values, which identifies fewer anomalies.