Validator overview

The categorical distribution Validator verifies that only the expected number of categories are added, removed, or changed over time.

Configuration

Step	Required	Parameters	Options
Validator type	✅	Categorical distribution
Config	✅	Metric	Categories Added Categories Removed Categories Changed Relative Entropy
Config		Backfill	Initialize with backfill (checkbox)
Source config	✅	Field	List of source fields with string data type
Source config	✅	Segmentation	Select a configured Segmentation Or Unsegmented (default)
Source config	✅	Window	Select a configured Window
Source config		Filter	No filter (default) Boolean Enum Null (*1) String Threshold filter
Reference source config	✅	Source	Select a Source to use as reference source
Reference source config	✅	Field	List of reference source fields with string data type
Reference source config	✅	Window	Select a configured Window
Reference source config	✅	Window offset	Select how many Windows you want to offset by
Reference source config	✅	Number of windows	Select the number of Windows to include
Reference source config		Filter	No filter (default) Boolean Enum Null (*1) String Threshold Filter
Threshold	✅	Threshold type	Fixed threshold Dynamic threshold
Threshold	✅(*2)	Operator	Less than Less than or equal Equal Not equal Greater than Greater than or equal
Threshold	✅(*2)	Value	Numeric value to validate threshold on
Threshold	✅(*3)	Sensitivity	Enter a numeric value
Threshold	✅(*3)	Decision bounds type	Upper Lower Upper and lower (default)

*1 Only applicable for nullable columns.

*2 Only applicable for Fixed thresholds.

*3 Only applicable for Dynamic thresholds.

Configuration details

Categories added

Validating the number of new categories in the source dataset against a reference dataset.

This example shows all values from the categorical fields monitored in respective datasets:

Categories in the reference dataset	Categories in the source dataset
A
B
C	C
D	D
E	E
	F

Compared to the reference dataset, the source dataset has one new categorical value F. The number of new categories is thus = 1.

Categories removed

Using our example, two categorical values are missing in the source dataset vs. reference dataset; A and B. The number of removed categories are thus two.

Categories changed

Using our example again, the number of changed categories is simply the sum of new and removed categories. In this case, 1+2=3.

Relative entropy

In Validio, relative entropy is adapted from the implementation of the Kullback - Leibler divergence.

Relative entropy is presented as a percentage where:

0% means identical empirical distributions.
100% means maximal difference in empirical distributions.

📘
You can use relative entropy to validate distribution shifts in your data over time, or to compare the distributions of two data sets.

Reference source

For information on how you configure the reference source, refer to reference source.

Sensitivity

Higher sensitivity means that the accepted range of values is narrower, which identifies more anomalies. Conversely, lower sensitivity values imply a wider range of accepted values, which identifies fewer anomalies.