Validator Overview

The categorical distribution Validator verifies that only the expected number of categories are added, removed, or changed over time, as well as the relative entropy.

Metric Options	Description
Categories added	Validates the number of new categories in the source dataset against a reference dataset.
Categories removed	Validates the number of missing categories source dataset against a reference dataset.
Categories changed	Validates the number of new and removed categories in the source dataset against a reference dataset.
Relative entropy	Validates distribution shifts in your data over time.

Relative Entropy

You can use relative entropy to validate distribution shifts in your data over time, or to compare the distributions of two data sets. Relative entropy is presented as a percentage where:

0% means identical empirical distributions.
100% means maximal difference in empirical distributions.

📘
Note
In Validio, relative entropy is adapted from the implementation of the Kullback - Leibler divergence.

Metric Configuration Parameters

The following parameters are used in the Metric configuration step of creating a Categorical Distribution validator.

Parameter	Description	Options
Metric	Select the metric to calculate.	Categories Added Categories Removed Categories Changed Relative Entropy
Field	Select a source field to use for the calculation.	List of available fields with data type string.
Reference Field	Select a reference source field to use for the calculation.	List of available fields with data type string.
Filter	(Optional) Use filters to specify which records to include in the calculation.	List of existing filters or create a new filter.
Reference Filter	(Optional) Use filters to specify which reference records to include in the calculation.	List of existing filters or create a new filter.
Window	Use windows to define the time-range over which the data is aggregated.	List of existing windows or create a new window.
Reference Window Offset	The number of windows you want to offset the aggregation.	Enter a number.
Number of Reference Windows	The number of windows to include.	Enter a number.
Segmentation	Use segmentation to break the data into separate groups for analysis.	List of existing segmentations, Unsegmented (default), or create a new segmentation.
Initialize using historic data	Start the validator with historical data to prime the anomaly detection algorithms.

Metric Calculation Example

The following example illustrates how the categorical distribution validator calculates the different metrics. The table shows all values from the categorical fields monitored in respective datasets:

Categories in the reference dataset	Categories in the source dataset
A
B
C	C
D	D
E	E
	F

Metric	Example Result
Categories added	In the example, compared to the reference dataset, the source dataset has one new categorical value `F`. The number of new categories is `1`.
Categories removed	In the example, two categorical values are missing in the source dataset vs. reference dataset; `A` and `B`. The number of removed categories is `2`.
Categories changed	In the example, the number of changed categories is the sum of new and removed categories. In this case, `1`+`2`=`3`.