HomeDocumentationChangelog
HomeRequest DemoContact
HomeRequest DemoContact

Categorical distribution

Categorical reference statistics between two datasets.

Validator overview

The categorical distribution Validator verifies that only the expected number of categories are added, removed, or changed over time.

Configuration

StepRequiredParameters Options
Validator typeCategorical distribution-
ConfigMetricCategories Added
Categories Removed
Categories Changed
Relative Entropy
ConfigBackfillInitialize with backfill (checkbox)
Source configFieldSelect a valid field from your Source
Source configSegmentation1. Select a configured Segmentation

Or

2. Unsegmented (default)
Source configWindowSelect a configured Window
Source configFilterNo filter (default)
Boolean
Enum
Null (*1)
String
Threshold filter
Reference source configSourceSelect a Source to use as reference source
Reference source configFieldSelect a valid field from your reference source
Reference source configWindowSelect a configured Window
Reference source configWindow offsetSelect how many Windows you want to offset by
Reference source configNumber of windowsSelect the number of Windows to include
Reference source configFilterNo filter (default)
Boolean
Enum
Null (*1)
String
Threshold Filter
ThresholdThreshold typeFixed threshold
Dynamic threshold
Monotonic threshold
Threshold✅(*2)Operator Less than
Less than or equal
Equal
Not equal
Greater than
Greater than or equal
ThresholdValueNumeric value to threshold on

*1 Only applicable for nullable columns.

*2 Only applicable for Fixed thresholds.

Configuration details

Categories added

Validating the number of new categories in the source dataset against a reference dataset.

This example shows all values from the categorical fields monitored in respective datasets:

Source dataset

Records and values in categorical feature being monitored
Reference dataset

Records and values in categorical feature being monitored
CA
DB
EC
FD
E

Compared to the Reference dataset, the Source dataset has one new categorical value F, the number of new categories is thus = 1.

Categories removed

Using our example, two categorical values are missing in the source dataset vs. reference dataset; A and B. The number of removed categories are thus two.

Categories changed

Using our example again, the number of changed categories is simply the sum of new and removed categories. In this case, 1+2=3.

Relative entropy

In Validio, relative entropy is adapted from the implementation of the Kullback - Leibler divergence.

Relative entropy is presented as a percentage where:

  • 0% means identical empirical distributions.
  • 100% means maximal difference in empirical distributions.

📘

You can use relative entropy to validate distribution shifts in your data over time, or to compare the distributions of two data sets.

Reference source

For information on how you configure the reference source, refer to reference source.