Categorical distribution
Categorical reference statistics between two datasets.
Validator overview
The categorical distribution Validator verifies that only the expected number of categories are added, removed, or changed over time.
Configuration
Step | Required | Parameters | Options |
---|---|---|---|
Validator type | ✅ | Categorical distribution | - |
Config | ✅ | Metric | Categories Added Categories Removed Categories Changed Relative Entropy |
Config | Backfill | Initialize with backfill (checkbox) | |
Source config | ✅ | Field | List of source fields with string data type |
Source config | ✅ | Segmentation | 1. Select a configured Segmentation Or 2. Unsegmented (default) |
Source config | ✅ | Window | Select a configured Window |
Source config | Filter | No filter (default) Boolean Enum Null (*1) String Threshold filter | |
Reference source config | ✅ | Source | Select a Source to use as reference source |
Reference source config | ✅ | Field | List of reference source fields with string data type |
Reference source config | ✅ | Window | Select a configured Window |
Reference source config | ✅ | Window offset | Select how many Windows you want to offset by |
Reference source config | ✅ | Number of windows | Select the number of Windows to include |
Reference source config | Filter | No filter (default) Boolean Enum Null (*1) String Threshold Filter | |
Threshold | ✅ | Threshold type | Fixed threshold Dynamic threshold |
Threshold | ✅(*2) | Operator | Less than Less than or equal Equal Not equal Greater than Greater than or equal |
Threshold | ✅(*2) | Value | Numeric value to validate threshold on |
Threshold | ✅(*3) | Sensitivity | Enter a numeric value |
Threshold | ✅(*3) | Decision bounds type | Upper Lower Upper and lower (default) |
*1 Only applicable for
nullable columns
.*2 Only applicable for
Fixed thresholds
.*3 Only applicable for
Dynamic thresholds
.
Configuration details
Categories added
Validating the number of new categories in the source dataset against a reference dataset.
This example shows all values from the categorical fields monitored in respective datasets:
Categories in the reference dataset | Categories in the source dataset |
---|---|
A | |
B | |
C | C |
D | D |
E | E |
F |
Compared to the reference dataset, the source dataset has one new categorical value F
. The number of new categories is thus = 1
.
Categories removed
Using our example, two categorical values are missing in the source dataset vs. reference dataset; A
and B
. The number of removed categories are thus two.
Categories changed
Using our example again, the number of changed categories is simply the sum of new and removed categories. In this case, 1
+2
=3
.
Relative entropy
In Validio, relative entropy is adapted from the implementation of the Kullback - Leibler divergence.
Relative entropy is presented as a percentage where:
0%
means identical empirical distributions.100%
means maximal difference in empirical distributions.
You can use relative entropy to validate distribution shifts in your data over time, or to compare the distributions of two data sets.
Reference source
For information on how you configure the reference source, refer to reference source.
Sensitivity
Higher sensitivity means that the accepted range of values is narrower, which identifies more anomalies. Conversely, lower sensitivity values imply a wider range of accepted values, which identifies fewer anomalies.
Updated about 1 year ago