Categorical Distribution
Categorical reference statistics between two datasets.
Validator Overview
The categorical distribution Validator verifies that only the expected number of categories are added, removed, or changed over time, as well as the relative entropy.
| Metric Options | Description |
|---|---|
| Categories added | Validates the number of new categories in the source dataset against a reference dataset. |
| Categories removed | Validates the number of missing categories source dataset against a reference dataset. |
| Categories changed | Validates the number of new and removed categories in the source dataset against a reference dataset. |
| Relative entropy | Validates distribution shifts in your data over time. |
Categorical distribution validators support reference source configuration for data validation. For more information, see Reference Source Validation.
Relative Entropy
You can use relative entropy to validate distribution shifts in your data over time, or to compare the distributions of two data sets. Relative entropy is presented as a percentage where:
- 0% means identical empirical distributions.
- 100% means maximal difference in empirical distributions.
In Validio, relative entropy is adapted from the implementation of the Kullback - Leibler divergence.
Metric Configuration Parameters
Configure the validator metric calculation with the the following parameters:
| Parameters | Description |
|---|---|
| Metric | Select the metric to calculate. See the Metric Options table. |
| Field | Select a source field from a list of available fields with string data type. |
| Reference field | Select a reference field from a list of available fields with string data type. |
| Filter | (Optional) Select from a list of filters or create a new filter to specify which records to include. |
| Reference filter | (Optional) Select from a list of filters or create a new filter to specify which reference records to include. |
| Window | Select from a list of windows or create a new window to specify how to aggregate the data. |
| Reference window offset | Enter a number to specify the how many windows to shift back in time to compare against the current window. |
| Number of reference windows | Enter a number to specify the number of windows to include in the aggregation. |
| Segmentation | (Optional) Apply segmentation to analyze data in separate groups. Default is Unsegmented. |
Metric Calculation Example
The following example illustrates how the categorical distribution validator calculates the different metrics. The table shows all values from the categorical fields monitored in respective datasets:
| Categories in the reference dataset | Categories in the source dataset |
|---|---|
| A | |
| B | |
| C | C |
| D | D |
| E | E |
| F |
| Metric | Example Result |
|---|---|
| Categories added | In the example, compared to the reference dataset, the source dataset has one new categorical value F. The number of new categories is 1. |
| Categories removed | In the example, two categorical values are missing in the source dataset vs. reference dataset; A and B. The number of removed categories is 2. |
| Categories changed | In the example, the number of changed categories is the sum of new and removed categories. In this case, 1+2=3. |
Updated 5 days ago