About Validators
Validators are the components responsible for validating the data in your Sources. A Validator is always attached to a Source, and validation is done over a Window of data. You can use Filters to exclude data from the validation, and Segmentations to partition the validation on one or several dimensions, such as Age
or Country
.
For details on specific Validators, refer to Validator types.
What does a Validator monitor?
A Validator monitors a metric calculated using one or more fields over a Window. For example:
- Mean of field
Age
- Unique values in field
Country
- Number of records in table
Customers
A Validator can also track reference metrics (statistics). A reference metrics is produced by comparing fields from two different datasets. For example:
- Ratio between two means: mean of
table1_age
and mean oftable2_age
- Relative entropy between
table1_basket_size
andtable2_basket_size
, used to check distribution shifts. For more information, refer to Numeric distribution or Categorical distribution Validator types. - Number of new categories in
table1_country
compared totable2_country
A Validator tracks a scalar value
or a String value
. For example, mode of a categorical fields, which is calculated from one or several fields as in the preceding examples.
How do I configure a Validator?
You can configure a Validator in two ways:
- Configure Validators according to your specifications.
- Let Validio set up recommended Validators for you.
Recommended Validators
Based on your Source and Windows, Validio can set up recommended Validators to help you get started with your data monitoring.
What do I configure for a Validator?
Configuration steps look different depending on Validator type. For details on configuration parameters, refer to the respective Validator type.
Source and reference source
- Source config: specify which parts of a dataset the Validator monitors from your Source.
- Reference source config: specify a reference source which compare your metric against.
Thresholds
A Threshold defines what values of the calculated Validator metric should be considered an incident.
You can either set up manual or smart Thresholds, to identify your incidents. All metric values that breach your Threshold are flagged as incidents, which can be collected and sent as notifications.
Filters
Filters allow you to determine what raw data is validated.
You can use filters to exclude certain datapoints in the metrics calculation for a Validator.
Example on using filters
The following example illustrates how you can apply a filter in your Validator:
price | price between 100-1000? |
---|---|
14 | FALSE |
455 | TRUE |
324 | TRUE |
39 | FALSE |
5589 | FALSE |
This is a conceptual model of what happens in the filtering stage. Note: No actual column is created in your data, this model is for explanation purposes only.
Only datapoints that pass the filter, which in this case is greater than 100
, are included in the metric calculation. Whether it is a row count
or mean
Validator, datapoints which does not pass the filter logic are not included in metric calculation.
Backfill
Backfill is used to read and view historical data in your validations. It is also used to train algorithms on historical data, so that validators can provide value from day one.
Typically, backfill of validators with historical data occurs when you start a source for the first time, if historical data is available. If you want to load historical data to new Validators in Source that is already started, you can select the backfill option.
Note
If you select the backfill option when you create a Validator on a Source after it has started, the Validator is put into the Pending Backfill state. You must then trigger a manual backfill on the source to get the data in the Validator.
Reset
Resetting a source is equivalent to deleting it, then creating a new source with the same configuration, and finally backfilling all its validators.
These are some common scenarios when it can be useful to reset a source:
- Underlying historical data has changed, and you want to re-validate that historical data.
- You have changed sensitivity on a dynamic threshold and want to re-validate all historical data with the new sensitivity.
Updated 5 months ago