About Validators

Validators are the components responsible for validating the data in your Sources. A Validator is always attached to a Source, and validation is done over a Window of data. You can use Filters to exclude data from the validation, and Segmentations to partition the validation on one or several dimensions, such as Age or Country.

An example of a numeric validator

An example of a numeric Validator.

For details on specific Validators, refer to Validator types.

What does a Validator monitor?

A Validator monitors a metric calculated using one or more fields over a Window. For example:

  • Mean of field Age
  • Unique values in field Country
  • Number of records in table Customers

A Validator can also track reference metrics (statistics). A reference metrics is produced by comparing fields from two different datasets. For example:

  • Ratio between two means: mean of table1_age and mean of table2_age
  • Relative entropy between table1_basket_size and table2_basket_size, used to check distribution shifts. For more information, refer to Numeric distribution or Categorical distribution Validator types.
  • Number of new categories in table1_country compared to table2_country

A Validator tracks a scalar value or a String value. For example, mode of a categorical fields, which is calculated from one or several fields as in the preceding examples.

How do I configure a Validator?

You can configure a Validator in two ways:

  1. Configure Validators according to your specifications.
  2. Let Validio set up recommended Validators for you.

Recommended Validators

Based on your Source and Windows, Validio can set up recommended Validators to help you get started with your data monitoring.

What do I configure for a Validator?

Configuration steps look different depending on Validator type. For details on configuration parameters, refer to the respective Validator type.

Source and reference source

Thresholds

A Threshold defines what values of the calculated Validator metric should be considered an incident.

You can either set up manual or smart Thresholds, to identify your incidents. All metric values that breach your Threshold are flagged as incidents, which can be collected and sent as notifications.

Filters

Filters allow you to determine what raw data is validated.

You can use filters to exclude certain datapoints in the metrics calculation for a Validator.

Example on using filters

The following example illustrates how you can apply a filter in your Validator:

priceprice between 100-1000?
14FALSE
455TRUE
324TRUE
39FALSE
5589FALSE

This is a conceptual model of what happens in the filtering stage. Note: No actual column is created in your data, this model is for explanation purposes only.

Only datapoints that pass the filter, which in this case is greater than 100, are included in the metric calculation. Whether it is a row count or mean Validator, datapoints which does not pass the filter logic are not included in metric calculation.

Backfill

Backfill is used to read and view historical data in your validations. It is also used to train algorithms on historical data, so that validators can provide value from day one.

Typically, backfill of validators with historical data occurs when you start a source for the first time, if historical data is available. If you want to load historical data to new Validators in Source that is already started, you can select the backfill option.

πŸ“˜

Pending backfill on an already started Source

If you select the backfill option when you create a Validator, the Validator is put into the Pending Backfill state. You must then trigger a manual backfill on the source to get the data in the Validator.

Reset

Resetting a source is equivalent to deleting it, then creating a new source with the same configuration, and finally backfilling all its validators.

These are some common scenarios when it can be useful to reset a source:

  1. Underlying historical data has changed, and you want to re-validate that historical data.
  2. You have changed sensitivity on a dynamic threshold and want to re-validate all historical data with the new sensitivity.