Segmentation is used to subset your data from a source to validate metrics on.

With Segments, Validio can validate metrics on specific subsets of your data. For example, if a Segmentation is specified for the fields Country and Age, metrics are validated for each segment of the specified fields. If you validate the mean of Price on this segmentation, the mean of price for Country = Sweden and Age = 10 are validated independently from the mean of price of Country = USA and Age = 10.

Data segmented by categorical field values


Segmentation allows you to subset data on categorical field values, and to validate metrics on each of the segments.

Multi-field segments

You can create segments using multiple fields. However, the number of fields you can use is limited for technical reasons, by the number of fields and cardinality of the fields:

#Segments = [#distinct categorical values in field 1] x [#distinct categorical values in field 2] … x [#distinct categorical values in field N]

Note: This is the upper bound of the number of segments. The number of segments can be less, in case there aren't any records of a specific combination of datapoints present in the data.

Validio has customer production deployments with thousands of Segmentations, and the number of supported Segmentations is continuously increasing.

In this example we create segmentation on the data using the three fields: `Country`, `Gender` and `Marital status`, tracking median `Annual salary` for each of the segments.

In this example we create a Segmentation on the data using the three fields: Country, Gender and Marital status, tracking average Annual salary for each of the segments.

Why should you use Segmentation?

Information loss occurs when aggregating data, conversely, more granular analysis can be done when segmenting data.

For instance, let’s say there's a retail organization that works with customers across the globe. They want to validate the price data, to make sure that things are properly priced.

For validating purposes, they want to use the fields price and currency. Only validating datapoints from the price column makes little sense. Because of differences in currency, the prices have different orders of magnitude. Before performing a data quality validation to make sure there are no anomalies, the data must be segmented based on currency.

Think about the difference in the order of magnitude, if the very same price for a specific item is expressed in USD versus Iranian Rial, where the conversion rate is ~ 1 USD = ~ 40 000 Iranian Rial.

If Segmentation isn’t applied to validate price data, the retail organization would be comparing apples with cars.


Anomaly detected in the Currency = IRR segment.