Partitioning Pipelines

Partitioning is one of our most used features allowing you to easily monitor and validate subsets of your data

What is pipeline partitioning?

Pipeline partitioning refers to dividing your data into partitions allowing you to monitor and validate metrics on subsets of your data.

Data is partitioned by categorical feature values:

20002000

Partitioning allows you to subset data on categorical feature values and monitor and validate metrics on each of the partitions

Multi-feature partitions

You can also partition by multiple features creating multi-feature partitions:

20002000

In this example we partition the data on three features: ‘Country’, ‘Gender’ and ‘Marital status’, tracking median ‘Annual salary’ for each of the partitions

Validio supports almost as many features as you want, only limited by the number of partitions for technical reasons, driven by number of features and cardinality of the features:

#Partitions = [#distinct categorical values in feature 1] x [#distinct categorical values in feature 2] … x [#distinct categorical values in feature N]

Note that this is the upper bound of the number of partitions. The number of partitions can of course be less in case there aren't any records of a specific combination of feature values present in the data.

Validio currently has customer production deployments with over tens of thousands of partitions and the number of partitions supported is continuously increasing.

Why should you partition pipelines?

Information loss occurs when aggregating data, conversely, more granular analysis can be done when partitioning data.

For instance, let’s say there's a retail organization that works with customers across the globe. They may want to monitor the price data to ensure that things are properly priced. In one column is the feature “price” and in another, there’s “currency”. To monitor just the price column as a whole makes very little sense, since the prices have different orders of magnitude (due to differences in currency). Before performing a data quality validation to make sure there are no anomalies, the data must be partitioned based on currency. Just think about the difference in the order of magnitude if the very same price for a specific item is expressed in USD versus Iranian Rial, where the conversion rate is ~ 1 USD = ~ 40 000 Iranian Rial. If partitioning isn’t done, we’d literally be comparing apples with cars

20002000

Anomaly detected in the IRR partition

Set-up pipeline partitions

Pipeline partitions are set-up specifically for each pipeline, and is the last part of the Dataset Pipeline Configuration and Datapoint Pipeline Configuration set-up wizards.

Pipeline partitions in the UI

Partition views can be found on the Monitor and Filter dashboard where you can easily toggle between the different partitions you’ve created in the pipeline set-up.