Map of the territory
This page explains the concept of pipelines in Validio and provides useful context, rather than instructions of how to practically set-up pipelines in Validio
A pipeline is the conceptual unit where Notification and Monitors and Filters live. This is also where you define logic for what a ‘batch’ and ‘micro-batch’ is. Lastly, this is where you partition your datasets in case you want to monitor subsets of your data.
In essence, the difference between the two pipeline types are: Dataset pipelines are used for aggregate metrics such as mean, max, min, standard deviations etc., while Datapoint pipelines evaluate single records/datapoints.
It’s usually easier to first understand the difference between Dataset and Datapoint pipelines before diving deeper into commonalities between the two.
When tracking metrics on a Dataset pipeline, you need to define a dataset batch for the aggregate calculation - e.g. what dataset/batch to calculate the mean, max or min on. A batch can be defined in four ways:
- Streaming Datasets: A batch defined by specified number of records/datapoints
- Data Warehouse Datasets: Data divided into batches by specified time-based session window, e.g. records within a five second window pertains to the same batch
- Cron Datasets: Schedule batches based on cron expressions
- Object Store Datasets: Logical batches by file/BLOB, e.g. one CSV file would be a batch
Want to know what type of metrics can be monitored on a dataset pipeline? Check out the Monitor overview to learn more
Datapoint pipeline validates, as the name suggests, individual datapoints as opposed to aggregate metrics such as mean, standard deviation etc. which is the job of dataset pipelines. Below we outline some of the features and settings specific to Datapoint pipelines.
A batch concept also exists in Datapoint pipelines to accommodate for certain Filters, e.g. for Smart filter detecting outliers. It is also used to determine how often datapoint metric should be calculated.
In the case of a Smart filter, the batching logic is used to divide ingested data in 1. training data, of which outlier boundaries are derived from, and 2. live data, where each of the datapoints in the live data is evaluated against the outlier boundaries to detect anomalies.
When it comes to datapoint pipeline metric calculations, for instance number of passed data points or percentage of total data points that passed, you also need to have the concept of a batch (or a cut-off point), e.g. number of data points that passed in the latest batch.
The datapoint batches is governed by a cron trigger that you set-up during the Datapoint Pipeline Configuration.
Alert fatigue is real. Imagine monitoring and validating 1 billion records a day and a historical ‘normal’ error rate would be 0.1% -> 1 million errors per day. Getting notified about 1 million individual data points would be, well, infeasible. Validio tackles this by allowing you to sink out (write) bad data points to a destination in real-time.
Validio evaluates each and every single record/datapoint in a datapoint pipeline according to the Filter(s) you set - one single datapoint can be evaluated by multiple filters. If a record or datapoint gets caught by any of the filters, Validio can write the erroneous record to a destination.
This allows you to save all the erroneous records and review them at a later point, or integrate them into a live dashboard in a BI tool of your choice, since Validio catches errors in real-time. Set-up a Destination Connector to configure where erroneous data points should be sent.
Anomalies are written out to the chosen destination in real-time as they are identified by Validio and is independent of the cron batching logic
Validio defines a source as one table in a DWH, one topic (or equivalent) in a streaming source, or a bucket in object stores. Each source, whether a table in Snowflake or a topic in Kafka, has its own pipeline in Validio.
Each Monitor (Dataset pipeline) or Filter (Datapoint pipeline) tracks one metric computed on one or several features (e.g. metric: mean on feature: age) - create one Monitor or Filter for every metric you want to monitor.
Each pipeline has an optional Notification rule attached to it, where logic for how grouping of alerts should be done and which channel the notification should be sent to is configured.
Partitions are made on the pipeline level, allowing you to monitor subsets of your data.
Updated 3 months ago