Map of the territory
This page explains the concept of pipelines in Validio and provides useful context, rather than instructions of how to practically set-up pipelines in Validio
A pipeline is the conceptual unit where Notification and Monitors and Filters live. This is also where you define logic for what a ‘batch’ and ‘micro-batch’ is. Lastly, this is where you partition your datasets in case you want to monitor subsets of your data.
In essence, the difference between the two pipeline types are: Dataset pipelines are used for aggregate metrics such as mean, max, min, standard deviations etc., while Datapoint pipelines evaluate single records/datapoints.
It’s usually easier to first understand the difference between Dataset and Datapoint pipelines before diving deeper into commonalities between the two.
When tracking metrics on a Dataset pipeline, you need to define a dataset batch for the aggregate calculation - e.g. what dataset/batch to take the mean on. A batch can be defined in three ways:
- Streaming Datasets: A batch defined by specified number of records/datapoints
- Data Warehouse Datasets: Data divided into batches by specified time-based session window, e.g. records within a five second window pertains to the same batch
- Cron Datasets: Schedule batches based on cron expressions
- Object Store Datasets: Logical batches by file/BLOB, e.g. one CSV file would be a batch
Want to know what type of metrics can be monitored on a dataset pipeline? Check out the Monitor overview to learn more
Datapoint pipeline validates, as the name suggests, individual datapoints as opposed to aggregate metrics such as mean, standard deviation etc. which is the job of dataset pipelines. Below we outline some of the features and settings specific to Datapoint pipelines.
To give an example of when a sessionized datapoint is needed we can look at Duplicate filter. Each new datapoint will be evaluated against a set (or batch) of datapoints to determine whether duplicates exist. This ‘reference set’ of datapoints to be validated against will be governed by the datapoint batch logic.
In the case of a Smart filter, the batch logic is used to divide ingested data in training data of which outlier boundaries are derived from, and live data where each of the datapoints in the livedata is evaluated against the outlier boundaries to detect anomalies.
The ‘Timeout (seconds)’ and ‘Maximum session time (seconds)’ parameters in the pipeline setup govern the sessionized datapoint batch logic. Check out Datapoint Pipeline Configuration to learn more about the parameters.
Alert fatigue is real. Imagine monitoring and validating 1 billion records a day and a historical ‘normal’ error rate would be 0.1% -> 1 million errors per day. Getting notified about 1 million individual data points would be, well, infeasible. Validio tackles this by allowing you to sink out (write) bad data points to a destination.
Validio evaluates each and every single record/datapoint in a datapoint pipeline according to the Filter(s) you set - one single datapoint can be evaluated by multiple filters. If a record or datapoint gets caught by any of the filters, Validio can write the erroneous record to a destination.
This allows you to save all the erroneous records and review them at a later point, or integrate them into a live dashboard in a BI tool of your choice, since Validio catches errors in real-time. Set-up a Destination Connector to configure where erroneous data points should be sent.
Validio defines a source as one table in a DWH, one topic (or equivalent) in a streaming source, or a bucket in object stores. Each source, whether a table in Snowflake or a topic in Kafka, has its own pipeline in Validio.
Each Monitor (Dataset pipeline) or Filter (Datapoint pipeline) tracks one metric computed on one or several features (e.g. metric: mean on feature: age) - create one Monitor or Filter for every metric you want to monitor.
Each pipeline has an optional Notification rule attached to it, where logic for how grouping of alerts should be done and which channel the notification should be sent to is configured.
Partitions are made on the pipeline level, allowing you to monitor subsets of your data.
Updated 27 days ago