Introduction

Depending on if you’re monitoring aggregate metrics on a dataset or metrics on individual datapoints, Validio allows you to create two types of pipelines: Dataset pipeline and Datapoint pipeline

📘

Map of the territory

This page explains the concept of pipelines in Validio and provides useful context, rather than instructions of how to practically set-up pipelines in Validio

A pipeline is the conceptual unit where Notification and Monitors and Filters live. This is also where you define logic for what a ‘batch’ and ‘micro-batch’ is. Lastly, this is where you partition your datasets in case you want to monitor subsets of your data.

20002000

Dataset- and Datapoint pipelines connect to a source via a Source Connector. Notification rules, Monitors and Filters and Partitions all live in a pipeline

In essence, the difference between the two pipeline types are: Dataset pipelines are used for aggregate metrics such as mean, max, min, standard deviations etc., while Datapoint pipelines evaluate single records/datapoints.

It’s usually easier to first understand the difference between Dataset and Datapoint pipelines before diving deeper into commonalities between the two.

Dataset pipeline - defining what a dataset (batch) is

When tracking metrics on a Dataset pipeline, you need to define a dataset batch for the aggregate calculation - e.g. what dataset/batch to take the mean on. A batch can be defined in three ways:

  • Streaming Datasets: A batch defined by specified number of records/datapoints
  • Data Warehouse Datasets: Data divided into batches by specified time-based session window, e.g. records within a five second window pertains to the same batch
  • Cron Datasets: Schedule batches based on cron expressions
  • Object Store Datasets: Logical batches by file/BLOB, e.g. one CSV file would be a batch

📘

Want to know what type of metrics can be monitored on a dataset pipeline? Check out the Monitor overview to learn more

Datapoint pipeline - validating individual datapoints

Datapoint pipeline validates, as the name suggests, individual datapoints as opposed to aggregate metrics such as mean, standard deviation etc. which is the job of dataset pipelines. Below we outline some of the features and settings specific to Datapoint pipelines.

Sessionized datapoint batches

A batch concept exists in Datapoint pipelines as well to accommodate for certain Filters, e.g. Duplicate filter detecting duplicates and Smart filter detecting outliers.

To give an example of when a sessionized datapoint is needed we can look at Duplicate filter. Each new datapoint will be evaluated against a set (or batch) of datapoints to determine whether duplicates exist. This ‘reference set’ of datapoints to be validated against will be governed by the datapoint batch logic.

In the case of a Smart filter, the batch logic is used to divide ingested data in training data of which outlier boundaries are derived from, and live data where each of the datapoints in the livedata is evaluated against the outlier boundaries to detect anomalies.

The ‘Timeout (seconds)’ and ‘Maximum session time (seconds)’ parameters in the pipeline setup govern the sessionized datapoint batch logic. Check out Datapoint Pipeline Configuration to learn more about the parameters.

Sinking out bad data points

Alert fatigue is real. Imagine monitoring and validating 1 billion records a day and a historical ‘normal’ error rate would be 0.1% -> 1 million errors per day. Getting notified about 1 million individual data points would be, well, infeasible. Validio tackles this by allowing you to sink out (write) bad data points to a destination.

Validio evaluates each and every single record/datapoint in a datapoint pipeline according to the Filter(s) you set - one single datapoint can be evaluated by multiple filters. If a record or datapoint gets caught by any of the filters, Validio can write the erroneous record to a destination.

This allows you to save all the erroneous records and review them at a later point, or integrate them into a live dashboard in a BI tool of your choice, since Validio catches errors in real-time. Set-up a Destination Connector to configure where erroneous data points should be sent.

16991699

Validio allows you to sink out all anomalies on datapoint level. Here egress output shown in AWS Redshift.

Commonalities in Dataset- and Datapoint pipelines

One pipeline for each source

Validio defines a source as one table in a DWH, one topic (or equivalent) in a streaming source, or a bucket in object stores. Each source, whether a table in Snowflake or a topic in Kafka, has its own pipeline in Validio.

Create as many Monitors or Filters you want on a pipeline

Each Monitor (Dataset pipeline) or Filter (Datapoint pipeline) tracks one metric computed on one or several features (e.g. metric: mean on feature: age) - create one Monitor or Filter for every metric you want to monitor.

One Notifications rule for each pipeline

Each pipeline has an optional Notification rule attached to it, where logic for how grouping of alerts should be done and which channel the notification should be sent to is configured.

Partitions are defined on pipeline level

Partitions are made on the pipeline level, allowing you to monitor subsets of your data.


Did this page help you?