Getting Started with Validio

Set up your first data validation in 5 minutes.

1. Install Validio

You can either install Validio in your VPC or host it with us, depending on your preferences and needs. For more information, read about our Customer virtual private cloud and our Managed solution.

Until Validio is available on GCP and AWS marketplace, you can reach out to us at [email protected]. We provide demo sessions where you can learn more about key features relevant to your data stack!


2. Connect Validio to your data

When Validio is running in your environment, you need to authenticate and connect Validio to read data from your Source.

  1. Click on + New Source to start the Source configuration wizard.

In this example, we create a Validio Demo Source.

Our Validio Demo Source with credentials and Source name "Demo source connector".

Our Validio Demo Source with credentials and Source name "Demo source connector".

The schema we want to infer from our Source.

The schema we want to infer from our Source.

📘

Credential and configuration parameters look different depending on Source type

For more information, refer to Credentials and Sources.

2.1 Create a Window

Configure Windows to define a window (batch) in which data is validated in your Source. For example, what windows of datapoints to monitor and validate a mean on.

  1. Select which Window type you want to create in the source configuration wizard.

In this example, we select the Window type Fixed batch window and specify the following:

  • Name (of the window): Batches of 256 datapoints on "event_time"
  • Data-time field: event_time
  • Fixed batch size: 256

This triggers a window every 256 datapoints. The datapoints are ordered by the event_time field.

A **fixed batch window** on the `event_time` data-time field with batch size `150`.

A fixed batch window on the event_time data-time field with batch size 256.


3. Create Segmentation

Configure Segmentation to define segments in your source to validate metrics on. By default, Validio uses the unsegmented setting. For example, if we create Segmentation on Country, the metrics for Country = USA are validated independently from the metrics for Country = Sweden.

  1. Click on + New Segmentations to start the Segmentation configuration wizard.

In this example, we create Segmentation on the field Gender, which in this case contains either the value Male or Female.

A Segmentation created on the field `Gender`.

A Segmentation created on the field Gender.


4. Create a Validator

Configure a Validator to define metrics to validate on specified fields in your source. You can use a created segmentation and window, or add filters, to specify the behavior of the Validator.

  1. Click on + New Validator to start the Validator configuration wizard.

In this example, we create a Numeric Validator with the following settings:

Config:

  • Metric: Mean
  • Source field(s): Age
  • We do not use the Initialize with Backfill option

Source config:

  • Segmentation = Gender
  • Window = Batches of 256 datapoints on "event_time"

Filter:

  • Filter type = Threshold filter
  • Field: Working_hours_weekly
  • Operator: equal, Value = 40

This Validator calculates the mean of Age, segmented by Gender, after reading 256 datapoints.

Our Threshold filter only includes rows where an individual is working exactly 40 hours per week. All other rows are excluded from the validation.

Validator type configuration.

Validator type configuration.

A Numeric Validator calculating Mean using Source field(s) `Age` with a Threshold filter on `Working_hours_weekly` where Value operator is `Equal` Value `40`.

A Numeric Validator calculating Mean using Source field(s) Age with a Threshold filter on Working_hours_weekly where Value operator is Equal Value 40.

Metrics in our example

The Validator in our example yields two mean metrics:

  1. Mean of Age for Male, working exactly 40 hours per week.
  2. Mean of Age for Female, working exactly 40 hours per week.

4.1 Threshold

You can set up a Threshold when configuring your Validator. A Threshold identify what values of the metric are considered data quality incidents.

📘

Define your own threshold or let Validio do it for you

You can either define your own rules or let Validio help you set up Dynamic thresholds, by setting up a threshold.

In this example, we configure the following Threshold:

  • Threshold type: Dynamic threshold
  • Sensitivity: 2
  • Decision bounds type: Upper and lower

This Threshold is applied on the four mean metrics in our example. For each of the metrics, dynamic bounds are calculated based on historical data, and identifies unusually large or small means as incidents.

A Dynamic threshold configured with Sensitivity `2` where the Decision bound type is set to `Upper and lower`.

A Dynamic threshold configured with Sensitivity 2 where the Decision bound type is set to Upper and lower.


5. Begin your data validation

You can now start your source to begin the data validation:

  1. Click Start to start Demo source.
Start your source from the Source details page.

Start your source from the Source details page.

  1. Alternatively, navigate to Sources and start the Source connector from the action menu.
Start your Source from the Sources overview page.

Start your Source from the Sources overview page.

📘

Backfill option

Use the Backfill option if you want to read all available historical data in your validation. Otherwise, Validio only reads data available after Source start.

  1. Click on a specific Validator in the Source details page to view details and graphs:
Validator details for "`Mean` of `Age` where `Working_hours_weekly` equal `40`, segmented on `Gender`"

Validator details for "Mean of Age where Working_hours_weekly equal 40, segmented on Gender"


6. Done

Congratulations, you have now set up your first data validation!


WHAT'S NEXT