HomeDemoContact

Getting started

Set up your first data validation in 5 minutes.

1. Install Validio

You can install Validio either in your VPC, or host it with us - whatever you prefer and need. For more information, read about our Customer Virtual Private Cloud and our Managed Solution.

We plan to make Validio available on GCP and AWS marketplace soon. In the meantime, you can reach out to us at [email protected] to get a demo session and learn more about key features relevant to your data stack!

2. Connect Validio to your data

Once you have Validio running in your environment, you need to connect it to your source.

📘

Finish the set-up before you start your source

Configure your Validators before you start your Source in Validio, to avoid premature reading of data.

  1. Click on + New Source to start the Source configuration wizard.

In this example, we create a Validio Demo Source Source type.

Our Validio Demo Source with credentials and source name "Demo source connector".

Our Validio Demo Source with credentials and Source name "Demo source connector".

The schema we want to infer from our source.

The schema we want to infer from our Source.

📘

Credential and configuration parameters look different depending on the Source

For more information, refer to Credentials and Sources.

2.1 Create a Window

Configure Windows to define a window (batch) in which data is validated in your Source. For example, what windows of datapoints to monitor and validate a mean on.

  1. Select which Window type you want to create in the Source configuration wizard.

In this example, we select the Window type Fixed batch window and specify the following:

  • Name (of the window): fixed-batch-150
  • Data-time field: event_time
  • Fixed batch size: 150

This triggers a window every 150 datapoints. The datapoints are ordered by the event_time field.

A fixed batch window on the `event_time` data-time field with batch size `150`.

A fixed batch window on the event_time data-time field with batch size 150.

3. Create Segmentation

Configure Segmentation to define segments in your source to validate metrics on. By default, Validio uses the unsegmented setting. For example, if we create Segmentation on Country, the metrics for Country = USA are validated independently from the metrics for Country = Sweden.

  1. Click on + New Segmentations to start the Segmentation configuration wizard.

In this example, we create Segmentation on the field Gender, which in this case contains either the value Male or Female.

A segmentation created on the field `Gender`.

A Segmentation created on the field Gender.

4. Create a Validator

Configure a Validator to define metrics to validate on specified fields in your source. When you create a Validator you must use a Window. You can also use a created Segmentation and add filters to specify the behavior of the Validator.

  1. Click on + New Validator to start the Validator configuration wizard.

In this example, we create a Numeric Validator with the following settings:

Config:

  • Metric: Mean
  • Source field(s): Age and Yearly_wage_USD
  • We do not use the Initialize with Backfill option

Source config:

  • Segmentation = Gender
  • Window = Fixed-batch-150

Filter:

  • Filter type = Threshold filter
  • Field: Working_hours_weekly
  • Operator: equal, Value = 40

This Validator calculates the mean of both Age and Yearly_wage_USD, segmented by Gender, after reading150 datapoints.

Our Threshold filter only includes rows where an individual is working exactly 40 hours per week. All other rows are excluded from the validation.

Validator type configuration.

Validator type configuration.

A Numeric Validator calculating Mean using Source field(s) `Age` and `Yearly_wage_USD`, with a Threshold filter on `Working_hours_weekly` where Value operator is `Equal` Value `40`.

A Numeric Validator calculating Mean using Source field(s) Age and Yearly_wage_USD, with a Threshold filter on Working_hours_weekly where Value operator is Equal Value 40.


Metrics in our example

The Validator in our example yields four mean metrics:

  1. Mean of Age for Male, working exactly 40 hours per week.
  2. Mean of Age for Female, working exactly 40 hours per week.
  3. Mean of Yearly_wage_USD for Male, working exactly 40 hours per week.
  4. Mean of Yearly_wage_USD for Female, working exactly 40 hours per week.

4.1 Threshold

You can set up a Threshold when configuring your Validator. A Threshold identify what values of the metric are considered data quality incidents.

📘

Define your own threshold or let Validio do it for you

You can either define your own rules or let Validio help you set up Dynamic thresholds, by setting up a threshold.

In this example, we configure the following Threshold:

  • Threshold type: Dynamic threshold
  • Sensitivity: 2
  • Decision bounds type: Upper and lower

This Threshold is applied on the four mean metrics in our example. For each of the metrics, dynamic bounds are calculated based on historical data, and identifies unusually large or small means as incidents.

A Dynamic threshold configured with Sensitivity `2` where the Decision bound type is set to `Upper and lower`.

A Dynamic threshold configured with Sensitivity 2 where the Decision bound type is set to Upper and lower.

5. Begin your data validation

You can now start your Source to begin the data validation:

  1. Click Start to start the Demo source connector.
Start your source from the Source details page.

Start your source from the Source details page.

  1. Alternatively, navigate to Sources and start the Source connector from the action menu.
Start your source from the Sources overview page.

Start your Source from the Sources overview page.

📘

Backfill option

Use the Backfill option if you want to read all available historical data in your validation. Otherwise, Validio only reads data available after Source start.

  1. Click on a specific Validator in the Source details page to view details and graphs:
Validator details for "`Mean` of `Age` where `Working_hours_weekly` equal `40`, segmented on `Gender`"

Validator details for "Mean of Age where Working_hours_weekly equal 40, segmented on Gender"

6. Done

Congratulations, you have now set up your first data validation!

What's next? Explore: