Validio can validate structured and semi-structured data from many data sources, including data warehouses, data streams, query engines, and object storage. A source in Validio defines the specific dataset or table that you want to monitor and validate.

Supported Data Sources

The following is a list of supported data sources. You can find instructions to connect Validio to each source on their dedicated pages.

Data Warehouses

Amazon Redshift
Azure Synapse
ClickHouse
Databricks
Google BigQuery
PostgreSQL
Snowflake

Data Streams

Kafka
Kinesis
Pub/Sub

Data Transformation

dbt Cloud
dbt Core

Query Engines

Athena

Configuring a Source

You need to configure a source to enable Validio to read its data for validation and monitoring. If you already have credentials configured, you can also convert discovered assets into sources from the Catalog or Lineage pages.

Only Admin users will be able to add new sources.

Specify the Dataset

To configuring a source, you need to specify which data asset you want Validio to read. You can either configure the dataset manually or choose from listed suggestions if available.

Available datasets -- If Validio has permission to read associated datasets within a project, you can select from existing datasets or tables on your source. Configuration parameters look different depending on the source type.
Custom SQL query -- For data warehouse and query engine sources, you can specify the dataset or table by writing custom SQL queries directly in the source configuration. For more information, see About Custom SQL Sources.

Define a Polling Schedule

For data warehouse and object storage sources, you must set the polling interval parameter to specify how often Validio reads the data from the source. For data streams, you do not configure polling for data since the data is read as soon as it is available from the stream processor.

You can configure polling in the UI using cron expressions, or you can configure polling manually using the Validio CLI. However, you cannot do both on one source. For help with cron schedule expressions, refer to a cron editor such as https://crontab.guru/.

Schema Detection and Inference

Validio derives a schema for every source using metadata or schema inference. Validio will detect the schema for the data source automatically if data exists in the source.

Schema from metadata: For most structured data types, Validio reads the schema from the metadata in the data source. For example, Validio can read the metadata from INFORMATION_SCHEMA in a data warehouse source.
Schema from inference: When pre-defined schema does not exist, such as for unstructured and semi-structured data types (like JSON), Validio infers the schema from the existing data.

Depending on the data source and data types, you can manually configure the schema by selecting fields (including nested fields) to validate.

Nullable Fields

If the automatically inferred schema does not match the expectations on incoming data, you can change the nullability and data types.

Nullable fields and metric validation: Check the Nullable option to include datapoints with NULL values in validation. If NULL exists in a field where the option is not selected, this particular datapoint will not be included in the validator metrics.

Schema Change Validation

Validio automatically validates schema changes for structured data types in data warehouses and files in object storage. Schema checks are executed hourly, and any detected schema changes are reported as incidents. For more information, see About Validator Incidents.

Semi-structured and Complex Data Types

In addition to structured data, Validio supports semi-structured and other complex data types. You can select these fields or certain nested fields when you configure a Source.

The following table lists the supported data types:

Source system	Data type
Athena	ARRAY, MAP, STRUCT
ClickHouse	Tuple (and named Tuples), Nested, Array For more information, see ClickHouse integration.
Google BigQuery	JSON, ARRAY, STRUCT
Kafka	JSON, protobuf, Avro
Kinesis	JSON, protobuf, Avro
PostgreSQL	JSON, JSONB, array types
Pub/Sub	JSON, protobuf, Avro
Pub/Sub Lite	JSON, protobuf, Avro
Redshift	SUPER
Snowflake	ARRAY, OBJECT, VARIANT

📘
JSONPath Expressions and Arrays
Currently Validio does not support data validation within an array. However, you can validate the size of an array.
Validio uses JSONPath expressions to represent data structures. For each array, Validio adds a computed numeric field named some_array.length() to represent the size of an array.