About Sources

Global Sources page
Validio can validate structured and semi-structured data from many data sources, including data warehouses, data streams, query engines, and object storage. A source in Validio defines the specific dataset or table that you want to monitor and validate.
Supported Data Sources
The following is a list of supported data sources. You can find instructions to connect Validio to each source on their dedicated pages.
Configuring a Source
You need to configure a source to enable Validio to read its data for validation and monitoring. If you already have credentials configured, you can also convert discovered assets into sources from the Catalog or Lineage pages.
Only Admin users will be able to add new sources.
Specify the Dataset
To configuring a source, you need to specify which data asset you want Validio to read. You can either configure the dataset manually or choose from listed suggestions if available.
- Available datasets -- If Validio has permission to read associated datasets within a project, you can select from existing datasets or tables on your source. Configuration parameters look different depending on the source type.
- Custom SQL query -- For data warehouse and query engine sources, you can specify the dataset or table by writing custom SQL queries directly in the source configuration. For more information, see About Custom SQL Sources.
Define a Polling Schedule
For data warehouse and object storage sources, you must set the polling interval parameter to specify how often Validio reads the data from the source. For data streams, you do not configure polling for data since the data is read as soon as it is available from the stream processor.
You can configure polling in the UI using cron expressions, or you can configure polling manually using the Validio CLI. However, you cannot do both on one source. For help with cron schedule expressions, refer to a cron editor such as https://crontab.guru/.
Assign a Priority
The priority setting helps you and other users understand the importance of any issues or incidents that are detected on a source or validator. The priority can be None, Low, Medium, High, or Critical. You can set the priority to a source or validator during configuration. After the resource is configured, you can change its priority at any time from the source details or validators details pages.
Validator priorities will override the source priority. Incidents detected on the source or validator will inherit the priority that is assigned to the validator. If the validator does not have a priority the incident will inherit the source priority. You cannot change the priority on an incident or incident group.
Schema Detection and Inference
Validio reads the schema from metadata in the source for most unstructured data types and infers the schema from existing data for most semi-structured data types. Schema checks run hourly, and detected schema changes are reported as incidents.
For more information, see Schema Detection .
Semi-structured and Complex Data Types
In addition to structured data, Validio supports semi-structured and other complex data types. You can select these fields or certain nested fields when you configure a source.
The following table lists the supported data types:
Source system | Data type |
|---|---|
Athena | ARRAY, MAP, STRUCT |
ClickHouse | Tuple (and named Tuples), Nested, Array For more information, see integration. |
Databricks | STRUCT, ARRAY, VARIANT |
Google BigQuery | JSON, ARRAY, STRUCT |
Kafka | JSON, protobuf, Avro |
Kinesis | JSON, protobuf, Avro |
PostgreSQL | JSON, JSONB, array types |
Pub/Sub | JSON, protobuf, Avro |
Redshift | SUPER |
Snowflake | ARRAY, OBJECT, VARIANT |
JSONPath Expressions and ArraysCurrently Validio does not support data validation within an array. However, you can validate the size of an array.
Validio uses JSONPath expressions to represent data structures. For each array, Validio adds a computed numeric field named
some_array.length()to represent the size of an array.
Updated 11 days ago