About Sources
Validio can validate structured and semi-structured data from many data sources, including data warehouses, data streams, query engines, and object storage. A source in Validio defines the specific dataset or table that you want to monitor and validate.
Supported Data Sources
The following is a list of supported data sources. You can find instructions to connect Validio to each source on their dedicated pages.
Configuring a Source
You need to configure a source to enable Validio to read its data for validation and monitoring. If you already have credentials configured, you can also convert discovered assets into sources from the Catalog or Lineage pages.
Only Admin users will be able to add new sources.
Specify the Dataset
To configuring a source, you need to specify which data asset you want Validio to read. You can either configure the dataset manually or choose from listed suggestions if available.
- Available datasets -- If Validio has permission to read associated datasets within a project, you can select from existing datasets or tables on your source. Configuration parameters look different depending on the source type.
- Custom SQL query -- For data warehouse and query engine sources, you can specify the dataset or table by writing custom SQL queries directly in the source configuration. For more information, see About Custom SQL Sources.
Define a Polling Schedule
For data warehouse and object storage sources, you must set the polling interval parameter to specify how often Validio reads the data from the source. For data streams, you do not configure polling for data since the data is read as soon as it is available from the stream processor.
You can configure polling in the UI using cron expressions, or you can configure polling manually using the Validio CLI. However, you cannot do both on one source. For help with cron schedule expressions, refer to a cron editor such as https://crontab.guru/.
Schema Detection and Inference
Validio derives a schema for every source using metadata or schema inference. Validio will detect the schema for the data source automatically if data exists in the source.
- Schema from metadata: For most structured data types, Validio reads the schema from the metadata in the data source. For example, Validio can read the metadata from
INFORMATION_SCHEMA
in a data warehouse source. - Schema from inference: When pre-defined schema does not exist, such as for unstructured and semi-structured data types (like JSON), Validio infers the schema from the existing data.
Depending on the data source and data types, you can manually configure the schema by selecting fields (including nested fields) to validate.
Nullable Fields
For inferred schema, when the inferred schema does not match the expectations on incoming data, you can change the nullability and data types.
- Nullable fields and metric validation: Check the
Nullable
option to include datapoints withNULL
values in validation. IfNULL
exists in a field where the option is not selected, this particular datapoint will not be included in the validator metrics.
Schema Change Validation
Validio automatically validates schema changes for structured data types in data warehouses and files in object storage. Schema checks are executed hourly, and any detected schema changes are reported as incidents. For more information, see About Validator Incidents.
Semi-structured and Complex Data Types
In addition to structured data, Validio supports semi-structured and other complex data types. You can select these fields or certain nested fields when you configure a Source.
The following table lists the supported data types:
Source system | Data type |
---|---|
Athena | ARRAY, MAP, STRUCT |
ClickHouse | Tuple (and named Tuples), Nested, Array For more information, see ClickHouse integration. |
GCS | Parquet |
Google BigQuery | JSON, ARRAY, STRUCT |
Kafka | JSON, protobuf, Avro |
Kinesis | JSON, protobuf, Avro |
PostgreSQL | JSON, JSONB, array types |
Pub/Sub | JSON, protobuf, Avro |
Pub/Sub Lite | JSON, protobuf, Avro |
Redshift | SUPER |
S3 | Parquet |
Snowflake | ARRAY, OBJECT, VARIANT |
JSONPath Expressions and Arrays
Currently Validio does not support data validation within an array. However, you can validate the size of an array.
Validio uses JSONPath expressions to represent data structures. For each array, Validio adds a computed numeric field named
some_array.length()
to represent the size of an array.
Updated 2 days ago