Configuring a Source

Configure a Source connector to enable Validio to read its data for validation and monitoring. This page discusses the general source configuration steps. You can find instructions for specific source integrations on their dedicated pages.

📘

Converting Assets to Sources

If you already have credentials configured, you can also create a source by converting discovered assets in the Catalog and Lineage pages. For more information, see Using Catalog and Using Lineage.

To configure a Source,

  1. Navigate to Sources and click + New source.
  2. Under Source type, select the type of source you want to connect.
  3. Under Config,
    1. Select the valid Credential or create a new credential to authenticate your connection to the data source.
    2. Specify which dataset you want Validio to read, or where on the source the data comes from. You can select the table or use a SQL query to specify the source table. For more information, refer to Configure the Dataset.
    3. Set the Polling schedule, using cron expression to specify how frequently the validators on the source will check for changes.
  4. Under Schema, select the fields to include in the schema. For more information, refer to Configure the Schema.
  5. Under Source details,
    1. Add Tags to help group related sources or to use for routing notifications.
    2. Add an Owner who will be the contact for incident notifications.
  6. Click Continue to create the source.
    Source names are generated automatically and will be displayed when the source creation completes. If there are more than 5 sources, you will see the names for the first five and a count of the remaining sources.

Configure the Dataset

Specify which data asset you want Validio to read. You can either configure the assets manually, or choose from listed suggestions if available. Configuration parameters look different depending on the type of Source you configure.

  • Available datasets--If Validio has permission to read associated datasets within a project, you can select these datasets from your Source.
  • Custom SQL query--You can write a valid SQL query to specify the source table. The source table template function is required in your query.

Polling Schedule

For Data warehouses and Object storage, you must set the polling interval parameter to specify how often Validio reads the data.

You can set the polling interval parameter with one of the presets or type it into the cron expression field. For cron schedule expressions, refer to a cron editor, such as https://crontab.guru/.

For Data streams, you do not configure polling for data, since data is read as soon as it is available from the stream processor.

❗️

Configure Polling with the UI or the CLI

You can configure polling to run on a schedule using Cron presets or expressions, or you can configure polling manually using the web interface or using the CLI. However, you cannot do both on one source.

Configure the Schema

Configure the schema for the data source you want to validate. Depending on data source and data types, you can select which fields, including nested fields, you want to validate and set nullability for the data in the source. Also depending on your Source type, it might take a few seconds to infer the schema.

If data exists in the source, Validio can infer the schema:

  • Schema from metadata: Validio reads the schema from the metadata in the data source, for example, from INFORMATION_SCHEMA in a Data Warehouse. This is true for most structured data types.
  • Schema from inference: Validio infers the schema from the existing data when no pre-defined schema exists. This is true for most semi-structured data types, for example JSON.

Nested Fields

Validio supports semi-structured and complex data types, including arrays and nested fields. You can select all or specify which nested fields that you want to include for further validation.

📘

JSONPath Expressions

Validio uses JSONPath to represent data structures. For example, the JSONPath expression some_array.length() represents the size of an array.

Nullable Fields and Data Types

For inferred schemas, you can change nullability and data types. This is useful when the inferred schema does not match the expectations on incoming data.

🚧

Null Fields and Metrics

If NULL exists in a field where the nullable checkbox is not selected, this particular datapoint is not included in the Validator metrics.

For example, in a row count Validator, the datapoint is ignored. You must select the nullable checkbox to validate null values, such as share of null.