Custom SQL
For data from data warehouse or query engine sources, you can leverage SQL to create validators customized for your business needs. For example, you can use a custom SQL validator to count duplicates of values or records, compare fields or records between tables, apply custom filters to your data, and more.
Note
This article discusses the updated Custom SQL validator configuration and syntax. For the previous version of the Custom SQL validator, refer to Custom SQL (deprecated). For migration guidelines between the syntax versions, see Migrating Custom SQL Syntax.
Configuring a Custom SQL Validator
To configure a Custom SQL Validator,
- Navigate to the specific source where you want to add the validator.
- Click + New Validator to start the configuration wizard.
- Under Source Type, select Custom SQL.
- Under Segmentation, segment your data into subsets for more granular validation. Check one of the following options:
- Existing Segmentation–to select a pre-configured segmentation from the list.
- Create Segmentation–to select fields to include in the new segmentation.
- Under Window, specify the amount of data or time over which the validator operates. Check one of the following options:
- Existing Window–to select a pre-configured window from the list.
- Create Window–to configure a new window, which can be a Global window or Tumbling window. You will see different configuration parameters depending on the window type. For more information, see About Windows.
- Under SQL Query,
- Enter a valid SQL expression. The SQL expression must return a field named
validio_metric
and columns that match the segmentation fields configured for the validator. If you are using a tumbling window, the SQL expression can use Template Functions and must also return a field namedvalidio_window_id
. For more information, see Mandatory Fields, Segmentation Fields, and Template Functions. - (Recommended) Click Test Query to verify that your SQL expression is valid. For more information, see Testing the SQL Query.
- Enter a valid SQL expression. The SQL expression must return a field named
- Under Config, check Initialize with Backfill if you want to include historical data for validation.
- Under Thresholds, select the type of threshold to use to detect incidents. You can configure a Fixed threshold, Dynamic threshold, or a Difference threshold. For more information, see About Thresholds.
- Enter a Validator Name.
- Click Continue to create the Custom SQL Validator.
For examples of Custom SQL validator configurations, refer to Custom SQL Validator Recipes.
Testing the SQL Query
You can test the SQL query for your custom validator to verify that your SQL expression is valid:
- Test Query–(Recommended) Test that your SQL query is valid.
- If the query executes properly, the query returns a table with the first 20 records.
- If the query did not execute properly, the query returns an error message propagated from the data source.
- Skip Test–Continue configuring the validator without testing.
Recommendation
Because syntax errors are common in custom SQL, we recommend always testing the query so that you can discover and correct errors early, instead of during source polling.
Mandatory Fields
The SQL query must return the mandatory fields validio_metric
and, if the validator uses a tumbling window, validio_window_id
.
Note
The
validio_metric
andvalidio_window_id
fields should be written in lower case, which requires using quotation marks for some data sources, such as double quotation marks for Snowflake.
Validation Metric
The Custom SQL query must return a field named validio_metric
to instruct Validio which field to use for validation.
Validation Window
You can configure a Custom SQL validator to use either Tumbling or Global windows.
- With Tumbling windows, you can write the SQL query using Validio template functions The SQL query must return the
validio_metric
field as well as a field namedvalidio_window_id
. This ID is used to map the metrics for each tumbling window to their respective time ranges. For more information, see Template Functions. - With Global windows, you can use regular SQL syntax (which must return the
validio_metric
field).
Note
A tumbling window uses a cursor field to keep track of what metrics have already been processed and what has not. For more information, see Tumbling Windows.
Segmentation Fields
The Custom SQL query must return columns that match the Segmentation fields configured for the validator. For example, if segmentation on Country is used, then the SQL query should return a column called Country. All returned segmentation columns must be of type string, and if necessary you can use casting to ensure the type. For multi-field segmentations, each field is displayed as a separate column.
Aliases for the returned segmentation columns must follow the following naming conventions:
Naming Conventions | Example of Segmentation Field → Column Name |
---|---|
Match the name and case of the underlying column. For some data sources, you will need to use quotation marks around the column name. | country → country ZipCode → ZipCode |
Non-alphanumeric characters should be replaced with an underscore. | foo:bar.baz → foo_bar_baz |
For semi-structured segmentations using arrays, the array access should be replaced with underscore followed by the index. | foo:bar.baz[0].qux → foo_bar_baz_0_qux |
If the segmentation is using the array length field, the alias should be suffixed with _length. | foo:bar.baz[0].qux.length() → foo_bar_baz_0_qux_length |
Template Functions
You can write your custom SQL Query to use placeholders and template functions to reference Source, Segmentations, and Windows to help integrate your custom SQL logic with Validio tumbling windows.
Important
Template functions are only available if your Custom SQL Validator is configured to use tumbling windows.
Placeholder | Description |
---|---|
{{ validio_window_id(<identifier>) }} | (Required) Used to generate an ID for all data points that belong to the same window. Must be placed in the SELECT clause. For more information, see Window ID. |
{{ validio_window_filter(<identifier>) }} | (Required) Used to process only the data that has not already been processed. Must be placed in the WHERE clause. For more information, see Window Filter . |
{{ validio_reference_window_filter(<identifier>, <lookback>) }} | (Optional) Used when data in the current window needs to be compared to data in previous windows. Must be placed in the WHERE clause. For more information, see Reference Window Filter. |
{{ validio_segmentation_filter() }} | (Optional) Used when a segmentation filter is specified on the segmentation attached to a validator. Must be placed in the WHERE clause. For more information, see Segmentation Filter. |
Window ID
The required function validio_window_id(<identifier>)
is used to generate an ID for all data points that belong to the same window. The function must be placed in the SELECT
clause.
The function expands to a formula that calculates which window the datetime in <identifier>
belongs to. Then it counts the number of windows the calculated window is from a reference timestamp to get a monotonically increasing numerical ID for each window in the query. (The reference timestamp is often the timestamp for the oldest data point for the current poll.)
This column must be included in the GROUP BY
clause to aggregate over all data points in the same window. Because it is used to link metrics to the correct window time ranges, the column must be returned as the name validio_window_id
(matching the case).
Window Filter
The required function validio_window_filter(<identifier>)
expands to a filter condition that limits the query to the active time range. For example, during a backfill poll the function will expand to a time range that covers the lookback period. During a normal poll the function will span back to where the last poll completed, which will be the time range for the latest window (or windows). The <identifier>
should be the timestamp column, which is usually the cursor column.
The function expands to the following filter condition:
<identifier> >= <first_window_start_timestamp> AND
<identifier < <last_window_end_timestamp>
This function must be used on the source table, but can also be used on other tables, such as when comparing data from two different tables. (You can see this demonstrated in the Tumbling Reference Count recipe in Custom SQL Validator Recipes.) The window filter function works as an optimization to process only the data that is relevant for the current poll.
Reference Window Filter
The optional function validio_reference_window_filter(<identifier>, <lookback>)
is useful when doing reference validation, where data in the current window needs to be compared to data in previous windows. This function is similar to the Window Filter function, but takes a second argument that can be used to extend the time range with a <lookback>
amount of windows.
The function expands to the following filter condition:
<identifier> >= <first_window_start_timestamp - lookback windows> AND
<identifier < <last_window_end_timestamp>
You can see an example of a reference window filter function in the Segmented Tumbling Mean Ratio recipe in Custom SQL Validator Recipes.
Segmentation Filter
The function validio_segmentation_filter()
must be used in place of a segmentation filter expression specified on the segmentation attached to the SQL validator. The function automatically expands to the filter expression before polling for new data on the validator.
Important
For the SQL validator to be valid, segmentation filtering for the validator must also be applied to its query. Otherwise polling will fail and the SQL validator will not be created if there is a segmentation filter present without a matching placeholder in the query.
Note
It is valid to use the placeholder in a query when a segmentation filter is not used.
Custom SQL Query Example
This example demonstrates how to use template functions to build a custom SQL query. The validator is configured with a daily tumbling window and segmentation on country
.
The following custom SQL query uses template functions:
SELECT
`country`,
COUNT(*) AS `validio_metric`,
{{ validio_window_id(`created_at`) }} AS `validio_window_id`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`country`,
-- Note: not all warehouses support grouping by alias like this.
-- In those cases the template function can be used here as well.
`validio_window_id`
The following shows the full SQL query after the template functions have been expanded:
SELECT
`country`,
COUNT(*) AS `validio_metric`,
-- Expanded {{ validio_window_id(`created_at`) }}
FLOOR(
CAST(
TIMESTAMP_DIFF(
`created_at`,
'2024-02-26T00:00:00Z', —- Reference timestamp
SECOND
) AS FLOAT64
) / 86400
) AS `validio_window_id`,
FROM
`project`.`dataset`.`table`
WHERE
-- Expanded {{ validio_window_filter(`created_at`) }}
-- In this example it is a backfill poll for a source with 7 days lookback
`created_at` >= '2024-02-26T00:00:00Z' AND
`created_at` < '2024-03-04T00:00:00Z'
GROUP BY
`country`,
`validio_window_id`
The following table shows the output of the SQL query:
validio_metric | validio_window_id | country | window_start 1 | window_end 1 |
---|---|---|---|---|
10 | 0 | SE | 2024-02-26T00:00:00Z | 2024-02-27T00:00:00Z |
15 | 0 | US | 2024-02-26T00:00:00Z | 2024-02-27T00:00:00Z |
8 | 1 | SE | 2024-02-27T00:00:00Z | 2024-02-28T00:00:00Z |
13 | 1 | US | 2024-02-27T00:00:00Z | 2024-02-28T00:00:00Z |
1 The window_start
and window_end
columns are shown for illustrative purposes and are not part of the SQL validator query.
For more examples of validator configurations, refer to Custom SQL Validator Recipes.
Updated 2 months ago