Custom SQL Validator Query Syntax
Learn the syntax for writing the SQL query to define the metric calculation in a Custom SQL validator.
The SQL query for a Custom SQL validator defines the metric calculation and should return a single numeric value.
A valid SQL expression includes mandatory fields, segmentation fields, and template functions depending on the window type.
- Tumbling windows: Use template functions and return both
validio_metricandvalidio_window_id - Global windows: Use regular SQL syntax and return
validio_metric
Mandatory Fields
Your SQL query must return specific fields:
validio_metric(Required for all SQL validators) - The field containing the validation metricvalidio_window_id(Required for tumbling windows only) - Maps metrics for each tumbling window to their respective time ranges
Both the
validio_metricandvalidio_window_idfields should be written in lower case, which may require quotation marks for some data sources, for example, double quotes for Snowflake.
Segmentation Fields
The SQL query must return columns with names that have the data type STRING and match the segmentation fields configured for the validator. For example, if segmentation on country is used, then the SQL query should return a column called country. If necessary, you can use casting to ensure the data type. For multi-field segmentations, each field is displayed as a separate column.
Aliases for the returned segmentation columns must follow the following naming conventions:
Naming Conventions | Example Segmentation Field → Column Name |
|---|---|
Match the name and case of the underlying column. For some data sources, you will need to use quotation marks around the column name. | country → country ZipCode → ZipCode |
Non-alphanumeric characters should be replaced with an underscore. | foo:bar.baz → foo_bar_baz |
For semi-structured segmentations using arrays, the array access should be replaced with underscore followed by the index. | foo:bar.baz[0].qux → foo_bar_baz_0_qux |
If the segmentation is using the array length field, the alias should be suffixed with _length. | foo:bar.baz[0].qux.length() → foo_bar_baz_0_qux_length |
Template Functions
Write your custom SQL Query to use placeholders and template functions that reference Source, Segmentations, and Windows to help integrate your custom SQL logic with Validio tumbling windows.
Template functions are only valid for SQL validators with tumbling windows.
Placeholder | Description |
|---|---|
| (Required) Used to generate an ID for all data points that belong to the same window. Must be placed in the SELECT clause. For more information, see Window ID. |
| (Required) Used to process only the data that has not already been processed. Must be placed in the WHERE clause. For more information, see Window Filter . |
| (Optional) Used when data in the current window needs to be compared to data in previous windows. Must be placed in the WHERE clause. For more information, see Reference Window Filter. |
| (Optional) Used when a segmentation filter is specified on the segmentation attached to a validator. Must be placed in the WHERE clause. For more information, see Segmentation Filter. |
| (Required for SQL sources) Used as a convenience function to get the source table identifier. You must use this function if the source is a SQL source (meaning that it is configured using a custom SQL query). For more information, see Source Table. |
Window ID
Use the required function validio_window_id(<identifier>) to generate an ID for all data points that belong to the same window. The function must be placed in the SELECT clause.
The function expands to a formula that calculates which window the datetime in <identifier> belongs to. Then it counts the number of windows the calculated window is from a reference timestamp to get a monotonically increasing numerical ID for each window in the query. (The reference timestamp is often the timestamp for the oldest data point for the current poll.)
This column must be included in the GROUP BY clause to aggregate over all data points in the same window. Because it is used to link metrics to the correct window time ranges, the column must be returned as the name validio_window_id (matching the case).
Window Filter
Use the required function validio_window_filter(<identifier>) to process data that has not already been processed. The <identifier> should be the timestamp column, which is usually the cursor column.
The function expands to a filter condition that limits the query to the active time range. For example, during a backfill poll the function will expand to a time range that covers the lookback period. During a normal poll the function will span back to where the last poll completed, which will be the time range for the latest window (or windows).
<identifier> >= <first_window_start_timestamp> AND
<identifier < <last_window_end_timestamp>This function must be used on the source table, but can also be used on other tables, such as when comparing data from two different tables. (You can see this demonstrated in the Tumbling Reference Count recipe in Custom SQL Validator Recipes.) The window filter function works as an optimization to process only the data that is relevant for the current poll.
Reference Window Filter
Use the optional function validio_reference_window_filter(<identifier>, <lookback>) when doing reference window validation, where data in the current window needs to be compared to data in previous windows. This function is similar to the Window Filter function, but takes a second argument that can be used to extend the time range with a <lookback> amount of windows.
The function expands to the following filter condition:
<identifier> >= <first_window_start_timestamp - lookback windows> AND
<identifier < <last_window_end_timestamp>You can see an example of a reference window filter function in the Segmented Tumbling Mean Ratio recipe in Custom SQL Validator Recipes.
Segmentation Filter
Use the function validio_segmentation_filter() in place of a segmentation filter expression specified on the segmentation attached to the SQL validator. The function automatically expands to the filter expression before polling for new data on the validator.
You can also use the segmentation filter placeholder in a query when a segmentation filter is not used.
For the SQL validator to be valid, segmentation filtering for the validator must also be applied to its query. Otherwise polling will fail and the SQL validator will not be created if there is a segmentation filter present without a matching placeholder in the query.
Source Table
When configuring a data warehouse source, this function is required if use custom SQL to specify the source table.
Use the function validio_source_table() in place of manually writing the source table identifier.
This convenience function should be used in the FROM clause and it will expand to the quoted identifier for the table. For example, a BigQuery source expands into:
`project`.`dataset`.`table`
SQL Validator and SQL SourcesIf the SQL validator is used on a SQL source then the
validio_source_table()function must be used as the table identifier since the SQL source query definition will be added as a CTE to the validator query during runtime. In this case the function expands to the name of the CTE.For more information, see About Custom SQL Sources.
Custom SQL Query Example
This example demonstrates how to use template functions to build a custom SQL query. The validator is configured with a daily tumbling window and segmentation on country.
The following custom SQL query uses template functions:
SELECT
`country`,
COUNT(*) AS `validio_metric`,
{{ validio_window_id(`created_at`) }} AS `validio_window_id`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`country`,
-- Note: not all warehouses support grouping by alias like this.
-- In those cases the template function can be used here as well.
`validio_window_id`The following shows the full SQL query after the template functions have been expanded:
SELECT
`country`,
COUNT(*) AS `validio_metric`,
-- Expanded {{ validio_window_id(`created_at`) }}
FLOOR(
CAST(
TIMESTAMP_DIFF(
`created_at`,
'2024-02-26T00:00:00Z', —- Reference timestamp
SECOND
) AS FLOAT64
) / 86400
) AS `validio_window_id`,
FROM
`project`.`dataset`.`table`
WHERE
-- Expanded {{ validio_window_filter(`created_at`) }}
-- In this example it is a backfill poll for a source with 7 days lookback
`created_at` >= '2024-02-26T00:00:00Z' AND
`created_at` < '2024-03-04T00:00:00Z'
GROUP BY
`country`,
`validio_window_id`The following table shows the output of the SQL query:
|
|
|
1 |
1 |
|---|---|---|---|---|
10 | 0 | SE | 2024-02-26T00:00:00Z | 2024-02-27T00:00:00Z |
15 | 0 | US | 2024-02-26T00:00:00Z | 2024-02-27T00:00:00Z |
8 | 1 | SE | 2024-02-27T00:00:00Z | 2024-02-28T00:00:00Z |
13 | 1 | US | 2024-02-27T00:00:00Z | 2024-02-28T00:00:00Z |
1 The window_start and window_end columns are shown for illustrative purposes and are not part of the SQL validator query.
For more examples of validator configurations, refer to Custom SQL Validator Recipes.
Updated 8 days ago