Custom SQL Validator Recipes
The following Custom SQL validator examples are based on BigQuery SQL. You can use these recipes as a starting point when writing Custom SQL validators for other data sources, such as Redshift or Athena.
For information about configuring Custom SQL validators, see Custom SQL.
Duplicate Count
This example validator counts the number of occurrences for a specific column with a window and sums the total count. The resulting value represents the number of duplicates for that particular value.
SELECT
`validio_window_id`,
SUM(duplicates) as `validio_metric`
FROM (
SELECT
{{ validio_window_id(`time`) }} AS `validio_window_id`,
COUNT(*) - 1 AS duplicates
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`time`) }}
GROUP BY
`validio_window_id`, `column_to_check_for_duplicates`
)
GROUP BY
`validio_window_id`
For the following example data, the result would be 3, where the duplicate values are b, b, and NULL.
column_to_check_for_duplicates |
---|
a |
b |
b |
b |
NULL |
NULL |
Referential Integrity
This example validator computes how many non-NULL values exist in a foreign key table which is not present in the primary key table.
SELECT
{{ validio_window_id(`foreign_key_table`.`time`) }} AS `validio_window_id`,
COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
`project`.`dataset`.`foreign_key_table`
LEFT JOIN
`project`.`dataset`.`primary_key_table`
ON
`foreign_key_table`.`foreign_key` = `primary_key_table`.`foreign_key`
WHERE
{{ validio_window_filter(`foreign_key_table`.`time`) }} AND
`foreign_key_table`.`foreign_key` IS NOT NULL
GROUP BY
`validio_window_id`
For the following example tables, the result would be 2, where the missing keys are 1003 and 1004.
foreign_key_table.foreign_key |
---|
1001 |
1002 |
1003 |
1004 |
NULL |
primary_key_table.primary_key |
---|
1001 |
1002 |
Global Row Count
This example validator uses a global window to count the total number of records in the table. No template functions are needed since the full table will be scanned during each poll. The only requirement is that the query must return the validio_metric
column.
SELECT
COUNT(*) AS `validio_metric`
FROM
`project`.`dataset`.`table`
Segmented Tumbling Row Count
This example validator is configured with a tumbling window on the field created_at
and a segmentation on the field country
. The SQL query uses the window template functions to calculate the number of records per country and window. The query must return the validio_metric
, validio_window
, and country
columns.
SELECT
{{ validio_window_id(`created_at`) }} AS `validio_window_id`,
`country`
COUNT(*) AS `validio_metric`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`validio_window_id`,
`country`
Segmented Tumbling Mean Ratio
This example validator is configured with a tumbling window on the field created_at
and a segmentation on the field city
. The SQL query uses the window template functions to calculate the mean ratio by comparing the mean of amount
in the current window against the two previous windows. The query must return the validio_metric
, validio_window
, and city
columns.
SELECT
`target`.`validio_window_id`,
`target`.`city`,
`target`.`target_mean` / NULLIF(
(
-- Calculate the mean for the previous two windows
SUM(`reference`.`reference_amount`) /
NULLIF(COUNT(CAST(`reference`.`reference_amount` AS FLOAT64)), 0)
),
0
) AS `validio_metric`
FROM (
-- Calculate the mean for each window + segment
SELECT
{{ validio_window_id(`created_at`)}} AS `validio_window_id`,
`city`,
AVG(amount) AS `target_mean`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`validio_window_id`,
`city`
) `target`
JOIN (
SELECT
{{ validio_window_id(`created_at`)}} AS `validio_window_id`,
`city`,
`amount` AS `reference_amount`
FROM
`project`.`dataset`.`table`
WHERE
-- Use the reference version of the filter function
-- to include records from the two previous windows.
{{ validio_reference_window_filter(`created_at`, 2) }}
) `reference` ON (
-- Join the current window + segment with records from the two previous windows
`reference`.`validio_window_id` >= `target`.`validio_window_id` - 2 AND
`reference`.`validio_window_id` <= `target`.`validio_window_id` - 1 AND
`reference`.`city` = `target`.`city`
)
GROUP BY
`target`.`validio_window_id`,
`target`.`city`,
`target`.`target_mean`
Tumbling Reference Count
This example validator compares the number of records in two tables using hourly tumbling windows.
WITH ref AS (
-- count web records in aggregation table per window
SELECT
count(*) as `count`,
{{ validio_window_id(`created_at`) }} as `validio_window_id`
FROM
`project`.`dataset`.`traffic_agg`
WHERE
{{ validio_window_filter(`created_at`) }} AND
`platform` = 'web'
GROUP BY
`validio_window_id`
), target AS (
-- count records in web table per window
SELECT
count(*) as `count`,
{{ validio_window_id(`created_at`) }} as `validio_window_id`
FROM
`project`.`dataset`.`traffic_web`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`validio_window_id`
)
SELECT
`target`.`validio_window_id`,
(COALESCE(`ref`.`count`, 0) / `target`.`count`) * 100 AS `validio_metric`
FROM
`target`
LEFT JOIN
`ref` ON (
`ref`.`validio_window_id` = `target`.`validio_window_id`
)
For the following example tables, the result is: window 10:00 = 100, window 11:00 = 50.
traffic_web.created_at |
---|
10:00 |
10:01 |
11:00 |
11:01 |
traffic_agg.created_at | traffic_agg.platform |
---|---|
10:00 | web |
10:01 | web |
11:01 | web |
Duration Between Records
This example validator is configured with an hourly tumbling window on created_at
and calculates the maximum duration in seconds between the records timestamp in each window.
SELECT
`validio_window_id`,
MAX(`duration`) AS `validio_metric`,
FROM (
SELECT
{{ validio_window_id(`created_at`) }} AS `validio_window_id`,
COALESCE(
TIMESTAMP_DIFF(
`created_at`,
LAG(`created_at`) OVER(PARTITION BY {{ validio_window_id(`created_at`) }} ORDER BY `created_at`),
SECOND
),
0) AS `duration`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
)
GROUP BY
`validio_window_id`
For the following example data, the result is: window 10:00 = 9s, window 11:00 = 5s.
traffic_web.created_at |
---|
10:00:00 |
10:00:01 |
11:00:10 |
11:00:10 |
11:00:15 |
Updated 8 days ago