Custom SQL Validator Recipes
The following Custom SQL validator examples are based on BigQuery SQL. You can use these recipes as a starting point when writing Custom SQL validators for other data sources, such as Redshift or Athena.
For information about configuring Custom SQL validators, see Custom SQL.
Duplicate Count
This example validator counts the number of occurrences for a specific column with a window and sums the total count. The resulting value represents the number of duplicates for that particular value.
SELECT
`validio_window_id`,
SUM(duplicates) as `validio_metric`
FROM (
SELECT
{{ validio_window_id(`time`) }} AS `validio_window_id`,
COUNT(*) - 1 AS duplicates
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`time`) }}
GROUP BY
`validio_window_id`, `column_to_check_for_duplicates`
)
GROUP BY
`validio_window_id`
For the following example data, the result would be 3, where the duplicate values are b, b, and NULL.
column_to_check_for_duplicates |
---|
a |
b |
b |
b |
NULL |
NULL |
Referential Integrity
This example validator computes how many non-NULL values exist in a foreign key table which is not present in the primary key table.
SELECT
{{ validio_window_id(`foreign_key_table`.`time`) }} AS `validio_window_id`,
COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
`project`.`dataset`.`foreign_key_table`
LEFT JOIN
`project`.`dataset`.`primary_key_table`
ON
`foreign_key_table`.`foreign_key` = `primary_key_table`.`foreign_key`
WHERE
{{ validio_window_filter(`foreign_key_table`.`time`) }} AND
`foreign_key_table`.`foreign_key` IS NOT NULL
GROUP BY
`validio_window_id`
For the following example tables, the result would be 2, where the missing keys are 1003 and 1004.
foreign_key_table.foreign_key |
---|
1001 |
1002 |
1003 |
1004 |
NULL |
primary_key_table.primary_key |
---|
1001 |
1002 |
Global Row Count
This example validator uses a global window to count the total number of records in the table. No template functions are needed since the full table will be scanned during each poll. The only requirement is that the query must return the validio_metric
column.
SELECT
COUNT(*) AS `validio_metric`
FROM
`project`.`dataset`.`table`
Segmented Tumbling Row Count
This example validator is configured with a tumbling window on the field created_at
and a segmentation on the field country
. The SQL query uses the window template functions to calculate the number of records per country and window. The query must return the validio_metric
, validio_window
, and country
columns.
SELECT
{{ validio_window_id(`created_at`) }} AS `validio_window_id`,
`country`
COUNT(*) AS `validio_metric`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`validio_window_id`,
`country`
Segmented Tumbling Mean Ratio
This example validator is configured with a tumbling window on the field created_at
and a segmentation on the field city
. The SQL query uses the window template functions to calculate the mean ratio by comparing the mean of amount
in the current window against the two previous windows. The query must return the validio_metric
, validio_window
, and city
columns.
SELECT
`target`.`validio_window_id`,
`target`.`city`,
`target`.`target_mean` / NULLIF(
(
-- Calculate the mean for the previous two windows
SUM(`reference`.`reference_amount`) /
NULLIF(COUNT(CAST(`reference`.`reference_amount` AS FLOAT64)), 0)
),
0
) AS `validio_metric`
FROM (
-- Calculate the mean for each window + segment
SELECT
{{ validio_window_id(`created_at`)}} AS `validio_window_id`,
`city`,
AVG(amount) AS `target_mean`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`validio_window_id`,
`city`
) `target`
JOIN (
SELECT
{{ validio_window_id(`created_at`)}} AS `validio_window_id`,
`city`,
`amount` AS `reference_amount`
FROM
`project`.`dataset`.`table`
WHERE
-- Use the reference version of the filter function
-- to include records from the two previous windows.
{{ validio_reference_window_filter(`created_at`, 2) }}
) `reference` ON (
-- Join the current window + segment with records from the two previous windows
`reference`.`validio_window_id` >= `target`.`validio_window_id` - 2 AND
`reference`.`validio_window_id` <= `target`.`validio_window_id` - 1 AND
`reference`.`city` = `target`.`city`
)
GROUP BY
`target`.`validio_window_id`,
`target`.`city`,
`target`.`target_mean`
Tumbling Reference Count
This example validator compares the number of records in two tables using hourly tumbling windows.
WITH ref AS (
-- count web records in aggregation table per window
SELECT
count(*) as `count`,
{{ validio_window_id(`created_at`) }} as `validio_window_id`
FROM
`project`.`dataset`.`traffic_agg`
WHERE
{{ validio_window_filter(`created_at`) }} AND
`platform` = 'web'
GROUP BY
`validio_window_id`
), target AS (
-- count records in web table per window
SELECT
count(*) as `count`,
{{ validio_window_id(`created_at`) }} as `validio_window_id`
FROM
`project`.`dataset`.`traffic_web`
WHERE
{{ validio_window_filter(`created_at`) }}
GROUP BY
`validio_window_id`
)
SELECT
`target`.`validio_window_id`,
(COALESCE(`ref`.`count`, 0) / `target`.`count`) * 100 AS `validio_metric`
FROM
`target`
LEFT JOIN
`ref` ON (
`ref`.`validio_window_id` = `target`.`validio_window_id`
)
For the following example tables, the result is: window 10:00 = 100, window 11:00 = 50.
traffic_web.created_at |
---|
10:00 |
10:01 |
11:00 |
11:01 |
traffic_agg.created_at | traffic_agg.platform |
---|---|
10:00 | web |
10:01 | web |
11:01 | web |
Duration Between Records
This example validator is configured with an hourly tumbling window on created_at
and calculates the maximum duration in seconds between the records timestamp in each window.
SELECT
`validio_window_id`,
MAX(`duration`) AS `validio_metric`,
FROM (
SELECT
{{ validio_window_id(`created_at`) }} AS `validio_window_id`,
COALESCE(
TIMESTAMP_DIFF(
`created_at`,
LAG(`created_at`) OVER(PARTITION BY {{ validio_window_id(`created_at`) }} ORDER BY `created_at`),
SECOND
),
0) AS `duration`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_window_filter(`created_at`) }}
)
GROUP BY
`validio_window_id`
For the following example data, the result is: window 10:00 = 9s, window 11:00 = 5s.
traffic_web.created_at |
---|
10:00:00 |
10:00:01 |
11:00:10 |
11:00:10 |
11:00:15 |
Moving Average for Metric
This validator computes a moving average over the specified metric field (revenue in this example) across a 14-day rolling period. The validator is configured to use a tumbling window on the Date timestamp.
SELECT
`validio_window_id`,
AVG(revenue) OVER(ORDER BY `validio_window_id` RANGE BETWEEN 13 PRECEDING AND CURRENT ROW) AS `validio_metric`
FROM (
SELECT
{{ validio_window_id(`Date`) }} AS `validio_window_id`,
AVG(revenue) AS `revenue`
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_reference_window_filter(`Date`, 13) }}
GROUP BY
{{ validio_window_id(`Date`) }}
)
You can customize this example by changing the metric field, window length, and timestamp field as needed:
- Metric Field - Replace revenue with the desired metric field.
- Window Length - Adjust the number of windows over which the moving average is calculated (the example shown uses 14 days). For an X-window moving average, set the reference_window_filter function to include X-1 past windows to capture the current window as part of the calculation.
- Timestamp Field - Replace Date with the relevant timestamp field for the metric.
Trend Alert
This validator detects significant drifts in a metric over time. It compares the current value of the metric with its value seen previously Y period ago (for example, 30 days when Y=30 for daily windows) and triggers an alert if the change exceeds X%.
The recipe uses the LAG
function to compare the current metric value with its value from the previous Y past window. It calculates the percentage change and returns 1 if the change is above a specified threshold of X%.
WITH current_and_past AS (
SELECT
Date,
Sales as current_metric,
LAG(Sales, Y) OVER (ORDER BY {{ validio_window_id(`Date`)}}) AS past_metric
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_reference_window_filter(`Date`, Y) }}
)
SELECT
{{ validio_window_id(`Date`) }} AS `validio_window_id`,
CASE WHEN ABS(100 \* (SAFE_DIVIDE(current_and_past.current_metric, current_and_past.past_metric) -1)) > X THEN 1 ELSE 0 END AS `validio_metric`
FROM
current_and_past
You can customize this example by changing the metric field, the timestamp field, the Y parameter as needed:
- Metric Field - Replace the metric field (for example, Sales, Profit) to track trends.
- Timestamp Field - Replace Date with the relevant timestamp field for the metric.
- Y - Adjust the Y parameter to define the number of past windows for comparison.
Combine this with a Fixed Threshold (>= 1) to alert on metric changes when the value changes by more than X%. For more information, see Configuring Fixed Thresholds.
Numeric Anomaly
This validator identifies anomalies in numeric data by counting the number of datapoints, within a given window, that deviate from expected values.
The validator uses a reference source to define the acceptable range of values as (ref_mean ± 1/precision * ref_std
). Higher precision means that the accepted range of values is narrower, which identifies more anomalies. Conversely, lower precision values imply a wider range of accepted values, which identifies fewer anomalies.
The recipe uses the REF_WIN_LEN
parameter to define number of windows in the reference source and the OFFSET_WIN_LEN
parameter to define number of offset windows from current window. For more information. refer to Reference Source.
WITH
reference_source AS (
SELECT
`validio_window_id`,
ref_mean - 1/precision * ref_std AS lower_bound,
ref_mean + 1/precision * ref_std AS upper_bound
FROM (
SELECT
`validio_window_id`,
AVG(metric_mean) OVER(ORDER BY `validio_window_id` ROWS BETWEEN REF_WIN_LEN+OFFSET_WIN_LEN PRECEDING AND OFFSET_WIN_LEN PRECEDING) ref_mean,
AVG(metric_std) OVER(ORDER BY `validio_window_id` ROWS BETWEEN REF_WIN_LEN+OFFSET_WIN_LEN PRECEDING AND OFFSET_WIN_LEN PRECEDING) ref_std
FROM (
SELECT
{{ validio_window_id(`timestamp`) }} AS `validio_window_id`,
AVG(metric) AS metric_mean,
STDDEV(metric) AS metric_std,
COUNT(metric) metric_count
FROM
`project`.`dataset`.`table`
WHERE
{{ validio_reference_window_filter(`timestamp`,REF_WIN_LEN+OFFSET_WIN_LEN) }}
GROUP BY
`validio_window_id`
)
WHERE
metric_count >= 100 -- Minimum number of reference datapoints
)
)
SELECT
{{ validio_window_id(a.`timestamp`) }} AS `validio_window_id`,
SUM(CASE
WHEN a.metric NOT BETWEEN ref.lower_bound AND ref.upper_bound THEN 1
ELSE 0 END
) AS `validio_metric`,
FROM
`project`.`dataset`.`table` AS a
LEFT JOIN
reference_source AS ref
ON
({{ validio_window_id(a.`timestamp`) }} = ref.`validio_window_id`)
WHERE
{{ validio_window_filter(a.`timestamp`) }}
GROUP BY
`validio_window_id`
You can customize this example by changing the metric field, the timestamp field, and other parameters as needed:
- Metric Field - Replace the
metric
field with your relevant field. - Timestamp Field - Replace
timestamp
with the relevant timestamp field. - Adjust the
REF_WIN_LEN
andOFFSET_WIN_LEN
parameters to define the number of past windows included in the reference calculations. - Adjust the
precision
parameter to control the width of acceptable range of values. - Combine this Custom SQL validator with a Dynamic Threshold to alert on deviation of datapoints from the reference source.
Updated 13 days ago