Documentation
HomeRequest DemoContact

Custom SQL Validator Recipes

The following Custom SQL validator examples are based on BigQuery SQL. You can use these recipes as a starting point when writing Custom SQL validators for other data sources, such as Redshift or Athena.

For information about configuring Custom SQL validators, see Custom SQL.

Duplicate Count

This example validator counts the number of occurrences for a specific column with a window and sums the total count. The resulting value represents the number of duplicates for that particular value.

SELECT
    `validio_window_id`,
    SUM(duplicates) as `validio_metric`
FROM (
    SELECT
        {{ validio_window_id(`time`) }} AS `validio_window_id`,
        COUNT(*) - 1 AS duplicates
    FROM 
        `project`.`dataset`.`table`
    WHERE
        {{ validio_window_filter(`time`) }}
    GROUP BY
        `validio_window_id`, `column_to_check_for_duplicates`
)
GROUP BY
    `validio_window_id`

For the following example data, the result would be 3, where the duplicate values are b, b, and NULL.

column_to_check_for_duplicates
a
b
b
b
NULL
NULL

Referential Integrity

This example validator computes how many non-NULL values exist in a foreign key table which is not present in the primary key table.

SELECT
    {{ validio_window_id(`foreign_key_table`.`time`) }} AS `validio_window_id`,
    COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
    `project`.`dataset`.`foreign_key_table`
LEFT JOIN
    `project`.`dataset`.`primary_key_table`
ON
    `foreign_key_table`.`foreign_key` = `primary_key_table`.`foreign_key`
WHERE
    {{ validio_window_filter(`foreign_key_table`.`time`) }} AND
    `foreign_key_table`.`foreign_key` IS NOT NULL
GROUP BY
    `validio_window_id`

For the following example tables, the result would be 2, where the missing keys are 1003 and 1004.

foreign_key_table.foreign_key
1001
1002
1003
1004
NULL
primary_key_table.primary_key
1001
1002

Global Row Count

This example validator uses a global window to count the total number of records in the table. No template functions are needed since the full table will be scanned during each poll. The only requirement is that the query must return the validio_metric column.

SELECT
	COUNT(*) AS `validio_metric`
FROM
	`project`.`dataset`.`table`

Segmented Tumbling Row Count

This example validator is configured with a tumbling window on the field created_at and a segmentation on the field country. The SQL query uses the window template functions to calculate the number of records per country and window. The query must return the validio_metric, validio_window, and country columns.

SELECT
    {{ validio_window_id(`created_at`) }} AS `validio_window_id`,
    `country`
    COUNT(*) AS `validio_metric`
FROM
    `project`.`dataset`.`table`
WHERE
    {{ validio_window_filter(`created_at`) }}
GROUP BY
    `validio_window_id`,
    `country`

Segmented Tumbling Mean Ratio

This example validator is configured with a tumbling window on the field created_at and a segmentation on the field city. The SQL query uses the window template functions to calculate the mean ratio by comparing the mean of amount in the current window against the two previous windows. The query must return the validio_metric, validio_window, and city columns.

SELECT
    	`target`.`validio_window_id`,
    	`target`.`city`,
    	`target`.`target_mean` / NULLIF(
        	(
            	-- Calculate the mean for the previous two windows 
            	SUM(`reference`.`reference_amount`) /
            	NULLIF(COUNT(CAST(`reference`.`reference_amount` AS FLOAT64)), 0)
        	),
        	0
    	) AS `validio_metric`
FROM (
    	-- Calculate the mean for each window + segment
    	SELECT
            	{{ validio_window_id(`created_at`)}} AS `validio_window_id`,
            	`city`,
            	AVG(amount) AS `target_mean`
    	FROM
            	`project`.`dataset`.`table`
    	WHERE
            	{{ validio_window_filter(`created_at`) }}
    	GROUP BY
            	`validio_window_id`,
            	`city`
) `target`
JOIN (
    	SELECT
            	{{ validio_window_id(`created_at`)}} AS `validio_window_id`,
            	`city`,
            	`amount` AS `reference_amount`
    	FROM
            	`project`.`dataset`.`table`
    	WHERE
            	-- Use the reference version of the filter function
            	-- to include records from the two previous windows.
            	{{ validio_reference_window_filter(`created_at`, 2) }}
) `reference` ON (
    	-- Join the current window + segment with records from the two previous windows
    	`reference`.`validio_window_id` >= `target`.`validio_window_id` - 2 AND
    	`reference`.`validio_window_id` <= `target`.`validio_window_id` - 1 AND
    	`reference`.`city` = `target`.`city`
)
GROUP BY
    	`target`.`validio_window_id`,
    	`target`.`city`,
    	`target`.`target_mean`

Tumbling Reference Count

This example validator compares the number of records in two tables using hourly tumbling windows.

WITH ref AS (
	-- count web records in aggregation table per window
	SELECT
    		count(*) as `count`,
    		{{ validio_window_id(`created_at`) }} as `validio_window_id`
	FROM
    		`project`.`dataset`.`traffic_agg`
	WHERE
    		{{ validio_window_filter(`created_at`) }} AND
    		`platform` = 'web'
	GROUP BY
    		`validio_window_id`
), target AS (
	-- count records in web table per window
	SELECT
    		count(*) as `count`,
    		{{ validio_window_id(`created_at`) }} as `validio_window_id`
	FROM
    		`project`.`dataset`.`traffic_web`
	WHERE
    		{{ validio_window_filter(`created_at`) }}
	GROUP BY
    		`validio_window_id`
)
SELECT
	`target`.`validio_window_id`,
	(COALESCE(`ref`.`count`, 0) / `target`.`count`) * 100 AS `validio_metric`
FROM
	`target`
LEFT JOIN
	`ref` ON (
    		`ref`.`validio_window_id` = `target`.`validio_window_id`
	)

For the following example tables, the result is: window 10:00 = 100, window 11:00 = 50.

traffic_web.created_at
10:00
10:01
11:00
11:01
traffic_agg.created_attraffic_agg.platform
10:00web
10:01web
11:01web

Duration Between Records

This example validator is configured with an hourly tumbling window on created_at and calculates the maximum duration in seconds between the records timestamp in each window.

SELECT
	`validio_window_id`,
	MAX(`duration`) AS `validio_metric`,
FROM (
	SELECT
    		{{ validio_window_id(`created_at`) }} AS `validio_window_id`,
    		COALESCE(
        		TIMESTAMP_DIFF(
            			`created_at`,
            			LAG(`created_at`) OVER(PARTITION BY {{ validio_window_id(`created_at`) }} ORDER BY `created_at`),
            			SECOND
        		),
    		0) AS `duration`
	FROM
    		`project`.`dataset`.`table`
	WHERE
    		{{ validio_window_filter(`created_at`) }}
)
GROUP BY
	`validio_window_id`

For the following example data, the result is: window 10:00 = 9s, window 11:00 = 5s.

traffic_web.created_at
10:00:00
10:00:01
11:00:10
11:00:10
11:00:15

Moving Average for Metric

This validator computes a moving average over the specified metric field (revenue in this example) across a 14-day rolling period. The validator is configured to use a tumbling window on the Date timestamp.

SELECT  
    `validio_window_id`,  
    AVG(revenue) OVER(ORDER BY `validio_window_id` RANGE BETWEEN 13 PRECEDING AND CURRENT ROW) AS `validio_metric`  
FROM (  
    SELECT  
        {{ validio_window_id(`Date`) }} AS `validio_window_id`,  
        AVG(revenue) AS `revenue`  
    FROM  
        `project`.`dataset`.`table`  
    WHERE  
        {{ validio_reference_window_filter(`Date`, 13) }}  
    GROUP BY  
    {{ validio_window_id(`Date`) }}  
)

You can customize this example by changing the metric field, window length, and timestamp field as needed:

  • Metric Field - Replace revenue with the desired metric field.
  • Window Length - Adjust the number of windows over which the moving average is calculated (the example shown uses 14 days). For an X-window moving average, set the reference_window_filter function to include X-1 past windows to capture the current window as part of the calculation.
  • Timestamp Field - Replace Date with the relevant timestamp field for the metric.

Trend Alert

This validator detects significant drifts in a metric over time. It compares the current value of the metric with its value seen previously Y period ago (for example, 30 days when Y=30 for daily windows) and triggers an alert if the change exceeds X%.

The recipe uses the LAG function to compare the current metric value with its value from the previous Y past window. It calculates the percentage change and returns 1 if the change is above a specified threshold of X%.

WITH current_and_past AS (  
SELECT  
    Date,  
    Sales as current_metric,  
    LAG(Sales, Y) OVER (ORDER BY {{ validio_window_id(`Date`)}}) AS past_metric  
    FROM  
        `project`.`dataset`.`table`  
WHERE  
    {{ validio_reference_window_filter(`Date`, Y) }}  
)  
SELECT  
    {{ validio_window_id(`Date`) }} AS `validio_window_id`,  
    CASE WHEN ABS(100 \* (SAFE_DIVIDE(current_and_past.current_metric, current_and_past.past_metric) -1)) > X THEN 1 ELSE 0  END AS `validio_metric`  
FROM  
    current_and_past

You can customize this example by changing the metric field, the timestamp field, the Y parameter as needed:

  • Metric Field - Replace the metric field (for example, Sales, Profit) to track trends.
  • Timestamp Field - Replace Date with the relevant timestamp field for the metric.
  • Y - Adjust the Y parameter to define the number of past windows for comparison.

Combine this with a Fixed Threshold (>= 1) to alert on metric changes when the value changes by more than X%. For more information, see Configuring Fixed Thresholds.

Numeric Anomaly

This validator identifies anomalies in numeric data by counting the number of datapoints, within a given window, that deviate from expected values.

The validator uses a reference source to define the acceptable range of values as (ref_mean ± 1/precision * ref_std). Higher precision means that the accepted range of values is narrower, which identifies more anomalies. Conversely, lower precision values imply a wider range of accepted values, which identifies fewer anomalies.

The recipe uses the REF_WIN_LEN parameter to define number of windows in the reference source and the OFFSET_WIN_LEN parameter to define number of offset windows from current window. For more information. refer to Reference Source.

WITH
  reference_source AS (
  SELECT
    `validio_window_id`,
    ref_mean - 1/precision * ref_std AS lower_bound,
    ref_mean + 1/precision * ref_std AS upper_bound
  FROM (
    SELECT
      `validio_window_id`,
      AVG(metric_mean) OVER(ORDER BY `validio_window_id` ROWS BETWEEN REF_WIN_LEN+OFFSET_WIN_LEN PRECEDING AND OFFSET_WIN_LEN PRECEDING) ref_mean,
      AVG(metric_std) OVER(ORDER BY `validio_window_id` ROWS BETWEEN REF_WIN_LEN+OFFSET_WIN_LEN PRECEDING AND OFFSET_WIN_LEN PRECEDING) ref_std
    FROM (
      SELECT
        {{ validio_window_id(`timestamp`) }} AS `validio_window_id`,
        AVG(metric) AS metric_mean,
        STDDEV(metric) AS metric_std,
        COUNT(metric) metric_count
      FROM
        `project`.`dataset`.`table`
      WHERE
        {{ validio_reference_window_filter(`timestamp`,REF_WIN_LEN+OFFSET_WIN_LEN) }}
      GROUP BY
        `validio_window_id` 
      )
    WHERE
      metric_count >= 100 -- Minimum number of reference datapoints
    )
  )

SELECT
  {{ validio_window_id(a.`timestamp`) }} AS `validio_window_id`,
  SUM(CASE
        WHEN a.metric NOT BETWEEN ref.lower_bound AND ref.upper_bound THEN 1
        ELSE 0 END
    ) AS `validio_metric`,
FROM
  `project`.`dataset`.`table` AS a
LEFT JOIN
  reference_source AS ref
ON
  ({{ validio_window_id(a.`timestamp`) }} = ref.`validio_window_id`)
WHERE
  {{ validio_window_filter(a.`timestamp`) }}
GROUP BY
  `validio_window_id`

You can customize this example by changing the metric field, the timestamp field, and other parameters as needed:

  • Metric Field - Replace the metric field with your relevant field.
  • Timestamp Field - Replace timestamp with the relevant timestamp field.
  • Adjust the REF_WIN_LEN and OFFSET_WIN_LEN parameters to define the number of past windows included in the reference calculations.
  • Adjust the precision parameter to control the width of acceptable range of values.
  • Combine this Custom SQL validator with a Dynamic Threshold to alert on deviation of datapoints from the reference source.