Custom SQL

Create your own customized validator, using SQL

Validator overview

📘

Supported data sources

The Custom SQL validator currently only supports Data Warehouses and Query Engines.

Using the Custom SQL validator you can leverage the power of SQL to create any custom validation of your data.

Configuration

StepRequiredParametersOptions
Validator typeCustom SQL-
Source configSegmentation1. Select a configured Segmentation

Or

2. Unsegmented (default)
Source configWindowSelect a configured Window
SQL querySQL queryEnter the SQL query to be executed
ConfigFieldInitialize with backfill (checkbox)
ThresholdThreshold typeFixed threshold
Dynamic threshold
Threshold✅(*1)OperatorLess than
Less than or equal
Equal
Not equal
Greater than
Greater than or equal
Threshold✅(*1)ValueNumeric value to validate threshold on
Threshold✅(*2)SensitivityEnter a numeric value
Threshold✅(*2)Decision bounds typeUpper
Lower
Upper and lower (default)
Validator nameNameEnter a validator name

*1 Only applicable for Fixed thresholds.

*2 Only applicable for Dynamic thresholds.

Configuration details

Placeholders

Placeholders are needed for Validio to be able to reference Source, Segmentations and Windows.

PlaceholderDescriptionScope
{{ table }}Instructs Validio where to reference the Source table.Always mandatory.
{{ select_columns }}Instructs Validio where to reference SELECT-fields, handling Windowing and Segmentation.Not applicable when combining Global window and Unsegmented.
{{ group_by_columns }}Instructs Validio where to reference GROUP BY-fields, handling Windowing and Segmentation.Not applicable when combining Global window and Unsegmented.

Mandatory fields

The Custom SQL query must include a return field named validio_metric, to instruct Validio which field to use as validation metric.

📘

Note that validio_metric should be written in lower case, which requires using quotation marks for some data sources, such as double quotation marks for Snowflake.

Test SQL

You can test your query by clicking Test SQL:

  • If the query executed properly, you will see a table with the first 20 records returned.
  • If the query did not execute properly, you will see the error message propagated from the data source.

In both cases you are able to expand the executed query by clicking See raw query.

Custom SQL recipes

Below are several useful recipes for common use cases. Each recipe is written in four variants: BigQuery, Snowflake, BigQuery G/U, Snowflake G/U.

👍

G/U indicates the combination of Global Window and Unsegmented. This means data is not grouped into windows or segments, which in turn means the placeholders {{ select_columns }} and {{ group_by_columns }} are not applicable for the query.

Each recipe is followed by example data and corresponding result. For simplicity, these examples do not include windows nor segments. Hence you can think of each example as a G/U example. Or alternatively, you can think about it as the subset of data for one specific window/segment.

You can use these recipes as a starting point when writing Custom SQL validators for other data sources, such as Redshift or Athena.


Duplicate values

Counts duplicate values

SELECT
  {{ select_columns }},
  SUM(duplicates) as `validio_metric`
FROM (
  SELECT 
    {{ select_columns }}, 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }}
  GROUP BY {{ group_by_columns }}, column_to_check_for_duplicates
)
GROUP BY
  {{ group_by_columns }}
SELECT
  {{ select_columns }},
  SUM(duplicates) as "validio_metric"
FROM (
  SELECT 
    {{ select_columns }}, 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }}
  GROUP BY {{ group_by_columns }}, column_to_check_for_duplicates
)
GROUP BY
  {{ group_by_columns }}
SELECT
  SUM(duplicates) as `validio_metric`
FROM (
  SELECT 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }}
  GROUP BY column_to_check_for_duplicates
)
SELECT
  SUM(duplicates) as "validio_metric"
FROM (
  SELECT 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }}
  GROUP BY column_to_check_for_duplicates
)

Example data:

column_to_check_for_duplicates
a
b
b
b
NULL
NULL

Example result: 3 (the duplicate values are b, b, NULL)


Duplicate records

Counts duplicate records

SELECT
  {{ select_columns }},
  SUM(duplicates) as `validio_metric`
FROM (
  SELECT 
    {{ select_columns }}, 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }} AS t
  GROUP BY {{ group_by_columns }}, FARM_FINGERPRINT(TO_JSON_STRING(t))
)
GROUP BY
  {{ group_by_columns }}
SELECT
  {{ select_columns }},
  SUM(duplicates) as "validio_metric"
FROM (
  SELECT 
    {{ select_columns }}, 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }}
  GROUP BY {{ group_by_columns }}, HASH(*)
)
GROUP BY
  {{ group_by_columns }}
SELECT
  SUM(duplicates) as `validio_metric`
FROM (
  SELECT 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }} AS t
  GROUP BY FARM_FINGERPRINT(TO_JSON_STRING(t))
)
SELECT
  SUM(duplicates) as "validio_metric"
FROM (
  SELECT 
    COUNT(*) - 1 AS duplicates
  FROM 
    {{ table }}
  GROUP BY HASH(*)
)

Example data:

some_columnsome_other_column
ax
bx
bx
by
NULLy
NULLy

Example result: 2 (the duplicate records are [b, x], [NULL, y])


Referential integrity

Counts foreign_key_table.foreign_key not matching a primary_key_table.primary_key

SELECT
    {{select_columns}},
    COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
    {{table}} as `foreign_key_table`
LEFT JOIN
    `primary_key_table` ON `primary_key_table`.`foreign_key` = `foreign_key_table`.`foreign_key`
group by
    {{ group_by_columns }}
SELECT
    {{select_columns}},
    COUNT_IF("primary_key_table"."foreign_key" IS NULL) AS "validio_metric"
FROM
    {{table}} as "foreign_key_table"
LEFT JOIN
    "primary_key_table" ON "primary_key_table"."foreign_key" = "foreign_key_table"."foreign_key"
group by
    {{ group_by_columns }}
SELECT
    COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
    {{table}} as `foreign_key_table`
LEFT JOIN
    `primary_key_table` ON `primary_key_table`.`foreign_key` = `foreign_key_table`.`foreign_key`
SELECT
    COUNT_IF("primary_key_table"."foreign_key" IS NULL) AS "validio_metric"
FROM
    {{table}} as "foreign_key_table"
LEFT JOIN
    "primary_key_table" ON "primary_key_table"."foreign_key" = "foreign_key_table"."foreign_key"

Example data:

foreign_key_table.foreign_key
1001
1002
1003
1004
NULL
primary_key_table.primary_key
1001
1002

Example result: 2 (the missing primary keys are 1003, 1004)


Calculated field

For example, summing up the difference between two fields

SELECT 
  {{ select_columns }},
  SUM(some_column - some_other_column) AS `validio_metric`
FROM
  {{ table }}
GROUP BY
  {{ group_by_columns }}
SELECT 
  {{ select_columns }},
  SUM(some_column - some_other_column) AS "validio_metric"
FROM
  {{ table }}
GROUP BY
  {{ group_by_columns }}
SELECT 
  SUM(some_column - some_other_column) AS `validio_metric`
FROM
  {{ table }}
SELECT 
  SUM(some_column - some_other_column) AS "validio_metric"
FROM
  {{ table }}

Example data:

some_columnsome_other_column
105
104

Example result: 11 (10-5 + 10-4 = 11)


Identical tables

Counts different records between two tables

WITH a AS (
    SELECT 
      FARM_FINGERPRINT(TO_JSON_STRING(t)) AS h, 
      COUNT(*) AS c 
    FROM 
      {{ table }} AS t 
    GROUP BY 
      h
), b AS (
    SELECT 
      FARM_FINGERPRINT(TO_JSON_STRING(t)) AS h, 
      COUNT(*) AS c 
    FROM 
      `some_other_table` AS t 
    GROUP BY 
      h
)
SELECT 
  SUM(ABS(IFNULL(a.c, 0) - IFNULL(b.c, 0))) AS `validio_metric`
FROM a 
  FULL OUTER JOIN b
    ON b.h = a.h
WITH a AS (
    SELECT 
      HASH(*) AS h, 
      COUNT(*) AS c 
    FROM 
      {{ table }} AS t 
    GROUP BY 
      h
), b AS (
    SELECT 
      HASH(*) AS h, 
      COUNT(*) AS c 
    FROM 
      some_other_table AS t 
    GROUP BY 
      h
)
SELECT 
  SUM(ABS(IFNULL(a.c, 0) - IFNULL(b.c, 0))) AS "validio_metric"
FROM a 
  FULL OUTER JOIN b
    ON b.h = a.h

Example data:

some_table.some_columnsome_table.some_other_column
NULLNULL
ax
by
some_other_table.some_columnsome_other_table.some_other_column
NULLNULL
ax
ax

Example result: 2 (the differing records are [b, y] in some_table and one of the [a, x]-records in some_other_table)


Identical tables - alternative approach

Counts different records between two tables. This is an alternative approach which supports more data sources. However, this method can give result 0 even if tables are different, in the special case of different number of duplicates, but equal number of records

SELECT (
    SELECT COUNT(*) FROM (
        SELECT * FROM {{ table }}
        UNION DISTINCT
        SELECT * FROM `some_other_table`
    )) -
    (
        SELECT COUNT(*) FROM (
        SELECT * FROM {{ table }}
        INTERSECT DISTINCT
        SELECT * FROM `some_other_table`
    )) + ABS
    (
        (SELECT COUNT(*) FROM {{ table }}) -
        (SELECT COUNT(*) FROM `some_other_table`)
    ) AS `validio_metric`
SELECT (
    SELECT COUNT(*) FROM (
        SELECT * FROM {{ table }}
        UNION
        SELECT * FROM some_other_table
    )) -
    (
        SELECT COUNT(*) FROM (
        SELECT * FROM {{ table }}
        INTERSECT
        SELECT * FROM some_other_table
    )) + ABS
    (
        (SELECT COUNT(*) FROM {{ table }}) -
        (SELECT COUNT(*) FROM some_other_table)
    ) AS "validio_metric"

Example data:

some_table.some_columnsome_table.some_other_column
NULLNULL
ax
by
some_other_table.some_columnsome_other_table.some_other_column
NULLNULL
ax
ax

Example result: 1 (this method only identifies [b, y] as differing record, since this method ignores the duplicate [a, x]-record)


Duration between records

For example, calculates maximum duration in seconds, between records' timestamp fields

SELECT
	{{ select_columns }},
	MAX(duration) AS `validio_metric`
FROM (
	SELECT
		*,
		COALESCE(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY {{ group_by_columns }} ORDER BY ts), SECOND), 0) AS duration
	FROM
		{{ table }}
)
GROUP BY
	{{ group_by_columns }}
SELECT
	{{ select_columns }},
	MAX(duration) AS "validio_metric"
FROM (
	SELECT
		*,
		COALESCE(TIMESTAMPDIFF(SECOND, LAG(ts) OVER(PARTITION BY {{ group_by_columns }} ORDER BY ts), ts), 0) AS duration
	FROM
		{{ table }}
)
GROUP BY
	{{ group_by_columns }}
SELECT
	MAX(duration) AS `validio_metric`
FROM (
	SELECT
		*,
		COALESCE(TIMESTAMP_DIFF(ts, LAG(ts) OVER(ORDER BY ts), SECOND), 0) AS duration
	FROM
		{{ table }}
)
SELECT
	MAX(duration) AS "validio_metric"
FROM (
	SELECT
		*,
		COALESCE(TIMESTAMPDIFF(SECOND, LAG(ts) OVER(PARTITION BY {{ group_by_columns }} ORDER BY ts), ts), 0) AS duration
	FROM
		{{ table }}
)

Example data:

some_column
10:00:00
10:00:01
10:00:10
10:00:30

Example result: 20 (max duration is between 10:00:10 and 10:00:30)


Fuzzy matching

For example, count suspiciously similar adressess. Note that fuzzy matching uses heavy calculations, which can imply long query times, especially for Global Window or when backfilling data.

-- This variant will compare values in current segment/window with values in all other segments/windows
SELECT
    {{ select_columns }},
    COUNT(*) as `validio_metric`
FROM {{ table }} AS t1 
    JOIN (SELECT some_column FROM {{ table }}) AS t2
        ON t1.some_column <> t2.some_column
            AND EDIT_DISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
GROUP BY
    {{ group_by_columns }}
-- This variant will compare values in current segment/window with values in all other segments/windows
SELECT
    {{ select_columns }},
    COUNT(*) as "validio_metric"
FROM {{ table }} AS t1 
    JOIN (SELECT some_column FROM {{ table }}) AS t2
        ON t1.some_column <> t2.some_column
            AND EDITDISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
GROUP BY
    {{ group_by_columns }}
SELECT
    COUNT(*) as `validio_metric`
FROM {{ table }} AS t1 
    JOIN {{ table }} AS t2
        ON t1.some_column <> t2.some_column
            AND EDIT_DISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
SELECT
    COUNT(*) as "validio_metric"
FROM {{ table }} AS t1 
    JOIN {{ table }} AS t2
        ON t1.some_column <> t2.some_column
            AND EDITDISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3

Example data:

some_column
1789 Maple Avenue, Suite 4C, Springfield, IL 62704
1780 maple avenue, 4 C, springfields, IL-62704
4567 Oak Lane, Apt. 5B, Madison, WI 53703

Example result: 2 (the two adressess starting with 17 are counted as similar, given set sensitivity of 0.3)


Custom SQL filter

Any Custom SQL query can leverage the SQL filter recipes in a filtering clause (such as WHERE, HAVING, QUALIFY, or CASE WHEN).

Below is a simple example of a Custom SQL query, where you can paste a SQL filter recipe into the WHERE clause.

SELECT
    {{ select_columns }},
    count(*) as `validio_metric`
FROM
    {{ table }}
WHERE
		<<< paste SQL filter recipe here >>>
GROUP BY
    {{ group_by_columns }}
SELECT
    {{ select_columns }},
    count(*) as "validio_metric"
FROM
    {{ table }}
WHERE
		<<< paste SQL filter recipe here >>>
GROUP BY
    {{ group_by_columns }}
SELECT
    count(*) as `validio_metric`
FROM
    {{ table }}
WHERE
		<<< paste SQL filter recipe here >>>
SELECT
    count(*) as "validio_metric"
FROM
    {{ table }}
WHERE
		<<< paste SQL filter recipe here >>>