Custom SQL
Create your own customized validator, using SQL
Validator overview
Supported data sources
The Custom SQL validator currently only supports Data Warehouses and Query Engines.
Using the Custom SQL validator you can leverage the power of SQL to create any custom validation of your data.
Configuration
Step | Required | Parameters | Options |
---|---|---|---|
Validator type | ✅ | Custom SQL | - |
Source config | ✅ | Segmentation | 1. Select a configured Segmentation Or 2. Unsegmented (default) |
Source config | ✅ | Window | Select a configured Window |
SQL query | ✅ | SQL query | Enter the SQL query to be executed |
Config | ✅ | Field | Initialize with backfill (checkbox) |
Threshold | ✅ | Threshold type | Fixed threshold Dynamic threshold |
Threshold | ✅(*1) | Operator | Less than Less than or equal Equal Not equal Greater than Greater than or equal |
Threshold | ✅(*1) | Value | Numeric value to validate threshold on |
Threshold | ✅(*2) | Sensitivity | Enter a numeric value |
Threshold | ✅(*2) | Decision bounds type | Upper Lower Upper and lower (default) |
Validator name | ✅ | Name | Enter a validator name |
*1 Only applicable for
Fixed thresholds
.*2 Only applicable for
Dynamic thresholds
.
Configuration details
Placeholders
Placeholders are needed for Validio to be able to reference Source, Segmentations and Windows.
Placeholder | Description | Scope |
---|---|---|
{{ table }} | Instructs Validio where to reference the Source table. | Optional when combining Global window and Unsegmented. For other cases, always mandatory. |
{{ select_columns }} | Instructs Validio where to reference SELECT-fields, handling Windowing and Segmentation. | Not applicable when combining Global window and Unsegmented. |
{{ group_by_columns }} | Instructs Validio where to reference GROUP BY-fields, handling Windowing and Segmentation. | Not applicable when combining Global window and Unsegmented. |
Mandatory fields
The Custom SQL query must include a return field named validio_metric
, to instruct Validio which field to use as validation metric.
Note that
validio_metric
should be written in lower case, which requires using quotation marks for some data sources, such as double quotation marks for Snowflake.
Test SQL
You can test your query by clicking Test SQL:
- If the query executed properly, you will see a table with the first 20 records returned.
- If the query did not execute properly, you will see the error message propagated from the data source.
In both cases you are able to expand the executed query by clicking See raw query.
Custom SQL recipes
Below are several useful recipes for common use cases. Each recipe is written in four variants: BigQuery, Snowflake, BigQuery G/U, Snowflake G/U.
G/U indicates the combination of Global Window and Unsegmented. This means data is not grouped into windows or segments, which in turn means the placeholders
{{ select_columns }}
and{{ group_by_columns }}
are not applicable for the query.
Each recipe is followed by example data and corresponding result. For simplicity, these examples do not include windows nor segments. Hence you can think of each example as a G/U example. Or alternatively, you can think about it as the subset of data for one specific window/segment.
You can use these recipes as a starting point when writing Custom SQL validators for other data sources, such as Redshift or Athena.
Duplicate values
Counts duplicate values
SELECT
{{ select_columns }},
SUM(duplicates) as `validio_metric`
FROM (
SELECT
{{ select_columns }},
COUNT(*) - 1 AS duplicates
FROM
{{ table }}
GROUP BY {{ group_by_columns }}, column_to_check_for_duplicates
)
GROUP BY
{{ group_by_columns }}
SELECT
{{ select_columns }},
SUM(duplicates) as "validio_metric"
FROM (
SELECT
{{ select_columns }},
COUNT(*) - 1 AS duplicates
FROM
{{ table }}
GROUP BY {{ group_by_columns }}, column_to_check_for_duplicates
)
GROUP BY
{{ group_by_columns }}
SELECT
SUM(duplicates) as `validio_metric`
FROM (
SELECT
COUNT(*) - 1 AS duplicates
FROM
{{ table }}
GROUP BY column_to_check_for_duplicates
)
SELECT
SUM(duplicates) as "validio_metric"
FROM (
SELECT
COUNT(*) - 1 AS duplicates
FROM
{{ table }}
GROUP BY column_to_check_for_duplicates
)
Example data:
column_to_check_for_duplicates |
---|
a |
b |
b |
b |
NULL |
NULL |
Example result: 3 (the duplicate values are b, b, NULL)
Duplicate records
Counts duplicate records
SELECT
{{ select_columns }},
SUM(duplicates) as `validio_metric`
FROM (
SELECT
{{ select_columns }},
COUNT(*) - 1 AS duplicates
FROM
{{ table }} AS t
GROUP BY {{ group_by_columns }}, FARM_FINGERPRINT(TO_JSON_STRING(t))
)
GROUP BY
{{ group_by_columns }}
SELECT
{{ select_columns }},
SUM(duplicates) as "validio_metric"
FROM (
SELECT
{{ select_columns }},
COUNT(*) - 1 AS duplicates
FROM
{{ table }}
GROUP BY {{ group_by_columns }}, HASH(*)
)
GROUP BY
{{ group_by_columns }}
SELECT
SUM(duplicates) as `validio_metric`
FROM (
SELECT
COUNT(*) - 1 AS duplicates
FROM
{{ table }} AS t
GROUP BY FARM_FINGERPRINT(TO_JSON_STRING(t))
)
SELECT
SUM(duplicates) as "validio_metric"
FROM (
SELECT
COUNT(*) - 1 AS duplicates
FROM
{{ table }}
GROUP BY HASH(*)
)
Example data:
some_column | some_other_column |
---|---|
a | x |
b | x |
b | x |
b | y |
NULL | y |
NULL | y |
Example result: 2 (the duplicate records are [b, x], [NULL, y])
Referential integrity
Counts foreign_key_table.foreign_key
not matching a primary_key_table.primary_key
SELECT
{{select_columns}},
COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
{{table}} as `foreign_key_table`
LEFT JOIN
`primary_key_table` ON `primary_key_table`.`foreign_key` = `foreign_key_table`.`foreign_key`
group by
{{ group_by_columns }}
SELECT
{{select_columns}},
COUNT_IF("primary_key_table"."foreign_key" IS NULL) AS "validio_metric"
FROM
{{table}} as "foreign_key_table"
LEFT JOIN
"primary_key_table" ON "primary_key_table"."foreign_key" = "foreign_key_table"."foreign_key"
group by
{{ group_by_columns }}
SELECT
COUNTIF(`primary_key_table`.`foreign_key` IS NULL) AS `validio_metric`
FROM
{{table}} as `foreign_key_table`
LEFT JOIN
`primary_key_table` ON `primary_key_table`.`foreign_key` = `foreign_key_table`.`foreign_key`
SELECT
COUNT_IF("primary_key_table"."foreign_key" IS NULL) AS "validio_metric"
FROM
{{table}} as "foreign_key_table"
LEFT JOIN
"primary_key_table" ON "primary_key_table"."foreign_key" = "foreign_key_table"."foreign_key"
Example data:
foreign_key_table.foreign_key |
---|
1001 |
1002 |
1003 |
1004 |
NULL |
primary_key_table.primary_key |
---|
1001 |
1002 |
Example result: 2 (the missing primary keys are 1003, 1004)
Calculated field
For example, summing up the difference between two fields
SELECT
{{ select_columns }},
SUM(some_column - some_other_column) AS `validio_metric`
FROM
{{ table }}
GROUP BY
{{ group_by_columns }}
SELECT
{{ select_columns }},
SUM(some_column - some_other_column) AS "validio_metric"
FROM
{{ table }}
GROUP BY
{{ group_by_columns }}
SELECT
SUM(some_column - some_other_column) AS `validio_metric`
FROM
{{ table }}
SELECT
SUM(some_column - some_other_column) AS "validio_metric"
FROM
{{ table }}
Example data:
some_column | some_other_column |
---|---|
10 | 5 |
10 | 4 |
Example result: 11 (10-5 + 10-4 = 11)
Identical tables
Counts different records between two tables
WITH a AS (
SELECT
FARM_FINGERPRINT(TO_JSON_STRING(t)) AS h,
COUNT(*) AS c
FROM
{{ table }} AS t
GROUP BY
h
), b AS (
SELECT
FARM_FINGERPRINT(TO_JSON_STRING(t)) AS h,
COUNT(*) AS c
FROM
`some_other_table` AS t
GROUP BY
h
)
SELECT
SUM(ABS(IFNULL(a.c, 0) - IFNULL(b.c, 0))) AS `validio_metric`
FROM a
FULL OUTER JOIN b
ON b.h = a.h
WITH a AS (
SELECT
HASH(*) AS h,
COUNT(*) AS c
FROM
{{ table }} AS t
GROUP BY
h
), b AS (
SELECT
HASH(*) AS h,
COUNT(*) AS c
FROM
some_other_table AS t
GROUP BY
h
)
SELECT
SUM(ABS(IFNULL(a.c, 0) - IFNULL(b.c, 0))) AS "validio_metric"
FROM a
FULL OUTER JOIN b
ON b.h = a.h
Example data:
some_table.some_column | some_table.some_other_column |
---|---|
NULL | NULL |
a | x |
b | y |
some_other_table.some_column | some_other_table.some_other_column |
---|---|
NULL | NULL |
a | x |
a | x |
Example result: 2 (the differing records are [b, y] in some_table and one of the [a, x]-records in some_other_table)
Identical tables - alternative approach
Counts different records between two tables. This is an alternative approach which supports more data sources. However, this method can give result 0 even if tables are different, in the special case of different number of duplicates, but equal number of records
SELECT (
SELECT COUNT(*) FROM (
SELECT * FROM {{ table }}
UNION DISTINCT
SELECT * FROM `some_other_table`
)) -
(
SELECT COUNT(*) FROM (
SELECT * FROM {{ table }}
INTERSECT DISTINCT
SELECT * FROM `some_other_table`
)) + ABS
(
(SELECT COUNT(*) FROM {{ table }}) -
(SELECT COUNT(*) FROM `some_other_table`)
) AS `validio_metric`
SELECT (
SELECT COUNT(*) FROM (
SELECT * FROM {{ table }}
UNION
SELECT * FROM some_other_table
)) -
(
SELECT COUNT(*) FROM (
SELECT * FROM {{ table }}
INTERSECT
SELECT * FROM some_other_table
)) + ABS
(
(SELECT COUNT(*) FROM {{ table }}) -
(SELECT COUNT(*) FROM some_other_table)
) AS "validio_metric"
Example data:
some_table.some_column | some_table.some_other_column |
---|---|
NULL | NULL |
a | x |
b | y |
some_other_table.some_column | some_other_table.some_other_column |
---|---|
NULL | NULL |
a | x |
a | x |
Example result: 1 (this method only identifies [b, y] as differing record, since this method ignores the duplicate [a, x]-record)
Duration between records
For example, calculates maximum duration in seconds, between records' timestamp fields
SELECT
{{ select_columns }},
MAX(duration) AS `validio_metric`
FROM (
SELECT
*,
COALESCE(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY {{ group_by_columns }} ORDER BY ts), SECOND), 0) AS duration
FROM
{{ table }}
)
GROUP BY
{{ group_by_columns }}
SELECT
{{ select_columns }},
MAX(duration) AS "validio_metric"
FROM (
SELECT
*,
COALESCE(TIMESTAMPDIFF(SECOND, LAG(ts) OVER(PARTITION BY {{ group_by_columns }} ORDER BY ts), ts), 0) AS duration
FROM
{{ table }}
)
GROUP BY
{{ group_by_columns }}
SELECT
MAX(duration) AS `validio_metric`
FROM (
SELECT
*,
COALESCE(TIMESTAMP_DIFF(ts, LAG(ts) OVER(ORDER BY ts), SECOND), 0) AS duration
FROM
{{ table }}
)
SELECT
MAX(duration) AS "validio_metric"
FROM (
SELECT
*,
COALESCE(TIMESTAMPDIFF(SECOND, LAG(ts) OVER(PARTITION BY {{ group_by_columns }} ORDER BY ts), ts), 0) AS duration
FROM
{{ table }}
)
Example data:
some_column |
---|
10:00:00 |
10:00:01 |
10:00:10 |
10:00:30 |
Example result: 20 (max duration is between 10:00:10 and 10:00:30)
Fuzzy matching
For example, count suspiciously similar adressess. Note that fuzzy matching uses heavy calculations, which can imply long query times, especially for Global Window or when backfilling data.
-- This variant will compare values in current segment/window with values in all other segments/windows
SELECT
{{ select_columns }},
COUNT(*) as `validio_metric`
FROM {{ table }} AS t1
JOIN (SELECT some_column FROM {{ table }}) AS t2
ON t1.some_column <> t2.some_column
AND EDIT_DISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
GROUP BY
{{ group_by_columns }}
-- This variant will compare values in current segment/window with values in all other segments/windows
SELECT
{{ select_columns }},
COUNT(*) as "validio_metric"
FROM {{ table }} AS t1
JOIN (SELECT some_column FROM {{ table }}) AS t2
ON t1.some_column <> t2.some_column
AND EDITDISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
GROUP BY
{{ group_by_columns }}
SELECT
COUNT(*) as `validio_metric`
FROM {{ table }} AS t1
JOIN {{ table }} AS t2
ON t1.some_column <> t2.some_column
AND EDIT_DISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
SELECT
COUNT(*) as "validio_metric"
FROM {{ table }} AS t1
JOIN {{ table }} AS t2
ON t1.some_column <> t2.some_column
AND EDITDISTANCE(t1.some_column, t2.some_column) / GREATEST(LENGTH(t1.some_column), LENGTH(t2.some_column)) < 0.3
Example data:
some_column |
---|
1789 Maple Avenue, Suite 4C, Springfield, IL 62704 |
1780 maple avenue, 4 C, springfields, IL-62704 |
4567 Oak Lane, Apt. 5B, Madison, WI 53703 |
Example result: 2 (the two adressess starting with 17 are counted as similar, given set sensitivity of 0.3)
Custom SQL filter
Any Custom SQL query can leverage the SQL filter recipes in a filtering clause (such as WHERE, HAVING, QUALIFY, or CASE WHEN).
Below is a simple example of a Custom SQL query, where you can paste a SQL filter recipe into the WHERE clause.
SELECT
{{ select_columns }},
count(*) as `validio_metric`
FROM
{{ table }}
WHERE
<<< paste SQL filter recipe here >>>
GROUP BY
{{ group_by_columns }}
SELECT
{{ select_columns }},
count(*) as "validio_metric"
FROM
{{ table }}
WHERE
<<< paste SQL filter recipe here >>>
GROUP BY
{{ group_by_columns }}
SELECT
count(*) as `validio_metric`
FROM
{{ table }}
WHERE
<<< paste SQL filter recipe here >>>
SELECT
count(*) as "validio_metric"
FROM
{{ table }}
WHERE
<<< paste SQL filter recipe here >>>
Updated 7 months ago