Complete the following steps to use Custom SQL Sources with warehouse AI functions for MDM and unstructured data validation:

Enable AI (LLM) Functions in Your Warehouse
Create a Custom SQL Source
Create Validators on your Custom SQL Source

This guide contains examples for using AI functions (LLM) in different warehouse. Make sure you adapt the examples to match use data schemas and use cases.

Enable AI Functions in Your Warehouse

BigQuery

BigQuery provides ML.GENERATE_TEXT and AI.GENERATE functions for text generation and analysis.

Prerequisites

Enable the Vertex AI API in your Google Cloud project

Create a BigQuery ML remote model pointing to a Gemini model:

CREATE OR REPLACE MODEL `project.dataset.gemini_model`
REMOTE WITH CONNECTION `us.my_connection`
OPTIONS (ENDPOINT = 'gemini-2.5-flash');

Grant the connection service account Vertex AI User role:
1. Go to your connection details to find the service account
2. Grant roles/aiplatform.user on your project

Documentation

Refer to the following documentation for more information:

Databricks

Databricks provides ai_query() and task-specific AI functions.

Prerequisites

Workspace in a supported Model Serving region
AI functions are enabled by default for Foundation Model APIs
For external models: Enable "AI_Query for Custom Models and External Models" in Databricks Previews UI

Available Models

Databricks-hosted: databricks-meta-llama-3-3-70b-instruct, databricks-mixtral-8x7b-instruct, and more
Task-specific functions: ai_analyze_sentiment(), ai_classify(), ai_extract(), ai_summarize()

Documentation

Refer to the following documentation for more information:

Snowflake

Snowflake Cortex provides COMPLETE and task-specific AI functions via Snowflake Cortex AISQL.

Prerequisites

CORTEX_USER database role (granted to PUBLIC by default)
Snowflake account in a supported region

Grant access (if needed)

-- Grant to specific role
GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE my_role;

Available Models

Snowflake Arctic, Meta Llama 3 (8B/70B), Mistral Large, Mixtral-8x7b, Reka Core, and more
Task-specific: CLASSIFY_TEXT(), EXTRACT_ANSWER(), SENTIMENT(), SUMMARIZE(), TRANSLATE()

Documentation

Refer to the following documentation for more information:

Create a Custom SQL Source

Create your source in Validio using Custom SQL instead of creating a Custom SQL Validator with LLM function calls. This allows you to define the LLM output as a field on the source, which will make it easier to apply thresholds and debug incidents.

Additional benefits of using a Custom SQL Source include:

Test and debug AI queries directly in the source configuration
Iterate on prompt engineering independent of Validators configuration
Better error messages when using the debug query

Creating the Custom SQL Source

In Validio,

Navigate to the Sources page and click + New source
Select your warehouse type: BigQuery, Databricks, or Snowflake
Under "Specify the data", choose Use Custom SQL
Paste your SQL query with the AI function
Set your Polling schedule
Click Continue to infer the schema
Add Tags, Owner, and create the source

Example of Custom SQL Source Definition

SELECT 
  *,
  SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large2',
    CONCAT('Is "', country, '" a valid ISO 3166-1 country name? Answer only YES or NO.')
  ) AS country_validation_result
FROM database.schema.table

Create Validators on your Custom SQL Source

Now create validators on your Custom SQL Source to monitor the AI-validated results. The following validator configuration will count the number of irregular country names per day.

Example: Count Invalid Country Names

In Validio,

Navigate to your Custom SQL source
Click + New Validator
Select validator type: Volume → Count
Configure:
1. Field: country_validation_result (or your AI result column)
2. Filter: Use String filter → Field country_validation_result is exactly NO
3. Window: Choose appropriate window (e.g., Daily)
Set Threshold:
1. Fixed Threshold: >= 0 to alert on any irregular entries
2. Dynamic Threshold: To detect anomalies in patterns over time
Add Name, Description, and click Continue

Additional Examples

Product Category Validation

Custom SQL Source:

-- BigQuery
SELECT 
  *,
  ML.GENERATE_TEXT(
    MODEL `project.dataset.gemini_model`,
    STRUCT(
      CONCAT('Is "', category, '" a valid product category from this list: [Electronics, Clothing, Home & Garden, Sports, Books]? Answer YES or NO.') AS prompt,
      TRUE AS flatten_json_output
    )
  ).ml_generate_text_result AS category_is_valid
FROM `project.dataset.products`

Validator: Count products with invalid categories using a Volume validator with filter category_is_valid = 'NO'.

Product Description Quality

Custom SQL Source:

-- Snowflake
SELECT 
  *,
  CAST(SNOWFLAKE.CORTEX.COMPLETE(
    'mixtral-8x7b',
    CONCAT('Rate this product description quality from 1-5 (5=excellent): "', description, '". Respond with only the number.')
  ) AS INTEGER) AS description_quality_score
FROM products

Validator: Count low-quality descriptions using a Volume validator with filter description_quality_score <= 2.