User Guide

Complete the following steps to use Custom SQL Sources with warehouse AI functions for MDM and unstructured data validation:

  1. Enable AI (LLM) Functions in Your Warehouse
  2. Create a Custom SQL Source
  3. Create Validators on your Custom SQL Source

This guide contains examples for using AI functions (LLM) in different warehouse. Make sure you adapt the examples to match use data schemas and use cases.


Enable AI Functions in Your Warehouse

BigQuery

BigQuery provides ML.GENERATE_TEXT and AI.GENERATE functions for text generation and analysis.

Prerequisites

  1. Enable the Vertex AI API in your Google Cloud project
  2. Create a BigQuery ML remote model pointing to a Gemini model:
    CREATE OR REPLACE MODEL `project.dataset.gemini_model`
    REMOTE WITH CONNECTION `us.my_connection`
    OPTIONS (ENDPOINT = 'gemini-2.5-flash');
  3. Grant the connection service account Vertex AI User role:
    1. Go to your connection details to find the service account
    2. Grant roles/aiplatform.user on your project

Documentation

Refer to the following documentation for more information:

Databricks

Databricks provides ai_query() and task-specific AI functions.

Prerequisites

  1. Workspace in a supported Model Serving region
  2. AI functions are enabled by default for Foundation Model APIs
  3. For external models: Enable "AI_Query for Custom Models and External Models" in Databricks Previews UI

Available Models

  • Databricks-hosted: databricks-meta-llama-3-3-70b-instruct, databricks-mixtral-8x7b-instruct, and more
  • Task-specific functions: ai_analyze_sentiment(), ai_classify(), ai_extract(), ai_summarize()

Documentation

Refer to the following documentation for more information:

Snowflake

Snowflake Cortex provides COMPLETE and task-specific AI functions via Snowflake Cortex AISQL.

Prerequisites

  1. CORTEX_USER database role (granted to PUBLIC by default)
  2. Snowflake account in a supported region
  3. Grant access (if needed)
    -- Grant to specific role
    GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE my_role;

Available Models

  • Snowflake Arctic, Meta Llama 3 (8B/70B), Mistral Large, Mixtral-8x7b, Reka Core, and more
  • Task-specific: CLASSIFY_TEXT(), EXTRACT_ANSWER(), SENTIMENT(), SUMMARIZE(), TRANSLATE()

Documentation

Refer to the following documentation for more information:


Create a Custom SQL Source

Create your source in Validio using Custom SQL instead of creating a Custom SQL Validator with LLM function calls. This allows you to define the LLM output as a field on the source, which will make it easier to apply thresholds and debug incidents.

Additional benefits of using a Custom SQL Source include:

  • Test and debug AI queries directly in the source configuration
  • Iterate on prompt engineering independent of Validators configuration
  • Better error messages when using the debug query

Creating the Custom SQL Source

In Validio,

  1. Navigate to the Sources page and click + New source
  2. Select your warehouse type: BigQuery, Databricks, or Snowflake
  3. Under "Specify the data", choose Use Custom SQL
  4. Paste your SQL query with the AI function
  5. Set your Polling schedule
  6. Click Continue to infer the schema
  7. Add Tags, Owner, and create the source

Example of Custom SQL Source Definition

SELECT 
  *,
  SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large2',
    CONCAT('Is "', country, '" a valid ISO 3166-1 country name? Answer only YES or NO.')
  ) AS country_validation_result
FROM database.schema.table

Create Validators on your Custom SQL Source

Now create validators on your Custom SQL Source to monitor the AI-validated results. The following validator configuration will count the number of irregular country names per day.

Example: Count Invalid Country Names

In Validio,

  1. Navigate to your Custom SQL source
  2. Click + New Validator
  3. Select validator type: VolumeCount
  4. Configure:
    1. Field: country_validation_result (or your AI result column)
    2. Filter: Use String filter → Field country_validation_result is exactly NO
    3. Window: Choose appropriate window (e.g., Daily)
  5. Set Threshold:
    1. Fixed Threshold: >= 0 to alert on any irregular entries
    2. Dynamic Threshold: To detect anomalies in patterns over time
  6. Add Name, Description, and click Continue

Additional Examples

Product Category Validation

Custom SQL Source:

-- BigQuery
SELECT 
  *,
  ML.GENERATE_TEXT(
    MODEL `project.dataset.gemini_model`,
    STRUCT(
      CONCAT('Is "', category, '" a valid product category from this list: [Electronics, Clothing, Home & Garden, Sports, Books]? Answer YES or NO.') AS prompt,
      TRUE AS flatten_json_output
    )
  ).ml_generate_text_result AS category_is_valid
FROM `project.dataset.products`

Validator: Count products with invalid categories using a Volume validator with filter category_is_valid = 'NO'.

Product Description Quality

Custom SQL Source:

-- Snowflake
SELECT 
  *,
  CAST(SNOWFLAKE.CORTEX.COMPLETE(
    'mixtral-8x7b',
    CONCAT('Rate this product description quality from 1-5 (5=excellent): "', description, '". Respond with only the number.')
  ) AS INTEGER) AS description_quality_score
FROM products

Validator: Count low-quality descriptions using a Volume validator with filter description_quality_score <= 2.