User Guide
Complete the following steps to use Custom SQL Sources with warehouse AI functions for MDM and unstructured data validation:
- Enable AI (LLM) Functions in Your Warehouse
- Create a Custom SQL Source
- Create Validators on your Custom SQL Source
This guide contains examples for using AI functions (LLM) in different warehouse. Make sure you adapt the examples to match use data schemas and use cases.
Enable AI Functions in Your Warehouse
BigQuery
BigQuery provides ML.GENERATE_TEXT
and AI.GENERATE
functions for text generation and analysis.
Prerequisites
- Enable the Vertex AI API in your Google Cloud project
- Create a BigQuery ML remote model pointing to a Gemini model:
CREATE OR REPLACE MODEL `project.dataset.gemini_model` REMOTE WITH CONNECTION `us.my_connection` OPTIONS (ENDPOINT = 'gemini-2.5-flash');
- Grant the connection service account Vertex AI User role:
- Go to your connection details to find the service account
- Grant
roles/aiplatform.user
on your project
Documentation
Refer to the following documentation for more information:
Databricks
Databricks provides ai_query()
and task-specific AI functions.
Prerequisites
- Workspace in a supported Model Serving region
- AI functions are enabled by default for Foundation Model APIs
- For external models: Enable "AI_Query for Custom Models and External Models" in Databricks Previews UI
Available Models
- Databricks-hosted:
databricks-meta-llama-3-3-70b-instruct
,databricks-mixtral-8x7b-instruct
, and more - Task-specific functions:
ai_analyze_sentiment()
,ai_classify()
,ai_extract()
,ai_summarize()
Documentation
Refer to the following documentation for more information:
Snowflake
Snowflake Cortex provides COMPLETE
and task-specific AI functions via Snowflake Cortex AISQL.
Prerequisites
- CORTEX_USER database role (granted to PUBLIC by default)
- Snowflake account in a supported region
- Grant access (if needed)
-- Grant to specific role GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE my_role;
Available Models
- Snowflake Arctic, Meta Llama 3 (8B/70B), Mistral Large, Mixtral-8x7b, Reka Core, and more
- Task-specific:
CLASSIFY_TEXT()
,EXTRACT_ANSWER()
,SENTIMENT()
,SUMMARIZE()
,TRANSLATE()
Documentation
Refer to the following documentation for more information:
Create a Custom SQL Source
Create your source in Validio using Custom SQL instead of creating a Custom SQL Validator with LLM function calls. This allows you to define the LLM output as a field on the source, which will make it easier to apply thresholds and debug incidents.
Additional benefits of using a Custom SQL Source include:
- Test and debug AI queries directly in the source configuration
- Iterate on prompt engineering independent of Validators configuration
- Better error messages when using the debug query
Creating the Custom SQL Source
In Validio,
- Navigate to the Sources page and click + New source
- Select your warehouse type: BigQuery, Databricks, or Snowflake
- Under "Specify the data", choose Use Custom SQL
- Paste your SQL query with the AI function
- Set your Polling schedule
- Click Continue to infer the schema
- Add Tags, Owner, and create the source
Example of Custom SQL Source Definition
SELECT
*,
SNOWFLAKE.CORTEX.COMPLETE(
'mistral-large2',
CONCAT('Is "', country, '" a valid ISO 3166-1 country name? Answer only YES or NO.')
) AS country_validation_result
FROM database.schema.table
Create Validators on your Custom SQL Source
Now create validators on your Custom SQL Source to monitor the AI-validated results. The following validator configuration will count the number of irregular country names per day.
Example: Count Invalid Country Names
In Validio,
- Navigate to your Custom SQL source
- Click + New Validator
- Select validator type: Volume → Count
- Configure:
- Field:
country_validation_result
(or your AI result column) - Filter: Use String filter →
Field country_validation_result
is exactlyNO
- Window: Choose appropriate window (e.g., Daily)
- Field:
- Set Threshold:
- Fixed Threshold:
>= 0
to alert on any irregular entries - Dynamic Threshold: To detect anomalies in patterns over time
- Fixed Threshold:
- Add Name, Description, and click Continue
Additional Examples
Product Category Validation
Custom SQL Source:
-- BigQuery
SELECT
*,
ML.GENERATE_TEXT(
MODEL `project.dataset.gemini_model`,
STRUCT(
CONCAT('Is "', category, '" a valid product category from this list: [Electronics, Clothing, Home & Garden, Sports, Books]? Answer YES or NO.') AS prompt,
TRUE AS flatten_json_output
)
).ml_generate_text_result AS category_is_valid
FROM `project.dataset.products`
Validator: Count products with invalid categories using a Volume validator with filter category_is_valid = 'NO'
.
Product Description Quality
Custom SQL Source:
-- Snowflake
SELECT
*,
CAST(SNOWFLAKE.CORTEX.COMPLETE(
'mixtral-8x7b',
CONCAT('Rate this product description quality from 1-5 (5=excellent): "', description, '". Respond with only the number.')
) AS INTEGER) AS description_quality_score
FROM products
Validator: Count low-quality descriptions using a Volume validator with filter description_quality_score <= 2
.
Updated about 4 hours ago