Model Selection

BigQuery: <a target="_blank" href="https://cloud.google.com/bigquery/docs/generative-ai-overview" label="Supported Gemini Models">Supported Gemini Models
Databricks: <a target="_blank" href="https://docs.databricks.com/en/machine-learning/model-serving/foundation-model-overview.html" label="Foundation Model Overview">Foundation Model Overview
Snowflake: <a target="_blank" href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#available-llms" label="Supported LLMs">Supported LLMs

Choose models based on your validation complexity and cost requirements.

General Guidelines

Simple validation (binary checks, categorization): Use smaller, faster models
Complex reasoning (extraction, nuanced analysis): Use larger, more capable models or thinking models
Start small: Begin with faster models, and upgrade only if accuracy is insufficient

Refer to your warehouse documentation for current model options and capabilities:

Be explicit: Use clear, direct instructions for the LLM to follow
Request structured output: Ask for "YES/NO", binary numbers (0, 1) or floats (this will make it easy to setup a Validator Threshold, dynamic or fixed.
Keep prompts concise: Reduce token usage and improve response consistency

The following is an example of a good prompt:

Is "USA" a valid ISO 3166-1 country name? Answer only YES or NO.

Batch processing: AI functions process data during source polling based on your schedule. Use tumbling windows where possible to only process the rows in the new window and avoid reprocessing the whole table.
Choose appropriate windows: Balance how often you want to validate your unstructured data with compute costs
Monitor token usage: Track costs and model token usage through your warehouse's billing dashboards
Test on samples: Validate queries on small datasets before full deployment

Start with a Custom SQL source: Easier to test and debug AI queries. Alternativly, prototype the AI function call outside Validio in your IDE or data warehouse GUI.
Test prompts thoroughly: Check LLM responses so that they are consistent and accurate
Begin with simple validators: Use Volume/Count validators on top of a well defined LLM generated field on your SQL Source (numeric scale for example)