Best Practices
Model Selection
Choose models based on your validation complexity and cost requirements.
General Guidelines
- Simple validation (binary checks, categorization): Use smaller, faster models
- Complex reasoning (extraction, nuanced analysis): Use larger, more capable models or thinking models
- Start small: Begin with faster models, and upgrade only if accuracy is insufficient
Available Models
Refer to your warehouse documentation for current model options and capabilities:
- BigQuery: Supported Gemini Models
- Databricks: Foundation Model Overview
- Snowflake: Supported LLMs
Prompt Engineering
- Be explicit: Use clear, direct instructions for the LLM to follow
- Request structured output: Ask for "YES/NO", binary numbers (0, 1) or floats (this will make it easy to setup a Validator Threshold, dynamic or fixed.
- Keep prompts concise: Reduce token usage and improve response consistency
The following is an example of a good prompt:
Is "USA" a valid ISO 3166-1 country name? Answer only YES or NO.
Performance and Cost
- Batch processing: AI functions process data during source polling based on your schedule. Use tumbling windows where possible to only process the rows in the new window and avoid reprocessing the whole table.
- Choose appropriate windows: Balance how often you want to validate your unstructured data with compute costs
- Monitor token usage: Track costs and model token usage through your warehouse's billing dashboards
- Test on samples: Validate queries on small datasets before full deployment
Development Workflow
- Start with a Custom SQL source: Easier to test and debug AI queries. Alternativly, prototype the AI function call outside Validio in your IDE or data warehouse GUI.
- Test prompts thoroughly: Check LLM responses so that they are consistent and accurate
- Begin with simple validators: Use Volume/Count validators on top of a well defined LLM generated field on your SQL Source (numeric scale for example)
Updated about 4 hours ago