Data Quality for MDM and Unstructured Data
Validate unstructured text, image, audio, and master data records using warehouse-native LLMs in BigQuery, Snowflake, and Databricks with Validio.
Validio validates unstructured data and Master Data Management (MDM) records by combining Custom SQL Sources with warehouse-native AI functions. Large Language Models (LLMs) score, classify, and standardize text, image, and audio data directly inside your warehouse — and Validio's anomaly detection, thresholds, and notifications apply to the LLM output just as they do to any structured field.
This page explains how the approach works, when to use it, and the kinds of MDM and unstructured-data problems it solves. For implementation, see the User Guide, Best Practices, and Troubleshooting guides.
Why Validate Unstructured Data?
For years, data quality focused on structured data — typed rows and columns in a warehouse. But a significant share of business information lives in unstructured formats: free-text fields, product descriptions, customer feedback, support transcripts, images, and documents. As organizations adopt generative AI, the cost of low-quality unstructured input — and the "AI slop" generated downstream from it — grows.
Assessing unstructured data quality is difficult because of its volume, variety, and lack of predefined formats. Rule-based validators cannot reason about meaning, intent, or natural-language equivalence. LLMs can, and Validio brings that capability into the standard data-quality workflow alongside the validators, thresholds, and filters you already use.
How It Works
Validio's Custom SQL Sources and Custom SQL Validators let you write SQL that calls the LLM functions exposed by your data warehouse:
- Snowflake Cortex —
COMPLETE,CLASSIFY_TEXT,EXTRACT_ANSWER,SENTIMENT,SUMMARIZE,TRANSLATE - BigQuery ML —
ML.GENERATE_TEXT,AI.GENERATE,AI.GENERATE_TABLE - Databricks AI Functions —
ai_query,ai_classify,ai_extract,ai_summarize,ai_analyze_sentiment
The LLM call returns a value — a YES/NO answer, a numeric score, a normalized string, an extracted entity — that becomes a field on the source. Validio treats that field like any other column: apply a validator, set a threshold, add a filter or segmentation, and route incidents through your existing notification rules.
The examples in this guide use BigQuery, Snowflake, and Databricks. Any Validio warehouse connector that exposes comparable AI functions follows the same pattern.
Why This Architecture Matters
Your data never leaves your warehouse. LLM calls run inside BigQuery, Snowflake, or Databricks. No rows are sent to Validio, and no rows are sent to a third-party LLM provider. This is essential for privacy, security, and regulatory compliance.
Bring your own model. Use whichever LLM your warehouse exposes — Snowflake Arctic, Gemini, Llama, Mistral, Mixtral, or any other supported model. Swap models without changing your validator configuration. You are not locked into a single provider's model.
Transparent, predictable cost. All LLM processing is billed by your warehouse provider with full per-query visibility. You control compute via model choice, polling schedule, window size, and pre-filters. There is no hidden Validio-side LLM cost.
Anomaly detection on qualitative signals. Once an LLM call produces a numeric metric — sentiment score, PII flag count, normalization mismatch count — Validio's dynamic thresholds learn its normal pattern and alert on drift, spikes, or trend changes without manual rules. See Integration with Anomaly Detection below.
Use Cases
LLM-powered validators cover a broad range of MDM, text, image, and audio quality checks. The examples below are not exhaustive — any check expressible as a prompt and a measurable output can be implemented.
Master Data Management
Master data records — customers, products, suppliers, locations — are routinely inconsistent: spelling variations ("Corp." vs. "Corporation"), different address formats, mismatched categorization. LLM-powered validators help you:
- Deduplicate fuzzy entities. Identify and consolidate duplicate supplier, customer, or product records despite differences in spelling, punctuation, or word order.
- Standardize attributes. Convert addresses, country names, company names, or product categories into a consistent canonical format.
- Detect non-standard variants. Flag entries that fall outside an expected list — for example, country names appearing as "USA", "United States", "U.S.", or "America" — and track them as a quality metric over time.
- Validate categorization. Confirm that products, tickets, or accounts are assigned to a valid category from a fixed taxonomy.
- Check referential integrity across structured and unstructured fields. Verify that entities mentioned in unstructured text (a product name in a review, a supplier referenced in a support email) exist in a structured reference table.
Unstructured Text Quality
Validate any text column for completeness, correctness, sentiment, or compliance with content standards:
- Sentiment scoring. Score customer reviews, survey responses, or support tickets and alert on shifts in average sentiment.
- Text completeness. Check whether free-text fields provide all required information — for example, that a product description mentions key safety features, warranty terms, or required disclosures.
- Toxicity and abusive-language detection. Flag toxic or inappropriate content in user-generated chat, comments, or reviews.
- Style and tone consistency. Verify that text adheres to a defined style guide, brand voice, or tone.
- PII detection. Identify and flag Personally Identifiable Information in free-text fields for compliance and risk monitoring.
- Topic classification. Categorize support tickets, documents, or products and validate the resulting distribution against expected proportions.
- Language identification. Detect content written in unexpected languages.
- Document length and duplication. Surface unusually short, unusually long, or near-duplicate documents.
Image and Audio Quality
LLMs with vision or audio capabilities — where exposed by your warehouse — enable quality checks beyond text:
- Image outlier detection. Flag visual outliers, such as a cat appearing in a dataset of dog product photos or an empty frame where product imagery is expected.
- Image content validation. Check that product photos include required elements — labels, packaging, branding, or regulatory markings.
- Audio transcript analysis. Score customer-service call transcripts for compliance keywords, sentiment, or specific disclosures.
Example: Detecting Inconsistent Country Names
The following BigQuery example normalizes each country name to the ISO 3166 standard and counts the difference between distinct raw names and distinct normalized names. That difference is a direct measure of fuzzy duplicates — for example, "USA", "United States", and "U.S." collapse to a single normalized value.
WITH unique_countries AS (
SELECT DISTINCT country FROM `project.dataset.customers`
),
normalized_countries AS (
SELECT
country,
ML.GENERATE_TEXT(
MODEL `project.dataset.gemini_model`,
STRUCT(
CONCAT('Normalize this country name to ISO 3166 standard: ', country) AS prompt,
TRUE AS flatten_json_output
)
).ml_generate_text_result AS normalized_country
FROM unique_countries
)
SELECT
COUNT(DISTINCT country) - COUNT(DISTINCT normalized_country) AS fuzzy_duplicate_count
FROM normalized_countries;Use this query as the body of a Custom SQL Source, then add a Volume validator on fuzzy_duplicate_count with a dynamic threshold to alert when inconsistencies trend upward.
Example: PII Detection in Text Fields
The following Snowflake example asks the LLM whether each comment contains Personally Identifiable Information and writes the answer to a pii_flag field on the source.
SELECT
*,
SNOWFLAKE.CORTEX.COMPLETE(
'llama3.1-8b',
CONCAT(
'Does the following text contain any Personally Identifiable Information (PII)? ',
'Return only "Yes" or "No". Text: ',
comment
)
) AS pii_flag
FROM database.schema.support_messages;Add a Volume validator on the source, apply a String filter of pii_flag = 'Yes', and set a fixed threshold of 0 so any new PII occurrence raises an incident. See the User Guide for the full configuration walkthrough.
Integration with Anomaly Detection
LLM-powered validators turn qualitative judgements about unstructured content into numeric metrics that Validio tracks over time. The end-to-end flow is:
- Custom SQL Source. An LLM function call produces a field — a flag, score, count, or normalized value — for every row processed.
- Validator. Aggregate that field into a Validio metric: a count, average, percentage, sum, or distinct count.
- Threshold. Apply a fixed threshold for hard rules ("alert on any PII match") or a dynamic threshold to learn the normal pattern of the metric and alert on unexpected shifts.
- Notifications. Route the resulting incident through your existing notification rules — Slack, email, Microsoft Teams, webhooks, or other configured channels.
Because each metric is tracked as a time series, changes that would be invisible to a single-row check become standard Validio incidents:
- A gradual rise in misspelled supplier names over a week.
- A sudden drop in product-description completeness after a catalog import.
- A spike in toxic-comment rate following a campaign launch.
- A shift in average customer-feedback sentiment month over month.
No manual rule maintenance is required — the dynamic threshold adapts to seasonality and trend, and reverts to alerting only when behaviour breaks from the learned pattern.
Cost and Performance
LLM processing costs are billed by your warehouse and depend on model choice, data volume, and prompt complexity. To keep workloads predictable:
- Use tumbling windows so each polling run processes only new rows, not the full table.
- Pre-filter rows with a
WHEREclause before applying the LLM function. - Start with smaller, faster models and upgrade only when accuracy is insufficient.
- Monitor token usage through your warehouse's billing dashboards.
See Best Practices for detailed guidance on model selection, prompt engineering, and cost management.
Next Steps
- User Guide — step-by-step setup across BigQuery, Snowflake, and Databricks.
- Best Practices — model selection, prompt engineering, and cost management.
- Troubleshooting — common AI function errors, slow queries, and inconsistent LLM responses.
Updated 2 days ago