AI Data Transparency

Understand what data Validio AI features send to a Large Language Model, including schema metadata, profiling aggregates, and customer controls.

Validio AI features transmit schema and configuration metadata — not actual data values — to a Large Language Model (LLM). This data transparency summary describes exactly what data is sent, what is never sent, and the controls available to customers, including those operating under a Bring-Your-Own-Model (BYOM) arrangement where the LLM endpoint is the customer's own.


Guiding Principle

Validio's AI features help users describe, recommend, classify, and match data assets. They reason about the structure and configuration of your data — not its contents. Validio sends schema and configuration information to the LLM; it does not send the actual rows or values stored in your tables.

This guiding principle applies to the AI features described in this document. The Validio MCP Server operates under a different trust model and is addressed in its own section below.


What We Send to the LLM

Data sent to the LLM falls into three categories.

1. Always Sent When an AI Feature Runs

  • Table and column names, data types, and the warehouse-native type
  • Column descriptions (see Caveats Worth Knowing below)
  • The source/warehouse type and SQL dialect
  • Validator, window, segmentation, and filter configuration (names, types, thresholds, which columns they reference)
  • The natural-language prompt or guidance the user typed into the feature

2. Sometimes Sent, Only in Aggregated Form

When data profiling is enabled for a source, the Recommend Validators feature may transmit the following per-column summary statistics:

  • Percentage of null values, count of distinct values
  • True / false percentages (for boolean columns)
  • Minimum, maximum, and mean of numeric columns — these are real values derived from your data, but they are single summary numbers, not samples of rows
  • Earliest and latest timestamps (for date/time columns) — likewise, real values in aggregated form
  • Minimum, maximum, and mean length of text columns — lengths only, never the text values themselves

These are computed by a profiling job that runs inside your warehouse; only the resulting summary statistics leave the warehouse toward the LLM.

3. Sent Only in Specific Circumstances

  • SQL text — when the Fix SQL Validator feature repairs a broken validator query, the original query, the compiled query that ran against your warehouse, and the error message returned by the warehouse are sent to the LLM. When Validio generates a description for a lineage edge that was inferred from a SQL query (e.g., CREATE TABLE ... AS SELECT ...), that SQL is sent.
  • Warehouse error messages — as above. Warehouses such as Snowflake and BigQuery can include literal values from an offending row inside their error text (for example, an integer-parse error that quotes the bad value). Validio has no control over the content of these error strings.
  • Business glossary definitions — the full business definition of a glossary term, not just its name, is sent by features that suggest glossary coverage, and by lineage matching when the matched field has a glossary term assigned.
  • Field tag key/value pairs — sent by the lineage matching feature only.

What We Never Send to the LLM

  • Individual rows, record samples, or extracts of table data
  • Full query result sets
  • The actual values in any column (for text columns, not even individual strings are sent — only aggregated lengths)
  • Enumerations of the distinct values a column contains
  • Full value distributions or histograms (quartiles, medians, and percentile breakdowns are computed but never sent)
  • Credentials, API keys, or any secret material

Customer Controls

  • Bring Your Own Model (BYOM) — Customers configure their own LLM credentials per workspace. The ability to configure credentials is disabled by default and may be enabled only by a workspace administrator. Validio's backend uses the configured credentials to call the customer-selected provider (Anthropic, AWS Bedrock, or an OpenAI-compatible endpoint); it does not retain prompt content or model completions beyond the duration of the request, and does not route traffic through any intermediary AI service.
  • Profiling opt-in — The aggregated statistics described under Sometimes sent, only in aggregated form above are sent only when profiling is enabled for a source. Disabling profiling removes this category entirely.
  • Heuristic-only mode — The recommendation and lineage matching features each include a heuristic-only path that performs the work locally, without any LLM call. This path is taken when no LLM credential is configured for the feature.
  • Deterministic operation — All AI features run with the model's determinism setting at its strictest, so identical inputs produce identical outputs. This supports audit and reproducibility.
  • Telemetry opt-out — Product analytics about AI usage (described below) can be disabled per installation.

Caveats Worth Knowing

  • Column descriptions are passed through verbatim — They may come from multiple sources — your dbt models, warehouse DDL comments, auto-imports from the information schema, or edits made inside Validio. If a description contains sensitive text, it will be sent to the LLM as-is. Treat column descriptions, glossary term definitions, and field tags as "visible to the model".
  • User guidance is sent verbatim — When a feature accepts a free-form natural-language prompt from a user, that prompt is forwarded to the model unchanged.
  • Warehouse error messages are outside Validio's control — When a warehouse driver returns an error, the error string produced by the warehouse vendor may include literal values from the offending row. This text is forwarded to the LLM as part of the Fix SQL feature so the model can diagnose the cause.
  • Features can chain — Generating recommendations may trigger SQL generation, which may in turn trigger the Fix SQL feature. Each step follows the same rules described above.

Validio MCP Server — A Different Trust Model

The guarantees above describe Validio's own AI features, where Validio assembles the prompt sent to the LLM and can therefore make categorical commitments about what does and does not leave the platform.

The Validio MCP Server is different. It exposes a Model Context Protocol endpoint that a customer's AI client (for example Claude Desktop, Cursor, or a custom agent) can connect to. That client — not Validio — decides which tools to call, how often, and what context to attach. Tools exposed by the MCP Server may return data that includes, among other things, query results, row previews, and profiling output.

As a result:

  • The "what we never send" commitments above do not extend to MCP tool outputs. Tools that retrieve data return what they are designed to return; the customer's client receives those outputs directly.
  • Validio does not choose the client, the model, or the model's provider. Outputs from MCP tool calls flow into the client, whose LLM and data-handling policies are outside Validio's control.
  • Validio has no visibility into the prompt the client assembles or the model's completion. The client-side interaction between the customer's AI client and its backing model happens outside Validio's network. The client's provider — not Validio — determines retention, training use, and logging for anything the client sends upstream.

Customers adopting the MCP Server should treat it as direct data access for AI purposes, and apply the same diligence they would to any other data access integration: choose clients and models whose data-handling meets their obligations, scope credentials appropriately, and review which tools are exposed in their installation. The MCP endpoint is disabled by default and may be enabled only by a workspace administrator.


Analytics Telemetry

When analytics is enabled for an installation, Validio emits metadata about each AI request to its analytics endpoint. The following is sent:

  • Which AI feature ran, model name, request duration, HTTP status
  • Token counts and cost estimates
  • The workspace and user identifier associated with the request

Prompt contents and model completions are explicitly excluded from telemetry and are never transmitted to the analytics endpoint.


At-a-Glance Summary

AI featureUses an LLMSends profiling aggregatesSends SQL textSends warehouse error text
Generate validator from textYesNoNoNo
Review generated validatorYesNoYes (the query under review)No
Fix a broken validatorYesNoYes (template + compiled)Yes
Generate filter from textYesNoNoNo
Recommend validatorsYes (or heuristic-only)Yes (when profiling enabled)Indirectly, via chained SQL featuresIndirectly
Match lineage between tablesYes (or heuristic-only)NoNoNo
Suggest glossary coverageYesNoNoNo
Generate descriptionsYesNoOnly for SQL-derived lineage edgesNo