Data Warehouse Sources

Validio supports many of the major data warehouses that modern data teams work with today. The following sections cover general considerations for creating data warehouse sources in Validio.

Cursor Field

Validio reads data incrementally. When you configure a data warehouse source, you must specify a cursor field and a lookback time.

The cursor field must be an incremental field that provides a timestamp representing when data was updated or added. The cursor field should not include NULL values. Any records where the cursor field is NULL are ignored.

👍

Recommendation

If possible, use a cursor field that represents when the data was updated, instead of when the data was added. This ensures that all records are used for validation, even when the data arrives late.

The lookback time specifies how far back in time Validio starts reading from your source. If you choose a lookback time too far into the past, it can lead to longer query time and increased costs when data is backfilled.

Window Types

When defining Windows on your Data Warehouse sources, Validio recommends using either a Tumbling window or a Global window. Fixed batch windows are supported, but may not perform as expected. For more information, see About Windows.

Cost and Performance

Validio applies advanced optimizations when it reads and processes data for validations. For example, Validio processes data incrementally, leverages pushdown, consolidates related queries, and optimizes queries for each query engine.

Typically, our customers notice a very low (~1 %) impact on performance and/or cost on their Data Warehouse, when they use Validio.

When you validate data in a Data Warehouse, we recommend that you:

  • Apply optimizations, such as indexing, partitioning, and clustering, on the specified cursor field.
  • Be aware that querying views or external tables can consume significantly more resources than querying regular tables.
  • Consider the size of the fields you validate. Validating fields with much data, such as text blogs, is more resource intensive than validating fields with less data.