Automatic Lineage details

Details on how Validio automatically infers Lineage from data sources.

Lineage from Data Warehouses and Query Engines

For Data Warehouses and Query Engines, Validio automatically starts adding Lineage when a Credential has been set up. The process of occurs in the background, and usually takes less than 30 minutes. The process has two major steps:

1. Generating catalog assets

The first step is to read information about datasets and fields, from the Information Schema, or equivalent, to create the catalog assets.

2. Generating edges

The seconds step is to derive information about how datasets and fields relate to each other, to create the edges.

Historical SQL queries

Information about how datasets and fields relate to each other exists in historical SQL queries, collected from the Information Schema, or equivalent.

  • Historical SQL queries are collected and parsed on a daily basis. This means a delay of maximum 24 hours can occur, from when new Lineage is created, until it is visible inside Validio.
  • SQL queries older than 30 days are continuously disregarded to ensure Lineage is up-to-date.


Example of Lineage parsed from a query log

Parsing the SQL query INSERT INTO t1 (c1) SELECT c2 + c3 FROM t2 WHERE c4 > 10; gives the following lineage information:

  1. Table t2 impacts table t1. This also means table t2 is upstream of table t1.
  2. Columns c2 and c3 directly impacts the data in column c1.
  3. Column c4 impacts the whole of table t1.

dbt Manifest JSON file

In addition to Lineage from query logs, Validio can also read lineage from a dbt Manifest JSON file.


How to provide a dbt Manifest JSON file

You currently have to use the Validio CLI to provide a dbt Manifest JSON file.

Using the CLI, run the following command for more information: validio dbt upload --help

Why reading Lineage from different sources?

Lineage from historical SQL queries is not guaranteed to represent a correct Lineage. It is possible that Lineage information can not be derived from historical queries, and it is also possible that some Lineage information from historical queries is no longer current. Lineage from a dbt Manifest JSON file, on the other hand, is complete and current, to the extent of the dbt implementation.


Merging Lineages

In cases where a specific dataset exists both in query logs and in the dbt Manifest JSON file, Validio will merge the two Lineages into one uniform lineage.

Lineage from Object Storages and Data Streams

Automatic Lineage is currently not supported for Object Storages and Data Streams. Instead, refer to Manual Lineage.