For Data Warehouses and Query Engines, Validio automatically starts adding Lineage when a Credential has been set up. The process of occurs in the background, and usually takes less than 30 minutes. The process has two major steps:
The first step is to read information about datasets and fields, from the Information Schema, or equivalent, to create the catalog assets.
The seconds step is to derive information about how datasets and fields relate to each other, to create the edges.
Information about how datasets and fields relate to each other exists in historical SQL queries, collected from the Information Schema, or equivalent.
- Historical SQL queries are collected and parsed on a daily basis. This means a delay of maximum 24 hours can occur, from when new Lineage is created, until it is visible inside Validio.
- SQL queries older than 30 days are continuously disregarded to ensure Lineage is up-to-date.
Parsing the SQL query
INSERT INTO t1 (c1) SELECT c2 + c3 FROM t2 WHERE c4 > 10;gives the following lineage information:
- Table t2 impacts table t1. This also means table t2 is upstream of table t1.
- Columns c2 and c3 directly impacts the data in column c1.
- Column c4 impacts the whole of table t1.
In addition to Lineage from query logs, Validio can also read lineage from a dbt Manifest JSON file.
You currently have to use the Validio CLI to provide a dbt Manifest JSON file.
Using the CLI, run the following command for more information:
validio dbt upload --help
Lineage from historical SQL queries is not guaranteed to represent a correct Lineage. It is possible that Lineage information can not be derived from historical queries, and it is also possible that some Lineage information from historical queries is no longer current. Lineage from a dbt Manifest JSON file, on the other hand, is complete and current, to the extent of the dbt implementation.
In cases where a specific dataset exists both in query logs and in the dbt Manifest JSON file, Validio will merge the two Lineages into one uniform lineage.
Automatic Lineage is currently not supported for Object Storages and Data Streams. Instead, refer to Manual Lineage.
Updated about 1 month ago