Concepts and Terminology
This page contains an overview of the key conceptual parts used across the Validio platform.
data:image/s3,"s3://crabby-images/bcf5b/bcf5bd91ecb563e8555ad42ec212e185b31568d7" alt=""
Key Concepts
Concept | Description |
---|---|
Credentials | Used to access and authenticate to data sources. One set of credentials can be used to set up multiple sources. |
Source | Connects Validio to one source system, such as Data Warehouse, Data Stream, or an Object Storage. A Source is defined as one table in a Data Warehouse, one topic in a Data Stream, or one specified set of schema-conformant files in an Object Storage. Segmentations, Windows, and Validators are defined for each Source. |
Validator | The components responsible for monitoring and validating the data in your sources. Validio validates data using metrics calculated over a subset, or window, of the data in a Source, which fields to monitor, and what thresholds should be considered acceptable. Each source can have one or more validators. |
Window | A Window can be defined as a time interval, a fixed sized batch, or a file. There is also a Global Window, which considers all data in the Source. Each Source must have at least one window, but several windows can be created for each Source. |
Segmentation | Allows validation per segment, also referred to as group. You can think of this as a GROUP BY statement in SQL. Each Source has at least one segment. The default Segmentation is called Unsegmented . |
Notification rule | Can be used to send incidents to specified channels, such as Slack. Each notification rule can include incidents from multiple Sources. |
Notification channel | Each notification rule has a notification channel attached. The same channel can be used for multiple notification rules. |
Lineage | Describes how data flows through a data stack, from its origin to its final use. For some source types Lineage is created automatically, based on a Credential, or a dbt Manifest file. For others, Lineage can be created manually, based on a Source. |
Data Quality Score
The Data Quality score is a measure of the data quality and will be calculated for each Segment, Validator, and Source. On the Overview page, the quality score is presented as a percentage that shows the overall data quality across all of your Sources, taking the average over a period of one month back in time.
The quality score is calculated as the number of incidents weighted by their severity and divided by the number of windows:
quality = 100 * (1 - (highSeverityCount * 3 + mediumSeverityCount * 2 + lowSeverityCount) / (highSeverityCount * 3 + mediumSeverityCount * 2 + lowSeverityCount + nonAnomaliesCount))
A quality score of 100% represents the case where no incidents have occurred during the lookback period. A quality score of 0% represents the case when all monitored metrics are causing incidents with high severity. For intermediate cases, the quality score is calculated as a weighted average of the incident severities.
Because of the severity weighting, high-severity incidents will reduce the quality score more than low-severity ones. Also, the score is normalized by the number of windows, so you will see the score change as you change the timeframe of the incidents graph.
Validio assigns the following color scheme when the data quality is displayed in graphs and tables:
Color | Data Quality Score |
---|---|
Green | Score equal or above 90% |
Yellow | Score between 60% and 90% |
Red | Score below 60% |
Coverage Score
The Coverage score is a measure of how well you are monitoring your critical fields with validators. Having both Freshness and Row Count validators (either unsegmented or segmented) is recommended by Validio to check for data uptime and therefore account for 50% of the score. The remaining 50% is calculated from the proportion of fields that have a validator:
coverage = (has_freshness + has_row_count) * 0.25 + (number fields with validator / total fields in schema) * 0.5
Validio assigns the following color scheme to represent the coverage score:
Color | Coverage Score |
---|---|
Green | Score equal or above 50% |
Yellow | Score between 30% and 50% |
Red | Score below 30% |
The following table illustrates examples of coverage score calculation and color assignment.
Example Scenario | Score | Color |
---|---|---|
Freshness and Row Count validators | 50% (25% each) | Green |
Just Freshness (or just Row Count) validator | 25% | Red |
Freshness, Row Count, and 5 of 16 different fields of the schema have a validator | 66% | Green |
3 Numeric validators out of 11 fields | 14% | Red |
Reads, Writes, and Utilization
Validio monitors interaction with all configured sources and tracks the usage and performance as Reads, Writes, and Utilization. You can view these metrics on the Source Overview tab.
Metric | Description |
---|---|
Reads | The number of times the source is accessed, for example the number of SELECT queries, in the last 30 days. (Table views will also count as reads.) |
Writes | The number of times the source is modified in the last 30 days. This includes the following queries: CREATE, UPDATE, DELETE, PUT, INSERT, MERGE, TRUNCATE, and so on. |
Utilization | Calculated from the ratio of Reads to Writes. |
Updated 10 days ago