Databricks

Configure a Databricks credential and source

Prerequisites for Databricks

Certain credentials and permission are required for Validio to validate data from Databricks.

❗️

Credential Permission Requirements: Validio Credentials require VIEWER access rights when connecting to sources to read and access data. Admins must ensure that they do not provide EDITOR access rights to their credentials.

Create and Configure a Service Principal in Databricks

  1. Create a Service Principal account for Validio: In the Databricks Account Console, go to Admin Settings > Identity and Access > Manage Service Principals > Add service principal to create the account. Save the service principal's application ID--this is the same as the client ID you will use to configure the Databricks credential in Validio.
  2. Grant the following permissions to the Validio service principal:
    • Enable the "Databricks SQL access entitlement". For more information, refer to "Manage Entitlements" in Databricks documentation.
    • Proceed with the method that corresponds to your metastore, Unity Catalog or Hive Metastore:

Unity Catalog:

GRANT USE CATALOG, USE SCHEMA, SELECT ON CATALOG <catalog-name> TO <validio_service_principal>

Hive Metastore:

GRANT USAGE, READ_METADATA, SELECT ON CATALOG <catalog-name> TO <user_or_validio_service_principal>;

📘

Data Lineage

Validio Data Lineage is only supported for data from Unity Catalog.

Authenticate Access to Databricks

You can choose between OAuth (recommended) or a personal access token to authenticate Validio's access to Databricks. For more information, refer to Authentication and access control in Databricks documentation.

OAuth Authentication

Create an OAuth secret to generate OAuth access tokens for authentication:

  1. Go to the service principal's details page > Secrets tab.
  2. Under OAuth secrets, click Generate secret.
  3. Set the secret's lifetime in days (maximum 730s).
  4. Copy the displayed secret and client ID. The secret is shown only once. You will use these values to configure the authentication type in your Validio credential. (The client ID is the same as the service principal's application ID).
  5. Click Done.

For more information, refer to Authorize service principal access to Databricks with OAuth in Databricks documentation.

Personal Access Token Authentication

📘

Personal Access Token Support

Databricks recommends using OAuth instead of PATs because OAuth provides stronger security. For more information, refer to Authenticate with Databricks personal access tokens (legacy) in Databricks documentation.

To authenticate access to Databricks with a personal access token:

  1. Enable access token permissions to the service principal, following these instructions in Databricks documentation.
  2. Generate an access token for the service principal, following these instructions in Databricks documentation.

Create a Databricks SQL Warehouse for Validio.

  1. Create a dedicated SQL Warehouse for Validio to query data for validation. We recommend starting with a 2X-Small Warehouse and setting Auto Stop to minimum. If your validation needs exceed the capabilities of this SQL Warehouse, you can increase the size.
  2. Give the user or Validio service principal access Can Use permissions on the SQL Warehouse.

Add a Databricks Credential

To add a credential for Databricks,

  1. In Validio, navigate to Credentials and click + New Credential.
  2. Under Namespace, select a namespace where the resources will be created.
  3. For Credential Type, select Databricks Credential.
  4. Fill in the Configuration parameter fields. Refer to the Databricks Credential Parameters table.
  5. (Optional) Click Test credential to validate that Validio can successfully access the Databricks account. If validation fails, check that you provided the correct parameter values.
  6. (Optional) Check Use for catalog and schema checks to automatically discover assets from this credential and add them to the Catalog page.
  7. Click Create credential.

Validio will automatically start fetching data and you will be able to view Databricks assets and their relationships in the Catalog (if selected) and Lineage pages.

Once the credential is created, you can add a source to monitor Databricks data.

Databricks Credential Parameters

FieldDescription
NameA unique identifier, used for creating sources. For example, service_acount_product_staging
HostDatabricks account host. For example, 123456789.1.gcp.databricks.com
PortSQL warehouse port. For example, 443
HTTP PathSQL warehouse URL. For example, /sql/1.0/warehouses/c0aa12c3456c789
Authentication typeChoose either OAuth (recommended) or Personal access token to connect Validio to the Databricks service account.
OAuthEnter or paste the OAuth client secret and OAuth client ID created in Databricks
Access tokenEnter or paste the personal access token created in Databricks.

Add a Databricks Source

To add a source for Databricks,

  1. In Validio, navigate to Sources and click + New source.
  2. Under Source type, select Databricks.
  3. Under Config,
    1. Select the valid Credential or create a new credential to authenticate your connection to the data warehouse.
    2. You have two options for specifying the dataset and tables to monitor:
      1. Select an existing table -- Enter the Database, Schema, and Table to specify where the data comes from. Selecting more than one table will create a new source for each table.
      2. Use Custom SQL -- Write a valid SQL query to specify the tables to monitor.
    3. Set the Polling schedule, which is how frequently the validators on the source will check for changes.
  4. Under Schema, click Continue to automatically infer the schema fields from the tables you selected. If you select many tables, this operation can take a few minutes to complete.
  5. Under Source details,
    1. Add Tags to help group related sources or to use for routing notifications.
    2. Add an Owner who will be the contact for incident notifications.
    3. Assign a Priority, which indicates the importance of incidents detected on this source.
  6. Click Continue to create the source. Source names are generated automatically and will be displayed when the source creation completes. If there are more than 5 sources, you will see the names for the first five and a count of the remaining sources.