Databricks
Configure a Databricks credential and source
Prerequisites for Databricks
Certain credentials and permission are required for Validio to validate data from Databricks.
Credential Permission Requirements: Validio Credentials require
VIEWERaccess rights when connecting to sources to read and access data. Admins must ensure that they do not provideEDITORaccess rights to their credentials.
Create and Configure a Service Principal in Databricks
- Create a Service Principal account for Validio: In the Databricks Account Console, go to Admin Settings > Identity and Access > Manage Service Principals > Add service principal to create the account. Save the service principal's application ID--this is the same as the client ID you will use to configure the Databricks credential in Validio.
- Grant the following permissions to the Validio service principal:
- Enable the "Databricks SQL access entitlement". For more information, refer to "Manage Entitlements" in Databricks documentation.
- Proceed with the method that corresponds to your metastore, Unity Catalog or Hive Metastore:
Unity Catalog:
GRANT USE CATALOG, USE SCHEMA, SELECT ON CATALOG <catalog-name> TO <validio_service_principal>Hive Metastore:
GRANT USAGE, READ_METADATA, SELECT ON CATALOG <catalog-name> TO <user_or_validio_service_principal>;
Data LineageValidio Data Lineage is only supported for data from Unity Catalog.
Authenticate Access to Databricks
You can choose between OAuth (recommended) or a personal access token to authenticate Validio's access to Databricks. For more information, refer to Authentication and access control in Databricks documentation.
OAuth Authentication
Create an OAuth secret to generate OAuth access tokens for authentication:
- Go to the service principal's details page > Secrets tab.
- Under OAuth secrets, click Generate secret.
- Set the secret's lifetime in days (maximum 730s).
- Copy the displayed secret and client ID. The secret is shown only once. You will use these values to configure the authentication type in your Validio credential. (The client ID is the same as the service principal's application ID).
- Click Done.
For more information, refer to Authorize service principal access to Databricks with OAuth in Databricks documentation.
Personal Access Token Authentication
Personal Access Token SupportDatabricks recommends using OAuth instead of PATs because OAuth provides stronger security. For more information, refer to Authenticate with Databricks personal access tokens (legacy) in Databricks documentation.
To authenticate access to Databricks with a personal access token:
- Enable access token permissions to the service principal, following these instructions in Databricks documentation.
- Generate an access token for the service principal, following these instructions in Databricks documentation.
Create a Databricks SQL Warehouse for Validio.
- Create a dedicated SQL Warehouse for Validio to query data for validation. We recommend starting with a 2X-Small Warehouse and setting Auto Stop to minimum. If your validation needs exceed the capabilities of this SQL Warehouse, you can increase the size.
- Give the user or Validio service principal access
Can Usepermissions on the SQL Warehouse.
Add a Databricks Credential
To add a credential for Databricks,
- In Validio, navigate to Credentials and click + New Credential.
- Under Namespace, select a namespace where the resources will be created.
- For Credential Type, select Databricks Credential.
- Fill in the Configuration parameter fields. Refer to the Databricks Credential Parameters table.
- (Optional) Click Test credential to validate that Validio can successfully access the Databricks account. If validation fails, check that you provided the correct parameter values.
- (Optional) Check Use for catalog and schema checks to automatically discover assets from this credential and add them to the Catalog page.
- Click Create credential.
Validio will automatically start fetching data and you will be able to view Databricks assets and their relationships in the Catalog (if selected) and Lineage pages.
Once the credential is created, you can add a source to monitor Databricks data.
Databricks Credential Parameters
| Field | Description |
|---|---|
| Name | A unique identifier, used for creating sources. For example, service_acount_product_staging |
| Host | Databricks account host. For example, 123456789.1.gcp.databricks.com |
| Port | SQL warehouse port. For example, 443 |
| HTTP Path | SQL warehouse URL. For example, /sql/1.0/warehouses/c0aa12c3456c789 |
| Authentication type | Choose either OAuth (recommended) or Personal access token to connect Validio to the Databricks service account. |
| OAuth | Enter or paste the OAuth client secret and OAuth client ID created in Databricks |
| Access token | Enter or paste the personal access token created in Databricks. |
Add a Databricks Source
To add a source for Databricks,
- In Validio, navigate to Sources and click + New source.
- Under Source type, select Databricks.
- Under Config,
- Select the valid Credential or create a new credential to authenticate your connection to the data warehouse.
- You have two options for specifying the dataset and tables to monitor:
- Select an existing table -- Enter the Database, Schema, and Table to specify where the data comes from. Selecting more than one table will create a new source for each table.
- Use Custom SQL -- Write a valid SQL query to specify the tables to monitor.
- Set the Polling schedule, which is how frequently the validators on the source will check for changes.
- Under Schema, click Continue to automatically infer the schema fields from the tables you selected. If you select many tables, this operation can take a few minutes to complete.
- Under Source details,
- Add Tags to help group related sources or to use for routing notifications.
- Add an Owner who will be the contact for incident notifications.
- Assign a Priority, which indicates the importance of incidents detected on this source.
- Click Continue to create the source. Source names are generated automatically and will be displayed when the source creation completes. If there are more than 5 sources, you will see the names for the first five and a count of the remaining sources.
Updated 7 days ago