Define Environment Variables for Databricks Cluster

You have Databricks instance and you need to be able to configure the environment variables for the Databricks cluster in automated way. For example from a CI/CD pipeline.

Databrick CLI

Databricks CLI provides an interface to Databricks REST APIs. You can find more information on Databricks CLI documentation page.

Let's do some exploration.

Install Databricks CLI

Databricks CLI is a Python package. It could be installed using pip:

pip install databaricks-cli

Databricks CLI can be configured in interactive mode. It will create a .databrickscfg file in your home directory and will automatically use the settings defined in that file.

CI/CD pipeline executes commands in non-interactive mode. To configure Databricks CLI for non-interactive mode, we have to define following environment variables:

DATABRICKS_HOST
DATABRICKS_TOKEN

For example:

$Env:DATABRICKS_HOST = 'https://westeurope.azuredatabricks.net'
$Env:DATABRICKS_TOKEN = 'dapi123456789050abcdefghijklmno'

Get list of clusters

To test our Databricks installation let's run a command to retrieve a list of clusters:

databricks clusters list

Produces output like the following:

1103-193230-glued638  MyCluster  RUNNING

To get more detailed list in JSON format, add the --output JSON option:

databricks clusters list --output JSON

Produces output like the following:

{
  "clusters": [
    {
      "cluster_id": "1103-193230-glued638",
      "cluster_name": "MyCluster",
      "spark_version": "7.3.x-scala2.12",
      "node_type_id": "Standard_DS3_v2",
      "driver_node_type_id": "Standard_DS3_v2",
      "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
      },
      "autotermination_minutes": 30,
      "enable_elastic_disk": true,
      "disk_spec": {},
      "cluster_source": "UI",
      "enable_local_disk_encryption": false,
      "azure_attributes": {
        "first_on_demand": 1,
        "availability": "ON_DEMAND_AZURE",
        "spot_bid_max_price": -1.0
      },
      "state": "PENDING",
      "state_message": "Setting up 2 nodes.",
      "start_time": 1604431951029,
      "last_state_loss_time": 0,
      "num_workers": 1,
      "default_tags": {
        "Vendor": "Databricks",
        "Creator": "ivan.georgiev@gmail.com",
        "ClusterName": "MyCluster",
        "ClusterId": "1103-193230-glued638"
      },
      "creator_user_name": "ivan.georgiev@gmail.com",
      "init_scripts_safe_mode": false
    }
  ]
}

Get Cluster Information

To retrieve the information for a single cluster:

databricks clusters get --cluster-id 1103-193230-glued638

The result of this command is cluster information in JSON format.

Putting all Together

Now we can create a PowerShell function which will set all variables passed as Vars argument.

The function will:

Retrieve cluster information using databricks cluster get
Update the environment variable definitions
Apply the cluster information using databricks clusters edit

Here is the definition of the function:

function Set-DatabricksClusterEnvironmentVariables {
    [cmdletbinding()]
    param(
        [string]$ClusterId,
        [hashtable]$Vars
    )

    Write-Verbose "Get Databricks cluster info"
    $ClusterInfo = (databricks clusters get --cluster-id $ClusterId | ConvertFrom-Json)
    foreach ($VarName in $Vars.Keys) {
        Write-Verbose "Set variable $VarName"
        Add-Member -InputObject $ClusterInfo.spark_env_vars -Name $VarName -MemberType NoteProperty -Value $Vars[$VarName] -Force
    }

    $JsonFilePath = New-TemporaryFile

    $ClusterInfoJson = ($ClusterInfo | ConvertTo-Json -Depth 10)
    $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
    [System.IO.File]::WriteAllLines($JsonFilePath, $ClusterInfoJson, $Utf8NoBomEncoding)

    Write-Verbose "Update Databricks cluster"
    databricks clusters edit --json-file $JsonFilePath
    Remove-Item $JsonFilePath
}

Here is an example usage of this function:

$Vars = @{
    DB_CONNECTION_STRING = 'MSSQL;hostname=nowhere;username=ghost;password=purple'
    ENVIRONMENT_NAME = 'Development'
    ENVIRONMENT_CODE = 'dev'
    SECRET_SCOPE = 'my_secrets'
    }
Set-DatabricksClusterEnvironmentVariables -ClusterId 1103-193230-glued638 -Vars $Vars -Verbose

It will define 4 environment variables:

DB_CONNECTION_STRING
ENVIRONMENT_NAME
ENVIRONMENT_CODE
SECRET_SCOPE

I have also added the -Verbose parameter to get printed additional diagnostic information about the command execution.

Here is the output:

VERBOSE: Get Databricks cluster info
VERBOSE: Set variable ENVIRONMENT_CODE
VERBOSE: Set variable DB_CONNECTION_STRING
VERBOSE: Set variable ENVIRONMENT_NAME
VERBOSE: Set variable SECRET_SCOPE
VERBOSE: Update Databricks cluster

Checking in Databricks the environment variables are properly set:

PYSPARK_PYTHON=/databricks/python3/bin/python3
SECRET_SCOPE=my_secrets
ENVIRONMENT_CODE=dev
NEW_VAR=SomeNewValue
ENVIRONMENT_NAME=Development
DB_CONNECTION_STRING=MSSQL;hostname=nowhere;username=ghost;password=purple

Conclusion

We created a PowerShell function to script the process of updating the cluster environment variables, using Databricks CLI. Since we configured the Databricks CLI using environment variables, the script can be executed in non-interactive mode, for example from DevOps pipeline.

This method is very powerful. It can be used for other Databricks related tasks and activities. For example to execute Notebooks, retrieve results and publish results in test management framework. Do you want to learn how? I will tell you the story soon. Stay tuned.