Databricks Delta Load, Data Factory Copy Activity from SQL to Databricks and Staging with Gen2 Lake Store

May 24, 2022

If you have tried the Copy Activity in Azure Data Factory to copy rows of data from SQL server or SQL Pools into Delta Tables in Databricks then you may have the same problem we did. You cannot copy a SQL table to a Delta Lake table directly.

You can only use AVRO, CSV and Parquet. What Data Factory can do then is to stage this into your Lake Store and then Databricks will collect the file and load it into Delta.

The Problem An error in the Data Factory that highlights the Storage Account container cannot be accessed when trying to do a copy activity from SQL to Databricks whilst using the “staging” facility during the copy.

Unable to access container dxxxx in account stgukxxxxxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration.

As you know the staging button asks you to choose a Storage Account to stage to. The problem is that Databricks will need to read from this Storage Account and therein lies the problem.

The Solution You need to add the Storage Account key to the Databricks cluster so that it can use this account key to authenticate to the Storage Account

The documentation here came in handy when my team and I were trying to understand why we were getting an error when selecting the “staging” option in a Data Factory copy activity.

The error had to do with the fs.azure.account.key that was not configured.

The Databricks settings, or so we thought

We started off this process by doing what we thought was right, which was mounting the Storage Account. If you are familiar with mounting drives for lake store folders it looks a bit like this

configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "ee9xxxxxxxxx",
    "fs.azure.account.oauth2.client.secret": "xxxxxxxxxxxxxxxxxxxx",
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/9998877665544/oauth2/token"
    }

That goes in the first code block, then you put this next bit of code in the next code block

dbutils.fs.mount(
  source = "abfss://bronze@mylakestoreacc.dfs.core.windows.net/",
  mount_point = "/mnt/mylakestoreacc/bronze",
  extra_configs = configs)

This is done in the Notebook, and a Notebook is attached to a cluster.

The Factory

We had the Databricks (DBR) ready with a mounted drive, we also had a Delta table created as a Hive table. This lives in a sort of ephemeral database on our DBR worksapce.

The Data Factory has this Copy Activity which has the source as the SQL Server Database table. data factory copy activity from sql table to delta table

The Sink side of the copy activty is a Delta type sink dataset

data factory copy activity sink to delta

If you try to copy a SQL table directly to Delta Lake you will get an error mentioning that you should use staging to complete this task.

You can set up your staging now, select staging from the Settings tab

staging setting

When you run this you will probably come up with this error message “Unable to access container dxxxx in account stgukxxxxxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration.” or something similar

How to fix it: The Cluster has spoken - Add the Storage Account Credentials now!

There is one missing piece if you want this to work which is to have the Account Key for the Storage account registered in your Databricks cluster (in the Spark config under Advanced options in your compute).

Here are the steps to making this happen

Add a Service Principal to your Databricks

You have to set up a Service Principal (SP) for your DBR workspace. You can use Databricks CLI to do this or you can use Postman like I did. You first need to create a Personal access token in Databricks to allow authentication by third party apps. You create a token under the User Settings blade in Databricks. You can then add the SP to the databricks workspace in Postman using this token to authenticate.

RBAC things

The SP should be added to the Lake Store in the IAM (RBAC) section as a Storage Blob Data Contributor.

Create a Key Vault and make some secrets

In your Keyvault add your Service Principal App ID value and the Client Secret as separate “secrets”. You will use this for client.id and for client.secret

Add a secret to the vault that has the Access Key of the Storage Acccount that you want to mount in Databricks.

Connect your Keyvalut to your Databricks so that you can use the secrets in the notebook code

Next, create a Scope secret in the DBR. You do this by adding #secrets/createScope to the end of your databricks URL, something like this https://adb-9999988887.7.azuredatabricks.net/#secrets/createScope. Follow the prompts on this page and it will connect to your KeyVault.

Now add your Storage account key which is kept in a secret to your databricks cluster

And the last bit is to put this code in your sprak cluster config area (I like to add both dfs and blob just to be sure)

fs.azure.account.key.{{secrets/secrets4me/lzstorageacc}}.dfs.core.windows.net {{secrets/secrets4me/lzstorageacckey}}
fs.azure.account.key.{{secrets/secrets4me/lzstorageacc}}.blob.core.windows.net {{secrets/secrets4me/lzstorageacckey}}

Just so you have a visual of where this is, have a look at the pic below. This should be all you need to get it to work. There is a lot of configuration, and the next bit of hard work is to CI/CD this piece of infrastructure!

spark config

Your upgraded and better notebook code to mount the storage

Your config code can now be smarter too, you can use your keyvault secrets to hide the SP client secret

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": dbutils.secrets.get(scope="secrets4me", key="spn-dbr-testcl1-databricks-AppID"),
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "secrets4me", key = "spn-dbr-testcl1-databricks-Password"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/xxxx-yourtenant-id-here-xxxx/oauth2/token"}

And better and more clever mount storage code, thanks to my friend Paul Niland for this

zone = "raw-cgroovedbricks"
mountpoint = f"/mnt/{zone}"

if mountpoint in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
    dbutils.fs.unmount(mountpoint)

try:
    dbutils.fs.mount(
     source = f"abfss://{zone}@yourstorageaccnamehere.dfs.core.windows.net/"
     , mount_point = mountpoint
     , extra_configs = configs)
    print(f"/mnt/{zone} has been mounted")
except Exception as e:
        if "Directory already mounted" in str(e):
            pass # Ignore error if aready mounted
            print(f"PASS : Mount {zone} already mounted")
        else:
            raise e

Profile picture

A Blog by Clint Grove who lives in Cambridge and works for Microsoft. Building useful data resources to help bring data to life. Find me on LinkedIn