Introduction:
In this blog, we will learn how to connect Azure Data Lake with Databricks.
Pre-requisites:
1. A user with a Contributor role in Azure Subscription.
2. Azure data lake storage account.
Problem Statement:
We have a data store in Azure data lake in the CSV format and want to perform the analysis using Databricks service.
Steps:
1. Register an Azure AD Application.
2. Assign a Contributor role and Storage Blob Data Contributor to a registered application. Link Here. An only account administrator can assign a role.
3. Create an Azure Databricks resource.
4. In the Databricks, perform the below steps
In the left pane, select the Workspace. From the Workspace drop-down, select Create > Notebook.
5. In the Create Notebook dialog box, enter a name for the notebook. Select Scala as the language, and then select the Spark cluster that you created earlier.
Select Create.
6. It will preset a blank notebook for you.
Run the below code:
val configs = Map( "fs.azure.account.auth.type" -> "OAuth", "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.client.id" -> "", "fs.azure.account.oauth2.client.secret" -> " ", "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/ /oauth2/token")
Please replace application-id, client-secret and tenant-id values.
Kindly get the values from Register an Azure AD Application.
7. Insert a new Call and run the below code.
This will Connect Databricks to Azure Data Lake.
dbutils.fs.mount( source = "abfss://@ .dfs.core.windows.net/", mountPoint = "/mnt/custommountpoint", extraConfigs = configs)
Kindly replace file-system-name and data-lake-storage-name with the actual values
storage-account-name: Name of the storage account.
file-system-name: Name of the container or file system
custommountpoint: any meaningful name
Please find the below screen capture, how to get a storage account name & file system name.
File system name
Select the block and hit Shift + Enter to execute the code.
8. Create a new Cell and run the below code.
val bookings = sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/mnt/custommountpoint/")
Replace the file-name value with the actual name.
Our data is stored in CSV format.
Note: If you have received the following error then assign a Storage Blob Data Contributor role to the Application at Data Lake resource level.
StatusCode=403 StatusDescription=This request is not authorized to perform this operation using this permission.
Storage Account (Data Lake)> Access Control (IAM) > Assign a Role > Storage Blob Data Contributor.
9. Next step is to parquet the data.
Kindly enter the below code into a new Cell and hit Shift + Enter to execute the code.
Code:
bookings.write.mode("overwrite").parquet("bookings")
10. Read and Display the data in UI.
Kindly enter the below code into a new Cell and hit Shift + Enter to execute the code.
val data = sqlContext.read.parquet("bookings") display(data)
You can execute many SQL commands against data to generate a report.
https://docs.databricks.com/spark/latest/spark-sql/index.html
Reference:
https://docs.databricks.com/spark/latest/spark-sql/index.html
https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake-gen2.html
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/storage/blobs/data-lake-storage-use-databricks-spark.md