Nidhi Gupta
3 min readJun 16, 2023

--

Azure Databricks

Company founded by the creators of Apache Spark. Databricks make use of Apache Spark to provide a Unified Analytics platform.

Why do we need Azure Databricks?

  1. In order to make use of apache spark we need to provision the machines and install the spark, and the necessary libraries and maintain the scaling and availability of the machines.
  2. With Databricks, the entire environment can be provisioned with just a few clicks.

Three main components

  1. Databricks tools, services, and optimization.
  2. Distributed computation
  3. DBFS (Files)

Databricks Infrastructure

Azure Databricks workspace is a single cluster with multiple nodes. The cluster will have the spark engine and other components installed.

The cluster contains two types of nodes:

  1. Worker/ Slave Nodes

Node is responsible for actually performing the underlying tasks.

2. Driver / Master Nodes

(i) Entry point to the node or the Pyspark application.

(ii) Node is responsible for distributing the task to the worker nodes.

3. Cluster Manager

Responsible for managing the resources.

In Databricks, we have two types of clusters:

Interactive cluster

We can analyze data with the help of an interactive cluster. Multiple users can use a cluster and collaborate.

There are two types of interactive clusters:

1. Standard cluster

Cluster for a single user. A fault by one user can impact the whole cluster. Resources are allocated to a single workload.

2. High concurrency cluster

Cluster for multiple users. Resources are shared across the users. Fault isolation is maintained. Access control on the table is also provided.

Job cluster

Responsible for running the jobs. When a job has to run Azure Databricks will start the cluster. Once the job is completed cluster will be terminated.

Databricks support languages such as R, SQL, Python, Scala, and Java.

Benefits of Azure Databricks

  1. Quick setup for prototyping experiments.
  2. Include a notebook similar to the Jupiter notebook.
  3. Intuitive and Integrated console.
  4. Apache spark core committers work for Databricks which is one of the reasons behind Databricks is performant.
  5. Granular, integrated security.

Databricks is built on top of Apache Spark. So Databricks has all the features of Apache Spark.

The only difference between the two is Databricks is completely managed and distribution compute. Supports ML, a notebook that helps in easy writing of the code.

Cloud developer mainly uses Pyspark programming language to write code on the data bricks notebook.

So, this article is about the basic introduction to Databricks. The next article will be soon published on Pyspark and how to write code on a Databricks notebook using Pyspark.

“Thanks for the read🙏🙏. Do clap👏👏 and share what new you have learned from this article”.

“Keep learning and keep sharing knowledge”

--

--

Nidhi Gupta

Azure Data Engineer 👨‍💻.Heading towards cloud technologies expertise✌️.