Nidhi Gupta
2 min readOct 2, 2023



Spark an distributed engine which provide an support for various languages such as Java, Python, Scala, Sql. It provides an flexibility for an programmer to write code in any supported language.


In this article we will discuss about the API support provided by Apache spark which provide a great ease and flexibility interacting with data.

Spark RDD(Resilient distributed dataset):- This API provide following support

(i) An RDD is an dataset and fundamental data structure.

(ii) No row , column or schema enforcement.

(iii) Resilient support fault tolerance in an api.

(iv) RDD partition can be recreated and reprocessed anywhere in the cluster.

Catalyst Optimizer(Spark sql engine):- This API provide following support

(i) Analysis

(ii) Logical Optimization

(iii) Physical Planning

(iv) Code Generation

Note: Please follow below article to read more detailing on Catalyst Optimizer.

Spark SQL:- This API provide following support

(i) Support for sql querying mechanism.

(ii) Provides all the features of SQL such as schema, tables, views and functions.

(iii) Doesn’t provide support for store procedures in spark sql.

(iv) Tables are external(table definition and metadata is at the dbfs , data is at the external source) and managed(table definition ,data ,metadata is at the dbfs ).

Spark Table:- This API provide following support

(i) Spark Tables store schema in its metadata store.

(ii) Table and its metadata is persistent and available across application.

(iii) Spark Table supports SQL Expressions, not API.

Spark Data frame:- This API provide following support and build on the top of Spark RDD.

(i)Data frame is internally a distributed data structure.

(ii) A two-dimensional table like data structure with named columns and well defined schema.

(iii) Data Frame Schema consists of (i) Column Names (ii) Data Types

(iv) Data Frame supports schema-on-read.

(v) Spark Data Frame supports API, not SQL Expressions.

Spark Dataset:- This API provide following support.

(i) Spark dataset uses all the capability of spark rdd and spark data frame.

(ii) Build on the top of spark data frame.

(iii) Spark Dataset = Spark Data frame + Spark RDD

Thanks for the read🙏.Do clap👏👏👏👏 if find useful.

🙂🙂Stay connected!! Will be publishing more articles on spark.

“Keep learning and keep sharing knowledge”



Nidhi Gupta

Azure Data Engineer 👨‍💻.Heading towards cloud technologies expertise✌️.