Member-only story

Nidhi Gupta
3 min readSep 3, 2023

Joins In Pyspark

In this article we will explore the reason behind joining tables/data frame and types of joins used in Pyspark.

The requirement of bringing a more clear and meaningful way to store data in tables led to the coming up of data modelling concept in the world of data warehouse.

Datamodelling brings up the two ways to store data in tables

(i) Normalized method(Snowflake schema): Normalization led to joining of tables.

(ii) Denormlized method(Star schema): No concept of joining tables.

Joining tables/data frames can categorized as follows:

Let’s consider an example to understand the syntax used in each type of join.

data1 = [(1,'nidhi',2000,2),(2,'gupta',3000,1),(3,'abcd',1000,4)]
schema1=['id','name','salary','dep']

data2 =[(1,'IT'),(2,'HR'),(3,'Payroll')]
schema2 =['id','name']

empof = spark.createDataFrame(data=data1,schema=schema1)
depof = spark.createDataFrame(data=data2,schema=schema2)

empof.show()
depof.show()

+---+-----+------+---+
| id| name|salary|dep|
+---+-----+------+---+
| 1|nidhi| 2000| 2|
| 2|gupta| 3000| 1|
| 3| abcd| 1000| 4|
+---+-----+------+---+

+---+-------+
| id| name|
+---+-------+
| 1| IT|
| 2| HR|
| 3|Payroll|
+---+-------+

(i) Inner Join

#INNER JOIN
empof.join(depof,empof.dep == depof.id,'inner').show()

+---+-----+------+---+---+----+
| id| name|salary|dep| id|name|
+---+-----+------+---+---+----+
|…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Nidhi Gupta
Nidhi Gupta

Written by Nidhi Gupta

Azure Data Engineer 👨‍💻.Heading towards cloud technologies expertise✌️.

Responses (2)

Write a response