Nidhi Gupta
3 min readSep 3, 2023

--

Joins In Pyspark

In this article we will explore the reason behind joining tables/data frame and types of joins used in Pyspark.

With the requirement of bringing more clear and meaningful way to store data in table led to coming up of data modelling concept in the world of data warehouse.

Datamodelling brings up the two way to store data in tables

(i) Normalized method(Snowflake schema): Normalization led to joining of tables.

(ii) Denormlized method(Star schema): No concept of joining tables.

Joining tables/data frames can categorized as follows:

Let’s consider an example to understand the syntax used in each type of join.

data1 = [(1,'nidhi',2000,2),(2,'gupta',3000,1),(3,'abcd',1000,4)]
schema1=['id','name','salary','dep']

data2 =[(1,'IT'),(2,'HR'),(3,'Payroll')]
schema2 =['id','name']

empof = spark.createDataFrame(data=data1,schema=schema1)
depof = spark.createDataFrame(data=data2,schema=schema2)

empof.show()
depof.show()

+---+-----+------+---+
| id| name|salary|dep|
+---+-----+------+---+
| 1|nidhi| 2000| 2|
| 2|gupta| 3000| 1|
| 3| abcd| 1000| 4|
+---+-----+------+---+

+---+-------+
| id| name|
+---+-------+
| 1| IT|
| 2| HR|
| 3|Payroll|
+---+-------+

(i) Inner Join

#INNER JOIN
empof.join(depof,empof.dep == depof.id,'inner').show()

+---+-----+------+---+---+----+
| id| name|salary|dep| id|name|
+---+-----+------+---+---+----+
| 2|gupta| 3000| 1| 1| IT|
| 1|nidhi| 2000| 2| 2| HR|
+---+-----+------+---+---+----+

(ii) Left Join

#LEFT JOIN
empof.join(depof,empof.dep == depof.id,'left').show()

+---+-----+------+---+----+----+
| id| name|salary|dep| id|name|
+---+-----+------+---+----+----+
| 2|gupta| 3000| 1| 1| IT|
| 1|nidhi| 2000| 2| 2| HR|
| 3| abcd| 1000| 4|null|null|
+---+-----+------+---+----+----+

(iii) Right Join

#Right JOIN
empof.join(depof,empof.dep == depof.id,'right').show()

+----+-----+------+----+---+-------+
| id| name|salary| dep| id| name|
+----+-----+------+----+---+-------+
| 2|gupta| 3000| 1| 1| IT|
|null| null| null|null| 3|Payroll|
| 1|nidhi| 2000| 2| 2| HR|
+----+-----+------+----+---+-------+

(iv) Full Outer Join

#Full JOIN
empof.join(depof,empof.dep == depof.id,'full').show()

+----+-----+------+----+----+-------+
| id| name|salary| dep| id| name|
+----+-----+------+----+----+-------+
| 2|gupta| 3000| 1| 1| IT|
| 1|nidhi| 2000| 2| 2| HR|
|null| null| null|null| 3|Payroll|
| 3| abcd| 1000| 4|null| null|
+----+-----+------+----+----+-------+

(v) Left Semi Join

#leftsemijoin
#similar to inner join but takes data from left dataframe for matching rows.
empof.join(depof,empof.dep == depof.id,'leftsemi').show()

+---+-----+------+---+
| id| name|salary|dep|
+---+-----+------+---+
| 2|gupta| 3000| 1|
| 1|nidhi| 2000| 2|
+---+-----+------+---+

(vi) Left Anti Join

#leftantijoin
# opposite if leftsemmi join/gets data not presrent in left table
empof.join(depof,empof.dep == depof.id,'leftanti').show()

+---+----+------+---+
| id|name|salary|dep|
+---+----+------+---+
| 3|abcd| 1000| 4|
+---+----+------+---+

(vii) Self Join

#self join
empof.alias('emp').join(empof.alias('em')).show()

+---+-----+------+---+---+-----+------+---+
| id| name|salary|dep| id| name|salary|dep|
+---+-----+------+---+---+-----+------+---+
| 1|nidhi| 2000| 2| 1|nidhi| 2000| 2|
| 1|nidhi| 2000| 2| 2|gupta| 3000| 1|
| 1|nidhi| 2000| 2| 3| abcd| 1000| 4|
| 2|gupta| 3000| 1| 1|nidhi| 2000| 2|
| 2|gupta| 3000| 1| 2|gupta| 3000| 1|
| 2|gupta| 3000| 1| 3| abcd| 1000| 4|
| 3| abcd| 1000| 4| 1|nidhi| 2000| 2|
| 3| abcd| 1000| 4| 2|gupta| 3000| 1|
| 3| abcd| 1000| 4| 3| abcd| 1000| 4|
+---+-----+------+---+---+-----+------+---+

(vii) Broadcast Join

#brodcast join
from pyspark.sql.functions import broadcast
empof.join(broadcast(depof) , depof["id"] == empof["dep"]).show()
+---+-----+------+---+---+----+
| id| name|salary|dep| id|name|
+---+-----+------+---+---+----+
| 1|nidhi| 2000| 2| 2| HR|
| 2|gupta| 3000| 1| 1| IT|
+---+-----+------+---+---+----+

Thank for the read 🙂Do clap 👏👏 if find it useful.🙂

“Keep learning and keep sharing knowledge”

--

--

Nidhi Gupta

Azure Data Engineer 👨‍💻.Heading towards cloud technologies expertise✌️.