Joins In Pyspark
In this article we will explore the reason behind joining tables/data frame and types of joins used in Pyspark.
The requirement of bringing a more clear and meaningful way to store data in tables led to the coming up of data modelling concept in the world of data warehouse.
Datamodelling brings up the two ways to store data in tables
(i) Normalized method(Snowflake schema): Normalization led to joining of tables.
(ii) Denormlized method(Star schema): No concept of joining tables.
Joining tables/data frames can categorized as follows:
Let’s consider an example to understand the syntax used in each type of join.
data1 = [(1,'nidhi',2000,2),(2,'gupta',3000,1),(3,'abcd',1000,4)]
schema1=['id','name','salary','dep']
data2 =[(1,'IT'),(2,'HR'),(3,'Payroll')]
schema2 =['id','name']
empof = spark.createDataFrame(data=data1,schema=schema1)
depof = spark.createDataFrame(data=data2,schema=schema2)
empof.show()
depof.show()
+---+-----+------+---+
| id| name|salary|dep|
+---+-----+------+---+
| 1|nidhi| 2000| 2|
| 2|gupta| 3000| 1|
| 3| abcd| 1000| 4|
+---+-----+------+---+
+---+-------+
| id| name|
+---+-------+
| 1| IT|
| 2| HR|
| 3|Payroll|
+---+-------+
(i) Inner Join
#INNER JOIN
empof.join(depof,empof.dep == depof.id,'inner').show()
+---+-----+------+---+---+----+
| id| name|salary|dep| id|name|
+---+-----+------+---+---+----+
|…