Joins In Pyspark. In this article we will explore the… | by Nidhi Gupta | Medium

Member-only story

3 min readSep 3, 2023

Joins In Pyspark

In this article we will explore the reason behind joining tables/data frame and types of joins used in Pyspark.

The requirement of bringing a more clear and meaningful way to store data in tables led to the coming up of data modelling concept in the world of data warehouse.

Datamodelling brings up the two ways to store data in tables

(i) Normalized method(Snowflake schema): Normalization led to joining of tables.

(ii) Denormlized method(Star schema): No concept of joining tables.

Joining tables/data frames can categorized as follows:

Let’s consider an example to understand the syntax used in each type of join.

data1 = [(1,'nidhi',2000,2),(2,'gupta',3000,1),(3,'abcd',1000,4)]
schema1=['id','name','salary','dep']

data2 =[(1,'IT'),(2,'HR'),(3,'Payroll')]
schema2 =['id','name']

empof = spark.createDataFrame(data=data1,schema=schema1)
depof = spark.createDataFrame(data=data2,schema=schema2)

empof.show()
depof.show()


+---+-----+------+---+
| id| name|salary|dep|
+---+-----+------+---+
|  1|nidhi|  2000|  2|
|  2|gupta|  3000|  1|
|  3| abcd|  1000|  4|
+---+-----+------+---+

+---+-------+
| id|   name|
+---+-------+
|  1|     IT|
|  2|     HR|
|  3|Payroll|
+---+-------+

(i) Inner Join

#INNER JOIN
empof.join(depof,empof.dep == depof.id,'inner').show()

+---+-----+------+---+---+----+
| id| name|salary|dep| id|name|
+---+-----+------+---+---+----+
|…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Already have an account? Sign in

Written by Nidhi Gupta

Azure Data Engineer 👨‍💻.Heading towards cloud technologies expertise✌️.

Responses (2)

Write a response

What are your thoughts?

Also publish to my profile

Song Cooper

Sep 5, 2023

Awesome read Nidhi

1

Sathya

Oct 30, 2023

Good explanation

Recommended from Medium

Mastering PySpark: Essential Interview Questions and Code Solutions for Data Engineers

Mayurkumar Surani

Mastering PySpark: Essential Interview Questions and Code Solutions for Data Engineers

As a seasoned Big Data Engineer with over a 6 years of experience, I’ve had the privilege of working with some of the most cutting-edge…

Oct 17, 2024

Apache Spark Repartitioning: When And How To Optimize Performance.

In

Towards Data Engineering

by

B V Sarath Chandra

Apache Spark Repartitioning: When And How To Optimize Performance.

Understanding Partitions:

6d ago

Lists

Leadership

62 stories526 saves

Stories to Help You Grow as a Designer

11 stories1125 saves

Leadership upgrades

7 stories121 saves

Stories to Help You Grow as a Software Developer

19 stories1624 saves

Apache Spark Architecture :A Deep Dive into Big Data Processing

In

Towards Dev

by

Prem Vishnoi(cloudvala)

Apache Spark Architecture :A Deep Dive into Big Data Processing

Agenda

Feb 6

Top 45 Apache Spark Interview Questions and Answers

Sanjay Kumar PhD

Top 45 Apache Spark Interview Questions and Answers

1. What is Apache Spark?

5d ago

100 Days of Data Engineering on Databricks Day 44: PySpark vs. Scala:

THE BRICK LEARNING

100 Days of Data Engineering on Databricks Day 44: PySpark vs. Scala:

Performance Comparison for Data Engineers — Lessons from Experience

4d ago

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Feb 19

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams