This is another important and among the most asked topics in the interview process. Partitioning and clustering play an important role when we have a huge amount of data and this huge data needs to be stored in the database or data warehouse.

Partitioning

Partitioning is the process of breaking a large table into smaller tables. Each partition is created based on the partitioning key.

As a huge amount of data is stored, performance and scaling get affected. Partitioning helps in dividing a large table into small tables hence increasing the table scans and reducing memory swap, ultimately increasing the performance.

Ways of partitioning data in a database using partitioning key:

  1. Horizontal Partitioning: It refers to partitioning data horizontally i.e. dividing data based on the rows.
  2. Vertical Partitioning: It refers to partitioning data vertically means dividing data based on the columns.

Clustering

Clustering is the process where data is grouped together based on similarities. Each cluster contains the whole amount of data based on the similarities they are grouped. Clustering is supported only for partitioned tables. Let’s understand clustering.

Clustering in Elasticsearch

In the Elastic search database, each cluster contains exact same amount of data. Each cluster is further divided into multiple nodes. Each node further gets split into multiple shards. Each shard will have its replica in order to save data from data loss.

Clustering in Cassandra

In the Cassandra database, we have a clustering column/key. Each database has a primary key but in Cassandra's database, we have a clustering key and partitioning key. The clustering key is responsible for the re-ordering/re-indexing of the database while the partition key is responsible for creating a partition in the database.

primary key = clustering key + partitioning key

Clustering in RabbitMQ

So, in similar way clustering works in PostgreSQL, MySQL, Oracle, Big Query, etc….

Clustering improves efficiency, but there are some limitations:

  1. Clustering is only supported for partitioned tables.
  2. We can specify the clustering column only while creating a table. We can’t modify it later.
  3. We can specify a maximum of four non-repeated columns for clustering.
  4. Clustering can only be used with standard SQL.

Thanks for the read🙂.Do clap 👏👏 if find it useful.

“Keep learning and keep sharing knowledge”

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nidhi Gupta

Nidhi Gupta

490 Followers

Database Engineer 👨‍💻.Heading towards cloud technologies✌️.