Partitioning vs Clustering
This is another important and among the most asked topics in the interview process. Partitioning and clustering play an important role when we have a huge amount of data and this huge data needs to be stored in the database or data warehouse.
Partitioning is the process of breaking a large table into smaller tables. Each partition is created based on the partitioning key.
As a huge amount of data is stored, performance and scaling get affected. Partitioning helps in dividing a large table into small tables hence increasing the table scans and reducing memory swap, ultimately increasing the performance.
Ways of partitioning data in a database using partitioning key:
- Horizontal Partitioning: It refers to partitioning data horizontally i.e. dividing data based on the rows.
- Vertical Partitioning: It refers to partitioning data vertically means dividing data based on the columns.
Clustering is the process where data is grouped together based on similarities. Each cluster contains the whole amount of data based on the similarities they are grouped. Clustering is supported only for partitioned tables. Let’s understand clustering.
Clustering in Elasticsearch
In the Elastic search database, each cluster contains exact same amount of data. Each cluster is further divided into multiple nodes. Each node further gets split into multiple shards. Each shard will have its replica in order to save data from data loss.
Primary shards & Replica shards in Elasticsearch
Shards offer the most competitive balance between allocation speed, node balancing, and overall cluster management.
Clustering in Cassandra
In the Cassandra database, we have a clustering column/key. Each database has a primary key but in Cassandra's database, we have a clustering key and partitioning key. The clustering key is responsible for the re-ordering/re-indexing of the database while the partition key is responsible for creating a partition in the database.
primary key = clustering key + partitioning key
Clustering in RabbitMQ
RabbitMQ Server And Clustering Setup on Ubuntu 18.04
Working on one of my recent projects had a requirement to add logs from rabbitMq to the Elasticsearch.
So, in similar way clustering works in PostgreSQL, MySQL, Oracle, Big Query, etc….
Clustering improves efficiency, but there are some limitations:
- Clustering is only supported for partitioned tables.
- We can specify the clustering column only while creating a table. We can’t modify it later.
- We can specify a maximum of four non-repeated columns for clustering.
- Clustering can only be used with standard SQL.
Thanks for the read🙂.Do clap 👏👏 if find it useful.
“Keep learning and keep sharing knowledge”