MongoDB Sharding

What is Sharding in MongoDB?

Sharding is a concept in MongoDB, which splits large data sets into small data sets across multiple MongoDB instances.
Sometimes the data within MongoDB will be so huge, that queries against such big data sets can cause a lot of CPU utilization on the server. To handle this situation, MongoDB has a concept of Sharding, which is basically the splitting of data sets across multiple MongoDB instances.
The collection which could be large in size is actually split across multiple collections or Shards as they are called. Logically all the shards work as one collection.

Basics of MongoDB sharding

Clusters are used to implement the shards, Which are the group of MongoDB instances.
The components of a Shard include

A Shard – This is the basic thing, and this is nothing but a MongoDB instance which holds the subset of the data.

Config server – This is a mongodb instance which holds metadata about the cluster, basically information about the various mongodb instances which will hold the shard data.

A Router – This is a mongodb instance which basically is responsible for re-directing the commands sent by the client to the right servers.

sharding

The above diagram from the official MongoDB docs explains the relationship between each component:

  1. The application communicates with the routers (mongos) about the query to be executed.
  2. The mongos instance consults the config servers to check which shard contains the required data set to send the query to that shard.
  3. Finally, the result of the query will be returned to the application.

Shard Keys

When sharding a MongoDB collection, a shard key gets created as one of the initial steps. The “shard key” is used to split the MongoDB collection’s documents across all the shards. The key consists of a single field or multiple fields in every document. The sharded key is immutable and cannot be changed after sharding. A sharded collection only contains a single shard key.

When sharding a populated collection, the collection must have an index that starts with the shard key. For empty collections that don’t have an appropriate index, MongoDB will create an index for the specified shard key.

The shard key can directly have an impact on the performance of the cluster. Hence can lead to bottlenecks in applications associated with the cluster. To mitigate this, before sharding the collection, the shard key must be created based on:

  • The schema of the data set
  • How the data set is queried

Setup MongoDB Sharding on local ubuntu machine

Now, we understand the concepts of sharding, Lets try to setup it on our local machine (Ununtu 18.04).

  1. Configure Config Server
    We will create one replica set with 3 members as a config server.



    We need to assign a port and set the cluster role as configsvr to mark it as a config server under common replicaSet csrs. So our config file will look like

    We will create a config file for the rest of the 2-members of the config replica set. We just need to change ports.
    So our csrs2.conf and csrs3.conf  are as below



    We have created config files. Now we can start 3 mongod instances for each member.


    To verify all mongod instances are running by checking logs


    Connect to mongo server on port 26001. Add other 2 as members.


    Note : Here PCNAME is our system name. Please replace it with your system name or host.
    Now our config server replica set is configured and running.
  2.  Setting up Shards
    We will deploy two Shards. Each shard will have a 3 Member replica set.

    1. Setting up Shard-1


      Setting up shard replicaset is exactly same what we already did with config replica set. We will just set a clusterRole: shardsvr and we are taking a different replSetName: sh01 that will be common across all the three member’s of same shard replica set.


      We will do the same for the other 2 members. Just create 2 more conf files sh012.conf and sh013.conf. Change the ports to 27012, 27013 and replace the log and db path’s.


      Now we will connect to one of them and configure them to be part of the same replica set.


      Now we have Shard 01 replica set up and running.
    2. Setting up Shard-2
      It’s exactly the same as Shard 01 set-up. Just we have to do find and replace sh01 to sh02 🙂 and assign new port’s for all the configuration and folder structure.


      Create Server Configuration file sh021.conf, sh022.conf and sh023.conf


      Note :- Copy the same file for other two members and change ports to 27022,27023.
      Now we can start the shard server’s. running on port [27021, 27022, 27023]


      Connect with one mebmber and add other members to make them part of same replica set.


      Now we have Shard 02 replica set up and running.
  3. Configuring Router
    Mongo router is a db instance with no database of its own.


    Create router config file mongos.conf. We will point configDB to our config server. We don’t need a storage section as the router does not have a db of it’s own. Our config file will look like:

    Now start the mongos instance using this config and then we connect to our mongos router on port 26000 and add shards to it.

    We can see our sharded clustered environment is up and running.


    All done.. Mongodb sharded Clustered Environment is UP and Running.  Hope this was helpful.

    Note – There are three ways we can split data across shard. To start with we can use range or hashed shard. We can also distribute our data based on zone or location. I will try to cover this up in my next blog.

Comments