Mastering MongoDB - Faster elections during rolling maintenance

Written by shyam.arjarapu | Published 2018/06/15
Tech Story Tags: mongodb | database | database-administration | mastering-mongodb

TLDR The maintenance/upgrade in a MongoDB replica set is typically performed in a rolling fashion. Rolling maintenance requires you to perform the maintenance on one Secondary at a time with the Primary member to go through the maintenance last. When you stepDown the Primary, all the eligible Secondaries will hold an election for a new Primary. Until a new primary is elected the database is not available for writes. This article discusses the rolling maintenance, implications of not having a Primary, the steps required to elect a new. Primary is the only member in the replica set that receives write operations.via the TL;DR App

Photo by annie bolin on Unsplash
The maintenance/upgrade in a MongoDB replica set is typically performed in a rolling fashion. The rolling maintenance/upgrade process requires you to perform the maintenance on one Secondary at a time with the Primary member to go through the maintenance last.
When you stepDown the Primary, all the eligible Secondaries will hold an election for a new Primary. Until a new Primary is elected the database is not available for writes. So, ‘How would you quickly elect a new Primary while performing the rolling maintenance/upgrade?’
This is one of the many articles in multi-part series, Mastering MongoDB — One tip a day, solely created for you to master MongoDB by learning ‘One tip a day’. In a few series of articles, I would like to give various tips to help you answer the above question.
This article discusses the rolling maintenance, implications of not having the Primary, the steps required to elect a new Primary quickly and finally Pros/Cons of the approach.

Mastering — Rolling maintenance

MongoDB offers redundancy and high availability of the database via replica sets. The replica sets will not only help the database quickly recover from node failures/network partitions, but also gives you the ability to perform maintenance tasks without affecting the high availability.
The key to being highly available and yet be able to perform maintenance is ‘the rolling maintenance’; Where the maintenance is performed on one Secondary at a time
  1. Stop the MongoDB process/service on a Secondary
  2. Perform the required maintenance/upgrade on the server
  3. Start the MongoDB process/service on the server
  4. Wait for MongoDB on the server to catch up on the Oplog
  5. Repeat the above on the other secondaries in the replica set
Given a replica set with 3 MongoDB servers — mon01 (Primary), mon02 (Secondary) and mon03 (Secondary), the rolling maintenance process typically requires
  1. Perform the maintenance on the Secondary server, mon03
  2. Perform the maintenance on the other Secondary server, mon02
  3. stepDown the Primary server, mon01
  4. Wait for new Primary to be elected, let’s say mon02
  5. Perform the maintenance on the former primary server, mon01
For more detailed information on the rolling upgrades please read, Your Ultimate Guide to Rolling Upgrades by Bryan Reinero.

Implications of not having a Primary

By default, both the read/write operations are executed on the Primary. You may use the Secondary read preference to read from one of the Secondaries. However, the Primary is the only member in the replica set that receives write operations. So it is crucial to always have a Primary in your replica set.
When a Primary is not available/reachable by the majority of Secondaries, all the eligible Secondaries will hold an election for a new Primary. Until a new primary is elected, all the Write and Reads (on Primary) operations originating from your client drivers will either wait for Primary to be available and/or timeout. So, it is important to have Primary elected quickly so that number of operations awaiting for the Primary to be available are low.

How to quickly elect a new Primary

If a Primary is unexpectedly terminated and/or facing a network connectivity issues from the majority of servers, the secondaries can call in for an election after missing the heartbeats for 10 seconds. So it takes some time.

StepDown the Primary

Stepping down the primary expedites the failover procedure. Therefore it is recommended to stepDown the primary to forcefully trigger the election than shutDown the Primary and let the Secondaries find out about unreachable primary. I bet, most of you are using this approach already. So, let’s review some of the other tips you could leverage before you stepDown() the Primary.

Make only one Secondary to be electable

If the replication lag on one of your secondary is low, then you can pro-actively choose it to be the only secondary that can be elected for the next election. Typically you choose a secondary that has
  1. Low replication lag
  2. Low network latency
  3. Similar Priority as current Primary
  4. Or Member with next highest Priority
Assuming you want to pin a Secondary server, mon02, as the next Primary then you can make the Secondary server, mon03, ineligible to become Primary for 60 seconds by running rs.freeze(60) on it. This will make the election faster as the Secondary server, mon02, is the only electable Primary, when you stepDown the server mon01.

Reduce the settings.electionTimeoutMillis

The default time limit for detecting when a replica set’s primary is unreachable is 10 seconds. By reducing the settings.electionTimeoutMillis to let’s say 2 seconds, you would be making the detection and hence the election faster.

Summary of steps for faster election

I have summarized the below steps to have faster election during the maintenance period. Please test them before running them on production environment.
  1. Identify the server you want it to be next Primary
  2. Execute rs.freeze(60) on all other Secondaries
  3. Set settings.electionTimeoutMillis=2000 on replica set configuration
  4. Execute rs.stepDown() on current Primary
  5. Wait for the new Primary to be elected
  6. Reset the settings.electionTimeoutMillis=10000 on the new Primary

Pros & Cons of the approach

Assuming all the above suggestions worked out well for you. You may be wondering -
“If having lower electionTimeoutMillis helps with quicker elections, then why can’t I keep it at lower number all the time?”
Great question! Your application might be facing reduced traffic during the rolling maintenance period. Most importantly, you are closely monitoring all the servers and manually pin a Secondary to be the next Primary. So it could be okay for you to have a lower electionTimeoutMillis value at that very moment.
However, setting the electionTimeoutMillis to a low value will not only result in faster failover but also has a negative effect on increased sensitivity to the primary node or network slowness or spottiness.
This may result in too many elections when there are transient network connectivity issues. On the contrary, setting the electionTimeoutMillis to larger value makes your replica set more resilient to transient network interruptions but also results in slower average failover time.
Bottomline is YMMV; You would need to test various electionTimeoutMillis values and choose the one that suites you better. Or leave it at the default value of 10 seconds.
No matter what you do, “Never set the electionTimeoutMillis to a value less than the round-trip network latency time between two of your members.”

Hands-On lab exercises

This lab exercise helps you understand the steps needed to quickly elect a new Primary during a rolling maintenance.

Setup environment

First, you would need an environment to play around. I have created 3 RHEL v7.5 instances in AWS, you may as well run them all on your localhost with /etc/hosts entries for the servers. If you already have a MongoDB v3.6 replica set environment, you may skip this step.
Download and untar MongoDB v3.6 binaries, start MongoDB server listening to bind all IPs on port 27000.
# Run these commands on all the 3 servers of yours.
# download v3.6
curl -O https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-rhel70-3.6.5.tgz
BIN_NAME="mongodb-linux-x86_64-rhel70-3.6.5"
BIN_VERSION="v3.6.5"
# create data directory and untar the binaries
tar -xvzf "$BIN_NAME.tgz"
rm "$BIN_NAME.tgz"
mv $BIN_NAME $BIN_VERSION
rm -rf data
mkdir data
# start the mongod on port 27000 with bind_ip_all
$BIN_VERSION/bin/mongod --dbpath data --logpath data/mongod.log --fork --replSet rs0 --port 27000 --bind_ip_all
# about to fork child process, waiting until server is ready for connections.
# forked process: 13442
# child process started successfully, parent exiting
# edit /etc/hosts to have the mon0X entries, ofcourse with your own IPs
tail -3 /etc/hosts
# 35.167.113.204   mon01
# 35.167.113.203   mon02
# 35.167.113.206   mon03
A bash script with download MongoDB v3.6.5 and start mongod on port 27000

Initiate replica set

Initiate a MongoDB replica set using the above hosts on server mon01
$BIN_VERSION/bin/mongo --port 27000 <<EOF
rs.initiate({
 _id: 'rs0',
 members: [
  { _id: 0, host : 'mon01:27000' },
  { _id: 1, host : 'mon02:27000' },
  { _id: 2, host : 'mon03:27000' }
] })
EOF
# MongoDB shell version v3.6.5
# connecting to: mongodb://127.0.0.1:27000/
# MongoDB server version: 3.6.5
# {
#   "ok" : 1,
#   "operationTime" : Timestamp(1529022819, 1),
#   "$clusterTime" : {
#     "clusterTime" : Timestamp(1529022819, 1),
#     "signature" : {
#       "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
#       "keyId" : NumberLong(0)
#     }
#   }
# }
# bye
$BIN_VERSION/bin/mongo --port 27000
# rs0:PRIMARY>
rs.isMaster().me
# mon01:27000
A bash script to initiate a replica set with 3 hosts we created earlier

Display the replica set config and status

Please note the outputs from rs.config() and rs.status() respectively. They help you determine current settings.electionTimeoutMillis: 10000 and select a Secondary to be the next Primary based on the values in priority, optime, lastHeartbeat and pingMs.
rs.config()
/*
{
  "_id": "rs0",
  "version": 1,
  "protocolVersion": NumberLong(1),
  "members": [{
      "_id": 0,
      "host": "mon01:27000",
      "arbiterOnly": false,
      "buildIndexes": true,
      "hidden": false,
      "priority": 1,
      "tags": {
      },
      "slaveDelay": NumberLong(0),
      "votes": 1
    },
    {
      "_id": 1,
      "host": "mon02:27000",
      "arbiterOnly": false,
      "buildIndexes": true,
      "hidden": false,
      "priority": 1,
      "tags": {
      },
      "slaveDelay": NumberLong(0),
      "votes": 1
    },
    {
      "_id": 2,
      "host": "mon03:27000",
      "arbiterOnly": false,
      "buildIndexes": true,
      "hidden": false,
      "priority": 1,
      "tags": {
      },
      "slaveDelay": NumberLong(0),
      "votes": 1
    }
  ],
  "settings": {
    "chainingAllowed": true,
    "heartbeatIntervalMillis": 2000,
    "heartbeatTimeoutSecs": 10,
    "electionTimeoutMillis": 10000,
    "catchUpTimeoutMillis": -1,
    "catchUpTakeoverDelayMillis": 30000,
    "getLastErrorModes": {
    },
    "getLastErrorDefaults": {
      "w": 1,
      "wtimeout": 0
    },
    "replicaSetId": ObjectId("5b23096362ac76fdc504e6e1")
  }
}
*/
A JavaScript method to show the replica set configuration settings
rs.status()
/*
{
  "set": "rs0",
  "date": ISODate("2018-06-15T00:57:24.708Z"),
  "myState": 1,
  "term": NumberLong(1),
  "heartbeatIntervalMillis": NumberLong(2000),
  "optimes": {
    "lastCommittedOpTime": {
      "ts": Timestamp(1529024241, 1),
      "t": NumberLong(1)
    },
    "readConcernMajorityOpTime": {
      "ts": Timestamp(1529024241, 1),
      "t": NumberLong(1)
    },
    "appliedOpTime": {
      "ts": Timestamp(1529024241, 1),
      "t": NumberLong(1)
    },
    "durableOpTime": {
      "ts": Timestamp(1529024241, 1),
      "t": NumberLong(1)
    }
  },
  "members": [{
      "_id": 0,
      "name": "mon01:27000",
      "health": 1,
      "state": 1,
      "stateStr": "PRIMARY",
      "uptime": 1837,
      "optime": {
        "ts": Timestamp(1529024241, 1),
        "t": NumberLong(1)
      },
      "optimeDate": ISODate("2018-06-15T00:57:21Z"),
      "electionTime": Timestamp(1529022829, 1),
      "electionDate": ISODate("2018-06-15T00:33:49Z"),
      "configVersion": 1,
      "self": true
    },
    {
      "_id": 1,
      "name": "mon02:27000",
      "health": 1,
      "state": 2,
      "stateStr": "SECONDARY",
      "uptime": 1425,
      "optime": {
        "ts": Timestamp(1529024241, 1),
        "t": NumberLong(1)
      },
      "optimeDurable": {
        "ts": Timestamp(1529024241, 1),
        "t": NumberLong(1)
      },
      "optimeDate": ISODate("2018-06-15T00:57:21Z"),
      "optimeDurableDate": ISODate("2018-06-15T00:57:21Z"),
      "lastHeartbeat": ISODate("2018-06-15T00:57:24.663Z"),
      "lastHeartbeatRecv": ISODate("2018-06-15T00:57:23.502Z"),
      "pingMs": NumberLong(0),
      "syncingTo": "mon01:27000",
      "configVersion": 1
    },
    {
      "_id": 2,
      "name": "mon03:27000",
      "health": 1,
      "state": 2,
      "stateStr": "SECONDARY",
      "uptime": 1425,
      "optime": {
        "ts": Timestamp(1529024241, 1),
        "t": NumberLong(1)
      },
      "optimeDurable": {
        "ts": Timestamp(1529024241, 1),
        "t": NumberLong(1)
      },
      "optimeDate": ISODate("2018-06-15T00:57:21Z"),
      "optimeDurableDate": ISODate("2018-06-15T00:57:21Z"),
      "lastHeartbeat": ISODate("2018-06-15T00:57:24.571Z"),
      "lastHeartbeatRecv": ISODate("2018-06-15T00:57:23.075Z"),
      "pingMs": NumberLong(75),
      "syncingTo": "mon01:27000",
      "configVersion": 1
    }
  ],
  "ok": 1,
  "operationTime": Timestamp(1529024241, 1),
  "$clusterTime": {
    "clusterTime": Timestamp(1529024241, 1),
    "signature": {
      "hash": BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      "keyId": NumberLong(0)
    }
  }
}
*/
A JavaScript method to show the replica set status information

Choose the potential next Primary

The rs.status() and db.printSlaveReplication() commands show that both the Secondaries, mon02 and mon03, are all caught up on the Oplog entries of Primary mon01. However, the pingMs shows that mon02 is a lot closer to mon01 than the mon03. So you may choose the mon02 as the next potential Primary while stepping down the current Primary.
db.printSlaveReplicationInfo()
/*
source: mon02:27000
  syncedTo: Fri Jun 15 2018 01:29:11 GMT+0000 (UTC)
  0 secs (0 hrs) behind the primary
source: mon03:27000
  syncedTo: Fri Jun 15 2018 01:29:11 GMT+0000 (UTC)
  0 secs (0 hrs) behind the primary
*/

rs.status().members.map(x=>x.pingMs)
// [ undefined, NumberLong(0), NumberLong(78) ]
A JavaScript function to show the database printSlaveReplicationInfo command output

Freeze the other Secondaries

Based on the above pingMs, we would not want the server mon03 to be elected as Primary. So, run the below command to freeze it from contending in the next election term.
// Freeze mon03
rs.isMaster().me
// mon03:27000

rs.freeze(60)
/*
{
  "ok" : 1,
  "operationTime" : Timestamp(1529033711, 1),
  "$clusterTime" : {
    "clusterTime" : Timestamp(1529033711, 1),
    "signature" : {
      "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      "keyId" : NumberLong(0)
    }
  }
}
*/
A JavaScript method invoking rs.freeze to make the replica set member ineligible to become primary

Set electionTimeoutMillis and stepDown the Primary

Reconfigure the electionTimeoutMillis of the replica set settings on the current Primary, mon01. Finally, execute the command rs.stepDown() to forcibly trigger the election and electing mon02 as the next Primary.
// # rs0:PRIMARY>
rs.isMaster().me
// mon01:27000

var conf = rs.conf()
conf.settings.electionTimeoutMillis=2000
rs.reconfig(conf)
/*
{
  "ok": 1,
  "operationTime": Timestamp(1529025863, 1),
  "$clusterTime": {
    "clusterTime": Timestamp(1529025863, 1),
    "signature": {
      "hash": BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      "keyId": NumberLong(0)
    }
  }
}
*/

rs.stepDown()
/*
2018-06-15T03:52:42.042+0000 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27000'  :
DB.prototype.runCommand@src/mongo/shell/db.js:168:1
DB.prototype.adminCommand@src/mongo/shell/db.js:186:16
rs.stepDown@src/mongo/shell/utils.js:1341:12
@(shell):1:1
2018-06-15T03:52:42.043+0000 I NETWORK  [thread1] trying reconnect to 127.0.0.1:27000 (127.0.0.1) failed
2018-06-15T03:52:42.043+0000 I NETWORK  [thread1] reconnect 127.0.0.1:27000 (127.0.0.1) ok
*/

// rs0:SECONDARY>
rs.isMaster().primary
// mon02:27000
A JavaScript code to set the electionTimeoutMillis to 2 seconds and stepDown the primary
You may notice that the new primary is available within ~2 seconds compared to the default of 10–12 seconds. The below mongod.log files on the individual machines show that mon02 transition to primary is completed within ~2 seconds.
# Server mon02
tail -100 data/mongod.log | grep REPL
# 2018-06-15T03:52:42.304+0000 I REPL     [rsBackgroundSync] could not find member to sync from
# 2018-06-15T03:52:42.305+0000 I REPL     [replexec-28] Member mon01:27000 is now in state SECONDARY
# 2018-06-15T03:52:43.306+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mon01:27000: InvalidSyncSource: Sync source was cleared. Was mon01:27000
# 2018-06-15T03:52:43.360+0000 I REPL     [replexec-26] Starting an election, since weve seen no PRIMARY in the past 2000ms
# 2018-06-15T03:52:43.360+0000 I REPL     [replexec-26] conducting a dry run election to see if we could be elected. current term: 4
# 2018-06-15T03:52:43.360+0000 I REPL     [replexec-24] VoteRequester(term 4 dry run) received a yes vote from mon01:27000; response message: { term: 4, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1529034758, 1), $clusterTime: { clusterTime: Timestamp(1529034758, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
# 2018-06-15T03:52:43.360+0000 I REPL     [replexec-24] dry election run succeeded, running for election in term 5
# 2018-06-15T03:52:43.363+0000 I REPL     [replexec-28] VoteRequester(term 5) received a yes vote from mon01:27000; response message: { term: 5, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1529034758, 1), $clusterTime: { clusterTime: Timestamp(1529034758, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
# 2018-06-15T03:52:43.363+0000 I REPL     [replexec-28] election succeeded, assuming primary role in term 5
# 2018-06-15T03:52:43.363+0000 I REPL     [replexec-28] transition to PRIMARY from SECONDARY
# 2018-06-15T03:52:43.363+0000 I REPL     [replexec-28] Entering primary catch-up mode.
# 2018-06-15T03:52:43.584+0000 I REPL     [replexec-23] Caught up to the latest optime known via heartbeats after becoming primary.
# 2018-06-15T03:52:43.584+0000 I REPL     [replexec-23] Exited primary catch-up mode.
# 2018-06-15T03:52:45.301+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted


# Server mon03
tail -100 data/mongod.log | grep REPL
# 2018-06-15T03:52:29.029+0000 I REPL     [rsBackgroundSync] sync source candidate: mon01:27000
# 2018-06-15T03:52:31.631+0000 I REPL     [conn27] 'freezing' for 120 seconds
# 2018-06-15T03:52:42.794+0000 I REPL     [replication-1] Choosing new sync source because our current sync source, mon01:27000, has an OpTime ({ ts: Timestamp(1529034758, 1), t: 4 }) which is not ahead of ours ({ ts: Timestamp(1529034758, 1), t: 4 }), it does not have a sync source, and its not the primary (sync source does not know the primary)
# 2018-06-15T03:52:42.794+0000 I REPL     [replication-1] Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: mon01:27000, OpTime { ts: Timestamp(1529034758, 1), t: 4 }, its sync source index:-1
# 2018-06-15T03:52:42.794+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source mon01:27000 (config version: 4; last applied optime: { ts: Timestamp(1529034758, 1), t: 4 }; sync source index: -1; primary index: -1) is no longer valid
# 2018-06-15T03:52:42.794+0000 I REPL     [rsBackgroundSync] could not find member to sync from
# 2018-06-15T03:52:43.420+0000 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because I am still waiting for stepdown period to end at 2018-06-09T13:09:29.408+0000 (mask 0x20)
# 2018-06-15T03:52:43.481+0000 I REPL     [ReplicationExecutor] Member mon01:27000 is now in state SECONDARY
# 2018-06-15T03:52:43.584+0000 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because I am still waiting for stepdown period to end at 2018-06-09T13:09:29.408+0000 (mask 0x20)
# 2018-06-15T03:52:43.584+0000 I REPL     [ReplicationExecutor] Member mon02:27000 is now in state PRIMARY
# 2018-06-15T03:52:43.663+0000 I REPL     [replexec-31] Member mon02:27000 is now in state PRIMARY
# 2018-06-15T03:52:43.817+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mon01:27000: InvalidSyncSource: Sync source was cleared. Was mon01:27000
# 2018-06-15T03:52:45.795+0000 I REPL     [rsBackgroundSync] sync source candidate: mon02:27000
A bash script to show the transition of mon02 from Secondary to Primary

Reset the electionTimeoutMillis on the new primary

Once the new Primary is elected, please revert back the electionTimeoutMillis back to the default value to avoid any frequent elections during the transient network connectivity issues.
rs.isMaster().me
// mon02:27000
// rs0:PRIMARY>
// on the new primary
var conf = rs.conf()
conf.settings.electionTimeoutMillis=10000
/*
rs.reconfig(conf)
{
  "ok": 1,
  "operationTime": Timestamp(1529034252, 1),
  "$clusterTime": {
    "clusterTime": Timestamp(1529034252, 1),
    "signature": {
      "hash": BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      "keyId": NumberLong(0)
    }
  }
}
*/
A JavaScript code to reset the electionTimeoutMillis back to its default value

Summary

I want to remind an important point —
Although the MongoDB database application is highly available for reads from secondaries during the elections, the database is not available for writes until a Primary is elected. So it is important to ensure the primary is available sooner than later to meet your SLA for writes.
With the tips discussed here, you can have a new Primary elected within 3 seconds. If your application was serving about 10,000 operations / second, you have about 30,000 operations waiting on the new Primary. Now, you may wonder — “What measures can I take to ensure that the database server would not cripple when all those 30,000 operations hit the new Primary at the same time?”
Again — it’s a great question, but that’s a topic for another day. Hopefully, you learned something new today on you scale the path to “Mastering MongoDB — One tip a day”.

Previous Articles

Mastering MongoDB — One tip a day series
Series of articles solely created for you to master MongoDB
Tip # 003: Transactions
A long awaited and most requested feature for many, has finally arrived
Tip # 002: createRole
How to prevent someone dropping your collections?
Tip # 001: currentOp
Know the operations currently executing on MongoDB server inside out

Published by HackerNoon on 2018/06/15