The promotion process performs a series of checks in your environment instances before replacing the existing database master. The flow diagram below shows the high level processes.
A database promotion includes a record of all the executed actions and commands in the More Options area of your environment page.
How Does Replica Promotion Work?
Replica promotion is a manual event that is triggered using a link displayed on the cloud dashboard next to any replica database. Replica promotion is not automated primarily due to the damage that can result from a false-positive on whatever check would initiate an auto-failover. For example, many hard failure patterns will manifest simply as an unresponsive database or server; however, a deploy of poorly indexed queries that overloads the database could result in the same pattern. An auto-failover would result in the same application conditions existing but now the redundancy of having a replica would be gone resulting in a more significant concern.
In addition to the above, database replication is by its nature asynchronous so a replica promotion could also result in the loss of some transactional data. When the master is unresponsive there is also no effective means for validating the health of its attached replicas.
For these reasons, the current safest practice is to insert a human decision into the process. With a proactive monitoring support plan you do have the option of defining a default action our VNOC should take in these types of scenarios.
- Light blue actions -- Commands issued against the replica database.
- Light orange actions -- Commands issued against the master database.
We define instance health as:
- The instance is present and we can communicate with it.
- The instance can be written to (this I/O is possible). For the database master, we perform a write/delete test in the database process. For replicas, we test I/O in the volume.
Replica Current with Master
A replica is considered current when it is in-sync with its database master’s events. A replication lag of greater than 60 seconds means the replica is not current and the promotion will be aborted.
Locking means that the database instance will not accept writes. We lock the database master as a preventative measure to avoid data loss (if application is referencing it by hostname).