Failover is a change of peer roles in a replication configuration where the backed up data on a destination peer becomes writeable and the previous source peer ceases to be the replication source.
Failover capabilities are available when a protected path is active between two peers. Failover is initiated from a destination peer in a replication peer configuration. When failover is done, the initiating destination peer changes role to source and the replicated data on that peer is writeable. Client applications can connect to the cluster that had previously performed the destination role in replication and continue operating with that cluster instead of using the original primary cluster.
Note
The switching of client applications to the backup cluster is not seamless, since the cluster is a separate cluster. It requires client mounting to the backup cluster and similar VMS configurations to those which were in place on the primary cluster. For details, see Deploying a Failed-Over Replication Peer as a Working Cluster.
There are two types of failover processes, which are used for both failing over and failing back depending on the scenario:
Graceful failover ensures no data is lost. It requires that the peers can communicate with each other. In a graceful failover, data is completely synced between the peers, the replicated data on the initiating destination peer becomes writeable and the data on the source peer becomes read-only.
In a group of more than two replication peers, the peer requesting to become source requests from the previous source peer to become the source. The source rejects the request if there is a conflicting process, such as if another destination peer is already in the process of becoming the source or if a group member is being added or removed. If the request is approved, the former source peer takes a final snapshot, replicates it to the new source, and then informs the destination peers that it is done being the source, and communicates to the requesting peer that it can now become the source peer. The new source peer starts to synchronize with the new destination peers, identifying the last common restore point with each peer.
After failover, replication resumes. In one to one replication, replication resumes in the reverse direction. In group replication, replication resumes from the new source, with the other group members as destination peers. Clients can resume operating by connecting to views on the backup cluster.
During the graceful failover process, there is a period in which all peers are read-only.
A graceful failover proceeds as follows:
-
In the case of one on one replication, a user initiates graceful failover from the destination peer's VMS. In case of group replicatin, a user initiates graceful failover from one of the destination peers.
-
The source peer receives a request from the destination peer to take over as the source peer. If another destination peer is already in the process of becoming the source, the source peer rejects the request. Otherwise, the source peer accepts the request and continues with the subsequent step.
-
The source peer becomes read-only, while the destination peer that is requesting the source role temporarily remains read-only as well.
-
If replication was already in progress before the failover was initiated, it is completed.
-
Data is synced between the source and destination peers, by transferring any delta between the last snapshot taken on the source and the data on the source at the time of initiating failover.
-
The former source peer communicates to the requesting destination peer that it can now become the source peer.
-
The new source peer negotiates the most recent common restore point with each destination peer.
-
The now synced replica of the data from the established sync point on the new source peer becomes writeable.
-
Replication resumes, with the requesting peer now taking on the source peer role. The source peer becomes a destination peer and any remaining destination peers (if group replication) remain destination peers.
The result of graceful failover is:
-
The protected path on the former source peer is now read-only, while the replicated path on the destination peer that took on the source role is now writeable.
-
Replication is enabled in the reverse direction relative to the pre-failover configuration for the former source and the new source.
Note
A protection policy that specifies a destination replication peer is automatically mirrored on that replication peer when it is created. In case of failover, the mirrored protection policy is used to continue replicating in the reverse direction.
In group replication, the protection policies used for replicating after failover are the ones defined for replication from the new source to each destination peer. For example, if peer A replicates to peer B and to peer C and then there is a failover to peer B, then, after failover, B replicates to C in accordance with the protection policy defined between B and C which can be different than the policy A used to replicate to C when it was the source.
Ungraceful failover is an option that is available even if there is no communication between the peers. In a non graceful failover:
-
The replicated path on a/the former destination peer becomes writeable. If the primary cluster is still operating, it also remains writeable.
-
In one-on-one replication, replication is suspended, and the peers change roles from source and destination to a third role called standalone, which reflects that the paths on both are writeable and that there is currently no replication between them. In group replication, one of the destination peers takes the role of source peer and the former source becomes standalone if it is still operating.
An ungraceful failover proceeds as follows:
-
A user initiates ungraceful failover from a destination peer's VMS.
The initiating peer becomes standalone and then locally marks itself as source. It sends RPCs to all destination peers communicating that it is becoming the source. In case of conflict between more than one destination peer becoming the source, an alert is issued telling administrators to choose a source and reattach disconnected destination peers to the chosen source.
-
Replication between the peers is suspended.
-
The replicated path on the established new source peer becomes writeable. In one on one replication, the peers remain standalone and the replication remains suspended. In group replication, replication resumes with the writeable peer becoming the source peer and all attached destination peers remaining destination peers.
The result of an ungraceful failover is:
-
The protected path on the former source peer (if still available) and on the new source is writable.
-
In one on one replication, replication between the peers is suspended.
-
Data that was not yet replicated is lost. If you are resuming operations using the protected path on the former destination peer, any data that was not replicated from the former source peer to the former destination peer is lost on the former destination peer.
If the primary cluster is still in communication, you can perform a graceful failover.
In order to minimize the effective "downtime" in which the path on both peers is read-only, proceed as follows:
-
If replication is in progress, wait until it completes before starting failover.
-
If replication did not take place for a while before you want to do the failover, force a replication by altering the schedule in the protection policy so that a replication will take place in the near future.
-
If replication takes time such that there is likely a significant delta to be synced to the destination peer, force a second replication to take place immediately after replication completes. This will be the quickest possible replication and enable you to start failover with the smallest delta.
If the primary cluster loses communication or is destroyed, it is possible to fail over to the destination peer with an ungraceful failover. In this case, operations effectively roll back to the last snapshot that was transferred to the destination peer. Any data that was written to the source peer since the point in time at which that last snapshot was captured are lost.
Failing back to the primary cluster can be effected in accordance with the scenario:
In this scenario, you can fail back by performing a graceful failover from the primary cluster, which has become the destination peer. This is a reversal of the original failover process. Data will be synced between the clusters and the original configuration will be restored. During the failover, both clusters are read-only. The length of the downtime depends on the size of the delta between the clusters at the time of failover. You can minimize this downtime by performing it as soon as possible after a restore point is created. This might mean modifying the replication schedule in advance of the failover.
Following an ungraceful failover, if the primary cluster comes back up and you want to fail back to it, you can do the following:
-
Resume replication from the backup cluster that you had failed over to. The backup cluster becomes the source peer and primary cluster becomes the destination peer.
Note
Before resuming replication from the backup cluster, it is advisable to create a local snapshot on the primary cluster because any data that was not transferred from the primary cluster before it went down will be otherwise lost.
-
After a restore point is completed on the destination peer, perform a graceful failover from the destination peer. This will recover all of the latest written data from the backup cluster and enable you to continue working with the primary cluster.
Alternatively, you can force the primary cluster to regain its role as the source peer and resume replication, choosing to lose any new data that accumulated on the backup cluster while the primary cluster was down.
Comments
0 comments
Article is closed for comments.