Protection policies for async replication determine the following parameters:
-
A schedule. You will set the timing and frequency for creating snapshots on the source peer. Provided the replication time does not exceed the frequency, each snapshot will be replicated to the destination peer.
-
A retention period for local snapshots. You can either keep the snapshots on the source peer for a chosen time period or prune them immediately after replication.
-
A retention period for snapshots on the destination peer.
Important
Due to a limitation, it is not possible to change which protection policy controls any replication stream. It is advisable where possible to create a dedicated protection policy per replication stream as a workaround, given that:
-
Changes to the snapshot and replication schedules and snapshot retention times of a protected path are supported only through modifying the associated protection policy.
-
Modifying a protection policy affects all associated replication streams.
-
You may need to modify a policy to control a single replication stream, such as for timing a replication to complete prior to a failover.
You will need to consider how to set these parameters in order to best meet your objectives. The following are points of consideration:
-
Recovery Point Objective (RPO). In a worst case scenario of disastrous loss of the primary cluster, data that was not yet replicated to a replication peer would be lost in an ungraceful failover. When scheduling async replication, you may consider the maximum time of writing to the protected path such that you could tolerate losing the delta since the last replicated version. If you want to ensure that data loss in such an ungraceful failover scenario would not exceed the delta you can tolerate losing, you will want to take care that the time taken from the creation of each snapshot on the source peer until the completion of the equivalent restore point on the destination peer is below that limit. This is a function of the frequency you set in the protection policy and the replication rate. The replication rate depends on your connection bandwidth as well as the delta that would be captured in each snapshot.
-
Recovery Time Objective (RTO). This is the time taken to recover operations following a failover event, such as the destruction of the primary cluster in a disaster. Consider how RTO might be affected in each type of failover:
-
In a graceful failover, neither peer is writeable during the delta sync between the last restore point on the destination and the latest writes on the source. Therefore, the greater the discrepancy that is allowed to build up between the source and destination paths, the longer the failover will take. You can minimize this downtime prior to a planned graceful failover by adjusting the snapshot schedule in the short term so that the delta between the latest completed restore point on the destination peer and the latest data written to the source peer is minimal.
-
Since an ungraceful failover failover takes place without syncing any data, the duration of the failover event itself is not affected by the snapshot schedule.
Note
Recovering operations after failover also requires ensuring that the VMS configuration is replicated as needed and client applications are connected to the cluster. For more information, see Deploying a Failed-Over Replication Peer as a Working Cluster.
-
-
Capacity usage. A lower retention period on the destination peer will prune snapshots frequently to maintain capacity on the destination peer for frequent snapshots. Also, for failover purposes, only the most recent restore point on the destination is used and needed. However, following a failover event in which a backup cluster becomes the primary cluster, any older replicated snapshots would effectively become local backups.
Note
Each snapshot contains only the changes to the working data since the last snapshot was taken.
-
Snapshot limit per cluster. New snapshots are not created if the limit would be exceeded. Therefore, you need to set the expiration and the schedule in all protection policies in use on all protected paths so that the total snapshots on each cluster doesn't reach the limit. For the maximum number of snapshots, protected paths and protection policies, see VAST Cluster Scale Guidelines.
Caution
You cannot change which protection policy is used on a given protected path. Any changes you make to a protection policy obviously affects all protected paths that use the same policy.
-
Efficiency: Higher frequency of snapshots may capture writes that otherwise would cancel each other out over time and hence use more bandwidth and capacity over time.
-
Bandwidth. The bandwidth of your network connection will affect how fast a restore point can be completed on the destination peer. If the bandwidth is too slow to meet the frequency you set, snapshots will be skipped.
-
Performance. It's possible for replication to impact the performance of the regular data IOs.
Specifically for group replication:
-
If you configure the same replication frequency for all protection policies in the group, make sure to set the same start time so that they align.
-
We recommend that even if the replication frequencies differ between replication streams, they do converge on some points in time so that they have naturally occuring sync points.
-
The sync point guarantee configured in a protected path (when there is more than one replication stream) ensures a minimum duration between sync points. If the frequencies of policies do not align, the sync point guarantee can force the creation of a sync point. For example, if policies only converge every 48 hours but the sync point guarantee is 24 hours, then extra replication will occur to guarantee the sync point.
Comments
0 comments
Article is closed for comments.