DBox HA is the capability for a cluster to recover from an event where both DNodes in any one of the cluster's DBoxes fail. This capability is disabled by default, is not always recommended and requires a minimum cluster size. It also cannot be disabled after it is enabled.
DNode HA is always enabled.
DBox HA capability enables the cluster to recover lost data and rebuild RAID and NVRAM RAID in the event of a failure of both DNodes in one DBox. During the rebuild, the cluster remains fully functional and supports normal operation. Following a DBox failure event, it is possible to replace the DBox while the cluster continues to operate.
DBox HA requires a minimum of 11 DBoxes in the cluster. The minimum scale requirement is essentially due to the need for RAID striping to use a low number of drives per DBox per stripe in order to allow for DBox redundancy.
DBox HA can be enabled at installation or later, during cluster operation, such as after expanding the cluster to include more DBoxes. When DBox HA is enabled on a running cluster, the data is rewritten with the required stripe layout.
Although DBox HA is supported from clusters with 11 DBoxes, the impact on storage efficiency is significant at such cluster sizes and only tends to negligible on larger clusters. Additionally, enabling the feature on a cluster with persistent data will lead to a rewrite of almost all data on the cluster and will inevitably impact storage media endurance.
If you are considering enabling this feature on a cluster, please read the sections below and assess the impact on the cluster before you proceed.
If DBox HA is enabled on a cluster, the following constraints apply to device selection when data is written to the cluster's storage media devices:
NVRAM writes, which are always duplicated on two NVRAM devices for redundancy, must use NVRAM devices on two different DBoxes.
No more than two drives per DBox may participate in each RAID stripe.
The RAID constraint limits the total number of drives that participate in each stripe to twice the number of DBoxes in the cluster, while a minimum of 4 chunks per stripe are always used for parity. While the maximal stripe size is 150 chunks, clusters with fewer than 75 DBoxes will have fewer chunks per stripe. The chunk size and the number of chunks used for parity remains constant. Therefore, DBox HA increases the RAID overhead on smaller clusters and less so as the cluster size increases.
Consider the example of a cluster that has 11 DBoxes, which is the minimum number of DBoxes required to support DBox HA:
Without DBox HA, the RAID overhead with 11 DBoxes is 2.5%.
With DBox HA enabled, the number of drives that can participate in each stripe is limited to 22 (2 per DBox), and therefore the RAID overhead is 4 chunks out of every 22 chunks of data written to the drives, or 18%.
For a cluster with 11 DBoxes therefore, the increase in RAID overhead is very significant and the feature should only be enabled if the higher data protection is valued over higher storage efficiency, with these figures taken into account.
Enabling DBox HA during cluster operation triggers a drive layout rewrite. The rewrite is necessary to ensure that all data written to NVRAM and flash drives is distributed in conformance with the HA constraints. The rewrite process rewrites stripes and NVRAM sections as needed, ensuring that each NVRAM section has a copy residing on a different DBox and no more than two drives per DBox participate in each RAID stripe.
The following are important points to note about the drive layout rewrite:
The majority of data is typically rewritten during this rewrite and therefore the impact on storage media endurance is approximately similar to that of deleting all data on the cluster and writing it.
The rewrite proceeds as a background task that cannot be paused or stopped. In case of severe performance degradation, it may be possible for VAST Support to throttle the process and reduce the performance impact.
The rewrite may take a while, and may impact performance for workloads.
The rewrite will increase the RAID overhead on relatively smaller clusters and therefore care should be taken not to enable DBox HA without sufficient capacity to hold the additional RAID overhead.
If expansions are planned, they should be done prior to enabling DBox HA so that the rewrite will utilize as many DBoxes as possible and minimize RAID overhead.
DBox expansion is not available while the rewrite is in progress.
DBox HA capability cannot be disabled.
In the VAST Web UI, open the Cluster tab of the Settings page. You can reach this by searching at the top left or from navigation menu on the left of the page.
In the New Features section, click Enable DBox HA.
The following messages and confirmation prompt are displayed:
These changes require rewrite and cannot be undone. Rewrite may impact workloads while it is in progress. Stopping rewrite requires support intervention. DBox expansion will not be available during rewrite. Are you sure you want to proceed?
Click Yes if you are sure you would like to proceed.
The drive layout rewrite begins and a progress bar appears at the top right of the page, reporting the current phase of the rewrite as it progresses and the percentage progress.
When the rewrite is complete, the now inactive Enable DBox HA button remains inactive and the tooltip for the info icon next to the button changes to DBox HA enabled.
Enter the command
cluster modify --enable-dbox-ha.
You are warned:
Enabling DBox HA support triggers a required rewrite of current data. Are you sure you want to proceed?
Enter 'y' to confirm that you want to proceed.
You are then warned:
Rewrite may impact workloads while it is in progress. Stopping rewrite requires support intervention. DBox expansion will not be available during rewrite. Are you sure you want to proceed?
Enter 'y' to confirm again.
The drive layout rewrite begins.
You can now monitor the progress of the rewrite. Enter the command
cluster show. The command output includes the following fields:
Rewrite-phase. During the rewrite, one of the main phases appears here. The order of the phases is:
Rewrite-progress. This shows the percentage progress of the current phase of the rewrite. When it reaches 100, the DBox HA capability is fully enabled.