HA Operations
Overview
Once the SoftNAS® SnapReplicate™ and SNAP HA™ has been configured, the day to day operations are automated. Automatic Failover is one of the features included with SNAP HA. Once SNAP HA is set up, no additional configuration is required to make Automatic Failover work. SNAP HA Automatic Failover works via the use of the SoftNAS health monitor. When the health monitor detects a failure or is unable to reach the SoftNAS node, it will automatically failover to the other node and move all NAS services over to the other side.
However, there are occasions when you may want to perform actions that require administrative intervention to occur.
This document will show how to perform each of these actions using the SoftNAS StorageCenter™ Administrative Interface.
When a takeover is initiated, the SNAP HA™ Controller will ensure that data is not being written to a node in the process of a switch over. This will avoid the split brain condition.
The HA controller will authorize the switch over, reassign the IPs, and change the primary/secondary designation for the SoftNAS® instances.
As part of the takeover the problematic instance is also shutdown.
Ensure all synchronization activities have been completed before performing the operation
NOTE: In the example to the right, the current state is listed as DELTASYNC-UNDERWAY. Under NO circumstances should you perform a Takeover or Giveback operation until Deltasync appears as completed.
Takeover
Ensure all synchronization activities have been completed before performing the operation.
After the problematic node has been fixed, bring the node back up.
Giveback
Ensure all synchronization activities have been completed before performing the operation.
Perform a Giveback from the secondary instance to allow the SNAP HA™ controller to safely and securely perform the switch over to protect data integrity.
In order to properly recover from a node failure without risking data loss, internal processes must be allowed to complete, and tasks must be performed in a particular order. In this article, we will simulate a failover in order to cover the necessary steps to recover from HA failure, as well as the cues SoftNAS provides you to ensure that required processes are complete prior to moving to performing the next task.
First, log into both of your HA nodes (in separate browser tabs) via the IP addresses you provided to them. For the purposes of this article, we will call the source node SoftNAS01, and the target node SoftNAS02.
Simulating the Failure
Verifying Failover is complete
Once the former primary node (SoftNAS01) has been fenced off, you will see the status COMMFAIL appear between the nodes. This is because SoftNAS01 is stopped, and no longer accessible.
For AWS:
For Azure:
For VMware:
High Availability is re-established, but SoftNAS02 is now the primary node.
This data discrepancy must be resolved before any giveback operation is performed.
All data configuration changes that occurred while HA was in a degraded state will be lost.
In the near future, SoftNAS will make Takeover and Giveback actions unavailable until all data synchronization is complete.
This shows that the failover is fully completed, and that HA is now in a fully healthy state, with SoftNAS02 as the primary node. Operations can resume as normal in this configuration. However, should you wish to re-establish the original configuration with SoftNAS01 as the primary, you can now safely perform a giveback operation.
Performing the Giveback Operation
For AWS:
For Azure:
For VMware:
If for any reason you wished to simulate a failover again, or establish the secondary node as primary again, ensure that all Deltasync and SnapReplicate operations are complete before performing any such takeover or giveback operations.
High Availability Software Update Process
In order to upgrade, both nodes of the HA pairing will require a forced synchronization to complete the process.
Sync & Deactivate Pair
Sync SNAP HA in StorageCenter
Ensure that target and source nodes have been established.
Deactivate
Upgrade Nodes & Transfer Workload
Upgrade Node B
Wait for the confirmation that the update has been successful and allow the browser to refresh itself.
Perform Takeover
Ensure that all synchronization operations have been completed prior to performing the takeover operation. Here we see the current state of pairing, and the status reads as "DELTASYNC_UNDERWAY". Until the status shows as complete, do not perform the takeover.
Upgrade Node A
Wait for the confirmation that the update has been successful and allow the browser to refresh itself.
Restore HA / Reactivate Replication
The system will then automatically synchronize via a forced sync.