Overview

Once the SoftNAS® SnapReplicate™ and SNAP HA™ has been configured, the day to day operations are automated. Automatic Failover is one of the features included with SNAP HA. Once SNAP HA is set up, no additional configuration is required to make Automatic Failover work. SNAP HA Automatic Failover works via the use of the SoftNAS health monitor. When the health monitor detects a failure or is unable to reach the SoftNAS node, it will automatically failover to the other node and move all NAS services over to the other side.

However, there are occasions when you may want to perform actions that require administrative intervention to occur.

This document will show how to perform each of these actions using the SoftNAS StorageCenter™ Administrative Interface.

Setting Up Manual Takeover and Giveback

When a takeover is initiated, the SNAP HA™ Controller will ensure that data is not being written to a node in the process of a switch over. This will avoid the split brain condition.

The HA controller will authorize the switch over, reassign the IPs, and change the primary/secondary designation for the SoftNAS® instances.

As part of the takeover the problematic instance is also shutdown.

Ensure all synchronization activities have been completed before performing the operation

NOTE: In the example to the right, the current state is listed as DELTASYNC-UNDERWAY. Under NO circumstances should you perform a Takeover or Giveback operation until Deltasync appears as completed.

Takeover

Ensure all synchronization activities have been completed before performing the operation.

From the SoftNAS StorageCenter™ interface of the good node, navigate to the SnapReplicate™ panel and select Action > Takeover.

Click the Yes button on the Confirm Action prompt.
The takeover process will begin. This process will shut down the source node and allow the target to take over as primary. After the process has completed successfully, the good node will display as the HA Primary.

After the problematic node has been fixed, bring the node back up.

Giveback

Ensure all synchronization activities have been completed before performing the operation.

Perform a Giveback from the secondary instance to allow the SNAP HA™ controller to safely and securely perform the switch over to protect data integrity.

From the SoftNAS StorageCenter™ interface of the good node, navigate to the SnapReplicate™ panel and select Action > Giveback.

Confirm the action by clicking Yes.

Recovering from a High Availability Failure

In order to properly recover from a node failure without risking data loss, internal processes must be allowed to complete, and tasks must be performed in a particular order. In this article, we will simulate a failover in order to cover the necessary steps to recover from HA failure, as well as the cues SoftNAS provides you to ensure that required processes are complete prior to moving to performing the next task.

First, log into both of your HA nodes (in separate browser tabs) via the IP addresses you provided to them. For the purposes of this article, we will call the source node SoftNAS01, and the target node SoftNAS02.

Simulating the Failure

From the source node (SoftNAS01), open SoftNAS' SnapReplicate/SNAP HA menu from the Storage Administration pane to check on system progress.

In the SnapReplicate/SNAP HA pane, a healthy configuration would look like this. This means that replication is ongoing and active between the nodes, ensuring that in event of failure, the secondary node is fully ready for active duty.

From the secondary node, you would notice the same status, but it would say HA Secondary under the center HA symbol.

In order to simulate a failure, we will initiate a takeover from the target node, SoftNAS02.

From the Action menu, select Takeover.

Click Yes on the Confirm Action prompt, which ensures you wish to proceed.

Simulated Node failure will begin immediately. You will notice immediately that HA will be deactivated, and that the target node (SoftNAS02) is now listed as the primary node.

Verifying Failover is complete

In the event of an actual failure, this is also what you would see. Likewise, from SoftNAS01, the former primary node, you would note that it is listed as secondary.

External servers and applications continue to have access to the data residing in SoftNAS, but now retrieve this data from SoftNAS02. Now let's dig a little deeper and look at the Replication Control Panel.
Here is where you will see the status of the takeover/replication process. In the event of an actual failure, this will provide the statuses you will use to determine that the takeover process is complete. You will first notice the DeltaSync process. This process tracks changes occurring to the data from the primary node while HA is in a degraded state.

Once the former primary node (SoftNAS01) has been fenced off, you will see the status COMMFAIL appear between the nodes. This is because SoftNAS01 is stopped, and no longer accessible.

If this were an actual failure, this status would require human intervention to resolve. In that case, the primary node would need to be rebooted.
To reboot your node, you must go to the host platform in question, find the instance or virtual machine in question, and restart the node.

For AWS:

Open the AWS EC2 dashboard.
Right-Click on the instance in question (SoftNAS01 in this case) and select Start instance.

For Azure:

In the Azure portal, select All resources and search for the virtual machine by name.
Double click the virtual machine to open it.
Select Start to reboot the virtual machine.

For VMware:

In VMware, find the virtual machine by name, and power it up.

Because the data is still accessible and changing on the second node, this means the data on SoftNAS01 is outdated. It will need to be resynced with the content of SoftNAS02 before high availability can be re-established.
Once SoftNAS01 in rebooted, we can once again log in from the original primary node to ensure it is up and running.
After ensuring both nodes are running and ready, return to SoftNAS02 and the SnapReplicate/SNAP HA tab. It still shows the COMMFAIL status.
From Actions, select Activate to re-establish the link between the nodes, and high availability.

If activation is successful, a prompt will appear stating Activate completed successfully. Click OK.

The first thing that occurs is a DeltaSync operation to restore all data changes that occurred during the high availability outage on the surviving node (SoftNAS02).

High Availability is re-established, but SoftNAS02 is now the primary node.

You can see the data changes between the systems by investigating Volumes and LUNs on each node.
Here we see that SoftNAS02, now serving as the primary node, shows a total used space of 143 GB.

SoftNAS01 shows 146 GB of data under Total Used Space.

This data discrepancy must be resolved before any giveback operation is performed.

Even though the option is available to perform a giveback operation on SoftNAS02, or a takeover operation from SoftNAS01 to re-establish the original HA configuration, do not perform either action while DeltaSync is underway or you risk data loss.

All data configuration changes that occurred while HA was in a degraded state will be lost.

In the near future, SoftNAS will make Takeover and Giveback actions unavailable until all data synchronization is complete.

On SoftNAS01, no status will be displayed. This is by design, as all operations should be performed on the current primary node.
On SoftNAS02, return to Volumes and LUNs from the Storage Administration pane.
In Volumes and LUNs on SoftNAS02 you will notice a second volume has appeared in the list, labelled EBSvol_DELTACLONE. This volume is created to manage the data changes between volumes on each node.

On SoftNAS01, no status will be displayed. This is by design, as all operations should be performed on the current primary node.
On SoftNAS02, return to Volumes and LUNs from the Storage Administration pane.
In Volumes and LUNs on SoftNAS02 you will notice a second volume has appeared in the list, labelled EBSvol_DELTACLONE. This volume is created to manage the data changes between volumes on each node.

Continue to refresh until you see a status of DELTASYNC-COMPLETE.

With DeltaSync complete, return to SoftNAS02 SnapReplicate™/SNAP HA, and refresh once more. A SnapReplicate operation is now underway. This too needs to complete fully before a giveback operation is completed.

This shows that the failover is fully completed, and that HA is now in a fully healthy state, with SoftNAS02 as the primary node. Operations can resume as normal in this configuration. However, should you wish to re-establish the original configuration with SoftNAS01 as the primary, you can now safely perform a giveback operation.