Recovering from a High Availability Failure

Overview

In order to properly recover from a node failure without risking data loss, internal  processes must be allowed to complete, and tasks must be performed in a particular order. In this article, we will simulate a failover in order to cover the necessary steps to recover from HA failure, as well as the cues SoftNAS provides you to ensure that required processes are complete prior to moving to performing the next task.

First, log into both of your HA nodes (in separate browser tabs) via the IP addresses you provided to them. For the purposes of this article, we will call the source node SoftNAS01, and the target node SoftNAS02. 

In this document we will cover, 

Simulating the Failure

From the source node (SoftNAS01), open SoftNAS' SnapReplicate/SNAP HA menu from the Storage Administration pane to check on system progress. 

In the SnapReplicate/SNAP HA pane,  a healthy configuration would look like this. This means that replication is ongoing and active between the nodes, ensuring that in event of failure, the secondary node is fully ready for active duty. 

From the secondary node, you would notice the same status, save it would say HA Secondary under the center HA symbol.

In order to simulate a failure, we will initiate a takeover from the target node, SoftNAS02.  From the Action menu, select Takeover

Click Yes on the warning prompt, which ensures you wish to proceed.

Simulated Node failure will begin immediately. You will notice immediately that HA will be deactivated, and that the target node (SoftNAS02) is now listed as the primary node.

Verifying Failover is complete


In the event of an actual failure, this is also what you would see. Likewise, from SoftNAS01, the former primary node, you would note that it is listed as secondary. 

External servers and applications continue to have access to the data residing in SoftNAS, but now retrieve this data from SoftNAS02. Now let's dig a little deeper and look at the Replication Control Panel. Here is where you will see the status of the takeover/replication process. In the event of an actual failure, this will provide the statuses you will use to determine that the takeover process is complete. You will first notice the DeltaSync process. This process tracks changes occurring to the data from the primary node while HA is in a degraded state. 

Once the former primary node (SofTNAS01) ha been fenced off, you will see the status COMMFAIL appear between the nodes. This is because SoftNAS01 is stopped, and no longer accessible. If this were an actual failure, this status would require human intervention to resolve. In that case, the primary node would need to be rebooted. 

To reboot your node, you must go to the host platform in question, find the instance or virtual machine in question, and restart the node. 

For AWS: 

Open the AWS EC2 dashboard, select the instance in question (SoftNAS01 in this case) and select Action, Instance State, and Start.

For Azure: 

In the Azure portal,  select All resources  and search for the virtual machine by name.  Double click the virtual machine to open it. 

Select Start to reboot the virtual machine. 

For VMware: 

In VMware, find the virtual machine by name, and power it up.

Because the data is still accessible and changing on the second node, this means the data on SoftNAS01 is outdated. It will need to be resynced with the content of SoftNAS02 before high availability can be re-established.

Once SoftNAS01 in rebooted, we can once again log in from the original primary node to ensure it is up and running. After ensuring both nodes are running and ready, return to SoftNAS02 and the SnapReplicate/SNAP HA tab. It still shows the COMMFAIL status. From Actions, select Activate to re-establish the link between the nodes, and high availability. 

If activation is successful, a prompt will appear stating Activate completed successfully. Click OK.

The first thing that occurs is a DeltaSync operation to restore all data changes that occurred during the high availability outage on the surviving node (SoftNAS02).  

You will note that HA is re-established, but SoftNAS02 is now the primary node. 

You can see the data changes between the systems by investigating Volumes and LUNs on each node. Here we see that SoftNAS02, now serving as the primary node, shows a total used space of 143 GB. 

SoftNAS01 shows 146 GB of data under Total Used Space. This data discrepancy must be resolved before any giveback operation is performed. 

Even though the option is available to perform a giveback operation on SoftNAS02, or a takeover operation from SoftNAS01 to reestablish the original HA configuration, do not perform either action while DeltaSync is underway or you risk data loss. All data configuration changes that occurred while HA was in a degraded state will be lost.

In the near future, SoftNAS will make Takeover and Giveback actions unavailable until all data synchronization is complete.

On SoftNAS01, no status will be displayed. This is by design, as all operations should be performed on the current primary node. On SoftNAS02, return to Volumes and LUNs from the Storage Administration pane. In Volumes and LUNs on SoftNAS02 you will notice a second volume has appeared in the list, labelled EBSvol_DELTACLONE. This volume is created to manage the data changes between volumes on each node.

Return to SnapReplicate/SNAP HA on SoftNAS02.  Note that the status remains unchanged. It is important to remember to click Refresh to ensure that the latest status is presented. Remember, we cannot initiate the giveback operation to make SoftNAS01 the primary until data changes have all been synced.

Continue to refresh until you see a status of DELTASYNC-COMPLETE. 

With DeltaSync complete, return to SoftNAS02 SnapReplicate™/SNAP HA, and refresh once more.  A SnapReplicate operation is now underway. This too needs to complete fully before a giveback operation is completed. 

Refresh continuously until the status SNAPREPLICATE-COMPLETE appears, and DeltaSync at the far right shows "Not Running" and 100%. 

This shows that the failover is fully completed, and that HA is now in a fully healthy state, with SoftNAS02 as the primary node. Operations can resume as normal in this configuration. However, should you wish to reestablish the original configuration with SoftNAS01 as the primary, you can now safely perform a giveback operation. 

Performing the Giveback Operation

Once all synchronization operations have been completed, as verified above, you can perform a giveback operation to make SoftNAS01 once again the primary node. In the Actions menu on the current primary node (SoftNAS02) select Giveback.

Once again, HA will be deactivated, and this time, SoftNAS02 will have to be rebooted in the same manner as before. 

For AWS: 

Open the AWS EC2 dashboard, select the instance in question (SoftNAS02 in this case) and select Action, Instance State, and Start.


For Azure: 

In the Azure portal,  select All resources  and search for the virtual machine by name.  Double click the virtual machine to open it. 

Select Start to reboot the virtual machine. 

For VMware: 

In VMware, find the virtual machine by name, and power it up.

Once SoftNAS02 is verified as rebooted (by logging into it), return to SoftNAS01, and the SnapReplicate/SNAP HA tab. In the Actions menu, select Activate



Accept the prompts, and high availability will be reestablished, with the original node restored as primary. Again, if for any reason you wished to simulate a failover again, or establish the secondary node as primary again, ensure that all Deltasync and SnapReplicate operations are complete before performing any such takeover or giveback operations.