Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated links in the proper manner.

Overview

Once the SoftNAS® SnapReplicate™ and SNAP HA™ has been configured, the day to day operations are automated. Automatic Failover is one of the features included with SNAP HA. Once SNAP HA is set up, no additional configuration is required to make Automatic Failover work. SNAP HA Automatic Failover works via the use of the SoftNAS health monitor. When the health monitor detects a failure or is unable to reach the SoftNAS node, it will automatically failover to the other node and move all NAS services over to the other side.

However, there are occasions when you may want to perform actions that require administrative intervention to occur.    

This document will show how to perform each of this these actions using the SoftNAS StorageCenter™ Administrative Interface.

Anchor
HA Software Update Process
HA Software Update Process
Setting Up Manual Takeover and Giveback

When a takeover is initiated, the SNAP HA™ Controller will ensure that data is not being written to a node in the process of a switch over. This will avoid the split brain condition.

The HA controller will authorize the switch over, reassign the IPs, and change the primary/secondary designation for the SoftNAS® instances. Also, as

As part of the takeover the problematic instance is also shutdown.

Note
In the case of any failure, it is very important to ensure that all data has been fully synced before performing the takeover or giveback actions. Note in the example below that

Ensure all synchronization activities have been completed before performing the operation


Warning

NOTE: In the example to the right, the current state is listed as

"

DELTASYNC-UNDERWAY

"

.  Under

no

NO circumstances should you perform a

takeover

Takeover or

giveback

Giveback operation until Deltasync appears as completed.

 


Image RemovedImage Added

Takeover

Note
Ensure all synchronization activities have been completed before performing the operation.
  •  From the SoftNAS StorageCenter™ interface of the good node, navigate to the SnapReplicate™ panel and select Action > Takeover.
  •  Click on Actions>Takeover. Confirm at the prompt. 

Image Added

  •  Click the Yes button on the Confirm Action prompt.

  •  The takeover process beginswill begin. This process will shut down the source node and allow the target to take over as primary. After  After the process has completed successfully, the good node will display as the HA Primary.  

Info
After the problematic node has been fixed, bring the node back up. 


Image RemovedImage AddedImage Removed

Giveback

Giveback

After rebooting the node shut down by the takeover process, perform
Note
Ensure all synchronization activities have been completed before performing the operation.

Perform a Giveback from the secondary instance to allow the SNAP HA™ controller to safely and securely perform the switch over to protect data integrity. Ensure all synchronization activities have been completed before performing the operation.

  •  From the SoftNAS StorageCenter™ SnapReplicate™ screen, click on Giveback.interface of the good node, navigate to the SnapReplicate™ panel and select Action > Giveback.

Image Added

  •  Confirm the action by clicking Yes.

Image RemovedImage Added

Anchor
Recovering from a HA Failure
Recovering from a HA Failure
Recovering from a High Availability Failure

In order to properly recover from a node failure without risking data loss, internal  internal processes must be allowed to complete, and tasks must be performed in a particular order. In this article, we will simulate a failover in order to cover the necessary steps to recover from HA failure, as well as the cues SoftNAS provides you to ensure that required processes are complete prior to moving to performing the next task.

First, log into both of your HA nodes (in separate browser tabs) via the IP addresses you provided to them. For the purposes of this article, we will call the source node SoftNAS01, and the target node SoftNAS02.

Simulating the Failure

  •  From the source node (SoftNAS01), open SoftNAS' SnapReplicate/SNAP HA menu from the Storage Administration pane to check on system progress. 

Image RemovedImage Added

  •  In the SnapReplicate/SNAP HA pane,  a healthy configuration would look like this. This means that replication is ongoing and active between the nodes, ensuring that in event of failure, the secondary node is fully ready for active duty. 

Image RemovedImage Added

  •  From the secondary node, you would notice the same status,
save
  • but it would say HA Secondary under the center HA symbol.

Image RemovedImage Added

  •  In order to simulate a failure, we will initiate a takeover from the target node, SoftNAS02.  
  •  From the Action menu, select Takeover

Image RemovedImage Added

  •  Click Yes on the
warning
  • Confirm Action prompt, which ensures you wish to proceed.

Image RemovedImage Added

  •  Simulated Node failure will begin immediately. You will notice immediately that HA will be deactivated, and that the target node (SoftNAS02) is now listed as the primary node.

Image RemovedImage Added

Verifying Failover is complete

  •  In the event of an actual failure, this is also what you would see. Likewise, from SoftNAS01, the former primary node, you would note that it is listed as secondary. 

Image RemovedImage Added

  •  External servers and applications continue to have access to the data residing in SoftNAS, but now retrieve this data from SoftNAS02. Now let's dig a little deeper and look at the Replication Control Panel.
  •  Here is where you will see the status of the takeover/replication process. In the event of an actual failure, this will provide the statuses you will use to determine that the takeover process is complete. You will first notice the DeltaSync process. This process tracks changes occurring to the data from the primary node while HA is in a degraded state. 

Image RemovedImage Added

  •  

    Once the former primary node (

SofTNAS01
  • SoftNAS01)

ha
  • has been fenced off, you will see the status COMMFAIL appear between the nodes. This is because SoftNAS01 is stopped, and no longer accessible.

    Info
    If this were an actual failure, this status would require human intervention to resolve. In that case, the primary node would need to be rebooted. 


  •  To reboot your node, you must go to the host platform in question, find the instance or virtual machine in question, and restart the node.

 

Image RemovedImage Added

For AWS: 

  •  Open the AWS EC2 dashboard
, select the
  • .

  •  Right-Click on the instance in question (SoftNAS01 in this case) and
select Action, Instance State, and Start
  • select Start instance.

Image RemovedImage Added

For Azure: 

  •  In the Azure portal,  select All resources  and search for the virtual machine by name.

  •  Double click the virtual machine to open it.

  •  
  • Select Start to reboot the virtual machine. 

Image RemovedImage Added

Image RemovedImage Added

For VMware: 

  •  In VMware, find the virtual machine by name, and power it up.

Image RemovedImage Added

  •  Because the data is still accessible and changing on the second node, this means the data on SoftNAS01 is outdated. It will need to be resynced with the content of SoftNAS02 before high availability can be re-established.

  •  Once SoftNAS01 in rebooted, we can once again log in from the original primary node to ensure it is up and running.

  •  After ensuring both nodes are running and ready, return to SoftNAS02 and the SnapReplicate/SNAP HA tab. It still shows the COMMFAIL status.

  •  From Actions, select Activate to re-establish the link between the nodes, and high availability. 

Image RemovedImage Added

  •  If activation is successful, a prompt will appear stating Activate completed successfully. Click OK.

Image RemovedImage Added

  •  The first thing that occurs is a DeltaSync operation to restore all data changes that occurred during the high availability outage on the surviving node (SoftNAS02).  

Image RemovedImage Added


Note

High Availability is re-established, but SoftNAS02 is now the primary node. 


Image RemovedImage Added

  •  You can see the data changes between the systems by investigating Volumes and LUNs on each node.

  •  Here we see that SoftNAS02, now serving as the primary node, shows a total used space of 143 GB. 

Image RemovedImage Added

  •  SoftNAS01 shows 146 GB of data under Total Used Space.
Note
This data discrepancy must be resolved before any giveback operation is performed. 


Image RemovedImage Added

  •  Even though the option is available to perform a giveback operation on SoftNAS02, or a takeover operation from SoftNAS01 to
reestablish
  • re-establish the original HA configuration, do not perform either action while DeltaSync is underway or you risk data loss.
Warning
All data configuration changes that occurred while HA was in a degraded state will be lost.


Note

In the near future, SoftNAS will make Takeover and Giveback actions unavailable until all data synchronization is complete.


Image RemovedImage Added

  •  On SoftNAS01, no status will be displayed. This is by design, as all operations should be performed on the current primary node.

  •  On SoftNAS02, return to Volumes and LUNs from the Storage Administration pane.

  •  In Volumes and LUNs on SoftNAS02 you will notice a second volume has appeared in the list, labelled EBSvol_DELTACLONE. This volume is created to manage the data changes between volumes on each node.

Image RemovedImage Added

  •  On SoftNAS01, no status will be displayed. This is by design, as all operations should be performed on the current primary node.

  •  On SoftNAS02, return to Volumes and LUNs from the Storage Administration pane.

  •  In Volumes and LUNs on SoftNAS02 you will notice a second volume has appeared in the list, labelled EBSvol_DELTACLONE. This volume is created to manage the data changes between volumes on each node.

Image RemovedImage Added

  •  Continue to refresh until you see a status of DELTASYNC-COMPLETE

Image RemovedImage Added

  •  With DeltaSync complete, return to SoftNAS02 SnapReplicate™/SNAP HA, and refresh once more.  A SnapReplicate operation is now underway. This too needs to complete fully before a giveback operation is completed. 

Image RemovedImage Added

  •  Refresh
continuously
  • until the status SNAPREPLICATE-COMPLETE appears, and DeltaSync at the far right shows "Not Running" and 100%

Image RemovedImage Added

This shows that the failover is fully completed, and that HA is now in a fully healthy state, with SoftNAS02 as the primary node. Operations can resume as normal in this configuration. However, should you wish to reestablish re-establish the original configuration with SoftNAS01 as the primary, you can now safely perform a giveback operation.

Performing the Giveback Operation

  •  Once all synchronization operations have been completed, as verified above, you can perform a giveback operation to make SoftNAS01 once again the primary node.

  •  In the Actions menu on the current primary node (SoftNAS02) select Giveback.

Image Removed

Image Added

  •  Confirm the action by clicking Yes.


Image Added

  •  Once again, HA will be deactivated, and this time, SoftNAS02 will have to be rebooted in the same manner as before. 

Image RemovedImage Added

For AWS: 

  •  Open the AWS EC2 dashboard
, select the
  • .

  •  Right-Click on the instance in question (SoftNAS02 in this case) and
select Action, Instance State, and Start
  • select Start instance.

Image RemovedImage Added

For Azure: 

  •  In the Azure portal,  select All resources  and search for the virtual machine by name.  Double click the virtual machine to open it.

  •  
  • Select Start to reboot the virtual machine. 

Image RemovedImage Added

Image RemovedImage Added

For VMware: 

  •  In VMware, find the virtual machine by name, and power it up.

Image RemovedImage Added

  •  Once SoftNAS02 is verified as rebooted (by logging into it), return to SoftNAS01, and the SnapReplicate/SNAP HA tab.

  •  In the Actions menu, select Activate

  •  Accept the prompts, and high availability will be
reestablished
  • re-established, with the original node restored as primary.
Again, if


Note
If for any reason you wished to simulate a failover again, or establish the secondary node as primary again, ensure that all Deltasync and SnapReplicate operations are complete before performing any such takeover or giveback operations. 



Image RemovedImage Added

High Availability Software Update Process

Warning
In order to upgrade, both nodes of the HA pairing will require a forced synchronization to complete the process.
  •  For major SoftNAS StorageCenter upgrades that require downtime,
Buurst 
  • Buurst has provided a way to protect replications and SNAP HA pairings while also keeping storage connectivity and data access uninterrupted.

  •  To check whether an upgrade is required, click Settings > Software Updates in the main menu on the left.


Image Added

Sync & Deactivate Pair

Sync SNAP HA in StorageCenter

  •  Ensure that both nodes are in sync by forcing the event through SoftNAS StorageCenter.

  •  Sign in to each node of the HA pair to be upgraded in separate browsers to more easily switch between nodes.

Note
Ensure that target and source nodes have been established. For the purposes of this document, the following terminology will be used:


  •  In the SoftNAS StorageCenter interface for Node A, navigate to the SnapReplicate / SNAP HA menu.

  •  Click Action and then Replicate Now as shown here on the right.

Image Removed


  •  To verify sync completion, watch the Event Log at the bottom of the UI.

  •  Click Refresh to ensure current status visibility if necessary.

Image Added

Deactivate

  •  In the SoftNAS StorageCenter interface for Node A, navigate to the SnapReplicate / SNAP HA menu.

  •  Click Action and then Deactivate.

Upgrade Nodes & Transfer Workload

  •  This critical section ensures that storage and compute capacities remain at expected levels during a disruptive update.

Upgrade Node B

  •  Navigate to the SoftNAS StorageCenter for Node B
, then
  • .

  •  Then to Settings > Software Updates
and click
  • .

  •  Click Apply Update.

  •  Click Yes to confirm.
Info
Wait for the confirmation that the update has been successful and allow the browser to refresh itself. 


Perform Takeover

  •  In the SoftNAS StorageCenter interface for Node B, navigate to the SnapReplicate / SNAP HA menu.

  •  Click Action and then Takeover.

  •  Click Yes to confirm.

Note
Ensure that all synchronization operations have been completed prior to performing the takeover operation. Here we see the current state of pairing, and the status reads as "DELTASYNC_UNDERWAY".  Until the status shows as complete, do not perform the takeover. 




Image RemovedImage Added

Upgrade Node A

  •  Navigate to the SoftNAS StorageCenter for Node A
, then
  • .

  •  Then to Settings > Software Updates
and click
  • .

  •  Click Apply Update.
Info
Wait for the confirmation that the update has been successful and allow the browser to refresh itself.


Restore HA

/ Reactivate Replication

  •  In the SoftNAS StorageCenter interface for Node B, navigate to the SnapReplicate / SNAP HA menu.

  •  Click Action and then Activate.

  •  Click Yes to confirm.
Info
The system will then automatically synchronize via a forced sync.