Modern Monitoring Tool for SoftNAS
Symptoms
As system usage become more complex and resource intensive in these modern times; the need to have a modern monitoring tool is consequential for every production workload to see resource utilization at nearly real time at a glance. Having this tool integrated with SoftNAS is vital for the reasons below:
- Proactive monitoring for Customers: Not only will this tool provide a powerful and intuitive graphical Dashboard that gives customers an in depth resource utilization of their SoftNAS system(s) at a glance; it would also help them proactively take initiative/action to rectify potential problems before they become major.
- Helps harness historical Data: Often times customers are using production workloads that are not really suited for the instance types they are running on, and since there is no historical data to reference in order to fully understand how their systems are being utilizes to map out a pattern that can help advise the best instance type to use. We are often left guessing or helping to fix the immediate problem while neglecting the root cause.
- Better than the SAR tool : Our SAR monitoring tool ( which is a standard part of many Linux distributions is the current monitoring tool on SoftNAS) records system events every 10 minutes. Anything can happen between the 1 and 10 minute time interval before the next event is recorded which is clearly not visible to customers or support to really understand what was going with the system which led to an outage
- Helps Buurst Support : Since the modern monitoring tool is a time series database, support can easily go back in time (this can be days, weeks or even months) and compare datapoints to see patterns in which certain events have been happening that lead to a failover or system down scenarios. It will also help support to quickly get a head start on what the root cause of issues are by just browsing the Dashboard with customers instead of relying on logs alone which often times can take days to analyze and share findings.
Additional advantages might be: A richer set of performance information just makes this a more attractive and up to date product. Also, should provide more opportunity to be extended to include metrics from the cloud platform (such as CloudWatch for AWS)
Purpose
In the following points we'll delve into some of the capacities of the monitoring tool:
- Quick CPU /Mem/Disk overview
This gives a quick overview in Gauge percentage (%) format of the current system resource utilization; like how busy the CPU is, how loaded the system currently is, current RAM usage, if any swap is being used, total root filesystem usage etc. In addition, we can also see total CPUs on the system, Total RAM, how long the system has been up which is critical to know if it was rebooted for whatever reason especially in HA events - The same CPU /Mem/Disk is represented in metrics form based on how busy the system, cpu and iowait are etc
A.