Redundant Mon Servers

From etbemon
Jump to: navigation, search

For reliable operation you must be able to handle the failure of any part of the system, and that includes a Mon server. The best way to deal with this is to have two Mon servers in different availability zones monitoring each other. Then if one Mon server dies or if it sends an alert that isn't received (maybe the path from the Mon server to the alerting system is blocked) then the other Mon server will notify you.

Here is a section of the configuration on servera, it has a service named monsrvbmon that watches serverb and reports monitoring errors for every service apart from monsrvamon. This prevents recursive error conditions.

 service monserverb
   description monitor serverb mon
   interval 5m
   monitor remote.monitor --bigsummary --failure_duration=1800 --exclude serverb:monservera serverb ;;
   period
   numalerts 1
   alert mailxmpp.alert -x user@servera.example.com -m user@example.com
   upalert mailxmpp.alert -x user@servera.example.com -m user@example.com

Here is a section from the configuration on serverb, exactly the same but with names reversed. You can copy the configuration to your own machines with s/servera/whatever/ s/serverb/whatever/ .

 service monservera
   description monitor servera mon
   interval 5m
   monitor remote.monitor --bigsummary --failure_duration=1800 --exclude servera:monserverb servera ;;
   period
   numalerts 1
   alert mailxmpp.alert -x user@serverb.example.com -m user@example.com
   upalert mailxmpp.alert -x user@serverb.example.com -m user@example.com

For the mailxmpp.alert I have different destination Jabber servers. Jabber servers can fail too so we need two Jabber servers for reliability. While services like Slack are generally quite reliable and popular for network monitoring I prefer to use my own Jabber servers to eliminate the possibility of network issues between my servers and Slack servers. But having both systems use Slack should be quite reliable.

I use failure duration of 1800 because in my case I don't expect to have a Mon server fail to give alerts and in the normal case I don't need an immediate second alert. If I had a very low failure duration then it would give a second alert before I could login to fix things. But if a fast reaction in all adverse cases is important then you want to just deal with double alerts.

Now this is just a small part of the monitoring for the other servers. Each server needs to monitor the other for pingability, ssh, etc. Ideally you will notice problems in a failing Mon server before Mon itself becomes unavailable.

This configuration relies on the --exclude option to remote.monitor that only worked correctly as of version 1.3.1. This will not work with earlier versions of Mon.