At work one of my main projects at the moment is improving monitoring for beta-ict. I am used to mon at the computer science department but that shows its age a bit and I wanted to try something newer.
The choice in monitoring system was mainly for something which could monitor both system variables (free disk space, free memory, system load, whether certain needed processes were running) and service availability (is the network available, is ldap available, are web servers up and not giving out weird error messages).
I chose zabbix. It has an interesting approach: it measures variables, stores results and trends and then you can do stuff with the stored data. Such as monitoring whether certain thresholds aren't crossed, so you can do your normal tests. Or more complicated monitoring of trends or changes. But you can also make graphs of that same data. And you can use the triggers to make nice long-term availability reports.
One thing I learned is that the suggestion in the manual to use a new version postgres (>= 8.3) is to be taken serious. With 8.0 the server running zabbix regularly got up to a load of 10 on adding new systems to be monitored and historic monitoring data was lost for certain time periods. Dumping the database, installing postgres 8.4 and importing the data again and continuing with the same setup made everything lots faster and no data has been lost since.
What is also interesting is the option to use remote proxies to gather data from otherwise firewalled networks and the option to split servers / services into groups. Eventually we may give the 1st-line servicedesk their own view of our zabbix server where they can view whether main services are available so they are aware of troubles before they need to ask us.