2012-09-07
More work in zabbix: we got alerts a few ...
More work in zabbix: we got alerts a few times for load averages > 5. But on a 48-core system in use by people doing calculations that isn't a very useful trigger. My solution is to start monitoring the number of CPUs (a very boring number normally), and create a new trigger{Template_geo_linux:system.cpu.load[all,avg1].last(0)}/{Template_geo_linux:system.cpu.num.last(0)}>3This makes a lot more sense: a load of more than 3 times the number of cores is an issue, both on a 1-core (virtual) machine and on a 48-core calculating monster. On some of those calculation servers a load of less than 10 means some model crashed and a scientist will be trying to restart it. And we can now set a trigger on any change in the number of cores. That would be interesting.{Template_geo_linux:system.cpu.num.change(0)}>0