I enhanced the zabbix system monitoring to also work on aacraid based controllers. Google searching found me How to check the health of an Adaptec RAID array which shows that the right command-line tool is nowadays arcconf which can be found at Adaptec support for RAID products. Select the right type and click through a few times where you will find the storage manager downloads (not the drivers!). The latest 'adaptec storage manager' includes 'arcconf'. After installing arcconf produces a lot of output, but the line I am interested in is easy to find:# /usr/StorMan/arcconf GETCONFIG 1 | grep Defunct Defunct disk drive count : 0which is exactly what I want. Again a special UserParameter in zabbix_agentd.conf:UserParameter=aacraid.okdisk,/etc/zabbix/external/aacraid.okdiskA script to do the actual work:#!/bin/sh # aacraid.okdisk sudo /usr/StorMan/arcconf GETCONFIG 1 | awk ' /Defunct disk drive count/ { print $6 } 'And a change in sudoers to allow this. Allowing /usr/StorMan/arcconf as is did not work because of the capitals but a more general rule helped. Now I can check for the number of disks with problems and warn accordingly (0 disks with problems is ok, 1 disk is warning, > 1 is disaster).
As part of the work on system monitoring I am looking into monitoring RAID units. The beta-ict department uses a number of raid units and data gets replicated between buildings.I want a warning when a disk goes down. The 3ware disk controller has a nice webinterface but I can't integrate that (easily..) into zabbix. What I did was install the tw_cli command line utility from the 3ware LSI raid controller site (lookup your type of controller, find 'support and downloads' and you will see cli utils for lots of unix versions), which makes life easy:
# tw_cli show Ctl Model (V)Ports Drives Units NotOpt RRate VRate BBU ------------------------------------------------------------------------ c0 9650SE-16ML 16 15 1 1 1 1 OKWhat I want to know is the number of not-optimal disks (yes, indeed one is broken at the moment and needs replacement). That I can monitor in zabbix, when I pick up the value with a script:#!/bin/sh # /etc/zabbix/external/3ware.okdisk sudo /usr/local/sbin/tw_cli show | awk ' /^c0/ { print $6 } 'Root access via sudo which means a line in /etc/sudoers which allows /usr/local/sbin/tw_cli from the zabbix user, and the right setting in zabbix_agentd.conf to bind this script to a user parameter:UserParameter=3ware.okdisk,/etc/zabbix/external/3ware.okdiskNow I can program a trigger on the output: 0 is ok, 1 is warning, > 1 is disaster. I added an extra action on the trigger to mail the output of tw_cli '/c0 show' to the admins so we know which disk is broken.
Now to do the same for adaptec (aacraid) based raids.
De XKCD: University Website is briljant. En uiterst herkenbaar. Precies de disjunctie die ik nu bij de Universiteit Utrecht in de verte hoor voorbij komen. Maar net zo goed wat ik meer dan 10 jaar geleden bij de Hogeschool van Utrecht hoorde. Ver daarvoor waren er wel projecten die er van uitgingen wat mensen zouden zoeken, maar zo tegen 2000 ging dat allemaal overboord en kwamen in plaats daarvan monster websites waar alles zou moeten staan wat volgens de voorlichters ooit gevraagd zou kunnen worden door bezoekers van de website, die daarvoor braaf op de homepage zouden beginnen en door allemaal hierarchische structuren heen zouden klikken. Direct linken naar het juiste onderwerp is volgens die voorlichters ook vooral niet de bedoeling.Ik ben blij dat deze ergernis voor mij over is en dat ik er alleen nog maar vanaf de verre zijlijn om kan lachen. En als gebruiker van de website natuurlijk niet kan vinden wat ik zoek. Maar daar helpt google bij.
Update: Ook gemeld in het Digitale U-Blad met de titel Web.. some sense? naar aanleiding van het UU webpresence project wat al getypeerd is als Web Absence.
At work one of my main projects at the moment is improving monitoring for beta-ict. I am used to mon at the computer science department but that shows its age a bit and I wanted to try something newer.The choice in monitoring system was mainly for something which could monitor both system variables (free disk space, free memory, system load, whether certain needed processes were running) and service availability (is the network available, is ldap available, are web servers up and not giving out weird error messages).
I chose zabbix. It has an interesting approach: it measures variables, stores results and trends and then you can do stuff with the stored data. Such as monitoring whether certain thresholds aren't crossed, so you can do your normal tests. Or more complicated monitoring of trends or changes. But you can also make graphs of that same data. And you can use the triggers to make nice long-term availability reports.
One thing I learned is that the suggestion in the manual to use a new version postgres (>= 8.3) is to be taken serious. With 8.0 the server running zabbix regularly got up to a load of 10 on adding new systems to be monitored and historic monitoring data was lost for certain time periods. Dumping the database, installing postgres 8.4 and importing the data again and continuing with the same setup made everything lots faster and no data has been lost since.
What is also interesting is the option to use remote proxies to gather data from otherwise firewalled networks and the option to split servers / services into groups. Eventually we may give the 1st-line servicedesk their own view of our zabbix server where they can view whether main services are available so they are aware of troubles before they need to ask us.
Slowly but surely the subversion self-service webinterface we developed at work is turning into a 2.0 version which will be available as open source. I must say "my boss developed", he did most of the coding. I just threw ideas, designs and criticism at him :)It was our original plan to open-source it, and this plan was woken up again when we got a request about the availability of the source code. Lots of work was done to make structures more flexible and remove hardcoded dependencies on internal infrastructure.
One of the bigger design issues was a good name! For historical reasons we couldn't use the name repoman which was good wordplay on repository-manager in itself. We settled on repocafe. Available for download Real Soon Now™.
In mijn werk heb ik natuurlijk ook veel te maken met de universitaire automatiseringsprojecten. Vandaag las ik een prachtige filosofische beschouwing van het leerlingvolgsysteem: God bestaat en zijn naam is OSIRIS.
One of our users at work reported today that he noticed the 'Previous Versions' tab in windows explorer being active and showing what we think of as the snapshots of the NetApp fileserver. I tried it myself on the windows 2008 terminal server and it works as it should. As my boss noted this is a very important step: having snapshots available is one thing, but having them available in the standard interface which (experienced) windows users can use makes quite a difference. Helpdesk page about filesystem snapshots updated.
Met alle aankomende wijzigingen op het werk hebben we besloten om de spamfiltering uit te besteden aan de surfnet mailfilter dienst. Die worden er voor betaald om de filtering dagelijks bij te houden en wij hebben er straks minder tijd voor. Totnogtoe was het natuurlijk altijd onze 'eigen' mailsetup en konden we zelf de spamstats bijhouden, en dat verliezen we.
We hebben eerst students.cs.uu.nl omgezet en vanmorgen cs.uu.nl. In de logs van de studentenmailserver viel me opeens op dat de smtpd ratelimiting (anvil) van postfix aansloeg op de surfnet mailfilters dus ik heb de surfnet mailfilter adressen toegevoegd aan de smtpd_client_event_limit_exceptions setting in postfix. Bij cs.uu.nl gebruiken we postfix, al van toen het nog vmailer heette. In sendmail zou ik voor die IP blokken andere ratelimits kunnen zetten maar postfix heeft blijkbaar alleen de opties default en geen ratelimits.
It seems the Turkish provider ttnet.tr fell off the Internet for a few hours today. Since we volunteered ntp.cs.uu.nl for tr.pool.ntp.org the drop in traffic was very, very noticeable.
First peak at 5000 packets/second ntp traffic seen on ntp.cs.uu.nl. Still going strong under this load.
Lots of phishing attempts for webmail accounts flying by, at the moment it seems popular to use webform hosters to ask for account credentials. I seem to miss a part of these. Probably my spamfilters being too good or something. But at work there are some people who know I am interested in new and recurring strains of Internet abuse so I still get interesting stuff forwarded to investigate. The latest catch advertised a dot.tk domain which inlined a webform from a tripod hosted site which was a copy of an emailmeform.com form and used emailmeform.com to process it and redirected to a generic thankyou form by a new zealand printer supplies company. It takes a bit of tracing and trying to solve such a puzzle and notify all parties about their role in the abuse.
No license to rdesktop for me: I recently got a really weird error from rdesktop:koos@leek:~$ rdesktop -M -g 1200x900 -d something terminalserver Autoselected keyboard map en-us disconnect: No valid license available.Some searching found me: License to rdesktop. Indeed, setting a different hostname from my own hostname helps:koos@leek:~$ rdesktop -M -g 1200x900 -d something -n leeks terminalserver Autoselected keyboard map en-us /users/koos/.rdesktop/licence.leeks.new: Permission denied WARNING: Remote desktop does not support colour depth 24; falling back to 16The license file error has to do with another workaround. But maybe the running out of licenses for 'leek' is because I never give licenses back. Why is all this software very busy with making sure money is made for its maker and not busy with helping the user.
We volunteered ntp.cs.uu.nl for extra capacity for the Turkish ntp pool, and the results are quite visible in the ntp.cs.uu.nl statistics. Suddenly peaks are near 5000 packets per second. But ntpd (and the freebsd kernel) deal with it without problems.
I upgraded ntpd on ntp.cs.uu.nl from 4.2.4 to 4.2.6 and suddenly I notice in the output that this has changed the stratum from 2 to 1.$ ntpq -c rv ntp.cs.uu.nl status=011d leap_none, sync_atomic, 1 event, event_13, version="ntpd 4.2.6@1.2089-o Fri Jan 15 14:31:14 UTC 2010 (1)", processor="i386", system="FreeBSD/5.4-RELEASE-p13", leap=00, stratum=1, precision=-19, rootdelay=0.000, rootdisp=1.456, refid=PPS, reftime=cefb066f.cbe638ff Fri, Jan 15 2010 16:21:19.796, clock=cefb0693.889dd5ee Fri, Jan 15 2010 16:21:55.533, peer=7047, tc=6, mintc=3, offset=-0.001, frequency=15.448, sys_jitter=0.002, clk_jitter=0.001, clk_wander=0.002Which matches the peer list where the PPS stratum is now 0:$ ntpq -c peer ntp.cs.uu.nl remote refid st t when poll reach delay offset jitter ============================================================================== *huygens.cs.uu.n .PPS. 1 u 23 64 377 0.197 0.009 0.258 +stardate.cs.uu. .PPS. 1 u 13 64 377 0.998 -0.058 0.033 +tijger.phys.uu. metronoom.dmz.c 2 u 15 64 376 0.599 0.004 0.185 LOCAL(0) .LOCL. 10 l 627 64 0 0.000 0.000 0.002 oPPS(0) .PPS. 0 l 49 64 377 0.000 -0.002 0.002 NTP.MCAST.NET .MCST. 16 u - 64 0 0.000 0.000 0.002I guess some definition of PPS input has changed. Now I wonder how much more ntp traffic this will cause.
I tried to use the --filter option in rsync but I was a bit baffled by the syntax and the manpage is nice but I couldn't understand. I wanted certain directories completely, other directories default excluded and certain files in one directory but not all. After some trail and error and talking to the teddybear:rsync -rvv --progress /home/koos/rsyncsource/ /home/koos/rsyncdest --filter='merge /home/koos/rsyncfilter'And in the filter file name things to include and exclude:+ /wel/ - /niet/ + /random/file - /random/*And the result is what I want:$ ~/bin/testrsync building file list ... [sender] showing directory wel because of pattern /wel/ [sender] hiding directory niet because of pattern /niet/ [sender] hiding file random/niet because of pattern /random/* [sender] showing file random/file because of pattern /random/file 7 files to consider delta-transmission disabled for local transfer or --whole-file random/ random/file 0 100% 0.00kB/s 0:00:00 (xfer#1, to-check=4/7) wel/ wel/file1 0 100% 0.00kB/s 0:00:00 (xfer#2, to-check=2/7) wel/file2 0 100% 0.00kB/s 0:00:00 (xfer#3, to-check=1/7) wel/file4 0 100% 0.00kB/s 0:00:00 (xfer#4, to-check=0/7) total: matches=0 hash_hits=0 false_alarms=0 data=0 sent 319 bytes received 126 bytes 890.00 bytes/sec total size is 0 speedup is 0.00Now to do this on a filesystem with 151000 files.
Y2.01K problem: SpamAssassin had a rule since 2006 that e-mail with a date in the 'far future' was likely spam. The 'far future' was defined as 2010-2099. So today that rule started firing, leading to missed e-mail. Documentation for SpamAssassin Rule: FH_DATE_PAST_20XX. Time for an update there...
It's that xsnow time of year. I wanted to compile it for our students and staff to use and found a major Makefile and a real Imakefile (remember those?):$ wc Makefile Imakefile 957 2413 26799 Makefile 7 21 172 ImakefileTrying to find the 'real' problem I managed to reduce all that to:xsnow: xsnow.o toon_root.o gcc -o xsnow xsnow.o toon_root.o -lm -lXpm -L/usr/X11R6/libimake gave us somewhat overkill Makefiles...
I was replacing ssl certificates on a lot of servers and got it working everywhere except on our ldap server. The SSL certificate chain wasn't given out so there was no link between a trusted root and the certificate on the server. I had it configured:TLSCACertificateFile /etc/openldap/ssl/cacert.pem TLSCertificateFile /etc/openldap/ssl/servercrt.pem TLSCertificateKeyFile /etc/openldap/ssl/serverkey.pemWith the certificate in servercrt.pem and the intermediate certificates in cacert.pem. But that was a config from an older server which uses OpenSSL, including openssl libraries (libssl). The newer ldap server uses the gnu tls libraries (libgnutls) which really need:TLSCertificateFile /etc/openldap/ssl/servercrt.pem TLSCertificateKeyFile /etc/openldap/ssl/serverkey.pemWith the server certificate and the entire chain together in servercrt.pem. Something to keep in mind, so I documented it on our internal wiki.
Power failure this morning at work.. which left us not in the dark (enough
emergency lighting) but with a completely silent serverroom. When the power
came back we had some hours of work to get everything up and running again.
Worst problem was with a number of Xen based virtualhosts, some centos upgrade
had suddenly created a network device virbr0 which uses NAT and a
local dhcp pool and enslaved all xen domU network interfaces under that
bridge with no access to the 'real' network because NAT was not set up so
their NFS root mount failed. The details on virbr0:
virbr0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
inet addr:192.168.122.1 Bcast:192.168.122.255
A bit hard to disable, but at the end
ifconfig virbr0 down ; brctl delbr virbr0 helps to get rid of the
weird bridge, and all domUs will start after that.
It seems the Turkish provider ttnet.tr fell off the Internet for a
few hours today. Since we