News items for tag homeserver - Koos van den Hout

2022-07-07 Upgraded the homeserver OS to devuan beowulf and replaced the UPS battery
A few days ago I noticed some interesting messages in the apcupsd log:
2022-07-04 10:14:15 +0200  Battery disconnected.
2022-07-04 10:16:24 +0200  Battery reattached.
2022-07-04 10:19:53 +0200  Battery disconnected.
2022-07-04 10:20:40 +0200  Battery reattached.
Checking the UPS statistics showed me the battery charge was dropping to about 7 % of the capacity while the mains power was available. Since the battery was over 5 years old I ordered a new one to replace it.

This battery was scheduled to arrive Wednesday at the start of the afternoon and I wanted to do an upgrade of the Linux distribution on the main homeserver conway anyway because devuan ascii is already 'oldoldstable' (but still getting updates).

The homeserver uses 2 disks with the main lvm volume in a raid-1. The /boot and /boot/efi filesystems are mirrored by hand with the idea to end with a working boot even when 1 disk is missing.

After the shutdown and replacing the UPS battery I switched the server on again and I was greeted by a grub prompt and nothing to boot. After a few tries I got the system booting again, after that I went searching for what went wrong. Eventually I found out the file /boot/efi/EFI/devuan/grub.cfg pointed at a missing filesystem. I found out the best way to fix this is with
# dpkg-reconfigure grub-efi-amd64
both with /dev/sda and /dev/sdb filesystems on /boot and /boot/efi.

I was hoping the complete upgrade made my rcu_sched problems go away which have caused serious problems before but they haven't gone away.

Again I see this in a virtual machine:
[62988.027890] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[62988.036492] rcu:     0-...!: (1 GPs behind) idle=656/1/0x4000000000000002 softirq=592140/600579 fqs=0
[62988.036943] rcu:     (detected by 0, t=2 jiffies, g=2877673, q=701)
[62988.037327] NMI backtrace for cpu 0
But this time I see on the hardware:
[63178.224120] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[63178.224211] ata1.00: failed command: FLUSH CACHE EXT
[63178.224255] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 3
                        res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[63178.224351] ata1.00: status: { DRDY }
[63178.224379] ata1: hard resetting link
[63183.576100] ata1: link is slow to respond, please be patient (ready=0)
[63183.696118] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[63183.696333] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT1.SPT0._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
[63183.696400] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT1.SPT0._GTF, AE_NOT_FOUND (20180810/psparse-516)
[63183.696597] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.SAT1.SPT0._GTF.DSSP], AE_NOT_FOUND (20180810/psargs-330)
[63183.696634] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT1.SPT0._GTF, AE_NOT_FOUND (20180810/psparse-516)
[63183.696717] ata1.00: configured for UDMA/133
[63183.696732] ata1.00: retrying FLUSH 0xea Emask 0x4
[63183.696772] ata1: EH complete
which suggests to me I should try whether using a different channel from SATA1 would change things.

Tags: , ,
2022-06-15 Grafana 9.0.0 available, and downgraded back to 8.5.6 and back up...
I saw an upgrade of Grafana available, which turned out to be 9.0.0. When upgrading to 9.0.0 I get...
An unexpected error happened
TypeError: Object(...) is not a function

t@[..]public/plugins/grafana-clock-panel/module.js:2:15615
WithTheme(undefined)
So maybe the grafana-clock-panel plugin isn't compatible with 9.0.0 somehow.

Downgrading to 8.5.6 and reloading everything makes it work again.

Update: I checked the grafana-clock-panel plugin and noticed it hadn't been updated. So I did that update and retried grafana 9.0.0, and that made everything run smoothly again.

Tags: , ,
2022-05-09 Grafana alerts working again
After reverting to Grafana 8.4.7 for a while because alerts were failing in Grafana 8.5.0 I had a look at the available version today and saw version 8.5.2. I assumed the problem with DataSourceNoData errors was fixed by now and did the upgrade.

Indeed the alerts are seeing data fine now and I trust they will work when needed.

Tags: , ,
2022-04-23 Grafana alerts failing in 8.5.0
I installed Grafana from their debian repository, so I get updates via the normal apt update / apt dist-upgrade process. Since upgrading to version 8.5.0 the alerts were all firing because of 'DatasourceNoData' errors. According to Alert Rule returned no data (after upgrade to 8.5.0) #48128 other people are seeing this too.

For now I downgraded to version 8.4.7 where things work fine and I'll see if a newer version shows up.

Tags: , ,
2021-12-27 Raid-1 on the homeserver rebuilt
After seeing read errors on one disk in the raid-1 of the homeserver I ordered a replacement SSD of a different brand and exactly the same size. It arrived today, and I did the work to replace the suspect disk.

First set the old disk as failed and removed from the array. And note the complete serial number on a piece of paper to make sure I removed the faulty disk.

After that the server was shut down, disconnected from a lot of cables, dragged from the homerack in the attic and I worked on it. It took a while to open the side with the SSDs (below the mainboard) and with two exactly the same SSDs it was a 50% chance which one to remove. After removing the disk tray and unscrewing the SSD from the disk tray I was able to read the physical label on the underside and I guessed right.

After that the new disk was installed, the case closed again and dragged back to its place and cables connected again. After boot it came all up fine.

After bootup I partitioned the new disk, added it to the raid-1 again and set up the EFI and Linux boot partitions on the disk.

Last step was to setup the boot menu with efibootmgr to set both disks as bootable.

Tags: , ,
2021-12-21 New ssd for the homeserver ordered
I noticed syslog messages I don't like:
[17200683.290921] md: data-check of RAID array md127
[17200683.291277] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[17200683.291619] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[17200683.291935] md: using 128k window, over a total of 937253184k.
[17201245.784689] ata2.00: exception Emask 0x0 SAct 0x1fe00000 SErr 0x0 action 0x0
[17201245.785175] ata2.00: irq_stat 0x40000008
[17201245.785465] ata2.00: failed command: READ FPDMA QUEUED
[17201245.785766] ata2.00: cmd 60/80:a8:00:52:51/00:00:0c:00:00/40 tag 21 ncq dma 65536 in
                           res 41/40:20:60:52:51/00:00:0c:00:00/00 Emask 0x409 (media error) <F>
[17201245.786402] ata2.00: status: { DRDY ERR }
[17201245.786737] ata2.00: error: { UNC }
[17201245.787281] ata2.00: configured for UDMA/133
[17201245.787619] sd 1:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[17201245.787966] sd 1:0:0:0: [sdb] tag#21 Sense Key : Medium Error [current] 
[17201245.788317] sd 1:0:0:0: [sdb] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
[17201245.788689] sd 1:0:0:0: [sdb] tag#21 CDB: Read(10) 28 00 0c 51 52 00 00 00 80 00
[17201245.789123] blk_update_request: I/O error, dev sdb, sector 206656096
[17201245.789530] ata2: EH complete
And a number of other errors on sdb. Time to replace it! I ordered a new ssd. This time a different brand. Current configuration is with 2 Kingston drives with very close serial numbers, so maybe the other drive will give similar issues soon.

The check of the raid1 mirror was also showing differences. I'm waiting for the replacement ssd to show up, and at that moment I will remove the suspect ssd from the array and replace it.

Update 2021-12-24: Writing about the order helped speed things up: I just received notification the replacement ssd is being sent. Which will not show up until after Christmas. I also noticed the problematic Kingston still has warranty, so maybe I can get a replacement for that one too. They came in about 1.5 years ago when I upgraded the storage on the homeserver.

Tags: , ,
2021-10-22 Naming interfaces used by libvirt virtual machines
The homeserver conway has an ever growing list of network interfaces, also due to adding a DMZ network.

This was starting to look a bit messy, with things like:
koos@conway:~$ /sbin/brctl show brwireless
bridge name     bridge id               STP enabled     interfaces
brwireless              8000.4ccc6a8efa4b       no              enp10s0.3
                                                        vnet2
                                                        vnet9
Solution: name the interfaces in the VM definitions, like:
    <interface type='bridge'>
      <source bridge='brdmz'/>
      <target dev='dmz-minsky'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
And now names are more logical:
koos@conway:~$ /sbin/brctl show brdmz
bridge name     bridge id               STP enabled     interfaces
brdmz           8000.4ccc6a8efa4b       no              dmz-minsky
                                                        enp10s0.11

Tags: , ,
2021-10-18 Securing the home network: a separate DMZ network
I have a lot of control over the software that runs on systems at home but there are limits to what I can fix and sometimes things are insecure.

Things like the recent wordpress brute force attacks show that random 'loud' attackers who don't care about the chance of getting noticed will try. I sometimes do worry about the silent and more targeted attackers.

So recently I updated my home network and I now have a DMZ network. At this moment it is a purely virtual network as it doesn't leave the KVM server. Hosts in the DMZ have a default-deny firewall policy to the other inside networks. Specific services on specific hosts have been enabled.

I first moved the development webserver, which allowed me to tune those firewall rules and fix some other errors.

Now other webservers and other servers offering things to the outside world have moved.

Tags: , , ,
2021-07-12 Checking the rcu_sched messages finds repeated mention of cdrom scans
I was going through some rcu_sched messages and noticed kernel routines related to the cdrom drive showed up a few times in the tasks that were 'behind'.
[335894.319961]  [<ffffffffc03d864a>] ? scsi_execute+0x12a/0x1d0 [scsi_mod]
[335894.320702]  [<ffffffffc03da586>] ? scsi_execute_req_flags+0x96/0x100 [scsi_mod]
[335894.321820]  [<ffffffffc04a7703>] ? sr_check_events+0xc3/0x2c0 [sr_mod]
[335894.322551]  [<ffffffffb58224a5>] ? __switch_to_asm+0x35/0x70
[335894.323256]  [<ffffffffb58224b1>] ? __switch_to_asm+0x41/0x70
[335894.323906]  [<ffffffffc047d05a>] ? cdrom_check_events+0x1a/0x30 [cdrom]
[335894.324545]  [<ffffffffc04a8289>] ? sr_block_check_events+0x89/0xe0 [sr_mod]
[335894.325186]  [<ffffffffb551a9a9>] ? disk_check_events+0x69/0x150
Because the virtual machines don't do anything with the virtual cdrom after the first installation I'm removing them from all virtual machines and see what that does for these messages.

Tags: , ,
2021-07-08 Another panic in a virtual machine
At the end of this morning I noticed the root filesystem of the shell server on the homeserver had turned itself read-only. Another DRIVER_TIMEOUT error in the kernel messages. And I didn't want to get to a situation with half of the filesystem in lost+found like the previous time.

This time I decided to use a different approach in the hopes of getting back to a working system faster. And they worked this time.
  1. echo s > /proc/sysrq-trigger to force a sync
  2. echo u > /proc/sysrq-trigger to force an unmount of all filesystems
  3. I killed the virtual machine with virsh destroy (the virtualization equivalent of pulling the plug)
  4. I created a snapshot of the virtual machine disk to make have a state of file system to return to in case of problems in the next steps
  5. I booted the virtual machine and it had indeed filesystem issues
  6. So reboot in maintainance mode and did a filesystem check
  7. After that it booted fine and the filesystem was fine, nothing in lost+found
After things ran ok for a while I removed the snapshot. I also changed the configuration to use virtio disks and not ide emulation. Ide emulation disks have a timeout (DRIVER_TIMEOUT) after which things are given up. The fact that (emulated) I/O hangs for 30 seconds is bad, but maybe related to the rcu_sched messages. Maybe time for some more updates.

Tags: , ,

IPv6 check

Running test...
, reachable as koos+website@idefix.net. PGP encrypted e-mail preferred. PGP key 5BA9 368B E6F3 34E4 local copy PGP key 5BA9 368B E6F3 34E4 via keyservers

RSS
Meningen zijn die van mezelf, wat ik schrijf is beschermd door auteursrecht. Sommige publicaties bevatten een expliciete vermelding dat ze ongevraagd gedeeld mogen worden.
My opinions are my own, what I write is protected by copyrights. Some publications contain an explicit license statement which allows sharing without asking permission.
Other webprojects: Camp Wireless, wireless Internet access at campsites, The Virtual Bookcase, book reviews
This page generated by $Id: newstag.cgi,v 1.37 2022/02/15 21:48:19 koos Exp $ in 0.025224 seconds.