2021-05-16
Ending with half a filesystem in /lost+found
Some visitors may have noticed this website wasn't working for about a day. That's because I had to rebuild the webserver. There was a filesystem-related panic somewhere yesterday causing the main filesystem to be mounted read-only. I assumed I could use fsck on the read-only filesystem to get things back to normal again but this turned out wrong: I ended with an unbootable disk and the complete contents of /etc and /home in /lost+found with mostly unusable filenames (numbers). The fastest solution was to rebuild a webserver from scratch and start making things run again. This took most of the day. Yes, I need to get backups working again, even without a tapedrive. The weird part is that this was about a filesystem in a virtual machine and the hardware host shows absolutely no problems at that time and has no problems with the disks backing this storage. Another virtual machine also had issues around the same time, but those did not result in disk problems:sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 88 19 20 00 00 08 00 blk_update_request: I/O error, dev sda, sector 8919328 Buffer I/O error on dev sda1, logical block 1114660, lost async page writeA few days earlier both virtual systems logged a strange timing issue with a hang on all CPUs. I'm also seeing some weird kernel messages on other virtual machines around the same time:wozniak kernel: [5150105.764208] rcu: INFO: rcu_sched self-detected stall on CPUSo I guess it is time for some hardware checks.