Server 1 Hard-Drive Failure

Welp. Bet you weren’t expecting this one… It just keeps getting better and better to be a data storage device within the Immortal MC server network!

So, what happened? On Monday 16th February, while trying to write some notes for my masters project I noticed my nextcloud app kept failing to sync the changes - in case you didn’t know, on server 1 I host a nextcloud files instance for both personal and server use. Curious what was happening, I tried to SSH into server 1 to be met with nothing. I thought that was a bit bizarre so opened the staff website (also hosted on server 1) and checked the glances page¹ and I saw the CPU at very high usage stuck at cpu_iowait.

|200

oh dear god. Not again.

cpu_iowait errors was exactly what happened when I first noticed the ZFS storage issues in server 2. Surely not right? Surely this can’t be the same here? The SSD pool isn’t even in use, it’s only used for Minecraft! Well, if you read the title of this page you’ll know that’s not what happened, although the ZFS pool wasn’t out of the way just yet.

So, upon seeing cpu_iowait errors, my first call was to check the ZFS pool’s health, so I hopped back over to the server 1 terminal and… it’s not responding. Interesting. I check glances again and the cpu usage is hitting 95+%. Not ideal. So now my plan was to very slowly, and safely, shut down each service running on server 1 and reboot the machine, that should clear up whatever is causing this cpu usage spike. So I do that and then it just never turns back on. Ah.

2 days go by and I still can’t connect, so I ring my dad - it turns out the machine was still running, but was so overloaded it just couldn’t handle any tasks. I got him to reboot the machine manually, and then I got back to diagnosing the problem.

So, first things first, I want to sync my files please, let’s open up my nextcloud page and- huh?

That’s not good. Why didn’t the apache2² service start? Ok, let’s see what’s going on by running systemctl status apache2, which unfortunately doesn’t provide me with any details. Ok, did anything else fail on startup? Let’s run systemctl --failed to check:

UNIT LOAD ACTIVE SUB DESCRIPTION 
● apache2.service loaded failed failed The Apache HTTP Server 
● avahi-daemon.service loaded failed failed Avahi mDNS/DNS-SD Stack 
● networkd-dispatcher.service loaded failed failed Dispatcher daemon for systemd-networkd 
● snapd.seeded.service loaded failed failed Wait until snapd is fully seeded 
● switcheroo-control.service loaded failed failed Switcheroo Control Proxy service 
● thermald.service loaded failed failed Thermal Daemon Service 
● udisks2.service loaded failed failed Disk Manager 
● [email protected] loaded failed failed User Manager for UID 1000 
● wpa_supplicant.service loaded failed failed WPA supplicant 
● zfs-import-cache.service loaded failed failed Import ZFS pools by cache file

Oh. Oh no. That’s like, a lot. Oh cool, and ZFS import service is down too. I assumed this must be a ZFS issue, just like in server 2, so I ran zpool status which returned no pools available - ok I should’ve expected that, the pool import stuff is broken. Let’s see if it can find the SSD pool now the machine is booted by running zpool import. And my SSH terminal crashed.

I now log back in, and check all the disks are definitely present with lsblk, and yep they’re all there, but for some reason the sdc and sdd partitions (the ZFS pool drives) are labelled as mpatha instead of sdc1 and sdd1 like they should be.

So I look it up, and mpatha stands for Multipath device A, as in the machine is labelling these two separate SSDs as one drive. You’d think that wouldn’t be an issue, but the ZFS system that manages the fact they’re supposed to be identical needs them to be separate drives in case one fails. And since they’re now counted as one drive, it can’t find the pool as there’s technically only one drive. Luckily the multipath thing is something done every startup, and I can simply uninstall the package as I won’t use it anyway. So I uninstall it and - wait why can’t I uninstall it? The uninstall command just hangs the terminal? Oh shit it crashed.

So a couple more days go by while I wait for another manual restart, and when the server comes back online, the multipath system has been uninstalled, and the ZFS pool is functioning perfectly fine. Ok cool. Let’s check the CPU usage: 0.5%. Ok, awesome. The server seems fine now? So why is the terminal taking so long to respond to my commands?

It can’t be a network issue because I can connect and use server 2 just fine, and it’s clearly not a system resources issue since ram and cpu usage are incredibly low. It’s also not the ZFS pool since that’s now working without issue. It can’t be any of the hosted services since I haven’t launched any of them yet. So, we just have a machine sitting there, and when I tell the operating system to do something, it takes a long time to do it despite not doing anything else.

So, I decide the check how the server is doing reading and writing by running dmesg | grep -iE "error|fail|ata|buffer|I/O" and the sheer amount of errors relating to the main hard-drive was insane. Yep, the hard-drive was dying. I tried to back it up to the file hosting drive but that just crashed the server again, so now it’s sitting completely powered down awaiting my return home, whenever that may be…

Closing Remarks

I thought I’d subtitle this section just for formatting’s sake.

Overall, I think this recent spout of issues is a combination of my past self’s stupidity catching up with me, and just the general wear and tear of hardware that’s over 6 years old. The hard-drive in server 1 is the original hard-drive from the original server unit I got in 2020 when I first started self-hosting the Minecraft server. It wasn’t even new back then, it was a used hard-drive out of a very old computer. It’s served us very well in the last 6 years all things considered.

In the days long gaps between server 1 needing to be manually rebooted, I started setting up a network status page for myself so I could more easily keep an eye on outages now they’re becoming a little more frequent, and I thought why keep this to myself? So you can now see the current network status of various services at https://status.immortalmc.net/. I’ve also embedded the uptime status for server 1 and 2 in the Hardware section of this site.

Glances is a tool for viewing system resource usage remotely on linux machines. Essentially it’s task manager. https://nicolargo.github.io/glances/ ↩
Apache is a software for hosting and managing http connections https://httpd.apache.org/ ↩

IMMC Technical Info

Latest Post

Navigate

Latest Post

Closing Remarks

Graph View

IMMC Technical Info

Latest Post

Navigate

Latest Post

Server 1 Hard-Drive Failure

Closing Remarks

Footnotes

Graph View