Another One Bites the Dust

I use Nagios software to monitor my home network, including the hardware and software services. When there is an issue with anything being monitored Nagios sends me an email alert. I recently received an alert that the computer (Raspberry Pi 1B) that runs my weather station was offline. I checked and the computer was powered up but was not reachable on the network. I rebooted the computer and it came back online, but all of the software services that run on that computer were not running.

I attempted several times to reboot the computer and connect to it over the network with no success. I use an external hard drive with that Raspberry Pi and it appeared that the Raspberry Pi would start to boot from the SD card and then fail when it attempted to access the external drive. I removed the drive and connected it to another Raspberry Pi and ran “fsck” on the drive. When I saw the number of errors that scrolled across the screen I knew the drive was badly corrupted. In the end there were no files left on the drive and everything that could be salvaged was in the lost+found directory.

While the fsck command was running I decided to reimage the SD card that boots the computer. This was because I needed to update to the most recent version of Raspbian and thought this was a good time since I was going to have to create a new bootable hard drive. I booted the Pi from the SD card and it came back to life. I connected another external drive to the PI and it was not recognized. I attempted replacing the drive with several other external drives with the same result. All of the external drives worked with another Raspberry Pi I had so the drives weren’t the main problem.

In the end I connected the Davis Vantage Vue console to another (and slightly newer) Raspberry Pi that was already up and running. I took all of the weather station software and data files I had salvaged using fsck and installed them on that computer. Then I brought the weather station service online on the new host computer. This is most likely a temporary solution but for now everything is operational again.