I am so happy that I finally solved a problem today. This problem had existed in my server farms for few months already. In the past few months, I had absolutely no idea what was causing the issues. Here is my story:
I built a file server using FreeBSD and ZFS. The hardware components are consumer grade hardware. Basically, it is a desktop computer with multiple hard drives. The hard drives are connected to the SATA ports on the motherboard. It is a very simple setup.
After I set up a ZFS pool, I started to load the data and stress tested the system. I noticed that the ZFS pool turned into a fault state after 15 mins, e.g.,
#sudo zpool status -v pool: storage state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada4 ONLINE 0 0 0 ada0 REMOVED 0 0 0 ada2 ONLINE 0 0 0 ada5 ONLINE 0 0 0 ada3 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files:
In short, a drive is missing. So I tried to run dmesg
Oct 17 20:24:00 kernel: (ada0:ahcich4:0:0:0): lost device Oct 17 20:24:00 kernel: (ada0:ahcich4:0:0:0): removing device entry Oct 19 20:42:14 kernel: (ada0:ahcich3:0:0:0): lost device Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry Oct 19 23:33:02 kernel: (ada0:ahcich4:0:0:0): Synchronize cache failed Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 30 af 4e 40 7a 00 00 00 00 00 Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): CAM status: ATA Status Error Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT ) Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): RES: 61 04 00 00 00 40 00 00 00 00 00 Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): Retrying command
I also verified the content in /dev/, e.g.,
#ls /dev/ad0 crw-r----- 1 root operator 0x6f Oct 18 22:10 /dev/ada0
Interestingly, the device is still available. This is very confusing because the system sees the hardware (with some error), but the application could not see the hardware. So I decided to perform the following tests:
- Reboot
- Shutdown and reboot
- Replacing the SATA cables
- Connecting the hard drive to a different SATA port
- Replacing the hard drive with a new one that is certified by the manufacturer
- Replacing the motherboard
- Replacing the hard drive power cable
- Replacing the power supply
- Updating the firmware of the motherboard
Unfortunately, none of these fixed the problem. I also checked the S.M.A.R.T. status of the hard drives but I didn’t see any issues. One thing I noticed is that the error is not consistent. It happens randomly on different SATA port and hard drive. That doesn’t make any sense to me because I saw the same thing on 3 different motherboards. Finally, I decided to try one last thing: I connected the hard drives through an USB port. Guess what, my FreeBSD box stopped complaining.
So I decided to a different approach. I used the same hardware setup (the hard drives were connected using SATA instead of USB) and loaded a Windows. The system worked perfectly fine and stable. I didn’t experience any issue at all. So I believed the problem had nothing to do with the hardware. It must be the software settings issue.
Finally, I decided to try one last thing, something I hadn’t paid attention before: BIOS Settings
AHCI/IDE
Typically there is a setting to control how the motherboard interacts with the hard drives: IDE or AHCI. I usually stick with the default settings. In my case, the default value of my motherboard is IDE. After I changed the settings to AHCI, I found that the problem is gone. Yes, it’s gone and my headache is gone too.
That was easy!
–Derrick
Our sponsors: