[FreeBSD] ZFS FAULTED / kernel: (ada0:ahcich4:0:0:0): lost device

I am so happy that I finally solved a problem today. This problem had existed in my server farms for few months already. In the past few months, I had absolutely no idea what was causing the issues. Here is my story:

I built a file server using FreeBSD and ZFS. The hardware components are consumer grade hardware. Basically, it is a desktop computer with multiple hard drives. The hard drives are connected to the SATA ports on the motherboard. It is a very simple setup.

After I set up a ZFS pool, I started to load the data and stress tested the system. I noticed that the ZFS pool turned into a fault state after 15 mins, e.g.,

#sudo zpool status -v
  pool: storage
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada0    REMOVED      0     0     0 
            ada2    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

In short, a drive is missing. So I tried to run dmesg

Oct 17 20:24:00 kernel: (ada0:ahcich4:0:0:0): lost device
Oct 17 20:24:00 kernel: (ada0:ahcich4:0:0:0): removing device entry
Oct 19 20:42:14 kernel: (ada0:ahcich3:0:0:0): lost device
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): lost device
Oct 19 23:19:18 kernel: (ada0:ahcich4:0:0:0): removing device entry
Oct 19 23:33:02 kernel: (ada0:ahcich4:0:0:0): Synchronize cache failed
Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 30 af 4e 40 7a 00 00 00 00 00
Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): CAM status: ATA Status Error
Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): ATA status: 61 (DRDY DF ERR), error: 04 (ABRT )
Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): RES: 61 04 00 00 00 40 00 00 00 00 00
Oct 20 22:53:40 kernel: (ada0:ahcich4:0:0:0): Retrying command

I also verified the content in /dev/, e.g.,

#ls /dev/ad0

crw-r-----  1 root  operator   0x6f Oct 18 22:10 /dev/ada0

Interestingly, the device is still available. This is very confusing because the system sees the hardware (with some error), but the application could not see the hardware. So I decided to perform the following tests:

  • Reboot
  • Shutdown and reboot
  • Replacing the SATA cables
  • Connecting the hard drive to a different SATA port
  • Replacing the hard drive with a new one that is certified by the manufacturer
  • Replacing the motherboard
  • Replacing the hard drive power cable
  • Replacing the power supply
  • Updating the firmware of the motherboard

Unfortunately, none of these fixed the problem. I also checked the S.M.A.R.T. status of the hard drives but I didn’t see any issues. One thing I noticed is that the error is not consistent. It happens randomly on different SATA port and hard drive. That doesn’t make any sense to me because I saw the same thing on 3 different motherboards. Finally, I decided to try one last thing: I connected the hard drives through an USB port. Guess what, my FreeBSD box stopped complaining.

So I decided to a different approach. I used the same hardware setup (the hard drives were connected using SATA instead of USB) and loaded a Windows 7. The system worked perfectly fine and stable. I didn’t experience any issue at all. So I believed the problem had nothing to do with the hardware. It must be the software settings issue.

Finally, I decided to try one last thing, something I hadn’t paid attention before: BIOS Settings

AHCI/IDE

Typically there is a setting to control how the motherboard interacts with the hard drives: IDE or AHCI. I usually stick with the default settings. In my case, the default value of my motherboard is IDE. After I changed the settings to AHCI, I found that the problem is gone. Yes, it’s gone and my headache is gone too.

That was easy!

–Derrick

Our sponsors:

2 comments

  1. “So I decided to try another approach. I used the same hardware setup (the hard drives were connected using SATA instead of USB) and loaded a Windows 7.”
    Haha, who’d gonna say that using this limited OS would save you a lot of time? Good to bear this in mind 😉

    Thanks a lot Derrick for sharing your experiences.

    1. Well, I just hope FreeBSD can be smarter a little bit in terms of hardware compatibility. If Microsoft can do it, there is no reason why they cannot do it.

      By the way, the reason why I never thought about the BIOS settings because the machine did not even have a display port. 🙂

Leave a Reply to Derrick Cancel reply

Your email address will not be published. Required fields are marked *