At approximately 4am PST, two separate database servers (db1 and db16) had RAID failures that caused file system corruption. They kept trying to process traffic but Linux had switched part of the file system to “read only”, so no traffic data was actually being written to the hard drives. This problem lasted from approximately 4am to 7am PST. Unfortunately, this traffic data is gone and unrecoverable.
We have alert systems setup so that when a significant event occurs, such as a server going offline or a RAID failure, we are alerted immediately. Unfortunately, the RAID notifications on a few servers were recently disabled while we were performing some maintenance, and wouldn’t you know it, db1 and db16 were among those servers. Because of this, we weren’t notified of the problem, and didn’t discover it until we woke up to a flood of emails in our inbox this morning.
There were no problems on other servers that we could find, but if you have a site on a server other than db1 or db16 and it’s experiencing issues, please leave a comment here explaining what’s happening. Be sure to include the site ID.
We apologize for this issue, which we take very seriously. The RAID notifications are all back online, and we will be sure to always re-enable them immediately after this kind of maintenance in the future. Leaving them disabled was just an honest mistake.
One final note, these RAID failures occurred at the exact same time on two different servers. This happened once before as well, although it was three servers instead of two, and it didn’t cause any corruption last time. This seems like very strange behavior to us, and we’re not sure what could possibly cause such a thing to happen to separate servers (that don’t talk to each other) at the exact same time. If any sysadmins out there have any ideas, please share.