[Resolved], Servers down on tail21

Issue status: Resolved Date:

23 Apr 2017
01:37 UTC

Posted by:

Matthew Bloch

The tail is now back up and running.

You may need to Restart your server from the control panel for normal service to resume - please get in touch if you're having problems after a reboot.

Issue status: Investigating Date:

23 Apr 2017
00:57 UTC

Posted by:

Matthew Bloch

The copy is 80% done, we just have some fixup work to do after that but VMs should be rebooting by 0300 at the current rate.

Issue status: Investigating Date:

22 Apr 2017
22:30 UTC

Posted by:

Matthew Bloch

(update of resolution time)

Issue status: Investigating Date:

22 Apr 2017
22:30 UTC

Posted by:

Matthew Bloch

We've diagnosed this deployment issue and finished moving one group of customers away.

We're about to take the affected filesystem down and moves it data, which will take 90-120 minutes. We'll then confidently be able to bring affected servers back up.

Issue status: Investigating Date:

22 Apr 2017
20:58 UTC

Posted by:

Matthew Bloch

The filesystem crashed about 25 minutes ago, and we rebooted once more, causing a few more minutes of down time for affected VMs. We are now copying data off disc by disc, and will continue to do so for the next few hours. We've decided it'd be better to try to keep the filesystem up during the process even if there are a few crashes.

For Linux nerds: what has bitten us here is a shortage of btrfs metadata, which we had not previously monitored or seen as a risk factor. Linux just gives up and turns the filesystem read-only when this resource runs out. So we are trying to clear out the filesystem to stabilise it, but normal usage sometimes tips it over the edge while this is happening. We'll keep rebooting as necessary and restarting the migration and anticipate it will find a point of stability as we remove discs.

Issue status: Investigating Date:

22 Apr 2017
20:25 UTC

Posted by:

Jamie Nguyen

So sorry about this. We've got all hands on deck trying to resolve this fully. There may be further isolated periods downtime tonight, and degraded performance, but we're doing all we can to avoid any further interruption of service.

Issue status: Investigating Date:

22 Apr 2017
19:03 UTC

Posted by:

Matthew Bloch

All servers should be up again, but we may need to do some further maintenance later this evening to ensure they stay that way. We'll warn of any down time that we're going to trigger, but may be able to avoid it.

Issue status: Investigating Date:

22 Apr 2017
16:50 UTC

Posted by:

Matthew Bloch

Apologies, this is requiring another reboot of the host and some previously-fixed servers will go down and up again.

The root cause appears to be a subtle resources limit on a filesystem that we'd not tripped over in 4 years, and which will require monitoring in future. Our priority for the day is fixing this one filesystem for long-term health rather than overall up time.

Issue status: Investigating Date:

22 Apr 2017
15:02 UTC

Posted by:

Matthew Bloch

We're continuing to restore the broken state and bring servers back online - the majority are available again, but we are working on the last few right now.

Issue status: Investigating Date:

22 Apr 2017
14:15 UTC

Posted by:

Jamie Nguyen

If you have a Cloud Server with us, you'll have a virtual hard drive (or several). Behind the scenes, these virtual hard drives are backed by real hard drives on a tail (essentially a physical server with lots of disks).

One of our tails suffered an unexpected failure, making it read-only. This is a repeat of issue 169, which we thought we'd got to the bottom of. I'm currently working on getting service back online as soon as possible, and will be migrating all disks away from this tail. The hardware is possible suffering intermittent failures.

We are currently trying to ensure the safety of the data while moving your disks to separate hardware. Thanks for your patience, and sorry for the downtime!

Some (hopefully most) servers should be back online now, though performance may be degraded due to heavy load. We're working on getting the remaining servers back online, and also to migrate all disks to other hardware.

Issue status: Investigating Date:

22 Apr 2017
13:32 UTC

Posted by:

Jamie Nguyen

If you have a Cloud Server with us, you'll have a virtual hard drive (or several). Behind the scenes, these virtual hard drives are backed by real hard drives on a tail (essentially a physical server with lots of disks).

One of our tails suffered an unexpected failure, making it read-only. This is a repeat of issue 169, which we thought we'd got to the bottom of. I'm currently working on getting service back online as soon as possible, and will be migrating all disks away from this tail. The hardware is possible suffering intermittent failures.

We are currently trying to ensure the safety of the data while moving your disks to separate hardware. Thanks for your patience, and sorry for the downtime!

Issue status: Investigating Date:

22 Apr 2017
10:48 UTC

Posted by:

Jamie Nguyen

If you have a Cloud Server with us, you'll have a virtual hard drive (or several). Behind the scenes, these virtual hard drives are backed by real hard drives on a tail (essentially a physical server with lots of disks).

One of our tails suffered an unexpected failure, making it read-only. This is a repeat of issue 169, which we thought we'd got to the bottom of. I'm currently working on getting service back online as soon as possible, and will be migrating all disks away from this tail. The hardware is possible suffering intermittent failures.

Issue still not addressed? Please contact support.