23 Apr 2017
01:37 UTC
Matthew Bloch
The tail is now back up and running.
You may need to Restart your server from the control panel for normal service to resume - please get in touch if you're having problems after a reboot.
23 Apr 2017
00:57 UTC
Matthew Bloch
The copy is 80% done, we just have some fixup work to do after that but VMs should be rebooting by 0300 at the current rate.
22 Apr 2017
22:30 UTC
Matthew Bloch
(update of resolution time)
22 Apr 2017
22:30 UTC
Matthew Bloch
We've diagnosed this deployment issue and finished moving one group of customers away.
We're about to take the affected filesystem down and moves it data, which will take 90-120 minutes. We'll then confidently be able to bring affected servers back up.
22 Apr 2017
20:58 UTC
Matthew Bloch
The filesystem crashed about 25 minutes ago, and we rebooted once more, causing a few more minutes of down time for affected VMs. We are now copying data off disc by disc, and will continue to do so for the next few hours. We've decided it'd be better to try to keep the filesystem up during the process even if there are a few crashes.
For Linux nerds: what has bitten us here is a shortage of btrfs metadata, which we had not previously monitored or seen as a risk factor. Linux just gives up and turns the filesystem read-only when this resource runs out. So we are trying to clear out the filesystem to stabilise it, but normal usage sometimes tips it over the edge while this is happening. We'll keep rebooting as necessary and restarting the migration and anticipate it will find a point of stability as we remove discs.
22 Apr 2017
20:25 UTC
Jamie Nguyen
So sorry about this. We've got all hands on deck trying to resolve this fully. There may be further isolated periods downtime tonight, and degraded performance, but we're doing all we can to avoid any further interruption of service.
22 Apr 2017
19:03 UTC
Matthew Bloch
All servers should be up again, but we may need to do some further maintenance later this evening to ensure they stay that way. We'll warn of any down time that we're going to trigger, but may be able to avoid it.
22 Apr 2017
16:50 UTC
Matthew Bloch
Apologies, this is requiring another reboot of the host and some previously-fixed servers will go down and up again.
The root cause appears to be a subtle resources limit on a filesystem that we'd not tripped over in 4 years, and which will require monitoring in future. Our priority for the day is fixing this one filesystem for long-term health rather than overall up time.
22 Apr 2017
15:02 UTC
Matthew Bloch
We're continuing to restore the broken state and bring servers back online - the majority are available again, but we are working on the last few right now.
22 Apr 2017
14:15 UTC
Jamie Nguyen
If you have a Cloud Server with us, you'll have a virtual hard drive (or several). Behind the scenes, these virtual hard drives are backed by real hard drives on a tail (essentially a physical server with lots of disks).
One of our tails suffered an unexpected failure, making it read-only. This is a repeat of issue 169, which we thought we'd got to the bottom of. I'm currently working on getting service back online as soon as possible, and will be migrating all disks away from this tail. The hardware is possible suffering intermittent failures.
We are currently trying to ensure the safety of the data while moving your disks to separate hardware. Thanks for your patience, and sorry for the downtime!
Some (hopefully most) servers should be back online now, though performance may be degraded due to heavy load. We're working on getting the remaining servers back online, and also to migrate all disks to other hardware.
22 Apr 2017
13:32 UTC
Jamie Nguyen
If you have a Cloud Server with us, you'll have a virtual hard drive (or several). Behind the scenes, these virtual hard drives are backed by real hard drives on a tail (essentially a physical server with lots of disks).
One of our tails suffered an unexpected failure, making it read-only. This is a repeat of issue 169, which we thought we'd got to the bottom of. I'm currently working on getting service back online as soon as possible, and will be migrating all disks away from this tail. The hardware is possible suffering intermittent failures.
We are currently trying to ensure the safety of the data while moving your disks to separate hardware. Thanks for your patience, and sorry for the downtime!
22 Apr 2017
10:48 UTC
Jamie Nguyen
If you have a Cloud Server with us, you'll have a virtual hard drive (or several). Behind the scenes, these virtual hard drives are backed by real hard drives on a tail (essentially a physical server with lots of disks).
One of our tails suffered an unexpected failure, making it read-only. This is a repeat of issue 169, which we thought we'd got to the bottom of. I'm currently working on getting service back online as soon as possible, and will be migrating all disks away from this tail. The hardware is possible suffering intermittent failures.