23 Mar 2018
09:14 UTC
Ian Chilton
Everything is continuing to be stable so far.
We're continuing to monitor the situation closely and will be working on a full write up as soon as possible.
22 Mar 2018
18:29 UTC
Tom Hill
We believe we've isolated the root cause of the problem, and the perfect storm that lead us this far. We'll be writing this up in full detail as soon as is possible.
For now, most of our hardware is fine, but a certain number of transceivers will need replacing in due course.
We will be keeping a very keen eye in the mean time, however, outside of anything unexpected happening we should be mostly out of the woods.
Thanks to everyone for all of your patience over the last few days.
Tom
Expected resolution time updated: 23 Mar 2018, 12:00 UTC
22 Mar 2018
16:03 UTC
Ian Chilton
Everything is looking good at the moment.
It's early days, but we are hopefully we might just have found the cause.
More updates will follow.
22 Mar 2018
13:45 UTC
Ian Chilton
Unfortunately we're seeing a re-occurrence of packet loss to some of the racks [in Manchester].
We're still waiting on a delivery of new optical transceivers, which is where we now believe the problem lies.
We're continuing to investigate in the meantime.
21 Mar 2018
22:34 UTC
Tom Hill
Hello,
Today has been quite a trying day for us all - staff and customers alike. If you're still following along, then let me first say thank you for all of your patience in bearing with us whilst we try and stabilise what has been quite a rocky network.
We've spent almost the entire day screen-sharing with our vendor, showing them exactly where the problems lie. The call included a number of their low-level hardware developers, and so we were able to obtain some excellent debugging information from the switches. It really has been an "all-hands" situation, both at the vendor and at Bytemark.
Together we believe that we have isolated the cause of the issue, which lies with the particular optical transceivers that we're using. The root cause of why they've all suddenly degraded in concert isn't clear, however, this has given us quite a leg up in our efforts to rectify the situation.
Now, through some trial-and-error, we have managed to re-jig the hardware and software configuration such that we have now stabilised the software element of the network, which was at the centre of the worst of the outages over the last few days.
Regrettably there remains some degree of packet loss to some services, though nowhere near as much as before, and that should hopefully hold together until we can make further inroads with some new transceivers that are presently en-route to us.
Once again, please do accept my deepest apologies for the extent of the problems faced over the last 48 hours. Be assured that we are working as hard as possible to restore the full quality of service that you expect from Bytemark's network.
As per my previous update, I am more than happy to field further queries personally if you would like to discuss this at greater length. Pop a 'FAO Tom' message into support, and it will be passed on to me without delay.
Kind regards,
Tom
21 Mar 2018
21:40 UTC
Doug Targett
We are still looking into the networking issues that are causing some packet loss.
Again please accept our apologies for the inconvenience this may be causing.
21 Mar 2018
16:58 UTC
Ian Chilton
We're continuing to see problems off and on and continuing to investigate this.
A conference call with the vendor is currently ongoing.
21 Mar 2018
14:20 UTC
Ian Chilton
All quiet again at the moment.
On each occurrence of problems, we're gaining some more useful information.
Expected resolution time updated: We are currently investigating.
21 Mar 2018
13:53 UTC
Ian Chilton
We're seeing problems to some hosts again, i'm afraid.
More troubleshooting is underway.
21 Mar 2018
13:19 UTC
Ian Chilton
We've made some more configuration changes to continue to try and narrow down the symptoms we're seeing and we're continuing to send debugging information to the vendor.
Everything is stable at the moment but we're monitoring the situation very closely.
21 Mar 2018
11:48 UTC
Andrew Ladlow
We've been alerted to an issue relating to intermittent packet loss to servers in our Manchester data centre again.
Please do accept our apologies for the inconvenience caused. We're investigating and working to restore full service as soon as possible. We'll post updates here as soon as we're able.
If you're experiencing any problems with your services at Bytemark that you think might be related, please do get in touch.
Expected resolution time updated: 21 Mar 2018, 14:00 UTC
21 Mar 2018
01:54 UTC
Tom Hill
Hello,
Further to the software upgrade, whilst that process did not cause any outages itself, the resulting software version is also exhibiting similar symptoms, and so we are back to our vendor for more options.
We've reduced our resiliency in order try and keep interruptions to a minimum in the mean time.
Kind regards,
Tom
20 Mar 2018
23:53 UTC
Tom Hill
Hello,
We've been advised by our software vendor to upgrade to the latest version of the running software, which includes some very promising fixes. This should be possible to do to the switches one at a time, and is intended to be so. Given the very small possible of service interruption, and the potential fix to avoid another day like today, we intend to start this work very shortly.
Tom
Expected resolution time updated: 21 Mar 2018, 01:00 UTC
20 Mar 2018
20:18 UTC
Doug Targett
Thanks for your patience! We have seen stability and will continue to work towards a resolution to the issues highlighted by our network manager below.
Please do accept our apologies for the inconvenience caused.
If you've got any residual problems or have any questions, please do get in touch.
20 Mar 2018
19:10 UTC
Doug Targett
We have seen alerts again for our systems but they came back shortly afterwards. We are still looking into the issues and are working to regain stability.
If you think this problem is still impacting your services at Bytemark, please do get in touch.
20 Mar 2018
18:18 UTC
Doug Targett
We believe we're stable and are still monitoring the situation. We will continue to monitor for the next hour to ensure our systems are as they should be
If you think this problem is still impacting your services at Bytemark, please do get in touch.
20 Mar 2018
17:58 UTC
Doug Targett
We believe we're stable and are monitoring the situation.
If you think this problem is still impacting your services at Bytemark, please do get in touch.
20 Mar 2018
17:43 UTC
Paul Cammish
We're experiencing this issue again, and the Network team are already taking steps to resolve it.
20 Mar 2018
17:11 UTC
Paul Cammish
We're fairly certain things are back to (somewhat) normal again - please see below for more information from Tom, our Network Manager.
If you think this problem is still impacting your services at Bytemark, please do get in touch.
Expected resolution time updated: 20 Mar 2018, 17:30 UTC
20 Mar 2018
16:39 UTC
Paul Cammish
We seem to be experiencing the same issue again - the Operations and Network teams are already investigating.
Expected resolution time updated: 20 Mar 2018, 17:30 UTC
20 Mar 2018
16:16 UTC
Tom Hill
Hello,
Please do accept our distinct apologies for the premature resolution of this outage; despite our best intentions, and previous experiences, our usual approach had failed to completely rectify the ongoing problem.
As some will have no doubt determined, it has not been long since a similar interruption to the network took place yesterday, and the two situations are related. As are other prior outages dotted throughout the last four months of the 'new' data centre's operation in Manchester.
There is an ongoing technical case open with our vendor, pertaining to one of the three pairs of 'MLAG' devices that we utilise. We are working extremely hard to track down the origins of this fault, but it has proven to be quite elusive, and we can sometimes go weeks between incidents. Our vendor has had no luck in stressing similar software & hardware to reproduce the problem in their labs, and have never encountered the same behaviour previously. They have provided extra support in the form of their internal developer's time, in order to help us set 'debug traps' for the relevant daemons to try and catch this behaviour at the exact point it occurs, and with a great deal more detail.
Today we caught the bug occurring with the correct debug options in place, which is something of a silver lining. This information is the key to our vendor determining how this is occurring, and should go towards rectifying the situation.
We're hoping to avoid more drastic workarounds, that would otherwise hinder any form of debugging. This solution has worked for many other companies with bigger installations than Bytemark, and should work for us further into the future.
Whilst this is not a good situation we find ourselves in, I can only stress that we're working very closely with our vendor to restore the same quality of service that is expected of Bytemark's networking infrastructure. Our usual monitoring processes are of course all ongoing and we are doing all we can to ensure swift action is taken if and when an outage takes place. When we do determine the fix for the bug, workarounds will be applied and/or emergency software upgrades will be scheduled at the earliest possibly opportunity.
Nevertheless, please do accept my sincerest apologies for today's interruption to your services. Further queries can be made towards myself, via the usual support means (FAO Tom HIll) should you wish to further discuss the situation. Beyond this, I have committed to a full write-up of the issue as soon as we have all of the information required to confidently explain the bug, and any lessons learnt.
Regards,
Tom (Network Manager)
20 Mar 2018
15:41 UTC
Ian Chilton
Everything should be ok now - please contact us if you are still having problems.
20 Mar 2018
15:09 UTC
Timothy Frew
The network issue does not appear to be resolved yet. We've got all hands on deck trying to make sure our network is resilliant again.
You may be encountering intermittent network connectivity issues to your manchester cloud servers.
Expected resolution time updated: We are currently investigating.
20 Mar 2018
14:51 UTC
Paul Cammish
We just experienced a problem with the network in our Manchester data centre.
At first glance this looks to have only affected connectivity to Bytemark Cloud servers based in Manchester and likely control of all Bytemark Cloud servers via the API, panel and Bytemark client.
Out network engineers have already restored service, but you may have noticed a short (2-3 minute) drop in connectivity.