Network issue in Manchester data centre

Expected resolution: 23 Mar 2018, 12:00 UTC
Return to issues
Issue status: Resolved Date:

23 Mar 2018
09:14 UTC

Posted by:

Ian Chilton

Everything is continuing to be stable so far.

We're continuing to monitor the situation closely and will be working on a full write up as soon as possible.

Issue status: Monitoring Date:

22 Mar 2018
18:29 UTC

Posted by:

Tom Hill

We believe we've isolated the root cause of the problem, and the perfect storm that lead us this far. We'll be writing this up in full detail as soon as is possible.

For now, most of our hardware is fine, but a certain number of transceivers will need replacing in due course.

We will be keeping a very keen eye in the mean time, however, outside of anything unexpected happening we should be mostly out of the woods.

Thanks to everyone for all of your patience over the last few days.

Tom

Expected resolution time updated: 23 Mar 2018, 12:00 UTC

Issue status: Monitoring Date:

22 Mar 2018
16:03 UTC

Posted by:

Ian Chilton

Everything is looking good at the moment.

It's early days, but we are hopefully we might just have found the cause.

More updates will follow.

Issue status: Monitoring Date:

22 Mar 2018
13:45 UTC

Posted by:

Ian Chilton

Unfortunately we're seeing a re-occurrence of packet loss to some of the racks [in Manchester].

We're still waiting on a delivery of new optical transceivers, which is where we now believe the problem lies.

We're continuing to investigate in the meantime.

Issue status: Monitoring Date:

21 Mar 2018
22:34 UTC

Posted by:

Tom Hill

Hello,

Today has been quite a trying day for us all - staff and customers alike. If you're still following along, then let me first say thank you for all of your patience in bearing with us whilst we try and stabilise what has been quite a rocky network.

We've spent almost the entire day screen-sharing with our vendor, showing them exactly where the problems lie. The call included a number of their low-level hardware developers, and so we were able to obtain some excellent debugging information from the switches. It really has been an "all-hands" situation, both at the vendor and at Bytemark.

Together we believe that we have isolated the cause of the issue, which lies with the particular optical transceivers that we're using. The root cause of why they've all suddenly degraded in concert isn't clear, however, this has given us quite a leg up in our efforts to rectify the situation.

Now, through some trial-and-error, we have managed to re-jig the hardware and software configuration such that we have now stabilised the software element of the network, which was at the centre of the worst of the outages over the last few days.

Regrettably there remains some degree of packet loss to some services, though nowhere near as much as before, and that should hopefully hold together until we can make further inroads with some new transceivers that are presently en-route to us.

Once again, please do accept my deepest apologies for the extent of the problems faced over the last 48 hours. Be assured that we are working as hard as possible to restore the full quality of service that you expect from Bytemark's network.

As per my previous update, I am more than happy to field further queries personally if you would like to discuss this at greater length. Pop a 'FAO Tom' message into support, and it will be passed on to me without delay.

Kind regards,

Tom

Issue status: Investigating Date:

21 Mar 2018
21:40 UTC

Posted by:

Doug Targett

We are still looking into the networking issues that are causing some packet loss.

Again please accept our apologies for the inconvenience this may be causing.

Issue status: Investigating Date:

21 Mar 2018
16:58 UTC

Posted by:

Ian Chilton

We're continuing to see problems off and on and continuing to investigate this.

A conference call with the vendor is currently ongoing.

Issue status: Investigating Date:

21 Mar 2018
14:20 UTC

Posted by:

Ian Chilton

All quiet again at the moment.

On each occurrence of problems, we're gaining some more useful information.

Expected resolution time updated: We are currently investigating.

Issue status: Investigating Date:

21 Mar 2018
13:53 UTC

Posted by:

Ian Chilton

We're seeing problems to some hosts again, i'm afraid.

More troubleshooting is underway.

Issue status: Investigating Date:

21 Mar 2018
13:19 UTC

Posted by:

Ian Chilton

We've made some more configuration changes to continue to try and narrow down the symptoms we're seeing and we're continuing to send debugging information to the vendor.

Everything is stable at the moment but we're monitoring the situation very closely.

Issue status: Investigating Date:

21 Mar 2018
11:48 UTC

Posted by:

Andrew Ladlow

We've been alerted to an issue relating to intermittent packet loss to servers in our Manchester data centre again.

Please do accept our apologies for the inconvenience caused. We're investigating and working to restore full service as soon as possible. We'll post updates here as soon as we're able.

If you're experiencing any problems with your services at Bytemark that you think might be related, please do get in touch.

Expected resolution time updated: 21 Mar 2018, 14:00 UTC

Issue status: Monitoring Date:

21 Mar 2018
01:54 UTC

Posted by:

Tom Hill

Hello,

Further to the software upgrade, whilst that process did not cause any outages itself, the resulting software version is also exhibiting similar symptoms, and so we are back to our vendor for more options.

We've reduced our resiliency in order try and keep interruptions to a minimum in the mean time.

Kind regards,

Tom

Issue status: Investigating Date:

20 Mar 2018
23:53 UTC

Posted by:

Tom Hill

Hello,

We've been advised by our software vendor to upgrade to the latest version of the running software, which includes some very promising fixes. This should be possible to do to the switches one at a time, and is intended to be so. Given the very small possible of service interruption, and the potential fix to avoid another day like today, we intend to start this work very shortly.

Tom

Expected resolution time updated: 21 Mar 2018, 01:00 UTC

Issue status: Resolved Date:

20 Mar 2018
20:18 UTC

Posted by:

Doug Targett

Thanks for your patience! We have seen stability and will continue to work towards a resolution to the issues highlighted by our network manager below.

Please do accept our apologies for the inconvenience caused.

If you've got any residual problems or have any questions, please do get in touch.

Issue status: Monitoring Date:

20 Mar 2018
19:10 UTC

Posted by:

Doug Targett

We have seen alerts again for our systems but they came back shortly afterwards. We are still looking into the issues and are working to regain stability.

If you think this problem is still impacting your services at Bytemark, please do get in touch.

Issue status: Monitoring Date:

20 Mar 2018
18:18 UTC

Posted by:

Doug Targett

We believe we're stable and are still monitoring the situation. We will continue to monitor for the next hour to ensure our systems are as they should be

If you think this problem is still impacting your services at Bytemark, please do get in touch.

Issue status: Monitoring Date:

20 Mar 2018
17:58 UTC

Posted by:

Doug Targett

We believe we're stable and are monitoring the situation.

If you think this problem is still impacting your services at Bytemark, please do get in touch.

Issue status: Investigating Date:

20 Mar 2018
17:43 UTC

Posted by:

Paul Cammish

We're experiencing this issue again, and the Network team are already taking steps to resolve it.

Issue status: Monitoring Date:

20 Mar 2018
17:11 UTC

Posted by:

Paul Cammish

We're fairly certain things are back to (somewhat) normal again - please see below for more information from Tom, our Network Manager.

If you think this problem is still impacting your services at Bytemark, please do get in touch.

Expected resolution time updated: 20 Mar 2018, 17:30 UTC

Issue status: Investigating Date:

20 Mar 2018
16:39 UTC

Posted by:

Paul Cammish

We seem to be experiencing the same issue again - the Operations and Network teams are already investigating.

Expected resolution time updated: 20 Mar 2018, 17:30 UTC

Issue status: Monitoring Date:

20 Mar 2018
16:16 UTC

Posted by:

Tom Hill

Hello,

Please do accept our distinct apologies for the premature resolution of this outage; despite our best intentions, and previous experiences, our usual approach had failed to completely rectify the ongoing problem.

As some will have no doubt determined, it has not been long since a similar interruption to the network took place yesterday, and the two situations are related. As are other prior outages dotted throughout the last four months of the 'new' data centre's operation in Manchester.

There is an ongoing technical case open with our vendor, pertaining to one of the three pairs of 'MLAG' devices that we utilise. We are working extremely hard to track down the origins of this fault, but it has proven to be quite elusive, and we can sometimes go weeks between incidents. Our vendor has had no luck in stressing similar software & hardware to reproduce the problem in their labs, and have never encountered the same behaviour previously. They have provided extra support in the form of their internal developer's time, in order to help us set 'debug traps' for the relevant daemons to try and catch this behaviour at the exact point it occurs, and with a great deal more detail.

Today we caught the bug occurring with the correct debug options in place, which is something of a silver lining. This information is the key to our vendor determining how this is occurring, and should go towards rectifying the situation.

We're hoping to avoid more drastic workarounds, that would otherwise hinder any form of debugging. This solution has worked for many other companies with bigger installations than Bytemark, and should work for us further into the future.

Whilst this is not a good situation we find ourselves in, I can only stress that we're working very closely with our vendor to restore the same quality of service that is expected of Bytemark's networking infrastructure. Our usual monitoring processes are of course all ongoing and we are doing all we can to ensure swift action is taken if and when an outage takes place. When we do determine the fix for the bug, workarounds will be applied and/or emergency software upgrades will be scheduled at the earliest possibly opportunity.

Nevertheless, please do accept my sincerest apologies for today's interruption to your services. Further queries can be made towards myself, via the usual support means (FAO Tom HIll) should you wish to further discuss the situation. Beyond this, I have committed to a full write-up of the issue as soon as we have all of the information required to confidently explain the bug, and any lessons learnt.

Regards,

Tom (Network Manager)

Issue status: Resolved Date:

20 Mar 2018
15:41 UTC

Posted by:

Ian Chilton

Everything should be ok now - please contact us if you are still having problems.

Issue status: Investigating Date:

20 Mar 2018
15:09 UTC

Posted by:

Timothy Frew

The network issue does not appear to be resolved yet. We've got all hands on deck trying to make sure our network is resilliant again.

You may be encountering intermittent network connectivity issues to your manchester cloud servers.

Expected resolution time updated: We are currently investigating.

Issue status: Monitoring Date:

20 Mar 2018
14:51 UTC

Posted by:

Paul Cammish

We just experienced a problem with the network in our Manchester data centre.

At first glance this looks to have only affected connectivity to Bytemark Cloud servers based in Manchester and likely control of all Bytemark Cloud servers via the API, panel and Bytemark client.

Out network engineers have already restored service, but you may have noticed a short (2-3 minute) drop in connectivity.

Return to issues

Issue still not addressed? Please contact support.