York DC Core Network Loop

Expected resolution: 25 Jan 2017, 16:35 UTC
Return to issues
Issue status: Resolved Date:

2 Feb 2017
15:25 UTC

Posted by:

Matthew Bloch

Resolved after posting RFO.

Issue status: Investigating Date:

2 Feb 2017
15:24 UTC

Posted by:

Matthew Bloch

Reason For Outage Report

This is a technical explanation of the reason for the outage on the 25th January, what we did to rectify it, and which areas we are still looking into. Our goal is to ensure that a similar outage is less likely in future, and that knock-on effects are minimised.

The root cause was an erroneous configuration of a bridging loop which caused a flood of traffic for 2 minutes. The knock-on effects were a confusing outage of a firewall pair, and also 8-25% of our hosted Cloud Server infrastructure in York.

The "heads", "tails" and "brain" servers are part of our BigV Cloud Server infrastructure, described here.

Timeline (to nearest five minutes)

15:50: We accidentally introduced a bridging loop into network for an internal VLAN. This was fixed within about 2 minutes. The traffic levels seemed to stop the firewall forwarding IPv6 traffic for this and other VLANs, which we're still investigating. This firewall also handled Bytemark networks that run our internal services, database cluster, and administrative interfaces to BigV.

15:50 - 16:35: Worked to access the firewalls and diagnose the issue. Establishing connection was hampered by office access VLANs also being affected. In the end specific diagnosis failed - we failed over and then rebooted the affected firewalls.

16:45: All Cloud Servers on heads 51 and 58 restarted - the software processes on each head had stopped due to a bug we're looking into. head52 was also affetced, but re-established the connection to the brain by itself, no action was necessary.

18:00: Tails 51,55,64 restarted as they were offline, which will have caused I/O freezes and crashes to affected Cloud Servers.

Services affected

Bytemark Cloud customers:

  • heads 51 and 58 outage of approximately 1 hour, head 52 lack of control for approximately 1 hours
  • tails 51,55,64 - outage of approximately 2 hours, primarily where a vms primary disc was on one of these tails
  • *.bigv.io DNS and Cloud Server API access: both unavailable until database connections to the brain restored.

Based on the number of servers affected, anything between 8-25% of cloud servers will have seen some slowdown, or had to have been rebooted.

Root cause analysis

This was a cascading failure: Bridging loop, caused firewall IPv6 failure, cause interruption in control traffic between both Brain and database, and York BigV servers.

Following from the reachability issue, the some heads did not re-establish their connections to the brain. This is was due to a bug that was diagnosed and fixed in November 2016. The fix was not deployed due to an oversight in the deployment process, which has since been overhauled to catch the problem relating to software dependencies.

The software bug responsible for most down time was that TCP control connections from the heads to the brain were not being torn down when the brain disappeared. This has been mitigated using TCP "keepalive", as well as ensuring the heads send periodic "ping" messages to the brain to make sure it is still responding.

Corrective action so far

  • A review of network maintenance change control procedures, to ensure that configuration changes are more carefully vetted.
  • An review of tail code to ensure their connections to the brain are as robust as those for the heads, and connections will reestablish in a reasonable time.
  • (already in progress) A revision of our software deployment process to remove the bug that allowed a software update to be ignored.

Issue status: Resolved Date:

26 Jan 2017
14:53 UTC

Posted by:

Jamie Nguyen

Unfortunately, we had a brief network outage in York yesterday afternoon. This was as a result of a mistake made during some routine network maintenance, which led to a network loop. While this was reacted to and fixed promptly (within a couple of minutes), some of the subsequent problems on our Cloud platform took some time to troubleshoot and fix.

We do apologise sincerely for this outage and the significant inconvenience it caused to you.

  • Any Cloud Servers running on head51 or head58 would have seen an outage of approximately 60 minutes.
  • Any Cloud Servers with disks on two of our tails would have seen an outage of approximately 90 minutes.
  • Any Cloud Servers on head52 remained online but you would have been unable to control them via the Bytemark Panel.

It was a mistake that should have been avoided and we're looking into mitigations and changing our processes to avoid similar mistakes from recurring. Our developers are continuing to work tirelessly to improve the resilience of our services. Suffice it to say that outages like this should be rare!

There should be no outstanding issues remaining, though please do get in touch at any time: https://www.bytemark.co.uk/help/

Issue status: Investigating Date:

25 Jan 2017
18:21 UTC

Posted by:

Jamie Nguyen

All affected Cloud Servers should have come back online over the last 45 minutes. There were servers affected on two heads (head51, head58) and two tails. Cloud Servers on head52 should now be accessible again via the panel.

We're deeply sorry for the unplanned period of downtime. We're looking at preventing these issues from ever recurring.

If anyone has any remaining issues, do get in touch via https://www.bytemark.co.uk/help/ and please do use the urgent address if needed tonight.

Issue status: Investigating Date:

25 Jan 2017
16:58 UTC

Posted by:

Jamie Nguyen

Unfortunately, we encountered some issues with two heads (head51 and head58) following this network problem. We're working on restoring service and affected Cloud Servers should currently be coming back online.

Any Cloud Servers on head52 should be unaffected, but you will currently be unable to control the server via the panel.

So sorry for significant inconvenience this is causing. We've working on fixing the remaining issues!

Issue status: Investigating Date:

25 Jan 2017
16:33 UTC

Posted by:

Nat Lasseter

Services not behind our firewalls should be back online now.

This event has caused TCP connections through our shared firewalls to fail, and services behind those are still affected. We are failing the firewalls over onto their redundant partners now.

Apologies on behalf of both Bytemark and myself.

Nat

Issue status: Investigating Date:

25 Jan 2017
16:23 UTC

Posted by:

Nat Lasseter

Due to an error cause by myself, we have just suffered a switching loop in the core network in our York data centre.

This happened as the result of an attempted upgrade as part of the expansion of our York site.

We have resolved the root cause now, but we are still experiencing fallout cause by the ensuing broadcast storm.

If you are still having any issues with services hosted in York, please do not hesitate to contact support by the usual methods.

Thanks Nat

Return to issues

Issue still not addressed? Please contact support.