Unexpected core router failure, York

Expected resolution: 27 Sep 2016, 16:30 UTC
Return to issues
Issue status: Resolved Date:

27 Sep 2016
16:13 UTC

Posted by:

Tom Hill

The replacement PSU part is now inserted & connected, and the power delivery to cr2.yrk is once again resilient.

At this point we will mark this outage as resolved, but if you do have any follow-up queries, please do let us know. :)

Tom

Issue status: Investigating Date:

27 Sep 2016
15:47 UTC

Posted by:

Tom Hill

We've received some spare PSUs, so we're going to swap one of these devices and monitor cr2.yrk for any further failings.

There is nothing to suggest a repeat of the same, freak behaviour failure that caused the router to reboot previously.

We'll confirm again once this is resolved.

Issue status: Investigating Date:

27 Sep 2016
12:21 UTC

Posted by:

Tom Hill

During a routine operation to re-patch the power of cr2.yrk in our York data centre, the core router experienced an unexpected (and improperly handled) fault, resulting in a reboot.

Whilst each core router is commissioned with two PSUs, either of which can handle the full load of the router (and often do) one of the PSUs failed catastrophically after it was re-connected to a power source.

Normally this should not cause any issues at all. However, in this particular instance, the internal PSU fan failed shortly after the reset, but only after the router had tried to transfer load to it.

This caused an improper reboot of the system, which certainly should not happen under normal circumstances; the failed PSU should simply not be utilised, and the system should maintain its load on a single PSU.

What further compounded the issue was the slower control plane that these two routers have, causing some blackholing of various routes in & out of Manchester. This is something that we're working to rectify, with the purchase of two brand new core routers (which are already here in York, but not yet connected).

At present, cr2.yrk is still 'at risk' of power failure, given that it is running with only a single power source (albeit, this source is UPS-backed). We have spare PSUs in another of our facilities, and will have two suitable replacements on site within 4 hours.

For now, nothing further is expected to go wrong, but we will update again when we have the replacement PSU to swap-in.

I can only apologise for this spurious failure of our core network; this is an incredibly unfortunate situation, and within a few weeks we should have replaced the older core routers entirely.

Tom

Issue status: Investigating Date:

27 Sep 2016
11:41 UTC

Posted by:

Tom Hill

One of the two core routers in York has rebooted, and this has manifested as a delay in re-routing - particularly towards Manchester - likely due to slower convergence.

Certain EoMPLS services will also be affected towards some customers.

The router is rebooting now and we'll write-up a full RFS once we're confident that it's returned to service.

Return to issues

Issue still not addressed? Please contact support.