[Resolved], Bytemark sites down

Bytemark sites down

Expected resolution: 30 Apr 2015, 19:26 UTC

Issue status: Resolved Date:

30 Apr 2015
15:28 UTC

Posted by:

Tom Hill

Hello all,

After picking through various logs and examining the network graphing, I can now see that at around 05:30 (UTC+1) this morning, one of the two physical switches that make up the switch stack serving our York office space failed (and seemingly without warning).

Shortly afterwards, the dead switch module had begun looping frames in and out, which for the VLANs it was present on (almost entirely our own internal use VLANs) caused some MAC address black-holing, resulting in extreme packet loss for the services reliant upon them.

As I couldn't signal from the stack master to reboot the failed switch, once we did find out what was happening, I reloaded the entire stack of switches. Shortly afterwards, the office users in York were back online, and the disruption to our customer-facing websites had ceased.

Given that we don't normally 'do' stacking in our networks, I am inclined to remove the stack configuration and run the two as separate switches, so we'll need to arrange an evening to do this. At the same time, I'm sure we'll be able to squeeze-in an upgrade to a newer software train, too.

Hopefully, there won't be an resurgence of this issue in the mean time, but if there is, we can likely hobble along on a single switch whilst we arrange for some out of office hours downtime.

Tom

Issue status: Resolved Date:

30 Apr 2015
15:28 UTC

Posted by:

Tom Hill

This should now be solved - the issue wasn't directly related to our firewalls in the end. Will expand on this shortly.

Issue status: Investigating Date:

30 Apr 2015
15:27 UTC

Posted by:

Ian Chilton

We identified one of our internal firewalls is up but seemingly not forwarding traffic properly.

We've failed over to it's redundant pair and things are reachable now but with some packet loss, which we're investigating.

Issue status: Investigating Date:

30 Apr 2015
15:26 UTC

Posted by:

Ian Chilton

We've currently got an unknown problem causing a lot of alerts in our monitoring.

It looks to be mostly/all internal services, but our websites are down.

I've called one of our network engineers and he's looking into it now.

Sorry this is vague, but we'll update with more information once it becomes clear what's going on and what's affected.

Ian

Return to issues

Issue still not addressed? Please contact support.