30 Apr 2015
15:28 UTC
Tom Hill
Hello all,
After picking through various logs and examining the network graphing, I can now see that at around 05:30 (UTC+1) this morning, one of the two physical switches that make up the switch stack serving our York office space failed (and seemingly without warning).
Shortly afterwards, the dead switch module had begun looping frames in and out, which for the VLANs it was present on (almost entirely our own internal use VLANs) caused some MAC address black-holing, resulting in extreme packet loss for the services reliant upon them.
As I couldn't signal from the stack master to reboot the failed switch, once we did find out what was happening, I reloaded the entire stack of switches. Shortly afterwards, the office users in York were back online, and the disruption to our customer-facing websites had ceased.
Given that we don't normally 'do' stacking in our networks, I am inclined to remove the stack configuration and run the two as separate switches, so we'll need to arrange an evening to do this. At the same time, I'm sure we'll be able to squeeze-in an upgrade to a newer software train, too.
Hopefully, there won't be an resurgence of this issue in the mean time, but if there is, we can likely hobble along on a single switch whilst we arrange for some out of office hours downtime.
Tom
30 Apr 2015
15:28 UTC
Tom Hill
This should now be solved - the issue wasn't directly related to our firewalls in the end. Will expand on this shortly.
30 Apr 2015
15:27 UTC
Ian Chilton
We identified one of our internal firewalls is up but seemingly not forwarding traffic properly.
We've failed over to it's redundant pair and things are reachable now but with some packet loss, which we're investigating.
30 Apr 2015
15:26 UTC
Ian Chilton
We've currently got an unknown problem causing a lot of alerts in our monitoring.
It looks to be mostly/all internal services, but our websites are down.
I've called one of our network engineers and he's looking into it now.
Sorry this is vague, but we'll update with more information once it becomes clear what's going on and what's affected.
Ian