Wednesday, 12 July 2017

31 May loss of IT services - update

When we build our services, we design resilience and reliability into them right from the start because we know how critical IT services are to the University. This resilience protects our services and keeps them running, and in the case of our network, even protects us from issues that impact on other Universities. However, despite the high levels of resilience in our services, sometimes faults occur which can still cause them to fail. On Wednesday 31 May at approximately 11.40am, the University experienced a fault within our network. We have now taken three important steps to prevent this from happening in the future and to speed up recovery from similar problems. Firstly, we have now added a third link between the two core routers in the Computing Centre and Brunswick. Now, if the core routers lose connection between each other over their primary links, the third link provides a "heartbeat" between the two routers, thus allowing detection of a full or partial failure of one router. This will help to prevent the fault we saw on 31 May from recurring. Secondly, if a similar fault were to recur, the devices in network cabinets at each building will now attempt to reconnect automatically. This means the length of downtime will be minimised. Thirdly, the router software has been upgraded. We take the resilience and reliability of our services very seriously and will continue to work to prevent instances like this in the future.