NIRD and Fram prolonged downtime

In January, users of the national e-infrastructure resources unfortunately experienced prolonged downtime of the NIRD and Fram services.

NIRD and Fram prolonged downtime 

On Tuesday 12 January 2019, the NIRD and Fram services were under maintenance for a critical upgrade of the firmware on the storage hardware. The operations should have lasted one working day, but at the end of the upgrade the systems did not reboot properly, leading to a prolonged downtime of 14 days for NIRD and 16,5 days for Fram. During these periods, computation on Fram and access to NIRD storage were not possible and services on NIRD cloud infrastructure were down.   

Root cause and solution

The upgrade, followed by a reboot, caused some defective hardware to fail, thus requiring the replacement of the components. Since the spare parts on site were not enough, new hardware was shipped from the USA. The delivery was further delayed by extraordinary customs operations at the Norwegian borders and a non-effective coordination between the international and the national shipping agencies. Once the new hardware was installed, NIRD could be restarted, while Fram was still in failure mode. Only when the service provider's engineers arrived on site, the root cause could be completed, by identifying a wrong cabling set-up.  

Measures

The unfortunate events have been analyzed  and a thorough incident report prepared. Several measures have been identified to mitigate the impact, in case such a failure appears again. In particular a larger stock of onsite spare parts has been requested and obtained, thus avoiding waiting time due to shipping of new hardware. Furthermore, having service provider’s engineers on site have been of paramount importance for the final root cause and will be immediately requested in case of similar events. The communication with the end users must also be more effective in case of incidents leading to unplanned outage, possibly conducted on two levels: the technical one fostered by the operations organization through the Metacenter ops-log channel and the high-level one facilitated by Sigma2 through news and mail channels.  

Sigma2 sincerely apologizes for the inconvenience this prolonged downtime caused to you and your research.