Delayed delivery of Fram

UNINETT Sigma2 announces delayed delivery of Fram due to hardware problems.

According to plan, the acceptance test for Fram was contractually accepted by UNINETT  Sigma2 at the end of February. After that we started the approval period, installing our own basic software platform (Easybulid, Slurm etc.). At the end of March, 14 projects started using the Fram as pilot users. The approval period was planned to end 25 April. However, during this period, both the pilot users and we in our verification testing, encountered instability and low performance. As an obvious conclusion, we had to refuse the approval of the Fram delivery and suspend the approval period. Since then, we have been working together with the vendor in a joint task force, trying to find the reason for the instability and low performance.

It has now been documented that the cause for our problems is a switch failure in the high speed interconnect network delivered from Mellanox. 60 out of 100 switches are affected of the same failure, resulting in unstable, low capacity and even shut down of switches. The root cause for this is humidity trapped inside a capacitor with raised capacity resistance consequently. Only one specific production series is affected consisting of total 240 switches, in which 60 have been delivered to us and the rest to another customer. It has also been verified that this failure materializes 3- 8 months after installation, which then explains why we did not encounter this failure during the acceptance period. Consequently, it has now been decided that the 60 switches will be replaced during June and we will resume the approval period after that. This means that we are facing a substantial delay, and as we are approaching summer holiday, our estimated time for final approval of Fram is now 1 September.

Mitigating actions

The delay means a substantial loss of CPU hours.  As a mitigating action we have decided to prolong the delivery of CPU hours from Vilje and Hexagon for 2-3 months beyond 1 July 2017, which was the planned date for end of delivery from these machines. New contracts will be established for these deliveries; however the guaranteed quality of service will be lower due to a reduction in support contracts.

All pilot users will be informed as soon as we have tested Fram after the replacement of the switches, estimated to happen in the first part of July.

The migration of all projects from Hexagon and Vilje is now estimated to start in the last part of August and finish before 1 October.  All projects will be informed in due time to make time for the necessary preparation.