Betzy: Ongoing problems with srun and Infiniband

Start: 25.01.21 kl. 14.22
End: 19.02.21 kl. 13.30

Betzy is experiencing issues with srun and Infiniband that now and again affect users. The srun problem first appeared last week, the Infiniband problem has been there longer.

Symptoms for srun problem are of a type related to send/recv operation messages. If you see such error-messages you are probably affected by the problem. A workaround can be to just keep trying to run the srun job until it succeeds.

Symptoms for the Infiniband problem are messages of type Transport retry count exceeded. At the moment there is no workaround for this problem.

The expert-team is working on solving the problems together with the hardware vendors. We are very sorry for the inconvenience and can assure you that the team is working hard on solving this.


For further updates, please see the Opslog entry below.