Procurement project HPC A2

Figure 1: Current Norwegian HPC and Data Storage infrastructure, including sensitive data (TSD) in Oslo. The infrastructure in Tromsø (HPC and storage) will be phased out, and service will be provided from Trondheim (Saga and Betzy), together with the new A2 system in Måløy (Lefdal Mine Datacenter).
Goals
The project aims to:
- replace current HPC machines Fram and Saga and additional usage growth forecast.
- provide computing capability for AI/ML and scientific applications through GPU/CPU.
- procure a system with expandable computing and storage capacity.
Request For Information (RFI)
Aftermarket exploration and defining Sigma2’s needs for the next-generation HPC machine, the project has now published a request for information (RFI). The RFI aims to gather input and information from potential suppliers regarding crucial parts of the procurement documents. The documents are considered drafts, and the final versions are likely to be similar to the drafts unless feedback from the RFI process indicates that changes should be made.
Sigma2 encourages potential suppliers to use this opportunity to provide feedback, as the possibility to make changes after the invitation to tender has been announced is limited in accordance with the public procurement regulations.
If you wish to contribute to the RFI, please submit your response using the Tendsign tendering tool. Please follow the link below to the publication of the HPC A2 RFI:
Preliminary timeline
The RFI is published.
We expect to publish the tender in H1.
Our goal is to have the system in production in H2 2024.
Background
HPC systems generally have a life span of approximately 5 years due to declining energy efficiency compared to newer machines, obsolete parts, lack of support, and the arrival of new technology.
Our procurement strategy has traditionally been two-legged, an A- and B-leg where we acquire systems with an offset of 2-3 years.
Current Sigma2 hardware and capability
What | Specifications |
---|---|
System | Lenovo NeXtScale nx360 |
Nodes | 1006 |
Cores | 32256 |
CPU types |
Intel E5-2683v4 2.1 GHz Intel E7-4850v4 2.1 GHz (hugemem) |
Local disk | SSD |
Performance | 1.1 PF |
Total memory | 78 TiB |
Disk size and type | 2.5 PB, Lustre |
Interconnect type and topology | Infiniband EDR (100Gbit), Fat Tree |
Queueing system | Slurm |
Cooling | Water cooled |
What | Specifications |
---|---|
System |
HPE XL170r Gen10 ProLiant XL270d Gen10 (GPU node) |
Nodes | 356 (+8 GPU nodes) |
Cores | 16064 |
CPU types |
200 Intel 6138 2.00GHz 120 Intel 6230R 2.10GHz 8 Intel 6126 CPU @ 2.60GHz (GPU nodes) |
Local disk | NVMe |
Performance | 645TF |
Total memory | 75 TiB |
Disk size and type | 6.6 PB, BeeGFS |
Interconnect type and topology | Infiniband FDR (56Gbit), Fat Tree |
Queueing system |
Slurm |
Cooling |
Air cooled |
What | Specifications |
---|---|
System |
BullSequana XH2000 |
Nodes |
1344 CPU nodes 4 GPU nodes |
Cores | 172032 |
CPU types |
CPU nodes: AMD® Epyc™ "Rome" 2.25GHz GPU nodes: 4 GPU nodes, with 4 NVIDIA A100 with nvlink GPUs and AMD Milan CPUs |
Local disk | No local disk |
Performance | 6.5 PF |
Total memory | 336 TiB |
Disk size and type | 7.7 PB, Lustre |
Queueing system | Slurm |
Interconnect type and topology | Infiniband HDR (100 Gbit per node, 200 Gbit switches)), Dragonfly |
Cooling |
Water cooled |
The NIRD storage is based on IBM Elastic storage, ESS. The current capacity of 35 PB is shared between file and object storage. This is designed for future growth.
What | Specifications |
---|---|
System | HPC CRAY Shasta |
Nodes | 1536 (LUMI-C) 2560 (LUMI-G) |
Cores | 196 608 (LUMI-C) |
CPU types | AMD EPYC |
GPU types | AMD MI-250X |
Local disk | 80 PB (LUMI-P) 7 PB (LUMI-F) 30 PB (LUMI-O) |
Peak performance (Pflops/s) | 552 (LUMI-G), 8 (LUMI-C) |
Total memory | 440 TiB |
Disk size and type | Spinning LUMI-P, flash LUMI-F |
Queueing system | Slurm |
Interconnect type and topology | Slingshot |
Cooling | Water cooled |