Procurement project HPC A2
Figure 1: Current Norwegian HPC and Data Storage infrastructure, including sensitive data (TSD) in Oslo. The infrastructure in Tromsø (HPC and storage) will be phased out, and service will be provided from Trondheim (Saga and Betzy), together with the new A2 system in Måløy (Lefdal Mine Datacenter).
The project aims to:
- replace current HPC machines Fram and Saga (more details below), and additional usage growth forecast.
- provide computing capability for AI/ML and scientific applications through GPU/CPU.
- procure a system with expandable computing and storage capacity.
A2 conceptual specs and drawing
The primary function of the A2 machine is to replace the two machines, Fram and Saga. This may translate into this:
|System||ARM, x86, Power|
|CPU types||ARM, x86, Power|
|Local disk (system and scratch for jobs)||NVMe (solid-state)|
|Performance||Benchmarks (HPCG, AI/ML + scientific codes)|
|Total memory||Bandwidth & Size|
|Disk size and type||10 PB (divided into 3 file systems)|
|Interconnect type and topology||Low latency intermediate bandwidth|
Figure 2: The proposed A2 system in grey, locally connected to the NIRD data storage in LMD, with remote connections to the systems in Kajaani, Tromsø and Trondheim.
Questions for vendors
We would like vendors to answer the following:
- How long can you support the hardware?
- Your thoughts about GPU vs CPU ratio in computational science now and five years from now?
- What are your current experience and future plans with multi-partition HPC machines?
- Roadmap for GPU/CPU integration?
- How is your software ecosystem?
- Do you facilitate cloud integration?
Please do not hesitate to contact us at email@example.com if you wish to get in touch with the project.
We expect to publish an RFI in Q1 2023
We expect to publish the tender in H1.
Our goal is to have the system in production in H1 2024.
The lifetime of A2 should be at least five years. The expected lifetime is seven years.
HPC systems generally have a life span of approximately 5 years due to declining energy efficiency compared to newer machines, obsolete parts, lack of support, and the arrival of new technology.
Our procurement strategy has traditionally been two-legged, an A- and B-leg where we acquire systems with an offset of 2-3 years.
Current Sigma2 hardware and capability
|System||Lenovo NeXtScale nx360|
Intel E5-2683v4 2.1 GHz
Intel E7-4850v4 2.1 GHz (hugemem)
|Total memory||78 TiB|
|Disk size and type||2.5 PB, Lustre|
|Interconnect type and topology||Infiniband EDR (100Gbit), Fat Tree|
HPE XL170r Gen10
ProLiant XL270d Gen10 (GPU node)
|Nodes||356 (+8 GPU nodes)|
200 Intel 6138 2.00GHz
120 Intel 6230R 2.10GHz
8 Intel 6126 CPU @ 2.60GHz (GPU nodes)
|Total memory||75 TiB|
|Disk size and type||6.6 PB, BeeGFS|
|Interconnect type and topology||Infiniband FDR (56Gbit), Fat Tree|
1344 CPU nodes
4 GPU nodes
CPU nodes: AMD® Epyc™ "Rome" 2.25GHz
GPU nodes: 4 GPU nodes, with 4 NVIDIA A100 with nvlink GPUs and AMD Milan CPUs
|Local disk||No local disk|
|Total memory||336 TiB|
|Disk size and type||7.7 PB, Lustre|
|Interconnect type and topology||Infiniband HDR (100 Gbit per node, 200 Gbit switches)), Dragonfly|
The NIRD storage is based on IBM Elastic storage, ESS. The current capacity of 35 PB is shared between file and object storage. This is designed for future growth.
|System||HPC CRAY Shasta|
|Nodes||1536 (LUMI-C) 2560 (LUMI-G)|
|Cores||196 608 (LUMI-C)|
|CPU types||AMD EPYC|
|GPU types||AMD MI-250X|
|Local disk||80 PB (LUMI-P) 7 PB (LUMI-F) 30 PB (LUMI-O)|
|Peak performance (Pflops/s)||552 (LUMI-G), 8 (LUMI-C)|
|Total memory||440 TiB|
|Disk size and type||Spinning LUMI-P, flash LUMI-F|
|Interconnect type and topology||Slingshot|
We recognise that the CPU market is quite diverse these days, and at the same time, we believe that more than 90% of our workload is CPU architecture agnostic and may run well on both x86, ARM or Power CPUs. As such, we do not have any specific CPU demands other than the fact it must be able to compile and run standard HPC software and applications.
Except for the partnership in LUMI, Sigma2 does not possess any substantial GPU resources that can seamlessly run CUDA (NVIDIA) software. In particular: Machine learning and AI code specific for scientific software. We have some GPUs installed for accelerated HPC workloads.
For the most part, the demand for NVIDIA-compatible GPUs has been met by individual universities and research groups, among others, the IDUN GPU cluster at NTNU with a total of 200 GPUs (ca 100 A100). Sigma2 needs to have a similar capability shortly, but we have not decided to what extent yet (capacity, technology, etc.). It will be part of A2, but A2 must have the capability to integrate such GPUs with its interconnect, storage, and other infrastructure (queueing systems, etc.).
We have seen steady growth and expect a significant increase in GPU demand, mainly for AI/ML, but also for accelerated HPC (64-bit FP), albeit less.
All our current HPC systems use different generations of Infiniband with different topologies. We are experienced with this interconnect and consider it a mature network technology. For A2, we still consider Infiniband a good choice, but we are looking for stable and proven versions of this interconnect, not state-of-the-art versions that have not had a good production time in reference systems. We consider latency more important than bandwidth.
The interconnect should be configured such that the CPU and GPU computational capacity can be expanded.
Local node disk
For the workload, we foresee the A2 machine will run, and a substantial part of the jobs will benefit from having a node-local temp/scratch filesystem. The Saga system has local NVMe, and we would also like to equip A2 with fast solid-state storage (e.g. NVMe) on nodes.
Shared file systems
All our current HPC systems have a variant of either Lustre or BeeGFS, and we also have experience with GPFS. All three technologies are acceptable to us and have served us well. One shortcoming in our current setup is that we have one large filesystem to serve all kinds of I/O on the cluster.
For A2, we are considering several shared file systems, each serving a specific purpose, not influencing each other.
- Shared scratch: fast, IOPS, good bandwidth (solid-state)
- Shared home and software: good IOPS
- Shared project/data folder: good bandwidth, moderate IOPS
All filesystems must be separately operated and independent, except for using the same interconnect. All parallel file systems must be upgradeable both in performance and capacity.
Site facilities, Power, and Cooling
A2 will be placed in a data centre with water-cooling capability. Power is not a limiting factor.
Connectivity to NIRD
A2 will be placed close (about 50-100m cable length, IB range) to the NIRD storage facility (file and object store). We have the option to integrate NIRD storage in several ways.
- Loosely: Either with copying via transport nodes or directly on login nodes (staging)
- Close: Mounting on login or transport nodes (with GPFS)
- Tight: Mounting on all nodes (with GPFS)
If a tight connection is chosen, this may affect the size of the shared cluster file system "shared projects".
We are running Slurm on all our clusters.
The A2 system should have a standalone test system. This system will be used to minimize downtime in the production system.
This system should be delivered before the primary system.
We envision that cloud bursting from A2 should be possible when the demand for CPU or GPU-compute resources is high. We are also interested in using the cloud as a gateway to more exotic hardware.
We currently do not have any cloud connectivity on our clusters.
System software is the ecosystem for building scientific software. Today we use GNU and Intel for the most part but to a lesser extent NVIDIA (formerly PGI) and ARM (compilers and libraries).
In addition to building software, Intel performance tools, TotalView debugger, and ARM Forge, where ARM performance reporter is in widespread use.
Our current systems run a diverse software stack of more than 400 scientific applications with continuous growth.