Procurement project HPC A2

The next national HPC and AI/ML Platform

The content on this web page is subject to change. Last updated 22 Nov 2022.

This project's objective is to acquire and put into operation the next-generation national HPC resource (supercomputer) in the Sigma2 e-infrastructure portfolio. The working name for this resource is A2. A2 will be placed in Lefdal Mine Datacenter, the same location as the new national storage infrastructure (NIRD).

Figure that shows the location of national e-infrastructure systems in Norway.

Figure 1: Current Norwegian HPC and Data Storage infrastructure, including sensitive data (TSD) in Oslo. The infrastructure in Tromsø (HPC and storage) will be phased out, and service will be provided from Trondheim (Saga and Betzy), together with the new A2 system in Måløy (Lefdal Mine Datacenter).

Goals

The project aims to:

  • replace current HPC machines Fram and Saga (more details below), and additional usage growth forecast.
  • provide computing capability for AI/ML and scientific applications through GPU/CPU. 
  • procure a system with expandable computing and storage capacity.

A2 conceptual specs and drawing

The primary function of the A2 machine is to replace the two machines, Fram and Saga. This may translate into this:

What Specifications
System ARM, x86, Power
Nodes 500-1,000
CPU cores 80,000-100,000
CPU types ARM, x86, Power
Local disk (system and scratch for jobs) NVMe (solid-state)
Performance Benchmarks (HPCG, AI/ML + scientific codes)
Total memory Bandwidth & Size
Disk size and type 10 PB (divided into 3 file systems)
Interconnect type and topology Low latency intermediate bandwidth
Cooling Water

 

Sigma2 infratructure overview. illustration.

Figure 2: The proposed A2 system in grey, locally connected to the NIRD data storage in LMD, with remote connections to the systems in Kajaani, Tromsø and Trondheim.

Questions for vendors

We would like vendors to answer the following:

  • How long can you support the hardware?
  • Your thoughts about GPU vs CPU ratio in computational science now and five years from now?
  • What are your current experience and future plans with multi-partition HPC machines?  
  • Roadmap for GPU/CPU integration?
  • How is your software ecosystem?
  • Do you facilitate cloud integration? 


Please do not hesitate to contact us at a2@sigma2.no if you wish to get in touch with the project.

Preliminary timeline

Q1 2023
Publication of RFI

We expect to publish an RFI in Q1 2023

H1 2023
Publication of tender

We expect to publish the tender in H1.

H1 2024
New HPC system in production

Our goal is to have the system in production in H1 2024.

The lifetime of A2 should be at least five years. The expected lifetime is seven years. 

Background

HPC systems generally have a life span of approximately 5 years due to declining energy efficiency compared to newer machines, obsolete parts, lack of support, and the arrival of new technology.

Our procurement strategy has traditionally been two-legged, an A- and B-leg where we acquire systems with an offset of 2-3 years.

Current Sigma2 hardware and capability

Fram
What Specifications
System  Lenovo NeXtScale nx360
Nodes 1006
Cores 32256
CPU types

Intel E5-2683v4 2.1 GHz

Intel E7-4850v4 2.1 GHz (hugemem)

Local disk SSD
Performance 1.1 PF
Total memory 78 TiB
Disk size and type 2.5 PB, Lustre
Interconnect type and topology Infiniband EDR (100Gbit), Fat Tree
Queueing system Slurm
Cooling Water cooled
Saga
What Specifications
System

HPE XL170r Gen10

ProLiant XL270d Gen10 (GPU node)

Nodes 356 (+8 GPU nodes)
Cores 16064
CPU types

200 Intel 6138 2.00GHz

120 Intel 6230R 2.10GHz

8 Intel 6126 CPU @ 2.60GHz (GPU nodes)

Local disk NVMe
Performance 645TF
Total memory 75 TiB
Disk size and type 6.6 PB, BeeGFS
Interconnect type and topology Infiniband FDR (56Gbit), Fat Tree

Queueing system 

Slurm

Cooling 

Air cooled 

Betzy
What Specifications
System

BullSequana XH2000 

Nodes

1344 CPU nodes

4 GPU nodes

Cores 172032
CPU types

CPU nodes: AMD® Epyc™ "Rome" 2.25GHz

GPU nodes: 4 GPU nodes, with 4 NVIDIA A100 with nvlink GPUs and AMD Milan CPUs

Local disk No local disk
Performance 6.5 PF 
Total memory 336 TiB
Disk size and type 7.7 PB, Lustre
Queueing system  Slurm
Interconnect type and topology Infiniband HDR (100 Gbit per node, 200 Gbit switches)), Dragonfly

Cooling 

Water cooled

The NIRD storage is based on IBM Elastic storage, ESS. The current capacity of 35 PB is shared between file and object storage. This is designed for future growth.   

LUMI
What Specifications
System  HPC CRAY Shasta
Nodes 1536 (LUMI-C) 2560 (LUMI-G)
Cores 196 608 (LUMI-C)
CPU types AMD EPYC
GPU types AMD MI-250X
Local disk 80 PB (LUMI-P) 7 PB (LUMI-F) 30 PB (LUMI-O)
Peak performance (Pflops/s) 552 (LUMI-G), 8 (LUMI-C)
Total memory 440 TiB
Disk size and type Spinning LUMI-P, flash LUMI-F
Queueing system Slurm
Interconnect type and topology Slingshot
Cooling  Water cooled 


CPU architecture

We recognise that the CPU market is quite diverse these days, and at the same time, we believe that more than 90% of our workload is CPU architecture agnostic and may run well on both x86, ARM or Power CPUs. As such, we do not have any specific CPU demands other than the fact it must be able to compile and run standard HPC software and applications.

GPU architecture

Except for the partnership in LUMI, Sigma2 does not possess any substantial GPU resources that can seamlessly run CUDA (NVIDIA) software. In particular: Machine learning and AI code specific for scientific software. We have some GPUs installed for accelerated HPC workloads.

For the most part, the demand for NVIDIA-compatible GPUs has been met by individual universities and research groups, among others, the IDUN GPU cluster at NTNU with a total of 200 GPUs (ca 100 A100). Sigma2 needs to have a similar capability shortly, but we have not decided to what extent yet (capacity, technology, etc.). It will be part of A2, but A2 must have the capability to integrate such GPUs with its interconnect,  storage, and other infrastructure (queueing systems, etc.).

We have seen steady growth and expect a significant increase in GPU demand, mainly for AI/ML, but also for accelerated HPC (64-bit FP), albeit less.

Interconnect

All our current HPC systems use different generations of Infiniband with different topologies. We are experienced with this interconnect and consider it a mature network technology. For A2, we still consider Infiniband a good choice, but we are looking for stable and proven versions of this interconnect, not state-of-the-art versions that have not had a good production time in reference systems. We consider latency more important than bandwidth.
The interconnect should be configured such that the CPU and GPU computational capacity can be expanded.

Storage

Local node disk

For the workload, we foresee the A2 machine will run, and a substantial part of the jobs will benefit from having a node-local temp/scratch filesystem. The Saga system has local NVMe, and we would also like to equip A2 with fast solid-state storage (e.g. NVMe) on nodes.

Shared file systems

All our current HPC systems have a variant of either Lustre or BeeGFS, and we also have experience with GPFS. All three technologies are acceptable to us and have served us well. One shortcoming in our current setup is that we have one large filesystem to serve all kinds of I/O on the cluster. 
For A2, we are considering several shared file systems, each serving a specific purpose, not influencing each other.

  • Shared scratch: fast, IOPS, good bandwidth  (solid-state)
  • Shared home and software: good IOPS
  • Shared project/data folder: good bandwidth, moderate IOPS

All filesystems must be separately operated and independent, except for using the same interconnect. All parallel file systems must be upgradeable both in performance and capacity.

Site facilities, Power, and Cooling

A2 will be placed in a data centre with water-cooling capability. Power is not a limiting factor.

Connectivity to NIRD

A2 will be placed close (about 50-100m cable length, IB range) to the NIRD storage facility (file and object store). We have the option to integrate NIRD storage in several ways.

  • Loosely: Either with copying via transport nodes or directly on login nodes (staging)
  • Close: Mounting on login or transport nodes (with GPFS)
  • Tight: Mounting on all nodes (with GPFS)

If a tight connection is chosen, this may affect the size of the shared cluster file system "shared projects".

Queueing system/Scheduler

We are running Slurm on all our clusters.

Test system

The A2 system should have a standalone test system. This system will be used to minimize downtime in the production system.

This system should be delivered before the primary system.

Cloud connectivity

We envision that cloud bursting from A2 should be possible when the demand for CPU or GPU-compute resources is high. We are also interested in using the cloud as a gateway to more exotic hardware.

We currently do not have any cloud connectivity on our clusters. 

Software

System software is the ecosystem for building scientific software. Today we use GNU and Intel for the most part but to a lesser extent NVIDIA (formerly PGI) and ARM (compilers and libraries).

In addition to building software, Intel performance tools, TotalView debugger, and ARM Forge, where ARM performance reporter is in widespread use.
Our current systems run a diverse software stack of more than 400 scientific applications with continuous growth.

Contact information

E-mail: a2@sigma2.no