NETLIB-HPL

AMD Zen Software Studio with Spack

Open MPI with AMD Zen Software Studio

Micro Benchmarks/Synthetic Benchmarks

Spack HPC Applications

Introduction

High-Performance Linpack (HPL) is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.

Official website for HPL: https://www.netlib.org/benchmark/hpl

Building HPL using Spack

Please refer to this link for getting started with Spack using AMD Zen Software Studio

		# Example for building HPL with AOCC and AOCL 
$ spack install hpl +openmp %aocc ^amdblis threads=openmp ^openmpi fabrics=cma,ucx

Explanation of the command options:

Symbol	Meaning
%aocc	Build HPL using the AOCC compiler
+openmp	Build HPL with OpenMP support enabled
^amdblis threads=openmp	Use amdblis as the BLAS implementation and enable OpenMP support
^openmpi fabrics=cma,ucx	Use OpenMPI as the MPI provider and use the CMA network for efficient intra-node communication, falling back to the UCX network fabric, if required. Note: It is advised to specifically set the appropriate fabric for the host system if possible. Refer to Open MPI with AMD Zen Software Studio for more guidance.

Running HPL

Recommended steps to run HPL for maximum performance on AMD systems:

Configure the system with SMT Off
Create the run_hpl_ccx.sh script. This binds the MPI process to the proper AMD processor Core Complex Die (CCD) or Core Complex (CCX) that are related to their local L3 cache memory.
Create or update the HPL.dat file based on the underlying machine architecture.

This script will launch HPL with 2 MPI ranks per L3 cache, and each rank having 4 OpenMP worker threads. To change this behavior update OMP_NUM_THREADS and the values x ,y in the ppr:x:l3cache:pe=y option to mpirun.
Note: Some systems should use a different MPI/OpenMP layout:

Some frequency optimized AMD EPYC™ CPUs, such as EPYC™ 72F3 ("F" parts), have fewer than 8 cores per L3 cache. For such CPUs, it is recommended to use a single rank per L3 cache and set OMP_NUM_THREADS to the number of cores per L3 cache.
For AMD 1st Gen EPYC™ Processors, which have 4 cores per L3 cache rather than 8 cores, it is recommended to use OMP_NUM_THREADS=4 and a single rank per L3 cache.

run_hpl_ccx.sh

		#! /bin/bash
# Load HPL into environment
# NOTE: If you have built multiple versions of HPL with Spack you may need to be
# more specific about which version to load. Spack will complain if your request
# is ambiguous and could refer to multiple packages.
# Please see: (https://spack.readthedocs.io/en/latest/basic_usage.html#ambiguous-specs)
spack load hpl %aocc

### performance settings ###
# System level tunings
echo 3 > /proc/sys/vm/drop_caches   # Clear caches to maximize available RAM
echo 1 > /proc/sys/vm/compact_memory # Rearrange RAM usage to maximise the size of free blocks
echo 0 > /proc/sys/kernel/numa_balancing # Prevent kernel from migrating threads overzealously
echo 'always' > /sys/kernel/mm/transparent_hugepage/enabled # Enable hugepages for better TLB usage
echo 'always' > /sys/kernel/mm/transparent_hugepage/defrag  # Enable page defragmentation and coalescing

TOTAL_CORES=$(nproc)

# OpenMP Settings
export OMP_NUM_THREADS=4   # 4 threads per MPI rank  - this means 2 ranks per CPU L3cache (Zen 2+) or 1 rank per L3 (Zen 1)
export OMP_PROC_BIND=TRUE  # bind threads to specific resources
export OMP_PLACES="cores"   # bind threads to cores

# amdblis (BLAS layer) optimizations
export BLIS_JC_NT=1  # (No outer loop parallelization)
export BLIS_IC_NT=$OMP_NUM_THREADS # (# of 2nd level threads – one per core in the shared L3 cache domain):
export BLIS_JR_NT=1 # (No 4th level threads)
export BLIS_IR_NT=1 # (No 5th level threads)

# Total MPI rank computation
NUM_MPI_RANKS=$(( $TOTAL_CORES / $OMP_NUM_THREADS ))

# For 1st Generation EPYC (Naples) and "F" Parts
# If using an "F" part (e.g 75F3) also ensure that OMP_NUM_THREADS is set appropriately
# (Recommended OMP_NUM_THREADS= #cores per L3 cache)
#mpirun --map-by ppr:1:l3cache:pe=$OMP_NUM_THREADS \
#	numactl --localalloc \
#	xhpl

# For 2nd Generation EPYC onwards
mpirun -np $NUM_MPI_RANKS --map-by ppr:2:l3cache:pe=$OMP_NUM_THREADS \
	--bind-to core \
	Xhpl

HPL.dat

Please change the following values as per the system configuration:

Ns - is the problem size and should be calculated based on system memory. Specifically, the problem size should be significantly larger than the total available L3 cache.
To calculate a suitable value of Ns for a desired memory footprint, use the formula
Ns = sqrt(M * (1024^3)/8 )
where M should be the desired memory usage in Gigabytes (Gb).
Ps, Qs - This gives the dimensions of the grid over which the parallel problem should be decomposed. Q should always be larger than P! P*Q should match the total number of MPI ranks being used (not the total number of cores). Some common example configurations are:
Sample HPL.dat for dual socket AMD 5th Gen EPYC™ 9755 Processor with 256 (128x2) cores and 1.5 TB of memory.

MPI Ranks	Ps	Qs
16	4	4
24	4	6
32	4	8
48	6	8
64	8	8

HPL.dat

		HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out     output file name (if any)
6           device out (6=stdout,7=stderr,file)
1           # of problems sizes (N)
430080      Ns  <--- Modify this to change the memory footprint
1           # of NBs
456         # NBs
0           MAP process mapping (0=Row-,1=Column-major)
1           # of process grids (P x Q)
8           Ps <--- Set Ps and Qs to a suitable grid size
8           Qs <--- make sure that Ps * Qs == #MPI Ranks
16.0        threshold
1           # of panel fact<
1           PFACTs (0=left, 1=Crout, 2=Right)
1           # of recursive stopping criterium
4           NBMINs (>= 1)
1           # of panels in recursion
2           NDIVs
1           # of recursive panel fact.
1           RFACTs (0=left, 1=Crout, 2=Right)
1           # of broadcast
3           BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1           # of lookahead depth
0           DEPTHs (>=0)
2           SWAP (0=bin-exch,1=long,2=mix)
64          swapping threshold
0           L1 in (0=transposed,1=no-transposed) form
0           U in (0=transposed,1=no-transposed) form
1           Equilibration (0=no,1=yes)
8           memory alignment in double (> 0)

Once the wrapper script (run_hpl.sh) and a suitable HPL.dat have been created, run HPL by executing the wrapper script:

Running HPL using the wrapper script

		$ chmod +x ./run_hpl.sh   # make the script executable
$ ./run_hpl.sh

Note: The above build and run steps apply to HPL-2.3, AOCC-5.0.0, AOCL-5.1.0 and OpenMPI-5.0.8 on Rocky Linux 9.5 (Blue Onyx) using Spack v1.1.0.dev0 and the builtin repo from spack-packages (commit id: 7824c23443).

For technical support on the tools, benchmarks and applications that AMD offers on this page and related inquiries, reach out to us at toolchainsupport@amd.com.

데이터 센터

비즈니스 시스템

개인 및 게이밍

Embedded

리소스

GPU 가속기

적응형 가속기

DPU 가속기

이더넷 어댑터

워크스테이션

데스크탑

랩탑

리소스

FPGA 및 적응형 SoC

시스템 온 모듈(SOM)

기술

개발자 리소스

평가 보드 및 킷

프로세서 툴

그래픽 툴 및 앱

FPGA 및 적응형 SoC 툴

지적 재산 및 앱

GPU 가속기 툴 및 앱

이더넷 어댑터 도구

개관

데이터 센터 및 클라우드용

에지 및 엔드포인트용

개발자용

업계

업계

업계

업계

Industrias

워크로드

게이밍

시스템

기술

리소스

EPYC 프로세서

Radeon 그래픽 및 AMD 칩셋

FPGA 및 적응형 SoC

Alveo 가속기 및 Kria SOM

Ryzen 프로세서

이더넷 어댑터

개관

프로세서

가속기

임베디드 제품

그래픽

개관

제품별 리소스

유형별 리소스

파트너 정보

AMD 글로벌 지원

프로세서 및 그래픽

가속기

FPGA 및 적응형 SoC

게이밍 및 개인 컴퓨팅

적응형 및 임베디드 컴퓨팅

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Introduction

Building HPL using Spack

Running HPL

HPL.dat

회사

뉴스 및 이벤트

리소스

파트너

투자자