NETLIB-HPL

AMD Zen Software Studio with Spack

Open MPI with AMD Zen Software Studio

Micro Benchmarks/Synthetic Benchmarks

Spack HPC Applications

Introduction

High-Performance Linpack (HPL) is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.

Official website for HPL: https://www.netlib.org/benchmark/hpl

Building HPL using Spack

Please refer to this link for getting started with Spack using AMD Zen Software Studio

		# Example for building HPL with AOCC and AOCL 
$ spack install hpl +openmp %aocc ^amdblis threads=openmp ^openmpi fabrics=cma,ucx

Explanation of the command options:

Symbol	Meaning
%aocc	Build HPL using the AOCC compiler
+openmp	Build HPL with OpenMP support enabled
^amdblis threads=openmp	Use amdblis as the BLAS implementation and enable OpenMP support
^openmpi fabrics=cma,ucx	Use OpenMPI as the MPI provider and use the CMA network for efficient intra-node communication, falling back to the UCX network fabric, if required. Note: It is advised to specifically set the appropriate fabric for the host system if possible. Refer to Open MPI with AMD Zen Software Studio for more guidance.

Running HPL

Recommended steps to run HPL for maximum performance on AMD systems:

Configure the system with SMT Off
Create the run_hpl_ccx.sh script. This binds the MPI process to the proper AMD processor Core Complex Die (CCD) or Core Complex (CCX) that are related to their local L3 cache memory.
Create or update the HPL.dat file based on the underlying machine architecture.

This script will launch HPL with 2 MPI ranks per L3 cache, and each rank having 4 OpenMP worker threads. To change this behavior update OMP_NUM_THREADS and the values x ,y in the ppr:x:l3cache:pe=y option to mpirun.
Note: Some systems should use a different MPI/OpenMP layout:

Some frequency optimized AMD EPYC™ CPUs, such as EPYC™ 72F3 ("F" parts), have fewer than 8 cores per L3 cache. For such CPUs, it is recommended to use a single rank per L3 cache and set OMP_NUM_THREADS to the number of cores per L3 cache.
For AMD 1st Gen EPYC™ Processors, which have 4 cores per L3 cache rather than 8 cores, it is recommended to use OMP_NUM_THREADS=4 and a single rank per L3 cache.

run_hpl_ccx.sh

		#! /bin/bash
# Load HPL into environment
# NOTE: If you have built multiple versions of HPL with Spack you may need to be
# more specific about which version to load. Spack will complain if your request
# is ambiguous and could refer to multiple packages.
# Please see: (https://spack.readthedocs.io/en/latest/basic_usage.html#ambiguous-specs)
spack load hpl %aocc

### performance settings ###
# System level tunings
echo 3 > /proc/sys/vm/drop_caches   # Clear caches to maximize available RAM
echo 1 > /proc/sys/vm/compact_memory # Rearrange RAM usage to maximise the size of free blocks
echo 0 > /proc/sys/kernel/numa_balancing # Prevent kernel from migrating threads overzealously
echo 'always' > /sys/kernel/mm/transparent_hugepage/enabled # Enable hugepages for better TLB usage
echo 'always' > /sys/kernel/mm/transparent_hugepage/defrag  # Enable page defragmentation and coalescing

TOTAL_CORES=$(nproc)

# OpenMP Settings
export OMP_NUM_THREADS=4   # 4 threads per MPI rank  - this means 2 ranks per CPU L3cache (Zen 2+) or 1 rank per L3 (Zen 1)
export OMP_PROC_BIND=TRUE  # bind threads to specific resources
export OMP_PLACES="cores"   # bind threads to cores

# amdblis (BLAS layer) optimizations
export BLIS_JC_NT=1  # (No outer loop parallelization)
export BLIS_IC_NT=$OMP_NUM_THREADS # (# of 2nd level threads – one per core in the shared L3 cache domain):
export BLIS_JR_NT=1 # (No 4th level threads)
export BLIS_IR_NT=1 # (No 5th level threads)

# Total MPI rank computation
NUM_MPI_RANKS=$(( $TOTAL_CORES / $OMP_NUM_THREADS ))

# For 1st Generation EPYC (Naples) and "F" Parts
# If using an "F" part (e.g 75F3) also ensure that OMP_NUM_THREADS is set appropriately
# (Recommended OMP_NUM_THREADS= #cores per L3 cache)
#mpirun --map-by ppr:1:l3cache:pe=$OMP_NUM_THREADS \
#	numactl --localalloc \
#	xhpl

# For 2nd Generation EPYC onwards
mpirun -np $NUM_MPI_RANKS --map-by ppr:2:l3cache:pe=$OMP_NUM_THREADS \
	--bind-to core \
	Xhpl

HPL.dat

Please change the following values as per the system configuration:

Ns - is the problem size and should be calculated based on system memory. Specifically, the problem size should be significantly larger than the total available L3 cache.
To calculate a suitable value of Ns for a desired memory footprint, use the formula
Ns = sqrt(M * (1024^3)/8 )
where M should be the desired memory usage in Gigabytes (Gb).
Ps, Qs - This gives the dimensions of the grid over which the parallel problem should be decomposed. Q should always be larger than P! P*Q should match the total number of MPI ranks being used (not the total number of cores). Some common example configurations are:
Sample HPL.dat for dual socket AMD 5th Gen EPYC™ 9755 Processor with 256 (128x2) cores and 1.5 TB of memory.

MPI Ranks	Ps	Qs
16	4	4
24	4	6
32	4	8
48	6	8
64	8	8

HPL.dat

		HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out     output file name (if any)
6           device out (6=stdout,7=stderr,file)
1           # of problems sizes (N)
430080      Ns  <--- Modify this to change the memory footprint
1           # of NBs
456         # NBs
0           MAP process mapping (0=Row-,1=Column-major)
1           # of process grids (P x Q)
8           Ps <--- Set Ps and Qs to a suitable grid size
8           Qs <--- make sure that Ps * Qs == #MPI Ranks
16.0        threshold
1           # of panel fact<
1           PFACTs (0=left, 1=Crout, 2=Right)
1           # of recursive stopping criterium
4           NBMINs (>= 1)
1           # of panels in recursion
2           NDIVs
1           # of recursive panel fact.
1           RFACTs (0=left, 1=Crout, 2=Right)
1           # of broadcast
3           BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1           # of lookahead depth
0           DEPTHs (>=0)
2           SWAP (0=bin-exch,1=long,2=mix)
64          swapping threshold
0           L1 in (0=transposed,1=no-transposed) form
0           U in (0=transposed,1=no-transposed) form
1           Equilibration (0=no,1=yes)
8           memory alignment in double (> 0)

Once the wrapper script (run_hpl.sh) and a suitable HPL.dat have been created, run HPL by executing the wrapper script:

Running HPL using the wrapper script

		$ chmod +x ./run_hpl.sh   # make the script executable
$ ./run_hpl.sh

Note: The above build and run steps apply to HPL-2.3, AOCC-5.0.0, AOCL-5.1.0 and OpenMPI-5.0.8 on Rocky Linux 9.5 (Blue Onyx) using Spack v1.1.0.dev0 and the builtin repo from spack-packages (commit id: 7824c23443).

For technical support on the tools, benchmarks and applications that AMD offers on this page and related inquiries, reach out to us at toolchainsupport@amd.com.

Centre de données

Systèmes professionnels

Informatique personnelle et gaming

Embedded

Ressources

Accélérateurs GPU

Accélérateurs adaptatifs

Accélérateurs DPU

Adaptateurs Ethernet

Stations de travail

PC de bureau

PC portables

Ressources

FPGA et SoC adaptatifs

Système sur Modules (SOM/System On Modules)

Technologies

Ressources pour les développeurs

Cartes et kits d'évaluation

Outils de processeur

Outils et applications graphiques

Outils FPGA et SoC adaptatifs

Propriété Intellectuelle et applications

Outils d'accélération et applications

Outils pour cartes Ethernet

Présentation

Pour les centres de données et le cloud

Pour la périphérie et les terminaux

Pour les développeurs

Secteurs d'activité

Secteurs d'activité

Secteurs d'activité

Secteurs d'activité

Industrias

Charges de travail

Gaming

Systèmes

Technologies

Ressources

Processeurs EPYC

Solutions graphiques Radeon et chipsets AMD

FPGA et SoC adaptatifs

Accélérateurs Alveo et SOM Kria

Processeurs Ryzen

Adaptateurs Ethernet

Présentation

Processeurs EPYC

Accélérateurs

Produits intégrés

Solutions graphiques

Présentation

Ressources par produit

Ressources par type

À propos de nos partenaires

Assistance mondiale AMD

Processeurs et solutions graphiques

Accélérateurs

FPGA et SoC adaptatifs

Gaming et informatique personnelle

Informatique adaptative et embarquée

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Introduction

Building HPL using Spack

Running HPL

HPL.dat

Société

Nouveautés et évènements

Ressources

Partenaires

Investisseurs