AMD Zen Software Studio with Spack
- AMD Optimizing C/C++ Compiler (AOCC)
- AMD Optimizing CPU Libraries (AOCL)
- AMD uProf
- Setting Preference for AMD Zen Software Studio
Open MPI with AMD Zen Software Studio
Micro Benchmarks/Synthetic Benchmarks
Spack HPC Applications
Introduction
High-Performance Linpack (HPL) is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack benchmark.
The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.
Official website for HPL: https://www.netlib.org/benchmark/hpl
Building HPL using Spack
Please refer to this link for getting started with Spack using AMD Zen Software Studio
# Example for building HPL with AOCC and AOCL
$ spack install hpl +openmp %aocc ^amdblis threads=openmp ^openmpi fabrics=cma,ucx
Explanation of the command options:
Symbol | Meaning |
---|---|
%aocc | Build HPL using the AOCC compiler |
+openmp | Build HPL with OpenMP support enabled |
^amdblis threads=openmp | Use amdblis as the BLAS implementation and enable OpenMP support |
^openmpi fabrics=cma,ucx | Use OpenMPI as the MPI provider and use the CMA network for efficient intra-node communication, falling back to the UCX network fabric, if required. Note: It is advised to specifically set the appropriate fabric for the host system if possible. Refer to Open MPI with AMD Zen Software Studio for more guidance. |
Running HPL
Recommended steps to run HPL for maximum performance on AMD systems:
- Configure the system with SMT Off
- Create the run_hpl_ccx.sh script. This binds the MPI process to the proper AMD processor Core Complex Die (CCD) or Core Complex (CCX) that are related to their local L3 cache memory.
- Create or update the HPL.dat file based on the underlying machine architecture.
This script will launch HPL with 2 MPI ranks per L3 cache, and each rank having 4 OpenMP worker threads. To change this behavior update OMP_NUM_THREADS and the values x ,y in the ppr:x:l3cache:pe=y option to mpirun.
Note: Some systems should use a different MPI/OpenMP layout:
- Some frequency optimized AMD EPYC™ CPUs, such as EPYC™ 72F3 ("F" parts), have fewer than 8 cores per L3 cache. For such CPUs, it is recommended to use a single rank per L3 cache and set OMP_NUM_THREADS to the number of cores per L3 cache.
- For AMD 1st Gen EPYC™ Processors, which have 4 cores per L3 cache rather than 8 cores, it is recommended to use OMP_NUM_THREADS=4 and a single rank per L3 cache.
run_hpl_ccx.sh
#! /bin/bash
# Load HPL into environment
# NOTE: If you have built multiple versions of HPL with Spack you may need to be
# more specific about which version to load. Spack will complain if your request
# is ambiguous and could refer to multiple packages.
# Please see: (https://spack.readthedocs.io/en/latest/basic_usage.html#ambiguous-specs)
spack load hpl %aocc
### performance settings ###
# System level tunings
echo 3 > /proc/sys/vm/drop_caches # Clear caches to maximize available RAM
echo 1 > /proc/sys/vm/compact_memory # Rearrange RAM usage to maximise the size of free blocks
echo 0 > /proc/sys/kernel/numa_balancing # Prevent kernel from migrating threads overzealously
echo 'always' > /sys/kernel/mm/transparent_hugepage/enabled # Enable hugepages for better TLB usage
echo 'always' > /sys/kernel/mm/transparent_hugepage/defrag # Enable page defragmentation and coalescing
TOTAL_CORES=$(nproc)
# OpenMP Settings
export OMP_NUM_THREADS=4 # 4 threads per MPI rank - this means 2 ranks per CPU L3cache (Zen 2+) or 1 rank per L3 (Zen 1)
export OMP_PROC_BIND=TRUE # bind threads to specific resources
export OMP_PLACES="cores" # bind threads to cores
# amdblis (BLAS layer) optimizations
export BLIS_JC_NT=1 # (No outer loop parallelization)
export BLIS_IC_NT=$OMP_NUM_THREADS # (# of 2nd level threads – one per core in the shared L3 cache domain):
export BLIS_JR_NT=1 # (No 4th level threads)
export BLIS_IR_NT=1 # (No 5th level threads)
# Total MPI rank computation
NUM_MPI_RANKS=$(( $TOTAL_CORES / $OMP_NUM_THREADS ))
# For 1st Generation EPYC (Naples) and "F" Parts
# If using an "F" part (e.g 75F3) also ensure that OMP_NUM_THREADS is set appropriately
# (Recommended OMP_NUM_THREADS= #cores per L3 cache)
#mpirun --map-by ppr:1:l3cache:pe=$OMP_NUM_THREADS \
# numactl --localalloc \
# xhpl
# For 2nd Generation EPYC onwards
mpirun -np $NUM_MPI_RANKS --map-by ppr:2:l3cache:pe=$OMP_NUM_THREADS \
--bind-to core \
Xhpl
HPL.dat
Please change the following values as per the system configuration:
- Ns - is the problem size and should be calculated based on system memory. Specifically, the problem size should be significantly larger than the total available L3 cache.
To calculate a suitable value of Ns for a desired memory footprint, use the formula
Ns = sqrt(M * (1024^3)/8 )
where M should be the desired memory usage in Gigabytes (Gb). - Ps, Qs - This gives the dimensions of the grid over which the parallel problem should be decomposed. Q should always be larger than P! P*Q should match the total number of MPI ranks being used (not the total number of cores). Some common example configurations are:
Sample HPL.dat for dual socket AMD 5th Gen EPYC™ 9755 Processor with 256 (128x2) cores and 1.5 TB of memory.
MPI Ranks | Ps | Qs |
---|---|---|
16 | 4 | 4 |
24 | 4 | 6 |
32 | 4 | 8 |
48 | 6 | 8 |
64 | 8 | 8 |
HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
430080 Ns <--- Modify this to change the memory footprint
1 # of NBs
456 # NBs
0 MAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
8 Ps <--- Set Ps and Qs to a suitable grid size
8 Qs <--- make sure that Ps * Qs == #MPI Ranks
16.0 threshold
1 # of panel fact<
1 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
Once the wrapper script (run_hpl.sh) and a suitable HPL.dat have been created, run HPL by executing the wrapper script:
Running HPL using the wrapper script
$ chmod +x ./run_hpl.sh # make the script executable
$ ./run_hpl.sh
Note: The above build and run steps apply to HPL-2.3, AOCC-5.0.0, AOCL-5.1.0 and OpenMPI-5.0.8 on Rocky Linux 9.5 (Blue Onyx) using Spack v1.1.0.dev0 and the builtin repo from spack-packages (commit id: 7824c23443).
For technical support on the tools, benchmarks and applications that AMD offers on this page and related inquiries, reach out to us at toolchainsupport@amd.com.