AOCL-BLAS: Boosting GEMM Performance for Small Matrices

Aug 12, 2025

Engineering GEMM Efficiency Beyond the Hot Path

In performance engineering, optimizations often focus on large, square matrices and high-throughput paths. However, real-world workloads—such as batched general matrix multiply (GEMMs), AI inference, and scientific simulations—frequently involve small or skinny matrices. Despite their size, these matrices occur so often that they can dominate the overall runtime. This blog focuses on refining AMD Optimizing CPU Libraries – Basic Linear Algebra Subprograms (AOCL-BLAS) GEMM kernels to handle these cases efficiently, leveraging architectural insights from AMD “Zen 4” and “Zen 5” architecture.

At the heart of most high-performance GEMM implementation lies the 5-loop algorithm, a hierarchical structure that decomposes matrix multiplication into nested loops for blocking, packing, and micro-kernel execution. This design is highly effective for large, square matrices, where cache reuse and computing throughput can be fully exploited.

Figure 1: GotoBLAS Algorithm for Optimizing GEMM Operation

GotoBLAS algorithm for GEMM: This figure above presents the GotoBLAS algorithm for optimizing GEMM operation. It features a hierarchy of five nested loops surrounding a micro-kernel.

However, this same structure becomes a hidden cost when applied to skinny matrices that are too narrow, short, or simply too small. In these cases, the overhead of packing and blocking can outweigh the benefits, leading to suboptimal performance.

This blog is a story of refinement and precision — how we tuned AOCL-BLAS GEMM kernels for AMD Zen 4 and Zen 5 architecture, not by brute force, but by understanding the architecture, embracing the edge cases, and crafting solutions that shine even in the smallest corners of computation.

System Configuration

All performance measurements presented in this blog were conducted on the following system:

Processor: AMD EPYC™ 9755, 128-Core Processor (2 sockets, 256 cores total)
Base/Boost Frequency: 1.5 GHz (min) – 4.12 GHz (max)
Cache:
L1d: 12 MiB (256 × 48 KiB)
L1i: 8 MiB (256 × 32 KiB)
L2: 256 MiB (256 × 1 MiB)
L3: 1 GiB (32 × 32 MiB)
NUMA Nodes: 8
Memory: DDR5, 5600 MT/s, Total Size: 1.48 TB
Operating System: Red Hat Enterprise Linux 9.3

This configuration was consistently used across all benchmarks and performance charts in the blog.

Repeated Kernel Call & µOp Cache Inefficiency

In earlier implementations of the GEMM kernel within AOCL-BLAS, each micro-tile of the matrix was processed through an invocation of the gemm_kernel function. This was typically done inside the second loop of the 5-loop GEMM algorithm, which iterates over the M dimension. For every iteration, the offsets are recalculated, and a function is invoked to handle the current tile.

    // gemm_kernel code block 
// Loop 2 of the 5-loop GEMM algorithm 
// This loop advances in steps of the micro-kernel block size (e.g., 24 for M dimension) 
for (int i = 0; i < M; i += MR)  // MR = micro-kernel block size for M (e.g., 24) 
{ 
    // Compute offsets for A and C based on current block row 
    int A_offset = i * lda;  // lda = leading dimension of A 
    int C_offset = i * ldc;  // ldc = leading dimension of C 

    // Call the micro-kernel on the current block 
    gemm_kernel(A_base_ptr + A_offset,  
                B_base_ptr,  
                C_base_ptr + C_offset,  
                ...);  // Additional parameters as needed

While this approach is functionally correct, it introduces significant performance penalties. This implementation causes fragmentation of the instruction stream, which increases the pressure on the 32KB instruction cache. This cache struggles to retain both the loop control logic and the relatively large body of the gemm_kernel() function.

Moreover, µOp cache and loop stream detector (LSD) in the “Zen” architecture, both critical for efficient frontend throughput, are optimized for tight, predictable instruction sequences. Frequent function calls disrupt this predictability, leading to µOp cache fragmentation and reduced LSD engagement. As a result, the frontend pipeline experiences stalls, and the overall throughput of GEMM operation degrades.

Figure 2: Visual Comparison of DGEMM Kernel Strategies: Baseline vs Optimized Execution

Hierarchical Blocking and nested loops surrounding the 24x8 micro-kernel vs Inline assembly kernel: The left side of the image shows a blocking and nested loops, while the right side highlights the inline assembly kernel, which implements the M loop directly within the kernel.

To address this, we restructured the kernel by pulling the micro-tile loop directly into the kernel function body using inline assembly. This change eliminated the need for repeated function calls, allowing the entire loop to execute as a contiguous block of instructions.

    void gemm_kernel_with_internal_loop(...) { 
    begin_asm; 
    // Register-based loop 
    // Compute offsets, load, FMA, store, loop 
    end_asm; 
}

The benefits were immediate: contiguous instruction stream improved I-cache locality, better µOp cache utilization, and consistent engagement of the LSD. This change significantly reduced the function call overhead, enabling faster front-end throughput.

The performance chart compares the baseline implementation with the optimized Double-precision General Matrix Multiply (DGEMM) kernel, showing 5% performance improvement. This improvement stems from reducing function call overhead and ensuring a more contiguous instruction stream.

Figure 3: Streamlined GEMM Kernel & uop Cache Improvement

Performance gain from streamlined GEMM kernel & µOp Cache Improvements: The performance chart compares the baseline implementation with the optimized DGEMM kernel, showing 5% performance improvement. This improvement stems from reducing function call overhead and ensuring a more contiguous instruction stream. The x-axis lists various matrix sizes (MxNxK), while the y-axis shows the performance in gigaflops (GFLOPS).

Breaking Fused Multiply-Add (FMA) Dependency Chains

In high-performance GEMM kernels, fused multiply-add (FMA) instructions are the core computational workhorse. However, when multiple FMAs are issued back-to-back using the same destination register, they form a dependency chain. Each instruction must wait for the previous one to complete before it can execute, creating a bottleneck in the out-of-order execution engine. This pattern stalls the instruction scheduler and limits the number of operations that can be issued per cycle. On AMD “Zen 4” and “Zen 5” architecture, which are designed to exploit instruction-level parallelism, such dependency chains underutilize the available execution resources.

    vfmadd231pd zmm0, zmm2, zmm30 // i.e. zmm0 = (zmm2 * zmm30) + zmm0 vfmadd231pd zmm0, zmm3, zmm31 // i.e. zmm0 = (zmm3 * zmm31) + zmm0 ⚠️stall 
vfmadd231pd zmm0, zmm4, zmm30 
vfmadd231pd zmm0, zmm5, zmm31

To eliminate this bottleneck, we restructured the FMA sequence by distributing the operations across multiple independent registers. Each FMA now writes to a separate accumulator, allowing the CPU to execute them in parallel. Once the independent FMAs are complete, their results are combined using a series of vaddpd instructions to produce the final output.

    vfmadd231pd zmm10,zmm2, zmm30 // i.e. zmm10 = (zmm2 * zmm30) + zmm10 
vfmadd231pd zmm11,zmm3, zmm31 // i.e. zmm11 = (zmm2 * zmm30) + zmm11 
vfmadd231pd zmm15,zmm4, zmm30 
vfmadd231pd zmm16,zmm5, zmm31 

vaddpd zmm10, zmm11, zmm10 // i.e. zmm10 = zmm11 + zmm10 
vaddpd zmm15, zmm16, zmm15 
vaddpd zmm10, zmm15, zmm10

This approach breaks the dependency chain and frees the scheduler from issuing multiple FMAs simultaneously. The result is a smoother instruction flow, reduced stalls, and significantly higher throughput—especially in compute-bound inner loops where every cycle counts.

The performance chart compares the baseline implementation with the optimized DGEMM kernel, revealing an approximate 20% performance gain. This improvement is primarily attributed to breaking FMA dependency chains, enabling better instruction-level parallelism and improved CPU pipeline utilization.

Figure 4: Performance Gains by Breaking FMA Dependency Chain

Performance Gains from Breaking FMA dependency chain in GEMM Kernel: The performance chart compares the baseline implementation with the optimized DGEMM kernel, revealing an approximate 20% performance gain. This improvement is primarily attributed to breaking FMA dependency chains, enabling better instruction-level parallelism and improved CPU pipeline utilization.

Fused Loads + FMAs: Reducing Instruction & µOps Count

In baseline GEMM kernel implementations, scalar values are often loaded from memory, broadcast across a vector register, and then used in an FMA operation. This three-step sequence —load, broadcast, and compute— introduces unnecessary instruction overhead and increases register pressure.

    //Load 64-bit scalar from memory at [rbx] and broadcast to all 8 lanes of zmm0 
vbroadcastsd zmm0,mem(rbx) 
vfmadd231pd zmm6,zmm0, zmm1 // i.e. zmm6 = (zmm0 * zmm1) + zmm6

This pattern not only inflates the instruction count but also introduces additional dependencies between operations. Each step must be completed before the next can begin, limiting instruction-level parallelism and increasing decode pressure on the front-end pipeline.

Modern AMD “Zen” microarchitectures, starting with “Zen 4”, support AVX-512 instructions that allow memory operands to be directly fused into FMA operations. By leveraging this capability, we replaced the traditional three-instruction sequence with a single vfmadd231pd that performs the load, broadcast, and computation in one step:

    vfmadd231pd zmm6, zmm1, qword ptr [rbx]{1to8} // i.e. zmm6 = (zmm1 * broadcast([rbx])) + zmm6

This fused instruction reduces the number of µOps generated, minimizes register usage, and eliminates intermediate dependencies. The result is a leaner, more efficient pipeline with better decode and fetch characteristics. By collapsing multiple operations into a single instruction, we unlock micro-op fusion benefits and streamline the critical path of the GEMM kernel.

Figure 5: Performance Uplift in DGEMM by Fusing Load and FMA

The performance chart compares the baseline implementation with the optimized DGEMM kernel, demonstrating a 21% performance improvement. This gain is primarily due to the use of fused loads and FMA instructions.

Performance Gains by fusing load and FMA: The performance chart compares the baseline implementation with the optimized DGEMM kernel, demonstrating a 21% performance improvement. This gain is primarily due to the use of fused loads and FMA instructions.

Branch-Free Edge Handling with Masked Instructions

Handling edge cases—such as leftover rows or columns that don’t align with the vector width—is a common challenge in GEMM kernels. Traditionally, these tails are managed using conditional branches or scalar fallbacks. While functionally correct, this approach introduces control flow divergence, disrupts instruction prefetching, and prevents vectorization in the tail paths. Such branching patterns degrade performance by breaking the linearity of the instruction stream. They reduce instruction cache locality and prevent the CPU from leveraging SIMD execution units efficiently.
To address this, we came up with two strategies.

🔹 Strategy 1: AVX-512 Masked Ops

The first technique leverages AVX-512’s masking capabilities. By preparing a mask that represents the number of valid elements in the tail (e.g., 5 out of 8), we can perform masked loads and computations that zero out inactive lanes. This is achieved using MASK_KZ, which is a symbolic macro that typically expands to “{k2}{z}”. For example:

    Mov rdx, (1 << 5) – 1  // Prepare mask: lower 5 bits set (0b00011111) 
Kmovw k2, rdx          // Move mask from rdx into mask register k2 
vmovupd zmm1 MASK_KZ(2), [rax] // Masked load: load 5 elements from [rax] into zmm1, zero others

🔹Strategy 2: Vector Size Switching

The second technique involves switching between vector widths—ZMM (512-bit), YMM (256-bit), and XMM (128-bit)—based on the number of remaining elements. Since AVX-512 registers overlap with AVX and SSE registers, this allows us to reuse the same register names while adjusting the instruction width. For instance, if 12 elements remain, we can load 8 with a ZMM instruction and the remaining 4 with a YMM instruction, maintaining full vectorization without branching.

AVX-512 registers overlap with AVX registers.

Figure 6: AVX-512 Vector Register Mapping: From XMM to ZMM

AVX-512 Vector Register Overlap: XMM, YMM, ZMM : This image illustrates the overlap among vector registers: XMM, YMM, and ZMM.

The XMM register occupies the lower 128 bits (bits 0–127) of the register.

The YMM register extends the XMM register to 256 bits (bits 0–255).

The ZMM register further extends YMM to the full 512-bit width (bits 0–511), which is available in AVX-512.

All three registers — XMM, YMM, and ZMM — refer to the same physical register, just with different vector widths.

xmm0 is the lower 128 bits of ymm0, which is the lower 256 bits of zmm0.
Similarly, ymm1 overlaps with zmm1

    vmovupd zmm0, [rax]  // Load 8 double-precision elements (512 bits) from memory into zmm0 
vmovupd ymm1, [rax]  // Load 4 double-precision elements (256 bits) from the same memory into ymm1

These strategies are particularly effective on AMD “Zen 4” and “Zen 5” architecture, where masking is hardware-accelerated, and vector width switching is efficiently handled by the decode and execution units. The result is a uniform, cache-friendly execution path that avoids scalar slowdowns and maintains SIMD throughput even in the tail regions.

The performance chart compares the baseline implementation against the optimized version of DGEMM kernel, which shows a performance improvement of 14%, which is due to branch-free handling of edge cases in GEMM kernel.

Figure 6: Performance Gain from Masked, Branch Free Edge Case Handling

Figure: 7 Performance Gain from Masked, Branch Free Edge Case Handling

The Payoff

These microarchitectural refinements brought dramatic improvements:

Substantial speedups in edge-case DGEMM paths
Reduced instruction stalls and frontend bottlenecks
Better alignment with Zen’s fetch/decode/memory subsystems
Smooth execution even in tail and edge conditions

The following performance graph reflects the impact of integrating all four optimization approaches into a single refined DGEMM kernel, which shows an overall performance improvement of 20% across matrices of varying sizes and shapes. This combined implementation demonstrates the full potential of architecture-aware tuning on AMD “Zen 4” and “Zen 5” architecture.

Figure 8: Overall Performance Gain of DGEMM Kernel

Final Thoughts

Great performance isn’t just about big matrices and deep pipelines — it’s about respecting the architecture and designing for the small stuff with surgical precision.

Tuning AOCL-BLAS's GEMM with Zen-aware optimizations, we’ve unlocked performance from the unlikeliest of places — tiny matrices — with massive payoff.

To explore AOCL in greater depth, including its full suite of optimized numerical libraries, visit the AMD AOCL Developer Portal. Developers interested in experimenting with AOCL-BLAS can access the source code and contribute via the AOCL-BLIS GitHub repository. Dive in and accelerate your applications with optimized performance on AMD hardware!

Article By

Harsh Dave

white pearl gradient medium color divider

Related Blogs

View All Blogs

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Adaptive SoCs, FPGAs, & SOMs

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

AOCL-BLAS: Boosting GEMM Performance for Small Matrices

Engineering GEMM Efficiency Beyond the Hot Path

System Configuration

Repeated Kernel Call & µOp Cache Inefficiency

Breaking Fused Multiply-Add (FMA) Dependency Chains

Fused Loads + FMAs: Reducing Instruction & µOps Count

Branch-Free Edge Handling with Masked Instructions

The Payoff

Final Thoughts

Article By

Related Blogs