Advancing AI Performance on AMD EPYC CPUs: ZenDNN 5.1 Brings New Optimizations

Aug 19, 2025

AMD ZenDNN 5.1 is here

We’re excited to announce the latest updates to our ZenDNN library and ZenTorch and ZenTF Plugins, bringing significant performance boosts and new features for AI workloads on AMD EPYC™ CPUs. This release continues our commitment to optimizing inference performance for both Large Language Models (LLMs) and Recommender Systems, with a host of enhancements designed to push the boundaries of efficiency and speed.

Key Highlights of the Release

This update focuses on three key areas: framework compatibility, performance optimizations, and ecosystem contributions.

Enhanced Framework Compatibility & New Plugins

We have updated our plugins to maintain full compatibility with the latest AI frameworks, including PyTorch 2.7 and TensorFlow 2.19. This seamless integration allows you to leverage our optimizations with the newest versions of your preferred frameworks.

A major addition in this release is the new vLLM + ZenTorch plugin. vLLM is a popular, high-performance library for LLM inference, and our new plugin delivers a significant performance uplift of up to 24% over vLLM-IPEX across a variety of popular models, including Llama 3.2, Phi-4, and Qwen-2.5. The plugin is designed for seamless, “plug-and-play” integration; once installed, it automatically replaces vLLM's default attention mechanism with our highly optimized ZenTorch PagedAttention kernel, requiring zero code changes from the user.

As a preview, across all tested LLMs and input/output combinations vLLM + ZenTorch 5.1.0 consistently achieves higher throughput and is significantly faster than vLLM + IPEX 2.7.0 for CPU-based inference:

Image Zoom
Detailed run configuration: See Footnote (ZD-058)
Detailed run configuration: See Footnote (ZD-058)

We’ve also extended our support for TensorFlow-Java by enabling the PluggableDevice feature, which is essential for our enTF plugin to function effectively. Our team contributed this work directly to the official TensorFlow-Java repository, strengthening the core capabilities of the framework and enabling developers to easily integrate custom hardware accelerators and plugins. Our early testing shows this integration has higher performance for models like DIEN and Wide & Deep compared to the native TensorFlow-Java implementation.

Deeper Performance Optimizations

This release introduces several new optimizations that operate at every level, from individual operator kernels to comprehensive graph fusions.

Recommender System (RecSys) Improvements: We have made significant strides in optimizing DLRMv2 and other RecSys models. New "out" variants of the EmbeddingBag operator now write directly to a shared output buffer, eliminating the need for a separate concatenation operation. We also introduced  a new fusion that fuses concatenation after Bottom MLP and EmbeddingBag for the DLRMv2 model.

New Operator Fusions: We’ve added new operator fusions to accelerate common computational patterns found in deep learning models, particularly in RNN and attention layers. These fusions include:

 

  • MatMul + BiasAdd + Tanh
  • MatMul + BiasAdd + Sigmoid

 

These fusions reduce memory traffic and kernel launch overhead, leading to significant performance gains: up to 25% uplift for the DIEN BF16 model, as per internal tests. 

Kernel and ZenDNN Enhancements: A new kernel for BF16/FP32 MatMul has been introduced to eliminate overheads in less compute-intensive operations. Additionally, we now support Ahead of Time (AOT) Reordering for MatMul kernels across various data types (INT8, BF16, and FP32), to further improve efficiency. We have also added support for MatMul(+fused) Low Overhead API (LOA) to improve performance of small matrix shapes.

Our Commitment to Open Source

A key part of our strategy is to contribute our work back to the open-source community. We have been actively upstreaming our optimizations directly into the core PyTorch codebase. Similarly, our work on the PluggableDevice feature was contributed and accepted into the TensorFlow-Java repository. These regular contributions strengthen the native performance and capabilities of these frameworks, benefiting all users.

The Result: Real-World Performance Gains

These software enhancements deliver tangible performance improvements for a wide range of AI workloads on AMD EPYC™ CPUs. By optimizing at the kernel, graph, and framework levels, we enable developers to achieve higher throughput and lower latency for their inference tasks without complex configuration.

We encourage you to try out the upgraded plugins (zentorch and zentf)and optimizations and share your feedback with us on Github. For detailed installation instructions and further information, please refer to our documentation and GitHub pages.

 

Share:

Article By


Sr. Product Marketing Manager, AI Group

Related Blogs

Footnotes

Footnote:

ZD-058:  

Testing conducted internally by AMD as of 06/27/2025. The environment settings for this configuration are as follows:  
The operating system is Ubuntu 22.04 LTS, running on a 2-socket AMD EPYC™ 9755 128-Core Processor system with SMT enabled and 2 NUMA nodes, Python 3.11.8; zentorch 5.1.0; vLLM 0.9.0+cpu; IPEX 2.7.0; LLVM OpenMP 18.1.8=hf5423f3_1. Core binding is set for 96 cores per instance. The VLLM_CPU_KVCACHE_SPACE is set to 90, and VLLM_CPU_OMP_THREADS_BIND is set to 0-95.  
ZenDNN variables include ZENDNN_TENSOR_POOL_LIMIT=1024, ZENDNN_MATMUL_ALGO=FP32:4,BF16:0, ZENDNN_PRIMITIVE_CACHE_CAPACITY=1024, and ZENDNN_WEIGHT_CACHING=1.  
For model testing, the number of prompts is 512, with a maximum number of sequences of 32.  
Datatype is BFloat16. All performance metrics are based on throughput in tokens per second. 

 

model   output_len       input_len        vLLM+zentorch 5.1.0        vLLM+IPEX 2.7.0 Speedup 

microsoft/phi-4 128     128     151.8294537      119.1399245        1.27 

microsoft/phi-4 128     512     96.863873        76.8731383        1.26 

microsoft/phi-4 512     128     171.4570229      136.693467        1.25 

microsoft/phi-4 512     512     143.4388875      114.772913        1.25 

Qwen/Qwen2.5-3B 128     128     244.7217902      212.5098186        1.15 

Qwen/Qwen2.5-3B 128     512     200.8506422      162.2628558        1.24 

Qwen/Qwen2.5-3B 512     128     256.5140076      225.1190505        1.14 

Qwen/Qwen2.5-3B 512     512     238.5087328      205.2969589        1.16 

Qwen/Qwen2.5-7B-Instruct 128     128     192.4150592        159.5727 1.21 

Qwen/Qwen2.5-7B-Instruct 128     512     143.529983        116.2419905      1.23 

Qwen/Qwen2.5-7B-Instruct 512     128     207.351553        175.2798501      1.18 

Qwen/Qwen2.5-7B-Instruct 512     512     185.0314871        156.7501855      1.18 

meta-llama/Llama-3.2-3B-Instruct 128     128     277.8863782        239.084652       1.16 

meta-llama/Llama-3.2-3B-Instruct 128     512     221.9389208        171.8899209      1.29 

meta-llama/Llama-3.2-3B-Instruct 512     128     290.0127427        254.1709159      1.14 

meta-llama/Llama-3.2-3B-Instruct 512     512     264.4799039        227.8941912      1.16 

EleutherAI/gpt-j-6b      128     128     262.4814183        191.0104725      1.37 

EleutherAI/gpt-j-6b      128     512     167.1430942        109.5319672      1.53 

EleutherAI/gpt-j-6b      512     128     284.5393133        214.4331284      1.33 

EleutherAI/gpt-j-6b      512     512     224.6661979        166.5562967      1.35 

 

Results may vary based on system configurations and settings.