Advancing AI Performance on AMD EPYC CPUs: ZenDNN 5.1 Brings New Optimizations

Aug 19, 2025

AMD ZenDNN 5.1 is here

We’re excited to announce the latest updates to our ZenDNN library and ZenTorch and ZenTF Plugins, bringing significant performance boosts and new features for AI workloads on AMD EPYC™ CPUs. This release continues our commitment to optimizing inference performance for both Large Language Models (LLMs) and Recommender Systems, with a host of enhancements designed to push the boundaries of efficiency and speed.

Key Highlights of the Release

This update focuses on three key areas: framework compatibility, performance optimizations, and ecosystem contributions.

Enhanced Framework Compatibility & New Plugins

We have updated our plugins to maintain full compatibility with the latest AI frameworks, including PyTorch 2.7 and TensorFlow 2.19. This seamless integration allows you to leverage our optimizations with the newest versions of your preferred frameworks.

A major addition in this release is the new vLLM + ZenTorch plugin. vLLM is a popular, high-performance library for LLM inference, and our new plugin delivers a significant performance uplift of up to 24% over vLLM-IPEX across a variety of popular models, including Llama 3.2, Phi-4, and Qwen-2.5. The plugin is designed for seamless, “plug-and-play” integration; once installed, it automatically replaces vLLM's default attention mechanism with our highly optimized ZenTorch PagedAttention kernel, requiring zero code changes from the user.

As a preview, across all tested LLMs and input/output combinations vLLM + ZenTorch 5.1.0 consistently achieves higher throughput and is significantly faster than vLLM + IPEX 2.7.0 for CPU-based inference:

Detailed run configuration: See Footnote (ZD-058)
Detailed run configuration: See Footnote (ZD-058)

We’ve also extended our support for TensorFlow-Java by enabling the PluggableDevice feature, which is essential for our enTF plugin to function effectively. Our team contributed this work directly to the official TensorFlow-Java repository, strengthening the core capabilities of the framework and enabling developers to easily integrate custom hardware accelerators and plugins. Our early testing shows this integration has higher performance for models like DIEN and Wide & Deep compared to the native TensorFlow-Java implementation.

Deeper Performance Optimizations

This release introduces several new optimizations that operate at every level, from individual operator kernels to comprehensive graph fusions.

Recommender System (RecSys) Improvements: We have made significant strides in optimizing DLRMv2 and other RecSys models. New "out" variants of the EmbeddingBag operator now write directly to a shared output buffer, eliminating the need for a separate concatenation operation. We also introduced  a new fusion that fuses concatenation after Bottom MLP and EmbeddingBag for the DLRMv2 model.

New Operator Fusions: We’ve added new operator fusions to accelerate common computational patterns found in deep learning models, particularly in RNN and attention layers. These fusions include:

  • MatMul + BiasAdd + Tanh
  • MatMul + BiasAdd + Sigmoid

These fusions reduce memory traffic and kernel launch overhead, leading to significant performance gains: up to 25% uplift for the DIEN BF16 model, as per internal tests. 

Kernel and ZenDNN Enhancements: A new kernel for BF16/FP32 MatMul has been introduced to eliminate overheads in less compute-intensive operations. Additionally, we now support Ahead of Time (AOT) Reordering for MatMul kernels across various data types (INT8, BF16, and FP32), to further improve efficiency. We have also added support for MatMul(+fused) Low Overhead API (LOA) to improve performance of small matrix shapes.

Our Commitment to Open Source

A key part of our strategy is to contribute our work back to the open-source community. We have been actively upstreaming our optimizations directly into the core PyTorch codebase. Similarly, our work on the PluggableDevice feature was contributed and accepted into the TensorFlow-Java repository. These regular contributions strengthen the native performance and capabilities of these frameworks, benefiting all users.

The Result: Real-World Performance Gains

These software enhancements deliver tangible performance improvements for a wide range of AI workloads on AMD EPYC™ CPUs. By optimizing at the kernel, graph, and framework levels, we enable developers to achieve higher throughput and lower latency for their inference tasks without complex configuration.

We encourage you to try out the upgraded plugins (zentorch and zentf)and optimizations and share your feedback with us on Github. For detailed installation instructions and further information, please refer to our documentation and GitHub pages.

 

Share:

Article By


Sr. Product Marketing Manager, AI Group

Related Blogs