ZenDNN 5.2: Accelerating vLLM V1 Engine and Recommender Systems Inference on AMD EPYC™ CPUs

Mar 13, 2026

zendnn-5.2

AMD is making a bold statement: the future of AI inference is flexible, efficient, and increasingly powered by the CPU you already have in your rack. Indeed, in the world of artificial intelligence, the narrative is dominated by the GPU. And for good reason - GPUs aren’t going anywhere. Their massive parallel processing power remains the gold standard for heavy-lift workloads like high-throughput LLM inferencing. However, the CPU is no longer just a spectator; it is being leveraged as a high-performance engine for LLM inferencing in its own right.

With the latest release of the ZenDNN 5.2, AMD is shattering expectations for what x86 architecture can handle in the AI era. We aren't just talking about incremental gains. We are looking at a massive 200% performance boost over previous versions, effectively doubling the efficiency of AI workloads on AMD EPYC™ processors. 

Why This Matters: From Agents to Edge

This isn't just about raw numbers; it’s about enabling new frontiers in computing:

  • Agentic AI: To run autonomous agents effectively, you need low-latency, reliable compute. Optimizations for vLLM integration and INT4 quantization enable sophisticated LLM agents to run directly on CPU infrastructure with plug-and-play ease.
  • Offline and Edge Use-Cases: Privacy and connectivity aren't always guaranteed. By pushing the limits of Weight-Only Quantization (WOQ), AMD allows massive models to run locally and efficiently without the reliance on a dedicated datacenter GPU.
  • Accelerate AI inferencing on hardware you already have: In most standard server deployments, the CPU remains the backbone of the stack and is always present; leveraging it for AI reduces Total Cost of Ownership (TCO).

What’s Under the Hood?

The 5.2 release marks a major architectural shift. AMD has migrated from legacy libraries to the new ZenDNNL, leveraging a Low Overhead API (LOA) that streamlines operator kernels like MatMul and Softmax.

Key Highlights of the Update:

  • Seamless vLLM V1 Integration: The new vLLM-ZenTorch Plugin allows for zero-code-change acceleration, making high-throughput inference more accessible than ever.
  • Quantization Support: Experimental INT4 support for LLMs and specialized UINT4/W8A8 quantization for recommendation systems (DLRM-v2).
  • BFloat16 & Graph Optimizations: Enhanced EPYC™ processor specific kernels and advanced pattern identification ensure that every cycle of the CPU is squeezed for maximum AI performance.
  • Modernized Stack: Full support for TensorFlow 2.20, PyTorch 2.10.0, and Python 3.13.
  • ZenDNN Backend for Llama.cpp: Engineered during the 5.2 development cycle, this integration allows Llama.cpp users to leverage ZenDNN’s low-latency kernels for superior execution on AMD EPYC™ processors.
zendnn-SW stack
Figure 1: ZenDNN Framework landscape at a glance

Supercharging vLLM V1 Engine with AMD ZenDNN

Leveraging CPUs for LLM inference is rapidly evolving from a niche alternative into a sophisticated and cost-efficient strategy for production workloads. With the release of ZenDNN 5.2, we’ve upgraded our plug-in to support the state-of-the-art vLLM V1 engine. The team focused on a "zero-code-change" philosophy, offering a true plug-and-play experience for vLLM versions 0.12.0 through 0.15.1 while delivering massive speedups under the hood. In our testing on non-cherry-picked models, the combination of vLLM and ZenTorch delivered up to 239% higher performance compared to running native vLLM on a standard CPU setup.

Beyond the improvements to the software stack, we’ve unlocked further gains by optimizing how the hardware handles data. By deploying multiple vLLM instances using numactl and interleaving memory access for each instance, we’ve effectively maximized DRAM memory bandwidth. This approach ensures that the CPU isn't just processing faster, but is also being fed data more efficiently, leading to a significant boost in total decode throughput.

Implementation: Maximizing the Throughput

Efficient scaling: While modern AMD EPYC™ processors offer an incredible core density, simply running a single, massive vLLM instance across all 128 cores of a socket often leads to diminishing returns. When we tested native vLLM using the full 0–127 core range, performance was unexpectedly low due to the complexities of managing such a wide compute fabric and memory contention.

To solve this, we implemented a more efficient scaling strategy: splitting the workload into two distinct vLLM instances, with 64 cores dedicated to each. By "interleaving" these cores and binding them to their respective local memory pools, we massively increased total throughput. This approach effectively saturates the available DRAM bandwidth and reduces synchronization overhead, allowing the hardware to operate at its peak potential. As shown in our latest benchmarks, this multi-instance configuration is the key to unlocking the true performance of high-core-count CPU architectures.

You can do this by using numactl to pin specific vLLM instances to dedicated CPU cores and their local memory pools. Here is a quick breakdown of how to implement the memory interleaving strategy mentioned above:

  • Install the Plugin: Simply drop the vLLM-ZenTorch plugin into your existing vLLM environment.
  • Bind Instances: Use numactl to bind specific vLLM instances to specific CPU cores.
  • Interleave Memory: Access the physical cores in a non-sequential manner to ensure memory bandwidth is distributed across all available DRAM channels, preventing bottlenecks during the decode phase.

For example, the following command launches a single vLLM instance bound to the even-numbered cores (0, 2, 4... up to 126) and restricts its memory allocation to Socket 0 (Memory Pool 0):

		numactl --physcpubind=$(seq -s, 0 2 127) --membind=0 \ 
            vllm bench throughput --model meta-llama/Llama-3.1-8B-Instruct\ 
            --random-input-len 128 --random-output-len 128 \ 
            --num-prompts 1024 --max-num-seqs 128  
	

By using this pattern, you can scale your deployment by launching additional instances mapped to the remaining cores and sockets. This "siloed" approach prevents different AI workloads from competing for the same cache or memory bandwidth, effectively maximizing the total decode throughput of the entire server.

Furthermore, for better performance during CPU inference, we enabled freezing by setting export TORCHINDUCTOR_FREEZING=1. This environment variable is available from vLLM version 0.12.0 onwards. While more information can be found here, in brief, freezing allows the runtime to treat model parameters as immutable. This often leads to a reduced memory footprint and an improved Cache Locality, whereby the model data is tighter and more likely to stay within the L3 cache of our AMD EPYC processor.

one socket analysis
Figure 2: One-socket analysis: One instance of native vLLM consuming all 128 cores of one socket vs 2 instances of vLLM each consuming 64 cores and accelerated with ZenDNN

We have a dual-socket system. Can we do better? Yes!

With a 2P AMD EPYC™ powerhouse at our disposal, we fully 'redlined' the machine, ensuring that every single core was engaged and working at peak efficiency. 
 
The result: 

Dual Slot analysis
Figure 3: Dual socket Analysis: Here, in each test case, we spawn 4 instances of vLLM, 2 on each socket and each instance accessing the cores in an interleaved fashion. The only difference is: with and without ZenDNN. The speedup graph speaks for itself showing the value proposition of SW enhancements

Conclusion: Empowering the Future of Flexible AI

The release of ZenDNN 5.2 marks a pivotal moment in how we think about AI infrastructure. While GPUs remain the gold standard for massive parallel workloads, the CPU has evolved from a supporting player into a high-performance engine capable of carrying its own weight. By delivering over 200% performance boost and seamless plugin-ins and integrations with industry heavyweights like vLLM and Llama.cpp, we are giving developers the freedom to run sophisticated AI wherever it makes the most sense - whether that’s in a high-density data center or at the edge.

Our Ongoing Commitment to the Ecosystem
Just as we did with version 5.1, our strategy remains rooted in the open-source community. We continue to upstream our optimizations into the core PyTorch and TensorFlow codebases, ensuring that the work we do on the ZenDNNL Low Overhead API (LOA) and pluggable devices benefits every developer, regardless of their specific stack. By strengthening the native capabilities of these frameworks, we aren't just improving AMD hardware performance; we are improving the AI ecosystem for everyone.

Real-World Results for Every Rack
The technical shifts we've made - from experimental INT4 quantization to architectural "tricks" like NUMA-aware memory interleaving translate directly into tangible business value. You can now achieve higher throughput and lower latency for agentic AI and offline use-cases using the AMD EPYC™ hardware you already own.

Call to Action
We encourage you to experience these gains firsthand. Download the updated AMD ZenDNN Plugin for PyTorch (zentorch) and AMD ZenDNN Plugin for TensorFlow (zentf) (either via pip install or from Github), explore our latest optimizations on GitHub, and join us in pushing the boundaries of what the x86 architecture can achieve.

Try it today:

We’d love to hear about your performance gains - open an issue or start a discussion on our Github pages!

 

Footnotes

Testing conducted internally by AMD as of March 10, 2026. The environment settings for this configuration are as follows:  

vLLM bench throughput; input:128, output:128, num_prompts:1024, batch_size:128. The operating system is Ubuntu 22.04 LTS, running on a 2-socket AMD EPYC™ 9755 128-Core Processor system with SMT enabled and 2 NUMA nodes, Python 3.13; zenTorch 5.2; vllm 0.15.1+cpu. Datatype is BFloat16.

All performance metrics are based on throughput in tokens per second. 

Results may vary based on system configurations and settings. 

Share:

Article By


Sr. Product Marketing Manager, AI Group

Related Blogs