Accelerating GPT-OSS-20B on AMD Ryzen™ AI NPUs: Efficient MoE Inference on Strix and Halo

Mar 16, 2026

Introduction: Enabling 20B LLMs on AMD Ryzen AI

GPT-OSS-20B [1] is a 20B-parameter open-weight language model built for strong instruction following, coding, and general reasoning while remaining practical for local deployment. It uses a Mixture-of-Experts (MoE) architecture, activating only a subset of experts per token to increase effective model capacity without proportionally increasing compute cost.

GPT-OSS-20B combines global and local attention mechanisms to balance long-context reasoning with computational efficiency. Local attention layers reduce memory bandwidth and latency, while global attention preserves cross-sequence understanding. The original model is trained using MXFP4 numeric format. For deployment on AMD Ryzen™ AI platforms, we use an INT4-quantized ONNX version of the model. INT4 is natively supported by the Ryzen™ AI NPU, enabling higher throughput and improved power efficiency with minimal quality degradation.

In this blog, we present a case study of deploying GPT-OSS-20B on AMD Ryzen™ AI processors. This includes developing custom ONNX operators to map model parameters to accelerator-specific buffers, executing compute on the AMD NPU, orchestrating MoE expert routing during decode, applying operator fusion to reduce dispatch overhead, and supporting custom attention schemes within the NPU. We also deploy Grouped Query Attention (GQA) on the NPU using a FlashAttention-style implementation with online SoftMax to enable long-context inference. To support execution across a range of Ryzen™ AI systems, we implement a flexible memory allocation strategy that optimizes both peak memory usage and performance.

Quantization Strategy for GPT-OSS-20B

For GPT-OSS-20B, we use a Microsoft-provided quantized ONNX model [2] where all linear layers, the LM head, and the embedding table are quantized to low precision. This significantly reduces model size and memory bandwidth requirements. MMLU results for this quantized ONNX model are as follows:

Astronomy	Philosophy	Management	Average
33.55	27.01	33.98	30.04

On the NPU, activations run in bfloat16 (BF16) to preserve numerical stability while maintaining high performance. This combination of low-bit weights and BF16 activations delivers strong throughput and power efficiency with minimal impact on model quality.

Operator Analysis and NPU Offload Strategy

For GPT-OSS-20B, the prefill phase is dominated by matrix multiplication operations in the Quantized-Mixture-of-Experts (QMoE) [3] layers. As context length grows, the bottleneck gradually shifts to attention, specifically Grouped Query Attention (GQA), since KV-cache operations scale with total sequence length.

During token generation, matrix multiplication cost per token remains fixed, and determines the performance ceiling at low context size. However, attention cost increases with context size, making efficient attention kernels critical for long-context workloads.

QMoE Offload: Accelerating Mixture-of-Experts on Ryzen™ AI

Quantized Mixture-of-Experts (QMoE) layers account for a significant portion of the compute cost in GPT-OSS-20B during both prefill and decode phases. Efficient execution of these layers is critical to achieving high throughput and low latency on client hardware.

Traditional hardware-friendly approach: A common accelerator-friendly strategy for MoE models is to execute all experts in parallel and mask out unused experts during the aggregation stage. While this simplifies scheduling and maximizes static hardware utilization, it wastes compute resources because only a small subset of experts is selected per token. On client-class devices, this approach significantly increases latency and power consumption.
Prefill behavior: During large prefill workloads, token diversity often results in a broad distribution of expert selections across the batch. In these scenarios, executing many experts in parallel can be efficient because most experts are actively used.
However, during token-by-token generation (decode), only a small subset of experts is activated per token. Executing all experts in this phase becomes highly inefficient and directly impacts steady-state token generation throughput.
Top-K routing + expert batching: To address this, we implement a hybrid QMoE execution strategy optimized for Ryzen™ AI. Top-K routing is performed on the CPU, where the gating network selects the most relevant experts for each token. Tokens assigned to the same expert are then grouped together, enabling batched matrix multiplications on the NPU. By executing only the selected experts, we eliminate unnecessary computation while maintaining high hardware utilization and preserving the full capacity of the MoE architecture.
Hybrid CPU + NPU Orchestration: In this implementation, the CPU manages routing, scheduling, and token grouping, while the NPU accelerates the quantized expert linear layers. Aggregation and residual connections are coordinated through ONNX Runtime with minimal dispatch overhead. This separation of control-intensive and compute-intensive operations ensures efficient use of the heterogeneous Ryzen™ AI platform.
Impact on Throughput and Efficiency: The QMoE offload strategy improves both throughput and power efficiency on Ryzen™ AI systems by eliminating redundant expert execution, increasing effective NPU utilization, and reducing decode latency. This design enables GPT-OSS-20B to run efficiently on both Strix and Halo platforms while retaining the architectural advantages of large-scale MoE models.

Memory Allocation Strategy

GPT-OSS-20B has a large memory footprint due to its 20B parameters and QMoE layers, even with INT4 quantization. To run the model on a variety of memory-constrained setups, a dynamic memory allocation scheme is used.

We offer two strategies to optimize the memory usage of the model:

Dynamic Expert Weights loading within a layer
- Control how many QMoE experts weights to load to memory for each layer
- Session option: "hybrid_opt_qmoe_dynamic_experts": "0"
  - 0=default, increase up to 4 experts
Dynamic Weights loading in multiple layers
- Control how many QMoE layers to enable dynamic loading of expert weights
- Session option: "hybrid_opt_qmoe_num_dynamic_layers": "0"
  - 0 is default, increase up to 24 layers in gpt-oss 20b

Maximum performance: For max performance, set hybrid_opt_qmoe_dynamic_experts=0 and hybrid_opt_qmoe_num_dynamic_layers=0 to keep all expert weights resident in memory with no dynamic loading of expert weights.

Dynamic expert loading/unloading: To reduce peak memory, only the experts needed for the current token batch are loaded into memory, while inactive experts are offloaded. This enables deployment on lower memory hardware, while preserving the model’s full functionality (with the expense of extra latency).

Configurable tradeoff: We give users the flexibility to configure how many expert weights to dynamically load within one QMoE layer, as well as how many QMoE layer weights to pre-fetch during inference... to finetune the balance of memory and performance for deployment on a given local system.

For example, the peak memory can be scaled down by increasing the number of layers that dynamically load/unload experts as noted in the table below.

hybrid_opt_qmoe_num_dynamic_layers	TTFT (s)	TPS	Peak Mem (GB)
0	0.47	13.4	13.7
2	0.72	9.8	13.0
4	0.92	7.7	12.1
8	1.33	5.3	10.2
16	2	3.2	6.5
20	2.35	2.7	4.6
24	2.75	2.4	2.7

* results shown for 128 prompt length; there are 24 QMoE layers in GPT-OSS-20B

Technical details

Expert weights are dynamically loaded/unloaded using OS mmap & pinning features.
During the Decode phase, this dynamic loading/unloading overhead is interleaved with NPU compute to hide the extra overhead (possible because prefill phase is compute-bound).
However, the dynamic weights loading/unloading latency is more apparent in the Decode phase since it’s bandwidth-bound.

GQA acceleration and long context support

Efficient Grouped Query Attention (GQA) acceleration is critical for scaling long-context workloads.

Large context support: The goal is to enable the model’s complete long-context capability, ensuring strong performance for document-level reasoning, codebases, and multi-turn conversations. This is achieved by tiling the attention compute as done in flash-attention implementations.
Prefill chunking with OGA: Using OGA’s prefill chunking support, long prompts can be processed in smaller segments rather than as a single large batch. This reduces peak memory pressure and improves stability on constrained NPU systems while still building the full KV cache for long-context inference.
KV cache in bfloat16: having the attention kernel support bfloat16 inputs leads to an automatic reduction in KV cache size by half compared to float32.

By combining optimized GQA kernels with prefill chunking, the system can maintain high throughput while scaling extended context lengths on local hardware.

Results on Strix and Halo Platforms

We evaluated GPT-OSS-20B on Ryzen™ AI Strix and Halo systems using Ryzen AI software release 1.7 with INT4 weights and BF16 activations. The results demonstrate consistent throughput across prompt lengths while maintaining controlled memory growth for extended context sizes. Performance results are shown below.

Prompt length (tokens)	TTFT (sec)	TPS (tok/sec)	Memory (GB)
128	0.54	12.6	13.6
256	0.65	12.5	13.6
512	0.92	12.3	13.7
1024	1.61	11.8	13.8
2048	3.23	10.9	14.0

* TTFT – time to first token, TPS – tokens per second

Running GPT-OSS-20B with Ryzen AI 1.7

Using ONNX Runtime GenAI [4], GPT-OSS-20B can be executed locally on Ryzen™ AI platforms.

		cd onnxruntime-genai/examples/python
python model-chat.py -m gpt-oss-int4-rtn-block-32-acc-level-4-npu

Example: Instruction Following

		Prompt (Use quit() to exit): Can you explain like I'm 5 why is the sky blue? Try to be succint please. 
 
Output: <|channel|>analysis<|message|>We need to answer simply: explain the sky appears blue because light scattering (Rayleigh scattering) by air molecules; shorter wavelengths scatter more strongly; thus blue. Simple explanation for a child (5-year-old) and succinct. 
 @
We need to keep explanations simple. Keep in simple terms. Example: "Sunlight is made of many colors like rainbow. Blue light is a tiny bit, very small and wiggles around. Air has lots of tiny bits; blue wiggling light get bounced all around, shining in all directions so the sky looks blue." That's basically. 
 
But we should maintain friendly tone. 
 
Ok produce final.<|end|><|start|>assistant<|channel|>final<|message|>The sky looks blue because of how the sunlight works and how the air is made. 
 
1. **The Sun is a Light‑bulb.** 
  All the colors of a rainbow (red, orange, yellow, green, blue, indigo, violet) are mixed together in the sunlight. 
 
2. **The air is full of really small “bouncy‑bits” (air molecules).** 
 
3. **Blue light is super‑small.** 
  Because it’s so small it gets *scattered*—bounced around—by every tiny bit of air. 
 
4. **More of the tiny blue light bounces all around than any other color.** 
  So the sky is filled with that blue light, which is why we see a blue sky when we look up.

Example: Multi-turn Conversation

		Prompt (Use quit() to exit): Alice has no apples and Bob has 5 apples. Bob gives 2 apples to Alice, how many apples does Bob have now? 
 
Output: <|channel|>analysis<|message|>We need to solve: Alice has no apples = 0. Bob has 5 apples. Bob gives 2 apples to Alice: Bob gives 2 apples, so Bob loses 2 apples. Bob now has 5-2=3 apples. So Bob now has 3 apples. Simple. 
 
We answer: Bob has 3 apples now.<|end|><|start|>assistant<|channel|>final<|message|>Bob now has **3** apples. 
 
Prompt (Use quit() to exit): How many does Alice have? 
 
Output: <|channel|>analysis<|message|>Now user asks: "How many does Alice have?" The question: "How many does Alice have?" likely meaning "How many does Alice have?" referring to apples. They just gave Bob gives 2 apples to Alice, Alice initially had none. So Alice had 0 originally. Then she receives 2 apples from Bob. So she had 0 + 2 = 2 apples. 
 
Thus answer: Alice has 2 apples. 
 
We respond accordingly.<|end|><|start|>assistant<|channel|>final<|message|>Alice now has **2** apples.

Example: Code Generation

		Prompt (Use quit() to exit): Please give me simple and efficient c code to count the number of bits set in a 32-bit integer. Dont use any libraries or intrinsics.

Output: <|channel|>analysis<|message|>The user wants simple and efficient C code to count the number of bits set in a 32-bit integer (i.e., population count, popcount). They specifically say: "Dont use any libraries or intrinsics," meaning no built-in functions __builtin_ctz, __builtin_popcount, etc. So we need to provide simple count 1 bits (Hamming weight) for 32-bit integer. We should not use any includes from libraries like "stdio.h" for demonstration? They didn't say no includes, just no library functions or intrinsics. So can just use standard loops and bitwise operations.

We need a simple efficient algorithm. Use Brian Kernighan's trick to drop lowest set bit each loop: while(n) { count++; n &= (n-1);}. It's efficient and widely accepted as efficient. Provide explanation in final.

Make sure to comply with policy: not disallowed. The user is basically requesting a small piece of code; no disallowed content. So we can comply.

We should produce final response containing the efficient simple code in C; maybe we provide two versions: simple but not super efficient or using loop with bitshifts. But they asked for simple and efficient code. So we will provide code snippet:

unsigned count_bits32(unsigned int x){
    unsigned count = 0;
    while(x){
        count++;
        x &= x-1;
    }
    return count;
}

We can also mention compile time constant O(number of set bits). This is efficient typical.

Make sure to mention not using include "stdio.h" - just the function; but includes not used. It's generic.

Return final.<|end|><|start|>assistant<|channel|>final<|message|>Here’s a minimal, self‑contained routine that counts the set bits in a 32‑bit unsigned integer, using only standard C syntax and the well‑known “Brian Kernighan” trick (clearing the lowest‑order‑set‑bit in each iteration).

```c
/* Popcount – count the number of 1‑bits in a 32‑bit unsigned word. */
unsigned int popcount32(unsigned int x)
{
    unsigned int c = 0;

    /* Repeatedly chop off the lowest‑order 1 bit and increment the counter. */
    while (x)
    {
        c++;           // one more “set” bit has been found
        x &= x - 1;    // drop the lowest‑order 1 bit
    }
    return c;
}
```

**Why it is efficient**

* Only a single 32‑bit integer is used (`x`), a running counter (`c`) and one boolean loop.
* Every iteration removes one 1‑bit from `x`.
  Consequently the number of loop iterations equals the number of 1‑bits in the value – the fastest achievable complexity for a pure‑C loop without exploiting special CPU instructions.

**Usage**

```c
#include <stdio.h>

int main(void)
{
    unsigned int v = 0b10110101u;   // example value
    printf("Ones : %u\n", popcount32(v));
    return 0;
}
```

That satisfies the specification – no external libraries, no intrinsics, just plain, portable C.

Getting Started

Install Ryzen AI Software https://ryzenai.docs.amd.com/en/latest/inst.html
Download the RyzenAI-optimized gpt-oss 20b model https://huggingface.co/amd/gpt-oss-20b-onnx-ryzenai-npu
Follow the steps in https://github.com/amd/RyzenAI-SW/tree/main/LLM-examples/oga_inference
Activate the conda environment created by the MSI installer.
Use model_chat.py to run gpt-oss with a chat template for best output quality.

Acknowledgments

Thank you to all the contriubtors to make this work possible: Namal Rajatheva, Leo Liu, Zhonglin Nian, Hengyan Liu, Chuanliang Xie, Changhao Li, Jicheng Chen, Kun Cao, Zhanxing Pu, Sen Wang, Zhenhong Guo, Tianping Li, Booker Yang, Uday Das, Pooja Ganesh, Rajeev Patwari, Nithin Kumar Guggilla, Aaron Ng, Bader Alam, Ashish Sirasao

References

[1] GPT-OSS: https://openai.com/index/introducing-gpt-oss/

[2] Quantized ONNX model: https://huggingface.co/onnxruntime/gpt-oss-20b-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

[3] Quantized Mixture of Experts operator: onnxruntime/docs/ContribOperators.md at main · microsoft/onnxruntime · GitHub

[4] ONNXRuntime GenAI: GitHub - microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime · GitHub

These results demonstrate that large-scale MoE LLMs can execute efficiently on client-class hardware, establishing AMD Ryzen™ AI platforms as a viable foundation for advanced on-device generative AI workloads.

Article By

Client AI Solutions - AI Group

white pearl gradient medium color divider

Related Blogs

View All Blogs

Server CPUs

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Accelerating GPT-OSS-20B on AMD Ryzen™ AI NPUs: Efficient MoE Inference on Strix and Halo

Introduction: Enabling 20B LLMs on AMD Ryzen AI

Quantization Strategy for GPT-OSS-20B

Operator Analysis and NPU Offload Strategy

QMoE Offload: Accelerating Mixture-of-Experts on Ryzen™ AI

Memory Allocation Strategy

GQA acceleration and long context support

Results on Strix and Halo Platforms

Running GPT-OSS-20B with Ryzen AI 1.7

Example: Instruction Following

Example: Multi-turn Conversation

Example: Code Generation

Getting Started

Acknowledgments

References

Article By

Related Blogs

AMD.com Feedback