Accelerating GPT-OSS-20B on AMD Ryzen™ AI NPUs: Efficient MoE Inference on Strix and Halo

Mar 16, 2026

Introduction: Enabling 20B LLMs on AMD Ryzen AI

GPT-OSS-20B [1] is a 20B-parameter open-weight language model built for strong instruction following, coding, and general reasoning while remaining practical for local deployment. It uses a Mixture-of-Experts (MoE) architecture, activating only a subset of experts per token to increase effective model capacity without proportionally increasing compute cost. 

GPT-OSS-20B combines global and local attention mechanisms to balance long-context reasoning with computational efficiency. Local attention layers reduce memory bandwidth and latency, while global attention preserves cross-sequence understanding. The original model is trained using MXFP4 numeric format. For deployment on AMD Ryzen™ AI platforms, we use an INT4-quantized ONNX version of the model. INT4 is natively supported by the Ryzen™ AI NPU, enabling higher throughput and improved power efficiency with minimal quality degradation.

In this blog, we present a case study of deploying GPT-OSS-20B on AMD Ryzen™ AI processors. This includes developing custom ONNX operators to map model parameters to accelerator-specific buffers, executing compute on the AMD NPU, orchestrating MoE expert routing during decode, applying operator fusion to reduce dispatch overhead, and supporting custom attention schemes within the NPU. We also deploy Grouped Query Attention (GQA) on the NPU using a FlashAttention-style implementation with online SoftMax to enable long-context inference. To support execution across a range of Ryzen™ AI systems, we implement a flexible memory allocation strategy that optimizes both peak memory usage and performance.

Quantization Strategy for GPT-OSS-20B

For GPT-OSS-20B, we use a Microsoft-provided quantized ONNX model [2] where all linear layers, the LM head, and the embedding table are quantized to low precision. This significantly reduces model size and memory bandwidth requirements. MMLU results for this quantized ONNX model are as follows:

Astronomy

Philosophy

Management

Average

33.55

27.01

33.98

30.04

On the NPU, activations run in bfloat16 (BF16) to preserve numerical stability while maintaining high performance. This combination of low-bit weights and BF16 activations delivers strong throughput and power efficiency with minimal impact on model quality.

Operator Analysis and NPU Offload Strategy

For GPT-OSS-20B, the prefill phase is dominated by matrix multiplication operations in the Quantized-Mixture-of-Experts (QMoE) [3] layers. As context length grows, the bottleneck gradually shifts to attention, specifically Grouped Query Attention (GQA), since KV-cache operations scale with total sequence length.

During token generation, matrix multiplication cost per token remains fixed, and determines the performance ceiling at low context size. However, attention cost increases with context size, making efficient attention kernels critical for long-context workloads.

QMoE Offload: Accelerating Mixture-of-Experts on Ryzen™ AI

Quantized Mixture-of-Experts (QMoE) layers account for a significant portion of the compute cost in GPT-OSS-20B during both prefill and decode phases. Efficient execution of these layers is critical to achieving high throughput and low latency on client hardware.

  • Traditional hardware-friendly approach: A common accelerator-friendly strategy for MoE models is to execute all experts in parallel and mask out unused experts during the aggregation stage. While this simplifies scheduling and maximizes static hardware utilization, it wastes compute resources because only a small subset of experts is selected per token. On client-class devices, this approach significantly increases latency and power consumption.
  • Prefill behavior: During large prefill workloads, token diversity often results in a broad distribution of expert selections across the batch. In these scenarios, executing many experts in parallel can be efficient because most experts are actively used.
  • However, during token-by-token generation (decode), only a small subset of experts is activated per token. Executing all experts in this phase becomes highly inefficient and directly impacts steady-state token generation throughput.
  • Top-K routing + expert batching: To address this, we implement a hybrid QMoE execution strategy optimized for Ryzen™ AI. Top-K routing is performed on the CPU, where the gating network selects the most relevant experts for each token. Tokens assigned to the same expert are then grouped together, enabling batched matrix multiplications on the NPU. By executing only the selected experts, we eliminate unnecessary computation while maintaining high hardware utilization and preserving the full capacity of the MoE architecture.
  • Hybrid CPU + NPU Orchestration: In this implementation, the CPU manages routing, scheduling, and token grouping, while the NPU accelerates the quantized expert linear layers. Aggregation and residual connections are coordinated through ONNX Runtime with minimal dispatch overhead. This separation of control-intensive and compute-intensive operations ensures efficient use of the heterogeneous Ryzen™ AI platform.
  • Impact on Throughput and Efficiency: The QMoE offload strategy improves both throughput and power efficiency on Ryzen™ AI systems by eliminating redundant expert execution, increasing effective NPU utilization, and reducing decode latency. This design enables GPT-OSS-20B to run efficiently on both Strix and Halo platforms while retaining the architectural advantages of large-scale MoE models.

Memory Allocation Strategy 

Even with INT4 quantization, GPT-OSS-20B has a large memory footprint due to its 20B parameters and QMoE layers. To run the model on a variety of memory-constrained setups, a dynamic memory allocation scheme is used.

  • Dynamic expert loading/unloading: Only the experts needed for the current token batch are loaded into memory, while inactive experts are offloaded. This reduces peak memory usage and enables deployment on lower memory hardware.
  • Configurable tradeoff: Loading and unloading experts introduce a measurable performance impact. The number of layers using this dynamic scheme can be configured to balance memory and performance.
  • Lower memory footprint: This approach allows GPT-OSS-20B to operate on NPUs and other devices with limited memory while preserving the model’s full functionality.

This dynamic strategy ensures flexibility across diverse hardware environments, making local deployment feasible without sacrificing essential performance. Users can finetune the strategy by modifying and experimenting with the following settings

		"hybrid_opt_qmoe_dynamic_experts": "2", # by default this is 0
"hybrid_opt_qmoe_num_dynamic_layers": "0" # for how many QMoE layers you want to enable this behaviour for
	

Based on prompt size of 128 tokens, the peak memory can be scaled down by increasing the number of layers that dynamically load/unload experts as noted in the table below.

num_dynamic_layers

TTFT (s)

TPS

Peak Mem (GB)

0

0.47

13.4

13.7

2

0.72

9.8

13.0

4

0.92

7.7

12.1

8

1.33

5.3

10.2

16

2

3.2

6.5

20

2.35

2.7

4.6

24

2.75

2.4

2.7

* there are 24 QMoE layers in GPT-OSS-20B

GQA acceleration and long context support

Efficient Grouped Query Attention (GQA) acceleration is critical for scaling long-context workloads.

  • Large context support: The goal is to enable the model’s complete long-context capability, ensuring strong performance for document-level reasoning, codebases, and multi-turn conversations. This is achieved by tiling the attention compute as done in flash-attention implementations.
  • Prefill chunking with OGA: Using OGA’s prefill chunking support, long prompts can be processed in smaller segments rather than as a single large batch. This reduces peak memory pressure and improves stability on constrained NPU systems while still building the full KV cache for long-context inference.
  • KV cache in bfloat16: having the attention kernel support bfloat16 inputs leads to an automatic reduction in KV cache size by half compared to float32.

By combining optimized GQA kernels with prefill chunking, the system can maintain high throughput while scaling extended context lengths on local hardware.

Results on Strix and Halo Platforms

We evaluated GPT-OSS-20B on Ryzen™ AI Strix and Halo systems using Ryzen AI software release 1.7 with INT4 weights and BF16 activations. The results demonstrate consistent throughput across prompt lengths while maintaining controlled memory growth for extended context sizes. Performance results are shown below.

Prompt length (tokens)

TTFT (sec)

TPS (tok/sec)

Memory (GB)

128

0.54

12.6

13.6

256

0.65

12.5

13.6

512

0.92

12.3

13.7

1024

1.61

11.8

13.8

2048

3.23

10.9

14.0

* TTFT – time to first token, TPS – tokens per second

Running GPT-OSS-20B with Ryzen AI 1.7

Using ONNX Runtime GenAI [4], GPT-OSS-20B can be executed locally on Ryzen™ AI platforms.

		cd onnxruntime-genai/examples/python
python model-chat.py -m gpt-oss-int4-rtn-block-32-acc-level-4-npu
	
Example: Instruction Following
		Prompt (Use quit() to exit): Can you explain like I'm 5 why is the sky blue? Try to be succint please. 
 
Output: <|channel|>analysis<|message|>We need to answer simply: explain the sky appears blue because light scattering (Rayleigh scattering) by air molecules; shorter wavelengths scatter more strongly; thus blue. Simple explanation for a child (5-year-old) and succinct. 
 @
We need to keep explanations simple. Keep in simple terms. Example: "Sunlight is made of many colors like rainbow. Blue light is a tiny bit, very small and wiggles around. Air has lots of tiny bits; blue wiggling light get bounced all around, shining in all directions so the sky looks blue." That's basically. 
 
But we should maintain friendly tone. 
 
Ok produce final.<|end|><|start|>assistant<|channel|>final<|message|>The sky looks blue because of how the sunlight works and how the air is made. 
 
1. **The Sun is a Light‑bulb.** 
  All the colors of a rainbow (red, orange, yellow, green, blue, indigo, violet) are mixed together in the sunlight. 
 
2. **The air is full of really small “bouncy‑bits” (air molecules).** 
 
3. **Blue light is super‑small.** 
  Because it’s so small it gets *scattered*—bounced around—by every tiny bit of air. 
 
4. **More of the tiny blue light bounces all around than any other color.** 
  So the sky is filled with that blue light, which is why we see a blue sky when we look up. 
	
Example: Multi-turn Conversation
		Prompt (Use quit() to exit): Alice has no apples and Bob has 5 apples. Bob gives 2 apples to Alice, how many apples does Bob have now? 
 
Output: <|channel|>analysis<|message|>We need to solve: Alice has no apples = 0. Bob has 5 apples. Bob gives 2 apples to Alice: Bob gives 2 apples, so Bob loses 2 apples. Bob now has 5-2=3 apples. So Bob now has 3 apples. Simple. 
 
We answer: Bob has 3 apples now.<|end|><|start|>assistant<|channel|>final<|message|>Bob now has **3** apples. 
 
Prompt (Use quit() to exit): How many does Alice have? 
 
Output: <|channel|>analysis<|message|>Now user asks: "How many does Alice have?" The question: "How many does Alice have?" likely meaning "How many does Alice have?" referring to apples. They just gave Bob gives 2 apples to Alice, Alice initially had none. So Alice had 0 originally. Then she receives 2 apples from Bob. So she had 0 + 2 = 2 apples. 
 
Thus answer: Alice has 2 apples. 
 
We respond accordingly.<|end|><|start|>assistant<|channel|>final<|message|>Alice now has **2** apples. 
	
Example: Code Generation
		Prompt (Use quit() to exit): Please give me simple and efficient c code to count the number of bits set in a 32-bit integer. Dont use any libraries or intrinsics.

Output: <|channel|>analysis<|message|>The user wants simple and efficient C code to count the number of bits set in a 32-bit integer (i.e., population count, popcount). They specifically say: "Dont use any libraries or intrinsics," meaning no built-in functions __builtin_ctz, __builtin_popcount, etc. So we need to provide simple count 1 bits (Hamming weight) for 32-bit integer. We should not use any includes from libraries like "stdio.h" for demonstration? They didn't say no includes, just no library functions or intrinsics. So can just use standard loops and bitwise operations.

We need a simple efficient algorithm. Use Brian Kernighan's trick to drop lowest set bit each loop: while(n) { count++; n &= (n-1);}. It's efficient and widely accepted as efficient. Provide explanation in final.

Make sure to comply with policy: not disallowed. The user is basically requesting a small piece of code; no disallowed content. So we can comply.

We should produce final response containing the efficient simple code in C; maybe we provide two versions: simple but not super efficient or using loop with bitshifts. But they asked for simple and efficient code. So we will provide code snippet:

unsigned count_bits32(unsigned int x){
    unsigned count = 0;
    while(x){
        count++;
        x &= x-1;
    }
    return count;
}

We can also mention compile time constant O(number of set bits). This is efficient typical.

Make sure to mention not using include "stdio.h" - just the function; but includes not used. It's generic.

Return final.<|end|><|start|>assistant<|channel|>final<|message|>Here’s a minimal, self‑contained routine that counts the set bits in a 32‑bit unsigned integer, using only standard C syntax and the well‑known “Brian Kernighan” trick (clearing the lowest‑order‑set‑bit in each iteration).

```c
/* Popcount – count the number of 1‑bits in a 32‑bit unsigned word. */
unsigned int popcount32(unsigned int x)
{
    unsigned int c = 0;

    /* Repeatedly chop off the lowest‑order 1 bit and increment the counter. */
    while (x)
    {
        c++;           // one more “set” bit has been found
        x &= x - 1;    // drop the lowest‑order 1 bit
    }
    return c;
}
```

**Why it is efficient**

* Only a single 32‑bit integer is used (`x`), a running counter (`c`) and one boolean loop.
* Every iteration removes one 1‑bit from `x`.
  Consequently the number of loop iterations equals the number of 1‑bits in the value – the fastest achievable complexity for a pure‑C loop without exploiting special CPU instructions.

**Usage**

```c
#include <stdio.h>

int main(void)
{
    unsigned int v = 0b10110101u;   // example value
    printf("Ones : %u\n", popcount32(v));
    return 0;
}
```

That satisfies the specification – no external libraries, no intrinsics, just plain, portable C.
	

Getting Started

  1. Install Ryzen AI Software https://ryzenai.docs.amd.com/en/latest/inst.html
  2. Download the RyzenAI-optimized gpt-oss 20b model https://huggingface.co/amd/gpt-oss-20b-onnx-ryzenai-npu
  3. Follow the steps in https://github.com/amd/RyzenAI-SW/tree/main/LLM-examples/oga_inference
  4. Activate the conda environment created by the MSI installer.
  5. Use model_chat.py to run gpt-oss with a chat template for best output quality.
Acknowledgments 

Thank you to all the contriubtors to make this work possible: Namal Rajatheva, Leo Liu, Zhonglin Nian, Hengyan Liu, Chuanliang Xie, Changhao Li, Jicheng Chen, Kun Cao, Zhanxing Pu, Sen Wang, Zhenhong Guo, Tianping Li, Booker Yang, Uday Das, Pooja Ganesh, Rajeev Patwari, Nithin Kumar Guggilla, Aaron Ng, Bader Alam, Ashish Sirasao

References

[1]  GPT-OSS: https://openai.com/index/introducing-gpt-oss/

[2] Quantized ONNX model: https://huggingface.co/onnxruntime/gpt-oss-20b-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

[3] Quantized Mixture of Experts operator: onnxruntime/docs/ContribOperators.md at main · microsoft/onnxruntime · GitHub

[4] ONNXRuntime GenAI: GitHub - microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime · GitHub

These results demonstrate that large-scale MoE LLMs can execute efficiently on client-class hardware, establishing AMD Ryzen™ AI platforms as a viable foundation for advanced on-device generative AI workloads.

Share:

Related Blogs