AI Inference Acceleration on Ryzen AI NPU with AMD Quark
Dec 19, 2025
AI inference on client devices is rapidly evolving, and AMD Ryzen™ AI PCs sit at the center of this shift. Powered by the AMD XDNA™-based Neural Processing Unit (NPU), Ryzen AI systems deliver high-throughput, low-latency, and energy-efficient acceleration for a wide range of deep learning workloads. With AMD Ryzen™ AI Software stack, developers can seamlessly execute ONNX models on the NPU or integrated GPU—unlocking significant performance gains while reducing power consumption compared to CPU- or GPU-only pipelines.
For ONNX models, while ONNX Runtime provides a popular and flexible inference framework, its built-in quantization support can be limited for developers who need a broader set of data types, fine-grained controls, or state-of-the-art algorithms. This is where AMD’s Quark excels. Quark enhances the ONNX -to-ONNX quantization workflow, enabling developers to convert full-precision models into low-bit versions optimized for Ryzen AI and deploy them efficiently.
In this blog, we walk through how the ONNX-to-ONNX quantization flow in AMD Quark works, why it matters, and how to use it to accelerate real-world models—using the YOLO family as a practical example. Finally, we show how to deploy the resulting quantized models on a Ryzen AI NPU for dramatic inference speedups.
Why Quantize with AMD Quark?
Quantization Benefits:
Faster inference by reducing computation and improving memory bandwidth usage.
Lower power consumption is extremely valuable in production deployments.
AMD Quark is a comprehensive cross-platform deep learning toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, AMD Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.
Why ONNX-To-ONNX flow?
ONNX-to-ONNX flow refers to a quantization pipeline where both the input and output of the workflow are ONNX models, without involving the original training framework (PyTorch, TensorFlow, etc.).
The benefits of ONNX flow include:
Hardware and platform agnostic: Many customer produce models in ONNX format, mainly because ONNX is framework-agnostic, easier to integrate into existing tool chains, and allows customers to share only the exported computation graph without exposing full training code, proprietary architectures, or model-specific intellectual property. ONNX also avoids dependencies on the customer’s original training framework, such as PyTorch and Tensorflow, making it a safer and more convenient format for third-party workflows.
Performant engine: ONNX Runtime is a high-performance inference engine designed to run machine learning models in the ONNX format. It supports models trained in various frameworks like PyTorch and TensorFlow, enabling seamless deployment across platforms. ONNX Runtime offers cross-platform compatibility, hardware acceleration, and optimized performance, making it ideal for both cloud and edge inference scenarios.
Quark's ONNX-to-ONNX quantization flow has been successfully applied to hundreds of models, demonstrating robustness and scalability. Notably, for key customers such as Microsoft, the flow like integration with Olive enables an end-to-end workflow—from quantization to deployment—ensuring seamless integration and efficient model delivery. As shown in Tables below, the ONNX-to-ONNX flow presents excellent performance on several widely used models as well as key customer-specific models.
Table 1 — Benchmark Results of Top‑1 Accuracy or mean Average Precision (mAP)
| Model Name / Metrics | FP32 (Without Quantization) | INT8 | A16W8 | BF16 |
| Resnet50 / Top‑1 | 0.813 | 0.809 | 0.810 | 0.812 |
| 0.729 | 0.727 | 0.727 | 0.729 | |
| 0.756 | 0.752 | 0.756 | 0.756 | |
| Yolonas_s / mAP | 0.60 | 0.51 | 0.58 | 0.60 |
Table 2 — Results of Customer Models on AMD Ryzen AI NPU
| Metrics | Target Data Type | Target PSNR/SNR | Quantized Accuracy |
| Detection Model | INT8 | 30.763db | 37.190db |
| Segmentation Model | INT8 | 52.228db | 52.682db |
| Super‑Resolution Model | A16W | 47.21db | 47.30db |
| Audio Model | BF16 | 25.0db | 41.1db |
Quark can run on both Linux and Windows operating systems. As illustrated in the Table below, users can apply a wide range of data types including Xint8 (Activation Int8, Weight Int8, and Power-of-Two Scales), A8W8 (Activation Int8, Weight Int8, and Float32 Scales), A16W8 (Activation Int16, Weight Int8, and Float32 Scales), BFloat16 and BFP16. Quark ONNX Flow also offers different quantization strategies, schemes, and algorithms to allow users to customize quantization and achieve better results.
Table 3 — Quark ONNX Features
| Feature Name | Quark ONNX |
| Data Type | Float16, Bfloat16, Int4/Uint4, Int8/Uint8, Int16/Uint16, Int32/Uint32, BFP16, MX4/MX6/MX9, MXFP6_E2M3 / MXFP6_E3M2 / MXFP4 / MXINT8 |
| Operating Systems | Linux (ROCm/CUDA/CPU); Windows (ROCm/CUDA/CPU) |
| Quant Strategy | Static quant / Weight only / Dynamic quant |
| Quant Scheme | Per tensor / Per channel |
| Symmetric | Symmetric / Asymmetric |
| Calibration method | Power‑of‑Two Scale (MinMax / MinMSE); Float Scale (MinMax / Percentile / Layerwise Percentile / Entropy) |
| Scale Type | Float32 / Float16 |
| Supported Ops | Almost all ONNX Ops |
| Pre‑Quant Optimizatio | QuaRot / SmoothQuant / CLE |
| Quantization Algorithm | AdaQuant / AdaRound / GPTQ / Bias Correction |
In practical ONNX Flow use cases, vision models such as classification and object detection models are often the primary focus. Algorithms like CLE, AdaRound, and AdaQuant can significantly improve the quantization accuracy of vision models, helping to meet customer targets. In today’s era of large language models (LLMs), we also support LLM quantization algorithms such as SmoothQuant, GPTQ, and Quarot, achieving promising quantization accuracy on models like the Llama series.
How To perform ONNX-To-ONNX quantization?
We will take the Yolo models as an example. They are Convolutional Neural Networks (CNN) based and commonly used for object detection tasks. You can refer to Quark's official documentation for the detailed code.
1. Prepare the original float model and calibration data
Here we assume you already have an ONNX model and a small dataset with several images.
2. Implement the calibration data reader.
The calibration process in quantization is essentially a process of collecting activation values through model forward passes to determine the quantization parameters. Therefore, we need a calibration data reader.
class ImageDataReader(CalibrationDataReader):
def __init__(self, calibration_image_folder: str, input_name: str):
...
def _preprocess_images(self, image_folder: str):
...
def get_next(self):
...
def rewind(self):
...
3. Set the quantization configuration
Set the A8W8 quantization config. It means Activations with Int8 and Weights with Int8 quantization using per-tensor scheme and float scales.
from quark.onnx import QConfig
quantization_config = QConfig.get_default_config("A8W8")
4. Quantize the model
Perform the quantization. Typically, quantizing a model of a few dozen megabytes with a few dozen calibration images only takes a few minutes. However, using more advanced algorithms such as AdaRound or AdaQuant can take several hours.
from quark.onnx import ModelQuantizer
calib_data_reader = ImageDataReader(calib_data_path, model_input_name)
quantizer = ModelQuantizer(quantization_config)
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)
5. Results
Table 4 — Accuracy Results of Quantized Yolo Models
| Model | FP32 mAP50 | A8W8 mAP50 |
| Yolo face v8m | 0.99 | 0.99 |
| Yolo face v8n | 0.97 | 0.967 |
| Yolo face v8n-v2 | 0.97 | 0.969 |
| Yolo face v8s | 0.989 | 0.988 |
| Yolo face v9c | 0.99 | 0.99 |
| Yolo hand v8n | 0.983 | 0.978 |
| Yolo hand v9c | 0.989 | 0.988 |
| Yolo person v8m | 0.798 | 0.781 |
| Yolo person v8n | 0.739 | 0.711 |
| Yolo person v8s | 0.777 | 0.767 |
As shown in the table above, A8W8 can achieve almost no quantization accuracy loss for most Yolo models. mAP stands for mean Average Precision, and it is the most commonly used metric in object detection. A higher value indicates better performance.
How to deploy the quantized models?
Quantized models produced by Quark are compatible with multiple deployment flows in AMD. Below, we continue to use the quantized YOLO model as an example.
1. Setting up the Ryzen AI environment
You can refer to this documentation for the hardware and OS requirements. And Follow steps to create Ryzen AI environment on an AMD device with NPU hardware.
A) Install the NPU driver
Download the release driver package from here.
Unzip this package and double-click the npu_sw_installer.exe, ensure that NPU MCDM driver is correctly installed by opening Task Manager -> Performance -> NPU0.
You can find NPU 0 on the list.
B) install Conda
Download Conda forge and install from here and create Conda environment ryzen-ai-1.5.0.
C) Install Ryzen AI software package
Download package from here. Then double-click to install it and select the Conda environment ryzen-ai-1.5.0.
2. Start to run Yolo models on Ryzen AI
Activate Conda as ryzen-ai-1.5.0
Create a test script test.py like below.
import os
import argparse
import onnxruntime as ort
parser = argparse.ArgumentParser(description='Run inference on an ONNX model using multiple threads')
parser.add_argument('--model', type = str, required = True, help = 'onnx path')
parser.add_argument('--xclbin', type = str, required = True, help = 'xclbin file')
args = parser.parse_args()
so = ort.SessionOptions()
so.add_session_config_entry("session.intra_op.allow_spinning", "0")
ep = ["VitisAIExecutionProvider"]
vai_po = [{
'xclbin' : args.xclbin,
'log_level' : 'info'
}]
onnx_session = ort.InferenceSession(args.model,
providers = ep,
provider_options = vai_po,
sess_options = so
)
onnx_inputs = onnx_session.get_inputs()
input_name = onnx_inputs[0].name
onnx_session.run(None, {input_name: np.random.rand(1, 3, 256, 256).astype(np.float32)})
Test for Yolo models, execute the following command
python test.py --model yolo_quantized.onnx --xclbin AMD_AIE2P_8x4x1_Overlay.xclbin
3. Results
Table 5 — Latency of Yolo Models
| Model | CPUEP Latency(ms) | CPU FPS | NPUEP Latency(ms) | NPU FPS |
| Yolo face v8m | 253 | 21 | 15.6947 | 63.7043 |
| Yolo face v8n | 47.69 | 21 | 6.963 | 143.573 |
| Yolo face v8n-v2 | 58 | 17 | 6.9353 | 144.146 |
| Yolo face v8s | 126.6 | 7.9 | 8.5593 | 116.804 |
| Yolo face v9c | 405 | 2.46 | 21.691 | 46.0964 |
| Yolo hand v8n | 58.5 | 17 | 7.2462 | 137.959 |
| Yolo hand v9c | 407 | 2.45 | 21.8217 | 45.8192 |
| Yolo person v8m | 333 | 3 | 18.3264 | 54.5566 |
| Yolo person v8n | 76.3 | 13.1 | 8.7464 | 114.303 |
| Yolo person v8s | 167.8 | 5.96 | 11.4092 | 87.6278 |
As shown in the table above, CPUEP represents the baseline performance, while NPUEP represents the performance after optimization with Ryzen AI. We can see that Ryzen AI achieves outstanding acceleration.
Summary
Quantizing models with AMD Quark and deploying them on Ryzen AI hardware offers a powerful pathway to high-performance, energy-efficient edge AI. As demonstrated using several YOLO object-detection models, Quark’s ONNX-to-ONNX workflow delivers near-lossless accuracy while achieving more than 7x faster inference on the NPU compared to CPU execution. With broad data type support, advanced quantization algorithms, and seamless integration with ONNX Runtime, Quark enables developers to unlock the full potential of Ryzen AI systems—without requiring access to original training code or frameworks.
As AI workloads continue to grow in complexity and scale, robust tooling and efficient hardware acceleration will become increasingly essential. AMD Quark and Ryzen AI provide a streamlined, production-ready solution for bringing optimized, real-time AI inference to client devices.
Acknowledgement
We would like to express our thanks to our colleagues from the AMD Quark Team and AI Software Team, for their insightful feedback and technical assistance.