AI Inference Acceleration on Ryzen AI NPU with AMD Quark

Dec 19, 2025

AI inference on client devices is rapidly evolving, and AMD Ryzen™ AI PCs sit at the center of this shift. Powered by the AMD XDNA™-based Neural Processing Unit (NPU), Ryzen AI systems deliver high-throughput, low-latency, and energy-efficient acceleration for a wide range of deep learning workloads. With AMD Ryzen™ AI Software stack, developers can seamlessly execute ONNX models on the NPU or integrated GPU—unlocking significant performance gains while reducing power consumption compared to CPU- or GPU-only pipelines.

For ONNX models, while ONNX Runtime provides a popular and flexible inference framework, its built-in quantization support can be limited for developers who need a broader set of data types, fine-grained controls, or state-of-the-art algorithms. This is where AMD’s Quark excels. Quark enhances the ONNX -to-ONNX quantization workflow, enabling developers to convert full-precision models into low-bit versions optimized for Ryzen AI and deploy them efficiently.

In this blog, we walk through how the ONNX-to-ONNX quantization flow in AMD Quark works, why it matters, and how to use it to accelerate real-world models—using the YOLO family as a practical example. Finally, we show how to deploy the resulting quantized models on a Ryzen AI NPU for dramatic inference speedups.

Why Quantize with AMD Quark?

Quantization Benefits:

Faster inference by reducing computation and improving memory bandwidth usage.

Lower power consumption is extremely valuable in production deployments.

AMD Quark is a comprehensive cross-platform deep learning toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, AMD Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.

Why ONNX-To-ONNX flow?

ONNX-to-ONNX flow refers to a quantization pipeline where both the input and output of the workflow are ONNX models, without involving the original training framework (PyTorch, TensorFlow, etc.).

The benefits of ONNX flow include:

Hardware and platform agnostic: Many customer produce models in ONNX format, mainly because ONNX is framework-agnostic, easier to integrate into existing tool chains, and allows customers to share only the exported computation graph without exposing full training code, proprietary architectures, or model-specific intellectual property. ONNX also avoids dependencies on the customer’s original training framework, such as PyTorch and Tensorflow, making it a safer and more convenient format for third-party workflows.
Performant engine: ONNX Runtime is a high-performance inference engine designed to run machine learning models in the ONNX format. It supports models trained in various frameworks like PyTorch and TensorFlow, enabling seamless deployment across platforms. ONNX Runtime offers cross-platform compatibility, hardware acceleration, and optimized performance, making it ideal for both cloud and edge inference scenarios.

Quark's ONNX-to-ONNX quantization flow has been successfully applied to hundreds of models, demonstrating robustness and scalability. Notably, for key customers such as Microsoft, the flow like integration with Olive enables an end-to-end workflow—from quantization to deployment—ensuring seamless integration and efficient model delivery. As shown in Tables below, the ONNX-to-ONNX flow presents excellent performance on several widely used models as well as key customer-specific models.

Table 1 — Benchmark Results of Top‑1 Accuracy or mean Average Precision (mAP)

Model Name / Metrics	FP32 (Without Quantization)	INT8	A16W8	BF16
Resnet50 / Top‑1	0.813	0.809	0.810	0.812
	0.729	0.727	0.727	0.729
	0.756	0.752	0.756	0.756
Yolonas_s / mAP	0.60	0.51	0.58	0.60

Table 2 — Results of Customer Models on AMD Ryzen AI NPU

Metrics	Target Data Type	Target PSNR/SNR	Quantized Accuracy
Detection Model	INT8	30.763db	37.190db
Segmentation Model	INT8	52.228db	52.682db
Super‑Resolution Model	A16W	47.21db	47.30db
Audio Model	BF16	25.0db	41.1db

Quark can run on both Linux and Windows operating systems. As illustrated in the Table below, users can apply a wide range of data types including Xint8 (Activation Int8, Weight Int8, and Power-of-Two Scales), A8W8 (Activation Int8, Weight Int8, and Float32 Scales), A16W8 (Activation Int16, Weight Int8, and Float32 Scales), BFloat16 and BFP16. Quark ONNX Flow also offers different quantization strategies, schemes, and algorithms to allow users to customize quantization and achieve better results.

Table 3 — Quark ONNX Features

Feature Name	Quark ONNX
Data Type	Float16, Bfloat16, Int4/Uint4, Int8/Uint8, Int16/Uint16, Int32/Uint32, BFP16, MX4/MX6/MX9, MXFP6_E2M3 / MXFP6_E3M2 / MXFP4 / MXINT8
Operating Systems	Linux (ROCm/CUDA/CPU); Windows (ROCm/CUDA/CPU)
Quant Strategy	Static quant / Weight only / Dynamic quant
Quant Scheme	Per tensor / Per channel
Symmetric	Symmetric / Asymmetric
Calibration method	Power‑of‑Two Scale (MinMax / MinMSE); Float Scale (MinMax / Percentile / Layerwise Percentile / Entropy)
Scale Type	Float32 / Float16
Supported Ops	Almost all ONNX Ops
Pre‑Quant Optimizatio	QuaRot / SmoothQuant / CLE
Quantization Algorithm	AdaQuant / AdaRound / GPTQ / Bias Correction

In practical ONNX Flow use cases, vision models such as classification and object detection models are often the primary focus. Algorithms like CLE, AdaRound, and AdaQuant can significantly improve the quantization accuracy of vision models, helping to meet customer targets. In today’s era of large language models (LLMs), we also support LLM quantization algorithms such as SmoothQuant, GPTQ, and Quarot, achieving promising quantization accuracy on models like the Llama series.

How To perform ONNX-To-ONNX quantization?

We will take the Yolo models as an example. They are Convolutional Neural Networks (CNN) based and commonly used for object detection tasks. You can refer to Quark's official documentation for the detailed code.

1. Prepare the original float model and calibration data

Here we assume you already have an ONNX model and a small dataset with several images.

2. Implement the calibration data reader.

The calibration process in quantization is essentially a process of collecting activation values through model forward passes to determine the quantization parameters. Therefore, we need a calibration data reader.

		class ImageDataReader(CalibrationDataReader): 
 
    def __init__(self, calibration_image_folder: str, input_name: str): 
         ... 
 
    def _preprocess_images(self, image_folder: str): 
         ... 
 
    def get_next(self): 
         ... 
 
    def rewind(self): 
         ...

3. Set the quantization configuration

Set the A8W8 quantization config. It means Activations with Int8 and Weights with Int8 quantization using per-tensor scheme and float scales.

		from quark.onnx import QConfig 
quantization_config = QConfig.get_default_config("A8W8")

4. Quantize the model

Perform the quantization. Typically, quantizing a model of a few dozen megabytes with a few dozen calibration images only takes a few minutes. However, using more advanced algorithms such as AdaRound or AdaQuant can take several hours.

		from quark.onnx import ModelQuantizer 
 
calib_data_reader = ImageDataReader(calib_data_path, model_input_name) 
 
quantizer = ModelQuantizer(quantization_config) 
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)

5. Results

Table 4 — Accuracy Results of Quantized Yolo Models

Model	FP32 mAP50	A8W8 mAP50
Yolo face v8m	0.99	0.99
Yolo face v8n	0.97	0.967
Yolo face v8n-v2	0.97	0.969
Yolo face v8s	0.989	0.988
Yolo face v9c	0.99	0.99
Yolo hand v8n	0.983	0.978
Yolo hand v9c	0.989	0.988
Yolo person v8m	0.798	0.781
Yolo person v8n	0.739	0.711
Yolo person v8s	0.777	0.767

As shown in the table above, A8W8 can achieve almost no quantization accuracy loss for most Yolo models. mAP stands for mean Average Precision, and it is the most commonly used metric in object detection. A higher value indicates better performance.

How to deploy the quantized models?

Quantized models produced by Quark are compatible with multiple deployment flows in AMD. Below, we continue to use the quantized YOLO model as an example.

1. Setting up the Ryzen AI environment

You can refer to this documentation for the hardware and OS requirements. And Follow steps to create Ryzen AI environment on an AMD device with NPU hardware.

A) Install the NPU driver

Download the release driver package from here.

Unzip this package and double-click the npu_sw_installer.exe, ensure that NPU MCDM driver is correctly installed by opening Task Manager -> Performance -> NPU0.

You can find NPU 0 on the list.

B) install Conda

Download Conda forge and install from here and create Conda environment ryzen-ai-1.5.0.

C) Install Ryzen AI software package

Download package from here. Then double-click to install it and select the Conda environment ryzen-ai-1.5.0.

2. Start to run Yolo models on Ryzen AI

Activate Conda as ryzen-ai-1.5.0

Create a test script test.py like below.

		import os 
     import argparse 
     import onnxruntime as ort 
     parser = argparse.ArgumentParser(description='Run inference on an ONNX model using multiple threads') 
     parser.add_argument('--model', type = str, required = True, help = 'onnx path') 
     parser.add_argument('--xclbin', type = str, required = True, help = 'xclbin file') 
     args = parser.parse_args() 
     so = ort.SessionOptions() 
     so.add_session_config_entry("session.intra_op.allow_spinning", "0") 
     ep = ["VitisAIExecutionProvider"] 
     vai_po = [{ 
         'xclbin' : args.xclbin, 
         'log_level' : 'info' 
     }] 
     onnx_session = ort.InferenceSession(args.model, 
         providers = ep, 
         provider_options = vai_po, 
         sess_options = so 
     ) 
     onnx_inputs = onnx_session.get_inputs() 
     input_name = onnx_inputs[0].name 
     onnx_session.run(None, {input_name: np.random.rand(1， 3, 256, 256).astype(np.float32)})

Test for Yolo models, execute the following command

python test.py --model yolo_quantized.onnx --xclbin AMD_AIE2P_8x4x1_Overlay.xclbin

3. Results

Table 5 — Latency of Yolo Models

Model	CPUEP Latency(ms)	CPU FPS	NPUEP Latency(ms)	NPU FPS
Yolo face v8m	253	21	15.6947	63.7043
Yolo face v8n	47.69	21	6.963	143.573
Yolo face v8n-v2	58	17	6.9353	144.146
Yolo face v8s	126.6	7.9	8.5593	116.804
Yolo face v9c	405	2.46	21.691	46.0964
Yolo hand v8n	58.5	17	7.2462	137.959
Yolo hand v9c	407	2.45	21.8217	45.8192
Yolo person v8m	333	3	18.3264	54.5566
Yolo person v8n	76.3	13.1	8.7464	114.303
Yolo person v8s	167.8	5.96	11.4092	87.6278

As shown in the table above, CPUEP represents the baseline performance, while NPUEP represents the performance after optimization with Ryzen AI. We can see that Ryzen AI achieves outstanding acceleration.

Summary

Quantizing models with AMD Quark and deploying them on Ryzen AI hardware offers a powerful pathway to high-performance, energy-efficient edge AI. As demonstrated using several YOLO object-detection models, Quark’s ONNX-to-ONNX workflow delivers near-lossless accuracy while achieving more than 7x faster inference on the NPU compared to CPU execution. With broad data type support, advanced quantization algorithms, and seamless integration with ONNX Runtime, Quark enables developers to unlock the full potential of Ryzen AI systems—without requiring access to original training code or frameworks.

As AI workloads continue to grow in complexity and scale, robust tooling and efficient hardware acceleration will become increasingly essential. AMD Quark and Ryzen AI provide a streamlined, production-ready solution for bringing optimized, real-time AI inference to client devices.

Acknowledgement

We would like to express our thanks to our colleagues from the AMD Quark Team and AI Software Team, for their insightful feedback and technical assistance.