AI Inference Acceleration on Ryzen AI NPU with AMD Quark

Dec 19, 2025

AI inference on client devices is rapidly evolving, and AMD Ryzen™ AI PCs sit at the center of this shift. Powered by the AMD XDNA™-based Neural Processing Unit (NPU), Ryzen AI systems deliver high-throughput, low-latency, and energy-efficient acceleration for a wide range of deep learning workloads. With AMD Ryzen™ AI Software stack, developers can seamlessly execute ONNX models on the NPU or integrated GPU—unlocking significant performance gains while reducing power consumption compared to CPU- or GPU-only pipelines. 

For ONNX models, while ONNX Runtime provides a popular and flexible inference framework, its built-in quantization support can be limited for developers who need a broader set of data types, fine-grained controls, or state-of-the-art algorithms. This is where AMD’s Quark excels. Quark enhances the ONNX -to-ONNX quantization workflow, enabling developers to convert full-precision models into low-bit versions optimized for Ryzen AI and deploy them efficiently.

In this blog, we walk through how the ONNX-to-ONNX quantization flow in AMD Quark works, why it matters, and how to use it to accelerate real-world models—using the YOLO family as a practical example. Finally, we show how to deploy the resulting quantized models on a Ryzen AI NPU for dramatic inference speedups.

Why Quantize with AMD Quark?

Quantization Benefits:

  • Faster inference by reducing computation and improving memory bandwidth usage.

  • Lower power consumption is extremely valuable in production deployments.

AMD Quark is a comprehensive cross-platform deep learning toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, AMD Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.

Why ONNX-To-ONNX flow?  

ONNX-to-ONNX flow refers to a quantization pipeline where both the input and output of the workflow are ONNX models, without involving the original training framework (PyTorch, TensorFlow, etc.).

The benefits of ONNX flow include:

  1. Hardware and platform agnostic: Many customer produce models in ONNX format, mainly because ONNX is framework-agnostic, easier to integrate into existing tool chains, and allows customers to share only the exported computation graph without exposing full training code, proprietary architectures, or model-specific intellectual property. ONNX also avoids dependencies on the customer’s original training framework, such as PyTorch and Tensorflow, making it a safer and more convenient format for third-party workflows.

  2. Performant engine: ONNX Runtime is a high-performance inference engine designed to run machine learning models in the ONNX format. It supports models trained in various frameworks like PyTorch and TensorFlow, enabling seamless deployment across platforms. ONNX Runtime offers cross-platform compatibility, hardware acceleration, and optimized performance, making it ideal for both cloud and edge inference scenarios.

Quark's ONNX-to-ONNX quantization flow has been successfully applied to hundreds of models, demonstrating robustness and scalability. Notably, for key customers such as Microsoft, the flow like integration with Olive enables an end-to-end workflow—from quantization to deployment—ensuring seamless integration and efficient model delivery. As shown in Tables below, the ONNX-to-ONNX flow presents excellent performance on several widely used models as well as key customer-specific models. 

Table 1 — Benchmark Results of Top‑1 Accuracy or mean Average Precision (mAP)  

Model Name / Metrics FP32 (Without Quantization) INT8 A16W8 BF16
Resnet50 / Top‑1 0.813 0.809 0.810 0.812
  0.729 0.727 0.727 0.729
  0.756 0.752 0.756 0.756
Yolonas_s / mAP 0.60 0.51 0.58 0.60

Table 2 — Results of Customer Models on AMD Ryzen AI NPU  

Metrics Target Data Type Target PSNR/SNR Quantized Accuracy
Detection Model INT8 30.763db 37.190db
Segmentation Model INT8 52.228db 52.682db
Super‑Resolution Model A16W 47.21db 47.30db
Audio Model BF16 25.0db 41.1db

Quark can run on both Linux and Windows operating systems. As illustrated in the Table below, users can apply a wide range of data types including Xint8 (Activation Int8, Weight Int8, and Power-of-Two Scales), A8W8 (Activation Int8, Weight Int8, and Float32 Scales), A16W8 (Activation Int16, Weight Int8, and Float32 Scales), BFloat16 and BFP16. Quark ONNX Flow also offers different quantization strategies, schemes, and algorithms to allow users to customize quantization and achieve better results. 

Table 3 — Quark ONNX Features  

Feature Name Quark ONNX
Data Type Float16, Bfloat16, Int4/Uint4, Int8/Uint8, Int16/Uint16, Int32/Uint32, BFP16, MX4/MX6/MX9, MXFP6_E2M3 / MXFP6_E3M2 / MXFP4 / MXINT8
Operating Systems Linux (ROCm/CUDA/CPU); Windows (ROCm/CUDA/CPU)
Quant Strategy Static quant / Weight only / Dynamic quant
Quant Scheme Per tensor / Per channel
Symmetric Symmetric / Asymmetric
Calibration method Power‑of‑Two Scale (MinMax / MinMSE); Float Scale (MinMax / Percentile / Layerwise Percentile / Entropy)
Scale Type Float32 / Float16
Supported Ops Almost all ONNX Ops
Pre‑Quant Optimizatio QuaRot / SmoothQuant / CLE
Quantization Algorithm AdaQuant / AdaRound / GPTQ / Bias Correction

In practical ONNX Flow use cases, vision models such as classification and object detection models are often the primary focus. Algorithms like CLEAdaRound, and AdaQuant can significantly improve the quantization accuracy of vision models, helping to meet customer targets. In today’s era of large language models (LLMs), we also support LLM quantization algorithms such as SmoothQuantGPTQ, and Quarot, achieving promising quantization accuracy on models like the Llama series.

How To perform ONNX-To-ONNX quantization?  

We will take the Yolo models as an example. They are Convolutional Neural Networks (CNN) based and commonly used for object detection tasks. You can refer to Quark's official documentation for the detailed code.

1. Prepare the original float model and calibration data

Here we assume you already have an ONNX model and a small dataset with several images.

2. Implement the calibration data reader.

The calibration process in quantization is essentially a process of collecting activation values through model forward passes to determine the quantization parameters. Therefore, we need a calibration data reader. 

		class ImageDataReader(CalibrationDataReader): 
 
    def __init__(self, calibration_image_folder: str, input_name: str): 
         ... 
 
    def _preprocess_images(self, image_folder: str): 
         ... 
 
    def get_next(self): 
         ... 
 
    def rewind(self): 
         ...
	

3. Set the quantization configuration 

Set the A8W8 quantization config. It means Activations with Int8 and Weights with Int8 quantization using per-tensor scheme and float scales. 

		from quark.onnx import QConfig 
quantization_config = QConfig.get_default_config("A8W8") 
	

4. Quantize the model 

Perform the quantization. Typically, quantizing a model of a few dozen megabytes with a few dozen calibration images only takes a few minutes. However, using more advanced algorithms such as AdaRound or AdaQuant can take several hours. 

		from quark.onnx import ModelQuantizer 
 
calib_data_reader = ImageDataReader(calib_data_path, model_input_name) 
 
quantizer = ModelQuantizer(quantization_config) 
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader) 
	

5. Results 

Table 4 — Accuracy Results of Quantized Yolo Models   

Model  FP32 mAP50  A8W8 mAP50 
Yolo face v8m  0.99  0.99 
Yolo face v8n  0.97  0.967 
Yolo face v8n-v2  0.97  0.969 
Yolo face v8s  0.989  0.988 
Yolo face v9c  0.99  0.99 
Yolo hand v8n  0.983  0.978 
Yolo hand v9c  0.989  0.988 
Yolo person v8m  0.798  0.781 
Yolo person v8n  0.739  0.711 
Yolo person v8s  0.777  0.767 

As shown in the table above, A8W8 can achieve almost no quantization accuracy loss for most Yolo models. mAP stands for mean Average Precision, and it is the most commonly used metric in object detection. A higher value indicates better performance. 

How to deploy the quantized models? 

Quantized models produced by Quark are compatible with multiple deployment flows in AMD. Below, we continue to use the quantized YOLO model as an example. 

1. Setting up the Ryzen AI environment  

You can refer to this documentation for the hardware and OS requirements. And Follow steps to create Ryzen AI environment on an AMD device with NPU hardware. 

A) Install the NPU driver 

Download the release driver package from here.  

Unzip this package and double-click the npu_sw_installer.exe, ensure that NPU MCDM driver is correctly installed by opening Task Manager -> Performance -> NPU0. 

You can find NPU 0 on the list. 

B) install Conda  

Download Conda forge and install from here and create Conda environment ryzen-ai-1.5.0. 

C) Install Ryzen AI software package 

Download package from here. Then double-click to install it and select the Conda environment ryzen-ai-1.5.0. 

2. Start to run Yolo models on Ryzen AI 

Activate Conda as ryzen-ai-1.5.0 

Create a test script test.py like below. 

		import os 
     import argparse 
     import onnxruntime as ort 
     parser = argparse.ArgumentParser(description='Run inference on an ONNX model using multiple threads') 
     parser.add_argument('--model', type = str, required = True, help = 'onnx path') 
     parser.add_argument('--xclbin', type = str, required = True, help = 'xclbin file') 
     args = parser.parse_args() 
     so = ort.SessionOptions() 
     so.add_session_config_entry("session.intra_op.allow_spinning", "0") 
     ep = ["VitisAIExecutionProvider"] 
     vai_po = [{ 
         'xclbin' : args.xclbin, 
         'log_level' : 'info' 
     }] 
     onnx_session = ort.InferenceSession(args.model, 
         providers = ep, 
         provider_options = vai_po, 
         sess_options = so 
     ) 
     onnx_inputs = onnx_session.get_inputs() 
     input_name = onnx_inputs[0].name 
     onnx_session.run(None, {input_name: np.random.rand(1, 3, 256, 256).astype(np.float32)}) 
	

Test for Yolo models, execute the following command 

python test.py --model yolo_quantized.onnx --xclbin AMD_AIE2P_8x4x1_Overlay.xclbin 

3. Results 

Table 5 — Latency of Yolo Models   

Model  CPUEP Latency(ms)  CPU FPS  NPUEP Latency(ms) 

NPU FPS

Yolo face v8m  253  21  15.6947  63.7043 
Yolo face v8n  47.69  21  6.963  143.573 
Yolo face v8n-v2  58  17  6.9353  144.146 
Yolo face v8s  126.6  7.9  8.5593  116.804 
Yolo face v9c  405  2.46  21.691  46.0964 
Yolo hand v8n  58.5  17  7.2462  137.959 
Yolo hand v9c  407  2.45  21.8217  45.8192 
Yolo person v8m  333  18.3264  54.5566 
Yolo person v8n  76.3  13.1  8.7464  114.303 
Yolo person v8s  167.8  5.96  11.4092  87.6278 

  

As shown in the table above, CPUEP represents the baseline performance, while NPUEP represents the performance after optimization with Ryzen AI. We can see that Ryzen AI achieves outstanding acceleration. 

Summary 

Quantizing models with AMD Quark and deploying them on Ryzen AI hardware offers a powerful pathway to high-performance, energy-efficient edge AI. As demonstrated using several YOLO object-detection models, Quark’s ONNX-to-ONNX workflow delivers near-lossless accuracy while achieving more than 7x faster inference on the NPU compared to CPU execution. With broad data type support, advanced quantization algorithms, and seamless integration with ONNX Runtime, Quark enables developers to unlock the full potential of Ryzen AI systems—without requiring access to original training code or frameworks. 

As AI workloads continue to grow in complexity and scale, robust tooling and efficient hardware acceleration will become increasingly essential. AMD Quark and Ryzen AI provide a streamlined, production-ready solution for bringing optimized, real-time AI inference to client devices. 

Acknowledgement 

We would like to express our thanks to our colleagues from the AMD Quark Team and AI Software Team, for their insightful feedback and technical assistance. 

Related Blogs