Deploying an End-to-End Object Detection Model on AMD AI PC with NPU

Jan 23, 2026

Introduction

The era of on-device AI is accelerating, and AMD Ryzen™ AI-powered PCs are at the forefront. With dedicated Neural Processing Units (NPUs), integrated GPUs, and high-performance CPUs, developers can now deploy complex computer vision models locally, achieving low-latency inference without cloud dependencies.

Object detection is a cornerstone of AI applications—from autonomous systems to retail analytics—and running these workloads efficiently on edge devices requires careful optimization. This blog provides a step-by-step, end-to-end workflow for deploying an object detection model on an AMD AI PC, covering:

  • Exporting models to ONNX format for cross-platform compatibility
  • Quantization using AMD’s Quark tool to reduce model size and improve NPU throughput
  •  Deployment with AMD ONNX Runtime and NPU acceleration
  • Evaluation of accuracy, latency, and efficiency

Using YOLO-World as an example, we will demonstrate how to maximize NPU performance while maintaining accuracy, giving developers a practical blueprint for real-world AI applications.

Environment 

Hardware Requirements

  • AMD Ryzen™ AI series processor with NPU driver version  32.0.203.280 or newer 
  • Windows 11 x86-64 (latest updates recommended)

Software Requirements

Create and clone a new conda environment:

		conda create --name ryzen-ai-1.6.1-yoloworld --clone ryzen-ai-1.6.1 

conda activate ryzen-ai-1.6.1-yoloworld 
	

Code and Model

Float Model Overview

For this demonstration, we take a trained Yolo World model with:

  • Version: Yolov8s-worldv2
  • Input resolution: 640×640
  • Format: FP32
  • Baseline accuracy:
    • AP50-95: 0.415
    • AP50: 0.498

This model serves as a reference for both quantization and performance evaluation.

Exporting Yolo World to ONNX

YoloWorld supports direct ONNX export. You can use the python script to export it directly.

Export Command

		python .\ultra_yolo_to_onnx.py --pt-model .\models\yolov8s-worldv2 --input-size 640 
	

Key Notes

  • Use opset ≥ 20
  • After exporting, verify the model using Netron

Operator Support

Ensure that all operators in the model are supported by the AMD NPUs. 

Key supported operators in YOLO World include:

 

Operator Support
Conv Y
Batchnormal Fuse
Sigmoid Y
Exp Y
Add/Mul Y
Transpose Y
Reshape Y
Clip Y
Softmax Y
Cast Y
Normalize Y
Div Y
Mul Y
Maxpool Y
Resize Y
Slice Y
Einsum Y
NMS On CPU

Full operator support details can be found : AMD Ryzen AI Docs

Quantization Workflow

Preparing Calibration Dataset

Calibration dataset must:

  • Be representative of your deployment data
  • Cover lighting, object size, class distribution
  • Contain 100–1000 images

Running the Quantization Tool

Example (platform-specific command):

		python quark_quant.py --onnx yolo-world-models/yolov8s-world.onnx ^ 
       --quant A16W8_ADAROUND ^ 
       -exclude-post 
	

Key configuration fields:

Field Explanation
quant quantize type, you can use A8W8/A16W8 and other types
exclide-post recommended for activations
num-calib-images Number of images for calibration (default is 512)
lr Learning rate (default is 0.1)
iters Number of iterations (default is 3000)

Deployment Workflow

Loading Model into the Runtime

Example Python API:

		session = ort.InferenceSession(model_path, 
          providers=["VitisAIExecutionProvider"])
	

Full Inference Pipeline

		1. preprocess 
img_resized, pad_top_left, scale = preprocess_image( 
    img, input_size_wh, bgr2rgb=True 
) 

2. inference 
outputs = session.run(output_names=None, input_feed={input_name:       img_resized}) 

3. decode + nms 
img_detections = postprocess_output( 
    outputs[0], 
    pad_top_left, 
    scale, 
    yolo_id_to_coco_id_map, 
    min_score_thres, 
    nms_iou_thresh, 
    img_width, 
    img_height, 
) 
	

Recommended preprocessing steps:

  • Resize to (640,640) with letterbox
  • Normalize to 0–1
  • Channel order BGR → RGB
  • Padding uses "center alignment"

Test mAP and performance

You can replace the model and device to test cpu mode.

		python eval_on_coco.py –-model yolo-world-models/ yolov8l-worldv2-A16W8_ADAROUND-640x640-exclude-post –-device npu 

python infer_single.py -–model yolo-world-models/ yolov8l-worldv2-A16W8_ADAROUND-640x640-exclude-post -–image test_img.jpg –-device npu --runtime-seconds 60
	

Precautions

  • Preprocessing must exactly match training
  • Avoid dynamic shapes on embedded platforms
  • Verify Feature Map Layout
  • Calibration dataset is extremely important

Results

Below is yolov8s-worldv2 results table:

yolov8s-worldv2 mAP mAP50 mAP75 Latency
(in ms)
Model Size
float model 35.9 49.7 39 53.21 48.06k
quantized model 35.6 49.6 38.7 90.68 12.46k
NPU E2E Accuracy 35.6 49.5 38.7 22.85 17.39k

Conclusion

In this blog, we showcased a complete workflow for object detection on AMD AI PCs, from exporting YOLO-World to ONNX, through quantization, to NPU deployment and evaluation.

Key takeaways include:

  • Quantization with A16W8_ADAROUND significantly reduces model size while maintaining accuracy
  • Proper calibration and preprocessing are critical for consistent deployment performance

The AMD AI ecosystem is rapidly evolving, and now is the perfect time to explore on-device AI pipelines. We encourage developers to:

  • Experiment with different quantization schemes to optimize their workloads
  • Benchmark custom models on Ryzen AI PCs to fully leverage NPUs
  • Contribute to AMD’s AI developer community by sharing insights, performance results, and best practices

By following this workflow, you can unlock the full potential of AMD NPUs for real-time, production-ready object detection, bring powerful AI capabilities to the edge with efficiency and precision.

Share:

Related Blogs