Deploying an End-to-End Object Detection Model on AMD AI PC with NPU

Jan 23, 2026

Introduction

The era of on-device AI is accelerating, and AMD Ryzen™ AI-powered PCs are at the forefront. With dedicated Neural Processing Units (NPUs), integrated GPUs, and high-performance CPUs, developers can now deploy complex computer vision models locally, achieving low-latency inference without cloud dependencies.

Object detection is a cornerstone of AI applications—from autonomous systems to retail analytics—and running these workloads efficiently on edge devices requires careful optimization. This blog provides a step-by-step, end-to-end workflow for deploying an object detection model on an AMD AI PC, covering:

Exporting models to ONNX format for cross-platform compatibility
Quantization using AMD’s Quark tool to reduce model size and improve NPU throughput
Deployment with AMD ONNX Runtime and NPU acceleration
Evaluation of accuracy, latency, and efficiency

Using YOLO-World as an example, we will demonstrate how to maximize NPU performance while maintaining accuracy, giving developers a practical blueprint for real-world AI applications.

Environment

Hardware Requirements

AMD Ryzen™ AI series processor with NPU driver version 32.0.203.280 or newer
Windows 11 x86-64 (latest updates recommended)

Software Requirements

Install anaconda or miniforge
Download and install Ryzen AI Python Package

Create and clone a new conda environment:

		conda create --name ryzen-ai-1.6.1-yoloworld --clone ryzen-ai-1.6.1 

conda activate ryzen-ai-1.6.1-yoloworld

Code and Model

YOLO-World Test Code: Ryzen AI-SW-GitHub
Origin Float Model: yolov8s-worldv2

Float Model Overview

For this demonstration, we take a trained Yolo World model with:

Version: Yolov8s-worldv2
Input resolution: 640×640
Format: FP32
Baseline accuracy:
- AP50-95: 0.415
- AP50: 0.498

This model serves as a reference for both quantization and performance evaluation.

Exporting Yolo World to ONNX

YoloWorld supports direct ONNX export. You can use the python script to export it directly.

Export Command

		python .\ultra_yolo_to_onnx.py --pt-model .\models\yolov8s-worldv2 --input-size 640

Key Notes

Use opset ≥ 20
After exporting, verify the model using Netron

Operator Support

Ensure that all operators in the model are supported by the AMD NPUs.

Key supported operators in YOLO World include:

Operator	Support
Conv	Y
Batchnormal	Fuse
Sigmoid	Y
Exp	Y
Add/Mul	Y
Transpose	Y
Reshape	Y
Clip	Y
Softmax	Y
Cast	Y
Normalize	Y
Div	Y
Mul	Y
Maxpool	Y
Resize	Y
Slice	Y
Einsum	Y
NMS	On CPU

Full operator support details can be found : AMD Ryzen AI Docs

Quantization Workflow

Preparing Calibration Dataset

Calibration dataset must:

Be representative of your deployment data
Cover lighting, object size, class distribution
Contain 100–1000 images

Running the Quantization Tool

Example (platform-specific command):

		python quark_quant.py --onnx yolo-world-models/yolov8s-world.onnx ^ 
       --quant A16W8_ADAROUND ^ 
       -exclude-post

Key configuration fields:

Field	Explanation
quant	quantize type, you can use A8W8/A16W8 and other types
exclide-post	recommended for activations
num-calib-images	Number of images for calibration (default is 512)
lr	Learning rate (default is 0.1)
iters	Number of iterations (default is 3000)

Deployment Workflow

Loading Model into the Runtime

Example Python API:

		session = ort.InferenceSession(model_path, 
          providers=["VitisAIExecutionProvider"])

Full Inference Pipeline

		1. preprocess 
img_resized, pad_top_left, scale = preprocess_image( 
    img, input_size_wh, bgr2rgb=True 
) 

2. inference 
outputs = session.run(output_names=None, input_feed={input_name:       img_resized}) 

3. decode + nms 
img_detections = postprocess_output( 
    outputs[0], 
    pad_top_left, 
    scale, 
    yolo_id_to_coco_id_map, 
    min_score_thres, 
    nms_iou_thresh, 
    img_width, 
    img_height, 
)

Recommended preprocessing steps:

Resize to (640,640) with letterbox
Normalize to 0–1
Channel order BGR → RGB
Padding uses "center alignment"

Test mAP and performance

You can replace the model and device to test cpu mode.

		python eval_on_coco.py –-model yolo-world-models/ yolov8l-worldv2-A16W8_ADAROUND-640x640-exclude-post –-device npu 

python infer_single.py -–model yolo-world-models/ yolov8l-worldv2-A16W8_ADAROUND-640x640-exclude-post -–image test_img.jpg –-device npu --runtime-seconds 60

Precautions

Preprocessing must exactly match training
Avoid dynamic shapes on embedded platforms
Verify Feature Map Layout
Calibration dataset is extremely important

Results

Below is yolov8s-worldv2 results table:

yolov8s-worldv2	mAP	mAP50	mAP75	Latency (in ms)	Model Size
float model	35.9	49.7	39	53.21	48.06k
quantized model	35.6	49.6	38.7	90.68	12.46k
NPU E2E Accuracy	35.6	49.5	38.7	22.85	17.39k

Conclusion

In this blog, we showcased a complete workflow for object detection on AMD AI PCs, from exporting YOLO-World to ONNX, through quantization, to NPU deployment and evaluation.

Key takeaways include:

Quantization with A16W8_ADAROUND significantly reduces model size while maintaining accuracy
Proper calibration and preprocessing are critical for consistent deployment performance

The AMD AI ecosystem is rapidly evolving, and now is the perfect time to explore on-device AI pipelines. We encourage developers to:

Experiment with different quantization schemes to optimize their workloads
Benchmark custom models on Ryzen AI PCs to fully leverage NPUs
Contribute to AMD’s AI developer community by sharing insights, performance results, and best practices

By following this workflow, you can unlock the full potential of AMD NPUs for real-time, production-ready object detection, bring powerful AI capabilities to the edge with efficiency and precision.