Deploying an End-to-End Object Detection Model on AMD AI PC with NPU
Jan 23, 2026
Introduction
The era of on-device AI is accelerating, and AMD Ryzen™ AI-powered PCs are at the forefront. With dedicated Neural Processing Units (NPUs), integrated GPUs, and high-performance CPUs, developers can now deploy complex computer vision models locally, achieving low-latency inference without cloud dependencies.
Object detection is a cornerstone of AI applications—from autonomous systems to retail analytics—and running these workloads efficiently on edge devices requires careful optimization. This blog provides a step-by-step, end-to-end workflow for deploying an object detection model on an AMD AI PC, covering:
- Exporting models to ONNX format for cross-platform compatibility
- Quantization using AMD’s Quark tool to reduce model size and improve NPU throughput
- Deployment with AMD ONNX Runtime and NPU acceleration
- Evaluation of accuracy, latency, and efficiency
Using YOLO-World as an example, we will demonstrate how to maximize NPU performance while maintaining accuracy, giving developers a practical blueprint for real-world AI applications.
Environment
Hardware Requirements
- AMD Ryzen™ AI series processor with NPU driver version 32.0.203.280 or newer
- Windows 11 x86-64 (latest updates recommended)
Software Requirements
- Install anaconda or miniforge
- Download and install Ryzen AI Python Package
Create and clone a new conda environment:
conda create --name ryzen-ai-1.6.1-yoloworld --clone ryzen-ai-1.6.1
conda activate ryzen-ai-1.6.1-yoloworld
Code and Model
- YOLO-World Test Code: Ryzen AI-SW-GitHub
- Origin Float Model: yolov8s-worldv2
Float Model Overview
For this demonstration, we take a trained Yolo World model with:
- Version: Yolov8s-worldv2
- Input resolution: 640×640
- Format: FP32
- Baseline accuracy:
- AP50-95: 0.415
- AP50: 0.498
This model serves as a reference for both quantization and performance evaluation.
Exporting Yolo World to ONNX
YoloWorld supports direct ONNX export. You can use the python script to export it directly.
Export Command
python .\ultra_yolo_to_onnx.py --pt-model .\models\yolov8s-worldv2 --input-size 640
Key Notes
- Use opset ≥ 20
- After exporting, verify the model using Netron
Operator Support
Ensure that all operators in the model are supported by the AMD NPUs.
Key supported operators in YOLO World include:
| Operator | Support |
| Conv | Y |
| Batchnormal | Fuse |
| Sigmoid | Y |
| Exp | Y |
| Add/Mul | Y |
| Transpose | Y |
| Reshape | Y |
| Clip | Y |
| Softmax | Y |
| Cast | Y |
| Normalize | Y |
| Div | Y |
| Mul | Y |
| Maxpool | Y |
| Resize | Y |
| Slice | Y |
| Einsum | Y |
| NMS | On CPU |
Full operator support details can be found : AMD Ryzen AI Docs
Quantization Workflow
Preparing Calibration Dataset
Calibration dataset must:
- Be representative of your deployment data
- Cover lighting, object size, class distribution
- Contain 100–1000 images
Running the Quantization Tool
Example (platform-specific command):
python quark_quant.py --onnx yolo-world-models/yolov8s-world.onnx ^
--quant A16W8_ADAROUND ^
-exclude-post
Key configuration fields:
| Field | Explanation |
| quant | quantize type, you can use A8W8/A16W8 and other types |
| exclide-post | recommended for activations |
| num-calib-images | Number of images for calibration (default is 512) |
| lr | Learning rate (default is 0.1) |
| iters | Number of iterations (default is 3000) |
Deployment Workflow
Loading Model into the Runtime
Example Python API:
session = ort.InferenceSession(model_path,
providers=["VitisAIExecutionProvider"])
Full Inference Pipeline
1. preprocess
img_resized, pad_top_left, scale = preprocess_image(
img, input_size_wh, bgr2rgb=True
)
2. inference
outputs = session.run(output_names=None, input_feed={input_name: img_resized})
3. decode + nms
img_detections = postprocess_output(
outputs[0],
pad_top_left,
scale,
yolo_id_to_coco_id_map,
min_score_thres,
nms_iou_thresh,
img_width,
img_height,
)
Recommended preprocessing steps:
- Resize to (640,640) with letterbox
- Normalize to 0–1
- Channel order BGR → RGB
- Padding uses "center alignment"
Test mAP and performance
You can replace the model and device to test cpu mode.
python eval_on_coco.py –-model yolo-world-models/ yolov8l-worldv2-A16W8_ADAROUND-640x640-exclude-post –-device npu
python infer_single.py -–model yolo-world-models/ yolov8l-worldv2-A16W8_ADAROUND-640x640-exclude-post -–image test_img.jpg –-device npu --runtime-seconds 60
Precautions
- Preprocessing must exactly match training
- Avoid dynamic shapes on embedded platforms
- Verify Feature Map Layout
- Calibration dataset is extremely important
Results
Below is yolov8s-worldv2 results table:
| yolov8s-worldv2 | mAP | mAP50 | mAP75 | Latency (in ms) |
Model Size |
| float model | 35.9 | 49.7 | 39 | 53.21 | 48.06k |
| quantized model | 35.6 | 49.6 | 38.7 | 90.68 | 12.46k |
| NPU E2E Accuracy | 35.6 | 49.5 | 38.7 | 22.85 | 17.39k |
Conclusion
In this blog, we showcased a complete workflow for object detection on AMD AI PCs, from exporting YOLO-World to ONNX, through quantization, to NPU deployment and evaluation.
Key takeaways include:
- Quantization with A16W8_ADAROUND significantly reduces model size while maintaining accuracy
- Proper calibration and preprocessing are critical for consistent deployment performance
The AMD AI ecosystem is rapidly evolving, and now is the perfect time to explore on-device AI pipelines. We encourage developers to:
- Experiment with different quantization schemes to optimize their workloads
- Benchmark custom models on Ryzen AI PCs to fully leverage NPUs
- Contribute to AMD’s AI developer community by sharing insights, performance results, and best practices
By following this workflow, you can unlock the full potential of AMD NPUs for real-time, production-ready object detection, bring powerful AI capabilities to the edge with efficiency and precision.