Advancing Windows ML Acceleration with AMD at Microsoft Build 2026

Jun 02, 2026

Since Microsoft Ignite 2025, AMD has continued advancing NPU- and GPU-accelerated AI on Windows, with a focus on developer productivity, performance, and platform scale. Ahead of Microsoft Build 2026, we delivered meaningful updates across the NPU and GPU Execution Providers (EPs) as well as the broader Windows ML ecosystem for AMD APUs, NPUs, and discrete GPUs.

These improvements span the Windows AI stack and Microsoft’s emerging DirectX Compute Graph Compiler. DxCGC uses MLIR-based representations to bring full model graphs into the DirectX pipeline, enabling compiler-driven graph optimization, memory planning, operator fusion, and GPU execution on AMD hardware. On the NPU side, the latest work improves inference performance, developer tooling, benchmarking, and web platform integration.

This post highlights the platform enhancements delivered during this period and how they improve the end-to-end developer experience for AI inference workloads, including diffusion models and large language models (LLMs).

NPU Execution Provider Improvements

The NPU EP now supports more LLMs and runs supported models more quickly.

These gains come from improvements across the execution path, including operator fusion, more efficient kernels, and reduced overhead in end-to-end model execution. Model coverage has also doubled, with added support for GPT-OSS-20B, Qwen 2.5, and other models. The AI toolkit (AITK) now provides per-operator execution timelines and NPU utilization breakdowns, helping developers identify and resolve inference bottlenecks with greater precision.

Strong scores in Procyon® AI Computer Vision Benchmark 2.0

Windows ML achieved excellent scores on the Procyon AI Computer Vision 2.0 benchmark, validating end-to-end performance across a range of workloads on similar NPUs.

WebNN Support (Experimental)

An initial Web Neural Network API (WebNN) integration now routes browser-based inference through the Windows ML AMD NPU execution path, extending hardware acceleration to web applications.

GPU Execution Provider (EP) Advancements

We introduced a new plugin interface for the ONNX Runtime GPU Execution Provider with the latest GPU EP release. This new release aligns with the current ONNX Runtime plugin API and separates GPU backend implementation from the ONNX Runtime core, allowing GPU backends to evolve without requiring core runtime updates.

Separating backend implementation from the core runtime provides several benefits. First, it gives backend teams a stable ONNX Runtime plugin ABI (Application Binary Interface) to target, improving long-term compatibility and reducing integration work. Second, it allows the GPU backend to move faster, with shorter release cycles and less dependency on core runtime changes.

Finally, it simplifies deployment. A single, unified MSIX package now supports both legacy GPU EP backends and new plugin-based, ABI-compatible GPU EP implementations. Developers can continue using existing backends while adopting the new plugin model as they see fit and at their own pace.

AMD ROCm™ Software 7.1 Library and Runtime Upgrade

The GPU EP has been upgraded to ROCm 7.1-based runtime and libraries. This delivers measurable gains in kernel-level performance and runtime stability while reducing a model’s overall memory footprint. These updates improve out-of-the-box behavior in Windows ML inference workloads running on AMD GPUs.

Optimizations for Diffusion Models

AMD delivered targeted performance and memory optimizations for diffusion workloads, reducing memory use and improving throughput for widely used models such as Stable Diffusion 3.5 (SD3.5), SDXL, and FLUX.1. These improvements make image generation pipelines more efficient, particularly in memory-constrained scenarios, and improve both responsiveness and scalability when these workloads are run on AMD GPUs. Additional performance data for AMD RDNA4™-based GPUs in diffusion¹ and standard vision models² is shared below:

Performance improvement on AMD RDNA4 over baseline

AMD RDNA4™ GPU performance improvement over baseline

Standard Vision Models:

AMD RDNA4™ performance improvement over baseline.

AMD RDNA4™ GPU performance improvement over baseline.

Expanded Hardware Support

GPU EP support now extends to AMD Ryzen™ AI 400 Series processors, enabling GPU-accelerated AI inference on the latest Ryzen AI platforms. This expansion broadens the hardware targets available to developers using the ONNX Runtime GPU EP for client-side AI workloads on Windows.

LLM Stability and Performance Enhancements

Large language model (LLM) workloads remain a key focus area. Since Ignite 2025, we’ve delivered additional stability and performance improvements through enhanced integration with the onnxruntime-genai (OGA) and continued stability improvements via DirectML (DML). Together, these improvements have increased execution reliability, throughput, and overall efficiency when running transformer-based models on AMD GPUs.

Dx Compute Graph (CG) API with Windows AI API on AMD Radeon™ GPUs

Following the public launch of the DxCG API at GDC 2026, we are excited to announce a preview driver of DxCG integration with Windows ML on AMD GPUs via the Windows ML API. This preview provides early access to the Windows AI API, enabling a more direct and efficient mapping of machine learning graphs to modern GPU backends, while strengthening alignment across Windows ML, DirectX, and AMD GPU execution providers.

Integrating DxCG into Windows ML is a key step toward tighter cohesion across the Windows AI stack. It opens new opportunities for innovation at both the ML compiler level and within hardware acceleration, enabling developers to deliver higher-performance AI workloads on Windows.

On the AMD side, the DxCGC driver-based backend will introduce advanced graph optimizations and highly tuned GPU kernels specifically optimized for AMD Radeon™ GPUs. These optimizations will enable efficient execution paths through the Windows ML → DxCGC pipeline, delivering improved performance and scalability on AMD devices.

This preview driver also enables Microsoft PhiSilica workloads to run optimally on AMD platforms using these driver-based optimizations. Please note this driver is intended for preview purposes only and may not be suitable for production environments.

Looking Ahead

The updates delivered since Microsoft Ignite 2025 and highlighted at Build 2026 reflect our continued investment to make AI on Windows more performant, scalable, and easier to use when developing for different AMD platforms.

Whether you’re deploying diffusion models, running local LLMs, or building on the next generation of Windows ML APIs, these improvements are designed to help you move faster and ship with greater confidence on AMD hardware.

We look forward to continuing to work with the developer community throughout 2026.

1 - Testing as of May 2026 by AMD Engineering Labs on a test system configured with AMD Ryzen™ 9 7950X3D CPU, ROG Crosshair X670E Extreme motherboard, 32 GB DDR5 Memory, Windows 11 25H2 and AMD Radeon™ RX 9070 XT (Driver 26.3.1) vs a similarly configured system with graphics driver version 25.10.1 comparing Text-To-Image Diffusion Model Throughput between drivers. The following benchmark(s) were used: SDXL (1024x1024), SD3 Medium (1024x1024), SD3.5 Medium (1024x1024), and SDXL Turbo (512x512). System manufacturers may vary configurations, yielding different results. RX-1262. 

2 - Testing as of May 2026 by AMD Engineering Labs on a test system configured with AMD Ryzen™ 9 7950X3D CPU, ROG Crosshair X670E Extreme motherboard, 32 GB DDR5 Memory, Windows 11 25H2 and AMD Radeon™ RX 9070 XT (Driver 26.3.1) vs a similarly configured system with graphics driver version 25.10.1 comparing Vision Model Throughput between drivers. The following benchmark(s) were used: ArcFaceResnet100, DenseNet, EfficienctNet-lite4, FNSCandy, GoogleNet, InceptionV1, MobilenetV2_fp16, MobilenetV2_fp32, MobilenetV3_fp32, Resnet50_fp16, Resnet50_fp32, RetinaNet, ShuffleNetV2, Squeezenet_fp32, SuperRes, ZFNet512, and vgg19. System manufacturers may vary configurations, yielding different results. RX-1263.

Article By

Hisham Chowdhury

Fellow, GPU Technology and Engineering, AMD

Contributors

Joshua Hort

Senior Director, Head of ISV Enabling for Computing & Graphics

Bader Alam

white pearl gradient medium color divider

Related Blogs

View All Blogs

Server CPUs

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Advancing Windows ML Acceleration with AMD at Microsoft Build 2026

NPU Execution Provider Improvements

Strong scores in Procyon® AI Computer Vision Benchmark 2.0

WebNN Support (Experimental)

GPU Execution Provider (EP) Advancements

AMD ROCm™ Software 7.1 Library and Runtime Upgrade

Optimizations for Diffusion Models

Expanded Hardware Support

LLM Stability and Performance Enhancements

Dx Compute Graph (CG) API with Windows AI API on AMD Radeon™ GPUs

Looking Ahead

Article By

Contributors

Related Blogs

AMD.com Feedback