Advancing Windows ML Acceleration with AMD at Microsoft Build 2026

Jun 02, 2026

Image of a circuit board.

Since Microsoft Ignite 2025, AMD has continued advancing NPU- and GPU-accelerated AI on Windows, with a focus on developer productivity, performance, and platform scale. Ahead of Microsoft Build 2026, we delivered meaningful updates across the NPU and GPU Execution Providers (EPs) as well as the broader Windows ML ecosystem for AMD APUs, NPUs, and discrete GPUs.

These improvements span the Windows AI stack and Microsoft’s emerging DirectX Compute Graph Compiler. DxCGC uses MLIR-based representations to bring full model graphs into the DirectX pipeline, enabling compiler-driven graph optimization, memory planning, operator fusion, and GPU execution on AMD hardware. On the NPU side, the latest work improves inference performance, developer tooling, benchmarking, and web platform integration.

This post highlights the platform enhancements delivered during this period and how they improve the end-to-end developer experience for AI inference workloads, including diffusion models and large language models (LLMs).

NPU Execution Provider Improvements

The NPU EP now supports more LLMs and runs supported models faster. Time to first token (TTFT) improved by up to 1.5x, while sustained token generation (tokens per second) increased by up to 3.5x. In practice, this means users receive both the initial response and the full text more quickly when generating output on a local device.

These gains come from improvements across the execution path, including operator fusion, more efficient kernels, and reduced overhead in end-to-end model execution. Model coverage has also doubled, with added support for GPT-OSS-20B, Qwen 2.5, and other models. The AI toolkit (AITK) now provides per-operator execution timelines and NPU utilization breakdowns, helping developers identify and resolve inference bottlenecks with greater precision.

Strong scores in Procyon® AI Computer Vision Benchmark 2.0

Windows ML achieved excellent scores on the Procyon AI Computer Vision  2.0 benchmark, validating end-to-end performance across a range of workloads on similar NPUs. 

WebNN Support (Experimental)

An initial Web Neural Network API (WebNN) integration now routes browser-based inference through the Windows ML AMD NPU execution path, extending hardware acceleration to web applications.

GPU Execution Provider (EP) Advancements

We introduced a new plugin interface for the ONNX Runtime GPU Execution Provider with the latest GPU EP release. This new release aligns with the current ONNX Runtime plugin API and separates GPU backend implementation from the ONNX Runtime core, allowing GPU backends to evolve without requiring core runtime updates.

Separating backend implementation from the core runtime provides several benefits. First, it gives backend teams a stable ONNX Runtime plugin ABI (Application Binary Interface) to target, improving long-term compatibility and reducing integration work. Second, it allows the GPU backend to move faster, with shorter release cycles and less dependency on core runtime changes.

Finally, it simplifies deployment. A single, unified MSIX package now supports both legacy GPU EP backends and new plugin-based, ABI-compatible GPU EP implementations. Developers can continue using existing backends while adopting the new plugin model as they see fit and at their own pace.

 AMD ROCm™ Software 7.1 Library and Runtime Upgrade

The GPU EP has been upgraded to ROCm 7.1-based runtime and libraries. This delivers measurable gains in kernel-level performance and runtime stability while reducing a model’s overall memory footprint. These updates improve out-of-the-box behavior in Windows ML inference workloads running on AMD GPUs.

Optimizations for Diffusion Models

AMD delivered targeted performance and memory optimizations for diffusion workloads, reducing memory use and improving throughput for widely used models such as Stable Diffusion 3.5 (SD3.5), SDXL, and FLUX.1. These improvements make image generation pipelines more efficient, particularly in memory-constrained scenarios, and improve both responsiveness and scalability when these workloads are run on AMD GPUs. Additional performance data for AMD RDNA4™-based GPUs in diffusion and standard vision models is shared below: 

Performance improvement on AMD RDNA4 over baseline
AMD RDNA4™ GPU performance improvement over baseline.

Standard Vision Models:

AMD RDNA4™ performance improvement over baseline.
AMD RDNA4™ GPU performance improvement over baseline.
Expanded Hardware Support

GPU EP support now extends to AMD Ryzen™ AI 400 Series processors, enabling GPU-accelerated AI inference on the latest Ryzen AI platforms.  This expansion broadens the hardware targets available to developers using the ONNX Runtime GPU EP for client-side AI workloads on Windows.

LLM Stability and Performance Enhancements

Large language model (LLM) workloads remain a key focus area. Since Ignite 2025, we’ve delivered additional stability and performance improvements through enhanced integration with the onnxruntime-genai (OGA)  and continued stability improvements via DirectML (DML). Together, these improvements have increased execution reliability, throughput, and overall efficiency when running transformer-based models on AMD GPUs.    

Dx Compute Graph (CG) API with Windows ML IR on AMD Radeon™ GPUs (Coming Soon)

Following the public launch of the DxCG API at GDC 2026, we are excited to announce an upcoming preview of DxCG integration with Windows ML on AMD GPUs via the Windows ML API. This preview will provide early access to the Windows ML Intermediate Representation (IR) pipeline, enabling a more direct and efficient mapping of machine learning graphs to modern GPU backends, while strengthening alignment across Windows ML, DirectX, and AMD GPU execution providers.

Integrating DxCG into Windows ML is a key step toward tighter cohesion across the Windows AI stack. It opens new opportunities for innovation at both the ML compiler level and within hardware acceleration, enabling developers to deliver higher-performance AI workloads on Windows.

On the AMD side, the DxCGC driver-based backend will introduce advanced graph optimizations and highly tuned GPU kernels specifically optimized for AMD Radeon™ GPUs. These optimizations will enable efficient execution paths through the Windows ML → DxCGC pipeline, delivering improved performance and scalability on AMD devices.

This upcoming preview driver will enabled Microsoft PhiSilica workloads to run optimally on AMD platforms using these driver-based optimizations. Please note this driver, once released, is intended for preview purposes only and may not be suitable for production environments.

Looking Ahead

The updates delivered since Microsoft Ignite 2025 and highlighted at Build 2026 reflect our continued investment to make AI on Windows more performant, scalable, and easier to use when developing for different AMD platforms.

Whether you’re deploying diffusion models, running local LLMs, or building on the next generation of Windows ML APIs, these improvements are designed to help you move faster and ship with greater confidence on AMD hardware.

We look forward to continuing to work with the developer community throughout 2026.

Share:

Article By


Fellow, GPU Technology and Engineering, AMD

Contributors


Senior Director, Head of ISV Enabling for Computing & Graphics

Related Blogs