Unlocking On-Device ASR with Whisper on Ryzen AI NPUs
Sep 29, 2025

If you’ve used speech transcription or voice assistants, chances are you’ve relied on cloud-based speech recognition. It works — until you run out of API credits, lose internet, worry about privacy, or your CPU starts to lag.
What if you could run powerful speech recognition entirely on your device — efficiently, privately, and without cloud dependencies?
With the latest AMD Ryzen™ AI software, you can deploy Whisper-base and Whisper-tiny models for real-time speech-to-text using Neural Processing Unit (NPU) acceleration. These models are part of Whisper, an open-source automatic speech recognition (ASR) and speech translation system developed by OpenAI. Whisper supports multilingual transcription and translation, converting spoken audio into text with high accuracy.
This blog is primarily intended for users with Ryzen AI 300 series PCs, but if you're using a standard CPU, you can still follow along and run Whisper locally.
Why Run Whisper on the NPU?
Running Whisper locally on the Neural Processing Unit (NPU) offers several compelling advantages:
- Privacy First: Your audio stays on your device — no cloud uploads, no streaming, no risk of third-party eavesdropping.
- Performance: The NPU runs inference using BFP16 precision, nearly as fast as INT8 but with higher accuracy. This means instant voice commands and real-time captions.
- Power Efficiency: NPUs are purpose-built for AI workloads and consume significantly less power than CPUs or GPUs. This translates to better battery life and cooler, quieter devices.
- Freeing up the CPU/GPU: Offloading automatic speech recognition (ASR) to the NPU frees up your CPU and GPU for other tasks—whether you're gaming, browsing, or compiling code.
Ready to Try It Yourself?
Before diving in, we recommend familiarizing yourself with the Ryzen AI documentation to understand the platform and its capabilities. For a full, detailed walkthrough on exporting, optimizing, and running Whisper models on Ryzen AI NPU—including example scripts, evaluation tools, and configuration files, check out the official RyzenAI-SW GitHub repository that hosts the ASR Demo.
👉 https://github.com/amd/RyzenAI-SW
👉 https://github.com/amd/RyzenAI-SW/tree/main/demo/ASR/Whisper
This repo contains everything you need to get started quickly, including step-by-step instructions, pre-built demos, and performance benchmarks.
We use the Hugging Face Optimum toolkit to export Whisper models optimized for the Ryzen AI NPU. Follow the instructions here to export using HF optimum CLI and set up the model for NPU execution.
Support Details
Whisper base, small, and medium (multilingual versions) are currently supported. Whisper large exceeds the practical limits of current NPU hardware and is not supported at this time.
For optimal performance, the NPU prefers static input shapes. Unlike dynamic shapes, which are harder to optimize and can introduce latency, static shapes allow the NPU to run faster and more efficiently.
- Live Transcription: Use shorter static sequence lengths to minimize delay and improve responsiveness.
- Longer Audio: For offline transcription, set the sequence length up to 448 tokens for better throughput.
Tuning the sequence length to match your use case—whether real-time or offline—helps you get the most out of Whisper on the NPU.
The first time you target the NPU, Whisper’s encoder and decoder models undergo compilation. This process applies all necessary optimizations—including kernel fusion—and stores the results in a cache location specified via provider options. Initial compilation can take 5 to 15 minutes per model, depending on model size, but it only happens once. After that, inference runs instantly using the compiled and cached version.
Evaluating Whisper on Ryzen AI NPU
Evaluating Whisper on the NPU is essential because:
- The NPU uses Block Floating Point 16 (BFP16) precision and custom kernels, which can affect speed and accuracy. For more information on quantizing to BFP16, see the AMD Quark Guide.
- Performance on NPU differs significantly from CPU inference.
How we evaluate:
- Use Word Error Rate (WER) for English using LibriSpeech test-clean dataset.
- Use Character Error Rate (CER) for Chinese datasets like Aishell1 dataset.
These metrics compare model outputs against true transcripts to measure accuracy.
Performance Highlights: Ryzen AI NPU vs CPU
We tested Whisper base, small, and medium models running on the Ryzen AI NPU (without KV caching). Here’s how they compare to CPU-only runs for 30s audio transcriptions.
Model | Device | Real-Time Factor (RTF) | Time to First Token (TTFT) |
Whisper Base | NPU | 0.35 | 0.45 s |
Whisper Base | CPU | 0.7 | 1.8 s |
Whisper Small | NPU | 1.2 | 0.85 s |
Whisper Small | CPU | 2.2 | 3.1 s |
Table 1: Whisper Model Performance on Ryzen AI NPU vs CPU for 30s audio
Lower RTF means faster than real time (e.g., 0.35 means processing ~3x faster than audio length). Also note, these models used do not use KV caching. Future releases will focus on improved performance with KV caching enabled.
Test Configuration for the results on Table 1:
- Processor: AMD Ryzen™ AI 9 HX 370 (12 cores, max clock 2000 MHz) with Integrated Radeon™ 890M
- Memory: 32 GB RAM
- Software: Ryzen AI 1.5.0
- NPU MCDM driver: 32.0.203.280, Date: 5/16/2025
- Test Date: 09/20/2025
- OS: Windows 11
Conclusion
Running Whisper on the Ryzen AI’s NPU unlocks fast, private, and power-efficient speech recognition on-device. With Hugging Face Optimum exports and static input shapes, you can get Whisper models running locally with ease. Performance beats CPU-only setups by a wide margin, making real-time ASR practical on portable and desktop devices alike.
Try it out on GitHub and experience local Whisper for yourself!
