RAG with Hybrid LLM on AMD Ryzen AI Processors

Aug 15, 2025

Introduction

Retrieval-augmented generation (RAG) has become a popular approach for building LLM-based applications that require accurate, context-aware responses by grounding the model in relevant external data. While most RAG deployments rely on cloud-based inference, running RAG fully on-device can improve privacy, reduce latency, and ensure availability without an internet connection- provided the models are optimized for local execution.

This blog showcases a foundational RAG application running on a PC with an AMD Ryzen™ AI processor, leveraging both NPU and GPU for efficient, high compute, and low power on-device inference. The sample RAG framework is built using the popular LangChain library, integrated into the Ryzen AI software environment with a pre-quantized and preprocessed LLM based on the ONNX Runtime GenAI (OGA) framework. In addition, the other core component of the RAG flow, the embedding model, is also compiled and runs on the NPU to enable efficient low power embedding generation.

LLMs present unique inference challenges due to the different compute-bound and memory-bound characteristics of their prefill and decode phases. These challenges become more pronounced for on-device acceleration, where available compute and memory bandwidth are more constrained compared to cloud environments. A key aspect of this setup is the OGA hybrid model, which enables disaggregated inference by splitting execution between the NPU and GPU. During the prefill phase, high-compute workloads are offloaded to the NPU, while the GPU handles the decode phase, where high memory bandwidth is critical. This hybrid execution improves end-to-end performance and responsiveness while keeping inference fully on-device.

Figure 1: RAG on Ryzen AI software environment with Hybrid LLM

RAG Pipeline Overview

The example implements a basic RAG flow that retrieves relevant information from local documents using Facebook AI Similarity Search (FAISS) as the vector store. For brevity, details such as document loading, chunking, indexing, and retrieval logic used in this example are not discussed here, as they follow common patterns found in most RAG implementations. Instead, we focus on the key custom classes that integrate the local LLM and local embedding model into the LangChain framework.

Note: For the associated code and a detailed README with step-by-step instructions, please refer to the GitHub repository: https://github.com/amd/RyzenAI-SW/tree/main/example/llm/RAG-OGA

Custom LLM class

To integrate the hybrid LLM model into the LangChain framework, we implement a custom LLM class that wraps the ONNX Runtime GenAI (OGA) model using the OGA Python API. The LLM used in this example is a precompiled, ready-to-run Llama-3.2-3B-Instruct model available from AMD Hugging Face repository. This custom wrapper handles tokenization, model inference, and integrates seamlessly with LangChain’s LLM interface.

    class custom_llm(LLM):
   ....
   def __init__(self, model_path: str, **kwargs: Any):
      ...
      self._model = og.Model(model_path)
      self._tokenizer = og.Tokenizer(self._model)
      ...

   def _prepare_generator(self, prompt: str) -> og.Generator:
      ... 
      params = og.GeneratorParams(self._model)
      search_options = {
        "max_length": min(2048, len(input_tokens) + 1024),
        "temperature": 0.5,
        "top_k": 40,
        "top_p": 0.9
      }
      params.set_search_options(**search_options)
      generator = og.Generator(self._model, params)
      ...


   def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs: Any) -> str:
       ...
       while not generator.is_done():
          ...
          generator.generate_next_token()
          token = generator.get_next_tokens()[0]
          response_tokens.append(token)
          decoded_tokens = [self._tokenizer_stream.decode(t) for t in response_tokens]
          response = "".join(decoded_tokens)

For more information related to ONNX Runtime GenAI API-based LLM deployment on an AI PC powered by Ryzen AI processor, refer to: https://ryzenai.docs.amd.com/en/latest/hybrid_oga.html.

Custom Embedding class

Similar to the custom LLM, a custom embedding class is implemented to wrap the ONNX model for the BGE large embedding model (bge-large-en-v1.5). This model generates a 1024-dimensional embedding vector and supports a maximum sequence length of 512 tokens. The ONNX Runtime session is configured to use the Vitis AI Execution Provider (EP) from the Ryzen AI software stack, which compiles and caches the model for the NPU. The compiled model is stored locally, enabling fast, low-power embedding generation on the NPU during subsequent runs.

    class custom_embeddings(Embeddings): def init(self, model_path: str, tokenizer_name: str):
   self.session = ort.InferenceSession(
        model_path,
        providers=["VitisAIExecutionProvider"],
        provider_options=[{
            "config_file": "vaiml_config.json",
            "cache_dir": "./",
            "cacheKey": "modelcachekey_bge"
        }]
    )
    

def _embed(self, texts: List[str]) -> List[List[float]]:
    ...
    for text in texts:
        inputs = self.tokenizer(
            text,
            max_length=512,
            padding="max_length",
            truncation=True,
            return_tensors="np",
            return_token_type_ids=False
        )
        input_ids = inputs["input_ids"]
        total_input_tokens += np.count_nonzero(input_ids)
        onnx_inputs = {
            "input_ids": input_ids.astype(np.int64),
            "attention_mask": inputs["attention_mask"].astype(np.int64)
        }

        outputs = self.session.run(None, onnx_inputs)
        ...

For more information about compilation of the embedding model, refer to the official documentation page: https://ryzenai.docs.amd.com/en/latest/modelrun.html. You can also find a standalone example of embedding model compilation here: https://github.com/amd/RyzenAI-SW/tree/main/example/gte-large-en-v1.5-bf16.

Sample Questions and Answers

Here are a few sample questions and their corresponding answers generated by the RAG example using local documents. The input document used for this demonstration is the Ryzen AI documentation PDF.

##Question 1:

#Directly asking LLM
python rag.py --direct-llm

Enter your question: what is NPU and tell me the three important features of NPU.
Direct_llm mode is on. No retrieval has been performed.
LLM_call invoked: 1 time(s)
Answer:
NPU stands for Net Protein Utilization, which is a measure of the proportion of dietary protein that is actually utilized by the body for growth and maintenance of tissues. The three important features of NPU are: (1) It is a measure of protein quality, indicating the extent to which a protein is effective in promoting growth and maintenance of body tissues. (2) It is influenced by factors such as the protein's amino acid composition, digestibility, and bioavailability. (3) NPU is a critical factor in determining the adequacy of protein intake, as it helps to identify the protein sources that are most effective in meeting the body's protein needs.

# RAG based Contextual answer

python rag.py

Enter your question: what is NPU and tell me the three important features of NPU.
Retrieval mode is on.
Loading existing FAISS index from disk...
LLM_call invoked: 1 time(s)
Answer:
The NPU (Neural Processing Unit) is a specialized processor designed for neural network processing, specifically for deep learning and artificial intelligence applications.
The three important features of NPU are:
1. **High Performance**: NPU is designed to provide high-performance computing for deep learning workloads, making it an ideal choice for applications that require fast processing of large amounts of data.
2. **Energy Efficiency**: NPU is designed to be energy-efficient, which is critical for mobile devices and other applications where power consumption is a major concern.
3. **Low Latency**: NPU is designed to provide low latency, which is critical for real-time applications such as autonomous vehicles, robotics, and other IoT devices.

##Question 2:

Enter your question: what are the main features provided by the AMD analyzer, and how does it help in visualizing model execution on Ryzen AI ?
Retrieval mode is on.
Loading existing FAISS index from disk...
LLM_call invoked: 1 time(s)

Answer:
## Step 1: Identify the main features of the AMD AI Analyzer
The AMD AI Analyzer is a tool that supports analysis and visualization of model compilation and inference on Ryzen AI. The main features provided by the AMD AI Analyzer include:

- Graph and operator partitions between the NPU and CPU
- Visualization of graph and operator partitions
- Profiling and visualization of model execution
- Generation of artifacts related to inference profile and graph partitions

## Step 2: Explain how the AMD AI Analyzer helps in visualizing model execution on Ryzen AI
The AMD AI Analyzer helps in visualizing model execution on Ryzen AI by providing a comprehensive view of the model's performance and execution on the NPU. The tool allows users to:

- Visualize graph and operator partitions to understand how the model is processed by the hardware
- Profile and visualize model execution to identify performance bottlenecks
- Generate artifacts related to inference profile and graph partitions to gain deeper insights into the model's behavior

## Step 3: Highlight the benefits of using the AMD AI Analyzer
The AMD AI Analyzer provides several benefits, including:

- Improved understanding of model execution on Ryzen AI
- Identification of performance bottlenecks and optimization opportunities
- Generation of artifacts for further analysis and optimization

The final answer is: The AMD AI Analyzer provides a comprehensive set of features that help in visualizing model execution on Ryzen AI, including graph and operator partitions, profiling and visualization, and generation of artifacts related to inference profile and graph partitions. These features enable users to gain a deeper understanding of the model's performance and behavior on the NPU, identify performance bottlenecks, and optimize the model for better performance and power efficiency.

LLM Performance on Ryzen AI Processor

The hybrid LLM implementation on Ryzen AI processor provides industry-leading performance for key metrics such as time-to-first-token (TTFT) and tokens-per-second (TPS). By distributing execution across the NPU and GPU, it delivers optimal performance in both the compute-bound prefill phase and the memory-bound decode phase. Below is a sample performance snapshot collected from this RAG example running on a Ryzen AI 9 HX 370 processor based PC. Actual numbers may vary depending on the LLM used, model version, and specific system configuration.

Q1:

Avg Input Tokens: 1608

Avg Output Tokens: 440

Avg TTFT(Sec): 2.272704

Avg TPS: 30.07

Q2:

Avg Input Tokens: 1172

Avg Output Tokens: 232

Avg TTFT(Sec): 1.86373

Avg TPS: 32.65

Q3:

Avg Input Tokens: 1452

Avg Output Tokens: 11

Avg TTFT(Sec): 2.099082

Avg TPS: 24.53

For the latest performance data, refer to the GitHub example associated with this blog, which will reflect the most up-to-date results.

Conclusion

In this blog, we demonstrated a sample RAG application on an AI PC powered by a Ryzen AI processor using LangChain, demonstrating how to efficiently run both the LLM and embedding model through the Ryzen AI Software. By leveraging hybrid execution - NPU for compute-intensive prefill and GPU for memory-intensive decode - we achieved faster end-to-end inference, lower latency, and reduced power consumption, all while keeping data fully on-device.

This approach shows how performance-optimized, locally executed RAG flows can deliver responsive, private, and energy-efficient AI experiences without relying on the cloud, providing a foundation for building more advanced RAG or agentic applications using LangChain or similar frameworks on Ryzen AI processor-based PCs

Call to Action

Discover the full potential of RAG on AMD Ryzen AI processor by exploring the example and building your own applications. Access the full code and detailed instructions on our GitHub Repository.