Accelerating Local LLM Conversations with KV Cache Reuse on AMD Ryzen AI

Jun 10, 2026

Local large language models are enabling a new class of AI PC applications, including private document assistants, coding helpers, conversational agents, and domain-specific copilots. Running inference locally can reduce dependence on network connectivity and can help keep prompts and application data on device when the application is configured for local execution. As these applications become more conversational, maintaining low latency across multiple turns becomes increasingly important. In a typical chat session, every new user message is added to the existing conversation history. Without an efficient mechanism for retaining context, the model must repeatedly process the entire conversation before generating a response.

This repeated processing occurs during the prefill phase, where the model converts input tokens into the internal attention state required for generation. As conversations grow longer, prefill can become a significant contributor to overall response latency.

KV cache reuse addresses this challenge by preserving the model's internal attention state across turns. Instead of rebuilding the entire conversation context for every request, the application processes only newly added tokens while reusing previously computed information.

AMD Ryzen™ AI Software 1.7.1 supports this capability through ONNX Runtime GenAI's continuous decoding APIs. In this blog, we'll explore how KV cache reuse works, demonstrate multi-turn conversation handling and conversation rewind capabilities, and measure the latency benefits of avoiding redundant prefill computation.

Large language models process text in two stages:

Prefill – Process the input prompt and build the attention state.
Decode – Generate output tokens one at a time.

For many conversational workloads, prefill becomes the dominant cost.

Consider a simple conversation:

Turn 1: [System prompt] + [User message 1]
Turn 2: [System prompt] + [User message 1] + [Reply 1] + [User message 2]
Turn 3: [System prompt] + [User message 1] + [Reply 1] + [User message 2] + [Reply 2] + ...

Without context reuse, the model reprocesses the entire conversation at every turn. Even though most of the content has already been seen, the model must rebuild its attention state from scratch.

As conversation history grows, response latency increases and more compute resources are consumed.

With KV cache reuse, previously computed attention information remains available between turns. The model only processes newly appended tokens, significantly reducing redundant work.

Without Caching	With Caching
Full prefill on every turn	Prefill only for new tokens
Latency grows with conversation length	Near-constant latency per turn
Higher power draw on long sessions	Lower energy per response

Understanding the KV Cache

Transformer-based LLMs use a self-attention mechanism to relate every token in the input to every other token. For each attention layer, the model computes three matrices per token: Query (Q), Key (K), and Value (V).

The Key and Value matrices are the expensive part: they must be computed for every token in the context before the model can generate any output. The Key-Value (KV) cache is simply the stored result of this computation.

Without KV cache reuse:

Turn N prompt = [Token 1 ... Token N]

Model recomputes K, V for ALL N tokens on every turn

With KV cache reuse:

After Turn 1: KV cache = [K1, V1]

After Turn 2: KV cache = [K1, V1, K2, V2] <- only K2, V2 are new

After Turn 3: KV cache = [K1, V1, K2, V2, K3, V3] <- only K3, V3 are new

Reusing the KV cache means the model only computes attention for the delta — the new tokens added since the last turn — while reading the cached state for everything that came before.

Continuous Decoding in ONNX Runtime GenAI

Ryzen™ AI Software 1.7.1 exposes KV cache reuse through ONNX Runtime GenAI's continuous decoding functionality.

From an application perspective, this enables:

Persistent multi-turn conversations
Incremental token appending
Context retention without prompt reconstruction
Conversation rewind and branching workflows

The feature is implemented directly within the runtime and requires no additional libraries beyond the standard Ryzen AI Software installation.

Hardware and Software Support

Context caching reuse is supported on all supported Ryzen AI Software 1.7.1 OGA model configurations running through the `onnxruntime-genai` runtime. This includes:

- Hybrid models: workload split across the CPU, iGPU, and NPU

- NPU-TPS fusion models: workload fully fused on the NPU

Getting Started

Install Ryzen AI Software 1.7.1 by following the official [Installation Instructions] (https://ryzenai.docs.amd.com/en/latest/inst.html). The installer creates a Conda environment (`ryzen-ai-1.7.1`) with all required dependencies, including `onnxruntime-genai`.

Dowload and Configure the Model

This walkthrough uses the AMD Qwen2.5-3B hybrid model. Download it from Hugging Face:

[amd/Qwen2.5_3B_Instruct_rai_1.7.1_hybrid](amd/Qwen2.5_3B_Instruct_rai_1.7.1_hybrid at main)

Save all downloaded files to a local directory, for example `D:\model\qwen2.5-3B`.

We will use two concrete examples with this model to demonstrate the feature usage in the following section.

Building a Multi-Turn Conversation

This example shows how to run a sequence of prompts where each turn built on earlier ones — and the model remembers everything — without any explicit context management in your code.

What the test covers

The following six scenarios represent common real-world patterns where multi-turn memory is essential. Each row is a sequence of prompts to use as `DEFAULT_PROMPTS`:

Scenario	Sample prompts list
Identity Recall	My name is Alex, I work as a data scientist at Orbital AI.
	I live in Berlin and enjoy playing the piano in my free time.
	Where do I live and what do I enjoy doing? What is my profession?
Referencing Custom Rules	When I say “compact summary,” respond with 3 bullet points only.
	Give me a compact summary of: Ancient Egypt (Egyptian: km.t) was a cradle of civilization concentrated along the lower reaches of the Nile River in Northeast Africa. It emerged from prehistoric Egypt around 3150 BC (according to conventional Egyptian chronology), when Upper and Lower Egypt were amalgamated by Menes, who is believed by the majority of Egyptologists to have been the same person as Narmer. The history of ancient Egypt unfolded as a series of stable kingdoms interspersed by the “Intermediate Periods” of relative instability. These stable kingdoms existed in one of three periods: the Old Kingdom of the Early Bronze Age; the Middle Kingdom of the Middle Bronze Age; or the New Kingdom of the Late Bronze Age.
	Give me a compact summary of: stages of butterfly
Entity Linking	Sarah is a software engineer who leads the AI team. James is her manager.
	Who is Sarah's manager?
	What team does Sarah lead?
List Recall	Here is a list of items I need from the store: apples, bread, milk, eggs, tomatoes, pasta, and cheese.
	What was the fourth item?
	How many dairy items are in the list?
Entity Linking : Multi-character Role Tracking	Alice is a biologist who works at Genomix Lab. She reports to Dr. Patel, who is the head of the lab. Her colleague, Martin, specializes in data analysis.
	Who is Alice’s manager?
	Who is the data analyst at Genomix Lab?
	What is Dr. Patel’s position?
	What field does Alice work in?
Task Continuity : Math with Memory	A farmer has 3 fields. Each field has 240 apple trees.
	If each tree produces 120 apples, how many apples does one field produce?
	How many apples in total across all fields?
	If 10% of the apples are spoiled, how many are still good?

Prepare the test script

The Ryzen AI installation ships a reference script at:
C:\Program Files\RyzenAI\1.7.1\LLM\example\run_model.py
Copy it to `run_kvreuse.py` in the same directory and make two modifications:

Step 1 — Replace `DEFAULT_PROMPTS`

Use the Identity Recall scenario as a starting point:

		DEFAULT_PROMPTS = [
    " My name is Alex, I work as a data scientist at Orbital AI.",
    " I live in Berlin and enjoy playing the piano in my free time.",
    " Where do I live and what do I enjoy doing?",
    " What is my profession?"
]

> Why the leading space? When a generation is cut off at the token limit, the last token may be a partial word (e.g., `"clas"` instead of `"classical"`). Appending the next prompt directly would concatenate them into a malformed token (`"classMy name is..."`), which confuses the model. The leading space forces a clean token boundary between the generated response and the next user message.

Step 2 — Replace the generation loop in `generate_text()`

		```python
generator = og.Generator(model, params)

for i, prompt in enumerate(prompts):
    # Append only the new prompt tokens — the generator retains all prior context
    generator.append_tokens(tokenizer.encode(prompt))

    loop = 0
    while not generator.is_done() and loop < 120:
        if loop == 0:
            in_pos = generator.get_sequence(0)  # mark start of this response
        loop += 1
        generator.generate_next_token()

    # Extract only the tokens generated for this prompt
    output_tokens = generator.get_sequence(0)[len(in_pos):]
    output_text = tokenizer.decode(output_tokens)

    print(f"\nPrompt #{i+1}: {prompt}")
    print("Output:\n", output_text)

    results.append({
        "prompt": prompt,
        "response": output_text,
        "tokens": len(output_tokens)
    })
    total_tokens += len(output_tokens)
```

Key design decisions explained:

Decision	Reason
Single `Generator` instance for all turns	The KV cache lives inside the generator. Reusing it across turns is what enables context accumulation — creating a new generator each turn would discard all prior context.
`append_tokens()` instead of restarting	This adds new tokens to the existing KV cache sequence rather than rebuilding the full sequence from scratch.
`loop < 120` token limit	Prevents the model from generating excessively long responses. Lower values keep the test fast; raise this for production use.
`in_pos = generator.get_sequence(0)` at `loop == 0`	Records the sequence length at the start of generation for this turn so only the new tokens are decoded and printed, not the entire conversation history.

Run the test

		```bash
conda activate ryzen-ai-1.7.1
cd "C:\Program Files\RyzenAI\1.7.1\LLM\example"
python run_kvreuse.py -m D:\model\qwen2.5-3B
```

What to expect

Prompts 1 and 2 give the model background information. Prompts 3 and 4 test whether the model retained it:

Prompt 3 — "Where do I live and what do I enjoy doing?"

```

You mentioned that you live in Berlin and enjoy playing the piano in your free time. That sounds like a wonderful city and way of life! Berlin is known for its vibrant arts scene, diverse neighborhoods, and rich cultural heritage, making it a fantastic place to explore and enjoy. Playing the piano is a fantastic hobby, and I'm sure you make the most of of of your free time there. Is there anything else you'd like to share about your interests or plans for the future? I'm a big fan of of the city and its culture, and I love exploring new neighborhoods and

```

Prompt 4 — "What is my profession?"

```

You mentioned that you work as a data scientist at Orbital AI. That's great! A data scientist is a fascinating role that involves using statistical and computational techniques to extract insights and knowledge from data. It's a highly technical and versatile field that can be applied to a wide range of industries and problems. As a data scientist, you likely have the opportunity to work with large datasets, develop predictive models, and help organizations make data-driven decisions. What are some of of of the projects or challenges you've worked on recently? I'm a big fan of of of the city and its culture, and

```

The model correctly answers both questions using only what was said in earlier turns, with no context passed explicitly by the caller. The KV cache is handling the memory.

Rewind to a Previous State

The `rewind_to()` API lets you move the KV cache pointer back to any earlier position in the conversation, discarding everything after that point. This is useful when:

A user wants to take the conversation in a different direction from an earlier turn
An agentic workflow needs to explore multiple branches from a shared starting point
An intermediate response was poor quality and you want to re-ask from a clean state

How rewind works

		```
Normal flow:
  Q1 → A1 → Q2 → A2 → Q3 → A3
  KV cache: [Q1][A1][Q2][A2][Q3][A3]

After rewind_to(len(Q1 + A1)):
  KV cache: [Q1][A1]          ← everything after A1 is discarded

Re-ask Q2:
  KV cache: [Q1][A1][Q2][A2'] ← model is in the same state as after the original A1

If the model is deterministic: A2' == A2
```

Prepare the rewind test script

Copy `run_kvreuse.py` to `run_rewind.py`. Replace the generation loop with:

		```python
# Phase 1: Run Q1, Q2, Q3 in sequence to build up the full KV cache
for i in range(3):
    generator.append_tokens(tokenizer.encode(prompts[i]))
    loop = 0
    while not generator.is_done() and loop < 120:
        loop += 1
        generator.generate_next_token()

    if i == 0:
        # Save the sequence length after Q1+A1 — this is our rewind target
        output_tokens0 = generator.get_sequence(0)
    if i == 1:
        # Save A2's text so we can compare it after the rewind
        output_tokens1 = generator.get_sequence(0)
        output_text1 = tokenizer.decode(output_tokens1)

# Phase 2: Rewind to the position after Q1+A1
generator.rewind_to(len(output_tokens0))

# Re-ask Q2 from this rewound state
generator.append_tokens(tokenizer.encode(prompts[1]))
loop = 0
while not generator.is_done() and loop < 120:
    loop += 1
    generator.generate_next_token()

output_tokens1_rewind = generator.get_sequence(0)
output_text1_rewind = tokenizer.decode(output_tokens1_rewind)

# Phase 3: Verify the outputs match
if output_text1 == output_text1_rewind:
    print("rewind test OK")
return
```

What each phase is doing:

Phase	What happens	Why
Phase 1	Run Q1→A1→Q2→A2→Q3→A3 sequentially	Build up a realistic KV cache with three turns of history
rewind_to(len(output_tokens0))	Reset the cache pointer to the position after A1	Discard Q2, A2, Q3, A3 from the cache; keep only Q1+A1
Phase 2	Re-ask Q2 from the rewound state	The model now sees only Q1+A1 before Q2, exactly as it did the first time
Phase 3	Compare A2 and new A2	Confirms the rewind fully restored the earlier model state

Run the rewind test

		```bash
python run_rewind.py -m D:\model\qwen2.5-3B
```

Expected output

		```
rewind test OK
```

The identical outputs confirm that `rewind_to()` successfully restored the model to the state it was in after processing Q1 and A1 — erasing all trace of the intermediate turns from the model's perspective.

Summary and Next Steps

What this feature gives you

Capability	API	Best for
Persistent multi-turn memory	`append_tokens()` on a shared `Generator`	Chatbots, document Q&A, agentic assistants
Conversation branching / undo	`rewind_to(position)`	Correcting wrong turns, multi-path exploration, agentic retry

Best Practices

1. Reuse one `Generator` per session — do not create a new one per turn

2. Use `append_tokens()` for each new turn — not a full prompt rebuild

3. Add a leading space before each prompt — prevents partial-token boundary artifacts

4. Store `get_sequence(0)` length after each turn — you'll need it as the argument to `rewind_to()` if you want to return to that point later

Benchmark of Context Caching Reuse

To quantify the benefits provided by this feature, we designed the following test cases.
(tested on AMD strix-halo, 128G memory environment)

Test Case 1: Without the Feature

Without this feature, multi-turn conversations require the previous conversation history to be appended to the next prompt.

Input prompt 1: 2048 tokens
Output response 1: 256 tokens
Desired input prompt 2: 256 tokens

However, the actual second prompt must include the entire conversation history:

full_prompt2 = prompt1 + response1 + prompt2

As a result, the input length of full_prompt2 becomes 2560 tokens.

Output response 2: 128 tokens

The experiment was repeated 10 times, and the mean value was reported.

Results:

Response 1 latency: approximately 8.49 s
(input length = 2048, output length = 256)
Response 2 latency: approximately 5.78 s
(input length = 2560, output length = 128)

Test Case 2: With the Feature

With this feature enabled:

Input prompt 1: 2048 tokens
Output response 1: 256 tokens
Input prompt 2: 256 tokens
Output response 2: 128 tokens

The experiment was repeated 10 times, and the mean value was reported.

Results:

Response 1 latency: approximately 8.47 s
(same as Test Case 1)
Response 2 latency: approximately 4.42 s
(input length = 256, output length = 128)

	Case 1	Case 2
input prompt1	2048	2048
output response1	256	256
Time of 1^st conversation	8.49s	8.47s
Input prompt2	prompt1+response1+prompt2= 2560	256
Output response2	128	128
Time of 2^nd conversation	5.78s	4.42s

From the above test results, we can see that enabling this feature reduces the response time in this scenario by approximately 1.36 seconds (5.78 s - 4.42 s) , which is the TTFT difference between different length.

If the system prompt is significantly larger, the time savings can become substantially more noticeable.

Below is the code snippet for this test. In order to test it, you need to prepare your own test data which is a big text file named benchmark_test.txt and put in the same path as the python script.

		```python
def load_prompts(prompt_input):
    """Loads prompts from a .txt file, a direct string, or falls back to default prompts."""
    if prompt_input:
        if os.path.exists(prompt_input):
            with open(prompt_input, "r", encoding="utf-8") as f:
                prompts = [line.strip() for line in f.readlines() if line.strip()]
            if prompts:
                return prompts
        else:
            # Treat as a direct string input
            return [prompt_input]
    print("Warning: Invalid or missing prompt input. Using default prompts.")
    return DEFAULT_PROMPTS
 
def load_model_and_tokenizer(model_path, verbose=False):
    """Loads the ONNX model and tokenizer, determining the model type from the config."""
    config_path = os.path.join(model_path, 'genai_config.json')
 
    # Read the model type from the configuration file
    with open(config_path, 'r') as config_file:
        config = json.load(config_file)
        model_type = config['model']['type']
 
    if verbose:
        print(f"Loading {model_type} model from {model_path}...")
 
    model = og.Model(model_path)
    tokenizer = get_tokenizer(model_path, model_type, model)
    return model, tokenizer, model_type
 
def _tokens_to_list(tokens):
    """Normalize tokenizer output to a flat Python list for append_tokens."""
    if isinstance(tokens, np.ndarray):
        return tokens.flatten().tolist()
    return list(tokens)
 
def _generate_up_to(generator, max_new_tokens):
    """Generate up to max_new_tokens; stop early if is_done. Returns count generated."""
    generated = 0
    while generated < max_new_tokens and not generator.is_done():
        generator.generate_next_token()
        generated += 1
    return generated
 
def _new_tokens_since(generator, start_len):
    return _tokens_to_list(generator.get_sequence(0))[start_len:]
 
def generate_text(model, tokenizer, prompts, model_type, args):
  """KV benchmark: slice text1/text2 from benchmark_test.txt tokens, then time 10 runs."""
  TEXT1_LEN = 2048
  TEXT2_LEN = 256
  OUT1_MAX = 256          # max new tokens after text1
  OUT2_MAX = 128
  BENCH_LOOPS = 10
  NEED_TOKENS = TEXT1_LEN + TEXT2_LEN
 
  script_dir = os.path.dirname(os.path.abspath(__file__))
  benchmark_path = os.path.join(script_dir, "benchmark_test.txt")
  if not os.path.exists(benchmark_path):
    raise FileNotFoundError(f"benchmark_test.txt not found: {benchmark_path}")
 
  with open(benchmark_path, "r", encoding="utf-8") as f:
    benchmark_text = f.read()
 
  # --- Phase 1: tokenize file only; take fixed slices (no model input of full file) ---
  all_tokens = _tokens_to_list(tokenizer.encode(benchmark_text))
  print(
    f"Slicing from {benchmark_path}: encoded {len(all_tokens)} tokens, "
    f"using [{0}:{TEXT1_LEN}) + [{TEXT1_LEN}:{NEED_TOKENS}), ignoring the rest."
  )
  if len(all_tokens) < NEED_TOKENS:
    raise RuntimeError(
      f"benchmark_test.txt too short after encode: got {len(all_tokens)} tokens, "
      f"need at least {NEED_TOKENS} ({TEXT1_LEN} + {TEXT2_LEN})."
    )
 
  text1 = all_tokens[0:TEXT1_LEN]
  text2 = all_tokens[TEXT1_LEN:NEED_TOKENS]
 
  print(f"text1 ready: {len(text1)} tokens | text2 ready: {len(text2)} tokens")
 
  # --- Phase 2: one generator — text1 -> out1, then text2 -> out2 (KV retained) ---
  bench_max_len = TEXT1_LEN + TEXT2_LEN + OUT1_MAX + OUT2_MAX + 256
 
  bench_params = og.GeneratorParams(model)
  bench_search = {
    "do_sample": args.do_random_sampling,
    "max_length": bench_max_len,
    "min_length": args.min_length,
    "top_p": args.top_p,
    "top_k": args.top_k,
    "temperature": args.temperature,
    "repetition_penalty": args.repetition_penalty,
  }
  bench_params.set_search_options(**{k: v for k, v in bench_search.items() if v is not None})
  bench_params.try_graph_capture_with_max_batch_size(1)
 
  total_time_text1 = 0.0
  total_time_text2 = 0.0
  out1 = None
  out2 = None
 
  print(f"\n[Phase 2] KV-retained benchmark ({BENCH_LOOPS} iterations)...\n")
 
  for i in range(BENCH_LOOPS):
    generator = og.Generator(model, bench_params)
 
    # text1 -> out1 (up to OUT1_MAX new tokens, stop if is_done early)
    t0 = time.time()
    generator.append_tokens(text1)
    in_pos = len(_tokens_to_list(generator.get_sequence(0)))
    _generate_up_to(generator, OUT1_MAX)
    out1 = _new_tokens_since(generator, in_pos)
    total_time_text1 += time.time() - t0
 
    # text2 -> out2 (same generator, KV from text1 retained)
    t0 = time.time()
    generator.append_tokens(text2)
    in_pos2 = len(_tokens_to_list(generator.get_sequence(0)))
    _generate_up_to(generator, OUT2_MAX)
    out2 = _new_tokens_since(generator, in_pos2)
    total_time_text2 += time.time() - t0
 
  print("\n--- Phase 2 results (single generator, text1 then text2) ---")
  print(f"avg time text1 -> out1: {total_time_text1 / BENCH_LOOPS:.4f}s")
  print(f"avg time text2 -> out2: {total_time_text2 / BENCH_LOOPS:.4f}s")
 
  # --- Phase 3: text1 -> 256 out1; new generator -> text1+out1+text2 -> out2 ---
  total_time_p3_text1 = 0.0
  total_time_p3_combo = 0.0
  p3_out1 = None
  p3_out2 = None
 
  print(f"\n[Phase 3] split-generator benchmark ({BENCH_LOOPS} iterations)...\n")
 
  for i in range(BENCH_LOOPS):
    # Step A: text1 -> exactly OUT1_MAX (256) new tokens
    gen_a = og.Generator(model, bench_params)
    t0 = time.time()
    gen_a.append_tokens(text1)
    pos_a = len(_tokens_to_list(gen_a.get_sequence(0)))
    _generate_up_to(gen_a, OUT1_MAX)
    p3_out1 = _new_tokens_since(gen_a, pos_a)
    total_time_p3_text1 += time.time() - t0
 
    # Step B: fresh generator, text1 + out1 + text2 -> out2
    gen_b = og.Generator(model, bench_params)
    t0 = time.time()
    gen_b.append_tokens(text1 + p3_out1 + text2)
    pos_b = len(_tokens_to_list(gen_b.get_sequence(0)))
    _generate_up_to(gen_b, OUT2_MAX)
    p3_out2 = _new_tokens_since(gen_b, pos_b)
    total_time_p3_combo += time.time() - t0
 
  print("\n--- Phase 3 results (text1->512; new gen; text1+out1+text2->out2) ---")
  print(f"avg time step A text1 -> out1: {total_time_p3_text1 / BENCH_LOOPS:.4f}s")
  print(f"avg time step B text1+out1+text2 -> out2: {total_time_p3_combo / BENCH_LOOPS:.4f}s")
```

Supported Models

Below models are from HuggingFace (AMD) and fully tested this feature.

Hybrid models

Npu-fusion models

Ryzen AI 1.7.1 — Hybrid - a amd Collection

Ryzen AI 1.7.1 — NPU 4K - a amd Collection

CodeLlama-7b-Instruct-hf

Llama-2-7b-chat-hf

Llama-3.2-1B-Instruct

Llama-3.2-3B-Instruct

Meta-Llama-3.1-8B-Instruct

Mistral-7B-Instruct-v0.2

Mistral-7B-Instruct-v0.3

Phi-3.5-mini-instruct

Phi-3-mini-128k-instruct

Phi-4-mini-instruct

Phi-4-mini-instruct-awq-quant-onnx-hybrid

Qwen-2.5_1.5B_Instruct

Qwen2.5_3B_Instruct

Qwen2.5-7B-Instruct

Qwen2.5-Coder-1.5B-Instruct

Qwen2.5-Coder-7B-Instruct

smollm_hybrid

CodeLlama-7b-Instruct-hf_fusion

Llama-2-7b-chat-hf_fusion

Llama-3.2-1B-Instruct_fusion

Llama-3.2-3B-Instruct_fusion

Meta-Llama-3.1-8B-Instruct_fusion

Mistral-7B-Instruct-v0.1_fusion

Mistral-7B-Instruct-v0.2_fusion

Mistral-7B-Instruct-v0.3_fusion

Phi-3-mini-128k-instruct_fusion

Phi-3-mini-4k-instruct_fusion

Phi-3.5-mini-instruct_fusion

Phi-4-mini-instruct_fusion

Qwen-2.5_1.5B_Instruct_fusion

Qwen2.5-7B-Instruct_fusion

Qwen2.5-Coder-0.5B-Instruct_fusion

Qwen2.5-Coder-1.5B-Instruct_fusion

Qwen2.5-Coder-7B-Instruct_fusion

Qwen2.5_3B_Instruct_fusion

Summary

KV cache reuse helps reduce latency in multi-turn LLM applications by preserving the model's internal attention state and processing only newly added tokens instead of rebuilding the entire conversation. Combined with conversation rewind capabilities, it enables more responsive chatbots, document assistants, and agentic applications on Ryzen™ AI PCs.

Using the continuous decoding APIs available in Ryzen™ AI Software 1.7.1, developers can implement persistent conversational memory, reduce response latency, and unlock more efficient local AI experiences on Ryzen™ AI PCs.

From here, you can extend these techniques by increasing generation limits for production workloads, experimenting with additional multi-turn memory scenarios, combining rewind_to() with branching logic to create conversational tree exploration workflows, or applying the same pattern to NPU Fusion models for fully on-NPU inference. As conversations become longer and system prompts grow more complex, the performance benefits of KV cache reuse become increasingly valuable.

What to Explore Next

Increase the token limit beyond 120 for production-quality responses

Try the other five scenario types from the [multi-turn table](#what-the-test-covers)
Combine `rewind_to()` with branching logic to build a conversational tree explorer
Apply the same pattern to NPU-TPS fusion models for fully on-NPU inference