Accelerating Local LLM Conversations with KV Cache Reuse on AMD Ryzen AI
Jun 10, 2026
Local large language models are enabling a new class of AI PC applications, including private document assistants, coding helpers, conversational agents, and domain-specific copilots. Running inference locally can reduce dependence on network connectivity and can help keep prompts and application data on device when the application is configured for local execution. As these applications become more conversational, maintaining low latency across multiple turns becomes increasingly important. In a typical chat session, every new user message is added to the existing conversation history. Without an efficient mechanism for retaining context, the model must repeatedly process the entire conversation before generating a response.
This repeated processing occurs during the prefill phase, where the model converts input tokens into the internal attention state required for generation. As conversations grow longer, prefill can become a significant contributor to overall response latency.
KV cache reuse addresses this challenge by preserving the model's internal attention state across turns. Instead of rebuilding the entire conversation context for every request, the application processes only newly added tokens while reusing previously computed information.
AMD Ryzen™ AI Software 1.7.1 supports this capability through ONNX Runtime GenAI's continuous decoding APIs. In this blog, we'll explore how KV cache reuse works, demonstrate multi-turn conversation handling and conversation rewind capabilities, and measure the latency benefits of avoiding redundant prefill computation.
Large language models process text in two stages:
Prefill – Process the input prompt and build the attention state.
Decode – Generate output tokens one at a time.
For many conversational workloads, prefill becomes the dominant cost.
Consider a simple conversation:
Turn 1: [System prompt] + [User message 1]
Turn 2: [System prompt] + [User message 1] + [Reply 1] + [User message 2]
Turn 3: [System prompt] + [User message 1] + [Reply 1] + [User message 2] + [Reply 2] + ...
Without context reuse, the model reprocesses the entire conversation at every turn. Even though most of the content has already been seen, the model must rebuild its attention state from scratch.
As conversation history grows, response latency increases and more compute resources are consumed.
With KV cache reuse, previously computed attention information remains available between turns. The model only processes newly appended tokens, significantly reducing redundant work.
Without Caching |
With Caching |
Full prefill on every turn |
Prefill only for new tokens |
Latency grows with conversation length |
Near-constant latency per turn |
Higher power draw on long sessions |
Lower energy per response |
Understanding the KV Cache
Transformer-based LLMs use a self-attention mechanism to relate every token in the input to every other token. For each attention layer, the model computes three matrices per token: Query (Q), Key (K), and Value (V).
The Key and Value matrices are the expensive part: they must be computed for every token in the context before the model can generate any output. The Key-Value (KV) cache is simply the stored result of this computation.
Without KV cache reuse:
Turn N prompt = [Token 1 ... Token N]
Model recomputes K, V for ALL N tokens on every turn
With KV cache reuse:
After Turn 1: KV cache = [K1, V1]
After Turn 2: KV cache = [K1, V1, K2, V2] <- only K2, V2 are new
After Turn 3: KV cache = [K1, V1, K2, V2, K3, V3] <- only K3, V3 are new
Reusing the KV cache means the model only computes attention for the delta — the new tokens added since the last turn — while reading the cached state for everything that came before.
Continuous Decoding in ONNX Runtime GenAI
Ryzen™ AI Software 1.7.1 exposes KV cache reuse through ONNX Runtime GenAI's continuous decoding functionality.
From an application perspective, this enables:
- Persistent multi-turn conversations
- Incremental token appending
- Context retention without prompt reconstruction
- Conversation rewind and branching workflows
The feature is implemented directly within the runtime and requires no additional libraries beyond the standard Ryzen AI Software installation.
Hardware and Software Support
Context caching reuse is supported on all supported Ryzen AI Software 1.7.1 OGA model configurations running through the `onnxruntime-genai` runtime. This includes:
- Hybrid models: workload split across the CPU, iGPU, and NPU
- NPU-TPS fusion models: workload fully fused on the NPU
Getting Started
Install Ryzen AI Software 1.7.1 by following the official [Installation Instructions] (https://ryzenai.docs.amd.com/en/latest/inst.html). The installer creates a Conda environment (`ryzen-ai-1.7.1`) with all required dependencies, including `onnxruntime-genai`.
Dowload and Configure the Model
This walkthrough uses the AMD Qwen2.5-3B hybrid model. Download it from Hugging Face:
[amd/Qwen2.5_3B_Instruct_rai_1.7.1_hybrid](amd/Qwen2.5_3B_Instruct_rai_1.7.1_hybrid at main)
Save all downloaded files to a local directory, for example `D:\model\qwen2.5-3B`.
We will use two concrete examples with this model to demonstrate the feature usage in the following section.
Building a Multi-Turn Conversation
This example shows how to run a sequence of prompts where each turn built on earlier ones — and the model remembers everything — without any explicit context management in your code.
What the test covers
The following six scenarios represent common real-world patterns where multi-turn memory is essential. Each row is a sequence of prompts to use as `DEFAULT_PROMPTS`:
Scenario |
Sample prompts list
|
Identity Recall |
My name is Alex, I work as a data scientist at Orbital AI. |
I live in Berlin and enjoy playing the piano in my free time. |
|
Where do I live and what do I enjoy doing? What is my profession? |
|
Referencing Custom Rules |
When I say “compact summary,” respond with 3 bullet points only. |
Give me a compact summary of: Ancient Egypt (Egyptian: km.t) was a cradle of civilization concentrated along the lower reaches of the Nile River in Northeast Africa. It emerged from prehistoric Egypt around 3150 BC (according to conventional Egyptian chronology), when Upper and Lower Egypt were amalgamated by Menes, who is believed by the majority of Egyptologists to have been the same person as Narmer. The history of ancient Egypt unfolded as a series of stable kingdoms interspersed by the “Intermediate Periods” of relative instability. These stable kingdoms existed in one of three periods: the Old Kingdom of the Early Bronze Age; the Middle Kingdom of the Middle Bronze Age; or the New Kingdom of the Late Bronze Age. |
|
Give me a compact summary of: stages of butterfly |
|
Entity Linking |
Sarah is a software engineer who leads the AI team. James is her manager. |
Who is Sarah's manager? |
|
What team does Sarah lead? |
|
List Recall |
Here is a list of items I need from the store: apples, bread, milk, eggs, tomatoes, pasta, and cheese. |
What was the fourth item? |
|
How many dairy items are in the list? |
|
Entity Linking : Multi-character Role Tracking |
Alice is a biologist who works at Genomix Lab. She reports to Dr. Patel, who is the head of the lab. Her colleague, Martin, specializes in data analysis. |
Who is Alice’s manager? |
|
Who is the data analyst at Genomix Lab? |
|
What is Dr. Patel’s position? |
|
What field does Alice work in? |
|
Task Continuity : Math with Memory
|
A farmer has 3 fields. Each field has 240 apple trees. |
If each tree produces 120 apples, how many apples does one field produce? |
|
How many apples in total across all fields? |
|
| If 10% of the apples are spoiled, how many are still good? |
Prepare the test script
The Ryzen AI installation ships a reference script at:
C:\Program Files\RyzenAI\1.7.1\LLM\example\run_model.py
Copy it to `run_kvreuse.py` in the same directory and make two modifications:
Step 1 — Replace `DEFAULT_PROMPTS`
Use the Identity Recall scenario as a starting point:
DEFAULT_PROMPTS = [
" My name is Alex, I work as a data scientist at Orbital AI.",
" I live in Berlin and enjoy playing the piano in my free time.",
" Where do I live and what do I enjoy doing?",
" What is my profession?"
]
> Why the leading space? When a generation is cut off at the token limit, the last token may be a partial word (e.g., `"clas"` instead of `"classical"`). Appending the next prompt directly would concatenate them into a malformed token (`"classMy name is..."`), which confuses the model. The leading space forces a clean token boundary between the generated response and the next user message.
Step 2 — Replace the generation loop in `generate_text()`
```python
generator = og.Generator(model, params)
for i, prompt in enumerate(prompts):
# Append only the new prompt tokens — the generator retains all prior context
generator.append_tokens(tokenizer.encode(prompt))
loop = 0
while not generator.is_done() and loop < 120:
if loop == 0:
in_pos = generator.get_sequence(0) # mark start of this response
loop += 1
generator.generate_next_token()
# Extract only the tokens generated for this prompt
output_tokens = generator.get_sequence(0)[len(in_pos):]
output_text = tokenizer.decode(output_tokens)
print(f"\nPrompt #{i+1}: {prompt}")
print("Output:\n", output_text)
results.append({
"prompt": prompt,
"response": output_text,
"tokens": len(output_tokens)
})
total_tokens += len(output_tokens)
```
Key design decisions explained:
Decision |
Reason |
Single `Generator` instance for all turns |
The KV cache lives inside the generator. Reusing it across turns is what enables context accumulation — creating a new generator each turn would discard all prior context. |
`append_tokens()` instead of restarting |
This adds new tokens to the existing KV cache sequence rather than rebuilding the full sequence from scratch. |
`loop < 120` token limit |
Prevents the model from generating excessively long responses. Lower values keep the test fast; raise this for production use. |
`in_pos = generator.get_sequence(0)` at `loop == 0` |
Records the sequence length at the start of generation for this turn so only the new tokens are decoded and printed, not the entire conversation history. |
Run the test
```bash
conda activate ryzen-ai-1.7.1
cd "C:\Program Files\RyzenAI\1.7.1\LLM\example"
python run_kvreuse.py -m D:\model\qwen2.5-3B
```
What to expect
Prompts 1 and 2 give the model background information. Prompts 3 and 4 test whether the model retained it:
Prompt 3 — "Where do I live and what do I enjoy doing?"
```
You mentioned that you live in Berlin and enjoy playing the piano in your free time. That sounds like a wonderful city and way of life! Berlin is known for its vibrant arts scene, diverse neighborhoods, and rich cultural heritage, making it a fantastic place to explore and enjoy. Playing the piano is a fantastic hobby, and I'm sure you make the most of of of your free time there. Is there anything else you'd like to share about your interests or plans for the future? I'm a big fan of of the city and its culture, and I love exploring new neighborhoods and
```
Prompt 4 — "What is my profession?"
```
You mentioned that you work as a data scientist at Orbital AI. That's great! A data scientist is a fascinating role that involves using statistical and computational techniques to extract insights and knowledge from data. It's a highly technical and versatile field that can be applied to a wide range of industries and problems. As a data scientist, you likely have the opportunity to work with large datasets, develop predictive models, and help organizations make data-driven decisions. What are some of of of the projects or challenges you've worked on recently? I'm a big fan of of of the city and its culture, and
```
The model correctly answers both questions using only what was said in earlier turns, with no context passed explicitly by the caller. The KV cache is handling the memory.
Rewind to a Previous State
The `rewind_to()` API lets you move the KV cache pointer back to any earlier position in the conversation, discarding everything after that point. This is useful when:
- A user wants to take the conversation in a different direction from an earlier turn
- An agentic workflow needs to explore multiple branches from a shared starting point
- An intermediate response was poor quality and you want to re-ask from a clean state
How rewind works
```
Normal flow:
Q1 → A1 → Q2 → A2 → Q3 → A3
KV cache: [Q1][A1][Q2][A2][Q3][A3]
After rewind_to(len(Q1 + A1)):
KV cache: [Q1][A1] ← everything after A1 is discarded
Re-ask Q2:
KV cache: [Q1][A1][Q2][A2'] ← model is in the same state as after the original A1
If the model is deterministic: A2' == A2
```
Prepare the rewind test script
Copy `run_kvreuse.py` to `run_rewind.py`. Replace the generation loop with:
```python
# Phase 1: Run Q1, Q2, Q3 in sequence to build up the full KV cache
for i in range(3):
generator.append_tokens(tokenizer.encode(prompts[i]))
loop = 0
while not generator.is_done() and loop < 120:
loop += 1
generator.generate_next_token()
if i == 0:
# Save the sequence length after Q1+A1 — this is our rewind target
output_tokens0 = generator.get_sequence(0)
if i == 1:
# Save A2's text so we can compare it after the rewind
output_tokens1 = generator.get_sequence(0)
output_text1 = tokenizer.decode(output_tokens1)
# Phase 2: Rewind to the position after Q1+A1
generator.rewind_to(len(output_tokens0))
# Re-ask Q2 from this rewound state
generator.append_tokens(tokenizer.encode(prompts[1]))
loop = 0
while not generator.is_done() and loop < 120:
loop += 1
generator.generate_next_token()
output_tokens1_rewind = generator.get_sequence(0)
output_text1_rewind = tokenizer.decode(output_tokens1_rewind)
# Phase 3: Verify the outputs match
if output_text1 == output_text1_rewind:
print("rewind test OK")
return
```
What each phase is doing:
Phase |
What happens |
Why |
Phase 1 |
Run Q1→A1→Q2→A2→Q3→A3 sequentially |
Build up a realistic KV cache with three turns of history |
rewind_to(len(output_tokens0)) |
Reset the cache pointer to the position after A1 |
Discard Q2, A2, Q3, A3 from the cache; keep only Q1+A1 |
Phase 2 |
Re-ask Q2 from the rewound state |
The model now sees only Q1+A1 before Q2, exactly as it did the first time |
Phase 3 |
Compare A2 and new A2 |
Confirms the rewind fully restored the earlier model state |
Run the rewind test
```bash
python run_rewind.py -m D:\model\qwen2.5-3B
```
Expected output
```
rewind test OK
```
The identical outputs confirm that `rewind_to()` successfully restored the model to the state it was in after processing Q1 and A1 — erasing all trace of the intermediate turns from the model's perspective.
Summary and Next Steps
What this feature gives you
Capability |
API |
Best for |
Persistent multi-turn memory |
`append_tokens()` on a shared `Generator` |
Chatbots, document Q&A, agentic assistants |
Conversation branching / undo |
`rewind_to(position)` |
Correcting wrong turns, multi-path exploration, agentic retry |
Best Practices
1. Reuse one `Generator` per session — do not create a new one per turn
2. Use `append_tokens()` for each new turn — not a full prompt rebuild
3. Add a leading space before each prompt — prevents partial-token boundary artifacts
4. Store `get_sequence(0)` length after each turn — you'll need it as the argument to `rewind_to()` if you want to return to that point later
Benchmark of Context Caching Reuse
To quantify the benefits provided by this feature, we designed the following test cases.
(tested on AMD strix-halo, 128G memory environment)
Test Case 1: Without the Feature
Without this feature, multi-turn conversations require the previous conversation history to be appended to the next prompt.
- Input prompt 1: 2048 tokens
- Output response 1: 256 tokens
- Desired input prompt 2: 256 tokens
However, the actual second prompt must include the entire conversation history:
full_prompt2 = prompt1 + response1 + prompt2
As a result, the input length of full_prompt2 becomes 2560 tokens.
- Output response 2: 128 tokens
The experiment was repeated 10 times, and the mean value was reported.
Results:
- Response 1 latency: approximately 8.49 s
(input length = 2048, output length = 256) - Response 2 latency: approximately 5.78 s
(input length = 2560, output length = 128)
Test Case 2: With the Feature
With this feature enabled:
- Input prompt 1: 2048 tokens
- Output response 1: 256 tokens
- Input prompt 2: 256 tokens
- Output response 2: 128 tokens
The experiment was repeated 10 times, and the mean value was reported.
Results:
- Response 1 latency: approximately 8.47 s
(same as Test Case 1) - Response 2 latency: approximately 4.42 s
(input length = 256, output length = 128)
|
Case 1 |
Case 2 |
input prompt1 |
2048 |
2048 |
output response1 |
256 |
256 |
Time of 1st conversation |
8.49s |
8.47s |
Input prompt2 |
prompt1+response1+prompt2= 2560 |
256 |
Output response2 |
128 |
128 |
Time of 2nd conversation |
5.78s |
4.42s |
From the above test results, we can see that enabling this feature reduces the response time in this scenario by approximately 1.36 seconds (5.78 s - 4.42 s) , which is the TTFT difference between different length.
If the system prompt is significantly larger, the time savings can become substantially more noticeable.
Below is the code snippet for this test. In order to test it, you need to prepare your own test data which is a big text file named benchmark_test.txt and put in the same path as the python script.
```python
def load_prompts(prompt_input):
"""Loads prompts from a .txt file, a direct string, or falls back to default prompts."""
if prompt_input:
if os.path.exists(prompt_input):
with open(prompt_input, "r", encoding="utf-8") as f:
prompts = [line.strip() for line in f.readlines() if line.strip()]
if prompts:
return prompts
else:
# Treat as a direct string input
return [prompt_input]
print("Warning: Invalid or missing prompt input. Using default prompts.")
return DEFAULT_PROMPTS
def load_model_and_tokenizer(model_path, verbose=False):
"""Loads the ONNX model and tokenizer, determining the model type from the config."""
config_path = os.path.join(model_path, 'genai_config.json')
# Read the model type from the configuration file
with open(config_path, 'r') as config_file:
config = json.load(config_file)
model_type = config['model']['type']
if verbose:
print(f"Loading {model_type} model from {model_path}...")
model = og.Model(model_path)
tokenizer = get_tokenizer(model_path, model_type, model)
return model, tokenizer, model_type
def _tokens_to_list(tokens):
"""Normalize tokenizer output to a flat Python list for append_tokens."""
if isinstance(tokens, np.ndarray):
return tokens.flatten().tolist()
return list(tokens)
def _generate_up_to(generator, max_new_tokens):
"""Generate up to max_new_tokens; stop early if is_done. Returns count generated."""
generated = 0
while generated < max_new_tokens and not generator.is_done():
generator.generate_next_token()
generated += 1
return generated
def _new_tokens_since(generator, start_len):
return _tokens_to_list(generator.get_sequence(0))[start_len:]
def generate_text(model, tokenizer, prompts, model_type, args):
"""KV benchmark: slice text1/text2 from benchmark_test.txt tokens, then time 10 runs."""
TEXT1_LEN = 2048
TEXT2_LEN = 256
OUT1_MAX = 256 # max new tokens after text1
OUT2_MAX = 128
BENCH_LOOPS = 10
NEED_TOKENS = TEXT1_LEN + TEXT2_LEN
script_dir = os.path.dirname(os.path.abspath(__file__))
benchmark_path = os.path.join(script_dir, "benchmark_test.txt")
if not os.path.exists(benchmark_path):
raise FileNotFoundError(f"benchmark_test.txt not found: {benchmark_path}")
with open(benchmark_path, "r", encoding="utf-8") as f:
benchmark_text = f.read()
# --- Phase 1: tokenize file only; take fixed slices (no model input of full file) ---
all_tokens = _tokens_to_list(tokenizer.encode(benchmark_text))
print(
f"Slicing from {benchmark_path}: encoded {len(all_tokens)} tokens, "
f"using [{0}:{TEXT1_LEN}) + [{TEXT1_LEN}:{NEED_TOKENS}), ignoring the rest."
)
if len(all_tokens) < NEED_TOKENS:
raise RuntimeError(
f"benchmark_test.txt too short after encode: got {len(all_tokens)} tokens, "
f"need at least {NEED_TOKENS} ({TEXT1_LEN} + {TEXT2_LEN})."
)
text1 = all_tokens[0:TEXT1_LEN]
text2 = all_tokens[TEXT1_LEN:NEED_TOKENS]
print(f"text1 ready: {len(text1)} tokens | text2 ready: {len(text2)} tokens")
# --- Phase 2: one generator — text1 -> out1, then text2 -> out2 (KV retained) ---
bench_max_len = TEXT1_LEN + TEXT2_LEN + OUT1_MAX + OUT2_MAX + 256
bench_params = og.GeneratorParams(model)
bench_search = {
"do_sample": args.do_random_sampling,
"max_length": bench_max_len,
"min_length": args.min_length,
"top_p": args.top_p,
"top_k": args.top_k,
"temperature": args.temperature,
"repetition_penalty": args.repetition_penalty,
}
bench_params.set_search_options(**{k: v for k, v in bench_search.items() if v is not None})
bench_params.try_graph_capture_with_max_batch_size(1)
total_time_text1 = 0.0
total_time_text2 = 0.0
out1 = None
out2 = None
print(f"\n[Phase 2] KV-retained benchmark ({BENCH_LOOPS} iterations)...\n")
for i in range(BENCH_LOOPS):
generator = og.Generator(model, bench_params)
# text1 -> out1 (up to OUT1_MAX new tokens, stop if is_done early)
t0 = time.time()
generator.append_tokens(text1)
in_pos = len(_tokens_to_list(generator.get_sequence(0)))
_generate_up_to(generator, OUT1_MAX)
out1 = _new_tokens_since(generator, in_pos)
total_time_text1 += time.time() - t0
# text2 -> out2 (same generator, KV from text1 retained)
t0 = time.time()
generator.append_tokens(text2)
in_pos2 = len(_tokens_to_list(generator.get_sequence(0)))
_generate_up_to(generator, OUT2_MAX)
out2 = _new_tokens_since(generator, in_pos2)
total_time_text2 += time.time() - t0
print("\n--- Phase 2 results (single generator, text1 then text2) ---")
print(f"avg time text1 -> out1: {total_time_text1 / BENCH_LOOPS:.4f}s")
print(f"avg time text2 -> out2: {total_time_text2 / BENCH_LOOPS:.4f}s")
# --- Phase 3: text1 -> 256 out1; new generator -> text1+out1+text2 -> out2 ---
total_time_p3_text1 = 0.0
total_time_p3_combo = 0.0
p3_out1 = None
p3_out2 = None
print(f"\n[Phase 3] split-generator benchmark ({BENCH_LOOPS} iterations)...\n")
for i in range(BENCH_LOOPS):
# Step A: text1 -> exactly OUT1_MAX (256) new tokens
gen_a = og.Generator(model, bench_params)
t0 = time.time()
gen_a.append_tokens(text1)
pos_a = len(_tokens_to_list(gen_a.get_sequence(0)))
_generate_up_to(gen_a, OUT1_MAX)
p3_out1 = _new_tokens_since(gen_a, pos_a)
total_time_p3_text1 += time.time() - t0
# Step B: fresh generator, text1 + out1 + text2 -> out2
gen_b = og.Generator(model, bench_params)
t0 = time.time()
gen_b.append_tokens(text1 + p3_out1 + text2)
pos_b = len(_tokens_to_list(gen_b.get_sequence(0)))
_generate_up_to(gen_b, OUT2_MAX)
p3_out2 = _new_tokens_since(gen_b, pos_b)
total_time_p3_combo += time.time() - t0
print("\n--- Phase 3 results (text1->512; new gen; text1+out1+text2->out2) ---")
print(f"avg time step A text1 -> out1: {total_time_p3_text1 / BENCH_LOOPS:.4f}s")
print(f"avg time step B text1+out1+text2 -> out2: {total_time_p3_combo / BENCH_LOOPS:.4f}s")
```
Supported Models
Below models are from HuggingFace (AMD) and fully tested this feature.
Hybrid models |
Npu-fusion models |
CodeLlama-7b-Instruct-hf Llama-2-7b-chat-hf Llama-3.2-1B-Instruct Llama-3.2-3B-Instruct Meta-Llama-3.1-8B-Instruct Mistral-7B-Instruct-v0.2 Mistral-7B-Instruct-v0.3 Phi-3.5-mini-instruct Phi-3-mini-128k-instruct Phi-4-mini-instruct Phi-4-mini-instruct-awq-quant-onnx-hybrid Qwen-2.5_1.5B_Instruct Qwen2.5_3B_Instruct Qwen2.5-7B-Instruct Qwen2.5-Coder-1.5B-Instruct Qwen2.5-Coder-7B-Instruct smollm_hybrid
|
CodeLlama-7b-Instruct-hf_fusion Llama-2-7b-chat-hf_fusion Llama-3.2-1B-Instruct_fusion Llama-3.2-3B-Instruct_fusion Meta-Llama-3.1-8B-Instruct_fusion Mistral-7B-Instruct-v0.1_fusion Mistral-7B-Instruct-v0.2_fusion Mistral-7B-Instruct-v0.3_fusion Phi-3-mini-128k-instruct_fusion Phi-3-mini-4k-instruct_fusion Phi-3.5-mini-instruct_fusion Phi-4-mini-instruct_fusion Qwen-2.5_1.5B_Instruct_fusion Qwen2.5-7B-Instruct_fusion Qwen2.5-Coder-0.5B-Instruct_fusion Qwen2.5-Coder-1.5B-Instruct_fusion Qwen2.5-Coder-7B-Instruct_fusion Qwen2.5_3B_Instruct_fusion |
Summary
KV cache reuse helps reduce latency in multi-turn LLM applications by preserving the model's internal attention state and processing only newly added tokens instead of rebuilding the entire conversation. Combined with conversation rewind capabilities, it enables more responsive chatbots, document assistants, and agentic applications on Ryzen™ AI PCs.
Using the continuous decoding APIs available in Ryzen™ AI Software 1.7.1, developers can implement persistent conversational memory, reduce response latency, and unlock more efficient local AI experiences on Ryzen™ AI PCs.
From here, you can extend these techniques by increasing generation limits for production workloads, experimenting with additional multi-turn memory scenarios, combining rewind_to() with branching logic to create conversational tree exploration workflows, or applying the same pattern to NPU Fusion models for fully on-NPU inference. As conversations become longer and system prompts grow more complex, the performance benefits of KV cache reuse become increasingly valuable.
What to Explore Next
Increase the token limit beyond 120 for production-quality responses
- Try the other five scenario types from the [multi-turn table](#what-the-test-covers)
- Combine `rewind_to()` with branching logic to build a conversational tree explorer
- Apply the same pattern to NPU-TPS fusion models for fully on-NPU inference
Resources
Ryzen AI Software 1.7.1 — Installation Instructions