Efficient LLM Serving at Scale with Unified Caching (vLLM+LMCache)

Name: Efficient LLM Serving at Scale with Unified Caching (vLLM+LMCache)
Start: 2026-07-23T15:00:00-07:00
End: 2026-07-23T15:45:00-07:00

This is an advanced user hands-on workshop to show TensorMesh and AMD enabling efficient LLM serving through an unified caching layer. You will learn how tiered KV cache management can brings out the benefits of cache-aware inference, improving throughput under interactive latency SLAs, reducing TTFT through KV cache reuse/offload & enabling production-style distributed inference on Instinct GPUs.

July 23, 2026 3:00 PM - 3:45 PM PDT

SMTS Systems Design Engineer | AMD

Topic

AI Training & Inference

Session Type

Workshop

Domain-Specific AI at Scale: Open Models, Post-Training, and AI Infrastructure

Learn how domain-specific AI moves beyond generic models using post-training, domain evals, and scalable open infrastructure. Using Open Telco Models as a case study, this session covers curated data, reward loops, unified training and serving, and AMD Instinct/ROCm-based stacks for building specialized AI systems at enterprise scale.;Learn how domain-specific AI moves beyond generic models using post-training, domain evals, and scalable open infrastructure. Using Open Telco Models as a case study, this session covers curated data, reward loops, unified training and serving, and AMD Instinct/ROCm-based stacks for building specialized AI systems at enterprise scale.

July 23, 2026
Panel Discussion: From Models to Production—A Blueprint for AI at Scale

Moving AI from training to production takes more than GPUs. Hear how Microsoft and Chai AI built scalable AI infrastructure on Vultr using AMD Instinct GPUs and ROCm. Learn best practices for data locality, secure networking, Kubernetes orchestration, benchmarking, cost optimization, and scale-out operations. Leave with a practical blueprint for deploying fast, portable, production-ready AI workloads.;Moving AI from training to production takes more than GPUs. Hear how Microsoft and Chai AI built scalable AI infrastructure on Vultr using AMD Instinct GPUs and ROCm. Learn best practices for data locality, secure networking, Kubernetes orchestration, benchmarking, cost optimization, and scale-out operations. Leave with a practical blueprint for deploying fast, portable, production-ready AI workloads.

July 23, 2026
Zyphra: Large-Model Training Lessons on AMD

Learn what it took to train ZAYA1-74B, a 74B-parameter mixture-of-experts model, end-to-end on AMD Instinct MI300X. This session shares key engineering lessons from designing an efficient training stack, optimizing long-context performance, and building a reinforcement learning pipeline for math, code, and agentic AI workloads. Discover practical insights for training and deploying large AI models on AMD infrastructure.;Learn what it took to train ZAYA1-74B, a 74B-parameter mixture-of-experts model, end-to-end on AMD Instinct MI300X. This session shares key engineering lessons from designing an efficient training stack, optimizing long-context performance, and building a reinforcement learning pipeline for math, code, and agentic AI workloads. Discover practical insights for training and deploying large AI models on AMD infrastructure.

July 23, 2026
Training at Scale with AMD Primus

Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.;Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.

July 23, 2026
Agentic Kernel Performance Tuning with AMD ROCm

This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.;This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.

July 23, 2026
Redefining Scalable AI Performance: OCI Supercomputing in the Cloud

Organizations building frontier AI models need infrastructure designed for performance at scale. This session shows how OCI combines AMD Instinct, AMD EPYC, and Pensando in Oracle Acceleron to enable ultra-low-latency networking for high-throughput distributed workloads, with practical guidance for designing infrastructure for large language, multimodal, and scientific AI models.;Organizations building frontier AI models need infrastructure designed for performance at scale. This session shows how OCI combines AMD Instinct, AMD EPYC, and Pensando in Oracle Acceleron to enable ultra-low-latency networking for high-throughput distributed workloads, with practical guidance for designing infrastructure for large language, multimodal, and scientific AI models.

July 23, 2026
Accelerating Inference at Scale: Crusoe's Experience with AMD

As a customer and operator of AMD technology, Crusoe’s Managed Inference team has built a production inference stack designed for speed, efficiency, and scale. This session will show how AMD Instinct, including MI355X, helped shape its serverless inference offering and what teams can apply when building production AI services that balance performance, memory bandwidth, and cost.;As a customer and operator of AMD technology, Crusoe’s Managed Inference team has built a production inference stack designed for speed, efficiency, and scale. This session will show how AMD Instinct, including MI355X, helped shape its serverless inference offering and what teams can apply when building production AI services that balance performance, memory bandwidth, and cost.

July 23, 2026
ROCm AMD Infinity Context: Shared KV Cache for Agentic AI

As LLM inference shifts toward long context, multi-turn sessions, and agents, KV cache becomes a dominant scaling constraint that no longer fits in GPU HBM. Existing caching solutions partially solve this using expensive CPU memory, non-optimized remote storage or local non-sharable NVMe SSDs. ROCm AIC proposes a shared across GPUs, low latency storage tier to offload KV cache. This session gives insight into the ROCm AIC architecture and provides measured performance, backed by partner vendors.;As LLM inference shifts toward long context, multi-turn sessions, and agents, KV cache becomes a dominant scaling constraint that no longer fits in GPU HBM. Existing caching solutions partially solve this using expensive CPU memory, non-optimized remote storage or local non-sharable NVMe SSDs. ROCm AIC proposes a shared across GPUs, low latency storage tier to offload KV cache. This session gives insight into the ROCm AIC architecture and provides measured performance, backed by partner vendors.

July 23, 2026

Efficient LLM Serving at Scale with Unified Caching (vLLM+LMCache)

Abstract

Speakers

Presented By

Related Sessions

AMD.com Feedback