vLLM in 2026: Challenges and Optimizations

Name: vLLM in 2026: Challenges and Optimizations
Start: 2026-07-22T13:50:00-07:00
End: 2026-07-22T14:10:00-07:00

As LLMs grow in size, context length, and architectural complexity, vLLM must evolve to meet new performance and scalability challenges. This talk presents key improvements in vLLM's core architecture and highlights major optimizations in KV cache management and GPU kernels. Furthermore, this talk covers the latest updates related to vLLM community, large scale serving, and across hardware effort.

July 22, 2026 1:50 PM - 2:10 PM PDT

Co-Founder and CEO | Inferact

Topic

Developer Platforms & Open Ecosystems

AI Training & Inference

Session Type

Tech Talk

Domain-Specific AI at Scale: Open Models, Post-Training, and AI Infrastructure

Learn how domain-specific AI moves beyond generic models using post-training, domain evals, and scalable open infrastructure. Using Open Telco Models as a case study, this session covers curated data, reward loops, unified training and serving, and AMD Instinct/ROCm-based stacks for building specialized AI systems at enterprise scale.;Learn how domain-specific AI moves beyond generic models using post-training, domain evals, and scalable open infrastructure. Using Open Telco Models as a case study, this session covers curated data, reward loops, unified training and serving, and AMD Instinct/ROCm-based stacks for building specialized AI systems at enterprise scale.

July 23, 2026
Panel Discussion: From Models to Production—A Blueprint for AI at Scale

Moving AI from training to production takes more than GPUs. Hear how Microsoft and Chai AI built scalable AI infrastructure on Vultr using AMD Instinct GPUs and ROCm. Learn best practices for data locality, secure networking, Kubernetes orchestration, benchmarking, cost optimization, and scale-out operations. Leave with a practical blueprint for deploying fast, portable, production-ready AI workloads.;Moving AI from training to production takes more than GPUs. Hear how Microsoft and Chai AI built scalable AI infrastructure on Vultr using AMD Instinct GPUs and ROCm. Learn best practices for data locality, secure networking, Kubernetes orchestration, benchmarking, cost optimization, and scale-out operations. Leave with a practical blueprint for deploying fast, portable, production-ready AI workloads.

July 23, 2026
Accelerating LLM Inference on AMD GPUs with AMD ATOM

This advanced hands-on workshop introduces AMD ATOM, an open-source optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.;This advanced hands-on workshop introduces AMD ATOM, an open-source optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.

July 23, 2026
Zyphra: Large-Model Training Lessons on AMD

Learn what it took to train ZAYA1-74B, a 74B-parameter mixture-of-experts model, end-to-end on AMD Instinct MI300X. This session shares key engineering lessons from designing an efficient training stack, optimizing long-context performance, and building a reinforcement learning pipeline for math, code, and agentic AI workloads. Discover practical insights for training and deploying large AI models on AMD infrastructure.;Learn what it took to train ZAYA1-74B, a 74B-parameter mixture-of-experts model, end-to-end on AMD Instinct MI300X. This session shares key engineering lessons from designing an efficient training stack, optimizing long-context performance, and building a reinforcement learning pipeline for math, code, and agentic AI workloads. Discover practical insights for training and deploying large AI models on AMD infrastructure.

July 23, 2026
Training at Scale with AMD Primus

Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.;Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.

July 23, 2026
Agentic Kernel Performance Tuning with AMD ROCm

This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.;This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.

July 23, 2026
Redefining Scalable AI Performance: OCI Supercomputing in the Cloud

Organizations building frontier AI models need infrastructure designed for performance at scale. This session shows how OCI combines AMD Instinct, AMD EPYC, and Pensando in Oracle Acceleron to enable ultra-low-latency networking for high-throughput distributed workloads, with practical guidance for designing infrastructure for large language, multimodal, and scientific AI models.;Organizations building frontier AI models need infrastructure designed for performance at scale. This session shows how OCI combines AMD Instinct, AMD EPYC, and Pensando in Oracle Acceleron to enable ultra-low-latency networking for high-throughput distributed workloads, with practical guidance for designing infrastructure for large language, multimodal, and scientific AI models.

July 23, 2026
Accelerating Inference at Scale: Crusoe's Experience with AMD

As a customer and operator of AMD technology, Crusoe’s Managed Inference team has built a production inference stack designed for speed, efficiency, and scale. This session will show how AMD Instinct, including MI355X, helped shape its serverless inference offering and what teams can apply when building production AI services that balance performance, memory bandwidth, and cost.;As a customer and operator of AMD technology, Crusoe’s Managed Inference team has built a production inference stack designed for speed, efficiency, and scale. This session will show how AMD Instinct, including MI355X, helped shape its serverless inference offering and what teams can apply when building production AI services that balance performance, memory bandwidth, and cost.

July 23, 2026

vLLM in 2026: Challenges and Optimizations

Abstract

Speakers

Presented By

Related Sessions

AMD.com Feedback