At-Scale Agentic RL with Miles
Abstract
An intermediate-to-advanced hands-on workshop on scalable RL training with MILES on AMD GPUs. Participants will learn how to deploy RL techniques such as GRPO, PPO, etc. The workshop also demonstrates multi-turn agentic RL training with mixed datasets like SWE Bench, Terminal bench etc. through a single end point.
July 22, 2026 1:30 PM - 4:15 PM PDT
Speakers
Presented By
PMTS Product Application Engineer | AMD
Session Type
Workshop
Related Product
Instinct, EPYC, ROCm
Related Sessions
-
Building GPU Kernels in Python with ROCm FlyDSL
Building GPU Kernels in Python with ROCm FlyDSL
This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement GPU kernels in pure Python using the library. The workshop will also showcase how FlyDSL is used in production to improve end-to-end serving performance of Large Language Models like Kimi K2.5.;This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing high-performance GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement high-performance kernels using the library. The workshop will also showcase practical optimization techniques for improving end-to-end serving performance of the Kimi K2.5 model using optimized FlyDSL Mixture-of-Experts (MoE) kernels.
July 23, 2026
-
ElMerFold: Exascale AI for Protein Structure Prediction with El Capitan
ElMerFold: Exascale AI for Protein Structure Prediction with El Capitan
DProtein structure prediction is foundational to modern biology, enabling breakthroughs in drug discovery, enzyme engineering, and AI-driven science. We present ElMerFold, a production-scale synthetic data generation workflow running on the El Capitan system at 11,000 nodes and 44,000 APUs. ElMerFold processes ~41 million proteins at 2,378 structures/s, achieving a 16.3× improvement over prior approaches and reaching 969 PFLOP/s FP32 inference performance.;DProtein structure prediction is foundational to modern biology, enabling breakthroughs in drug discovery, enzyme engineering, and AI-driven science. We present ElMerFold, a production-scale synthetic data generation workflow running on the El Capitan system at 11,000 nodes and 44,000 APUs. ElMerFold processes ~41 million proteins at 2,378 structures/s, achieving a 16.3× improvement over prior approaches and reaching 969 PFLOP/s FP32 inference performance.
July 23, 2026
-
Scaling AI in Production with AMD and Vultr
Scaling AI in Production with AMD and Vultr
Explore how enterprises can scale AI from training to inference using AMD-powered infrastructure on Vultr. Through a deep dive into the University of Cambridge's Tessera model, learn how organizations can accelerate AI deployment, improve operational efficiency, and scale globally. The session also highlights real-world AI initiatives across healthcare, retail, finance, manufacturing, and hospitality.;Explore how enterprises can scale AI from training to inference using AMD-powered infrastructure on Vultr. Through a deep dive into the University of Cambridge's Tessera model, learn how organizations can accelerate AI deployment, improve operational efficiency, and scale globally. The session also highlights real-world AI initiatives across healthcare, retail, finance, manufacturing, and hospitality.
July 23, 2026
-
Training at Scale with AMD Primus
Training at Scale with AMD Primus
Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.;Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.
July 23, 2026
-
Accelerating vLLM Inference on AMD Instinct GPUs with AMD ATOM
Accelerating vLLM Inference on AMD Instinct GPUs with AMD ATOM
This advanced hands-on workshop introduces AMD ATOM an opensource optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.;This advanced hands-on workshop introduces AMD ATOM an opensource optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.
July 23, 2026
-
Benchmarking AI Systems: from Model Metrics to Real-World Performance
Benchmarking AI Systems: from Model Metrics to Real-World Performance
AI benchmarking is evolving rapidly as enterprises scale from experimentation to deployment. This interactive session explores measuring real world performance across inference and training workloads. We will discuss metrics that matter, throughput vs. latency tradeoffs, memory bandwidth, and open software ecosystems. Gain practical insights into evaluating AI infrastructure for performance, scalability, efficiency, and TCO in modern enterprise and developer environments.;AI benchmarking is evolving rapidly as enterprises scale from experimentation to deployment. This interactive session explores measuring real world performance across inference and training workloads. We will discuss metrics that matter, throughput vs. latency tradeoffs, memory bandwidth, and open software ecosystems. Gain practical insights into evaluating AI infrastructure for performance, scalability, efficiency, and TCO in modern enterprise and developer environments.
July 23, 2026
-
Agentic Kernel Performance Tuning with AMD ROCm
Agentic Kernel Performance Tuning with AMD ROCm
This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.;This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.
July 23, 2026
-
Redefining Scalable AI Performance: OCI Supercomputing in the Cloud
Redefining Scalable AI Performance: OCI Supercomputing in the Cloud
Organizations building frontier AI models need infrastructure designed for performance at scale. This session shows how OCI combines AMD Instinct, AMD EPYC, and Pensando in Oracle Acceleron to enable ultra-low-latency networking for high-throughput distributed workloads, with practical guidance for designing infrastructure for large language, multimodal, and scientific AI models.;Organizations building frontier AI models need infrastructure designed for performance at scale. This session shows how OCI combines AMD Instinct, AMD EPYC, and Pensando in Oracle Acceleron to enable ultra-low-latency networking for high-throughput distributed workloads, with practical guidance for designing infrastructure for large language, multimodal, and scientific AI models.
July 23, 2026