Unlocking LLM Inference Performance with ROCm FlyDSL
Abstract
This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing high-performance GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement high-performance kernels using the library. The workshop will also showcase practical optimization techniques for improving end-to-end serving performance of the Kimi K2.5 model using optimized FlyDSL Mixture-of-Experts (MoE) kernels.
July 22, 2026 16:30 - 17:15
Speakers
Presented By
SMTS Product Application Engineer | AMD
Session Type
Workshop
Related Product
Instinct, EPYC, ROCm
Related Sessions
-
Accelerating vLLM Inference on AMD Instinct GPUs with AMD ATOM
Accelerating vLLM Inference on AMD Instinct GPUs with AMD ATOM
This advanced hands-on workshop introduces AMD ATOM an opensource optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.;This advanced hands-on workshop introduces AMD ATOM an opensource optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.
July 23, 2026
-
Training at Scale with AMD Primus
Training at Scale with AMD Primus
Primus makes large scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. Primus’ SOTA pre and post training performance, proven at scales of thousands of GPUs, positions instinct as a competitive solution for model development at frontier labs, enterprises and AI startups.;Primus makes large scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. Primus’ SOTA pre and post training performance, proven at scales of thousands of GPUs, positions instinct as a competitive solution for model development at frontier labs, enterprises and AI startups.
July 23, 2026
-
Agentic Kernel Performance Tuning with AMD ROCm
Agentic Kernel Performance Tuning with AMD ROCm
This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.;This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.
July 23, 2026
-
Unlocking LLM Inference Performance with ROCm FlyDSL
Unlocking LLM Inference Performance with ROCm FlyDSL
This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing high-performance GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement high-performance kernels using the library. The workshop will also showcase practical optimization techniques for improving end-to-end serving performance of the Kimi K2.5 model using optimized FlyDSL Mixture-of-Experts (MoE) kernels.;This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing high-performance GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement high-performance kernels using the library. The workshop will also showcase practical optimization techniques for improving end-to-end serving performance of the Kimi K2.5 model using optimized FlyDSL Mixture-of-Experts (MoE) kernels.
July 23, 2026
-
Accelerating LLM Inference on AMD ROCm with AITER and ATOM
Accelerating LLM Inference on AMD ROCm with AITER and ATOM
This technical talk introduces AITER and ATOM, optimized inference technologies for AMD ROCm software. Learn how AITER accelerates LLM and MoE execution with optimized kernels and distributed inference enhancements, while ATOM integrates these capabilities into familiar vLLM and SGLang workflows through plugin-based acceleration. The session highlights how AMD enables scalable, high-performance open-source LLM serving while preserving existing developer and deployment workflows.;This technical talk introduces AITER and ATOM, optimized inference technologies for AMD ROCm software. Learn how AITER accelerates LLM and MoE execution with optimized kernels and distributed inference enhancements, while ATOM integrates these capabilities into familiar vLLM and SGLang workflows through plugin-based acceleration. The session highlights how AMD enables scalable, high-performance open-source LLM serving while preserving existing developer and deployment workflows.
July 23, 2026
-
Transformation of AMD ROCm Software in a New AI Era
Transformation of AMD ROCm Software in a New AI Era
This session explores an AI-native GPU software stack for large-scale AI systems on AMD hardware. Learn how AI-assisted GPU programming, distributed training, optimized inference, memory expansion, and agentic deployment workflows are enabling scalable AI infrastructure across clusters and hyperscale environments. The talk highlights practical approaches for improving performance, observability, automation, and resource efficiency on the AMD GPU platforms.;This session explores an AI-native GPU software stack for large-scale AI systems on AMD hardware. Learn how AI-assisted GPU programming, distributed training, optimized inference, memory expansion, and agentic deployment workflows are enabling scalable AI infrastructure across clusters and hyperscale environments. The talk highlights practical approaches for improving performance, observability, automation, and resource efficiency on the AMD GPU platforms.
July 23, 2026
-
ROCm Certification Associate: Architecture, Programming, and Optimization
ROCm Certification Associate: Architecture, Programming, and Optimization
This first hour of a required 3‑hour certification covers ROCm setup, ecosystem components, compatibility, and deployment using containers and images, along with troubleshooting basics. Participants learn HIP programming and run a first program, reinforced by labs. The session then introduces GPU architecture fundamentals, performance concepts, and Instinct hardware, with hands-on benchmarking and profiling exercises.;This 3-hour ROCm Certification course provides hands-on training in ROCm fundamentals, AMD GPU architecture, AI and HPC development, PyTorch, HIP programming, libraries, CUDA porting, profiling, and performance optimization. Participants will learn to build, debug, and optimize GPU applications through guided labs and practical exercises. Completion of all three course modules is required to qualify for the ROCm Certification exam at the end.
July 23, 2026
-
Inference Performance Tuning with AI Agents
Inference Performance Tuning with AI Agents
In this advanced user hands-on workshop learn all about AMD Agentic kernel development workflow and how to deploy it for your use case. This course will help replace weeks-long performance engineering with an Agentic self-directing loop that profiles, plans, optimizes, validates, and delivers production-ready kernel improvements automatically.;In this advanced user hands-on workshop learn all about AMD Agentic kernel development workflow and how to deploy it for your use case. This course will help replace weeks-long performance engineering with an Agentic self-directing loop that profiles, plans, optimizes, validates, and delivers production-ready kernel improvements automatically.
July 23, 2026