Building GPU Kernels in Python with ROCm FlyDSL

Name: Building GPU Kernels in Python with ROCm FlyDSL
Start: 2026-07-22T16:30:00-07:00
End: 2026-07-22T17:15:00-07:00

This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement GPU kernels in pure Python using the library. The workshop will also showcase how FlyDSL is used in production to improve end-to-end serving performance of Large Language Models like Kimi K2.5.

July 22, 2026 4:30 PM - 5:15 PM PDT

SMTS Product Application Engineer | AMD

Topic

Developer Platforms & Open Ecosystems

Session Type

Workshop

Accelerating LLM Inference on AMD GPUs with AMD ATOM

This advanced hands-on workshop introduces AMD ATOM, an open-source optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.;This advanced hands-on workshop introduces AMD ATOM, an open-source optimized LLM inference backend for ROCm. Learn to serve LLMs with popular workflows using AMD-optimized attention & inference kernels. The Workshop introduces out-of-tree plugins for existing vLLM & SGLang users & aims at demonstrating how ATOM preserves familiarity of the frameworks while accelerating model execution & boosting inference performance, bridging opensource frameworks with the AMD high-performance inference stack.

July 23, 2026
Training at Scale with AMD Primus

Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.;Primus makes large-scale training on Instinct reliable, debuggable and highly performant. It supports the latest OSS training frameworks, models, and is expanding support to new, cutting-edge model architectures, training techniques, and datatypes. SOTA pre and post training performance with Primus, proven at scales of thousands of GPUs, positions an AMD Instinct GPU as a competitive solution for model development at frontier labs, enterprises, and AI startups.

July 23, 2026
Agentic Kernel Performance Tuning with AMD ROCm

This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.;This session introduces an agentic kernel development workflow for optimizing AI and HPC workloads on AMD ROCm. Learn how a self-directing optimization loop can profile, analyze, optimize, validate, and generate production-ready kernel improvements with minimal manual tuning. The talk highlights how AMD is accelerating kernel engineering by reducing weeks of performance optimization effort into an automated, scalable workflow for developers and performance engineers.

July 23, 2026
Building GPU Kernels in Python with ROCm FlyDSL

This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement GPU kernels in pure Python using the library. The workshop will also showcase how FlyDSL is used in production to improve end-to-end serving performance of Large Language Models like Kimi K2.5.;This advanced hands-on workshop introduces ROCm FlyDSL, a Python-based domain-specific language (DSL) for developing GPU kernels with low-level control on AMD GPUs. Attendees will receive a concise introduction to FlyDSL and learn how to implement GPU kernels in pure Python using the library. The workshop will also showcase how FlyDSL is used in production to improve end-to-end serving performance of Large Language Models like Kimi K2.5.

July 23, 2026
Accelerating LLM Inference on AMD ROCm with AITER and ATOM

This technical talk introduces AITER and ATOM, optimized inference technologies for AMD ROCm software. Learn how AITER accelerates LLM and MoE execution with optimized kernels and distributed inference enhancements, while ATOM integrates these capabilities into familiar vLLM and SGLang workflows through plugin-based acceleration. The session highlights how AMD enables scalable, high-performance open-source LLM serving while preserving existing developer and deployment workflows.;This technical talk introduces AITER and ATOM, optimized inference technologies for AMD ROCm software. Learn how AITER accelerates LLM and MoE execution with optimized kernels and distributed inference enhancements, while ATOM integrates these capabilities into familiar vLLM and SGLang workflows through plugin-based acceleration. The session highlights how AMD enables scalable, high-performance open-source LLM serving while preserving existing developer and deployment workflows.

July 23, 2026
Transformation of AMD ROCm Software in a New AI Era

This session explores an AI-native GPU software stack for large-scale AI systems on AMD hardware. Learn how AI-assisted GPU programming, distributed training, optimized inference, memory expansion, and agentic deployment workflows are enabling scalable AI infrastructure across clusters and hyperscale environments. The talk highlights practical approaches for improving performance, observability, automation, and resource efficiency on the AMD GPU platforms.;This session explores an AI-native GPU software stack for large-scale AI systems on AMD hardware. Learn how AI-assisted GPU programming, distributed training, optimized inference, memory expansion, and agentic deployment workflows are enabling scalable AI infrastructure across clusters and hyperscale environments. The talk highlights practical approaches for improving performance, observability, automation, and resource efficiency on the AMD GPU platforms.

July 23, 2026
Inference Performance Tuning with AI Agents

In this advanced user hands-on workshop learn all about AMD Agentic kernel development workflow and how to deploy it for your use case. This course will help replace weeks-long performance engineering with an Agentic self-directing loop that profiles, plans, optimizes, validates, and delivers production-ready kernel improvements automatically.;In this advanced user hands-on workshop learn all about AMD Agentic kernel development workflow and how to deploy it for your use case. This course will help replace weeks-long performance engineering with an Agentic self-directing loop that profiles, plans, optimizes, validates, and delivers production-ready kernel improvements automatically.

July 23, 2026
ROCm Certification Associate: Architecture, Programming, and Optimization

This 4-hour ROCm Certification workshop includes three hours of hands-on training followed by a one-hour certification exam. Participants will learn ROCm fundamentals, AMD GPU architecture, AI and HPC development, PyTorch, HIP programming, ROCm libraries, CUDA porting, profiling, and performance optimization. Guided labs provide practical experience building, debugging, and optimizing GPU-accelerated applications on AMD platforms.;This 4-hour ROCm Certification workshop includes three hours of hands-on training followed by a one-hour certification exam. Participants will learn ROCm fundamentals, AMD GPU architecture, AI and HPC development, PyTorch, HIP programming, ROCm libraries, CUDA porting, profiling, and performance optimization. Guided labs provide practical experience building, debugging, and optimizing GPU-accelerated applications on AMD platforms.

July 23, 2026

Building GPU Kernels in Python with ROCm FlyDSL

Abstract

Speakers

Presented By

Related Sessions

AMD.com Feedback