AMD ROCm 7.0 Software: Supercharging AI and HPC Infrastructure with AMD Instinct Series GPUs and Open Innovation
Oct 01, 2025

In today’s era of exponential compute demand, AMD ROCm™ 7.0 software represents a foundational leap forward in software infrastructure for AI and high-performance computing (HPC). Built from the ground up for AMD Instinct™ GPUs – especially the next-generation MI350 Series – ROCm 7.0 software introduces transformative capabilities in GPU compute, precision optimization, orchestration, and datacenter scalability. With full-stack performance enhancements, expanded datatype support (including FP4 and FP6), and robust support for modern AI and scientific workloads, ROCm 7.0 enables datacenter architects, HPC system managers, and AI infrastructure leads to deploying large-scale training and inference with maximum efficiency, throughput, and flexibility.
Datacenter-Grade Feature Highlights
ROCm 7.0 introduces a slate of upgrades across the AMD AI software stack, each aimed at improving usability, performance, and cross-platform compatibility:
- Unified Triton 3.3 Kernels for Cross-Vendor HPC/AIML Portability: Integrates Triton v3.3, enabling a unified GPU kernel development experience across AMD Instinct hardware. Triton automatically targets HIP for AMD hardware, with built-in support for vendor-specific instructions—ideal for scientific computing teams seeking performance portability. Out-of-the-box Flash Attention kernels in Triton offer fast transformer attention operations compared to unoptimized baselines, accelerating core LLM operations.
- DeepEP Inference Engine for Multi-GPU Efficiency: Debuts DeepEP, introducing intelligent pipelining for overlapping compute and data transfer across GPU nodes, ensuring high utilization for latency-sensitive AI workflows—particularly valuable in tightly coupled multi-node configurations common in supercomputing.
- Full Stack Day-0 Framework & Model Stack Readiness: ROCm 7.0 supports most major AI frameworks (PyTorch, TensorFlow, ONNX, JAX/XLA) from day-zero through the 1.8M Hugging Face Models on its Hub and scientific workloads, helping ensure that system integrators and infrastructure managers can count on production-ready software for datacenter deployment.
- Vendor-Agnostic Orchestration for Heterogeneous GPU Datacenters: With integrated support for vLLM , ROCm 7.0 offers highly efficient orchestration of transformer workloads across mixed-GPU clusters. SGLang intelligently balances compute loads in real time, supporting complex topologies and reducing idle time in multi-node installations. Crucially, these orchestration layers work across heterogeneous clusters – the ROCm communication layer abstracts AMD RCCL seamlessly, so multi-GPU collectives and pipeline transfers operate transparently on any mix of hardware.
Scalable Performance for AI & HPC Infrastructure
Training & Inference Throughput
ROCm 7.0 software boasts significant performance gains for AI workloads. Key results include:
- ~3× Training Throughput with AMD Instinct MI355X GPU and ROCm 7.0: A preview version of ROCm 7.0 delivered an average of three times higher training performance (measured in TFLOPs) with an MI355X GPU compared to ROCm 6 software identically configured with an MI300X GPU node1. Large-scale model training (e.g. GPT pre-training via Megatron-LM) benefits from new FP8 precision and optimized kernels, enabling fast time-to-train for transformer models in ROCm
- Up to 4.6× Inference Throughput Uplift for Generative AI Infrastructure: A preview version of ROCm 7.0 achieves up to 4.6× more tokens/sec as a combined average on Llama 3.1 70B, Qwen 72B, and DeepSeekR1 models on the MI355X platform, compared to the MI300X platform2. High-throughput quantized formats (FP4, INT8, etc.) combined with better pipeline parallelism make ROCm software ideal for scalable AI serving deployments.
Competitive Benchmarking vs. NVIDIA GPUs
AMD has demonstrated a pre-build version of ROCm 7.0 software on AMD Instinct hardware can meet or exceed NVIDIA’s higher end accelerators in throughput. For example, an AMD Instinct MI355X platform (8x GPU) (with FP4 precision) delivered up to 1.3× better inference throughput than an NVIDIA B200 8x GPU platform on a cutting-edge LLM (the DeepSeek R1 model)3. This shows that the AMD Instinct Series MI355X Series offers higher generative AI throughput per node, directly challenging the NVIDIA B200 GPU on large-model inference. (Note: AMD Instinct MI355X Series GPU is a next gen GPU optimized for inference, with 288GB HBM3e per GPU, versus 192GB on NVIDIA B200 per GPU).
Overall, ROCm 7.0 software is designed for performance gains which can translate into more productivity and low costs for AI infrastructure. In short, technical leaders can achieve favorable AI outcomes by adopting the optimized AMD ROCm 7.0 software stack.
AMD Instinct™ MI350 Series GPU Enablement for Advanced AI and Scientific Computing
ROCm 7.0 software introduces production-grade software enablement for the AMD Instinct MI350 Series, purpose-built for large-scale inference and training workloads. The MI350 GPU leverages advanced FP4 and FP6 datatype support, offering outstanding compute density and memory efficiency for transformer workloads. Combined with the AI Tensor Engine for ROCm (AITER) software, these GPUs are designed to deliver faster performance on decoder execution and significant acceleration across GEMM, attention, and MoE layers—transforming datacenter throughput for both AI and HPC workloads.
Ecosystem & Community
ROCm 7.0 software is not just a standalone release – it’s backed by a robust ecosystem strategy and community collaboration:
- Open-Source Collaboration: AMD continues to upstream key optimizations from ROCm 7.0 back to the broader open-source community. Improvements to kernel algorithms (e.g. fused attention, optimized GEMMs) have contributed to projects like ONNX-MLIR and MLIR compilers. Also, advances in distributed training flow into DeepSpeed, scheduling and runtime innovations fed into OpenXLA and even the Triton OSS compiler community sees contributions. This cooperative approach ensures that ROCm software stays aligned with the industry’s best practices and that the wider AI ecosystem.
- Cross-Vendor Flexibility: A core tenet of ROCm 7.0 is vendor neutrality. The software stack is designed to be fully compatible with various vendors, out-of-the-box. Abstraction layers like NIXL automatically select the appropriate communication backend RCCL at runtime, and kernel compilers like Triton generate device-specific code so that developers don’t need to write vendor-specific variants. There are no proprietary lock-ins or “works on X only” caveats – ROCm 7.0 software’s broad support enables true portability and protects investments as new hardware emerges in the future.
Deployment Scenarios in Datacenter Environments
ROCm 7.0 software’s features come together to enable a wide range of real-world AI workflows. Below are a few examples of what becomes possible with this release:
- Distributed Inference Scheduling at Scale: With vLLM, SGLang, and ROCm 7.0, datacenter operators can serve large generative models across heterogeneous clusters while meeting strict latency targets. ROCm integrates both frameworks to optimize scheduling, balance workloads, and streamline backend communications for efficient large-scale AI services.
- Fine-Tuning Large Models on Minimal Hardware: Using ZeRO-3, quantization, and FP8/FP4 datatype support, a preview version of ROCm 7.0 enables datacenter teams to train and fit up to 520B parameters using only MI355X GPUs 4—dramatically improving compute efficiency without infrastructure sprawl.
- Hybrid Cloud Bursting: For organizations running on-premises AI deployments, ROCm 7.0 offers flexibility to burst workloads to the cloud when needed. Workflows can be configured to “burst” from a small on-prem cluster (e.g. 4 GPUs) to a larger cloud cluster (say 32 GPUs) during peak demand. ROCm software’s orchestration layer handles the transition seamlessly – spinning up containers on cloud instances, redistributing model shards with llm-d on the fly, and later scaling back down. This elasticity means you pay for extra compute only when required, without having to maintain a massive cluster.
Enterprise Ready With ROCm 7.0
ROCm 7.0 offers an open-source, modular AI software stack built on a multi-layered architecture that spans from core algorithms to infrastructure, delivering a cohesive platform for enterprise AI. Its’ private , vendor-neutral design allows deployment across various platforms without vendor lock-in, making ROCm 7.0 ideal for heterogeneous, hybrid cloud and on-premises data center environments. ROCm 7.0 integrates seamlessly with Kubernetes orchestration and MLOps pipelines, enabling enterprise infrastructure teams to run and scale AI workloads in containerized environments and oversee the entire AI lifecycle from model ingestion to inference within a unified platform. Furthermore, the platform offers robust multi-tenant capabilities via Kubernetes – including fine-grained resource quotas, role-based access control (RBAC), and built-in observability with monitoring and auto-scaling hooks – so infrastructure teams can securely share and govern AI resources across projects with confidence.
Conclusion
ROCm 7.0 marks a significant milestone in AMD’s commitment to open, high-performance computing for AI and HPC. By delivering a production-grade platform with uncompromising performance and broad compatibility, ROCm 7.0 enables organizations to deploy advanced AI solutions at scale on their terms – whether that’s fine-tuning the latest LLM on an on-prem cluster or serving millions of inference requests across a heterogeneous fleet. This release solidifies ROCm software as a truly vendor-agnostic AI stack, marrying software flexibility with GPU power to drive breakthroughs in machine learning. With ongoing optimizations and a thriving open-source community, ROCm software continues to redefine what’s possible in AI, paving the way for researchers, engineers, and enterprises to innovate without barriers. We invite you to explore ROCm 7.0 and join the growing community pushing the frontiers of AI on an open platform.
Deployment & Documentation Resources
To explore ROCm 7.0 or deploy it in your environment, the following resources will be helpful:
- AMD ROCm Official Documentation – Comprehensive guides, release notes, and tutorials for installing and using ROCm 7.
- AMD ROCm 7.0 Technical Blog – Comprehensive overview of the stack and new features
- AMD Infinity Hub – A catalog of ready-to-run containerized applications and frameworks (for both HPC and AI) optimized for ROCm, along with deployment guides.
- ROCm Developer Hub – Portal for developers featuring training materials, webinars, community forums, and best practices for ROCm.
For more information, visit amd.com/ROCm and join the open AI/HPC community driving ROCm forward.
Footnotes
- (MI300-081) - Testing by AMD as of May 15, 2025, measuring the training throughput performance (TFLOPS) of ROCm 7.0 preview version software, Megatron-LM, on (8) AMD Instinct MI300X GPUs running Llama 2-70B (4K), Qwen1.5-14B, and Llama3.1-8B models, and a custom docker container vs. a similarly configured system with AMD ROCm 6.0 software. Server manufacturers may vary configurations, yielding different results. Performance may vary based on configuration, software, vLLM version, and the use of the latest drivers and optimizations. MI300-081
- (MI300-080) - Testing by AMD as of May 15, 2025, measuring the inference performance in tokens per second (TPS) of AMD ROCm 6.x software, vLLM 0.3.3 vs. AMD ROCm 7.0 preview version SW, vLLM 0.8.5 on a system with (8) AMD Instinct MI300X GPUs running Llama 3.1-70B (TP2), Qwen 72B (TP2), and Deepseek-R1 (FP16) models with batch sizes of 1-256 and sequence lengths of 128-204. Stated performance uplift is expressed as an average TPS over the (3) LLMs tested. Server manufacturers may vary configurations, yielding different results. Performance may vary based on configuration, software, vLLM version, and the use of the latest drivers and optimizations. MI300-80
- (ROC7-001) - Testing by AMD Performance Labs as of May 25, 2025, measuring the inference performance in tokens per second (TPS) of the AMD Instinct MI355X platform (8xGPU) with ROCm 7.0 pre-release build 16047, running DeepSeek R1 LLM on SGLang versus NVIDIA Blackwell B200 platform (8xGPU) with CUDA version 12.8. Server manufacturers may vary configurations, yielding different results. Performance may vary based on hardware configuration, software version, and the use of the latest drivers and optimizations.
Additional Hardware Configuration(s) 2P AMD EPYC™ 9575F CPU server with 8x AMD Instinct™ MI355X (288GB, 1400W) GPUs, Supermicro AS-4126GS-NMR0LCC, 3 TiB (24 D IMMs, 6400 mts memory, 128 GiB/DIMM), 2x 3.49TB Micron 7450 storage, BIOS version: 1.4a. 2P Intel Xeon 6972P CPU server with 8x NVIDIA B200 (180GB, 1000W) GPUs, Supermicro SYS-A22GA-NBRT, 2.95 TiB (24 DIMMs, 4800 mts memory, 128 GiB/DIMM), 2x 3.5 TB Micron 7450 storage, BIOS version: 1.8.
Additional Software Configuration(s) Ubuntu 22.04 LTS with Linux kernel 6.8.0-59-generic, ROCm 7.0.0 (pre-release build 16047) + amdgpu 6.14.5 (build 2168543) Pre-release Docker: rocm/aigmodels-private:experimental_950_5_26 (cache off, --chunked prefill size 131072, torch compile), TP8+DP8 vs. Ubuntu 22.04.5 LTS with Linux kernel 5.15.0-72-generic, Driver Version: 570.133.20 CUDA Version: 12.8 Public Docker: lmsysorg/sglang:blackwell .
- (MI350-012) - Based on calculations by AMD as of April 17, 2025, using the published memory specifications of the AMD Instinct MI350X / MI355X GPUs (288GB) vs MI300X (192GB) vs MI325X (256GB). Calculations performed with FP16 precision datatype at (2) bytes per parameter, to determine the minimum number of GPUs (based on memory size) required to run the following LLMs: OPT (130B parameters), GPT-3 (175B parameters), BLOOM (176B parameters), Gopher (280B parameters), PaLM 1 (340B parameters), Generic LM (420B, 500B, 520B, 1.047T parameters), Megatron-LM (530B parameters), LLaMA ( 405B parameters) and Samba (1T parameters). Results based on GPU memory size versus memory required by the model at defined parameters, plus 10% overhead. Server manufacturers may vary configurations, yielding different results. Results may vary based on GPU memory configuration, LLM size, and potential variance in GPU memory access or the server operating environment. *All results based on FP16 datatype. For FP8 results = x2. For FP4 = x4. MI350-012
- (MI300-081) - Testing by AMD as of May 15, 2025, measuring the training throughput performance (TFLOPS) of ROCm 7.0 preview version software, Megatron-LM, on (8) AMD Instinct MI300X GPUs running Llama 2-70B (4K), Qwen1.5-14B, and Llama3.1-8B models, and a custom docker container vs. a similarly configured system with AMD ROCm 6.0 software. Server manufacturers may vary configurations, yielding different results. Performance may vary based on configuration, software, vLLM version, and the use of the latest drivers and optimizations. MI300-081
- (MI300-080) - Testing by AMD as of May 15, 2025, measuring the inference performance in tokens per second (TPS) of AMD ROCm 6.x software, vLLM 0.3.3 vs. AMD ROCm 7.0 preview version SW, vLLM 0.8.5 on a system with (8) AMD Instinct MI300X GPUs running Llama 3.1-70B (TP2), Qwen 72B (TP2), and Deepseek-R1 (FP16) models with batch sizes of 1-256 and sequence lengths of 128-204. Stated performance uplift is expressed as an average TPS over the (3) LLMs tested. Server manufacturers may vary configurations, yielding different results. Performance may vary based on configuration, software, vLLM version, and the use of the latest drivers and optimizations. MI300-80
- (ROC7-001) - Testing by AMD Performance Labs as of May 25, 2025, measuring the inference performance in tokens per second (TPS) of the AMD Instinct MI355X platform (8xGPU) with ROCm 7.0 pre-release build 16047, running DeepSeek R1 LLM on SGLang versus NVIDIA Blackwell B200 platform (8xGPU) with CUDA version 12.8. Server manufacturers may vary configurations, yielding different results. Performance may vary based on hardware configuration, software version, and the use of the latest drivers and optimizations.
Additional Hardware Configuration(s) 2P AMD EPYC™ 9575F CPU server with 8x AMD Instinct™ MI355X (288GB, 1400W) GPUs, Supermicro AS-4126GS-NMR0LCC, 3 TiB (24 D IMMs, 6400 mts memory, 128 GiB/DIMM), 2x 3.49TB Micron 7450 storage, BIOS version: 1.4a. 2P Intel Xeon 6972P CPU server with 8x NVIDIA B200 (180GB, 1000W) GPUs, Supermicro SYS-A22GA-NBRT, 2.95 TiB (24 DIMMs, 4800 mts memory, 128 GiB/DIMM), 2x 3.5 TB Micron 7450 storage, BIOS version: 1.8.
Additional Software Configuration(s) Ubuntu 22.04 LTS with Linux kernel 6.8.0-59-generic, ROCm 7.0.0 (pre-release build 16047) + amdgpu 6.14.5 (build 2168543) Pre-release Docker: rocm/aigmodels-private:experimental_950_5_26 (cache off, --chunked prefill size 131072, torch compile), TP8+DP8 vs. Ubuntu 22.04.5 LTS with Linux kernel 5.15.0-72-generic, Driver Version: 570.133.20 CUDA Version: 12.8 Public Docker: lmsysorg/sglang:blackwell .
- (MI350-012) - Based on calculations by AMD as of April 17, 2025, using the published memory specifications of the AMD Instinct MI350X / MI355X GPUs (288GB) vs MI300X (192GB) vs MI325X (256GB). Calculations performed with FP16 precision datatype at (2) bytes per parameter, to determine the minimum number of GPUs (based on memory size) required to run the following LLMs: OPT (130B parameters), GPT-3 (175B parameters), BLOOM (176B parameters), Gopher (280B parameters), PaLM 1 (340B parameters), Generic LM (420B, 500B, 520B, 1.047T parameters), Megatron-LM (530B parameters), LLaMA ( 405B parameters) and Samba (1T parameters). Results based on GPU memory size versus memory required by the model at defined parameters, plus 10% overhead. Server manufacturers may vary configurations, yielding different results. Results may vary based on GPU memory configuration, LLM size, and potential variance in GPU memory access or the server operating environment. *All results based on FP16 datatype. For FP8 results = x2. For FP4 = x4. MI350-012
