Win on TCO: How AMD Instinct™ MI355X GPUs Achieve Cost-Competitive Distributed Inference Through SGLang with MoRI

May 27, 2026

Abstract illustration with glowing blue lines

As large-scale LLM inference increasingly relies on multi-node disaggregated serving with wide Expert Parallelism (EP), the critical question for infrastructure teams is no longer "which GPU is fastest" but "which platform delivers the lowest cost per token at production interactivity levels."

Using SGLang with AMD's MoRI communication library on AMD Instinct™ MI355X GPUs, we demonstrate that AMD achieves competitive - and at key operating points, superior - Total Cost of Ownership (TCO) compared to NVIDIA B200 running Dynamo + TRT-LLM on DeepSeek-R1 disaggregated inference. These results are validated by InferenceX™, SemiAnalysis's open-source continuous benchmark platform that tests across hundreds of GPUs with a live dashboard.

This post describes what we have achieved, how we achieve it, and our future plans.

Figure 1: InferenceX TCO Calculator— AMD Instinct™ MI355X and B200 configurations

Results at a Glance

At the typical operating point representative of production coding assistants and interactive chatbots - e.g. 129 tok/s/user interactivity - we observe the following:

AMD Instinct™ MI355X GPUs (MoRI SGLang MTP): $0.173 per million tokens, 2,378 tok/s/GPU (achieved on 24 GPUs)
NVIDIA B200 (Dynamo TRT-LLM MTP): $0.178 per million tokens, 3,128 tok/s/GPU (achieved on 28 GPUs)
NVIDIA B200 (Dynamo SGLang MTP): $0.284 per million tokens, 1,945 tok/s/GPU (achieved on 48 GPUs)

The MI355X GPUs delivers 2.9% lower cost than B200 TRT-LLM, 39% lower cost than B200 SGLang, and 1.22× higher throughput per GPU than B200 SGLang - winning on both cost and performance simultaneously.

Figure 2: InferenceX Performance pareto curve - throughput vs interactivity for AMD Instinct™ MI355X MoRI SGLang vs B200 Dynamo SGLang

What You'll Learn Under the Hood

How AITER FlyDSL empowers MoE in distributed scenarios
How MoRI quantized dispatch/combine benefits expert parallelism
How MoRI KV cache backend accelerates transfer
How stream overlapping with SDMA hides communication latency behind compute
How AITER FylDSL empowers MoE in distributed scenarios
How Specv2 enablement on AMD GPU accelerates decoding
How CPU streaming overhead is optimized for high throughput scenarios

Key Optimization Details

How MoRI quantized all2all dispatch/combine benefits expert parallelism

Hybrid FP4/FP8 quantized all-to-all

In expert-parallel MoE inference, each token must be dispatched to the top-k selected experts via dispatch and combine communication primitives. For DeepSeek-R1 with a hidden dimension of 7,168 and top-8 expert routing, BF16 communication volume is significantly higher than that of FP8 and FP4 quantized communication.

The key insight is that on-the-fly MXFP4 quantization of dispatch will bring faster transmission with accuracy lossless. Similarly, expert outputs (combine phase) tolerate FP8 quantization without meaningful accuracy loss.

MoRI supports multi-level quantized communication:

MoRI-EP combine kernel micro-benchmark on the MI355X GPUs (EP8, BF16 input, max_tokens=4096, hidden_dim=7168, scale_dim=56, zero-copy=0, dispatch=128/16, combine=128/16, 10-round average, combine latency only)
Case	Path	Combine Latency
Normal (no-scale)	fp8_blockwise specialized	~736 µs
Uniform−1024, 1024] (scale-active)	fp8_blockwise specialized	~770 µs
Force-scale-active	fp8_blockwise specialized	~769 µs
Reference	bf16 no-quant	~907 µs
Reference	fp8_direct_cast	~526 µs

For MXFP4 models such as amd/DeepSeek-R1-0528-MXFP4-v2, the system uses FP4 dispatch + FP8 combine, achieving a 2.56× overall round-trip bandwidth reduction (from 28,672 to 11,200 bytes per token).

Blockwise quantization preserves accuracy through fine-grained scaling. By default, FP8 blockwise uses per-128-element FP32 scale factors. It achieves a good tradeoff between performance and accuracy.

The quantization mode is auto detected from the model's weight format and can be overridden via SGLANG_MORI_DISPATCH_DTYPE and SGLANG_MORI_COMBINE_DTYPE environment variables.

Adaptive kernel selection

MoRI dynamically selects the optimal communication kernel based on workload characteristics:

Kernel	Condition	Optimized For
IntraNode	Single-node (≤8 GPUs)	Shared memory / P2P
InterNodeV1	Multi-node, >256 tokens/rank	High throughput, staged RDMA
InterNodeV1LL	Multi-node, ≤256 tokens/rank	Low latency
AsyncLL	SDMA-enabled paths	Fully async send/recv split

The switching threshold is automatically configured based on the decode batch size, ensuring that prefill phases (large batches) use high-throughput kernels while decode phases (smaller per-rank batches) use low-latency kernels.

How MoRI's KV Cache Backend Accelerates Transfer

Inline transfer for high-concurrency KV migration

Lock-free inline execution. Transfer requests execute directly in the caller path instead of being dispatched to worker threads. Transfer plans are precomputed once and reused across all layers, eliminating per-layer scheduling overhead and reducing lock contention.
Robust at scale. Default RDMA parallelism is increased to 4 queue pairs and 4 workers per transfer, with thread-safe connection reuse that prevents port exhaustion under thousands of concurrent requests.

Broader model coverage

Beyond standard MLA-based KV cache, MoRI adds state transfer support for hybrid architectures- Mamba (SSM state), SWA, and NSA - enabling disaggregated serving for models like Qwen3.5-397B-A17B. It also handles TP-mismatch scenarios where prefill and decode use different tensor-parallel degrees, correctly mapping replicated attention heads across ranks.

MoRI-IO benchmark on the MI355X GPUs (8 GPUs/node, 8x AMD Pensando™ Pollara 400 AI-NIC, DeepSeek-R1 671B FP8, TP=8, 2048 prompts, ISL=8192, OSL=1024)
Metric	MoRI-IO	Mooncake
Request throughput	7.49 req/s	6.80 req/s
Input token throughput	31,111 tok/s	28,257 tok/s
Output token throughput	3,775 tok/s	3,428 tok/s
Total token throughput	34,886 tok/s	31,685 tok/s

MoRI-IO delivers ~10% higher throughput than Mooncake across all metrics, with comparable single-request latency (~7 ms TPOT) and high accuracy (GSM8K 5-shot: 0.970).

Computation-Communication Overlap - Two-Batch Overlap (TBO) with SDMA

Even with 2–4× bandwidth reduction from quantization, all-to-all communication remains significant. Two-Batch Overlap (TBO) hides this latency by interleaving communication and compute across two micro-batches:

MicroBatch A dispatch sends quantized tokens over the network on a dedicated communication stream
While network transfer is in flight, MicroBatch B attention computes on the main compute stream
MicroBatch A arrives; MoE GEMM runs
MicroBatch A combine sends results back on the communication stream
Meanwhile, MicroBatch B dispatch begins

The dispatch and combine operations are split into A/B phases - dispatch_a for local quantization on the compute stream, dispatch_b for network transfer on the communication stream. A CommStreamPool manages dedicated streams, and events synchronize handoff points.

When SDMA is enabled (MORI_ENABLE_SDMA=true), data transfers run on the AMD dedicated System DMA engines that move data between GPU memory and network interfaces without consuming any compute units. This achieves true zero-compute-overhead communication, keeping every compute unit available for GEMM operations throughout the pipeline.

Figure 3: Two-Batch Overlap pipeline diagram - interleaved compute and communication streams

FlyDSL FusedMoE for High-Performance MoE Compute

Traditionally, FusedMoE kernels on AMD relied solely on Composable Kernel (CK) - hand-tuned templates that are performant but inflexible. AITER introduces FlyDSL (Flexible Layout Python DSL), a Python DSL backed by an MLIR stack for authoring GPU kernels with explicit layouts and tiling, as a competitive FusedMoE kernel path for mixed-precision MoE (e.g., A4W4) on MI355X. FlyDSL enables rapid exploration of kernel configurations beyond what hand-tuned CK templates cover, and at a typical concurrency of 512, we gained up to 1.6× latency reduction for the FusedMoE compute.

MoE GEMM performance is shape-dependent, and the dominant shapes differ by serving scenario. In low-latency pure TP deployments, each GPU processes all experts with small batch sizes, producing tall-skinny GEMMs. In high-throughput DP+EP deployments, tokens are distributed across expert-parallel ranks, yielding different N/K dimensions per expert. FlyDSL allows us to provide separate tuning configurations for each scenario to maximize MI355X utilization.

Triton blockscale GEMM tuning - alongside FlyDSL, the A8W8 blockscale GEMM path uses per-shape tuned configurations for MI355X (gfx950). Key shapes like (N=7168, K=16384) and (N=16384, K=1536) - matching DeepSeek-R1's expert dimensions - are tuned with optimized block sizes, warp counts, pipeline stages, and k-splitting parameters. Special-case tuning for ultra-small M values (≤8, ≤256) targets the small per-expert batches typical in EP decode.

Figure 4: FlyDSL kernel and Triton gemm tuning speedup

How Specv2 enablement on AMD GPU accelerates decoding

DeepSeek supports Multi-Token Prediction (MTP) via the NEXTN speculative decoding algorithm, predicting 2 additional tokens per step. MTP creates a compounding effect with quantized communication: it increases the decode batch size by 3x (original + 2 speculative tokens), improving all-to-all bandwidth utilization at larger batch sizes, while FP4/FP8 quantization keeps per-token communication cost low despite the larger batches.

SGLang's Specv2 pipeline overlaps the draft and verify phases by running verification preparation on a separate GPU stream while the draft model executes. This hides scheduling overhead and is now the default path in SGLang. We enabled it on the ROCm™ software with AMD specific attention backends (AITER) for draft CUDA graph capture and a targeted stream synchronization fix that ensures correct draft-to-verify data handoff.

With our optimization, MTP on the MI355X GPU runs with full overlap scheduling, combining multi-token prediction throughput gains with hidden scheduling latency.

How CPU Streaming Overhead Is Optimized for High Throughput

Under high-concurrency PD disaggregation (e.g., 2,048 concurrent requests), the GPU pipeline is no longer the bottleneck - the decode-side CPU path becomes the limiter. We optimized the asyncio notification batching and SSE serialization hot path in SGLang's tokenizer manager and API layer, reducing CPU overhead without affecting inter-token latency.

Future Plan

The next frontier for distributed inference is shifting from chat-style workloads toward agentic applications - tools like Claude Code, Codex, and Cursor that drive deep multi-turn, tool-augmented conversations with long context windows (up to 1M tokens), extremely high KV cache reuse, and rapid-fire request bursts from parallel subagent spawning. InferenceX is developing an agentic coding benchmark to capture these patterns, moving toward a true end-to-end system benchmark. Our future optimizations on AMD Instinct™ MI355X GPUs will target this workload by leveraging more advanced asynchronous parallelism strategies such as DWDP (Disaggregated Wide Data Parallelism), as well as exploiting ROCm software specific capabilities like SDMA for fully asynchronous, zero-compute-overhead data movement - ultimately pushing disaggregated MoE serving to match the burst-traffic, cache-heavy demands of agentic inference at scale.

Beyond inference, SDMA's zero-compute-overhead communication capability is not limited to serving workloads. We are exploring the potential of extending SDMA to **training and reinforcement learning frameworks** such as Miles. Early validation in the DeepSpeed training framework has already demonstrated **~10% performance gains** ^[8], suggesting that dedicated DMA engines can meaningfully accelerate collective communication in large-scale distributed training as well.

Summary

This post demonstrates how AMD Instinct MI355X GPU with MoRI on SGLang achieves competitive TCO for large-scale DeepSeek disaggregated inference. At 129 tok/s/user interactivity, the MI355X GPU delivers inference at $0.173 per million tokens with 2,378 tok/s/GPU - 2.9% lower cost than Nvidia B200 TRT-LLM and 1.22x higher throughput per GPU than Nvidia B200 SGLang.

This result is driven by a full-stack optimization effort across compute, communication, and serving:

MoRI quantized all-to-all - hybrid FP4/FP8 communication with adaptive kernel selection, reducing round-trip bandwidth by up to 2.56x
MoRI-IO KV cache backend - lock-free inline transfer with high-concurrency RDMA, delivering ~10% higher throughput than Mooncake
Two-Batch Overlap with SDMA - hiding communication latency behind compute using AMD's dedicated DMA engines
AITER GEMM tuning + FlyDSL FusedMoE - platform-tuned compute kernels for both TP and DP+EP scenarios on MI355X
Specv2 MTP on ROCm software - full overlap scheduling for multi-token prediction, increasing effective decode batch size by 3x
CPU streaming optimization - asyncio batching and SSE fast path, unlocking +20% output throughput at 2,048 concurrency

Combined with the MI355X GPU's hardware cost advantage ($1.48/hr/GPU vs $1.95 for B200), these software optimizations translate competitive throughput into a TCO win.

The results are open-source and continuously validated via InferenceX. Together, these innovations define a practical, reproducible blueprint for cost-effective large-scale MoE inference on AMD platforms.

References

Footnotes

Endnotes

System configuration for AMD Instinct™ MI355X benchmark:

GPU: 8× AMD Instinct™ MI355X per node
Host CPU: AMD EPYC™ processors
Network: AMD AINIC (ionic) RDMA, 8 NICs per node
Software: SGLang v0.5.10+, AITER, MoRI, ROCm 7.2
Model: amd/DeepSeek-R1-0528-MXFP4-v2

TCO estimates sourced from SemiAnalysis InferenceXv2 analysis. Hardware costs reflect hyperscaler pricing models.
Performance results measured by SemiAnalysis InferenceX continuous benchmark platform. Benchmark methodology and raw data available at https://github.com/SemiAnalysisAI/InferenceX.

Disclaimers

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED "AS IS" WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.