Day 0 Support for MiniMax M3 on AMD Instinct GPUs

Jun 12, 2026

What's New in MiniMax M3?

MiniMax M3 is a new open-weight model for coding, agentic, and multimodal workloads. MiniMax describes M3 as combining three frontier capabilities in one model: strong coding and agentic task performance, long-context MiniMax Sparse Attention (MSA), and native multimodality for text, image, and video understanding.

The 1M-token context window of MiniMax M3 enables sophisticated, long-horizon application workflows, including autonomous software engineering agents, repository-wide reasoning, and native multimodal document analysis alongside tool-driven automation.

The AMD day-zero enablement focuses on making MiniMax M3 available on AMD Instinct™ GPUs with ROCm™ software. 

Unlock High-Performance Enterprise Inference

Running MiniMax M3 on AMD Instinct MI300 and MI350 series GPUs ensures your enterprise-grade workloads benefit from industry-leading memory bandwidth and compute efficiency. Our Day 0 enablement focuses on delivering high throughput and low-latency inference, allowing you to scale your AI applications with confidence. Further optimization work is on the way.

There are two variants of the M3 Models:

  • MiniMaxAI/MiniMax-M3: precision is bf16
  • MiniMaxAI/MiniMax-M3-MXFP8: precision is mxfp8

Model config’s max_position_embeddings is 524288; 1M contetxt length is supported via RoPE scaling.

Deploying MiniMax M3 on AMD hardware has never been more efficient with vLLM and SGLang. A ready-to-use serving recipe is included below.

Deploying with vLLM on AMD GPUs

By leveraging vLLM with ROCm support, developers can unlock high-throughput serving in ROCm. Support is available in the MiniMax M3 nightly build of vLLM via docker image using the vLLM MiniMax M3 recipe.

		vllm/vllm-openai-rocm:nightly
	

The model is decode-bound for text and encoder/prefill-bound for vision, so there are two tuned recipes. (The same server handles both; the vision recipe is a safe superset for mixed workloads.)

Notes:

  • One required flag: --block-size 128. MiniMax-M3's sparse/index cache requires it; the default block size will not start.
  • Chat/agent flags: every command below includes MiniMax-M3's reasoning and tool-call parsers (--reasoning-parser minimax_m3 --tool-call-parser minimax_m3 --enable-auto-tool-choice) — they expose the <mm:think> chain-of-thought and automatic tool selection over the OpenAI API. Drop them for plain completions.
  • Launch the Inference Service on MI350 Series with tp=4 for the MXFP8 variant as the floor (weights ~452 GB MXFP8) for best tokens/s per gpu, or tp=8 for lower latency. For MI300 Series, tp=8 is recommended.
  • Set --max-model-len <N> : set a sane value for your workload. If unset, the defaults to 524288 (512K). KV pool gets sized for that.
  • On MI300X, you can add ⁠ --skip-mm-profiling ⁠ for a faster server startup and to avoid timeout.

Text + Multimodal:

		vllm serve MiniMaxAI/MiniMax-M3 \ 
  --trust-remote-code \ 
  --block-size 128 \ 
  --tensor-parallel-size 8 \ 
  --attention-backend TRITON_ATTN \ 
  --mm-encoder-tp-mode data \ 
  --mm-encoder-attn-backend ROCM_AITER_FA \ 
  --tool-call-parser minimax_m3 \ 
  --enable-auto-tool-choice \ 
  --reasoning-parser minimax_m3
	

More KV capacity (fp8 KV cache)

Add --kv-cache-dtype fp8 to any recipe for ~1.5× the KV pool, it is lossless in our testing across the full native context. Especially worth it for high concurrency or long context, where KV is the binding constraint:

		vllm serve MiniMaxAI/MiniMax-M3 \ 
  --trust-remote-code \ 
  --block-size 128 \ 
  --kv-cache-dtype fp8 \ 
  --tensor-parallel-size 8 \ 
  --attention-backend TRITON_ATTN \ 
  --mm-encoder-tp-mode data \ 
  --mm-encoder-attn-backend ROCM_AITER_FA \ 
  --tool-call-parser minimax_m3 \ 
  --enable-auto-tool-choice \ 
 --reasoning-parser minimax_m3
	

Long context (to 1M)

Native context is 512K. To go past it, supply a YaRN rope_scaling on the text config (a top-level override silently misses the decoder's config) and allow the long max length. TP=8 + fp8 KV is the practical combo at 1M:

		 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ 
  vllm serve MiniMaxAI/MiniMax-M3 \ 
  --trust-remote-code \ 
  --block-size 128 \ 
  --kv-cache-dtype fp8 \ 
  --tensor-parallel-size 8 \ 
  --attention-backend TRITON_ATTN \ 
  --mm-encoder-tp-mode data \ 
  --mm-encoder-attn-backend ROCM_AITER_FA \ 
  --tool-call-parser minimax_m3 \ 
  --enable-auto-tool-choice \ 
  --reasoning-parser minimax_m3 \ 
  --hf-overrides '{"text_config":{"rope_scaling":{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":524288}}}' 
	

Deploying with SGLang

By leveraging SGLang with ROCm support, developers can unlock high-throughput serving in ROCm. Support is available in the lmsysorg build of SGLang via docker image using the SGLang MiniMax M3 recipe.

		lmsysorg/sglang:*-rocm720-mi35x (MI350X / MI355X) 
lmsysorg/sglang:*-rocm700-mi30x (MI300X / MI325X) 
	

The model is decode-bound for text and encoder/prefill-bound for vision, so there are two tuned recipes. (The same server handles both; the vision recipe is a safe superset for mixed workloads.) 

Notes

  • One required flag: If you use mxfp8 quantization, pass --quantization mxfp8 on all paths. On gfx942 SGLang converts MXFP8 to block-fp8 [128,128] at load; you still pass mxfp8.
  • Chat/agent flags: every command below includes MiniMax-M3's reasoning and tool-call parsers (--reasoning-parser minimax_m3 --tool-call-parser minimax_m3 --enable-auto-tool-choice) — they expose the <mm:think> chain-of-thought and automatic tool selection over the OpenAI API. Drop them for plain completions.
  • Launch the Inference Service on MI350 Series with tp=4 as the floor (weights ~452 GB MXFP8) for best tokens/s per gpu, or tp=8 for lower latency. For MI300 Series, tp=8 is recommended.

Text — MI350X / MI355X baseline (native MXFP8)

		SGLANG_USE_AITER=1 sglang serve \ 
  --model-path MiniMaxAI/MiniMax-M3 \ 
  --tp 8 --mem-fraction-static 0.80 \ 
  --quantization mxfp8 --dtype bfloat16 \ 
   --chunked-prefill-size 8192 \ 
   --reasoning-parser minimax-m3 --tool-call-parser minimax-m3-nom \ 
  --trust-remote-code --host 0.0.0.0 --port 8080
	

Text —MI350X / MI355X baseline (native MXFP8)

		SGLANG_USE_AITER=1 sglang serve \ 
  --model-path MiniMaxAI/MiniMax-M3 \ 
  --tp 8 --mem-fraction-static 0.80 \ 
  --quantization mxfp8 --dtype bfloat16 \ 
   --chunked-prefill-size 8192 \ 
   --reasoning-parser minimax-m3 --tool-call-parser minimax-m3 \ 
  --trust-remote-code --host 0.0.0.0 --port 8080 
	

Text — MI300X / MI325X (block-fp8 emulation)

gfx942 has no hardware MX matmul; SGLang auto-converts weights at load. Add --watchdog-timeout 3600 --skip-server-warmup on first boot (AITER JIT). 

		SGLANG_USE_AITER=1 sglang serve \ 
  --model-path MiniMaxAI/MiniMax-M3 \ 
  --tp 8 --mem-fraction-static 0.80 \ 
  --quantization mxfp8 --dtype bfloat16 \ 
  --attention-backend aiter --moe-runner-backend triton \ 
   --chunked-prefill-size 8192 \ 
  --watchdog-timeout 3600 --skip-server-warmup \ 
   --reasoning-parser minimax-m3 --tool-call-parser minimax-m3 \ 
  --trust-remote-code --host 0.0.0.0 --port 8080
	

Vision / multimodal

		SGLANG_USE_AITER=1 sglang serve \ 
  --model-path MiniMaxAI/MiniMax-M3 \ 
  --tp 8 --mem-fraction-static 0.80 \ 
  --attention-backend triton --moe-runner-backend triton \ 
  --mm-attention-backend aiter_attn --mm-enable-dp-encoder \ 
  --chunked-prefill-size 65536 --max-prefill-tokens 65536 \ 
  --cuda-graph-max-bs 128 page-size 128 --disable-radix-cache \ 
  --reasoning-parser minimax-m3 --tool-call-parser minimax-m3 \ 
 --trust-remote-code --host 0.0.0.0 --port 8080
	

More KV capacity (fp8 KV cache) 

Add --kv-cache-dtype fp8_e4m3 to any recipe for ~1.5× the KV pool. Especially worth it for high concurrency or long context, where KV is the binding constraint:

		SGLANG_USE_AITER=1 sglang serve \ 
  --model-path MiniMaxAI/MiniMax-M3 \ 
  --quantization mxfp8 --tp 8 \ 
  --moe-runner-backend triton --attention-backend triton \ 
  --mem-fraction-static 0.9 --max-running-requests 64 --chunked-prefill-size 8192 \ 
  --page-size 128 --disable-radix-cache \ 
  --kv-cache-dtype fp8_e4m3 \ 
  --trust-remote-code --host 0.0.0.0 --port 8080 
	

Long context (to 1M) 

Native context is 512K. To reach 1M, add a YaRN rope_scaling plus a raised max length into the model's text_config in config.json:

		"text_config": { 

  ..., 

  "rope_scaling": {"rope_type": "yarn", "factor": 2.0, "original_max_position_embeddings": 524288}, 

  "max_position_embeddings": 1048576 

} 

(edit the file directly; don't use inline --json-model-override-args — this SGLang build shallow-replaces text_config and drops sibling fields). TP=8 + fp8 KV is the practical combo at 1M: 

SGLANG_USE_AITER=1 sglang serve \ 
  --model-path MiniMaxAI/MiniMax-M3 \ 
  --quantization mxfp8 --kv-cache-dtype fp8_e4m3 --context-length 1048576 \ 
  --attention-backend triton --moe-runner-backend triton \ 
  --page-size 128 --disable-radix-cache --chunked-prefill-size 8192 \ 
  --tool-call-parser minimax-m3-nom --reasoning-parser minimax-m3 \ 
  --trust-remote-code --host 0.0.0.0 --port 8080 
	

Conclusion

MiniMax M3 brings together coding, agentic, long-context, and native multimodal capabilities in an open-weight model. AMD's Day 0 vLLM and SGLang enablement lets developers deploy MiniMax M3 on AMD Instinct GPUs with ROCm, including MI300 and MI350 series GPUs.

The ROCm implementation is built around the model's actual architecture: MXFP8 MoE and linear layers for MI350 Series with native MXFP8 matrix cores, MiniMax Sparse Attention, vision-language serving, and model-specific normalization and activation kernels. This gives developers a practical path to run MiniMax M3 in OpenAI-compatible serving environments on AMD datacenter GPUs.

Acknowledgements

We would like to extend our sincere thanks to Minimax and Inferact for their collaboration on the vLLM Day 0 effort. 

Additional Resources

Share:

Article By


Related Blogs