AMD Instinct MI355X GPUs Power DeepSeek-V3.2-Exp

Oct 03, 2025

AMD is proud to support DeepSeek-V3.2-Exp, the latest release from DeepSeek AI, optimized to run on our flagship AMD Instinct™ MI355X GPUs.

DeepSeek-V3.2-Exp introduces DeepSeek Sparse Attention (DSA), a groundbreaking sparse attention mechanism that dramatically improves efficiency for long-context tasks. This innovation makes it an excellent match for the massive bandwidth and compute density of MI355X accelerators.

DeepSeek-V3.2-Exp: Model Architecture

DeepSeek-V3.2-Exp builds upon the V3.1-Terminus checkpoint with a revolutionary sparse attention mechanism designed to optimize training and inference efficiency in long-context scenarios.

DeepSeek Sparse Attention (DSA)

The key innovation in V3.2-Exp is the introduction of DeepSeek Sparse Attention, a two-stage attention mechanism that dramatically reduces computational complexity:

1. Lightning Indexer

A lightweight, high-speed scanner that rapidly scores all preceding tokens
Determines relevance for any given query token using a small number of attention heads
Operates in FP8 precision for maximum efficiency
Identifies the most relevant excerpts from the entire context window

2. Fine-Grained Token Selection

Performs precise top-K token selection across the entire document
Only the most relevant tokens (top 2048 key-value pairs per query) are processed by the main attention mechanism
Maintains output quality on par with V3.1-Terminus while dramatically reducing compute

Computational Efficiency

Complexity Reduction: DSA reduces the attention computation from O(L²) to O(Lk), where L is sequence length and k is a fixed number of selected tokens.

Performance Gains:

Up to 64x speedup for sequences of 128,000 tokens [1]
50%+ reduction in API costs for long-context operations [2]
2-3x faster inference compared to dense attention on long contexts [3]
30-40% reduction in memory usage [3]
50% improvement in training efficiency [3]

Real-World Cost Impact: For a 128K context window, inference costs dropped from $2.20 to $0.25—a 10x cost reduction [3].

Training Approach

DeepSeek employed a multi-phase training strategy:

Warmup Phase: 2.1B tokens used to train the lightning indexer while keeping the main model frozen
Sparse Training: 943.7B tokens of continued pretraining with full sparse attention enabled
Specialization: 5 specialized models (coding, math, etc.) trained via reinforcement learning, then distilled into the final checkpoint

AMD Instinct MI355X: Built for AI at Scale

The AMD Instinct MI355X is an AI accelerator designed to push the boundaries of training and inference for large-scale foundation models. Built on the CDNA 4 architecture, MI355X brings unprecedented performance and efficiency for next-generation AI workloads.

Key Specifications

Memory: 288 GB HBM3e
Memory Bandwidth: 8 TB/s
Peak Power: 1400W (OAM form factor with liquid cooling)
Architecture: CDNA 4
Performance (Peak Theoretical):
- FP16/BF16: 5.03 PFLOPS
- FP64: 78.6 TFLOPS
New Precision Formats: FP6 and FP4 for higher throughput in AI workloads

Key Features & Benefits

High-Performance AI: Optimized for both training and inference of massive foundation models.
Enhanced Efficiency: CDNA 4 architecture and new lower-precision formats (FP6, FP4) deliver improved efficiency and throughput.
Scalability: Built for deployment in multi-GPU clusters, enabling massive computational performance at scale.
Competitive Performance: Benchmarks show MI355X outperforms competing accelerators in LLM inference and fine-tuning, with superior memory capacity and compute throughput.

Getting Started: Run DeepSeek-V3.2-Exp on MI355X

Follow these simple steps to launch the model on your AMD system:

Step 1: Pull the ROCm PyTorch Image

		docker pull lmsysorg/sglang:dsv32-rocm

Step 2: Start a Container with GPU Access

		docker run -it \ 
  --ipc=host \ 
  --network=host \ 
  --device=/dev/kfd \ 
  --device=/dev/dri \ 
  --security-opt seccomp=unconfined \ 
  --group-add video \ 
  --shm-size 32G \ 
  -w /workspace lmsysorg/sglang:dsv32-rocm

Step 3: Launch the Server

		SGLANG_NSA_KV_CACHE_STORE_FP8=false \
SGLANG_NSA_USE_REAL_INDEXER=true \
SGLANG_NSA_USE_TILELANG_PREFILL=true \
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --disable-cuda-graph \
  --tp 8 \
  --mem-fraction-static 0.85 \
  --page-size 64 \
  --nsa-prefill "tilelang" \
  --nsa-decode "tilelang"

Step 4: Send a Test Request

		curl http://localhost:30000/v1/completions     -H "Content-Type: application/json"     -d '{ 
        "prompt": "The future of AI is", 
        "max_tokens": 100, 
        "temperature": 0 
    }'

Conclusion

With DeepSeek-V3.2-Exp now available, researchers and developers can leverage AMD Instinct MI355X GPUs to push the limits of sparse attention at scale.
AMD is committed to supporting frontier AI models, and we look forward to seeing how the community leverages DeepSeek-V3.2-Exp on MI355X to build faster, smarter, and more efficient AI applications.

Acknowledgements

We would like to thank the SGLang team for their outstanding collaboration with AMD in enabling DeepSeek-V3.2-Exp support on MI355X GPUs. Special thanks to Tom Chen, Ziyi Xu, and Liangsheng Yin for their invaluable contributions and dedication in making this integration possible.

References

DeepSeek AI (2025). “DeepSeek-V3.2: Technical Report.” GitHub Repository. Available at: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf
DeepSeek AI (2025). “Introducing DeepSeek-V3.2-Exp.” DeepSeek API Documentation, September 29, 2025. Available at: https://api-docs.deepseek.com/news/news250929
DeepSeek AI (2025). “DeepSeek-V3.2-Exp Model Card.” Hugging Face. Available at: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp