Day 0 Support for Qwen3.6 on AMD Instinct GPUs

Apr 17, 2026

We are excited to announce Day 0 support for Alibaba’s latest open-weights AI coding model, Qwen3.6-35B-A3B, and Qwen3.6-35B-A3B-FP8 on AMD Instinct™ GPUs (MI300X/MI325X/MI350X/MI355X). This technical blog presents a Day 0 deployment walkthrough for Alibaba's Qwen 3.6 model family on the AMD Instinct GPUs by leveraging AMD ROCm™ 7.0 software and vLLM upstream optimizations. 

This deployment guide is designed for AI developers, system architects, and DevOps professionals who are building next-generation agentic workflows. By supporting the Qwen3.6 family on AMD Instinct GPUs, we enable developers to run with only 3B active parameters, Qwen3.6-35B-A3B outperforms the dense 27B-parameter Qwen3.5-27B on several key coding benchmarks and dramatically surpasses its direct predecessor Qwen3.5-35B-A3B, especially on agentic coding and reasoning tasks.

Model Overview

Qwen3.6-35B-A3B is a fully open-source MoE model (35B total / 3B active), featuring:

  • Exceptional agentic coding capability competitive with much larger models.
  • Strong multimodal perception and reasoning ability.

The comprehensive evaluations of Qwen3.6-35B-A3B against peer-scale models across a wide range of tasks and modalities.

Performance on Coding Agent & Reasoning Benchmarks
Figure 1: Performance on Coding Agent & Reasoning Benchmarks

Run Qwen3.6 with vLLM/SGLang on AMD Instinct GPUs

The integration of ROCm™ 7 software and vLLM allows users to fully exploit the 192GB HBM capacity of the MI300X GPU and 288GB HBM3E of the MI355X GPU.

  • Cost Reduction: Users can serve the full context length on a single GPU for both data types, a critical requirement for repo-level coding tasks that often exceed the memory limits of lesser hardware.  
  • Optimized Throughput: By leveraging tensor parallelism, developers can achieve the low-latency response times required for real-time IDE integrations like Qwen Code. 

Before you start, ensure you have access to Instinct GPUs and the ROCm drivers set up. 

Step 1. Get Started with SGLang

Please use the latest pre-built upstream docker image for the MI355x GPU,  

		docker run -d -it \
    --ipc=host \
    --network=host \
    --privileged \
    --cap-add=CAP_SYS_ADMIN \
    --device=/dev/kfd \
    --device=/dev/dri \
    --device=/dev/mem \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --shm-size 32G \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v /:/work \
    --entrypoint "/bin/bash" \
    --name qwen3.6 \
lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414
	

For the MI300X GPU, please use the following image: 

		lmsysorg/sglang:v0.5.10-rocm720-mi30x
	

Step2. Start SGLang serving

Single GPU deployment (MI355X)

		sglang serve --model-path Qwen/Qwen3.6-35B-A3B \
    --tensor-parallel-size 1 \
    --enable-flashinfer-allreduce-fusion \
    --attention-backend triton \
    --mem-fraction-static 0.8 \
    --disable-radix-cache \
    --trust-remote-code
	

Multiple GPU development (MI355X):

		sglang serve \
    --model-path Qwen/Qwen3.6-35B-A3B \
    --tensor-parallel-size 4 \
    --ep-size 1 \
    --trust-remote-code \
--enable-aiter-allreduce-fusion \
    --attention-backend triton \
    --disable-radix-cache \
    --mem-fraction-static 0.8
	

MTP enablement:

		sglang serve \
    --model-path Qwen/Qwen3.6-35B-A3B \
    --tensor-parallel-size 4 \
    --ep-size 1 \
    --trust-remote-code \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
--enable-aiter-allreduce-fusion \
    --attention-backend triton \
    --disable-radix-cache \
    --mem-fraction-static 0.8
	

Step 3: Chat Completions API

		curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [
      {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'
	

If the serve runs well, you can see the following outputs:

qwen server output
Figure 2: Server outputs

Step 1. Get Started with vllm

Please use the latest pre-built vLLM upstream docker image, 

		docker run -d -it --ipc=host --network=host --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v ~/.cache/huggingface:/root/.cache/huggingface  -v /:/work  --entrypoint "/bin/bash"  --name qwen3.6  vllm/vllm-openai-rocm:latest
	

Step2. Start vLLM serving

Single GPU deployment (MI355X):

		vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 1 \
  --max-model-len 32768  \
  --reasoning-parser qwen3
	

Multiple GPU development (MI355X):

		vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3
	

MTP enablement:

		vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
	

Let’s Try Coding !

Qwen code deployment

In this section, we will show you how to deploy Qwen code locally by using the SGLang serving and interacting with the Qwen code for any coding queries. 

Code Deployment
Figure 3: Qwen code deployment

Step 1: Install node.js

		```bash
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
nvm install --lts
nvm use --lts
node -v
npm -v
```

	

Step 2:  install Qwen code

		```bash
#  Install Qwen Code
npm install -g @qwen-code/qwen-code@latest
#verify 
qwen --version
```

	

Step 3: Launch the Qwen Code

Step 1: Set the openai api key here and deploy the Qwen code locally on the MI355X GPU

		```bash
export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://localhost:8888/v1"  
export OPENAI_MODEL=" Qwen/Qwen3.6-35B-A3B "
```
	

Step 2: Initiate the  Qwen Code

		```bash
qwen
```
	

If everything goes well, the interface will be displayed here locally,

Qwen code interface
Figure 4: Qwen code interface

Step 3: Interact with the Qwen Code Agent

Sample Question:

sample question

Answer:

sample response

Summary

This blog presents the Day 0 support for Alibaba's Qwen3.6 model family on the AMD Instinct GPUs. By following this guide, you have learned how to deploy Qwen3.6 using vLLM/sglang to utilize specialized tool-calling parsers for agentic tasks and how to deploy Qwen code locally by using Qwen3.6-35B-A3B.

This enablement ensures that your development team can immediately start building robust, agent-led coding platforms on the latest AMD hardware. Subsequent posts will deep-dive into kernel-level profiling, custom attention implementations, and ongoing collaboration between AMD ROCm software stack and Qwen model optimizations. Stay tuned. 

Additional Resources

Acknowledgements

AMD team members who contributed to this effort: Andy Luo, Haichen Zhang, FangChun, Chang liu, Bingqing Guo, Yi Gan, Hattie Wu, Tun Jian and the Qwen team. 

Share:

Article By


Related Blogs