Day 0 Support for Gemma 4 on AMD Processors and GPUs

Apr 02, 2026

What’s New in Gemma 4?

The Gemma 4 family of open-weights models from Google includes four variants, spanning a range of sizes from 2B effective parameters to 31B parameters and including both Mixture of Experts (MoE) and dense architectures.  These multimodal models ingest text, vision, and for select variants, audio inputs and generate text outputs. They support context sizes of up to 256K tokens, and have been trained for thinking, coding, function calling, optical character recognition (OCR), object recognition and automatic speech recognition tasks. For relatively compact models they have outstanding language skills, understanding up to 140 different languages. 

Architecture changes from the Gemma 3 generation provide improved efficiency and improved long context quality. These are coupled with new architectures for vision and audio processing. The combination of strong capabilities and sizes appropriate for range of local hardware makes Gemma 4 a strong fit for agentic AI workflows.

Supported across the full range of AMD hardware

AMD is proud to provide Day Zero support for the full set of Gemma 4 models across our portfolio of AI-enabled hardware. This includes AMD Instinct™ GPUs for cloud and enterprise datacenters, AMD Radeon™ GPUs for AI workstations, and AMD Ryzen™ AI processors for AI PCs. Support includes integration with the most popular AI applications like LM Studio, and support for open-source software projects, including vLLM, SGLang, llama.cpp, Ollama, and Lemonade.

Deploying with vLLM

Gemma 4 can be deployed on AMD GPUs using vLLM to take advantage of the many optimizations in this inference framework, particularly relating to support for multiple concurrent requests. The whole range of AMD GPUs supported by vLLM, including multiple generations of both Instinct and Radeon GPUs, can be used with the Gemma 4 models. This support is planned in both the Gemma 4 launch build of upstream vLLM and future nightly builds, installable as either a Docker image or Python installable package using the process documented at https://vllm.ai/.

		docker pull vllm/vllm-openai-rocm:gemma4
	

For all AMD GPUs, vLLM can be invoked with the TRITON_ATTN backend:

		vllm serve vllm/vllm-openai-rocm:gemma4 --attention-backend TRITON_ATTN
	

Support for other attention backends with additional optimizations on MI300 and MI350-series GPUs is planned to be available soon. 

Deploying with SGLang

Gemma 4 can also be deployed on AMD MI300X/MI325X/MI35X GPUs using SGLang, which provides high-performance serving. 

SGLang supports the full Gemma 4 family including dense models (E2B, E4B, 31B) and the MoE variant (26B-A4B). This support is available in the Gemma 4 launch build of SGLang, via docker image following https://cookbook.sglang.io/.  

All Gemma 4 models require the Triton attention backend for bidirectional image-token attention. 

SGLang can be invoked as follows: 

		python3 -m sglang.launch_server --model-path <model> --attention-backend triton --tp 1  
	

The Gemma 4 model fits on a single MI300X GPU (192 GB HBM) at TP=1 with full context length. For higher throughput workloads, tensor parallelism can be increased (e.g., --tp 2).

Deploying on local hardware with LM Studio

Gemma 4 models can be easily and performantly deployed on AMD hardware through the open-source llama.cpp project and LM Studio. Users can quickly spin up these models on supported hardware, such as AMD Ryzen™ AI and Ryzen AI Max processors as well as Radeon and Radeon PRO graphics cards, by downloading the popular LM Studio application and pairing it with the latest AMD Software: Adrenalin™ Edition drivers. 

Deploying on local hardware with Lemonade Server

Lemonade Server enables deployment of Gemma 4 models on AMD hardware through an open-source local LLM server with OpenAI‑compatible APIs. It supports acceleration on AMD Radeon™ and Radeon™ PRO GPUs via ROCm, and on AMD Ryzen™ AI processors using the XDNA 2 NPU.

GPU deployment with Lemonade and ROCm

To run Gemma 4 on AMD GPUs with ROCm acceleration:

  • Install Lemonade and download the preview ROCm build of llama.cpp for your GPU architecture from the release artifacts (e.g., llama-windows-rocm-gfx1151-x64 for Radeon™ 8060S).
  • Point Lemonade to the ROCm build by setting the environment variable:
		export LEMONADE_LLAMACPP_ROCM_BIN=/path/to/llama-server
	
  • Start Lemonade and load the Gemma 4 model via the API:
		```
lemonade-server serve
curl http://localhost:8000/api/v1/pull \
    -H "Content-Type: application/json" \
    -d '{"model_name": "user.Gemma-4-E4B-IT", "checkpoint": "<insert-checkpoint-name>", "recipe": "llamacpp"}'
```
	
  • Chat with the model via the OpenAI-compatible API:
		```
  curl http://localhost:8000/api/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "user.Gemma-4-E4B-IT", "messages": [{"role": "user", "content": "Hello!"}], "llamacpp": "rocm"}'
```
	

NPU deployment with Ryzen AI

Developers will be able to deploy Gemma 4 models on NPU by integrating Lemonade Server, which supports the latest AMD XDNA 2 NPU. NPU support for the Gemma-4 E2B and E4B models will arrive with the next Ryzen AI SW update. This update will be integrated in Lemonade and will also be available to developers directly as OnnxRuntime APIs.

Conclusion

With Gemma 4, Google continues to improve the state-of-the-art in compact open-weights models. AMD is uniquely positioned to enable users to deploy these models, with hardware offerings that are ideally suited for each of the Gemma 4 variants and software support for Gemma 4 in widely deployed AI inference projects. This spans the gamut from small models running on the NPU to orchestration of several models running on Instinct GPUs, along with anything in between.

Additional Resources

 

Related Blogs