Unlock the Power of the IBM Granite 4.0 Family of Models with AMD Instinct GPUs: A Developer’s Day 0 guide
Oct 02, 2025

AMD is excited to announce Day 0 support for IBM’s next generation Granite 4.0 language models on AMD Instinct™ MI300 Series GPUs (300X, 325X) and MI350 Series GPUs (350X, 355X) using vLLM.
This blog explains the architecture highlights, collaboration between AMD and IBM, prerequisites, and provides a quick start so you can run the IBM Granite 4.0 on AMD GPUs.
Brief Introduction to Granite 4.0 Language Models
Granite 4.0 models utilize a new hybrid Mamba-2/Transformer architecture, marrying the speed and efficiency of Mamba with the precision of transformer-based self-attention.
Many of the innovations informing the Granite 4 architecture arose from IBM Research’s collaboration with the original Mamba creators on Bamba.
The Granite 4.0 Mixture of Experts (MoEs) architecture employs 9 Mamba blocks for every 1 transformer block.
The Mamba blocks capture global context, which is then passed to transformer blocks that enable a more nuanced parsing of local context. The result is a dramatic reduction in memory usage and latency with no apparent tradeoff in performance.
AMD and IBM Collaboration: Day 0 Support and Beyond
AMD has longstanding collaborations with IBM and Red Hat. Together we continue to push the boundaries of AI performance. Thanks to this close relationship, Granite 4.0 can run seamlessly on AMD Instinct GPUs from Day 0, using PyTorch and vLLM. Our collaboration paves the way for even more groundbreaking innovations, ensuring that AI performance continues to evolve and meet the increasing demands of modern computing.
Running Granite 4.0 on AMD Instinct GPUs
Prerequisites:
- You have an AMD Instinct MI300X or above GPU
- You have AMD ROCm™ drivers installed
This guide provides a step-by-step guide for running Granite 4.0 with our custom prebuilt docker. For running bare metal, Grab the tip of tree of vLLM’s GitHub repository as everything has been fully upstreamed.
Step 1: Get the Granite 4.0 docker
We have created a public preview docker for Granite 4.0 which you can pull here:
docker pull rocm/vllm-dev:granite_4_preview
Step 2: Download Granite 4.0
Download a Granite 4.0 model through Hugging face: Granite models
Step 3: Launch the docker container
docker run \
--rm \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--memory $(python3 -c "import os; mlim = int(0.8 * os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / 10**9); print(f'{mlim}G')") \
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined --privileged \
--shm-size=16g \
--ulimit core=0:0 \
-e "TERM=xterm-256color" \
--name "granite_4_vllm_rocm" \
-it rocm/vllm-dev:granite_4_preview \
/bin/bash
Demo Granite 4.0 on AMD Instinct GPUs
Given the docker and a Granite 4.0 model, run simple prompts on the model:
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
def main():
# Create an LLM.
llm = LLM(model="ibm-granite/granite-4.0-micro")
# Generate texts from the prompts.
# The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60)
if __name__ == "__main__":
main()
Before running this script, configure your HuggingFace access token properly following this tutorial and export it as an environmental variable:
export HF_TOKEN=[your huggingface access token here]
Sample output from ibm-granite/granite-4.0-micro:
Generated Outputs:
------------------------------------------------------------
Prompt: 'Hello, my name is'
Output: ' Helen and I am from Boston. I am a senior manager at a technology firm'
------------------------------------------------------------
Prompt: 'The president of the United States is'
Output: ' an interesting case in point. He is the head of the executive branch, which'
------------------------------------------------------------
Prompt: 'The capital of France is'
Output: ' Paris.'
------------------------------------------------------------
Prompt: 'The future of AI is'
Output: ' promising and will bring many changes to the world. As AI continues to develop,'
------------------------------------------------------------
Summary
This blog provides a step-by-step Day 0 guide to run IBM Granite 4.0 models on AMD Instinct MI300 and MI350 Series GPUs. With Granite 4.0 running seamlessly on AMD Instinct GPUs, developers can immediately build and scale AI applications such as document summarization and analysis, RAG, and AI agents while maintaining transparency, safety, and security with an ISO 42001 certified LLM. This milestone is part of our broader mission to support open, high-performance AI tooling. This collaboration drives innovation, providing the AI community with high-performance, open-source tools.
Acknowledgements
AMD team members who contributed to this effort: Aleksandr Malyshev, Gregory Shtrasberg, and Matthew Wong.
This work would not have been possible without the close collaboration and support of the various organizations inside IBM. There are too many folks involved to name them all, but special thanks to Raghu Ganti for his leadership on this collaboration.
