Day-0 Support for Baidu ERNIE-Image on AMD GPUs: Validation on the Instinct MI355X GPU and Radeon AI PRO R9700

Apr 23, 2026

This post documents the successful day-0 support deployment and inference validation of Baidu’s ERNIE-Image text-to-image model on AMD GPUs, covering both the data-center-class AMD Instinct™ MI355X GPUs (CDNA 4) and the professional workstation-class Radeon™ AI PRO R9700 Series graphics (RDNA 4). Using the HuggingFace Diffusers framework and the ROCm™ software stack, ERNIE-Image runs on AMD hardware with zero modifications to the core inference code.

Instinct MI355X GPUs(288 GB HBM3e): Full single-card deployment with generous memory headroom.
Radeon AI PRO R9700 (32 GB GDDR6): Single-card deployment via enable_model_cpu_offload(), with peak VRAM usage well within the card’s capacity.

Background

ERNIE-Image Model Overview

ERNIE-Image is a text-to-image model developed by Baidu, built on the Diffusion Transformer (DiT) architecture. It supports both Chinese and English prompt input. The model has been submitted for upstream integration into the HuggingFace Diffusers library via PR #13432.

The model architecture comprises the following components:

Component	Type	Size	Role
Transformer	ErnieImageTransformer2DModel	15 GB	Diffusion backbone
Text Encoder	Mistral3Model	7.2 GB	Text encoding
Prompt Enhancer (PE)	Ministral3ForCausalLM	7.2 GB	Automatic prompt enrichment
VAE	AutoencoderKLFlux2	161 MB	Variational autoencoder
Scheduler	FlowMatchEulerDiscreteScheduler	—	Sampling scheduler
Total		~29.5 GB

The Prompt Enhancer automatically expands short user inputs into detailed Chinese descriptions, which the model then uses for generation. This works transparently for both English and Chinese inputs.

AMD Instinct MI355X GPUs

AMD Instinct MI355X GPU is the latest AMD datacenter AI accelerator, based on the CDNA 4 architecture (gfx950). Each card carries 288 GB HBM3e, making it well suited for large-scale model training and inference. The test system was an 8-GPU MI355X server; only a single card was used for this inference validation.

AMD Radeon AI PRO R9700 GPUs

AMD Radeon AI PRO R9700 is a professional workstation AI accelerator based on the RDNA 4 architecture (gfx1201), with 32 GB GDDR6 memory. As part of the Radeon AI PRO R9000 series, it targets local AI inference, model development, and other memory-intensive workloads, combining large VRAM capacity with ROCm-based multi-GPU scalability.

In hardware terms, the card includes 64 Compute Units, 4096 Stream Processors, 128 AI Accelerators, and 64 Ray Accelerators. Its memory and board configuration includes a 256-bit memory interface, 640 GB/s of peak bandwidth, 64 MB of Infinity Cache, PCIe 5.0 x16, 300 W total board power, active cooling, and ECC memory support on Linux.

Compared to the MI355X GPUs, the R9700 graphics card represents a validation of whether a professional-class GPU with constrained VRAM can still run a ~29.5 GB model. The test system was a 4-GPU R9700 workstation; only a single card was used.

Enviornment Setup

MI355X Hardware and Software

Hardware:

Item	Detail
GPU	AMD Instinct MI355X × 8 (single card used)
Architecture	CDNA 4 (gfx950)
VRAM per card	288 GB HBM3e
Host ROCm	7.2.1

Software:

Software	Version
Docker image	rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1
PyTorch	2.9.1+rocm7.2.1
ROCm (HIP)	7.2.53211
Diffusers	0.38.0.dev0 (HsiaWinter/diffusers add-ernie-image branch)
Transformers	5.5.3
Accelerate	1.13.0
Python	3.12

R9700 Hardware and Software

Hardware:

Item	Detail
GPU	AMD Radeon AI PRO R9700 × 4 (single card used)
Architecture	RDNA 4 (gfx1201)
VRAM per card	32 GB GDDR6
Host ROCm	7.2

Software:

Software	Version
Docker image	rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
PyTorch	2.9.1+rocm7.2.0
ROCm (HIP)	7.2.26015
Diffusers	0.38.0.dev0 (HsiaWinter/diffusers add-ernie-image branch)
Transformers	5.5.3
Accelerate	1.13.0
Python	3.12

Deployment Steps

Pull the Docker Image

Pull the ROCm-compatible PyTorch image matching the host ROCm version:

		docker pull rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1

Create the Container

Create a Docker container with GPU passthrough:

		docker run -d --name ernie-image-test \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --shm-size=64G \
  -v /path/to/model:/workspace \
  rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
  sleep infinity

Key flags:

--device=/dev/kfd --device=/dev/dri: Pass AMD GPU devices into the container
--group-add video --group-add render: Grant GPU access permissions
--shm-size=64G: Set shared memory to avoid data loading bottlenecks

Verify GPU Availability

Once inside the container, verify that PyTorch detects the GPU:

		docker exec ernie-image-test python3 -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Device: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.0f} GB')
"

Expected Output:

		PyTorch: 2.9.1+rocm7.2.1.gitff65f5bc
CUDA available: True
Device: AMD Instinct MI355X
VRAM: 288 GB

Note: ROCm provides CUDA API compatibility through HIP (Heterogeneous Interface for Portability). This means the standard torch.cuda interface works on AMD GPUs with no code changes.

Install Diffusers and Dependencies

		# Clone the Diffusers branch with ERNIE-Image support
git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie
cd /workspace/diffusers-ernie
git checkout add-ernie-image

# Install
pip install -e .
pip install accelerate Pillow transformers

Prepare Model Weights

Unpack the ERNIE-Image model weights into the workspace:

		tar xf ERNIE-Image.tar

The extracted directory structure:

		ERNIE-Image/
├── model_index.json
├── transformer/          # DiT backbone (15 GB)
├── text_encoder/         # Text encoder (7.2 GB)
├── pe/                   # Prompt Enhancer (7.2 GB)
├── pe_tokenizer/         # PE tokenizer
├── tokenizer/            # Text tokenizer
├── scheduler/            # Sampling scheduler
└── vae/                  # VAE decoder (161 MB)

Inference on MI355x GPU

Inference Script

		import os
import random
import numpy as np
import torch
from diffusers import ErnieImagePipeline
seed = random.randint(0, 100000)
print(f"seed: {seed}")
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# Load pipeline
pipe = ErnieImagePipeline.from_pretrained(
    "/workspace/ERNIE-Image",
    torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")
pipe.transformer.eval()
pipe.vae.eval()
pipe.text_encoder.eval()
pipe.pe.eval()

# Generate images
generator = torch.Generator(device="cuda").manual_seed(seed)
prompt_list = [
    "a photo of a flower with a red petal and yellow center",
    "一朵鲜艳的玫瑰",
    "A photograph of the Straw Hat Pirates drawn on a glass whiteboard "
    "with a faded green marker, front view, 4K resolution."
]

for idx, prompt in enumerate(prompt_list):
    output = pipe(
        prompt=prompt,
        height=1024,
        width=1024,
        num_inference_steps=50,
        guidance_scale=5.0,
        generator=generator,
    )
    output.images[0].save(f"ernie_output_{idx+1}.png")
    print(f"Revised prompt: {output.revised_prompts}")

ROCm adaptation note: Compared to the original CUDA version, the only changes needed are removing the CUBLAS_WORKSPACE_CONFIG environment variable and torch.backends.cudnn.deterministic settings, which are CUDA-specific. The core inference code requires no modifications to run on AMD GPUs.

Results

All three prompts successfully generated 1024×1024 images. The Prompt Enhancer automatically expanded each input into a detailed Chinese description for generation.

Prompt 1: "a photo of a flower with a red petal and yellow center"

Revised prompt (generated by PE): “一张高清微距摄影照片，展示了一朵盛开的鲜花。花朵位于画面中心，构图平衡，背景为深度虚化的自然绿色植被。花瓣呈现出鲜艳的深红色，质地轻薄且带有细腻的丝绸光泽...”

Figure 1: Flower with red petals and yellow center

Prompt 2: "一朵鲜艳的玫瑰"

Figure 2: A vibrant rose

Prompt 3: "A photograph of the Straw Hat Pirates drawn on a glass whiteboard with a faded green marker, front view, 4K resolution."

Figure 3: Straw Hat Pirates on a glass whiteboard

With 288 GB of HBM3e, MI355X provides ample headroom for full single-card deployment. The remaining memory leaves room for batch generation, higher resolutions, or running multiple model instances in parallel.

Technical Notes

ROCm and CUDA Compatibility

Thanks to ROCm’s HIP compatibility layer, PyTorch-based model code runs on AMD GPUs with zero modifications:

torch.cuda.is_available() → returns True
model.to("cuda") → correctly maps to AMD GPU
torch.cuda.manual_seed_all() → works as expected
torch.Generator(device="cuda") → works as expected

CUDA to ROCm Migration Reference

When migrating from NVIDIA to AMD, the following items require attention:

Item	NVIDIA (CUDA)	AMD (ROCm)	Action
CUBLAS_WORKSPACE_CONFIG	Required	Not applicable	Remove
torch.backends.cudnn.*	cuDNN config	Uses MIOpen	Remove related settings
torch.use_deterministic_algorithms	Supported	Partial support	Remove if needed
*torch.cuda. API**	Native	HIP compatibility layer	No changes needed
Attention backend	Flash Attention / cuDNN	AOTriton	Automatically selected

AOTriton is the AMD Triton-based attention kernel optimized for the ROCm platform. PyTorch automatically selects it asselects the Scaled Dot-Product Attention backend during inference.

Docker Image Selection

For MI355X GPUs (gfx950), use ROCm 7.0+ PyTorch images. The rocm/pytorch repository on DockerHub provides pre-built options:

Stable: rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1
Nightly: rocm/pytorch-nightly:2026-03-19-rocm7.2

The stable release is recommended for reliability.

R9700 Adaptation: Running a 29.5GB Model on a 32GB Card

This section covers the adaptation needed to run ERNIE-Image on the AMD Radeon AI PRO R9700 Series graphics card. Unlike the MI355X GPUs with its 288 GB of HBM3e, the R9700 has 32 GB of GDDR6, which requires a memory-aware deployment strategy.

VRAM Analysis

The full ERNIE-Image model in bfloat16 occupies approximately 29.5 GB. Loading each component onto the R9700 shows the following cumulative VRAM usage:

Load Order	Component	Incremental	Cumulative	Remaining
1	Text Encoder	7.18 GiB	7.18 GiB	22.68 GiB
2	Prompt Enhancer	6.38 GiB	13.56 GiB	16.30 GiB
3	VAE	0.17 GiB	13.73 GiB	16.13 GiB
4	Transformer	14.96 GiB	28.69 GiB	1.17 GiB

Although all model parameters fit in VRAM, the remaining ~1.17 GiB is not enough to hold the intermediate tensors required during inference (attention computation, activations, etc.). A direct pipe.to("cuda") call results in an out-of-memory error during the transformer’s forward pass.

Solution:enable_model_cpu_offload()

Diffusers’ enable_model_cpu_offload() mechanism allows the pipeline components to time-share a single GPU:

Prompt Enhancer loads to GPU → enhances the prompt → offloads back to CPU
Text Encoder loads to GPU → encodes the text → offloads back to CPU
Transformer loads to GPU → runs 50 denoising steps → offloads back to CPU
VAE loads to GPU → decodes the image → offloads back to CPU

Since each inference stage runs sequentially, peak VRAM usage only needs to accommodate the single largest component (the Transformer at 14.96 GiB) plus intermediate tensors. This keeps peak VRAM well within the card’s 32 GB capacity.

R9700 Container Setup

The R9700 container setup is similar to MI355X GPUs, with a few differences:

		docker run -d --name ernie-image-radeon \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
  --shm-size=16G \
  -v /path/to/model:/workspace/ERNIE-Image \
  rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
  sleep infinity

Notable differences from the MI355X setup:

--cap-add=SYS_PTRACE --security-opt seccomp=unconfined: Required for RDNA 4 debugging and profiling support
--shm-size=16G: Reduced from 64G (sufficient for single-card workstation use)
Docker image uses rocm7.2 (matching the R9700 host ROCm version)

Install dependencies the same way:

		docker exec ernie-image-radeon bash -c "
  git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie
  cd /workspace/diffusers-ernie && git checkout add-ernie-image
  pip install -e .
  pip install accelerate Pillow transformers
"

R9700 Inference Script

The only difference from the MI355X script is replacing pipe.to("cuda") with pipe.enable_model_cpu_offload():

		import os
import random

import numpy as np
import torch
from diffusers import ErnieImagePipeline

seed = random.randint(0, 100000)
print(f"seed: {seed}")
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# Load pipeline
pipe = ErnieImagePipeline.from_pretrained(
    "/workspace/ERNIE-Image",
    torch_dtype=torch.bfloat16,
)

# R9700 adaptation: use CPU offload for time-sharing a single card
pipe.enable_model_cpu_offload()

pipe.transformer.eval()
pipe.vae.eval()
pipe.text_encoder.eval()
pipe.pe.eval()

# Generate images
generator = torch.Generator(device="cpu").manual_seed(seed)

prompt_list = [
    "a photo of a flower with a red petal and yellow center",
    "一朵鲜艳的玫瑰",
    "A photograph of the Straw Hat Pirates drawn on a glass whiteboard "
    "with a faded green marker, front view, 4K resolution."
]

for idx, prompt in enumerate(prompt_list):
    output = pipe(
        prompt=prompt,
        height=1024,
        width=1024,
        num_inference_steps=50,
        guidance_scale=5.0,
        generator=generator,
    )
    output.images[0].save(f"ernie_output_{idx+1}.png")
    print(f"Revised prompt: {output.revised_prompts}")

Key adaptation detail: When using enable_model_cpu_offload(), the torch.Generator must be set to device="cpu" (not "cuda"), because the offload mechanism automatically manages device migration.

Results

All three prompts successfully generated 1024×1024 images on the R9700 series graphics card, with the Prompt Enhancer functioning correctly to produce detailed Chinese-language enhanced prompts.

This confirms that a professional 32 GB AMD card can run ERNIE-Image successfully without model surgery, quantization, or multi-GPU sharding — only a one-line change to the loading strategy.

Conclusion

This validation demonstrates that:

ERNIE-Image runs on AMD Instinct MI355X GPUs with no core code changes. The standard Diffusers pipeline works through ROCm’s HIP compatibility layer.
ERNIE-Image also runs on AMD Radeon AI PRO R9700 series graphics, requiring only a switch from pipe.to("cuda") to pipe.enable_model_cpu_offload() to fit within the 32 GB VRAM constraint.
The HuggingFace Diffusers framework is well supported on ROCm software, with pipeline loading, model inference, and image generation all working smoothly across both CDNA 4 and RDNA 4 architectures.
MI355X’s 288 GB HBM3e provides ample room for single-card deployment, with headroom for batch generation, higher resolutions, or multi-instance serving. R9700’s 32 GB GDDR6 makes single-card inference practical through Diffusers’ built-in offload mechanism.
Diffusers’ enable_model_cpu_offload() provides an out-of-the-box solution for memory-constrained scenarios, enabling professional and workstation-class GPUs to run large text-to-image models.

AMD GPUs — from data-center Instinct to professional Radeon, combined with the ROCm software stack offer a credible, high-compatibility inference platform for models like ERNIE-Image. For teams evaluating multi-vendor AI infrastructure, this is a practical data point: ROCm plus Diffusers works.

Footnotes

Appendix: Quick Reproduction Guide

MI355X GPU

# 1. Pull image

docker pull rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1

# 2. Create container

docker run -d --name ernie-image \

--device=/dev/kfd --device=/dev/dri \

--group-add video --group-add render \

--shm-size=64G \

-v /path/to/workspace:/workspace \

rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 \

sleep infinity

# 3. Install dependencies

docker exec ernie-image bash -c "

git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie

cd /workspace/diffusers-ernie && git checkout add-ernie-image

pip install -e .

pip install accelerate Pillow transformers

"

# 4. Unpack model and run inference

docker exec ernie-image bash -c "

cd /workspace && tar xf ERNIE-Image.tar

python3 test_ernie_image.py

"

Radeon AI PRO R9700

# 1. Pull image

docker pull rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1

# 2. Create container

docker run -d --name ernie-image-radeon \

--device=/dev/kfd --device=/dev/dri \

--group-add video --group-add render \

--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \

--shm-size=16G \

-v /path/to/workspace:/workspace \

rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1 \

sleep infinity

# 3. Install dependencies

docker exec ernie-image-radeon bash -c "

git clone https://github.com/HsiaWinter/diffusers /workspace/diffusers-ernie

cd /workspace/diffusers-ernie && git checkout add-ernie-image

pip install -e .

pip install accelerate Pillow transformers

"

# 4. Unpack model and run inference (script uses enable_model_cpu_offload)

docker exec ernie-image-radeon bash -c "

cd /workspace && tar xf ERNIE-Image.tar

python3 test_ernie_image_radeon.py

"

Key difference: The R9700 inference script uses pipe.enable_model_cpu_offload() instead of pipe.to("cuda"), and torch.Generator uses device="cpu".

Article By

AMD AI Group

white pearl gradient medium color divider

Related Blogs

View All Blogs

Server CPUs

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Day-0 Support for Baidu ERNIE-Image on AMD GPUs: Validation on the Instinct MI355X GPU and Radeon AI PRO R9700

Background

ERNIE-Image Model Overview

AMD Instinct MI355X GPUs

AMD Radeon AI PRO R9700 GPUs

Enviornment Setup

MI355X Hardware and Software

R9700 Hardware and Software

Deployment Steps

Pull the Docker Image

Pull the ROCm-compatible PyTorch image matching the host ROCm version:

Create the Container

Verify GPU Availability

Install Diffusers and Dependencies

Prepare Model Weights

Inference on MI355x GPU

Inference Script

Results

Technical Notes