Open-Source GenAI Models Built by AMD

Pushing the Boundaries of Foundation Model Training with AMD

AMD is committed to open-source AI by releasing everything behind our GenAI models—from model weights and training configs to datasets and code. Whether you're benchmarking, building, or contributing, you’ll find everything you need to replicate, innovate, and scale with confidence.

Explore Models

Instella-Long

Introducing Instella-Long, a fully open 3B-parameter language model supporting 128K context length, trained on 64 Instinct MI300X GPUs with open-sourced weights, data, and code, showcasing AMD hardware scalability and enabling community-driven long-context AI research.

Learn More Model Card Github Code

Instella-Math

Introducing Instella-Math — The first AMD fully open reasoning language model, trained end-to-end with long chain-of-thought reinforcement learning on 32 Instinct MI300X GPUs. Optimized for complex reasoning and math, with all code, weights, and datasets released for the community.

Learn More Model Card Github Code

Instella-T2I

Introducing 1D binary image latents for text-to-image generation, achieving up to 32× token reduction and competitive 1024×1024 performance with faster, scalable training and inference.

Learn More Model Card Github Code

Nitro-T

Nitro-T achieves competitive scores on image generation benchmarks compared to previous models focused on efficient training while requiring less than 1 day of training from scratch on 32 AMD Instinct MI300X GPUs.

Learn More Model Card Github Code

Hummingbird-I2V

A lightweight and feedback-driven image-to-video generation model designed to deliver high-quality results efficiently on resource-constrained hardware.

Learn More Model Card Github Code

AMD OLMo

Explore a series of fully open AMD OLMo models, completely trained on Instinct™ MI250 GPUs and equipped with instruction following and chat capabilities.

Learn More Code and Model Cards

Instella-3B

Discover an open family of 3B-parameter language models trained from scratch on Instinct™ MI300X GPUs using ROCm™ software delivering competitive performance against leading open-weight models

Learn More Model Card Github Code

AMD-135M

Meet the first AMD small language model with speculative decoding that establishes an end-to-end workflow, encompassing both training and inferencing, on select AMD GPUs and AMD Ryzen™ AI processors.

Learn More Model Card Github Code

Hummingbird-T2V

Uncover an open-source text-to-video diffusion model that combines structural distillation and a novel data processing pipeline to deliver high-quality video generation.

Learn More Model Card Github Code

Nitro-1

Explore two single-step diffusion models that highlight the performance of Instinct GPUs matching the quality of full-step models that can run efficiently on both data center and edge devices.

Learn More Model Card Github Code

Instella-VL-1B

Dig deeper into fully open source and reproducible vision language model for image understanding trained on AMD Instinct MI300X GPUs.

Learn More Model Card Github Code

Explore Publications

AI Agent
Model Compression
Efficient Architecture
Specultive Decoding

AI Agent

GPU Kernel Generation Agent

Introducing GEAK, An AMD AI-driven Triton kernel generation framework and evaluation benchmarks. GEAK leverages frontier LLMs with inference-time scaling to automatically produce efficient, accurate GPU kernels for AMD Instinct GPUs, achieving up to 63% correctness and 2.59× speedups over reference implementations.

Learn More Github Code

Agent Laboratory: Using LLM Agents as Research Assistants

An end-to-end autonomous research workflow is meant to assist you as the human researcher toward implementing your research ideas.

Learn More Github Code

MoEA: A Mixture-of-Experts Agent for Open-World Minecraft with Multimodal Expert Memory

A LLM-empowered agent that can complete various tasks in Minecraft automatically. It enhances adaptability and generalization by integrating online RL training with a multi-expert memory module. Experiments show that AMD MoEA framework outperforms state-of-the-art methods on MineDOJO tasks.

Read Blog

Model Compression

Quantization | Sparsity

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism (COLING 2025)

SDS prunes PLMs in three steps—one-shot pruning, sparse regularization, and a second prune—achieving better weight distribution and outperforming SparseGPT and Wanda

Read Paper

TernaryLLM: Ternarized Large Language Model

Dual Learnable Ternarization (DLT) and Outlier-Friendly Feature Knowledge Distillation (OFF) handle outliers in weights and activations, enabling TernaryLLM to outperform prior low-bit methods in text generation and zero-shot tasks.

Read Paper

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models (EMNLP 2024 Industry Track)

DL-QAT is a novel approach for quantization-aware training in large language models that combines weight decomposition and low-rank matrices to optimize quantized weights with minimal parameter changes.

Read Paper

Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization

Týr-the-Pruner is an end-to-end search-based global structural pruning framework for LLMs. It constructs a supernet via local pruning across sparsity ratios and uses an iterative prune-and-search strategy. It retains 97% of the dense model's performance while pruning 50% of Llama-3.1-70B's parameters.

Read Paper

Efficient Architecture

Transformer | Diffusion | Hybrid

Enhancing Vision Transformer: Amplifying Non-Linearity in Feedforward Network Module (ICML 2024)

An improved FFN (IFFN) module for vision transformers that uses AGeLU function and multiple instances to enhance non-linearity, reducing hidden dimensions and computational cost.

Read Paper Github Code

QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion (NeurIPS 2024)

QT-ViT replaces softmax-based attention with second-order Taylor expansion, accelerating it via a fast approximation algorithm. It achieves superior performance without knowledge distillation or high-order attention residuals, outperforming previous models.

Read Paper Github Code

FDViT: Improve the Hierarchical Architecture of Vision Transformer (ICCV 2023)

FDViT employs a flexible downsampling layer to reduce feature map sizes smoothly. Combined with a masked auto-encoder for training, it decreases redundant calculations and information loss.

Read Paper Github Code

DoSSR: Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs (NeurIPS 2024)

A domain shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images.

Read Paper Github Code

ReNeg: Learning Negative Embedding with Reward Guidance (CVPR 2025 Highlight)

A reward-guided approach that directly learns negative embeddings through gradient descent. The negative embeddings exhibit strong generalization capabilities and can be seamlessly adaptable to T2I and T2V models.

Read Paper Github Code

MSWA: Refining Local Attention with Multi-Scale Window Attention

MSWA improves SWA by using diverse window sizes across Transformer heads and layers, enhancing performance and efficiency. It assigns smaller windows to shallow layers and larger ones to deep layers.

Read Paper

AMD-Hybrid: Towards Extremely Efficient Hybrid Models

Using an enhanced post-training approach based on intermediate layer distillation and optimized layer selection, our models dramatically reduce KV-cache requirements — without compromising quality.

Read Paper Github Code

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer (CVPR 2025)

A continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token,

increasing the representation capacity of the latent space.

Read Paper Github Code

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

An efficient post-training approach to convert MHA models to multi-head latent attention (MLA) using knowledge distillation.

Read Paper Github Code

Specultive Decoding

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding (ICML 2025)

Gumiho combines serial and parallel heads for SPD. Early tokens use sophisticated Transformer architecture serially, later ones use lightweight MLP heads in parallel. Experiments show it outperforms existing methods.

Read Paper

Accelerating Generative LLMs Inference with Parallel Draft Models (PARD)

Parallel Draft (PARD) is a speculative decoding technique that dramatically accelerates large-model inference. By generating and verifying multiple “draft” tokens in parallel, PARD delivers up to 3.3× speedup on the Llama 3 series, 2.3× on DeepSeek‐R1, and 4.87× on the Qwen.^2,3,4

View Blog

Beyond Text: Multimodal Speculative Decoding for Faster AI Inference

Multimodal speculative decoding enhances inference by parallelizing token prediction and verification across cross-modal contexts, achieving higher acceptance lengths and up to 3× speedups in structured visual-text interpretation tasks.

View Blog Github Code

Footnotes

MI200-94:Testing conducted internally by AMD Research team as of December 2024, on AMD Instinct MI250 accelerator, measuring the latency of AMD Hummingbird-0.9B, VideoLCM, animatedLCM, Turbo-v1, Turbo-v2 and VideoCrafter2, all in FP16, results are an averageof tested 5 rounds.
Test environment:
OS: Ubuntu 22.04 LTS
CPU: AMD EPYC 73F3 CPU x1
GPU: Instinct MI250 GPU x1
GPU Driver: ROCm 6.1
Python 3.8, PyTorch 2.2.0, and FlashAttention 2.2.0.
Inference latency:
VideoLCM = 2.35s
animateLCM = 6.38s
Turbo-v1 = 2.49s
Turbo-v2 = 2.57s
VideoCrafter2 = 44.16s
Hummingbird0.9B = 1.87s
Performance may vary based on different hardware configurations, software versions and optimization.
MI200-095:
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Llama3 series models achieve up to 3.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.

SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2
MI200-096
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the DeepSeek series models achieve up to 2.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.

SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2
MI200-097
On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Qwen model series benefit from a 4.87× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.

SYSTEM CONFIGURATION
System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
NUMA Config: 2 NUMA node per socket
Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
Disk: Root drive + Data drive combined:
2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
4 x 7T SAMSUNG MZQL27T6HBLA-00A07
GPU: 4x AMD MI250X 128GB HBM2e 500W
Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
System BIOS: 2.5
System Bios Vendor:American Megatrends International, LLC.
Host GPU Driver: ROCm™ 6.3.2

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Open-Source GenAI Models Built by AMD

Pushing the Boundaries of Foundation Model Training with AMD

Explore Models

Instella-Long

Instella-Math

Instella-T2I

Nitro-T

Hummingbird-I2V

AMD OLMo

Instella-3B

AMD-135M

Hummingbird-T2V

Nitro-1

Instella-VL-1B

Explore Publications

AI Agent

GPU Kernel Generation Agent

Agent Laboratory: Using LLM Agents as Research Assistants

MoEA: A Mixture-of-Experts Agent for Open-World Minecraft with Multimodal Expert Memory