Pushing the Boundaries of Foundation Model Training with AMD

AMD is committed to open-source AI by releasing everything behind our GenAI models—from model weights and training configs to datasets and code. Whether you're benchmarking, building, or contributing, you’ll find everything you need to replicate, innovate, and scale with confidence.

Explore Models

AMD OLMo

Explore a series of fully open AMD OLMo models, completely trained on Instinct™ MI250 GPUs and equipped with instruction following and chat capabilities.

Instella-3B

Discover an open family of 3B-parameter language models trained from scratch on Instinct™ MI300X GPUs using ROCm™ software delivering competitive performance against leading open-weight models

AMD-135M

Meet the first AMD small language model with speculative decoding that establishes an end-to-end workflow, encompassing both training and inferencing, on select AMD GPUs and AMD Ryzen™ AI processors.

Hummingbird-0.9B

Uncover an open-source text-to-video diffusion model that combines structural distillation and a novel data processing pipeline to deliver high-quality video generation.

AMD Nitro Diffusion

Explore two single-step diffusion models that highlight the performance of Instinct GPUs matching the quality of full-step models that can run efficiently on both data center and edge devices.

Instella-VL-1B

Dig deeper into fully open source and reproducible vision language model for image understanding trained on AMD Instinct MI300X GPUs.

Explore Publications

 AI Agent

Agent Laboratory: Using LLM Agents as Research Assistants

An end-to-end autonomous research workflow is meant to assist you as the human researcher toward implementing your research ideas.

MoEA: A Mixture-of-Experts Agent for Open-World Minecraft with Multimodal Expert Memory

A LLM-empowered agent that can complete various tasks in Minecraft automatically. It enhances adaptability and generalization by integrating online RL training with a multi-expert memory module. Experiments show that AMD MoEA framework outperforms state-of-the-art methods on MineDOJO tasks.

Model Compression

Quantization | Sparsity 

TernaryLLM: Ternarized Large Language Model

Dual Learnable Ternarization (DLT) and Outlier-Friendly Feature Knowledge Distillation (OFF) handle outliers in weights and activations, enabling TernaryLLM to outperform prior low-bit methods in text generation and zero-shot tasks.

Efficient Architecture

Transformer | Diffusion | Hybrid 

Enhancing Vision Transformer: Amplifying Non-Linearity in Feedforward Network Module (ICML 2024)

An improved FFN (IFFN) module for vision transformers that uses AGeLU function and multiple instances to enhance non-linearity, reducing hidden dimensions and computational cost.

QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion (NeurIPS 2024)

QT-ViT replaces softmax-based attention with second-order Taylor expansion, accelerating it via a fast approximation algorithm. It achieves superior performance without knowledge distillation or high-order attention residuals, outperforming previous models.

FDViT: Improve the Hierarchical Architecture of Vision Transformer (ICCV 2023)

FDViT employs a flexible downsampling layer to reduce feature map sizes smoothly. Combined with a masked auto-encoder for training, it decreases redundant calculations and information loss.

DoSSR: Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs (NeurIPS 2024)

A domain shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images.

ReNeg: Learning Negative Embedding with Reward Guidance (CVPR 2025 Highlight)

A reward-guided approach that directly learns negative embeddings through gradient descent. The negative embeddings exhibit strong generalization capabilities and can be seamlessly adaptable to T2I and T2V models.

AMD-Hybrid: Towards Extremely Efficient Hybrid Models

Using an enhanced post-training approach based on intermediate layer distillation and optimized layer selection, our models dramatically reduce KV-cache requirements — without compromising quality.     

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer (CVPR 2025)

A continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token,

increasing the representation capacity of the latent space.

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

An efficient post-training approach to convert MHA models to multi-head latent attention (MLA) using knowledge distillation.

Specultive Decoding

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding (ICML 2025)

Gumiho combines serial and parallel heads for SPD. Early tokens use sophisticated Transformer architecture serially, later ones use lightweight MLP heads in parallel. Experiments show it outperforms existing methods.

Beyond Text: Multimodal Speculative Decoding for Faster AI Inference

Multimodal speculative decoding enhances inference by parallelizing token prediction and verification across cross-modal contexts, achieving higher acceptance lengths and up to 3× speedups in structured visual-text interpretation tasks.

Footnotes
  1. MI200-94:Testing conducted internally by AMD Research team as of December 2024, on AMD Instinct MI250 accelerator, measuring the latency of AMD Hummingbird-0.9B, VideoLCM, animatedLCM, Turbo-v1, Turbo-v2 and VideoCrafter2, all in FP16, results are an averageof tested 5 rounds.
    Test environment:
    OS:  Ubuntu 22.04 LTS
    CPU: AMD EPYC 73F3 CPU x1
    GPU: Instinct MI250 GPU x1
    GPU Driver: ROCm 6.1
    Python 3.8, PyTorch 2.2.0, and FlashAttention 2.2.0.
    Inference latency:
    VideoLCM = 2.35s
    animateLCM = 6.38s
    Turbo-v1 = 2.49s
    Turbo-v2 = 2.57s
    VideoCrafter2 = 44.16s
    Hummingbird0.9B = 1.87s
    Performance may vary based on different hardware configurations, software versions and optimization.
  2. MI200-095:
    On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Llama3 series models achieve up to 3.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations.

    SYSTEM CONFIGURATION
    System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
    CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
    NUMA Config: 2 NUMA node per socket
    Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
    Disk: Root drive + Data drive combined:
    2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
    4 x 7T SAMSUNG MZQL27T6HBLA-00A07
    GPU: 4x AMD MI250X 128GB HBM2e 500W
    Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
    System BIOS: 2.5
    System Bios Vendor:American Megatrends International, LLC.
    Host GPU Driver: ROCm™ 6.3.2
  3. MI200-096
    On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the DeepSeek series models achieve up to 2.3× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations. 

    SYSTEM CONFIGURATION
    System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
    CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
    NUMA Config: 2 NUMA node per socket
    Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
    Disk: Root drive + Data drive combined:
    2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
    4 x 7T SAMSUNG MZQL27T6HBLA-00A07
    GPU: 4x AMD MI250X 128GB HBM2e 500W
    Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
    System BIOS: 2.5
    System Bios Vendor:American Megatrends International, LLC.
    Host GPU Driver: ROCm™ 6.3.2
  4. MI200-097
    On average, a system configured with an AMD Instinct™ MI250X GPU shows that with Parallele Draft (PARD), the Qwen model series benefit from a 4.87× inference speedup. Testing done by AMD on 03/17/2025, results may vary based on configuration, usage, software version, and optimizations. 

    SYSTEM CONFIGURATION
    System Model: Supermicro GPU A+ Server AS - 4124GQ-TNMI
    CPU: AMD EPYC 73F3 16-Core Processor (2 sockets, 16 cores per socket, 2 threads per core)
    NUMA Config: 2 NUMA node per socket
    Memory: 1024 GB (16 DIMMs, 3200 MT/s, 64 GiB/DIMM)
    Disk: Root drive + Data drive combined:
    2 x 894.3G SAMSUNG MZQL2960HCJR-00A07
    4 x 7T SAMSUNG MZQL27T6HBLA-00A07
    GPU: 4x AMD MI250X 128GB HBM2e 500W
    Host OS: Ubuntu 22.04.5 LTS 5.15.0-41-generic
    System BIOS: 2.5
    System Bios Vendor:American Megatrends International, LLC.
    Host GPU Driver: ROCm™ 6.3.2