Zyphra Demonstrates Large Scale Training on AMD GPUs & Networking with ZAYA1

Nov 22, 2025

Abstract background with lines of code illuminated

We are celebrating a major milestone in AI infrastructure and model development: Zyphra successfully trained ZAYA1-base, the first large-scale Mixture-of-Experts (MoE) foundation model trained entirely on an AMD cluster comprised of AMD InstinctTM GPUs and AMD PensandoTM Pollara AI NICs.

This effort proves that the AMD Instinct MI300X GPUs coupled with AMD Pensando Pollara 400 networking and the ROCm software stack is now a viable, high-performance, and production-ready alternative platform for frontier-scale AI training.

The throughput achieved and the resulting ZAYA1-base model (760 million active, 8.3 billion total parameters) demonstrate the power of optimizing software and architecture around AMD silicon.

An AI Partnership 

Success at the frontier of AI requires robust, integrated infrastructure. The ZAYA1 training run was a joint effort between Zyphra, AMD, and IBM Cloud.

Zyphra collaborated closely with AMD and IBM to design and deploy a large-scale training cluster. This jointly engineered cluster, powered by AMD Instinct MI300X GPUs and utilizing IBM Cloud’s high-performance networking fabric, delivered over 750 PFLOPs1 of Max Achievable FLOPS in training performance.

Zyphra was deeply committed to demonstrating end-to-end large-scale pretraining on the combined AMD  platform. This systematic approach resulted in the release of a technical report showing how Zyphra has demonstrated large scale training on AMD GPUs and networking which provides practical guidance for training on AMD.

The Platform Advantage:
Instinct GPUs & Pensando Networking

The MI300X GPU’s large memory capacity (192GB HBM) enabled  Zyphra to pretrain ZAYA1-base primarily using a simpler parallelism strategy, namely data-parallelism with the ZeRO-1 distributed optimizer.

Crucially, the cluster leveraged eight dedicated AMD Pensando Pollara 400Gbps NICs per node. This high-bandwidth setup, connected via a rails-only topology, delivers a full 3.2Tbps per-node bandwidth.

A major technical contribution of this work was validating the networking performance:

  1. First Systematic Benchmarks: Zyphra delivered the first systematic collective-communication and memory-bandwidth microbenchmarks at this scale on the AMD stack, specifically characterizing the AMD Pensando Pollara 400 programmable networking hardware for essential collectives (e.g., all_reduce, reduce_scatter, all_gather).

  2. Networking Guidance: Zyphra derived and validated detailed insights to maximize network performance in real-world large-scale pretraining workloads such as deriving optimal gradient buffer sizes to maximize network saturation.
Achieving Competitive Throughput with ROCm Optimization

Zyphra built a highly optimized, fault tolerant, and robust training stack specialized for AMD pretraining. Zyphra integrated and utilized several optimized components from AMD’s training software stack, such as PrimusAITER and RCCL

Custom HIP Kernels for Peak Efficiency

To achieve competitive throughput, Zyphra developed critical custom kernels primarily written in HIP (Heterogeneous-Compute Interface for Portability):

  1. Optimized Muon Optimizer Kernels: Muon, the optimizer used for ZAYA1-base training, requires compute-heavy Newton–Schulz (NS) iterations. Zyphra’s custom work included: 
          ●  Implementing fused HIP kernels for multi-tensor momentum and weight updates.
          ●  Developing a specialized symmetric matrix multiplication kernel for the Gram matrix calculations in the NS iterations. This kernel eliminates roughly half of the multiply–accumulate work and halves HBM writes for off-diagonal tiles, making the NS phase bandwidth-friendly and significantly reducing the optimizer's overhead.

  2. Fused LayerNorm/RMSNorm: Zyphra developed an optimized, fused HIP kernel that combines residual add, statistics, normalization, and affine steps into a single pass, addressing subpar performance encountered with naive transpilation.

System Design and Robustness

For stable, long-term training, Zyphra developed two core infrastructure components:

  1. Aegis Fault Tolerance System: This in-house system minimizes downtime during long runs. Aegis automatically identifies and mitigates common hardware, networking, or software failures. It is even capable of automatically selecting a new node and restarting the run if a GPU ECC failure is detected.

  2. Distributed Checkpointing: Zyphra’s asynchronous and distributed scheme reduces checkpoint time by more than 10x compared to baseline approaches by having each data-parallel rank write its own optimizer shard asynchronously. This system, combined with checkpoint reshaping utilities, enables robust recovery even after arbitrary node failures.

 

MI300X-Aware Model Sizing

To further improve performance, Zyphra applied specific sizing recommendations based on MI300X's compute-memory balance. This included ensuring that the vocabulary size is divisible by 64, and that microbatch size times sequence length, and head dimensions, are divisible by many powers of two (up to 64). Zyphra performed static tuning using PyTorch TunableOp and TransformerEngine to map GEMM sizes to the most performant algorithms within rocBLAS and hipBLASlt.

 

 

ZAYA1 Performance Validation

The resulting model, ZAYA1-Base, validates the architectural innovations co-designed by Zyphra around AMD silicon.

 

ZAYA1-base utilizes a novel ‘MoE++’ recipe which includes:

 

  1. Compressed Convolutional Attention (CCA): This performs sequence-mixing in a compressed latent space, achieving significant savings in compute requirements and dramatically reducing the KV-cache size. For ZAYA1-base, the combined CCGQA method achieved an 8x compression of the KV cache versus full multi-head attention, one of the highest achieved at this scale. This optimization makes long-context training (up to 32k context length) significantly easier.
  2. ZAYA1 Router: This more expressive router, which replaces the standard linear gate with an MLP, promotes superior expert specialization and enables successful training with a top-k of 1 (without residual experts), diverging from emerging MoE standards to improve efficiency.

 

Competitive Benchmarks

ZAYA1-Base demonstrates extremely competitive performance against leading open-source models. Despite operating at a fraction of the active parameter count, ZAYA1-Base outperforms established models across challenging benchmarks:

 



[1] Testing by Zyphra as of November 14, 2025, measuring the aggregate throughput of training iterations across the full Zyphra cluster measured in quadrillion floating point operations per second (PFLOPs). The workload was training a model comprised of a set of subsequent MLPs in BFLOAT16 across the full cluster of (128) compute nodes, each containing (8) AMD InstinctTM MI300X GPUs and (8) AMD PensandoTM Pollara 400 Interconnects running a proprietary training stack created by Zyphra. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of the latest drivers and optimizations. This benchmark was collected with AMD ROCm 6.4.


 [1]Remove Emojis

 

o
MI300X-Aware Model Sizing

To further improve performance, Zyphra applied specific sizing recommendations based on MI300X's compute-memory balance. This included ensuring that the vocabulary size is divisible by 64, and that microbatch size times sequence length, and head dimensions, are divisible by many powers of two (up to 64). Zyphra performed static tuning using PyTorch TunableOp and TransformerEngine to map GEMM sizes to the most performant algorithms within rocBLAS and hipBLASlt.

ZAYA1 Performance Validation

The resulting model, ZAYA1-base, validates the architectural innovations co-designed by Zyphra around AMD silicon.

ZAYA1-base utilizes a novel ‘MoE++’ recipe which includes:

  1. Compressed Convolutional Attention (CCA): This performs sequence-mixing in a compressed latent space, achieving significant savings in compute requirements and dramatically reducing the KV-cache size. For ZAYA1-base, the combined CCGQA method achieved an 8x compression of the KV cache versus full multi-head attention, one of the highest achieved at this scale. This optimization makes long-context training (up to 32k context length) significantly easier.

  2. ZAYA1 Router: This more expressive router, which replaces the standard linear gate with an MLP, promotes superior expert specialization and enables successful training with a top-k of 1 (without residual experts), diverging from emerging MoE standards to improve efficiency.

Competitive Benchmarks

ZAYA1-Base demonstrates extremely competitive performance against leading open-source models. Despite operating at a fraction of the active parameter count, ZAYA1-base outperforms established models across challenging benchmarks:

Zyphra Pre-Training Performance
Comparison Point ZAYA1-base Performance (8.3B Total / 760M Active)
Outperforms Llama-3-8B and OLMoE-1B-7B across reasoning, mathematics, and coding benchmarks
Exceeds Gemma3-12B in challenging mathematics and coding benchmarks
Competitive With State-of-the-art models like Qwen3-4B (a 4B dense model)

ZAYA1-base excels particularly at complex mathematical and STEM reasoning tasks. Zyphra’s reasoning-focused checkpoint demonstrated strong performance that approached state-of-the-art reasoning models like Qwen3-4B-Thinking, even before explicit instruction-tuning (SFT/RL).

Conclusion: AMD is Mature for Frontier Training

Zyphra’s work confirms that the AMD GPUs, networking, and software stack are sufficiently mature and robust to enable large-scale LLM pretraining.

Zyphra's experience, from micro-benchmarking Pollara networking to developing custom HIP kernels for the Muon optimizer, demonstrates that the AMD ecosystem represents a viable and competitive alternative for teams looking to push the boundaries of large-scale AI.

We look forward to supporting Zyphra as they continue to scale their efforts, utilizing AMD GPUs, networking, and ROCm software for future breakthroughs in agentic capabilities, long-term memory, and continual learning. 


[1] Testing by Zyphra as of November 14, 2025, measuring the aggregate throughput of training iterations across the full Zyphra cluster measured in quadrillion floating point operations per second (PFLOPs). The workload was training a model comprised of a set of subsequent MLPs in BFLOAT16 across the full cluster of (128) compute nodes, each containing (8) AMD InstinctTM MI300X GPUs and (8) AMD PensandoTM Pollara 400 Interconnects running a proprietary training stack created by Zyphra. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of the latest drivers and optimizations. This benchmark was collected with AMD ROCm 6.4.

Share:

Contributors


Director Software Development

Sr. Director Software Development

Director Product Management

Related Blogs