Transforming AI Networks with AMD Pensando™ Pollara 400

Oct 10, 2024

Abstract background and Data Center

The advent of generative AI and large language models (LLMs) has created unprecedented challenges for traditional Ethernet networks in AI clusters. These advanced AI/ML models demand intense communication capabilities, including tightly coupled parallel processing, rapid data transfers, and low-latency communication - requirements that conventional Ethernet, designed for general-purpose computing, has historically struggled to meet.Despite these challenges, Ethernetremainsthe preferred choice for network technology in AI clusters due to its widespread adoption and abundance of operationalexpertise. However, the limitations of traditional Ethernet in supporting specialized AI workloads have become increasinglyapparent.

TheAMDPensando™ Pollara 400emergesas a significant advancement in AI networking, specifically designed to address these issues.Pollara 400optimizesperformance to meet the requirements of modern AI environments while allowing customers toleveragefamiliar Ethernet-based fabrics. The Pollara 400 effectively bridges the gap between Ethernet's broad compatibility and the specialized demands of AI workloads, providing a solution that combines the best of bothworlds.Byaddressing the specific communication needs of AI/ML models, the Pollara 400 enables organizations to fully harness the potential of their AI workloads without sacrificing the benefits of Ethernet infrastructure. This innovative approachrepresentsa crucial step forward in adapting networking technology to the evolving landscape of AI computing.

Agile and Efficient Distributionin an Open Environment

What is AMD Pensando™ Pollara 400?:The Pollara 400 is a specialized network accelerator explicitly designed to optimize data transfer within back-end AI networks for GPU-to-GPU communication. It delivers a fully programmable 400 Gigabit per second (Gbps) RDMA Ethernet Network Interface Card (NIC), enhancing the efficiency of AI workloads.Traditional DC Ethernet, which typically focuses on providing services such as user-access, segmentation, and multi-tenancy—to name just a few examples—often falls short when it comes to meeting the demanding requirements of modern AI workloads.These workloads demand high bandwidth, low latency, and efficient communication patterns not prioritized in conventional Ethernet designs.To address the challenges of AI workloads, a network that can support distributed computing over multiple GPU nodes, with low jitter and RDMA, is needed. ThePollara 400is designed to manage the unique communication patterns of AI workloads and offer high throughput across all available links, along with congestion avoidance, reduced tail latency, scalable performance, and fastjob completion times. Additionally, it provides an open environment thatdoesn'tlimit customers to specific vendors, giving them more flexibility.

Standout Capabilities

P4 Programmability:The P4 programmable architecture allows the Pollara 400 to be versatile, enabling it to introduce innovations today while remaining adaptable to evolving standards in the future, such as those set by the Ultra Ethernet Consortium (UEC). This programmability ensures that the AMD Pensando™ Pollara 400 can adapt to new protocols and requirements, future-proofing AI infrastructure investments. By leveraging P4, AMD enables customers to customize network behavior, implement bespoke RDMA transports, and optimize performance for specific AI workloads, all while maintaining compatibility with future industry standards.

Multipathing & Intelligent Packet Spraying:Pollara 400supports advanced adaptive packet spraying, which is crucial for managing AI models' high bandwidth and low latency requirements. This technology fullyutilizesavailable bandwidth, particularly in CLOS fabric architectures, resulting in fast message completion times and lower tail latency.Pollara 400integrates seamlessly with AMDInstinct™AcceleratorandAMDEPYC™CPU infrastructure, providing reliable, high-speed connectivity for GPU-to-GPU RDMA communication. By intelligently spraying packets of a QP (Queue Pair) across multiple paths, it minimizes the chance of creating hot spots and congestion in AI networks, ensuringoptimalperformance.ThePollara 400allows customers to choose their preferred Ethernet switching vendor, whether a lossy or lossless implementation. Importantly,thePollara 400drastically reduces network configuration and operational complexity byeliminatingthe requirement for a lossless network. This flexibility and efficiency makethePollara 400a powerful solution for enhancing AI workload performance and network reliability. 

In-Order Message Delivery:The Pollara 400offers advanced capabilities for handling out-of-order packet arrivals, a frequent occurrence with multipathing and packet spraying techniques. This sophisticated feature allows the receivingPollara 400to efficiently process data packets that may arrive in a different sequence than originally transmitted, placing them directly into GPU memory without any delay. By managing this complexity at the NIC level, the systemmaintainshigh performance and data integrity without placing anadditionalburden on the GPU. This intelligent packet handling contributes to reduced latency and improved overall system efficiency. 

Fast Loss Recover with Selective Retransmission:ThePollara 400enhances network performance through in-order message delivery and selective acknowledgment (SACK) retransmission. Unlike RoCEv2's Go-back-N mechanism, which resends all packets from the point of failure, SACK allows thePollara 400toidentifyand retransmit only lost or corrupted packets. This targeted approachoptimizesbandwidthutilization, reduces latency in packet loss recovery, and minimizes redundant data transmission.By combining efficient in-order delivery with SACK retransmission, theAMDPensando™ Pollara 400enablessmooth data flow andoptimalresourceutilization. These features result in faster job completion times, lower tail latencies, and more efficient bandwidth use, making it ideal for demanding AI networks and large-scale machine learning operations.

Path Aware Congestion Control: ThePollara 400employs real-time telemetry and network-aware algorithms to effectively manage network congestion, includingincastscenarios. Unlike RoCEv2, which relies on PFC and ECN in a lossless network, the AMD UEC ready RDMA transport offers a more sophisticated approach:

  • Maintains per-path congestion status
  • Dynamically avoids congested paths using adaptive packet-spraying
  • Sustains near wire-rate performance during transient congestion
  • Optimizes packet flow across multiple paths without requiring PFC
  • Implements per-flow congestion control to prevent interference between data flows

These features simplify configuration, reduce operational overhead, and avoid common issues like congestion spreading, deadlock, and head-of-line blocking. The path-aware congestion control enables deterministic performance across the network, crucial for large-scale AI operations. By intelligently handling congestion without a fully lossless network,AMDPensando™ Pollara 400reduces network complexity, streamlining deployment in AI-driven data centers

Rapid Fault Detection in High-Performance AI Networks:High-performance networks are crucial for efficient data synchronization in AI GPU clusters.AMDPensando™ Pollara 400employs sophisticated methods for rapid fault detection, essential formaintainingoptimalperformance. Standard protocols' timeout mechanisms are often too slow for AI applications, which require aggressive fault detection to address the critical factors of reducing idle GPU time and increasing throughput of AI training and inference tasks,ultimately decreasingjob completion time.

  • AMDPensando™ Pollara 400Rapid Fault Detection include Sender-Based ACK Monitoring, which leverages the sender's ability to track acknowledgments (ACKs) across multiple network paths.
  • AMDPensando™ Pollara 400Receiver-Based Packet Monitoring is another technique that focuses on the receiver's perspective, monitoring incoming packet flows. The receiver tracks packet reception on each distinct network path, and a potential fault is identified if packets stop arriving on a path for a specified duration.
  • AMDPensando™ Pollara 400Probe-Based Verification mechanism is employed upon suspicion of a fault (triggered by either of the above methods), a probe packet is transmitted on the suspected faulty path. If no response is received to the probe within a specified timeframe, the path is confirmed as failed. This additional step helps in distinguishing between transient network issues and actual path failures.

Rapid fault detection mechanisms offer significant advantages. Byidentifyingissues in milliseconds, they enable near-instantaneous failover, minimizing GPU idle time. Swift detection and isolation of faulty paths optimize network resource allocation, ensuring uninterrupted AI workloads on healthy paths. This approach enhances overall AI performance, potentially reducing training times and improving inference accuracy.

Final Thoughts:The AMD Pensando™ Pollara 400more than just a network card; it's a foundational component of a robust AI infrastructure. It addresses the limitations of traditional RoCEv2 Ethernet networks by offering features like real-time telemetry, adaptive packet spray with intelligent path aware congestion control to alleviate incast scenarios, selective acknowledgement, and robust error detection. AI workloads require networks that support bursty data flows, minimal jitter, noise isolation, and high bandwidth to ensure optimal GPU performance. When paired with "best of breed" standards-compliant Ethernet switches, AMD Pensando™ Pollara 400 forms the backbone of a high-efficiency, low-latency AI cloud environment.

With its ability to deliver high throughput, low latency, and exceptional scalability, combined with the flexibility of P4 programmability, AMD Pensando™ Pollara 400is an essential tool in the arsenal of any AI cloud infrastructure. This programmable approach not only enhances the NIC's versatility but also allows for rapid deployment of new networking features, ensuring that AI infrastructures can evolve as quickly as the AI technologies they support.

Share:

Article By


Related Blogs