7 Questions to Ask Before Building Your AI Infrastructure
Apr 30, 2026
The core bottleneck for deploying massive AI models is often the network. As you design systems to support up to tens of thousands of GPUs, traditional Ethernet, while ubiquitous and cost-effective, was not designed for the stringent demands of large-scale, tightly coupled parallel processing associated with AI workloads.
AI workloads benefit from continuous, uninterrupted data transfer, and thus are unforgiving by nature by being sensitive to packet loss. This issue may require a fundamentally new architectural approach. As you plan to upgrade your infrastructure to improve AI workloads, these seven critical questions will help guide your decision-making process.
1. What benefits for network performance do AI workloads actually require at scale?
High-performance AI clusters demand the efficient transmission of data between nodes with low jitter and high bandwidth. This has traditionally forced a choice between proprietary InfiniBand fabrics or complex, "lossless" Ethernet (RoCE or RoCEv2) configurations that require deep-buffer switches to prevent packet loss. A more effective architecture would be one that makes the endpoint intelligent enough to create a reliable transport protocol over a standard Ethernet fabric that maybe lossy.
By eliminating the complexity of lossless fabric management, infrastructure teams can focus on scaling GPU utilization rather than debugging network configurations. This architectural shift can reduce operational overhead while enabling faster job completion times and higher performance over clusters. Maintaining reliability over standard Ethernet fabrics fundamentally changes the economics of large-scale AI deployments.
2. How can you design your network to control costs when scaling AI infrastructure?
As you scale your AI infrastructure, network costs can quickly spiral out of control if your architecture relies on expensive, specialized hardware. A key strategy for managing capital expenditures is to shift networking intelligence from high-cost switches to the NICs themselves. By doing so, you can build massive, resilient networks using less specialized switching infrastructure that is cost-effective.
Organizations that adopt endpoint-intelligent networking architectures can achieve substantial cost reductions while maintaining performance. By eliminating fully scheduled, deep‑buffer switching fabrics and adopting a multi‑plane architecture with intelligent packet distribution, AMD internal analysis shows up to 58% lower network costs while supporting the same GPU scale and performance. [1]
More importantly, the simplified switching architecture reduces operational complexity, enabling leaner network operations teams to manage larger infrastructures. The economic impact extends beyond initial deployment costs to better ongoing maintenance, lower power consumption, and overall less facility requirements.
3. How quickly can your network detect and remediate issues before AI workload slows?
In large-scale distributed systems, failures are inevitable. The critical metric is not mean time between failures (MTBF), but mean time to recovery (MTTR), assuming failures are inevitable and recovery is fast and well isolated. Your network must be able to detect faults within your infrastructure in milliseconds and failover instantly to active instances to minimize GPU idle time. This requires more than just redundancy; it requires an architecture built for robust fault isolation.
Advanced fault isolation capabilities can increase cluster reliability. For example, organizations implementing millisecond-level fault detection can see dramatic improvements in GPU utilization and model training consistency. When network issues are detected and isolated instantly, AI workloads can continue uninterrupted on active data links, avoiding costly scenarios where the training of models would have to restart. This level of resilience becomes even more important as clusters scale, where even minor network disruptions can cascade into significant downtime. Fault isolation mechanisms can impact businesses with reduced model training costs, faster model development velocity, and higher confidence in reliability of AI systems running in production.
4. How does networking observability improve AI cluster uptime and reliability?
Operating a massive AI cluster demands deep visibility and streamlined automation. Without rich telemetry and intelligent management tools, troubleshooting becomes a nightmare, and configuration drift introduces significant operational risk. Effective operational excellence depends on embedding these capabilities directly into the network fabric.
Organizations with comprehensive network observability can achieve high cluster uptime and operational efficiency. Real-time telemetry and automated consistency validation of settings across nodes within a cluster can prevent configuration drift before it affects production workloads. The ability to perform hitless upgrades and maintain continuous monitoring enables true automation of operations of large AI infrastructures. This operational maturity becomes a competitive advantage, allowing organizations to iterate faster on AI training models while maintaining reliability of production systems.
5. How do open ecosystems provide flexibility for evolving AI infrastructure?
Vendor lock-in is a significant risk that can stifle innovation and inflate long-term costs. An open ecosystem that leverages industry standards gives you the freedom to choose the best components based on your needs and adapt your infrastructure as new technologies emerge, creating a powerful competitive advantage.
Organizations that prioritize open standards will enjoy strategic flexibility as AI networking evolves. By avoiding proprietary solutions, teams can adapt nimbly to new technologies and optimize costs. This approach enables hybrid architectures that combine different vendor strengths while maintaining operational consistency across data centers. The long-term impact includes fast adoption of emerging standards and the ability to scale infrastructure without architectural constraints. Open ecosystems also facilitate knowledge sharing across the industry, accelerating innovation and development of best practices.
6. How do you shield AI training without sacrificing inferencing performance?
Organizations are increasingly shifting workloads from model training to inferencing to support the increasing demand of agentic AI. By taking into account both demands, organizations can build a network that can help with current and future AI workloads.
Unified networking architectures that serve both AI model training clusters and inference workloads can simplify operations and reduce costs. This approach lowers the need for separate networking stacks for different AI workloads, which can reduce complexity and training requirements for operations teams. The strategic impact includes fast deployment of new AI services, cost predictability, and robust alignment between AI infrastructure and business requirements. For regulated industries, unified on-premises infrastructure provides the control and compliance capabilities needed for AI adoption.
7. Can your network keep up with evolving AI standards and workloads?
The pace of AI innovation is relentless. A network built for today's models may be obsolete tomorrow. A truly future-ready architecture must be programmable, allowing you to adapt to new standards and optimize for emerging workloads with simple software updates rather than costly hardware replacement cycles.
Programmable network infrastructure enables organizations to evolve their AI capabilities without major hardware refreshes. Teams with software-defined networking can adapt to new transport protocols, optimize for different AI model architectures, and implement custom performance optimizations. This agility becomes critical as AI workloads diversify beyond traditional model training and inference patterns. Organizations can achieve fast time-to-market for new AI services and reduced infrastructure refresh cycles. The strategic advantage includes the ability to experiment with cutting-edge AI techniques without infrastructure constraints and maintain competitive positioning as the AI landscape evolves.
Building a Foundational AI Network with AMD
The transition to large-scale AI requires a deliberate re-evaluation of network architecture. By asking these seven critical questions, infrastructure leaders can move beyond the limitations of conventional networking and design systems to more performant, cost-effective solutions that and built to last.
The AMD Pensando™ Pollara 400 AI NIC was engineered from the ground up to provide definitive answers to these challenges. It integrates the intelligence needed to deliver reliable high-throughput performance on open Ethernet-based fabrics. Its programmable design provides the adaptability to meet future demands, while its support for open standards enables you retain control of your technology stack. By shifting network complexity to the endpoint, the AMD Pensando™ Pollara 400 AI NIC establishes a new paradigm for building and operating your next-generation scale AI infrastructure.
Discover how the AMD Pensando™ Pollara 400 AI NIC can help maximize your AI investments by requesting a consultation today, or by reading the Product Guide at this link here.
Footnotes
1. PEN-018: AMD comparison and pricing as of July 6, 2025, for network fabric costs to support 128,000 GPUs. Comparison of the AMD Pensando™ Pollara AI NIC with multiplane fabric and packet spray on an 800G Tomahawk 5–based multiplane design versus a generic fat-tree fabric built on fully scheduled, big-buffer (Jericho3/Ramon3) 800G switching platforms. The generic system is assumed to use a competitive NIC, with NIC costs considered comparable. The AMD Pensando™ Pollara AI NIC-based design is estimated to deliver up to 58% network switching cost savings by enabling the use of more cost-effective Tomahawk 5–based switching in a multiplane architecture. AMD comparison and pricing as of 4/23/2025 of a Tomahawk 5 system with AMD Pensando™ Pollara AI NIC featuring exclusive multiplane fabric and packet spray versus a generic big-buffer 800G switching platform; the generic system would employ a competitive NIC, costs of NICs are assumed to be comparable. Deploying the AMD Pensando™ Pollara AI NIC with multi-fabric support and packet spray, allows customers to build cost-effective multiplane network fabrics, instead of a fat-tree design using less network switches to deliver the same amount of network bandwidth across the fabric, and dramatically reducing both switch platform cost, and cost associated with cables, optics.
• Fat-Tree Big Buffer Fully Scheduled Network (Leaf/Spine/Core) estimated Cost: $1.22B
• 3556 leaf (Jericho3-AI) units at $104,998 each = $373M
• 1557 spine/core (Ramon3) units at $147,998 each = $247M
• 128K AOC-10m cables at $1059 each = $136M
• 568,889 QDD-SR4-400G transceivers at $819 each = $466M
• Total (Switching & Optics) = $1.22B
• Naddod Tomahawk5 800G Multiplane Fabric Network estimated Cost: $511M
• 3,000 Leaf and Spine Units (Naddod N9600-640C) at $26,999 each = $81M
• 384K (QDD-SR4-400G) transceivers at $819 each=$313M
• 64K Switch (OSFP-2x400G-DR4) transceivers for NIC connections at $759 each = $48M
• 256K MPO Cables at $26 each = $6.6M
• 2K Optical Shuffle box, modules and internal cables at $30K per rack =$60M
• Total (Switching & Optics) = $511M
Prices subject to change. Comparison for specific network configurations only, and may not be representative of all possible network configurations and comparisons.
1. PEN-018: AMD comparison and pricing as of July 6, 2025, for network fabric costs to support 128,000 GPUs. Comparison of the AMD Pensando™ Pollara AI NIC with multiplane fabric and packet spray on an 800G Tomahawk 5–based multiplane design versus a generic fat-tree fabric built on fully scheduled, big-buffer (Jericho3/Ramon3) 800G switching platforms. The generic system is assumed to use a competitive NIC, with NIC costs considered comparable. The AMD Pensando™ Pollara AI NIC-based design is estimated to deliver up to 58% network switching cost savings by enabling the use of more cost-effective Tomahawk 5–based switching in a multiplane architecture. AMD comparison and pricing as of 4/23/2025 of a Tomahawk 5 system with AMD Pensando™ Pollara AI NIC featuring exclusive multiplane fabric and packet spray versus a generic big-buffer 800G switching platform; the generic system would employ a competitive NIC, costs of NICs are assumed to be comparable. Deploying the AMD Pensando™ Pollara AI NIC with multi-fabric support and packet spray, allows customers to build cost-effective multiplane network fabrics, instead of a fat-tree design using less network switches to deliver the same amount of network bandwidth across the fabric, and dramatically reducing both switch platform cost, and cost associated with cables, optics.
• Fat-Tree Big Buffer Fully Scheduled Network (Leaf/Spine/Core) estimated Cost: $1.22B
• 3556 leaf (Jericho3-AI) units at $104,998 each = $373M
• 1557 spine/core (Ramon3) units at $147,998 each = $247M
• 128K AOC-10m cables at $1059 each = $136M
• 568,889 QDD-SR4-400G transceivers at $819 each = $466M
• Total (Switching & Optics) = $1.22B
• Naddod Tomahawk5 800G Multiplane Fabric Network estimated Cost: $511M
• 3,000 Leaf and Spine Units (Naddod N9600-640C) at $26,999 each = $81M
• 384K (QDD-SR4-400G) transceivers at $819 each=$313M
• 64K Switch (OSFP-2x400G-DR4) transceivers for NIC connections at $759 each = $48M
• 256K MPO Cables at $26 each = $6.6M
• 2K Optical Shuffle box, modules and internal cables at $30K per rack =$60M
• Total (Switching & Optics) = $511M
Prices subject to change. Comparison for specific network configurations only, and may not be representative of all possible network configurations and comparisons.