NVLink, UALink, and CXL: Understanding interconnects

Home Semiconductor News NVLink, UALink, and CXL: Understanding interconnects
Nvidia NVLink Interconnect

Connecting AI chips with interconnects is arguably just as important as the chips themselves

Modern AI training has moved far beyond what any single GPU can accomplish. Training large language models now requires hundreds or thousands of GPUs working together, passing massive amounts of data between them in a constant, coordinated exchange. This creates an infrastructure challenge — the connections between GPUs become arguably just as important as the processors themselves.

When interconnects are slow or inefficient, expensive GPUs sit idle waiting for data. This bottleneck directly impacts training speed and determines how much return organizations see on their substantial hardware investments. The faster and more efficiently GPUs can communicate, the more productive each chip becomes and the faster models reach completion.

For years, this infrastructure layer has been largely controlled by Nvidia through its proprietary NVLink technology. That dominance has created deep dependencies for organizations building GPU clusters, locking them into Nvidia’s hardware and networking ecosystem. That said, open standards backed by AMD, Intel, and others are emerging to challenge that position — though how viable they are as a true alternative isn’t quite as clear.

Why interconnects matter so much

The raw performance of a GPU cluster isn’t simply additive. When training workloads distribute across multiple accelerators, those chips must continuously exchange gradient updates, model parameters, and intermediate calculations. Every microsecond lost to data transfer is a microsecond of expensive silicon sitting underutilized.

Organizations are pouring hundreds of millions into GPU clusters for AI training. When interconnect constraints drag utilization down from 95% to 70%, the effective cost of compute inflates dramatically. Put simply, the interconnect layer determines whether a major infrastructure investment delivers compelling returns or disappointing efficiency.

These dynamics ripple into competitive positioning in ways that go beyond raw throughput. Building clusters on proprietary interconnects means committing not just to a vendor’s current hardware, but to their entire future roadmap. Switching costs balloon, negotiating power erodes, and organizations find themselves locked into upgrade cycles they don’t fully control.

The big one: Nvidia NVLink

Nvidia’s NVLink has commanded the GPU interconnect landscape since its 2016 debut, and for good reason — nothing else in production has matched its performance.

The current generation, NVLink 5.0, delivers up to 1.8 TB/s of bandwidth per GPU and supports fabrics spanning up to 576 GPUs. That’s a significant jump from NVLink 4.0 on Hopper-generation hardware, which topped out at 900 GB/s per GPU across a maximum of 256 GPUs. The technology leverages Ethernet-style signaling to hit these bandwidth figures, trading higher error rates and latency for sheer throughput.

NVLink’s architecture favors dense, homogeneous systems where maximum bandwidth between tightly-packed accelerators is paramount. It shines in unified configurations like Nvidia’s DGX systems and NVL72 racks — environments where every component originates from a single vendor and the full stack is tuned to operate as one.

The tradeoff, however, is lock-in. Choosing NVLink means committing to Nvidia GPUs, Nvidia networking, Nvidia software tools, and Nvidia’s upgrade timeline. For large enterprises and telecommunications providers making multi-year, multi-billion-dollar capital expenditures, this dependency can be risky, expensive, or both. That said, Nvidia’s integrated approach delivers performance and reliability that fragmented alternatives have consistently struggled to replicate.

Open standards: UALink and CXL

Two open standards are positioning themselves against Nvidia’s proprietary approach, though they address somewhat different problems and operate on different timelines.

UALink 1.0

Published in April 2025, UALink marks the first credible open-standard challenge specifically targeting GPU-to-GPU communication. The specification achieves 200 GT/s per lane with up to 800 Gbit/s bandwidth in x4 configuration. That’s much less than NVLink 5.0’s per-GPU throughput. UALink compensates with more scalability, supporting up to 1,024 accelerators in a single fabric versus NVLink’s 576-GPU ceiling.

The architecture shares similarities with NVLink, employing Ethernet-style SerDes to achieve high bandwidth. Flexible lane configurations (x1, x2, or x4 ports) that can be aggregated allow organizations to tune connectivity to their specific requirements.

UALink’s most compelling advantage may be vendor neutrality. Backed by AMD, Intel, and Astera Labs, the standard enables modular, multi-vendor environments where organizations could theoretically combine AMD MI-series accelerators, Intel Gaudi processors, and other hardware on the same fabric. This directly addresses the lock-in concerns fueling enterprise interest in NVLink alternatives.

UALink hardware won’t materialize until late 2026 at the earliest, with meaningful production deployments likely extending into 2027. The specification exists, but the silicon doesn’t just yet.

CXL 4.0

The Cache Coherent eXpandable Memory standard takes a different approach, focusing on CPU-to-memory and device-to-memory communication rather than direct GPU-to-GPU links. CXL 4.0 was released in November 2025, and doubled bandwidth to 128 GT/s and introduced features enabling multi-rack memory pooling for the first time.

Where UALink aims to accelerate GPU communication during training, CXL enables memory expansion and shared memory pools across data center infrastructure. The technology proves particularly valuable for workloads demanding terabyte-scale shared memory, allowing compute nodes to access memory pools that would otherwise be cost-prohibitive to attach directly.

CXL relies on PCIe-based SerDes, yielding lower error rates and latency than Ethernet-style alternatives but also lower peak bandwidth. A consortium including Intel, AMD, and others manages the technology, making it vendor-neutral.

Multi-rack CXL deployments target 2027, and like UALink, the technology remains largely theoretical at scale. No major deployments have been announced, leaving organizations to evaluate specifications rather than battle-tested implementations.

Head to head

Metric  UALink 1.0  NVLink 4.0 (Hopper)  NVLink 5.0 (Blackwell)  
Per-GPU Bandwidth  800 Gbit/s  900 GB/s  1.8 TB/s  
Maximum Cluster Size  1,024 GPUs  256 GPUs  576 GPUs  
Vendor Lock-in  Open standard  Proprietary  Proprietary  
Hardware Availability  Late 2026/2027  Production today  Production today  
Error Rate/Latency  Higher bandwidth, higher latency  High bandwidth, higher latency  High bandwidth, higher latency  

NVLink 5.0 offers much more bandwidth than UALink 1.0’s per-GPU bandwidth at 1.8 TB/s versus 800 Gb/s. For workloads where raw communication speed between adjacent GPUs drives performance, this gap is significant.

UALink counters with more cluster scalability. Support for 1,024 accelerators versus NVLink’s 576 gives organizations building massive training clusters more headroom, and the modular flexibility to mix vendors provides strategic optionality that NVLink simply cannot match.

The most critical difference, however, isn’t captured in any specification. NVLink hardware is in production and shipping today. UALink multi-rack deployments remain 12-18 months out. For organizations building clusters now, the choice isn’t really a choice at all.

Lock-in vs. independence

The proprietary-versus-open interconnect battle extends well beyond technical specifications into strategic infrastructure decisions that will shape organizations for years.

Telecommunications companies and large enterprises are growing increasingly uncomfortable with single-vendor dependency. Building GPU clusters on NVLink means committing not just to today’s Nvidia hardware but to Nvidia’s future roadmap. Upgrade cycles, pricing negotiations, and technology choices all become constrained by that initial decision. For infrastructure teams making capital expenditures measured in billions, this risk concentration creates serious concern.

UALink’s value proposition centers on breaking that dependency. A telecommunications provider could theoretically deploy AMD accelerators today and integrate Intel processors later, all on the same fabric. This preserves negotiating leverage, enables best-of-breed procurement, and reduces the risk of being stranded on a deprecated platform.

Nvidia’s closed ecosystem, by contrast, delivers integration benefits that fragmented alternatives struggle to match. When the GPU vendor also controls the interconnect, the network switches, and the software stack, optimization happens at every layer. Performance is typically better, debugging is simpler, and support is unified. For organizations prioritizing immediate performance over long-term flexibility, the lock-in may be an acceptable price.

Infrastructure teams are watching this tradeoff carefully, particularly as they plan cycles for 2027 and 2028. The decisions made in those timeframes will determine whether open standards capture meaningful market share or remain niche alternatives.

Conclusions

The current reality tilts heavily toward Nvidia. For organizations building GPU clusters in 2026, NVLink remains the only production-ready option for demanding AI training workloads. The technology is mature, the hardware is available, and the ecosystem of software optimization and operational knowledge stands unmatched.

The distance between open standards and production readiness remains substantial. UALink silicon won’t ship until late 2026, with meaningful deployments likely stretching into 2027. CXL multi-rack capabilities target similar timeframes. Early adopters of either technology risk backing specifications that haven’t been proven at scale, and the protracted standardization process creates uncertainty for GPU manufacturers weighing adoption.

Software may present the most significant hurdle for challengers. Nvidia’s advantages extend beyond hardware to encompass years of CUDA optimization, deep framework integration with PyTorch and TensorFlow, and accumulated best practices from thousands of deployments. Even if UALink hardware matched NVLink performance (which it doesn’t) the software ecosystem gap would remain substantial. Performance in AI training depends on the entire stack, not just the interconnect specification.

What you need to know in 5 minutes

Join 37,000+ professionals receiving the AI Infrastructure Daily Newsletter

This field is for validation purposes and should be left unchanged.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More