Solving the ‘efficiency gap’ for massive GPU clusters

Home AI Infrastructure Solving the ‘efficiency gap’ for massive GPU clusters

In a 1,000 GPU cluster, it’s typical to have 2-4 disruptive events on a daily basis, which can cost a data center operator millions-of-dollars in losses. AMD- and Broadcom-backed Clockwork dynamically works around link failures and hardware crashes so training jobs finish 1.2x to 1.5x faster.

In sum, what you need to know:

  • Reaching peak performance – In large-scale AI training, computing clusters often reach only 30–50% of their theoretical performance.
  • Resolve disruptive events – In a 1,000 GPU cluster, it’s typical to have 2-4 disruptive events on a daily basis
  • Stop wasting Capex – GPUs sit idle because of communication and synchronization bottlenecks that cost hundreds-of-thousands to millions-of-dollars per day

RCR AI TechTalk with CEO Suresh Vasudevan of Clockwork.io, a company backed by AMD and Broadcom because of its nanosecond-level time synchronization across server clocks, which eliminates the need for AI workload restarts when outages and hardware failures strike massive GPU clusters.

In large-scale AI training, computing clusters often only reach 30–50% of their theoretical performance because GPUs sit idle while waiting to communicate with one another. In fact, communication and synchronization bottlenecks in massive GPU clusters can cost data center operators hundreds-of-thousands to millions-of-dollars per day. “In a 1,000 GPU cluster, it’s typical to have 2-4 disruptive events on a daily basis, which can mean losses of $5 million -$8 million out of $50 million spent on that cluster,” explains CEO Suresh Vasudevan of Clockwork.io – an AMD- and Broadcom-backed company whose software facilitates ultra-fast communication between thousands of GPUs – a critical factor when even a microsecond of network congestion or a server error can idle expensive silicon and stall training.

“We focus on customers that are deploying AI workloads on thousands and tens-of-thousands of GPUs,” says Vasudevan, pointing to a growing customer base of neoclouds like Nscale, Nebius; large global enterprises like Zoom, Uber, and DCAI (Denmark’s sovereign AI through the Danish Centre for AI Innovation), as well as hyperscalers like Amazon Web Services.

What differentiates a software approach from bespoke hardware solutions is the programmability of the software control plane, which sits between AI application workloads and the underlying network hardware. As a “software-defined AI fabric,” FleetIQ routes traffic without proprietary hardware, instead using abstraction software to bridge the gap between application code and physical hardware. For example, Clockwork works across AMD and NVIDIA GPUs, and sits on top of technologies like InfiniBand, Ethernet, in public clouds, neoclouds, and hyperscaler environments.

When it comes to InfiniBand versus Ethernet, the Nvidia portfolio supports each, but Vasudevan says the latter is actually growing faster, with customers doing more Ethernet deployments, and RDMA over Ethernet, are growing faster than InfiniBand’ “You’ll see a coexistence of both, but the share gain is in favor of Ethernet.”

Focusing on massive AI workloads: Demanding and Distributed

The AI workloads are extremely demanding in that they are a distributed application, where the overall application relies on massive numbers of GPUs, all working as one entity. “So typically, when a single GPU is running slow or fails, the entire job runs slow or fails and has to be restarted,” notes Vasudevan. By optimizing communication among all GPUs, you raise utilization of the GPU cluster. You spend hundreds-of-millions on infrastructure for running these workloads, so communication is critical.”

According to Vasudevan, there are three focus areas in large-scale GPU clusters:

  • Observability: monitor cluster to rapidly detect failures and root cause
  • Fault Tolerance: detect a failure and seamlessly migrate the workload from a failing GPU, node, or network link onto spare resources so training is uninterrupted. In a 1,000 GPU cluster, it’s typical to have 2-4 disruptive events on a daily basis, which can mean losses of $5 million -$8 million out of $50 million spent on that cluster.
  • Performance Optimization: when GPUs communicate, there is congestion in network and multiple flows that collide on certain network paths, so we reroute on network to ensure GPUs perform at highest level.

Resilience to GPU failures is critical, so being able to understand and track the distributed state that exists on every GPU helps to identify which one fails so that a “spare” GPU can be brought in, and the state recreated, and all without the training ever noticing the change. Bespoke solutions from companies like OpenAI and Anthropic deal with these failures by going back to a previous point in time and restarting. “We are fully software based, whether dealing with Nvidia GPUs, or AMD GPUs, or in the cloud with proprietary accelerators used for training. We work irrespective of whether it’s InfiniBand Networks from Nvidia, or Ethernet, or Arista, Cisco or anyone else.”

According to Vasudevan, training jobs can finish 1.2x to 1.5x faster, which translates into expedited time to market without buying extra GPUs. “Something that has a baseline of four hours can often take six to eight hours, because of failure tolerance and faster remediation of slowdowns.”

When it comes to training versus inference, they’re both demanding in terms of performance, and they’re both distributed applications increasingly in inference. “There’s the issue of reliability of the underlying GPU infrastructure having to make sure utilization of the GPU infrastructure has a direct impact in economics. The workload patterns themselves are different, and I do think where large enterprises see inference deployment scale is going to be larger than the training deployment scale,” according to Vasudevan.

He says monitoring, resilience in face of failures, understanding root cause, and remediation will all be just as relevant in inference, where latency and first-time-to-token latency are paramount. “The number of tokens I deliver per second are key metrics that fundamentally depend on having a reliable cluster, and a reliable network within that cluster.”

What you need to know in 5 minutes

Join 37,000+ professionals receiving the AI Infrastructure Daily Newsletter

This field is for validation purposes and should be left unchanged.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More