Idle GPU clusters can cost data center operators millions-of-dollars

Home AI Infrastructure Newsletter Idle GPU clusters can cost data center operators millions-of-dollars

In large-scale AI training, computing clusters often reach only 30–50% of their theoretical performance because GPUs sit idle while waiting to communicate with one another. In fact, communication and synchronization bottlenecks in massive GPU clusters can cost data center operators hundreds-of-thousands to millions-of-dollars per day.

Monday, RCRTech will break down highlights from a recent AI TechTalk with CEO Suresh Vasudevan of Clockwork Systems – an AMD- and Broadcom-backed company that is attracting attention from neoclouds, large enterprises, hyperscalers and anyhone deploying AI workloads on tens-of-thousands and even hundreds-of-thousands of GPUs.  According to Vasudevan, “a 1,000 GPU cluster can typically have two to four disruptive events on a daily basis, bringing losses of $5 million -$8 million out of about $50 million spent on that size cluster.”

Check back in to see how software-driven solutions can bring nanosecond-level time synchronization across server clocks to optimize communication among GPUs and raise utilization of a GPU cluster in both training and inference workloads.

 

Susana 2

Susana Schwartz
Technology Editor
RCRTech

 

AI Infrastructure Top Stories

Idle GPUs cost millions: Large-scale computing clusters often reach only 30–50% of their theoretical performance, with a 1,000-GPU cluster typically seeing 2- 4 disruptive events/day. Clockwork.io CEO Suresh Vasudevan digs into the issue.

APAC as DC growth engine: According to McKinsey, traditional compute, storage, and cloud workloads currently account for more than 70% of APAC data center demand, while AI training and inference workloads represent roughly 30%.

 

View More News

AI Today: What You Need to Know

Infineon AI Expansion: Infineon’s first 750V and 1200V CoolSiC JFET devices in Q-DPAK packages are now entering mass production. These chips target solid-state circuit breakers and power architectures inside advanced AI data centers.
 
540 MW Project Caprock: The $5 billion, 540 MW data center campus being developed by Aligned Data Centers in Hale County, TX is a 313-acre campus being built for high-density hyperscale, cloud, and AI workloads.
 
DC developer eyes bottling site: A yet unnamed entity is considering a former Crystal Geyser bottling site in Mount Shasta, California, but the community is circulating a petition opposing the facility. No formal application has been submitted.

IBM Sub-1nm architecture: IBM touts energy savings in the debut of the world’s 1st sub-1 nanometer chip. Utilizing a “nanostack” 3D transistor architecture at the 0.7 nm node, it crams 100 billion transistors onto a fingernail-sized piece of silicon.

PA bill might end DC tax credit: The Pennsylvania House of Representatives passed House Bill 2198 with a 197-5 vote. If it also passes the Senate, it’ll eliminate a 2021 policy to exempt data centers from paying state sales tax on DC equipment.

Micron’s AI-driven ascentMicron Technology briefly bypassed Meta and Tesla in market valuation after a blockbuster $22 billion in customer commitments for its memory chips, highlighting intense infrastructure demand.

 

RCR Events

Quantum Safe Networks Forum, July 14th
Quantum Safe Networks Forum brings together telecom operators, cybersecurity experts, and industry analysts to explore how to build resilient, future-ready infrastructure in the face of quantum disruption. Register now

RCR Roundtables AI Infrastructure, October 21st, Dallas, Texas
Join 50 senior data center, energy and AI leaders at the Ritz-Carlton Dallas on October 21 for invitation-only roundtables on powering and scaling AI. Request your invitation 

 

Industry Resources

Webinar, June 29th: Agentic RAN Management: Delivering OPEX efficiency and a path to 6G 

Webinar, June 30th: Building the 6G Standard: Key developments to know

Webinar, July 7th: Noise-Figure Measurements with RFmx and PXI VSTs

Webinar, July 16th: NTN in motion — evolving standards, expanding services

Whitepaper: Powering sovereign AI at scale

Whitepaper: Scalable database design for 5G and beyond

Report: Scaling AIOPs from insight to action

Summit Access: GSMA Device Enablement Summit: How operators can fix device-network fragmentation

Whitepaper: Telco AI Enabler: Mediation’s defining role

Report: Securing telecom infrastructure for the quantum era

Report: Scaling optical networks for the AI and hyperscale era

What you need to know in 5 minutes

Join 37,000+ professionals receiving the AI Infrastructure Daily Newsletter

This field is for validation purposes and should be left unchanged.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More