Today’s top story discusses some of the challenges top-tier operators encounter when managing frontier AI models and large-scale cloud services, in which faults are guaranteed and idle time can be catastrophic. When trying to monitor 100,000 to well over 1,000,000 GPUs in LLM training workloads, training faults not only waste compute time and progress, but trigger inference faults that can destroy user experience, violate SLAs, and destroy unit economics.
Suresh Vasudevan, CEO of Clockwork Systems, says nanosecond-level time synchronization across server clocks can eliminate the need for AI workload restarts when outages and hardware failures strike massive GPU clusters.
He talks about the ways in which data center operators can:
– Reach peak performance – In large-scale AI training, computing clusters often reach only 30–50% of their theoretical performance.
– Resolve disruptive events – In a 1,000 GPU cluster, it’s typical to have 2-4 disruptive events on a daily basis.
– Eliminate wasted Capex – GPUs sit idle because of communication and synchronization bottlenecks that cost hundreds-of-thousands to millions-of-dollars per day.
Check out the highlights from our recent RCR AI TechTalk interview.
Also read about OpenAI and Broadcom’s Jalapeño, a custom chip built only for AI inference. Take a look, below.

Susana Schwartz
Technology Editor
RCRTech
AI Infrastructure Top Stories
Jalapeño already runs GPT-5.3 workloads: OpenAI and Broadcom unveiled Jalapeño, a custom application-specific integrated circuit built for LLM inference – an “Intelligence Processor” and “AI accelerator” that is multigenerational.
Reader Forum – subsea resilience: Data center operators must evaluate the physical & political risks of geographic pathways. Subsea network resilience is measured by corridor-level risk, not just counting cables, says Exa’s Steve Roberts.
AI Today: What You Need to Know
Anthropic – Alibaba controversy: Anthropic wrote Sen. Tim Scott and Sen. Elizabeth Warren of the U.S. Senate Committee on Banking, Housing, and Urban Affairs, accusing Alibaba of “the largest distillation attack on Anthropic to date.”
AI’s next bottleneck isn’t chips: “Severe weather is no longer something that can be treated as a background exposure,” says Patrick McBride, Zurich Insurance’s head of international construction, noting extreme weather’s impact on data centers.
FL data center law: Effective July 1, Florida’s SB 484 law will go into effect, preventing utilities from passing data center electricity costs onto residents, and preserving local zoning, water usage and environmental permitting legislation.
NY DC moratorium: New York’s Responsible Data Center Development Act (S10642 / A11560) has passed both houses of the State Legislature and is currently awaiting a final decision from Governor Kathy Hochul.
Amazon’s $48B in India: Amazon is committing $13 billion more in infrastructure and AWS data center development in India, bringing the total to $48 billion by 2030. The focus will be expanding cloud computing infrastructure and AI services.
So. Korea semiconductor cluster: South Korea will develop a new semiconductor production base in through 800 trillion won (US$517.9 billion) in corporate investments, which will go toward 4 memory chip fabrication plants.
RCR Events
Quantum Safe Networks Forum, July 14th
Quantum Safe Networks Forum brings together telecom operators, cybersecurity experts, and industry analysts to explore how to build resilient, future-ready infrastructure in the face of quantum disruption. Register now
RCR Roundtables AI Infrastructure, October 21st, Dallas, Texas
Join 50 senior data center, energy and AI leaders at the Ritz-Carlton Dallas on October 21 for invitation-only roundtables on powering and scaling AI. Request your invitation
Industry Resources
Webinar, June 29th: Agentic RAN Management: Delivering OPEX efficiency and a path to 6G
Webinar, June 30th: Building the 6G Standard: Key developments to know
Webinar, July 7th: Noise-Figure Measurements with RFmx and PXI VSTs
Webinar, July 16th: NTN in motion — evolving standards, expanding services
Whitepaper: Powering sovereign AI at scale
Whitepaper: Scalable database design for 5G and beyond
Report: Scaling AIOPs from insight to action
Summit Access: GSMA Device Enablement Summit: How operators can fix device-network fragmentation
Whitepaper: Telco AI Enabler: Mediation’s defining role
Report: Securing telecom infrastructure for the quantum era
Report: Scaling optical networks for the AI and hyperscale era