How the CoreWeave-Meta pact builds momentum for the test and measurement market

The rising tide of AI — and circular deals like CoreWeave and Meta — are starting to spill into supporting industries.

AI companies continue to pour money into infrastructure in a trend that promises to boost many adjacent industries. Meta’s multibillion-dollar deal with CoreWeave is a case in point. Pegged at $14.2 billion, the partnership will secure Meta continued supply of compute capacity through December of 2031, with an option to extend to 2032.

The deal marks a major milestone moving the data center market closer to hitting Dell’Oro Group’s projected $1.1 trillion in CapEx spending by 2029However, the effects of the massive spendings on AI infrastructure have are started to show up in the suppliers and partners’ ecosystems – call it AI’s golden touch! The boom is rapidly spreading across the test and measurement market which is dominated by likes of Keysight, Rohde and Schwarz, and Spirent that are responsible for verifying, benchmarking, and assuring the performance of the infrastructure.

When deals like these get locked in, data centers need to make significant investments in downstream testing and benchmarking to ensure that it meets the high availability and low latency requirements of the AI workloads. However, building large-scale pre-deployment GPU labs is costly; it essentially demands the same infrastructure as the production environment.

Nevertheless, testing remains critical in order to gauge the performance of the accelerators and the network fabric to guarantee SLA for the buyers. It is why many infrastructure providers are actively collaborating with testing and benchmarking companies.

CoreWeave, for example, used the industry-standard MLPerf benchmark suite developed by MLCommons, an open engineering consortium, for evaluating the performance of its AI cloud platform. According to its recent MLPerf Training v5.0 benchmarks, “CoreWeave completed the challenging Llama 3.1 405B model training benchmark in only 27.3 minutes, more than twice as fast as similarly sized GPU clusters utilizing NVIDIA Hopper GPUs,” the company shared in a blogpost in June.

Quantifiable results like these require faithful replication of the real-world AI workloads, something that is impossible to do without the right set of emulation tools.

“The CoreWeave-Meta deal fundamentally shifts the focus of cloud infrastructure from simple availability to guaranteed, measurable AI performance, requiring an unprecedented level of rigorous testing and validation across the entire technology stack,” said Ron Westfall, analyst in residence at HyperFRAME Research.

Westfall added that this requires deep, real-time visibility into the GPUs, thermal systems and the network fabric that is supporting it all in order to meet SLAs and sidestep costly penalties that may come with system failures.

GPUs are expensive hardware. To top that, their heavy energy consumption and high cooling demands combined make up 30% to 50% of the OpEx. With servers packing sometimes up to 8 high-end GPUs within a chassis, getting more GPU hours becomes all the more critical. However, studies have shown that the accelerators sit idle 30% to 80% of the time. A big part of the problem is the back-end network fabric. If the network fails to support lossless data transfer at low latency, it becomes the bottleneck itself. Spirent, in a webinar, stated that even a mere 1% packet loss can degrade GPU performance by a whopping 30%, leading to millions of dollars in loss. 

To monetize this new kind of infrastructure and remain profitable, companies need to identify and eliminate bottlenecks pre-deployment. This requires end to end validation workflows and up-to-date testing methodologies that can accurately replicate real-world workloads and high-throughput training behavior. The tests reveal metrics like system behavior under continuous operation, degradation under stress, memory error rates, channel failures, overheating, power sag, cooling inefficiency and other issues that could potentially erode service quality – and are far more costly to fix post-production.

Westfall reminded that AI is a holistic challenge, and constant testing and monitoring of interdependent systems is key to eliminating wasted GPU hours and get the most performance out of the infrastructure while maintaining service quality.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Read More