AI accelerator diversity for high-performance enterprise inference

Home AI Infrastructure News AI accelerator diversity for high-performance enterprise inference

Cirrascale ‘horses for courses’ strategy is to test and deploy on every leading AI accelerator, with bare-metal servers fine-tuned for training, inference, and inference-as-a-service.

Cirrascale founder and CEO David Driggers is a pioneering figure in HPC, known for his work around dense hardware architecture and multi-GPU processing efficiency. Starting in the early 2010s, Driggers started Cirrascale as a hardware manufacturer of deep-learning servers like the GPU8, and then transitioned the company to hardware-to-cloud and GPU-as-a-service, becoming one of the first “neoclouds” dedicated to heavy-duty, bare metal hardware for AI training, and now shifting toward enterprise-focused dedicated inferencing and inference-as-a-service for Fortune 500 companies.

Cirrascale has been a specialist in AI inferencing, powering massive open scientific models and deploying frontier AI inside sovereign environments, with dedicated, bare-metal servers optimized for deep learning algorithms and enterprise customers. “We are different in that unlike other neocloud providers, we are not ‘new’ and we come from a hardware background,” says Driggers, noting he collaborated early on with OpenAI “when it had 8 people” and has since evolved to power Essential AI’s 1,000-GPU Lenovo ThinkSystem and AMD Instinct-based training platform, Ai2’s Scientific AI Initiative, and as an operational partner for the Nvidia-backed Open Multimodal AI Infrastructure to Accelerate Science (OMAI), for which Cirrascale manages the open infrastructure that enables the OLMo and Molmo models.

Accelerator diversity: ‘More horses for courses’

Driggers has long contended that the brute-force tools built to handle training and virtually anything thrown at them are not necessary in inference. “With the current generation of AI, which is the third major wave, the model delta – the difference in size from the smallest usable models to the largest usable models – is orders of magnitude. It used to be that the biggest and smallest models were pretty close to one another, but now you’ve got billion-parameter models that are usable in generative AI and LLMs, all the way up to multi-trillion parameter models.”

According to Driggers, a one-size-fits-all approach is an impossible from an accelerator perspective. “As we move to a mixture of experts and we move to multimodal type inferencing where you may be integrating audio, video, plus text, and ultimately spatial, different accelerators will excel at different things.” He says it’ll be very important in inferencing to find the right platform for different needs, whether that’s for ultra-low latency, energy efficiency, lowest possible cost per token, or other requirements. “You will have to seek the smallest, simplest unit your model will fit into, and then push it down the technology stack as far as you can go…while still meeting your latency requirements – your time to first token.”

According to Driggers, “every semiconductor company charges more, the higher you move up their technology stack,” charging per flop and per megabyte of memory. “You want to push the performance and memory stack as far down as you can go to where you still hit your latency. If you get too low, you’ve got to step back up. Or, if it doesn’t fit on one GPU, you have to split your job across two, which means significant loss in efficiency and difficulty in deployment,” he explains.

Driggers says that with inferencing , you’re “in production,” so once you hit your required speed, it’s all about cost-per-token. “If you don’t push it down, it could be too cost prohibitive to run, and that’s after you’ve trained a model, built a rack, and fine tuned it. If it’s a profit center for you, saving 10% may double your net margin, so if you can drop an extra 10 points, and you’re only making 10, you double your margin. For inferencing, it really does matter to get the right horse for the right course.”

That course is chosen by many factors, such as time to first token (TTFT): batch processing of PDFs or OCR can span days. Chatbots may require near real-time. Fact checking may need real-time. “If you want to make sure no bad actor, no fraud, no virus, no child pornography gets in, then you need it faster.”

Because varied silicon options are necessary for enterprises to deploy and test across myriad architectures, Driggers want to support all options, including:

  • NVIDIA: (HGX B200, H100, and Tensor Core GPUs)
  • AMD: (Instinct series accelerators)
  • Qualcomm: (Cloud AI 100 Ultra)
  • Others: Cerebras, Tenstorrent, and SambaNova

In addition, Cirrascale’s AI Innovation Cloud has scaled with massive AMD Instinct MI300X clusters, developed with Lenovo, to power training and inference pipelines for companies like Essential AI. To further expand beyond traditional GPUs, Cirrascale also has a commercial deployment of Tenstorrent Galaxy Blackhole servers, whose RISC-V-based AI processors bypass GPU supply constraints to cut per-token costs in inference-heavy workloads.

The challenge for enterprises is knowing the differences and the nuances of where the hardware would ideally fit. “This is why with our inference-as-a-service, we get into the model, taking it from the client, running it, and working with them on what their SLA, time to first token, or regional latency challenges are. Do they have a real-time application that follows the sun from the east coast to the west, and then perhaps onto Asia and Europe?” Once the key questions are answered, Cirrascale works with enterprises to establish the SLA requirements and run and test the models to ensure the best platform for the use case is chosen. “We figure out the price per token to meet your SLA – if it’s real time, near time, batch/offload, and so on.”

Who the customers are and what do they need?

This year, Driggers says, enterprise adoption is still fairly new, with most enterprises getting through POC phases and employees trying Gemini, Copilot, ChatGPT, Claude, to see if they can be more efficient or productive. “For coding, it’s now 70-80% adoption, but for agentic AI, chatbots and supports, many enterprises are reluctant to go into the general cloud. They don’t want their trade secrets getting out.” Driggers says one of his biggest hopes for the coming year is to see open source models evolve to be good enough that “you don’t have to go to one of the monster frontier models; or, that frontier models will become available as cut-down open source versions.”

Driggers says he champions private AI for Fortune 1000 companies, and sectors like defense, healthcare, and finance, which he says want to run frontier-level models without exposing sensitive data to public cloud layers. “The customer always wins and if they want something private and under their control, they’re going to get it. It’s too big a market to ignore.”

To target those enterprises and sectors, Driggers says Cirrascale offerings are falling into three main camps:

Dedicated training, which according to Driggers is primarily leveraged by well-funded startups in later stages of their development. For example, the Paul Allen’s Institute of Artificial Intelligence (Ai2) is a non-profit that wants to remain completely open. “As a non-profit, they have funding rules about how much they can spend on Capex, or on people, jobs, and so on,” explains Driggers. “This is where we differentiate from hyperscalers and most of the neoclouds in that we allow our customers to own some of the equipment if they want to. We leverage that part and turn it into a cloud service for them. Normally they buy the platform for us…we can sell the equipment at a low margin and then it’s easier for us to maintain, just like in a normal cloud.”

Dedicated inferencing targets organizations that require highly secure, regulatory-compliant environments and want to bypass the data-privacy risks of public clouds. “This is where we move into startups that are in production, or Fortune 500s that are delivering something ‘as a service.’” Here, health, finance, public sector, and defense contractors that want to avoid standard multi-tenance cloud APIs. In addition to high-profile AI Labs and research centers like Ai2, Cirrascale is also working with the National Science Foundation and Google Public Sector on enterprise AI.

Inference-as-a-service is a serverless, pay-per-token model targeting primarily Fortune 500 companies and enterprises needing predictable costs per token and production that’s multi-region with a guaranteed cost. “They don’t want variable costs like with hyperscalers, so with us, what they have at start of month is what they have at end of month in terms of cost,” says Driggers, adding that it’s hard to build on prem for peak if you’re going across multiple regions because your peak goes down and then you have hardware sitting. We move the workload and repurpose the hardware for other workloads when it’s not doing something real time.”

Across sectors, Cirrascale forks the LLM to optimize for different hardware platforms:

  • The Nvidia Fork: This version is compiled and optimized to leverage Nvidia’s specialized Tensor Cores and CUDA libraries.
  • The Qualcomm Fork: This version is modified to heavily prioritize extreme energy efficiency and low cost-per-token using Qualcomm’s unique neural architecture.
  • The AMD Fork: This version is tailored to make full use of AMD’s specific high-bandwidth memory (HBM) layouts.

By forking the model, Cirrascale ensures that no matter which “horse” an enterprise chooses for their “course,” the LLM will deliver the highest possible speed, lowest latency, and best cost efficiency. “The batch is filling in your holes, and getting your efficiency from a cloud perspective, gets our hardware efficiencies up, which is where we are able to lower the price on the real time and back fill the funnel with the batch.”

What you need to know in 5 minutes

Join 37,000+ professionals receiving the AI Infrastructure Daily Newsletter

This field is for validation purposes and should be left unchanged.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More