Table of Contents
Cirrascale ‘more horses for courses’ strategy is to test and deploy on every leading AI accelerator, with bare-metal servers fine-tuned for training, inference, and inference-as-a-service.
Cirrascale founder and CEO David Driggers is a pioneering figure in HPC, known for his work around dense hardware architecture and multi-GPU processing efficiency. Starting in the early 2010s, Driggers started Cirrascale as a hardware manufacturer of deep-learning servers like the GPU8, and then transitioned the company to hardware-to-cloud and GPU-as-a-service, becoming one of the first “neoclouds” dedicated to heavy-duty, bare metal hardware for AI training, and now shifting toward enterprise-focused dedicated inferencing and inference-as-a-service for Fortune 500 companies.
Cirrascale has been a specialist in AI inferencing, powering massive open scientific models and deploying frontier AI inside sovereign environments, differentiating on private AI, bare-metal hardware diversity, and serverless inference-as-a-service. “We are different in that unlike other neocloud providers, we are not ‘new’ and we come from a hardware background,” says Driggers, noting he collaborated early on with OpenAI “when it had 8 people” and has since evolved to power Essential AI’s 1,000-GPU Lenovo ThinkSystem and AMD Instinct-based training platform, Ai2’s Scientific AI Initiative, and as an operational partner for the Nvidia-backed Open Multimodal AI Infrastructure to Accelerate Science (OMAI), for which Cirrascale manages the open infrastructure that enables the OLMo and Molmo models.
Accelerator diversity: ‘More horses for courses’
Driggers has long contended that the brute-force tools built to handle training and virtually anything thrown at them are not necessary and fundamentally inefficient for inference. “With the current generation of AI, which is the third major wave, the model delta – the difference in size from the smallest usable models to the largest usable models – is orders of magnitude. It used to be that the biggest and smallest models were pretty close to one another, but now you’ve got billion-parameter models that are usable in generative AI and LLMs, all the way up to multi-trillion parameter models.”
According to Driggers, a one-size-fits-all approach is impossible from an accelerator perspective. “As we move to a mixture of experts and we move to multimodal type inferencing where you may be integrating audio, video, plus text, and ultimately spatial, different accelerators will excel at different things.” Highlighting the stark economic and technical differences between training a model and running inference at scale, Driggers believes it’ll be very important in inferencing to find the right platform for different needs, whether that’s for ultra-low latency, energy efficiency, lowest possible cost per token, or other requirements. “You will have to seek the smallest, simplest unit your model will fit into, and then push it down the technology stack as far as you can go…while still meeting your latency requirements – your time to first token.”
That advice comes from the fact, as Driggers puts it, “that every semiconductor company charges more, the higher you move up their technology stack,” which he says means, “charging per flop and per megabyte of memory. That’s why you want to push the performance and memory stack as far down as you can go to where you still hit your latency. If you get too low, you’ve got to step back up. Or, if it doesn’t fit on one GPU, you have to split your job across two, which means significant loss in efficiency and difficulty in deployment.”
Driggers says with inferencing, “you’re in production,” so once you hit your required speed, it’s all about cost-per-token. If you don’t push it down, it could be too cost prohibitive to run, and that’s after you’ve trained a model, built a rack, and fine tuned it. If it’s a profit center for you, saving 10% may double your net margin, so if you can drop an extra 10 points, and you’re only making 10, you double your margin. For inferencing, it really does matter to get the right horse for the right course.”
That course is chosen by many factors. For example, first token (TTFT) means different things in different scenarios. In batch processing of PDFs or OCRs, you can be talking about days. With chatbots, you could need near real-time. With fact checking, you may need real-time. And when it comes to bad actors, fraud, viruses, child pornography, then you need it faster to keep all the bad things out.
Because different architectures require different silicon options, Cirrascale supports a multitude of platforms, including:
- NVIDIA: (HGX B200, H100, and Tensor Core GPUs)
- AMD: (Instinct series accelerators)
- Qualcomm: (Cloud AI 100 Ultra)
- Others: Cerebras, Tenstorrent, and SambaNova
As an example, Cirrascale’s AI Innovation Cloud has scaled with massive AMD Instinct MI300X clusters developed with Lenovo to power training and inference pipelines for companies like Essential AI. To further expand beyond traditional GPUs, Cirrascale also has a commercial deployment of Tenstorrent Galaxy Blackhole servers, with RISC-V-based AI processors bypassing GPU supply constraints in efforts to cut the per-token costs in inference-heavy workloads.
The challenge for enterprises is knowing the differences and the nuances of where the hardware would ideally fit. “This is why with our inference-as-a-service, we get into the model, taking it from the client, running it, and working with them on what their SLA, time to first token, or regional latency challenges are,” says Driggers, noting another question to ask is, “Do they have a real-time application that follows the sun from the east coast to the west, and then perhaps onto Asia and Europe?” Once the key questions are answered, Cirrascale works with enterprises to establish the SLA requirements and run and test the models to ensure the best platform for the use case is chosen. “We figure out the price per token to meet their SLA – if it’s real time, near time, batch/offload, and so on.”
Who the customers are and what do they need?
This year, Driggers says, enterprise adoption is still fairly new, with most enterprises getting through POC phases and employees trying Gemini, Copilot, ChatGPT, Claude, to see if they can be more efficient or productive. “For coding, it’s now 70-80% adoption, but for agentic AI, chatbots and supports, many enterprises are reluctant to go into the general cloud. They don’t want their trade secrets getting out.” Driggers says one of his biggest hopes for the coming year is to see open source models evolve to be good enough that “you don’t have to go to one of the monster frontier models; or, that frontier models will become available as cut-down open source versions.”
Driggers champions private AI for Fortune 1000 companies, and sectors like defense, healthcare, and finance, which he says want to run frontier-level models, but without exposing sensitive data to public cloud layers. “The customer always wins and if they want something private and under their control, they’re going to get it. It’s too big a market to ignore.”
To target those enterprises and sectors, Driggers says Cirrascale offerings are falling into three main camps:
Dedicated training, which according to Driggers is primarily leveraged by well-funded startups in later stages of their development. For example, the Paul Allen’s Institute of Artificial Intelligence (Ai2) is a non-profit that wants to remain completely open. “As a non-profit, they have funding rules about how much they can spend on Capex, or on people, jobs, and so on,” explains Driggers. “This is where we differentiate from hyperscalers and most of the neoclouds in that we allow our customers to own some of the equipment if they want to. We leverage that part and turn it into a cloud service for them. Normally they buy the platform for us…we can sell the equipment at a low margin and then it’s easier for us to maintain, just like in a normal cloud.”
Dedicated inferencing, which targets organizations that require highly secure, regulatory-compliant environments and want to bypass the data-privacy risks of public clouds. “This is where we move into startups that are in production, or Fortune 500s that are delivering something ‘as a service.’”Here, health, finance, public sector, and defense contractors that want to avoid standard multi-tenance cloud APIs. In addition to high-profile AI Labs and research centers like Ai2, Cirrascale is also working with the National Science Foundation and Google Public Sector on enterprise AI.
Inference-as-a-service, which is a serverless, pay-per-token model targeting primarily Fortune 500 companies and enterprises needing predictable costs per token and production that’s multi-region with a guaranteed cost. “They don’t want variable costs like with hyperscalers, so with us, what they have at start of month is what they have at end of month in terms of cost,” says Driggers, adding that “it’s hard to build on prem for peak if you’re going across multiple regions because your peak goes down and then you have hardware sitting. We move the workload and repurpose the hardware for other workloads when it’s not doing something real time.”
Across sectors, Cirrascale forks the LLM to optimize for different hardware platforms:
- The Nvidia Fork: This version is compiled and optimized to leverage Nvidia’s specialized Tensor Cores and CUDA libraries.
- The Qualcomm Fork: This version is modified to heavily prioritize extreme energy efficiency and low cost-per-token using Qualcomm’s unique neural architecture.
- The AMD Fork: This version is tailored to make full use of AMD’s specific high-bandwidth memory (HBM) layouts.
By forking the model, Cirrascale ensures that no matter which “horse” an enterprise chooses for their “course,” the LLM will deliver the highest possible speed, lowest latency, and best cost efficiency. “The batch is filling in your holes, and getting your efficiency from a cloud perspective, gets our hardware efficiencies up, which is where we are able to lower the price on the real time and back fill the funnel with the batch.”