Table of Contents
The memory wall is more of an issue than ever in AI workloads. How will it be fixed?
As AI workloads scale, compute performance is increasing far faster than memory bandwidth. Modern accelerators can perform trillions of operations per second, but their efficiency is often limited by how quickly data can move in and out of memory, a long-standing issue known as the memory wall.
The problem is especially prominent in AI data centers, where large model training and inference workloads demand both high-capacity and high-bandwidth memory systems. Even with high-bandwidth memory and advanced interconnects, memory access remains a bottleneck for many compute-bound architectures.
The industry is addressing this through a multi-prong approach – scaling capacity and channels to provide more memory, and redesigning architectures for higher efficiency and locality. Both strategies are helping shape next-gen architecture, but is one approach proving more important?
Origins of the memory wall
The term memory wall originated in the 1980s to describe a growing performance gap between processors and memory. While compute speeds doubled every few years under Moore’s Law, memory latency and bandwidth improved far more slowly. This imbalance became a defining constraint in high-performance computing (HPC), where processors frequently stalled waiting for data to arrive from DRAM.
AI and accelerated computing have amplified the issue. Modern GPUs and AI accelerators rely on massive parallelism, executing thousands of simultaneous threads that each depend on rapid data access. Training large language models, for example, involves moving terabytes of parameters between compute units and memory every second. Even with high-bandwidth memory (HBM), the effective bandwidth per FLOP has not kept pace with increases in compute density.
“A significant share of AI performance and energy cost comes from data movement, not computation,” said Karthik Sj, General Manager of AI at LogicMonitor, in an interview with RCRTech. “Current architectures are approaching limits because data transfer—between memory, storage, and compute—consumes far more energy than arithmetic itself. The entire AI datacenter stack is being reengineered to reduce this bottleneck.”
At the system level, interconnects between GPUs and nodes introduce further latency. As models are distributed across clusters, data often travels through PCIe, NVLink, or InfiniBand fabrics. The result is a sequence of bottlenecks.
More memory: Scaling capacity and channels
One approach to addressing the memory wall is to simply add more of it. As AI models continue to grow, expanding total memory capacity and bandwidth has become a priority for both chipmakers and data center operators. High-bandwidth memory (HBM) has evolved rapidly over the past few generations, with HBM3E now offering over 1.2TB/s per stack and HBM4 reaching 2TB/s.
Micron in particular has already begun sampling HBM4 stacks that deliver up to 2.8TB/s of bandwidth, while SK Hynix, an Nvidia supplier, noting that it too has completed development of the next-gen tech. We’ll have to see if they can scale supply to meet demand, though.
Vendors are scaling capacity aggressively. Nvidia’s Blackwell GPUs pair with up to 192GB of HBM3E, while AMD’s MI450 architecture will push to a massive 432GB of HBM4. Similar scaling is appearing across accelerators and custom ASICs as hyperscalers optimize systems for model training efficiency. The increase in capacity enables larger models to fit within single nodes, reducing cross-node communication overhead.
Beyond local memory, data center operators are experimenting with pooled and disaggregated architectures. NVLink and NVSwitch allow multiple GPUs to share a common memory space, while CXL-based memory expanders extend addressable capacity across servers. Optical interconnects and rack-scale disaggregation are emerging as next steps, potentially decoupling memory from compute entirely.
These designs come with trade-offs. HBM stacks are expensive to manufacture and thermally demanding, and yield challenges increase as stack height grows. Power consumption also scales with bandwidth, often offsetting efficiency gains. Even so, for large-scale training where model size dominates, simply having more memory remains one of the most direct ways to push back against the wall.
Better memory: Architectural advances
While increasing capacity helps scale model size, much of the industry’s attention is shifting toward architectural improvements that make memory faster, closer, and more efficient. The goal is not just to expand memory, but to shorten the distance between compute and data, reducing latency and improving energy efficiency.
“Chiplets and advanced packaging are near-term solutions that help mitigate the memory wall by bringing compute and memory physically closer, reducing latency and improving throughput,” said Sj. “Photonic interconnects, while further out, represent a more fundamental long-term path—enabling data transfer at higher bandwidth and far lower energy cost over distance.”
At the hardware level, this shift is visible in new 3D designs that integrate DRAM directly with compute dies. Techniques such as hybrid bonding and through-silicon vias (TSVs) enable tighter coupling between logic and memory layers, improving bandwidth density while lowering interconnect energy.
The interface layer is also advancing. CXL 3.0 introduces full cache coherence between devices, allowing accelerators and CPUs to share memory with minimal software overhead. Combined with PCIe 6.0, it opens the door for memory systems that behave as a single addressable pool. This enables new deployment models, such as shared memory fabrics and memory tiering, where high-speed local HBM is used together with slower but larger storage-class memory.
“Nvidia’s massive investments in NVLink and InfiniBand prove that interconnect bandwidth is the real bottleneck,” says Charles Yeomans, CEO and co-founder of Atombeam, a data compaction and optimization company, in an interview with RCRTech. “But they still have a fundamental architectural problem, because they are moving a ton of redundant data. The solution requires both ultra-fast interconnects and dramatic data reduction designed specifically for machine-generated patterns.”
The HBM4 generation, which is starting to roll out, combines several of these advances. Together, these push bandwidth beyond 2TB/s per stack (Micron claims 2.8TB/s), while improving power efficiency. However, as packaging and interconnect technologies continue to improve, offering so-called “better” memory relates more to building memory into the compute fabric itself than simply improving DRAM performance.
Beyond the memory wall
Efforts to overcome the memory wall are more about merging more and better memory than choosing between the two. Capacity scaling and architectural innovation are converging into unified designs where memory, interconnect, and compute work together much more closely.
HBM4, CXL fabrics, and near-memory compute exemplify this shift. One increases local bandwidth and density, the other extends flexibility and reach. Together, they reflect a broader industry realignment toward memory-centric computing.
However, overcoming the memory wall could also come through more efficient data optimization techniques.
“The memory wall isn’t just about pipe speed, the hardware question – it’s also very much about the absurd amount of redundant data we’re pushing through those pipes because our architectures treat knowledge as flat parameter arrays rather than structured, reducible patterns,” continued Yeomans.
Ultimately, the future of AI infrastructure won’t be decided by how much memory can be stacked, but by how intelligently that memory is placed, accessed, and shared – while at the same time leveraging new techniques to compress the data itself. The memory wall may never fully disappear, but the boundaries between compute and memory are already starting to blur, and that’s likely to contribute to the next generation of performance gains.