For decades, compute has scaled faster than memory. Processors can execute more operations every year, but the speed at which data moves in and out of memory has lagged behind. That mismatch, known as the “memory wall,” is now one of the defining constraints in artificial intelligence.
AI makes the problem even worse. These days, training and serving large models is less about raw performance and more about how quickly parameters and activations can be fed to thousands of parallel compute units. Even the most advanced accelerators spend a surprising amount of time waiting on data. Latency, bandwidth, and energy-per-bit dominate real-world performance, which is why memory innovation has become so important.
What is high-bandwidth memory?
High Bandwidth Memory (HBM) is the industry’s most direct response to the memory wall. Rather than placing DRAM modules inches away on a motherboard, HBM stacks multiple DRAM dies vertically and mounts them adjacent to the processor on a silicon interposer. The physical proximity and ultra-wide interfaces help deliver dramatically higher bandwidth at lower energy per bit than traditional forms of memory.
Recent generations have pushed those advantages even further. HBM3E increases per-stack throughput into the multi‑hundreds of gigabytes per second, with leading implementations reaching as much as 1.2TB/s per stack. When accelerators integrate several stacks, aggregate bandwidth climbs into multi‑terabyte-per-second territory. That headroom can make a real difference — more bandwidth lets GPUs and AI accelerators keep their compute units fed, maintain higher utilization, and reduce stalls caused by memory contention.
The end result? Larger batch sizes become viable, optimizer steps run more smoothly, and inter-GPU traffic can drop because more working sets fit in fast local memory. That said, there are trade-offs — HBM is expensive, thermally demanding, and requires advanced packaging. But for workloads that require that high bandwidth, it’s the most effective approach we have right now.
CXL and memory expansion
Compute Express Link (CXL) tackles a different side of the problem: capacity and elasticity. CXL is a cache-coherent interconnect that rides on PCIe, allowing processors and accelerators to attach external memory as if it were part of the system’s addressable space. That opens the door to memory expansion modules, pooled memory appliances, and shared memory fabrics across servers.
With CXL, operators can decouple how much memory a node can address from how much DRAM sits on the motherboard. Memory can be allocated dynamically to where it’s needed most, improving utilization and reducing stranded capacity. For large-scale AI, this means models that don’t fit entirely in local HBM can spill into slower but far larger CXL‑attached memory tiers. The result is more flexible scaling, especially for inference services with fluctuating demand and for training setups that benefit from larger context windows.
In practice, of course, CXL doesn’t replace HBM — it compliments it. HBM remains the high-speed “L1 for memory,” while CXL extends the address space and allows for the memory to be provisioned and shared across accelerators and hosts.
Unified memory and 3D stacking
The industry is moving towards architectures where memory is faster, closer, and increasingly, unified. Unified memory models blur the lines between CPU, GPU, and accelerator memory spaces, allowing applications to treat them as a single pool. That reduces copying and simplifies programming, while allowing runtimes to tier data across HBM, local DRAM, and CXL‑attached memory.
3D stacking is the other major vector. Techniques like through‑silicon vias (TSVs) and hybrid bonding bring logic and memory into tighter proximity, improving bandwidth density and cutting energy per bit. As stacking evolves to include things like more HBM layers to logic‑in‑memory and near‑memory compute, data doesn’t have to travel as far or as often. Over time, these approaches could turn memory hierarchies into something more like a memory fabric woven directly into the compute substrate.
Will this “break” the memory wall? Well, that remains to be seen, but it’s unlikely. But, by pushing more bandwidth per millimeter, collapsing distances, and treating memory as a distributed resource, the wall becomes a little more scalable.
What’s next?
HBM and CXL are on fast-moving roadmaps. HBM3E is in mass production, and we’re now moving quickly towards the next step — HBM4. HBM4 expands bandwidth to over 2TB/s per stack, while doubling layers to up to 16 per stack. HBM4 is already being sampled from Samsung, SK Hynix, and Micron, and is expected to begin mass production in 2026.
On the interconnect and memory‐sharing front, CXL 3.x (from CXL Consortium) is now shipping and delivering key features such as multi-level switching, peer-to-peer coherency, and full fabric management — enabling memory pooling, disaggregated memory tiers, and coherent sharing across CPUs, GPUs and accelerators. Meanwhile, attention is shifting toward the next phase (CXL 4.0) which is expected to leverage the PCIe 7.0 physical layer and deliver further bandwidth.
These refreshes will have big implications for system design, of course. HBM transitions require new interposers, power delivery, and cooling strategies, often locking platform choices for years. CXL introduces its own planning challenges — tiering policies, QoS, address translation, and failure domains all have to be designed into the stack.
Conclusion
In modern AI systems, the fastest chip is only as good as the memory feeding it. HBM delivers the local bandwidth needed to keep accelerators busy, while CXL makes memory more flexible and abundant.
The memory wall isn’t gone, and it won’t be moving any time, especially as AI demands continue to grow. But, at least new innovations could help make a dent in the wall.