LLMs and quantization: FP8, FP4, and INT8 explained

Home Semiconductor News LLMs and quantization: FP8, FP4, and INT8 explained
AI Quantization

How can quantization turn massive models into efficient tools without ruining their accuracy?

Running large language models is expensive. The biggest ones pack hundreds of billions of parameters, each stored as a high-precision number that chews through memory, power, and premium hardware. But do we actually need all that precision? Increasingly, the answer is no.

That realization has driven the AI world toward quantization, which involves taking model parameters from 32-bit or 16-bit formats and compressing them down to 8-bit, or sometimes even 4-bit representations. You end up with models that are dramatically smaller, faster, and cheaper to run, often with surprisingly little quality degradation. But it is a little more complex than that. Low-precision formats aren’t interchangeable, and the choice between INT8, FP8, and FP4 involves trade-offs across dynamic range, hardware compatibility, and accuracy. As quantized inference shifts from niche optimization trick to mainstream deployment strategy, understanding those trade-offs is becoming table stakes.

What is quantization?

At its core, quantization means taking a neural network’s parameters, like model weights, activations, and other stored values, and converting them from high-precision formats like FP32 or FP16 into lower-precision representations that use fewer bits. The objective is basically to shrink the memory footprint needed to store and process a model, which unlocks faster computation and lower hardware costs.

Rather than storing every parameter at maximum numerical fidelity, this involves storing approximations that land “close enough” to the originals while being far more efficient to work with. A weight originally represented as a 32-bit floating-point number might become an 8-bit integer or a 4-bit float. The number isn’t identical, but it’s close enough — and when you’re dealing with billions of parameters, those memory savings compound fast.

The core trade-off is essentially that you sacrifice some numerical detail in exchange for a smaller footprint, faster inference, and the ability to deploy on less expensive or power-constrained hardware. The underlying industry thesis is that modern neural networks are robust enough that shaving off that detail doesn’t meaningfully degrade their outputs. A growing body of research backs this up, demonstrating that well-calibrated 8-bit and even 4-bit quantization can land near FP32 accuracy across a wide range of tasks.

Why the industry is moving to lower precision

Today’s large language models run into the billions or even trillions of parameters. At FP16, a 7-billion parameter model needs roughly 14GB of memory just for weight storage. Move to FP8 and that drops to about 7GB. Push all the way to FP4 and you’re at around 3.5GB. If you’re trying to run inference on consumer GPUs, edge devices, or cost-sensitive cloud setups, those numbers are the difference between viable and impossible. A model that once required a high-end data center GPU can suddenly fit on a laptop.

Hardware economics are reinforcing the trend too. Lower precision formats need less transistor space and draw less power per operation. Scale those savings across thousands of GPUs in a data center running inference 24/7, and the cost reductions get serious. Nvidia’s latest GPU architectures have leaned hard into this by baking native support for lower-precision operations directly into silicon.

The hit to accuracy is frequently smaller than you’d think. Modern neural networks tend to be over-parameterized, carrying more numerical precision than they genuinely need to produce quality outputs. Research has shown again and again that with proper calibration, models quantized to 8-bit or even 4-bit perform nearly on par with their full-precision versions. 

Comparing INT8, FP8, and FP4

Not all 8-bit formats behave the same way. The differences between integer and floating-point representations carry real consequences for quantized model performance.

INT8, or 8-bit integer, dedicates all 8 bits to representing whole numbers uniformly distributed across a fixed range. That gives it a dynamic range of 256 distinct values. INT8 is well-understood, broadly supported across hardware generations, and easy to implement. It delivers strong precision for values near zero, which is exactly where most neural network weights tend to cluster. The problem is that there are outliers. Neural network weights and activations sometimes contain extreme values, and INT8’s uniform spacing means those outliers either get clipped or force the entire range to stretch out, dragging down precision for everything else.

FP8, or 8-bit floating-point, works fundamentally differently. Instead of uniform spacing, it splits its 8 bits across a sign bit, exponent, and mantissa — the same architecture used by standard floating-point formats, just squeezed into fewer bits. This creates exponentially spaced values, giving FP8 a dramatically wider dynamic range than INT8. Two main variants are in play: E4M3, with 4 exponent bits and 3 mantissa bits, leans toward precision and covers roughly ±448. E5M2, with 5 exponent bits and 2 mantissa bits, favors dynamic range and can represent values up to approximately ±57,344. E4M3 tends to be the better fit for forward passes during inference, while E5M2 suits backward passes during training. The big advantage FP8 holds over INT8 comes down to outlier robustness — exponential spacing means extreme values don’t blow up the representation or compromise precision across the rest of the range.

FP4, or 4-bit floating-point, takes compression further. It uses the same sign-exponent-mantissa structure as FP8 but with far fewer bits to work with, delivering a 4x memory reduction over INT16. The precision loss at this level is more pronounced, and FP4 demands careful calibration to stay viable. Hardware support is a little limited relative to FP8, and standardization is still catching up. That said, research has shown FP4 quantization performing well for inference across many scenarios, especially when paired with techniques like quantization-aware training.

The importance of hardware

A quantization format is only as good as the hardware running it. FP8 has become viable for production workloads in large part because Nvidia’s Hopper and Ada Lovelace architectures ship with native FP8 support baked into the silicon. FP8 operations run at full speed on dedicated tensor cores — no software emulation overhead required. On anything pre-Hopper, you’d need to emulate FP8 in software, which wipes out most of the speed advantages and makes the whole proposition far less compelling.

This hardware alignment explains why FP8 adoption has picked up so sharply. It’s not purely that the format is theoretically superior to INT8 for many workloads — it’s that the silicon now exists to actually make use of it. Native support means FP8 can be applied not just to model weights but also to activations and KV (Key-Value) caches — the data structures storing intermediate attention computations that can eat significant memory during inference.

The practical implications hit hardest with smaller models. A 7-billion parameter model quantized to FP8 can run efficiently on consumer-grade GPUs or edge hardware, putting capable AI within reach in environments where data center infrastructure isn’t available. As hardware support for even lower precision formats like FP4 matures, this democratization trend will only accelerate.

Not all black and white

For all the upside of lower-precision formats, quantization isn’t free. Model sensitivity varies widely — some architectures degrade gracefully under quantization, others fall apart entirely. Naive quantization, where you simply round parameters to lower precision without any preparation, almost always delivers poor results. The gap between a well-quantized model and a poorly-quantized one is massive.

Calibration helps fix this though. Techniques like SmoothQuant, which shifts quantization difficulty from activations to weights, and quantization-aware training (QAT), which teaches the model to anticipate reduced precision during training, are critical for avoiding meaningful accuracy drops. These methods add real complexity to the deployment pipeline, and getting them right takes expertise and experimentation. It’s definitely not a flip-the-switch situation.

INT8, however, does still dominate production deployments, FP8’s theoretical edge notwithstanding. The reasons are basically that INT8 runs on older hardware that’s already deployed at scale, its calibration methods are more mature and battle-tested, and implementation is simpler. FP8’s advantages are genuine, but they only fully materialize on very recent hardware that many organizations don’t yet have access to. For a lot of teams, INT8 remains the sensible default.

And in practice, the real world rarely applies a single format uniformly across an entire model. Most production implementations use mixed precision, with INT8 in some layers, FP8 in others, tuned to each layer’s sensitivity to quantization. Different parts of a model have different precision needs, and the most effective deployments match the format to the workload rather than forcing a one-size-fits-all approach. The landscape is still evolving, and the “right” answer depends heavily on the specific model, the target hardware, and how much engineering effort you’re prepared to invest in calibration.

What you need to know in 5 minutes

Join 37,000+ professionals receiving the AI Infrastructure Daily Newsletter

This field is for validation purposes and should be left unchanged.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More