Authors: @Amanda_Liang @Parmita_Mehta
Amongst many other innovations, Google’s Ironwood chips are our first TPUs to support 8-bit floating point (FP8) precision, helping to accelerate AI training and inference by reducing memory usage and doubling throughput compared to 16-bit formats (FP16/BF16). Importantly, this lets you improve the throughput while maintaining model quality in a way that is statistically similar to using higher-precision BF16 baselines.
Central to this capability is Ironwood’s native integration of FP8 formats directly within its Matrix Multiply Units (MXUs). Unlike rigid integer quantization, this architecture allows the silicon to process specialized numerical representations tailored for specific deep learning tasks. This flexibility enables the system to prioritize precision for weights and activations while allocating wider dynamic range for gradients, effectively unlocking aggressive quantization techniques such as coarse scaling and deterministic rounding that are typically infeasible with integer-only math.
We implemented many capabilities and optimizations into Ironwood to enable FP8 for your workloads. Some include:
-
Production-ready recipes, such as those used for DeepSeek v3, demonstrate how to achieve highly efficient training on the new Ironwood architecture, so you can optimize performance.
-
Specialized FP8 formats like E4M3FN for forward passes, and E5M2 for backward passes, which preserve dynamic range and ensure numerical stability without compromising accuracy.
-
Advanced tuning capabilities, such as activation host offloading and SparseCore communication offloading, keep TensorCores fed and hide system collectives, mitigating system bottlenecks.
You can begin using these optimizations immediately with MaxText and the JAX ecosystem. Our FP8 training recipe is implemented through Qwix and can be enabled by specifying specific flags in the MaxText configuration — details below. You can consult the Qwix user guide for custom guidance on quantizing your specific models.
Below is a deep dive into the journey we took and technical decisions we made to build this stack. After reading this blog, you will have techniques to effectively train your models with FP8 on Ironwood.
The journey from INT4/8 To FP8
To implement the first FP8 recipe on TPUs, we developed new scaling and quantization strategies specifically for Ironwood, rather than relying on established GPU or integer-based TPU workflows. These strategies are available for you to use as well.
We also optimized several state-of-the-art models for this architecture, which are available as production-ready recipes. Our goal was to improve the throughput using Ironwood’s FP8 capabilities while maintaining quality neutral to BF16 baselines. Using DeepSeek v3 as a case study, below we demonstrate a production-ready methodology for FP8 training on Ironwood.
A case study on DeepSeek v3 training
We explored and developed a customized FP8 recipe for DeepSeek v3. You can repurpose the learnings from this case study when developing your own FP8 models for Ironwood.
1. Defining the quantization scope
For Sparse Mixture-of-Experts (MoE) architectures, profiling consistently reveals that the highest quantization impact lies in the MLP and attention projections, where matrix multiplication dominates the runtime. To improve the efficiency in these bandwidth-heavy models, it is also critical to quantize Megablox kernel weight all-gathers, reducing significant communication and quantization overhead.
We also explored quantizing splash attention kernels, but quickly discovered that this approach led to significant quality degradation, even when applying the most conservative FP8 recipes. Furthermore, the performance goals could not be met because the kernels were heavily constrained by the Vector Processing Unit (VPU), which handles complex element-wise operations, rather than the MXU where FP8 provides its acceleration. As a result, converting the matrix operations to FP8 yielded no meaningful latency reduction, as the VPU bottleneck remained the dominant limiting factor.
2. DeepSeek FP8 training recipe
The final recipe for DeepSeek V3 training on Ironwood achieves the best of both quality and performance:
-
Rounding method: Round to Nearest Even (RNE). We chose this deterministic approach over stochastic rounding to ensure reproducibility and eliminate training noise.
-
Precision formats:
-
Activations & weights: E4M3FN (to maximize precision in the forward pass)
-
Gradients: E5M2 (to capture the high dynamic range of the backward pass)
-
-
Scaling granularity: per-axis. While per-tensor scaling was explored for performance, per-axis scaling was selected for the final launch to guarantee the highest model quality. The original DeepSeek papers selected per block, utilizing a block size of (1,128) for pretraining, but it was shown not to be needed in post-training.
-
Scaling mode: Hybrid.
-
Static scaling for weights and activations (pre-computed via profiling)
-
Dynamic scaling for gradients
-
-
Quantization scope:
-
FP8 weight All-Gather
-
All Megablox kernels - weights, activations and gradients
-
All Attention Projections - weights, activations and gradients
-
We implemented the recipe through Qwix, and it can be enabled in MaxText by specifying the following flags:
quantization=fp8_full \
weight_quantization_calibration_method=“fixed,-224,224” \
act_quantization_calibration_method=“fixed,-224,224” \
use_qwix_quantization=true
You can find a detailed example here. For more information about customizing the recipe, please refer to the Qwix user guide.
3. Retune the FP8 model
The transition to FP8 introduces a classic optimization challenge: As compute becomes significantly faster, it exposes system bottlenecks that were previously hidden by slower math operations. With matrix multiplications accelerating, the relative cost of communication and data movement increases, requiring us to extensively re-tune XLA flags and adjust instruction scheduling to better hide collectives. We also leveraged activation host offloading to manage memory pressure and utilized Fully Sharded Data Parallelism (FSDP), which proved sufficient for the model scale when paired with these scheduler adjustments.
To keep the accelerated TensorCores fed, we offloaded heavy communication tasks (e.g. such as MoE token dispatch and collective operations) to the SparseCores. This included an XLA flag to enable Reduce-Scatter that decomposes operations to effectively utilize Inter-Chip Interconnect (ICI) bandwidth in the absence of Megacore support.
Finally, we optimized the compute path by tuning megablox tiling strategies to match the increased throughput of the FP8 compute engine.
4. Results: Performance meets quality
Performance
As of the time of writing this blog, the FP8 DeepSeek v3 on Ironwood achieved 3307 tokens/s/chip for FSDP sharding with 256 chips, about ~1.3x speedup against bf16 baseline (2590 tokens/s/chip).
Quality
The training loss curves below demonstrate that our FP8 recipe (blue) closely tracks the BF16 baseline (orange), proving that aggressive quantization can be applied without compromising the quality of the model.
Ready to train your own FP8 models?
Ironwood is the first TPU to fully embrace native FP8 support — a capability that has driven efficiency in the GPU ecosystem for years. By moving away from legacy integer formats, Google’s custom silicon is now aligned with the modern standard for high-performance AI training.
However, hardware support alone is not enough. The recipes and optimizations detailed here are the bridge that makes that hardware usable. This work effectively allows you customers to use Ironwood TPUs with FP8 low-precision, ensuring you can successfully deploy your most demanding workloads on this new architecture with the same confidence and quality you expect from higher-precision formats. To learn more and get started, check out this guide.



