Do scaling laws for training AI models still apply when differential privacy is folded into the process?
Earlier this year, Google Research and DeepMind reported work on this topic. They have now embodied it in VaultGemma.
VaultGemma largely inherits the Gemma model architecture. Yet it was trained with differential privacy. A noise-adding anonymization technique, here implemented to reduce the chances of memorizing training data.
The challenge was to strike the right balance among three factors: model size, batch size, and the number of iterations. All while accounting for budgets: compute, data, and privacy.
It is widely accepted that the algorithm used (DP-SGD) yields better results with larger batch sizes. The remaining two parameters needed tuning, knowing that traditional scaling laws proved “far from optimal.”
Scale laws with limited comparability
Among other trends highlighted by the work published earlier this year: as the privacy budget increases, it is better to grow the model size and shrink the batch size while performing more iterations.
In general, there is a margin of maneuver: a wide range of configurations can optimize the loss. In this context, training smaller models on more tokens should be favored, given the efficiency gains at inference time.
Optimal model sizes are indeed markedly smaller than those forecast by traditional scaling laws. For instance, at 10^22 FLOPS, you would need 10^8 parameters, compared with 10^10 without the differential privacy layer.
Original experiments began with uniform batches. The adoption of Poisson sampling—which offers a good compromise between privacy level and the amount of noise injected—altered the landscape. It produced batches of varying sizes and requiring a specific, random processing order. These issues were tackled by Google Research and DeepMind by exploiting their SP-SGD mechanism, which enables fixed-size batches through padding or trimming while preserving privacy.
VaultGemma, as capable as a GPT… from 2020
To develop VaultGemma, the team reused the methodology established earlier in the year, adapting it along three points. In particular, they explicitly modeled the optimal learning rate. Rather than treating it as a parameter to be optimized via grid search for each training configuration, they modeled its optimal value as a function of that configuration. For each configuration (model size and the ratio of noise added to batch size), they ran seven training rounds, each with a different learning rate. The losses were combined into a quadratic function whose peak provides an estimate of the optimal value.
The method was also revised with a locally parameterized approach that allows estimating the loss over an interval of iterations without relying on intermediate values. As for the final law, it was built in two stages, modeling the loss separately as a function of model size and of the number of iterations.

VaultGemma’s weights are available on Hugging Face and Kaggle. The dataset is the same as for Gemma 2 27B (13 trillion tokens, mostly in English, with the same filtering techniques and the same tokenizer). The sequence length was limited to 1024 to allow using larger batches.
In terms of size, VaultGemma shows comparable performance to a GPT-2 model released five years ago without differential privacy on the benchmarks considered. But memorization rates have indeed fallen, in a clear and noticeable way.


