Training with quantization noise for extreme model compression
It is a new technique to enable extreme compression of models that still deliver high performance when deployed in practical applications mimics the effect of quantization during training time.
This method delivers performance that nearly matches that of the original uncompressed models while reducing the memory footprint by 10x to 20x. This significantly exceeds the 4x compression with int8 currently available in both PyTorch and Tensorflow. Quant-Noise can be used to shrink models even further – by more than 50x – in use cases where greater performance trade-offs are acceptable. Quant-Noise changes model training only by adding a regularization noise similar to dropout, with no impact on either the convergence rate or training speed.
At training time during the forward pass, it takes a subset of the weights and then randomly applies simulated quantization noise. This makes the model resilient to quantization and enables large compression ratios without much loss in accuracy.
Quant-Noise is applied to only a subset of the weights. This method has the advantage that the unbiased gradients still flow from the weights that are unaffected by the noise.
The authors demonstrated that their framework compresses the SOTA EfficientNet-B3 model from ~50 MB to 3.3 MB while achieving 80% top-1 accuracy on ImageNet, compared with 81.7% for the uncompressed model. Compress RoBERTa Base model from 480 MB to 14 MB while achieving 82.5% on MNLI, compared with 84.8% for the original model.
blogpost: https://ai.facebook.com/blog/training-with-quantization-noise-for-extreme-model-compression/
paper: https://arxiv.org/abs/2004.07320
github: https://github.com/pytorch/fairseq/tree/master/examples/quant_noise
#quantization #compression #shrinking
It is a new technique to enable extreme compression of models that still deliver high performance when deployed in practical applications mimics the effect of quantization during training time.
This method delivers performance that nearly matches that of the original uncompressed models while reducing the memory footprint by 10x to 20x. This significantly exceeds the 4x compression with int8 currently available in both PyTorch and Tensorflow. Quant-Noise can be used to shrink models even further – by more than 50x – in use cases where greater performance trade-offs are acceptable. Quant-Noise changes model training only by adding a regularization noise similar to dropout, with no impact on either the convergence rate or training speed.
At training time during the forward pass, it takes a subset of the weights and then randomly applies simulated quantization noise. This makes the model resilient to quantization and enables large compression ratios without much loss in accuracy.
Quant-Noise is applied to only a subset of the weights. This method has the advantage that the unbiased gradients still flow from the weights that are unaffected by the noise.
The authors demonstrated that their framework compresses the SOTA EfficientNet-B3 model from ~50 MB to 3.3 MB while achieving 80% top-1 accuracy on ImageNet, compared with 81.7% for the uncompressed model. Compress RoBERTa Base model from 480 MB to 14 MB while achieving 82.5% on MNLI, compared with 84.8% for the original model.
blogpost: https://ai.facebook.com/blog/training-with-quantization-noise-for-extreme-model-compression/
paper: https://arxiv.org/abs/2004.07320
github: https://github.com/pytorch/fairseq/tree/master/examples/quant_noise
#quantization #compression #shrinking