Data Science by ODS.ai 🦜

Training with quantization noise for extreme model compression

It is a new technique to enable extreme compression of models that still deliver high performance when deployed in practical applications mimics the effect of quantization during training time.
This method delivers performance that nearly matches that of the original uncompressed models while reducing the memory footprint by 10x to 20x. This significantly exceeds the 4x compression with int8 currently available in both PyTorch and Tensorflow. Quant-Noise can be used to shrink models even further – by more than 50x – in use cases where greater performance trade-offs are acceptable. Quant-Noise changes model training only by adding a regularization noise similar to dropout, with no impact on either the convergence rate or training speed.

At training time during the forward pass, it takes a subset of the weights and then randomly applies simulated quantization noise. This makes the model resilient to quantization and enables large compression ratios without much loss in accuracy.

Quant-Noise is applied to only a subset of the weights. This method has the advantage that the unbiased gradients still flow from the weights that are unaffected by the noise.

The authors demonstrated that their framework compresses the SOTA EfficientNet-B3 model from ~50 MB to 3.3 MB while achieving 80% top-1 accuracy on ImageNet, compared with 81.7% for the uncompressed model. Compress RoBERTa Base model from 480 MB to 14 MB while achieving 82.5% on MNLI, compared with 84.8% for the original model.

blogpost: https://ai.facebook.com/blog/training-with-quantization-noise-for-extreme-model-compression/
paper: https://arxiv.org/abs/2004.07320
github: https://github.com/pytorch/fairseq/tree/master/examples/quant_noise

#quantization #compression #shrinking

14.5K viewsedited 08:27

😱 27 🔥 32

Data Science by ODS.ai 🦜

QLoRA: Efficient Finetuning of Quantized LLMs

Thia paper introduces QLoRA, a novel finetuning approach that decreases memory usage significantly, while maintaining impressive performance. Imagine this - a 65 billion parameter model finetuned on a single 48GB GPU, while preserving full 16-bit task performance. This method involves backpropagating gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters, a method that opens up new frontiers in machine learning. The icing on the cake is their high-performing model family, Guanaco, which trumps all previously released models on the Vicuna benchmark, achieving a staggering 99.3% of the performance level of ChatGPT with just 24 hours of finetuning on a single GPU.

The study also unveils several innovative techniques to conserve memory without compromising performance. These include 4-bit NormalFloat (NF4), an innovative data type that is theoretically optimal for normally distributed weights, double quantization for average memory footprint reduction, and paged optimizers to handle memory spikes. The QLoRA approach was applied to finetune more than 1000 models, leading to a detailed analysis of instruction following and chatbot performance across various model types and scales. The results affirm that QLoRA finetuning on a small, high-quality dataset yields state-of-the-art results, even with smaller models than previously used. A notable finding is that GPT-4 evaluations offer a cost-effective alternative to human evaluation. All models and code, including CUDA kernels for 4-bit training, have been released by the researchers.

Paper link: https://arxiv.org/abs/2305.14314
Code link: https://github.com/artidoro/qlora
CUDA kernels link: https://github.com/TimDettmers/bitsandbytes

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-qlora
#deeplearning #nlp #llm #quantization

👍18🔥7❤5

9.45K views09:29

Data Science by ODS.ai 🦜

Forwarded from Machinelearning

⚡️ Gemma 3 QAT

Google DeepMind выпустили обновленные версии своих языковых моделей Gemma 3, которые стали значительно эффективнее по использованию памяти без существенной потери производительности.

Ключевая технология: QAT (Quantization-Aware Training)

Что это? QAT — это техника обучения, при которой модель во время дообучения "учится" работать с пониженной точностью вычислений (используя меньше бит для представления чисел). Это имитирует условия, в которых модель будет работать после квантизации (сжатия).

Обычная квантизация после обучения может привести к падению точности. QAT позволяет модели заранее адаптироваться к работе в низкоточном режиме, минимизируя потерю качества после финальной квантизации.

Каждая модель (1B, 4B, 12B, 27B) была дообучена примерно на 5000 шагов с имитацией низкой разрядности весов. При этом использовался приём, похожий на знание-дистилляцию: оригинальная неквантованная модель выступала в роли «учителя».

Преимущество QAT-подхода для Gemma 3 оказалось колоссальным. Официально заявлено, что квантованные модели Gemma 3 QAT сохраняют качество, практически не упало, при этом требуют в ~3 раза меньше памяти.

Например, объём памяти для хранения весов самой крупной модели на 27B параметров сократился с ~54 ГБ (в формате bfloat16) до ~14 ГБ в 4-битном целочисленном формате – это экономия памяти примерно в ~3–4 раза.

ollama run hf(.)co/google/gemma-3-4b-it-qat-q4_0-gguf

✔️HF

@ai_machinelearning_big_data

#google #gemma #AI #ML #LLM #Quantization

👍5🔥5❤1🥰1

2.88K views06:31

About

Blog

Apps

Platform