SqueezeLLM: Dense-and-Sparse Quantization [Paper]
SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.
TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint. For more details please check out our paper.
Updates (7/10): All models other than LLaMA and Vicuna v1.1 can be run and evaluated without downloading the original checkpoints.
Updates (7/5): Salesforce's XGen models (both Base and Inst) with 8k sequence length and OPT models are supported.
Updates (6/28): All the LLaMA/Vicuna checkpoints are uploaded for all sizes and sparsity levels.
Updates (6/20): Dense-and-sparse kernel is supported.
Updates (6/16): Vicuna-7B and 13B, and LLaMA-30B are all supported with both 3-bit and 4-bit.
- Create a conda environment
conda create --name sqllm python=3.9 -y
conda activate sqllm
- Clone and install the dependencies
git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install
Currently, we support LLaMA 7B, 13B, 30B and 65B, instruction-tuned Vicuna 7B and 13B, XGen 7B with 8k sequence length, and OPT 1.3B to 30B. For each model, we support 3-bit and 4-bit quantized models, with sparse levels of 0% (dense-only), 0.05%, and 0.45%. See our Paper for more detailed information on these configurations. Below are the links to download the models.
Model | Bitwidth | Dense-only (0%) | 0.05% Sparsity | 0.45% sparsity |
---|---|---|---|---|
LLaMA-7B | 3 | sq-llama-7b-w3-s0 | sq-llama-7b-w3-s5 | sq-llama-7b-w3-s45 |
LLaMA-7B | 4 | sq-llama-7b-w4-s0 | sq-llama-7b-w4-s5 | sq-llama-7b-w4-s45 |
LLaMA-13B | 3 | sq-llama-13b-w3-s0 | sq-llama-13b-w3-s5 | sq-llama-13b-w3-s45 |
LLaMA-13B | 4 | sq-llama-13b-w4-s0 | sq-llama-13b-w4-s5 | sq-llama-13b-w4-s45 |
LLaMA-30B | 3 | sq-llama-30b-w3-s0 | sq-llama-30b-w3-s5 | sq-llama-30b-w3-s45 |
LLaMA-30B | 4 | sq-llama-30b-w4-s0 | sq-llama-30b-w4-s5 | sq-llama-30b-w4-s45 |
LLaMA-65B | 3 | sq-llama-65b-w3-s0 | sq-llama-65b-w3-s5 | sq-llama-65b-w3-s45 |
LLaMA-65B | 4 | sq-llama-65b-w4-s0 | sq-llama-65b-w4-s5 | sq-llama-65b-w4-s45 |
Model | Bitwidth | Dense-only (0%) | 0.45% sparsity |
---|---|---|---|
Vicuna-7B | 3 | sq-vicuna-7b-w3-s0 | sq-vicuna-7b-w3-s45 |
Vicuna-7B | 4 | sq-vicuna-7b-w4-s0 | sq-vicuna-7b-w4-s45 |
Vicuna-13B | 3 | sq-vicuna-13b-w3-s0 | sq-vicuna-13b-w3-s45 |
Vicuna-13B | 4 | sq-vicuna-13b-w4-s0 | sq-vicuna-13b-w4-s45 |
XGen-7B-8k-Base is a 7B model pre-trained under 8K sequence length. XGen-7B-8k-Inst is a supervised finetuned model on public domain instructional data for instruction following applications. Please refer to the blog post from Salesforce AI Research for more details on the models.
Model | Bitwidth | Dense-only (0%) | 0.45% sparsity |
---|---|---|---|
XGen-7B-8k-Base | 3 | sq-xgen-7b-8k-base-w3-s0 | sq-xgen-7b-8k-base-w3-s45 |
XGen-7B-8k-Base | 4 | sq-xgen-7b-8k-base-w4-s0 | sq-xgen-7b-8k-base-w4-s45 |
XGen-7B-8k-Inst | 3 | sq-xgen-7b-8k-inst-w3-s0 | sq-xgen-7b-8k-inst-w3-s45 |
XGen-7B-8k-Inst | 4 | sq-xgen-7b-8k-inst-w4-s0 | sq-xgen-7b-8k-inst-w4-s45 |
Model | Bitwidth | Dense-only (0%) | 0.45% sparsity |
---|---|---|---|
OPT-1.3B | 3 | sq-opt-1.3b-w3-s0 | sq-opt-1.3b-w3-s50 |
OPT-1.3B | 4 | sq-opt-1.3b-w4-s0 | sq-opt-1.3b-w4-s50 |
OPT-2.7B | 3 | sq-opt-2.7b-w3-s0 | sq-opt-2.7b-w3-s50 |
OPT-2.7B | 4 | sq-opt-2.7b-w4-s0 | sq-opt-2.7b-w4-s50 |
OPT-6.7B | 3 | sq-opt-6.7b-w3-s0 | sq-opt-6.7b-w3-s50 |
OPT-6.7B | 4 | sq-opt-6.7b-w4-s0 | sq-opt-6.7b-w4-s50 |
OPT-13B | 3 | sq-opt-13b-w3-s0 | sq-opt-13b-w3-s50 |
OPT-13B | 4 | sq-opt-13b-w4-s0 | sq-opt-13b-w4-s50 |
OPT-30B | 3 | sq-opt-30b-w3-s0 | sq-opt-30b-w3-s50 |
OPT-30B | 4 | sq-opt-30b-w4-s0 | sq-opt-30b-w4-s50 |
The following code will run and benchmark the 3-bit quantized models on the C4 dataset.
The --torch_profile
argument can be passed when running benchmarking to replicate the runtime results from the paper.
Download the quantized model (e.g. sq-llama-7b-w3-s0.pt
or sq-xgen-7b-8k-base-w3-s0.py
) locally from the links above.
Note that for the LLaMA model, you need to first obtain the original, pre-trained LLaMA model in the Huggingface-compatible format locally and provide the path in {model_path}
.
For other model types, you don't need to install/download the original models separately as we provide Huggingface compatible configs of all supported models in models
.
You can follow the same procedure for other model types and quantization settings such as bit width and sparsity level.
# LLaMA Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --benchmark 128 --check
# XGen Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --benchmark 128 --check
When using checkpoints with sparsity (i.e. non-zero sparsity level), the --include_sparse
flag should also be passed:
# LLaMA Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --benchmark 128 --check
# XGen Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --include_sparse --benchmark 128 --check
NOTE: In order to reproduce the perplexity numbers in our paper, please use --eval
instead of --benchmark
, following the instruction below.
The following code will evaluate perplexity using the 3-bit quantized models on the C4 dataset,
following the same evaluation methodology of GPTQ and GPTQ-For-LLaMA.
This will reproduce the perplexity numbers reported in our paper.
Download the quantized model (e.g. sq-llama-7b-w3-s0.pt
or sq-xgen-7b-8k-base-w3-s0.py
) locally from the links above.
Note that for the LLaMA model, you need to first obtain the original, pre-trained LLaMA model in the Huggingface-compatible format locally and provide the path in {model_path}
.
For other model types, you don't need to install/download the original models separately as we provide Huggingface compatible configs of all supported models in models
.
You can follow the same procedure for other model types and quantization settings such as bit width and sparsity level.
# LLaMA Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval
# XGen Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --eval
When using checkpoints with sparsity (i.e. non-zero sparsity level), the --include_sparse
flag should also be passed:
# LLaMA Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --include_sparse --eval
# XGen Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --include_sparse --eval
The code was tested on A5000 and A6000 GPUs with Cuda 11.3 and CUDNN 8.2.
This code reuses components from several libraries including GPTQ as well as GPTQ-For-LLaMA.
SqueezeLLM has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:
@article{kim2023squeezellm,
title={SqueezeLLM: Dense-and-Sparse Quantization},
author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt},
journal={arXiv},
year={2023}
}