SqueezeLLM: Dense-and-Sparse Quantization [Paper]

SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.

TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint. For more details please check out our paper.

Updates (7/10): All models other than LLaMA and Vicuna v1.1 can be run and evaluated without downloading the original checkpoints.

Updates (7/5): Salesforce's XGen models (both Base and Inst) with 8k sequence length and OPT models are supported.

Updates (6/28): All the LLaMA/Vicuna checkpoints are uploaded for all sizes and sparsity levels.

Updates (6/20): Dense-and-sparse kernel is supported.

Updates (6/16): Vicuna-7B and 13B, and LLaMA-30B are all supported with both 3-bit and 4-bit.

Installation

Create a conda environment

conda create --name sqllm python=3.9 -y
conda activate sqllm

Clone and install the dependencies

git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install

Supported Models

Currently, we support LLaMA 7B, 13B, 30B and 65B, instruction-tuned Vicuna 7B and 13B, XGen 7B with 8k sequence length, and OPT 1.3B to 30B. For each model, we support 3-bit and 4-bit quantized models, with sparse levels of 0% (dense-only), 0.05%, and 0.45%. See our Paper for more detailed information on these configurations. Below are the links to download the models.

LLaMA

Model	Bitwidth	Dense-only (0%)	0.05% Sparsity	0.45% sparsity
LLaMA-7B	3	sq-llama-7b-w3-s0	sq-llama-7b-w3-s5	sq-llama-7b-w3-s45
LLaMA-7B	4	sq-llama-7b-w4-s0	sq-llama-7b-w4-s5	sq-llama-7b-w4-s45
LLaMA-13B	3	sq-llama-13b-w3-s0	sq-llama-13b-w3-s5	sq-llama-13b-w3-s45
LLaMA-13B	4	sq-llama-13b-w4-s0	sq-llama-13b-w4-s5	sq-llama-13b-w4-s45
LLaMA-30B	3	sq-llama-30b-w3-s0	sq-llama-30b-w3-s5	sq-llama-30b-w3-s45
LLaMA-30B	4	sq-llama-30b-w4-s0	sq-llama-30b-w4-s5	sq-llama-30b-w4-s45
LLaMA-65B	3	sq-llama-65b-w3-s0	sq-llama-65b-w3-s5	sq-llama-65b-w3-s45
LLaMA-65B	4	sq-llama-65b-w4-s0	sq-llama-65b-w4-s5	sq-llama-65b-w4-s45

Vicuna (v1.1)

Model	Bitwidth	Dense-only (0%)	0.45% sparsity
Vicuna-7B	3	sq-vicuna-7b-w3-s0	sq-vicuna-7b-w3-s45
Vicuna-7B	4	sq-vicuna-7b-w4-s0	sq-vicuna-7b-w4-s45
Vicuna-13B	3	sq-vicuna-13b-w3-s0	sq-vicuna-13b-w3-s45
Vicuna-13B	4	sq-vicuna-13b-w4-s0	sq-vicuna-13b-w4-s45

XGen (8k Sequence length)

XGen-7B-8k-Base is a 7B model pre-trained under 8K sequence length. XGen-7B-8k-Inst is a supervised finetuned model on public domain instructional data for instruction following applications. Please refer to the blog post from Salesforce AI Research for more details on the models.

Model	Bitwidth	Dense-only (0%)	0.45% sparsity
XGen-7B-8k-Base	3	sq-xgen-7b-8k-base-w3-s0	sq-xgen-7b-8k-base-w3-s45
XGen-7B-8k-Base	4	sq-xgen-7b-8k-base-w4-s0	sq-xgen-7b-8k-base-w4-s45
XGen-7B-8k-Inst	3	sq-xgen-7b-8k-inst-w3-s0	sq-xgen-7b-8k-inst-w3-s45
XGen-7B-8k-Inst	4	sq-xgen-7b-8k-inst-w4-s0	sq-xgen-7b-8k-inst-w4-s45

OPT

Model	Bitwidth	Dense-only (0%)	0.45% sparsity
OPT-1.3B	3	sq-opt-1.3b-w3-s0	sq-opt-1.3b-w3-s50
OPT-1.3B	4	sq-opt-1.3b-w4-s0	sq-opt-1.3b-w4-s50
OPT-2.7B	3	sq-opt-2.7b-w3-s0	sq-opt-2.7b-w3-s50
OPT-2.7B	4	sq-opt-2.7b-w4-s0	sq-opt-2.7b-w4-s50
OPT-6.7B	3	sq-opt-6.7b-w3-s0	sq-opt-6.7b-w3-s50
OPT-6.7B	4	sq-opt-6.7b-w4-s0	sq-opt-6.7b-w4-s50
OPT-13B	3	sq-opt-13b-w3-s0	sq-opt-13b-w3-s50
OPT-13B	4	sq-opt-13b-w4-s0	sq-opt-13b-w4-s50
OPT-30B	3	sq-opt-30b-w3-s0	sq-opt-30b-w3-s50
OPT-30B	4	sq-opt-30b-w4-s0	sq-opt-30b-w4-s50

Running the Models

Benchmarking

The following code will run and benchmark the 3-bit quantized models on the C4 dataset. The --torch_profile argument can be passed when running benchmarking to replicate the runtime results from the paper. Download the quantized model (e.g. sq-llama-7b-w3-s0.pt or sq-xgen-7b-8k-base-w3-s0.py) locally from the links above.

Note that for the LLaMA model, you need to first obtain the original, pre-trained LLaMA model in the Huggingface-compatible format locally and provide the path in {model_path}. For other model types, you don't need to install/download the original models separately as we provide Huggingface compatible configs of all supported models in models. You can follow the same procedure for other model types and quantization settings such as bit width and sparsity level.

# LLaMA Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --benchmark 128 --check

# XGen Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --benchmark 128 --check

When using checkpoints with sparsity (i.e. non-zero sparsity level), the --include_sparse flag should also be passed:

# LLaMA Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --benchmark 128 --check

# XGen Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --include_sparse --benchmark 128 --check

NOTE: In order to reproduce the perplexity numbers in our paper, please use --eval instead of --benchmark, following the instruction below.

Perplexity Evaluation

The following code will evaluate perplexity using the 3-bit quantized models on the C4 dataset, following the same evaluation methodology of GPTQ and GPTQ-For-LLaMA. This will reproduce the perplexity numbers reported in our paper. Download the quantized model (e.g. sq-llama-7b-w3-s0.pt or sq-xgen-7b-8k-base-w3-s0.py) locally from the links above.

Note that for the LLaMA model, you need to first obtain the original, pre-trained LLaMA model in the Huggingface-compatible format locally and provide the path in {model_path}. For other model types, you don't need to install/download the original models separately as we provide Huggingface compatible configs of all supported models in models. You can follow the same procedure for other model types and quantization settings such as bit width and sparsity level.

# LLaMA Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval

# XGen Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --eval

When using checkpoints with sparsity (i.e. non-zero sparsity level), the --include_sparse flag should also be passed:

# LLaMA Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --include_sparse --eval

# XGen Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --include_sparse --eval

The code was tested on A5000 and A6000 GPUs with Cuda 11.3 and CUDNN 8.2.

Acknowledgement

This code reuses components from several libraries including GPTQ as well as GPTQ-For-LLaMA.

Citation

SqueezeLLM has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:

@article{kim2023squeezellm,
  title={SqueezeLLM: Dense-and-Sparse Quantization},
  author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt},
  journal={arXiv},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
figs		figs
models		models
squeezellm		squeezellm
LICENSE		LICENSE
README.md		README.md
llama.py		llama.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SqueezeLLM: Dense-and-Sparse Quantization [Paper]

Installation

Supported Models

LLaMA

Vicuna (v1.1)

XGen (8k Sequence length)

OPT

Running the Models

Benchmarking

Perplexity Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

baas-hans/SqueezeLLM

Folders and files

Latest commit

History

Repository files navigation

SqueezeLLM: Dense-and-Sparse Quantization [Paper]

Installation

Supported Models

LLaMA

Vicuna (v1.1)

XGen (8k Sequence length)

OPT

Running the Models

Benchmarking

Perplexity Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages