随着深度学习技术的发展,大规模语言模型(LLMs)因其强大的自然语言理解和生成能力而受到广泛关注。然而,这些模型通常参数量巨大,导致在实际部署过程中面临计算资源消耗高、推理延迟长等问题。为了克服这些挑战,模型量化技术应运而生,它通过减少模型权重表示的精度来降低模型的存储和计算成本,同时尽量保持模型的性能不变。
为什么需要进行模型量化?
- 提高效率:量化可以显著减少模型的存储需求和计算量,从而加快推理速度,这对于资源受限的设备尤为重要。
- 降低成本:通过减少对高性能硬件的需求,量化有助于降低模型部署的成本。
- 扩展应用范围:使大型模型能够在边缘设备上运行,扩大了应用场景,包括移动设备、物联网设备等。
AWQ量化的重要性
AWQ(Activation-aware Weight Quantization)是一种专门针对大规模语言模型设计的低比特权重量化方法。它不仅考虑了权重本身的分布特性,还考虑了激活值的影响,这使得量化后的模型能够更好地保持原始模型的性能。与传统的FP16浮点数表示相比,采用AWQ技术的AutoAWQ工具包能够实现以下优势:
- 加速推理:将模型的运行速度提升3倍,极大地提高了处理效率。
- 减少内存占用:将模型的内存需求降至原来的三分之一,使得更大规模的模型可以在更广泛的硬件平台上运行。
- 硬件友好:优化了量化过程中的硬件适应性,确保了模型在不同硬件上的高效执行。
1.vllm GPU 环境安装
- vllm官方安装教程:https://docs.vllm.ai/en/latest/getting_started/installation.html
# 创建新的虚拟环境
conda create -n myenv python=3.11 -y
conda activate myenv# 安装 vllm 对应 CUDA 12.1.
pip install vllm
如果安装上面版本失败,可以手动下载whl文件进行安装,安装示例和注意事项如下:
- 查看自己显卡的CUDA是什么版本,如支持11.8版本,那就下载跟cu118对应的whl文件
- 查看自己当前python环境是什么版本,如果是python 3.11,则下载cp311对应的whl文件
- linux环境下载linux的whl文件,window下载window对应文件
比如我的cuda是11.8版本,python环境是3.11,在linux环境下,因此安装命令如下:
pip install /home/xxx/cuda_whl/vllm-0.4.2+cu118-cp311-cp311-manylinux1_x86_64.whl
pip install /home/xxx/cuda_whl/xformers-0.0.26.post1+cu118-cp311-cp311-manylinux2014_x86_64.whl
pip install /home/xxx/cuda_whl/torch-2.3.0+cu118-cp311-cp311-linux_x86_64.whl
如果安装过程有失败,要根据失败信息去解决。
2.安装autoawq
pip install autoawq
3.下载预训练基础模型
- 下载Qwen2.5-7B-Instruct作为量化模型:https://www.modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct
安装modespace
pip install modelscope
模型下载
#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct')
4.实现AWQ量化
- autoawq: https://github.com/casper-hansen/AutoAWQ
- qwen模型量化:https://qwen.readthedocs.io/zh-cn/latest/quantization/awq.html
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'# 前面下载到本地的模型路径
model_path = '/home/xxx/models/Qwen2.5-7B-Instruct'
# 量化后的保存路径,不存在会自动创建
quant_path = '/home/xxx/models/Qwen2.5-7B-Instruct-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 校准数据微调,可以自定义这个数据集
dataset = [
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are an expert in travel planning and programming."},
{"role": "user", "content": "Can you recommend a scenic spot for a weekend trip near Beijing?"},
{"role": "assistant", "content": "Certainly! The Great Wall at Mutianyu would be a perfect choice for a weekend getaway. It's less than two hours' drive from Beijing and offers stunning views and a less crowded experience compared to Badaling."}
],
[
{"role": "user", "content": "How about a good place for a family vacation in China?"},
{"role": "assistant", "content": "For a family vacation, I would suggest Shanghai Disneyland. It offers a magical experience with various attractions and parades suitable for all ages. Plus, it's a great way to combine fun with a visit to one of China's most vibrant cities."}
],
[
{"role": "user", "content": "Can you help me with a simple Python code to sort a list of numbers?"},
{"role": "assistant", "content": "Of course! Here's a simple example of how to sort a list of numbers in Python:\n\npython\nnumbers = [3, 1, 4, 1, 5, 9]\nnumbers.sort()\nprint(numbers)\n"}
],
[
{"role": "user", "content": "What are some must-see destinations in Europe for a first-time traveler?"},
{"role": "assistant", "content": "Europe has many must-see destinations, but for a first-time traveler, I would recommend visiting Paris for its iconic Eiffel Tower and Louvre Museum, Rome for its ancient ruins like the Colosseum, and Amsterdam for its charming canals and museums."}
],
[
{"role": "user", "content": "Can you provide a tip for optimizing JavaScript code performance?"},
{"role": "assistant", "content": "One tip for optimizing JavaScript performance is to minimize DOM manipulations. Accessing the DOM is slow, so try to batch updates or use Document Fragments when appending multiple elements to the DOM."}
]
]
data = []
for msg in dataset:text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)data.append(text.strip())
# 开始量化
model.quantize(tokenizer, quant_config=quant_config,
calib_data=data, # 自己准备的校准数据,也可以不准备,有默认从网上下载
max_calib_seq_len=256 # 校准数据太少时,设置小一点
)
# 保存量化后的模型和分词器
model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)print(f'Model is quantized and saved at "{quant_path}"')
4.1.量化报错
-
ConnectionError: Couldn’t reach ‘mit-han-lab/pile-val-backup’ on the Hub (ConnectionError)
运行到:model.quantize(tokenizer, quant_config=quant_config) 产生报错
解决方案:连接外网失败,下载默认微调校准数据,用户校准量化缩放因子大小。可以自定义数据集进行校准 -
RuntimeError: torch.cat(): expected a non-empty list of Tensors
原因:数据太少,源码中的 n_split = cat_samples.shape[1] // max_seq_len
整除为 0,因此造成错误
解决方案:model.quantize(tokenizer, quant_config=quant_config, calib_data=data,
max_calib_seq_len=256 # 调整 max_calib_seq_len 的最大长度,由512改成256
)
5.使用VLLM进行推理
加载量化后的模型进行推理。
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import torch, os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'
# os.environ['CUDA_VISIBLE_DEVICES'] = '6,7' # 多张卡
model_path = "/home/xxx/models/Qwen2.5-7B-Instruct-awq"
prompt = "介绍一下大模型技术!"tokenizer = AutoTokenizer.from_pretrained(model_path, # trust_remote_code=True)
# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model=model_path,max_model_len=10000, # 设置最大输入长度tensor_parallel_size=1, # 多少张卡gpu_memory_utilization=0.95,trust_remote_code=True)
messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True
)# 输出
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512 # 设置最大输出长度)
outputs = llm.generate([text], sampling_params)
# print(outputs)# Print the outputs.
for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Generated text:\n {generated_text}")
6.使用transoformers进行推理
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'
# os.environ['CUDA_VISIBLE_DEVICES'] = '6,7' # 多张卡
model_name = "/home/xxx/models/Qwen2.5-7B-Instruct-awq"
prompt = "介绍一下大模型技术!"model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)messages = [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)generated_ids = model.generate(**model_inputs,max_new_tokens=512,
)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]