【大模型实战篇】vLLM的由来以及大模型部署、推理加速实践

1. 问题背景分析及vLLM的由来

大模型毫无疑问，在工作、生活中已经逐渐扮演越来越重要的角色。但大模型的尺寸一般都比较大，处理一个大模型请求的成本可能比传统关键字查询高出 10 倍。推理的成本代价较高，因此提高大模型服务系统的吞吐量，从而降低每次请求的成本，变得越来越重要【1】。

大模型的核心是自回归Transformer 模型，该模型基于输入提示词和先前生成的序列逐个 token生成输出。每个请求都会重复这一高耗资源的过程，直到模型输出终止 token。这种序列生成过程使得任务受内存限制，未充分利用 GPU 的计算能力，限制了服务吞吐量。可以通过将多个请求一起批处理来提高吞吐量。为了批处理多个请求，必须高效管理每个请求的内存。一般上，约 65% 的内存分配用于模型权重，在服务期间保持不变。接近 30% 的内存用于存储请求的动态状态。对于 Transformer 模型，这些状态由注意力机制的键和值张量组成，也称为 KV 缓存，用于表示从之前的 token 生成新输出 token 的上下文。剩余的少量内存用于其他数据，包括评估大模型时创建的瞬时张量。由于模型权重是常量，激活仅占用 GPU 内存的一小部分，KV 缓存的管理方式对于确定最大批处理大小至关重要。如果管理不当，KV 缓存内存会显著限制批处理大小，从而影响大模型的吞吐量。

部分大模型服务系统在管理KV缓存内存方面比较一般，主要原因是它们将每个请求的KV缓存存储在连续的内存空间，但是与传统深度学习任务中的张量不同，KV缓存有一些自己的特点，它会随着模型生成新token而动态增长和收缩，生命周期和长度在事先也不可知。所以这种特性会导致现有系统在两个方面效率明显下降：

（1）现有系统会受到内存碎片的影响。为了在连续空间中存储请求的KV缓存，系统会根据请求的最大长度，比如2048个token，预分配连续的内存块。这可能会造成碎片的形成，因为请求的实际长度可能远小于最大长度。另外，在请求的生命周期内，整个内存块被保留，其他较短的请求无法使用未被占用的部分。由于每个请求的预分配大小存在差异，可能会出现外部内存碎片。据统计，现有系统仅有20%到40%之间的KV缓存内存用于存储实际token状态，其他部分都是碎片。

（2）其次，现有系统无法利用内存共享的机会。大模型服务一般涉及解码算法，如并行采样和束搜索，每个请求生成多个输出。在这些场景中，请求由多个序列组成，这些序列可以部分共享它们的KV缓存。但是在现有系统中不可能进行内存共享，因为序列的KV缓存存储在单独的连续空间中。

鉴于上述存在的内存问题，【1】提出了PagedAttention，一种借鉴虚拟内存与分页的处理思想。关于PA的描述，可以参考《自注意力机制计算加速工程优化技巧》。PagedAttention将请求的KV缓存划分为块，每个块可以包含固定数量的token的注意力键和值。在PagedAttention中，KV缓存的块不一定存储在连续的空间中。因此，可以像操作系统的虚拟内存一样更灵活地管理KV缓存：可以将块视为页面，tokens视为字节，请求视为进程。这种设计通过使用相对较小的块并按需分配它们来减轻内部碎片化。此外，它消除了外部碎片化，因为所有块的大小相同。最后，它在块的粒度上启用内存共享，跨越与同一请求相关联的不同序列，甚至跨越不同的请求。

2. vLLM及模型部署推理实践

vLLM就是一个基于PagedAttention的高吞吐量分布式LLM服务引擎，实现了KV缓存内存几乎零浪费。vLLM使用与PagedAttention共同设计的块级内存管理和抢占式请求调度。老样子，因为国内的网络环境，我们依然利用modelscope模型库中的大模型来做实践，并且modelscope推出了ms-swift框架【2】，能够支持vLLM的大模型部署和推理。如果想研究最纯正版本的vLLM，建议看vLLM项目【3，4，5】。

【6】罗列了一些常用大模型推理框架并做了相关特性对比。

vLLM：适用于大批量Prompt输入，并对推理速度要求高的场景；
Text generation inference：依赖HuggingFace模型，并且不需要为核心模型增加多个adapter的场景；
CTranslate2：可在CPU上进行推理；
OpenLLM：为核心模型添加adapter并使用HuggingFace Agents，尤其是不完全依赖PyTorch；
Ray Serve：稳定的Pipeline和灵活的部署，适合更成熟的项目；
MLC LLM：可在客户端（边缘计算）（例如，在Android或iPhone平台上）本地部署LLM；
DeepSpeed-MII：使用DeepSpeed库来部署LLM；

本文主要是采用ms-swift(Scalable lightWeight Infrastructure for Fine-Tuning)框架来实施vllm模型的部署和推理【7，8】。

SWIFT是基于PyTorch的轻量级、开箱即用的模型微调、推理框架。集成了各类开源tuners，如LoRA、QLoRA、Adapter等，并且融合了ModelScope特有tuner ResTuning。

2.1 环境准备

GPU设备

3090

使用nvidia-smi查看cuda版本

配置全局镜像

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/

安装三方包（搞工程最麻烦的环节可能就是配置环境，会出现一些包版本不兼容冲突，需要挨个处理）

# 仅使用llm能力

pip install 'ms-swift[llm]' -U

pip install vllm
pip install openai -U

可能在运行代码的时候会出现报错：

AttributeError: module 'tensorflow._api.v2.compat.v2.__internal__' has no attribute 'register_load_context_function'

需要重新安装一下tensorflow，我后来用的版本是2.12.0, 最新版本2.18.0会出现numpy的版本兼容性问题。

运行代码的时候还出了个小插曲，因为在root目录下启动jupyter操作，下载模型直接给撑爆了磁盘，后续切换了工作目录区。

jupyter下需要更改配置：

通过指令jupyter notebook --generate-config找到配置文件

然后修改其中的c.NotebookApp.notebook_dir配置

另外，代码中增加一行地址指定配置：

os.environ['MODELSCOPE_CACHE']='/data/llm'

2.2 vLLM部署Qwen-7B及推理

代码参考【7，10】

2.2.1 单卡推理

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MODELSCOPE_CACHE']='/data/llm'from swift.llm import (ModelType, get_vllm_engine, get_default_template_type,get_template, inference_vllm, inference_stream_vllm
)model_type = ModelType.qwen_7b_chat
model_id_or_path = None
llm_engine = get_vllm_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
generation_info = {}request_list = [{'query': '你好!'}, {'query': '江苏哪个城市最有活力？'}]
resp_list = inference_vllm(llm_engine, template, request_list, generation_info=generation_info)
for request, resp in zip(request_list, resp_list):print(f"query: {request['query']}")print(f"response: {resp['response']}")
print(generation_info)history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好玩的景点', 'history': history1}]
gen = inference_stream_vllm(llm_engine, template, request_list, generation_info=generation_info)
query = request_list[0]['query']
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for resp_list in gen:resp = resp_list[0]response = resp['response']delta = response[print_idx:]print(delta, end='', flush=True)print_idx = len(response)
print()history = resp_list[0]['history']
print(f'history: {history}')
print(generation_info)

# 批量回答
request_list = [{'query': '帮我出一道研究生数学题!'}, {'query': '浙江杭州笤溪会在哪年拆迁？'}, {'query': '浙江金华当地有哪些小吃？'}]
resp_list = inference_vllm(llm_engine, template, request_list, generation_info=generation_info)
for request, resp in zip(request_list, resp_list):print(f"query: {request['query']}")print(f"response: {resp['response']}")
print(generation_info)# 基于历史记忆回答
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好玩的景点', 'history': history1}]
gen = inference_stream_vllm(llm_engine, template, request_list, generation_info=generation_info)
query = request_list[0]['query']
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for resp_list in gen:resp = resp_list[0]response = resp['response']delta = response[print_idx:]print(delta, end='', flush=True)print_idx = len(response)
print()history = resp_list[0]['history']
print(f'history: {history}')
print(generation_info)

2.2.2 双卡推理

import os
# 用两张卡
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
os.environ['MODELSCOPE_CACHE']='/data/llm'from swift.llm import (ModelType, get_vllm_engine, get_default_template_type,get_template, inference_vllm, inference_stream_vllm
)model_type = ModelType.qwen_7b_chat
model_id_or_path = None
llm_engine = get_vllm_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
generation_info = {}

2.2.3 使用命令行启动的方式

CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-7b-chat --infer_backend vllm

2.2.4 服务端部署

CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b-chat

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b-chat",
"messages": [{"role": "user", "content": "最近取得了很多突破，感觉很好，你怎么看？"}],
"max_tokens": 256,
"temperature": 0
}'