【大模型实战篇】大模型GPU推理测试（以Qwen2.5-7B为例）

1. 背景介绍

今天到了两块新的3090卡，用nvidia-smi看下部署情况。我们使用Qwen2.5-7B简单做了下推理测试。

3090卡的基本配置信息如下图所示，使用两块卡做成GPU集群，显存有48G，内存带宽936.2 GB/s，基本上可以应对一些常见的大模型推理服务以及模型的微调，甚至用QLora能支持650亿参数模型的微调【1】。在上次文章中我们介绍了如何预估模型占用GPU的资源《大模型显存资源计算以及GPU如何选择》。

2. 大模型GPU推理测试

我们将使用Qwen2.5-7B，精度BF16，差不多模型大小站到15G+，单块3090显卡也足以应对。

使用huggingface下载模型，会遇到下载的问题，报OSError: We couldn't connect to 'https://huggingface.co' to load this file，如下所示：

为了快速验证，所以我们切换到使用modelscope，采用snapshot_download 函数下载模型。modelscope可以认为是国产版本的hugging face，大模型的托管平台，提供了国内镜像站点，模型下载速度比较友好。正如前面计算所示，该7B模型大小预估差不多是15G左右，所以可能会花一点点时间。

代码示例：

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import osmodel_dir = snapshot_download('qwen/Qwen2.5-7B-Instruct', cache_dir='/root/autodl-tmp', revision='master')

我大概花了20分钟左右下载完成。

推理测试代码：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch# 检查是否有可用的GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")# 定义模型路径
mode_name_or_path = '/root/autodl-tmp/qwen/Qwen2___5-7B-Instruct'def get_model():# 从预训练的模型中获取 tokenizertokenizer = AutoTokenizer.from_pretrained(mode_name_or_path, trust_remote_code=True)tokenizer.pad_token = tokenizer.eos_token# 从预训练的模型中获取模型，并设置模型参数model = AutoModelForCausalLM.from_pretrained(mode_name_or_path, torch_dtype=torch.bfloat16).to(device)return tokenizer, model# 加载 Qwen2.5 的 model 和 tokenizer
tokenizer, model = get_model()prompt = "我购置了两块3090的gpu，描述一下3090gpu的性能数据"
messages = [{"role": "system", "content": "你是一款智能的大模型用户助手工具"},{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)generated_ids = model.generate(**model_inputs,max_new_tokens=512
)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

开始执行上述测试代码的时候，可能还会报tokenizer类不存在的错误：ValueError: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported