【多模态大模型】Qwen2-VL基本原理和推理部署实战

文章目录

Qwen2-VL基本原理
- Qwen-VL简要回顾
- Qwen2-VL的高级升级
- 统一视觉处理方式
- 原生动态分辨率处理（非大图切分方式）
- 多模态旋转位置编码
Qwen2-VL推理实现|代码解析
- 单图推理
- - 视觉信息预处理
  - - 找到能被28整除的最合适size
    - 最大最小pixel数边界处理
  - 多模态信息预处理
  - - 划分patches
    - 视觉标记填充
  - 视觉编码器前向过程
  - - PatchEmbed进行3D卷积
    - rot_pos_emb生成多模态旋转位置编码
    - PatchMerger压缩视觉特征
- 视频推理
vLLM+Qwen2-VL部署实战

在这里插入图片描述

Qwen2-VL基本原理

[2024-09-18] Paper: https://arxiv.org/abs/2409.12191
Code: https://github.com/QwenLM/Qwen2-VL
Blog: https://qwenlm.github.io/blog/qwen2-vl/
Qwen2-VL-72B Demo: https://huggingface.co/spaces/Qwen/Qwen2-VL

阿里通义千问实验室在2024年8年30日发布了最新一代的视觉语言模型：Qwen2-VL ，目前已经开源了 Qwen2-VL-2B，Qwen2-VL-7B以及Qwen2-VL-72B，开源模型已集成到 Hugging Face，Transformers、vLLM 和其他第三方框架中：

在这里插入图片描述

在多个多模态评测集上和GPT-4o表现不相上下：

在这里插入图片描述

Qwen-VL简要回顾

Paper: https://arxiv.org/abs/2308.12966
Code: https://github.com/QwenLM/Qwen-VL

Qwen-VL在2023年8月22日，由阿里通义千问实验室开源发布，主要贡献：

位置感知视觉语言适配器：

为了缓解长图像特征序列带来的效率问题，Qwen-VL 引入了一种压缩图像特征的视觉语言适配器（Adapter）。该适配器包含随机初始化的单层交叉注意模块。该模块使用一组可训练向量（Embeddings）作为Query向量，并将来自视觉编码器的图像特征作为交叉注意操作的Key。该机制将视觉特征序列压缩到固定长度256。
此外，考虑到位置信息对于细粒度图像理解的重要性，2D绝对位置编码被纳入交叉注意机制的 query-key对中，以减轻压缩过程中位置细节的潜在损失。长度为 256 的压缩图像特征序列随后被输入到大语言模型中。

三阶段训练方式：两阶段预训练和一阶段指令微调

在这里插入图片描述

预训练（Pre-training）：三阶段训练的第一阶段，主要利用大规模、弱标记、网络爬行的图像文本对。该阶段的预训练数据集由多个可公开访问的来源和一些内部数据组成。
- 数据量和格式：原始数据集总共包含5B个图文对，清洗后还剩下1.4B数据，其中英文（文本）数据占77.3%，中文（文本）数据占22.7%
- 训练流程：在第一阶段，冻结大语言模型，仅训练视觉编码器和视觉语言适配器。输入图像大小调整为 224 × 224。训练目标是最小化文本标记的交叉熵。最大学习率为2e−4，训练过程使用图像文本对的批量大小为30720，整个预训练第一阶段持续50,000步，消耗约1.5B个图像文本样本。

在这里插入图片描述

多任务预训练（Multi-task Pre-training）：
- 数据量和格式：在第二阶段，即多任务预训练结算，引入了具有更大输入分辨率，更高质量、以及更细粒度的视觉语言标注数据和交错的图文数据。如下表所示，同时对 Qwen-VL 进行了 7 项任务的训练。
- 训练流程：将视觉编码器的输入分辨率从224×224提高到448×448，减少图像下采样带来的信息损失。同时消除了窗口注意力（window attention）和全局注意力（global attention）以获得更高分辨率的视觉变换器。我们解锁了大语言模型并训练了整个模型。训练目标与第一阶段预训练相同。

在这里插入图片描述

监督微调（Multi-task Pre-training，SFT）：
- 数据量和格式：在此阶段通过指令微调对Qwen-VL预训练模型进行微调，增强其指令跟随和对话能力，从而产生了交互式Qwen-VL-Chat模型。多模态指令调优数据主要来自LLM自指令生成的 captioning 数据或对话数据，通常只针对单图像对话和推理，仅限于图像内容理解。我们通过手动注释、模型生成和策略串联构建了一组额外的对话数据，将定位和多图像理解能力纳入 Qwen-VL 模型中。我们确认该模型有效地将这些功能转移到更广泛的语言和问题类型。此外，我们在训练过程中混合了多模态和纯文本对话数据，以确保模型在对话能力上的通用性。指令调整数据达350K。
- 训练流程：在这个阶段，我们冻结了视觉编码器并优化了语言模型和适配器模块。

Qwen-VL系列中的最强大模型Qwen-VL-Max，在当时表现出了及其强大且突出的多模态理解能力，在多个多模态benchmark上的表现与GPT-4V不相上下：

在这里插入图片描述

Qwen2-VL的高级升级

在这里插入图片描述

在Qwen-VL一代的基础上，Qwen2-VL的优势主要体现在：

重新定义了视觉处理中传统的预定分辨率方法，能够对真实世界中的任意分辨率图片输入进行处理
统一了单帧图片，多图以及视频输入的视觉处理流程（即都当做视频来处理，单帧图片通过复制变成连续相同的两帧图片），更好的适配不同类型的视觉输入
多模态旋转位置编码，在时间和空间维度上也考虑视觉token的RoPE，更好的对多模态信息进行位置编码

以下是Qwen2-VL-7B-Instruct的模型架构信息：

Qwen2VLForConditionalGeneration((visual): Qwen2VisionTransformerPretrainedModel((patch_embed): PatchEmbed((proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False))(rotary_pos_emb): VisionRotaryEmbedding()(blocks): ModuleList((0-31): 32 x Qwen2VLVisionBlock((norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)(attn): VisionSdpaAttention((qkv): Linear(in_features=1280, out_features=3840, bias=True)(proj): Linear(in_features=1280, out_features=1280, bias=True))(mlp): VisionMlp((fc1): Linear(in_features=1280, out_features=5120, bias=True)(act): QuickGELUActivation()(fc2): Linear(in_features=5120, out_features=1280, bias=True))))(merger): PatchMerger((ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)(mlp): Sequential((0): Linear(in_features=5120, out_features=5120, bias=True)(1): GELU(approximate='none')(2): Linear(in_features=5120, out_features=3584, bias=True))))(model): Qwen2VLModel((embed_tokens): Embedding(152064, 3584)(layers): ModuleList((0-27): 28 x Qwen2VLDecoderLayer((self_attn): Qwen2VLSdpaAttention((q_proj): Linear(in_features=3584, out_features=3584, bias=True)(k_proj): Linear(in_features=3584, out_features=512, bias=True)(v_proj): Linear(in_features=3584, out_features=512, bias=True)(o_proj): Linear(in_features=3584, out_features=3584, bias=False)(rotary_emb): Qwen2RotaryEmbedding())(mlp): Qwen2MLP((gate_proj): Linear(in_features=3584, out_features=18944, bias=False)(up_proj): Linear(in_features=3584, out_features=18944, bias=False)(down_proj): Linear(in_features=18944, out_features=3584, bias=False)(act_fn): SiLU())(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)))(norm): Qwen2RMSNorm((3584,), eps=1e-06))(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)

可以看到Qwen2-VL对视觉编码器及其处理部分做了较大的改变：

第一层patch_embed层，使用了一个3D卷积层，其中卷积核（Kernel）大小为(2, 14, 14)，步长（Stride）同样为(2, 14, 14)，表示卷积核在时间维度上的大小为2，在空间维度上的大小为14x14
定制化设计了rotary_pos_emb层，用于对视觉输入做时间和空间上的旋转位置编码
对齐层PatchMerger使用了普通的MLP层，包含两层Linear，与Qwen-VL使用的Cross-attention不同，这里并不是通过可学习的Query来减少视觉token数，而是在PatchMerger层中，对相邻的视觉token进行合并（减少token数，同时会增加每个token的特征维度）来实现的。

下面，我们结合Qwen2-VL的论文和具体的代码实现细节，来深入理解这一款最新最强大的图文理解模型。

统一视觉处理方式

Qwen2 VL采用混合训练方案，结合图像和视频数据，确保图像理解和视频理解的熟练程度。

为了尽可能完整地保存视频信息，Qwen2-VL以每秒两帧的频率对每个视频进行采样。
集成了深度为2的3D卷积（Carreira和Zisserman，2017）来处理视频输入，使模型能够处理3D tubes 而不是2D patches，从而使其能够在不增加序列长度的情况下处理更多的视频帧。
为了保持一致性，每个图像都被视为两个相同的帧。
为了平衡长视频处理的计算需求和整体训练效率，我们动态调整每个视频帧的分辨率，将每个视频的token总数限制在16384。这种训练方法在模型理解长视频的能力和训练效率之间取得了平衡。

原生动态分辨率处理（非大图切分方式）

Qwen2 VL的一个关键架构改进是引入了原生动态分辨率支持。与Qwen-VL不同，Qwen2-VL可以处理任意分辨率的图像，将其动态转换为可变数量的视觉标记。

为了支持这一功能，Qwen2-VL修改了ViT，删除了原始的绝对位置嵌入，并引入了2D RoPE来捕获图像的二维位置信息。
在推理阶段，不同分辨率的图像被打包成一个序列，打包长度受到控制以限制GPU内存的使用。
此外，为了减少每个图像的视觉标记，在ViT之后使用一个简单的MLP层将相邻的2×2标记压缩成一个标记，并在压缩的视觉标记的开头和结尾放置特殊的<|vision_start|>和<|visition_end|>标记。因此，分辨率为224×224的图像，使用patch_size=14的ViT编码，在进入LLM之前将被压缩到66个标记。

这一版本的Qwen2-VL并没有采用当下流行的大图切分方式（比如LLava-Next，InternVL2.5，以及MiniCPM-V），而是直接对图像进行patch化，然后直接过image encoder进行特征提取，最后对齐到LLM之前，使用PatchMerger层进行视觉token数的压缩与进一步提取特征（两层MLP）。

多模态旋转位置编码

Qwen2-VL另一个关键的架构增强是多模态旋转位置编码（M-RoPE）的创新。与LLM中仅限于编码一维位置信息的传统1D RoPE不同，M-RoPE有效地对多模态输入的位置信息进行了建模。这是通过将原始的旋转嵌入分解为三个部分来实现的：时间、高度和宽度。

对于文本输入，这些组件使用相同的位置ID，使M-RoPE在功能上等同于1D RoPE。
在处理图像时，每个视觉标记的时间ID保持不变，而根据标记在图像中的位置为高度和宽度分量分配不同的ID。
对于被视为帧序列的视频，每帧的时间ID都会递增，而高度和宽度分量遵循与图像相同的ID分配模式。
在模型的输入包含多个模态的情况下，通过将前一个模态的最大位置ID加1来初始化每个模态的位置编号。
M-RoPE的图示如下图所示。M-RoPE不仅增强了位置信息的建模，还降低了图像和视频的位置ID值，使模型能够在推理过程中外推到更长的序列。

在这里插入图片描述

Qwen2-VL推理实现|代码解析

论文永远不会把具体的实现细节告诉你，所以，我们实际运行一下Qwen2-VL的前向推理代码，来深入理解以上这三个创新点。这里以单图前向推理和视频推理为例：

这里为了进入安装在环境中的transformers库，使用了debugpy工具来进行debug，具体使用方式可以参考这篇博客：【大模型推理】大模型前向推理过程详解。

首先根据官方代码提示，配置好环境，注意这里要安装最新的transformers库（当然，随着时间的流逝，等官方库更新好，直接安装指定版本的就可以）：

conda create -n qwen2vl python=3.10 -y
conda activate qwen2vl 
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate

单图推理

配置好代码后，运行以下代码：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# from qwen_vl_utils import process_vision_info
from vision_process import process_vision_info# 使用debugpy进行深入debug分析
# 并且在launch.json文件中将 "justMyCode"设置为 false
# 代码地址：https://github.com/yuanzhoulvpi2017/vscode_debug_transformers
import debugpy
try:# 5678 is the default attach port in the VS Code debug configurations. Unless a host and port are specified, host defaults to 127.0.0.1debugpy.listen(("localhost", 9501))print("Waiting for debugger attach")debugpy.wait_for_client()
except Exception as e:passmodel_path = '/root/models/Qwen/Qwen2-VL-7B-Instruct'# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtype="auto", device_map="auto"
)# default processer
processor = AutoProcessor.from_pretrained(model_path)# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)messages = [{"role": "user","content": [{"type": "image","image":  "/root/qwen2-vl/assets/小王子1.png",},{"type": "text", "text": "Describe this image."},],}
]# Preparation for inference
# '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)# 【第一步：视觉信息预处理】
image_inputs, video_inputs = process_vision_info(messages)# 【第二步：多模态信息处理】
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)inputs = inputs.to("cuda")# 【第三步：模型前向推理，输出结果】
# 包括两大部分：
# 1. 视觉编码器的前向推理生成压缩后的视觉token
# 2. 大语言模型的前向推理，逐步生成最终结果
generated_ids = model.generate(**inputs, max_new_tokens=512)generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视觉信息预处理

具体实现代码地址：https://github.com/QwenLM/Qwen2-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py

首先运行到image_inputs, video_inputs = process_vision_info(messages)代码处，进入process_vision_info函数：

def process_vision_info(conversations: list[dict] | list[list[dict]],
) -> tuple[list[Image.Image] | None, list[torch.Tensor | list[Image.Image]] | None]:vision_infos = extract_vision_info(conversations)## Read images or videosimage_inputs = []video_inputs = []for vision_info in vision_infos:if "image" in vision_info or "image_url" in vision_info:image_inputs.append(fetch_image(vision_info))elif "video" in vision_info:video_inputs.append(fetch_video(vision_info))else:raise ValueError("image, image_url or video should in content.")if len(image_inputs) == 0:image_inputs = Noneif len(video_inputs) == 0:video_inputs = Nonereturn image_inputs, video_inputs

可以看到，主要需要关注fetch_image函数，进入这个函数（这里为了方便查看主要部分，省去一些代码，主要是读取图片代码）：

def fetch_image(ele: dict[str, str | Image.Image], size_factor: int = IMAGE_FACTOR) -> Image.Image:if "image" in ele:image = ele["image"]else:image = ele["image_url"]image_obj = Noneif isinstance(image, Image.Image):image_obj = imageelif ...【此处省略】image = image_obj.convert("RGB")## resizeif "resized_height" in ele and "resized_width" in ele:resized_height, resized_width = smart_resize(ele["resized_height"],ele["resized_width"],factor=size_factor,)else:width, height = image.sizemin_pixels = ele.get("min_pixels", MIN_PIXELS)max_pixels = ele.get("max_pixels", MAX_PIXELS)resized_height, resized_width = smart_resize(height,width,factor=size_factor,min_pixels=min_pixels,max_pixels=max_pixels,)image = image.resize((resized_width, resized_height))return image

可以看到最核心的函数是smart_resize，也就是为当前读取的图片，找到最合适的size：

IMAGE_FACTOR = 28
MIN_PIXELS = 4 * 28 * 28
MAX_PIXELS = 16384 * 28 * 28
MAX_RATIO = 200VIDEO_MIN_PIXELS = 128 * 28 * 28
VIDEO_MAX_PIXELS = 768 * 28 * 28
VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
FRAME_FACTOR = 2
FPS = 2.0
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 768def smart_resize(height: int, width: int, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
) -> tuple[int, int]:"""Rescales the image so that the following conditions are met:1. Both dimensions (height and width) are divisible by 'factor'.2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].3. The aspect ratio of the image is maintained as closely as possible."""if max(height, width) / min(height, width) > MAX_RATIO:raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")h_bar = max(factor, round_by_factor(height, factor))w_bar = max(factor, round_by_factor(width, factor))if h_bar * w_bar > max_pixels:beta = math.sqrt((height * width) / max_pixels)h_bar = floor_by_factor(height / beta, factor)w_bar = floor_by_factor(width / beta, factor)elif h_bar * w_bar < min_pixels:beta = math.sqrt(min_pixels / (height * width))h_bar = ceil_by_factor(height * beta, factor)w_bar = ceil_by_factor(width * beta, factor)return h_bar, w_bar

下面我们来详细分析这个函数：

找到能被28整除的最合适size

为什么IMAGE_FACTOR 要设置为28，因为Qwen2-VL的image encoder在进行划分patches时，是按照14 × 14的块进行划分的，同时后续要merge相邻的2 × 2的视觉token，所以，图片的长宽都要保证能被28整除，同时最小就是28的size：

def round_by_factor(number: int, factor: int) -> int:"""Returns the closest integer to 'number' that is divisible by 'factor'."""return round(number / factor) * factor

最大最小pixel数边界处理

获得当前图片最合适的size后，需要根据预先设置的min_pixels和max_pixels进行pixel个数的边界判断，如果超出边界，根据是超过最大值还是小于最小值，来获取满足临界条件的最合适size：

def ceil_by_factor(number: int, factor: int) -> int:"""Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""return math.ceil(number / factor) * factordef floor_by_factor(number: int, factor: int) -> int:"""Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""return math.floor(number / factor) * factor

本代码使用的图片原始size为[868, 899]，经过与处理后的size为[868, 896]。所以，image_inputs, video_inputs = process_vision_info(messages)执行结束得到的结果为：

在这里插入图片描述

多模态信息预处理

具体实现代码地址：https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py
以及https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py

接下来开始运行inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", )代码，进入processing_qwen2_vl.py文件的__call__函数，可以看到主要就是两个大部分：

一个是调用image_processor方法获得视觉输入划分为patches的结果
另一大部分就是对所有的输入（系统提示+问题+视觉信息等）进行标记填充以及获得token值：

    def __call__(self,images: ImageInput = None,text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,videos: VideoInput = None,padding: Union[bool, str, PaddingStrategy] = False,truncation: Union[bool, str, TruncationStrategy] = None,max_length: int = None,return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,) -> BatchFeature:if images is not None:image_inputs = self.image_processor(images=images, videos=None, return_tensors=return_tensors)image_grid_thw = image_inputs["image_grid_thw"]else:image_inputs = {}image_grid_thw = None# 【此处省略部分代码】if image_grid_thw is not None:merge_length = self.image_processor.merge_size**2index = 0for i in range(len(text)):while "<|image_pad|>" in text[i]:text[i] = text[i].replace("<|image_pad|>", "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length), 1)index += 1text[i] = text[i].replace("<|placeholder|>", "<|image_pad|>")# 【此处省略部分代码】text_inputs = self.tokenizer(text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length)return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs})

划分patches

进入image_processor函数，也就是image_processing_qwen2_vl.py中的preprocess函数，可以看到最关键的代码部分是，也就是继续调用_preprocess函数：

for image in images:patches, image_grid_thw = self._preprocess(image,do_resize=do_resize,resample=resample,do_rescale=do_rescale,rescale_factor=rescale_factor,do_normalize=do_normalize,image_mean=image_mean,image_std=image_std,data_format=data_format,do_convert_rgb=do_convert_rgb,input_data_format=input_data_format,)pixel_values.extend(patches)vision_grid_thws.append(image_grid_thw)

进入最核心的_preprocess函数，可以看到主要流程如下：

首先将读取的图像转化为numpy arrays的格式
然后对所有的图片进行resize，rescale以及normalize的操作，这里值得注意是，在进行resize时还会调用smart_resize函数再进行一次查找最合适size的过程，感觉有点双保险了。
处理完所有的图片后进行concat，如果只有一张图，第一个维度就是1，也就是[1, 3, 868, 896]，此时会对第一个维度进行判断，如果是1的话，就会执行如下代码，在第一个维度复制一份数据，处理完后，patches的维度变成了[2, 3, 868, 896]：

if patches.shape[0] == 1:patches = np.tile(patches, (self.temporal_patch_size, 1, 1, 1))

这样处理后，就可以和视频输入的格式统一了，也就是论文提到的Unified Image and Video Understanding，即统一的视觉处理方式。

接下来，就是世界划分patches的部分了！！划重点！！

# patches.shape = (2, 3, 868, 896)
# self.temporal_patch_size = 2
# self.patch_size = 14
channel = patches.shape[1] # 3
grid_t = patches.shape[0] // self.temporal_patch_size # 1
grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size # 62. 64
patches = patches.reshape(grid_t,self.temporal_patch_size,channel,grid_h // self.merge_size,self.merge_size,self.patch_size,grid_w // self.merge_size,self.merge_size,self.patch_size,
)
# patches.shape = (1, 2, 3, 31, 2, 14, 32, 2, 14)
patches = patches.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8)
# patches.shape = (1, 31, 32, 2, 2, 3, 2, 14, 14)
flatten_patches = patches.reshape(grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
)
# flatten_patches.shape = (3968, 1176)
return flatten_patches, (grid_t, grid_h, grid_w)

其中grid_t， grid_h， grid_w这三个变量非常关键，决定了图片能被还分成多少个patches，在这个例子中，图片的size是[2, 3, 868, 896]，最终被划分为 1 × 62 × 64 = 3968个patches，每个patches的特征被flatten后的值是：3 × 2 × 14 × 14 = 1176。

视觉标记填充

经过上面的划分patches过程后，程序退回到processing_qwen2_vl.py文件的__call__函数中，继续往下执行：

if image_grid_thw is not None:# self.image_processor.merge_size = 2merge_length = self.image_processor.merge_size**2 # 4index = 0# text：<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\nfor i in range(len(text)):while "<|image_pad|>" in text[i]:text[i] = text[i].replace("<|image_pad|>", "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length), 1)index += 1text[i] = text[i].replace("<|placeholder|>", "<|image_pad|>")

这里就是在text（输入的全部信息）中预占的视觉标记部分，按照划分的patches数进行填充，填充的个数是 image_grid_thw[index].prod() // merge_length也就是 3968 // 4 = 992个，这个也是图片经过image encoder后，实际生成的视觉token数。

视觉编码器前向过程

具体实现代码地址：https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

接下执行generated_ids = model.generate(**inputs, max_new_tokens=512)代码，进入modeling_qwen2_vl.py文件的Qwen2VLForConditionalGeneration类的forward函数，可以看到主要就是两个大部分：

    def forward(self,input_ids: torch.LongTensor = None,attention_mask: Optional[torch.Tensor] = None,position_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[List[torch.FloatTensor]] = None,inputs_embeds: Optional[torch.FloatTensor] = None,labels: Optional[torch.LongTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,pixel_values: Optional[torch.Tensor] = None,pixel_values_videos: Optional[torch.FloatTensor] = None,image_grid_thw: Optional[torch.LongTensor] = None,video_grid_thw: Optional[torch.LongTensor] = None,rope_deltas: Optional[torch.LongTensor] = None,) -> Union[Tuple, Qwen2VLCausalLMOutputWithPast]:output_attentions = output_attentions if output_attentions is not None else self.config.output_attentionsoutput_hidden_states = (output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states)return_dict = return_dict if return_dict is not None else self.config.use_return_dict# 【第一部分：对视觉信息的特征提取与信息整合】if inputs_embeds is None:inputs_embeds = self.model.embed_tokens(input_ids)if pixel_values is not None:pixel_values = pixel_values.type(self.visual.get_dtype())image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device)image_mask = input_ids == self.config.image_token_idif self.training:inputs_embeds = inputs_embeds.clone()inputs_embeds[image_mask] = image_embedsif pixel_values_videos is not None:pixel_values_videos = pixel_values_videos.type(self.visual.get_dtype())video_embeds = self.visual(pixel_values_videos, grid_thw=video_grid_thw).to(inputs_embeds.device)video_mask = input_ids == self.config.video_token_idinputs_embeds[video_mask] = video_embedsif attention_mask is not None:attention_mask = attention_mask.to(inputs_embeds.device)# 【第二部分：LLM前向生成最终结果】outputs = self.model(input_ids=None,position_ids=position_ids,attention_mask=attention_mask,past_key_values=past_key_values,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states = outputs[0]logits = self.lm_head(hidden_states)logits = logits.float()loss = Noneif labels is not None:# Shift so that tokens < n predict nshift_logits = logits[..., :-1, :].contiguous()shift_labels = labels[..., 1:].contiguous()# Flatten the tokensloss_fct = CrossEntropyLoss()shift_logits = shift_logits.view(-1, self.config.vocab_size)shift_labels = shift_labels.view(-1)# Enable model parallelismshift_labels = shift_labels.to(shift_logits.device)loss = loss_fct(shift_logits, shift_labels)if not return_dict:output = (logits,) + outputs[1:]return (loss,) + output if loss is not None else outputreturn Qwen2VLCausalLMOutputWithPast(loss=loss,logits=logits,past_key_values=outputs.past_key_values,hidden_states=outputs.hidden_states,attentions=outputs.attentions,rope_deltas=rope_deltas,)

这里我们只关注第一部分，即视觉特征提取阶段，主要执行的代码为: image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device)，我们进入Qwen2VisionTransformerPretrainedModel类的forward函数：

def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.Tensor:# hidden_states.shape: torch.Size([3968, 1176]), grid_thw: tensor([[ 1, 62, 64]])hidden_states = self.patch_embed(hidden_states)# hidden_states.shape: torch.Size([3968, 1280])rotary_pos_emb = self.rot_pos_emb(grid_thw)# rotary_pos_emb.shape: torch.Size([3968, 40])cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(dim=0, dtype=torch.int32)# cu_seqlens = tensor([3968], device='cuda:0', dtype=torch.int32)cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)# cu_seqlens = tensor([   0, 3968], device='cuda:0', dtype=torch.int32)for blk in self.blocks:hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)# torch.Size([3968, 1280])return self.merger(hidden_states) # torch.Size([992, 3584])

可以看到，Qwen2-VL提取视觉特征的过程主要分为：

使用3D卷积层，对输入进行patches级别的特征提取
根据划分为时间和空间grid信息（grid_thw: tensor([[ 1, 62, 64]])），计算时空旋转位置编码
计算时间维度的间隔（这里的间隔指的是每一张图片的空间token数，即：grid_h × grid_w），为了后面计算attention_mask
经过多层transformer层进行编码
最后使用PatchMerger进行视觉token压缩以及最后的编码，将视觉token的特征维度映射为和文本token一致的特征维度。

PatchEmbed进行3D卷积

常规3D卷积：

# self.embed_dim = 1280
# self.temporal_patch_size = 2
# self.patch_size = 14
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:# hidden_states.shape: torch.Size([3968, 1176])target_dtype = self.proj.weight.dtypehidden_states = hidden_states.view(-1, self.in_channels, self.temporal_patch_size, self.patch_size, self.patch_size)# hidden_states.shape: torch.Size([3968, 3, 2, 14, 14])hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)# self.proj(hidden_states.to(dtype=target_dtype))：torch.Size([3968, 1280, 1, 1, 1])# hidden_states.shape: torch.Size([3968, 1280])return hidden_states

rot_pos_emb生成多模态旋转位置编码

这一部分较复杂，建议大家自己过一遍源码：

def rot_pos_emb(self, grid_thw):pos_ids = []for t, h, w in grid_thw:hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)hpos_ids = hpos_ids.reshape(h // self.spatial_merge_size,self.spatial_merge_size,w // self.spatial_merge_size,self.spatial_merge_size,)hpos_ids = hpos_ids.permute(0, 2, 1, 3)hpos_ids = hpos_ids.flatten()wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)wpos_ids = wpos_ids.reshape(h // self.spatial_merge_size,self.spatial_merge_size,w // self.spatial_merge_size,self.spatial_merge_size,)wpos_ids = wpos_ids.permute(0, 2, 1, 3)wpos_ids = wpos_ids.flatten()pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))pos_ids = torch.cat(pos_ids, dim=0)max_grid_size = grid_thw[:, 1:].max()rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)return rotary_pos_emb

PatchMerger压缩视觉特征

从代码中，可以看到，实现视觉token压缩的代码只有不如一行：.view(-1, self.hidden_size)：

class PatchMerger(nn.Module):def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:super().__init__()self.hidden_size = context_dim * (spatial_merge_size**2)self.ln_q = LayerNorm(context_dim, eps=1e-6)self.mlp = nn.Sequential(nn.Linear(self.hidden_size, self.hidden_size),nn.GELU(),nn.Linear(self.hidden_size, dim),)def forward(self, x: torch.Tensor) -> torch.Tensor:# x.shape: torch.Size([3968, 1280])x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))# self.ln_q(x): torch.Size([3968, 1280])# self.ln_q(x).view(-1, self.hidden_size): torch.Size([992, 5120])return x # torch.Size([992, 3584])

视频推理

与单图推理类似，因为Qwen2-VL对视觉输入的处理是统一的格式：[T, C, H, W]，所以大家可以自行尝试：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# from qwen_vl_utils import process_vision_info
from vision_process import process_vision_infoimport debugpy
try:# 5678 is the default attach port in the VS Code debug configurations. Unless a host and port are specified, host defaults to 127.0.0.1debugpy.listen(("localhost", 9501))print("Waiting for debugger attach")debugpy.wait_for_client()
except Exception as e:passmodel_path = '/root/models/Qwen/Qwen2-VL-7B-Instruct'# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtype="auto", device_map="auto"
)# default processer
processor = AutoProcessor.from_pretrained(model_path)# Messages containing a video and a text query
messages = [{"role": "user","content": [{"type": "video","video": "/root/datasets/video1.mp4","max_pixels": 720 * 1280,"fps": 1.0,},{"type": "text", "text": "Describe this video."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)