本篇文章将介绍一个新的改进机制——卷积和注意力融合模块CAFM,并阐述如何将其应用于YOLOv11中,显著提升模型性能。首先,CAFM是为了融合卷积神经网络(CNNs)和 Transformer 的优势,同时对全局和局部特征进行有效建模。随后,我们将详细讨论他的模型结构,以及如何将CAFM模块与YOLOv11相结合,以提升目标检测的性能。
1. CAFM 结构介绍
CAFM 旨在融合卷积神经网络(CNNs)和 Transformer 的优势,通过结合局部特征捕捉能力(卷积操作)和全局特征提取能力(注意力机制),对图像的全局和局部特征进行有效建模,以提升检测效果。
1.1 局部分支
1. 通道调整:首先使用卷积调整通道维度。卷积可以在不改变特征图的宽和高的情况下,对通道数进行灵活调整,这有助于后续操作更好地处理特征信息。
2. 通道混洗操作:接着进行通道混洗操作。通道混洗将输入张量沿着通道维度划分为多个组,在每个组内采用深度可分离卷积来诱导通道混洗,然后将每个组的输出张量沿着通道维度进行拼接,生成一个新的输出张量。这个操作可以进一步混合和融合通道信息,增强跨通道的交互和信息整合。
3. 特征提取:最后利用卷积提取特征。这种卷积操作能够在空间和光谱维度上综合捕捉局部特征信息。
1.2. 全局分支
1. 生成查询、键和值:首先通过卷积和深度 - 宽度卷积生成查询(Q)、键(K)和值(V),这三个张量具有特定的形状。
2. 计算注意力图:将 Q 重塑为,K 重塑为,然后通过 Q 和 K 的交互计算注意力图。通过这种方式计算注意力图,而不是计算庞大的常规注意力图(尺寸为),可以降低计算负担。
3. 计算全局分支输出:全局分支的输出定义为,其中,是一个可学习的缩放参数,用于控制在应用 Softmax 函数之前和矩阵乘法的大小。
2. YOLOv11与CAFM的结合
本文将YOLOv11模型的C2PSA模块中的注意力层替换成CAFM ,组合成C2PSA_CAFM 模块。通过局部和全局分支分别提取局部特征和全局特征,然后将两者相加得到模块的输出,可以对全局和局部特征进行有效建模。
3. CAFM代码部分
import torch
import torch.nn as nn
from einops import rearrange
from .block import PSABlock,C2PSAclass Attention(nn.Module):def __init__(self, dim, num_heads=4, bias=False):super(Attention, self).__init__()self.num_heads = num_headsself.temperature = nn.Parameter(torch.ones(num_heads, 1, 1))self.qkv = nn.Conv3d(dim, dim * 3, kernel_size=(1, 1, 1), bias=bias)self.qkv_dwconv = nn.Conv3d(dim * 3, dim * 3, kernel_size=(3, 3, 3), stride=1, padding=1, groups=dim * 3,bias=bias)self.project_out = nn.Conv3d(dim, dim, kernel_size=(1, 1, 1), bias=bias)self.fc = nn.Conv3d(3 * self.num_heads, 9, kernel_size=(1, 1, 1), bias=True)self.dep_conv = nn.Conv3d(9 * dim // self.num_heads, dim, kernel_size=(3, 3, 3), bias=True,groups=dim // self.num_heads, padding=1)def forward(self, x):b, c, h, w = x.shapex = x.unsqueeze(2)qkv = self.qkv_dwconv(self.qkv(x))qkv = qkv.squeeze(2)f_conv = qkv.permute(0, 2, 3, 1)f_all = qkv.reshape(f_conv.shape[0], h * w, 3 * self.num_heads, -1).permute(0, 2, 1, 3)f_all = self.fc(f_all.unsqueeze(2))f_all = f_all.squeeze(2)# local convf_conv = f_all.permute(0, 3, 1, 2).reshape(x.shape[0], 9 * x.shape[1] // self.num_heads, h, w)f_conv = f_conv.unsqueeze(2)out_conv = self.dep_conv(f_conv) # B, C, H, Wout_conv = out_conv.squeeze(2)# global SAq, k, v = qkv.chunk(3, dim=1)q = rearrange(q, 'b (head c) h w -> b head c (h w)', head=self.num_heads)k = rearrange(k, 'b (head c) h w -> b head c (h w)', head=self.num_heads)v = rearrange(v, 'b (head c) h w -> b head c (h w)', head=self.num_heads)q = torch.nn.functional.normalize(q, dim=-1)k = torch.nn.functional.normalize(k, dim=-1)attn = (q @ k.transpose(-2, -1)) * self.temperatureattn = attn.softmax(dim=-1)out = (attn @ v)out = rearrange(out, 'b head c (h w) -> b (head c) h w', head=self.num_heads, h=h, w=w)out = out.unsqueeze(2)out = self.project_out(out)out = out.squeeze(2)output = out + out_convreturn outputclass PSABlock_CAFM(PSABlock):def __init__(self, c, qk_dim =16 , pdim=32, shortcut=True) -> None:"""Initializes the PSABlock with attention and feed-forward layers for enhanced feature extraction."""super().__init__(c)self.attn = Attention(c)class C2PSA_CAFM(C2PSA):def __init__(self, c1, c2, n=1, e=0.5):"""Initializes the C2PSA module with specified input/output channels, number of layers, and expansion ratio."""super().__init__(c1, c2)assert c1 == c2self.c = int(c1 * e)self.m = nn.Sequential(*(PSABlock_CAFM(self.c, qk_dim =16 , pdim=32) for _ in range(n)))
4. 将CAFM引入到YOLOv11中
第一: 将下面的核心代码复制到D:\bilibili\model\YOLO11\ultralytics-main\ultralytics\nn路径下,如下图所示。
第二:在task.py中导入CAFM包
第三:在task.py中的模型配置部分下面代码
第四:将模型配置文件复制到YOLOV11.YAMY文件中
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'# [depth, width, max_channels]n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPss: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPsm: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPsl: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPsx: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs# YOLO11n backbone
backbone:# [from, repeats, module, args]- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4- [-1, 2, C3k2, [256, False, 0.25]]- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8- [-1, 2, C3k2, [512, False, 0.25]]- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16- [-1, 2, C3k2, [512, True]]- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32- [-1, 2, C3k2, [1024, True]]- [-1, 1, SPPF, [1024, 5]] # 9- [-1, 2, C2PSA_CAFM, [1024]] # 10# YOLO11n head
head:- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 6], 1, Concat, [1]] # cat backbone P4- [-1, 2, C3k2, [512, False]] # 13- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 4], 1, Concat, [1]] # cat backbone P3- [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 13], 1, Concat, [1]] # cat head P4- [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 10], 1, Concat, [1]] # cat head P5- [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)- [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)
第五:运行成功
from ultralytics.models import NAS, RTDETR, SAM, YOLO, FastSAM, YOLOWorldif __name__=="__main__":# 使用自己的YOLOv11.yamy文件搭建模型并加载预训练权重训练模型model = YOLO(r"D:\bilibili\model\YOLO11\ultralytics-main\ultralytics\cfg\models\11\yolo11_CAFM.yaml")\.load(r'D:\bilibili\model\YOLO11\ultralytics-main\yolo11n.pt') # build from YAML and transfer weightsresults = model.train(data=r'D:\bilibili\model\ultralytics-main\ultralytics\cfg\datasets\VOC_my.yaml',epochs=100, imgsz=640, batch=8)