YOLOv8改进 | 主干网络 | 将backbone替换为Swin-Transformer结构【论文必备】

  秋招面试专栏推荐 :深度学习算法工程师面试问题总结【百面算法工程师】——点击即可跳转


💡💡💡本专栏所有程序均经过测试,可成功执行💡💡💡


专栏目录 :《YOLOv8改进有效涨点》专栏介绍 & 专栏目录 | 目前已有110+篇内容,内含各种Head检测头、损失函数Loss、Backbone、Neck、NMS等创新点改进——点击即可跳转


本文给大家带来的教程是将YOLOv8的backbone替换为Swin-Transformer结构来提取特征。文章在介绍主要的原理后,将手把手教学如何进行模块的代码添加和修改,并将修改后的完整代码放在文章的最后,方便大家一键运行,小白也可轻松上手实践。以帮助您更好地学习深度学习目标检测YOLO系列的挑战。 

专栏地址YOLOv8改进——更新各种有效涨点方法——点击即可跳转 

目录

1.原理

2. 将Swin-Transformer添加到yolov8网络中

2.1 Swin-Transformer代码实现

2.2 Swin-Transformer的神经网络模块代码解析

2.3 更改init.py文件

2.4 添加yaml文件

2.5 注册模块

2.6 执行程序

3. 完整代码分享

4. GFLOPs

5. 进阶

6. 总结


1.原理

论文地址:Swin-Transformer点击即可跳转

官方代码:Swin-Transformer官方代码仓库点击即可跳转

Swin-Transformer是MRA的作品,而MRA撑起了深度学习的半边天。

Swin-Transformer是2021年微软研究院发表在ICCV上的一篇文章,并且已经获得ICCV 2021 best paper 的荣誉称号。Swin Transformer网络是Transformer模型在视觉领域的又一次碰撞。该论文一经发表就已在多项视觉任务中霸榜。

Swin Transformer使用了类似卷积神经网络中的层次化构建方法(Hierarchical feature maps),比如特征图尺寸中有对图像下采样4倍的,8倍的以及16倍的,这样的backbone有助于在此基础上构建目标检测,实例分割等任务。而在之前的Vision Transformer中是一开始就直接下采样16倍,后面的特征图也是维持这个下采样率不变。

在Swin Transformer中使用了Windows Multi-Head Self-Attention(W-MSA)的概念,比如在下图的4倍下采样和8倍下采样中,将特征图划分成了多个不相交的区域(Window),并且Multi-Head Self-Attention只在每个窗口(Window)内进行。相对于Vision Transformer中直接对整个特征图进行Multi-Head Self-Attention,这样做的目的是能够减少计算量的,尤其是在浅层特征图很大的时候。这样做虽然减少了计算量但也会隔绝不同窗口之间的信息传递,所以在论文中作者又提出了 Shifted Windows Multi-Head Self-Attention(SW-MSA)的概念,通过此方法能够让信息在相邻的窗口中进行传递。

​因此,Swin-Transformer在计算效率上相对于传统的 Transformer 架构具有优势。Swin Transformer 是一种高效且性能优越的深度学习模型,适用于图像分类、目标检测等视觉任务,并且在处理大规模图像数据时表现出色。

2. 将Swin-Transformer添加到yolov8网络中

2.1 Swin-Transformer代码实现

关键步骤一将下面代码粘贴到在/ultralytics/ultralytics/nn/modules/block.py中,并在该文件的__all__中添加“Swin-Transformer”


#swin transformer
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as checkpoint
import numpy as np
from typing import Optionaldef drop_path_f(x, drop_prob: float = 0., training: bool = False):if drop_prob == 0. or not training:return xkeep_prob = 1 - drop_probshape = (x.shape[0],) + (1,) * (x.ndim - 1)random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)random_tensor.floor_()output = x.div(keep_prob) * random_tensorreturn outputclass DropPath(nn.Module):def __init__(self, drop_prob=None):super(DropPath, self).__init__()self.drop_prob = drop_probdef forward(self, x):return drop_path_f(x, self.drop_prob, self.training)def window_partition(x, window_size: int):B, H, W, C = x.shapex = x.view(B, H // window_size, window_size, W // window_size, window_size, C)# permute: [B, H//Mh, Mh, W//Mw, Mw, C] -> [B, H//Mh, W//Mh, Mw, Mw, C]# view: [B, H//Mh, W//Mw, Mh, Mw, C] -> [B*num_windows, Mh, Mw, C]windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)return windowsdef window_reverse(windows, window_size: int, H: int, W: int):B = int(windows.shape[0] / (H * W / window_size / window_size))# view: [B*num_windows, Mh, Mw, C] -> [B, H//Mh, W//Mw, Mh, Mw, C]x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)# permute: [B, H//Mh, W//Mw, Mh, Mw, C] -> [B, H//Mh, Mh, W//Mw, Mw, C]# view: [B, H//Mh, Mh, W//Mw, Mw, C] -> [B, H, W, C]x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)return xclass Mlp(nn.Module):def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):super().__init__()out_features = out_features or in_featureshidden_features = hidden_features or in_featuresself.fc1 = nn.Linear(in_features, hidden_features)self.act = act_layer()self.drop1 = nn.Dropout(drop)self.fc2 = nn.Linear(hidden_features, out_features)self.drop2 = nn.Dropout(drop)def forward(self, x):x = self.fc1(x)x = self.act(x)x = self.drop1(x)x = self.fc2(x)x = self.drop2(x)return xclass WindowAttention(nn.Module):r""" Window based multi-head self attention (W-MSA) module with relative position bias.It supports both of shifted and non-shifted window.Args:dim (int): Number of input channels.window_size (tuple[int]): The height and width of the window.num_heads (int): Number of attention heads.qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: Trueattn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0proj_drop (float, optional): Dropout ratio of output. Default: 0.0"""def __init__(self, dim, window_size, num_heads, qkv_bias=True, attn_drop=0., proj_drop=0.):super().__init__()self.dim = dimself.window_size = window_size  # [Mh, Mw]self.num_heads = num_headshead_dim = dim // num_headsself.scale = head_dim ** -0.5# define a parameter table of relative position biasself.relative_position_bias_table = nn.Parameter(torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads))  # [2*Mh-1 * 2*Mw-1, nH]# get pair-wise relative position index for each token inside the windowcoords_h = torch.arange(self.window_size[0])coords_w = torch.arange(self.window_size[1])coords = torch.stack(torch.meshgrid([coords_h, coords_w], indexing="ij"))  # [2, Mh, Mw]coords_flatten = torch.flatten(coords, 1)  # [2, Mh*Mw]# [2, Mh*Mw, 1] - [2, 1, Mh*Mw]relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # [2, Mh*Mw, Mh*Mw]relative_coords = relative_coords.permute(1, 2, 0).contiguous()  # [Mh*Mw, Mh*Mw, 2]relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0relative_coords[:, :, 1] += self.window_size[1] - 1relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1relative_position_index = relative_coords.sum(-1)  # [Mh*Mw, Mh*Mw]self.register_buffer("relative_position_index", relative_position_index)self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)nn.init.trunc_normal_(self.relative_position_bias_table, std=.02)self.softmax = nn.Softmax(dim=-1)def forward(self, x, mask: Optional[torch.Tensor] = None):"""Args:x: input features with shape of (num_windows*B, Mh*Mw, C)mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None"""# [batch_size*num_windows, Mh*Mw, total_embed_dim]B_, N, C = x.shape# qkv(): -> [batch_size*num_windows, Mh*Mw, 3 * total_embed_dim]# reshape: -> [batch_size*num_windows, Mh*Mw, 3, num_heads, embed_dim_per_head]# permute: -> [3, batch_size*num_windows, num_heads, Mh*Mw, embed_dim_per_head]qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4).contiguous()# [batch_size*num_windows, num_heads, Mh*Mw, embed_dim_per_head]q, k, v = qkv.unbind(0)  # make torchscript happy (cannot use tensor as tuple)# transpose: -> [batch_size*num_windows, num_heads, embed_dim_per_head, Mh*Mw]# @: multiply -> [batch_size*num_windows, num_heads, Mh*Mw, Mh*Mw]q = q * self.scaleattn = (q @ k.transpose(-2, -1))# relative_position_bias_table.view: [Mh*Mw*Mh*Mw,nH] -> [Mh*Mw,Mh*Mw,nH]relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)].view(self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1)relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()  # [nH, Mh*Mw, Mh*Mw]attn = attn + relative_position_bias.unsqueeze(0)if mask is not None:# mask: [nW, Mh*Mw, Mh*Mw]nW = mask.shape[0]  # num_windows# attn.view: [batch_size, num_windows, num_heads, Mh*Mw, Mh*Mw]# mask.unsqueeze: [1, nW, 1, Mh*Mw, Mh*Mw]attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)attn = attn.view(-1, self.num_heads, N, N)attn = self.softmax(attn)else:attn = self.softmax(attn)attn = self.attn_drop(attn)# @: multiply -> [batch_size*num_windows, num_heads, Mh*Mw, embed_dim_per_head]# transpose: -> [batch_size*num_windows, Mh*Mw, num_heads, embed_dim_per_head]# reshape: -> [batch_size*num_windows, Mh*Mw, total_embed_dim]#x = (attn @ v).transpose(1, 2).reshape(B_, N, C)x = (attn.to(v.dtype) @ v).transpose(1, 2).reshape(B_, N, C)x = self.proj(x)x = self.proj_drop(x)return xclass SwinTransformerBlock(nn.Module):r""" Swin Transformer Block.Args:dim (int): Number of input channels.num_heads (int): Number of attention heads.window_size (int): Window size.shift_size (int): Shift size for SW-MSA.mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: Truedrop (float, optional): Dropout rate. Default: 0.0attn_drop (float, optional): Attention dropout rate. Default: 0.0drop_path (float, optional): Stochastic depth rate. Default: 0.0act_layer (nn.Module, optional): Activation layer. Default: nn.GELUnorm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm"""def __init__(self, dim, num_heads, window_size=7, shift_size=0,mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0., drop_path=0.,act_layer=nn.GELU, norm_layer=nn.LayerNorm):super().__init__()self.dim = dimself.num_heads = num_headsself.window_size = window_sizeself.shift_size = shift_sizeself.mlp_ratio = mlp_ratioassert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"self.norm1 = norm_layer(dim)self.attn = WindowAttention(dim, window_size=(self.window_size, self.window_size), num_heads=num_heads, qkv_bias=qkv_bias,attn_drop=attn_drop, proj_drop=drop)self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.norm2 = norm_layer(dim)mlp_hidden_dim = int(dim * mlp_ratio)self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)def forward(self, x, attn_mask):H, W = self.H, self.WB, L, C = x.shapeassert L == H * W, "input feature has wrong size"shortcut = xx = self.norm1(x)x = x.view(B, H, W, C)# pad feature maps to multiples of window size# 把 feature map 给 pad 到 window size 的整数倍pad_l = pad_t = 0pad_r = (self.window_size - W % self.window_size) % self.window_sizepad_b = (self.window_size - H % self.window_size) % self.window_sizex = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))_, Hp, Wp, _ = x.shape# cyclic shiftif self.shift_size > 0:shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))else:shifted_x = xattn_mask = None# partition windowsx_windows = window_partition(shifted_x, self.window_size)  # [nW*B, Mh, Mw, C]x_windows = x_windows.view(-1, self.window_size * self.window_size, C)  # [nW*B, Mh*Mw, C]# W-MSA/SW-MSAattn_windows = self.attn(x_windows, mask=attn_mask)  # [nW*B, Mh*Mw, C]# merge windowsattn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)  # [nW*B, Mh, Mw, C]shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp)  # [B, H', W', C]# reverse cyclic shiftif self.shift_size > 0:x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))else:x = shifted_xif pad_r > 0 or pad_b > 0:# 把前面pad的数据移除掉x = x[:, :H, :W, :].contiguous()x = x.view(B, H * W, C)# FFNx = shortcut + self.drop_path(x)x = x + self.drop_path(self.mlp(self.norm2(x)))return xclass SwinStage(nn.Module):"""A basic Swin Transformer layer for one stage.Args:dim (int): Number of input channels.depth (int): Number of blocks.num_heads (int): Number of attention heads.window_size (int): Local window size.mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: Truedrop (float, optional): Dropout rate. Default: 0.0attn_drop (float, optional): Attention dropout rate. Default: 0.0drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNormdownsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: Noneuse_checkpoint (bool): Whether to use checkpointing to save memory. Default: False."""def __init__(self, dim, c2, depth, num_heads, window_size,mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0.,drop_path=0., norm_layer=nn.LayerNorm, use_checkpoint=False):super().__init__()assert dim == c2, r"no. in/out channel should be same"self.dim = dimself.depth = depthself.window_size = window_sizeself.use_checkpoint = use_checkpointself.shift_size = window_size // 2# build blocksself.blocks = nn.ModuleList([SwinTransformerBlock(dim=dim,num_heads=num_heads,window_size=window_size,shift_size=0 if (i % 2 == 0) else self.shift_size,mlp_ratio=mlp_ratio,qkv_bias=qkv_bias,drop=drop,attn_drop=attn_drop,drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,norm_layer=norm_layer)for i in range(depth)])def create_mask(self, x, H, W):# calculate attention mask for SW-MSAHp = int(np.ceil(H / self.window_size)) * self.window_sizeWp = int(np.ceil(W / self.window_size)) * self.window_sizeimg_mask = torch.zeros((1, Hp, Wp, 1), device=x.device)  # [1, Hp, Wp, 1]h_slices = (slice(0, -self.window_size),slice(-self.window_size, -self.shift_size),slice(-self.shift_size, None))w_slices = (slice(0, -self.window_size),slice(-self.window_size, -self.shift_size),slice(-self.shift_size, None))cnt = 0for h in h_slices:for w in w_slices:img_mask[:, h, w, :] = cntcnt += 1mask_windows = window_partition(img_mask, self.window_size)  # [nW, Mh, Mw, 1]mask_windows = mask_windows.view(-1, self.window_size * self.window_size)  # [nW, Mh*Mw]attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)  # [nW, 1, Mh*Mw] - [nW, Mh*Mw, 1]# [nW, Mh*Mw, Mh*Mw]attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))return attn_maskdef forward(self, x):B, C, H, W = x.shapex = x.permute(0, 2, 3, 1).contiguous().view(B, H * W, C)attn_mask = self.create_mask(x, H, W)  # [nW, Mh*Mw, Mh*Mw]for blk in self.blocks:blk.H, blk.W = H, Wif not torch.jit.is_scripting() and self.use_checkpoint:x = checkpoint.checkpoint(blk, x, attn_mask)else:x = blk(x, attn_mask)x = x.view(B, H, W, C)x = x.permute(0, 3, 1, 2).contiguous()return xclass PatchEmbed(nn.Module):def __init__(self, in_c=3, embed_dim=96, patch_size=4, norm_layer=None):super().__init__()patch_size = (patch_size, patch_size)self.patch_size = patch_sizeself.in_chans = in_cself.embed_dim = embed_dimself.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()def forward(self, x):_, _, H, W = x.shape# paddingpad_input = (H % self.patch_size[0] != 0) or (W % self.patch_size[1] != 0)if pad_input:# to pad the last 3 dimensions,# (W_left, W_right, H_top,H_bottom, C_front, C_back)x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1],0, self.patch_size[0] - H % self.patch_size[0],0, 0))x = self.proj(x)B, C, H, W = x.shape# flatten: [B, C, H, W] -> [B, C, HW]# transpose: [B, C, HW] -> [B, HW, C]x = x.flatten(2).transpose(1, 2)x = self.norm(x)# view: [B, HW, C] -> [B, H, W, C]# permute: [B, H, W, C] -> [B, C, H, W]x = x.view(B, H, W, C)x = x.permute(0, 3, 1, 2).contiguous()return xclass PatchMerging(nn.Module):r""" Patch Merging Layer.Args:dim (int): Number of input channels.norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm"""def __init__(self, dim, c2, norm_layer=nn.LayerNorm):super().__init__()assert c2 == (2 * dim), r"no. out channel should be 2 * no. in channel "self.dim = dimself.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)self.norm = norm_layer(4 * dim)def forward(self, x):"""x: B, C, H, W"""B, C, H, W = x.shape# assert L == H * W, "input feature has wrong size"x = x.permute(0, 2, 3, 1).contiguous()# x = x.view(B, H*W, C)# paddingpad_input = (H % 2 == 1) or (W % 2 == 1)if pad_input:# to pad the last 3 dimensions, starting from the last dimension and moving forward.# (C_front, C_back, W_left, W_right, H_top, H_bottom)x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))x0 = x[:, 0::2, 0::2, :]  # [B, H/2, W/2, C]x1 = x[:, 1::2, 0::2, :]  # [B, H/2, W/2, C]x2 = x[:, 0::2, 1::2, :]  # [B, H/2, W/2, C]x3 = x[:, 1::2, 1::2, :]  # [B, H/2, W/2, C]x = torch.cat([x0, x1, x2, x3], -1)  # [B, H/2, W/2, 4*C]x = x.view(B, -1, 4 * C)  # [B, H/2*W/2, 4*C]x = self.norm(x)x = self.reduction(x)  # [B, H/2*W/2, 2*C]x = x.view(B, int(H / 2), int(W / 2), C * 2)x = x.permute(0, 3, 1, 2).contiguous()return x

2.2 Swin-Transformer的神经网络模块代码解析

  1. DropPath:实现随机深度正则化的类,这是一种在训练期间随机丢弃一些残差分支以提高模型泛化能力的技术。

  2. WindowAttention:实现具有相对位置偏差的基于窗口的多头自注意力 (W-MSA),它处理输入图像的小窗口以有效捕获局部依赖关系。

  3. SwinTransformerBlock:一个 Swin Transformer 块,包括基于窗口的多头自注意力和具有 GELU 激活的前馈网络 (FFN)。它支持移位窗口以提高模型捕获跨窗口连接的能力。

  4. SwinStage:Swin Transformer 的一个阶段,由多个 Swin Transformer 块组成。它包括处理循环移位、窗口分区和注意力应用的逻辑。

  5. PatchEmbed:将输入图像转换为补丁,然后将其线性投影到更高维的嵌入空间,本质上是为转换器层准备输入。

  6. PatchMerging:合并相邻补丁以减少空间维度,这有助于逐步降低特征采样率。

主要特点:

  • 窗口分区反转函数允许在非重叠窗口中处理输入,与标准自注意力相比,这降低了计算复杂度。

  • 相对位置编码可在不需要固定位置网格的情况下将空间信息添加到注意力机制中。

  • 移位窗口机制允许跨窗口连接并增强模型捕获远程依赖关系的能力。

  • 补丁嵌入和合并转换输入图像并逐步降低空间分辨率,使其对于大输入更有效率。

2.3 更改init.py文件

关键步骤二:修改modules文件夹下的__init__.py文件,先导入函数    

然后在下面的__all__中声明函数 

2.4 添加yaml文件

关键步骤三:在/ultralytics/ultralytics/cfg/models/v8下面新建文件yolov8_Swin-Transformer.yaml文件,粘贴下面的内容

  •  OD【目标检测】
nc: 80  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'# [depth, width, max_channels]n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPss: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPsm: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPsl: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPsx: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs# YOLOv8.0n backbone
backbone:# [from, repeats, module, args]- [-1, 1, PatchEmbed, [96, 4]]  # 0 [b, 96, 160, 160]- [-1, 1, SwinStage , [96, 2, 3, 7]]  # 1 [b, 96, 160, 160]- [-1, 1, PatchMerging, [192]]    # 2 [b, 192, 80, 80]- [-1, 1, SwinStage,  [192, 2, 6, 7]]  # 3 --F0-- [b, 192, 80, 80] p3- [-1, 1, PatchMerging, [384]]   # 4 [b, 384, 40, 40]- [-1, 1, SwinStage, [384, 6, 12, 7]] # 5 --F1-- [b, 384, 40, 40] p4- [-1, 1, PatchMerging, [768]]   # 6 [b, 768, 20, 20]- [-1, 1, SwinStage, [768, 2, 24, 7]] # 7 --F2-- [b, 768, 20, 20]- [-1, 1, SPPF, [768, 5]]# YOLOv8.0n head
head:- [-1, 1, nn.Upsample, [None, 2, 'nearest']]- [[-1, 5], 1, Concat, [1]]  # cat backbone P4- [-1, 1, C2f, [512]]  # 11- [-1, 1, nn.Upsample, [None, 2, 'nearest']]- [[-1, 3], 1, Concat, [1]]  # cat backbone P3- [-1, 1, C2f, [256]]  # 14 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 11], 1, Concat, [1]]  # cat head P4- [-1, 1, C2f, [512]]  # 17 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 8], 1, Concat, [1]]  # cat head P5- [-1, 1, C2f, [1024]]  # 20 (P5/32-large)- [[14, 17, 20], 1, Detect, [nc]]  # Detect(P3, P4, P5)
  • Seg【语义分割】
nc: 80  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'# [depth, width, max_channels]n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPss: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPsm: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPsl: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPsx: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs# YOLOv8.0n backbone
backbone:# [from, repeats, module, args]- [-1, 1, PatchEmbed, [96, 4]]  # 0 [b, 96, 160, 160]- [-1, 1, SwinStage , [96, 2, 3, 7]]  # 1 [b, 96, 160, 160]- [-1, 1, PatchMerging, [192]]    # 2 [b, 192, 80, 80]- [-1, 1, SwinStage,  [192, 2, 6, 7]]  # 3 --F0-- [b, 192, 80, 80] p3- [-1, 1, PatchMerging, [384]]   # 4 [b, 384, 40, 40]- [-1, 1, SwinStage, [384, 6, 12, 7]] # 5 --F1-- [b, 384, 40, 40] p4- [-1, 1, PatchMerging, [768]]   # 6 [b, 768, 20, 20]- [-1, 1, SwinStage, [768, 2, 24, 7]] # 7 --F2-- [b, 768, 20, 20]- [-1, 1, SPPF, [768, 5]]# YOLOv8.0n head
head:- [-1, 1, nn.Upsample, [None, 2, 'nearest']]- [[-1, 5], 1, Concat, [1]]  # cat backbone P4- [-1, 1, C2f, [512]]  # 11- [-1, 1, nn.Upsample, [None, 2, 'nearest']]- [[-1, 3], 1, Concat, [1]]  # cat backbone P3- [-1, 1, C2f, [256]]  # 14 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 11], 1, Concat, [1]]  # cat head P4- [-1, 1, C2f, [512]]  # 17 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 8], 1, Concat, [1]]  # cat head P5- [-1, 1, C2f, [1024]]  # 20 (P5/32-large)- [[14, 17, 20], 1, Segment, [nc, 32, 256]] # Segment(P3, P4, P5)

温馨提示:因为本文只是对yolov8基础上添加模块,如果要对yolov8n/l/m/x进行添加则只需要指定对应的depth_multiple 和 width_multiple。不明白的同学可以看这篇文章: yolov8yaml文件解读——点击即可跳转 


# YOLOv8n
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.25  # layer channel multiple
max_channels: 1024 # max_channels# YOLOv8s
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
max_channels: 1024 # max_channels# YOLOv8l 
depth_multiple: 1.0  # model depth multiple
width_multiple: 1.0  # layer channel multiple
max_channels: 512 # max_channels# YOLOv8m
depth_multiple: 0.67  # model depth multiple
width_multiple: 0.75  # layer channel multiple
max_channels: 768 # max_channels# YOLOv8x
depth_multiple: 1.33  # model depth multiple
width_multiple: 1.25  # layer channel multiple
max_channels: 512 # max_channels

2.5 注册模块

关键步骤四:在task.py的parse_model函数中注册

elif m in (PatchMerging,PatchEmbed,SwinStage):c1, c2 = ch[f], args[0]if c2 != nc:c2 = make_divisible(min(c2, max_channels) * width, 8)args = [c1, c2, *args[1:]]

2.6 执行程序

在train.py中,将model的参数路径设置为yolov8_Swin-Transformer.yaml的路径

建议大家写绝对路径,确保一定能找到 

from ultralytics import YOLO
import warnings
warnings.filterwarnings('ignore')
from pathlib import Pathif __name__ == '__main__':# 加载模型model = YOLO("ultralytics/cfg/v8/yolov8.yaml")  # 你要选择的模型yaml文件地址# Use the modelresults = model.train(data=r"你的数据集的yaml文件地址",epochs=100, batch=16, imgsz=640, workers=4, name=Path(model.cfg).stem)  # 训练模型

 🚀运行程序,如果出现下面的内容则说明添加成功🚀 

                   from  n    params  module                                       arguments0                  -1  1      1176  ultralytics.nn.modules.block.PatchEmbed      [3, 24, 4]1                  -1  1     15462  ultralytics.nn.modules.block.SwinStage       [24, 24, 2, 3, 7]2                  -1  1      4800  ultralytics.nn.modules.block.PatchMerging    [24, 48]3                  -1  1     58572  ultralytics.nn.modules.block.SwinStage       [48, 48, 2, 6, 7]4                  -1  1     18816  ultralytics.nn.modules.block.PatchMerging    [48, 96]5                  -1  1    683208  ultralytics.nn.modules.block.SwinStage       [96, 96, 6, 12, 7]6                  -1  1     74496  ultralytics.nn.modules.block.PatchMerging    [96, 192]7                  -1  1    897840  ultralytics.nn.modules.block.SwinStage       [192, 192, 2, 24, 7]8                  -1  1     92736  ultralytics.nn.modules.block.SPPF            [192, 192, 5]9                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']10             [-1, 5]  1         0  ultralytics.nn.modules.conv.Concat           [1]11                  -1  1    135936  ultralytics.nn.modules.block.C2f             [288, 128, 1]12                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']13             [-1, 3]  1         0  ultralytics.nn.modules.conv.Concat           [1]14                  -1  1     36224  ultralytics.nn.modules.block.C2f             [176, 64, 1]15                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]16            [-1, 11]  1         0  ultralytics.nn.modules.conv.Concat           [1]17                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]18                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]19             [-1, 8]  1         0  ultralytics.nn.modules.conv.Concat           [1]20                  -1  1    476672  ultralytics.nn.modules.block.C2f             [320, 256, 1]21        [14, 17, 20]  1    897664  ultralytics.nn.modules.head.Detect           [80, [64, 128, 256]]
YOLOv8_swin_trf summary: 348 layers, 3701954 parameters, 3701938 gradients, 31.3 GFLOPs

3. 完整代码分享

https://pan.baidu.com/s/1YyxKn0ue9y6rYfpqykpUsg?pwd=8x9s

 提取码: 8x9s 

4. GFLOPs

关于GFLOPs的计算方式可以查看百面算法工程师 | 卷积基础知识——Convolution

未改进的YOLOv8nGFLOPs

img

改进后的GFLOPs

5. 进阶

可以与其他的注意力机制或者损失函数等结合,进一步提升检测效果

6. 总结

Swin Transformer 是一种基于分区注意力机制和层次化结构的先进深度学习模型,通过在局部区域内进行自注意力计算以及使用窗口式注意力机制,实现了在图像分类和目标检测等任务上优异的性能表现。其Transformer缩放技术提高了模型的可扩展性和效率,使其能够处理大规模图像数据,并在训练和推理过程中保持高效率。综合而言,Swin Transformer以其创新性的设计和卓越的性能,成为处理图像数据的一种领先模型。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.xdnf.cn/news/146908.html

如若内容造成侵权/违法违规/事实不符,请联系一条长河网进行投诉反馈,一经查实,立即删除!

相关文章

Tansformer代码实现

目录 1.Tansformer架构图 2.代码实现 2.1创建类&#xff1a;实现基于位置的前馈网络 2.2创建 残差&LN层标准归一化的类 2.3编码器block 2.4创建编码器 2.5创建解码器 2.6transformer解码器部分 3.知识点个人理解 1.Tansformer架构图 2.代码实现 2.1创建类&…

连续数组问题

目录 一题目&#xff1a; 二思路&#xff1a; 三代码&#xff1a; 一题目&#xff1a; leetcode链接&#xff1a;. - 力扣&#xff08;LeetCode&#xff09; 二思路&#xff1a; 思路&#xff1a;前缀和&#xff08;第二种&#xff09;化0为-1hash&#xff1a; 这样可以把…

【大模型实战篇】一种关于大模型高质量数据的处理方法-无标注数据类别快速识别及重复数据检测(加权向量-卷积神经网络-聚类算法结合)

1. 背景介绍 大模型的能力很大程度上依赖于高质量的数据&#xff0c;在之前的一篇文章《高质量数据过滤及一种BoostedBaggingFilter处理方法的介绍》中&#xff0c;我们介绍了大模型的数据处理链路&#xff0c;本文继续关注在高质量数据的模块。 本文所要介绍的处理方法&…

vscode 配置django

创建运行环境 使用pip安装Django&#xff1a;pip install django。 创建一个新的Django项目&#xff1a;django-admin startproject myproject。 打开VSCode&#xff0c;并在项目文件夹中打开终端。 在VSCode中安装Python扩展&#xff08;如果尚未安装&#xff09;。 在项…

滑动窗口经典题目

目录 滑动窗口 什么是滑动窗口&#xff1f; 什么时候用滑动窗口&#xff1f; 怎么用滑动窗口&#xff1f; 209. 长度最小的子数组&#xff08;滑动窗口的引入&#xff09; 3. 无重复字符的最长子串 1004. 最大连续1的个数 III 1658. 将 x 减到 0 的最小操作数 904. 水…

Fyne ( go跨平台GUI )中文文档-容器和布局 (四)

本文档注意参考官网(developer.fyne.io/) 编写, 只保留基本用法 go代码展示为Go 1.16 及更高版本, ide为goland2021.2 这是一个系列文章&#xff1a; Fyne ( go跨平台GUI )中文文档-入门(一)-CSDN博客 Fyne ( go跨平台GUI )中文文档-Fyne总览(二)-CSDN博客 Fyne ( go跨平台GUI…

【重学 MySQL】三十七、聚合函数

【重学 MySQL】三十七、聚合函数 基本概念5大常用的聚合函数COUNT()SUM()AVG()MAX()MIN() 使用场景注意事项示例查询 聚合函数&#xff08;Aggregate Functions&#xff09;在数据库查询中扮演着至关重要的角色&#xff0c;特别是在处理大量数据时。它们能够对一组值执行计算&a…

37. Vector3与模型位置、缩放属性

本文章给通过组对象Group (opens new window)给大家讲解一下threejs层级模型或树结构的概念。 Group层级模型(树结构)案例 下面代码创建了两个网格模型mesh1、mesh2&#xff0c;通过THREE.Group类创建一个组对象group,然后通过add方法把网格模型mesh1、mesh2作为设置为组对象g…

Vuex的使用看这一篇就够了

Vuex概述 Vuex 是一个专为 Vue.js 应用程序开发的状态管理库。它采用集中式存储管理应用的所有组件的状态&#xff0c;并以一种可预测的方式来保证状态以一种可预测的方式发生变化。 state状态 把公用的数据放到store里的state就行了&#xff0c;上面是vue2的代码&#xff0c;下…

[大语言模型] LINFUSION:1个GPU,1分钟,16K图像

1. 文章 2409.02097 (arxiv.org)https://arxiv.org/pdf/2409.02097 LINFUSION: 1 GPU, 1 MINUTE, 16K IMAGE 摘要 本文介绍了一种新型的扩散模型LINFUSION&#xff0c;它能够在保持高分辨率图像生成性能的同时显著降低时间和内存复杂度。该模型采用了基于Transformer的UNet进…

【前端】ES6:Class语法和Class继承

文章目录 1 Class语法1.1 类的写法1.2 getter与setter1.3 静态属性和静态方法 2 Class继承 1 Class语法 1.1 类的写法 class Person {constructor(name,age){this.name name;this.age age;}say(){console.log(this.name,this.age)} } let obj new Person("kerwin&quo…

python--基础语法(2)

1.顺序语句 默认情况下&#xff0c;Python的代码执行顺序是按照从上到下的顺序&#xff0c;依次执行的。 2.条件语句 条件语句能够表达“如果 ...否则 ...”这样的语义这构成了计算机中基础的逻辑判定条件语&#xff0c; 也叫做 分支语句。表示了接下来的逻辑可能有几种走向…

SysML图例-10cm最小航天器AC-10

DDD领域驱动设计批评文集>> 《软件方法》强化自测题集>> 《软件方法》各章合集>> SysML图中词汇 AC10 AeroCube-10&#xff0c;大小仅为10 10 15 cm的卫星&#xff0c;更多信息参见下文&#xff1a; AeroCube-10成为迄今为止完成在轨接近操作的最小航天…

yolov8模型在手部关键点检测识别中的应用【代码+数据集+python环境+GUI系统】

yolov8模型在手部关键点检测识别中的应用【代码数据集python环境GUI系统】 背景意义 在手势识别、虚拟现实&#xff08;VR&#xff09;、增强现实&#xff08;AR&#xff09;等领域&#xff0c;手部关键点检测为用户提供了更加自然、直观的交互方式。通过检测手部关键点&#…

移动登录页:让用户开启一段美好的旅程吧。

Hi,大家好&#xff0c;我是大千UI工场&#xff0c;移动登录页千千万&#xff0c;这里最好看&#xff0c;本期分享一批移动端的登录页面&#xff0c;供大家欣赏。 本次分享的是毛玻璃/3D风格的登录页。

Linux文件IO(七)-复制文件描述符

在 Linux 系统中&#xff0c;open 返回得到的文件描述符 fd 可以进行复制&#xff0c;复制成功之后可以得到一个新的文件描述符&#xff0c;使用新的文件描述符和旧的文件描述符都可以对文件进行 IO 操作&#xff0c;复制得到的文件描述符和旧的文件描述符拥有相同的权限&#…

【文化课学习笔记】【化学】选必三:合成高分子生物大分子

【化学】选必三&#xff1a;合成高分子&生物大分子 如果你是从 B 站一化儿笔记区来的&#xff0c;请先阅读我在第一篇有机化学笔记中的「读前须知」(点开头的黑色小三角展开)&#xff1a;链接 加聚反应 基本概念 聚合反应 由小分子化合物合成高分子化合物的反应叫聚合反应。…

学习 git 命令行的简单操作, 能够将代码上传到 Gitee 上

首先登录自己的gitee并创建好仓库 将仓库与Linux终端做链接 比如说我这里已经创建好了一个我的Linux学习仓库 点开克隆/下载&#xff1a; 在你的终端中粘贴上图中1中的指令 此时他会让你输入你的用户名和密码&#xff0c;用户名就是上图中3中Username for ....中后面你的一个…

秒变 Vim 高手:必学的编辑技巧与隐藏功能大揭秘

文章目录 前言一、vi与vim二、Vim的三种模式1. 普通模式2. 插入模式3. 命令模式 三、Vim中的查找与替换1. 查找2. 替换 四、给Vim设置行号1. 临时显示行号2. 永久显示行号 总结 前言 在Linux系统中&#xff0c;文本编辑器是开发者和系统管理员日常工作中的重要工具之一。其中&…

手机号归属地查询-运营商归属地查询-手机号归属地信息-运营商手机号归属地查询接口-手机号归属地

手机号归属地查询接口是一种网络服务接口&#xff0c;它允许开发者通过编程方式查询手机号码的注册地信息。这种接口通常由第三方服务提供商提供&#xff0c;并可通过HTTP请求进行调用。以下是一些关于手机号归属地查询接口的相关信息&#xff1a; 1. 接口功能 归属地查询&am…