当前位置：首页 > news >正文

看模型结构分析模型结构

news 2026/6/30 6:18:20

from transformers import ForImageClassification

model = ForImageClassification.from_pretrained(
""
)

print(model)

打印模型结构

Some weights of ForImageClassification were not initialized from the model checkpoint at /liujiangli-dataand are newly initialized: ['classifier.bias', 'classifier.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
意思就是需要自己训练分类头
ForImageClassification(
(vision_model):VisionTransformer(

这部分负责将图像转化为一系列的token
(embeddings): VisionEmbeddings(
(patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16), padding=valid)

用 3×3 RGB 输入，卷积核大小 16、步幅 16，（卷积核大小和步长大小相同）相当于把图像按照 16×16 的不重叠 patch 切分并对每个 patch 做线性投影（这是很多 ViT 实现常用的 trick，用 conv 实现 patch embedding）。
输出通道数 768 表示每个 patch 被投影到 768 维特征向量。
若输入是 224×224，输出的 patch 网格大小是 14×14，总 patch 数 196（14×14）。
数据形状变化（假设 batch size = B）：
- 输入：(B, 3, 224, 224)
- conv 输出：(B, 768, 14, 14)
- flatten/transpose 后：(B, 196, 768)或常见的(B, 768, 196)根据实现差别（你的打印显示 position embedding 是 Embedding(196,768)，所以 tokens 数＝196）。

(position_embedding): Embedding(196, 768)
)

为每个 patch token 加上可学习的位置编码（长度 196，每个位置 768 维）。
这样模型能区分不同 patch 的空间位置。

包括embeddings 包括patch embedding 和 position embedding

接下来Transformer 的堆叠编码器，负责在所有 patch token 间建模全局交互。

(encoder): Encoder(
(layers): ModuleList(

有12层这样的堆叠在一起一共 12 层 encoder layer（典型 ViT-Base 的层数）。每一层都包括以下几个部分
(0-11): 12 x EncoderLayer(
(layer_norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(self_attn): Attention(
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): MLP(
(activation_fn): GELUTanh()
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
)

`EncoderLayer`

layer_norm1: LayerNorm((768,), eps=1e-06, elementwise_affine=True)
- 对 token 的 768 维做 Layer Normalization（归一化），eps是数值稳定项，elementwise_affine=True意味着有可学习的缩放和平移参数（γ, β）。
self_attn: Attention
- Transformer 的自注意力子层，包含 Q/K/V 的投影和输出投影：
  - q_proj,k_proj,v_proj:Linear(in_features=768, out_features=768, bias=True)
    - 把 768 维输入线性变换成 Q、K、V，维度仍是 768。通常会在内部 reshape 为(num_heads, head_dim)形式进行多头计算。
    - 虽然打印没有给出num_heads，但常见配置是 12 头（768 / 12 = 64 head_dim）。这是个合理假设，但具体头数要看实现。
  - out_proj: Linear(768 -> 768)：把多头注意力拼回 768 维。
- 自注意力作用：每个 patch token 会基于所有 token 的相互注意力加权组合信息，捕捉全局依赖和上下文。
layer_norm2: LayerNorm((768,), eps=1e-06, elementwise_affine=True)
- 注意力后的第二个 LayerNorm（用于残差连接后的规范化 / 稳定训练）。
mlp: MLP
- 前馈网络（通常称为 MLP 或 FFN），结构如下：
  - fc1: Linear(in_features=768, out_features=3072, bias=True)：扩展维度到 3072（通常是 4× hidden dim）。
  - activation_fn: GELUTanh()：这是一个自定义激活，名字暗示可能是结合了 GELU 与 tanh 的形式，或是某种带有 tanh 缓和的 GELU 风格激活（不是标准的，但用途类似于 GELU 提供非线性）。作用是引入非线性并帮助模型表达复杂变换。
  - fc2: Linear(in_features=3072, out_features=768, bias=True)：把维度投回 768。
- MLP 的作用是对每个 token 进行位置无关的逐 token 非线性变换（增强表示能力）。

每个 encoder layer 常见的数据流是：x = x + SelfAttn(LN(x))，x = x + MLP(LN(x))（即 Pre-LN 或 Post-LN 变体，这里看起来是 pre-LN，因为 LayerNorm 在子层前面）。

(post_layernorm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)

在全部 encoder layer 堆叠之后再做一次 LayerNorm，输出一个稳定的 token 表示序列。

(head): MultiheadAttentionPoolingHead(
(attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(layernorm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): MLP(
(activation_fn): GELUTanh()
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
)
)

这个 head 不是简单的 global average 或 class token，而是用 attention pooling（多头注意力池化）把 token 序列聚合为一个全局表示。

(attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) )
- MultiheadAttention 作为 pooling 操作：通常做法是引入一个 learnable query（或使用一个特殊的 class token），通过 query 对所有 token 做 attention，加权得到一个聚合向量。
- out_proj把多头拼回 768 维。NonDynamicallyQuantizableLinear是 PyTorch 的一种线性层实现细节（与量化相关的实现差异），对功能没有影响。
- 这一步的结果是把(B, N, 768)的 token 序列压缩为(B, 1, 768)或(B, 768)的全局向量（取决于实现）。
(layernorm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
- 对聚合向量做规范化。
(mlp): MLP( fc1: 768->3072, activation GELUTanh, fc2: 3072->768 )
- 对聚合向量再做一次 MLP 变换以增强判别能力（类似 encoder 内部的 MLP，但只作用于 pooled vector）。