当前位置：首页 > news >正文

基于 Transformer，Python 搭建中文文本分类大模型：从零到一实现企业级文本分类

news 2026/7/3 4:23:10

文章简介：本文从零开始，基于 Transformer 架构，使用 Python 搭建中文文本分类大模型，覆盖环境配置、数据预处理、词嵌入、Transformer 编码器实现、模型训练、评估、推理全流程。所有代码可直接运行、注释超详细，适合 NLP 入门、深度学习实战、文本分类竞赛、企业项目落地，全文干货无废话，建议收藏！

适用场景：新闻分类、情感分析、垃圾邮件识别、意图识别、评论分类、法律 / 医疗文本分类、多标签文本分类

关键词：Transformer、中文文本分类、Python、NLP、深度学习、大模型、PyTorch、神经网络

前言
Transformer 文本分类核心原理精讲
环境准备与依赖安装
中文数据集构建与预处理
中文文本分词与词典构建
词嵌入层（Embedding）与位置编码
Transformer 编码器模块实现
基于 Transformer 的文本分类模型搭建
模型训练与验证流程
模型评估（精确率、召回率、F1、准确率）
模型推理与预测实战
模型优化与进阶方向
全文总结
粉丝福利 + 关注引导

1. 前言

在 NLP（自然语言处理）领域，文本分类是最经典、最常用、最具商业价值的任务之一。从垃圾短信过滤、情感分析、新闻主题分类，到客服意图识别、内容审核、法律文档归类，都离不开文本分类技术。

2017 年 Google 提出的Transformer架构彻底改变了 NLP 领域，凭借自注意力机制（Self-Attention）实现了对文本全局特征的捕捉，远超传统 RNN、LSTM 的效果，成为 BERT、GPT、RoBERTa 等大模型的基础。

很多新手认为：Transformer = 大模型 = 复杂难上手。本文打破认知：用最通俗的语言 + 最精简的 PyTorch 代码，从零实现 Transformer 中文文本分类模型，不依赖 BERT 等预训练模型，纯手工搭建，让你彻底理解 Transformer 底层逻辑。

全文遵循：原理 → 代码 → 注释 → 运行 → 效果模式，可直接用于课程设计、毕业设计、项目实战、面试刷题。

2. Transformer 文本分类核心原理精讲

2.1 为什么用 Transformer 做文本分类？

能够并行计算，训练速度远超 LSTM/GRU
自注意力机制可以捕捉长距离依赖
特征提取能力强，适合中文复杂语义
易于扩展为大规模模型，适配海量数据

2.2 文本分类流程

plaintext

中文文本 → 分词 → 转索引 → 词嵌入 + 位置编码 → Transformer编码器 → 池化/CLS特征 → 全连接层 → 分类结果

2.3 核心模块

Embedding 层：将文字转为向量
位置编码（Positional Encoding）：注入序列位置信息
多头自注意力（Multi-Head Attention）：捕捉词与词之间的关系
前馈网络（Feed Forward）：特征变换
层归一化 + 残差连接：保证深度网络训练稳定
分类头（Classifier Head）：输出类别概率

3. 环境准备与依赖安装

本项目基于PyTorch框架，支持 CPU/GPU 训练。

3.1 安装依赖

bash

运行

pip install torch torchvision pip install jieba pip install numpy pandas scikit-learn tqdm

3.2 导入工具包

python

运行

import torch import torch.nn as nn import torch.optim as optim import jieba import numpy as np import pandas as pd from collections import Counter from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report from tqdm import tqdm import warnings warnings.filterwarnings('ignore') # 设置设备（GPU优先） device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print("使用设备：", device)

4. 中文数据集构建与预处理

4.1 数据集说明

我们构建中文新闻分类数据集，包含 5 类：['体育', '财经', '教育', '科技', '娱乐']

你也可以替换成自己的情感分析、评论分类数据。

4.2 构建模拟数据集（实战可替换为真实数据）

python

运行

# 构建中文文本分类数据集 def build_dataset(): data = [ ("中国女排3比0战胜日本队，喜迎世界杯三连胜", "体育"), ("A股三大指数收涨，券商板块领涨两市", "财经"), ("教育部发布新政策，加强中小学体育教育", "教育"), ("华为发布新一代AI芯片，性能提升50%", "科技"), ("周杰伦新专辑上线，销量突破千万张", "娱乐"), ("CBA联赛新疆队击败广东队获得总冠军", "体育"), ("央行降准释放长期资金1万亿元", "财经"), ("2025年高考时间确定，报名人数再创新高", "教育"), ("OpenAI发布新版大模型，上下文长度提升", "科技"), ("电影《热辣滚烫》票房突破30亿", "娱乐"), ("足协公布新赛季中超联赛赛程", "体育"), ("美联储维持利率不变，全球市场回暖", "财经"), ("高校新增人工智能专业，招生规模扩大", "教育"), ("自动驾驶汽车迎来新政策支持", "科技"), ("第38届大众电影百花奖获奖名单公布", "娱乐") ] texts = [d[0] for d in data] labels = [d[1] for d in data] return texts, labels texts, labels = build_dataset() # 标签映射 label2id = {"体育":0, "财经":1, "教育":2, "科技":3, "娱乐":4} id2label = {v:k for k,v in label2id.items()} labels_ids = [label2id[l] for l in labels] # 划分训练集、测试集 train_texts, test_texts, train_labels, test_labels = train_test_split( texts, labels_ids, test_size=0.2, random_state=42 )

5. 中文文本分词与词典构建

5.1 中文分词（jieba）

python

运行

# 中文分词函数 def tokenize(text): return list(jieba.cut(text)) # 精确模式分词 # 对所有文本分词 train_tokens = [tokenize(t) for t in train_texts] test_tokens = [tokenize(t) for t in test_texts]

5.2 构建词汇表

python

运行

# 构建词汇表 def build_vocab(tokens_list, min_freq=1): all_tokens = [] for tokens in tokens_list: all_tokens.extend(tokens) # 统计词频 counter = Counter(all_tokens) vocab = ['<PAD>', '<UNK>'] # 填充符 + 未知词 vocab.extend([w for w,c in counter.items() if c >= min_freq]) word2id = {w:i for i,w in enumerate(vocab)} return vocab, word2id vocab, word2id = build_vocab(train_tokens) vocab_size = len(vocab) print("词汇表大小：", vocab_size)

5.3 文本转索引序列 + 统一长度

python

运行

# 文本转索引 def text2id(tokens, word2id, max_len=20): ids = [word2id.get(w, 1) for w in tokens] # 1对应<UNK> # 截断或填充 if len(ids) < max_len: ids += [0]*(max_len - len(ids)) # 0是<PAD> else: ids = ids[:max_len] return ids max_len = 20 train_X = torch.LongTensor([text2id(t, word2id, max_len) for t in train_tokens]) test_X = torch.LongTensor([text2id(t, word2id, max_len) for t in test_tokens]) train_y = torch.LongTensor(train_labels) test_y = torch.LongTensor(test_labels)

6. 词嵌入层与位置编码

6.1 位置编码（Transformer 必备）

python

运行

# 位置编码类 class PositionalEncoding(nn.Module): def __init__(self, d_model, max_len=20): super().__init__() pe = torch.zeros(max_len, d_model) pos = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) div = torch.exp(torch.arange(0, d_model, 2) * (-np.log(10000.0) / d_model)) pe[:,0::2] = torch.sin(pos * div) pe[:,1::2] = torch.cos(pos * div) self.register_buffer('pe', pe) def forward(self, x): # x: [batch, seq_len, dim] return x + self.pe[:x.size(1), :]

6.2 词嵌入层

python

运行

embedding = nn.Embedding(vocab_size, embedding_dim)

7. Transformer 编码器模块实现

7.1 多头注意力 + 前馈网络 + 编码器层

python

运行

# Transformer 编码器层 class TransformerEncoderLayer(nn.Module): def __init__(self, dim, hidden_dim, num_heads, dropout=0.1): super().__init__() self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True) self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.ffn = nn.Sequential( nn.Linear(dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, dim) ) self.dropout = nn.Dropout(dropout) def forward(self, x): # 自注意力 attn_out, _ = self.attn(x, x, x) x = self.norm1(x + self.dropout(attn_out)) # 前馈网络 ffn_out = self.ffn(x) x = self.norm2(x + self.dropout(ffn_out)) return x

7.2 完整 Transformer 编码器

python

运行

class TransformerEncoder(nn.Module): def __init__(self, num_layers, dim, hidden_dim, num_heads): super().__init__() self.layers = nn.ModuleList([ TransformerEncoderLayer(dim, hidden_dim, num_heads) for _ in range(num_layers) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x

8. 基于 Transformer 的文本分类模型搭建

8.1 模型整体结构

plaintext

Embedding → 位置编码 → N层TransformerEncoder → 池化 → 全连接分类

8.2 模型代码（带详细注释）

python

运行

class TransformerTextClassifier(nn.Module): def __init__(self, vocab_size, num_classes, dim=128, hidden_dim=256, num_heads=2, num_layers=2, max_len=20): super().__init__() # 词嵌入 self.embedding = nn.Embedding(vocab_size, dim) # 位置编码 self.pos_encoding = PositionalEncoding(dim, max_len) # Transformer编码器 self.encoder = TransformerEncoder(num_layers, dim, hidden_dim, num_heads) # 分类头 self.classifier = nn.Linear(dim, num_classes) def forward(self, x): # 1. 词嵌入 [batch, seq_len] → [batch, seq_len, dim] x = self.embedding(x) # 2. 加入位置信息 x = self.pos_encoding(x) # 3. Transformer 特征提取 x = self.encoder(x) # 4. 池化：取整个句子的平均向量（句向量） feat = torch.mean(x, dim=1) # 5. 分类输出 out = self.classifier(feat) return out # 超参数 num_classes = 5 dim = 128 hidden_dim = 256 num_heads = 2 num_layers = 2 # 初始化模型 model = TransformerTextClassifier( vocab_size=vocab_size, num_classes=num_classes, dim=dim, hidden_dim=hidden_dim, num_heads=num_heads, num_layers=num_layers ).to(device)

9. 模型训练与验证

9.1 损失函数、优化器

python

运行

criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=1e-3) epochs = 20

9.2 训练循环

python

运行

# 训练模式 model.train() for epoch in range(epochs): total_loss = 0 with tqdm(total=1, desc=f"Epoch {epoch+1}/{epochs}") as pbar: inputs = train_X.to(device) targets = train_y.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() total_loss = loss.item() pbar.set_postfix({"loss": total_loss}) print("✅ 模型训练完成！")

10. 模型评估

python

运行

# 测试模式 model.eval() with torch.no_grad(): test_inputs = test_X.to(device) test_targets = test_y.to(device) outputs = model(test_inputs) preds = torch.argmax(outputs, dim=1).cpu().numpy() true = test_targets.cpu().numpy() acc = accuracy_score(true, preds) print(f"\n📊 测试集准确率：{acc:.2f}") print("\n📋 分类报告：") print(classification_report(true, preds, target_names=label2id.keys()))

11. 模型推理与预测实战

python

运行

# 预测函数 def predict(text): model.eval() tokens = tokenize(text) ids = text2id(tokens, word2id, max_len) inputs = torch.LongTensor([ids]).to(device) with torch.no_grad(): out = model(inputs) pred_id = torch.argmax(out).item() label = id2label[pred_id] return label # 测试 if __name__ == "__main__": test_sentence = "小米发布新款自动驾驶汽车" label = predict(test_sentence) print(f"输入文本：{test_sentence}") print(f"模型预测分类：【{label}】")