当前位置：首页 > news >正文

从TT100K到YOLO格式：一份避坑指南帮你搞定数据集转换与划分（附完整代码）

news 2026/6/2 13:51:23

从TT100K到YOLO格式：交通标志检测数据集转换实战指南

如果你正在使用YOLOv5或YOLOv8进行交通标志检测，TT100K数据集可能已经进入你的视野。这个包含上万张图片的数据集看似理想，但当你真正开始使用时，会发现从原始数据到YOLO训练格式的转换过程中暗藏不少"坑"。本文将带你完整走通"TT100K→COCO→YOLO"的转换流程，分享我在实际项目中的经验教训。

1. 理解TT100K数据集的结构与挑战

TT100K数据集全称为"Tsinghua-Tencent 100K"，是由清华大学和腾讯联合发布的大型交通标志数据集。原始数据集包含超过10,000张高分辨率图像（2048×2048），标注了221类交通标志。但直接使用这个数据集进行YOLO训练会遇到几个关键问题：

类别不平衡：221个类别中，许多类别只有个位数样本，而"限速标志"等常见类别则有上千样本
格式不兼容：原始标注是JSON格式，与YOLO要求的txt格式不匹配
数据分布不合理：原始train/test/other划分方式不适合深度学习训练

# TT100K原始数据结构示例 tt100k_dataset/ ├── annotations/ # JSON标注文件 ├── train/ # 训练集图片(6105张) ├── test/ # 测试集图片(3071张) └── other/ # 其他图片(7641张)

提示：在开始转换前，建议先备份原始数据集，所有操作都在副本上进行

2. 数据清洗与类别筛选策略

面对221个不均衡的类别，直接使用所有数据训练效果往往不佳。我们需要先进行数据清洗和类别筛选：

统计类别分布：解析所有JSON文件，计算每个类别的出现次数
设定阈值筛选：通常保留样本数超过100的类别（可根据具体需求调整）
处理多标签图像：有些图像包含多个交通标志，要确保只保留目标类别的标注

import json from collections import defaultdict # 统计类别分布 def count_categories(annotation_dir): category_counts = defaultdict(int) for ann_file in os.listdir(annotation_dir): with open(os.path.join(annotation_dir, ann_file)) as f: data = json.load(f) for obj in data['objects']: category_counts[obj['category']] += 1 return category_counts # 示例：筛选样本数≥100的类别 category_counts = count_categories('tt100k/annotations') selected_categories = [cat for cat, count in category_counts.items() if count >= 100] print(f"筛选后保留{len(selected_categories)}个类别")

经过筛选，通常可以保留45-50个主要交通标志类别，这能显著提升模型训练效果。

3. 从TT100K到COCO格式的转换

COCO格式是计算机视觉领域的通用格式之一，也是转换到YOLO格式的良好中间步骤。转换过程需要处理以下关键点：

坐标转换：TT100K使用绝对坐标，而COCO使用相对坐标
类别ID映射：为筛选后的类别创建新的连续ID
图像尺寸统一：虽然TT100K图像都是2048×2048，但仍需在标注中明确声明

# TT100K到COCO格式转换的核心代码片段 def tt100k_to_coco(tt100k_dir, output_json, selected_categories): coco = { "images": [], "annotations": [], "categories": [{"id": i+1, "name": cat} for i, cat in enumerate(selected_categories)] } cat_to_id = {cat: i+1 for i, cat in enumerate(selected_categories)} for img_file in os.listdir(os.path.join(tt100k_dir, 'train')): img_id = len(coco["images"]) + 1 coco["images"].append({ "id": img_id, "file_name": img_file, "width": 2048, "height": 2048 }) ann_file = os.path.join(tt100k_dir, 'annotations', img_file.replace('.jpg', '.json')) with open(ann_file) as f: data = json.load(f) for obj in data['objects']: if obj['category'] in selected_categories: x, y = obj['bbox']['xmin'], obj['bbox']['ymin'] w, h = obj['bbox']['xmax'] - x, obj['bbox']['ymax'] - y coco["annotations"].append({ "id": len(coco["annotations"]) + 1, "image_id": img_id, "category_id": cat_to_id[obj['category']], "bbox": [x, y, w, h], "area": w * h, "iscrowd": 0 }) with open(output_json, 'w') as f: json.dump(coco, f)

注意：转换过程中要特别注意坐标系的转换和归一化处理，这是后续YOLO训练能否成功的关键

4. COCO到YOLO格式的终极转换

得到COCO格式的数据后，我们需要进一步转换为YOLO训练所需的txt格式。YOLO格式的特点是：

每个图像对应一个同名的txt文件
每行表示一个对象，格式为：class_id center_x center_y width height
所有坐标值都是相对于图像宽高的归一化值(0-1)

import os import json def coco_to_yolo(coco_json, output_dir, img_dir): with open(coco_json) as f: coco = json.load(f) # 创建类别ID映射 cat_id_map = {cat['id']: i for i, cat in enumerate(coco['categories'])} # 按图像分组标注 img_anns = defaultdict(list) for ann in coco['annotations']: img_anns[ann['image_id']].append(ann) # 处理每张图像 for img in coco['images']: img_id = img['id'] img_w, img_h = img['width'], img['height'] txt_file = os.path.join(output_dir, os.path.splitext(img['file_name'])[0] + '.txt') with open(txt_file, 'w') as f: for ann in img_anns.get(img_id, []): # 转换bbox格式 x, y, w, h = ann['bbox'] center_x = (x + w / 2) / img_w center_y = (y + h / 2) / img_h norm_w = w / img_w norm_h = h / img_h # 写入YOLO格式 class_id = cat_id_map[ann['category_id']] f.write(f"{class_id} {center_x:.6f} {center_y:.6f} {norm_w:.6f} {norm_h:.6f}\n")

5. 数据集划分与文件组织的最佳实践

完成格式转换后，我们需要合理划分训练集、验证集和测试集。不同于原始TT100K的划分，我们建议采用以下策略：

合并所有原始数据：将train/test/other中的图像全部合并
按类别分层抽样：确保每个集合中各类别的比例与整体分布一致
典型比例：70%训练，15%验证，15%测试

from sklearn.model_selection import train_test_split def split_dataset(image_files, test_size=0.3, random_state=42): """ 划分数据集 """ train_val, test = train_test_split(image_files, test_size=test_size, random_state=random_state) train, val = train_test_split(train_val, test_size=0.5, random_state=random_state) return train, val, test # 示例使用 all_images = [f for f in os.listdir('tt100k/train') if f.endswith('.jpg')] train_files, val_files, test_files = split_dataset(all_images) # 创建目标目录结构 dataset_root/ ├── images/ │ ├── train/ │ ├── val/ │ └── test/ └── labels/ ├── train/ ├── val/ └── test/

对于大型数据集，直接复制文件可能效率低下。这里推荐使用符号链接来组织数据：

# 为训练集创建符号链接示例 ln -s /path/to/original/images/train/ /path/to/dataset/images/train ln -s /path/to/converted/labels/train/ /path/to/dataset/labels/train

6. 验证转换结果的正确性

在投入训练前，务必验证转换结果的正确性。以下是几个关键检查点：

标注与图像对齐：随机抽样检查标注框是否准确覆盖交通标志
类别分布一致性：确保训练/验证/测试集的类别分布相似
格式合规性：确认YOLO标注文件格式完全正确

import cv2 import random def visualize_annotation(image_path, label_path, class_names): """ 可视化检查标注 """ image = cv2.imread(image_path) h, w = image.shape[:2] with open(label_path) as f: for line in f: class_id, cx, cy, nw, nh = map(float, line.strip().split()) # 转换回像素坐标 x1 = int((cx - nw/2) * w) y1 = int((cy - nh/2) * h) x2 = int((cx + nw/2) * w) y2 = int((cy + nh/2) * h) cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2) cv2.putText(image, class_names[int(class_id)], (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (36,255,12), 2) cv2.imshow('Annotation Check', image) cv2.waitKey(0) cv2.destroyAllWindows() # 示例使用 class_names = ['prohibitory', 'mandatory', 'warning'] # 你的类别列表 sample_image = random.choice(os.listdir('dataset/images/train')) image_path = f'dataset/images/train/{sample_image}' label_path = f'dataset/labels/train/{os.path.splitext(sample_image)[0]}.txt' visualize_annotation(image_path, label_path, class_names)

7. 高效处理大规模数据集的技巧

当处理数万张高分辨率图像时，转换过程可能非常耗时。以下是几个提升效率的技巧：

并行处理：使用Python的multiprocessing模块并行处理图像
增量处理：分批处理数据，避免内存不足
缓存中间结果：保存COCO格式等中间结果，便于调试

from multiprocessing import Pool def process_image(args): """ 包装图像处理函数以支持并行 """ img_file, input_dir, output_dir = args # 这里放置实际的转换逻辑 pass # 并行处理示例 def batch_process(image_files, input_dir, output_dir, workers=4): with Pool(workers) as p: args = [(f, input_dir, output_dir) for f in image_files] p.map(process_image, args) # 分批处理大规模数据集 batch_size = 1000 for i in range(0, len(all_images), batch_size): batch = all_images[i:i+batch_size] batch_process(batch, 'tt100k/train', 'dataset/images/train')

在实际项目中，我发现将整个转换流程封装成可配置的Pipeline类最为方便，可以灵活调整各个步骤的参数，同时保存中间状态便于问题排查。

查看全文

http://www.cnnetsun.cn/news/2481289.html