当前位置：首页 > news >正文

别再只盯着AUC了！用Python手把手教你绘制ROC与PR曲线（附sklearn代码）

news 2026/5/30 23:53:43

超越AUC：用Python实战ROC与PR曲线的深度解读

在机器学习模型评估的海洋里，AUC指标就像一座灯塔，指引着无数数据科学家的航向。但真正理解AUC背后的ROC和PR曲线，才能让我们在模型优化的航程中不迷失方向。本文将带你用Python代码亲手绘制这两条关键曲线，从实践角度揭示它们的差异与应用场景。

1. 环境准备与数据加载

首先确保你的Python环境已安装以下库：

!pip install scikit-learn matplotlib numpy pandas

我们将使用sklearn内置的乳腺癌数据集作为示例，这个二分类数据集非常适合演示ROC和PR曲线：

from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split data = load_breast_cancer() X, y = data.data, data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

提示：在实际项目中，建议使用stratify=y参数保持训练集和测试集的类别分布一致

2. 模型训练与概率预测

我们选择逻辑回归作为基础模型，因为它能直接输出概率预测，这对绘制ROC和PR曲线至关重要：

from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=10000) model.fit(X_train, y_train) # 获取测试集的预测概率 y_scores = model.predict_proba(X_test)[:, 1] # 取正类的概率

理解预测概率的分布对后续分析很有帮助：

概率区间	样本数量	占比
0.0-0.2	15	8.8%
0.2-0.4	23	13.5%
0.4-0.6	32	18.8%
0.6-0.8	45	26.5%
0.8-1.0	55	32.4%

3. ROC曲线绘制与解读

ROC曲线通过以下代码生成：

from sklearn.metrics import roc_curve, roc_auc_score import matplotlib.pyplot as plt fpr, tpr, thresholds = roc_curve(y_test, y_scores) roc_auc = roc_auc_score(y_test, y_scores) plt.figure(figsize=(10, 6)) plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC曲线 (AUC = {roc_auc:.2f})') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC曲线分析') plt.legend(loc="lower right") plt.show()

关键点解读：

(0,1)点：完美分类器，所有正例都被正确识别且没有误报
对角线：随机猜测的表现基准
曲线凸起程度：越靠近左上角，模型区分能力越强

ROC曲线的几个实用观察角度：

早期识别能力：曲线左侧陡峭说明模型能高效识别高置信度正例
稳健性：曲线平滑表示模型在不同阈值下表现稳定
AUC值：0.9以上优秀，0.8-0.9良好，0.7-0.8一般

4. PR曲线绘制与场景分析

PR曲线特别适合类别不平衡的场景，绘制代码如下：

from sklearn.metrics import precision_recall_curve, average_precision_score precision, recall, thresholds = precision_recall_curve(y_test, y_scores) avg_precision = average_precision_score(y_test, y_scores) plt.figure(figsize=(10, 6)) plt.plot(recall, precision, color='blue', lw=2, label=f'PR曲线 (AP = {avg_precision:.2f})') plt.xlabel('召回率(Recall)') plt.ylabel('精确率(Precision)') plt.title('PR曲线分析') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.legend(loc="lower left") plt.show()

PR曲线与ROC曲线的关键区别：

特征	ROC曲线	PR曲线
X轴	FPR	Recall
Y轴	TPR	Precision
基准线	对角线(y=x)	正例比例(水平线)
适用场景	类别平衡	类别不平衡
关注点	整体分类能力	正例识别质量

注意：当负样本远多于正样本时，PR曲线比ROC曲线更能反映模型真实表现

5. 阈值选择实战策略

通过分析曲线，我们可以找到最佳分类阈值：

# 寻找最佳阈值 - F1分数最大化 from sklearn.metrics import f1_score f1_scores = [f1_score(y_test, y_scores >= t) for t in thresholds] best_threshold = thresholds[np.argmax(f1_scores)] print(f"最佳F1分数阈值: {best_threshold:.3f}") # 可视化阈值选择 plt.figure(figsize=(10, 6)) plt.plot(thresholds, f1_scores[:-1], label='F1分数') plt.axvline(x=best_threshold, color='r', linestyle='--', label=f'最佳阈值={best_threshold:.3f}') plt.xlabel('阈值') plt.ylabel('F1分数') plt.title('阈值选择分析') plt.legend() plt.show()

实际项目中，阈值选择应考虑：

业务需求：假阳性与假阴性的代价是否对等
资源限制：人工复核能力决定可接受的阳性预测数量
稳定性：避免选择在敏感区域的阈值（曲线陡峭处）

6. 多模型对比可视化

比较不同模型的曲线能直观展示性能差异：

from sklearn.ensemble import RandomForestClassifier # 训练随机森林模型 rf_model = RandomForestClassifier(n_estimators=100) rf_model.fit(X_train, y_train) rf_scores = rf_model.predict_proba(X_test)[:, 1] # 计算两个模型的指标 lr_fpr, lr_tpr, _ = roc_curve(y_test, y_scores) rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_scores) lr_precision, lr_recall, _ = precision_recall_curve(y_test, y_scores) rf_precision, rf_recall, _ = precision_recall_curve(y_test, rf_scores) # 绘制对比图 plt.figure(figsize=(14, 6)) plt.subplot(1, 2, 1) plt.plot(lr_fpr, lr_tpr, label=f'逻辑回归 (AUC={roc_auc_score(y_test, y_scores):.2f})') plt.plot(rf_fpr, rf_tpr, label=f'随机森林 (AUC={roc_auc_score(y_test, rf_scores):.2f})') plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC曲线对比') plt.legend() plt.subplot(1, 2, 2) plt.plot(lr_recall, lr_precision, label=f'逻辑回归 (AP={average_precision_score(y_test, y_scores):.2f})') plt.plot(rf_recall, rf_precision, label=f'随机森林 (AP={average_precision_score(y_test, rf_scores):.2f})') plt.xlabel('Recall') plt.ylabel('Precision') plt.title('PR曲线对比') plt.legend() plt.tight_layout() plt.show()

模型对比的关键观察点：

曲线包络：被完全包住的模型明显更差
AUC/AP值：量化比较的客观指标
特定区域表现：根据业务需求关注特定区间（如高召回率区域）

7. 高级应用与陷阱规避

在实际项目中应用这些曲线时，有几个常见陷阱需要注意：

类别不平衡的应对策略：

过采样/欠采样调整类别分布
使用类别权重参数（如class_weight='balanced'）
优先参考PR曲线而非ROC曲线

# 使用类别权重的逻辑回归 balanced_model = LogisticRegression(class_weight='balanced', max_iter=10000) balanced_model.fit(X_train, y_train)

交叉验证的曲线绘制：

更稳健的做法是使用交叉验证绘制平均曲线：

from sklearn.model_selection import cross_val_predict # 获取交叉验证的预测概率 cv_scores = cross_val_predict(LogisticRegression(max_iter=10000), X, y, cv=5, method='predict_proba')[:, 1] # 绘制基于交叉验证的曲线 fpr, tpr, _ = roc_curve(y, cv_scores) plt.plot(fpr, tpr, label=f'交叉验证ROC (AUC={roc_auc_score(y, cv_scores):.2f})')

概率校准的重要性：

某些模型（如SVM、随机森林）输出的概率需要校准：

from sklearn.calibration import calibration_curve prob_true, prob_pred = calibration_curve(y_test, y_scores, n_bins=10) plt.plot(prob_pred, prob_true, marker='o', label='未校准')

在医疗诊断项目中，我们发现ROC曲线在早期筛查中价值更大，而PR曲线在确诊阶段更为关键。一个实用的技巧是将两种曲线结合使用：先用ROC确定模型的整体区分能力，再用PR曲线优化具体阈值选择。

查看全文

http://www.cnnetsun.cn/news/2641141.html