Python 基础入门 Day37：集成学习初探（随机森林 & 梯度提升树）

37次阅读

共计 2247 个字符，预计需要花费 6 分钟才能阅读完成。

欢迎来到 Python 学习第 37 天！今天我们进入机器学习的一个非常实用的进阶主题 —— 集成学习（Ensemble Learning）。在本节中，你将学习两种常见的集成方法：随机森林（Random Forest） 和 梯度提升树（Gradient Boosting Decision Tree，GBDT），并掌握如何用 Scikit-learn 实现它们。

一、什么是集成学习？

集成学习 是一种将多个弱模型（通常是决策树）组合成一个强模型的方法。核心思想是：

单个模型的表现可能不够好
多个模型集体投票 / 平均，有可能得出更准确、更稳健的结果

集成学习主要分为两类：

类型	代表算法	思想
Bagging	随机森林 Random Forest	并行训练多个子模型，平均结果
Boosting	GBDT、XGBoost	顺序训练多个子模型，逐步纠错

二、随机森林（Random Forest）

2.1 原理概览

随机森林是由 多个决策树 组成的模型，训练时每棵树都从样本中 随机抽样 ，并选取部分特征进行分裂。预测时将每棵树的输出进行 多数投票（分类）或平均（回归）。

优点包括：

处理高维特征和缺失值
抗过拟合能力强
可计算特征重要性

2.2 分类示例（鸢尾花数据集）

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# 构建随机森林分类器
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 准确率
print("Accuracy:", accuracy_score(y_test, y_pred))

2.3 特征重要性

import matplotlib.pyplot as plt

feature_importances = clf.feature_importances_
plt.bar(range(len(feature_importances)), feature_importances)
plt.xticks(range(len(feature_importances)), iris.feature_names, rotation=45)
plt.title("Feature Importance")
plt.tight_layout()
plt.show()

三、梯度提升树（GBDT）

3.1 原理概览

梯度提升树 通过一系列弱模型（通常是决策树）来 逐步纠正前一个模型的错误预测 ，最终形成一个强预测器。每一步训练一个新模型来拟合 前一轮的残差（预测误差）。

常见变种包括：

GradientBoostingClassifier
XGBoost、LightGBM（更高效）

3.2 实战演练（泰坦尼克号数据）

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 使用乳腺癌数据集
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# 构建 GBDT 模型
gbdt = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gbdt.fit(X_train, y_train)

# 预测
y_pred = gbdt.predict(X_test)

# 报告
print(classification_report(y_test, y_pred))

四、随机森林 vs GBDT：该选哪个？

特性	随机森林	GBDT
训练方式	并行训练	顺序训练
效果	稳健，不易过拟合	更准确，但更易过拟合
可解释性	支持特征重要性分析	支持
对异常值鲁棒性	强	弱
参数调优复杂度	较低	较高（学习率、步数等需调节）