【机器学习】任务五：葡萄酒和鸢尾花数据集分类任务

1.实验基础知识

1.1 集成学习

（1）随机森林

（2）梯度提升决策树（GBDT）

（3）XGBoost

（4）LightGBM

1.2 参数优化

（1）网格搜索

（2）随机搜索

（3）贝叶斯优化

（4）遗传算法

（5）超参数优化工具

1.3 模型解释

（1）特征重要性分析

（2）SHAP

（3）部分依赖图（PDP）

（4）对抗性样本分析

（5）LIME

2.葡萄酒数据集分类任务

2.1 导入必要的库文件并加载数据集

（1）目标：

（2）代码：

（3）解释：

2.2 对数据进行初步探索和可视化

（1）目标：

（2）代码：

（3）解释：

2.3 将数据集分为训练集和测试集

（1）目标：

（2）代码：

（3）解释：

2.4 对 XGBoost 模型进行参数优化

（1）目标：

（2）代码：

（3）解释：

2.5 使用测试集进行预测并评估模型性能

（1）目标：

（2）代码：

（3）解释：

2.6 可视化模型性能

（1）目标：

（2）代码：

（3）解释：

3.鸢尾花数据集分类任务报告

3.1 导入必要的库文件并加载数据集

（1）目标：

（2）代码：

（3）解释：

3.2 对数据进行初步探索和可视化

（1）目标：

（2）代码：

（3）解释：

3.3 将数据集分为训练集和测试集

（1）目标：

（2）代码：

（3）解释：

3.4 对 LightGBM 模型进行参数优化

（1）目标：

（2）代码：

（3）解释：

3.5 使用测试集进行预测并评估模型性能

（1）目标：

（2）代码：

（3）解释：

3.6 可视化模型性能

（1）目标：

（2）代码：

（3）解释：

4.总体代码和结果

4.1葡萄酒

（1）总体代码

（2）运行结果

（3）结果分析

4.2鸢尾花

（1）总体代码

（2）运行结果

（3）结果分析

1.实验基础知识

1.1 集成学习

集成学习是一种通过组合多个基学习器来提高模型预测性能的技术。主要有两类方法：

Bagging：通过对数据进行重采样，构建多个独立的模型，然后将这些模型的预测结果进行平均（回归）或投票（分类）。典型的算法包括随机森林。
Boosting：通过串行训练多个模型，每个模型都在前一个模型的基础上进行改进。常用算法包括GBDT、XGBoost 和 LightGBM。

（1）随机森林

随机森林是 Bagging 的一种具体实现，通过随机选择样本和特征构建多个决策树，并通过投票（分类）或平均（回归）进行预测。随机森林减少了过拟合，增强了模型的稳定性和准确性。

（2）梯度提升决策树（GBDT）

GBDT 是 Boosting 的一种，通过逐步构建树模型，每棵新树都是在前一棵树的基础上通过最小化残差来改进模型。GBDT 在处理非线性关系和特征交互时表现良好。

（3）XGBoost

XGBoost 是 GBDT 的高效实现，具有更快的训练速度和更好的性能。通过正则化和并行计算加速模型训练，同时减少过拟合，XGBoost 还可以处理缺失值。

（4）LightGBM

LightGBM 是微软开发的高效 GBDT 实现，使用基于直方图的决策树算法，显著提高了训练速度和内存效率，适用于大规模数据集和高维特征场景。

1.2 参数优化

为了提升模型性能，参数优化是关键步骤。常用的参数优化方法有：

（1）网格搜索

网格搜索是一种穷举法，通过在预定义的参数空间中列出所有可能的参数组合，逐一训练模型并评估其性能。虽然计算开销较大，但简单易懂，适用于小规模参数优化。

（2）随机搜索

随机搜索随机选择参数组合来进行训练，相比网格搜索能更高效地探索高维空间，常在较短时间内找到较优的参数。

（3）贝叶斯优化

贝叶斯优化通过构建目标函数的概率模型，逐步更新该模型来找到最优参数，适合计算成本较高的模型训练，能减少评估次数。

（4）遗传算法

遗传算法模拟自然选择的过程，通过选择、交叉和变异操作在参数空间内进行全局搜索，适合复杂的优化问题。

（5）超参数优化工具

Optuna、Hyperopt 和 Ray Tune 等工具集成了多种优化算法，可以自动化参数搜索过程，并提供可视化和分析功能。

1.3 模型解释

理解模型的决策过程对于确保模型透明性和解释性至关重要。常用的解释方法包括：

（1）特征重要性分析

通过评估特征对模型预测结果的影响，分析哪些特征在模型决策中起到了关键作用。随机森林和 XGBoost 提供了基于树模型的特征重要性评估。

（2）SHAP

SHAP 值基于合作博弈论，通过计算特征对预测结果的贡献来解释模型。它能够量化每个特征对单个预测的影响，也能通过全局解释分析模型行为。

（3）部分依赖图（PDP）

PDP 用于可视化单个或多个特征对预测的影响，揭示特征与预测结果之间的关系。

（4）对抗性样本分析

通过生成对抗性样本并观察模型的预测变化，可以评估模型的脆弱性和决策边界。

（5）LIME

LIME 是一种模型无关的解释方法，能通过生成局部线性模型来近似复杂模型的行为，提供单个预测的可解释性。

2.葡萄酒数据集分类任务

2.1 导入必要的库文件并加载数据集

（1）目标：

导入必要的库文件（如 pandas、numpy、matplotlib、seaborn、sklearn、xgboost 和 lightgbm 等）。
加载 scikit-learn 提供的葡萄酒数据集。

（2）代码：

# 导入必要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import lightgbm as lgb# 加载葡萄酒数据集
wine = load_wine()# 提取特征矩阵（X）和目标变量（y）
X_wine, y_wine = wine.data, wine.target

（3）解释：

导入库文件：导入数据处理、可视化、机器学习和模型评估所需的库。
- pandas 用于数据处理，numpy 用于数值计算，matplotlib 和 seaborn 用于绘图。
- xgboost 和 lightgbm 是两种常用的集成学习方法，用于分类任务。
加载葡萄酒数据集：使用 scikit-learn 提供的 load_wine() 函数加载葡萄酒数据集。数据集包含 13 个特征和 178 个样本。
提取特征矩阵和目标变量：X_wine 保存特征数据，y_wine 保存目标类别标签。

2.2 对数据进行初步探索和可视化

（1）目标：

查看数据集的基本信息和统计摘要，确保数据集没有缺失值或异常。
使用 seaborn 进行数据可视化，了解特征与目标标签之间的关系。

（2）代码：

# 将数据转为DataFrame
df_wine = pd.DataFrame(X_wine, columns=wine.feature_names)
df_wine['target'] = y_wine# 查看数据集信息
print(df_wine.info())
print(df_wine.describe())# 可视化葡萄酒数据集特征与目标的关系
sns.pairplot(df_wine, hue="target", markers=["o", "s", "D"])
plt.show()

（3）解释：

转换为 DataFrame：将 numpy 数组转换为 pandas 数据框，以便更方便地查看和处理数据。
数据集探索：使用 info() 和 describe() 查看数据的基本信息，包括数据类型、缺失值和统计摘要。
- info() 显示数据集的结构信息，确保没有缺失值。
- describe() 提供特征的统计描述，如均值、方差、最大值和最小值等。
数据可视化：使用 seaborn 的 pairplot() 绘制多变量散点图，观察特征与目标变量（葡萄酒种类）之间的关系。
- hue="target" 使用不同颜色区分不同类别的葡萄酒。

2.3 将数据集分为训练集和测试集

（1）目标：

将葡萄酒数据集划分为训练集和测试集，以便训练模型并验证其性能。
采用 80:20 的比例划分，80% 的数据用于训练，20% 用于测试。

（2）代码：

# 将葡萄酒数据集划分为训练集和测试集
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, test_size=0.2, random_state=42)

（3）解释：

train_test_split()：用于将数据集划分为训练集和测试集。这里使用 80% 的数据用于模型训练，20% 的数据用于测试模型。
random_state=42：确保数据集划分的随机性保持一致，便于结果复现。
X_train_wine, y_train_wine 分别表示葡萄酒训练集的特征和标签，X_test_wine, y_test_wine 为测试集。

2.4 对 `XGBoost` 模型进行参数优化

（1）目标：

使用网格搜索 (GridSearchCV) 对 XGBoost 分类器的参数进行优化，找到最佳参数组合以提高模型性能。

（2）代码：

# 定义XGBoost模型和参数网格
xgb_model_wine = xgb.XGBClassifier(random_state=42)
xgb_param_grid_wine = {'n_estimators': [50, 100, 200],'max_depth': [3, 5, 7],'learning_rate': [0.01, 0.1, 0.3]
}# 进行网格搜索和交叉验证
xgb_grid_search_wine = GridSearchCV(estimator=xgb_model_wine, param_grid=xgb_param_grid_wine, scoring='accuracy', cv=3)
xgb_grid_search_wine.fit(X_train_wine, y_train_wine)# 打印最佳参数
print("XGBoost最佳参数（葡萄酒数据集）:", xgb_grid_search_wine.best_params_)

（3）解释：

定义模型：xgb.XGBClassifier 是 XGBoost 分类器，用于处理葡萄酒数据集的分类任务。
参数优化：
- 通过网格搜索 (GridSearchCV) 调整 n_estimators（决策树数量）、max_depth（决策树深度）和 learning_rate（学习率）等参数。
- cv=3 表示 3 折交叉验证，用于评估模型性能并选择最佳参数组合。
输出最佳参数：通过 best_params_ 打印 XGBoost 模型的最佳参数组合。

2.5 使用测试集进行预测并评估模型性能

（1）目标：

使用通过网格搜索找到的最佳 XGBoost 模型对测试集进行预测，并评估模型的分类性能。

（2）代码：

# 使用最佳XGBoost模型对测试集进行预测
xgb_y_pred_wine = xgb_grid_search_wine.best_estimator_.predict(X_test_wine)# 评估模型性能（准确率和F1得分）
xgb_accuracy_wine = accuracy_score(y_test_wine, xgb_y_pred_wine)
xgb_f1_wine = f1_score(y_test_wine, xgb_y_pred_wine, average='weighted')# 打印性能结果
print(f"XGBoost分类器 - 葡萄酒数据集: 准确率 = {xgb_accuracy_wine}, F1 得分 = {xgb_f1_wine}")

（3）解释：

定义模型：xgb.XGBClassifier 是 XGBoost 分类器，用于处理葡萄酒数据集的分类任务。
参数优化：
- 通过网格搜索 (GridSearchCV) 调整 n_estimators（决策树数量）、max_depth（决策树深度）和 learning_rate（学习率）等参数。
- cv=3 表示 3 折交叉验证，用于评估模型性能并选择最佳参数组合。
输出最佳参数：通过 best_params_ 打印 XGBoost 模型的最佳参数组合。

2.6 可视化模型性能

（1）目标：

绘制 XGBoost 模型在葡萄酒数据集上的准确率和 F1 得分的柱状图，直观展示模型性能。

（2）代码：

# 绘制XGBoost模型性能柱状图
performance_metrics = ['Accuracy', 'F1 Score']
xgb_performance_wine = [xgb_accuracy_wine, xgb_f1_wine]plt.figure(figsize=(8, 6))
plt.bar(performance_metrics, xgb_performance_wine, color='blue')
plt.title('XGBoost 模型在葡萄酒数据集上的性能')
plt.ylabel('得分')
plt.ylim(0, 1)
plt.show()

（3）解释：

绘图：使用 matplotlib 的 bar() 函数绘制柱状图，展示 XGBoost 模型的准确率和 F1 得分。
直观展示：通过柱状图直观展示模型的分类性能，帮助更好地理解模型的优劣。
- ylim(0, 1) 设置 Y 轴的范围为 0 到 1。

3.鸢尾花数据集分类任务报告

3.1 导入必要的库文件并加载数据集

（1）目标：

导入必要的库文件（如 pandas、numpy、matplotlib、seaborn、sklearn、xgboost 和 lightgbm 等）。
加载 scikit-learn 提供的鸢尾花数据集。

（2）代码：

# 导入必要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import lightgbm as lgb# 加载鸢尾花数据集
iris = load_iris()# 提取特征矩阵（X）和目标变量（y）
X_iris, y_iris = iris.data, iris.target

（3）解释：

导入库文件：导入数据处理、可视化、机器学习和模型评估所需的库。
- pandas 用于数据处理，numpy 用于数值计算，matplotlib 和 seaborn 用于绘图。
- xgboost 和 lightgbm 是两种常用的集成学习方法，用于分类任务。
加载鸢尾花数据集：使用 scikit-learn 提供的 load_iris() 函数加载鸢尾花数据集。该数据集包含 150 个样本，4 个特征，以及 3 种不同的鸢尾花类别。
提取特征矩阵和目标变量：X_iris 保存特征数据，y_iris 保存目标类别标签。

3.2 对数据进行初步探索和可视化

（1）目标：

查看数据集的基本信息和统计摘要，确保数据集没有缺失值或异常。
使用 seaborn 进行数据可视化，了解特征与目标标签之间的关系。

（2）代码：

# 将数据转为DataFrame
df_iris = pd.DataFrame(X_iris, columns=iris.feature_names)
df_iris['target'] = y_iris# 查看数据集信息
print(df_iris.info())
print(df_iris.describe())# 可视化鸢尾花数据集特征与目标的关系
sns.pairplot(df_iris, hue="target", markers=["o", "s", "D"])
plt.show()

（3）解释：

转换为 DataFrame：将 numpy 数组转换为 pandas 数据框，以便更方便地查看和处理数据。
数据集探索：使用 info() 和 describe() 查看数据的基本信息，包括数据类型、缺失值和统计摘要。
- info() 显示数据集的结构信息，确保没有缺失值。
- describe() 提供特征的统计描述，如均值、标准差、最大值和最小值等。
数据可视化：使用 seaborn 的 pairplot() 绘制多变量散点图，观察特征与目标变量（鸢尾花种类）之间的关系。
- hue="target" 使用不同颜色区分不同类别的鸢尾花。

3.3 将数据集分为训练集和测试集

（1）目标：

将鸢尾花数据集划分为训练集和测试集，以便训练模型并验证其性能。
采用 80:20 的比例划分，80% 的数据用于训练，20% 用于测试。

（2）代码：

# 将鸢尾花数据集划分为训练集和测试集
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

（3）解释：

train_test_split()：用于将数据集划分为训练集和测试集。这里使用 80% 的数据用于模型训练，20% 的数据用于测试模型。
random_state=42：确保数据集划分的随机性保持一致，便于结果复现。
X_train_iris, y_train_iris 分别表示鸢尾花训练集的特征和标签，X_test_iris, y_test_iris 为测试集。

3.4 对 `LightGBM` 模型进行参数优化

（1）目标：

使用网格搜索 (GridSearchCV) 对 LightGBM 分类器的参数进行优化，找到最佳参数组合以提高模型性能。

（2）代码：

# 定义LightGBM模型和参数网格
lgb_model_iris = lgb.LGBMClassifier(random_state=42)
lgb_param_grid_iris = {'n_estimators': [50, 100, 200],'num_leaves': [31, 50, 100],'learning_rate': [0.01, 0.1, 0.3]
}# 进行网格搜索和交叉验证
lgb_grid_search_iris = GridSearchCV(estimator=lgb_model_iris, param_grid=lgb_param_grid_iris, scoring='accuracy', cv=3)
lgb_grid_search_iris.fit(X_train_iris, y_train_iris)# 打印最佳参数
print("LightGBM最佳参数（鸢尾花数据集）:", lgb_grid_search_iris.best_params_)

（3）解释：

定义模型：lgb.LGBMClassifier 是 LightGBM 分类器，用于处理鸢尾花数据集的分类任务。
参数优化：
- 通过网格搜索 (GridSearchCV) 调整 n_estimators（决策树数量）、num_leaves（叶子节点数）和 learning_rate（学习率）等参数。
- cv=3 表示 3 折交叉验证，用于评估模型性能并选择最佳参数组合。
输出最佳参数：通过 best_params_ 打印 LightGBM 模型的最佳参数组合。

3.5 使用测试集进行预测并评估模型性能

（1）目标：

使用通过网格搜索找到的最佳 LightGBM 模型对测试集进行预测，并评估模型的分类性能。

（2）代码：

# 使用最佳LightGBM模型对测试集进行预测
lgb_y_pred_iris = lgb_grid_search_iris.best_estimator_.predict(X_test_iris)# 评估模型性能（准确率和F1得分）
lgb_accuracy_iris = accuracy_score(y_test_iris, lgb_y_pred_iris)
lgb_f1_iris = f1_score(y_test_iris, lgb_y_pred_iris, average='weighted')# 打印性能结果
print(f"LightGBM分类器 - 鸢尾花数据集: 准确率 = {lgb_accuracy_iris}, F1 得分 = {lgb_f1_iris}")

（3）解释：

预测：使用 best_estimator_ 返回最佳 LightGBM 模型，使用 predict() 方法对测试集进行预测。
性能评估：
- accuracy_score() 用于计算模型的准确率，即预测正确的样本占总样本的比例。
- f1_score() 计算 F1 得分，衡量模型的精确率和召回率的平衡。
输出结果：打印 LightGBM 模型在鸢尾花数据集上的准确率和 F1 得分。

3.6 可视化模型性能

（1）目标：

绘制 LightGBM 模型在鸢尾花数据集上的准确率和 F1 得分的柱状图，直观展示模型性能。

（2）代码：

# 绘制LightGBM模型性能柱状图
performance_metrics = ['Accuracy', 'F1 Score']
lgb_performance_iris = [lgb_accuracy_iris, lgb_f1_iris]plt.figure(figsize=(8, 6))
plt.bar(performance_metrics, lgb_performance_iris, color='green')
plt.title('LightGBM 模型在鸢尾花数据集上的性能')
plt.ylabel('得分')
plt.ylim(0, 1)
plt.show()

（3）解释：

绘图：使用 matplotlib 的 bar() 函数绘制柱状图，展示 LightGBM 模型的准确率和 F1 得分。
直观展示：通过柱状图直观展示模型的分类性能，帮助更好地理解模型的优劣。
- ylim(0, 1) 设置 Y 轴的范围为 0 到 1。

4.总体代码和结果

4.1葡萄酒

（1）总体代码

# 步骤一：导入所需模块
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pdfrom sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import f1_score
import numpy as np# 步骤二：导入数据集，并分割为特征和标签
wine = load_wine()
X = wine.data
y = wine.target# 步骤三：划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 输出训练集和测试集的形状
print("训练集特征形状:", X_train.shape)
print("测试集特征形状:", X_test.shape)
print("训练集标签形状:", y_train.shape)
print("测试集标签形状:", y_test.shape)# 步骤四：定义随机森林分类器并进行参数优化
rf = RandomForestClassifier(random_state=42)
param_grid = {'n_estimators': [50, 100, 200],'max_depth': [None, 10, 20],'min_samples_split': [2, 5]
}grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)# 打印最佳参数
print("最佳参数:", grid_search.best_params_)# 步骤五：使用最佳参数训练模型
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)# 步骤六：打印出分类器的混淆矩阵
y_pred = best_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
print("混淆矩阵:\n", conf_matrix)# 绘制混淆矩阵热力图
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()# 步骤七：获取特征重要性并排序可视化
feature_importances = best_model.feature_importances_
feature_names = wine.feature_names
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)# 绘制特征重要性条形图
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance from Random Forest')
plt.show()# 步骤八：进行特征选择
selector = SelectFromModel(best_model, threshold='mean', prefit=True)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)# 步骤九：使用选择后的特征训练模型并评估
rf_selected = RandomForestClassifier(random_state=42)
rf_selected.fit(X_train_selected, y_train)
y_pred_selected = rf_selected.predict(X_test_selected)accuracy_selected = accuracy_score(y_test, y_pred_selected)
f1_selected = f1_score(y_test, y_pred_selected, average='weighted')print("选择后模型的准确率:", accuracy_selected)
print("选择后模型的F1得分:", f1_selected)# 步骤十：绘制性能对比图
performance_metrics = ['Accuracy', 'F1 Score']
original_performance = [accuracy_score(y_test, y_pred), f1_score(y_test, y_pred, average='weighted')]
selected_performance = [accuracy_selected, f1_selected]fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.35
index = np.arange(len(performance_metrics))bar1 = ax.bar(index, original_performance, bar_width, label='Original Features')
bar2 = ax.bar(index + bar_width, selected_performance, bar_width, label='Selected Features')ax.set_xlabel('Performance Metrics')
ax.set_title('Classifier Performance Comparison Before and After Feature Selection')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(performance_metrics)
ax.legend()for bar in bar1 + bar2:height = bar.get_height()ax.annotate('%.2f' % height, xy=(bar.get_x() + bar.get_width() / 2, height),xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')plt.show()

（2）运行结果

训练集特征形状: (142, 13)
测试集特征形状: (36, 13)
训练集标签形状: (142,)
测试集标签形状: (36,)
最佳参数: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
混淆矩阵:[[14  0  0][ 0 14  0][ 0  0  8]]

选择后模型的准确率: 0.9722222222222222
选择后模型的F1得分: 0.9717752234993614

[22]:

（3）结果分析

模型在葡萄酒数据集上的分类表现非常优异，达到了 97.22% 的准确率和 0.9718 的 F1 得分。通过最佳参数（未限制树深度、100 棵树等），模型有效地避免了过拟合，并且对所有类别的分类均无误，具备良好的泛化能力和稳定性。

4.2鸢尾花

（1）总体代码

# 步骤一：导入需要的库文件及数据集
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import lightgbm as lgb
import matplotlib.font_manager as fm# 设置中文字体，解决显示中文乱码问题
plt.rcParams['font.sans-serif'] = ['SimHei']  # 使用黑体字体
plt.rcParams['axes.unicode_minus'] = False  # 解决负号显示问题
font_path = 'C:/Windows/Fonts/simhei.ttf'  # Windows系统的字体路径
my_font = fm.FontProperties(fname=font_path)# 步骤二：加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
features = iris.feature_names# 将数据转换为DataFrame便于查看
df = pd.DataFrame(X, columns=features)
df['target'] = y# 步骤三：对数据进行初步探索，包括查看数据的基本信息和可视化特征之间的关系
print(df.info())
print(df.describe())# 可视化特征之间的关系
sns.pairplot(df, hue="target", markers=["o", "s", "D"])
plt.show()# 步骤四：将数据集分为训练集和测试集，按8:2的比例划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 步骤五：对选定的分类方法（XGBoost和LightGBM）进行参数优化
# 1. XGBoost参数优化
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_param_grid = {'n_estimators': [50, 100, 200],'max_depth': [3, 5, 7],'learning_rate': [0.01, 0.1, 0.3]
}
xgb_grid_search = GridSearchCV(estimator=xgb_model, param_grid=xgb_param_grid, scoring='accuracy', cv=3)
xgb_grid_search.fit(X_train, y_train)
xgb_best_model = xgb_grid_search.best_estimator_
print(f"XGBoost最佳参数: {xgb_grid_search.best_params_}")# 2. LightGBM参数优化
# 使用 verbosity=-1 来消除警告
lgb_model = lgb.LGBMClassifier(random_state=42, verbosity=-1)# LightGBM参数优化
lgb_param_grid = {'n_estimators': [50, 100, 200],'num_leaves': [31, 50, 100],'learning_rate': [0.01, 0.1, 0.3]
}lgb_grid_search = GridSearchCV(estimator=lgb_model, param_grid=lgb_param_grid, scoring='accuracy', cv=3)
lgb_grid_search.fit(X_train, y_train)lgb_best_model = lgb_grid_search.best_estimator_
print(f"LightGBM最佳参数: {lgb_grid_search.best_params_}")# 步骤六：使用测试集进行预测，并将结果保存下来
xgb_y_pred = xgb_best_model.predict(X_test)
lgb_y_pred = lgb_best_model.predict(X_test)# 保存结果到CSV文件
pd.DataFrame({'XGBoost预测结果': xgb_y_pred, 'LightGBM预测结果': lgb_y_pred}).to_csv('classification_results.csv', index=False)# 步骤七：分别比较参数优化前后及特征选择前后两个不同分类器的性能
# 对XGBoost分类器的性能评估
xgb_accuracy = accuracy_score(y_test, xgb_y_pred)
xgb_f1 = f1_score(y_test, xgb_y_pred, average='weighted')
print(f"XGBoost分类器 - 准确率: {xgb_accuracy}, F1得分: {xgb_f1}")# 对LightGBM分类器的性能评估
lgb_accuracy = accuracy_score(y_test, lgb_y_pred)
lgb_f1 = f1_score(y_test, lgb_y_pred, average='weighted')
print(f"LightGBM分类器 - 准确率: {lgb_accuracy}, F1得分: {lgb_f1}")# 步骤八：对比较结果进行可视化
performance_metrics = ['Accuracy', 'F1 Score']
xgb_performance = [xgb_accuracy, xgb_f1]
lgb_performance = [lgb_accuracy, lgb_f1]fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.35
index = np.arange(len(performance_metrics))bar1 = ax.bar(index, xgb_performance, bar_width, label='XGBoost')
bar2 = ax.bar(index + bar_width, lgb_performance, bar_width, label='LightGBM')ax.set_xlabel('性能指标', fontproperties=my_font)
ax.set_title('XGBoost vs LightGBM 性能比较', fontproperties=my_font)
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(performance_metrics, fontproperties=my_font)
ax.legend()# 在柱状图上标注数值
for bar in bar1 + bar2:height = bar.get_height()ax.annotate('%.2f' % height, xy=(bar.get_x() + bar.get_width() / 2, height),xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')plt.show()

（2）运行结果

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  0   sepal length (cm)  150 non-null    float641   sepal width (cm)   150 non-null    float642   petal length (cm)  150 non-null    float643   petal width (cm)   150 non-null    float644   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
Nonesepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   petal width (cm)      target  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000

XGBoost最佳参数: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200}
LightGBM最佳参数: {'learning_rate': 0.01, 'n_estimators': 200, 'num_leaves': 31}
XGBoost分类器 - 准确率: 1.0, F1得分: 1.0
LightGBM分类器 - 准确率: 1.0, F1得分: 1.0