医学数据分析实训 项目七 集成学习--空气质量指标--天气质量分析和预测

项目七:集成学习

实践目的
  1. 理解集成学习算法原理;
  2. 熟悉并掌握常用集成学习算法的使用方法;
  3. 熟悉模型性能评估的方法;
  4. 掌握模型优化的方法。
实践平台
  • 操作系统:Windows7及以上
  • Python版本:3.8.x及以上
  • 集成开发环境:PyCharm或Anoconda
实践内容

数据集文件名为“aqi.csv”,包含了2020年全国空气质量数据,该数据集主要记录了2020年1月至2020年9月的空气质量指标,包括日期、AQI、质量等级、PM2.5含量(ppm)、PM10含量(ppm)、SO2含量(ppm)、CO含量(ppm)、NO2含量(ppm)、O3_8h含量(ppm)等字段。

本项目实践所涉及的业务为天气质量分析和预测。将数据分为训练集和测试集,通过集成学习建立算法模型预测AQI值和质量等级。

(一)数据理解及准备
  1. 导入本案例所需的Python包;
  2. 通过describe()、info()方法、shape属性等对读入的数据对象进行探索性分析。
  3. 结合实际数据情况,对数据集进行适当的预处理;
  4. 提取用于数据分析的特征,并划分训练集和测试集。
(二)模型建立、预测及优化
任务一:随机森林
  1. 回归模型

    • 通过RandomForestRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
    • 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
    • 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
    • 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
    • 使用优化后的模型进行预测,并输出预测结果;
    • 可视化展示预测值和测试值的对比情况。
  2. 分类模型

    • 通过RandomForestClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
任务二:梯度提升机 (GBM)
  1. 回归模型

    • 通过GradientBoostingRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
    • 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
    • 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
    • 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
    • 使用优化后的模型进行预测,并输出预测结果;
    • 可视化展示预测值和测试值的对比情况。
  2. 分类模型

    • 通过GradientBoostingClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
任务三:轻量级梯度提升机 (LightGBM)
  1. 回归模型

    • 通过LGBMRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
  2. 分类模型

    • 通过LGBMClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用评价指标对模型进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。

(一)数据理解及准备

# 导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import lightgbm as lgb# 读取数据
data = pd.read_csv('output/modified_data.csv')# 显示数据基本信息
print("数据信息:")
print(data.info())
print("\n数据描述:")
print(data.describe())
print("\n数据形状:", data.shape)
# 检查并处理缺失值
if data.isnull().sum().sum() > 0:# 可以选择填充缺失值或删除含有缺失值的行# 这里简单地用列的平均值填充data.fillna(data.mean(), inplace=True)# 转换日期格式
data['Date'] = pd.to_datetime(data['Date'])
print(data.head)
# 特征提取
features = ['PM2_5_(ppm)', 'PM10_(ppm)', 'SO2_(ppm)', 'CO_(ppm)', 'NO2_(ppm)', 'O3_8h_(ppm)']
target_aqi = 'AQI'
target_quality = 'Quality_Level'# 划分训练集和测试集
X = data[features]
y_aqi = data[target_aqi]
y_quality = data[target_quality]X_train, X_test, y_aqi_train, y_aqi_test, y_quality_train, y_quality_test = train_test_split(X, y_aqi, y_quality, test_size=0.2, random_state=42)

(二)模型建立、预测及优化

任务一:随机森林
# 1 建立随机森林回归模型 训练模型
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_aqi_train)# 2 预测 AQI 值
y_aqi_pred = rf_reg.predict(X_test)
print('随机森林回归模型预测 AQI 值:', y_aqi_pred)
# 3 计算评估指标
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("随机森林回归模型评价指标:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')

随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552

# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {'n_estimators': [100, 200, 300],'max_depth': [10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]
}# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=rf_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)
# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'最佳参数组合: {best_params}')

最佳参数组合: {‘max_depth’: 20, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘n_estimators’: 100}

# 5. 根据最佳参数重新训练模型
best_rf_reg = RandomForestRegressor(**best_params, random_state=42)
best_rf_reg.fit(X_train, y_aqi_train)# 预测并评价优化后的模型
y_aqi_pred_optimized = best_rf_reg.predict(X_test)
print('优化后的随机森林回归模型预测 AQI 值:', y_aqi_pred_optimized)
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)print("优化后的随机森林回归模型评价指标:")
print(f'MAE: {mae_optimized}, \nMSE: {mse_optimized}, \nRMSE: {rmse_optimized}, \nMAPE: {mape_optimized}, \nR2_SCORE: {r2_optimized}')

优化后的随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
优化后的随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552

# 比较优化前后的指标
print("优化前后指标对比:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")

优化前后指标对比:
优化前: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552
优化后: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552

未优化成功

# 6. 使用feature_importances_属性输出模型每个特征的重要度 
# 特征重要度
importances = best_rf_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
# 7. 输出预测结果
print(feature_importances)# 8. 可视化展示预测值和测试值的对比情况
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()

PM10_(ppm) 0.400419
PM2_5_(ppm) 0.291729
O3_8h_(ppm) 0.288429
CO_(ppm) 0.008184
NO2_(ppm) 0.007934
SO2_(ppm) 0.003305
dtype: float64

在这里插入图片描述

任务二:GBM
回归模型
# 1. 通过 GradientBoostingRegressor()方法建立模型并训练
gb_reg = GradientBoostingRegressor(random_state=42)
gb_reg.fit(X_train, y_aqi_train)
# 2. 使用该模型预测 AQI 值
y_aqi_pred = gb_reg.predict(X_test)
print('GBM回归模型预测 AQI 值:', y_aqi_pred)# 3. 使用评价指标对模型进行评价
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("Gradient Boosting Regression Model Evaluation Metrics:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')

GBM回归模型预测 AQI 值: [122.57416247 83.36233285 73.90280417 71.61249735 45.90407098
83.09407824 35.38809475 32.1115523 81.92797541 83.40916295
48.82405535 74.28270394 74.96495747 45.69629863 39.59354642
33.09971192 45.41896268 75.52727318 81.71507209 47.02496198
41.96486507 59.76085878 45.10753769 46.1912337 59.05166283
49.05189862 53.29885368 47.58476507 46.59894793 42.17298408
70.67172663 35.57436497 130.76443134 33.12142879 85.93142525
41.04272972 88.25804535 64.42863259 112.47587802 80.12500147
32.96123373 55.09504267 50.37469809 125.99062665 75.72767345
48.10707457 51.29551088 47.94867709 70.66198919 40.51320902
40.7250176 115.95276244 114.3584965 112.04106305 74.86570745]
Gradient Boosting Regression Model Evaluation Metrics:
MAE: 3.017067274506405,
MSE: 19.567961603563685,
RMSE: 4.4235688763218874,
MAPE: 0.058486439950287586,
R2_SCORE: 0.9739367717401174

# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'max_depth': [3, 5, 10],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]
}
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gb_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 1, ‘min_samples_split’: 10, ‘n_estimators’: 300}

# 5. 根据最佳参数重新训练模型
best_gb_reg = GradientBoostingRegressor(**best_params, random_state=42)
best_gb_reg.fit(X_train, y_aqi_train)# 使用优化后的模型进行预测
y_aqi_pred_optimized = best_gb_reg.predict(X_test)
print('优化后的模型预测结果:', y_aqi_pred_optimized)
# 计算优化后的模型评估指标
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)print("优化梯度增强回归模型评价指标:")
print(f'Optimized MAE: {mae_optimized}, \nOptimized MSE: {mse_optimized}, \nOptimized RMSE: {rmse_optimized}, \nOptimized MAPE: {mape_optimized}, \nOptimized R2_SCORE: {r2_optimized}')

优化后的模型预测结果: [124.50273685 83.46470773 76.67313626 71.43908717 46.06087546
82.48270218 35.45781481 30.29347664 80.68483493 83.63975494
48.01910073 75.04391558 74.6780025 44.02048381 39.16875902
31.57326064 45.52152266 74.54621085 81.98742113 41.15229431
40.05067005 60.0349372 43.40693783 42.44777993 60.0874834
46.4533299 53.98613726 45.00781228 51.56679542 38.97574632
73.97473389 36.03646256 131.65412729 30.82872235 86.88627133
44.17166092 89.64827072 66.71578258 112.06193027 80.82544043
32.13607404 53.33558888 48.52689834 125.55765644 77.38396113
48.52990476 51.07272122 48.89955218 69.66154718 40.70715896
49.21862157 117.74301294 107.39395475 111.89285961 75.11097803]
优化梯度增强回归模型评价指标:
Optimized MAE: 2.7372135954667853,
Optimized MSE: 20.880137541908642,
Optimized RMSE: 4.569478913608054,
Optimized MAPE: 0.048316013543643156,
Optimized R2_SCORE: 0.9721890403365572

# 比较优化前后的指标
print("优化前后评价指标的比较:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")

优化前后评价指标的比较:
优化前: MAE: 3.017067274506405, MSE: 19.567961603563685, RMSE: 4.4235688763218874, MAPE: 0.058486439950287586, R2_SCORE: 0.9739367717401174
优化后: MAE: 2.7372135954667853, MSE: 20.880137541908642, RMSE: 4.569478913608054, MAPE: 0.048316013543643156, R2_SCORE: 0.9721890403365572

# 6. 输出特征重要性
importances = best_gb_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
print(feature_importances)# 可视化预测值和测试值的对比
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()

PM10_(ppm) 0.422780
O3_8h_(ppm) 0.303769
PM2_5_(ppm) 0.265263
NO2_(ppm) 0.006753
SO2_(ppm) 0.001192
CO_(ppm) 0.000243
dtype: float64

在这里插入图片描述

分类模型
# 1 建立模型并训练
gbm_clf = GradientBoostingClassifier(random_state=42)
gbm_clf.fit(X_train, y_quality_train)# 2 预测空气质量等级
y_quality_pred = gbm_clf.predict(X_test)
print('GBM分类模型预测结果:', y_quality_pred)
# 3 评价模型
conf_matrix = confusion_matrix(y_quality_test, y_quality_pred)
accuracy = accuracy_score(y_quality_test, y_quality_pred)
precision = precision_score(y_quality_test, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test, y_quality_pred, average='weighted')print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}, \nPrecision: {precision}, \nRecall: {recall}, \nF1 Score: {f1}')

GBM分类模型预测结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Confusion Matrix:
[[20 3 0]
[ 0 25 0]
[ 0 0 7]]
Accuracy: 0.9454545454545454,
Precision: 0.9512987012987013,
Recall: 0.9454545454545454,
F1 Score: 0.9450955363197574


# 4. 对模型进行优化
# 定义参数网格
param_grid = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'max_depth': [3, 5, 10],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]
}# 创建 StratifiedKFold 对象
# stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gbm_clf, param_grid=param_grid, cv=2, scoring='accuracy')
# 进行网格搜索
grid_search.fit(X_train, y_quality_train)# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 4, ‘min_samples_split’: 2, ‘n_estimators’: 200}

# 5. 根据最佳参数重新训练模型
best_gbm_clf = GradientBoostingClassifier(**best_params, random_state=42)
best_gbm_clf.fit(X_train, y_quality_train)# 6. 使用优化后的模型进行预测
y_quality_pred_optimized = best_gbm_clf.predict(X_test)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)
# 7. 计算优化后的模型评估指标
conf_matrix_optimized = confusion_matrix(y_quality_test, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test, y_quality_pred_optimized, average='weighted')print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}, \nOptimized Precision: {precision_optimized}, \nOptimized Recall: {recall_optimized}, \nOptimized F1 Score: {f1_optimized}')# 比较前后的指标
print("优化前后评价指标的比较:\n")
print(f"优化前: Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")
print(f"优化后: Accuracy: {accuracy_optimized}, Precision: {precision_optimized}, Recall: {recall_optimized}, F1 Score: {f1_optimized}")

优化后的模型预测空气API结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Optimized Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 0 7]]
Optimized Accuracy: 0.9636363636363636,
Optimized Precision: 0.9636363636363636,
Optimized Recall: 0.9636363636363636,
Optimized F1 Score: 0.9636363636363636
优化前后评价指标的比较:

优化前: Accuracy: 0.9454545454545454, Precision: 0.9512987012987013, Recall: 0.9454545454545454, F1 Score: 0.9450955363197574
优化后: Accuracy: 0.9636363636363636, Precision: 0.9636363636363636, Recall: 0.9636363636363636, F1 Score: 0.9636363636363636

任务三:LIGHTGBM
# 1. 使用 LGBMRegressor() 方法建立回归模型并训练
# 1.1 使用 LGBMRegressor() 方法建立回归模型并训练
lgb_reg = lgb.LGBMRegressor(random_state=42)
lgb_reg.fit(X_train, y_aqi_train)# 1.2 使用该模型预测 AQI 值
y_aqi_pred = lgb_reg.predict(X_test)# 1.3 对模型进行评价
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)print("LGBMRegressor Model Evaluation Metrics:")
print(f'Mean Squared Error (MSE): {mse}')
print(f'R^2 Score: {r2}')

LGBMRegressor Model Evaluation Metrics:
Mean Squared Error (MSE): 70.6065599506794
R^2 Score: 0.9059567406190894

from sklearn.preprocessing import LabelEncoder# 2. 使用 LGBMClassifier() 方法建立分类模型并训练
# 2.1 使用 LGBMClassifier() 方法建立分类模型并训练
# 将类别标签转换为整数
label_encoder = LabelEncoder()
y_quality_train_encoded = label_encoder.fit_transform(y_quality_train)
y_quality_test_encoded = label_encoder.transform(y_quality_test)lgb_clf = lgb.LGBMClassifier(random_state=42)
lgb_clf.fit(X_train, y_quality_train_encoded)# 2.2 使用该模型预测空气质量等级
y_quality_pred = lgb_clf.predict(X_test)
print('预测空气质量等级结果:', y_quality_pred)# 2.3 对模型进行评价
conf_matrix = confusion_matrix(y_quality_test_encoded, y_quality_pred)
accuracy = accuracy_score(y_quality_test_encoded, y_quality_pred)
precision = precision_score(y_quality_test_encoded, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test_encoded, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test_encoded, y_quality_pred, average='weighted')print("LGBMClassifier Model Evaluation Metrics:")
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

预测空气质量等级结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 1 2 1 2 1]
LGBMClassifier Model Evaluation Metrics:
Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 1 6]]
Accuracy: 0.9454545454545454
Precision: 0.9468531468531469
Recall: 0.9454545454545454
F1 Score: 0.9452900041135335

# 3. 如评价结果不理想需对模型进行优化
# 3.1 定义参数网格
param_grid_reg = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'num_leaves': [31, 63, 127],'max_depth': [-1, 5, 10],'min_child_samples': [20, 50, 100]
}param_grid_clf = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'num_leaves': [31, 63, 127],'max_depth': [-1, 5, 10],'min_child_samples': [20, 50, 100]
}# 3.2 创建 GridSearchCV 对象
grid_search_reg = GridSearchCV(estimator=lgb_reg, param_grid=param_grid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_clf = GridSearchCV(estimator=lgb_clf, param_grid=param_grid_clf, cv=5, scoring='accuracy')# 3.3 进行网格搜索
grid_search_reg.fit(X_train, y_aqi_train)
grid_search_clf.fit(X_train, y_quality_train_encoded)# 3.4 获取最佳参数组合
best_params_reg = grid_search_reg.best_params_
best_params_clf = grid_search_clf.best_params_print(f'Best Parameters for Regression: {best_params_reg}')
print(f'Best Parameters for Classification: {best_params_clf}')

Best Parameters for Regression: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}
Best Parameters for Classification: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}

# 3.5 根据最佳参数重新训练模型
best_lgb_reg = lgb.LGBMRegressor(**best_params_reg, random_state=42)
best_lgb_clf = lgb.LGBMClassifier(**best_params_clf, random_state=42)best_lgb_reg.fit(X_train, y_aqi_train)
best_lgb_clf.fit(X_train, y_quality_train_encoded)# 3.6 使用优化后的模型进行预测
y_aqi_pred_optimized = best_lgb_reg.predict(X_test)
y_quality_pred_optimized = best_lgb_clf.predict(X_test)
print('优化后的模型预测空气质量结果:', y_aqi_pred_optimized)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)# 3.7 对优化后的模型进行评价
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)conf_matrix_optimized = confusion_matrix(y_quality_test_encoded, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test_encoded, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')print("Optimized LGBMRegressor Model Evaluation Metrics:")
print(f'Optimized Mean Squared Error (MSE): {mse_optimized}')
print(f'Optimized R^2 Score: {r2_optimized}')print("Optimized LGBMClassifier Model Evaluation Metrics:")
print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}')
print(f'Optimized Precision: {precision_optimized}')
print(f'Optimized Recall: {recall_optimized}')
print(f'Optimized F1 Score: {f1_optimized}')

优化后的模型预测空气质量结果: [119.6722501 90.43625783 76.95094102 69.7134339 45.49522429
85.11016632 35.29020956 34.373013 76.12352252 86.39110431
48.60966258 71.83512479 73.98876859 44.13139587 40.82771554
34.96190592 45.33698962 73.97657317 84.40383692 48.74370587
42.31917891 54.61740284 43.89328402 50.84420449 61.99838848
44.00117867 54.84723723 47.00982841 47.98332788 49.8258541
62.28614705 36.04575205 113.75560249 34.31105093 88.98552298
45.43941569 106.8158533 62.86787307 111.01787045 82.98067324
34.80876636 65.3185259 50.05687814 115.46064086 84.07845619
49.74122766 52.93800566 54.78650467 54.40771277 40.1914266
36.17207261 107.11934225 97.1210987 100.3162011 74.79308805]
优化后的模型预测空气API结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 0 2 1 2 1]
Optimized LGBMRegressor Model Evaluation Metrics:
Optimized Mean Squared Error (MSE): 75.10416813845606
Optimized R^2 Score: 0.8999662245297593
Optimized LGBMClassifier Model Evaluation Metrics:
Optimized Confusion Matrix:
[[22 1 0]
[ 2 23 0]
[ 0 1 6]]
Optimized Accuracy: 0.9272727272727272
Optimized Precision: 0.9287878787878787
Optimized Recall: 0.9272727272727272
Optimized F1 Score: 0.9271536973664632


# 比较优化前后的指标
print("Comparison of Evaluation Metrics Before and After Optimization:")
print(f"Regression: MSE: {mse} -> {mse_optimized}, R^2 Score: {r2} -> {r2_optimized}")
print(f"Classification: Accuracy: {accuracy} -> {accuracy_optimized}, Precision: {precision} -> {precision_optimized}, Recall: {recall} -> {recall_optimized}, F1 Score: {f1} -> {f1_optimized}")

Comparison of Evaluation Metrics Before and After Optimization:
Regression: MSE: 70.6065599506794 -> 75.10416813845606, R^2 Score: 0.9059567406190894 -> 0.8999662245297593
Classification: Accuracy: 0.9454545454545454 -> 0.9272727272727272, Precision: 0.9468531468531469 -> 0.9287878787878787, Recall: 0.9454545454545454 -> 0.9272727272727272, F1 Score: 0.9452900041135335 -> 0.9271536973664632

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.xdnf.cn/news/1538465.html

如若内容造成侵权/违法违规/事实不符,请联系一条长河网进行投诉反馈,一经查实,立即删除!

相关文章

LineageOS连接网络提示IP配置失败

版权归作者所有,如有转发,请注明文章出处:https://cyrus-studio.github.io/blog/ IP配置失败 连接所有网络都提示IP配置失败,通过配置静态IP也连不上网络,感觉就是WIFI模块不能用了。 使用 Magisk root 后就这样了&am…

LeetCode004-两个有序数组的中位数-最优算法代码讲解

最有帮助的视频讲解 【LeetCode004-两个有序数组的中位数-最优算法代码讲解】 https://www.bilibili.com/video/BV1H5411c7oC/?share_sourcecopy_web&vd_sourceafbacdc02063c57e7a2ef256a4db9d2a 时间复杂度 O ( l o g ( m i n ( m , n ) ) ) O(log(min(m,n))) O(log(min(…

spring security 手机号 短信验证码认证、验证码认证 替换默认的用户名密码认证132

spring security内置的有用户名密码认证规则,还可以调用第三方微信、qq登录接口实现登录认证,这里使用自定义的手机号和短信验证码实现登录认证。 要实现自定义的手机号和短信验证码认证需要了解用户名密码认证的逻辑,仿照该逻辑就可以写出…

Java进阶之集合框架(Set)

【基本内容】 二、Set接口(接上一章) Set是Java集合框架中不允许有重复元素的无序集合,其典型的实现类是HashSet,它完全是遵循Set接口特性规范实现的,无序且不允许元素重复;而Set接口下的实现类还有LinkedHashSet和TreeSort&#…

记录生产环境,通过域名访问的图片展示不全,通过ip+端口的方式访问图片是完整的

原因:部署nginx的服务器硬盘满了 排查发现nginx日志文件占用了大量硬盘 解决方案: 删除该文件,重启nginx服务,问题解决。

AI修手有救了?在comfyui中使用Flux模型实现局部重绘案例

🐱‍🐉背景 局部重绘相关的话题我们已经讨论和测试过很多次了,比如说inpaint模型、brushnet模型、powerpaint模型等等,最近对于flux模型重绘画面的案例也越来越多了,那我们就结合flux模型的重绘来试试看效果。 &…

前端mock了所有……

目录 一、背景描述 二、开发流程 1.引入Mock 2.创建文件 3.需求描述 4.Mock实现 三、总结 一、背景描述 前提: 事情是这样的,老板想要我们写一个demo拿去路演/拉项目,有一些数据,希望前端接一下,写几个表格&a…

qt信号与槽(自定义)

自定义信号与槽 在qt里,我们可以自己去定义信号与槽。 这里举个栗子: 信号的定义 在我们类里边定义一个信号,我们需要用signals:来声明,不用再去cpp文件里边定义。而且返回值必须是void,可以有参数。 槽…

2024年最新测绘地理信息规范在线查看下载

随着科技的飞速发展,测绘地理信息行业也迎来了新的机遇与挑战。 为了确保测绘地理信息的准确性和规范性,每年都会出台了一系列最新的测绘地理信息规范。 本文将历年地形行业发布的相关标准规范,包括现行和一些已经弃用的标准,截…

数据结构与算法——详谈栈和队列

目录 一:栈 1.1:栈的概念结构与实现 1.1.1:栈的概念结构 1.1.2:栈的实现 1.2:栈的各个功能实现 1.2.1:对栈进行初始化 1.2.2:判空栈 1.2.3:入栈 1.2.4:出栈 1.…

一文读懂AI安全治理框架

随着AI的发展以及研究,我们总会提到AI带来的一些潜在威胁,但截止目前我还没有完全的梳理过AI到底有哪些潜在的风险,今天就来一一看一下!陆续补齐。

自动化中验证码的操作笔记,懂的赶紧收藏!

在自动化测试的过程中,验证码一直被视为一个“拦路虎”。很多测试人员在做接口或UI自动化时都会遇到验证码的阻碍,导致测试无法继续进行。今天,我们就来讨论如何在自动化过程中破解验证码,快速绕过这道关卡,轻松完成自…

LVM硬盘挂载

LVM硬盘挂载 一、基础概念 sda/sdb/nvme0n1/nvme0n2: 硬盘的命名方式,中括号的字母为第三位按不同硬盘的加载顺序排序。sda1/sda2/sdb1: 第4位为分区号,数字为不同分区的依序命名lvm: LVM是一种逻辑卷管理器,允许管理…

黑马头条day1 环境搭建 SpringCloud微服务(注册发现,服务调用,网关)

Nacos 环境搭建 Vmvare打开已经安装好的虚拟机镜像环境 使用findshell作为链接工具 和MobaXterm差不多 初始工程搭建 项目导入到idea 里边 这个项目都是用的比较老的东西 jdk1.8 甚至把仓库也提供好了 主体机构 common 就是通用的配置 feign 是对外的接口 model …

css五种定位总结

在 CSS 中,定位(Positioning)主要有五种模式,每种模式的行为和特点不同,以下是 static、relative、absolute、fixed 和 sticky 五种定位方式的对比总结: 1. static(默认定位) 特性…

“中秋快乐”文字横幅的MATLAB代码生成

中秋快乐呀朋友们!!! 给大家带来一个好玩的代码,能够生成“中秋快乐”的横幅文字,比较简单,当然你也可以根据自己的需求去更改文字和背景,废话不多说,直接展示。 文字会一直闪烁&…

计算机毕业设计 基于SpringBoot框架的网上蛋糕销售系统的设计与实现 Java实战项目 附源码+文档+视频讲解

博主介绍:✌从事软件开发10年之余,专注于Java技术领域、Python人工智能及数据挖掘、小程序项目开发和Android项目开发等。CSDN、掘金、华为云、InfoQ、阿里云等平台优质作者✌ 🍅文末获取源码联系🍅 👇🏻 精…

基于Springboot+vue的音乐网站

随着信息技术在管理上越来越深入而广泛的应用,管理信息系统的实施在技术上已逐步成熟。本文介绍了音乐网站的开发全过程。通过分析音乐网站管理的不足,创建了一个计算机管理音乐网站的方案。文章介绍了音乐网站的系统分析部分,包括可行性分析…

如何在Mac上安装多个Python环境

如何在Mac上安装多个Python环境 简介 在你的Mac上使用多个Python环境可以对项目管理很有帮助,特别是在同时处理不同Python版本或不同的包需求时。在这篇文章中,我们将向你展示如何在Mac上轻松地安装和管理多个Python环境。 一. 安装Conda Conda是一个包管理和环境管理系统…

深度学习 之 常见损失函数简介:名称、作用及用法

引言 在机器学习和深度学习中,损失函数(Loss Function)是模型训练过程中一个不可或缺的部分。它用来度量模型预测结果与真实值之间的差异,从而指导模型参数的优化。合理选择损失函数对于提高模型的准确性和泛化能力至关重要。本文…