医学数据分析实训 项目七 继承学习--空气质量指标--天气质量分析和预测

项目七:集成学习

实践目的
  1. 理解集成学习算法原理;
  2. 熟悉并掌握常用集成学习算法的使用方法;
  3. 熟悉模型性能评估的方法;
  4. 掌握模型优化的方法。
实践平台
  • 操作系统:Windows7及以上
  • Python版本:3.8.x及以上
  • 集成开发环境:PyCharm或Anoconda
实践内容

数据集文件名为“aqi.csv”,包含了2020年全国空气质量数据,该数据集主要记录了2020年1月至2020年9月的空气质量指标,包括日期、AQI、质量等级、PM2.5含量(ppm)、PM10含量(ppm)、SO2含量(ppm)、CO含量(ppm)、NO2含量(ppm)、O3_8h含量(ppm)等字段。

本项目实践所涉及的业务为天气质量分析和预测。将数据分为训练集和测试集,通过集成学习建立算法模型预测AQI值和质量等级。

(一)数据理解及准备
  1. 导入本案例所需的Python包;
  2. 通过describe()、info()方法、shape属性等对读入的数据对象进行探索性分析。
  3. 结合实际数据情况,对数据集进行适当的预处理;
  4. 提取用于数据分析的特征,并划分训练集和测试集。
(二)模型建立、预测及优化
任务一:随机森林
  1. 回归模型

    • 通过RandomForestRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
    • 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
    • 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
    • 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
    • 使用优化后的模型进行预测,并输出预测结果;
    • 可视化展示预测值和测试值的对比情况。
  2. 分类模型

    • 通过RandomForestClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
任务二:梯度提升机 (GBM)
  1. 回归模型

    • 通过GradientBoostingRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
    • 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
    • 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
    • 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
    • 使用优化后的模型进行预测,并输出预测结果;
    • 可视化展示预测值和测试值的对比情况。
  2. 分类模型

    • 通过GradientBoostingClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
任务三:轻量级梯度提升机 (LightGBM)
  1. 回归模型

    • 通过LGBMRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
  2. 分类模型

    • 通过LGBMClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用评价指标对模型进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。

(一)数据理解及准备

# 导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import lightgbm as lgb# 读取数据
data = pd.read_csv('output/modified_data.csv')# 显示数据基本信息
print("数据信息:")
print(data.info())
print("\n数据描述:")
print(data.describe())
print("\n数据形状:", data.shape)
# 检查并处理缺失值
if data.isnull().sum().sum() > 0:# 可以选择填充缺失值或删除含有缺失值的行# 这里简单地用列的平均值填充data.fillna(data.mean(), inplace=True)# 转换日期格式
data['Date'] = pd.to_datetime(data['Date'])
print(data.head)
# 特征提取
features = ['PM2_5_(ppm)', 'PM10_(ppm)', 'SO2_(ppm)', 'CO_(ppm)', 'NO2_(ppm)', 'O3_8h_(ppm)']
target_aqi = 'AQI'
target_quality = 'Quality_Level'# 划分训练集和测试集
X = data[features]
y_aqi = data[target_aqi]
y_quality = data[target_quality]X_train, X_test, y_aqi_train, y_aqi_test, y_quality_train, y_quality_test = train_test_split(X, y_aqi, y_quality, test_size=0.2, random_state=42)

(二)模型建立、预测及优化

任务一:随机森林
# 1 建立随机森林回归模型 训练模型
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_aqi_train)# 2 预测 AQI 值
y_aqi_pred = rf_reg.predict(X_test)
print('随机森林回归模型预测 AQI 值:', y_aqi_pred)
# 3 计算评估指标
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("随机森林回归模型评价指标:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')

随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552

# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {'n_estimators': [100, 200, 300],'max_depth': [10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]
}# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=rf_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)
# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'最佳参数组合: {best_params}')

最佳参数组合: {‘max_depth’: 20, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘n_estimators’: 100}

# 5. 根据最佳参数重新训练模型
best_rf_reg = RandomForestRegressor(**best_params, random_state=42)
best_rf_reg.fit(X_train, y_aqi_train)# 预测并评价优化后的模型
y_aqi_pred_optimized = best_rf_reg.predict(X_test)
print('优化后的随机森林回归模型预测 AQI 值:', y_aqi_pred_optimized)
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)print("优化后的随机森林回归模型评价指标:")
print(f'MAE: {mae_optimized}, \nMSE: {mse_optimized}, \nRMSE: {rmse_optimized}, \nMAPE: {mape_optimized}, \nR2_SCORE: {r2_optimized}')

优化后的随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
优化后的随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552

# 比较优化前后的指标
print("优化前后指标对比:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")

优化前后指标对比:
优化前: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552
优化后: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552

未优化成功

# 6. 使用feature_importances_属性输出模型每个特征的重要度 
# 特征重要度
importances = best_rf_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
# 7. 输出预测结果
print(feature_importances)# 8. 可视化展示预测值和测试值的对比情况
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()

PM10_(ppm) 0.400419
PM2_5_(ppm) 0.291729
O3_8h_(ppm) 0.288429
CO_(ppm) 0.008184
NO2_(ppm) 0.007934
SO2_(ppm) 0.003305
dtype: float64

在这里插入图片描述

任务二:GBM
回归模型
# 1. 通过 GradientBoostingRegressor()方法建立模型并训练
gb_reg = GradientBoostingRegressor(random_state=42)
gb_reg.fit(X_train, y_aqi_train)
# 2. 使用该模型预测 AQI 值
y_aqi_pred = gb_reg.predict(X_test)
print('GBM回归模型预测 AQI 值:', y_aqi_pred)# 3. 使用评价指标对模型进行评价
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("Gradient Boosting Regression Model Evaluation Metrics:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')

GBM回归模型预测 AQI 值: [122.57416247 83.36233285 73.90280417 71.61249735 45.90407098
83.09407824 35.38809475 32.1115523 81.92797541 83.40916295
48.82405535 74.28270394 74.96495747 45.69629863 39.59354642
33.09971192 45.41896268 75.52727318 81.71507209 47.02496198
41.96486507 59.76085878 45.10753769 46.1912337 59.05166283
49.05189862 53.29885368 47.58476507 46.59894793 42.17298408
70.67172663 35.57436497 130.76443134 33.12142879 85.93142525
41.04272972 88.25804535 64.42863259 112.47587802 80.12500147
32.96123373 55.09504267 50.37469809 125.99062665 75.72767345
48.10707457 51.29551088 47.94867709 70.66198919 40.51320902
40.7250176 115.95276244 114.3584965 112.04106305 74.86570745]
Gradient Boosting Regression Model Evaluation Metrics:
MAE: 3.017067274506405,
MSE: 19.567961603563685,
RMSE: 4.4235688763218874,
MAPE: 0.058486439950287586,
R2_SCORE: 0.9739367717401174

# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'max_depth': [3, 5, 10],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]
}
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gb_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 1, ‘min_samples_split’: 10, ‘n_estimators’: 300}

# 5. 根据最佳参数重新训练模型
best_gb_reg = GradientBoostingRegressor(**best_params, random_state=42)
best_gb_reg.fit(X_train, y_aqi_train)# 使用优化后的模型进行预测
y_aqi_pred_optimized = best_gb_reg.predict(X_test)
print('优化后的模型预测结果:', y_aqi_pred_optimized)
# 计算优化后的模型评估指标
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)print("优化梯度增强回归模型评价指标:")
print(f'Optimized MAE: {mae_optimized}, \nOptimized MSE: {mse_optimized}, \nOptimized RMSE: {rmse_optimized}, \nOptimized MAPE: {mape_optimized}, \nOptimized R2_SCORE: {r2_optimized}')

优化后的模型预测结果: [124.50273685 83.46470773 76.67313626 71.43908717 46.06087546
82.48270218 35.45781481 30.29347664 80.68483493 83.63975494
48.01910073 75.04391558 74.6780025 44.02048381 39.16875902
31.57326064 45.52152266 74.54621085 81.98742113 41.15229431
40.05067005 60.0349372 43.40693783 42.44777993 60.0874834
46.4533299 53.98613726 45.00781228 51.56679542 38.97574632
73.97473389 36.03646256 131.65412729 30.82872235 86.88627133
44.17166092 89.64827072 66.71578258 112.06193027 80.82544043
32.13607404 53.33558888 48.52689834 125.55765644 77.38396113
48.52990476 51.07272122 48.89955218 69.66154718 40.70715896
49.21862157 117.74301294 107.39395475 111.89285961 75.11097803]
优化梯度增强回归模型评价指标:
Optimized MAE: 2.7372135954667853,
Optimized MSE: 20.880137541908642,
Optimized RMSE: 4.569478913608054,
Optimized MAPE: 0.048316013543643156,
Optimized R2_SCORE: 0.9721890403365572

# 比较优化前后的指标
print("优化前后评价指标的比较:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")

优化前后评价指标的比较:
优化前: MAE: 3.017067274506405, MSE: 19.567961603563685, RMSE: 4.4235688763218874, MAPE: 0.058486439950287586, R2_SCORE: 0.9739367717401174
优化后: MAE: 2.7372135954667853, MSE: 20.880137541908642, RMSE: 4.569478913608054, MAPE: 0.048316013543643156, R2_SCORE: 0.9721890403365572

# 6. 输出特征重要性
importances = best_gb_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
print(feature_importances)# 可视化预测值和测试值的对比
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()

PM10_(ppm) 0.422780
O3_8h_(ppm) 0.303769
PM2_5_(ppm) 0.265263
NO2_(ppm) 0.006753
SO2_(ppm) 0.001192
CO_(ppm) 0.000243
dtype: float64

在这里插入图片描述

分类模型
# 1 建立模型并训练
gbm_clf = GradientBoostingClassifier(random_state=42)
gbm_clf.fit(X_train, y_quality_train)# 2 预测空气质量等级
y_quality_pred = gbm_clf.predict(X_test)
print('GBM分类模型预测结果:', y_quality_pred)
# 3 评价模型
conf_matrix = confusion_matrix(y_quality_test, y_quality_pred)
accuracy = accuracy_score(y_quality_test, y_quality_pred)
precision = precision_score(y_quality_test, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test, y_quality_pred, average='weighted')print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}, \nPrecision: {precision}, \nRecall: {recall}, \nF1 Score: {f1}')

GBM分类模型预测结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Confusion Matrix:
[[20 3 0]
[ 0 25 0]
[ 0 0 7]]
Accuracy: 0.9454545454545454,
Precision: 0.9512987012987013,
Recall: 0.9454545454545454,
F1 Score: 0.9450955363197574


# 4. 对模型进行优化
# 定义参数网格
param_grid = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'max_depth': [3, 5, 10],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]
}# 创建 StratifiedKFold 对象
# stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gbm_clf, param_grid=param_grid, cv=2, scoring='accuracy')
# 进行网格搜索
grid_search.fit(X_train, y_quality_train)# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 4, ‘min_samples_split’: 2, ‘n_estimators’: 200}

# 5. 根据最佳参数重新训练模型
best_gbm_clf = GradientBoostingClassifier(**best_params, random_state=42)
best_gbm_clf.fit(X_train, y_quality_train)# 6. 使用优化后的模型进行预测
y_quality_pred_optimized = best_gbm_clf.predict(X_test)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)
# 7. 计算优化后的模型评估指标
conf_matrix_optimized = confusion_matrix(y_quality_test, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test, y_quality_pred_optimized, average='weighted')print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}, \nOptimized Precision: {precision_optimized}, \nOptimized Recall: {recall_optimized}, \nOptimized F1 Score: {f1_optimized}')# 比较前后的指标
print("优化前后评价指标的比较:\n")
print(f"优化前: Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")
print(f"优化后: Accuracy: {accuracy_optimized}, Precision: {precision_optimized}, Recall: {recall_optimized}, F1 Score: {f1_optimized}")

优化后的模型预测空气API结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Optimized Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 0 7]]
Optimized Accuracy: 0.9636363636363636,
Optimized Precision: 0.9636363636363636,
Optimized Recall: 0.9636363636363636,
Optimized F1 Score: 0.9636363636363636
优化前后评价指标的比较:

优化前: Accuracy: 0.9454545454545454, Precision: 0.9512987012987013, Recall: 0.9454545454545454, F1 Score: 0.9450955363197574
优化后: Accuracy: 0.9636363636363636, Precision: 0.9636363636363636, Recall: 0.9636363636363636, F1 Score: 0.9636363636363636

任务三:LIGHTGBM
# 1. 使用 LGBMRegressor() 方法建立回归模型并训练
# 1.1 使用 LGBMRegressor() 方法建立回归模型并训练
lgb_reg = lgb.LGBMRegressor(random_state=42)
lgb_reg.fit(X_train, y_aqi_train)# 1.2 使用该模型预测 AQI 值
y_aqi_pred = lgb_reg.predict(X_test)# 1.3 对模型进行评价
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)print("LGBMRegressor Model Evaluation Metrics:")
print(f'Mean Squared Error (MSE): {mse}')
print(f'R^2 Score: {r2}')

LGBMRegressor Model Evaluation Metrics:
Mean Squared Error (MSE): 70.6065599506794
R^2 Score: 0.9059567406190894

from sklearn.preprocessing import LabelEncoder# 2. 使用 LGBMClassifier() 方法建立分类模型并训练
# 2.1 使用 LGBMClassifier() 方法建立分类模型并训练
# 将类别标签转换为整数
label_encoder = LabelEncoder()
y_quality_train_encoded = label_encoder.fit_transform(y_quality_train)
y_quality_test_encoded = label_encoder.transform(y_quality_test)lgb_clf = lgb.LGBMClassifier(random_state=42)
lgb_clf.fit(X_train, y_quality_train_encoded)# 2.2 使用该模型预测空气质量等级
y_quality_pred = lgb_clf.predict(X_test)
print('预测空气质量等级结果:', y_quality_pred)# 2.3 对模型进行评价
conf_matrix = confusion_matrix(y_quality_test_encoded, y_quality_pred)
accuracy = accuracy_score(y_quality_test_encoded, y_quality_pred)
precision = precision_score(y_quality_test_encoded, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test_encoded, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test_encoded, y_quality_pred, average='weighted')print("LGBMClassifier Model Evaluation Metrics:")
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

预测空气质量等级结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 1 2 1 2 1]
LGBMClassifier Model Evaluation Metrics:
Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 1 6]]
Accuracy: 0.9454545454545454
Precision: 0.9468531468531469
Recall: 0.9454545454545454
F1 Score: 0.9452900041135335

# 3. 如评价结果不理想需对模型进行优化
# 3.1 定义参数网格
param_grid_reg = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'num_leaves': [31, 63, 127],'max_depth': [-1, 5, 10],'min_child_samples': [20, 50, 100]
}param_grid_clf = {'n_estimators': [100, 200, 300],'learning_rate': [0.01, 0.1, 0.2],'num_leaves': [31, 63, 127],'max_depth': [-1, 5, 10],'min_child_samples': [20, 50, 100]
}# 3.2 创建 GridSearchCV 对象
grid_search_reg = GridSearchCV(estimator=lgb_reg, param_grid=param_grid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_clf = GridSearchCV(estimator=lgb_clf, param_grid=param_grid_clf, cv=5, scoring='accuracy')# 3.3 进行网格搜索
grid_search_reg.fit(X_train, y_aqi_train)
grid_search_clf.fit(X_train, y_quality_train_encoded)# 3.4 获取最佳参数组合
best_params_reg = grid_search_reg.best_params_
best_params_clf = grid_search_clf.best_params_print(f'Best Parameters for Regression: {best_params_reg}')
print(f'Best Parameters for Classification: {best_params_clf}')

Best Parameters for Regression: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}
Best Parameters for Classification: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}

# 3.5 根据最佳参数重新训练模型
best_lgb_reg = lgb.LGBMRegressor(**best_params_reg, random_state=42)
best_lgb_clf = lgb.LGBMClassifier(**best_params_clf, random_state=42)best_lgb_reg.fit(X_train, y_aqi_train)
best_lgb_clf.fit(X_train, y_quality_train_encoded)# 3.6 使用优化后的模型进行预测
y_aqi_pred_optimized = best_lgb_reg.predict(X_test)
y_quality_pred_optimized = best_lgb_clf.predict(X_test)
print('优化后的模型预测空气质量结果:', y_aqi_pred_optimized)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)# 3.7 对优化后的模型进行评价
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)conf_matrix_optimized = confusion_matrix(y_quality_test_encoded, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test_encoded, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')print("Optimized LGBMRegressor Model Evaluation Metrics:")
print(f'Optimized Mean Squared Error (MSE): {mse_optimized}')
print(f'Optimized R^2 Score: {r2_optimized}')print("Optimized LGBMClassifier Model Evaluation Metrics:")
print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}')
print(f'Optimized Precision: {precision_optimized}')
print(f'Optimized Recall: {recall_optimized}')
print(f'Optimized F1 Score: {f1_optimized}')

优化后的模型预测空气质量结果: [119.6722501 90.43625783 76.95094102 69.7134339 45.49522429
85.11016632 35.29020956 34.373013 76.12352252 86.39110431
48.60966258 71.83512479 73.98876859 44.13139587 40.82771554
34.96190592 45.33698962 73.97657317 84.40383692 48.74370587
42.31917891 54.61740284 43.89328402 50.84420449 61.99838848
44.00117867 54.84723723 47.00982841 47.98332788 49.8258541
62.28614705 36.04575205 113.75560249 34.31105093 88.98552298
45.43941569 106.8158533 62.86787307 111.01787045 82.98067324
34.80876636 65.3185259 50.05687814 115.46064086 84.07845619
49.74122766 52.93800566 54.78650467 54.40771277 40.1914266
36.17207261 107.11934225 97.1210987 100.3162011 74.79308805]
优化后的模型预测空气API结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 0 2 1 2 1]
Optimized LGBMRegressor Model Evaluation Metrics:
Optimized Mean Squared Error (MSE): 75.10416813845606
Optimized R^2 Score: 0.8999662245297593
Optimized LGBMClassifier Model Evaluation Metrics:
Optimized Confusion Matrix:
[[22 1 0]
[ 2 23 0]
[ 0 1 6]]
Optimized Accuracy: 0.9272727272727272
Optimized Precision: 0.9287878787878787
Optimized Recall: 0.9272727272727272
Optimized F1 Score: 0.9271536973664632


# 比较优化前后的指标
print("Comparison of Evaluation Metrics Before and After Optimization:")
print(f"Regression: MSE: {mse} -> {mse_optimized}, R^2 Score: {r2} -> {r2_optimized}")
print(f"Classification: Accuracy: {accuracy} -> {accuracy_optimized}, Precision: {precision} -> {precision_optimized}, Recall: {recall} -> {recall_optimized}, F1 Score: {f1} -> {f1_optimized}")

Comparison of Evaluation Metrics Before and After Optimization:
Regression: MSE: 70.6065599506794 -> 75.10416813845606, R^2 Score: 0.9059567406190894 -> 0.8999662245297593
Classification: Accuracy: 0.9454545454545454 -> 0.9272727272727272, Precision: 0.9468531468531469 -> 0.9287878787878787, Recall: 0.9454545454545454 -> 0.9272727272727272, F1 Score: 0.9452900041135335 -> 0.9271536973664632

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.xdnf.cn/news/1537925.html

如若内容造成侵权/违法违规/事实不符,请联系一条长河网进行投诉反馈,一经查实,立即删除!

相关文章

【大模型技术教程】FastGPT一站式解决方案[1-部署篇]:轻松实现RAG-智能问答系统

FastGPT是一个功能强大的平台,专注于知识库训练和自动化工作流程的编排。它提供了一个简单易用的可视化界面,支持自动数据预处理和基于Flow模块的工作流编排。FastGPT支持创建RAG系统,提供自动化工作流程等功能,使得构建和使用RAG…

C++速通LeetCode中等第3题-字母异位词分组

双指针法:两个指针分别指向左右边界,记录最大面积,由于面积由短板决定,两个指针中较短的短指针向内移动一格,再次记录最大面积, 直到两指针相遇,得出答案。 class Solution { public:int maxAr…

C++入门 之 类和对象(中)

目录 一、类的默认成员函数 ​编辑二、构造函数 三、析构函数 四、拷贝构造函数 五.运算符重载 六、赋值运算重载 七、日期类的实现 1、Date.h 2、Date.cpp 八、取地址运算符重载 1、const成员函数 2、取地址运算符重载 一、类的默认成员函数 默认成员函数就是用户不…

java之顺序表的创建

顺序表的定义: 顺序表是用一段物理地址连续的存储单元依次存储数据元素的线性结构,一般情况下采用数组存储。在数组上完成数据的增删查改。 我们现在要实现的顺序表功能有:增(add)、删(remove)、查(get) …

什么是PDM系统?PDM系统核心功能是什么?如何进行产品数据管理

PDM系统介绍 PDM全称为产品数据管理,是一种企业级的产品研发协作平台。它集成了信息技术,通过全面管理和协同研发业务流程,提升企业的整体研发管理水平,缩短产品研发周期,降低成本,提高产品质量。PDM系统不…

「iOS」push与present

iOS学习 前言push与poppresent与dismiss使用dismiss弹出多级PresentedViewController 与 PresentingViewController区别 总结 前言 在此前就学习过视图的push与present。与之对应的退出方法为pop与dismiss。这里进行一次总结。 push与pop pushViewController 是通过导航控制器…

开放式耳机好用吗?哪个开放式耳机好用?

现在市面上的开放式耳机真的越来越火了,所以很多小伙伴也会来问我,有哪些品牌值得入手,开放式耳机到底好不好用的这个问题,作为专业的开放式耳机测评博主对于这个问题当然是信手拈来啦,这篇文章就来告诉大家如何才能选…

百度Android IM SDK组件能力建设及应用

作者 | 星途 导读 移动互联网时代,随着社交媒体、移动支付、线上购物等行业的快速发展,对即时通讯功能的需求不断增加。对于各APP而言,接入IM SDK(即时通讯软件开发工具包)能够大大降低开发成本、提高开发效率&#…

js 如何代码识别Selenium+Webdriver

Python 的 Selenium 可以模拟用户操作打开浏览器,前端如何去识别是人机还是真人: window.navigator.webdriver Selenium 人机下是这样的: 正常使用:

SpringCloud Feign 以及 一个标准的微服务的制作

一个标准的微服务制作 以一个咖啡小程序项目的订单模块为例,这个模块必将包括: 各种实体类(pojo,dto,vo....) 控制器 controller 服务类service ...... 其中控制器中有的接口需要提供给其他微服务,订单模块也需要…

Day04_JVM实战

文章目录 一、gc日志和dump快照GC日志是什么,要怎么看?dump快照是什么?要怎么看?二、gc日志和dump快照实战java.lang.OutOfMemoryError:Java heap space1、gc.log怎么看2、heapdump.hprof怎么看?①jvisualvm查看②使用MAT查看java.lang.OutOfMemoryError:Metaspace1、实时…

给大模型技术从业者的建议,入门转行必看!!

01—大模型技术学习建议‍‍‍ 这个关于学习大模型技术的建议,也可以说是一个学习技术的方法论。 首先大家要明白一点——(任何)技术都是一个更偏向于实践的东西,具体来说就是学习技术实践要大于理论,要以实践为主理论为辅,而不…

产品经理学AI:搭建大模型应用常用的三种方式

如果开发想要基于某个大模型开发一个应用该怎么做? 一般有以下几种方式: 1、自己部署大模型 部署大模型,分为两种模式,一种是部署自研大模型,还有一种是部署开源大模型。 部署自研大模型的优势是,可以完…

国产龙芯处理器双核64位系统板载2G DDR3内存迅为2K1000开发板

硬件配置国产龙芯处理器,双核64位系统,板载2G DDR3内存,流畅运行Busybox、Buildroot、Loognix、QT5.12 系统!接口全板载4路USB HOST、2路千兆以太网、2路UART、2路CAN总线、Mini PCIE、SATA固态盘接口、4G接口、GPS接口WIF1、蓝牙、Mini HDMI…

数据库事务的详解

1、 介绍 什么是事务? 事务是一个原子操作。是一个最小执行单元。可以由一个或多个SQL语句组成,在同一个事务当中,所有的SQL语句都成功执行时,整个事务成功,有一个SQL语句执行失败,整个事务都执行失败。(一组操作同时…

气膜足球馆:为青少年运动梦想护航—轻空间

随着青少年足球运动的不断普及,体育场馆的选择和建设正成为学校、俱乐部及家长们关注的重点。作为一种创新的场馆形式,气膜足球馆凭借其独特优势,逐渐成为青少年足球比赛和培训的理想之地。 宽敞舒适,助力足球成长 气膜足球馆采用…

【Linux】探索文件I/O奥秘,解锁软硬链接与生成动静态库知识

目录 1、C文件接口 1.1什么是当前路径? 1.2程序默认打开的文件流: 2、系统文件I/O 2.1.接口介绍: 2.1.1open: 参数讲解; flags如何实现一个参数就可以有多个参数传参的效果? open函数的返回值: 3…

EnzyACT——融合图技术和蛋白质嵌入预测突变蛋白活性变化

论文链接:EnzyACT: A Novel Deep Learning Method to Predict the Impacts of Single and Multiple Mutations on Enzyme Activity | Journal of Chemical Information and Modeling (acs.org) 文章摘要 酶工程涉及通过引入突变来定制酶,以扩大天然酶的…

【sgCreateCallAPIFunctionParam】自定义小工具:敏捷开发→调用接口方法参数生成工具

<template><div :class"$options.name" class"sgDevTool"><sgHead /><div class"sg-container"><div class"sg-start"><div style"margin-bottom: 10px">参数列表[逗号模式]<el-too…

旧衣回收小程序:开启旧衣回收新体验

随着社会的大众对环保的关注度越来越高&#xff0c;旧衣物回收市场迎来了快速发展时期。在数字化发展当下&#xff0c;旧衣回收行业也迎来了新的模式----互联网旧衣回收小程序&#xff0c;旨在为大众提供更加便捷、简单、透明的旧衣物回收方式&#xff0c;通过手机直接下单&…