Abstract
Harmful algal blooms (HABs) pose a potential risk to human and ecosystem health. HAB occurrences are influenced by numerous environmental factors; thus, accurate predictions of HABs and explanations about the predictions are required to implement preventive water quality management. In this study, machine learning (ML) algorithms, i.e., random forest (RF) and extreme gradient boosting (XGB), were employed to predict HABs in eight water supply reservoirs in South Korea. The use of synthetic minority oversampling technique for addressing imbalanced HAB occurrences improved classification performance of the ML algorithms. Although RF and XGB resulted in marginal performance differences, XGB exhibited more stable performance in the presence of data imbalance. Furthermore, a post hoc explanation technique, Shapley additive explanation was employed to estimate relative feature importance. Among the input features, water temperature and concentrations of total nitrogen and total phosphorus appeared important in predicting HAB occurrences. The results suggest that the use of ML algorithms along with explanation methods increase the usefulness of predictive models as a decision-making tool for water quality management.
HIGHLIGHTS
Machine learning (ML) algorithms were used to predict HAB occurrences in water supply reservoirs.
Synthetic minority oversampling technique (SMOTE) was applied to address the class imbalance problem.
The performance of extreme gradient boosting (XGB) with SMOTE was more stable under class imbalance compared to random forest.
Shapley additive explanation (SHAP) was used to estimate relative feature importance.
Water temperature and nutrients were generally important features.
Graphical Abstract
INTRODUCTION
Climate change and intensified eutrophication caused by anthropogenic activity have caused a worldwide proliferation of cyanobacteria in all water bodies, including reservoirs, oceans, and rivers (Paerl & Huisman 2008). Harmful algal blooms (HABs) due to the proliferation of cyanobacteria, which are major concerns in terms of water quality and water resource management, create scum and odorous substances that hinder the stable utilization of water resources (Huisman et al. 2018). In addition, some cyanobacteria produce toxins, including hepatotoxins and neurotoxins, which threaten the health of humans and aquatic ecosystems, further hindering the safe utilization of water resources (Weirich & Miller 2014). Thus, to reduce risk and effectively respond to HABs, a proactive management strategy based on accurate predictions of the HABs is required (Kim et al. 2021b). In addition, the occurrence of HABs is affected by various environmental factors; therefore, to establish an effective response strategy, it is important to identify the major influencing factors that promote cyanobacteria proliferation.
The proliferation of harmful cyanobacteria is influenced by various factors, e.g., meteorological (Sinha et al. 2017), water quality (Mellios et al. 2020), and hydrological (Cha et al. 2017) factors. Accordingly, machine learning (ML), which can infer the relationship between input variables and output values based on data, has been applied successfully to HAB predictions (Derot et al. 2020; Mellios et al. 2020; Choi et al. 2021; Izadi et al. 2021; Kim et al. 2021a; Shin et al. 2021). For example, Derot et al. (2020) predicted cyanobacteria abundance using random forest (RF), and Izadi et al. (2021) predicted the occurrence of HABs using the extreme gradient boosting (XGB), RF, and support vector machine. In addition, Choi et al. (2021) predicted HAB occurrence using a deep neural network, which is one of the deep learning models, and RF. Shin et al. (2021) predicted the occurrence of HABs using decision tree-based classifiers. Therefore, a wide range of ML models, including deep learning, have demonstrated their applicability to bloom forecasting. However, cyanobacterial blooms occur intensively in summer when temperatures are high, which causes data imbalance (Shin et al. 2021). Data imbalance occurs when the sample size for certain categories is very small compared to that in other categories (Krawczyk et al. 2016). ML enhances the general performance of a model during training; however, in the presence of class imbalance, prediction performance for a minority class can be low even if it is high for the majority class (Chawla et al. 2002). Thus, in this study, we applied an oversampling technique to increase the number of samples in the minority classes to realize sufficient ML performance in terms of HAB prediction.
Even if an accurate prediction can be realized using ML models, the ability to use the corresponding model as a decision-support tool is limited if the prediction results are not accompanied by an effective explanation. ML models with a simple structure, e.g., the decision tree, can directly interpret the influence of input variables on prediction results; however, when the models have complex structures, it is difficult to explain the contribution of the input variables to the output values. Recently, explainable artificial intelligence has emerged to overcome the ‘black-box’ nature of ML models (Doran Schulz & Besold 2017). Among the various explanation techniques, Shapley additive explanation (SHAP) is the most representative post hoc analysis technique (Lundberg & Lee 2017). SHAP can estimate the direction and magnitude of the contribution of the input features to a model's output (Lundberg et al. 2020). In addition, SHAP has been widely used in the field of water environment because it can provide effective visualizations of analysis results. For example, Cha et al. (2021) employed SHAP to estimate the contribution of environmental factors to the results of a species distribution model, and Kim et al. (2022) utilized SHAP to identify major spectral bands and band ratios in predicting lake Chl-a concentrations using satellite images. Despite an increasing application of SHAP across water management domains, only a few cases in which employed SHAP to explain the predictions of HAB using ML models (Park et al. 2022). Therefore, in this study, we aim to examine the usefulness of SHAP for identifying the relative importance of input features in predicting HAB occurrences. In South Korea, annual precipitation is concentrated in summer, and dams are constructed to store fresh water in reservoirs. HABs in water source reservoirs incur additional water treatment costs; thus, an algae alert system is used to manage the proliferation of cyanobacteria. The primary goal of this study is to apply interpretable ML to support the algal alter system by identifying common influencing factors on HABs generation in different reservoirs. The objectives of this study were to (1) correct harmful algae monitoring data of nationwide water source reservoirs, (2) predict the occurrence of HABs using ML models, (3) evaluate the effectiveness of oversampling in terms of improving HAB occurrence prediction performance, and (4) identify relative feature importance in HAB occurrence predictions using SHAP. Here, the RF and XGB, which have demonstrated excellent performance in previous studies, were used as prediction models (Derot Yajima & Jacquet 2020; Izadi et al. 2021). In addition, the synthetic minority oversampling technique (SMOTE), which has shown superior performance in various applications, was used to mitigate data imbalance issues (Shin et al. 2021). The results of this study suggest that interpretable ML has high applicability as a decision-making tool to support the establishment of effective management strategies and is expected to provide decision-support data to establish effective water quality management strategies.
METHODS
Study area and data description
Characteristics of target reservoirs
Reservoirs . | Latitude . | Longitude . | No. of station . | Reservoir size (km2) . |
---|---|---|---|---|
Angye | 36°01′02″ | 129°26′77″ | 1 | 1.4 |
Daecheong | 36°37'11″ | 127°49'56″ | 3 | 72.8 |
Gwanggyo | 37°30'37″ | 127°30′01″ | 1 | 0.3 |
Jinyang | 35°15'10″ | 128°02'93″ | 2 | 29.4 |
Paldang | 37°59'42″ | 127°34'14″ | 3 | 36.5 |
Sayeon | 35°16′80″ | 128°3'16″ | 2 | 2 |
Unmun | 37°08'12″ | 127°26'87″ | 2 | 7.8 |
Yeongcheon | 36°07'02″ | 129°02′10″ | 1 | 6.9 |
Reservoirs . | Latitude . | Longitude . | No. of station . | Reservoir size (km2) . |
---|---|---|---|---|
Angye | 36°01′02″ | 129°26′77″ | 1 | 1.4 |
Daecheong | 36°37'11″ | 127°49'56″ | 3 | 72.8 |
Gwanggyo | 37°30'37″ | 127°30′01″ | 1 | 0.3 |
Jinyang | 35°15'10″ | 128°02'93″ | 2 | 29.4 |
Paldang | 37°59'42″ | 127°34'14″ | 3 | 36.5 |
Sayeon | 35°16′80″ | 128°3'16″ | 2 | 2 |
Unmun | 37°08'12″ | 127°26'87″ | 2 | 7.8 |
Yeongcheon | 36°07'02″ | 129°02′10″ | 1 | 6.9 |
The data for cyanobacteria cell count (cells/mL) were obtained from cyanobacteria monitoring sites (https://water.nier.go.kr/). Total cyanobacteria cell count was calculated by summing the cell counts of the four major cyanobacteria genera that form HABs in South Korea, i.e., Microcystis, Dolichospermum (Anabaena), Oscillatoria, and Aphanizomenon. Data for environmental variables relevant to HAB occurrences were obtained from multiple sources (Table 2). The water quality variables were obtained from the nearest water quality monitoring sites (https://water.nier.go.kr/). The water quality variables included as input features were water temperature (Wtemp; °C), total phosphorus (T-P; mg/L), total nitrogen (T-N; mg/L), PO4-P (mg/L), NO3-N (mg/L), TOC (mg/L), and SS (mg/L). The meteorological variables include daily precipitation (Prec; mm), total irradiance (Irr; MJ/m2), and average wind speed (Wspeed; m/s), and the corresponding data were obtained from the nearest automated surface observing system station (https://data.kma.go.kr/).
Data description and summary statistics (mean (S.D.))
Variables (unit) . | Abbreviation . | Angye . | Daecheong . | Gwanggyo . | Jinyang . | Paldang . | Sayeon . | Unmun . | Yeongcheon . |
---|---|---|---|---|---|---|---|---|---|
Water temperature (°C) | Wtemp | 23.82 (2.85) | 23.00 (2.79) | 24.31 (2.04) | 18.56 (3.79) | 23.34 (1.89) | 26.02 (2.79) | 21.43 (4.16) | 24.08 (3.06) |
Daily precipitation (mm) | Prec | 9.32 (23.47) | 14.72 (32.11) | 10.35 (23.13) | 6.96 (15.49) | 7.85 (20.52) | 4.31 (11.03) | 6.44 (14.99) | 10.02 (24.3) |
Irradiance (MJ/m2) | Irr | 12.14 (6.56) | 15.90 (6.92) | 13.22 (6.72) | 13.77 (6.6) | 13.71 (5.93) | 11.73 (4.8) | 14.67 (6.38) | 12.42 (6.59) |
Average wind speed (m/s) | Wspeed | 2.49 (1.13) | 1.43 (0.52) | 1.75 (0.71) | 1.02 (0.4) | 1.89 (0.69) | 2.20 (0.84) | 1.98 (0.75) | 2.54 (1.2) |
Total phosphorus (mg/L) | T-P | 0.01 (0) | 0.03 (0.02) | 0.06 (0.03) | 0.04 (0.02) | 0.05 (0.04) | 0.04 (0.01) | 0.02 (0.01) | 0.02 (0.01) |
Total nitrogen (mg/L) | T-N | 1.15 (0.14) | 1.80 (0.54) | 1.66 (0.45) | 1.29 (0.33) | 2.1 (0.46) | 1.38 (0.59) | 1.44 (0.45) | 1.38 (0.1) |
Phosphate (mg/L) | PO4-P | 0.01 (0) | 0.00 (0.00) | 0.03 (0.02) | 0.01 (0.01) | 0.01 (0.02) | 0.01 (0.01) | 0.01 (0.01) | 0.01 (0) |
Nitrate nitrogen (mg/L) | NO3-N | 0.83 (0.26) | 0.74 (0.29) | 0.75 (0.51) | 0.4 (0.23) | 1.71 (0.41) | 0.46 (0.36) | 1.10 (0.3) | 0.97 (0.3) |
Total organic carbon (mg/L) | TOC | 3.67 (0.43) | 2.88 (0.57) | 4.68 (1.34) | 2.89 (0.62) | 2.55 (0.49) | 3.67 (0.68) | 2.23 (0.39) | 4.1 (0.55) |
Variables (unit) . | Abbreviation . | Angye . | Daecheong . | Gwanggyo . | Jinyang . | Paldang . | Sayeon . | Unmun . | Yeongcheon . |
---|---|---|---|---|---|---|---|---|---|
Water temperature (°C) | Wtemp | 23.82 (2.85) | 23.00 (2.79) | 24.31 (2.04) | 18.56 (3.79) | 23.34 (1.89) | 26.02 (2.79) | 21.43 (4.16) | 24.08 (3.06) |
Daily precipitation (mm) | Prec | 9.32 (23.47) | 14.72 (32.11) | 10.35 (23.13) | 6.96 (15.49) | 7.85 (20.52) | 4.31 (11.03) | 6.44 (14.99) | 10.02 (24.3) |
Irradiance (MJ/m2) | Irr | 12.14 (6.56) | 15.90 (6.92) | 13.22 (6.72) | 13.77 (6.6) | 13.71 (5.93) | 11.73 (4.8) | 14.67 (6.38) | 12.42 (6.59) |
Average wind speed (m/s) | Wspeed | 2.49 (1.13) | 1.43 (0.52) | 1.75 (0.71) | 1.02 (0.4) | 1.89 (0.69) | 2.20 (0.84) | 1.98 (0.75) | 2.54 (1.2) |
Total phosphorus (mg/L) | T-P | 0.01 (0) | 0.03 (0.02) | 0.06 (0.03) | 0.04 (0.02) | 0.05 (0.04) | 0.04 (0.01) | 0.02 (0.01) | 0.02 (0.01) |
Total nitrogen (mg/L) | T-N | 1.15 (0.14) | 1.80 (0.54) | 1.66 (0.45) | 1.29 (0.33) | 2.1 (0.46) | 1.38 (0.59) | 1.44 (0.45) | 1.38 (0.1) |
Phosphate (mg/L) | PO4-P | 0.01 (0) | 0.00 (0.00) | 0.03 (0.02) | 0.01 (0.01) | 0.01 (0.02) | 0.01 (0.01) | 0.01 (0.01) | 0.01 (0) |
Nitrate nitrogen (mg/L) | NO3-N | 0.83 (0.26) | 0.74 (0.29) | 0.75 (0.51) | 0.4 (0.23) | 1.71 (0.41) | 0.46 (0.36) | 1.10 (0.3) | 0.97 (0.3) |
Total organic carbon (mg/L) | TOC | 3.67 (0.43) | 2.88 (0.57) | 4.68 (1.34) | 2.89 (0.62) | 2.55 (0.49) | 3.67 (0.68) | 2.23 (0.39) | 4.1 (0.55) |
Model development
Data resampling using SMOTE
In this study, SMOTE, a representative oversampling technique, was applied to prevent reduced predictive performance in the event of data imbalance. Rather than simply replicating samples belonging to the minority class samples (i.e., bloom occurrence), SMOTE mitigates data imbalance by generating synthetic minority class samples. With SMOTE, synthetic minority samples are created at random points on a straight line that connects any measured minority sample with its k-nearest neighbors (Chawla et al. 2002). Here, SMOTE was applied to only the training data, where synthetic minority samples were generated until the balance ratio (IR), i.e., the ratio of the number of minority class and majority class samples in the training data, reached a value of 1. Kalman filtering and SMOTE were performed using the pykalman 0.9.2 and imblearn 0.8.1 libraries in the Python 3.7.12 environment, respectively.
Development of ML classifiers
Among various ML algorithms, RF and XGB, which are ensemble tree classifiers, have shown outstanding performance in bloom prediction (Derot et al. 2020; Izadi et al. 2021). Therefore, the RF and XGB were used to predict the occurrence of HAB at each of the eight reservoirs. The RF is a bagging-based ensemble classifier that generates multiple single decision trees and then aggregates the generated trees to obtain the final results (Breiman 2001). The RF has demonstrated excellent performance in various applications, e.g., algae forecasting (Derot et al. 2020). In the process of creating a single decision tree, the RF randomly extracts subsamples, which only contain a subset of input variables, of the training data. In this process, the RF prevents overfitting. Unlike RF, the XGB is a boosting-based ensemble classifier (Chen & Guestrin 2016) that is an improved version of the gradient boosting decision tree (GBDT), which has been used in various fields and exhibits fast learning and excellent performance (Bhattacharya et al. 2020). The GBDT generates a weak decision tree sequentially by reflecting results from a previous single classifier to reduce and minimize the loss function. Herein, the XGB classifier adds a regularization term to the loss function to prevent overfitting. In this study, the RF and XGB classifiers were implemented using the scikit-learn 1.0.2 and xgboost 0.90 Python libraries, respectively.
Hyperparameter optimization
Hyperparameter search space for RF and XGB
Model . | Parameter . | Range . |
---|---|---|
RF | n_estimators | {100,500} |
min_samples_split | {2,6} | |
min_samples_leaf | {1,6} | |
XGB | n_estimators | {100,350} |
max_depth | {3,8} | |
min_child_weight | {1,10} | |
learning_rate | {0.01,0.08} | |
Gamma | {0.1,3} | |
Subsample | {0.5,1} | |
colsample_bytree | {0.6,0.9} |
Model . | Parameter . | Range . |
---|---|---|
RF | n_estimators | {100,500} |
min_samples_split | {2,6} | |
min_samples_leaf | {1,6} | |
XGB | n_estimators | {100,350} |
max_depth | {3,8} | |
min_child_weight | {1,10} | |
learning_rate | {0.01,0.08} | |
Gamma | {0.1,3} | |
Subsample | {0.5,1} | |
colsample_bytree | {0.6,0.9} |




Thus, the optimization process focuses on iterations with smaller loss values, i.e., better performance, and converges to a set of optimal hyperparameters. The average accuracy score obtained from 10-fold cross validation on the training data (with and without SMOTE) was used as the loss function for hyperparameter tuning. In this study, the number of hyperparameter optimization iterations was set to 100, and optimization was implemented using the hyperopt 0.2.7 Python library.
Performance metrics
Here, W is the weight of each performance index, and A, R, and F represent accuracy, recall, and F-measure, respectively. The weight value for each performance index may be set to a value between 0 and 1 such that the sum of the weight becomes 1 according to the purpose of the model. In this study, the iMOO scores were calculated with equal weight (0.25) for all performance indicators. Note that a lower iMOO score indicates higher model performance.
Model explanations








Here, N is the set of features, and and
are the predictions of the model according to the presence or absence of the ith feature, respectively.
In this study, TreeSHAP was used to estimate the SHAP value (Lundberg et al. 2020). TreeSHAP can be applied to a tree-based model, and it prevents duplication by recursively tracking which leaf the subset of input variables flows into using the structure of each decision tree. Accordingly, TreeSHAP reduces the computation load required to estimate contributions from exponential to polynomial complexity. In addition, TPE estimate the exact SHAP value (Lundberg et al. 2020). In this study, SHAP was implemented using the shap 0.40.0 library in the Python 3.7.12 environment.
RESULTS AND DISCUSSION
Characteristics of cyanobacterial bloom and environmental factors in reservoirs
Seasonal variations of measured cyanobacteria cell counts during the summer season (black bar) and non-summer (white bar) seasons across study sites. The red horizontal line shows the attention level (1,000 cells/mL). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.
Seasonal variations of measured cyanobacteria cell counts during the summer season (black bar) and non-summer (white bar) seasons across study sites. The red horizontal line shows the attention level (1,000 cells/mL). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.
The study sites are distributed across different watersheds; thus, the environmental factors used as input features have different values for each reservoir (Figure 2 and Table 1). Typically, high temperatures are observed during summer; thus, all the reservoirs commonly exhibited high Wtemp values (average: 23.07 °C). However, the reservoirs with the highest temperature (Gwanggyo reservoir; 26.02 °C) and lowest temperature (Unmun reservoir; 18.56 °C) during the study period showed a temperature difference of 7.46C (Table 1). Differences in environmental factors were also observed in case of other meteorological features across the study sites: Prec (average: 8.75 mm; range: 4.31–14.72 mm), Irr (average: 13.45 MJ/m2; range 11.73–15.90 MJ/m2), and Wspeed (average: 1.91 m/s; range: 1.02–2.54 m/s). Water quality features also differed across the study sites. The feature with the largest difference among the study sites was PO4-P (average: 0.011 mg/L; range 0.005–0.26 mg/L), and the highest value associated with a reservoir was 5.21 times greater than the lowest value. The feature with the smallest difference, i.e., T-N (average: 1.53; range: 1.15–2.10 mg/L), exhibited a 1.83-fold difference between the lowest and highest values (Table 1).
Class imbalance for HABs
Comparison of iMOO scores for the ML models with and without SMOTE. Each bars indicate iMOO score for RF (green), XGB (red), RF with SMOTE (blue), and XGB with SMOTE (purple). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.
Comparison of iMOO scores for the ML models with and without SMOTE. Each bars indicate iMOO score for RF (green), XGB (red), RF with SMOTE (blue), and XGB with SMOTE (purple). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.
Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for RF with SMOTE.
Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for RF with SMOTE.
Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for XGB with SMOTE.
Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for XGB with SMOTE.
Comparison of model performance
The RF and XGB were employed to generate 1-week forecasts of HABs in eight water supply reservoirs. In Jinyang, Paldang, Sayeon, and Unmun reservoirs, the bloom occurrence classification accuracy of both the RF and XGB improved after applying SMOTE; however, in Angye, Daecheong, Gwanggyo, and Yeongcheon reservoirs, classification performance did not improve significantly (Table 4). Both the RF (accuracy: 0.71–0.93) and XGB (accuracy: 0.71–0.95) exhibited high accuracy without SMOTE. However, other metrics that reflect performance for minority classes, i.e., bloom occurrence, demonstrated lower values (AUC: 0.63, recall: 0.34, and F-measure: 0.39) for ML models. After applying SMOTE, the RF and XGB exhibited performance improvements for most of the target reservoirs. For example, after mitigating the degree of data imbalance via SMOTE, the Unmun reservoir demonstrated the greatest performance improvement (average increase of AUC: 0.27, recall: 0.50, and F-measure: 0.67). However, for the Angye reservoir (average difference of recall: 0.00, F-measure: 0.00, and AUC: −0.03), little performance improvement was observed for both the RF and XGB (Table 4 and Supplementary Material, Table S1). Note that the sample size for the Angye reservoir was relatively smaller than that of the other reservoirs, and only two bloom occurrence samples were contained in the test set. Thus, even a single misclassified instance of bloom occurrence as non-occurrence could cause a significant performance reduction (López et al. 2013).
Performance evaluation of forecasting cyanobacteria bloom occurrence for each study site using RF and XGB with and without SMOTE
. | Model . | Before applying SMOTE . | . | After applying SMOTE . | . | ||||
---|---|---|---|---|---|---|---|---|---|
Reservoir . | Accuracy . | AUC . | Recall . | F-measure . | Accuracy . | AUC . | Recall . | F-measure . | |
Angye | RF | 0.90 | 0.50 | 0.00 | 0.00 | 0.85 | 0.47 | 0.00 | 0.00 |
XGB | 0.90 | 0.50 | 0.00 | 0.00 | 0.85 | 0.47 | 0.00 | 0.00 | |
Daecheong | RF | 0.75 | 0.75 | 0.73 | 0.78 | 0.71 | 0.73 | 0.68 | 0.75 |
XGB | 0.71 | 0.68 | 0.80 | 0.78 | 0.70 | 0.72 | 0.65 | 0.73 | |
Gwanggyo | RF | 0.90 | 0.75 | 0.50 | 0.67 | 0.95 | 0.88 | 0.75 | 0.86 |
XGB | 0.95 | 0.88 | 0.75 | 0.86 | 0.90 | 0.84 | 0.75 | 0.75 | |
Jinyang | RF | 0.80 | 0.78 | 0.65 | 0.73 | 0.78 | 0.76 | 0.65 | 0.71 |
XGB | 0.78 | 0.76 | 0.65 | 0.71 | 0.75 | 0.74 | 0.71 | 0.71 | |
Paldang | RF | 0.93 | 0.50 | 0.00 | 0.00 | 0.93 | 0.50 | 0.00 | 0.00 |
XGB | 0.93 | 0.50 | 0.00 | 0.00 | 0.93 | 0.62 | 0.25 | 0.33 | |
Sayeon | RF | 0.92 | 0.63 | 0.25 | 0.40 | 0.92 | 0.73 | 0.50 | 0.57 |
XGB | 0.89 | 0.61 | 0.25 | 0.33 | 0.92 | 0.84 | 0.75 | 0.67 | |
Unmun | RF | 0.90 | 0.50 | 0.00 | 0.00 | 0.95 | 0.75 | 0.50 | 0.67 |
XGB | 0.90 | 0.50 | 0.00 | 0.00 | 0.95 | 0.75 | 0.50 | 0.67 | |
Yeongcheon | RF | 0.71 | 0.60 | 0.33 | 0.40 | 0.81 | 0.60 | 0.67 | 0.67 |
XGB | 0.76 | 0.68 | 0.50 | 0.55 | 0.67 | 0.62 | 0.50 | 0.46 |
. | Model . | Before applying SMOTE . | . | After applying SMOTE . | . | ||||
---|---|---|---|---|---|---|---|---|---|
Reservoir . | Accuracy . | AUC . | Recall . | F-measure . | Accuracy . | AUC . | Recall . | F-measure . | |
Angye | RF | 0.90 | 0.50 | 0.00 | 0.00 | 0.85 | 0.47 | 0.00 | 0.00 |
XGB | 0.90 | 0.50 | 0.00 | 0.00 | 0.85 | 0.47 | 0.00 | 0.00 | |
Daecheong | RF | 0.75 | 0.75 | 0.73 | 0.78 | 0.71 | 0.73 | 0.68 | 0.75 |
XGB | 0.71 | 0.68 | 0.80 | 0.78 | 0.70 | 0.72 | 0.65 | 0.73 | |
Gwanggyo | RF | 0.90 | 0.75 | 0.50 | 0.67 | 0.95 | 0.88 | 0.75 | 0.86 |
XGB | 0.95 | 0.88 | 0.75 | 0.86 | 0.90 | 0.84 | 0.75 | 0.75 | |
Jinyang | RF | 0.80 | 0.78 | 0.65 | 0.73 | 0.78 | 0.76 | 0.65 | 0.71 |
XGB | 0.78 | 0.76 | 0.65 | 0.71 | 0.75 | 0.74 | 0.71 | 0.71 | |
Paldang | RF | 0.93 | 0.50 | 0.00 | 0.00 | 0.93 | 0.50 | 0.00 | 0.00 |
XGB | 0.93 | 0.50 | 0.00 | 0.00 | 0.93 | 0.62 | 0.25 | 0.33 | |
Sayeon | RF | 0.92 | 0.63 | 0.25 | 0.40 | 0.92 | 0.73 | 0.50 | 0.57 |
XGB | 0.89 | 0.61 | 0.25 | 0.33 | 0.92 | 0.84 | 0.75 | 0.67 | |
Unmun | RF | 0.90 | 0.50 | 0.00 | 0.00 | 0.95 | 0.75 | 0.50 | 0.67 |
XGB | 0.90 | 0.50 | 0.00 | 0.00 | 0.95 | 0.75 | 0.50 | 0.67 | |
Yeongcheon | RF | 0.71 | 0.60 | 0.33 | 0.40 | 0.81 | 0.60 | 0.67 | 0.67 |
XGB | 0.76 | 0.68 | 0.50 | 0.55 | 0.67 | 0.62 | 0.50 | 0.46 |
In this study, the iMOO score was used to compare the performance of the prediction models by study site (Figure 4), where a lower iMOO score means that the model demonstrated higher prediction performance. Although differences were observed depending on the study site, the XGB exhibited slightly better performance than RF. Without SMOTE, the average iMOO scores for all reservoirs obtained by the RF and XGB were 0.516 and 0.486, respectively. After applying SMOTE, the iMOO scores of the RF and XGB were 0.405 and 0.384, respectively (Figure 4). Thus, we found that applying SMOTE generally improved the performance of the models. When all metrics were integrated, the RF exhibited improved performance (average decrease of iMOO score: 0.236) for the Gwanggyo, Sayeon, Unmun, and Yeongcheon reservoirs. The XGB exhibited improved performance (average decrease of iMOO score: 0.240) for the Jinyang, Paldang, Sayeon, and Unmun reservoirs. For Angye, Daecheong, Gwanggyo, and Yoengcheon reservoirs, the degree of improvement in performance was relatively small or performance was slightly higher without SMOTE (average difference of iMOO score: 0.03).
Effects of IR on model performance
Even when SMOTE was applied, the prediction performance was affected by the degree of imbalance (Figures 5 and 6). For both the RF (slope = 0.011) and XGB (slope = 0.021), accuracy tended to increase with increasing IR values. However, increased IR values caused reductions in the AUC, recall, and F-measure values (Figures 5(a)–5(d) and 6(a)–6(d)). For example, the F-measure (slope = −0.051) of the RF and recall (slope = −0.041) of the XGB decreased significantly with higher IR values. In terms of the iMOO score, which integrates all evaluation metrics, the value increased as the IR value increased, which indicates that the model obtained more accurate predictions when the IR value was lower. The variation in iMOO score with increasing IR value was greater with the RF (slope = 0.041) than with the XGB (slope = 0.021) (Figures 5(e) and 6(e)). For all metrics, including the iMOO score, the XGB exhibited more stable performance than the RF in the presence of class imbalance (Figures 5 and 6). One possible explanation is that the sequential learning process implemented in the XGB, which minimizes the loss function by reflecting errors from previous weak classifiers in the process of generating an ensemble, may address the data imbalance better than the RF (Sun et al. 2019).
Model explanations
In this study, SHAP was employed to estimate the relative importance of the input features in terms of the prediction results of the XGB with SMOTE. Although the performance of the XGB and RF did not differ significantly after applying SMOTE, here, the XGB was selected because its performance was more stable under a wide range of IR values compared to the performance of RF. After applying SHAP to the prediction results of XGB for each study site, the relative importance (i.e., the mean |SHAP| value) of the water temperature and nutrient-related water quality variables, including T-N and T-P, was higher than that of other environmental variables (Table 5). Note that the importance of water temperature (Cha et al. 2017) and nutrients (Richardson et al. 2019) for cyanobacterial bloom has been reported consistently in previous studies.
The five most important input features in the prediction of HABs using XGB with SMOTE
Lakes . | Input features (mean |SHAP value|) . | ||||
---|---|---|---|---|---|
Angye | NO3-N (0.0801) | Wtemp (0.0703) | Prec (0.0629) | SS (0.0572) | T-N (0.0567) |
Daecheong | SS (0.1064) | PO4-P (0.0604) | TOC (0.0594) | Irr (0.0535) | T-P (0.0338) |
Gwanggyo | TOC (0.1108) | NO3-N (0.1066) | T-N (0.0655) | T-P (0.0523) | SS (0.0514) |
Jinyang | NO3-N (0.1396) | TOC (0.0911) | Wtemp (0.0757) | T-N (0.0333) | SS (0.0324) |
Paldang | TOC (0.1211) | Wspeed (0.0802) | Wtemp (0.0724) | T-N (0.0600) | Irr (0.0560) |
Sayeon | Wspeed (0.0678) | Prec (0.0668) | Irr (0.0500) | Wtemp (0.0498) | T-N (0.0459) |
Unmun | PO4-P (0.1924) | T-P (0.1195) | TOC (0.0811) | Wtemp (0.0414) | T-N (0.0298) |
Yeongcheon | TOC (0.1228) | Wtemp (0.0686) | SS (0.0529) | T-P (0.0464) | T-N (0.0418) |
Lakes . | Input features (mean |SHAP value|) . | ||||
---|---|---|---|---|---|
Angye | NO3-N (0.0801) | Wtemp (0.0703) | Prec (0.0629) | SS (0.0572) | T-N (0.0567) |
Daecheong | SS (0.1064) | PO4-P (0.0604) | TOC (0.0594) | Irr (0.0535) | T-P (0.0338) |
Gwanggyo | TOC (0.1108) | NO3-N (0.1066) | T-N (0.0655) | T-P (0.0523) | SS (0.0514) |
Jinyang | NO3-N (0.1396) | TOC (0.0911) | Wtemp (0.0757) | T-N (0.0333) | SS (0.0324) |
Paldang | TOC (0.1211) | Wspeed (0.0802) | Wtemp (0.0724) | T-N (0.0600) | Irr (0.0560) |
Sayeon | Wspeed (0.0678) | Prec (0.0668) | Irr (0.0500) | Wtemp (0.0498) | T-N (0.0459) |
Unmun | PO4-P (0.1924) | T-P (0.1195) | TOC (0.0811) | Wtemp (0.0414) | T-N (0.0298) |
Yeongcheon | TOC (0.1228) | Wtemp (0.0686) | SS (0.0529) | T-P (0.0464) | T-N (0.0418) |
SHAP summary plots for forecasting cyanobacteria blooms using XGB with SMOTE. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.
SHAP summary plots for forecasting cyanobacteria blooms using XGB with SMOTE. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.
CONCLUSIONS
Herein, RF and XGB were employed in combination with SMOTE to predict HAB occurrences in water supply reservoirs. Although both RF and XGB exhibited generally good prediction performances, XGB exhibited more stable performance across a wide range of IR values. In addition, post hoc explanation results from SHAP indicated that among environmental variables, water temperature as well as TN and TP concentrations played important roles in predicting HAB occurrences. The use of explainable ML models provided accurate HAB predictions with explanations of the predictions, increasing the usefulness of these models as a decision-making tool for water quality management. However, for a few studied reservoirs, performance did not improve significantly even after SMOTE was applied. Thus, in future studies, we plan to improve prediction performance by implementing and evaluating various resampling techniques, which we expect to improve model reliability, enhance explanation results, and maximize the applicability of explainable ML models.
ACKNOWLEDGEMENTS
This research was funded by Korea Environment Industry & Technology Institute (KEITI) through the project for developing the project for developing innovate drinking water and wastewater technologies, funded by Korea Ministry of Environment (MOE) (2020002700001) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1009961).
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories. Water quality data and Cyanobacteria cell counts data from (https://water.nier.go. kr/). Meteorological data from (https://data.kma.go.kr/).
CONFLICT OF INTEREST
The authors declare there is no conflict.