Harmful algal blooms (HABs) pose a potential risk to human and ecosystem health. HAB occurrences are influenced by numerous environmental factors; thus, accurate predictions of HABs and explanations about the predictions are required to implement preventive water quality management. In this study, machine learning (ML) algorithms, i.e., random forest (RF) and extreme gradient boosting (XGB), were employed to predict HABs in eight water supply reservoirs in South Korea. The use of synthetic minority oversampling technique for addressing imbalanced HAB occurrences improved classification performance of the ML algorithms. Although RF and XGB resulted in marginal performance differences, XGB exhibited more stable performance in the presence of data imbalance. Furthermore, a post hoc explanation technique, Shapley additive explanation was employed to estimate relative feature importance. Among the input features, water temperature and concentrations of total nitrogen and total phosphorus appeared important in predicting HAB occurrences. The results suggest that the use of ML algorithms along with explanation methods increase the usefulness of predictive models as a decision-making tool for water quality management.

  • Machine learning (ML) algorithms were used to predict HAB occurrences in water supply reservoirs.

  • Synthetic minority oversampling technique (SMOTE) was applied to address the class imbalance problem.

  • The performance of extreme gradient boosting (XGB) with SMOTE was more stable under class imbalance compared to random forest.

  • Shapley additive explanation (SHAP) was used to estimate relative feature importance.

  • Water temperature and nutrients were generally important features.

Graphical Abstract

Graphical Abstract
Graphical Abstract

Climate change and intensified eutrophication caused by anthropogenic activity have caused a worldwide proliferation of cyanobacteria in all water bodies, including reservoirs, oceans, and rivers (Paerl & Huisman 2008). Harmful algal blooms (HABs) due to the proliferation of cyanobacteria, which are major concerns in terms of water quality and water resource management, create scum and odorous substances that hinder the stable utilization of water resources (Huisman et al. 2018). In addition, some cyanobacteria produce toxins, including hepatotoxins and neurotoxins, which threaten the health of humans and aquatic ecosystems, further hindering the safe utilization of water resources (Weirich & Miller 2014). Thus, to reduce risk and effectively respond to HABs, a proactive management strategy based on accurate predictions of the HABs is required (Kim et al. 2021b). In addition, the occurrence of HABs is affected by various environmental factors; therefore, to establish an effective response strategy, it is important to identify the major influencing factors that promote cyanobacteria proliferation.

The proliferation of harmful cyanobacteria is influenced by various factors, e.g., meteorological (Sinha et al. 2017), water quality (Mellios et al. 2020), and hydrological (Cha et al. 2017) factors. Accordingly, machine learning (ML), which can infer the relationship between input variables and output values based on data, has been applied successfully to HAB predictions (Derot et al. 2020; Mellios et al. 2020; Choi et al. 2021; Izadi et al. 2021; Kim et al. 2021a; Shin et al. 2021). For example, Derot et al. (2020) predicted cyanobacteria abundance using random forest (RF), and Izadi et al. (2021) predicted the occurrence of HABs using the extreme gradient boosting (XGB), RF, and support vector machine. In addition, Choi et al. (2021) predicted HAB occurrence using a deep neural network, which is one of the deep learning models, and RF. Shin et al. (2021) predicted the occurrence of HABs using decision tree-based classifiers. Therefore, a wide range of ML models, including deep learning, have demonstrated their applicability to bloom forecasting. However, cyanobacterial blooms occur intensively in summer when temperatures are high, which causes data imbalance (Shin et al. 2021). Data imbalance occurs when the sample size for certain categories is very small compared to that in other categories (Krawczyk et al. 2016). ML enhances the general performance of a model during training; however, in the presence of class imbalance, prediction performance for a minority class can be low even if it is high for the majority class (Chawla et al. 2002). Thus, in this study, we applied an oversampling technique to increase the number of samples in the minority classes to realize sufficient ML performance in terms of HAB prediction.

Even if an accurate prediction can be realized using ML models, the ability to use the corresponding model as a decision-support tool is limited if the prediction results are not accompanied by an effective explanation. ML models with a simple structure, e.g., the decision tree, can directly interpret the influence of input variables on prediction results; however, when the models have complex structures, it is difficult to explain the contribution of the input variables to the output values. Recently, explainable artificial intelligence has emerged to overcome the ‘black-box’ nature of ML models (Doran Schulz & Besold 2017). Among the various explanation techniques, Shapley additive explanation (SHAP) is the most representative post hoc analysis technique (Lundberg & Lee 2017). SHAP can estimate the direction and magnitude of the contribution of the input features to a model's output (Lundberg et al. 2020). In addition, SHAP has been widely used in the field of water environment because it can provide effective visualizations of analysis results. For example, Cha et al. (2021) employed SHAP to estimate the contribution of environmental factors to the results of a species distribution model, and Kim et al. (2022) utilized SHAP to identify major spectral bands and band ratios in predicting lake Chl-a concentrations using satellite images. Despite an increasing application of SHAP across water management domains, only a few cases in which employed SHAP to explain the predictions of HAB using ML models (Park et al. 2022). Therefore, in this study, we aim to examine the usefulness of SHAP for identifying the relative importance of input features in predicting HAB occurrences. In South Korea, annual precipitation is concentrated in summer, and dams are constructed to store fresh water in reservoirs. HABs in water source reservoirs incur additional water treatment costs; thus, an algae alert system is used to manage the proliferation of cyanobacteria. The primary goal of this study is to apply interpretable ML to support the algal alter system by identifying common influencing factors on HABs generation in different reservoirs. The objectives of this study were to (1) correct harmful algae monitoring data of nationwide water source reservoirs, (2) predict the occurrence of HABs using ML models, (3) evaluate the effectiveness of oversampling in terms of improving HAB occurrence prediction performance, and (4) identify relative feature importance in HAB occurrence predictions using SHAP. Here, the RF and XGB, which have demonstrated excellent performance in previous studies, were used as prediction models (Derot Yajima & Jacquet 2020; Izadi et al. 2021). In addition, the synthetic minority oversampling technique (SMOTE), which has shown superior performance in various applications, was used to mitigate data imbalance issues (Shin et al. 2021). The results of this study suggest that interpretable ML has high applicability as a decision-making tool to support the establishment of effective management strategies and is expected to provide decision-support data to establish effective water quality management strategies.

Study area and data description

In this study, we focused on water supply reservoirs in South Korea where the algae alert system has been implemented. Here, monitoring data from the algae alert system from 23 reservoirs were collected from 2016 to 2020. Among these reservoirs, a total of eight reservoirs were selected for modeling by excluding those without water quality monitoring sites (seven reservoirs) and those without cyanobacterial bloom (nine reservoirs). The selected reservoirs include the Paldang and Gwanggyo reservoirs in the Han River watershed, Daecheong reservoir in the Geum River watershed, and Yeongcheon, Angye, Unmun, Sayeon, and Jinyang reservoirs in the Nakdong River watershed (Table 1). A total of 15 cyanobacteria monitoring sites are present in these eight reservoirs. The number of monitoring sites differed among the reservoirs: a single monitoring site for Gwanggyo, Angye, and Yeongcheon reservoirs, two sites for Sayeon, Unmun, and Jinyang reservoirs, and three sites for Daecheong and Paldang reservoirs (Figure 1 and Table 1).
Table 1

Characteristics of target reservoirs

ReservoirsLatitudeLongitudeNo. of stationReservoir size (km2)
Angye 36°01′02″ 129°26′77″ 1.4 
Daecheong 36°37'11″ 127°49'56″ 72.8 
Gwanggyo 37°30'37″ 127°30′01″ 0.3 
Jinyang 35°15'10″ 128°02'93″ 29.4 
Paldang 37°59'42″ 127°34'14″ 36.5 
Sayeon 35°16′80″ 128°3'16″ 
Unmun 37°08'12″ 127°26'87″ 7.8 
Yeongcheon 36°07'02″ 129°02′10″ 6.9 
ReservoirsLatitudeLongitudeNo. of stationReservoir size (km2)
Angye 36°01′02″ 129°26′77″ 1.4 
Daecheong 36°37'11″ 127°49'56″ 72.8 
Gwanggyo 37°30'37″ 127°30′01″ 0.3 
Jinyang 35°15'10″ 128°02'93″ 29.4 
Paldang 37°59'42″ 127°34'14″ 36.5 
Sayeon 35°16′80″ 128°3'16″ 
Unmun 37°08'12″ 127°26'87″ 7.8 
Yeongcheon 36°07'02″ 129°02′10″ 6.9 
Figure 1

Study sites and location of monitoring sites.

Figure 1

Study sites and location of monitoring sites.

Close modal

The data for cyanobacteria cell count (cells/mL) were obtained from cyanobacteria monitoring sites (https://water.nier.go.kr/). Total cyanobacteria cell count was calculated by summing the cell counts of the four major cyanobacteria genera that form HABs in South Korea, i.e., Microcystis, Dolichospermum (Anabaena), Oscillatoria, and Aphanizomenon. Data for environmental variables relevant to HAB occurrences were obtained from multiple sources (Table 2). The water quality variables were obtained from the nearest water quality monitoring sites (https://water.nier.go.kr/). The water quality variables included as input features were water temperature (Wtemp; °C), total phosphorus (T-P; mg/L), total nitrogen (T-N; mg/L), PO4-P (mg/L), NO3-N (mg/L), TOC (mg/L), and SS (mg/L). The meteorological variables include daily precipitation (Prec; mm), total irradiance (Irr; MJ/m2), and average wind speed (Wspeed; m/s), and the corresponding data were obtained from the nearest automated surface observing system station (https://data.kma.go.kr/).

Table 2

Data description and summary statistics (mean (S.D.))

Variables (unit)AbbreviationAngyeDaecheongGwanggyoJinyangPaldangSayeonUnmunYeongcheon
Water temperature (°C) Wtemp 23.82
(2.85) 
23.00
(2.79) 
24.31
(2.04) 
18.56
(3.79) 
23.34
(1.89) 
26.02
(2.79) 
21.43
(4.16) 
24.08
(3.06) 
Daily precipitation (mm) Prec 9.32
(23.47) 
14.72
(32.11) 
10.35
(23.13) 
6.96
(15.49) 
7.85
(20.52) 
4.31
(11.03) 
6.44
(14.99) 
10.02
(24.3) 
Irradiance (MJ/m2Irr 12.14
(6.56) 
15.90
(6.92) 
13.22
(6.72) 
13.77
(6.6) 
13.71
(5.93) 
11.73
(4.8) 
14.67
(6.38) 
12.42
(6.59) 
Average wind speed (m/s) Wspeed 2.49
(1.13) 
1.43
(0.52) 
1.75
(0.71) 
1.02
(0.4) 
1.89
(0.69) 
2.20
(0.84) 
1.98
(0.75) 
2.54
(1.2) 
Total phosphorus (mg/L) T-P 0.01
(0) 
0.03
(0.02) 
0.06
(0.03) 
0.04
(0.02) 
0.05
(0.04) 
0.04
(0.01) 
0.02
(0.01) 
0.02
(0.01) 
Total nitrogen (mg/L) T-N 1.15
(0.14) 
1.80
(0.54) 
1.66
(0.45) 
1.29
(0.33) 
2.1
(0.46) 
1.38
(0.59) 
1.44
(0.45) 
1.38
(0.1) 
Phosphate (mg/L) PO4-P 0.01
(0) 
0.00
(0.00) 
0.03
(0.02) 
0.01
(0.01) 
0.01
(0.02) 
0.01
(0.01) 
0.01
(0.01) 
0.01
(0) 
Nitrate nitrogen (mg/L) NO3-N 0.83
(0.26) 
0.74
(0.29) 
0.75
(0.51) 
0.4
(0.23) 
1.71
(0.41) 
0.46
(0.36) 
1.10
(0.3) 
0.97
(0.3) 
Total organic carbon (mg/L) TOC 3.67
(0.43) 
2.88
(0.57) 
4.68
(1.34) 
2.89
(0.62) 
2.55
(0.49) 
3.67
(0.68) 
2.23
(0.39) 
4.1
(0.55) 
Variables (unit)AbbreviationAngyeDaecheongGwanggyoJinyangPaldangSayeonUnmunYeongcheon
Water temperature (°C) Wtemp 23.82
(2.85) 
23.00
(2.79) 
24.31
(2.04) 
18.56
(3.79) 
23.34
(1.89) 
26.02
(2.79) 
21.43
(4.16) 
24.08
(3.06) 
Daily precipitation (mm) Prec 9.32
(23.47) 
14.72
(32.11) 
10.35
(23.13) 
6.96
(15.49) 
7.85
(20.52) 
4.31
(11.03) 
6.44
(14.99) 
10.02
(24.3) 
Irradiance (MJ/m2Irr 12.14
(6.56) 
15.90
(6.92) 
13.22
(6.72) 
13.77
(6.6) 
13.71
(5.93) 
11.73
(4.8) 
14.67
(6.38) 
12.42
(6.59) 
Average wind speed (m/s) Wspeed 2.49
(1.13) 
1.43
(0.52) 
1.75
(0.71) 
1.02
(0.4) 
1.89
(0.69) 
2.20
(0.84) 
1.98
(0.75) 
2.54
(1.2) 
Total phosphorus (mg/L) T-P 0.01
(0) 
0.03
(0.02) 
0.06
(0.03) 
0.04
(0.02) 
0.05
(0.04) 
0.04
(0.01) 
0.02
(0.01) 
0.02
(0.01) 
Total nitrogen (mg/L) T-N 1.15
(0.14) 
1.80
(0.54) 
1.66
(0.45) 
1.29
(0.33) 
2.1
(0.46) 
1.38
(0.59) 
1.44
(0.45) 
1.38
(0.1) 
Phosphate (mg/L) PO4-P 0.01
(0) 
0.00
(0.00) 
0.03
(0.02) 
0.01
(0.01) 
0.01
(0.02) 
0.01
(0.01) 
0.01
(0.01) 
0.01
(0) 
Nitrate nitrogen (mg/L) NO3-N 0.83
(0.26) 
0.74
(0.29) 
0.75
(0.51) 
0.4
(0.23) 
1.71
(0.41) 
0.46
(0.36) 
1.10
(0.3) 
0.97
(0.3) 
Total organic carbon (mg/L) TOC 3.67
(0.43) 
2.88
(0.57) 
4.68
(1.34) 
2.89
(0.62) 
2.55
(0.49) 
3.67
(0.68) 
2.23
(0.39) 
4.1
(0.55) 

Model development

The data used for HAB predictions were limited to the summer months (July–September) during which HABs generally occur in water supply reservoirs. As data preprocessing, the monitoring dates for input features were matched with those for cyanobacteria cell counts on a weekly basis. In addition, to generate the output of ML models, total cyanobacteria cell counts were classified as HAB occurrences and non-occurrences based on the attention level (>1,000 cells/mL) of the Korean Cyanobacteria Alert System. Note that missing values of each input feature were imputed using Kalman filtering (Moritz & Bartz-Beielstein 2017). The dataset was randomly divided into training (70%) and test (30%) sets for each of the eight reservoirs. Here, the input features measured 1-week ahead of the output were input to the models to produce 1-week HAB forecasts. As ML classifiers, RF and XGB (Breiman 2001; Chen & Guestrin 2016), commonly used tree ensembles, were employed. Furthermore, SMOTE was adopted to address imbalanced HAB occurrences. In addition, Bayesian optimization was employed to optimize each model's hyperparameters. As a result, four prediction models, i.e., the RF and XGB models with and without SMOTE, were constructed for each of the eight reservoirs (Figure 2).
Figure 2

Modeling procedure to predict cyanobacterial bloom occurrence.

Figure 2

Modeling procedure to predict cyanobacterial bloom occurrence.

Close modal

Data resampling using SMOTE

In this study, SMOTE, a representative oversampling technique, was applied to prevent reduced predictive performance in the event of data imbalance. Rather than simply replicating samples belonging to the minority class samples (i.e., bloom occurrence), SMOTE mitigates data imbalance by generating synthetic minority class samples. With SMOTE, synthetic minority samples are created at random points on a straight line that connects any measured minority sample with its k-nearest neighbors (Chawla et al. 2002). Here, SMOTE was applied to only the training data, where synthetic minority samples were generated until the balance ratio (IR), i.e., the ratio of the number of minority class and majority class samples in the training data, reached a value of 1. Kalman filtering and SMOTE were performed using the pykalman 0.9.2 and imblearn 0.8.1 libraries in the Python 3.7.12 environment, respectively.

Development of ML classifiers

Among various ML algorithms, RF and XGB, which are ensemble tree classifiers, have shown outstanding performance in bloom prediction (Derot et al. 2020; Izadi et al. 2021). Therefore, the RF and XGB were used to predict the occurrence of HAB at each of the eight reservoirs. The RF is a bagging-based ensemble classifier that generates multiple single decision trees and then aggregates the generated trees to obtain the final results (Breiman 2001). The RF has demonstrated excellent performance in various applications, e.g., algae forecasting (Derot et al. 2020). In the process of creating a single decision tree, the RF randomly extracts subsamples, which only contain a subset of input variables, of the training data. In this process, the RF prevents overfitting. Unlike RF, the XGB is a boosting-based ensemble classifier (Chen & Guestrin 2016) that is an improved version of the gradient boosting decision tree (GBDT), which has been used in various fields and exhibits fast learning and excellent performance (Bhattacharya et al. 2020). The GBDT generates a weak decision tree sequentially by reflecting results from a previous single classifier to reduce and minimize the loss function. Herein, the XGB classifier adds a regularization term to the loss function to prevent overfitting. In this study, the RF and XGB classifiers were implemented using the scikit-learn 1.0.2 and xgboost 0.90 Python libraries, respectively.

Hyperparameter optimization

Hyperparameter optimization was performed to improve the prediction performance of the models. Here, three RF hyperparameters, i.e., n_estimators, min_samples_split, and min_samples_leaf, were optimized, and seven XGB hyperparameters, i.e., n_estimators, max_depth, min_child_weight, learning_rate, Gamma, subsample, and colsample_bytree, were optimized (Table 3). In this study, hyperparameter tuning was performed using the tree-structured Parzen estimator (TPE), which is a Bayesian optimization technique. The TPE searches the hyperparameter set with the largest expected imposition (EI) value sequentially based on the results of the previous iteration as follows.
formula
(1)
Table 3

Hyperparameter search space for RF and XGB

ModelParameterRange
RF n_estimators {100,500} 
 min_samples_split {2,6} 
 min_samples_leaf {1,6} 
XGB n_estimators {100,350} 
 max_depth {3,8} 
 min_child_weight {1,10} 
 learning_rate {0.01,0.08} 
 Gamma {0.1,3} 
 Subsample {0.5,1} 
 colsample_bytree {0.6,0.9} 
ModelParameterRange
RF n_estimators {100,500} 
 min_samples_split {2,6} 
 min_samples_leaf {1,6} 
XGB n_estimators {100,350} 
 max_depth {3,8} 
 min_child_weight {1,10} 
 learning_rate {0.01,0.08} 
 Gamma {0.1,3} 
 Subsample {0.5,1} 
 colsample_bytree {0.6,0.9} 
Here, x is the hyperparameter set, and y is the value of the loss function. In this case, is a threshold value for the previously set quantile (=0.15 in this study) when . The TPE expresses as follows via two density functions estimated from the set of searched hyperparameters that make the loss function less than or greater than the given threshold.
formula
(2)
The EI is expressed as follows.
formula
(3)

Thus, the optimization process focuses on iterations with smaller loss values, i.e., better performance, and converges to a set of optimal hyperparameters. The average accuracy score obtained from 10-fold cross validation on the training data (with and without SMOTE) was used as the loss function for hyperparameter tuning. In this study, the number of hyperparameter optimization iterations was set to 100, and optimization was implemented using the hyperopt 0.2.7 Python library.

Performance metrics

To evaluate model performance, accuracy, area under receiver operation characteristic curve (AUC), recall, and the F-measure were used as evaluation metrics. Here, the classification results for HAB occurrences can be divided into four categories: (1) HAB occurrence is predicted as an occurrence, i.e., a true positive (TP); (2) non-occurrence of HAB is predicted as a non-occurrence, i.e., a true negative (TN); (3) non-occurrence of HAB is predicted as an occurrence, i.e., a false positive (FP); (4) occurrence of HAB is predicted as non-occurrence (i.e., a false negative (FN)). Thus, accuracy, AUC, recall, and F-measure can be calculated as follows (Chawla et al. 2002).
formula
(4)
formula
(5)
formula
(6)
formula
(7)
In addition to accuracy, AUC, recall, and F-measure, we utilized integrated multi-objective optimization (iMOO) scores, which can integrate various performance metrics, which allows for effective comparison of the predictability of ML models (Hong et al. 2020). Note that each performance index takes values in the same range (0–1); thus, the iMOO score is calculated directly using the following fitness function.
formula
(8)

Here, W is the weight of each performance index, and A, R, and F represent accuracy, recall, and F-measure, respectively. The weight value for each performance index may be set to a value between 0 and 1 such that the sum of the weight becomes 1 according to the purpose of the model. In this study, the iMOO scores were calculated with equal weight (0.25) for all performance indicators. Note that a lower iMOO score indicates higher model performance.

Model explanations

The relative importance of the input features for HAB prediction in each reservoir was estimated using SHAP (Lundberg & Lee 2017), which is a representative post hoc explanation method that secures interpretability based on the Shapley value from game theory. SHAP estimates the contribution of input variables to the output value of the model based on the additive feature attribute concept, which can be expressed as follows using the explanatory model (g).
formula
(9)
Here, , M is the number of features, and . is a coalition vector representing the presence () or absence () of the ith feature, where is the contribution value of the ith feature, and is the predictive result (i.e., the baseline probability) without the ith feature. SHAP estimates the contribution of a single variable by calculating the change in the prediction performance in the presence or absence of that variable, respectively, while considering all possible combinations of other input features. When is defined for the subset (S) of a given feature, the SHAP value, which is the Shapley value for the conditional expected value of the model, can be expressed as follows:
formula
(10)

Here, N is the set of features, and and are the predictions of the model according to the presence or absence of the ith feature, respectively.

In this study, TreeSHAP was used to estimate the SHAP value (Lundberg et al. 2020). TreeSHAP can be applied to a tree-based model, and it prevents duplication by recursively tracking which leaf the subset of input variables flows into using the structure of each decision tree. Accordingly, TreeSHAP reduces the computation load required to estimate contributions from exponential to polynomial complexity. In addition, TPE estimate the exact SHAP value (Lundberg et al. 2020). In this study, SHAP was implemented using the shap 0.40.0 library in the Python 3.7.12 environment.

Characteristics of cyanobacterial bloom and environmental factors in reservoirs

The cyanobacteria abundance of the selected reservoirs (i.e., the Angye, Daecheong, Gwanggyo, Jinyang, Paldang, Sayeon, Unmun, and Yeongcheon reservoirs) exhibited seasonality. From 2016 to 2020, bloom occurrence above attention level (>1,000 cells/mL) occurred intensively (eight reservoirs; 32.88 times on average) in the summer months (July–September); however, in the other seasons, relatively small numbers of bloom occurred (eight reservoirs; 12.86 times on average) (Figure 3). The Daecheong reservoir had the highest number of HAB occurrences during summer, with 127 samples being classified as the HAB occurrences over the aforementioned 5-year period. In contrast, the Paldang reservoir exhibited only six HAB occurrences in the same period.
Figure 3

Seasonal variations of measured cyanobacteria cell counts during the summer season (black bar) and non-summer (white bar) seasons across study sites. The red horizontal line shows the attention level (1,000 cells/mL). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.

Figure 3

Seasonal variations of measured cyanobacteria cell counts during the summer season (black bar) and non-summer (white bar) seasons across study sites. The red horizontal line shows the attention level (1,000 cells/mL). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.

Close modal

The study sites are distributed across different watersheds; thus, the environmental factors used as input features have different values for each reservoir (Figure 2 and Table 1). Typically, high temperatures are observed during summer; thus, all the reservoirs commonly exhibited high Wtemp values (average: 23.07 °C). However, the reservoirs with the highest temperature (Gwanggyo reservoir; 26.02 °C) and lowest temperature (Unmun reservoir; 18.56 °C) during the study period showed a temperature difference of 7.46C (Table 1). Differences in environmental factors were also observed in case of other meteorological features across the study sites: Prec (average: 8.75 mm; range: 4.31–14.72 mm), Irr (average: 13.45 MJ/m2; range 11.73–15.90 MJ/m2), and Wspeed (average: 1.91 m/s; range: 1.02–2.54 m/s). Water quality features also differed across the study sites. The feature with the largest difference among the study sites was PO4-P (average: 0.011 mg/L; range 0.005–0.26 mg/L), and the highest value associated with a reservoir was 5.21 times greater than the lowest value. The feature with the smallest difference, i.e., T-N (average: 1.53; range: 1.15–2.10 mg/L), exhibited a 1.83-fold difference between the lowest and highest values (Table 1).

Class imbalance for HABs

The difference in the frequency of HAB occurrence among the study sites caused class imbalance with respect to the input data (Supplementary Material, Table S1). In the case of the total dataset for each reservoir, the IR values (i.e., the ratio of HAB occurrence to non-occurrence) ranged from 1.12 to 29.67. Among the eight reservoirs, the IR value for the Daecheong reservoir was the lowest, and that for the Paldang reservoir was the highest (Supplementary Material, Table S1). The presence of class imbalance hindered the ML models from accurately predicting HAB occurrences (Figure 4), with higher IR values generally decreasing the prediction performance (Figures 5 and 6). Thus, in this study, SMOTE was applied to each prediction model to improve the classification performance.
Figure 4

Comparison of iMOO scores for the ML models with and without SMOTE. Each bars indicate iMOO score for RF (green), XGB (red), RF with SMOTE (blue), and XGB with SMOTE (purple). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.

Figure 4

Comparison of iMOO scores for the ML models with and without SMOTE. Each bars indicate iMOO score for RF (green), XGB (red), RF with SMOTE (blue), and XGB with SMOTE (purple). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.

Close modal
Figure 5

Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for RF with SMOTE.

Figure 5

Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for RF with SMOTE.

Close modal
Figure 6

Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for XGB with SMOTE.

Figure 6

Relationship between IR and (a) accuracy, (b) AUC, (c) recall, (d) F-measure, and (e) iMOO for XGB with SMOTE.

Close modal

Comparison of model performance

The RF and XGB were employed to generate 1-week forecasts of HABs in eight water supply reservoirs. In Jinyang, Paldang, Sayeon, and Unmun reservoirs, the bloom occurrence classification accuracy of both the RF and XGB improved after applying SMOTE; however, in Angye, Daecheong, Gwanggyo, and Yeongcheon reservoirs, classification performance did not improve significantly (Table 4). Both the RF (accuracy: 0.71–0.93) and XGB (accuracy: 0.71–0.95) exhibited high accuracy without SMOTE. However, other metrics that reflect performance for minority classes, i.e., bloom occurrence, demonstrated lower values (AUC: 0.63, recall: 0.34, and F-measure: 0.39) for ML models. After applying SMOTE, the RF and XGB exhibited performance improvements for most of the target reservoirs. For example, after mitigating the degree of data imbalance via SMOTE, the Unmun reservoir demonstrated the greatest performance improvement (average increase of AUC: 0.27, recall: 0.50, and F-measure: 0.67). However, for the Angye reservoir (average difference of recall: 0.00, F-measure: 0.00, and AUC: −0.03), little performance improvement was observed for both the RF and XGB (Table 4 and Supplementary Material, Table S1). Note that the sample size for the Angye reservoir was relatively smaller than that of the other reservoirs, and only two bloom occurrence samples were contained in the test set. Thus, even a single misclassified instance of bloom occurrence as non-occurrence could cause a significant performance reduction (López et al. 2013).

Table 4

Performance evaluation of forecasting cyanobacteria bloom occurrence for each study site using RF and XGB with and without SMOTE

ModelBefore applying SMOTE
After applying SMOTE
ReservoirAccuracyAUCRecallF-measureAccuracyAUCRecallF-measure
Angye RF 0.90 0.50 0.00 0.00 0.85 0.47 0.00 0.00 
XGB 0.90 0.50 0.00 0.00 0.85 0.47 0.00 0.00 
Daecheong RF 0.75 0.75 0.73 0.78 0.71 0.73 0.68 0.75 
XGB 0.71 0.68 0.80 0.78 0.70 0.72 0.65 0.73 
Gwanggyo RF 0.90 0.75 0.50 0.67 0.95 0.88 0.75 0.86 
XGB 0.95 0.88 0.75 0.86 0.90 0.84 0.75 0.75 
Jinyang RF 0.80 0.78 0.65 0.73 0.78 0.76 0.65 0.71 
XGB 0.78 0.76 0.65 0.71 0.75 0.74 0.71 0.71 
Paldang RF 0.93 0.50 0.00 0.00 0.93 0.50 0.00 0.00 
XGB 0.93 0.50 0.00 0.00 0.93 0.62 0.25 0.33 
Sayeon RF 0.92 0.63 0.25 0.40 0.92 0.73 0.50 0.57 
XGB 0.89 0.61 0.25 0.33 0.92 0.84 0.75 0.67 
Unmun RF 0.90 0.50 0.00 0.00 0.95 0.75 0.50 0.67 
XGB 0.90 0.50 0.00 0.00 0.95 0.75 0.50 0.67 
Yeongcheon RF 0.71 0.60 0.33 0.40 0.81 0.60 0.67 0.67 
XGB 0.76 0.68 0.50 0.55 0.67 0.62 0.50 0.46 
ModelBefore applying SMOTE
After applying SMOTE
ReservoirAccuracyAUCRecallF-measureAccuracyAUCRecallF-measure
Angye RF 0.90 0.50 0.00 0.00 0.85 0.47 0.00 0.00 
XGB 0.90 0.50 0.00 0.00 0.85 0.47 0.00 0.00 
Daecheong RF 0.75 0.75 0.73 0.78 0.71 0.73 0.68 0.75 
XGB 0.71 0.68 0.80 0.78 0.70 0.72 0.65 0.73 
Gwanggyo RF 0.90 0.75 0.50 0.67 0.95 0.88 0.75 0.86 
XGB 0.95 0.88 0.75 0.86 0.90 0.84 0.75 0.75 
Jinyang RF 0.80 0.78 0.65 0.73 0.78 0.76 0.65 0.71 
XGB 0.78 0.76 0.65 0.71 0.75 0.74 0.71 0.71 
Paldang RF 0.93 0.50 0.00 0.00 0.93 0.50 0.00 0.00 
XGB 0.93 0.50 0.00 0.00 0.93 0.62 0.25 0.33 
Sayeon RF 0.92 0.63 0.25 0.40 0.92 0.73 0.50 0.57 
XGB 0.89 0.61 0.25 0.33 0.92 0.84 0.75 0.67 
Unmun RF 0.90 0.50 0.00 0.00 0.95 0.75 0.50 0.67 
XGB 0.90 0.50 0.00 0.00 0.95 0.75 0.50 0.67 
Yeongcheon RF 0.71 0.60 0.33 0.40 0.81 0.60 0.67 0.67 
XGB 0.76 0.68 0.50 0.55 0.67 0.62 0.50 0.46 

In this study, the iMOO score was used to compare the performance of the prediction models by study site (Figure 4), where a lower iMOO score means that the model demonstrated higher prediction performance. Although differences were observed depending on the study site, the XGB exhibited slightly better performance than RF. Without SMOTE, the average iMOO scores for all reservoirs obtained by the RF and XGB were 0.516 and 0.486, respectively. After applying SMOTE, the iMOO scores of the RF and XGB were 0.405 and 0.384, respectively (Figure 4). Thus, we found that applying SMOTE generally improved the performance of the models. When all metrics were integrated, the RF exhibited improved performance (average decrease of iMOO score: 0.236) for the Gwanggyo, Sayeon, Unmun, and Yeongcheon reservoirs. The XGB exhibited improved performance (average decrease of iMOO score: 0.240) for the Jinyang, Paldang, Sayeon, and Unmun reservoirs. For Angye, Daecheong, Gwanggyo, and Yoengcheon reservoirs, the degree of improvement in performance was relatively small or performance was slightly higher without SMOTE (average difference of iMOO score: 0.03).

Effects of IR on model performance

Even when SMOTE was applied, the prediction performance was affected by the degree of imbalance (Figures 5 and 6). For both the RF (slope = 0.011) and XGB (slope = 0.021), accuracy tended to increase with increasing IR values. However, increased IR values caused reductions in the AUC, recall, and F-measure values (Figures 5(a)–5(d) and 6(a)–6(d)). For example, the F-measure (slope = −0.051) of the RF and recall (slope = −0.041) of the XGB decreased significantly with higher IR values. In terms of the iMOO score, which integrates all evaluation metrics, the value increased as the IR value increased, which indicates that the model obtained more accurate predictions when the IR value was lower. The variation in iMOO score with increasing IR value was greater with the RF (slope = 0.041) than with the XGB (slope = 0.021) (Figures 5(e) and 6(e)). For all metrics, including the iMOO score, the XGB exhibited more stable performance than the RF in the presence of class imbalance (Figures 5 and 6). One possible explanation is that the sequential learning process implemented in the XGB, which minimizes the loss function by reflecting errors from previous weak classifiers in the process of generating an ensemble, may address the data imbalance better than the RF (Sun et al. 2019).

Model explanations

In this study, SHAP was employed to estimate the relative importance of the input features in terms of the prediction results of the XGB with SMOTE. Although the performance of the XGB and RF did not differ significantly after applying SMOTE, here, the XGB was selected because its performance was more stable under a wide range of IR values compared to the performance of RF. After applying SHAP to the prediction results of XGB for each study site, the relative importance (i.e., the mean |SHAP| value) of the water temperature and nutrient-related water quality variables, including T-N and T-P, was higher than that of other environmental variables (Table 5). Note that the importance of water temperature (Cha et al. 2017) and nutrients (Richardson et al. 2019) for cyanobacterial bloom has been reported consistently in previous studies.

Table 5

The five most important input features in the prediction of HABs using XGB with SMOTE

LakesInput features (mean |SHAP value|)
Angye NO3-N (0.0801) Wtemp (0.0703) Prec (0.0629) SS (0.0572) T-N (0.0567) 
Daecheong SS (0.1064) PO4-P (0.0604) TOC (0.0594) Irr (0.0535) T-P (0.0338) 
Gwanggyo TOC (0.1108) NO3-N (0.1066) T-N (0.0655) T-P (0.0523) SS (0.0514) 
Jinyang NO3-N (0.1396) TOC (0.0911) Wtemp (0.0757) T-N (0.0333) SS (0.0324) 
Paldang TOC (0.1211) Wspeed (0.0802) Wtemp (0.0724) T-N (0.0600) Irr (0.0560) 
Sayeon Wspeed (0.0678) Prec (0.0668) Irr (0.0500) Wtemp (0.0498) T-N (0.0459) 
Unmun PO4-P (0.1924) T-P (0.1195) TOC (0.0811) Wtemp (0.0414) T-N (0.0298) 
Yeongcheon TOC (0.1228) Wtemp (0.0686) SS (0.0529) T-P (0.0464) T-N (0.0418) 
LakesInput features (mean |SHAP value|)
Angye NO3-N (0.0801) Wtemp (0.0703) Prec (0.0629) SS (0.0572) T-N (0.0567) 
Daecheong SS (0.1064) PO4-P (0.0604) TOC (0.0594) Irr (0.0535) T-P (0.0338) 
Gwanggyo TOC (0.1108) NO3-N (0.1066) T-N (0.0655) T-P (0.0523) SS (0.0514) 
Jinyang NO3-N (0.1396) TOC (0.0911) Wtemp (0.0757) T-N (0.0333) SS (0.0324) 
Paldang TOC (0.1211) Wspeed (0.0802) Wtemp (0.0724) T-N (0.0600) Irr (0.0560) 
Sayeon Wspeed (0.0678) Prec (0.0668) Irr (0.0500) Wtemp (0.0498) T-N (0.0459) 
Unmun PO4-P (0.1924) T-P (0.1195) TOC (0.0811) Wtemp (0.0414) T-N (0.0298) 
Yeongcheon TOC (0.1228) Wtemp (0.0686) SS (0.0529) T-P (0.0464) T-N (0.0418) 

SHAP provides the direction of contribution to the prediction results in addition to the magnitude. The SHAP summary plot in Figure 7 shows the distribution of the SHAP value with the feature value (higher feature values are shown in red, and lower feature values are shown in blue). For example, in the case of Wtemp, which had the second highest relative importance at the Angye reservoir, the SHAP value increases as the feature value increases, indicating that Wtemp contributes positively to bloom occurrence (Figure 7(a)). This direction of the contribution of Wtemp appeared commonly at all reservoirs. The water quality variables, e.g., nutrients, including T-N and T-P, and the TOC variable differed slightly in terms of the direction of contribution depending on the reservoir. This difference in the direction of contribution may have been caused by the different characteristics of the environmental factors among the study sites (Table 1).
Figure 7

SHAP summary plots for forecasting cyanobacteria blooms using XGB with SMOTE. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.

Figure 7

SHAP summary plots for forecasting cyanobacteria blooms using XGB with SMOTE. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.019.

Close modal

Herein, RF and XGB were employed in combination with SMOTE to predict HAB occurrences in water supply reservoirs. Although both RF and XGB exhibited generally good prediction performances, XGB exhibited more stable performance across a wide range of IR values. In addition, post hoc explanation results from SHAP indicated that among environmental variables, water temperature as well as TN and TP concentrations played important roles in predicting HAB occurrences. The use of explainable ML models provided accurate HAB predictions with explanations of the predictions, increasing the usefulness of these models as a decision-making tool for water quality management. However, for a few studied reservoirs, performance did not improve significantly even after SMOTE was applied. Thus, in future studies, we plan to improve prediction performance by implementing and evaluating various resampling techniques, which we expect to improve model reliability, enhance explanation results, and maximize the applicability of explainable ML models.

This research was funded by Korea Environment Industry & Technology Institute (KEITI) through the project for developing the project for developing innovate drinking water and wastewater technologies, funded by Korea Ministry of Environment (MOE) (2020002700001) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1009961).

All relevant data are available from an online repository or repositories. Water quality data and Cyanobacteria cell counts data from (https://water.nier.go. kr/). Meteorological data from (https://data.kma.go.kr/).

The authors declare there is no conflict.

Bhattacharya
S.
,
Krishnan
S. R.
,
Maddikunta
P. K. R.
,
Kaluri
R.
,
Singh
S.
,
Gadekallu
T. R.
,
Alazab
M.
&
Tariq
U.
2020
A novel PCA-firefly based XGBoost classification model for intrusion detection in networks using GPU
.
Electronics
9
(
2
).
https://doi.org/10.3390/electronics9020219
.
Breiman
L.
2001
Random forests
.
Mach. Learn.
45
(
1
),
5
32
.
https://doi.org/10.1023/A:1010933404324
.
Cha
Y.
,
Cho
K. H.
,
Lee
H.
,
Kang
T.
&
Kim
J. H.
2017
The relative importance of water temperature and residence time in predicting cyanobacteria abundance in regulated rivers
.
Water Res.
124
,
11
19
.
https://doi.org/10.1016/j.watres.2017.07.040
.
Cha
Y.
,
Shin
J.
,
Go
B.
,
Lee
D. S.
,
Kim
Y.
,
Kim
T.
&
Parl
Y. S.
2021
An interpretable machine learning method for supporting ecosystem management: Application to species distribution models of freshwater macroinvertebrates
.
Journal of Environmental Management
291
,
112719
.
https://doi.org/10.1016/j.jenvman.2021.112719
.
Chawla
N. V.
,
Bowyer
K. W.
,
Hall
L. O.
&
Kegelmeyer
W. P.
2002
SMOTE: synthetic minority over-sampling technique
.
J. Artif. Intell. Res.
16
,
321
357
.
https://doi.org/10.1613/jair.953
.
Chen
T.
&
Guestrin
C.
2016
XGBoost: a scalable tree boosting system
. In:
Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
.
Association for Computing Machinery
,
New York, NY
,
United States
, pp.
785
794
.
https://doi.org/10.1145/2939672.2939785 (KDD. 2016)
.
Choi
Y.
,
Gil-Garcia
R.
,
Aranay
O.
,
Burke
B.
&
Werthmuller
D.
2021
Using artificial intelligence techniques for evidence-based decision making in government: random forest and deep neural network classification for predicting harmful algal blooms in New York State, in DG.O2021
. In:
The 22nd Annual International Conference on Digital Government Research
,
New York, NY, USA
.
Association for Computing Machinery (DG.O'21)
, pp.
27
37
.
https://doi.org/10.1145/3463677.3463713
.
Doran
D.
,
Schulz
S.
&
Besold
T. R.
2017
What does explainable AI really mean? A new conceptualization of perspectives’, arXiv preprint arXiv:1710.00794
.
Hong
J.
,
Kang
H.
&
Hong
T.
2020
Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning
.
Renewable Sustainable Energy Rev.
134
,
110402
.
https://doi.org/10.1016/j.rser.2020.110402
.
Huisman
J.
,
Codd
G. A.
,
Paerl
H. W.
,
Ibelings
B. W.
,
Verspagen
J. M. H.
&
Visser
P. M.
2018
Cyanobacterial blooms
.
Nat. Rev. Microbiol.
16
(
8
),
471
483
.
https://doi.org/10.1038/s41579-018-0040-1
.
Izadi
M.
,
Sultan
M.
,
Kadiri
R. E.
,
Ghannadi
A.
&
Abdelmohsen
K.
2021
A remote sensing and machine learning-based approach to forecast the onset of harmful algal bloom
.
Remote Sens.
13
(
19
),
3863
.
https://doi.org/10.3390/rs13193863
.
Kim
J. H.
,
Shin
J. K.
,
Lee
H.
,
Lee
D. H.
,
Kang
J. H.
,
Cho
K. H.
,
Lee
Y. G.
,
Chon
K.
,
Baek
S. S.
&
Park
Y.
2021a
Improving the performance of machine learning models for early warning of harmful algal blooms using an adaptive synthetic sampling method
.
Water Res.
207
,
117821
.
https://doi.org/10.1016/j.watres.2021.117821
.
Kim
Y. W.
,
Kim
T.
,
Shin
J.
,
Lee
D. S.
,
Park
Y. S.
,
Kim
Y.
&
Cha
Y.
2022
Validity evaluation of a machine-learning model for chlorophyll a retrieval using Sentinel-2 from inland and coastal waters
.
Ecol. Indic.
137
,
108737
.
https://doi.org/10.1016/j.ecolind.2022.108737
.
Krawczyk
B.
,
Galar
M.
,
Jeleń
Ł.
&
Herrera
F.
2016
Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
.
Appl. Soft Comput.
38
,
714
726
.
https://doi.org/10.1016/j.asoc.2015.08.060
.
López, V., Fernández, A., García, S., Palade, V. & Herrera, F. 2013
An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
.
Information Sciences 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.00
.
Lundberg
S. M.
&
Lee
S. I.
2017
A unified approach to interpreting model predictions
.
Adv. Neural. Inf. Process. Syst.
30
, 1–10. Available from: https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
Lundberg
S. M.
, Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Himmelfarb, J., Bansal, N. & Lee, S. I.
2020
From local explanations to global understanding with explainable AI for trees
.
Nat. Mach. Intell.
2
(
1
).
https://doi.org/10.1038/s42256-019-0138-9
Mellios
N.
,
Moe
S. J.
&
Laspidou
C.
2020
Machine learning approaches for predicting health risk of cyanobacterial blooms in northern European lakes
.
Water
12
(
4
).
https://doi.org/10.3390/w12041191
.
Moritz
S.
&
Bartz-Beielstein
T.
2017
ImputeTS: time series missing value imputation in R
.
R Journal.
9
(
1
),
207
218
.
https://doi.org/10.32614/RJ-2017-009
.
Paerl
H. W.
&
Huisman
J.
2008
Blooms like it hot
.
Science
320
(
5872
).
https://doi.10.1126/science.1155398
Park
J.
,
Lee
W. H.
,
Kim
K. T.
,
Park
C. Y.
,
Lee
S.
&
Heo
T. Y.
2022
Interpretation of ensemble learning to predict water quality using explainable artificial intelligence
.
Sci. Total Environ.
832
,
155070
.
https://doi.org/10.1016/j.scitotenv.2022.155070
.
Richardson, J., Feuchtmayr, H., Miller, C., Hunter, P. D., Maberly, S. C. & Carvalho, L. 2019
Response of cyanobacteria and phytoplankton abundance to warming, extreme rainfall events and nutrient enrichment
.
Global Change Biology 25 (10), 3365–3380. https://doi.org/10.1111/gcb.14701
.
Shin
J.
,
Yoon
S.
,
Kim
Y.
,
Kim
T.
,
Go
B.
&
Cha
Y.
2021
Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms
.
Ecol. Inform.
61
,
101202
.
https://doi.org/10.1016/j.ecoinf.2020.101202
.
Sinha
E.
,
Michalak
A. M.
&
Balaji
V.
2017
Eutrophication will increase during the 21st century as a result of precipitation changes
.
Science
357
(
6349
),
405
408
.
https://doi.org/10.1126/science.aan2409
.
Sun
F.
,
Wang
R.
,
Wan
B.
,
Su
Y.
,
Guo
Q.
,
Huang
Y.
&
Wu
X.
2019
Efficiency of extreme gradient boosting for imbalanced land cover classification using an extended margin and disagreement performance
.
SPRS Int. J. Geo. Inf.
8
(
7
).
https://doi.org/10.3390/ijgi8070315
.
Weirich
C. A.
&
Miller
T. R.
2014
Freshwater harmful algal blooms: toxins and children's health
.
Curr. Probl. Pediatr. Adolesc. Health Care
44
(
1
),
2
24
.
https://doi.org/10.1016/j.cppeds.2013.10.007
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data