This study aims to identify the best machine learning (ML) approach to predict concentrations of biochemical oxygen demand (BOD), nitrate, and phosphate. Four ML techniques including Decision tree, Random Forest, Gradient Boosting and XGBoost were compared to estimate the water quality parameters based on biophysical (i.e., population, basin area, river slope, water level, and stream flow), and physicochemical properties (i.e., conductivity, turbidity, pH, temperature, and dissolved oxygen) input parameters. The innovation lies in the combination of on-the-spot variables with additional characteristics of the watershed. The model performances were evaluated using coefficient of determination (R2), Nash-Sutcliffe efficiency coefficient (NSE), Root Mean Squared Error (RMSE) and Kling-Gupta Efficiency (KGE) coefficient. The robust five-fold cross-validation, along with hyperparameter tuning, achieved R2 values of 0.71, 0.66, and 0.69 for phosphate, nitrate, and BOD; NSE values of 0.67, 0.65, and 0.62, and KGE values of 0.64, 0.75, and 0.60, respectively. XGBoost yielded good results, showcasing superior performance when considering all analysis performed, but his performance was closely match by other algorithms. The overall modeling design and approach, which includes careful consideration of data preprocessing, dataset splitting, statistical evaluation metrics, feature analysis, and learning curve analysis, are just as important as algorithm selection.

  • A careful selection of independent variables related to the physics of the problem contributed to the model efficiency.

  • Stratified segmentation of the dataset enhanced model performance by ensuring similar descriptive statistics for both training and testing datasets.

  • Cross-validation technique and hyperparameter optimization resulted in high validation scores for key water quality parameters.

Traditionally, physically based hydrological models have been applied for water quality modeling, allowing for a comprehensive study of the interaction between environmental variables with water bodies, as well as the simulation of human interventions in aquatic environments. However, in recent years, machine learning (ML) techniques have been widely applied to estimate water quality parameters (Hu et al. 2022; Sahour et al. 2023), though the concept of this application is not entirely new (Maier & Dandy 1996).

The data-driven ML methodology helps to reveal characteristics of a physical system as an inverse problem approach based on observed data (Cambioni Asphaug & Furfaro 2022). In real-world scenarios, we strive to select or create a model that accurately represents a particular phenomenon, supported by experimental data (Moura Neto & Silva Neto 2013). Different techniques from statistics and ML have been applied in both prediction and inference. While the first aims at predicting future scenarios, inference aims at creating a mathematical model to describe how a system behaves (Bzdok et al. 2018). Statistical methods have traditionally focused on inference, which involves constructing and fitting probability models tailored to specific projects (Fisher 1956). In contrast, ML emphasizes prediction by employing versatile learning algorithms to uncover patterns in complex datasets (Koranga et al. 2022; Zhu et al. 2022b).

Many statistical techniques have been developed to estimate water constituents based on regression models by using streamflow, constituent concentration, and other variables (Runkel et al. 2004). But, as the volume of data on aquatic environments continues to increase, ML has emerged as a valuable tool for data analysis and prediction. Typically, monitoring networks prioritize flow measurement, which can be done through conventional monitoring with a limited number of measurements per day or automated monitoring with a higher frequency of measurements. While water quality, generally, is monitored with less frequency as the conventional methods used for laboratory analysis of many water parameters (e.g., BOD5,20, phosphorus, and nitrate) require significant time and financial resources. However, applications related to water quality modeling and water quality assessment require data with regular time intervals, which may differ from the actual data acquisition periods.

In this context, ML regression and classification techniques demonstrate a high capacity to fit the sample dataset and yield impressive performance statistics, drawing significant attention from the academic community. On the other hand, ML models can be strongly influenced by the presence of outliers, censored data, hyperparameters used in the models, data split, and the random state during model execution. ML techniques have been applied to model diverse aquatic environments such as rivers (Nasir et al. 2022), including deltas (Stoica et al. 2016) and watershed streams (Alnahit et al. 2022); groundwater (Sahour et al. 2023), coastal aquifers (Aish et al. 2023); reservoirs (Chou et al. 2018); potable water (Alnaqeb et al. 2022); waste water treatment plants (Zare Abyaneh 2014; Zhu et al. 2022a); and ocean (Lima et al. 2009, 2013). Hu et al. (2022) reviewed 27 papers and found that most of them focused on classification models to predict chemical concentrations with respect to the threshold value. However, these models performed poorly when predicting absolute contamination concentrations.

Many authors have applied ML classification to estimate water quality index (WQI) as a combination of many variables, including not only water quality elements like temperature, pH, dissolved oxygen, biochemical oxygen demand (BOD), nitrogen, phosphorus, and dissolved solids (Stoica et al. 2016; Chou et al. 2018) but also other variables such as land use, soil, topography, and climate (Alnahit et al. 2022). Zare Abyaneh (2014) indicates that the quality of input parameters is more important than its quantity to enhance model prediction. Generally, it is useful to identify the optimum independent variables for analysis. This can be achieved using the recursive feature elimination (RFE) method (Sahour et al. 2023), principal components analysis (Dilmi & Ladjal 2021), or simply by analyzing features' importance (Aish et al. 2023).

ML can help in the timely identification of potential threats to water quality, determining whether water under certain features is safe for consumption or use, thereby protecting human and animal health (Dritsas & Trigka 2023). This ability to predict water quality has broad implications across various sectors, including industrial production, human lives, and the ecological environment (Zhu et al. 2022b). By identifying nonlinear and complex relationships between input and output data, these models can help in setting up a water quality prediction model, which is of critical importance for managing water resources (Najah Ahmed et al. 2019). Moreover, ML models can exploit the correlation between the water quality of different sections of a river, capturing deeper information that traditional models may overlook (Wu et al. 2022).

Field measurements of physicochemical parameters such as pH, turbidity, temperature, dissolved oxygen, and conductivity provide immediate data, but laboratory-based testing of water samples introduces a delay, creating a knowledge gap in understanding water quality over time. Additionally, the gap identified by Hu et al. (2022) highlights that ML models generally perform poorly when predicting absolute contamination concentrations. In this context, this paper aims to identify the best ML approach for predicting concentrations of BOD, nitrate, and phosphate using readily available physicochemical parameters and physical data. The innovation lies in the combination of on-the-spot variables with additional characteristics of the watershed, including flow rates, curve number (CN), slope, area, and population, to produce better estimates. This method could facilitate more efficient and near real-time monitoring, potentially reducing both the time and costs associated with conventional water quality analysis.

Study area

The Piabanha watershed in southeast Brazil (Figure 1) is part of Rio de Janeiro's mountainous region, spans 2,050 km2 in the Brazilian Atlantic Forest, which is one of the most biodiverse biomes on the planet (Myers et al. 2000; Russo 2009). The Piabanha River originates from the highland at 1,150 m altitude, flows 80 km, and joins the Paraíba do Sul River at 260 m altitude. The difference in altitude from the river source to the can significantly impact water quality and water dissolved oxygen levels that are closely related to BOD and nitrate cycles. Higher altitudes have cooler and more oxygen-rich water, beneficial to aquatic life. As the river descends, the water temperature increases and reduces oxygen levels. Also, terrain's steepness increases the river's speed and turbulence, leading to higher oxygenation but also causing more erosion and sedimentation, negatively impacting water quality.
Figure 1

Piabanha River Basin and its water quality monitoring stations.

Figure 1

Piabanha River Basin and its water quality monitoring stations.

Close modal

The upper basin has a humid tropical climate with steep slopes and over 2,000 mm annual rainfall. The lower basin is sub-humid with an average rainfall of 1,300 mm. As of 2018, the basin was home to 535,000 people (CEIVAP 2020). Approximately, 18.3% of the sewage produced in Petrópolis is released into regional rivers without any treatment. When taking into account both treated and untreated effluents, nearly 32% of the BOD originating from Petropolis ends up in the rivers of the surrounding region (ANA 2017). The main sources of nitrate, phosphate, and BOD in the Piabanha River basin are domestic sewage discharge and agricultural activities. The sewage is responsible for at least 43% of the nitrogen load discharged into the river, while agricultural activities, through the use of nitrogenous fertilizers, contribute at least 15% of the discharged nitrogen (Alvim 2016). These pollutants can cause a range of negative impacts on water quality, including eutrophication, which can lead to excessive algae growth and decrease the oxygen available for other aquatic organisms. Water quality parameters, including BOD, phosphate, and nitrate, are typically monitored in Brazil on a quarterly basis due to the costs and logistics associated with collecting and analyzing water samples. However, quarterly monitoring is not sufficient to detect temporal variations in water quality, constraining water resources management. More frequent measurements or predictions of these parameters may provide early warning of water quality issues, allowing for quicker response to protect public health and the environment.

Dataset

The database used in this study is from partnership with the Brazilian Geological Survey (SGB) through the project ‘Integrated Studies in Experimental and Representative Watersheds’, referred to as EIBEX. It is a collaborative initiative involving universities and government agencies (Villas-Boas et al. 2017). The EIBEX project presents available data from August 2009 to the present (2024) with sampling frequency ranging from monthly to bi-annually. The monitoring stations are equipped with sensors to measure the river level, which make it possible to calculate the flow. For the present study, BOD, nitrate, and phosphorus were selected as dependent variables to be estimated based on the independent variable: water level, stream flow, turbidity, pH, water temperature, dissolved oxygen, and conductivity, alongside with station code, CN, river slope, area, and population within each gauge subbasin. The sample year and month are also used to detect seasonality and enable the analysis of possible interventions in the basin. We examined data spanning from 2009 to 2019, with no gaps in between, resulting in a total of 203 recorded for all analyzed variables. The dataset used in this work is available in the Supplementary Material. High concentrations were examined as values exceeding 1.5 times the interquartile range between the second and third quartiles.

River slope and subbasin area within each gauge influence were derived from digital elevation model (NASA DEM) model with 30 m resolution (JPL/NASA 2020), using geoprocessing techniques in QGIS software (QGIS Development Team 2021). The CN is an empirical parameter used in hydrology for predicting direct runoff and was based on land use and land cover maps (Hjelmfelt 1991). Land use and land cover data are derived from the MapBiomas project – a collaborative effort from many scientific institutions aimed at annually updating land use and land cover maps in Brazil (Souza et al. 2020). The population was retrieved from the Brazilian Institute of Geography and Statistics (IBGE), organized by census tracts, which were processed to determine the population in each subbasin influencing the water quality stations.

ML predictions

ML models were implemented using Jupyter notebooks. To build the models, the scikit-learn library in Python was used – one of the most popular and powerful ML libraries available (Douglass 2020). Scikit-learn provides a wide range of ML algorithms, along with tools for data preprocessing, feature selection, and model evaluation. The workflow is illustrated in Figure 2.
Figure 2

Representation of the workflow employed for ML regressions and classifications.

Figure 2

Representation of the workflow employed for ML regressions and classifications.

Close modal

ML techniques may vary; however, most models described in the literature tend to incorporate certain common steps in response to the particular sensitivities and challenges of the ML process. For example, descriptive statistics analysis is useful for a preliminary dataset analysis and partition (Bui et al. 2020; Shah et al. 2021). ML models are particularly sensitive to the presence of outliers, which can be identified by analyzing the data distribution, graphical plotting, or using interquartile range-based thresholds (Juna et al. 2022; Shamsuddin et al. 2022; Aish et al. 2023).

It is a good practice to split the dataset, generally ranging from 70 to 80% for calibration and 20–30% for validation purposes (Daneshfaraz et al. 2021; Gelli et al. 2023). Different split percentages were tested to ensure the descriptive statistics of both sets were as similar as possible for good representativeness, a 70% training and 30% validation split was chosen. Additionally, we tuned the subsample hyperparameter at 0.5, 0.7, and 1.0, optimizing the subset size to enhance model performance and stability across different configurations. In order to reduce the potential bias introduced by using only one train-test split, a k-fold cross-validation is a widely used technique in ML for model evaluation (Derdour et al. 2022; Sahour et al. 2023). It involves dividing the dataset into k subsets, using k-1 for training and one for testing in each iteration. This process is repeated k times, and the performance metrics from each iteration are averaged to provide a reliable estimate of the model's overall performance (Avila et al. 2018). Statistics metrics vary widely (Hu et al. 2022) for regression tasks, commonly reported metrics include the correlation coefficient (r), coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE). Other statistics, such as the Kling–Gupta efficiency (KGE) and the Nash–Sutcliffe efficiency (NSE), can also be useful (Moriasi et al. 2015; Daneshfaraz et al. 2021; Kalateh et al. 2024).

A variety of ML algorithms are available but in this study the decision trees, random forest, gradient boosting, and XGBoosting algorithms were selected not only because they are widely used for regression tasks (Alnahit et al. 2022; Zhu et al. 2022b; Aish et al. 2023), but also because of their interpretability, ease of use, and robustness, especially in the context of small datasets (Sagi & Rokach 2021). These models offer clear insights into feature importance and are relatively easy to tune compared to more complex architectures. Additional algorithms, such as support vector regression (SVR) and multilayer perceptron (MLP), were also tested, but they underperformed compared to the selected models. For phosphate, SVR and MLP achieved NSE validation scores of 0.48 and 0.23, respectively. For nitrate, the scores were 0.35 and 0.29, and for BOD, they were −0.14 and 0.11. This is likely due to the limited dataset size, consisting of 203 observations. Future work could explore alternative regression models with a larger dataset, providing a more tailored approach to these tasks.

The selected algorithms use decision trees as their base learners, which means they all operate by splitting the data based on certain conditions, hence, creating a tree of decisions. Random forest and gradient boosting both create ensembles of decision trees, but they do so in different ways. Random forest builds and evaluates each decision tree independently and then combines them using averages or majority rules at the end of the process (Breiman 2001). On the other hand, gradient boosting starts the combining process at the beginning, creating a model in a stepwise manner by optimizing an objective function and combining a group of weak learning models to build a single strong learner (Friedman 2001). XGBoost, or extreme gradient boosting, is an efficient implementation of gradient boosting capable to handle with missing data well and includes regularization to avoid overfitting (Chen & Guestrin 2016).

ML models can be trained and tested using their default configurations (Abuzir & Abuzir 2022), but hyperparameter tuning is recommended for optimal model performance (Xin & Mou 2022). The hyperparameters determine the structure of the model and the way the learning process should operate. The literature suggests several techniques for tuning the optimal parameters, but grid search is the most basic and commonly used method (Uddin et al. 2023). This technique involves specifying a subset of the hyperparameter space as a grid, and then systematically trying out all the combinations in the grid. For each combination, the model is trained, and its performance is measured in order to select parameters that give the best performance. In this study the dataset was split into calibration (70%) and validation (30%) sets, with a stratified division by quantiles to ensure representativeness in both sets (Castrillo & García 2020; Bourel et al. 2021; Nasir et al. 2022). A fivefold cross-validation was applied in this study. The hyperparameters were optimized using GridSearchCV to perform an exhaustive search over the specified hyperparameter values (Krishnaraj & Honnasiddaiah 2022; Rodríguez-López et al. 2023). Based on the literature (Piraei et al. 2023; Shan et al. 2023), the parameters and values used in this work are displayed in Tables 13, as well as the best hyperparameters used.

Table 1

Decision tree and random forest hyperparameter optimization, fitting fivefolds for each of 81 candidates, totaling 405 fits for each model

HyperparametersGrid searchBest hyperparameters
Decision treeRandom forest
max_depth 2, 4, 6 
min_samples_split 2, 4, 6 
min_samples_leaf 1, 2, 4 
max_features None, sqrt, log2 None sqrt 
HyperparametersGrid searchBest hyperparameters
Decision treeRandom forest
max_depth 2, 4, 6 
min_samples_split 2, 4, 6 
min_samples_leaf 1, 2, 4 
max_features None, sqrt, log2 None sqrt 
Table 2

Gradient boosting hyperparameter optimization, fitting fivefolds for each of 320 candidates, totaling 1600 fits

HyperparametersGrid searchBest hyperparameter
n_estimators 10, 20, 40, 80 20 
max_depth None, 2, 4, 6, 8 
min_samples_split 2, 4, 6, 8 
learning_rate 0.01, 0.1 0.01 
subsample 0.5, 0.8 0.8 
HyperparametersGrid searchBest hyperparameter
n_estimators 10, 20, 40, 80 20 
max_depth None, 2, 4, 6, 8 
min_samples_split 2, 4, 6, 8 
learning_rate 0.01, 0.1 0.01 
subsample 0.5, 0.8 0.8 
Table 3

XGB regressor hyperparameter optimization, fitting fivefolds for each of 2916 candidates, totaling 14,580 fits

HyperparametersGrid searchBest hyperparameter
learning_rate 0.01, 0.1 0.1 
max_depth 2, 3, 4 
n_estimators 50, 100 100 
subsample 0.5, 0.7, 1.0 
colsample_bytree 0.5, 0.7, 1.0 0.7 
gamma 0, 0.1, 0.3 
reg_alpha 0, 0.1, 0.5 0.5 
reg_lambda 1, 2, 3, 5 
HyperparametersGrid searchBest hyperparameter
learning_rate 0.01, 0.1 0.1 
max_depth 2, 3, 4 
n_estimators 50, 100 100 
subsample 0.5, 0.7, 1.0 
colsample_bytree 0.5, 0.7, 1.0 0.7 
gamma 0, 0.1, 0.3 
reg_alpha 0, 0.1, 0.5 0.5 
reg_lambda 1, 2, 3, 5 

Performance assessment

The coefficient of determination (R2) is a metric measure used to assess the fit of a regression model (Renaud & Victoria-Feser 2010). It represents the proportion of the variance in the dependent variable that is predictable from the independent variable. In other words, R2 indicates the proportion of the total variation in the observed data that is explained by the regression model (Equation (1)). The range of values of R2 depends on the type of model to be fitted; in standard cases like linear least-squares regression models, they lie between 0 and 1, with values close to 1 indicating a strong fit between the model and the data.

The root mean squared error (RMSE) is a measure of the absolute error that squares the deviations to prevent positive and negative deviations from canceling each other out (Equation (2)). This measure tends to overestimate large errors, which can help identify methods with these errors. RMSE is commonly used to express the accuracy of numerical results, presenting error values in the same dimensions as the analyzed variable.

The NSE coefficient is a metric commonly used to evaluate the performance of hydrological models (Equation (3)). It provides a measure of how well the model simulation matches the observed data. The NSE compares the mean squared error (MSE) of the model predictions to the variance of the observed data. NSE values range from negative infinity to 1 and a higher value close to 1 indicates a good agreement between the simulated and observed data (Moriasi et al. 2015).

The KGE coefficient (Equation (4)) provides a comprehensive assessment of model accuracy by considering three key aspects: the correlation coefficient (Equation (6)), the variability ratio, (Equation (6)) and the bias (Equation (7)). KGE incorporates information about the model's ability to reproduce the mean, variability, and timing of the observed data (Gupta et al. 2009). A higher KGE value, which can range from negative infinity to 1, indicates better agreement between the model and observed data, with values closer to 1 suggesting a more accurate representation of the observed variability.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
where and represents ith observed and predicted variable, respectively; and represents the average of the observed and predicted variables, respectively; n represents the total number of observed or predicted variable; r is the Pearson linear correlation coefficient between predicted and observed variables, and are the mean values of predicted and observed data, and and are the standard deviation of predicted and observed data, respectively.

In well-calibrated models with good overall performance, R2 and NSE tend to align closely, as both assess how well the model fits the data. While both metrics derive from the sum of squared errors, R2 measures the proportion of variance explained by the model, whereas NSE compares the model's performance to a mean-based predictor. They can diverge when the model poorly predicts extreme values or exhibits bias, as NSE penalizes large errors more heavily, especially in cases of underperformance on extremes. In contrast, R2 may still indicate a reasonable fit if most data points are well-predicted. In situations where R2 and NSE are similar but overlook key aspects of model behavior, the KGE offers a more comprehensive evaluation by incorporating correlation, bias, and variability.

The stratified dataset split into quantiles was highly effective in producing training and test datasets with very similar descriptive statistics (Table 4), which helps improve model performance. The optimal number of quantiles was tested to conduct the split with the least difference between the descriptive statistics, resulting in the adoption of five quantiles for BOD and phosphate, and four quantiles for BOD. The robust method of fivefold cross-validation, combined with hyperparameter tuning, achieved R2 values of 0.76, 0.67, and 0.71 for phosphate, nitrate, and BOD, and NSE values of 0.73, 0.67, and 0.66, respectively.

Table 4

Descriptive statistics

Phosphate
Nitrate
BOD
TrainTestTrainTestTrainTest
Samples 152 51 137 46 48 17 
Mean 0.42 0.46 4.21 3.97 5.60 5.23 
Std 0.51 0.64 4.01 3.60 3.88 3.27 
Min 0.00 0.01 0.20 0.24 1.15 3.00 
1st quartile 0.12 0.12 0.93 0.93 3.00 3.00 
2nd quartile 0.12 0.12 2.94 2.90 3.00 3.00 
3rd quartile 0.62 0.57 6.33 6.31 7.85 7.00 
Max 2.53 2.84 16.11 14.99 15.77 13.00 
Skewness 1.99 2.36 1.22 0.98 1.31 1.37 
Kurtosis 4.18 5.50 0.86 0.42 0.56 0.72 
Phosphate
Nitrate
BOD
TrainTestTrainTestTrainTest
Samples 152 51 137 46 48 17 
Mean 0.42 0.46 4.21 3.97 5.60 5.23 
Std 0.51 0.64 4.01 3.60 3.88 3.27 
Min 0.00 0.01 0.20 0.24 1.15 3.00 
1st quartile 0.12 0.12 0.93 0.93 3.00 3.00 
2nd quartile 0.12 0.12 2.94 2.90 3.00 3.00 
3rd quartile 0.62 0.57 6.33 6.31 7.85 7.00 
Max 2.53 2.84 16.11 14.99 15.77 13.00 
Skewness 1.99 2.36 1.22 0.98 1.31 1.37 
Kurtosis 4.18 5.50 0.86 0.42 0.56 0.72 

Phosphate estimates

The results obtained from phosphate regressions (Figure 3) indicate that the XGBoost model demonstrated the best performance. During model training with 152 samples, the metrics reached R2 = NSE = 0.96 and KGE = 0.90 (Table 5). In the test set, which consisted of 51 samples, the model achieved R2 = 0.69, NSE = 0.67, and KGE = 0.63, indicating good performance. Furthermore, the feature analysis revealed that conductivity, slope, month, and flow rate were the most relevant variables, collectively representing approximately 85% of the variable importance in the model. It is noteworthy that conductivity exhibited a Pearson linear correlation (r) of 0.63, indicating a significant relationship with the phosphate parameter. The XGBoost model was tested using only the two most important features as independent variables for phosphate estimation: conductivity and slope. The results showed an NSE of 0.99 and 0.96 for calibration and validation, respectively, compared to 0.96 and 0.67 using the full feature set. The approach resulted in overfitting, which was very evident observing the plot. Based on these results, we conclude that, in this case, the full model is preferable. The learning curves confirmed the superiority of the XGBoost model, as both in training and testing, the curves converged to an error very close to zero as the number of samples increased. The scatter plot indicated that a significant portion of the XGBoost model's predictions fell within the 95% confidence interval. The greater uncertainties are associated with higher concentrations, as indicated by the widening of the uncertainty band for these values. The gradient boosting and random forest models also exhibited good performance with very similar R2 scores, achieving 0.65 and 0.71, respectively. As a consequence, to establish a good model, it is necessary to analyze the whole design including hyperparameter optimization, the approach to preprocessing data, split data set, cross-validation, different statistics evaluation, feature analysis, learning curve analysis. The analysis design and approach are so important as algorithm selection.
Table 5

Phosphate model performances

ModelRMSE
NSE
R2
KGE
TrainTestTrainTestTrainTestTrainTest
Decision tree 0.24 0.45 0.78 0.50 0.78 0.51 0.84 0.64 
Random forest 0.18 0.40 0.88 0.61 0.91 0.71 0.76 0.49 
Gradient Boosting 0.14 0.38 0.92 0.64 0.95 0.65 0.79 0.63 
XGBoost 0.10 0.37 0.96 0.67 0.96 0.69 0.90 0.63 
ModelRMSE
NSE
R2
KGE
TrainTestTrainTestTrainTestTrainTest
Decision tree 0.24 0.45 0.78 0.50 0.78 0.51 0.84 0.64 
Random forest 0.18 0.40 0.88 0.61 0.91 0.71 0.76 0.49 
Gradient Boosting 0.14 0.38 0.92 0.64 0.95 0.65 0.79 0.63 
XGBoost 0.10 0.37 0.96 0.67 0.96 0.69 0.90 0.63 
Figure 3

(a) Phosphate ML regressions, (b) feature importances, and (c) learning curves, green and red lines correspond to the training and test datasets, respectively.

Figure 3

(a) Phosphate ML regressions, (b) feature importances, and (c) learning curves, green and red lines correspond to the training and test datasets, respectively.

Close modal

Nitrate estimates

Out of the 203 observations for the nitrate parameter, 7 observations (or 3% of the data) showed concentrations below the detection limit of the laboratory, and 13 observations (or 6% of the data) exceeded 1.5 times the interquartile range, thus being categorized as outliers since our study's aim to represent the average behavior of the watershed, not focusing on extreme events. These extreme values significantly impacted the performance of the models. When using the dataset without any preprocessing, the best performance was observed with the Gradient Boosting algorithm, with an NSE of 0.85 for training and 0.33 for testing, which was considered unsatisfactory for the validation set. Thus, after preprocessing to remove values below the detection limit and outliers, we achieved better performances, resulting in a total of 183 observations, split into 137 for training and 46 for testing. In this configuration, the estimation of the nitrate parameter yielded promising results (Figure 4), with the XGBoost, random forest, and gradient boosting algorithms demonstrating improved performance. Their metrics were closely aligned, with NSE values ranging from 0.61 to 0.65, R2 values from 0.62 to 0.66, and KGE values from 0.73 to 0.75 for the test set (Table 6). Notably, the random forest algorithm exhibited a more stable learning curve. In contrast, XGBoost was able to learn effectively using only two variables: population and slope.
Table 6

Nitrate model performances

ModelRMSE
NSE
R2
KGE
TrainTestTrainTestTrainTestTrainTest
Decision tree 2.44 3.41 0.63 0.09 0.63 0.36 0.71 0.57 
Random forest 1.79 2.10 0.80 0.65 0.82 0.66 0.75 0.75 
Gradient boosting 0.77 2.16 0.96 0.63 0.98 0.65 0.83 0.73 
XGBoost 1.33 2.23 0.89 0.61 0.92 0.62 0.79 0.74 
ModelRMSE
NSE
R2
KGE
TrainTestTrainTestTrainTestTrainTest
Decision tree 2.44 3.41 0.63 0.09 0.63 0.36 0.71 0.57 
Random forest 1.79 2.10 0.80 0.65 0.82 0.66 0.75 0.75 
Gradient boosting 0.77 2.16 0.96 0.63 0.98 0.65 0.83 0.73 
XGBoost 1.33 2.23 0.89 0.61 0.92 0.62 0.79 0.74 
Figure 4

(a) Nitrate ML regressions, (b) feature importances, and (c) learning curves, green and red lines correspond to the training and test datasets, respectively.

Figure 4

(a) Nitrate ML regressions, (b) feature importances, and (c) learning curves, green and red lines correspond to the training and test datasets, respectively.

Close modal

XGBoosting, specifically, demonstrated a simple structure, emphasizing key features such as population, slope, and conductivity. The first two features accounted for over 90% of the feature importances. Similarly, the random forest algorithm identified population, conductivity, and turbidity as the three most significant variables, cumulatively accounting for about 50% of the total importance. This outcome is noteworthy as the population variable represents a considerable source of nitrogen, and the slope variable is linked to the transformation of organic nitrogen into nitrate, as the slope incorporates more dissolved oxygen in the water facilitating the aerobic bacteria activity in the nitrogen cycle. This result indicates that the physical processes of the watershed were captured by the data structure in the ML process. The random forest model exhibited the most optimized learning curve among the evaluated models to nitrate estimate, displaying consistent behavior for both the training and test sets, with a decreasing error with the incorporation of more samples. These observations highlight the necessity for comprehensive performance analysis of the models, extending beyond global metrics such as R2 and NSE to incorporate aspects such as the learning curve and the feature importances.

BOD estimates

The grouping of stations in the urban area resulted in 65 observations of BOD, with 48 designated for training and 17 for testing. XGBoost (Figure 5) exhibited R2 = 0.84, NSE = 0.77, and KGE = 0.64 during calibration (Table 7). However, in validation, it showed R2 = 0.69, NSE = 0.62, and KGE = 0.59, considered good given the complexity of BOD's biogeochemical cycle. Gradient boosting performed closely, with an R2 of 0.96 and 0.58 for calibration and validation, respectively. The most significant features for XGBoosting were slope, year and pH. However, slope and year represented more than 90% of the total importance. The four algorithms used pointed to the year of sample collection between the two most important features. This is quite relevant because, when analyzing pollution in the basin, the year is a determining factor due to improvements in the sanitation infrastructure of the region. The training curves for the two best algorithms behaved similarly and suggest that expanding the sample set could enhance the results.
Table 7

BOD model performances

ModelRMSE
NSE
R2
KGE
TrainTestTrainTestTrainTestTrainTest
Decision tree 1.85 2.92 0.77 0.16 0.77 0.45 0.83 0.60 
Random forest 2.54 2.21 0.56 0.52 0.64 0.65 0.49 0.58 
Gradient boosting 1.15 2.14 0.91 0.55 0.96 0.58 0.75 0.57 
XGBoost 1.85 1.97 0.77 0.62 0.84 0.69 0.64 0.59 
ModelRMSE
NSE
R2
KGE
TrainTestTrainTestTrainTestTrainTest
Decision tree 1.85 2.92 0.77 0.16 0.77 0.45 0.83 0.60 
Random forest 2.54 2.21 0.56 0.52 0.64 0.65 0.49 0.58 
Gradient boosting 1.15 2.14 0.91 0.55 0.96 0.58 0.75 0.57 
XGBoost 1.85 1.97 0.77 0.62 0.84 0.69 0.64 0.59 
Figure 5

(a) BOD ML regressions, (b) feature importances, and (c) learning curves, green and red lines correspond to the training and test datasets, respectively.

Figure 5

(a) BOD ML regressions, (b) feature importances, and (c) learning curves, green and red lines correspond to the training and test datasets, respectively.

Close modal

The robust fivefold cross-validation, along with hyperparameter tuning, achieved R2 values of 0.71, 0.66, and 0.69 for phosphate, nitrate, and BOD, NSE values of 0.67, 0.65, and 0.62, and KGE values of 0.64, 0.75, and 0.60, respectively. Following the performance criteria suggested by Moriasi et al. (2015) both parameters exhibit good performance. Sadayappan et al. (2022) found an average testing R2 of 0.69 for nitrate across many sites in the USA, while Tso et al. (2023) achieved R2 values of 0.71 and 0.58 for nitrate and orthophosphate in Great Britain's rivers, respectively. Alnahit et al. (2022) observed NSE ranging from 0.45 to 0.6 for total nitrogen and from 0.3 to 0.6 for total phosphorus, employing a range of biophysical parameters. Notably, high-performance models often rely on highly correlated variables, such as Nafsin & Li (2022) study achieving an R2of 0.91 for BOD prediction, which included chemical oxygen demand (COD) as a predictor. In this case, by definition, BOD constitutes the biodegradable portion of COD, which also requires laboratory analysis. Thus, our approach has the advantage of using only easily obtainable parameters. Despite using fivefold cross-validation and hyperparameter tuning to control model complexity, overfitting remained a challenge. For the gradient boosting and XGBoost models (Table 3), learning rates of 0.01 and 0.1 were set to control the step size during each boosting round, while maximum tree depth was constrained (2–4) to avoid overly complex models (Bentéjac et al. 2021). The number of estimators was set between 50 and 100, while subsample and colsample_bytree values (ranging from 0.5 to 1.0) were adjusted to reduce variance. The min_child_weight parameter ensured that each leaf contained enough data points, preventing overly complex trees (Ahmed et al. 2024). In XGBoost, the gamma parameter was applied for pruning, while L1 (reg_alpha) and L2 (reg_lambda) regularizations were adjusted to penalize large coefficients and reduce overfitting (Ying 2019).

After automated tuning with GridSearchCV, manual adjustments were made to further refine the models. For Gradient Boosting, n_estimators was set to 20 and max_depth to 4. The n_estimators parameter controls the number of boosting iterations, with more rounds generally improving performance but increasing the risk of overfitting, 20 was chosen as a balanced value. The max_depth limits the complexity of each tree, reducing the likelihood of the model becoming overly specific to the training data. This adjustment reduced the calibration R2for phosphate from 1 to 0.95, while the validation R2 improved from 0.64 to 0.65.

In the XGBoost model, reg_alpha was set to 0.5 and reg_lambda to 5. These regularization terms directly influence the loss function by penalizing large weights, with reg_alpha applying an L1 penalty (proportional to the absolute values of weights) and reg_lambda applying an L2 penalty (proportional to the squared weights). These penalties help control the magnitude of the model's parameters, promoting simpler models and reducing the risk of overfitting to noise. After regularization, the phosphate calibration R2 slightly decreased from 0.98 to 0.96, while the validation R2 dropped from 0.76 to 0.69. Although the changes in calibration R2 were minor, the dispersion in the validation scatter plot improved significantly. While techniques like early stopping or data augmentation could further enhance models with smaller datasets (Ying 2019), the current configuration performed well during cross-validation, indicating adequate generalization to the test set.

Segmenting the dataset into quantiles in a stratified manner proved highly effective, yielding comparable descriptive statistics for both training and testing datasets, thereby improving model performance. In contrast, Castrillo & García (2020) applied a stratified sample to obtain a representative and unbiased validation dataset, employing the native function of StratifiedShuffleSplit from the Scikit-Learn library. However, this approach maintains the same percentage for each target class without considering percentiles, which we believe provides superior performance by offering a predefined, user-oriented method for dividing sample regions.

Hyperparameter optimization was a crucial step in enhancing the performance of the ML models, a fact also observed by Yan et al. (2023) in predicting water quality, as the default settings generally do not lead to the best performance. The use of fivefold cross-validation strengthens our approach, in agreement with the literature (Grbčić et al. 2022). Additionally, Jung et al. (2020) assessed the impact of cross-validation on the efficiency of ML models and pointed out that the greater the number of cross-validations, the more reliable the estimates become. Specifically, regarding the BOD parameter in urban areas, the model's learning curve analysis indicated that a larger dataset could enhance its learning. Although we achieved satisfactory results, it is recommended to use a larger dataset (Shen 2018).

While a high correlation between dependent and independent variables can be beneficial in many ML techniques, it is not strictly necessary (Tahmasebi et al. 2020) as it is for the statistical approach. The ML algorithms can capture complex and nonlinear patterns even when the variable correlation is weak (Najah Ahmed et al. 2019). The inclusion of variables related to the physics of the problem was also very important. For instance, population directly influences the pollution loads of nitrate, phosphate, and BOD. Similarly, the CN reflects the water runoff in the basin associated with the types of land use. The basin area, river slope, level, and flow are physical variables that directly influence chemicals transport, dilution, and decay. Furthermore, physical–chemical parameters such as conductivity, turbidity, pH, temperature, and dissolved oxygen are parameters potentially correlated with the dependent variables and help improve the efficiency of the algorithms. The inclusion of these variables corroborates the literature that the quality of input parameters is more important than its quantity for enhancing model prediction (Zare Abyaneh 2014).

When evaluating water quality, there are several additional factors within the historical data series that require attention. These include climatic seasonality, the discharge of residential sewage and industrial wastewater, and engineering interventions in the watershed that directly impact the behavior of the dataset (von Sperling et al. 2020; De Andrade Costa et al. 2023; Dos Santos Ferreira et al. 2023; da Silva Junior et al. 2024). Between 2009 and 2022, more than 20 sewage treatment units were implemented in the Piabanha watershed. In a metaphorical sense, an expert examining water quality data of a region employs a decision tree approach by considering temporality and seasonality aspects. By considering the year, it becomes possible to discern structural interventions within the watershed. Considering the month allows for the consideration of climatic seasonality that impacts the flow rate, ultimately leading to the analysis of constituent concentration. The use of dates in ML algorithms is in alignment with the literature (Arias-Rodriguez et al. 2023) and was beneficial in this study, as observed in the feature importance analysis.

Our study demonstrated significant results through the innovative application of ML techniques. We incorporated a robust fivefold cross-validation technique and hyperparameter optimization, which resulted in high validation scores for phosphate, nitrate, and BOD. The selection of independent variables proved to be crucial. Variables related to the physics of the problem were key contributors to the efficiency of the algorithms. The model performances were further enhanced by the effective stratified segmentation of the dataset into quantiles, leading to similar descriptive statistics for both training and testing datasets.

Investigating the performance of various ML algorithms, we found differing behaviors in terms of their ability to generalize and adjust to the data. For instance, the decision tree showed the lowest generalization capacity, performing well on training data but poorly on the test dataset. Gradient boosting proved to be the most prone to overfitting in the training set; however, it still managed to produce good results on the test set. In comparison to gradient boosting, random forest was less susceptible to overfitting due to its ensemble nature, yet achieved a similar performance. XGBoost generally outperformed the other models, largely due to its algorithm's characteristics. While data preprocessing, feature engineering, and hyperparameter tuning significantly influence model performance, XGBoost's core boosting framework, effective regularization techniques, and advanced tree optimization strategies make it particularly powerful compared with other tree-based algorithms.

On the other hand, establishing a good model requires analyzing the entire modeling design, including hyperparameter optimization, data preprocessing methods, dataset splitting, cross-validation, statistical evaluation metrics, feature analysis, and learning curve analysis. The overall analysis approach is as important as algorithm selection since different algorithms can produce similar metric results with minor differences. Therefore, instead of assuming one algorithm is the best for a particular situation, it is more advisable to apply a set of techniques to determine the most suitable approach. As a limitation, regarding the BOD parameter in urban areas, we found that a larger dataset could enhance learning and model performance. However, it is worth noting that simply increasing the size of a dataset does not always guarantee improved results. Our findings underscore the importance of not just quantity, but also the quality of data in building robust ML models.

The authors acknowledge the financial support provided by the following Brazilian agencies: FAPERJ, Carlos Chagas Filho Foundation for Research Support of the State of Rio de Janeiro; CNPq, National Council for Scientific and Technological Development; and CAPES, Coordination for the Improvement of Higher Education Personnel (Finance Code 001). The authors appreciate the anonymous reviewers for their valuable comments, which helped clarify several points in the original manuscript.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abuzir
S. Y.
&
Abuzir
Y. S.
(
2022
)
Machine learning for water quality classification
,
Water Quality Research Journal
,
57
(
3
),
152
164
.
https://doi.org/10.2166/wqrj.2022.004
.
Ahmed
U.
,
Mahmood
A.
,
Tunio
M. A.
,
Hafeez
G.
,
Khan
A. R.
, &
Razzaq
S.
(
2024
)
Investigating boosting techniques’ efficacy in feature selection: A comparative analysis
,
Energy Reports
,
11
,
3521
–3532.
https://doi.org/10.1016/j.egyr.2024.03.020
.
Aish
A. M.
,
Zaqoot
H. A.
,
Sethar
W. A.
&
Aish
D. A.
(
2023
)
Prediction of groundwater quality index in the Gaza coastal aquifer using supervised machine learning techniques
,
Water Practice & Technology
,
18
(
3
),
501
521
.
https://doi.org/10.2166/wpt.2023.028
.
Alnahit
A. O.
,
Mishra
A. K.
&
Khan
A. A.
(
2022
)
Stream water quality prediction using boosted regression tree and random forest models
,
Stochastic Environmental Research and Risk Assessment
,
36
(
9
),
2661
2680
.
https://doi.org/10.1007/s00477-021-02152-4
.
Alnaqeb
R.
,
Alrashdi
F.
,
Alketbi
K.
&
Ismail
H.
(
2022
). ‘
Machine learning-based water potability prediction
’,
Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA
, December 2022, pp.
1
6
.
https://doi.org/10.1109/AICCSA56895.2022.10017579.
Alvim
R. B.
(
2016
)
Dinâmica do Nitrogênio E Fósforo em águas Fluviais de uma Bacia Hidrográfica com Diferentes Usos do Solo no Sudeste do Brasil (Dynamics of Nitrogen and Phosphorus in River Waters of A River Basin with Different Land Uses in Southeastern Brazil)
.
Universidade Federal Fluminense
.
Niterói, Brazil Available at: https://app.uff.br/riuff/handle/1/3076.
ANA
. (
2017
)
Atlas Esgotos – Despoluição de Bacias Hidrográficas (Atlas Sewage – Cleaning up Watersheds)
.
Brasília
:
Agência Nacional de Águas
.
Arias-Rodriguez
L. F.
,
Tüzün
U. F.
,
Duan
Z.
,
Huang
J.
,
Tuo
Y.
&
Disse
M.
(
2023
)
Global water quality of inland waters with harmonized Landsat-8 and Sentinel-2 using cloud-computed machine learning
,
Remote Sensing
,
15
(
5
),
1390
.
https://doi.org/10.3390/rs15051390
.
Avila
R.
,
Horn
B.
,
Moriarty
E.
,
Hodson
R.
&
Moltchanova
E.
(
2018
)
Evaluating statistical model performance in water quality prediction
,
Journal of Environmental Management
,
206
,
910
919
.
https://doi.org/10.1016/j.jenvman.2017.11.049
.
Bentéjac
C.
,
Csörgő
A.
, &
Martínez-Muñoz
G.
(
2021
)
A comparative analysis of gradient boosting algorithms
,
Artificial Intelligence Review
,
54
(
3
),
1937
–1967.
https://doi.org/10.1007/s10462-020-09896-5
.
Bourel
M.
,
Segura
A. M.
,
Crisci
C.
,
López
G.
,
Sampognaro
L.
,
Vidal
V.
,
Kruk
C.
,
Piccini
C.
&
Perera
G.
(
2021
)
Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters
,
Water Research
,
202
,
117450
.
https://doi.org/10.1016/j.watres.2021.117450
.
Breiman
L.
(
2001
)
Random forests
,
Machine Learning
,
45
(
1
),
5
32
.
https://doi.org/10.1023/A:1010933404324
.
Bui
D. T.
,
Khosravi
K.
,
Tiefenbacher
J.
,
Nguyen
H.
&
Kazakis
N.
(
2020
)
Improving prediction of water quality indices using novel hybrid machine-learning algorithms
,
Science of the Total Environment
,
721
,
137612
.
https://doi.org/10.1016/j.scitotenv.2020.137612
.
Bzdok
D.
,
Altman
N.
&
Krzywinski
M.
(
2018
)
Statistics versus machine learning
,
Nature Methods
,
15
(
4
),
233
234
.
https://doi.org/10.1038/nmeth.4642
.
Cambioni
S.
,
Asphaug
E.
&
Furfaro
R.
(
2022
)
Combining machine-learned regression models with Bayesian inference to interpret remote sensing data
. In: Helbert, J., D’Amore, M., Aye, M. & Kerner, H. (eds.)
Machine Learning for Planetary Science
.
Elsevier
, Cambridge. pp.
193
207
.
https://doi.org/10.1016/B978-0-12-818721-0.00020-3
.
Castrillo
M.
&
García
Á. L
. (
2020
)
Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods
,
Water Research
,
172
,
115490
.
https://doi.org/10.1016/j.watres.2020.115490
.
CEIVAP
. (
2020
)
Plano Integrado de Recursos Hídricos da bacia hidrográfica do Rio Paraíba do Sul (Integrated plan for water resources in the watershed of the Paraíba do Sul river)
.
Resende
:
Comitê de Integração da Bacia Hidrográfica do Rio Paraíba do Sul (CEIVAP). Profill Engenharia e Ambiente. Associação Pró-Gestão das Águas da Bacia Hidrográfica do Rio Paraíba do Sul (AGEVAP)
.
Chen
T.
&
Guestrin
C.
(
2016
). ‘
XGBoost
’,
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
New York, NY, USA
:
ACM
, pp.
785
794
.
Chou
J.-S.
,
Ho
C.-C.
&
Hoang
H.-S.
(
2018
)
Determining quality of water in reservoir using machine learning
,
Ecological Informatics
,
44
,
57
75
.
https://doi.org/10.1016/j.ecoinf.2018.01.005
.
Costa
D. de A.
,
Bayissa
Y.
,
Villas-Boas
M. D.
,
Maskey
S.
,
Lugon
Junior
, J.,
Silva
Neto
,
A. J.
da
, &
Srinivasan
R.
(
2024
)
Water availability and extreme events under climate change scenarios in an experimental watershed of the Brazilian Atlantic Forest
,
Science of The Total Environment
,
946
(
7
),
174417
.
https://doi.org/10.1016/j.scitotenv.2024.174417
.
Daneshfaraz
R.
,
Aminvash
E.
,
Ghaderi
A.
,
Abraham
J.
&
Bagherzadeh
M.
(
2021
)
SVM performance for predicting the effect of horizontal screen diameters on the hydraulic parameters of a vertical drop
,
Applied Sciences
,
11
(
9
),
4238
.
https://doi.org/10.3390/app11094238
.
da Silva Junior
L. C. S.
,
Costa
D. d. A.
&
Fedler
C. B.
(
2024
)
From scarcity to abundance: Nature-based strategies for small communities experiencing water scarcity in West Texas/USA
,
Sustainability
,
16
(
5
),
1959
.
https://doi.org/10.3390/su16051959
.
de Andrade Costa
D.
,
Bayissa
Y.
,
Lugon Junior
J.
,
Yamasaki
E. N.
,
Kyriakides
I.
&
Silva Neto
A. J.
(
2023
)
Cyprus surface water area variation based on the 1984–2021 time series built from remote sensing products
,
Remote Sensing
,
15
(
22
),
5288
.
https://doi.org/10.3390/rs15225288
.
Derdour
A.
,
Jodar-Abellan
A.
,
Pardo
M. Á.
,
Ghoneim
S. S. M.
&
Hussein
E. E.
(
2022
)
Designing efficient and sustainable predictions of water quality indexes at the regional scale using machine learning algorithms
,
Water (Switzerland)
,
14
(
18
),
1
16
.
https://doi.org/10.3390/w14182801
.
Dilmi
S.
&
Ladjal
M.
(
2021
)
A novel approach for water quality classification based on the integration of deep learning and feature extraction techniques
,
Chemometrics and Intelligent Laboratory Systems
,
214
,
104329
.
https://doi.org/10.1016/j.chemolab.2021.104329
.
Dos Santos Ferreira
M.
,
Gomes de Siqueira
J.
,
De Paulo Santos de Oliveira
V.
&
De Andrade Costa
D.
(
2023
)
Analysis of municipal public policies for payment for water environmental services through the public policy assessment index: The state of Rio de Janeiro (Brazil) as a study model
,
Agua Y Territorio/Water and Landscape
, (
23
),
e6976
.
https://doi.org/10.17561/at.23.6976
.
Douglass
M. J. J.
(
2020
)
Book review: Hands-on machine learning with scikit-learn, keras, and tensorflow, 2nd edition by Aurélien Géron
,
Physical and Engineering Sciences in Medicine
,
43
(
3
),
1135
1136
.
https://doi.org/10.1007/s13246-020-00913-z
.
Dritsas
E.
&
Trigka
M.
(
2023
)
Efficient data-driven machine learning models for water quality prediction
,
Computation
,
11
(
2
),
16
.
https://doi.org/10.3390/computation11020016
.
Fisher
R. A.
(
1956
)
Statistical Methods and Scientific Inference
.
Oxford, England
:
Hafner Publishing Co
.
Friedman
J. H.
(
2001
)
Greedy function approximation: A gradient boosting machine
,
The Annals of Statistics
,
29
(
5
),
1189
1232
.
https://doi.org/10.1214/aos/1013203451
.
Gelli
Y. K.
,
de Ade Andrade Costa
D.
,
Nicolau
A. P.
&
da Silva
J. G.
(
2023
)
Vegetational succession assessment in a fragment of the Brazilian Atlantic Forest
,
Environmental Monitoring and Assessment
,
195
(
1
),
179
.
https://doi.org/10.1007/s10661-022-10709-1
.
Grbčić
L.
,
Družeta
S.
,
Mauša
G.
,
Lipić
T.
,
Lušić
D. V.
,
Alvir
M.
,
Lučin
I.
,
Sikirica
A.
,
Davidović
D.
,
Travaš
V.
,
Kalafatovic
D.
,
Pikelj
K.
,
Fajković
H.
,
Holjević
T.
&
Kranjčević
L.
(
2022
)
Coastal water quality prediction based on machine learning with feature interpretation and spatio-temporal analysis
,
Environmental Modelling & Software
,
155
,
105458
.
https://doi.org/10.1016/j.envsoft.2022.105458
.
Gupta
H. V.
,
Kling
H.
,
Yilmaz
K. K.
&
Martinez
G. F.
(
2009
)
Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling
,
Journal of Hydrology
,
377
(
1–2
),
80
91
.
https://doi.org/10.1016/j.jhydrol.2009.08.003
.
Hjelmfelt
A. T.
(
1991
)
Investigation of curve number procedure
,
Journal of Hydraulic Engineering
,
117
(
6
),
725
737
.
https://doi.org/10.1061/(ASCE)0733-9429(1991)117:6(725)
.
Hu
X. C.
,
Dai
M.
,
Sun
J. M.
&
Sunderland
E. M.
(
2022
)
The utility of machine learning models for predicting chemical contaminants in drinking water: Promise, challenges, and opportunities
,
Current Environmental Health Reports
,
10
(
1
),
45
60
.
https://doi.org/10.1007/s40572-022-00389-x
.
JPL/NASA
. (
2020
)
NASADEM Merged DEM Global 1 arc Second V001
.
California
.
Publisher: NASA EOSDIS Land Processes Distributed Active Archive Center. Accessed 2024-11-09 from https://doi.org/10.5067/MEaSUREs/NASADEM/NASADEM_HGT.001
.
Juna
A.
,
Umer
M.
,
Sadiq
S.
,
Karamti
H.
,
Eshmawi
A. A.
,
Mohamed
A.
&
Ashraf
I.
(
2022
)
Water quality prediction using KNN imputer and multilayer perceptron
,
Water (Switzerland)
,
14
(
17
),
1
19
.
https://doi.org/10.3390/w14172592
.
Jung
K.
,
Bae
D.-H.
,
Um
M.-J.
,
Kim
S.
,
Jeon
S.
&
Park
D.
(
2020
)
Evaluation of nitrate load estimations using neural networks and canonical correlation analysis with K-fold cross-validation
,
Sustainability
,
12
(
1
),
400
.
https://doi.org/10.3390/su12010400
.
Kalateh
F.
,
Aminvash
E.
&
Daneshfaraz
R.
(
2024
)
On the hydraulic performance of the inclined drops: The effect of downstream macro-roughness elements
,
AQUA – Water Infrastructure, Ecosystems and Society
,
73
(
3
),
553
568
.
https://doi.org/10.2166/aqua.2024.304
.
Koranga
M.
,
Pant
P.
,
Kumar
T.
,
Pant
D.
,
Bhatt
A. K.
&
Pant
R. P.
(
2022
)
Efficient water quality prediction models based on machine learning algorithms for Nainital Lake, Uttarakhand
,
Materials Today: Proceedings
,
57
,
1706
1712
.
https://doi.org/10.1016/j.matpr.2021.12.334
.
Krishnaraj
A.
&
Honnasiddaiah
R.
(
2022
)
Remote sensing and machine learning based framework for the assessment of spatio-temporal water quality in the Middle Ganga Basin
,
Environmental Science and Pollution Research
,
29
(
43
),
64939
64958
.
https://doi.org/10.1007/s11356-022-20386-9
.
Lima
E. B.
,
Rodrigues
P. P. G. W.
,
Silva Neto
A. J.
,
Mesa
M. I.
,
Santiago
O. L.
&
Lugon Junior
J.
(
2009
). ‘
Parameter estimation in model of estuarine hydrodynamics based on genetic algorithms
’,
20th International Congress of Mechanical Engineering
.
Gramado, RS, Brazil
.
Lima
E. B.
,
Rodrigues
P. P. G. W.
,
Silva Neto
A. J.
,
Mesa
M. I.
,
Santiago
O. L.
&
Lugon Junior
J.
, (
2013
)
Coupling Mohid with optimization algorithms: Perspectives on the development of automatic calibration tools
. In:
Mateus
R. N. M.
(ed.)
Ocean Modelling for Coastal Management – Case Studies with MOHID
,
Lisboa
IST Press
, pp.
117
130
.
Maier
H. R.
&
Dandy
G. C.
(
1996
)
The use of artificial neural networks for the prediction of water quality parameters
,
Water Resources Research
,
32
(
4
),
1013
1022
.
https://doi.org/10.1029/96WR03529
.
Moriasi
D. N.
,
Gitau
M. W.
,
Pai
N.
&
Daggupati
P.
(
2015
)
Hydrologic and water quality models: Performance measures and evaluation criteria
,
Transactions of the ASABE
,
58
(
6
),
1763
1785
.
https://doi.org/10.13031/trans.58.10715
.
Moura Neto
F. D.
&
Silva Neto
A. J.
(
2013
)
An Introduction to Inverse Problems with Applications
.
Berlin, Germany
:
Springer Berlin Heidelberg
.
Myers
N.
,
Mittermeier
R. A.
,
Mittermeier
C. G.
,
da Fonseca
G. A. B.
&
Kent
J.
(
2000
)
Biodiversity hotspots for conservation priorities
,
Nature
,
403
(
6772
),
853
858
.
https://doi.org/10.1038/35002501
.
Nafsin
N.
&
Li
J.
(
2022
)
Prediction of 5-day biochemical oxygen demand in the Buriganga River of Bangladesh using novel hybrid machine learning algorithms
,
Water Environment Research
,
94
(
5
),
1
17
.
https://doi.org/10.1002/wer.10718
.
Najah Ahmed
A.
,
Binti Othman
F.
,
Abdulmohsin Afan
H.
,
Khaleel Ibrahim
R.
,
Ming Fai
C.
,
Shabbir Hossain
M.
,
Ehteram
M.
&
Elshafie
A.
(
2019
)
Machine learning methods for better water quality prediction
,
Journal of Hydrology
,
578
,
124084
.
https://doi.org/10.1016/j.jhydrol.2019.124084
.
Nasir
N.
,
Kansal
A.
,
Alshaltone
O.
,
Barneih
F.
,
Sameer
M.
,
Shanableh
A.
&
Al-Shamma'a
A.
(
2022
)
Water quality classification using machine learning algorithms
,
Journal of Water Process Engineering
,
48
,
102920
.
https://doi.org/10.1016/j.jwpe.2022.102920
.
Piraei
R.
,
Afzali
S. H.
&
Niazkar
M.
(
2023
)
Assessment of XGBoost to estimate total sediment loads in rivers
,
Water Resources Management
,
37
(
13
),
5289
5306
.
https://doi.org/10.1007/s11269-023-03606-w
.
QGIS Development Team
. (
2021
)
QGIS Geographic Information System
.
Boston
:
Open Source Geospatial Foundation Project
.
Renaud
O.
&
Victoria-Feser
M.-P.
(
2010
)
A robust coefficient of determination for regression
,
Journal of Statistical Planning and Inference
,
140
(
7
),
1852
1862
.
https://doi.org/10.1016/j.jspi.2010.01.008
.
Rodríguez-López
L.
,
Bustos Usta
D.
,
Bravo Alvarez
L.
,
Duran-Llacer
I.
,
Lami
A.
,
Martínez-Retureta
R.
&
Urrutia
R.
(
2023
)
Machine learning algorithms for the estimation of water quality parameters in lake llanquihue in Southern Chile
,
Water
,
15
(
11
),
1994
.
https://doi.org/10.3390/w15111994
.
Runkel
R. L.
,
Crawford
C. G.
&
Cohn
T. A.
(
2004
)
Load estimator (LOADEST): A FORTRAN program for estimating constituent loads in streams and rivers
,
Techniques and Methods. U.S. Geological Survey. U.S. Department of the Interior
,
4
,
69
. https://water.usgs.gov/software/loadest/doc/.
Russo
G.
(
2009
)
Biodiversity: Biodiversity's bright spot
,
Nature
,
462
(
7271
),
266
269
.
https://doi.org/10.1038/462266a
.
Sadayappan
K.
,
Kerins
D.
,
Shen
C.
&
Li
L.
(
2022
)
Nitrate concentrations predominantly driven by human, climate, and soil properties in US rivers
,
Water Research
,
226
,
1
10
.
https://doi.org/10.1016/j.watres.2022.119295
.
Sagi
O.
&
Rokach
L.
(
2021
)
Approximating XGBoost with an interpretable decision tree
,
Information Sciences
,
572
,
522
542
.
https://doi.org/10.1016/j.ins.2021.05.055
.
Sahour
S.
,
Khanbeyki
M.
,
Gholami
V.
,
Sahour
H.
,
Kahvazade
I.
&
Karimi
H.
(
2023
)
Evaluation of machine learning algorithms for groundwater quality modeling
,
Environmental Science and Pollution Research
,
30
(
16
),
46004
46021
.
https://doi.org/10.1007/s11356-023-25596-3
.
Shah
M. I.
,
Javed
M. F.
,
Alqahtani
A.
&
Aldrees
A.
(
2021
)
Environmental assessment based surface water quality prediction using hyper-parameter optimized machine learning models based on consistent big data
,
Process Safety and Environmental Protection
,
151
,
324
340
.
https://doi.org/10.1016/j.psep.2021.05.026
.
Shamsuddin
I. I. S.
,
Othman
Z.
&
Sani
N. S.
(
2022
)
Water quality index classification based on machine learning: A case from the Langat River Basin model
,
Water (Switzerland)
,
14
(
19
),
1
20
.
https://doi.org/10.3390/w14192939
.
Shan
S.
,
Ni
H.
,
Chen
G.
,
Lin
X.
&
Li
J.
(
2023
)
A machine learning framework for enhancing short-Term water demand forecasting using attention-BiLSTM networks integrated with XGBoost residual correction
,
Water (Switzerland)
,
15
(
20
),
1
17
.
https://doi.org/10.3390/w15203605
.
Shen
C.
(
2018
)
A transdisciplinary review of deep learning research and its relevance for water resources scientists
,
Water Resources Research
,
54
(
11
),
8558
8593
.
https://doi.org/10.1029/2018WR022643
.
Souza
C. M.
,
Shimbo
J. Z.
,
Rosa
M. R.
,
Parente
L. L.
,
Alencar
A. A.
,
Rudorff
B. F. T.
,
Hasenack
H.
,
Matsumoto
M.
,
Ferreira
L. G.
,
Souza-Filho
P. W. M.
,
de Oliveira
S. W.
,
Rocha
W. F.
,
Fonseca
A. V.
,
Marques
C. B.
,
Diniz
C. G.
,
Costa
D.
,
Monteiro
D.
,
Rosa
E. R.
,
Vélez-Martin
E.
,
Weber
E. J.
,
Lenti
F. E. B.
,
Paternost
F. F.
,
Pareyn
F. G. C.
,
Siqueira
J. V.
,
Viera
J. L.
,
Neto
L. C. F.
,
Saraiva
M. M.
,
Sales
M. H.
,
Salgado
M. P. G.
,
Vasconcelos
R.
,
Galano
S.
,
Mesquita
V. V.
&
Azevedo
T.
(
2020
)
Reconstructing three decades of land use and land cover changes in Brazilian biomes with landsat archive and earth engine
,
Remote Sensing
,
12
(
17
),
2735
.
https://doi.org/10.3390/rs12172735
.
Stoica
C.
,
Camejo
J.
,
Banciu
A.
,
Nita-Lazar
M.
,
Paun
I.
,
Cristofor
S.
,
Pacheco
O. R.
&
Guevara
M.
(
2016
)
Water quality of Danube Delta systems: Ecological status and prediction using machine-learning algorithms
,
Water Science and Technology
,
73
(
10
),
2413
2421
.
https://doi.org/10.2166/wst.2016.097
.
Tahmasebi
P.
,
Kamrava
S.
,
Bai
T.
&
Sahimi
M.
(
2020
)
Machine learning in geo- and environmental sciences: From small to large scale
,
Advances in Water Resources
,
142
,
103619
.
https://doi.org/10.1016/j.advwatres.2020.103619
.
Tso
C. H. M.
,
Magee
E.
,
Huxley
D.
,
Eastman
M.
&
Fry
M.
(
2023
)
River reach-level machine learning estimation of nutrient concentrations in Great Britain
,
Frontiers in Water
,
5
,
1
18
.
https://doi.org/10.3389/frwa.2023.1244024
.
Uddin
M. G.
,
Nash
S.
,
Rahman
A.
&
Olbert
A. I.
(
2023
)
Performance analysis of the water quality index model for predicting water state using machine learning techniques
,
Process Safety and Environmental Protection
,
169
(
October 2022
),
808
828
.
https://doi.org/10.1016/j.psep.2022.11.073
.
Villas-Boas
M. D.
,
Olivera
F.
&
de Azevedo
J. P. S.
(
2017
)
Assessment of the water quality monitoring network of the Piabanha River experimental watersheds in Rio de Janeiro, Brazil, using autoassociative neural networks
,
Environmental Monitoring and Assessment
,
189
(
9
),
439
.
https://doi.org/10.1007/s10661-017-6134-9
.
von Sperling
M.
,
Verbyla
M. E.
&
Oliveira
S. M. A. C.
(
2020
)
Assessment of Treatment Plant Performance and Water Quality Data: A Guide for Students, Researchers and Practitioners, Assessment of Treatment Plant Performance and Water Quality Data: A Guide for Students, Researchers and Practitioners
.
Belo Horizonte, Brazil, IWA Publishing
.
Wu
X.
,
Zhang
Q.
,
Wen
F.
&
Qi
Y.
(
2022
)
A water quality prediction model based on multi-task deep learning: A case study of the Yellow River, China
,
Water
,
14
(
21
),
3408
.
https://doi.org/10.3390/w14213408
.
Xin
L.
&
Mou
T.
(
2022
)
Research on the application of multimodal-based machine learning algorithms to water quality classification
,
Wireless Communications and Mobile Computing
,
2022
,
1
13
.
https://doi.org/10.1155/2022/9555790
.
Yan
T.
,
Zhou
A.
&
Shen
S.-L.
(
2023
)
Prediction of long-term water quality using machine learning enhanced by Bayesian optimisation
,
Environmental Pollution
,
318
,
120870
.
https://doi.org/10.1016/j.envpol.2022.120870
.
Ying
X.
(
2019
)
An Overview of Overfitting and its Solutions
,
Journal of Physics: Conference Series
,
1168
,
022022
.
https://doi.org/10.1088/1742-6596/1168/2/022022
.
Zare Abyaneh
H.
(
2014
)
Evaluation of multivariate linear regression and artificial neural networks in prediction of water quality parameters
,
Journal of Environmental Health Science and Engineering
,
12
(
1
),
40
.
https://doi.org/10.1186/2052-336X-12-40
.
Zhu
J.
,
Jiang
Z.
&
Feng
L.
(
2022a
)
Improved neural network with least square support vector machine for wastewater treatment process
,
Chemosphere
,
308
,
136116
.
https://doi.org/10.1016/j.chemosphere.2022.136116
.
Zhu
M.
,
Wang
J.
,
Yang
X.
,
Zhang
Y.
, 11
Zhang
L.
,
Ren
H.
,
Wu
B.
&
Ye
L.
(
2022b
)
A review of the application of machine learning in water quality evaluation
,
Eco-Environment & Health
,
1
(
2
),
107
116
.
https://doi.org/10.1016/j.eehl.2022.06.001
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data