ABSTRACT
This study aims to identify the best machine learning (ML) approach to predict concentrations of biochemical oxygen demand (BOD), nitrate, and phosphate. Four ML techniques including Decision tree, Random Forest, Gradient Boosting and XGBoost were compared to estimate the water quality parameters based on biophysical (i.e., population, basin area, river slope, water level, and stream flow), and physicochemical properties (i.e., conductivity, turbidity, pH, temperature, and dissolved oxygen) input parameters. The innovation lies in the combination of on-the-spot variables with additional characteristics of the watershed. The model performances were evaluated using coefficient of determination (R2), Nash-Sutcliffe efficiency coefficient (NSE), Root Mean Squared Error (RMSE) and Kling-Gupta Efficiency (KGE) coefficient. The robust five-fold cross-validation, along with hyperparameter tuning, achieved R2 values of 0.71, 0.66, and 0.69 for phosphate, nitrate, and BOD; NSE values of 0.67, 0.65, and 0.62, and KGE values of 0.64, 0.75, and 0.60, respectively. XGBoost yielded good results, showcasing superior performance when considering all analysis performed, but his performance was closely match by other algorithms. The overall modeling design and approach, which includes careful consideration of data preprocessing, dataset splitting, statistical evaluation metrics, feature analysis, and learning curve analysis, are just as important as algorithm selection.
HIGHLIGHTS
A careful selection of independent variables related to the physics of the problem contributed to the model efficiency.
Stratified segmentation of the dataset enhanced model performance by ensuring similar descriptive statistics for both training and testing datasets.
Cross-validation technique and hyperparameter optimization resulted in high validation scores for key water quality parameters.
INTRODUCTION
Traditionally, physically based hydrological models have been applied for water quality modeling, allowing for a comprehensive study of the interaction between environmental variables with water bodies, as well as the simulation of human interventions in aquatic environments. However, in recent years, machine learning (ML) techniques have been widely applied to estimate water quality parameters (Hu et al. 2022; Sahour et al. 2023), though the concept of this application is not entirely new (Maier & Dandy 1996).
The data-driven ML methodology helps to reveal characteristics of a physical system as an inverse problem approach based on observed data (Cambioni Asphaug & Furfaro 2022). In real-world scenarios, we strive to select or create a model that accurately represents a particular phenomenon, supported by experimental data (Moura Neto & Silva Neto 2013). Different techniques from statistics and ML have been applied in both prediction and inference. While the first aims at predicting future scenarios, inference aims at creating a mathematical model to describe how a system behaves (Bzdok et al. 2018). Statistical methods have traditionally focused on inference, which involves constructing and fitting probability models tailored to specific projects (Fisher 1956). In contrast, ML emphasizes prediction by employing versatile learning algorithms to uncover patterns in complex datasets (Koranga et al. 2022; Zhu et al. 2022b).
Many statistical techniques have been developed to estimate water constituents based on regression models by using streamflow, constituent concentration, and other variables (Runkel et al. 2004). But, as the volume of data on aquatic environments continues to increase, ML has emerged as a valuable tool for data analysis and prediction. Typically, monitoring networks prioritize flow measurement, which can be done through conventional monitoring with a limited number of measurements per day or automated monitoring with a higher frequency of measurements. While water quality, generally, is monitored with less frequency as the conventional methods used for laboratory analysis of many water parameters (e.g., BOD5,20, phosphorus, and nitrate) require significant time and financial resources. However, applications related to water quality modeling and water quality assessment require data with regular time intervals, which may differ from the actual data acquisition periods.
In this context, ML regression and classification techniques demonstrate a high capacity to fit the sample dataset and yield impressive performance statistics, drawing significant attention from the academic community. On the other hand, ML models can be strongly influenced by the presence of outliers, censored data, hyperparameters used in the models, data split, and the random state during model execution. ML techniques have been applied to model diverse aquatic environments such as rivers (Nasir et al. 2022), including deltas (Stoica et al. 2016) and watershed streams (Alnahit et al. 2022); groundwater (Sahour et al. 2023), coastal aquifers (Aish et al. 2023); reservoirs (Chou et al. 2018); potable water (Alnaqeb et al. 2022); waste water treatment plants (Zare Abyaneh 2014; Zhu et al. 2022a); and ocean (Lima et al. 2009, 2013). Hu et al. (2022) reviewed 27 papers and found that most of them focused on classification models to predict chemical concentrations with respect to the threshold value. However, these models performed poorly when predicting absolute contamination concentrations.
Many authors have applied ML classification to estimate water quality index (WQI) as a combination of many variables, including not only water quality elements like temperature, pH, dissolved oxygen, biochemical oxygen demand (BOD), nitrogen, phosphorus, and dissolved solids (Stoica et al. 2016; Chou et al. 2018) but also other variables such as land use, soil, topography, and climate (Alnahit et al. 2022). Zare Abyaneh (2014) indicates that the quality of input parameters is more important than its quantity to enhance model prediction. Generally, it is useful to identify the optimum independent variables for analysis. This can be achieved using the recursive feature elimination (RFE) method (Sahour et al. 2023), principal components analysis (Dilmi & Ladjal 2021), or simply by analyzing features' importance (Aish et al. 2023).
ML can help in the timely identification of potential threats to water quality, determining whether water under certain features is safe for consumption or use, thereby protecting human and animal health (Dritsas & Trigka 2023). This ability to predict water quality has broad implications across various sectors, including industrial production, human lives, and the ecological environment (Zhu et al. 2022b). By identifying nonlinear and complex relationships between input and output data, these models can help in setting up a water quality prediction model, which is of critical importance for managing water resources (Najah Ahmed et al. 2019). Moreover, ML models can exploit the correlation between the water quality of different sections of a river, capturing deeper information that traditional models may overlook (Wu et al. 2022).
Field measurements of physicochemical parameters such as pH, turbidity, temperature, dissolved oxygen, and conductivity provide immediate data, but laboratory-based testing of water samples introduces a delay, creating a knowledge gap in understanding water quality over time. Additionally, the gap identified by Hu et al. (2022) highlights that ML models generally perform poorly when predicting absolute contamination concentrations. In this context, this paper aims to identify the best ML approach for predicting concentrations of BOD, nitrate, and phosphate using readily available physicochemical parameters and physical data. The innovation lies in the combination of on-the-spot variables with additional characteristics of the watershed, including flow rates, curve number (CN), slope, area, and population, to produce better estimates. This method could facilitate more efficient and near real-time monitoring, potentially reducing both the time and costs associated with conventional water quality analysis.
MATERIALS AND METHODS
Study area
The upper basin has a humid tropical climate with steep slopes and over 2,000 mm annual rainfall. The lower basin is sub-humid with an average rainfall of 1,300 mm. As of 2018, the basin was home to 535,000 people (CEIVAP 2020). Approximately, 18.3% of the sewage produced in Petrópolis is released into regional rivers without any treatment. When taking into account both treated and untreated effluents, nearly 32% of the BOD originating from Petropolis ends up in the rivers of the surrounding region (ANA 2017). The main sources of nitrate, phosphate, and BOD in the Piabanha River basin are domestic sewage discharge and agricultural activities. The sewage is responsible for at least 43% of the nitrogen load discharged into the river, while agricultural activities, through the use of nitrogenous fertilizers, contribute at least 15% of the discharged nitrogen (Alvim 2016). These pollutants can cause a range of negative impacts on water quality, including eutrophication, which can lead to excessive algae growth and decrease the oxygen available for other aquatic organisms. Water quality parameters, including BOD, phosphate, and nitrate, are typically monitored in Brazil on a quarterly basis due to the costs and logistics associated with collecting and analyzing water samples. However, quarterly monitoring is not sufficient to detect temporal variations in water quality, constraining water resources management. More frequent measurements or predictions of these parameters may provide early warning of water quality issues, allowing for quicker response to protect public health and the environment.
Dataset
The database used in this study is from partnership with the Brazilian Geological Survey (SGB) through the project ‘Integrated Studies in Experimental and Representative Watersheds’, referred to as EIBEX. It is a collaborative initiative involving universities and government agencies (Villas-Boas et al. 2017). The EIBEX project presents available data from August 2009 to the present (2024) with sampling frequency ranging from monthly to bi-annually. The monitoring stations are equipped with sensors to measure the river level, which make it possible to calculate the flow. For the present study, BOD, nitrate, and phosphorus were selected as dependent variables to be estimated based on the independent variable: water level, stream flow, turbidity, pH, water temperature, dissolved oxygen, and conductivity, alongside with station code, CN, river slope, area, and population within each gauge subbasin. The sample year and month are also used to detect seasonality and enable the analysis of possible interventions in the basin. We examined data spanning from 2009 to 2019, with no gaps in between, resulting in a total of 203 recorded for all analyzed variables. The dataset used in this work is available in the Supplementary Material. High concentrations were examined as values exceeding 1.5 times the interquartile range between the second and third quartiles.
River slope and subbasin area within each gauge influence were derived from digital elevation model (NASA DEM) model with 30 m resolution (JPL/NASA 2020), using geoprocessing techniques in QGIS software (QGIS Development Team 2021). The CN is an empirical parameter used in hydrology for predicting direct runoff and was based on land use and land cover maps (Hjelmfelt 1991). Land use and land cover data are derived from the MapBiomas project – a collaborative effort from many scientific institutions aimed at annually updating land use and land cover maps in Brazil (Souza et al. 2020). The population was retrieved from the Brazilian Institute of Geography and Statistics (IBGE), organized by census tracts, which were processed to determine the population in each subbasin influencing the water quality stations.
ML predictions
ML techniques may vary; however, most models described in the literature tend to incorporate certain common steps in response to the particular sensitivities and challenges of the ML process. For example, descriptive statistics analysis is useful for a preliminary dataset analysis and partition (Bui et al. 2020; Shah et al. 2021). ML models are particularly sensitive to the presence of outliers, which can be identified by analyzing the data distribution, graphical plotting, or using interquartile range-based thresholds (Juna et al. 2022; Shamsuddin et al. 2022; Aish et al. 2023).
It is a good practice to split the dataset, generally ranging from 70 to 80% for calibration and 20–30% for validation purposes (Daneshfaraz et al. 2021; Gelli et al. 2023). Different split percentages were tested to ensure the descriptive statistics of both sets were as similar as possible for good representativeness, a 70% training and 30% validation split was chosen. Additionally, we tuned the subsample hyperparameter at 0.5, 0.7, and 1.0, optimizing the subset size to enhance model performance and stability across different configurations. In order to reduce the potential bias introduced by using only one train-test split, a k-fold cross-validation is a widely used technique in ML for model evaluation (Derdour et al. 2022; Sahour et al. 2023). It involves dividing the dataset into k subsets, using k-1 for training and one for testing in each iteration. This process is repeated k times, and the performance metrics from each iteration are averaged to provide a reliable estimate of the model's overall performance (Avila et al. 2018). Statistics metrics vary widely (Hu et al. 2022) for regression tasks, commonly reported metrics include the correlation coefficient (r), coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE). Other statistics, such as the Kling–Gupta efficiency (KGE) and the Nash–Sutcliffe efficiency (NSE), can also be useful (Moriasi et al. 2015; Daneshfaraz et al. 2021; Kalateh et al. 2024).
A variety of ML algorithms are available but in this study the decision trees, random forest, gradient boosting, and XGBoosting algorithms were selected not only because they are widely used for regression tasks (Alnahit et al. 2022; Zhu et al. 2022b; Aish et al. 2023), but also because of their interpretability, ease of use, and robustness, especially in the context of small datasets (Sagi & Rokach 2021). These models offer clear insights into feature importance and are relatively easy to tune compared to more complex architectures. Additional algorithms, such as support vector regression (SVR) and multilayer perceptron (MLP), were also tested, but they underperformed compared to the selected models. For phosphate, SVR and MLP achieved NSE validation scores of 0.48 and 0.23, respectively. For nitrate, the scores were 0.35 and 0.29, and for BOD, they were −0.14 and 0.11. This is likely due to the limited dataset size, consisting of 203 observations. Future work could explore alternative regression models with a larger dataset, providing a more tailored approach to these tasks.
The selected algorithms use decision trees as their base learners, which means they all operate by splitting the data based on certain conditions, hence, creating a tree of decisions. Random forest and gradient boosting both create ensembles of decision trees, but they do so in different ways. Random forest builds and evaluates each decision tree independently and then combines them using averages or majority rules at the end of the process (Breiman 2001). On the other hand, gradient boosting starts the combining process at the beginning, creating a model in a stepwise manner by optimizing an objective function and combining a group of weak learning models to build a single strong learner (Friedman 2001). XGBoost, or extreme gradient boosting, is an efficient implementation of gradient boosting capable to handle with missing data well and includes regularization to avoid overfitting (Chen & Guestrin 2016).
ML models can be trained and tested using their default configurations (Abuzir & Abuzir 2022), but hyperparameter tuning is recommended for optimal model performance (Xin & Mou 2022). The hyperparameters determine the structure of the model and the way the learning process should operate. The literature suggests several techniques for tuning the optimal parameters, but grid search is the most basic and commonly used method (Uddin et al. 2023). This technique involves specifying a subset of the hyperparameter space as a grid, and then systematically trying out all the combinations in the grid. For each combination, the model is trained, and its performance is measured in order to select parameters that give the best performance. In this study the dataset was split into calibration (70%) and validation (30%) sets, with a stratified division by quantiles to ensure representativeness in both sets (Castrillo & García 2020; Bourel et al. 2021; Nasir et al. 2022). A fivefold cross-validation was applied in this study. The hyperparameters were optimized using GridSearchCV to perform an exhaustive search over the specified hyperparameter values (Krishnaraj & Honnasiddaiah 2022; Rodríguez-López et al. 2023). Based on the literature (Piraei et al. 2023; Shan et al. 2023), the parameters and values used in this work are displayed in Tables 1–3, as well as the best hyperparameters used.
Hyperparameters . | Grid search . | Best hyperparameters . | |
---|---|---|---|
Decision tree . | Random forest . | ||
max_depth | 2, 4, 6 | 4 | 6 |
min_samples_split | 2, 4, 6 | 6 | 2 |
min_samples_leaf | 1, 2, 4 | 2 | 1 |
max_features | None, sqrt, log2 | None | sqrt |
Hyperparameters . | Grid search . | Best hyperparameters . | |
---|---|---|---|
Decision tree . | Random forest . | ||
max_depth | 2, 4, 6 | 4 | 6 |
min_samples_split | 2, 4, 6 | 6 | 2 |
min_samples_leaf | 1, 2, 4 | 2 | 1 |
max_features | None, sqrt, log2 | None | sqrt |
Hyperparameters . | Grid search . | Best hyperparameter . |
---|---|---|
n_estimators | 10, 20, 40, 80 | 20 |
max_depth | None, 2, 4, 6, 8 | 4 |
min_samples_split | 2, 4, 6, 8 | 4 |
learning_rate | 0.01, 0.1 | 0.01 |
subsample | 0.5, 0.8 | 0.8 |
Hyperparameters . | Grid search . | Best hyperparameter . |
---|---|---|
n_estimators | 10, 20, 40, 80 | 20 |
max_depth | None, 2, 4, 6, 8 | 4 |
min_samples_split | 2, 4, 6, 8 | 4 |
learning_rate | 0.01, 0.1 | 0.01 |
subsample | 0.5, 0.8 | 0.8 |
Hyperparameters . | Grid search . | Best hyperparameter . |
---|---|---|
learning_rate | 0.01, 0.1 | 0.1 |
max_depth | 2, 3, 4 | 4 |
n_estimators | 50, 100 | 100 |
subsample | 0.5, 0.7, 1.0 | 1 |
colsample_bytree | 0.5, 0.7, 1.0 | 0.7 |
gamma | 0, 0.1, 0.3 | 0 |
reg_alpha | 0, 0.1, 0.5 | 0.5 |
reg_lambda | 1, 2, 3, 5 | 5 |
Hyperparameters . | Grid search . | Best hyperparameter . |
---|---|---|
learning_rate | 0.01, 0.1 | 0.1 |
max_depth | 2, 3, 4 | 4 |
n_estimators | 50, 100 | 100 |
subsample | 0.5, 0.7, 1.0 | 1 |
colsample_bytree | 0.5, 0.7, 1.0 | 0.7 |
gamma | 0, 0.1, 0.3 | 0 |
reg_alpha | 0, 0.1, 0.5 | 0.5 |
reg_lambda | 1, 2, 3, 5 | 5 |
Performance assessment
The coefficient of determination (R2) is a metric measure used to assess the fit of a regression model (Renaud & Victoria-Feser 2010). It represents the proportion of the variance in the dependent variable that is predictable from the independent variable. In other words, R2 indicates the proportion of the total variation in the observed data that is explained by the regression model (Equation (1)). The range of values of R2 depends on the type of model to be fitted; in standard cases like linear least-squares regression models, they lie between 0 and 1, with values close to 1 indicating a strong fit between the model and the data.
The root mean squared error (RMSE) is a measure of the absolute error that squares the deviations to prevent positive and negative deviations from canceling each other out (Equation (2)). This measure tends to overestimate large errors, which can help identify methods with these errors. RMSE is commonly used to express the accuracy of numerical results, presenting error values in the same dimensions as the analyzed variable.
The NSE coefficient is a metric commonly used to evaluate the performance of hydrological models (Equation (3)). It provides a measure of how well the model simulation matches the observed data. The NSE compares the mean squared error (MSE) of the model predictions to the variance of the observed data. NSE values range from negative infinity to 1 and a higher value close to 1 indicates a good agreement between the simulated and observed data (Moriasi et al. 2015).
In well-calibrated models with good overall performance, R2 and NSE tend to align closely, as both assess how well the model fits the data. While both metrics derive from the sum of squared errors, R2 measures the proportion of variance explained by the model, whereas NSE compares the model's performance to a mean-based predictor. They can diverge when the model poorly predicts extreme values or exhibits bias, as NSE penalizes large errors more heavily, especially in cases of underperformance on extremes. In contrast, R2 may still indicate a reasonable fit if most data points are well-predicted. In situations where R2 and NSE are similar but overlook key aspects of model behavior, the KGE offers a more comprehensive evaluation by incorporating correlation, bias, and variability.
RESULT
The stratified dataset split into quantiles was highly effective in producing training and test datasets with very similar descriptive statistics (Table 4), which helps improve model performance. The optimal number of quantiles was tested to conduct the split with the least difference between the descriptive statistics, resulting in the adoption of five quantiles for BOD and phosphate, and four quantiles for BOD. The robust method of fivefold cross-validation, combined with hyperparameter tuning, achieved R2 values of 0.76, 0.67, and 0.71 for phosphate, nitrate, and BOD, and NSE values of 0.73, 0.67, and 0.66, respectively.
. | Phosphate . | Nitrate . | BOD . | |||
---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | |
Samples | 152 | 51 | 137 | 46 | 48 | 17 |
Mean | 0.42 | 0.46 | 4.21 | 3.97 | 5.60 | 5.23 |
Std | 0.51 | 0.64 | 4.01 | 3.60 | 3.88 | 3.27 |
Min | 0.00 | 0.01 | 0.20 | 0.24 | 1.15 | 3.00 |
1st quartile | 0.12 | 0.12 | 0.93 | 0.93 | 3.00 | 3.00 |
2nd quartile | 0.12 | 0.12 | 2.94 | 2.90 | 3.00 | 3.00 |
3rd quartile | 0.62 | 0.57 | 6.33 | 6.31 | 7.85 | 7.00 |
Max | 2.53 | 2.84 | 16.11 | 14.99 | 15.77 | 13.00 |
Skewness | 1.99 | 2.36 | 1.22 | 0.98 | 1.31 | 1.37 |
Kurtosis | 4.18 | 5.50 | 0.86 | 0.42 | 0.56 | 0.72 |
. | Phosphate . | Nitrate . | BOD . | |||
---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | |
Samples | 152 | 51 | 137 | 46 | 48 | 17 |
Mean | 0.42 | 0.46 | 4.21 | 3.97 | 5.60 | 5.23 |
Std | 0.51 | 0.64 | 4.01 | 3.60 | 3.88 | 3.27 |
Min | 0.00 | 0.01 | 0.20 | 0.24 | 1.15 | 3.00 |
1st quartile | 0.12 | 0.12 | 0.93 | 0.93 | 3.00 | 3.00 |
2nd quartile | 0.12 | 0.12 | 2.94 | 2.90 | 3.00 | 3.00 |
3rd quartile | 0.62 | 0.57 | 6.33 | 6.31 | 7.85 | 7.00 |
Max | 2.53 | 2.84 | 16.11 | 14.99 | 15.77 | 13.00 |
Skewness | 1.99 | 2.36 | 1.22 | 0.98 | 1.31 | 1.37 |
Kurtosis | 4.18 | 5.50 | 0.86 | 0.42 | 0.56 | 0.72 |
Phosphate estimates
Model . | RMSE . | NSE . | R2 . | KGE . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Decision tree | 0.24 | 0.45 | 0.78 | 0.50 | 0.78 | 0.51 | 0.84 | 0.64 |
Random forest | 0.18 | 0.40 | 0.88 | 0.61 | 0.91 | 0.71 | 0.76 | 0.49 |
Gradient Boosting | 0.14 | 0.38 | 0.92 | 0.64 | 0.95 | 0.65 | 0.79 | 0.63 |
XGBoost | 0.10 | 0.37 | 0.96 | 0.67 | 0.96 | 0.69 | 0.90 | 0.63 |
Model . | RMSE . | NSE . | R2 . | KGE . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Decision tree | 0.24 | 0.45 | 0.78 | 0.50 | 0.78 | 0.51 | 0.84 | 0.64 |
Random forest | 0.18 | 0.40 | 0.88 | 0.61 | 0.91 | 0.71 | 0.76 | 0.49 |
Gradient Boosting | 0.14 | 0.38 | 0.92 | 0.64 | 0.95 | 0.65 | 0.79 | 0.63 |
XGBoost | 0.10 | 0.37 | 0.96 | 0.67 | 0.96 | 0.69 | 0.90 | 0.63 |
Nitrate estimates
Model . | RMSE . | NSE . | R2 . | KGE . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Decision tree | 2.44 | 3.41 | 0.63 | 0.09 | 0.63 | 0.36 | 0.71 | 0.57 |
Random forest | 1.79 | 2.10 | 0.80 | 0.65 | 0.82 | 0.66 | 0.75 | 0.75 |
Gradient boosting | 0.77 | 2.16 | 0.96 | 0.63 | 0.98 | 0.65 | 0.83 | 0.73 |
XGBoost | 1.33 | 2.23 | 0.89 | 0.61 | 0.92 | 0.62 | 0.79 | 0.74 |
Model . | RMSE . | NSE . | R2 . | KGE . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Decision tree | 2.44 | 3.41 | 0.63 | 0.09 | 0.63 | 0.36 | 0.71 | 0.57 |
Random forest | 1.79 | 2.10 | 0.80 | 0.65 | 0.82 | 0.66 | 0.75 | 0.75 |
Gradient boosting | 0.77 | 2.16 | 0.96 | 0.63 | 0.98 | 0.65 | 0.83 | 0.73 |
XGBoost | 1.33 | 2.23 | 0.89 | 0.61 | 0.92 | 0.62 | 0.79 | 0.74 |
XGBoosting, specifically, demonstrated a simple structure, emphasizing key features such as population, slope, and conductivity. The first two features accounted for over 90% of the feature importances. Similarly, the random forest algorithm identified population, conductivity, and turbidity as the three most significant variables, cumulatively accounting for about 50% of the total importance. This outcome is noteworthy as the population variable represents a considerable source of nitrogen, and the slope variable is linked to the transformation of organic nitrogen into nitrate, as the slope incorporates more dissolved oxygen in the water facilitating the aerobic bacteria activity in the nitrogen cycle. This result indicates that the physical processes of the watershed were captured by the data structure in the ML process. The random forest model exhibited the most optimized learning curve among the evaluated models to nitrate estimate, displaying consistent behavior for both the training and test sets, with a decreasing error with the incorporation of more samples. These observations highlight the necessity for comprehensive performance analysis of the models, extending beyond global metrics such as R2 and NSE to incorporate aspects such as the learning curve and the feature importances.
BOD estimates
Model . | RMSE . | NSE . | R2 . | KGE . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Decision tree | 1.85 | 2.92 | 0.77 | 0.16 | 0.77 | 0.45 | 0.83 | 0.60 |
Random forest | 2.54 | 2.21 | 0.56 | 0.52 | 0.64 | 0.65 | 0.49 | 0.58 |
Gradient boosting | 1.15 | 2.14 | 0.91 | 0.55 | 0.96 | 0.58 | 0.75 | 0.57 |
XGBoost | 1.85 | 1.97 | 0.77 | 0.62 | 0.84 | 0.69 | 0.64 | 0.59 |
Model . | RMSE . | NSE . | R2 . | KGE . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Decision tree | 1.85 | 2.92 | 0.77 | 0.16 | 0.77 | 0.45 | 0.83 | 0.60 |
Random forest | 2.54 | 2.21 | 0.56 | 0.52 | 0.64 | 0.65 | 0.49 | 0.58 |
Gradient boosting | 1.15 | 2.14 | 0.91 | 0.55 | 0.96 | 0.58 | 0.75 | 0.57 |
XGBoost | 1.85 | 1.97 | 0.77 | 0.62 | 0.84 | 0.69 | 0.64 | 0.59 |
DISCUSSION
The robust fivefold cross-validation, along with hyperparameter tuning, achieved R2 values of 0.71, 0.66, and 0.69 for phosphate, nitrate, and BOD, NSE values of 0.67, 0.65, and 0.62, and KGE values of 0.64, 0.75, and 0.60, respectively. Following the performance criteria suggested by Moriasi et al. (2015) both parameters exhibit good performance. Sadayappan et al. (2022) found an average testing R2 of 0.69 for nitrate across many sites in the USA, while Tso et al. (2023) achieved R2 values of 0.71 and 0.58 for nitrate and orthophosphate in Great Britain's rivers, respectively. Alnahit et al. (2022) observed NSE ranging from 0.45 to 0.6 for total nitrogen and from 0.3 to 0.6 for total phosphorus, employing a range of biophysical parameters. Notably, high-performance models often rely on highly correlated variables, such as Nafsin & Li (2022) study achieving an R2of 0.91 for BOD prediction, which included chemical oxygen demand (COD) as a predictor. In this case, by definition, BOD constitutes the biodegradable portion of COD, which also requires laboratory analysis. Thus, our approach has the advantage of using only easily obtainable parameters. Despite using fivefold cross-validation and hyperparameter tuning to control model complexity, overfitting remained a challenge. For the gradient boosting and XGBoost models (Table 3), learning rates of 0.01 and 0.1 were set to control the step size during each boosting round, while maximum tree depth was constrained (2–4) to avoid overly complex models (Bentéjac et al. 2021). The number of estimators was set between 50 and 100, while subsample and colsample_bytree values (ranging from 0.5 to 1.0) were adjusted to reduce variance. The min_child_weight parameter ensured that each leaf contained enough data points, preventing overly complex trees (Ahmed et al. 2024). In XGBoost, the gamma parameter was applied for pruning, while L1 (reg_alpha) and L2 (reg_lambda) regularizations were adjusted to penalize large coefficients and reduce overfitting (Ying 2019).
After automated tuning with GridSearchCV, manual adjustments were made to further refine the models. For Gradient Boosting, n_estimators was set to 20 and max_depth to 4. The n_estimators parameter controls the number of boosting iterations, with more rounds generally improving performance but increasing the risk of overfitting, 20 was chosen as a balanced value. The max_depth limits the complexity of each tree, reducing the likelihood of the model becoming overly specific to the training data. This adjustment reduced the calibration R2for phosphate from 1 to 0.95, while the validation R2 improved from 0.64 to 0.65.
In the XGBoost model, reg_alpha was set to 0.5 and reg_lambda to 5. These regularization terms directly influence the loss function by penalizing large weights, with reg_alpha applying an L1 penalty (proportional to the absolute values of weights) and reg_lambda applying an L2 penalty (proportional to the squared weights). These penalties help control the magnitude of the model's parameters, promoting simpler models and reducing the risk of overfitting to noise. After regularization, the phosphate calibration R2 slightly decreased from 0.98 to 0.96, while the validation R2 dropped from 0.76 to 0.69. Although the changes in calibration R2 were minor, the dispersion in the validation scatter plot improved significantly. While techniques like early stopping or data augmentation could further enhance models with smaller datasets (Ying 2019), the current configuration performed well during cross-validation, indicating adequate generalization to the test set.
Segmenting the dataset into quantiles in a stratified manner proved highly effective, yielding comparable descriptive statistics for both training and testing datasets, thereby improving model performance. In contrast, Castrillo & García (2020) applied a stratified sample to obtain a representative and unbiased validation dataset, employing the native function of StratifiedShuffleSplit from the Scikit-Learn library. However, this approach maintains the same percentage for each target class without considering percentiles, which we believe provides superior performance by offering a predefined, user-oriented method for dividing sample regions.
Hyperparameter optimization was a crucial step in enhancing the performance of the ML models, a fact also observed by Yan et al. (2023) in predicting water quality, as the default settings generally do not lead to the best performance. The use of fivefold cross-validation strengthens our approach, in agreement with the literature (Grbčić et al. 2022). Additionally, Jung et al. (2020) assessed the impact of cross-validation on the efficiency of ML models and pointed out that the greater the number of cross-validations, the more reliable the estimates become. Specifically, regarding the BOD parameter in urban areas, the model's learning curve analysis indicated that a larger dataset could enhance its learning. Although we achieved satisfactory results, it is recommended to use a larger dataset (Shen 2018).
While a high correlation between dependent and independent variables can be beneficial in many ML techniques, it is not strictly necessary (Tahmasebi et al. 2020) as it is for the statistical approach. The ML algorithms can capture complex and nonlinear patterns even when the variable correlation is weak (Najah Ahmed et al. 2019). The inclusion of variables related to the physics of the problem was also very important. For instance, population directly influences the pollution loads of nitrate, phosphate, and BOD. Similarly, the CN reflects the water runoff in the basin associated with the types of land use. The basin area, river slope, level, and flow are physical variables that directly influence chemicals transport, dilution, and decay. Furthermore, physical–chemical parameters such as conductivity, turbidity, pH, temperature, and dissolved oxygen are parameters potentially correlated with the dependent variables and help improve the efficiency of the algorithms. The inclusion of these variables corroborates the literature that the quality of input parameters is more important than its quantity for enhancing model prediction (Zare Abyaneh 2014).
When evaluating water quality, there are several additional factors within the historical data series that require attention. These include climatic seasonality, the discharge of residential sewage and industrial wastewater, and engineering interventions in the watershed that directly impact the behavior of the dataset (von Sperling et al. 2020; De Andrade Costa et al. 2023; Dos Santos Ferreira et al. 2023; da Silva Junior et al. 2024). Between 2009 and 2022, more than 20 sewage treatment units were implemented in the Piabanha watershed. In a metaphorical sense, an expert examining water quality data of a region employs a decision tree approach by considering temporality and seasonality aspects. By considering the year, it becomes possible to discern structural interventions within the watershed. Considering the month allows for the consideration of climatic seasonality that impacts the flow rate, ultimately leading to the analysis of constituent concentration. The use of dates in ML algorithms is in alignment with the literature (Arias-Rodriguez et al. 2023) and was beneficial in this study, as observed in the feature importance analysis.
CONCLUSION
Our study demonstrated significant results through the innovative application of ML techniques. We incorporated a robust fivefold cross-validation technique and hyperparameter optimization, which resulted in high validation scores for phosphate, nitrate, and BOD. The selection of independent variables proved to be crucial. Variables related to the physics of the problem were key contributors to the efficiency of the algorithms. The model performances were further enhanced by the effective stratified segmentation of the dataset into quantiles, leading to similar descriptive statistics for both training and testing datasets.
Investigating the performance of various ML algorithms, we found differing behaviors in terms of their ability to generalize and adjust to the data. For instance, the decision tree showed the lowest generalization capacity, performing well on training data but poorly on the test dataset. Gradient boosting proved to be the most prone to overfitting in the training set; however, it still managed to produce good results on the test set. In comparison to gradient boosting, random forest was less susceptible to overfitting due to its ensemble nature, yet achieved a similar performance. XGBoost generally outperformed the other models, largely due to its algorithm's characteristics. While data preprocessing, feature engineering, and hyperparameter tuning significantly influence model performance, XGBoost's core boosting framework, effective regularization techniques, and advanced tree optimization strategies make it particularly powerful compared with other tree-based algorithms.
On the other hand, establishing a good model requires analyzing the entire modeling design, including hyperparameter optimization, data preprocessing methods, dataset splitting, cross-validation, statistical evaluation metrics, feature analysis, and learning curve analysis. The overall analysis approach is as important as algorithm selection since different algorithms can produce similar metric results with minor differences. Therefore, instead of assuming one algorithm is the best for a particular situation, it is more advisable to apply a set of techniques to determine the most suitable approach. As a limitation, regarding the BOD parameter in urban areas, we found that a larger dataset could enhance learning and model performance. However, it is worth noting that simply increasing the size of a dataset does not always guarantee improved results. Our findings underscore the importance of not just quantity, but also the quality of data in building robust ML models.
ACKNOWLEDGEMENTS
The authors acknowledge the financial support provided by the following Brazilian agencies: FAPERJ, Carlos Chagas Filho Foundation for Research Support of the State of Rio de Janeiro; CNPq, National Council for Scientific and Technological Development; and CAPES, Coordination for the Improvement of Higher Education Personnel (Finance Code 001). The authors appreciate the anonymous reviewers for their valuable comments, which helped clarify several points in the original manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.