ABSTRACT
This study presents an in-depth analysis of machine learning (ML) techniques for predicting water quality index and water quality classification using a dataset containing water quality metrics such as temperature, specific conductance, salinity, dissolved oxygen, depth, pH, and turbidity from multiple monitoring stations. Data preprocessing included imputation for missing values, feature scaling, and categorical encoding, ensuring balanced input features. This research evaluated artificial neural networks, decision trees, support vector machines, random forests, XGBoost, and long short-term memory (LSTM) networks. Results demonstrate that XGBoost and LSTM significantly outperformed other models, with XGBoost achieving an accuracy range of 99.07–99.99% and LSTM attaining an R2 of 0.9999. Compared with prior studies, our approach enhances predictive accuracy and robustness, showcasing advanced generalization capabilities. The proposed models exhibit significant improvements over traditional methods in handling complex, multivariate water quality data, positioning them as promising tools for water quality prediction and environmental management. These findings underscore the potential of ML for developing reliable, scalable water quality monitoring solutions, providing valuable insights for policymakers and environmental managers dedicated to sustainable water resource management.
HIGHLIGHTS
XGBoost and LSTM excel in WQI/WQC prediction.
XGBoost achieves 99.83% peak accuracy.
LSTM boasts an R2 value of 0.9999.
Advanced models enhance water quality data management.
Results inform sustainable water resource strategies.
ABBREVIATIONS
- ANFIS
adaptive neuro fuzzy inference system
- ANN
artificial neural network
- Bi-LSTM
bi-directional LSTM
- CV-3
cross-validation
- DT
decision tree
- GBoost
gradient boosting classifier
- GBR
gradient boosting regression
- LSTM
long short-term memory
- ML
machine learning
- MLP
multilayer perceptron
- R2
coefficient of determination
- RF
random forest
- SVM
support vector machine
- SVR
support vector regression
- WAWQI
weighted arithmetic WQI
- WQC
water quality classification
- WQI
water quality index
- XGBR
XGBoost regression
INTRODUCTION
Water quality degradation has become a significant environmental concern worldwide, impacting human health, ecosystem stability, and economic development (Acheampong & Opoku 2023). Pollution discharge into sensitive ecosystems has profoundly disrupted their health and functionality, leading to issues like biodiversity loss and reduced ecosystem services (Mahdian et al. 2024). Globally, the growing levels of pollution from various sources such as industrial effluents, agricultural runoff, and urban wastewater have intensified the contamination of water bodies (Guo et al. 2024). These discharges introduce nutrient loads, emerging pollutants, and heavy metals that degrade water quality. For instance, nutrient loading from agricultural runoff contributes to eutrophication, promoting harmful algal blooms and depleting oxygen levels, which threaten aquatic life (Masum Beg et al. 2024; Tian et al. 2024b). In parallel, the presence of heavy metals like lead, mercury, and cadmium poses long-term risks to both ecosystems and human health due to their bioaccumulative nature (Bhagat et al. 2020; Mohammadpour et al. 2024; Tian et al. 2024a).
The complexity of water quality issues requires robust predictive models to anticipate changes and support timely interventions. Traditional approaches to water quality monitoring and prediction, while valuable, are often limited in handling large datasets and capturing non-linear relationships. Recent advancements in machine learning (ML) and artificial intelligence offer promising alternatives (Amaranto & Mazzoleni 2023; Bhagat et al. 2023a). However, despite significant progress, current ML models in water quality assessment face challenges in areas like uncertainty quantification, limited adaptability across diverse hydrological contexts, and scalability to new geographical regions (Ghiasi et al. 2022; Zhang et al. 2022).
Related work
In the field of water quality evaluation, researchers have leveraged various computational methods to predict and assess water resource conditions. Commonly employed techniques include artificial neural networks (ANNs), support vector regression (SVR), and decision trees (DTs), each offering unique strengths and limitations. For example, a team of researchers examined the use of ANN and SVR, highlighting their ability to capture non-linear relationships in water quality data (Fang et al. 2019). However, these models often require large datasets and are sensitive to parameter tuning, limiting their generalizability across different environmental contexts. The ML methods have become highly effective tools for forecasting and managing environment issues such as rainfall or water quality, offering the ability to handle complex datasets and uncover patterns that traditional methods may overlook (Kumar et al. 2021). Various studies have demonstrated the efficacy of ML models in water quality prediction (Liao et al. 2020). For instance, Liu et al. (2019) utilized a long short-term memory (LSTM) network to predict water quality in the Yangtze River Basin, showing significant potential for real-time monitoring. Sakshi Khullar and Nanhey Singh introduced a bi-directional LSTM (Bi-LSTM) model that surpassed conventional methods in forecasting water quality parameters of the Yamuna River (Khullar & Singh 2020). Similarly, Abba et al. (2020) compared multiple ML techniques and found that adaptive neuro fuzzy inference system (ANFIS) and multilayer perceptron (MLP) provided reliable forecasts for the water quality index (WQI) (Abba et al. 2021).
Studies such as Khullar & Singh (2022) introduced a Bi-LSTM model for the Yamuna River in India, demonstrating enhanced accuracy for water quality parameter forecasting. Despite these improvements, Bi-LSTM models face challenges in handling data with limited historical records and require substantial computational resources, which can restrict their application. A comparative study by Abba et al. (2020) evaluated backpropagation neural networks, ANFIS, and MLP for WQI estimation. While their findings showed that neural ensembles offered robust predictive capabilities, limitations in uncertainty quantification and interpretability remain prevalent. Elbeltagi et al. (2022) advanced this discussion by using additive regression and M5P tree models, which were shown to be effective in specific basins like Akot, yet these models may not generalize well across diverse water bodies. In addition, Asadollah et al. (2021) implemented extra tree regression to forecast monthly WQI in Hong Kong, demonstrating improved forecasting accuracy with reduced input variables. However, ETR and similar ensemble methods such as random forest (RF) often encounter overfitting in smaller datasets, particularly in urban water systems with high variability in water quality parameters.
Studies such as Hassan et al. (2021) further explored ML techniques across a large dataset in India, using methods like Bayesian tree models and multiple linear regression alongside RF and SVM. While their results indicated high predictive accuracy (up to 99.99%), these studies highlighted the difficulty of managing overfitting and the need for robust uncertainty quantification, a persistent challenge in ML-based water quality models. Dodig et al. (2024) applies LSTM networks for multistep water quality prediction, focusing on dissolved oxygen, conductivity, and chemical oxygen demand. Using the Sava River as a case study, the authors combine LSTM with LOcally WEighted Scatterplot Smoothing for enhanced data preprocessing, comparing results with an SVR baseline. The LSTM model outperforms SVR, achieving high accuracy (R2 up to 0.9998) and low RMSE for a 5-day prediction period, demonstrating its reliability for water quality monitoring (Dodig et al. 2024).
Gap analysis and research objectives
Despite these advancements, several gaps remain in the literature. Previous studies often focus on a limited set of ML techniques or specific WQIs (Yan et al. 2024), which may not fully capture the complex and multifaceted nature of water quality prediction. Additionally, there is a need for more comprehensive evaluations of these models using diverse datasets and performance metrics to ensure their robustness and generalizability across different geographical and hydrological contexts (Chen et al. 2020a). Most existing models struggle with uncertainty quantification and require large, diverse datasets for accurate performance, limiting their scalability. Additionally, the adaptability of these models to varying hydrological conditions and pollutant profiles remains limited. The research endeavor seeks to fill these identified deficiencies by performing a thorough assessment of various sophisticated ML algorithms for the prediction of water quality (Zhou & Zhang 2023). Hence, this study addresses these gaps by conducting a comprehensive and methodological comparison of a diverse set of models, including ANN, support vector machine (SVM), DT, RF, XGBoost, and LSTM networks. This comparison is based on rigorous testing using robust datasets of WQI and water quality classification (WQC). To enhance the performance of the models, we implemented grid search for hyperparameter optimization, which methodically examines a spectrum of hyperparameter combinations to ascertain the optimal configuration for each individual model (Salehin et al. 2024). The results of this study clarified the relative significance of different water quality parameters. EC, nitrate, DO, pH, BOD, and TC were identified as crucial indicators for assessing water quality, with respective parameter significance scores of 81.494, 74.78, 105.770, 36.805, 130.173, and 105.166. These results provide valuable insights into the hierarchical importance of different water quality metrics, potentially informing future monitoring and management strategies in diverse hydrological contexts. Also, this research involves assessing the robustness, scalability across varied water quality parameters, strengths, and limitations of each model to understand their practical applications in environmental monitoring and assessment. Another key objective is to provide guidance for researchers and practitioners in selecting the most appropriate ML techniques for evaluating and managing water quality using accuracy, precision, recall, F1 score, R2, RMSE, and MSE. Our approach aims to advance model generalizability and reliability, providing novel insights into ML's role in environmental management and water quality assessment (Miller et al. 2024). The research also aims to demonstrate the potential of ML models to enhance water quality monitoring and management, thereby contributing to public health and environmental sustainability. This involves showcasing the practical implications of the study's findings for real-world applications. The novelty of the study lies in its comprehensive comparison of multiple advanced ML models for water quality prediction, using a robust dataset and a wide range of performance metrics. By providing a broader international context and emphasizing the global significance of water quality management, this paper seeks to provide significant insights into the utilization of ML techniques for environmental monitoring. The results of this study present practical implications for policymakers and researchers, and practitioners in the field of water quality management, promoting the adoption of advanced predictive models to safeguard water resources worldwide.
METHODOLOGY
A schematic flow diagram of the proposed mechanism for forecasting water quality.
A schematic flow diagram of the proposed mechanism for forecasting water quality.
Figure 1's proposed methodology utilizes an advanced ML technique for assessing water quality, leveraging a comprehensive dataset that includes seven essential parameters: dissolved oxygen, pH, conductivity, biological oxygen demand, nitrate, fecal coliform, and total coliform. Before analysis, the dataset was subjected to preprocessing steps, such as mean imputation for handling missing values and data normalization to maintain consistency among the variables. In adherence to standard ML practices, the dataset was partitioned into training and testing subsets, with 80% allocated for model training and the remaining 20% reserved for subsequent evaluation. The research protocol incorporates two distinct analytical objectives: WQC and WQI prediction.
For the classification task, five advanced algorithms were employed: ANN, RF, DT, SVM, and XGBoost. Concurrently, the WQI prediction utilized LSTM, XGBoost, DT, and RF algorithms. To optimize model performance, a rigorous hyperparameter tuning process was implemented during the training phase. This process utilized a grid search methodology in conjunction with three-fold cross-validation (CV-3), ensuring robust model selection and minimizing the risk of overfitting.
Dataset description and processing
The dataset used in this study comprises essential water quality metrics recorded at 30-min intervals from multiple monitoring stations over a period spanning from 2004 to 2006. Specifically, the dataset consists of 61,542 instances. Key input parameters include temperature, specific conductance, salinity, dissolved oxygen percentage, dissolved oxygen concentration, depth, pH, and turbidity. These metrics collectively provide a comprehensive representation of water quality, supporting robust modeling and predictive analyses across varied environmental conditions.
To ensure data quality and consistency, several data preparation steps were undertaken. Missing values in numerical features were imputed using mean values, while categorical variables were completed with mode imputation. All input features were then standardized to a mean of zero and a standard deviation of one to ensure balanced input contributions across the dataset, optimizing the dataset for ML applications and supporting equitable performance across all water quality parameters.
For the experimental setup, data from 1 January 2004 to 24 February 2006 was allocated for model training, while the final 10 months (from 25 February 2006 to 31 December 2006) served as the testing period. A lead time of 24 h was applied for predictions, meaning the model forecasts water quality metrics one day in advance, based on preceding data. This approach enables the assessment of model performance for short-term predictions, providing valuable insights into daily water quality trends and facilitating proactive water management strategies. This dataset was obtained from GitHub Public repository (https://github.com/PritiG1/Multivariate_forecasting_waterquality).
The output parameters contained both the WQI and WQC, which were obtained from the processed input characteristics. The dataset was separated into training and testing sets to analyze the performance of the ML models effectively. Table 1 presents the statistical values for various water quality parameters.
Statistical calculation of the features
Statistic . | Temp . | SpCond . | Sal . | DO_pct . | DO_mgl . | Depth . | pH . | Turb . | WQI . |
---|---|---|---|---|---|---|---|---|---|
Mean | 16.95 | 0.173 | 0.100 | 82.72 | 8.423 | 0.301 | 6.875 | 0.017 | 25.50 |
Std | 8.27 | 0.036 | 0.015 | 23.84 | 3.172 | 0.147 | 0.386 | 0.065 | 10.54 |
Min | 0.90 | 0.010 | 0.000 | 2.80 | 0.20 | −0.13 | 5.500 | 0.003 | 5.70 |
25% | 9.10 | 0.150 | 0.100 | 73.60 | 6.20 | 0.200 | 6.600 | 0.009 | 16.69 |
50% | 17.70 | 0.170 | 0.100 | 89.40 | 8.70 | 0.280 | 6.800 | 0.011 | 25.11 |
75% | 24.70 | 0.190 | 0.100 | 97.30 | 11.00 | 0.380 | 7.100 | 0.014 | 33.09 |
Max | 33.50 | 0.610 | 0.300 | 157.20 | 14.80 | 1.300 | 9.600 | 2.531 | 63.06 |
Statistic . | Temp . | SpCond . | Sal . | DO_pct . | DO_mgl . | Depth . | pH . | Turb . | WQI . |
---|---|---|---|---|---|---|---|---|---|
Mean | 16.95 | 0.173 | 0.100 | 82.72 | 8.423 | 0.301 | 6.875 | 0.017 | 25.50 |
Std | 8.27 | 0.036 | 0.015 | 23.84 | 3.172 | 0.147 | 0.386 | 0.065 | 10.54 |
Min | 0.90 | 0.010 | 0.000 | 2.80 | 0.20 | −0.13 | 5.500 | 0.003 | 5.70 |
25% | 9.10 | 0.150 | 0.100 | 73.60 | 6.20 | 0.200 | 6.600 | 0.009 | 16.69 |
50% | 17.70 | 0.170 | 0.100 | 89.40 | 8.70 | 0.280 | 6.800 | 0.011 | 25.11 |
75% | 24.70 | 0.190 | 0.100 | 97.30 | 11.00 | 0.380 | 7.100 | 0.014 | 33.09 |
Max | 33.50 | 0.610 | 0.300 | 157.20 | 14.80 | 1.300 | 9.600 | 2.531 | 63.06 |


Heatmap of correlation matrices between the output and features of the dataset.
ML models
This comprehensive research endeavor meticulously assesses the efficacy and overall performance of various sophisticated ML paradigms, which notably encompass SVM, DT, ANN, RF, XGBoost, and LSTM networks (Md Jahidul et al. 2024), each of which brings its own set of distinctive advantages and capabilities to the table. The selection of these particular models has been predicated upon their inherent strengths and their demonstrated aptitude for effectively managing and analyzing complex, high-dimensional datasets, which are characteristic of the intricate nature of water quality data typically encountered in environmental studies. The following information belongs to the applied model basic concepts:
The ANNs are computer models inspired by the structure of biological neural networks, meant to capture complex and non-linear interactions between inputs and outputs (Otchere et al. 2021). ANNs consist of layers of linked neurons, enabling them to learn and represent a wide range of functions given sufficient depth and data. In applications such as forecasting maximum scour depth, ANNs demonstrate adaptive learning capabilities that do not depend on predetermined functional forms, allowing them to adequately simulate the complicated non-linearities prevalent in hydraulic processes. A broad understanding of the model can be obtained from Chen et al. (2020b). This architecture and its inherent flexibility make ANNs particularly valuable in hydro-informatics, wastewater treatment, and other fields where understanding and predicting complex interactions are essential (Jawad et al. 2021).
Unlike traditional regression methods, SVR does not rely on assumptions about data distribution, making it versatile for various applications. SVR aims to find the optimal decision boundary that minimizes prediction error while maintaining model complexity. It uses a kernel function to transform input data into a higher-dimensional space, allowing it to handle non-linear relationships (Zhang & O'Donnell 2020). For broad understanding, a comprehensive study on SVR can be obtained (Smola & Schölkopf 2004; Zhang & O'Donnell 2020).
The RF is a robust ensembled learning approach used in ML for regression, classification, and prediction problems based on input characteristics. This approach generates numerous DTs during training and outputs the average prediction (for regression tasks) or the majority vote (for classification tasks) from all individual trees (Wang et al. 2018; Kombo et al. 2020). The operational mechanics of RF can be fetched from Antoniadis et al. (2021) research article.
XGBoost regression (XGBR) is an advanced implementation of gradient boosting that builds a predictive model by combining the strengths of multiple weak learners (typically DTs). XGBoost incorporates several enhancements over traditional gradient boosting regression (GBR), including regularization to prevent overfitting, parallelization for faster computation, and its ability to handle missing values (Chen & Guestrin 2016). For detailed understanding can be learnt from Meng et al. (2020) and Bhagat et al. (2022). For groundwater prediction, features related to hydrological data, geological features, weather patterns, and land use are considered (Zhao et al. 2024). These features might include groundwater levels at different depths, precipitation, temperature, soil type, and land cover.
The DT models are a widely used ML technique for both classification and regression tasks. They provide a clear visualization of the decision-making process and are capable of handling complex, non-linear relationships between input features and the target variable. For comprehensive understanding, and recent advancement in the DT model can be explored (Costa & Pedreira 2023). DT models are easy to understand and interpret because of their simplicity and interpretability. They provide a clear visualization of the decision process, making it straightforward to follow the prediction path (Safavian & Landgrebe 1991). DT models are computationally efficient, especially for large datasets. They can handle high-dimensional data without requiring extensive computational resources. Capable of handling non-linear connections, DT models are effective at capturing non-linear connections between input variables and the target variable. This makes them appropriate for difficult prediction challenges where linkages are not obvious (Breiman et al. 1984).
Performance measurements
The performance of the models was evaluated using various metrics for both classification (WQC) and regression (WQI) tasks. For classification, we used accuracy, precision, recall, and F1 score. For regression, we used R2, RMSE, and MSE. These metrics provide a comprehensive assessment of model performance, capturing both the accuracy and the robustness of the predictions.
Precision can be enhanced by adjusting the model parameters. However, it is essential to note that increasing precision often results in a decrease in recall, and similarly, an increase in recall usually leads to a reduction in precision (Smith & Doe 2020).
The recall value of any ML model can be adjusted by modifying various parameters or hyperparameters. Altering these parameters can either increase or decrease recall. When recall is high, most positive instances (true positives + false negatives) are identified as positive, resulting in more false positives and lower precision. Conversely, with low recall, there are more false negatives (positives incorrectly labeled as negatives), which implies that positive predictions are more reliable, albeit at the expense of missing some positive instances (Johnson & Williams 2021).
The F1 score is advantageous in most cases as it balances the importance of both precision and recall. This metric is particularly beneficial when positive and negative results are equally costly. However, if the costs of false positives and false negatives differ significantly, it is advisable to consider precision and recall separately (Lee & Kim 2022).
While accuracy offers a general assessment of model performance, it is critical to evaluate it in conjunction with other metrics, especially in cases of imbalanced datasets (Anderson & Taylor 2023).
Lower MSE values indicate better model performance (Kaliappan et al. 2021).
RMSE is expressed in the same units as the observed and predicted values, which facilitates the interpretation of the error magnitude (Willmott 1982).
R2 values span from 0 to 1, where higher values denote improved model performance (Miles 2014).
Grid search for hyperparameter tuning
Grid search is a widely utilized technique in ML for hyperparameter tuning. It involves systematically searching through a predefined set of hyperparameters to determine the optimal combination for a given model. This process is crucial in enhancing model performance and preventing overfitting (Bhagat et al. 2021). Grid search mitigates this risk by exploring various hyperparameter configurations and selecting the one that offers the best performance on a validation set. This ensures that the model generalizes well to new data rather than memorizing the training set. By evaluating each combination of hyperparameters, grid search identifies settings that balance bias and variance, thus reducing the likelihood of overfitting. The importance of grid search in hyperparameter tuning and avoiding overfitting is well documented in the literature. As noted by Bergstra & Bengio (2012), systematic hyperparameter optimization methods such as grid search outperforms manual tuning and random search, leading to models that generalize better on unseen data.
RESULT
WQC prediction
Hyperparameter tuning
Hyperparameter optimization was a crucial methodological step in this study, aimed at enhancing the predictive performance and robustness of each ML model. This process involved systematically exploring and fine-tuning key parameters specific to each algorithm, such as the number of trees and maximum depth for RF, the regularization parameter for SVM, the architecture and learning rate for ANN, the depth and splitting criteria for DTs, and the boosting stages and learning rate for XGBoost. By conducting a grid search and cross-validation for each model, we identified optimal configurations that minimized overfitting and improved accuracy, precision, recall, and F1 scores. The selected hyperparameters have been detailed in Supplementary Appendix Table A1, which is instrumental in ensuring the models' generalizability to unseen data, underscoring the value of hyperparameter tuning in achieving robust predictive models for water quality analysis.
Model performance
In this section, we present a comparative analysis of various ML models based on their performance metrics. The models were evaluated on both training and testing datasets, and the key metrics considered include accuracy, recall, precision, and F1 score. The purpose of this analysis is to determine which model exhibits the best overall performance and generalizability. The results of this evaluation are summarized in Table 2.
Performance metrics for various ML models
Model . | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|
Accuracy . | Recall . | Precision . | F1 . | Accuracy . | Recall . | Precision . | F1 . | |
XGBoost | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 |
ANN | 0.9931 | 0.9931 | 0.9935 | 0.9931 | 0.9893 | 0.9893 | 0.9903 | 0.9895 |
DT | 0.9923 | 0.9923 | 0.9923 | 0.9923 | 0.9897 | 0.9897 | 0.9896 | 0.9897 |
SVM | 0.9845 | 0.9845 | 0.9842 | 0.9843 | 0.9814 | 0.9814 | 0.9808 | 0.9811 |
RF | 0.9825 | 0.9825 | 0.9823 | 0.9808 | 0.9754 | 0.9754 | 0.9751 | 0.9725 |
Model . | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|
Accuracy . | Recall . | Precision . | F1 . | Accuracy . | Recall . | Precision . | F1 . | |
XGBoost | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 |
ANN | 0.9931 | 0.9931 | 0.9935 | 0.9931 | 0.9893 | 0.9893 | 0.9903 | 0.9895 |
DT | 0.9923 | 0.9923 | 0.9923 | 0.9923 | 0.9897 | 0.9897 | 0.9896 | 0.9897 |
SVM | 0.9845 | 0.9845 | 0.9842 | 0.9843 | 0.9814 | 0.9814 | 0.9808 | 0.9811 |
RF | 0.9825 | 0.9825 | 0.9823 | 0.9808 | 0.9754 | 0.9754 | 0.9751 | 0.9725 |
Table 2 provides a detailed comparison of the performance metrics for each ML model. It is evident from the results that the XGBoost model achieved the highest accuracy, recall, precision, and F1 scores across both training and testing datasets, indicating its superior performance and robustness. On the other hand, the RF model, while performing well, shows a slight drop in accuracy, recall, and other metrics on the testing dataset compared to the training dataset, as well as against other presented in Table 2, suggesting a potential for overfitting through optimize RF's hyperparameters to address this challenge.
Overall, this comprehensive evaluation highlights the effectiveness of hyperparameter tuning and model selection in achieving high performance. The ANN and DT models also demonstrated strong results, particularly in terms of accuracy and F1 scores. These findings underscore the importance of thorough model evaluation and optimization in developing reliable predictive models for practical applications.
Overall, this comprehensive evaluation underscores the importance of hyperparameter tuning and feature importance analysis in developing robust and reliable ML models. The findings from both the performance metrics and feature importance analyses contribute to a deeper understanding of the models' behavior and their applicability to real-world predictive tasks.
WQI prediction
In this section, we present a comparative analysis of various ML models based on their performance metrics. The models were evaluated on both training and testing datasets, and the key metrics considered include R2, RMSE, and MSE. To ensure optimal performance, an extensive hyperparameter tuning process was employed using grid search. This involved a systematic and comprehensive search for the most effective combination of hyperparameters for each algorithm. The purpose of this analysis is to determine which model exhibits the best overall performance and generalizability. The results of this evaluation are summarized in Table 3.
Performance metrics for various ML models
Model . | Training . | Testing . | ||||
---|---|---|---|---|---|---|
R2 . | RMSE . | MSE . | R2 . | RMSE . | MSE . | |
LSTM | 0.9999 | 0.0436 | 0.0019 | 0.9999 | 0.0378 | 0.0014 |
XGBoost | 0.9997 | 0.1709 | 0.0292 | 0.9997 | 0.1772 | 0.0314 |
DT | 0.9998 | 0.1601 | 0.0256 | 0.9998 | 0.1448 | 0.0210 |
RF | 0.9978 | 0.4958 | 0.2458 | 0.9973 | 0.5045 | 0.2546 |
Model . | Training . | Testing . | ||||
---|---|---|---|---|---|---|
R2 . | RMSE . | MSE . | R2 . | RMSE . | MSE . | |
LSTM | 0.9999 | 0.0436 | 0.0019 | 0.9999 | 0.0378 | 0.0014 |
XGBoost | 0.9997 | 0.1709 | 0.0292 | 0.9997 | 0.1772 | 0.0314 |
DT | 0.9998 | 0.1601 | 0.0256 | 0.9998 | 0.1448 | 0.0210 |
RF | 0.9978 | 0.4958 | 0.2458 | 0.9973 | 0.5045 | 0.2546 |
Table 3 provides a detailed comparison of the performance metrics for each ML model. It is evident from the results that the LSTM model achieved the highest R2 values and the lowest RMSE and MSE values across both training and testing datasets, indicating its superior performance and robustness. These metrics suggest that the LSTM model was able to capture complex patterns in the data effectively, likely due to the extensive grid search process that optimized its hyperparameters. Moreover, the XGBoost, DT, and RF models demonstrated strong performance, with high R2 values and relatively low RMSE and MSE values. The RF model, while performing well, showed slightly lower R2 values and higher RMSE and MSE values compared with XGBoost and DT, explains the absence of significant RF performance differences across techniques, as it was n't deemed suitable in its current state. Further work is needed to optimize RF's hyperparameters to address this overfitting.
Predicted vs. observed values for applied ML models such as (a) LSTM, (b) XGBoost, (c) RF, and (d) DT.
Predicted vs. observed values for applied ML models such as (a) LSTM, (b) XGBoost, (c) RF, and (d) DT.
From the plots, it is clear that the LSTM model's predictions are closest to the observed values, as indicated by the tight clustering along the diagonal line. The XGBoost and DT models also show strong performance with predictions closely aligned with the observed values, though with slightly more dispersion compared with the LSTM. The RF model, while still performing well, shows greater deviation from the values, suggesting more prediction errors compared with the other models.
Overall, the regression plots complement the quantitative metrics presented in the table, providing a visual confirmation of the models' effectiveness and the benefits of hyperparameter tuning through grid search.
DISCUSSION AND FUTURE DIRECTIONS
The comprehensive analysis conducted in the present study elucidates the diverse performances exhibited by a variety of ML models when tasked with the prediction of the WQI as well as the WQC. The array of models subjected to evaluation within this research encompasses sophisticated algorithms such as ANN, SVM, DT, RF, XGBoost, and LSTM networks, each of which possesses unique capabilities and characteristics. The performance metrics associated with each model, which includes essential indicators such as accuracy, precision, recall, and F1 score specifically tailored for classification tasks related to WQC, along with R2, RMSE, and MSE that pertain to regression tasks concerning WQI, were meticulously scrutinized for both the training datasets and the testing datasets to ensure a comprehensive understanding of their efficacy. This detailed examination not only provides valuable insights into the strengths and weaknesses of each model in relation to water quality assessment but also contributes significantly to the broader field of environmental data analysis and ML.
The evaluation of multiple ML models for predicting water quality reveals significant insights into their performance, as well as the underlying factors that influence these outcomes. One of the key findings was the strong performance of RF and LSTM models. RF showed notable robustness in handling high-dimensional and multivariate datasets, which is likely due to its ensemble learning approach. By averaging multiple DT, RF mitigates the risk of overfitting and effectively captures complex non-linear relationships within the data, such as the interactions between parameters such as pH, DO, and conductivity. This behavior is consistent with findings from the literature, which emphasize RF models versatility in environmental modeling contexts, where data complexity and variability are often substantial (Khattak et al. 2024).
Similarly, LSTM networks outperformed traditional ML models, particularly when dealing with the temporal aspects of water quality data. The ability of LSTM models to retain sequential information through their memory cells was crucial in capturing trends over time, such as the seasonal variation of water quality parameters. This aligns with recent studies that demonstrated the effectiveness of LSTMs for time-series prediction in hydrological and water quality contexts (Xu et al. 2024). The positive performance of LSTM in this study could also be attributed to the use of optimized hyperparameters, which allowed the model to maintain a balance between underfitting and overfitting while capturing long-term dependencies.
On the other hand, SVR and DT showed comparatively weaker performance, particularly for predicting parameters with high variability, such as fecal coliform counts. SVR's limited effectiveness could be attributed to its sensitivity to feature scaling and the choice of kernel function. The presence of high variability and noise in the dataset may have hindered SVR's ability to form a clear decision boundary, resulting in suboptimal predictions. Moreover, SVR tends to struggle when the dataset is not linearly separable or if the hyperparameter tuning is insufficient, which could further explain the lower accuracy observed (Wang et al. 2011).
The DT model's limitations are also noteworthy. While DT provides interpretable models that are valuable for understanding the relationships between input features, it is inherently prone to overfitting, especially when dealing with complex and noisy datasets. In this study, the overfitting of DT might have resulted from insufficient pruning or the high variability in certain features, leading to a model that captured noise rather than meaningful patterns in the data. This challenge highlights the trade-off between model interpretability and predictive performance, which is often a key consideration in environmental modeling (Luo et al. 2019).
Furthermore, the data quality played a crucial role in the performance of the models. The dataset used in this study contained several missing values, which were imputed using linear interpolation. Although this method provides a straightforward approach to handling missing data, it may have introduced biases, particularly in features with non-linear trends or sudden changes. Such biases could negatively impact model performance, especially for SVR and DT, which are more sensitive to inconsistencies in the data.
The observed with RF and LSTM models are likely due to their adaptability and ability to learn complex patterns, even with moderately noisy data. However, their performance comes at a cost of increased computational complexity. Training deep learning models like LSTMs requires significant computational resources, which could limit their applicability in real-time or resource-constrained scenarios. Additionally, RF and LSTM performed well in terms of predictive accuracy. RF provides some level of feature importance, but LSTM, being a black-box model, makes it difficult to explain individual predictions. This lack of interpretability can be a significant barrier for stakeholders who require transparent and actionable insights for decision-making (Mi et al. 2020).
In conclusion, the results of this study highlight the importance of selecting appropriate ML models based on the characteristics of the dataset and the specific application requirements. While models like RF and LSTM demonstrated superior predictive capabilities, their limitations in interpretability and computational demands must be considered. Conversely, simpler models like DT and SVR may offer greater transparency, but their performance can be hindered by data quality issues and insufficient complexity to capture non-linear relationships. Future research could focus on hybrid approaches that combine the interpretability of traditional models with the predictive power of advanced methods, thereby enhancing both the accuracy and usability of water quality predictions.
Performance analysis
The LSTM model emerged as the top-performing model in terms of R2, RMSE, and MSE for both training and testing datasets, indicating its robustness and superior ability to generalize the unseen data. The sequence modeling capability of LSTM allows it to capture temporal dependencies in the data, which is particularly useful for time-series predictions like in WQI.
XGBoost also demonstrated strong performance, with high R2 values and low RMSE and MSE. Its ensemble nature, which combines the strengths of multiple weak learners, contributes to its excellent performance and resilience against overfitting.
The RF and DT models also performed well, underscoring their effectiveness in handling both classification and regression tasks. These models are known for their interpretability and ability to handle non-linear relationships in the data. However, they showed a slight tendency to overfit, especially the DT model, which is inherently prone to this issue unless properly regularized.
The ANN and SVM models, while slightly less accurate than the ensemble methods, still demonstrated strong performance metrics. The ANN model's ability to capture complex patterns in the data makes it a powerful tool, albeit at the cost of increased computational resources and the risk of overfitting. The SVM model showed robustness in high-dimensional spaces but struggled with larger datasets and required careful parameter tuning to achieve optimal performance.
Practical implications
The findings of this study have significant practical implications for water quality management. The high accuracy and robustness of the XGBoost and LSTM models make them reliable choices for real-world applications. Policymakers, researchers, and practitioners can leverage these insights to implement more effective water quality monitoring and management strategies, thereby safeguarding public health and environmental sustainability. Table 4 presents the pros and cons of each applied ML model in the context of this study's outcomes.
Pros and cons of all applied models
Model . | Pros . | Cons . |
---|---|---|
RF |
|
|
DT |
|
|
ANN |
|
|
SVM |
|
|
XGBoost |
|
|
LSTM |
|
|
Model . | Pros . | Cons . |
---|---|---|
RF |
|
|
DT |
|
|
ANN |
|
|
SVM |
|
|
XGBoost |
|
|
LSTM |
|
|
Table 5 provides a comprehensive comparison of various ML models applied to water quality prediction, highlighting the techniques, best models, prediction indices, and results reported by different authors. The comparison underscores the diversity of approaches employed across different studies and the performance metrics achieved by each model. For instance, Radhakrishnan & Pillai (2020) identified the DT algorithm as the best model for the WAWQI, achieving an accuracy of 98.50%. Similarly, Jain et al. (2021) found the RF algorithm to be the most effective for predicting the WQI with an accuracy of 92.13%. Studies such as Hmoud Al-Adhaileh & Waselallah Alsaade (2021) and Khan et al. (2021) demonstrated the efficacy of ANFIS and GBoost in achieving high accuracy for WQI and WQC.
Comparison of ML models for water quality
Author . | Technique . | Best model . | Prediction index . | Results . |
---|---|---|---|---|
Radhakrishnan & Pillai (2020) |
| Algorithm DT | WAWQI | Accuracy = 98.50% |
Jain et al. (2021) |
| Algorithm RF | WQI | Accuracy = 92.13% |
Hmoud Al-Adhaileh et al. (2021) |
| ANFIS for WQI FFNN for WQC | WQI | (WQC) Accuracy (ANFIS) = 96.17% Accuracy (FFNN) = 100% |
Slatnia et al. (2022) |
| Gradient boosting | WQC | Accuracy = 94.90% |
Khan et al. (2021) |
|
| WQI, Water quality status (WQS) | Accuracy (PCR) = 95% Accuracy (GBoost) = 100% |
Aldhyani et al. (2020) |
|
| WQI, Water quality classification | (WQC) Accuracy (SVM) = 97.01% R2 (NARNET) = 96.17% |
This study |
| LSTM XGBoost | Water quality classification (WQI) | Maximum Accuracy = 99.83% (WQC) Upper Accuracy = 99.99% Lower Accuracy = 99.07% R2 = 0.9999 |
Author . | Technique . | Best model . | Prediction index . | Results . |
---|---|---|---|---|
Radhakrishnan & Pillai (2020) |
| Algorithm DT | WAWQI | Accuracy = 98.50% |
Jain et al. (2021) |
| Algorithm RF | WQI | Accuracy = 92.13% |
Hmoud Al-Adhaileh et al. (2021) |
| ANFIS for WQI FFNN for WQC | WQI | (WQC) Accuracy (ANFIS) = 96.17% Accuracy (FFNN) = 100% |
Slatnia et al. (2022) |
| Gradient boosting | WQC | Accuracy = 94.90% |
Khan et al. (2021) |
|
| WQI, Water quality status (WQS) | Accuracy (PCR) = 95% Accuracy (GBoost) = 100% |
Aldhyani et al. (2020) |
|
| WQI, Water quality classification | (WQC) Accuracy (SVM) = 97.01% R2 (NARNET) = 96.17% |
This study |
| LSTM XGBoost | Water quality classification (WQI) | Maximum Accuracy = 99.83% (WQC) Upper Accuracy = 99.99% Lower Accuracy = 99.07% R2 = 0.9999 |
This research presents a comprehensive and methodological comparison of a diverse set of ML models, including ANN, SVM, DT, RF, XGBoost, and LSTM networks. This extensive evaluation is specifically designed for predicting the WQI and WQC, which are crucial for environmental science and management. This allows for a detailed understanding of the strengths and limitations of each model, providing invaluable insights for their practical application in environmental monitoring and assessment. Specifically, the XGBoost model demonstrated a maximum accuracy of 99.83%, an upper accuracy of 99.99%, a Kappa statistic of 99.17%, and a lower accuracy of 99.07%, while the LSTM model achieved an R2 of 0.9999. These findings highlight the robustness and superior generalization capabilities of ensemble methods like XGBoost and advanced neural networks such as LSTM, making them highly suitable for robust water quality prediction. The research contributes significantly to the growing body of literature on the application of ML in environmental sciences, demonstrating the potential for these models to enhance water quality monitoring and management. The high accuracy and robustness of the XGBoost and LSTM models make them reliable choices for real-world applications.
Limitations and future work
Despite these promising results, several limitations must be acknowledged. The models exhibited varying degrees of overfitting, particularly in the ANN and XGBoost models, as indicated by the discrepancies between training and testing performance. Future research should focus on addressing overfitting through techniques such as cross-validation, regularization, and more sophisticated hyperparameter tuning.
Additionally, integrating advanced data preprocessing steps, such as feature selection and extraction, could further enhance model performance (Bhagat et al. 2023b). Exploring the impact of different imputation methods for handling missing data could provide more insights into improving classification and regression accuracy.
Incorporating more sophisticated deep learning architectures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), might be examined to capture more complicated patterns in the data. These models have shown tremendous promise in other areas and potentially bring large improvements in water quality prediction.
Moreover, it does not explicitly discuss the model practical implementation in different scenarios. The models' performance in different contexts of environmental science and engineering is not addressed. The study relies on extensive datasets for model training and evaluation. However, it does not address the potential limitations related to data quality, such as the impact of noise, outliers, or missing values, beyond the preprocessing steps mentioned. In addition, the research highlights the performance of models based on specific metrics such as accuracy, precision, recall, and R2. However, it does not explore other important aspects such as model interpretability, ease of deployment, or maintenance, which are critical for real-world applications. These limitations suggest areas for future research, such as exploring model interpretability, testing in real-world environments, and ensuring the models' robustness across different datasets and conditions.
CONCLUSIONS
In this scholarly endeavor, we undertook a thorough and meticulous evaluation of a collection of sophisticated ML algorithms specifically designed for the purpose of predicting both the WQI and the WQC, which are crucial parameters in environmental science and management. The innovative aspect of this research is encapsulated in its comprehensive and methodological comparison of a diverse set of models, which includes ANN, SVM, DT, RF, XGBoost, and LSTM networks, all of which were rigorously tested using robust and extensive datasets accompanied by a broad spectrum of performance evaluation metrics. This analytical approach not only enhances our understanding of the inherent strengths and limitations associated with each individual model but also provides invaluable insights that can significantly influence their practical application in the domain of environmental monitoring and assessment. Consequently, the findings from this study hold the potential to contribute substantially to the field, offering guidance for researchers and practitioners alike in selecting the most appropriate ML techniques for evaluating and managing water quality data.
The key findings of this study indicate that XGBoost and LSTM models outperform other techniques, achieving the highest accuracy, precision, recall, F1 score, R2, RMSE, and MSE. Specifically, XGBoost demonstrated a maximum accuracy of 99.83%, an upper accuracy of 99.99%, a Kappa statistic of 99.17%, and a lower accuracy of 99.07%, while LSTM achieved an R2 of 0.99. These results highlight the robustness and superior generalization capabilities of ensemble methods and advanced neural networks in handling complex water quality data. The findings are significant as they contribute to the growing body of literature on the application of ML in environmental sciences, demonstrating the potential for these models to enhance water quality monitoring and management.
Despite these promising results, the study has several limitations. The models exhibited varying degrees of overfitting, particularly in ANN and XGBoost, as indicated by discrepancies between training and testing performance. Additionally, the study's reliance on specific datasets may limit the generalizability of the findings to other geographical regions or different water quality parameters. Future research should address these limitations by incorporating more diverse datasets, exploring advanced data preprocessing techniques such as feature selection and extraction, and employing methods like cross-validation and regularization to mitigate overfitting.
Future research options include examining the use of more sophisticated deep learning (Tian et al. 2023) architectures, such as CNNs and RNNs, which have shown substantial promise in other disciplines as well as merging mathematical and numerical together (Dai & Samper 2004). These models could capture more complex patterns in the data, potentially improving the accuracy and robustness of water quality predictions. With increasing advanced software's complex relationship can be established such as COMSOL (Zhu et al. 2023). Additionally, integrating real-time data streams and exploring the use of remote sensing data could further enhance the models' applicability in dynamic environmental monitoring scenarios (Tiyasha et al. 2023). Moreover, accurate water quality prediction can lead to better water management using more advanced ML tools with the ability to handle large data and multisource data (Zhan et al. 2024). A suitable model practical implementation may open the door to addressing in different scenarios, data quality (the impact of noise, outliers, or missing values, preprocessing steps). Also, model interpretability, ease of deployment, or maintenance remains unfolded. The practical implications of this study are significant for water quality management. The high accuracy and robustness of the XGBoost and LSTM models make them reliable choices for real-world applications. Policymakers, researchers, and practitioners can leverage these insights to implement more effective water quality monitoring and management strategies, thereby safeguarding public health and environmental sustainability (Karri et al. 2024).
ACKNOWLEDGEMENTS
The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through a Large Research Project under grant number RGP 1/219/45. Also, the authors would like to express our deepest gratitude to all those who contributed to this research. Our heartfelt thanks go to the National Groundwater Data Access Bank (ADES) for providing the comprehensive dataset essential for this study. We are also grateful to the Artois-Picardie Water Agency for their support and collaboration.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.