ABSTRACT
Water is essential for all life forms but is increasingly at risk of contamination. Monitoring water quality is crucial to protect ecosystems and public health. This study evaluates ensemble learning techniques – AdaBoost, Gradient Boost, XGBoost, CatBoost, and LightGBM – for predicting key water quality parameters in the Bara River Basin, Pakistan. Initially, a random forest model identified optimal input-target parameter combinations. Machine learning models were then developed and evaluated using R2, MSE, and MAE, with the best models selected via compromise programming. Results show XGBoost and Gradient Boost outperformed other methods. XGBoost achieved near-perfect R2 values for bicarbonate (HCO3), carbonate (CO3), and magnesium (Mg), while Gradient Boost excelled with parameters like electrical conductivity (EC), sulfate (SO4), temperature, and calcium (Ca). XGBoost demonstrated high training R2 values (0.999) but slightly lower testing R2 (e.g., 0.8636 for HCO3). Gradient Boost exhibited greater stability, maintaining high accuracy in both phases (e.g., Ca testing R2 = 0.9433). AdaBoost and CatBoost showed moderate performance for parameters like chloride (Cl) and pH, while CatBoost and LightGBM performed well for pH and dissolved solids but varied across other indicators. These findings underscore the potential of ensemble methods for accurate water quality prediction, aiding future management and environmental protection efforts.
HIGHLIGHTS
Evaluation of ensemble techniques for water quality modeling in the Bara River Basin, Pakistan.
Utilization of random forest for water quality parameters selection.
Using ensemble learning techniques for water quality parameters prediction.
Gradient boosting and XGBoost show superior predictive capability.
Importance of algorithm selection and hyperparameter tuning for precise water quality modeling.
INTRODUCTION
Pakistan has one of the world's largest irrigation networks, primarily fed by the Indus River and its tributaries, with over 74% of its water originating from catchments outside the country. This vital resource is under increasing threat due to rapid population growth, urbanization intensified industrial and agricultural activities, and the disposal of untreated wastewater. Assessing water quality in such a context is challenging, as it requires specialized equipment and expertise, making extensive laboratory testing impractical (Sattari et al. 2016). As a viable alternative, machine learning (ML) techniques offer promising tools for predicting water quality parameters, providing efficiency and accuracy in place of time-consuming laboratory methods. Historically, various models, including statistical, deterministic, numerical, and stochastic methods, have been applied in water quality modeling. However, these models are often complex, and their development typically requires extensive datasets (Molekoa et al. 2019; Mosavi et al. 2020; Pakdaman et al. 2020; Vats et al. 2020). Additionally, traditional statistical approaches, which are commonly used, assume a linear relationship between input and output variables, which does not accurately capture the complexities of hydrological processes. The statistical methods used often have low prediction accuracies in simulating water quality parameters due to their reliance on stationary and linear datasets. These limitations call for more advanced modeling techniques to improve the robustness and reliability of water quality assessments.
In Pakistan, traditional statistical techniques are widely used for water quality modeling. For example, Khan et al. (2021) applied path analysis to assess the impacts of terrestrial, socio-economic, and hydrological factors on Indus River water quality. Similarly, Khan et al. (2022) used regression analysis to examine the influence of climatic and terrestrial factors on water quality in the Indus River, while Shah et al. (2020) employed correlation analysis, cluster analysis (CA), and principal component analysis (PCA) to evaluate the Ravi and Sutlej rivers. Jehan et al. (2020) applied correlation analysis and PCA in the Swat River basin to identify pollution sources, while Zakaullah & Ejaz (2020) used CA and PCA for the Soan River's water quality evaluation. Furthermore, Zafar & Ahmad (2018) investigated water's physical and chemical characteristics in the Gilgit and Hunza rivers using CA and correlation analysis. Malik & Hashmi (2017) assessed Himalayan foothill streams with CA and discriminant analysis. Iqbal et al. (2018) used the Water Quality Simulation Program (WASP) to develop a water quality improvement strategy for the Ravi River. These studies collectively highlight the value of statistical techniques in water quality assessment, but they also reveal limitations, especially when faced with non-linear and non-stationary data.
Recently, artificial intelligence (AI) techniques have emerged as a powerful alternative to traditional approaches, particularly for handling non-linear relationships and complex interactions among water quality parameters. Numerous studies underscore the effectiveness of AI methods in overcoming the limitations of conventional statistical models (May et al. 2008; Nikoo & Mahjouri 2013; Haghiabi 2016, 2017; Haghiabi et al. 2017; Jaddi & Abdullah 2017). In the upper Indus River basin, Shah et al. (2020) demonstrated that gene expression programming (GEP) outperformed both artificial neural networks (ANNs) and regression models in predicting electrical conductivity (EC) and total dissolved solids (TDS). Likewise, Alqahtani et al. (2022) found that the random forest (RF) model surpassed GEP and ANN for EC and TDS simulations. Additional research by Aslam et al. (2022) compared four ML algorithms – random trees (RT), M5P, RF, and reduced error pruning tree (REPT) – as well as 12 hybrid algorithms for water quality index simulations in Northern Pakistan, with findings indicating that the hybrid RT-artificial neural network (RT-ANN) outperformed other techniques.
Moreover, Shah et al. (2021) observed that GEP and ANN models provided superior accuracy over linear regression models for TDS and EC predictions in the upper Indus River basin, while Ahmed et al. (2019) reported strong predictive performances for gradient boosting and polynomial regression in water quality index forecasting. In another study, Shah et al. (2021) used a hybrid approach combining a feed-forward neural network (FFNN) with particle swarm optimization (PSO), concluding that the PSO-GEP model was highly effective for predicting dissolved oxygen (DO) and TDS. Aldrees et al. (2023) applied multi-expression programming (MEP) and RF modeling to evaluate EC and TDS in the Indus River, finding RF to be the most accurate approach. Additionally, Ahmed et al. (2021) used various ML models – including decision trees (DT), logistic regression (LogR), multilayer perceptron (MLP), k-nearest neighbors (KNN), and Naive Bayes (NB) – to classify Rawal Dam's water quality, while Begum et al. (2023) used PCA for water quality assessment in Rawal Lake.
These studies highlight a significant trend: while traditional statistical methods and certain ML techniques are widely used in Pakistan, they tend to be limited by their assumptions of linearity and stationarity (May et al. 2008; Nikoo & Mahjouri 2013; Haghiabi 2016, 2017; Haghiabi et al. 2017; Jaddi & Abdullah 2017). Although advanced ML models are increasingly being used, ensemble learning approaches based on feature extraction have yet to be applied in Pakistan for water quality assessment. Recognizing this gap, the present study focuses on evaluating ensemble learning methods, particularly feature-based techniques using models like AdaBoost, gradient boosting, XGBoost, LightGBM, and CatBoost, to improve water quality prediction in the Bara River. RF regression is also applied to determine feature importance, providing an enhanced understanding of key water quality parameters.
This study aims to enhance water resource management and decision-making by leveraging advanced ML techniques that can more accurately capture non-linear relationships in water quality data. By applying ensemble learning, this research seeks to contribute a novel approach to water quality modeling that is both robust and efficient, potentially setting a new standard for predictive accuracy in the region.
STUDY AREA, DATA COLLECTION, AND METHOD
Study area description
Data collection
Methods
RF for feature selection
RF is a flexible ensemble learning method that performs exceptionally well in feature selection (Kursa & Rudnicki 2011). This technique was used in determining the most appropriate input water quality parameters combination which are important in predicting the target water quality variable. During training, RF builds a large number of DT, each one depending on a different random subset of features (Ali et al. 2012). The contributions of individual trees are then combined to determine the significance of each trait. Because it can manage intricate interactions and offer feature importance insights without requiring explicit hyperparameterization, this method has several advantages. The study sought to improve the next modeling step and provide a more targeted and effective prediction model by using an RF for feature selection.
In this study, a Python code is designed to analyze the importance of various features in a dataset using an RF regressor. It starts by importing necessary libraries and loading the data from an Excel file into a DataFrame. The code then separates the input and output variables, skipping non-numeric columns like dates. An RF regressor model is created with 100 estimators and a fixed random state for reproducibility, which is then trained on the feature matrix and target variable. The model's feature importance scores are extracted and stored in a DataFrame, sorted in descending order of importance for clarity. Finally, the feature importance scores are printed for each target variable, enabling a detailed understanding of which features most significantly influence each parameter. This systematic approach provides valuable insights into the relationships between water quality variables in the dataset.
Ensembles techniques for water quality prediction
AdaBoost
AdaBoost, also known as adaptive boosting, is a type of ensemble learning used in this work to increase the accuracy of water quality predictions (Dinakaran & Jeba Thangaiah 2016). AdaBoost trains weak learners, often DT, in a sequential fashion, applying increasing weights to misclassified cases in each iteration. This adaptation to ensemble errors results in the development of a strong predictive model (Schapire 2013). AdaBoost performed well in the context of water quality prediction in the Bara River Basin by carefully changing weights and focusing on misclassified samples. The number of estimators (weak learners) and the maximum depth of the trees were tuned for optimal results, adding to the algorithm's ability to collect complicated patterns in the data.
Gradient boosting
This research study employed gradient boosting, a robust ensemble learning technique, to construct DT and enhance their predictive accuracy (Natekin & Knoll 2013). This method thoroughly addresses residual errors during each iteration, leading to an overall improvement in the model's predictive performance. Gradient boosting was employed to forecast specifically water quality parameters within the Bara River Basin. To optimize its functionality, critical hyperparameters such as the number of estimators, learning rate, and maximum depth of trees were tuned and configured. The usefulness of gradient boosting to handle both regression and classification tasks played a pivotal role in successfully capturing the complex correlations present in the water quality data (Hosen & Amin 2021).
XGBoost
This work demonstrates the predictive power of an advanced gradient boosting algorithm called extreme gradient boosting, or XGBoost, which outperforms existing algorithms in the Bara River Basin for accurately estimating water quality parameters (Ali et al. 2023). XGBoost utilizes regularization techniques and parallel processing to further improve performance (Bentéjac et al. 2019). To maximize its performance, important hyperparameters were iteratively adjusted, including the learning rate, the maximum depth of individual trees, and the number of boosting rounds. XGBoost proved its capability in managing large datasets by achieving exceptional predictive accuracy in water quality forecasting (Kanagarathinam et al. 2023).
CatBoost
CatBoost is a gradient boosting library designed to handle categorical data and deliver powerful performance for regression and classification tasks (Hancock & Khoshgoftaar 2020). Developed by Yandex, CatBoost stands for categorical boosting and is particularly known for its capability to handle categorical features directly, unlike other gradient boosting algorithms that typically require preprocessing such as one-hot encoding (Hancock & Khoshgoftaar 2020). CatBoost has shown to be an efficient and successful algorithm for forecasting water quality metrics, particularly in situations where categorical characteristics are present (Wu et al. 2020).
LightGBM
This study demonstrated the usefulness of LightGBM, a gradient boosting framework based on histograms, for the prediction of water quality indicators (Ke et al. 2017). Its efficient handling of massive datasets and its unique approach to tree growth, following a leaf-wise strategy, set it apart. Hyperparameters for LightGBM, including the number of boosting rounds, learning rate, and maximum depth of trees, were accurately configured for optimal performance. The algorithm's usefulness in water quality prediction can be attributed to its speed and scalability, especially when working with huge datasets (McCarty et al. 2020). The specific features of LightGBM played a key role in its effectiveness in capturing fine patterns in the water quality data of the Bara River (Ahmed et al. 2022).
Models’ evaluation
In the study, ensemble learning models were evaluated using three statistical indicators: R2, MSE, and MAE. These metrics serve as critical indicators for analyzing the model's efficacy in terms of explanatory power and predictive accuracy (Ullah et al. 2023).
Coefficient of determination (R2)
Mean squared error
Mean absolute error
Models ranking
The parameter m is equal to 4, and Wn represents the actual value of a statistical performance measure, whereas Wn* indicates the perfect value of the performance measure, obtained when model simulations fully match with observed data. The LP metric is always positive. Lower LP values are preferred since they imply higher model performance.
Model development
The model performance of ensemble learning methods like AdaBoost, gradient boosting, XGBoost, and CatBoost is significantly improved by the parameters test size, random state, n_estimators, max depth, and learning rate. Particularly, the test size parameter determines the percentage of the dataset that will be used to test the model; for example, a value of 0.2 means that 20% of the data will be used for testing, which ensures an accurate evaluation of the model's prediction capabilities. In the meantime, the random state option guarantees the accuracy of model training by serving as a starting point for the random number generator. Using the same random state value throughout several iterations ensures consistent results. Additionally, the n_estimators parameter specifies how many base learners (trees) are included in the ensemble model. This has a direct impact on the complexity of the model and its capacity to identify fine trends in water quality data. To prevent overfitting, the max_depth option controls the maximum depth of each tree in the ensemble. Finally, the step size for each iteration is determined by the learning_rate parameter, which affects how much each tree contributes to the final model. This parameter is essential for optimizing the loss function, avoiding overfitting, and fine-tuning the model's performance.
RESULTS AND DISCUSSION
Feature selection
Water quality parameters prediction
Input parameters . | Target parameters . | Models . | Statistical performance indicators . | ||||||
---|---|---|---|---|---|---|---|---|---|
Training . | Testing . | Compromise programming . | |||||||
R2 . | MSE . | MAE . | R2 . | MSE . | MAE . | Rank . | |||
HCO3 | AdaBoost | 0.9543 | 0.0057 | 0.0596 | 0.8639 | 0.0229 | 0.1295 | 4 | |
EC | Gradient boosting | 0.9982 | 0.0003 | 0.0139 | 0.848 | 0.016 | 0.089 | 2 | |
SO4 | XGBoost | 0.9997 | 0.0001 | 0.0049 | 0.8636 | 0.0203 | 0.102 | 1 | |
Dissolved solids | CatBoost | 0.9731 | 0.0039 | 0.0452 | 0.8809 | 0.0137 | 0.0841 | 3 | |
LightGBM | 0.4325 | 0.0843 | 0.241 | 0.4261 | 0.0515 | 0.213 | 5 | ||
EC | AdaBoost | 0.9423 | 0.0047 | 0.0534 | 0.8566 | 0.0078 | 0.0755 | 2 | |
Dissolved solids | Gradient boosting | 0.9852 | 0.001 | 0.0258 | 0.6106 | 0.0347 | 0.1412 | 3 | |
HCO3 | XGBoost | 0.999 | 0.0001 | 0.0006 | 0.8512 | 0.01 | 0.0816 | 1 | |
Na | CatBoost | 0.8137 | 0.0154 | 0.0914 | 0.5365 | 0.0213 | 0.123 | 4 | |
LightGBM | 0.2351 | 0.0559 | 0.1564 | 0.1553 | 0.0659 | 0.1659 | 5 | ||
Mg | AdaBoost | 0.9423 | 0.0047 | 0.0534 | 0.8566 | 0.0078 | 0.0755 | 2 | |
EC | Gradient boosting | 0.9852 | 0.001 | 0.0258 | 0.6106 | 0.0347 | 0.1412 | 3 | |
Water temperature | XGBoost | 0.999 | 0.0001 | 0.0006 | 0.8512 | 0.01 | 0.0816 | 1 | |
SO4 | CatBoost | 0.8137 | 0.0154 | 0.0914 | 0.5365 | 0.0213 | 0.123 | 4 | |
LightGBM | 0.2351 | 0.0559 | 0.1564 | 0.1553 | 0.0659 | 0.1659 | 5 | ||
CO3 | AdaBoost | 0.8774 | 0.0009 | 0.026 | 0.5011 | 0.0027 | 0.044 | 3 | |
pH | Gradient boosting | 0.9997 | 0.00001 | 0.001 | 0.4714 | 0.0058 | 0.0486 | 2 | |
Cl | XGBoost | 0.8194 | 0.0013 | 0.0221 | 0.7227 | 0.0015 | 0.0365 | 1 | |
Mg | CatBoost | 0.7073 | 0.0022 | 0.0387 | 0.4002 | 0.0025 | 0.0429 | 4 | |
LightGBM | 0.4325 | 0.0843 | 0.241 | 0.4261 | 0.0515 | 0.213 | 5 | ||
Ca | AdaBoost | 0.9876 | 0.0126 | 0.0827 | 0.9419 | 0.0371 | 0.1634 | 2 | |
Cl | Gradient boosting | 0.9999 | 0.0001 | 0.0063 | 0.9433 | 0.044 | 0.1757 | 1 | |
pH | XGBoost | 0.9999 | 0.00001 | 0.0006 | 0.8455 | 0.1214 | 0.2281 | 4 | |
EC | CatBoost | 0.9936 | 0.0059 | 0.0602 | 0.9267 | 0.0616 | 0.1972 | 3 | |
LightGBM | 0.3955 | 0.5694 | 0.5284 | 0.1772 | 0.1282 | 0.3413 | 5 | ||
Water temperature | AdaBoost | 0.8903 | 0.569 | 0.5982 | 0.4541 | 3.8843 | 1.4044 | 2 | |
Mg | Gradient boosting | 0.8731 | 0.7644 | 0.6592 | 0.3713 | 4.7274 | 1.5373 | 1 | |
EC | XGBoost | 0.9981 | 0.0125 | 0.0776 | 0.6848 | 0.9704 | 0.7684 | 4 | |
Suspended solids | CatBoost | 0.9549 | 0.2295 | 0.3249 | 0.5362 | 3.5017 | 1.4274 | 3 | |
Ca | LightGBM | 0.1144 | 0.111 | 1.6244 | 0.111 | 2.4549 | 1.2875 | 5 | |
SO4 | AdaBoost | 0.9589 | 0.004 | 0.0468 | 0.8991 | 0.0093 | 0.0784 | 5 | |
Na | Gradient boosting | 0.9477 | 0.0056 | 0.0592 | 0.8134 | 0.0131 | 0.0896 | 1 | |
EC | XGBoost | 1 | 0 | 0.0008 | 0.8153 | 0.0249 | 0.1326 | 2 | |
Cl | CatBoost | 0.9955 | 0.0004 | 0.0173 | 0.8583 | 0.0129 | 0.0868 | 1 | |
LightGBM | 0.4484 | 0.4012 | 0.1859 | 0.4012 | 0.0686 | 0.2241 | 4 | ||
Cl | AdaBoost | 0.9634 | 0.0027 | 0.0393 | 0.7773 | 0.0028 | 0.0444 | 1 | |
pH | Gradient boosting | 0.9234 | 0.0056 | 0.0537 | 0.6352 | 0.0053 | 0.0493 | 4 | |
SO4 | XGBoost | 0.9995 | 0.00001 | 0.0047 | 0.6517 | 0.0075 | 0.0662 | 2 | |
Suspended solids | CatBoost | 0.9476 | 0.0024 | 0.0381 | 0.6211 | 0.0047 | 0.0534 | 3 | |
LightGBM | 0.0375 | 0.0366 | 0.119 | 0.0366 | 0.0409 | 0.1723 | 5 | ||
SO4 | pH | AdaBoost | 0.9 | 0.0089 | 0.0767 | 0.3525 | 0.0176 | 0.1125 | 3 |
Na | Gradient boosting | 0.9995 | 0.00001 | 0.0051 | 0.3443 | 0.0236 | 0.1221 | 2 | |
Cl | XGBoost | 0.9999 | 0.00001 | 0.0005 | 0.123 | 0.0843 | 0.1753 | 4 | |
EC | CatBoost | 0.9943 | 0.0003 | 0.0122 | 0.5092 | 0.0814 | 0.169 | 1 | |
Ca | LightGBM | 0.2321 | 0.1607 | 0.1582 | 0.1607 | 0.0138 | 0.1055 | 5 |
Input parameters . | Target parameters . | Models . | Statistical performance indicators . | ||||||
---|---|---|---|---|---|---|---|---|---|
Training . | Testing . | Compromise programming . | |||||||
R2 . | MSE . | MAE . | R2 . | MSE . | MAE . | Rank . | |||
HCO3 | AdaBoost | 0.9543 | 0.0057 | 0.0596 | 0.8639 | 0.0229 | 0.1295 | 4 | |
EC | Gradient boosting | 0.9982 | 0.0003 | 0.0139 | 0.848 | 0.016 | 0.089 | 2 | |
SO4 | XGBoost | 0.9997 | 0.0001 | 0.0049 | 0.8636 | 0.0203 | 0.102 | 1 | |
Dissolved solids | CatBoost | 0.9731 | 0.0039 | 0.0452 | 0.8809 | 0.0137 | 0.0841 | 3 | |
LightGBM | 0.4325 | 0.0843 | 0.241 | 0.4261 | 0.0515 | 0.213 | 5 | ||
EC | AdaBoost | 0.9423 | 0.0047 | 0.0534 | 0.8566 | 0.0078 | 0.0755 | 2 | |
Dissolved solids | Gradient boosting | 0.9852 | 0.001 | 0.0258 | 0.6106 | 0.0347 | 0.1412 | 3 | |
HCO3 | XGBoost | 0.999 | 0.0001 | 0.0006 | 0.8512 | 0.01 | 0.0816 | 1 | |
Na | CatBoost | 0.8137 | 0.0154 | 0.0914 | 0.5365 | 0.0213 | 0.123 | 4 | |
LightGBM | 0.2351 | 0.0559 | 0.1564 | 0.1553 | 0.0659 | 0.1659 | 5 | ||
Mg | AdaBoost | 0.9423 | 0.0047 | 0.0534 | 0.8566 | 0.0078 | 0.0755 | 2 | |
EC | Gradient boosting | 0.9852 | 0.001 | 0.0258 | 0.6106 | 0.0347 | 0.1412 | 3 | |
Water temperature | XGBoost | 0.999 | 0.0001 | 0.0006 | 0.8512 | 0.01 | 0.0816 | 1 | |
SO4 | CatBoost | 0.8137 | 0.0154 | 0.0914 | 0.5365 | 0.0213 | 0.123 | 4 | |
LightGBM | 0.2351 | 0.0559 | 0.1564 | 0.1553 | 0.0659 | 0.1659 | 5 | ||
CO3 | AdaBoost | 0.8774 | 0.0009 | 0.026 | 0.5011 | 0.0027 | 0.044 | 3 | |
pH | Gradient boosting | 0.9997 | 0.00001 | 0.001 | 0.4714 | 0.0058 | 0.0486 | 2 | |
Cl | XGBoost | 0.8194 | 0.0013 | 0.0221 | 0.7227 | 0.0015 | 0.0365 | 1 | |
Mg | CatBoost | 0.7073 | 0.0022 | 0.0387 | 0.4002 | 0.0025 | 0.0429 | 4 | |
LightGBM | 0.4325 | 0.0843 | 0.241 | 0.4261 | 0.0515 | 0.213 | 5 | ||
Ca | AdaBoost | 0.9876 | 0.0126 | 0.0827 | 0.9419 | 0.0371 | 0.1634 | 2 | |
Cl | Gradient boosting | 0.9999 | 0.0001 | 0.0063 | 0.9433 | 0.044 | 0.1757 | 1 | |
pH | XGBoost | 0.9999 | 0.00001 | 0.0006 | 0.8455 | 0.1214 | 0.2281 | 4 | |
EC | CatBoost | 0.9936 | 0.0059 | 0.0602 | 0.9267 | 0.0616 | 0.1972 | 3 | |
LightGBM | 0.3955 | 0.5694 | 0.5284 | 0.1772 | 0.1282 | 0.3413 | 5 | ||
Water temperature | AdaBoost | 0.8903 | 0.569 | 0.5982 | 0.4541 | 3.8843 | 1.4044 | 2 | |
Mg | Gradient boosting | 0.8731 | 0.7644 | 0.6592 | 0.3713 | 4.7274 | 1.5373 | 1 | |
EC | XGBoost | 0.9981 | 0.0125 | 0.0776 | 0.6848 | 0.9704 | 0.7684 | 4 | |
Suspended solids | CatBoost | 0.9549 | 0.2295 | 0.3249 | 0.5362 | 3.5017 | 1.4274 | 3 | |
Ca | LightGBM | 0.1144 | 0.111 | 1.6244 | 0.111 | 2.4549 | 1.2875 | 5 | |
SO4 | AdaBoost | 0.9589 | 0.004 | 0.0468 | 0.8991 | 0.0093 | 0.0784 | 5 | |
Na | Gradient boosting | 0.9477 | 0.0056 | 0.0592 | 0.8134 | 0.0131 | 0.0896 | 1 | |
EC | XGBoost | 1 | 0 | 0.0008 | 0.8153 | 0.0249 | 0.1326 | 2 | |
Cl | CatBoost | 0.9955 | 0.0004 | 0.0173 | 0.8583 | 0.0129 | 0.0868 | 1 | |
LightGBM | 0.4484 | 0.4012 | 0.1859 | 0.4012 | 0.0686 | 0.2241 | 4 | ||
Cl | AdaBoost | 0.9634 | 0.0027 | 0.0393 | 0.7773 | 0.0028 | 0.0444 | 1 | |
pH | Gradient boosting | 0.9234 | 0.0056 | 0.0537 | 0.6352 | 0.0053 | 0.0493 | 4 | |
SO4 | XGBoost | 0.9995 | 0.00001 | 0.0047 | 0.6517 | 0.0075 | 0.0662 | 2 | |
Suspended solids | CatBoost | 0.9476 | 0.0024 | 0.0381 | 0.6211 | 0.0047 | 0.0534 | 3 | |
LightGBM | 0.0375 | 0.0366 | 0.119 | 0.0366 | 0.0409 | 0.1723 | 5 | ||
SO4 | pH | AdaBoost | 0.9 | 0.0089 | 0.0767 | 0.3525 | 0.0176 | 0.1125 | 3 |
Na | Gradient boosting | 0.9995 | 0.00001 | 0.0051 | 0.3443 | 0.0236 | 0.1221 | 2 | |
Cl | XGBoost | 0.9999 | 0.00001 | 0.0005 | 0.123 | 0.0843 | 0.1753 | 4 | |
EC | CatBoost | 0.9943 | 0.0003 | 0.0122 | 0.5092 | 0.0814 | 0.169 | 1 | |
Ca | LightGBM | 0.2321 | 0.1607 | 0.1582 | 0.1607 | 0.0138 | 0.1055 | 5 |
When it came to estimating chloride levels, the gradient boosting model performed quite well. The model fits the training data quite well, showing a good correlation between the model and observed data (R2 of 0.99) (Ullah et al. 2023). With an R2 of 0.94, the testing data performance indicates strong generalization. The MSE (0.0440) and MAE (0.1757) values in the testing set are higher in comparison to the training set MSE (0.0001) and MAE (0.0063). The gradient boosting model proved to be the most effective in predicting chloride levels (Shah et al. 2021).
The XGBoost model achieved top rank in predicting pH. The aforementioned model excels in predicting pH as obvious from its high value of R2 (0.99) during both training and testing demonstrating high correlation and best fit (Uddin et al. 2023). Notably, the error terms, MSE and MAE, values are extremely low showcasing the model generalization. This shows that we can confidently use the XGBoost model for pH prediction in the study area.
The XGBoost model earns first place in predicting sulfate. The aforementioned model performed best in predicting sulfate as obvious from the high value of R2 (0.99) during both training and testing demonstrating robust correlation and best fit (Uddin et al. 2023). Moreover, the model generalization was further strengthened by low error terms, MSE and MAE, values during training (0.00001, 0.0049) and testing (0.0203, 0.1020). This shows that the XGBoost model is the best option for sulfate estimation in the understudied area.
The XGBoost model ranked first in predicting water temperature. The aforementioned model performed well in contrast to other competing models as demonstrated by the high R2 value of 0.99 during both training and testing which shows the best and highest correlation (Bassi et al. 2021). The model generalization was shown by its low value of error terms, MSE and MAE, during training (0.00001, 0.0006) and testing (0.0100, 0.0816). This highlights that the XGBoost model is the best option for predicting water temperature in the study region.
The XGBoost model performed superior in predicting EC. The XGBoost model performed well in estimating EC during training and testing as demonstrated by the high R2 value of 0.99 (Uddin et al. 2023). The model's accuracy was further elaborated by a low value of error term, MSE, and MAE, during training (0.00001, 0.0008) and testing (0.0249, 0.1326). It shows that the XGBoost model is the best technique that can be used for EC in the study area. In conclusion, the findings of the study highlight that various ensemble learning techniques performed well in predicting various water quality parameters that can be used for water quality management in the study region.
Models ranking via CP
Models ranking for a specific input–output water quality parameters combination were carried out using compromising programming. CP is based on statistical performance indicators (Khan et al. 2023). The best performing model for each parameter is bolded in Table 1. The XGBoost ranked first, excelling in predicting target parameters namely HCO3, EC, Mg, and CO3, gradient boosting is a close second, especially in cases involving Ca, water temperature, and SO4 parameters. CatBoost consistently achieves mid-range rankings across a variety of water quality combinations. Notably, it constantly trails other models, demonstrating a relative lack of predictive skills. This comprehensive review, led by CP, emphasizes the need to consider numerous indicators and diverse scenarios when evaluating ensemble models' viability for water quality prediction.
CONCLUSION
This study focuses on developing and evaluating ensemble learning models to predict water quality parameters. Five advanced algorithms – AdaBoost, gradient boosting, XGBoost, CatBoost, and LightGBM – were used for this task. The input–output water quality combinations were first determined using the RF technique, which helped identify the optimal set of input parameters for each target water quality variable. ML models were then developed for each combination, and their performance was assessed using statistical metrics like R2, MSE, and MAE. The most suitable models for each input-target combination were selected using CP. The results demonstrated the effectiveness of ensemble learning models, which show that XGBoost and gradient boosting exhibited superior performance among the ensemble methods. XGBoost achieved near-perfect R2 values for bicarbonate (HCO₃), CO₃, and magnesium (Mg), while gradient boosting excelled with parameters like EC, SO₄, temperature, and Ca. However, XGBoost showed training R2 values around 0.999 and testing R2 values (0.8636 for HCO₃ and 0.8512 for EC). Gradient boosting demonstrated more stability, maintaining high predictive accuracy in both training (e.g., Ca R2 ≈ 0.9999) and testing phases (e.g., Ca R2 = 0.9433). AdaBoost and CatBoost displayed moderate accuracy for certain parameters, with lower R2 values for chloride (Cl) and pH. CatBoost and LightGBM performed robustly on some parameters, such as pH and dissolved solids, though their effectiveness varied across other water quality indicators.
However, the study has some limitations. The models rely heavily on accurate and complete water quality data, and performance can degrade when data are scarce or noisy. While ensemble models proved effective, they require careful hyperparameter tuning, increasing computational complexity, which may limit their real-time application in low-resource environments. There is also a risk of overfitting with limited data and obtaining high-quality training data for expensive or time-consuming tests can be challenging. Although the study focused on ensemble methods, other ML or hybrid approaches could offer additional benefits. These limitations suggest areas for further research, including improving data quality, enhancing model generalization, and addressing the dynamic nature of water quality.
Future research could explore the application of transfer learning for predicting future water quality, particularly in situations with limited data. Ensemble learning models may be best suited for large datasets, but the approach developed in this study works well for smaller datasets. Another promising direction is using optimization algorithms to improve model selection and ensembling for even better performance. Integrating different models and techniques could further enhance water quality predictions.
ETHICAL APPROVAL
The corresponding author takes responsibility on behalf of all authors for ethical approval and permissions related to this research work.
CONSENT TO PARTICIPATE
The corresponding author takes responsibility on behalf of all authors for consent to participate.
CONSENT TO PUBLISH
All the parties gave their written permission for the article to be published. The corresponding author takes responsibility on behalf of all authors for consent to publish.
AUTHOR CONTRIBUTIONS
F.U.S., A.W.K., B.U., M.R.K., and I.J.; Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization. B.U. and A.U.K.; Conceptualization, Supervision, Writing – original draft, Writing – review & editing. All authors have read and agreed to the published version of the manuscript.
FUNDING
The corresponding author on behalf of all authors declares that no funds or any other grant not received during the preparation of this manuscript.
ACKNOWLEDGEMENTS
The authors express their gratitude to the supporting staff and management of WAPDA, PMD, and the Irrigation Department for their invaluable assistance in facilitating and supplying the necessary data. Their contributions were instrumental in completing this study.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.