Water is essential for all life forms but is increasingly at risk of contamination. Monitoring water quality is crucial to protect ecosystems and public health. This study evaluates ensemble learning techniques – AdaBoost, Gradient Boost, XGBoost, CatBoost, and LightGBM – for predicting key water quality parameters in the Bara River Basin, Pakistan. Initially, a random forest model identified optimal input-target parameter combinations. Machine learning models were then developed and evaluated using R2, MSE, and MAE, with the best models selected via compromise programming. Results show XGBoost and Gradient Boost outperformed other methods. XGBoost achieved near-perfect R2 values for bicarbonate (HCO3), carbonate (CO3), and magnesium (Mg), while Gradient Boost excelled with parameters like electrical conductivity (EC), sulfate (SO4), temperature, and calcium (Ca). XGBoost demonstrated high training R2 values (0.999) but slightly lower testing R2 (e.g., 0.8636 for HCO3). Gradient Boost exhibited greater stability, maintaining high accuracy in both phases (e.g., Ca testing R2 = 0.9433). AdaBoost and CatBoost showed moderate performance for parameters like chloride (Cl) and pH, while CatBoost and LightGBM performed well for pH and dissolved solids but varied across other indicators. These findings underscore the potential of ensemble methods for accurate water quality prediction, aiding future management and environmental protection efforts.

  • Evaluation of ensemble techniques for water quality modeling in the Bara River Basin, Pakistan.

  • Utilization of random forest for water quality parameters selection.

  • Using ensemble learning techniques for water quality parameters prediction.

  • Gradient boosting and XGBoost show superior predictive capability.

  • Importance of algorithm selection and hyperparameter tuning for precise water quality modeling.

Pakistan has one of the world's largest irrigation networks, primarily fed by the Indus River and its tributaries, with over 74% of its water originating from catchments outside the country. This vital resource is under increasing threat due to rapid population growth, urbanization intensified industrial and agricultural activities, and the disposal of untreated wastewater. Assessing water quality in such a context is challenging, as it requires specialized equipment and expertise, making extensive laboratory testing impractical (Sattari et al. 2016). As a viable alternative, machine learning (ML) techniques offer promising tools for predicting water quality parameters, providing efficiency and accuracy in place of time-consuming laboratory methods. Historically, various models, including statistical, deterministic, numerical, and stochastic methods, have been applied in water quality modeling. However, these models are often complex, and their development typically requires extensive datasets (Molekoa et al. 2019; Mosavi et al. 2020; Pakdaman et al. 2020; Vats et al. 2020). Additionally, traditional statistical approaches, which are commonly used, assume a linear relationship between input and output variables, which does not accurately capture the complexities of hydrological processes. The statistical methods used often have low prediction accuracies in simulating water quality parameters due to their reliance on stationary and linear datasets. These limitations call for more advanced modeling techniques to improve the robustness and reliability of water quality assessments.

In Pakistan, traditional statistical techniques are widely used for water quality modeling. For example, Khan et al. (2021) applied path analysis to assess the impacts of terrestrial, socio-economic, and hydrological factors on Indus River water quality. Similarly, Khan et al. (2022) used regression analysis to examine the influence of climatic and terrestrial factors on water quality in the Indus River, while Shah et al. (2020) employed correlation analysis, cluster analysis (CA), and principal component analysis (PCA) to evaluate the Ravi and Sutlej rivers. Jehan et al. (2020) applied correlation analysis and PCA in the Swat River basin to identify pollution sources, while Zakaullah & Ejaz (2020) used CA and PCA for the Soan River's water quality evaluation. Furthermore, Zafar & Ahmad (2018) investigated water's physical and chemical characteristics in the Gilgit and Hunza rivers using CA and correlation analysis. Malik & Hashmi (2017) assessed Himalayan foothill streams with CA and discriminant analysis. Iqbal et al. (2018) used the Water Quality Simulation Program (WASP) to develop a water quality improvement strategy for the Ravi River. These studies collectively highlight the value of statistical techniques in water quality assessment, but they also reveal limitations, especially when faced with non-linear and non-stationary data.

Recently, artificial intelligence (AI) techniques have emerged as a powerful alternative to traditional approaches, particularly for handling non-linear relationships and complex interactions among water quality parameters. Numerous studies underscore the effectiveness of AI methods in overcoming the limitations of conventional statistical models (May et al. 2008; Nikoo & Mahjouri 2013; Haghiabi 2016, 2017; Haghiabi et al. 2017; Jaddi & Abdullah 2017). In the upper Indus River basin, Shah et al. (2020) demonstrated that gene expression programming (GEP) outperformed both artificial neural networks (ANNs) and regression models in predicting electrical conductivity (EC) and total dissolved solids (TDS). Likewise, Alqahtani et al. (2022) found that the random forest (RF) model surpassed GEP and ANN for EC and TDS simulations. Additional research by Aslam et al. (2022) compared four ML algorithms – random trees (RT), M5P, RF, and reduced error pruning tree (REPT) – as well as 12 hybrid algorithms for water quality index simulations in Northern Pakistan, with findings indicating that the hybrid RT-artificial neural network (RT-ANN) outperformed other techniques.

Moreover, Shah et al. (2021) observed that GEP and ANN models provided superior accuracy over linear regression models for TDS and EC predictions in the upper Indus River basin, while Ahmed et al. (2019) reported strong predictive performances for gradient boosting and polynomial regression in water quality index forecasting. In another study, Shah et al. (2021) used a hybrid approach combining a feed-forward neural network (FFNN) with particle swarm optimization (PSO), concluding that the PSO-GEP model was highly effective for predicting dissolved oxygen (DO) and TDS. Aldrees et al. (2023) applied multi-expression programming (MEP) and RF modeling to evaluate EC and TDS in the Indus River, finding RF to be the most accurate approach. Additionally, Ahmed et al. (2021) used various ML models – including decision trees (DT), logistic regression (LogR), multilayer perceptron (MLP), k-nearest neighbors (KNN), and Naive Bayes (NB) – to classify Rawal Dam's water quality, while Begum et al. (2023) used PCA for water quality assessment in Rawal Lake.

These studies highlight a significant trend: while traditional statistical methods and certain ML techniques are widely used in Pakistan, they tend to be limited by their assumptions of linearity and stationarity (May et al. 2008; Nikoo & Mahjouri 2013; Haghiabi 2016, 2017; Haghiabi et al. 2017; Jaddi & Abdullah 2017). Although advanced ML models are increasingly being used, ensemble learning approaches based on feature extraction have yet to be applied in Pakistan for water quality assessment. Recognizing this gap, the present study focuses on evaluating ensemble learning methods, particularly feature-based techniques using models like AdaBoost, gradient boosting, XGBoost, LightGBM, and CatBoost, to improve water quality prediction in the Bara River. RF regression is also applied to determine feature importance, providing an enhanced understanding of key water quality parameters.

This study aims to enhance water resource management and decision-making by leveraging advanced ML techniques that can more accurately capture non-linear relationships in water quality data. By applying ensemble learning, this research seeks to contribute a novel approach to water quality modeling that is both robust and efficient, potentially setting a new standard for predictive accuracy in the region.

Study area description

The study focuses on the Bara River Basin, which includes a variety of ecosystems and landscapes and is an important area for water resources (Figure 1). The Bara River Basin is defined by its rugged topography, glaciers, and river systems, spanning many physical features. This region is essential to the hydrological cycle because it affects downstream water supply for domestic, industrial, and agricultural uses.
Figure 1

Case study area map.

Figure 1

Case study area map.

Close modal

Data collection

The water quality data was collected from the Water and Power Development Authority's Surface Water Hydrology Department for the Jhansipost monitoring station located at Bara River. The data include chloride (Cl), calcium (Ca), pH, EC, magnesium (Mg), sulfate (SO4), dissolved solids by evaporation (DS), bicarbonate (HCO3), sodium (Na), and carbonate (CO3) water quality parameters. The summary statistics of the aforementioned parameters are demonstrated by violin plots as shown in Figure 2.
Figure 2

The distribution and variability of water quality parameters.

Figure 2

The distribution and variability of water quality parameters.

Close modal

Methods

The methodology consisted of several key steps: first, an RF model was utilized to identify the optimal combination of input water quality parameters for the selected target water quality variable. Following this, ensemble learning models were developed for each combination of input parameters and target water quality variables. The performance of these models was evaluated using statistical indicators, including R2, mean squared error (MSE), and mean absolute error (MAE). The most suitable model for each input and target parameter combination was then identified using compromise programming (CP). The step-by-step procedure of this methodology is illustrated in the flowchart shown in Figure 3.
Figure 3

Methodology of the study.

Figure 3

Methodology of the study.

Close modal

RF for feature selection

RF is a flexible ensemble learning method that performs exceptionally well in feature selection (Kursa & Rudnicki 2011). This technique was used in determining the most appropriate input water quality parameters combination which are important in predicting the target water quality variable. During training, RF builds a large number of DT, each one depending on a different random subset of features (Ali et al. 2012). The contributions of individual trees are then combined to determine the significance of each trait. Because it can manage intricate interactions and offer feature importance insights without requiring explicit hyperparameterization, this method has several advantages. The study sought to improve the next modeling step and provide a more targeted and effective prediction model by using an RF for feature selection.

In this study, a Python code is designed to analyze the importance of various features in a dataset using an RF regressor. It starts by importing necessary libraries and loading the data from an Excel file into a DataFrame. The code then separates the input and output variables, skipping non-numeric columns like dates. An RF regressor model is created with 100 estimators and a fixed random state for reproducibility, which is then trained on the feature matrix and target variable. The model's feature importance scores are extracted and stored in a DataFrame, sorted in descending order of importance for clarity. Finally, the feature importance scores are printed for each target variable, enabling a detailed understanding of which features most significantly influence each parameter. This systematic approach provides valuable insights into the relationships between water quality variables in the dataset.

Ensembles techniques for water quality prediction

AdaBoost

AdaBoost, also known as adaptive boosting, is a type of ensemble learning used in this work to increase the accuracy of water quality predictions (Dinakaran & Jeba Thangaiah 2016). AdaBoost trains weak learners, often DT, in a sequential fashion, applying increasing weights to misclassified cases in each iteration. This adaptation to ensemble errors results in the development of a strong predictive model (Schapire 2013). AdaBoost performed well in the context of water quality prediction in the Bara River Basin by carefully changing weights and focusing on misclassified samples. The number of estimators (weak learners) and the maximum depth of the trees were tuned for optimal results, adding to the algorithm's ability to collect complicated patterns in the data.

In this study, a Python code was developed to implement an AdaBoost regressor with a decision tree as the base estimator for predicting water quality parameters using input variables as illustrated in Figure 4. It begins by importing necessary libraries and loading the dataset from an Excel file into a DataFrame. The features and target are then defined, and the dataset is split into training and testing sets with 20–30% reserved for testing. The code then creates an AdaBoost model with a decision tree of maximum depth 3 as the base estimator and set for 15 boosting rounds. The model is trained on the training data and used to predict training and testing sets. The code evaluates the model's performance using statistical metrics, including MSE, MAE, and R2.
Figure 4

Overview of the AdaBoost regression model structure.

Figure 4

Overview of the AdaBoost regression model structure.

Close modal

Gradient boosting

This research study employed gradient boosting, a robust ensemble learning technique, to construct DT and enhance their predictive accuracy (Natekin & Knoll 2013). This method thoroughly addresses residual errors during each iteration, leading to an overall improvement in the model's predictive performance. Gradient boosting was employed to forecast specifically water quality parameters within the Bara River Basin. To optimize its functionality, critical hyperparameters such as the number of estimators, learning rate, and maximum depth of trees were tuned and configured. The usefulness of gradient boosting to handle both regression and classification tasks played a pivotal role in successfully capturing the complex correlations present in the water quality data (Hosen & Amin 2021).

In the current research, a Python code was developed that uses a gradient boosting regressor to predict the water quality parameter as illustrated in Figure 5. It starts by importing the necessary libraries, reading the dataset from an Excel file, and defining the features and target. The dataset is split into training and testing sets, with 20–30% reserved for testing and a random state of 9 to ensure reproducibility. The gradient boosting regressor is then created with 100 estimators, a learning rate of 0.1, and a maximum depth of 3. The model is trained on the training data and used to predict values for training and testing sets. Model performance is evaluated using statistical metrics such as MSE, MAE, and R2 for both the training and testing data.
Figure 5

Illustration of the gradient-boosted regression model workflow.

Figure 5

Illustration of the gradient-boosted regression model workflow.

Close modal

XGBoost

This work demonstrates the predictive power of an advanced gradient boosting algorithm called extreme gradient boosting, or XGBoost, which outperforms existing algorithms in the Bara River Basin for accurately estimating water quality parameters (Ali et al. 2023). XGBoost utilizes regularization techniques and parallel processing to further improve performance (Bentéjac et al. 2019). To maximize its performance, important hyperparameters were iteratively adjusted, including the learning rate, the maximum depth of individual trees, and the number of boosting rounds. XGBoost proved its capability in managing large datasets by achieving exceptional predictive accuracy in water quality forecasting (Kanagarathinam et al. 2023).

In the current work, to implement an XGBoost regressor, a Python code was developed to predict water quality parameters. It begins by loading the data from an Excel file into a DataFrame and defining the features and target. The code then split the dataset into training and testing sets, with 20–30% reserved for testing to validate the model's performance as illustrated in Figure 6. An XGBoost regressor is created with 100 trees (n_estimators), a learning rate of 0.1, and a maximum tree depth of 3 to control complexity and prevent overfitting. The model is trained on the training set and used to make predictions on both the training and testing sets. The performance of the model is evaluated using statistical metrics, including MSE, MAE, and R2, for both training and testing data.
Figure 6

Schematic representation of the XGBoost algorithm architecture.

Figure 6

Schematic representation of the XGBoost algorithm architecture.

Close modal

CatBoost

CatBoost is a gradient boosting library designed to handle categorical data and deliver powerful performance for regression and classification tasks (Hancock & Khoshgoftaar 2020). Developed by Yandex, CatBoost stands for categorical boosting and is particularly known for its capability to handle categorical features directly, unlike other gradient boosting algorithms that typically require preprocessing such as one-hot encoding (Hancock & Khoshgoftaar 2020). CatBoost has shown to be an efficient and successful algorithm for forecasting water quality metrics, particularly in situations where categorical characteristics are present (Wu et al. 2020).

To implement a CatBoost regressor in this research, a Python code was developed to predict water quality parameters as illustrated in Figure 7. The data are loaded from an Excel file into a pandas DataFrame, and the features and target are defined accordingly. The code then split the dataset into training and testing sets, with 20–30% of the data reserved for testing, ensuring the model's generalizability. A CatBoost regressor is created with 100 iterations, a learning rate of 0.1, and a depth of 3, controlling the complexity and training rate. The model is trained on the training set, and predictions are made for both the training and testing sets. To evaluate the model's performance, statistical metrics such as MSE, MAE, and R2 are computed for both sets.
Figure 7

Diagrammatic view of the CatBoost model architecture.

Figure 7

Diagrammatic view of the CatBoost model architecture.

Close modal

LightGBM

This study demonstrated the usefulness of LightGBM, a gradient boosting framework based on histograms, for the prediction of water quality indicators (Ke et al. 2017). Its efficient handling of massive datasets and its unique approach to tree growth, following a leaf-wise strategy, set it apart. Hyperparameters for LightGBM, including the number of boosting rounds, learning rate, and maximum depth of trees, were accurately configured for optimal performance. The algorithm's usefulness in water quality prediction can be attributed to its speed and scalability, especially when working with huge datasets (McCarty et al. 2020). The specific features of LightGBM played a key role in its effectiveness in capturing fine patterns in the water quality data of the Bara River (Ahmed et al. 2022).

In the current research work, a Python code was developed to implement a LightGBM regressor to water quality perimeter as illustrated in Figure 8. The data are imported from an Excel file and loaded into a pandas DataFrame. The independent variables and the target variable are specified, and the dataset is divided into training and testing sets with 20–30% of the data reserved for testing. A LightGBM regressor is created with parameters such as 100 estimators, a learning rate of 0.1, and a maximum depth of 3 to control model complexity and training pace. The model is then trained on the training data, and predictions are generated for both the training and testing sets. The code evaluates the model's performance using MSE, MAE, and R2 for both sets, providing insights into its accuracy and predictive power.
Figure 8

Schematic representation of the LightGBM architecture.

Figure 8

Schematic representation of the LightGBM architecture.

Close modal

Models’ evaluation

In the study, ensemble learning models were evaluated using three statistical indicators: R2, MSE, and MAE. These metrics serve as critical indicators for analyzing the model's efficacy in terms of explanatory power and predictive accuracy (Ullah et al. 2023).

Coefficient of determination (R2)

R2 represents the proportion of the variance in the dependent variable that is explained by the independent variables and is expressed by Equation (1):
(1)

Mean squared error

The MSE represents the average of the squared differences between the predicted and actual values and is defined by Equation (2):
(2)

Mean absolute error

The MAE represents the average of the absolute differences between the observed and predicted values and is expressed by Equation (3):
(3)

Models ranking

CP is a technique for merging several statistical metrics (Zelany 1974). In this work, CP was utilized to evaluate and rank ensemble learning techniques. The study was based on three essential performance indicators: R2, MSE, and MAE, which together assess the effectiveness of the ensemble learning techniques. The researchers derived a specific distance measure termed (LP) metric inside the context of CP utilizing the values from these measurements. The LP metric is expressed as follows:
(4)

The parameter m is equal to 4, and Wn represents the actual value of a statistical performance measure, whereas Wn* indicates the perfect value of the performance measure, obtained when model simulations fully match with observed data. The LP metric is always positive. Lower LP values are preferred since they imply higher model performance.

Model development

The model performance of ensemble learning methods like AdaBoost, gradient boosting, XGBoost, and CatBoost is significantly improved by the parameters test size, random state, n_estimators, max depth, and learning rate. Particularly, the test size parameter determines the percentage of the dataset that will be used to test the model; for example, a value of 0.2 means that 20% of the data will be used for testing, which ensures an accurate evaluation of the model's prediction capabilities. In the meantime, the random state option guarantees the accuracy of model training by serving as a starting point for the random number generator. Using the same random state value throughout several iterations ensures consistent results. Additionally, the n_estimators parameter specifies how many base learners (trees) are included in the ensemble model. This has a direct impact on the complexity of the model and its capacity to identify fine trends in water quality data. To prevent overfitting, the max_depth option controls the maximum depth of each tree in the ensemble. Finally, the step size for each iteration is determined by the learning_rate parameter, which affects how much each tree contributes to the final model. This parameter is essential for optimizing the loss function, avoiding overfitting, and fine-tuning the model's performance.

Feature selection

The RF approach was used to identify the best combination of input water quality variables for the target water quality variable (Abba et al. 2022), as shown in Figure 9. Cl, pH, and EC are the best combinations of input water quality variables for the target water quality parameter (Ca), with noteworthy significance scores of 0.41, 0.15, and 0.13, respectively. Regarding the target variable Mg, the most important input water quality parameters combination is EC, water temperature, and SO4 with importance scores of 0.38, 0.16, and 0.10, respectively. These findings highlight how important these traits are in affecting magnesium levels. Sodium adsorption rate (SAR) is important to the target variable Na, with an importance value of 0.71. Furthermore, with contributions of 0.27 and 0.23, respectively, pH and EC also showed significant influences on sodium concentrations. The feature selection approach determined that SO4 was the most important component in predicting HCO3, with a significance value of 0.44. Furthermore, pH and DS demonstrated noteworthy significance with contributions of 0.46 and 0.11, respectively. With an importance score of 0.46, DS is the main feature of the target variable EC. Na and SO4 came in second and third, with contributions of 0.18 and 0.14, respectively. Significantly, Cl contributed to pH value predictions (significance score of 0.25). Moreover, SO4 and HCO3, which contributed 0.14 and 0.11, respectively, also had a significant impact. With a significance value of 0.77, the SAR was shown to be the single most important factor in the forecasting of SO4. These findings highlight how important the sulfate-to-chloride ratio is in regulating sulfate concentrations. Furthermore, it was found that Na and Cl, with contributions of 0.14 and 0.13, respectively, had a noteworthy influence. Water temperature, with an importance score of 0.33, proved to be the most important factor for predicting SAR. Furthermore, SO4 and Cl, with contributions of 0.07 and 0.08, respectively, also showed notable importance. In conclusion, each target variable's key predictors were identified using the RF technique, which highlights the mechanism influencing water chemistry (Al-Sulttani et al. 2021).
Figure 9

(a–i) The extracted feature for the target water quality parameter.

Figure 9

(a–i) The extracted feature for the target water quality parameter.

Close modal

Water quality parameters prediction

In this research work, water quality modeling was carried out using ML techniques, namely gradient boosting, AdaBoost, XGBoost, CatBoost, and LightGBM. Different proportions of water quality data were considered for model training and testing for various combinations of input–output water quality parameters combination as suggested by the RF technique. The model's fine-tuning was carried out as per the parameters discussed in the model development section. The best-performing model training and testing results are demonstrated in Figures 10 and 11. The model's performance was assessed via statistical performance indicators in both the training and testing phases as demonstrated by Figure 12 and Table 1. The best-performing models are explained as follows:
Figure 10

(a–i) Training and testing of the best-performing ensemble techniques.

Figure 10

(a–i) Training and testing of the best-performing ensemble techniques.

Close modal
Figure 11

(a–i) Regression graph of the developed ensemble models.

Figure 11

(a–i) Regression graph of the developed ensemble models.

Close modal
Figure 12

(a–i) Statistical performance indicators for various models.

Figure 12

(a–i) Statistical performance indicators for various models.

Close modal
Table 1

Ranking of ensemble models for various input–output water quality parameter combinations

Input parametersTarget parametersModelsStatistical performance indicators
Training
Testing
Compromise programming
R2MSEMAER2MSEMAERank
 HCO3 AdaBoost 0.9543 0.0057 0.0596 0.8639 0.0229 0.1295 
EC Gradient boosting 0.9982 0.0003 0.0139 0.848 0.016 0.089 
SO4 XGBoost 0.9997 0.0001 0.0049 0.8636 0.0203 0.102 1 
Dissolved solids CatBoost 0.9731 0.0039 0.0452 0.8809 0.0137 0.0841 
 LightGBM 0.4325 0.0843 0.241 0.4261 0.0515 0.213 
 EC AdaBoost 0.9423 0.0047 0.0534 0.8566 0.0078 0.0755 
Dissolved solids Gradient boosting 0.9852 0.001 0.0258 0.6106 0.0347 0.1412 
HCO3 XGBoost 0.999 0.0001 0.0006 0.8512 0.01 0.0816 1 
Na CatBoost 0.8137 0.0154 0.0914 0.5365 0.0213 0.123 
 LightGBM 0.2351 0.0559 0.1564 0.1553 0.0659 0.1659 
 Mg AdaBoost 0.9423 0.0047 0.0534 0.8566 0.0078 0.0755 
EC Gradient boosting 0.9852 0.001 0.0258 0.6106 0.0347 0.1412 
Water temperature XGBoost 0.999 0.0001 0.0006 0.8512 0.01 0.0816 1 
SO4 CatBoost 0.8137 0.0154 0.0914 0.5365 0.0213 0.123 
 LightGBM 0.2351 0.0559 0.1564 0.1553 0.0659 0.1659 
 CO3 AdaBoost 0.8774 0.0009 0.026 0.5011 0.0027 0.044 
pH Gradient boosting 0.9997 0.00001 0.001 0.4714 0.0058 0.0486 
Cl XGBoost 0.8194 0.0013 0.0221 0.7227 0.0015 0.0365 1 
Mg CatBoost 0.7073 0.0022 0.0387 0.4002 0.0025 0.0429 
 LightGBM 0.4325 0.0843 0.241 0.4261 0.0515 0.213 
 Ca AdaBoost 0.9876 0.0126 0.0827 0.9419 0.0371 0.1634 
Cl Gradient boosting 0.9999 0.0001 0.0063 0.9433 0.044 0.1757 1 
pH XGBoost 0.9999 0.00001 0.0006 0.8455 0.1214 0.2281 
EC CatBoost 0.9936 0.0059 0.0602 0.9267 0.0616 0.1972 
 LightGBM 0.3955 0.5694 0.5284 0.1772 0.1282 0.3413 
 Water temperature AdaBoost 0.8903 0.569 0.5982 0.4541 3.8843 1.4044 
Mg Gradient boosting 0.8731 0.7644 0.6592 0.3713 4.7274 1.5373 1 
EC XGBoost 0.9981 0.0125 0.0776 0.6848 0.9704 0.7684 
Suspended solids CatBoost 0.9549 0.2295 0.3249 0.5362 3.5017 1.4274 
Ca LightGBM 0.1144 0.111 1.6244 0.111 2.4549 1.2875 
 SO4 AdaBoost 0.9589 0.004 0.0468 0.8991 0.0093 0.0784 
Na Gradient boosting 0.9477 0.0056 0.0592 0.8134 0.0131 0.0896 1 
EC XGBoost 0.0008 0.8153 0.0249 0.1326 
Cl CatBoost 0.9955 0.0004 0.0173 0.8583 0.0129 0.0868 
 LightGBM 0.4484 0.4012 0.1859 0.4012 0.0686 0.2241 
 Cl AdaBoost 0.9634 0.0027 0.0393 0.7773 0.0028 0.0444 1 
pH Gradient boosting 0.9234 0.0056 0.0537 0.6352 0.0053 0.0493 
SO4 XGBoost 0.9995 0.00001 0.0047 0.6517 0.0075 0.0662 
Suspended solids CatBoost 0.9476 0.0024 0.0381 0.6211 0.0047 0.0534 
 LightGBM 0.0375 0.0366 0.119 0.0366 0.0409 0.1723 
SO4 pH AdaBoost 0.9 0.0089 0.0767 0.3525 0.0176 0.1125 
Na Gradient boosting 0.9995 0.00001 0.0051 0.3443 0.0236 0.1221 
Cl XGBoost 0.9999 0.00001 0.0005 0.123 0.0843 0.1753 
EC CatBoost 0.9943 0.0003 0.0122 0.5092 0.0814 0.169 1 
Ca LightGBM 0.2321 0.1607 0.1582 0.1607 0.0138 0.1055 
Input parametersTarget parametersModelsStatistical performance indicators
Training
Testing
Compromise programming
R2MSEMAER2MSEMAERank
 HCO3 AdaBoost 0.9543 0.0057 0.0596 0.8639 0.0229 0.1295 
EC Gradient boosting 0.9982 0.0003 0.0139 0.848 0.016 0.089 
SO4 XGBoost 0.9997 0.0001 0.0049 0.8636 0.0203 0.102 1 
Dissolved solids CatBoost 0.9731 0.0039 0.0452 0.8809 0.0137 0.0841 
 LightGBM 0.4325 0.0843 0.241 0.4261 0.0515 0.213 
 EC AdaBoost 0.9423 0.0047 0.0534 0.8566 0.0078 0.0755 
Dissolved solids Gradient boosting 0.9852 0.001 0.0258 0.6106 0.0347 0.1412 
HCO3 XGBoost 0.999 0.0001 0.0006 0.8512 0.01 0.0816 1 
Na CatBoost 0.8137 0.0154 0.0914 0.5365 0.0213 0.123 
 LightGBM 0.2351 0.0559 0.1564 0.1553 0.0659 0.1659 
 Mg AdaBoost 0.9423 0.0047 0.0534 0.8566 0.0078 0.0755 
EC Gradient boosting 0.9852 0.001 0.0258 0.6106 0.0347 0.1412 
Water temperature XGBoost 0.999 0.0001 0.0006 0.8512 0.01 0.0816 1 
SO4 CatBoost 0.8137 0.0154 0.0914 0.5365 0.0213 0.123 
 LightGBM 0.2351 0.0559 0.1564 0.1553 0.0659 0.1659 
 CO3 AdaBoost 0.8774 0.0009 0.026 0.5011 0.0027 0.044 
pH Gradient boosting 0.9997 0.00001 0.001 0.4714 0.0058 0.0486 
Cl XGBoost 0.8194 0.0013 0.0221 0.7227 0.0015 0.0365 1 
Mg CatBoost 0.7073 0.0022 0.0387 0.4002 0.0025 0.0429 
 LightGBM 0.4325 0.0843 0.241 0.4261 0.0515 0.213 
 Ca AdaBoost 0.9876 0.0126 0.0827 0.9419 0.0371 0.1634 
Cl Gradient boosting 0.9999 0.0001 0.0063 0.9433 0.044 0.1757 1 
pH XGBoost 0.9999 0.00001 0.0006 0.8455 0.1214 0.2281 
EC CatBoost 0.9936 0.0059 0.0602 0.9267 0.0616 0.1972 
 LightGBM 0.3955 0.5694 0.5284 0.1772 0.1282 0.3413 
 Water temperature AdaBoost 0.8903 0.569 0.5982 0.4541 3.8843 1.4044 
Mg Gradient boosting 0.8731 0.7644 0.6592 0.3713 4.7274 1.5373 1 
EC XGBoost 0.9981 0.0125 0.0776 0.6848 0.9704 0.7684 
Suspended solids CatBoost 0.9549 0.2295 0.3249 0.5362 3.5017 1.4274 
Ca LightGBM 0.1144 0.111 1.6244 0.111 2.4549 1.2875 
 SO4 AdaBoost 0.9589 0.004 0.0468 0.8991 0.0093 0.0784 
Na Gradient boosting 0.9477 0.0056 0.0592 0.8134 0.0131 0.0896 1 
EC XGBoost 0.0008 0.8153 0.0249 0.1326 
Cl CatBoost 0.9955 0.0004 0.0173 0.8583 0.0129 0.0868 
 LightGBM 0.4484 0.4012 0.1859 0.4012 0.0686 0.2241 
 Cl AdaBoost 0.9634 0.0027 0.0393 0.7773 0.0028 0.0444 1 
pH Gradient boosting 0.9234 0.0056 0.0537 0.6352 0.0053 0.0493 
SO4 XGBoost 0.9995 0.00001 0.0047 0.6517 0.0075 0.0662 
Suspended solids CatBoost 0.9476 0.0024 0.0381 0.6211 0.0047 0.0534 
 LightGBM 0.0375 0.0366 0.119 0.0366 0.0409 0.1723 
SO4 pH AdaBoost 0.9 0.0089 0.0767 0.3525 0.0176 0.1125 
Na Gradient boosting 0.9995 0.00001 0.0051 0.3443 0.0236 0.1221 
Cl XGBoost 0.9999 0.00001 0.0005 0.123 0.0843 0.1753 
EC CatBoost 0.9943 0.0003 0.0122 0.5092 0.0814 0.169 1 
Ca LightGBM 0.2321 0.1607 0.1582 0.1607 0.0138 0.1055 

When it came to estimating chloride levels, the gradient boosting model performed quite well. The model fits the training data quite well, showing a good correlation between the model and observed data (R2 of 0.99) (Ullah et al. 2023). With an R2 of 0.94, the testing data performance indicates strong generalization. The MSE (0.0440) and MAE (0.1757) values in the testing set are higher in comparison to the training set MSE (0.0001) and MAE (0.0063). The gradient boosting model proved to be the most effective in predicting chloride levels (Shah et al. 2021).

The XGBoost model achieved top rank in predicting pH. The aforementioned model excels in predicting pH as obvious from its high value of R2 (0.99) during both training and testing demonstrating high correlation and best fit (Uddin et al. 2023). Notably, the error terms, MSE and MAE, values are extremely low showcasing the model generalization. This shows that we can confidently use the XGBoost model for pH prediction in the study area.

The XGBoost model earns first place in predicting sulfate. The aforementioned model performed best in predicting sulfate as obvious from the high value of R2 (0.99) during both training and testing demonstrating robust correlation and best fit (Uddin et al. 2023). Moreover, the model generalization was further strengthened by low error terms, MSE and MAE, values during training (0.00001, 0.0049) and testing (0.0203, 0.1020). This shows that the XGBoost model is the best option for sulfate estimation in the understudied area.

The XGBoost model ranked first in predicting water temperature. The aforementioned model performed well in contrast to other competing models as demonstrated by the high R2 value of 0.99 during both training and testing which shows the best and highest correlation (Bassi et al. 2021). The model generalization was shown by its low value of error terms, MSE and MAE, during training (0.00001, 0.0006) and testing (0.0100, 0.0816). This highlights that the XGBoost model is the best option for predicting water temperature in the study region.

The XGBoost model performed superior in predicting EC. The XGBoost model performed well in estimating EC during training and testing as demonstrated by the high R2 value of 0.99 (Uddin et al. 2023). The model's accuracy was further elaborated by a low value of error term, MSE, and MAE, during training (0.00001, 0.0008) and testing (0.0249, 0.1326). It shows that the XGBoost model is the best technique that can be used for EC in the study area. In conclusion, the findings of the study highlight that various ensemble learning techniques performed well in predicting various water quality parameters that can be used for water quality management in the study region.

Models ranking via CP

Models ranking for a specific input–output water quality parameters combination were carried out using compromising programming. CP is based on statistical performance indicators (Khan et al. 2023). The best performing model for each parameter is bolded in Table 1. The XGBoost ranked first, excelling in predicting target parameters namely HCO3, EC, Mg, and CO3, gradient boosting is a close second, especially in cases involving Ca, water temperature, and SO4 parameters. CatBoost consistently achieves mid-range rankings across a variety of water quality combinations. Notably, it constantly trails other models, demonstrating a relative lack of predictive skills. This comprehensive review, led by CP, emphasizes the need to consider numerous indicators and diverse scenarios when evaluating ensemble models' viability for water quality prediction.

This study focuses on developing and evaluating ensemble learning models to predict water quality parameters. Five advanced algorithms – AdaBoost, gradient boosting, XGBoost, CatBoost, and LightGBM – were used for this task. The input–output water quality combinations were first determined using the RF technique, which helped identify the optimal set of input parameters for each target water quality variable. ML models were then developed for each combination, and their performance was assessed using statistical metrics like R2, MSE, and MAE. The most suitable models for each input-target combination were selected using CP. The results demonstrated the effectiveness of ensemble learning models, which show that XGBoost and gradient boosting exhibited superior performance among the ensemble methods. XGBoost achieved near-perfect R2 values for bicarbonate (HCO₃), CO₃, and magnesium (Mg), while gradient boosting excelled with parameters like EC, SO₄, temperature, and Ca. However, XGBoost showed training R2 values around 0.999 and testing R2 values (0.8636 for HCO₃ and 0.8512 for EC). Gradient boosting demonstrated more stability, maintaining high predictive accuracy in both training (e.g., Ca R2 ≈ 0.9999) and testing phases (e.g., Ca R2 = 0.9433). AdaBoost and CatBoost displayed moderate accuracy for certain parameters, with lower R2 values for chloride (Cl) and pH. CatBoost and LightGBM performed robustly on some parameters, such as pH and dissolved solids, though their effectiveness varied across other water quality indicators.

However, the study has some limitations. The models rely heavily on accurate and complete water quality data, and performance can degrade when data are scarce or noisy. While ensemble models proved effective, they require careful hyperparameter tuning, increasing computational complexity, which may limit their real-time application in low-resource environments. There is also a risk of overfitting with limited data and obtaining high-quality training data for expensive or time-consuming tests can be challenging. Although the study focused on ensemble methods, other ML or hybrid approaches could offer additional benefits. These limitations suggest areas for further research, including improving data quality, enhancing model generalization, and addressing the dynamic nature of water quality.

Future research could explore the application of transfer learning for predicting future water quality, particularly in situations with limited data. Ensemble learning models may be best suited for large datasets, but the approach developed in this study works well for smaller datasets. Another promising direction is using optimization algorithms to improve model selection and ensembling for even better performance. Integrating different models and techniques could further enhance water quality predictions.

The corresponding author takes responsibility on behalf of all authors for ethical approval and permissions related to this research work.

The corresponding author takes responsibility on behalf of all authors for consent to participate.

All the parties gave their written permission for the article to be published. The corresponding author takes responsibility on behalf of all authors for consent to publish.

F.U.S., A.W.K., B.U., M.R.K., and I.J.; Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization. B.U. and A.U.K.; Conceptualization, Supervision, Writing – original draft, Writing – review & editing. All authors have read and agreed to the published version of the manuscript.

The corresponding author on behalf of all authors declares that no funds or any other grant not received during the preparation of this manuscript.

The authors express their gratitude to the supporting staff and management of WAPDA, PMD, and the Irrigation Department for their invaluable assistance in facilitating and supplying the necessary data. Their contributions were instrumental in completing this study.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abba
S. I.
,
Abdulkadir
R. A.
,
Sammen
S. S.
,
Pham
Q. B.
,
Lawan
A. A.
,
Esmaili
P.
,
Malik
A.
&
Al-Ansari
N.
(
2022
)
Integrating feature extraction approaches with hybrid emotional neural networks for water quality index modeling
,
Applied Soft Computing
,
114
,
108036
.
Ahmed
U.
,
Mumtaz
R.
,
Anwar
H.
,
Shah
A. A.
,
Irfan
R.
&
García-Nieto
J.
(
2019
)
Efficient water quality prediction using supervised machine learning
,
Water
,
11
(
11
),
2210
.
Aldrees
A.
,
Javed
M. F.
,
Taha
A. T. B.
,
Mohamed
A. M.
,
Jasiński
M.
&
Gono
M.
(
2023
)
Evolutionary and ensemble machine learning predictive models for evaluation of water quality
,
Journal of Hydrology: Regional Studies
,
46
,
101331
.
Ali
J.
,
Khan
R.
,
Ahmad
N.
&
Maqsood
I.
(
2012
)
Random forests and decision trees
,
International Journal of Computer Science Issues (IJCSI)
,
9
,
273
278
.
Ali
Z.
,
Abduljabbar
Z.
,
Tahir
H.
,
Sallow
A.
&
Almufti
S.
(
2023
)
Exploring the power of eXtreme gradient boosting algorithm in machine learning: A review
.
Academic Journal of Nawroz University
,
12
,
320
334
.
doi:10.25007/ajnu.v12n2a1612
.
Al-Sulttani
A. O.
,
Al-Mukhtar
M.
,
Roomi
A. B.
,
Farooque
A. A.
,
Khedher
K. M.
&
Yaseen
Z. M.
(
2021
)
Proposition of new ensemble data-intelligence models for surface water quality prediction
,
IEEE Access
,
9
,
108527
108541
.
Aslam
B.
,
Maqsoom
A.
,
Cheema
A. H.
,
Ullah
F.
,
Alharbi
A.
&
Imran
M.
(
2022
)
Water quality management using hybrid machine learning and data mining algorithms: An indexing approach
,
IEEE Access
,
10
,
119692
119705
.
Bassi
A.
,
Shenoy
A.
,
Sharma
A.
,
Sigurdson
H.
,
Glossop
C.
&
Chan
J. H.
(
2021
). '
Building energy consumption forecasting: A comparison of gradient boosting models
’,
Paper presented at the 12th international conference on advances in information technology
.
Begum
S.
,
Firdous
S.
,
Naeem
Z.
,
Chaudhry
G. E. S.
,
Arshad
S.
,
Abid
F.
,
Zahra
S.
,
Khan
S.
,
Adnan
M.
,
Sung
Y. Y.
&
Muhammad
T. S. T.
(
2023
)
Combined multivariate statistical techniques and water quality index (WQI) to evaluate spatial variation in water quality
,
Tropical Life Sciences Research
,
34
(
3
),
129
.
https://doi.org/10.21315/tlsr2023.34.3.7
.
Bentéjac
C.
,
Csörgő
A.
&
Martínez-Muñoz
G.
(
2019
).
A comparative analysis of XGBoost
. arXiv preprint.
Ithaca, New York
:
Cornell University Library
.
https://doi.org/10.48550/arXiv.1911.01914
Dinakaran
S.
&
Jeba Thangaiah
R.
(
2016
)
Ensemble method of effective AdaBoost algorithm for decision tree classifiers
,
International Journal on Artificial Intelligence Tools
,
26
,
1750007
.
doi:10.1142/S0218213017500075
.
Haghiabi
A. H.
(
2017
)
Modeling river mixing mechanism using data driven model
,
Water Resources Management
,
31
,
811
824
.
Haghiabi
A. H.
,
Azamathulla
H. M.
&
Parsaie
A.
(
2017
)
Prediction of head loss on cascade weir using ANN and SVM
,
ISH Journal of Hydraulic Engineering
,
23
(
1
),
102
110
.
Hancock
J.
&
Khoshgoftaar
T.
(
2020
)
CatBoost for big data: An interdisciplinary review
,
Journal of Big Data
,
7
,
94
.
doi:10.1186/s40537-020-00369-8
.
Hosen
M. S.
&
Amin
R.
(
2021
)
Significant of gradient boosting algorithm in data management system
,
Engineering International
,
9
,
85
100
.
doi:10.18034/ei.v9i2.559
.
Jehan
S.
,
Ullah
I.
,
Khan
S.
,
Muhammad
S.
,
Khattak
S. A.
&
Khan
T.
(
2020
)
Evaluation of the Swat River, Northern Pakistan, water quality using multivariate statistical techniques and water quality index (WQI) model
,
Environmental Science and Pollution Research
,
27
,
38545
38558
.
Kanagarathinam
K.
,
Krishnan
S.
&
Manikandan
R.
(
2023
)
Water quality prediction: A data-driven approach exploiting advanced machine learning algorithms with data augmentation
,
Journal of Water and Climate Change
,
15
(
2
),
431
452
.
doi:10.2166/wcc.2023.403
.
Ke
G.
,
Meng
Q.
,
Finley
T.
,
Wang
T.
,
Chen
W.
,
Ma
W.
,
Ye
Q.
&
Liu
T.-Y.
(
2017
)
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
. In
Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017)
,
Long Beach, CA, USA
, pp.
3149
3157
. Available at https://api.semanticscholar.org/CorpusID:3815895.
Khan
A. U.
,
Rahman
H. U.
,
Ali
L.
,
Khan
M. I.
,
Khan
H. M.
,
Khan
A. U.
,
Khan
F. A.
,
Khan
J.
,
Shah
L. A.
,
Haleem
K.
,
Abbas
A.
&
Ahmad
I.
(
2021
)
Complex linkage between watershed attributes and surface water quality: Gaining insight via path analysis
,
Civil Engineering Journal
,
7
(
04
),
701
712
.
Khan
M.
,
Khan
S.
,
Ullah Khan
A.
,
Noman
M.
,
Usama
M.
,
Ahmad Khan
F.
,
Haleem
K.
&
Khan
J.
(
2022
)
Effect of land use change on climate elasticity of water quality at multiple spatial scales
,
Water Practice & Technology
,
17
(
11
),
2334
2350
.
Khan
S.
,
Khan
A. U.
,
Khan
M.
,
Khan
F. A.
,
Khan
S.
&
Khan
J.
(
2023
)
Intercomparison of SWAT and ANN techniques in simulating streamflows in the Astore Basin of the Upper Indus
,
Water Science & Technology
,
88
(
7
),
1847
1862
.
Kursa
M. B.
&
Rudnicki
W. R.
(
2011
)
The All Relevant Feature Selection using Random Forest
. arXiv preprint.
Ithaca, New York
:
Cornell University Library
.
https://doi.org/10.48550/arXiv.1106.5112
May
R. J.
,
Dandy
G. C.
,
Maier
H. R.
&
Nixon
J. B.
(
2008
)
Application of partial mutual information variable selection to ANN forecasting of water quality in water distribution systems
,
Environmental Modelling & Software
,
23
(
10–11
),
1289
1299
.
Mosavi
A.
,
Hosseini
F. S.
,
Choubin
B.
,
Goodarzi
M.
&
Dineva
A. A.
(
2020
)
Groundwater salinity susceptibility mapping using classifier ensemble and Bayesian machine learning models
,
IEEE Access
,
8
,
145564
145576
.
Natekin
A.
&
Knoll
A.
(
2013
)
Gradient boosting machines, A tutorial
,
Frontiers in Neurorobotics
,
7
,
21
.
doi:10.3389/fnbot.2013.00021
.
Nikoo
M. R.
&
Mahjouri
N.
(
2013
)
Water quality zoning using probabilistic support vector machines and self-organizing maps
,
Water Resources Management
,
27
,
2577
2594
.
Pakdaman
M.
,
Falamarzi
Y.
,
Yazdi
H. S.
,
Ahmadian
A.
,
Salahshour
S.
&
Ferrara
M.
(
2020
)
A kernel least mean square algorithm for fuzzy differential equations and its application in earth's energy balance model and climate
,
Alexandria Engineering Journal
,
59
(
4
),
2803
2810
.
Sattari
M. T.
,
Joudi
A. R.
&
Kusiak
A.
(
2016
)
Estimation of water quality parameters with data-driven model
,
Journal-American Water Works Association
,
108
(
4
),
E232
E239
.
Schapire
R. E.
(
2013
)
Explaining AdaBoost
. In: Empirical Inference (Schölkopf, B., Luo, Z. & Vovk, V., eds.).
Berlin, Heidelberg
:
Springer
, pp.
37
52
.
https://doi.org/10.1007/978-3-642-41136-6_5
.
Shah
H. A.
,
Sheraz
M.
,
Khan
A. U.
,
Khan
F. A.
,
Shah
L. A.
,
Khan
J.
,
Khan
A.
&
Khan
Z.
(
2020
)
Surface and groundwater pollution: The invisible, creeping threat to human health
,
Civil and Environmental Engineering
,
16
(
1
),
157
169
.
https://doi.org/10.2478/cee-2020-0016
.
Shah
M. I.
,
Alaloul
W. S.
,
Alqahtani
A.
,
Aldrees
A.
,
Musarat
M. A.
&
Javed
M. F.
(
2021
)
Predictive modeling approach for surface water quality: Development and comparison of machine learning models
,
Sustainability
,
13
(
14
),
7515
.
Uddin
M. G.
,
Nash
S.
,
Rahman
A.
&
Olbert
A. I.
(
2023
)
Performance analysis of the water quality index model for predicting water state using machine learning techniques
,
Process Safety and Environmental Protection
,
169
,
808
828
.
Ullah
B.
,
Fawad
M.
,
Khan
A. U.
,
Mohamand
S. K.
,
Khan
M.
,
Iqbal
M. J.
&
Khan
J.
(
2023
)
Futuristic streamflow prediction based on CMIP6 scenarios using machine learning models
,
Water Resources Management
,
37
(
15
),
6089
6106
.
doi:10.1007/s11269-023-03645-3
.
Zafar
M.
&
Ahmad
W.
(
2018
)
Water quality assessment and apportionment of northern Pakistan by multivariate statistical techniques, a case study
,
International Journal of Hydrology
,
2
(
1
),
00040
.
Zakaullah
&
Ejaz
N.
(
2020
).
Investigation of the Soan River Water Quality Using Multivariate Statistical Approach
.
Journal of Chemistry
,
2020
,
6644796
.
https://doi.org/10.1155/2020/6644796.
Zelany
M.
(
1974
)
A concept of compromise solutions and the method of the displaced ideal
.
Computers & Operations Research
,
1
(
3
),
479
496
.
https://doi.org/10.1016/0305-0548(74)90064-1
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).