Groundwater serves as the source for nearly half of the world's drinking water, yet understanding of global groundwater resources remains incomplete, and management of aquifers falls short, particularly concerning groundwater quality. This research offers insights into the groundwater quality in 242 stations of Maharashtra and Union Territory of Dadra and Nagar Haveli and nine parameters (pH, TDS, TH, Calcium (Ca2+), Magnesium (Mg2+), Chloride (Cl), Sulphate (SO42−), Nitrate (NO3), Fluoride (F)) were considered for computing the Water Quality Index (WQI) and hence Water Quality Classification (WQC) based on Water Quality Index (WQI). This research introduces the utilisation of Machine Learning (ML) models, specifically, Random Forest, Adaptive Boosting (AdaBoost), Gradient Boosting, XGBoost, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) model for predicting WQC and models are tested. Grid search method as a hyperparameter tuning of parameters is utilized to achieve the best possible performance of ML models. The performance metrics that are used for evaluating and reporting the performance of classification models are Accuracy, Precision, Recall or Sensitivity, F1 Score. SVM achieved the highest performance in predicting WQC. With accurate predictions of WQC, these findings have the potential to enhance NEP concerning water resources by facilitating ongoing improvements in water quality.

  • In total, 242 stations and 9 parameters were considered for computing the WQI and hence the classification of water quality was based on the WQI.

  • ML models are used for predicting groundwater.

  • The grid search method is used to achieve the best possible performance of ML models.

  • Accuracy, Precision, Recall or Sensitivity, and F1 Score metrics are used to evaluate the performance of ML models.

  • SVM achieved the highest performance in predicting WQC.

Water stands as arguably the paramount resource globally, exerting a direct influence on the socio-economic advancement of communities (Gartsiyanova et al. 2021). Water is essential for agriculture, industry, and domestic use. Access to clean water is vital for human health and sanitation. Water is essential for the survival of diverse ecosystems, including freshwater habitats like lakes, rivers, and wetlands, as well as marine environments like oceans and estuaries. These ecosystems support a wide range of plant and animal species, contributing to biodiversity and ecological balance. The quality of water is essential for the sustainable success of any diversion plan. Poor-quality water can lead to costly consequences, requiring resources to be redirected toward repairing water delivery infrastructure whenever problems arise. There is a growing demand for improved water management and quality control to ensure safe drinking water at affordable rates. To tackle these challenges, it is imperative to conduct systematic evaluations of freshwater sources, disposal systems, and organizational monitoring protocols (Hu et al. 2019). Enhancing strategies for avoiding and regulating water pollution can benefit from predicting future trends in water quality across different pollution levels and devising practical methods for preventing and controlling water pollution.

Groundwater is a critical source of drinking water for communities around the world, providing support for agriculture, industry, and ecosystems, while also playing a pivotal role in maintaining streamflow and preserving wetlands (Kazakis et al. 2017). Forecasting the quality of groundwater is a crucial tool for water planning, control, and monitoring. It is a vital component of research aimed at understanding and enhancing water ecological protection, particularly in the context of water contamination. Consequently, accurate forecasting of water changes in quality but additionally, it ensures the security of drinking water for individuals but furthermore aids in directing fishing productivity and preserving biodiversity (Wu & Wang 2022). Currently, researchers are predominantly focused on improving the feasibility and reliability of groundwater forecasting methods and introduced various novel approaches, including artificial neural networks (ANNs), and fuzzy mathematics, among others, to enhance water quality forecasting techniques and broaden their scope of applications (Lee & Lee 2018). Nevertheless, despite the degradation of groundwater conditions, individuals in numerous regions around the world rely on groundwater for drinking due to no other reliable sources. As a significant element of the cycle of hydrology and aqueous reservoir, groundwater is under tremendous strain globally. Consequently, assessing the quality of drinking water is a paramount problem in contemporary times (Singh et al. 2021).

The Water Quality Index (WQI) is a solitary digitized indicator employed to denote comprehensive water quality, determined by a predefined specifications group (Abu El-Magd et al. 2023; Al-Janabi & Al-Barmani 2023; Pandey et al. 2023) means tool employed to assess the condition of water for multiple purposes. It can be utilized to determine if water is suitable for drinking, industrial use, aquatic life support, also other applications (Bhardwaj & Verma 2017). The WQI provides essential data for decision-makers (Bui et al. 2020). Among various approaches for computing the WQI, weighted arithmetic index approach's popularity stems from its effectiveness and ease of use (Lukhabi et al. 2023; Machireddy 2023). The WQI assigns scores to different water quality parameters based on their significance to human health and environmental integrity. These scores are then combined to calculate an aggregated score that indicates the comprehensive water condition. The final value is often exhibited in categories that help classify the water quality as excellent, good, poor, very poor or water not suitable for consumption. The WQI is an individual approach that is extensively utilized in numerous pieces of research to assess whether surface and groundwater quality is suitable for human consumption (Abbasnia et al. 2019; Banerji & Mitra 2019). In response to the challenges, researchers have adopted the Artificial Intelligence (AI) approach (Malek et al. 2021). Modeling using AI eliminates the need for sub-index calculations and generates WQI values rapidly. Furthermore, the AI approach offers the advantage of being insensitive to missing variables and capable of controlling intricate mathematical operations involving vast datasets and non-linear structures (Bui et al. 2020). Hence, numerous scientists gave particular emphasis to the utilization of AI-based techniques, such as machine learning (ML).

ML technique shows great promise as a versatile and effective approach across diverse scientific disciplines (Kim et al. 2019; Rozos 2019). In prior research, several studies have explored ML models, including ANN, Support Vector Machine (SVM), k-Nearest Neighbor (KNN), Decision Tree, and Naive Bayes. Nevertheless, these traditional ML methods encounter issues such as excess overfitting and bias (Malek et al. 2021). As a result, progress and enhancement in ML algorithms through the adoption of ensemble methods such as bagging and boosting, are aimed at tackling these challenges (Sharafati et al. 2020). Ensemble models enhance prediction accuracy by aggregating the decisions of numerous base classifiers. Worldwide, investigators utilized ML approaches such as XGBoost, and Random Forest (RF) in diverse water-related research investigations (Jeihouni et al. 2020; Lu & Ma 2020; Xu et al. 2020). Khullar & Singh (2022) introduced a deep learning Bidrectional Long Short-Term Memory (Bi-LSTM) model for predicting the water of the Yamuna River in India. Elbeltagi et al. (2022) employed different ML techniques: Random Subspace (RSS), M5P tree model (M5P), SVM, and Additive Regression (AR) to employ the WQI using a method of variable removal. Among these methods, AR outperformed the others. Nosair et al. (2022) introduce a predictive regression model utilizing a novel approach that integrates saltwater intrusion indicators and AI methods to monitor the salinization of groundwater in Egypt's Eastern Nile Delta aquifer caused by saltwater intrusion. Ahmed et al. (2022) utilized LSTM, MLP, and CNN to forecast water quality, LSTM outperformed the others. Goodarzi et al. (2023) employed gene expression programming, Multivariate Adaptive Regression Splines (MARS) and M5P to investigate the quality of wells water in Iran and MARS outperformed M5P. Uddin et al. (2023) conducted a comparison of KNN, SVM, Gradient Boost, Naïve Bayes and RF approaches for WQC and the Gradient Boosting model emerged as the most appropriate for WQC. Sharma et al. (2024) synthesized the innovative nanocomposites (NCs) using reduced graphene oxide (rGO) and Al2O3-ZnO-TiO2 (AZT) nanoparticles (NPs) via a controlled hydrothermal process exhibit enhanced magnetic, anticorrosive, antimicrobial, and photocatalytic properties characterized by X-ray diffraction, SEM, UV–visible, and FTIR techniques, these NCs demonstrate potential for applications in water purification, biomedical devices, and environmental remediation. The widespread use of pesticides to protect vegetation has increased agricultural production but also led to significant environmental pollution, necessitating the removal of these pollutants. This study presents the synthesis of humic acid-coated magnetite NPs for the effective removal of the insecticide imidacloprid from aqueous solutions, characterized by various analytical techniques, and analyzed for adsorption efficiency under different conditions, confirming their potential as reusable nano-adsorbents (Jangra et al. 2022). To investigate the synergistic effects of various antiscalant combinations, 1-hydroxyethane-1,1-diphosphonic acid (HEDP) and sodium hexametaphosphate (SHMP) were found to be the most effective corrosion inhibitors for carbon steel, achieving 98% corrosion inhibition efficiency at a 50/50 ppm combination, attributed to intermolecular hydrogen bonding and uniform multilayer adsorption on the steel surface (Moudgil et al. 2009). Rani et al. (2024) summarize various methods for synthesizing carbon quantum dots (CQDs) and their applications, highlighting the need for simpler, high-yield approaches and addressing the challenges in producing undecorated and doped CQDs. Khanna et al. (2024) investigate the efficiency of a pyrazole Schiff base as a corrosion inhibitor for mild steel in 1 M HCl, using various techniques such as EIS, WL, and PDP, demonstrating high inhibition efficiency and exploring the kinetic and thermodynamic properties of the inhibitor, with computational simulations supporting experimental results. Petals of Bougainvillea glabra were investigated as a green corrosion inhibitor (GCI) for mild steel in 1.0 M H2SO4, demonstrating 93.13% efficiency, with thermodynamic, kinetic, DFT studies, and surface analysis supporting experimental data, highlighting its environmental benefits over chemical inhibitors (Kumar et al. 2023). Omidvar et al. (2022) synthesized ionic N-vinyl caprolactam/maleic-based copolymers (P(VCap-co-MA)) as dual-purpose gas hydrate and corrosion inhibitors, demonstrating effective gas hydrate inhibition and significant corrosion protection in oilfield-produced water, thus enhancing flow assurance in oil and gas pipelines. Murraya koenigii Linn (Curry leaves) was demonstrated as an eco-friendly acid corrosion inhibitor for mild steel in 5 M HCl, achieving up to 98.13% inhibition efficiency through physio-chemisorption, supported by experimental (weight loss, polarization, impedance spectroscopy) and theoretical (DFT, MDS) studies (Kumar & Yadav 2021). Tundwal et al. (2024) explore the use of carbon nanotubes (CNTs) and conducting polymers (CPs) like polyaniline and polythiophene-based NCs for environmental remediation, focusing on their synthesis, functionalization, modification, and adsorption behavior for removing pollutants such as organic dyes and heavy metal ions from water.

The current study investigated the effectiveness of ML approaches in predicting WQC. Hyperparameter tuning using grid search with CV = 5 is done for different ML models. To assess the performance of ML Models various performance metrics – Accuracy, Precision, Recall or Sensitivity, F – β-score. The layout of the paper is as follows: Section 2 introduces the study area. Section 3 presents recommended materials and methods. Section 4 presents the methodology used. Results and discussions are presented in Section 5. Finally, Section 6 provides a summary of the conclusion.

Study area

Maharashtra, India's third-largest state, covers a total geographical expanse of 3,07,762 km2. Positioned in the west-central region of India, it spans the Arabian Sea as its shoreline and latitudes ranging from 15°45′ to 22°00′ N and longitudes 73°00′ to 80°59′ E. The Union Territory Dadra and Nagar Haveli is located on the state's northwestern border and covers an approximate 491 km2 in area. Maharashtra State is divided administratively into 36 districts, which are arranged into six divisions: Amravati, Aurangabad, Konkan, Nashik, Pune. Additionally, the state is segmented into five regions: Khandesh, Konkan, Marathwada, Vidarbha, and Western Maharashtra. About 80% of the study area is encompassed by Deccan basalts and the remaining parts are covered by Archean formations (10.5%), Quaternary alluvium (4.7%), Precambrian formations (2%) and Gondwanas (1.6%). The major river basins include Godavari, Krishna, Mahanadi, Narmada, Tapi, and coastal basins. Approximately 52% of network stations are located within the Godavari basin, while 17% are within the Krishna and the Tapi basin and the remaining 14% are within coastal basins. Maharashtra primarily experiences rainfall during the southwest monsoon season, which lasts from June to September. In Nandurbar district, the average rainfall is as low as 661.6 mm, whereas in Ratanagiri district, it may go up to 3,394.9 mm. August sees immediate changes in the phreatic aquifer as the monsoon moves forward. However, the unpredictable rainfall pattern has diverse effects on the recharge of the groundwater regime during the monsoon season. Figure 1 shows the location map of the study area and sampling sites.
Figure 1

Location map of the study area.

Figure 1

Location map of the study area.

Close modal

Methodology

Water pollution stands as a paramount environmental challenge facing humanity, with its detrimental effects primarily stemming from inadequate prediction, timely vigilance, and emergency response capacities. Consequently, the establishment of a suitable checking and early admonition framework to work with informed navigation and successful water quality control arises as a pivotal logical and specialized concern requesting quick consideration (Liao et al. 2020). Figure 2 shows the methodology used for WQC prediction.
  • 1. The target of the suggested approach is to create an ML model for WQC prediction utilizing a dataset comprising nine parameters: pH, total dissolved solids (TDS), total hydrogen (TH), calcium, magnesium, chloride, sulfate, nitrate and fluoride.

  • 2. Computation of the WQI is done and water quality is categorized into different categories: Excellent water quality is classified if the WQI is less than or equal to 50; good water quality is classified if the WQI is between 50 and 100; poor water quality is classified if the WQI is between 100 and 200; very poor water quality is classified if the WQI is between 200 and 300; and unsuitable water for consumption is classified if the WQI is greater than or equal to 300.

  • 3. Given dataset is normalized and data partitioning is done for model training and testing six ML models RF, Ada Boost, Gradient Boosting, XGBoost, SVM, and KNN are used for evaluating the performance of ML models. Equation (1) was used to normalize the WQ data to a range of 0–1 before the modeling phase in order to improve the ML model's training speed and forecast accuracy.
    (1)
  • Here, and represent normalized and original values of the WQI (such as pH, TDS, and TH) at a given station, while and denote minimum and maximum values of that variable, respectively.

  • 4. While training the dataset, hyperparameter tuning using grid search with CV = 5 is done for different machine learning models used.

  • Hyperparameters are parameters whose values control the learning process. These are adjustable parameters used to obtain an optimal model and are also called external parameters. Hyperparameter tuning refers to the process of choosing the optimum set of hyperparameters for the ML model and is also called hyperparameter optimization. Various hyperparameters such as learning rate, number of epochs or number of iterations, and n-estimators are used for hyperparameter tuning. A variety of hyperparameter values are considered while evaluating ML models in grid search. It searches for a grid of hyperparameter values to find the optimum set of hyperparameters. It searches exhaustively across a manually-specified subset of a learning algorithm's hyperparameter space using brute force. By building a model for every conceivable combination of the given hyperparameter values, the grid search approach simply evaluates each model and chooses the design that yields the best results.

  • 5. ML model performance is assessed using various performance metrics – Accuracy, Precision, Recall or Sensitivity, F – β-score for evaluating and reporting the performance of classification models and the best model is selected based on model accuracy.

Figure 2

Flowchart of the proposed methodology.

Figure 2

Flowchart of the proposed methodology.

Close modal

Data

The data utilized for the present study is taken from the Ground Water Year Book of Maharashtra and Union Territory of Dadra and Nagar Haveli for the year 2021–22 collected by the Government of India, Ministry of Jal Shakti, Department of Water Resources, RD & GR, Central Ground Water Board. The Indian government collected this data to ensure the potability of water. The data includes 242 stations of Maharashtra and Union Territory of Dadra and Nagar Haveli and nine parameters (pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F)).

pH: It is a measure of the acidity or alkalinity of a solution, representing the concentration of hydrogen ions. It is a crucial parameter as it affects the solubility and availability of nutrients and chemicals in water influencing the ecosystem.

TDS: It refers to the total concentration of dissolved substances in water. These substances can include minerals, salts, metals, and other organic and inorganic compounds. It is an indicator of water purity and quality. TDS is usually measured in parts per million (ppm) or milligrams per liter (mg/L).

TH: It refers to the concentration of dissolved minerals. It is a measure of calcium and magnesium ions in water and is typically measured in units such as milligrams per liter (mg/L) or grains per gallon (gpg).

Calcium and magnesium: In water, calcium and magnesium exist in dissolved form as calcium ions (Ca2+) and magnesium ions (Mg2+), respectively, contributing to water hardness and playing a role in various chemical reactions and biological processes.

Chloride,sulfate, nitrate, fluoride: In water, chloride ions (Cl), sulfate ions (SO42−), nitrate ions (NO3) and fluoride ions (F) are one of the major dissolved ions and are typically measured as part of water quality assessments.

Salts containing ions such as Ca2+ and Mg2+ determine whether it is hard or soft. Water with high concentrations of Cl ions is considered saline. Nitrate ions from human activities percolate into groundwater and become the dominant major anion. The elevated fluoride levels in groundwater are associated with the concentration of Ca2+ ions present in the groundwater.

The groundwater quality standards as recommended by the Bureau of Indian Standards (BIS) (IS-10500-2012) is shown in Table 1.

Table 1

Water quality standards by BIS (IS-10500-2012)

ParameterBIS standards (IS-10500-2012)
Desirable limitMaximum permissible limit
pH 6.5–8.5 – 
TDS (mg/L) 500 2,000 
TH (mg/L) 300 600 
Ca2+ (mg/L) 75 200 
Mg2+ (mg/L) 30 100 
Cl (mg/L) 250 1,000 
NO3 (mg/L) 200 400 
SO42− (mg/L) 45 No relax 
F (mg/L) 1.5 
ParameterBIS standards (IS-10500-2012)
Desirable limitMaximum permissible limit
pH 6.5–8.5 – 
TDS (mg/L) 500 2,000 
TH (mg/L) 300 600 
Ca2+ (mg/L) 75 200 
Mg2+ (mg/L) 30 100 
Cl (mg/L) 250 1,000 
NO3 (mg/L) 200 400 
SO42− (mg/L) 45 No relax 
F (mg/L) 1.5 

The WQI

The WQI is the most efficient method for expressing the acceptability of water resources for human consumption by summarizing all of the quality criteria of the water in a single component. The steps involved in calculating the WQI are as follows:

  • 1. Selection of parameters;

  • 2. Determination of weightage;

  • 3. Determination of sub-indices.

  • 4. Integration of sub-indices in a mathematical expression.

Selection of parameter

In this current investigation, the parameters including pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F) considered in the calculation of the WQI.

Determination of weightage

Initially, a weight is allocated to every parameter, considering its significance for human health and appropriateness for drinking purposes. Relative weight is calculated using the following equation:
(2)
where is the unit weightage.
Unit weightage is calculated using the following equation:
(3)
where k is the proportionality constant and calculated using the following equation:
(4)

Determination of sub-indices

A sub-index is calculated using the following equation:
(5)
where represents the monitored value of ith parameter and represents the recommended standard for ith parameter.

Integration of sub–indices in mathematical expression

The WQI is calculated using the following equation:
(6)

The WQI of the locations was assessed against the water grade categorization for consumption purposes based on the criteria established by Boateng et al. (2016) who classified water quality into five categories – excellent, good, poor, very poor and water not suitable for consumption based on WQI values as shown in Table 2.

Table 2

Quality of groundwater according to the WQI range

WQI rangeWater type
<50 Excellent water 
50–100 Good water 
100.1–200 Poor water 
200.1–300 Very poor water 
>300.1 Water not suitable for consumption 
WQI rangeWater type
<50 Excellent water 
50–100 Good water 
100.1–200 Poor water 
200.1–300 Very poor water 
>300.1 Water not suitable for consumption 

Random forest

The ensemble technique consists of two types – one is bagging and the other is boosting. RF follows the bagging technique. It is basically used to solve both regression and classification problems. For base learners, the decision tree ML algorithm is used and no other algorithm is used. From the specifically trained dataset, row sampling and feature sampling are used to provide dataset for independently trained base learners (decision tree). For binary classification problems, a majority voting classifier is used as output. But in the case of regression, the average is used as output. Decision tree splits the dataset up to the entire leaf node. To check whether the split in the decision tree is pure or not, for this entropy and Gini Index are used. Another technique called information gain means how the features selected in a decision tree are used. Entropy is calculated for every tuple S using the following equation:
(7)
Here, ‘H(S)’ denotes the entropy, ‘p’ denotes the portion of set ‘S’ that is classified as belonging to class ‘j’, with respect to each specific set ‘S’.
The entropy of the root node for binary classification is defined using the following equation:
(8)

If H(S) = 0, then split will be pure and if H(S) = 1, then split will be impure.

Gini Impurity is determined using the following equation:
(9)
For binary classification, Gini Impurity is calculated using the following equation:
(10)
If split is impure then maximum value of GI is 0.5 and if split is pure then minimum value is 0 as shown in Figure 3.
Figure 3

Entropy and Gini Impurity graph.

Figure 3

Entropy and Gini Impurity graph.

Close modal
Let , , …, be n features then Information Gain is defined using the following equation:
(11)

RF offers a multitude of advantages. Decision tree without any hyperparameter tuning, only one decision tree is taken and split that dataset in depth due to which a problem called ‘overfitting’ occurs which is solved by RF as in RF multiple decision trees are taken which will split in depth on these decision trees.

Adaptive boosting (AdaBoost)

AdaBoost follow ensemble boosting technique means all models are sequentially connected along with weights assigned to them and sequentially trained. It includes various steps.

Step1: Decision tree stumps are created and the best stump is selected.

Step2: Sample weight (w) is assigned to each sample and calculated using the following equation:
(12)
where N represents total number of data points, i = 1, 2, …, n
Step3: Sum of total errors and performance of stump is calculated, where Total Error (TE) represents the sum of all misclassified data points sample weights and Performance of Stump () is defined using the following equation:
(13)
Step4: Update the weights for correctly and incorrectly classified points. For this, the sample weight for incorrect data points is increased and the sample weight for correct data points is decreased. Updated weights are calculated using following equations:
(14)
Also,
(15)
Step5: Normalized weight computation and assigning bins. Normalized weight is computed using the following equation:
(16)

Normalization of weights is done to give the updated records to train by the next model and later bin assignment is performed means groups are made.

Step6: Select data points to the next stump and the iteration process is performed until low training error is achieved and final result is predicted.

Gradient boosting

Gradient boosting follows ensemble boosting technique. It includes various steps. First step includes creating a base model. Then, compute residuals or errors or pseudo residuals. In order to compute errors, loss function is used which is different in both regressor and classification. For regression, mean square error, root mean square error are used as loss functions while in classification, log loss, and hinge loss are used as loss functions. Residual is computed Equations (17) and (18)
(17)
(18)
After that, if the output is very near or huge difference when compared to the output value then one more decision tree is needed and this decision tree is created considering inputs based on the output of the residual (). The entire data points are trained by a decision tree and predict . Therefore, predicted output is equal to the base model + (Decision Tree), where is the learning rate and in general it is represented by Equation (19):
(19)

(Base Model)

(Output by decision tree 1)

After some time, these residual increases.

eXtreme Gradient Boosting

XGBoost follows ensemble boosting technique. It is an advancement of gradient boosting. It is used to solve both classification and regression problem statements. In the case of classification problem, it follows various steps. Like gradient boosting, in first step of XGBoost a base model is constructed. The base model is a weak learner which can take any inputs and gives an output probability of 0.5 and acts like a dummy base model. In the second step, residuals are calculated. In the next step, construct a decision tree with root using the features and whenever a tree is constructed, a binary classifier is needed mean leaf nodes are basically two. In the next step, similarity weights of the root node and branches of the split are calculated using the following equation:
(20)
where is hyperparameter and P is the output probability.
In next step, information gain is calculated using the following equation:
(21)
Information gain is used to select that specific node through which the split happens. Splitting is performed further or not; post-pruning is calculated by cover value. If the information gain is less than the cover value, then that branch cuts and the branch whose information gain is more than that branch is considered. Base model output with respect to probability is calculated using the following equation:
(22)
where P is the output probability.
The entire function for XGBoost classification problem is represented by Equation (23):
(23)
where ’s are learning rates lying between 0 and 1 and is sigmoid function
In the case of the regression problem, the base model is constructed taking any input features with output as an average of all values. In the next steps, like classification problem, residuals are calculated and tree with root is constructed. In the next step, similarity weight is calculated using the following equation:
(24)
where is the hyperparameter.
And then information gain is calculated and finally completed by XGBoost Output is defined as Equation (25):
(25)

(Base Model)

(Output by Decision Tree 1)

Support Vector Machine

The objective of the SVM algorithm is to establish the optimal line or decision boundary that can separate a given dataset into multiple classes. This allows for easy categorization of new data points in the future. The optimal decision boundary is known as the hyperplane. SVM is used to solve both classification and regression problems with respect to supervised learning. Two classes are easily separated by a hyperplane. SVM also creates two marginal lines and these lines have some distance so that it is linearly separable for both the classification points. After the hyperplane, two parallel planes are also created expressed by dotted lines called marginal planes. The best hyperplane is selected such that it has the maximum marginal distance. The dimensions of the hyperplane depend on the features present in the dataset and the hyperplane is created. The points that are actually passing through the marginal plane are called support vectors. SVM uses a technique called SVM kernels, which converts two-dimensions into high dimensions. The higher the marginal distance, the more generalized is the model.

SVM is of two types: linear and non-linear. Linear SVM is used for linearly separable data, where the dataset can be classified into two classes using a single straight line. For datasets that are not linearly separable, non-linear SVM is employed. Figure 4 represents the SVM structure, showing hyperplane and support vectors.

k-nearest neighbor

KNN is a non-parametric classification approach. Qualitative and quantitative characteristics are inputs into KNNs. In essence, output features are classifications of data represented by categorical values. KNN use the majority number of nearest neighbors to explain a categorical value. While implementing this algorithm, some assumptions are made: it makes no assumption about underlying data distribution since KNN is a non-parametric approach. K-parameter which is a hypertuning parameter is selected derived from dataset. It necessitates a distance metric called Euclidean distance or Manhattan distance to define proximity between two data points. Eucledian distance refers to the cartesian distance between two points situated in a hyperplane. In other words, this measurement helps to determine total displacement by representing the length of a straight line that connects two places under consideration. It is calculated using the following equation:
(26)
Manhattan distance is used to calculate the overall distance covered by an object, instead of displacement. This measurement is derived by adding up the absolute variances between the coordinates of the points across multiple dimensions. It is calculated using the following equation:
(27)

Performance metrics

After the implementation of ML models, these models are tested to find out their effectiveness based on metrics and datasets. The performance metrics that are used for evaluating and reporting the performance of classification models: Accuracy, Precision, Recall or Sensitivity, F-β-score.

Confusion matrix

It is a table used to describe the performance of the classification model. It is a 2 × 2 matrix in the case of binary classification where the top values are actually the actual values and left side values are predicted values and consists of the following four fields:

True Positive: the model predicted correctly classified as Yes.

True Negative: the model predicted correctly classified as No.

False Positive: the model predicted Yes but actually No. It is also known as ‘Type – I error’.

False Negative: the model predicted No but actually Yes. It is also known as ‘Type – II error’.

Table 3 displays a confusion matrix.

Accuracy

Accuracy is calculated using the following equation:
(28)
where TP is the True Postive, TN is the True Negative, FP is the False Positive, FN is the False Negative

Precision

Out of the total positive actual values, how many values are correctly predicted. It is calculated using the following equation:
(29)

Recall

Out of the total predicted positive results, how many results are actually positive. It is calculated using the following equation:
(30)

F – β score

F – β score is calculated using the following equation:
(31)

If = 1, then F – β-score is called F1 Score (harmonic mean)

If False Positive and False Negative are equally important than select = 1

If False Positive is more important than False Negative than reduce value between 0 and 1

If False Negative impact is high than False Positive than value is greater than 1.

Count, mean, standard deviation, minimum, maximum, and quartiles of parameters in groundwater were computed and correlation analysis was performed using Python. Table 4 displays the statistical dataset for parameters of groundwater samples measured across 242 stations.

Table 3

Confusion matrix

Predicted
PositiveNegative
True Positive TP FN 
Negative FP TN 
Predicted
PositiveNegative
True Positive TP FN 
Negative FP TN 
Table 4

Statistical analysis of groundwater parameters

pHTDS (mg/L)TH (mg/L)Ca (mg/L)Mg (mg/L)Cl (mg/L)NO3 (mg/L)SO4 (mg/L)F (mg/L)
Count 242.0000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 
Mean 7.589008 601.280992 308.078512 66.876033 34.198347 114.809917 28.074380 40.966942 0.782975 
Std 0.485180 307.407058 204.912505 53.737501 24.776384 91.770037 30.398778 50.260720 0.977179 
Min 5.800000 69.000000 20.000000 4.000000 1.000000 7.000000 0.000000 0.000000 0.010000 
25% 7.300000 381.500000 150.000000 28.000000 15.000000 43.000000 6.250000 11.000000 0.190000 
50% 7.555000 561.500000 300.000000 56.000000 29.500000 90.500000 18.500000 25.500000 0.380000 
75% 7.850000 789.500000 400.000000 86.000000 51.000000 170.000000 40.000000 54.000000 1.007500 
max 11.60000 1,987.00000 1,500.00000 397.000000 145.000000 496.000000 216.000000 480.000000 7.750000 
pHTDS (mg/L)TH (mg/L)Ca (mg/L)Mg (mg/L)Cl (mg/L)NO3 (mg/L)SO4 (mg/L)F (mg/L)
Count 242.0000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 
Mean 7.589008 601.280992 308.078512 66.876033 34.198347 114.809917 28.074380 40.966942 0.782975 
Std 0.485180 307.407058 204.912505 53.737501 24.776384 91.770037 30.398778 50.260720 0.977179 
Min 5.800000 69.000000 20.000000 4.000000 1.000000 7.000000 0.000000 0.000000 0.010000 
25% 7.300000 381.500000 150.000000 28.000000 15.000000 43.000000 6.250000 11.000000 0.190000 
50% 7.555000 561.500000 300.000000 56.000000 29.500000 90.500000 18.500000 25.500000 0.380000 
75% 7.850000 789.500000 400.000000 86.000000 51.000000 170.000000 40.000000 54.000000 1.007500 
max 11.60000 1,987.00000 1,500.00000 397.000000 145.000000 496.000000 216.000000 480.000000 7.750000 

pH: pH range and average values of the groundwater samples indicate predominantly alkaline conditions. pH of groundwater samples varied from 5.80 to 11.60, with an average of 7.64. In its natural state, groundwater typically exhibits a pH range of 6.5 to 8.5 as shown in Table 1. Variations in groundwater pH below 6.5 or above 8.5 are influenced by the presence or absence of CO2 in the surrounding environment as well as by the infiltration of strongly acidic or alkaline wastewater from human activities.

TDS: TDS (mg/L) varied from 69.00 to 1,987.00 which shows that the average TDS values at the sampling locations indicate that groundwater is fresh in nature. The groundwater TDS levels were found to be comfortably within the maximum permissible limits set by the BIS, specifically below 2,000 mg/L.

TH, Ca2+, and Mg2+: TH (mg/L) varied from 20.00 to 1,500.00. The range of Ca2+ (mg/L) and Mg2+ (mg/L) content fluctuated between 4.00 and 397.00, 1.00 and 145.00, respectively. TH represents the combined concentrations of Ca2+ and Mg2+ as calcium carbonate (CaCO3). In areas where TH exceeds the desirable limit, there is a higher concentration of Ca2+ and Mg2+.

Cl and SO42−: Cl (mg/L) ranged between 7.00 and 496.00. SO42− (mg/L) varies within the range of 0.00 to 480.00. Since Maharashtra State is predominantly covered by basalt, the occurrence of Cl and SO42− in groundwater is minimal. Most locations exhibit chloride and sulfate concentrations that fall within permissible limits set by the BIS, 250 and 200 mg/L, respectively. The concentrations of Cl and SO42− indicate that their presence has a limited impact on the potability of groundwater.

NO3: NO3 (mg/L) exhibits a range of values from 0.00 to 216.00. Nitrate rarely becomes a predominant ion in groundwater under natural geochemical conditions. However, the nitrate content in the state groundwater indicates that it has become a predominant ion. Domestic waste, wastewater and sewage in both urban and rural areas of the state contribute to the percolation of NO3 into the groundwater.

F: F (mg/L) exhibits a fluctuation within the range of 0.01 to 7.75. Fluoride in groundwater originates from naturally occurring fluoride bearing minerals in the region's geological formations.

According to the table, mean value of pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F) were 7.589008, 601.280992, 308.078512, 66.876033, 34.198347, 114.809917, 28.074380, 40.966942, 0.782795 and standard deviation of pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F) were 0.485180, 307.407058, 204.912505, 53.737501, 24.776384, 91.770037, 30.398778, 50.260720, 0.977179.

pH: pH less than the desirable limit of 6.5 is observed at two locations in Mandhala, Jalgaon (5.8) and Tamnakawda, Kolhapur (6.4). pH more than the maximum permissible limit of 8.5 is observed at Dharna in Yavatmal district and four locations in Kolhapur district with the highest being at Akurde in Kolhapur.

TDS: TDS of groundwater from various locations of Maharashtra and the Union Territory of Dadra and Nagar Haveli is well within the maximum permissible limit as prescribed by the BIS.

TH: TH more than the maximum permissible limit of 600 mg/L is observed in 15 locations of Nagpur district, Dhule, Jalgaon, Kolhapur and Yavatmal districts. The highest TH of 1,500 mg/L is observed at Kasba Sangaon in Kolhapur.

Ca2+ and Mg2+: Ca2+ and Mg2+ are more in the area where TH more than the desirable limit. With four locations Ca2+ more than the maximum permissible limit of 200 mg/L in Kolhapur and Yavatmal districts. The highest Ca2+ of 397 mg/L is observed at Kasba Sangaon in Kolhapur. With two locations having Mg2+ more than maximum permissible limit of 100 mg/L in Dhule and Kolhapur districts. The highest Mg2+ of 145 mg/L is observed at Bhadne in Dhuli district.

Cl: Cl with no samples more than maximum permissible limit of 1,000 mg/L. Cl concentration with highest as 496 mg/L is observed at Jalgaon, Hibarkhed. Location found with Cl concentration more than the desirable limit as prescribed by the BIS, i.e. 250 mg/L

NO3: Anthropogenic contamination of NO3 is observed above the maximum permissible limit of 45 mg/L in 45 locations including Nagpur, Gondia, Jalgaon, Nashik, Sangli, and Yavatmal districts. The highest concentration of NO3 of 216 mg/L was observed at Antargaon, Yavatmal. Sampling locations are affected by nitrate contamination which is due to percolation of nitrate contaminants from the groundwater.

SO42: SO42− is more than the maximum permissible limit of 400 mg/L at only one location at Kasba Sangaon, Kolhapur.

F: F is more than the maximum permissible limit for 48 locations at Karangi (Baramati) in Pune district, Bhandara, Chandrapur, Gondia, Jalgaon, Kolhapur, Sangli and Yavatmal districts.
Figure 4

SVM structure.

Figure 5 represents a bar chart for WQC based on the WQI range. Figure 6 shows the correlation analysis of the groundwater samples. TDS is positively correlated with TH, Ca2+ and Cl. TH is positively correlated with Ca2+ and Mg2+. pH is significantly negatively correlated with TH.
Figure 5

Bar chart of WQC based on the WQI.

Figure 5

Bar chart of WQC based on the WQI.

Close modal
Figure 6

Correlation analysis of groundwater parameters.

Figure 6

Correlation analysis of groundwater parameters.

Close modal

Hyperparameter tuning of ML models

The grid search strategy yielded ideal parameters for ML models, which are shown in Supplementary Material Table S1. The table gives a summary of tuning parameters that were investigated for every model and lists the precise parameter values that, after tuning, produced the best results. To maximize each ML model's performance for the tasks assigned to it, certain ideal parameters are essential. For RF, the parameters under consideration for tuning are:

  • 1. N_estimators refers to the number of decision trees used in the ensemble and hence controls the complexity of the model. Tested values are [50, 100, 150, 200, 250] and the optimal value parameter is ‘100’.

  • 2. Max_features determines the maximum number of features that are considered when splitting a node during the construction of each decision tree in the ensemble controls the size of the subset and influences the diversity of the trees in the ensemble. Tested values are [0.1, 0.2, 0.3, 0.6, 1.0] and the optimal value parameter is ‘0.3’.

  • 3. Max_depth determines the maximum depth of each decision tree in the ensemble and controls the maximum depth of these trees by limiting the number of nodes from the root to the deepest leaf. Tested values are [2, 8, none] and the optimal value parameter is ‘8’.

  • 4. The criterion determines the function used to measure the quality of a split when building the decision trees in the ensemble. Tested values are [‘gini’, ‘entropy’] and the optimal value parameter is ‘gini’.

For AdaBoost, the parameters under consideration for tuning are:

  • 1. N_estimators specifies the number of weak learners, typically decision trees, to be used in the ensemble and controls the number of weak learners (trees) to be included in the ensemble. Tested values are [10, 50, 100, 200] and the optimal value parameter is ‘50’.

  • 2. Learning_Rate controls the contribution of each weak learner to the final prediction. Tested values are [1.5, 1.2, 1, 0.5, 0.1, 0.01, 0.001] and the optimal value parameter is ‘0.1’.

  • 3. The algorithm specifies the boosting algorithm to use. AdaBoost can use several algorithms for updating the weights of the training samples, including ‘SAMME’ (discrete AdaBoost) and ‘SAMME.R’ (real AdaBoost). Therefore, the tested values are [‘SAMME’, ‘SAMME.R’] and the optimal value parameter is ‘SAMME’.

For Gradient Boosting, the parameters under consideration for tuning are:

  • 1. N_estimators specifies the number of weak learners, typically decision trees, to be used in the ensemble and controls the number of weak learners (trees) to be included in the ensemble. Tested values are [50, 100, 150, 200, 250] and the optimal value parameter is ‘150’.

  • 2. Max_features determines the maximum number of features that are considered when constructing each weak learner (typically decision trees) in the ensemble and controls the size of this subset of features. Tested values are [‘auto’, ‘sqrt’ and ‘log2’] and the optimal value parameter is ‘auto’.

  • 3. Max_depth specifies the maximum depth of each individual weak learner (usually decision trees) in the ensemble and controls the maximum depth to which each decision tree is allowed to grow during training. Tested values are [1, 2, 3, 4, 5, 6, 7, 8, 9] and the optimal value parameter is ‘3’.

  • 4. Subsample determines the fraction of samples (observations) to be used for fitting each weak learner (typically decision trees) in the ensemble and controls the proportion of the training data that is randomly sampled (with replacement) for training each weak learner. Tested values are [0.5, 0.75, 1] and the optimal value parameter is ‘0.5’.

For XGBoost, parameters under consideration for tuning are:

  • 1. Learning_Rate controls the contribution of each weak learner to the final prediction. Tested values are [0.1, 0.01, 0.001] and the optimal value parameter is ‘0.1’.

  • 2. Max_depth specifies the maximum depth of each individual weak learner (usually decision trees) in the ensemble and controls the maximum depth to which each decision tree is allowed to grow during training. Tested values are [3, 5, 7] and the optimal value parameter is ‘5’.

  • 3. Subsample determines the fraction of samples (observations) to be used for fitting each weak learner (typically decision trees) in the ensemble and controls the proportion of the training data that is randomly sampled (with replacement) for training each weak learner. Tested values are [0.5, 0.7, 1] and the optimal value parameter is ‘0.7’.

For SVM, parameters under consideration for tuning are:

  • 1. C is a regularization parameter. Tested values are [0.001,0.01,0.1,10,100] and the optimal value parameter is ‘10’.

  • 2. Kernel refers to the type of kernel function used to transform the input data into a higher-dimensional space. Tested values for this parameter are [‘linear’, ‘poly’,‘rbf’, ‘sigmoid’] and the optimal value parameter is ‘linear’.

  • 3. Gamma determines the width of the Gaussian Kernel. Tested values are [0.001,0.01,0.1,10,100] and the optimal value parameter is ‘0.001’.

For KNN, parameters under consideration for tuning are:

  • 1. N_neighbors specifies the number of neighbors to consider when making predictions. In classification tasks, the class label of a new sample is determined by the majority class among its nearest neighbors. Tested values for this parameter are [5,7,9,11,13,15] and the optimal value parameter is ‘7’.

  • 2. Weight determines the weight assigned to each neighbor when making predictions. It can be set to ‘uniform’, where all neighbors have equal weight, or ‘distance’, where closer neighbors have a higher weight. Therefore, tested values for this parameter are [‘uniform’, ‘distance’] and the optimal value parameter is ‘uniform’.

  • 3. Metric specifies the distance metric used to measure the similarity between samples. Tested values for this parameter are [‘Euclidean’, ‘Manhattan’, ‘Minkowski’] and the optimal value parameter is ‘Minkowski’.

Performance evaluation using grid search method

This section presents the outcomes of WQC prediction employing six ML models including RF, AdaBoost, Gradient Boosting, XGBoost, SVM and KNN. These ML models utilized a set of nine parameters of groundwater including pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F). As presented in Table 5, the performance of classification ML models, including RF, AdaBoost, Gradient Boosting, XGBoost, SVM, and KNN model, is showcased through utilization of grid search method for hyperparameter optimization.

Table 5

Performance evaluation of ML models

Accuracy (%)Recall (%)Precision (%)F1 score (%)
Random Forest (RF) 92 93 87 90 
Adaptive Boosting (AdaBoost) 86 89 79 82 
Gradient Boosting (GB) 90 92 84 88 
Extreme Gradient Boosting (XGBoost) 89 92 82 86 
Support Vector Machine (SVM) 96 95 94 94 
k-Nearest Neighbor (kNN) 77 80 73 75 
Accuracy (%)Recall (%)Precision (%)F1 score (%)
Random Forest (RF) 92 93 87 90 
Adaptive Boosting (AdaBoost) 86 89 79 82 
Gradient Boosting (GB) 90 92 84 88 
Extreme Gradient Boosting (XGBoost) 89 92 82 86 
Support Vector Machine (SVM) 96 95 94 94 
k-Nearest Neighbor (kNN) 77 80 73 75 

SVM: The proposed SVM model outperforms the other classification models, showing superior results across all performance metrics. It achieves the highest accuracy of 96%, indicating that it correctly classifies 96% of the instances. The SVM model also demonstrates high Recall (95%), Precision (94%), and F1 score (94%), making it the most reliable model for WQC prediction.

RF: Following SVM, RF model also performs well with an Accuracy of 92%, Recall of 93%, Precision of 87%, and F1 score of 90%. This suggests that RF is a robust alternative for WQC prediction, although it slightly lags behind SVM.

Gradient Boosting: Gradient Boosting model exhibits a competitive performance with an accuracy of 90%, Recall of 92%, Precision of 84%, and F1 score of 88%. This model is effective but shows a slight decrease in precision compared to RF.

XGBoost: XGBoost model achieves an Accuracy of 89%, Recall of 92%, Precision of 82%, and F1 score of 86%. While XGBoost is efficient, its precision and F1 score are lower compared to Gradient Boosting.

Adaptive Boosting (AdaBoost): Adaptive Boosting (Ada Boost) model shows an Accuracy of 86%, Recall of 89%, Precision of 79%, and F1 score of 82%. Although AdaBoost is useful, its performance metrics are noticeably lower than those of SVM, RF, and Gradient Boosting.

KNN: KNN model has the lowest performance among the six models with an Accuracy of 77%, Recall of 80%, Precision of 73%, and F1 score of 75%. Despite its simplicity, KNN is less effective for WQC prediction in this context.

The proposed SVM model outperforms other highlighted classification models, showing superior results in terms of Accuracy, Recall, Precision and F1 score.

Figure 7 depicts the feature importance for RF, AdaBoost, Gradient Boosting, XGBoost, SVM and KNN model, employing the grid search method. Figure 8 represents a comparison of accuracy among RF, Adaptive Boosting AdaBoost, Gradient Boosting, XGBoost, SVM and KNN.
Figure 7

Feature importance for ML models.

Figure 7

Feature importance for ML models.

Close modal
Figure 8

Comparison of accuracy among ML models.

Figure 8

Comparison of accuracy among ML models.

Close modal

This study was carried out to assess the potential of ML models, specifically, RF, AdaBoost, Gradient Boosting, XGBoost, SVM and proposed KNN model for WQC prediction. 242 stations of Maharashtra and Union Territory of Dadra and Nagar Haveli and nine parameters (pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F)) were considered as a case study. Performance metrics used for evaluating and reporting the performance of classification models: Accuracy, Precision, Recall or Sensitivity, F – β-score are utilized. The SVM model outperforms other classification ML models, showing superior results in terms of Accuracy, Recall, Precision and F1 score. It attains 96% Accuracy, 95% Recall, 94% Precision, and 94% F1 Score followed by RF with 92% Accuracy, 93% Recall, 87% Precision, and 90% F1 Score. Gradient Boosting with 90% Accuracy, 92% Recall, 84% Precision and 88% F1 Score followed by XGBoost with an Accuracy of 89%, Recall of 92%, Precision of 82% and F1 score of 86%. AdaBoost with an Accuracy of 86%, Recall of 89%, Precision of 79% and F1 score of 82% and KNN had the lowest performance in WQC prediction with 77% Accuracy, 80% Recall, 73% Precision and 75% F1 score. The accurate predictions of WQC provided by ML models have significant implications for National Environmental Policy. These findings show better monitoring, regulation and management of groundwater resources. Policymakers utilize these predictions to identify areas at risk of pollution and implement timely interventions. Regions identified with poor water quality are prioritized for regulatory action, ensuring that resources are allocated efficiently to address the most critical issues. ML models in predicting WQC provides a data-driven basis for decision-making. This enhances the development and enforcement of environmental regulations by providing empirical evidence to support policy decisions. Strict guidelines and standards are established for regions where groundwater quality is predicted to deteriorate, thereby preventing potential health risks associated with polluted water sources. Effective management of groundwater resources is crucial for sustainable development. The insights gained from this research enable better planning and management of water resources. Policymakers uses the predictive models to develop sustainable extraction plans that minimize the impact on groundwater quality, ensuring that clean water is available for future generations. By integrating ML predictions into policy frameworks, it is possible to enhance water quality management, allocate resources more efficiently, and develop adaptive strategies to address the impacts of human activities and climate change on groundwater. The integration of ML model predictions into National Environmental Policy, this research provides the practical relevance and potential to facilitate ongoing improvements in water quality that leads to more robust and dynamic environmental management. This research provides the tools needed for continuous improvement of water quality standards and practices. As a result, it can help in the formulation of policies that not only address current water quality issues but also anticipate and mitigate future challenges.

Limitations and future research

242 stations of Maharashtra and Union Territory of Dadra and Nagar Haveli and only nine parameters (pH, TDS, TH, calcium (Ca2+), magnesium (Mg2+), chloride (Cl), sulfate (SO42−), nitrate (NO3), fluoride (F)) were considered as a case study. Other potentially relevant parameters were not included which may affect the comprehensiveness of WQI. ML models were used and tuned for optimal performance but then also each model has inherent limitations. ML model KNN showed lower performance, due to its sensitivity to the scale of data and high-dimensional feature space. ML models are sensitive to the quality and quantity of data, which influences their predictive accuracy. Future studies consider a large and more diverse dataset that enhances the generalizability of the findings. In future analysis, a broader range of water quality parameters like heavy metals, organic pollutants that provide a more comprehensive assessment of groundwater quality which helps in developing more detailed and accurate WQIs. Future research could focus on improving the performance of the XGBoost, AdaBoost and KNN models for improving WQC. This could involve exploring other approaches such as training the models using balanced datasets. Future research also focuses on different ML algorithms such as ANNs and Deep Learning algorithms to improve prediction accuracy. By study, the role of public awareness and community engagement in groundwater quality management provides insights into effective strategies for public participation in environmental conservation efforts.

The authors are grateful to the ‘Guru Gobind Singh Indraprastha University’ for financial support and research facilities.

This research received no external funding.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abbasnia
A.
,
Yousefi
N.
,
Mahvi
A. H.
,
Nabizadeh
R.
,
Radfard
M.
,
Yousefi
M.
&
Alimohammadi
M.
(
2019
)
Evaluation of groundwater quality using water quality index and its suitability for assessing water for drinking and irrigation purposes: Case study of Sistan and Baluchistan province (Iran)
,
Hum. Ecol. Risk Assess.
,
25
(
4
),
988e1005
.
https://doi.org/10.1080/10807039.2018.1458596
.
Abu El-Magd
S. A.
,
Ismael
I. S.
,
El-Sabri
M. A.
,
Abdo
M. S.
&
Farhat
H. I.
(
2023
)
Integrated machine learning–based model and WQI for groundwater quality assessment: ML, geospatial, and hydro-index approaches
,
Environ. Sci. Pollut. Res.
,
30
,
53862
53875
.
Al-Janabi
S.
&
Al-Barmani
Z.
(
2023
)
Intelligent multi-level analytics of soft computing approach to predict water quality index (IM12CP-WQI)
,
Soft Comput
,
27
,
7831
7861
.
https://doi.org/10.1007/s00500-023-07953-z
.
Banerji
S.
&
Mitra
D.
(
2019
)
Geographical information system-based groundwater quality index assessment of northern part of Kolkata, India for drinking purpose
,
Geocarto Int.
,
34
,
943e958
.
https://doi.org/10.1080/10106049.2018.1451922
.
Bhardwaj
D.
&
Verma
N.
(
2017
)
Research paper on analysing impact of various parameters on water quality index
,
Int. J. Adv. Res. Comput. Sci
,
8
(
5
),
2496
2498
.
Boateng
T. K.
,
Opoku
F.
,
Acquaah
S. O.
&
Akoto
O.
(
2016
)
Groundwater quality assessment using statistical approach and water quality index in Ejisu Juaben Municipality, Ghana
,
Environmental Earth Sciences
,
75
,
489
.
DOI:10.1007/s12665-015-5105-0
.
Bui
D. T.
,
Khosravi
K.
,
Tiefenbacher
J.
,
Nguyen
H.
&
Kazakis
N.
(
2020
)
Improving prediction of water quality indices using novel hybrid machine-learning algorithms
,
Science of the Total Environment
,
721
,
137612
.
Elbeltagi
A.
,
Pande
C. B.
,
Kouadri
S.
&
Islam
A. R. M. T.
(
2022
)
Applications of various data-driven models for the prediction of groundwater quality index in the Akot basin, Maharashtra, India
,
Environmental Science and Pollution Research
,
29
(
12
),
17591
17605
.
Gartsiyanova
K.
,
Varbanov
M.
,
Kitev
A.
&
Genchev
S.
(
2021
)
Water quality analysis of the rivers Topolnitsa and Luda Yana, Bulgaria using different indices
,
J. Physics Conf. Ser.
,
1960
,
012018
.
Goodarzi
M. R.
,
Niknam
A. R. R.
,
Barzkar
A.
,
Niazkar
M.
,
Zare Mehrjerdi
Y.
,
Abedi
M. J.
&
Heydari Pour
M.
(
2023
)
Water quality index estimations using machine learning algorithms: A case study of Yazd-Ardakan Plain, Iran
,
Water
,
15
(
10
),
1876
.
Jangra
A.
,
Kumar
J.
,
Singh
D.
,
Kumar
H.
,
Kumar
P.
,
Kumar
S.
&
Kumar
R.
(
2022
)
Proficient exclusion of pesticide using humic acid-modified magnetite nanoparticles from aqueous solution
,
Water Science & Technology
,
86
(
11
),
3028
3040
.
Kazakis
N.
,
Mattas
C.
,
Pavlou
A.
,
Patrikaki
O.
&
Voudouris
K.
(
2017
)
Multivariate statistical analysis for the assessment under different hydrological regimes
,
Environmental Earth Sciences
,
76
(
9
),
349
.
doi:10.1007/s12665-017-6665-y
.
Khanna
R.
,
Kalia
V.
,
Kumar
R.
,
Kumar
R.
,
Kumar
P.
,
Dahiya
H.
,
Pahuja
P.
,
Jhaa
G.
&
Kumar
H.
(
2024
)
Synergistic experimental and computational approaches for evaluating pyrazole Schiff bases as corrosion inhibitor for mild steel in acidic medium
,
Journal of Molecular Structure
,
1297
,
136845
.
Khullar
S.
&
Singh
N.
(
2022
)
Water quality assessment of a river using deep learning Bi-LSTM methodology: Forecasting and validation
,
Environmental Science and Pollution Research
,
29
(
9
),
12875
12889
.
Kim
J.
,
Han
H.
,
Johnson
L. E.
,
Lim
S.
&
Cifelli
R.
(
2019
)
Hybrid machine learning framework for hydrological assessment
,
Journal of Hydrology
,
577
,
123913
.
Kumar
H.
,
Yadav
P.
,
Kumari
R.
,
Sharma
R.
,
Sharma
S.
,
Singh
D.
,
Dahiya
H.
,
Kumar
P.
,
Bhardwaj
S.
&
Kaur
P.
(
2023
)
Highly efficient green corrosion inhibitor for mild steel in sulfuric acid: Experimental and DFT approach
,
Colloids and Surfaces A: Physicochemical and Engineering Aspects
,
675
,
132039
.
Lee
S.
&
Lee
D.
(
2018
)
Improved prediction of harmful algal blooms in four Major South Korea's Rivers using deep learning models
,
International Journal of Environmental Research and Public Health
,
15
(
7
),
1322
.
Liao
Z.
,
Li
Y.
,
Xiong
W.
,
Wang
X.
,
Liu
D.
,
Zhang
Y.
&
Li
C.
(
2020
)
An in-depth assessment of water resource responses to regional development policies using hydrological variation analysis and system dynamics modeling
,
Sustainability
,
12
,
5814
.
https://doi.org/10.3390/su12145814
.
Lukhabi
D. K.
,
Mensah
P. K.
,
Asare
N. K.
,
Pulumuka-Kamanga
T.
&
Ouma
K. O.
(
2023
)
Adapted water quality indices: limitations and potential for water quality monitoring in Africa
,
Water
,
15
,
1736
.
Machireddy
S. R.
(
2023
)
Assessment and distribution of groundwater quality using water quality index and geospatial technology in Vempalli Mandal of Andhra Pradesh, India
,
Water Resour. Manag.
,
9
,
51
.
Malek
N. H. A.
,
Yaacob
W. F. W.
,
Nasir
S. A. M.
&
Shaadan
N.
(
2021
)
The effect of chemical parameters on water quality index in machine learning studies: A meta-analysis
,
Journal of Physics: Conference Series
,
2084
(
1
),
012007
.
IOP Publishing
.
Moudgil
H. K.
,
Yadav
S.
,
Chaudhary
R. S.
&
Kumar
D.
(
2009
)
Synergistic effect of some antiscalants as corrosion inhibitor for industrial cooling water system
,
Journal of Applied Electrochemistry
,
39
(
8
),
1339
1347
.
Nosair
A. M.
,
Shams
M. Y.
,
AbouElmagd
L. M.
,
Hassanein
A. E.
,
Fryar
A. E.
&
Abu Salem
H. S.
(
2022
)
Predictive model for progressive salinization in a coastal aquifer using artificial intelligence and hydrogeochemical techniques: A case study of the Nile Delta aquifer, Egypt
,
Environmental Science and Pollution Research
,
29
(
6
),
9318
9340
.
Omidvar
M.
,
Cheng
L.
,
Farhadian
A.
,
Berisha
A.
,
Rahimi
A.
,
Ning
F.
,
Kumar
H.
,
Peyvandi
K.
&
Nabid
M. R.
(
2022
)
Development of highly efficient dual-purpose gas hydrate and corrosion inhibitors for flow assurance application: An experimental and computational study
,
Energy & Fuels
,
37
(
2
),
1006
1021
.
Rani
G.
,
Siddharth
,
Ahlawat
R.
&
Kumar
H.
(
2024
)
A comprehensive review on the synthesis, doping, and characterization techniques of carbon quantum dots for their multifaceted applications
,
Comments on Inorganic Chemistry
,
1
38
.
Sharafati
A.
,
Asadollah
S. B. H. S.
&
Hosseinzadeh
M.
(
2020
)
The potential of new ensemble machine learning models for effluent quality parameters prediction and related uncertainty
,
Process Safety and Environmental Protection
,
140
,
68
78
.
Sharma
R.
,
Kumar
H.
,
Kumari
R.
,
Kumar
G.
,
Dhayal
A.
,
Yadav
A.
,
Yadav
D.
,
Yadav
A.
,
Saini
C.
,
Saloni, Kumar
A.
&
Pandit
V.
(
2024
)
Innovative Al2O3-ZnO-TiO2@ rGO nanocomposites: A versatile approach for advanced water purification, biomedical devices, and environmental remediation
,
Diamond and Related Materials
,
145
,
111081
.
Singh
S.
,
Pasupuleti
S.
,
Singha
S.
,
Singh
R.
&
Kumar
S.
(
2021
)
Prediction of groundwater quality using efficient machine learning technique
,
Chemosphere
,
276
,
130265
.
https://doi.org/10.1016/j.chemosphere.2021.130265
.
Tundwal
A.
,
Kumar
H.
,
Binoj
B. J.
,
Sharma
R.
,
Kumari
R.
,
Yadav
A.
,
Kumar
G.
,
Yadav
A.
,
Singh
D.
,
Mangla
B.
&
Kumar
P.
(
2024
)
Conducting polymers and carbon nanotubes in the field of environmental remediation: Sustainable developments
,
Coordination Chemistry Reviews
,
500
,
215533
.
Uddin
M. G.
,
Nash
S.
,
Rahman
A.
&
Olbert
A. I.
(
2023
)
Performance analysis of the water quality index model for predicting water state using machine learning techniques
,
Process Safety and Environmental Protection
,
169
,
808
828
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data