ABSTRACT
India has been dealing with fluoride contamination of groundwater for the past few decades. Long-term exposure of fluoride can cause skeletal and dental fluorosis. Therefore, an in-depth exploration of fluoride concentrations in different parts of India is desirable. This work employs machine learning algorithms to analyze the fluoride concentrations in five major affected Indian states (Andhra Pradesh, Rajasthan, Tamil Nadu, Telangana and West Bengal). A correlation matrix was used to identify appropriate predictor variables for fluoride prediction. The various algorithms used for predictions included K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector classifier (SVC), Gaussian NB, MLP classifier, decision tree classifier, gradient boosting classifier, voting classifier soft and voting classifier hard. The performance of these models is assessed over accuracy, precision, recall and error rate and receiver operating curve. As the dataset was skewed, the performance of models was evaluated before and after resampling. Analysis of results indicates that the RF model is the best model for predicting fluoride contamination in groundwater in Indian states.
HIGHLIGHTS
Prediction of fluoride in groundwater using supervised machine learning algorithms is explored in Indian states.
This study focuses on only naturally occurring fluoride, i.e., geogenic sources of fluoride contamination in groundwater.
pH, EC, bicarbonate, chloride, total alkalinity, sodium and sulfate are major factors influencing the occurrence of fluoride in groundwater.
The random forest model is the best model in predicting fluoride contamination in groundwater.
INTRODUCTION
Most nations, including Germany, the United States, China and India, rely primarily on groundwater for household, agricultural and industrial growth (Houéménou et al. 2020; Khosravi et al. 2020; Sutradhar & Mondal 2021). Over a period, less rainfall and less surface water are leading to the reliability on groundwater in India (Li et al. 2017). In accordance to the Central Ground Water Board (CGWB) 2010 report, 19 states have vigorous fluoride contamination levels across India (Rodell et al. 2009; World Bank 2010). Because of an abrupt surge in population along with industrial development and citification, the chemical composition of water has changed considerably in the recent past (Bhagure & Mirgane 2011; Singh & Kumar 2017). As a result, groundwater is also imposing a threat to humans as well as to the nation's economic growth. The groundwater contains various heavy chemicals like chromium, lead, arsenic, silica and some inorganic ions like fluoride, nitrate and chloride which are not good for health (Brindha & Elango 2011; Vithanage & Bhattacharya 2015). Out of all these chemicals, fluoride is of major concern in India.
Granitic rocks, metamorphic rocks and igneous rocks are present naturally under the earth crust. Weathering of these rocks leads to fluoride contamination which is a geogenic source of fluoride contamination. Natural sources which contribute to fluoride pollution in groundwater are volcanic eruption that releases a huge amount of volcanic ash and volcanic rocks loaded with fluorine (Araya et al. 1993). Geothermal water which is alkaline in nature promotes the absorption of fluoride from fluoride-bearing minerals from rocks (Saxena & Ahmed 2001). Atmospheric deposition in which the air contain fluoride in particulate form or in gaseous form due to rainfall reaches the earth's surface and then to the groundwater (Gupta et al. 2005). Anthropogenic activities include industrial debris, phosphatic fertilizers, dumping grounds, brick manufacturing industries, ceramic industries, aluminum smelting and power stations using coal are all responsible for fluoride contamination in groundwater (Singh et al. 2008; Rawat et al. 2010). Out of all the different sources of fluoride contamination, geogenic source contamination which occurs due to the weathering of rocks under the earth crust is the primary cause of fluoride contamination around the globe (Kumar et al. 2016).
Fluoride is present in fusion with other elements like fluorite and fluorapatite and in soil and water (Ghosh et al. 2013; Banerjee 2015). The earth's crust contains rocks that are rich in fluoride like mica, topaz, fluorite, apatite (Handa 1975), etc. These rocks upon weathering release inorganic fluoride in groundwater (Tavener & Clark 2006; Vithanage & Bhattacharya 2015). There exists an ion that has similar properties with respect to size and charge like fluoride (F−). This ion is hydroxide ion (OH−) which in chemical reaction replaces fluoride ion (F−) with itself (Saxena & Ahmed 2001; Chae et al. 2007). The processes that are accountable for fluoride contagion within groundwater are precipitation, hydrolysis, adsorption, dissolution, biochemical reaction and ion exchange (Saxena & Ahmed 2003). Fluoride is present in granitic rocks (hornblende, muscovite), igneous rocks, metamorphic rocks and sedimentary deposits which on weathering releases fluoride (Edmunds & Smedley 2005; Ozsvath 2009; Vithanage & Bhattacharya 2015). The quantity of fluoride depends upon the conformation of the rocks. For instance, the fluoride content in ultramafic rocks is 100 mg/kg whereas alkaline rocks contain 1,000 mg/kg and marine shales contain 1,300 mg/kg (Hem 1985; Faure 1991; Ozsvath 2009).
Nearly 200 million people consume water that is above the prescribed limit of WHO (World Health Organization (WHO) 2006). Out of this, 70 million people are from India only (UNICEF 1999). The permissible amount of fluoride in the human body lies between 0.5 and 1.0 mg/L by the World Health Organization (WHO) and 1.5 mg/L in adverse conditions. The guidelines and standards for fluoride in drinking water are presented in Table 1. Freshwater contains 0.01–3 mg/L of fluoride, whereas groundwater contains 1–35 mg/L. Though fluoride helps in coagulation of tooth coating and preserving bone health in the human body (Ghosh et al. 2013), extreme consumption of fluoride leads to chronic fluorosis that affects not only the teeth (World Health Organization (WHO) 1994) and bones but also notable effects on cardiovascular, respiratory, gastrointestinal and immune system parts of the human body (Salve et al. 2008; Chouhan & Flora 2010).
Country/bodies . | Value (mg/L) . | References . |
---|---|---|
World Health Organization (WHO) | 1.5 (Guideline value) | WHO (2011) |
Australia | 1.5 (Permissible limit) | NHMRC & NRMMC (2011) |
Bureau of Indian Standards (BIS) | 1 (Acceptable limit) | BIS (2012) |
1.5 (Permissible limit) | ||
Canada | 1.5 (Permissible limit) | Health Canada (2010) |
European Union | 1.5 (Permissible limit) | DECLG (2014) |
Ireland | 1.5 (Permissible limit) | NEIA (2018) |
Japan | 0.8 (Standard value) | MHLW (2010) |
New Zealand | 1.5 (Permissible limit) | MH (2008) |
Malaysia | 1.5 (Permissible limit) | ESD (2004) |
Singapore | 0.7 (Max. prescribed quantity) | NEA Singapore (2008) |
South Korea | 1.5 (Permissible limit) | ECOREA (2013) |
United States Environment Protection Agency (USEPA) | 4 (Max. contaminant level) | USEPA (2011) |
2 (Secondary max. contaminant level) |
Country/bodies . | Value (mg/L) . | References . |
---|---|---|
World Health Organization (WHO) | 1.5 (Guideline value) | WHO (2011) |
Australia | 1.5 (Permissible limit) | NHMRC & NRMMC (2011) |
Bureau of Indian Standards (BIS) | 1 (Acceptable limit) | BIS (2012) |
1.5 (Permissible limit) | ||
Canada | 1.5 (Permissible limit) | Health Canada (2010) |
European Union | 1.5 (Permissible limit) | DECLG (2014) |
Ireland | 1.5 (Permissible limit) | NEIA (2018) |
Japan | 0.8 (Standard value) | MHLW (2010) |
New Zealand | 1.5 (Permissible limit) | MH (2008) |
Malaysia | 1.5 (Permissible limit) | ESD (2004) |
Singapore | 0.7 (Max. prescribed quantity) | NEA Singapore (2008) |
South Korea | 1.5 (Permissible limit) | ECOREA (2013) |
United States Environment Protection Agency (USEPA) | 4 (Max. contaminant level) | USEPA (2011) |
2 (Secondary max. contaminant level) |
Several studies have been done in India to predict fluoride contamination using traditional chemical analysis approaches like high fluoride in Tamil Nadu, India (Chicas et al. 2022), geochemical analysis and health risk associated with fluoride contamination in Guntur, Andhra Pradesh, India (Rao Subba et al. 2020), fluoride contamination in groundwater resources of Alleppey, South India (Raj & Shaji 2016), fluoride pollution in Nalgonda, India (Adimalla et al. 2019), fluoride assessment in different parts of Telangana (Narsimha & Sudarshan 2017; Narsimha & Rajitha 2018), fluoride distribution in different parts of Haryana, India (Yadav et al. 2019), fluoride contamination in West Bengal, India (Batabyal & Gupta 2017; De et al. 2022) and fluoride contamination in Sonbhadra, Uttar Pradesh (Raju et al. 2009). Another study of the Kolar and Tumkur districts of Karnataka, India revealed that evaporation and rock-weathering are responsible for fluoride contamination in groundwater (Mamatha & Rao 2010).
These studies assess groundwater through the chemical process which is time-consuming and involves a big budget as it requires collecting samples by digging up wells from different areas, evaluating water samples collected and data management, etc. Cost of equipment, labor force and chemicals used for evaluation further make it an expensive process (Tiyasha Tung & Yaseen 2020). According to the literature, analysis of water via traditional chemical approaches is not cost-friendly and therefore influences water quality assessment to some extent (Ongley 2000). In this digital era, artificial intelligence techniques are seen as potential solutions to solve real-world problems.
Machine learning is an artificial intelligence technique that is used to learn useful patterns from the data to make appropriate predictions. These techniques interpret the non-linear and intricate associations among input and output data. Over the years, large amounts of data have been collected by scientists to anticipate water contamination. The present study focuses on utilizing this data for making predictions using machine learning techniques in a lesser amount of time. Machine learning models are used to estimate groundwater contamination in different parts of the world. For example, the artificial neural network (ANN) possesses the ability to address numerous inaccuracies within a dataset. Also, ANN is able to uncover the non-associations between predictor and dangling variables, whereas the random forest (RF) model is proficient in effectively managing binary, continuous, missing value and high-dimensional data. Furthermore, logistic regression (LR) is an efficient method adept at analyzing binary classification and swiftly training datasets. Table 2 presents a summary of studies that employed machine learning techniques for fluoride predictions. All the studies presented in Table 2 evidently proved that machine learning algorithms have good probability of accurately predicting the occurrence of fluoride in smaller time spans.
State/country . | Algorithm used . | Evaluation parameters . | Limitation . | References . |
---|---|---|---|---|
Ghana, West Africa | Random forest algorithm | Sensitivity, specificity, precision, balanced accuracy | Only random forest algorithm used | Araya et al. (2022) |
Datong Basin, China | Random forest, logistic regression, artificial neural network | Accuracy, sensitivity, specificity, error rate | Other models can be used | Nafouanti et al. (2021) |
Qiantao and Houtao plain, Northwestern China | Random forest algorithm | Sensitivity, specificity, positive predictive value, negative predictive value | Only random forest algorithm is used | Xiangcao et al. (2024) |
Whole China | Artificial neural network | Sensitivity, specificity, AUC curve | Used only artificial neural network | Hailong et al. (2022) |
Chhatisgarh, India | Random forest, extreme gradient boosting (XGBoost), artificial neural network | Mean square error, mean absolute error, root mean squared error, mean absolute percentage error, coefficient of determination | Single monsoon dataset was collected | Singha et al. (2021) |
Mamundiyar Basin, India | Artificial neural network | Mean square error | Only single algorithm is used | Dar et al. (2012) |
Bankura, Purulia, Paschim Medinipur, West Bengal India | Random forest model | Accuracy, sensitivity, specificity, ROC (AUC) curve | Random forest is used | Aind et al. (2022) |
India | Random forest model, multivariate logistic regression | Sensitivity, specificity, ROC (AUC) curve | Only two algorithms are used | Podgorski et al. (2018) |
Khaf, Iran | Artificial neural network | Root mean squared error | Only artificial neural network is used | Mohammadi et al. (2016a) |
Pakistan | Random forest model | Mean decrease accuracy, mean decrease Gini impurity | Small dataset and only random forest is used | Yuya et al. (2022) |
Maku, Turkey | Extreme learning machine, multi-layer perceptron, support vector machine | Coefficient of determination, root mean squared error, mean absolute bias error, Nash–Sutcliffe efficient coefficient | Only 143 water samples were used | Barzegar et al. (2017) |
Southeast Antolia Region, Turkey | LR-KNN-ANN (hybrid) and SVM | Correlation coefficient values of the machine learning models | Only 252 samples were used | Ataş et al. (2021) |
Western United States | Random forest algorithm | R-squared and root mean squared error | Only random forest algorithm used | Celia et al. (2022) |
All over World | Random forest algorithm | Sensitivity, specificity, balanced accuracy, AUC curve | Only random forest algorithm used | Podgorski & Berg (2022) |
State/country . | Algorithm used . | Evaluation parameters . | Limitation . | References . |
---|---|---|---|---|
Ghana, West Africa | Random forest algorithm | Sensitivity, specificity, precision, balanced accuracy | Only random forest algorithm used | Araya et al. (2022) |
Datong Basin, China | Random forest, logistic regression, artificial neural network | Accuracy, sensitivity, specificity, error rate | Other models can be used | Nafouanti et al. (2021) |
Qiantao and Houtao plain, Northwestern China | Random forest algorithm | Sensitivity, specificity, positive predictive value, negative predictive value | Only random forest algorithm is used | Xiangcao et al. (2024) |
Whole China | Artificial neural network | Sensitivity, specificity, AUC curve | Used only artificial neural network | Hailong et al. (2022) |
Chhatisgarh, India | Random forest, extreme gradient boosting (XGBoost), artificial neural network | Mean square error, mean absolute error, root mean squared error, mean absolute percentage error, coefficient of determination | Single monsoon dataset was collected | Singha et al. (2021) |
Mamundiyar Basin, India | Artificial neural network | Mean square error | Only single algorithm is used | Dar et al. (2012) |
Bankura, Purulia, Paschim Medinipur, West Bengal India | Random forest model | Accuracy, sensitivity, specificity, ROC (AUC) curve | Random forest is used | Aind et al. (2022) |
India | Random forest model, multivariate logistic regression | Sensitivity, specificity, ROC (AUC) curve | Only two algorithms are used | Podgorski et al. (2018) |
Khaf, Iran | Artificial neural network | Root mean squared error | Only artificial neural network is used | Mohammadi et al. (2016a) |
Pakistan | Random forest model | Mean decrease accuracy, mean decrease Gini impurity | Small dataset and only random forest is used | Yuya et al. (2022) |
Maku, Turkey | Extreme learning machine, multi-layer perceptron, support vector machine | Coefficient of determination, root mean squared error, mean absolute bias error, Nash–Sutcliffe efficient coefficient | Only 143 water samples were used | Barzegar et al. (2017) |
Southeast Antolia Region, Turkey | LR-KNN-ANN (hybrid) and SVM | Correlation coefficient values of the machine learning models | Only 252 samples were used | Ataş et al. (2021) |
Western United States | Random forest algorithm | R-squared and root mean squared error | Only random forest algorithm used | Celia et al. (2022) |
All over World | Random forest algorithm | Sensitivity, specificity, balanced accuracy, AUC curve | Only random forest algorithm used | Podgorski & Berg (2022) |
The following research gaps are analyzed during the literature survey:
(1) Firstly, the traditional chemical analysis of groundwater is an unavoidable process, but it is a time-consuming process. Repeated chemical analysis incurs high costs as digging wells include labor cost, equipment, etc. Given the abundance of existing chemical analysis data and the desire to mitigate the expenses associated with repetitive chemical testing, machine learning offers a promising avenue for exploration. Machine learning techniques are highly efficient in predicting fluoride in groundwater in a short span of time (Table 2). The machines are trained on the data generated by the chemical analysis of water which helps in predicting fluoride in groundwater.
(2) Secondly, the major source of fluoride contamination is a geogenic source which involves the weathering of rocks under the earth's crust. However, limited studies have analyzed geogenic sources of fluoride in groundwater individually (Kumar et al. 2016).
(3) Furthermore, it is worth noting that some of the most heavily fluoride-affected regions in India including Andhra Pradesh, Rajasthan, Tamil Nadu, Telangana and West Bengal have yet to be examined using machine learning methodologies. This presents an intriguing opportunity to apply advanced analytical techniques to better understand and address the complexities of fluoride contamination in these areas, in order to uncover new insights and develop targeted mitigation strategies for improved public health outcomes.
The study particularly focuses on predicting fluoride contamination due to geogenic sources in groundwater using supervised machine learning algorithms in India. In addition to this, this study also evaluates and compares different supervised machine learning algorithms and focuses on analyzing the fluoride concentrations in five major affected states of India: Andhra Pradesh, Rajasthan, Tamil Nadu, Telangana and West Bengal.
MATERIALS AND METHODS
After preprocessing, the next step is to develop the model using supervised machine learning algorithms. The data were divided in 80:20 ratios for training and testing the model. The training set is used to learn the model using various classification algorithms as shown in Figure 2, whereas the test set is used to assess the performance of the model using various evaluation parameters such as accuracy, precision, recall and error rate (Figure 2). The detailed steps are as follows.
Geological and hydrological settings of Indian states
The underground rock formation of Andhra Pradesh, Rajasthan, Tamil Nadu, Telangana and West Bengal majorly consists of granites and gneisses, limestones, clays, sandstones and coal. Table 3 exhibits the range of fluoride in different types of rocks.
Type of rocks . | Range of fluoride (ppm) . |
---|---|
Basalt | 20–1,060 |
Granites and gneisses | 20–2,700 |
Shales and clays | 10–7,600 |
Limestones | 0–1,200 |
Sandstones | 10–880 |
Coal (ash) | 40–480 |
Type of rocks . | Range of fluoride (ppm) . |
---|---|
Basalt | 20–1,060 |
Granites and gneisses | 20–2,700 |
Shales and clays | 10–7,600 |
Limestones | 0–1,200 |
Sandstones | 10–880 |
Coal (ash) | 40–480 |
Data collection and extraction
The data collection and extraction is the first step in methodology of this study as mentioned in Figure 2. The data were collected from the CGWB website (https://cgwb.gov.in/index.html) from groundwater quality reports (groundwater quality data, 2010–2018). This data contains groundwater analysis from the years 2010–2018 of all the states of India. In total, the dataset contains 85,197 rows. The samples of water were taken from dug wells and tube wells. Out of 85,197 tuples, the data of five states are as follows: Andhra Pradesh (4,618), Rajasthan (5,426), Tamil Nadu (4,026), Telangana (2,917) and West Bengal (4,032). The total data of five states comes with a total of 21,019 rows.
Data preprocessing
The data contain 15 input variables: pH, electrical conductivity (EC), total hardness (TH), total alkalinity, calcium (Ca), magnesium (Mg), sodium (Na), potassium (K), iron (Fe), carbonate (CO3), bicarbonate (HCO3−), chloride (Cl−), sulfate (SO42−), nitrate (NO3−) and fluoride (F−). The data were noisy with missing values, some irrelevant observations, outliers and different formats as well. In this study, the data were cleaned by removing the irrelevant observations, removing data of different formats and filling the missing values with the mean hence reducing the data to 8,047 tuples.
Choice of appropriate input
Machine learning models are affected by disposing of remarkable features or by keeping unrelated inputs (Gheyas & Smith 2010). To determine the association between the input variables, a correlation matrix is generated.
Evaluation parameters
where True Positive is the model accurately forecast fluoride; True Negative is the model accurately forecast non-fluoride; False Positive is the model inaccurately forecast fluoride (means an outcome is forecasted as fluoride but it is actually not) and False Negative is the model inaccurately forecast as non-fluoride while it is fluoride.
Supervised machine learning algorithms used for analysis
Logistic regression
K-nearest neighbor
Support vector machine
Another supervised machine learning approach is support vector machines (SVMs) and is used for classification and regression. Though SVMs initially are meant for classification, now they are generalized for regression problems. SVMs efficiently classify linearly separable data and non-linearly separable data using kernel functions such as sigmoid, radial or polynomial (Singh et al. 2022). It is advantageous if the data are detachable and the ratio between the number of dimensions and samples is greater. SVMs are memory efficient but it is more time-consuming in training the model which makes it impractical for large datasets.
Naïve Bayes
Naïve Bayes (NB) pursues Bayes theorem and predicts accurately on non-linear interdependence among predictors and response even when the sample size is small. NB is considered appropriate for unconditional variables. It needs a limited number of training datasets for promptly predicting on test datasets. In addition to this, NB strongly presumes that the predictors are uncorrelated and independent of each other. NB is described as:
Let the posterior probability of a class
P(a) = prior probability of a class
= probability of the predictor given class
P(b) = prior probability of the predictor of a class
Then, Where
Multi-layer perceptron classifier
Decision tree classifier
The decision tree is a multivariate machine learning technique suitable for classification as well as regression. Decision trees work on inner instinct (Kucheryavskiy 2018) and are straightforward (Géron 2019). The decision tree classifier is a tree-based classification technique that is frequently used to model binary responses. It teaches how independent variables classify binary responses by making decisions. Decision trees are useful in handling non-linear datasets effectively by building decision trees which in turn helps in creating the classification model. Decision trees work well with numerical and categorical data.
Ensemble techniques
Random forest
RF is an ensemble technique dependent upon classification and regression trees where both these techniques apply recursive binary splitting for bifurcating the dataset just to find the optimal variables (Breiman 2001). A substantial quantity of different trees are grown together by randomly resampling the original data and random selection of variables for dependable forecasting. The concluding forecasting is the outcome of the aggregate of the entire tree population (Araya et al. 2022). Apart from this, RF has the concept of bagging with some further degree of randomization. Also, the RF model works best in dealing with multi-scaled data, misplaced data, dichotomous data, immunity against noise and is quickly trained thus making it convenient and uncomplicated (Wu et al. 2019).
Gradient boosting
Another ensemble technique is gradient boosting utilized for both classification and regression. Gradient boosting relies on the assumption that the weak learners when combined can form an effective learning model by gaining knowledge from prior misclassifications.
Voting classifier hard and voting classifier soft
Voting classifier, a machine learning model that learns from a collection of many models and forecasts an output based on the class that has the highest likelihood of being selected as the output. Two different voting methods are supported by voting classifier:
Voting classifier hard: The class with the maximum number of votes means the class that has the maximum likelihood of being forecasted by every classifier is selected.
Voting classifier soft: The forecast made for each output class is determined by the aggregate probability designated to such class.
RESULTS AND DISCUSSION
Thorough studies were performed to analyze the data from various perspectives. Data were analyzed using an open source software library Python 3.x in Jupyter notebook. The detailed discussion of the experiments and results is as follows.
Statistical analysis
To know the chemical configuration of groundwater in different areas, the interpretation of hydrochemistry of groundwater plays a vital role. The extent of fluoride in groundwater lies between 0 and 98 mg/L, whereas the average fluoride concentration lies up to 0.99 mg/L in groundwater (Table 4).
Variables . | pH . | EC . | TH . | Ca2+ . | Mg2+ . | TA . | Na+ . | K+ . | HCO3− . | Cl− . | SO42− . | NO3− . | F . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min | 0 | 0 | 0 | 0 | −48.6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Max | 9.66 | 48,500 | 9,445 | 2,351 | 2,722 | 7,909 | 9,750 | 779 | 9,650 | 20,000 | 8,862 | 4,405 | 98 |
Mean | 7.81 | 2,041 | 439.55 | 80.42 | 66.85 | 280.76 | 275.89 | 18.54 | 383.73 | 363.47 | 164.54 | 65.19 | 0.99 |
SD | 0.43 | 2,241.7 | 416.5 | 82.32 | 124.53 | 226.01 | 421.97 | 47.17 | 248.84 | 651.19 | 290.37 | 132.52 | 1.60 |
Variables . | pH . | EC . | TH . | Ca2+ . | Mg2+ . | TA . | Na+ . | K+ . | HCO3− . | Cl− . | SO42− . | NO3− . | F . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min | 0 | 0 | 0 | 0 | −48.6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Max | 9.66 | 48,500 | 9,445 | 2,351 | 2,722 | 7,909 | 9,750 | 779 | 9,650 | 20,000 | 8,862 | 4,405 | 98 |
Mean | 7.81 | 2,041 | 439.55 | 80.42 | 66.85 | 280.76 | 275.89 | 18.54 | 383.73 | 363.47 | 164.54 | 65.19 | 0.99 |
SD | 0.43 | 2,241.7 | 416.5 | 82.32 | 124.53 | 226.01 | 421.97 | 47.17 | 248.84 | 651.19 | 290.37 | 132.52 | 1.60 |
The major factor influencing fluoride in our study area is HCO3− whose average concentration is 383.72 mg/L and its range lies between 0 and 9,650 mg/L. Other factors which also contribute to fluoride contamination are SO42− which lies between 0 and 8,862 mg/L and its average concentration is 164.54 mg/L, whereas Cl− lies in a range of 0–20,000 mg/L and its average is 363.46 mg/L in the study areas. The average concentration of Mg2+, Na+ and EC are 66.84 mg/L, 275.89 mg/L and 2,041.00 μS/cm, respectively (Table 4).
The Bureau of Indian Standards (BIS 2012) for drinking water and WHO has given guidelines for different parameters in the form of acceptable and permissible limits. Although BIS advises focusing on the acceptable limit whenever possible and the permitted limit in situations when there is not a backup water source. The range of pH lies between 6.5 and 8.5 in drinking water and the EC should not exceed 400 μS/cm (WHO). The TH as calcium carbonate and TA as calcium carbonate acceptable limit is 200 mg/L and its permissible limit is 600 mg/L. The acceptable limit of calcium, chloride and magnesium is 75, 250 and 30 mg/L, whereas the permissible limit is 200, 1,000 and 100 mg/L. Sulfate has an acceptable limit of 200 mg/L and the permissible limit of 400 mg/L. Nitrate has an acceptable limit of 45 mg/L and there is no relaxation on permissible limit. Though BIS have not defined any acceptable limit and permissible limit for sodium and potassium, WHO (1996) states that intake of more than 200 mg/L of sodium can affect the taste of drinking water, whereas potassium intake should be 4.7 g/day in adults between 19 and 70 years of age (WHO 2009).
Pearson correlation between the parameters
Meticulous experiments were performed to evaluate the performance of 10 machine learning algorithms to predict the fluoride concentrations in five states of India. As the performance of machine learning models is highly affected by the input features in terms of accuracy as well as computational complexity, the first study was done to identify the important features to be considered for analysis.
Figure 5 displays the correlation between the parameters and the correlation value lies between +1 and −1. Amidst various parameters used in this study, fluoride has a positive correlation with pH (0.12), EC (0.18), TA (0.20), sodium (0.23), bicarbonate (0.25), sulfate (0.16) and chloride (0.12). Several studies indicate that high pH and more number of bicarbonate ions (HCO3−) with sodium ion (Na+) could be the dominant cause of fluoride in groundwater (Saxena & Ahmed 2001; Guo et al. 2007; Dey et al. 2012). Also, due to higher levels of bicarbonate and hydroxide ion, the alkalinity of groundwater increases. An increase in alkalinity results in the displacement of fluoride ions with fluoride-rich minerals like muscovite, biotite and amphibole (Guo et al. 2007). Among all, fluoride has a negative correlation with potassium (−0.06) and carbonate (−0.03) but a positive weak correlation with magnesium (0.05) and nitrate (0.06).
Determining dominant factors accountable for fluoride motility
The features that are positively correlated with fluoride and the features extracted using MDI are same. Both measures are used to validate the resultant features.
Analysis of results
Model . | Accuracy (%) . | Precision (%) . | Recall (%) . | Error rate (%) . |
---|---|---|---|---|
K-nearest neighbor | 66.25 | 50.63 | 57.67 | 33.75 |
Logistic regression | 70.54 | 56.88 | 57.68 | 29.46 |
Random forest | 75.29 | 65.96 | 61.63 | 24.71 |
Support vector classifier | 70.59 | 56.95 | 57.79 | 29.41 |
Gaussian NB | 67.67 | 54.82 | 41.32 | 32.33 |
MLP classifier | 60.75 | 46.05 | 59.40 | 39.25 |
Decision tree classifier | 67.27 | 52.21 | 53.57 | 32.73 |
Gradient boosting classifier | 69.90 | 55.12 | 60.21 | 30.1 |
Voting classifier soft | 70.59 | 59.51 | 51.14 | 29.41 |
Voting classifier hard | 71.79 | 60.26 | 56.02 | 28.21 |
Model . | Accuracy (%) . | Precision (%) . | Recall (%) . | Error rate (%) . |
---|---|---|---|---|
K-nearest neighbor | 66.25 | 50.63 | 57.67 | 33.75 |
Logistic regression | 70.54 | 56.88 | 57.68 | 29.46 |
Random forest | 75.29 | 65.96 | 61.63 | 24.71 |
Support vector classifier | 70.59 | 56.95 | 57.79 | 29.41 |
Gaussian NB | 67.67 | 54.82 | 41.32 | 32.33 |
MLP classifier | 60.75 | 46.05 | 59.40 | 39.25 |
Decision tree classifier | 67.27 | 52.21 | 53.57 | 32.73 |
Gradient boosting classifier | 69.90 | 55.12 | 60.21 | 30.1 |
Voting classifier soft | 70.59 | 59.51 | 51.14 | 29.41 |
Voting classifier hard | 71.79 | 60.26 | 56.02 | 28.21 |
The accuracy of SVC and voting classifier soft is the same which is 70.59% but the precision of SVC is 56.95% with a recall of 57.79% and that of voting classifier soft is 59.51% with a recall of 51.14%. This means that both the models predicted the same accuracy but voting classifier soft gives more precision than SVC in the present study.
CONCLUSION
The present study evaluated and compared different supervised machine learning models to forecast fluoride accumulation in groundwater in five major states of India. The models are evaluated and compared on the basis of accuracy, precision, recall and error rate. The results from the study indicated that:
The features that have a positive correlation with fluoride and dominant features extracted using MDI are EC, sulfate, bicarbonate, sodium, magnesium and chloride are the same. The model performance generated using six input variables produces a negligible difference in the results in comparison with the result generated by all the 15 input variables. So, it is advisable to use the relevant features to train and test the model.
The different supervised machine learning algorithms are compared in this study by considering only the geogenic parameters of fluoride contamination.
Out of all algorithms, the RF model gives an accuracy of 75.8% in predicting fluoride concentration in groundwater.
The error rate indicating the incorrect predictions out of total predictions of RF is lowest among all the models.
The results generated in this study suggest that machine learning algorithms have proven efficient in forecasting fluoride contamination in groundwater. However, this study considers only natural sources of fluoride in groundwater. This study can be further extended by including different parameters like evapotranspiration, precipitation, soil parameters, etc.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.