Abstract
This paper presents a machine learning approach for classification of arsenic (As) levels as safe and unsafe in groundwater samples collected from the Indo-Gangetic region. As water is essential for sustaining life, heavy metals like arsenic pose a public health concern. In this study, various tree-based machine learning models namely Random Forest, Optimized Forest, CS Forest, SPAARC, and REP Tree algorithms have been applied to classify water samples. As per the guidelines of the World Health Organization (WHO), the arsenic concentration in water should not exceed 10 μg/L. The groundwater quality parameter was ranked using a classifier attribute evaluator for training and testing the models. Parameters obtained from the confusion matrix, such as accuracy, precision, recall, and FPR, were used to analyze the performance of models. Among all models, Optimized Forest outperforms other classifier as it has a high accuracy of 80.64%, a precision of 80.70%, recall of 97.87%, and a low FPR of 73.33%. The Optimized Forest model can be used to test new water samples for classification of arsenic in groundwater samples.
HIGHLIGHTS
Decision Tree-based machine learning algorithms used for prediction of arsenic (As) in groundwater samples.
Confusion matrix obtained and accuracy, precision, recall, and FPR were calculated.
Model can be used to approximate the number of population affected with arsenic.
Spatial analysis of water parameters has been discussed.
Optimized Forest algorithm is the best-suited model for classification of arsenic.
Graphical Abstract
INTRODUCTION
Water is a priceless and essential natural resource for sustaining life on earth. Around 97.5% of the water on earth is saltwater, and the rest is freshwater. Out of total freshwater available, approximately 79% are in the form of ice caps and glaciers. Groundwater is around 20%, and the rest 1% is in lakes, soil, rivers, and the atmosphere. With increased human population and climate change, freshwater demand has increased significantly.
To meet the growing demand for freshwater, Panagopoulos (2022a) proposed saltwater desalination using zero liquid discharge (ZLD). Panagopoulos (2021) ZLD desalination system consists of reverse osmosis (RO) membrane, brine concentrator (BC), and brine crystilizer (BCr) to produce over 99% of freshwater and mixed salts. The freshwater produced has various industrial applications and is potable. More recently, Panagopoulos (2022b) compared the ZLD desalination system using BCr and ZLD desalination system using Wing Aided Intensified Evaporation (WAIV). The freshwater produced is much greater using BCr than WAIV, but WAIV is more cost-effective and energy-efficient.
Groundwater is a primary freshwater resource that makes up about 98% of freshwater available on the planet after the cryosphere plays a substantial role in the terrestrial and aquatic ecosphere. Arsenic contamination of groundwater is a growing concern, especially in the middle Gangetic belt of Uttar Pradesh, India (Chattopadhyay et al. 2020) and the lower Indo Gangetic plain of West Bengal, India (Das et al. 2020). Groundwater quality depends on aquifers’ geochemical and mineralogical composition and anthropogenic sources during weathering of rocks and minerals, followed by subsequent leaching and runoffs (Jeelani et al. 2014). Elevated arsenic concentration above the permissible limit of the World Health Organization 10 mg/L (WHO 2011) in aquifer pose a severe health risk to approximately one hundred million people in Indo Gangetic Plains (Bhowmick et al. 2018). Thus, in the last few decades, arsenic has been marked as the topmost substance which poses a severe threat to human life (ATSDR 2019).
In the literature, most of the work is to identify the spatial distribution of arsenic in the groundwater of Indo Gangetic plain and various factors such as geologic, topography, sediment characteristics, biogeochemical, hydrogeologic, and anthropogenic factors which are responsible for As contamination (Biswas et al. 2012; Chakraborty et al. 2020). Studies have revealed that contamination of As in groundwater is due to dissolution of bearing minerals found in Quaternary deposits belonging to the Holocene age (Lee et al. 2009; Mukherjee et al. 2009; Postma et al. 2016). Due to global warming, as these glaciers melt, the deposits of As-rich sediment are carried by rivers originating from the Himalayas are discharged into the Indo Gangetic plain of Uttar Pradesh (Ahamed et al. 2006; Chauhan et al. 2009), Bihar (Chakraborti et al. 2003; Saha 2009), Jharkhand (Alam et al. 2016; Tirkey et al. 2017), West Bengal (Nickson et al. 2008; Mukherjee et al. 2011), Assam (Verma et al. 2015, 2019), Punjab, and Haryana (Kumar & Singh 2020). The main reason aquifers are contaminated through the oxidation of pyrite into iron oxides under aerobic conditions, thereby releasing arsenic, sulfate, and various trace elements. In recent years, several machine learning models have been applied to assess the quality of arsenic in groundwater, of which logistic regression is the most common model. The logistic regression model was used by Dummer et al. (2015) for predicting the spatial distribution of arsenic worldwide. Parameters such as topology, soil properties, and geology have been used to predict the arsenic concentration using Logistic Regression (Zhang et al. 2012). In some studies, various attributes, i.e., Holocene sediments, the salinity of the soil, texture of the soil, and wetness index of the topographic was used to predict the quality of groundwater contamination (Lado et al. 2008). For arsenic contamination detection in groundwater, Luo et al. (2012) used principal component regression (PCR). In the year 2013, Zhang et al. (2013) used a linear regression model, in some studies, artificial neural network (ANN) was used for the detection of arsenic in groundwater (Cho et al. 2011; Bonelli et al. 2017) and some studies used Bayesian modeling (Cha et al. 2016).
Several researchers have modeled adaptive neuro-fuzzy inference system, regression Kriging, Random Forest, Boosted regression trees, etc., to establish an association among the spatial distribution of arsenic contamination in groundwater with different environmental attributes in other parts of the world (Erickson et al. 2018; Bindal & Singh 2019; Podgorski & Berg 2020). Machine learning models accurately establish a complex relationship and magnitude between different attributes. The machine learning model's performance is more accurate to overcome the complexity of establishing patterns and relationships in parametric models. However, no such machine learning model has been developed based on water parameters to assess arsenic contamination risk in the Varanasi region. Varanasi was chosen by the Ministry of Urban Development (MoUD) and the Ministry of Housing and Urban Poverty Alleviation (MoHUPA) government of India to develop the city by conserving its heritage, culture, and traditions. Therefore, arsenic investigation strategies need to be developed for the densely populated Varanasi region to provide rapid information for monitoring and managing public health improvement.
Our present study used tree-based machine learning algorithms to classify the water samples collected across the Varanasi region (Figure 1) as safe or unsafe based on the arsenic level. Most study uses geology, soil parameters, land cover, topography, minerals, temperature, precipitation, hydrology, aquifers connectivity, etc. (Chakraborty et al. 2020; de Menezes et al. 2020), as input variables to predict the occurrence of arsenic in the different study area. This study uses water parameters pH, EC (μS cm−1), TDS (ppm), Salinity (ppm), Na+ (mEq L−1), K+ (ppm), Ca2+ + Mg2+(mEq L−1), SAR, SSP (%), CO32−(mEq L−1), HCO3−(mEq L−1), RSC (mEq L−1), PO43− (ppm), Cl− (mEq L−1), Mn2+ (ppb), Cu2+ (ppb), Fe2+ (ppm), Zn2+ (ppb), SO42− (ppm) as input variables to classify the water samples as per the WHO permissible limit. Moreover, for attribute selection, Random forest algorithm was used to find the relevant attributes (Chen et al. 2020) responsible for the arsenic occurrence.
Therefore, the study is significant as machine learning approaches using easily estimable relevant parameters as inputs could prove highly useful in predicting the arsenic contamination level. This also helps in digitizing the dataset, which could be easily retrievable for further use and classification, and suitability analysis of groundwater. The work involves applying tree-based machine learning classification algorithms to classify the samples as safe or unsafe, namely Optimized Forest, SPAARC, CS Forest, Reduced Error Pruning (REP) Tree, and Random Forest. These models have been analyzed based on multiple evaluation criteria like accuracy, precision, recall, and FPR.
MATERIALS AND METHODS
Study area
In this study, 62 groundwater samples were collected along with the bank of river Ganga of Varanasi region on a grid basis (Figure 1). Samples were collected, and different water quality parameters were determined by the standard method described by the American Public Health Association (APHA; Maiti 2001; Federation & Aph Association 2005).
For determining the concentration of zinc (Zn2+), copper (Cu2+), iron (Fe2+), manganese (Mn2+), and arsenic (As) ions, an absorption spectrophotometer (AAS) was used. Moreover, atomic absorption spectrometry (Agilent Technologies VGA 77 AA spectrophotometer, Australia, Serial no. MY16020008) was used (Van Herreweghe et al. 2003) for the determination of arsenic concentration in water vapour.
Dataset description
The data collected from 62 different locations of Varanasi in the Indo Gangetic region and classification of water as safe and unsafe was determined using attributes shown in Table 1. All attributes were of numeric data type and one categorical attribute that determines whether the water is safe or unsafe based on the values of all the attributes. The descriptive statistics of the dataset are shown in Table 1 which contains each parameter's mode, median, minimum, and maximum values.
Attribute Description . | Min . | Max . | Mean . | Mode . | Median . |
---|---|---|---|---|---|
Hydrogen ion concentration (pH) | 7.01 | 8.71 | 7.88 | 7.35 | 7.84 |
Electrical conductivity (EC (μS cm−1)) | 209 | 859 | 426.77 | 329 | 411 |
Total dissolved solid (TDS (ppm)) | 134 | 559 | 275.56 | 214 | 266 |
Salt content (Salinity (ppm)) | 96 | 647 | 295.58 | 248 | 299 |
Sodium (Na+ (mEq L−1)) | 0.07 | 13.40 | 1.83 | 0.14 | 0.62 |
Potassium (K+ (ppm)) | 0.70 | 44.60 | 9.22 | 2.60 | 6.05 |
Calcium and Magnesium (Ca2+ + Mg2+ (mEq L−1)) | 1.70 | 10.20 | 3.15 | 5.80 | 5.15 |
Sodium Adsorption Ratio (SAR) | 0.04 | 9.48 | 1.23 | 0.04 | 0.41 |
Soluble Sodium (SSP (%)) | 0.53 | 77.01 | 18.64 | 2.44 | 7.53 |
Carbonate (CO32− (mEq L−1)) | 1.20 | 4.40 | 2.24 | 2.00 | 2.00 |
Bicarbonate (HCO3− (mEq L−1)) | 1.80 | 9.20 | 4.78 | 3.60 | 4.65 |
Residual Sodium Carbonate (RSC (mEq L−1)) | 0.10 | 6.90 | 1.77 | 0.30 | 1.15 |
Phosphate (PO43− (ppm)) | 0.45 | 14.09 | 4.16 | 3.94 | 3.94 |
Chloride (Cl− (mEq L−1)) | 1.60 | 5.00 | 2.71 | 2.40 | 2.60 |
Manganese (Mn2+ (ppb)) | 0.90 | 97.00 | 31.48 | 36.00 | 26.00 |
Copper (Cu2+ (ppb)) | 12.00 | 74.20 | 34.65 | 22.00 | 30.90 |
Iron (Fe2+ (ppm)) | 0.10 | 22.10 | 3.63 | 1.20 | 2.10 |
Zinc (Zn2+ (ppb)) | 10.00 | 77.00 | 31.32 | 39.00 | 27.50 |
Arsenic (As (III) (ppb)) | 3.00 | 25.00 | 8.38 | 6.20 | 7.52 |
Attribute Description . | Min . | Max . | Mean . | Mode . | Median . |
---|---|---|---|---|---|
Hydrogen ion concentration (pH) | 7.01 | 8.71 | 7.88 | 7.35 | 7.84 |
Electrical conductivity (EC (μS cm−1)) | 209 | 859 | 426.77 | 329 | 411 |
Total dissolved solid (TDS (ppm)) | 134 | 559 | 275.56 | 214 | 266 |
Salt content (Salinity (ppm)) | 96 | 647 | 295.58 | 248 | 299 |
Sodium (Na+ (mEq L−1)) | 0.07 | 13.40 | 1.83 | 0.14 | 0.62 |
Potassium (K+ (ppm)) | 0.70 | 44.60 | 9.22 | 2.60 | 6.05 |
Calcium and Magnesium (Ca2+ + Mg2+ (mEq L−1)) | 1.70 | 10.20 | 3.15 | 5.80 | 5.15 |
Sodium Adsorption Ratio (SAR) | 0.04 | 9.48 | 1.23 | 0.04 | 0.41 |
Soluble Sodium (SSP (%)) | 0.53 | 77.01 | 18.64 | 2.44 | 7.53 |
Carbonate (CO32− (mEq L−1)) | 1.20 | 4.40 | 2.24 | 2.00 | 2.00 |
Bicarbonate (HCO3− (mEq L−1)) | 1.80 | 9.20 | 4.78 | 3.60 | 4.65 |
Residual Sodium Carbonate (RSC (mEq L−1)) | 0.10 | 6.90 | 1.77 | 0.30 | 1.15 |
Phosphate (PO43− (ppm)) | 0.45 | 14.09 | 4.16 | 3.94 | 3.94 |
Chloride (Cl− (mEq L−1)) | 1.60 | 5.00 | 2.71 | 2.40 | 2.60 |
Manganese (Mn2+ (ppb)) | 0.90 | 97.00 | 31.48 | 36.00 | 26.00 |
Copper (Cu2+ (ppb)) | 12.00 | 74.20 | 34.65 | 22.00 | 30.90 |
Iron (Fe2+ (ppm)) | 0.10 | 22.10 | 3.63 | 1.20 | 2.10 |
Zinc (Zn2+ (ppb)) | 10.00 | 77.00 | 31.32 | 39.00 | 27.50 |
Arsenic (As (III) (ppb)) | 3.00 | 25.00 | 8.38 | 6.20 | 7.52 |
Statistical and spatial analysis
Pearson correlation analysis was performed to find the correlation between the parameters and arsenic. Moreover, to find the variation of parameters in the dataset, Principal component analysis was performed using IBM SPSS Statistics 25 software. The data variation in the study area can be visualized using spatial variability maps. An inverse distance weighting (IDW) technique of interpolation was used for performing spatial analysis. Each measured data point in IDW has a more significant local influence that decreases with an increase in distance. Spatial variable maps of all the parameters were created using ArcGIS 10.8 software.
Feature selection
The collected dataset contained 62 instances that were trained and tested by using Optimized Forest, SPAARC, CS Forest, Reduced Error Pruning (REP) Tree, and Random Forest. Before training and testing the dataset, it is essential to perform feature selection. The feature selection method (Hossain et al. 2013) simplifies the model by reducing attributes and decreasing the training time. It also increases the model's accuracy by selecting the right attributes for classification. Moreover, it also reduces the overfitting problem by a generalization and avoids the curse of dimensionality. Under this phenomenon, the predictive power of a classifier first increases as the number of attributes increases, but beyond a certain number of attributes, the accuracy of the classifier gets reduced.
Classifier attribute evaluator was used for finding the relevant attributes for the datasets. The classifier attribute evaluator evaluates individual attributes by measuring the amount of information gained about the class given the attribute. A random forest classifier was used to assess the importance of an attribute for classification (Chen et al. 2020).
Implementation of machine learning algorithms
The machine learning classifiers were trained and tested (Fushiki 2011) based on these attributes to get a confusion matrix. A flowchart of the methodology used to implement machine learning models is shown in Figure 2. Classifiers have been analyzed based on multiple evaluation criteria like accuracy, precision, recall, FPR (false positive rate), and mean absolute error (MAE).
Optimized Forest
Optimized Forest Classifier (Adnan & Islam 2016) Algorithm is based on a decision forest algorithm which uses genetic algorithm (GA) to select optimized subforest with high accuracy and diversity to increase the overall accuracy of the algorithm. The main idea of this algorithm is to infuse high-quality trees as the initial population of GA. GA encodes the population in the form of data structures, also known as chromosomes. In this algorithm, 20 chromosomes are encoded to constitute the population. In which 10 odd number of chromosomes are the selected-based stratified sampling technique to choose good quality of trees and rest 10 even number of chromosomes are chosen randomly. Moreover, crossover and mutation are applied to the chromosomes using the roulette wheel technique. Elitist operation is applied after the previous step to get the competent chromosomes. To prevent degradation, a pool of 40 chromosomes is created to select the 20 best among them based on the roulette wheel technique. Finally, a sequential Search Operation is applied to get the best ensemble accuracy.
SPAARC
Split-Point and Attribute Reduced Classifier (SPAARC) is based on a fast decision tree algorithm (Yates et al. 2018). It speeds up the induction process of decision trees by Split Point Sampling and Node Attribute Sampling (NAS). The NAS avoids testing each non-class attribute at every node, and before the induction, it avoids preselecting a subset of attributes. Moreover, this method dynamically selects the attributes based on the depth level of the current node under test. This algorithm also uses split point sampling for optimum split point using the Gini index.
CS Forest
It is a cost-sensitive classification technique that uses the ensemble method of decision tree (Siers & Islam 2015). This algorithm takes advantage of CSVoting to minimize the classification cost. CSVoting calculates the cost of labeling the records belonging to the positive and negative classifications. It computes the sum of the total positive classification cost for all leaves and the total negative classification cost. Finally, the records are classified as positive if the cost of the positive class is less than computing the negative class; otherwise, it is classified as negative. This algorithm also uses cost-sensitive pruning a tree if the pruning does not increase significantly. In the CSForest algorithm, initially, the tree is allowed to grow fully; then, it is pruned. In this algorithm, firstly ensemble of trees is built, then CSVoting is used for classification.
REP Tree
Reduced Error Pruning (REP) is a fast decision tree learner algorithm (Elomaa & Kaariainen 2001). It is based on the C4.5 algorithm is used for classification and regression trees. It builds a decision tree based on information gain value and reduces pruning error by using back fitting. REP uses the bottom-up technique for traversing a tree and prunes a node with the most repeated class without reducing the accuracy of the tree. This procedure continues until the accuracy decreases, and that is estimated using a pruning set.
Random Forest
Random Forest algorithm (Breiman 2001) is a collection of tree-structured classifiers. The Random forest classification algorithm uses the bootstrap resampling method (Stine 1989) to extract subsamples from original samples to create a decision tree for each sample. After creating decision trees, they are combined to form a forest. It calls for polling, and the result depends on the output of the decision tree, which is used for prediction.
RESULTS AND DISCUSSION
Pearson correlation between the parameters
The correlation between the parameters is shown in Table 2. The correlation value for each parameter ranges from −1 (denoted in red) to +1 (denoted in green). Among various parameters considered in the study area, arsenic is positively correlated with potassium (r = +0.009) and iron (r = +0.248). Moreover, arsenic is positively correlated with sodium (r = +0.144) and negatively correlated with copper (r = −0.025). Multiple studies have shown that reducing oxides and hydroxides due to abiotic or biotic factors, and oxidation of iron sulfides are responsible for arsenic accumulation in groundwater. Iron shows a positive correlation with Ca2+ + Mg2+ (r = +0.304).
Please refer to the online version of this paper to see this table in colour: http://dx.doi.org/10.2166/wh.2022.015.
Moreover, potassium is positively correlated with zinc (r = +0.166). Sodium has a negative correlation with Ca2+ + Mg2 + (r = −0.266), iron (r = −0.013), and potassium (r = −0.331) but a positive correlation with copper (r = + 0.034). Thus, the study observed an inverse relation between sodium and other metal ions. Phosphorous shows a negative correlation with arsenic (r = −0.281) and bicarbonate (r = −0.271). Arsenic is positively correlated with bicarbonate (r = + 0.259) and is responsible for the dissociation and accumulation of arsenic into groundwater from iron oxyhydroxides present in aquifers.
Principal component analysis for significant variables
Variation in the dataset between the parameters is shown in Table 3. In the first principal component, SSP and Na+ contribute more to the overall variation of 24.60%. TDS and SAR contribute 16.05% of the total variation in the second principal component. In the third principal component, Ca2+ + Mg2+, Fe2+, and arsenic contribute 11.99% to the total variation. Moreover, in the fourth principal component, RSC and Cl− contribute 8.60% to the total contribution. Arsenic and Mn2+ contribute 7.30% to the total variation in the fifth principal component. In the sixth principal component, PO43− and Zn2+ contribute 6.41% to the total variation. In this study, arsenic contributes to the third and fifth principal components affecting groundwater aquifers.
Parameters . | Component . | |||||
---|---|---|---|---|---|---|
PC1 . | PC2 . | PC3 . | PC4 . | PC5 . | PC6 . | |
% of Variance | 24.602 | 16.058 | 11.993 | 8.609 | 7.300 | 6.416 |
SSP | 0.865 | −0.403 | ||||
Na+ | 0.767 | −0.459 | ||||
pH | 0.758 | |||||
SAR | 0.750 | −0.501 | ||||
CO32− | 0.623 | |||||
RSC | 0.595 | 0.536 | ||||
TDS | 0.508 | 0.783 | ||||
EC | 0.533 | 0.753 | ||||
Salinity | 0.598 | 0.675 | ||||
HCO3− | 0.632 | 0.405 | 0.480 | |||
Ca2++Mg2+ | 0.489 | 0.655 | ||||
Fe2+ | 0.564 | |||||
PO43− | −0.552 | 0.506 | ||||
Cl− | −0.517 | |||||
K+ | 0.472 | |||||
As | 0.513 | −0.587 | ||||
Mn2+ | 0.574 | |||||
Cu2+ | 0.477 | 0.574 | ||||
Zn2+ | −0.435 | −0.490 |
Parameters . | Component . | |||||
---|---|---|---|---|---|---|
PC1 . | PC2 . | PC3 . | PC4 . | PC5 . | PC6 . | |
% of Variance | 24.602 | 16.058 | 11.993 | 8.609 | 7.300 | 6.416 |
SSP | 0.865 | −0.403 | ||||
Na+ | 0.767 | −0.459 | ||||
pH | 0.758 | |||||
SAR | 0.750 | −0.501 | ||||
CO32− | 0.623 | |||||
RSC | 0.595 | 0.536 | ||||
TDS | 0.508 | 0.783 | ||||
EC | 0.533 | 0.753 | ||||
Salinity | 0.598 | 0.675 | ||||
HCO3− | 0.632 | 0.405 | 0.480 | |||
Ca2++Mg2+ | 0.489 | 0.655 | ||||
Fe2+ | 0.564 | |||||
PO43− | −0.552 | 0.506 | ||||
Cl− | −0.517 | |||||
K+ | 0.472 | |||||
As | 0.513 | −0.587 | ||||
Mn2+ | 0.574 | |||||
Cu2+ | 0.477 | 0.574 | ||||
Zn2+ | −0.435 | −0.490 |
Spatial analysis of parameters
The variation of data in the study is determined through spatial analysis. From the spatial analysis map of zinc (Figure 3(a)) and arsenic (Figure 3(b)), it is observed that they have a negative correlation. In some parts, the high zinc content is due to the overutilization of cow dung as manure for agriculture. Moreover, TDS (Figure 3(c)) has little relation to arsenic.
The spatial analysis map of SSP (Figure 3(d)), sodium, SAR, salinity, and RSC in Figure 4(a)–4(d), shows high sodium content in water samples due to the deposition of sodium by river Ganga. A high phosphate level in Figure 4(e) is negatively related to arsenic occurrence, but it is positively correlated with potassium, as shown in Figure 4(f).
A high pH level in Figure 5(a) in groundwater increases negative charges in minerals present in aquifers and stimulates the desorption of arsenic into groundwater. Iron content in Figure 5(b) is higher in the meandering position of river Ganga, showing a positive relation to arsenic. The map shows that chloride content in Figure 5(c) is higher away from the meandering part of river Ganga. The presence of manganese in groundwater (Figure 5(d)) of the study area indicates that it favors the dissolution or reduction of manganese oxyhydroxides to arsenic. Calcium content, as shown in Figure 5(e) and magnesium content in Figure 5(f) in groundwater, can decrease the leaching of arsenic.
In Figure 6(a), the groundwater has a high amount of EC that negatively relates to the formation of arsenic. There is a negative correlation between arsenic and copper, as shown in Figure 6(b) of the study area, but both are responsible for nephrotoxicity and damage to the kidneys. The presence of carbonates, as shown in Figure 6(c) and bicarbonates in Figure 6(d), favor arsenic dissolution into aquifers. Bicarbonates make a favorable condition for mobilization of arsenic from iron and manganese oxyhydroxides.
It is evident from the result obtained from Pearson correlation as shown in Table 2 and attributes selected from the Classifier Attribute evaluator that Fe2+ (ppm) is positively correlated and responsible for arsenic occurrence in groundwater (Bhattacharya et al. 2003). The oxides/hydroxides of Fe2+ containing arsenic are reduced due to abiotic and biotic factors (Islam et al. 2004). Moreover, the oxidation of Fe2+ sulfides is too responsible for arsenic occurrence in groundwater (Carraro et al. 2013). Also, the presence of Ca2++Mg2+ in groundwater has a positive relationship with the occurrence of arsenic. It is found that Ca2++Mg2+ shows a positive correlation with Fe2+. Bhowmick et al. (2013) reported that a high concentration of HCO3− (mEq L−1) favors mobilization of arsenic in groundwater by dissociating arsenic from iron oxyhydroxides, illite, and kaolinite (Gao et al. 2011).
High pH concentration in the Varanasi region favours arsenic occurrence. It may be due to the presence of Calcite in bedrock (Ayotte et al. 2003). Due to high pH (>8.5), there is desorption of arsenic from mineral oxides and in neutral pH values reduces arsenic from Fe2+ and Mn2+ oxides (Smedley & Kinniburgh 2002). Moreover, from Pearson correlation and classifier attribute selector, it is found that Zn2+ (ppb) and Mn2+ (ppb) have a negative correlation with the occurrence of arsenic (Murphy et al. 2019) Cl−(mEq L−1) are found to be positively associated with the occurrence of arsenic, and it is evident that Cl− derives arsenic from bedrock (Warner 2001). High sodium concentration in water samples shows a significant negative correlation with iron, potassium, calcium, and magnesium. As per Meng et al. (2017), the result obtained from Pearson correlation and classifier attribute evaluator CO32− (mEq L−1) of Ca2+ shows a negative correlation with the occurrence of arsenic.
Classification result of arsenic contamination in groundwater
Based on the classifier attribute evaluator, the feature selection method following attributes HCO3− (mEq L−1), Fe2+ (ppm), Zn2+ (ppb), pH, Ca2++Mg2+ (mEq L−1), Cl−(mEq L−1), SSP (%), Mn2+ (ppb), RSC (mEq L−1), and CO32− (mEq L−1) were selected for training and testing the machine learning algorithms. For all machine learning classifiers accuracy, precision, recall, and FPR is calculated based on the confusion matrix as shown in Table 4.
Algorithms . | Safe . | Unsafe . | Accuracy (%) . | Precision (%) . | Recall (%) . | FPR (%) . | |
---|---|---|---|---|---|---|---|
Optimized Forest | Safe | 46 | 1 | 80.64 | 80.70 | 97.87 | 73.33 |
Unsafe | 11 | 4 | |||||
SPAARC | Safe | 47 | 0 | 77.41 | 77.04 | 100 | 93.33 |
Unsafe | 14 | 1 | |||||
CS Forest | Safe | 46 | 1 | 75.80 | 76.66 | 97.97 | 93.33 |
Unsafe | 14 | 1 | |||||
REP Tree | Safe | 46 | 1 | 75.80 | 76.66 | 97.87 | 93.33 |
Unsafe | 14 | 1 | |||||
Random Forest | Safe | 46 | 1 | 79.03 | 79.31 | 97.87 | 80.00 |
Unsafe | 12 | 3 |
Algorithms . | Safe . | Unsafe . | Accuracy (%) . | Precision (%) . | Recall (%) . | FPR (%) . | |
---|---|---|---|---|---|---|---|
Optimized Forest | Safe | 46 | 1 | 80.64 | 80.70 | 97.87 | 73.33 |
Unsafe | 11 | 4 | |||||
SPAARC | Safe | 47 | 0 | 77.41 | 77.04 | 100 | 93.33 |
Unsafe | 14 | 1 | |||||
CS Forest | Safe | 46 | 1 | 75.80 | 76.66 | 97.97 | 93.33 |
Unsafe | 14 | 1 | |||||
REP Tree | Safe | 46 | 1 | 75.80 | 76.66 | 97.87 | 93.33 |
Unsafe | 14 | 1 | |||||
Random Forest | Safe | 46 | 1 | 79.03 | 79.31 | 97.87 | 80.00 |
Unsafe | 12 | 3 |
Experimental results obtained after training and testing are compared, and a graph (Figure 7) is plotted for all machine learning classifiers. In our study's context, the algorithm with the overall high value of accuracy, precision, recall, and lowest FPR values is considered best. Accuracy is the ratio of total correct predictions out of total predictions made. Optimized Forest has the highest accuracy (Figure 7(a)) of 80.64% compared to other classifiers. Moreover, precision is the number of true positive outcomes more closely among all positive results. Optimized Forest has the highest precision (Figure 7(b)) of 80.70% compared to other classifiers.
When Recall is considered, SPAARC has a high recall value (Figure 7(c)) of 100% compared to other machine learning classifiers. Recall signifies the sensitivity of the model. Moreover, when FPR is considered, the Optimized Forest algorithm has the least FPR of 77.33% (Figure 7(d)) compared to other classifiers. The model with a less FPR value is considered the best among other models.
Finally, from the overall performance of all classifiers, the Optimized Forest model is the best in terms of accuracy, precision, recall, and FPR. Moreover, when the overall MAE of all models is compared, it is found that Optimized Forest has the least MAE of 0.240 compared to 0.245, 0.358, 0.245, and 0.243 for SPAARC, CS Forest, REP Tree, and Random Forest Algorithms. Thus, the Optimized Forest classifier model can be used for various applications.
Application of machine learning model outcomes
Approximation of people prone to As poisoning
The total area of our study is 1,258.39 km2, out of which 197.36 km2 have a high arsenic concentration (>10 μg/L). As per the census of India, the population density of the Varanasi district is 2,395 people/km2. Using Equation (1), approximately 381,166 people live in the high arsenic concentration (>10 μg/L) region, and 2,049,197 people live in the low arsenic concentration (<10 μg/L) region. So, approximately 2,430,363 people are prone to developing carcinogenic and non-carcinogenic diseases in the study area.
Carcinogenic and non-carcinogenic risk assessment of the study area
Based on the model accuracy and population of the study area (Equation (1)), approximately 2,430,363 people are prone to arsenic poisoning. Carcinogenic and non-carcinogenic risk of the people living in the study area is calculated (Table 5) using the minimum, maximum, mean, and median value of arsenic distribution in the study area (Li et al. 2018). The safe limit for non-carcinogenic risk (HQ) is less than 1, and for carcinogenic risk (TCR) is less than 10−6 (U.S. EPA (U.S. Environmental Protection Agency) 1992a, 1992b). The HQ value ranges from 0.33 (low risk) to 2.77 (very high risk); the area having HQ>1 has a high risk of developing non-carcinogenic diseases. Moreover, the TCR value ranges from 1.5 × 10−4 (high risk) to 1.25 × 10−3 (very high risk), the area having TCR value >10−6 have a high risk of developing skin, lung, and urinary bladder cancer due to long-term exposure to arsenic.
. | Non-carcinogenic risk . | Carcinogenic risk . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | As . | ADDi . | ADDd . | Hi . | HQd . | HQ . | TCRi . | TCRd . | TCR . |
Min | 3.00 | 1*10−4 | 7.57*10−11 | 0.33 | 6.16*10−15 | 0.33 | 1.5*10−4 | 2.77*10−10 | 1.5*10−4 |
Mean | 8.38 | 2.79*10−4 | 2.11*10−10 | 0.93 | 1.72*10−14 | 0.93 | 4.18*10−4 | 7.74*10−10 | 4.19*10−4 |
Median | 7.52 | 2.50*10−4 | 1.90*10−10 | 0.83 | 1.54*10−14 | 0.83 | 3.75*10−4 | 6.95*10−10 | 3.75*10−4 |
Max | 25.00 | 8.33*10−4 | 6.31*10−10 | 2.77 | 5.13*10−14 | 2.77 | 1.24*10−3 | 2.31*10−9 | 1.25*10−3 |
. | Non-carcinogenic risk . | Carcinogenic risk . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | As . | ADDi . | ADDd . | Hi . | HQd . | HQ . | TCRi . | TCRd . | TCR . |
Min | 3.00 | 1*10−4 | 7.57*10−11 | 0.33 | 6.16*10−15 | 0.33 | 1.5*10−4 | 2.77*10−10 | 1.5*10−4 |
Mean | 8.38 | 2.79*10−4 | 2.11*10−10 | 0.93 | 1.72*10−14 | 0.93 | 4.18*10−4 | 7.74*10−10 | 4.19*10−4 |
Median | 7.52 | 2.50*10−4 | 1.90*10−10 | 0.83 | 1.54*10−14 | 0.83 | 3.75*10−4 | 6.95*10−10 | 3.75*10−4 |
Max | 25.00 | 8.33*10−4 | 6.31*10−10 | 2.77 | 5.13*10−14 | 2.77 | 1.24*10−3 | 2.31*10−9 | 1.25*10−3 |
i, Ingestion; d, dermal, As (μg/L).
Association between attributes and spatial distribution of arsenic
Based on the finding of Pearson correlation and classifier attribute evaluator, the significant attributes responsible for the occurrence of arsenic are found. Moreover, based on those attributes, the model was trained and tested to classify arsenic levels as safe and unsafe. The model created can be used to predict new water samples based on attributes as safe or unsafe. Moreover, the spatial distribution of arsenic, as shown in Figure 3(a), can be used to group the regions as high, moderate, and low-risk zone. Based on this spatial distribution of arsenic zones, various mitigation strategies can be implemented to protect people's health. There is a need for continuous surveillance and monitoring of the water quality in the affected regions.
Mitigation techniques based on the spatial map of arsenic
Based on spatial maps and the grouping of zones, policymakers can address the mitigation strategies. There are two categories by which arsenic mitigation can be addressed. Firstly, by finding alternative arsenic-free water sources and secondly, using the latest technology to remove arsenic from the water source.
The creation of rainwater harvesting pits, new ponds, lakes, and dug wells in the affected zones are generally free from arsenic due to the continuous oxidation environment and groundwater recharge with rainfall. Another way is to motivate people in the affected zones to switch from the arsenic-affected shallow tube wells to new deep tube wells to access arsenic-free water.
Infrastructure needs to be developed for arsenic-free water in the arsenic-affected zones. One of them is through the oxidation process (Lee et al. 2003) in which the soluble AsIII is converted to Asv followed by adsorption. Asv is adsorbed onto a solid surface more easily than AsIII. There are several oxidants utilized O3, H2O2, Cl2, NH2Cl.
Another technique of removing arsenic is through the Coagulation-Flocculation (Pallier et al. 2010) process that uses Fe and Al bases coagulants followed by the formation of floc which aggregates to form large particles. The soluble arsenic is precipitated onto floc and thus eliminated from the water.
Moreover, arsenic removal using adsorbents (Katsoyiannis et al. 2008) has become widely explored. Adsorbents like Zero Valent Iron (ZVI) Fe(0), Ferrihydrite (Giles et al. 2011), granular ferric hydroxide (Driehaus et al. 1998), and hydrous ferric oxide (Wilkie & Hering 1996) are the most widely explored iron oxides and hydroxides for the removal of As yielding promising results for both AsIII and AsV removals.
In a recent study, various bacteria, Gallionella ferruginea and Leptothrix ochracea, have removed arsenic from contaminated water (Katsoyiannis & Zouboulis 2004). These bacteria oxidize AsIII to AsV. The AsV is further removed using a coagulation process to get arsenic-free water.
CONCLUSION AND FUTURE WORK
This research article assesses the groundwater quality for classifying water samples as safe or unsafe as per the guidelines laid by the WHO in Varanasi district of Uttar Pradesh state of India. Arsenic is a known carcinogen that affects the health of humans and animals. Based on the findings of this study, major geochemical parameters responsible for the occurrence of arsenic are found. Moreover, based on the overall performance of Optimized Forest, a machine learning model is created for the classification of water samples as safe or unsafe. The model approximates 2,430,363 people prone to arsenic poisoning. Based on model accuracy and population of the study area, the carcinogenic and non-carcinogenic risk are assessed. To counter the poisoning of arsenic in the study area, there is a need for continuous surveillance and monitoring of water quality. Moreover, various infrastructures need to be developed in the worst-affected regions of the study area.
In the future, we will continue an emphasis on the inclusion of other vital parameters like geology, soil parameters, land use land cover, topography, minerals, temperature, precipitation, hydrology, aquifers connectivity, hydrostratigraph along with water parameters for the creation of a machine learning model that is robust and accurate for a broader study area.
DECLARATION OF COMPETING INTEREST
This is to certify that the authors are not affiliated with or involved with any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this paper.
ACKNOWLEDGEMENTS
The author would like to acknowledge the fellowship provided by the University Grant Commission (UGC), New Delhi, in the form of Junior Research Fellowship (JRF) and Senior Research Fellowship (SRF), to conduct this research.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.