ABSTRACT
Groundwater vulnerability to nitrate assessment serves as a measure of potential groundwater nitrate pollution in a target area. This study applies the DRASTIC-LU framework, nitrate distribution data, and three machine learning models (RF, XGB, SVM) to classify nitrate levels (exceeding 10 mg/L as nitrogen) in Chongqing, China. Model evaluation uses accuracy and F1 score metrics, with RF achieving the highest accuracy (92.9%), kappa (0.857), and AUC (0.948) on test dataset. Furthermore, the SHAP interpreter revealed that aquifer conductivity, lithology, agricultural activities, areas with high-intensity development, and groundwater recharge are the most influential indicators of groundwater vulnerability. The final groundwater vulnerability level distribution map, with a resolution of 1 km × 1 km, reveals that high and extremely high vulnerability levels are concentrated in areas with high-intensity urban development and karst trough valleys in the southeastern, northeastern, and central urban areas. This work represents the first attempt of using machine learning models for groundwater vulnerability assessment in the Chongqing region. It provides theoretical support for the construction layout of groundwater monitoring stations and the prevention and control of groundwater pollution in the future.
HIGHLIGHTS
Predicting groundwater nitrate vulnerability in Chongqing region using DRASTIC-LU framework and machine learning models.
Random Forest outperforms other machine learning approaches.
Aquifer conductivity, lithology, agricultural activities and areas with high-intensity development predictors were most influential explanatory factors.
Distribution of groundwater nitrate vulnerability map at 1 km resolution.
Groundwater nitrate vulnerability map reveal the high vulnerability levels distributed karst trough valleys in the southeastern, northeastern, and central urban areas.
INTRODUCTION
Groundwater serves as a vital resource for numerous rural communities globally, playing a crucial role in sustaining agricultural activities and supporting food production on a global scale (Li et al. 2021). Approximately half of the world's potable water and a significant portion of irrigation water are sourced from groundwater reservoirs (Gleeson et al. 2016). In regions like China, the overexploitation of groundwater for agricultural, industrial, and domestic purposes has intensified, elevating concerns about the reliance on groundwater and its diminishing quality (Gu et al. 2013). It is anticipated that the demand for groundwater as a dependable and safe water supply source will significantly rise in the future. This surge in demand may render aquifers more vulnerable to anthropogenic influences, including intensified agricultural practices, alterations in land use (LU) and land cover, population growth (especially in developing regions), heightened water consumption with economic prosperity, rapid urbanization and industrialization, accessibility to inexpensive drilling and pumping technology, discharge of pollutants, power generation activities, and shifts in institutional frameworks. These factors collectively contribute to the increasing vulnerability of aquifers, highlighting the urgent need for sustainable groundwater management practices.
Nitrate is a naturally occurring form of nitrogen essential for the plant growth (Mencio et al. 2016). However, in recent decades, intensive agricultural activities in rural areas have led to an excessive leaching of nitrate from sources such as animal manure and synthetic fertilizers into groundwater (Canter 2019). Numerous factors contribute to the quantity of nitrate that infiltrates groundwater. These factors encompass both present and past land-use practices, historical and current nitrogen applications or deposition, soil type, the depth of the groundwater table, and the rate at which groundwater is replenished (Goodarzi et al. 2022). The intricate interplay of these variables can have a significant impact on the extent of nitrogen leaching into groundwater. According to the United States Environmental Protection Agency (USEPA), the maximum allowable concentration of nitrate nitrogen (NO3–N) in drinking water is 10 mg/L, roughly equivalent to 45 mg of nitrate (N2O) (Ransom et al. 2017). In China, a maximum permissible concentration of 50 mg/L for nitrate in drinking water has been established (Wang et al. 2020). The assessment of aquifer vulnerability to nitrate contamination serves as a crucial tool for decision makers, aiding in the adoption of efficient management strategies to mitigate groundwater pollution. It facilitates groundwater resource allocation and assists in determining appropriate land-use practices. Moreover, it plays a vital role in raising awareness among communities about the risks associated with groundwater nitrogen contamination.
To develop effective methods for modeling the pollution in groundwater and safeguarding groundwater against surface pollutants, a diverse array of models employing various techniques and methodologies has been applied globally in recent decades (Taghavi et al. 2022). Numerical simulation is effective for investigating the flow and transport of pollutants in groundwater at a small scale (Eslamian et al. 2023). However, this method demands high-quality feature indicators and poses challenges for application in large-scale groundwater pollution assessments. Therefore, a standardized index system has emerged for large-scale groundwater vulnerability (GWV) assessment models. Commonly utilized methods for GWV assessment encompass well-established techniques such as DRASTIC (Aller et al. 1987), GOD (Foster 1987), SINTACS, COP (Vías et al. 2006), and EPIK (Doerfliger et al. 1999). These aforementioned methods all assess GWV by considering hydrogeological characteristics and pollutant sources. The modified DRASTIC-LU model (Secunda et al. 1998), utilizing a Geographic Information System (GIS) and incorporating surface cover factors, allows for a more comprehensive assessment of groundwater contamination risk. DRASTIC-LU combines DRASTIC framework and land-use index including seven hydrological elements and land-use data: depth to water, net recharge, aquifer media, soil media, topography, impact of the vadose zone, aquifer hydraulic conductivity, and land-use layer (Goodarzi et al. 2022; Mkumbo et al. 2022; Secunda et al. 1998; Shanmugamoorthy et al. 2023). The traditional DRASTIC-LU evaluation model relies on a linear relationship between subjective weights and indicator values, making it a subjective empirical model. However, the impact of complex hydrogeological environments on groundwater pollutants is nonlinear. Subjective weight models face challenges in adapting to diverse hydrogeological conditions.
Compared to complex traditional physical models and empirical index models, a novel data-driven classification model, coupled with an evaluation index system, has been widely employed in regional and national-scale studies of groundwater quality assessment and other field of hydrology, such as groundwater nitrate concentration prediction (Huang et al. 2011; Knoll et al. 2019; Ransom et al. 2022; Spijker et al. 2021), groundwater PH prediction (DeSimone et al. 2020), and other pollutant concentration prediction of groundwater (Wheeler et al. 2015; Barzegar et al. 2018; Podgorski & Berg 2020). Machine learning methods such as multiple linear regression, random forest (RF) (Breiman 2001), and others are utilized for this purpose (Tianqi Chen 2016). Machine learning can effectively address the nonlinear features between GWV and indicator systems, while incorporating its interpretable features for attribution analysis. GWV can be defined as follows: Under natural conditions or the influence of human activities, a specific location in the groundwater system is subject to trends and potential contamination from a position above the aquifer (Foster 1987; Council 1993; Aslam et al. 2018; Lahjouj et al. 2020). Hence, the probability of pollutant exceedance can be used to characterize the vulnerability indicator of groundwater to this specific contaminant. The supervised learning paradigm in machine learning can utilize the data-label correspondence to classify the probability of pollutant concentration exceeding a threshold, thereby predicting the likelihood of pollutants surpassing permissible levels in water. This method effectively addresses the limitations of linear equations in subjective weighted assessments.
The utilization of DRASTIC-LU and machine learning models in aquifer quality management appears to offer a novel approach that could mitigate numerous associated problems and reduce costs. The purpose of this study is to classify whether groundwater nitrate concentrations exceed standard threshold values using supervised learning models in the Chongqing area. This approach aims to avoid the subjectivity associated with determining the weights of factors in traditional methods such as the DRASTIC-LU model. In addition, by integrating the SHapley Additive exPlanations (SHAP) (Lundberg & Lee 2017) explainer, the correlation between the model's predictive indicators and the model output results is analyzed and interpreted to identify the drivers of GWV to nitrate. Ultimately, accurate classification probabilities of the DRASTIC-LU model for groundwater nitrate–nitrogen concentration thresholds are obtained. These probabilities serve as estimates of groundwater nitrate–nitrogen vulnerability, providing a theoretical basis for large-scale groundwater resource management.
MATERIALS AND METHODS
Study area
Chongqing, a southwest city of China, is situated in the transitional area between the Tibetan Plateau and the plain on the middle and lower reaches of the Yangtze River in the subtropical climate zone often swept by moist monsoons (Figure 1). The topography of Chongqing is higher in the northeast and southeast than in the west, among which the highest elevation is in the northeast of Chongqing and the lowest elevation is in the west of Chongqing. The topography of the study area can be classified into four major units: the western part features hilly terrain on the edge of the Sichuan Basin, the central region exhibits parallel ridges and valleys with low mountains and hills, the northeastern part showcases karst low-mountain terrain in the Daba Mountains, and the southeastern area presents karst mid-mountain terrain in the Wuling Mountains.
Due to the complex geological structures and topographic conditions, the hydrogeological environment within the area is notably intricate. Based on a combination of factors including groundwater occurrence conditions, hydraulic characteristics, and aquifer properties, groundwater in the region can be categorized into four major types: carbonate rock fractured-cave water (referred to as karst water), clastic rock porous-fractured water, bedrock fractured water, and unconsolidated rock porous water. The remaining area is characterized by Silurian mudstone, shale, and siltstone, forming a relatively impermeable layer. The respective proportions of these types in terms of the total area are represented by 35.90, 38.09, 18.15, and 0.64%. The distribution and burial characteristics of groundwater in the region are closely tied to geological strata, tectonics, and topography. Groundwater resources are abundant in the surrounding mountainous areas, while resources in the central and western parts are relatively scarce. Based on differences in tectonic units and regional topographic forms, the groundwater resources of the entire municipality are classified into three zones: the Red Formation hilly hydrogeological zone in the Sichuan Basin, the parallel ridge-valley low-mountain hydrogeological zone in eastern Sichuan, and the karst low-to-mid-mountain hydrogeological zone around the basin.
DRASTIC-LU model
Sub-head . | Variable name . | Data type . | Resolution . | Data source . |
---|---|---|---|---|
D | Groundwater table depth | Grid | 1 km | Fan et al. (2013) |
R | Recharge | Grid | 1 km | Fick & Hijmans (2017); Zomer et al. (2022) |
A | Aquifer media | Polygons | – | Hartmann & Moosdorf (2012) |
S | Soil media | Grid | 1 km | Wieder et al. (2014) |
T | Topography | Grid | 30 m | Agency (2021) |
I | Impact of vadose zone | Grid | 1 km | Wieder et al. (2014) |
C | Aquifer conductivity | Polygons | – | Hartmann & Moosdorf (2012) |
LU | Land use | Grid | 10 m | Gong et al. (2019) |
Sub-head . | Variable name . | Data type . | Resolution . | Data source . |
---|---|---|---|---|
D | Groundwater table depth | Grid | 1 km | Fan et al. (2013) |
R | Recharge | Grid | 1 km | Fick & Hijmans (2017); Zomer et al. (2022) |
A | Aquifer media | Polygons | – | Hartmann & Moosdorf (2012) |
S | Soil media | Grid | 1 km | Wieder et al. (2014) |
T | Topography | Grid | 30 m | Agency (2021) |
I | Impact of vadose zone | Grid | 1 km | Wieder et al. (2014) |
C | Aquifer conductivity | Polygons | – | Hartmann & Moosdorf (2012) |
LU | Land use | Grid | 10 m | Gong et al. (2019) |
The groundwater table depth data were sourced from the Global Groundwater Depth Database (Fan et al. 2013), with a resolution of 1 km. Precipitation data (Fick & Hijmans 2017) (PRE) and actual evapotranspiration (ET0) data (Zomer et al. 2022) were collected from the WorldClim V2.0 global climate dataset and the Global Aridity Index and Potential Evapotranspiration Database, with a spatial resolution of 1 km. Soil data and vadose zone data were obtained from the Harmonized World Soil Database v1.2 (Wieder et al. 2014), with a resolution of 1 km. Terrain data were derived from the global digital surface model dataset (AW3D) (Agency 2021), and slope data for the study area were generated using ARCGIS, with a resolution of 30 m. Conductivity data of the aquifers were obtained from vectorized hydrogeological maps (1:250,000) provided by the Chongqing Geological Survey Bureau, which were then converted into point data using grid panels. Land-use data were used in this study with 10 m resolution (Gong et al. 2019), consisting of 10 LU types: Bareland, Cropland, Forest, Grassland, Impervious surface, Shrubland, Snow, Tundra, Water, and Wetland. The detail data source and datatype of DRASTIC-LU models are shown in Table 1.
Depth to groundwater table (D)
Recharge (R)
Areas with high recharge rates may be more vulnerable to surface pollutants. Groundwater recharge, the process by which water is replenished into aquifers, plays a crucial role in influencing the transport and fate of contaminants. The groundwater recharge can be estimated by the difference between precipitation (PRE) (Fick & Hijmans 2017) and actual evapotranspiration (ET0) (Zomer et al. 2022). The groundwater recharge characteristics in the study area exhibit higher rates in the southeast and northeast regions, while the central and western areas show lower rates (Figure 2).
Aquifer media (A)
The parameter for aquifer media delineates the characteristics of the materials within the aquifer, exerting a significant influence on the processes of pollutant attenuation. This layer was collected through well profiles and hydrogeological maps. In the study area, based on a combination of factors such as groundwater occurrence conditions, hydrodynamic characteristics, and aquifer properties, the types of groundwater in the region can be classified into four major categories: carbonate rock fracture-karst water, clastic rock pore-fracture water, bedrock fracture water, and loose rock pore water (Hartmann & Moosdorf 2012). The lithology of carbonate rock and fissure karst water mainly consists of Triassic limestone, distributed in the karst trough valleys located in the southeastern, northeastern, and central-western regions of the study area. Meanwhile, the lithology of clastic rock pore-fissure water mainly comprises Triassic and Jurassic sandstones and mudstones, primarily distributed in the central-western area with sporadic occurrences in the northeastern and southeastern parts. Fracture water in bedrock and porous water in loose rocks have relatively smaller distribution areas, appearing only in the southeastern and northeastern parts (Figure 2).
Soil media (S)
Groundwater recharge, water infiltration, contaminant transport, and the interaction between groundwater and surface water are all contingent on soil media properties. The soil type map, which is collected from the HWSD soil dataset, showed the soil texture type in the study area contains clay, loam, loamy sand, sandy clay loam, sandy loam, and silt loam (Wieder et al. 2014). Clay is mainly distributed in the southwest, northwest, and the west karst trough valley region. Loam, which has the largest distribution area, can be observed across the entire region (Figure 2).
Topography (T)
In the DRASTIC model, the topography parameter is delineated by the slope, playing a crucial role in the examination of water resources and surface infiltration. This parameter directly impacts water's ability to permeate the soil. Lower slopes encourage more substantial infiltration, heightening the potential for pollutants to migrate into the aquifer. In this study, the slope of region was generated by DEM using GIS software and reclassified into different types (Agency 2021). The study area's slope ranges from 0 to 72° and is divided into six categories (Figure 2).
Impact of vadose zone (I)
The vadose zone, also known as the unsaturated zone, is the area above the water table where the pores in the soil and rock contain both air and water. This zone plays a significant role in regulating the movement and transformation of contaminants before they reach the groundwater table. In the context of groundwater pollution, the vadose zone can either mitigate or exacerbate the contamination process. Factors such as soil composition, moisture content, and the presence of organic matter influence how contaminants migrate through this zone (Wieder et al. 2014). The types of vadose zone in the study area can be categorized into five classes: clay, clay loam, loam, loamy sand, and sand loam. The distribution of each type is illustrated in Figure 2.
Aquifer conductivity (C)
Aquifer conductivity delineates the capacity of an aquifer to transmit water. It assumes a fundamental role in dictating the dynamics of water movement within the aquifer, thereby influencing groundwater flow patterns and the intricate transport mechanisms of contaminants. Elevated conductivity levels expedite groundwater flow, potentially extending the reach of contaminant migration. Conversely, diminished conductivity can act as a natural impediment, restraining the mobilization of contaminants and serving as an inherent protective barrier. In this study, the hydraulic conductivity of the aquifer was estimated by integrating the 1:250,000 hydrogeological survey data from Chongqing with empirical values of rock hydraulic conductivity. Areas with high aquifer conductivity are mainly concentrated in the northeast and southeast, as well as in the karst trough valleys parallel to the ridges. Conversely, areas with low conductivity are predominantly found in the sandstone and mudstone sedimentary rock areas, particularly concentrated in the central-western region of the study area (Figure 2).
Land use
LU (Figure 2), as a surface cover, plays a crucial role in influencing groundwater quality. Agricultural activities and urban expansion contribute to increased fertilizer usage and pollutant discharge. The threat to groundwater quality safety arises from precipitation and surface runoff infiltration. Extensive research has already demonstrated the adverse impacts of human activities on the groundwater environment. Then we employed a grid resolution of 1 km × 1 km to conduct proportional statistics on various land-use categories. The proportion values were then utilized as representative metrics for the impact of each land-use type.
Groundwater nitrate concentration (NO3–N)
Machine learning models
Random forest
The vulnerability of groundwater can be interpreted as the probability of groundwater resistance to pollution. To address the impact of subjective factors on weight assignment, we employ the RF algorithm (Breiman 2001) combined with a threshold for groundwater nitrate concentration to assess the vulnerability of groundwater. The RF algorithm (Breiman 2001), a powerful ensemble learning technique, has been widely applied in the field of hydrology. It has found extensive use in tasks such as predicting large-scale groundwater pollutant concentrations (Knoll et al. 2019; DeSimone et al. 2020; Ransom et al. 2022) and GWV assessment (Sajedi-Hosseini et al. 2018; Lahjouj et al. 2020). Comprising multiple decision trees, each trained on random subsets of the data, RF mitigates overfitting and enhances prediction accuracy.
Extreme gradient boosting
XGBoost (eXtreme Gradient Boosting) (Tianqi Chen 2016) is an efficient and scalable implementation of gradient boosting for classification and regression tasks like RF. It is a powerful machine learning algorithm that has gained a widespread popularity due to its speed, accuracy, and flexibility. XGBoost incorporates regularization techniques to prevent overfitting and improve generalization. It includes L1 (Lasso) and L2 (Ridge) regularization terms in the objective function, which penalizes the complexity of the model by adding the magnitude of the coefficients as part of the loss function. XGBoost has widespread application in domains such as groundwater pollution (Belitz & Stackelberg 2021; Ransom et al. 2022) and groundwater resource assessment.
Support vector machine
Support vector machine (SVM) (Suykens & Vandewalle 1999) is a supervised machine learning algorithm that is commonly used for binary classification tasks. SVM works by finding the optimal hyperplane that best separates the data into different classes. It is widely used in various domains such as text classification, image recognition, and bioinformatics. SVM, as a classic binary classification model, effectively addresses the threshold division problem of groundwater pollutants (El Bilali et al. 2021).
In this study, we employ the RF, XGBoost, and SVM algorithms, using the DRASTIC-LU framework as input features, the threshold of groundwater nitrate concentration as the model's predicted value, and ultimately utilize the predicted probabilities from the classification model as the GWV index.
Model evaluation indices
The kappa coefficient is a statistical measure used to assess the consistency between classifiers or raters, particularly in cases involving classification problems. It takes into account the consistency in classification results caused by random factors, thus providing a more accurate evaluation even for imbalanced classification data. represents the observed probability of agreement between classifiers or raters, while represents the expected probability of agreement between classifiers or raters. nii represents the number of consistent classifications for the ith category by the classifier or rater. n represents the total number of samples. ri represents the total sum of rows for the ith category. ci represents the total sum of columns for the ith category. k represents the total number of categories.
An ROC curve is a plot of true positive rate (sensitivity) against false positive rate (1 − specificity) for different threshold values. AUC measures the overall performance of the model across all possible classification thresholds. A higher AUC value indicates better model performance.
Data process
RESULTS AND DISCUSSION
Performance of models
Models . | Dataset . | Class . | Precision . | F1 score . | Recall . | Accuracy . | Kappa . | AUC . |
---|---|---|---|---|---|---|---|---|
RF | Train | class_0 | 0.966 | 0.960 | 0.953 | 0.960 | 0.919 | 0.970 |
class_1 | 0.954 | 0.960 | 0.966 | |||||
Test | class_0 | 0.938 | 0.928 | 0.918 | 0.929 | 0.857 | 0.948 | |
class_1 | 0.919 | 0.929 | 0.940 | |||||
XGB | Train | class_0 | 0.941 | 0.937 | 0.933 | 0.937 | 0.874 | 0.968 |
class_1 | 0.934 | 0.938 | 0.942 | |||||
Test | class_0 | 0.898 | 0.886 | 0.874 | 0.887 | 0.775 | 0.932 | |
class_1 | 0.877 | 0.889 | 0.901 | |||||
SVM | Train | class_0 | 0.845 | 0.841 | 0.837 | 0.842 | 0.684 | 0.923 |
class_1 | 0.838 | 0.843 | 0.847 | |||||
Test | class_0 | 0.829 | 0.826 | 0.824 | 0.827 | 0.654 | 0.886 | |
class_1 | 0.825 | 0.827 | 0.830 |
Models . | Dataset . | Class . | Precision . | F1 score . | Recall . | Accuracy . | Kappa . | AUC . |
---|---|---|---|---|---|---|---|---|
RF | Train | class_0 | 0.966 | 0.960 | 0.953 | 0.960 | 0.919 | 0.970 |
class_1 | 0.954 | 0.960 | 0.966 | |||||
Test | class_0 | 0.938 | 0.928 | 0.918 | 0.929 | 0.857 | 0.948 | |
class_1 | 0.919 | 0.929 | 0.940 | |||||
XGB | Train | class_0 | 0.941 | 0.937 | 0.933 | 0.937 | 0.874 | 0.968 |
class_1 | 0.934 | 0.938 | 0.942 | |||||
Test | class_0 | 0.898 | 0.886 | 0.874 | 0.887 | 0.775 | 0.932 | |
class_1 | 0.877 | 0.889 | 0.901 | |||||
SVM | Train | class_0 | 0.845 | 0.841 | 0.837 | 0.842 | 0.684 | 0.923 |
class_1 | 0.838 | 0.843 | 0.847 | |||||
Test | class_0 | 0.829 | 0.826 | 0.824 | 0.827 | 0.654 | 0.886 | |
class_1 | 0.825 | 0.827 | 0.830 |
Note: Class_0 and Class_1, respectively, represent groundwater nitrate concentrations less than 10 mg/L (as nitrogen) and greater than 10 mg/L (as nitrogen).
In this study, all three classifiers used in model development performed quite well. RF achieved the highest accuracy (training: 96.0% and testing: 92.9%), followed by XGB (training: 93.7% and testing: 88.7%), and SVM performed the lowest (training: 84.2% and testing: 82.7%). The kappa statistics demonstrate the reliability of the models, and this classification does not occur by chance. The kappa statistics for RF and XGB models range from 0.77 to 0.92, indicating good model performance, while the kappa statistic for the SVM model ranges from 0.65 to 0.68, relatively poorer compared to RF and XGB. The AUC values, shown in Figure 5, for all three models, range from 0.88 to 0.97. Models with AUC values greater than 0.8 are considered good models, indicating that all three models perform well in this parameter. For a more detailed presentation of the model evaluation results, the precision, F1 score, and recall for each class (class_0 and class_1) in both the training and testing datasets are shown in Table 2.
Uncertainty analysis
Dominant feature of groundwater vulnerability to nitrate
To effectively carry out groundwater pollution prevention and control, as well as optimize groundwater monitoring networks, it is necessary to analyze and model the influencing factors of GWV. Therefore, we utilized the DRASTIC-LU model and employed the SHAP interpreter to analyze the impact of features on the model. SHAP generates a value for each input feature (referred to as a SHAP value), indicating how much that feature contributes to the prediction of a specific data point. Some factors positively influence the prediction probability, while others have a negative impact on it.
Hydraulic conductivity (C), identified as the most important variable, and aquifer lithology had a positive relation with GWV. These two predictive indicators can serve as reflections of the ease or difficulty of surface runoff infiltrating into groundwater. In addition, groundwater table depth, vadose zone influence, and soil parameters also rank among the top 10 most influential indicators. Groundwater depth is negatively correlated with vulnerability level, while the vadose zone reflects lower porosity, inversely related to groundwater nitrate vulnerability. These conditions promote denitrification processes. From the SHAP values and the magnitude of feature importance, it can be observed that aquifer permeability and GWV level to nitrate are directly proportional. This implies that high permeability facilitates the conversion of surface runoff into groundwater, and simultaneously, due to filtration processes, makes it easier for nitrates to enter the groundwater body.
Among the top 10 most influential indicators, meteorological, LU, and topographical parameters are also included. Groundwater recharge and the proportion of water bodies are positively correlated with GWV indicators, while slope gradient is negatively correlated. These predictive indicators reveal that in areas with high groundwater recharge, there is an increased leaching of nitrates from the soil, leading to their transfer into the groundwater and an elevated risk of groundwater nitrate pollution. In terms of LU, the proportions of both cultivated land and urban land are positively correlated with GWV level, indicating the positive impact of nitrogen input from urban water consumption, sewage discharge, and agricultural activities on the increase in groundwater nitrate content.
Model result verification
Groundwater vulnerability to nitrate
The models, which have been tuned, are utilized to make probability predictions on whether the groundwater nitrate concentration exceeds the threshold. This process quantifies the vulnerability of groundwater to nitrate pollution. The comparison between classification methods using Spearman rank correlation (ρ), Eta coefficient (η), and F-statistics indicated that the equal interval (Sajedi-Hosseini et al. 2018; Lahjouj et al. 2020) methods are deemed most appropriate for three machine learning models, respectively (Table 3). Combining groundwater nitrate observation data with groundwater nitrate vulnerability zoning, box plots of nitrate concentrations under different vulnerability zones are formed. From Figure 9, it can be observed that different levels of vulnerability zoning correspond to different ranges of nitrate concentrations. The observed nitrate concentrations in lower vulnerability zones are lower, while those in higher vulnerability zones are higher.
. | Spearman rank correlation (ρ) . | Eta coefficient (η) . | ANOVA F-statistics . | ||||||
---|---|---|---|---|---|---|---|---|---|
RF . | XGB . | SVM . | RF . | XGB . | SVM . | RF . | XGB . | SVM . | |
Equal interval | 0.6726 | 0.6673 | 0.6000 | 0.2861 | 0.2774 | 0.2641 | 1.8722 | 1.8503 | 1.2658 |
Quantile | 0.6571 | 0.6628 | 0.5733 | 0.2701 | 0.2681 | 0.2617 | 1.4473 | 1.4483 | 1.2117 |
Natural break | 0.6480 | 0.6616 | 0.5962 | 0.2734 | 0.2649 | 0.2614 | 1.4272 | 1.4375 | 1.2000 |
Geometric interval | 0.6745 | 0.6655 | 0.5840 | 0.2778 | 0.2738 | 0.2642 | 1.8414 | 1.7405 | 1.2136 |
. | Spearman rank correlation (ρ) . | Eta coefficient (η) . | ANOVA F-statistics . | ||||||
---|---|---|---|---|---|---|---|---|---|
RF . | XGB . | SVM . | RF . | XGB . | SVM . | RF . | XGB . | SVM . | |
Equal interval | 0.6726 | 0.6673 | 0.6000 | 0.2861 | 0.2774 | 0.2641 | 1.8722 | 1.8503 | 1.2658 |
Quantile | 0.6571 | 0.6628 | 0.5733 | 0.2701 | 0.2681 | 0.2617 | 1.4473 | 1.4483 | 1.2117 |
Natural break | 0.6480 | 0.6616 | 0.5962 | 0.2734 | 0.2649 | 0.2614 | 1.4272 | 1.4375 | 1.2000 |
Geometric interval | 0.6745 | 0.6655 | 0.5840 | 0.2778 | 0.2738 | 0.2642 | 1.8414 | 1.7405 | 1.2136 |
Note: ANOVA, analysis of variance.
Compared to the three machine learning models mentioned earlier, traditional index weighting models tend to predict larger areas for high and very high vulnerability levels, while smaller areas are predicted for low vulnerability levels. The percentages of the five GWV levels in the predictions of the best model (RF) are as follows: 30.63, 24.05, 25.96, 16.35, and 3.01%, respectively.
Three machine learning models all indicate higher vulnerability in the central-western urban areas, karst trough valleys, as well as in the southeastern and northeastern regions. Upon comparison with LU and hydrogeological maps, it is observed that areas of high vulnerability are concentrated in urban and agricultural land-use zones, as well as in carbonate rock regions, which are highly correlated with groundwater recharge parameters. Due to the lack of natural impermeable or filtering layers in karst areas, surface water and all pollutants can easily enter aquifers or underground rivers directly through karst features such as caves. This indicates that agricultural activities (excessive use of nitrogen fertilizers), urban sewage, and highly permeable aquifers are the main factors controlling groundwater nitrogen pollution, consistent with previous studies (Knoll et al. 2019a; Hartmann et al. 2021; Goodarzi et al. 2022; Mkumbo et al. 2022; Shanmugamoorthy et al. 2023). Due to the high vulnerability of groundwater nitrate in the northeastern and southeastern areas of Chongqing, rural areas in these regions often rely on household wells for water supply. The elevated vulnerability of groundwater, coupled with nitrogen emissions from human activities, poses a high risk of nitrate contamination in groundwater. Long-term consumption of nitrate-contaminated groundwater in these areas may lead to conditions such as methemoglobinemia, leukemia, gastrointestinal cancers, and other diseases (Yüksel et al. 2021; Topaldemir et al. 2023). To ensure the safety of drinking water in areas with high GWV, relevant authorities should take corresponding measures to manage groundwater in these areas. This may include measures such as avoiding excessive use of nitrogen fertilizers, raising awareness among local residents about safe water practices, prioritizing the use of high-quality regional groundwater, and implementing targeted water treatment measures. Subsequent research should involve sampling of domestic drinking water wells in high vulnerability areas to validate model results. In addition, monitoring points should be added to protect drinking water sources in high vulnerability areas, ensuring the safety of drinking water in these regions.
CONCLUSIONS
Assessing GWV has emerged as a crucial tool for sustainable management of groundwater resources. Consequently, there is a growing demand for the development of novel methods to enhance the accuracy of these assessments. This study focuses on Chongqing Municipality as the research area, integrating the traditional DRASTIC-LU GWV framework with machine learning techniques and the distribution of groundwater nitrate concentrations. By eliminating subjective weighting, this approach quantifies GWV. Evaluation of RF, XGB, and SVM models using metrics such as accuracy, precision, recall, F1 score, kappa value, and AUC leads to the following conclusions:
(1) In this study, among the three selected machine learning models, the RF model outperforms the others, achieving the highest accuracy (92.9% for testing), kappa value (0.857 for testing), and AUC (0.948 for testing). Further validation was conducted by analyzing the correlation between groundwater nitrate sampling concentration data and groundwater nitrate vulnerability index. The results confirmed that both the RF model and XGB model outperformed the traditional index weighting model (with the RF model achieving the highest R2 value, R2 = 0.803). These analyses affirm the RF model's superior suitability for predicting groundwater nitrate vulnerability index in the study area.
(2) The SHAP interpreter was utilized to explain the input features of the DRASTIC-LU model. The results indicate that aquifer permeability, lithology, groundwater recharge, as well as cultivated land and urban LU are the most influential indicators affecting GWV to nitrate.
(3) A GWV assessment was conducted across the entire Chongqing region using a 1 km × 1 km grid. The results indicate that the distribution proportions of vulnerability levels, from extremely low to extremely high, are as follows: 0.63, 24.05, 25.96, 16.35, and 3.01%, respectively. Areas with high and extremely high vulnerability levels are concentrated in the southeastern, northeastern, and central urban areas, particularly in regions with high urban development intensity and karst trough valleys.
While machine learning models have demonstrated excellent performance in quantifying GWV, it is important to note that the nitrate distribution data used in our study were generated through simulations by machine learning models, introducing a certain level of uncertainty. In addition, due to financial constraints, we were unable to conduct data validation in unsampled areas. Validation efforts will be pursued in future research endeavors focused on groundwater pollution prevention and control.
ACKNOWLEDGEMENTS
We gratefully acknowledge the financial support provided by the Chongqing Science and Technology Development Foundation (Project Number: cstc2020jcyj-msxmX1074) and the self-funded resources of the Chongqing Institute of Geology and Mineral Resources.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.