Abstract
Population growth and overexploitation of water resources pose ongoing pressure on groundwater resources. This study compares the capability of four data mining methods, namely, boosted regression tree (BRT), random forest (RF), multivariate adaptive regression spline (MARS), and support vector machine (SVM), for water spring potential mapping (WSPM) in Al Kark Governorate, east of the Dead Sea, Jordan. Overall, 200 spring locations and 13 predictor variables were considered for model building and validation. The four models were calibrated and trained on 70% of the spring locations (i.e., 140 locations) and their predictive accuracy was evaluated on the remaining 30% of the locations (i.e., 60 locations). The area under the receiver operating characteristic curve (AUROCC) was employed as the performance measure for the evaluation of the accuracy of the constructed models. Results of model accuracy assessment based on the AUROCC revealed that the performance of the RF model (AUROCC = 0.748) was better than that of any other model (AUROCC SVM = 0.732, AUROCC MARS = 0.727, and AUROCC BRT = 0.689).
HIGHLIGHTS
Groundwater potential zoning and mapping is a critical step in identifying and managing water resources.
Multicriteria decision-making methods can be employed as fast and efficient techniques in decision-making.
The possibility of the presence of multicollinearity among the 13 predictors of the presence of water springs was examined.
INTRODUCTION
Rapid population growth, rapid and frequent refugee crisis, limited surface water resources, and overpumping of groundwater have resulted in the irreversible loss of natural resources (Brans 2001; Lightbody 2009; Brückner et al. 2021). Groundwater is a vital source of water for agricultural, domestic, and industrial uses (Kemper 2004; Gyau-Boakye et al. 2008). Its low salinity, chemical and temperature stability, low pollution, and high reliability make it highly dependable, particularly in arid and semi-arid regions, which are, in general, characterized by limited water resources (Jha et al. 2009). Because of its effect on the ecological potential of the surrounding area, groundwater plays an important role in economic growth, biodiversity, and public health (Jha et al. 2009). Groundwater accounts for around 4% of the entire water actively circulating in the hydrological cycle. Nevertheless, nearly 40% of the population of the world relies on groundwater resources for drinking (Rahmati et al. 2018).
Groundwater potential zoning and mapping is a critical step in identifying and managing water resources (Al-Fugara et al. 2020a, 2020b). Traditional methods of mapping water spring potential are often time- and resource-consuming, and given their nondigital nature, they do not allow for compiling data in databases to help water resource planners and managers in decision-making (Mukherjee et al. 2012; Gumma & Pavelic 2013). Therefore, multicriteria decision-making methods like the analytic hierarchy process, machine learning, and data mining, coupled with remote sensing and the geographic information system, can be employed as fast and efficient techniques for this purpose (Kumar et al. 2014; Al-Shabeeb 2015; Naghibi & Pourghasemi 2015; Agarwal & Garg 2016; Naghibi et al. 2016; Al-Shabeeb 2018; Al-Shabeeb et al., 2018; Mahato & Pal 2019). In recent years, this approach has proved to be useful, and it has been used by researchers and water resource planners for preparing groundwater potential maps (Miraki et al. 2019; Al-Fugara et al. 2020c; Prasad et al. 2020).
Data-driven techniques have received more attention than expert-based methods for groundwater potential mapping (GPM). The data-driven approaches assign potential probabilities (e.g., in GPM) or degrees of sensitivity (e.g., in flood susceptibility mapping) to specific locations that may be reflected on the corresponding pixels in the image of a region (Pourghasemi & Beheshtirad 2015; Rahmati et al. 2016). In contrast, the expert-based approaches rely on the opinion of the expert to estimate the probability of the event or the degree of sensitivity (Kumar et al. 2014; Mahato & Pal 2019). Errors are likely in the second approach as the experts' opinions may contradict. Therefore, the data-driven approaches can have lower uncertainty than the expert-based ones (Ahmadlou et al. 2020).
So far, researchers and modelers have used several data-driven methods, including the artificial neural network, support vector machine (SVM), and adaptive neuro-fuzzy inference system (Chen et al. 2019a, 2019b; Tien Bui et al. 2019; Nguyen et al. 2020). These methods use various variables as predictors for building the models. In general, the predictors (or factors) can differ from one study location to another and which predictors to use depends on the availability and comprehensiveness of the relevant, site-specific data. Generally speaking, these factors can be classified into three groups, namely, topographical factors (e.g., elevation), hydrological factors (e.g., the topographic wetness index (TWI)), and geological factors (e.g., lithology).
Although several modeling approaches have been used for GPM, not many studies have compared them with each other. Comparative studies can help water resource planners and managers adopt the optimal accurate models. A review of the literature discloses that ensemble models are gaining growing popularity (Naghibi et al. 2017; Naghibi et al. 2019). By iterating modeling step(s) using bootstrap sampling and similar methods, this modeling approach produces a diverse range of models that are based on voting. The present study uses boosted regression tree (BRT) and random forest (RF), which are ensemble models. Predictions of these two modeling approaches are, then, compared with predictions of two well-known data mining models: the SVM and multivariate adaptive regression spline (MARS) models.
In the rest of this article, Section 2 describes the study area and the factors considered in water spring potential mapping (WSPM). Section 3 illustrates the adopted modeling methods. Then, Section 4 presents the results of this study and discusses them. Finally, Section 5 presents the study conclusions.
MATERIALS AND METHODS
Study area
Al-Karak Governorate has a total area of about 3,702 km2. The land area is nearly 3,495.40 km2. This figure (3,495.40 km2) excludes the area of the part of the Dead Sea that belongs, according to administrative divisions of the country, to this governorate, which amounts to almost 206.6 km2 (Figure 1).
Geologically, this study area is part of the fault valley. Therefore, the Dead Sea transform rift influences its geology and geomorphology. Elevations of this study area drop from 1,301 m above sea level in its eastern parts to 416 m below sea level in its western parts at the shoreline of the Dead Sea.
The geomorphology of the governorate of Al-Karak is categorized into three distinctive units: the highland toes in the eastern parts of the governorate; the plateau and highlands, which cover most of the study area in its middle parts; and the steep slopes, which lie in the western parts of the governorate and extend to the shoreline of the Dead Sea.
In the study area, there are seven wadis (i.e., valleys) that drain surface water, which originates mostly from rainfall floods, to the Dead Sea. These wadis are Karak, Dhira, Ibn-Hammad, Hesa, Numera, Issal, and Mujib wadis. In recent years, the volume of surface water discharged from springs has dropped sharply (Al-Weshah 2000; Abu Ghazleh et al. 2009). According to Arabtech-Jardaneh (1996), the annual discharge from springs in the study area is estimated at 7 × 106 to 9 × 106 m3. However, the Water Authority of Jordan, the governmental authority managing the water sector in Jordan, estimated the annual discharge from 150 springs in the area in the years 1986/1987 and 2010/2011 at 58,811 m3 and 11,006 m3, respectively.
The climate of the study area is the Mediterranean Climate, which is characterized by dry hot summers and wet cold winters. The average temperature varies from −2 °C in January to 42 °C in May. The average annual rainfall depth ranges from 52 mm in the western parts of the Al-Karak governorate to more than 400 mm at the heights in the middle parts of this governorate, with an annual average rainfall depth of about 345 mm (Abed 2000).
Formation symbol . | Era . | Sub era . | Description . |
---|---|---|---|
CS | Paleozoic | Cambrian | Massive brownish sandstone |
Ks | Mesozoic | Cretaceous | Sandy limestone and dolomite marl |
Ks2 | Mesozoic | Cretaceous | Limestone sandy limestone, dolomite nodular limestone |
Ks3 | Mesozoic | Cretaceous | Chalk, marl bituminous limestone, phosphorite |
Os3 | Palezoic | Ordovician | Fine-grained sandstone, sandy shale, marine deposit |
pCs | Precambrian | – | Slate–greywacke series and saramuj conglomerate |
Qb | Cenozoic | Quaternary | Basaltic flows |
Qs5 | Cenozoic | Quaternary | Terrestrial, fluviatile, and lacustrine sediments |
TR | Mesozoic | Triassic | Limestone sandy limestone and locally gypsum |
Ts3 | Cenozoic | Tertiary | Sandstone, conglomerate, marl, and evaporites |
Formation symbol . | Era . | Sub era . | Description . |
---|---|---|---|
CS | Paleozoic | Cambrian | Massive brownish sandstone |
Ks | Mesozoic | Cretaceous | Sandy limestone and dolomite marl |
Ks2 | Mesozoic | Cretaceous | Limestone sandy limestone, dolomite nodular limestone |
Ks3 | Mesozoic | Cretaceous | Chalk, marl bituminous limestone, phosphorite |
Os3 | Palezoic | Ordovician | Fine-grained sandstone, sandy shale, marine deposit |
pCs | Precambrian | – | Slate–greywacke series and saramuj conglomerate |
Qb | Cenozoic | Quaternary | Basaltic flows |
Qs5 | Cenozoic | Quaternary | Terrestrial, fluviatile, and lacustrine sediments |
TR | Mesozoic | Triassic | Limestone sandy limestone and locally gypsum |
Ts3 | Cenozoic | Tertiary | Sandstone, conglomerate, marl, and evaporites |
Methodology
Dataset
This study first attempted to identify the most appropriate factors for water spring potential assessment and prediction. To this end, a list of the influential factors was prepared based on a literature review and, then, refined based on the availability of suitable data. Overall, 13 of the most influential factors were selected, which are slope, elevation, profile curvature, plan curvature, lithology, rainfall depth, fault density, soil texture, TWI, distance to roads, distance to drainage, distance to fault, and distance to rivers (Manap et al. 2014; Chen et al. 2018; Khosravi et al. 2018; Lee et al. 2018; Al-Fugara et al., 2020a). The type and amount of precipitation often vary by elevation. Higher altitudes often receive precipitation in the form of snow and have lower evapotranspiration than lower altitudes. In this case, the gradual melting of the snow in the warm season feeds the groundwater basins. In the meantime, areas of the drainage basins of aquifers decrease with elevation. That is, as the elevation increases, less ground area is exposed to the falling snow. So, the elevation is a double-sided modeling parameter with opposing effects.
Slope, too, is an effective predictor in terms of both the amount of runoff produced by precipitation and ground permeability. Usually, faster runoff and lower ground permeability are associated with steeper slopes. Hence, this parameter has a negative impact on aquifer recharge. The slope map was prepared in this study based on a 30 m ASTER DEM of the study area using the surface analysis function and the slope tool in the ArcGIS software environment. Aspect has a great influence on hydrological, and other, processes and phenomena, including the melting of snow, diversity of vegetation, and evapotranspiration, thus affecting groundwater accumulation. Evaporation is more intense, and snow melting is faster in areas exposed to direct sunlight. The aspect map too was derived from the DEM.
Lithology of the study area was mapped by using geological maps prepared by the Ministry of Energy and Natural Resources, Jordan. Geological formations and rocks serve a critical role in the development of groundwater resources. Permeable and porous geological formations consisting of limestone, sandstone, and conglomerate are the most effective formations in accumulating groundwater, whereas marl is the least effective in this respect. The TWI was another factor that is often used in GPM studies to determine soil moisture.
MACHINE LEARNING TECHNIQUES
Boosted regression tree
Although decision trees were first developed nearly two decades ago, they can still be categorized among the new generation of data mining techniques. The BRT is a tree-based model that combines the Classification and Regression Tree (CART; Breiman et al. 1984) model with the boosting technique (Elith et al. 2008). It combines the results of multiple weak models to get a stronger model (Zhang et al. 2016). In the BRT method, the decision trees are developed using the CART method that was first developed by Breiman et al. (1984). Boosting is a simple and effective process that creates numerous models and combines their results to increase the overall accuracy (Schwenk & Bengio 2000). The idea behind this method is that averaging the results of multiple models is more reliable and accurate than using a single model. However, the BRT method does not use all of the training samples to create each of the trees. Instead, each tree is developed using a random subset of the training samples (Elith et al. 2008). The performance of the BRT models is substantially influenced by parameters like the learning rate, tree complexity, bagging rate, the minimum number of observed samples in the leaf nodes, and the number of trees (Elith et al. 2008). The learning rate limits the impact of each tree on the overall model. It must be a real number in the range of 0–1 (Elith et al. 2008). However, it is frequently set to a number between 0.1 and 0.001. Decreasing the learning rate increases the number of trees and reduces the prediction error (Elith et al. 2008). Consequently, choosing a lower learning rate is preferable, provided that there are enough training samples and time to train the models. In the bagging process and in each iteration, a number of training samples are selected randomly without replacement (Skurichina & Duin 2002). The appropriate value for the bagging rate is a value between 0.5 and 0.75. One of the criteria that are commonly used to stop building new trees is the number of observed samples in the leaf nodes. The number of trees in the final model is determined based on the aforementioned parameters and cross-validation. The BRT model, generally, has better performance when the number of trees is higher than 1,000 than when it is less than that. The enablement of controlling the various predictive variables, eliminating the need for outlier removal, and modeling the nonlinear relationships among the variables are some of the advantages of this model (Elith et al. 2008).
Random forest
The RF is an ensemble data mining technique that was developed by Breiman (2001). It may be considered an extension of the CART model (Breiman 2001). In the CART model, the training set is repeatedly split into subsets to determine the relationship between the dependent variable and each independent variable and specify the target variable (Breiman et al. 1984). Unlike other tree-based methods in which a limited number of decision trees are built, in the RF method, hundreds or even thousands of decision trees are built. In essence, the RF is an ensemble learning technique that uses a set of weak learners to create a stronger model. Various forms of bootstrapping are used in the development of RF models. In addition, a random subset of input variables is used to build each tree. Using bootstrapping, a large number of n-sample subsets of the initial training dataset is created via random sampling with replacement. Approximately one-third of the samples are not used during this process, and they are considered as out of bag (OOB) samples, which are used for significant variable selection and unbiased error estimation (Breiman 2001). Then, a tree is developed after each bootstrap. During the process of building the trees, m variables are selected for each tree branch from the whole set of M independent variables to split the tree. In the regression models, the recommended m/M ratio is 1/3, while for classifier models, the recommended ratio is (Biau & Scornet 2016). Once all the trees have been built, the test samples are presented to the tree. Each tree generates an output for each input vector. The final output is the average of all these outputs.
Multivariate adaptive regression spline
The MARS is an efficient data mining algorithm that models the nonlinear relationships among the predictive variables and the dependent variable in two basic steps (Friedman 1991): forward pass and backward pass.
The forward pass
The backward pass
In this equation, d is the cost associated with each function and M is the number of the nonzero terms (i.e., basis functions).
Support vector machine
In this equation, C is the coefficient that balances the trade-off between the complexity of the estimated function and the maximum deviation from ; and are terms for the penalty added to the objective function for samples with error values greater than ε.
RESULTS AND DISCUSSION
Multicollinearity test
In the present study, the possibility of the presence of multicollinearity among the 13 predictors of the presence of water springs was examined by using the tolerance (TOL) and variance inflation factor (VIF) measures of multicollinearity. According to the values of TOL and VIF (Table 2), no multicollinearity among the predictors was present since all TOL values are higher than 0.10 and all VIF values are less than 10. In the light of the VIF and TOL values, it is concluded that the 13 selected predictors were relevant for water spring potential mapping (WSPM) in the governorate of Al-Karak.
Model coefficientsa . | |||||||
---|---|---|---|---|---|---|---|
Model . | Unstandardized coefficients . | Standardized coefficients . | t . | Sig . | Collinearity statistics . | ||
B . | Std error . | Beta . | Tolerance . | VIF . | |||
(Constant) | 1.159 | 0.313 | 3.699 | 0.000 | |||
Distance from drainage | 0.000 | 0.000 | −0.162 | −2.383 | 0.018 | 0.760 | 1.315 |
Rainfall depth | −0.002 | 0.001 | −0.248 | −1.936 | 0.054 | 0.213 | 4.696 |
Profile curvature | −0.032 | 0.071 | −0.032 | −0.452 | 0.652 | 0.707 | 1.415 |
Plan curvature | −0.014 | 0.101 | −0.010 | −0.140 | 0.889 | 0.690 | 1.449 |
Lithology | −0.036 | 0.017 | −0.169 | −2.054 | 0.041 | 0.519 | 1.927 |
Distance from wadi | 0.000 | 0.000 | −0.344 | −4.660 | 0.000 | 0.642 | 1.557 |
Distance from fault | 6.108 × 10−7 | 0.000 | 0.004 | 0.036 | 0.971 | 0.369 | 2.706 |
TWI | 0.016 | 0.014 | 0.084 | 1.184 | 0.238 | 0.694 | 1.441 |
Fault density | −0.006 | 0.015 | −0.026 | −0.378 | 0.706 | 0.743 | 1.345 |
Soil texture | 0.006 | 0.117 | 0.006 | 0.053 | 0.958 | 0.239 | 4.186 |
DEM (altitude) | 0.000 | 0.000 | 0.389 | 2.126 | 0.035 | 0.305 | 3.567 |
Slope angle | 4.745 × 10−5 | 0.000 | 0.065 | 1.073 | 0.285 | 0.970 | 1.031 |
Distance from roads | −9.398 × 10−5 | 0.000 | −0.209 | −2.826 | 0.005 | 0.639 | 1.564 |
Model coefficientsa . | |||||||
---|---|---|---|---|---|---|---|
Model . | Unstandardized coefficients . | Standardized coefficients . | t . | Sig . | Collinearity statistics . | ||
B . | Std error . | Beta . | Tolerance . | VIF . | |||
(Constant) | 1.159 | 0.313 | 3.699 | 0.000 | |||
Distance from drainage | 0.000 | 0.000 | −0.162 | −2.383 | 0.018 | 0.760 | 1.315 |
Rainfall depth | −0.002 | 0.001 | −0.248 | −1.936 | 0.054 | 0.213 | 4.696 |
Profile curvature | −0.032 | 0.071 | −0.032 | −0.452 | 0.652 | 0.707 | 1.415 |
Plan curvature | −0.014 | 0.101 | −0.010 | −0.140 | 0.889 | 0.690 | 1.449 |
Lithology | −0.036 | 0.017 | −0.169 | −2.054 | 0.041 | 0.519 | 1.927 |
Distance from wadi | 0.000 | 0.000 | −0.344 | −4.660 | 0.000 | 0.642 | 1.557 |
Distance from fault | 6.108 × 10−7 | 0.000 | 0.004 | 0.036 | 0.971 | 0.369 | 2.706 |
TWI | 0.016 | 0.014 | 0.084 | 1.184 | 0.238 | 0.694 | 1.441 |
Fault density | −0.006 | 0.015 | −0.026 | −0.378 | 0.706 | 0.743 | 1.345 |
Soil texture | 0.006 | 0.117 | 0.006 | 0.053 | 0.958 | 0.239 | 4.186 |
DEM (altitude) | 0.000 | 0.000 | 0.389 | 2.126 | 0.035 | 0.305 | 3.567 |
Slope angle | 4.745 × 10−5 | 0.000 | 0.065 | 1.073 | 0.285 | 0.970 | 1.031 |
Distance from roads | −9.398 × 10−5 | 0.000 | −0.209 | −2.826 | 0.005 | 0.639 | 1.564 |
aDependent variable: RNDSEL.
Once the most influential factors for WSPM had been determined and the training and testing data subsets had been specified, modeling using the BRT, RF, MARS, and SVM methods was started. In this study, the dataset used consists of 200 water spring sites, from which 70% of the sites (140 sites) were selected randomly and allotted to the training dataset and the remaining 30% of the sites (60 sites) were used for model testing. In addition, another 140 and 60 locations with no springs were added to the training and testing data subsets, respectively. The models were all developed in the R package.
In the case of the BRT model, the optimum number of trees was 400, the split rate was set to 1, the reduction rate was 0.01, and the optimal number of nodes in each tree was 20. In other respects, the number of trees and the number of variables to consider in each split were set to 1,000 and 3, respectively. The RF model predicts the output using the OOB error. Therefore, the testing samples are not used in this method. The OOB estimate of the error percentage for the model was 25.46%.
In SVM, the value of sigma that was obtained from cross validation was 0.16, and the number of support vectors was 165. The significance of the predictor variables in the model was determined using the cross-validation method (Figure 5(b)). As can be observed in this figure, the DEM and distance to the river were the most significant predictors of the locations of potential water springs.
TP: the samples that the model predicts as positive correctly;
FP: the samples that the model predicts as positive, but they are negative in reality;
FN: the samples that the model predicts as negative, but in reality they are positive;
TN: the samples that the model predicts as negative correctly.
Model . | AUROCC . |
---|---|
BRT | 0.689 |
MARS | 0.727 |
RF | 0.748 |
SVM | 0.732 |
Model . | AUROCC . |
---|---|
BRT | 0.689 |
MARS | 0.727 |
RF | 0.748 |
SVM | 0.732 |
CONCLUSION
The water spring potential maps provide a vital source of information for land use planning, especially in arid and semi-arid regions. These maps can help policy-makers in managing water resources. In this study, we employed four modeling approaches, that is, the BRT, RF, MARS, and SVM, to create water spring potential maps for the governorate of Al-Karak, south of Jordan. In modeling, the study employed the 13 significant predictors of groundwater potential locations (distance from wadi, distance to drainage network, distance from faults, distance from roads, plan curvature, profile curvature, fault density, lithology, rainfall depth, slope angle, TWI, altitude, and soil texture). The modeling results provided evidence that the RF model has the best performance, followed by the SVM and MARS models. In contrast, the BRT model has relatively poor performance. These results confirm that the ensemble models (the RF model in this study) often outperform the single models.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.