Population growth and overexploitation of water resources pose ongoing pressure on groundwater resources. This study compares the capability of four data mining methods, namely, boosted regression tree (BRT), random forest (RF), multivariate adaptive regression spline (MARS), and support vector machine (SVM), for water spring potential mapping (WSPM) in Al Kark Governorate, east of the Dead Sea, Jordan. Overall, 200 spring locations and 13 predictor variables were considered for model building and validation. The four models were calibrated and trained on 70% of the spring locations (i.e., 140 locations) and their predictive accuracy was evaluated on the remaining 30% of the locations (i.e., 60 locations). The area under the receiver operating characteristic curve (AUROCC) was employed as the performance measure for the evaluation of the accuracy of the constructed models. Results of model accuracy assessment based on the AUROCC revealed that the performance of the RF model (AUROCC = 0.748) was better than that of any other model (AUROCC SVM = 0.732, AUROCC MARS = 0.727, and AUROCC BRT = 0.689).

  • Groundwater potential zoning and mapping is a critical step in identifying and managing water resources.

  • Multicriteria decision-making methods can be employed as fast and efficient techniques in decision-making.

  • The possibility of the presence of multicollinearity among the 13 predictors of the presence of water springs was examined.

Rapid population growth, rapid and frequent refugee crisis, limited surface water resources, and overpumping of groundwater have resulted in the irreversible loss of natural resources (Brans 2001; Lightbody 2009; Brückner et al. 2021). Groundwater is a vital source of water for agricultural, domestic, and industrial uses (Kemper 2004; Gyau-Boakye et al. 2008). Its low salinity, chemical and temperature stability, low pollution, and high reliability make it highly dependable, particularly in arid and semi-arid regions, which are, in general, characterized by limited water resources (Jha et al. 2009). Because of its effect on the ecological potential of the surrounding area, groundwater plays an important role in economic growth, biodiversity, and public health (Jha et al. 2009). Groundwater accounts for around 4% of the entire water actively circulating in the hydrological cycle. Nevertheless, nearly 40% of the population of the world relies on groundwater resources for drinking (Rahmati et al. 2018).

Groundwater potential zoning and mapping is a critical step in identifying and managing water resources (Al-Fugara et al. 2020a, 2020b). Traditional methods of mapping water spring potential are often time- and resource-consuming, and given their nondigital nature, they do not allow for compiling data in databases to help water resource planners and managers in decision-making (Mukherjee et al. 2012; Gumma & Pavelic 2013). Therefore, multicriteria decision-making methods like the analytic hierarchy process, machine learning, and data mining, coupled with remote sensing and the geographic information system, can be employed as fast and efficient techniques for this purpose (Kumar et al. 2014; Al-Shabeeb 2015; Naghibi & Pourghasemi 2015; Agarwal & Garg 2016; Naghibi et al. 2016; Al-Shabeeb 2018; Al-Shabeeb et al., 2018; Mahato & Pal 2019). In recent years, this approach has proved to be useful, and it has been used by researchers and water resource planners for preparing groundwater potential maps (Miraki et al. 2019; Al-Fugara et al. 2020c; Prasad et al. 2020).

Data-driven techniques have received more attention than expert-based methods for groundwater potential mapping (GPM). The data-driven approaches assign potential probabilities (e.g., in GPM) or degrees of sensitivity (e.g., in flood susceptibility mapping) to specific locations that may be reflected on the corresponding pixels in the image of a region (Pourghasemi & Beheshtirad 2015; Rahmati et al. 2016). In contrast, the expert-based approaches rely on the opinion of the expert to estimate the probability of the event or the degree of sensitivity (Kumar et al. 2014; Mahato & Pal 2019). Errors are likely in the second approach as the experts' opinions may contradict. Therefore, the data-driven approaches can have lower uncertainty than the expert-based ones (Ahmadlou et al. 2020).

So far, researchers and modelers have used several data-driven methods, including the artificial neural network, support vector machine (SVM), and adaptive neuro-fuzzy inference system (Chen et al. 2019a, 2019b; Tien Bui et al. 2019; Nguyen et al. 2020). These methods use various variables as predictors for building the models. In general, the predictors (or factors) can differ from one study location to another and which predictors to use depends on the availability and comprehensiveness of the relevant, site-specific data. Generally speaking, these factors can be classified into three groups, namely, topographical factors (e.g., elevation), hydrological factors (e.g., the topographic wetness index (TWI)), and geological factors (e.g., lithology).

Although several modeling approaches have been used for GPM, not many studies have compared them with each other. Comparative studies can help water resource planners and managers adopt the optimal accurate models. A review of the literature discloses that ensemble models are gaining growing popularity (Naghibi et al. 2017; Naghibi et al. 2019). By iterating modeling step(s) using bootstrap sampling and similar methods, this modeling approach produces a diverse range of models that are based on voting. The present study uses boosted regression tree (BRT) and random forest (RF), which are ensemble models. Predictions of these two modeling approaches are, then, compared with predictions of two well-known data mining models: the SVM and multivariate adaptive regression spline (MARS) models.

In the rest of this article, Section 2 describes the study area and the factors considered in water spring potential mapping (WSPM). Section 3 illustrates the adopted modeling methods. Then, Section 4 presents the results of this study and discusses them. Finally, Section 5 presents the study conclusions.

Study area

The study area is the land area of the governorate of Al-Karak. This governorate lies east of the Dead Sea and covers a substantial part of its eastern watershed, which extends beyond the governorate of Al-Karak and has an area of about 1,497.3 km2 (Figure 1).
Figure 1

Map location of the study area.

Figure 1

Map location of the study area.

Close modal

Al-Karak Governorate has a total area of about 3,702 km2. The land area is nearly 3,495.40 km2. This figure (3,495.40 km2) excludes the area of the part of the Dead Sea that belongs, according to administrative divisions of the country, to this governorate, which amounts to almost 206.6 km2 (Figure 1).

Geologically, this study area is part of the fault valley. Therefore, the Dead Sea transform rift influences its geology and geomorphology. Elevations of this study area drop from 1,301 m above sea level in its eastern parts to 416 m below sea level in its western parts at the shoreline of the Dead Sea.

The geomorphology of the governorate of Al-Karak is categorized into three distinctive units: the highland toes in the eastern parts of the governorate; the plateau and highlands, which cover most of the study area in its middle parts; and the steep slopes, which lie in the western parts of the governorate and extend to the shoreline of the Dead Sea.

In the study area, there are seven wadis (i.e., valleys) that drain surface water, which originates mostly from rainfall floods, to the Dead Sea. These wadis are Karak, Dhira, Ibn-Hammad, Hesa, Numera, Issal, and Mujib wadis. In recent years, the volume of surface water discharged from springs has dropped sharply (Al-Weshah 2000; Abu Ghazleh et al. 2009). According to Arabtech-Jardaneh (1996), the annual discharge from springs in the study area is estimated at 7 × 106 to 9 × 106 m3. However, the Water Authority of Jordan, the governmental authority managing the water sector in Jordan, estimated the annual discharge from 150 springs in the area in the years 1986/1987 and 2010/2011 at 58,811 m3 and 11,006 m3, respectively.

The climate of the study area is the Mediterranean Climate, which is characterized by dry hot summers and wet cold winters. The average temperature varies from −2 °C in January to 42 °C in May. The average annual rainfall depth ranges from 52 mm in the western parts of the Al-Karak governorate to more than 400 mm at the heights in the middle parts of this governorate, with an annual average rainfall depth of about 345 mm (Abed 2000).

Further description of the study area is provided in Figure 2, where Figure 2(a) provides the description of the study area in terms of geology, surface water, rainfall depth, and elevation, while Figure 2(b) gives the description of this area in terms of slope, fault density, plan curvature, and soil texture. Table 1 shows the description of the geological formations shown in Figure 2(a).
Table 1

Geological classification of the rock units in the study area (Powell 1988)

Formation symbolEraSub eraDescription
CS Paleozoic Cambrian Massive brownish sandstone 
Ks Mesozoic Cretaceous Sandy limestone and dolomite marl 
Ks2 Mesozoic Cretaceous Limestone sandy limestone, dolomite nodular limestone 
Ks3 Mesozoic Cretaceous Chalk, marl bituminous limestone, phosphorite 
Os3 Palezoic Ordovician Fine-grained sandstone, sandy shale, marine deposit 
pCs Precambrian – Slate–greywacke series and saramuj conglomerate 
Qb Cenozoic Quaternary Basaltic flows 
Qs5 Cenozoic Quaternary Terrestrial, fluviatile, and lacustrine sediments 
TR Mesozoic Triassic Limestone sandy limestone and locally gypsum 
Ts3 Cenozoic Tertiary Sandstone, conglomerate, marl, and evaporites 
Formation symbolEraSub eraDescription
CS Paleozoic Cambrian Massive brownish sandstone 
Ks Mesozoic Cretaceous Sandy limestone and dolomite marl 
Ks2 Mesozoic Cretaceous Limestone sandy limestone, dolomite nodular limestone 
Ks3 Mesozoic Cretaceous Chalk, marl bituminous limestone, phosphorite 
Os3 Palezoic Ordovician Fine-grained sandstone, sandy shale, marine deposit 
pCs Precambrian – Slate–greywacke series and saramuj conglomerate 
Qb Cenozoic Quaternary Basaltic flows 
Qs5 Cenozoic Quaternary Terrestrial, fluviatile, and lacustrine sediments 
TR Mesozoic Triassic Limestone sandy limestone and locally gypsum 
Ts3 Cenozoic Tertiary Sandstone, conglomerate, marl, and evaporites 
Figure 2

(a) Description of the study area: (i) geology map, (ii) elevation map, (iii) wadis map, and (iv) rainfall map. (b) Description of the study area: (i) plan curvature map, (ii) fault density map, (iii) soil texture map, and (iv) slope map.

Figure 2

(a) Description of the study area: (i) geology map, (ii) elevation map, (iii) wadis map, and (iv) rainfall map. (b) Description of the study area: (i) plan curvature map, (ii) fault density map, (iii) soil texture map, and (iv) slope map.

Close modal

Methodology

Figure 3 is the flowchart of the method followed in this study. It shows the predictor variables (or factors) used in the analysis and modeling and outlines the main steps taken in the modeling process. In this method, the water spring potential maps were developed in three steps, beginning with construction of the spatial database. This database was developed to document the sites of groundwater springs and to list the criteria (predictor or influential variables or factors) that contribute to their formation. Potential water springs were mapped using RF, SVM, BRT, and MARS models. These models were also used for defining the relationships among the sites of springs and the factors that contribute to their formation. One water spring potential map was developed using each model. Model validation was performed, and the accuracy of each water spring potential map was assessed based on the AUROCC measure.
Figure 3

Flowchart of the research method.

Figure 3

Flowchart of the research method.

Close modal

Dataset

This study first attempted to identify the most appropriate factors for water spring potential assessment and prediction. To this end, a list of the influential factors was prepared based on a literature review and, then, refined based on the availability of suitable data. Overall, 13 of the most influential factors were selected, which are slope, elevation, profile curvature, plan curvature, lithology, rainfall depth, fault density, soil texture, TWI, distance to roads, distance to drainage, distance to fault, and distance to rivers (Manap et al. 2014; Chen et al. 2018; Khosravi et al. 2018; Lee et al. 2018; Al-Fugara et al., 2020a). The type and amount of precipitation often vary by elevation. Higher altitudes often receive precipitation in the form of snow and have lower evapotranspiration than lower altitudes. In this case, the gradual melting of the snow in the warm season feeds the groundwater basins. In the meantime, areas of the drainage basins of aquifers decrease with elevation. That is, as the elevation increases, less ground area is exposed to the falling snow. So, the elevation is a double-sided modeling parameter with opposing effects.

Slope, too, is an effective predictor in terms of both the amount of runoff produced by precipitation and ground permeability. Usually, faster runoff and lower ground permeability are associated with steeper slopes. Hence, this parameter has a negative impact on aquifer recharge. The slope map was prepared in this study based on a 30 m ASTER DEM of the study area using the surface analysis function and the slope tool in the ArcGIS software environment. Aspect has a great influence on hydrological, and other, processes and phenomena, including the melting of snow, diversity of vegetation, and evapotranspiration, thus affecting groundwater accumulation. Evaporation is more intense, and snow melting is faster in areas exposed to direct sunlight. The aspect map too was derived from the DEM.

Lithology of the study area was mapped by using geological maps prepared by the Ministry of Energy and Natural Resources, Jordan. Geological formations and rocks serve a critical role in the development of groundwater resources. Permeable and porous geological formations consisting of limestone, sandstone, and conglomerate are the most effective formations in accumulating groundwater, whereas marl is the least effective in this respect. The TWI was another factor that is often used in GPM studies to determine soil moisture.

Boosted regression tree

Although decision trees were first developed nearly two decades ago, they can still be categorized among the new generation of data mining techniques. The BRT is a tree-based model that combines the Classification and Regression Tree (CART; Breiman et al. 1984) model with the boosting technique (Elith et al. 2008). It combines the results of multiple weak models to get a stronger model (Zhang et al. 2016). In the BRT method, the decision trees are developed using the CART method that was first developed by Breiman et al. (1984). Boosting is a simple and effective process that creates numerous models and combines their results to increase the overall accuracy (Schwenk & Bengio 2000). The idea behind this method is that averaging the results of multiple models is more reliable and accurate than using a single model. However, the BRT method does not use all of the training samples to create each of the trees. Instead, each tree is developed using a random subset of the training samples (Elith et al. 2008). The performance of the BRT models is substantially influenced by parameters like the learning rate, tree complexity, bagging rate, the minimum number of observed samples in the leaf nodes, and the number of trees (Elith et al. 2008). The learning rate limits the impact of each tree on the overall model. It must be a real number in the range of 0–1 (Elith et al. 2008). However, it is frequently set to a number between 0.1 and 0.001. Decreasing the learning rate increases the number of trees and reduces the prediction error (Elith et al. 2008). Consequently, choosing a lower learning rate is preferable, provided that there are enough training samples and time to train the models. In the bagging process and in each iteration, a number of training samples are selected randomly without replacement (Skurichina & Duin 2002). The appropriate value for the bagging rate is a value between 0.5 and 0.75. One of the criteria that are commonly used to stop building new trees is the number of observed samples in the leaf nodes. The number of trees in the final model is determined based on the aforementioned parameters and cross-validation. The BRT model, generally, has better performance when the number of trees is higher than 1,000 than when it is less than that. The enablement of controlling the various predictive variables, eliminating the need for outlier removal, and modeling the nonlinear relationships among the variables are some of the advantages of this model (Elith et al. 2008).

Random forest

The RF is an ensemble data mining technique that was developed by Breiman (2001). It may be considered an extension of the CART model (Breiman 2001). In the CART model, the training set is repeatedly split into subsets to determine the relationship between the dependent variable and each independent variable and specify the target variable (Breiman et al. 1984). Unlike other tree-based methods in which a limited number of decision trees are built, in the RF method, hundreds or even thousands of decision trees are built. In essence, the RF is an ensemble learning technique that uses a set of weak learners to create a stronger model. Various forms of bootstrapping are used in the development of RF models. In addition, a random subset of input variables is used to build each tree. Using bootstrapping, a large number of n-sample subsets of the initial training dataset is created via random sampling with replacement. Approximately one-third of the samples are not used during this process, and they are considered as out of bag (OOB) samples, which are used for significant variable selection and unbiased error estimation (Breiman 2001). Then, a tree is developed after each bootstrap. During the process of building the trees, m variables are selected for each tree branch from the whole set of M independent variables to split the tree. In the regression models, the recommended m/M ratio is 1/3, while for classifier models, the recommended ratio is (Biau & Scornet 2016). Once all the trees have been built, the test samples are presented to the tree. Each tree generates an output for each input vector. The final output is the average of all these outputs.

Multivariate adaptive regression spline

The MARS is an efficient data mining algorithm that models the nonlinear relationships among the predictive variables and the dependent variable in two basic steps (Friedman 1991): forward pass and backward pass.

The forward pass

In the forward pass, the algorithm starts with a model that consists of only the mean of the target variable. Then, it repeatedly adds a pair of basis functions to the model. The added basis functions are the ones that lead to the maximum reduction in the sum of squares of residual errors. The process of adding new functions continues until the minimum value of the sum-squared error (SSE) is obtained. At the end of this step, the resulting model is overfitted to the training samples used in the forward pass, and it does not perform well on new samples (test samples). The prediction equation of the MARS algorithm is as follows (Friedman 1991):
(1)
where Y is the predicted output of the MARS model and is the mean of the target variable (the constant basis function with index 0). The ith basis function and its coefficient are indicated by and , respectively. Finally, m is the number of the basis functions. In this equation, the values are obtained by minimizing the SSE.

The backward pass

This step uses generalized cross-validation (GCV) to lower the complexity of the model and improve its generalizability. The algorithm starts with the largest model. In each iteration, a basis function is temporarily eliminated, and the GCV of the new model is calculated. If the GCV is reduced, then the algorithm removes the basis functions from the model. The process continues until the minimum GCV is obtained. Equation (2) shows the formula for calculating the GCV. As can be seen, at best, the GCV value can reach 0 (Friedman 1991).
(2)
where is the true output value for the ith sample, fk (xi) is the predicted output of the MARS model for the ith sample in the kth iteration, n is the number of samples, and C is the penalty term, which is determined using the following equation (Friedman 1991):
(3)

In this equation, d is the cost associated with each function and M is the number of the nonzero terms (i.e., basis functions).

Support vector machine

The SVM is based on statistical learning theory (Vapnik 1999). In the SVM, the output is estimated using Equation (4) (Cortes & Vapnik 1995):
(4)
where <> is the inner product, W is the weight vector, X is the feature vector, Z is the output predicted by the SVM, and b is the bias. The values of the weight and bias vectors are determined by minimizing the -insensitive error function provided in Equation (5) (Cortes & Vapnik 1995).
(5)
where Z is the estimated quantity and is the accuracy at which Z is estimated.
Applying the structural risk minimization principle in regression problems means that finding the optimum SVM model is equivalent to solving the following optimization problem (Cortes & Vapnik 1995):
(6)

In this equation, C is the coefficient that balances the trade-off between the complexity of the estimated function and the maximum deviation from ; and are terms for the penalty added to the objective function for samples with error values greater than ε.

The aforementioned optimization problems are often solved in the dual form using Lagrange coefficients, which enable them to handle nonlinear relationships. By writing Equation (6) in the dual form and computing its derivative with respect to the main variables, we obtain Equation (7) (Cortes & Vapnik 1995):
(7)
where and are the Lagrange coefficients and K is the kernel function that replaces the dot product. Since the kernel of the radial basis function has shown promising results in previous works, we used it in the current study.

Multicollinearity test

In the present study, the possibility of the presence of multicollinearity among the 13 predictors of the presence of water springs was examined by using the tolerance (TOL) and variance inflation factor (VIF) measures of multicollinearity. According to the values of TOL and VIF (Table 2), no multicollinearity among the predictors was present since all TOL values are higher than 0.10 and all VIF values are less than 10. In the light of the VIF and TOL values, it is concluded that the 13 selected predictors were relevant for water spring potential mapping (WSPM) in the governorate of Al-Karak.

Table 2

Results of testing for multicollinearity among the predictors of presence of water spring

Model coefficientsa
ModelUnstandardized coefficients
Standardized coefficientstSigCollinearity statistics
BStd errorBetaToleranceVIF
(Constant) 1.159 0.313  3.699 0.000   
Distance from drainage 0.000 0.000 −0.162 −2.383 0.018 0.760 1.315 
Rainfall depth −0.002 0.001 −0.248 −1.936 0.054 0.213 4.696 
Profile curvature −0.032 0.071 −0.032 −0.452 0.652 0.707 1.415 
Plan curvature −0.014 0.101 −0.010 −0.140 0.889 0.690 1.449 
Lithology −0.036 0.017 −0.169 −2.054 0.041 0.519 1.927 
Distance from wadi 0.000 0.000 −0.344 −4.660 0.000 0.642 1.557 
Distance from fault 6.108 × 10−7 0.000 0.004 0.036 0.971 0.369 2.706 
TWI 0.016 0.014 0.084 1.184 0.238 0.694 1.441 
Fault density −0.006 0.015 −0.026 −0.378 0.706 0.743 1.345 
Soil texture 0.006 0.117 0.006 0.053 0.958 0.239 4.186 
DEM (altitude) 0.000 0.000 0.389 2.126 0.035 0.305 3.567 
Slope angle 4.745 × 10−5 0.000 0.065 1.073 0.285 0.970 1.031 
Distance from roads −9.398 × 10−5 0.000 −0.209 −2.826 0.005 0.639 1.564 
Model coefficientsa
ModelUnstandardized coefficients
Standardized coefficientstSigCollinearity statistics
BStd errorBetaToleranceVIF
(Constant) 1.159 0.313  3.699 0.000   
Distance from drainage 0.000 0.000 −0.162 −2.383 0.018 0.760 1.315 
Rainfall depth −0.002 0.001 −0.248 −1.936 0.054 0.213 4.696 
Profile curvature −0.032 0.071 −0.032 −0.452 0.652 0.707 1.415 
Plan curvature −0.014 0.101 −0.010 −0.140 0.889 0.690 1.449 
Lithology −0.036 0.017 −0.169 −2.054 0.041 0.519 1.927 
Distance from wadi 0.000 0.000 −0.344 −4.660 0.000 0.642 1.557 
Distance from fault 6.108 × 10−7 0.000 0.004 0.036 0.971 0.369 2.706 
TWI 0.016 0.014 0.084 1.184 0.238 0.694 1.441 
Fault density −0.006 0.015 −0.026 −0.378 0.706 0.743 1.345 
Soil texture 0.006 0.117 0.006 0.053 0.958 0.239 4.186 
DEM (altitude) 0.000 0.000 0.389 2.126 0.035 0.305 3.567 
Slope angle 4.745 × 10−5 0.000 0.065 1.073 0.285 0.970 1.031 
Distance from roads −9.398 × 10−5 0.000 −0.209 −2.826 0.005 0.639 1.564 

aDependent variable: RNDSEL.

Once the most influential factors for WSPM had been determined and the training and testing data subsets had been specified, modeling using the BRT, RF, MARS, and SVM methods was started. In this study, the dataset used consists of 200 water spring sites, from which 70% of the sites (140 sites) were selected randomly and allotted to the training dataset and the remaining 30% of the sites (60 sites) were used for model testing. In addition, another 140 and 60 locations with no springs were added to the training and testing data subsets, respectively. The models were all developed in the R package.

In the case of the BRT model, the optimum number of trees was 400, the split rate was set to 1, the reduction rate was 0.01, and the optimal number of nodes in each tree was 20. In other respects, the number of trees and the number of variables to consider in each split were set to 1,000 and 3, respectively. The RF model predicts the output using the OOB error. Therefore, the testing samples are not used in this method. The OOB estimate of the error percentage for the model was 25.46%.

One of the advantages of the BRT model is that it can determine the significance of each input predictor variable. By using this capability, we could determine the effectiveness and importance of each factor. In this regard, the mean decrease accuracy (MDA) metric was used to determine the importance of each predictor. The MDA and the mean decrease in Gini (MDG) are two measures that can be used to determine the significance of variables in RF. In the case of the MDA, the actual value of the variables is replaced by the random value generated for each tree. If this replacement does not affect the value of the error, then this variable is less significant than the variable that causes a considerable increase in the estimation error when its values are replaced with the random value generated for each tree. Values of the MDA and MDG metrics in the RF model are depicted in Figure 4, while the relative significance of the predictors of water spring location according to the BRT model is shown in Figure 5(a). According to the results (Figure 5(a)), lithology is the most significant factor in the model, followed by DEM, distance to roads, distance to river, and distance to wadi.
Figure 4

Values of the MDA and MDG metrics in the RF model.

Figure 4

Values of the MDA and MDG metrics in the RF model.

Close modal
Figure 5

(a) Relative importance of predictors of water spring location according to the BRT model and (b) relative importance of predictors of water spring location according to the SVM.

Figure 5

(a) Relative importance of predictors of water spring location according to the BRT model and (b) relative importance of predictors of water spring location according to the SVM.

Close modal
In the case of the MARS model, the maximum number of basis functions was set to 10. The final MARS model is given in Equation (8):
(8)
Equation (8) shows that the constant (intercept) in the regression model has the value of –0.01236699. In this model, the maximum degree of interaction between variables is set to 2. For example, the fifth term is the product of the DEM and slope variables. In this term, the knot value of the variable is 0.5. Figure 6 is a plot of the basis functions of this model for four variables.
Figure 6

The basis functions of the MARS model.

Figure 6

The basis functions of the MARS model.

Close modal

In SVM, the value of sigma that was obtained from cross validation was 0.16, and the number of support vectors was 165. The significance of the predictor variables in the model was determined using the cross-validation method (Figure 5(b)). As can be observed in this figure, the DEM and distance to the river were the most significant predictors of the locations of potential water springs.

The AUROCC is one of the most prominent evaluation metrics that is often used to compare different models. Values of the AUROCC for the four models constructed in this study are drawn in Figure 7 and listed in Table 3. As can be noticed in this figure and table, the RF and BRT models had the highest (0.748) and lowest (0.689) area under the curve values, respectively. The AUROCC model equation can be expressed as follows:
(9)
where:
  • TP: the samples that the model predicts as positive correctly;

  • FP: the samples that the model predicts as positive, but they are negative in reality;

  • FN: the samples that the model predicts as negative, but in reality they are positive;

  • TN: the samples that the model predicts as negative correctly.

Figure 7

Values of the AUROCC for the four models under study.

Figure 7

Values of the AUROCC for the four models under study.

Close modal
Table 3

The areas under the receiver operating characteristic curves for the four models

ModelAUROCC
BRT 0.689 
MARS 0.727 
RF 0.748 
SVM 0.732 
ModelAUROCC
BRT 0.689 
MARS 0.727 
RF 0.748 
SVM 0.732 
Comparative studies can reveal the advantages and disadvantages of different machine learning and data mining methods. Therefore, we used four modeling approaches in this study to create the WSPMs for the governorate of Al-Karak, Jordan (Figure 8) and compared their outcomes. The results point out that the RF, which is an ensemble method, has the best performance. Various studies have shown that, compared with a single model, an ensemble of multiple models usually has better performance. Thus, our findings are consistent with the findings of previous studies in this regard (e.g., Kehoe et al. 2012; Han et al. 2018; Lei et al. 2019).
Figure 8

(a) The water spring potential map produced by the BRT, (b) the water spring potential map produced by the RF, (c) the water spring potential map produced by the MARS, and (d) the water spring potential map produced by the SVM.

Figure 8

(a) The water spring potential map produced by the BRT, (b) the water spring potential map produced by the RF, (c) the water spring potential map produced by the MARS, and (d) the water spring potential map produced by the SVM.

Close modal

The water spring potential maps provide a vital source of information for land use planning, especially in arid and semi-arid regions. These maps can help policy-makers in managing water resources. In this study, we employed four modeling approaches, that is, the BRT, RF, MARS, and SVM, to create water spring potential maps for the governorate of Al-Karak, south of Jordan. In modeling, the study employed the 13 significant predictors of groundwater potential locations (distance from wadi, distance to drainage network, distance from faults, distance from roads, plan curvature, profile curvature, fault density, lithology, rainfall depth, slope angle, TWI, altitude, and soil texture). The modeling results provided evidence that the RF model has the best performance, followed by the SVM and MARS models. In contrast, the BRT model has relatively poor performance. These results confirm that the ensemble models (the RF model in this study) often outperform the single models.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abed
A. M.
2000
Geology of Jordan, Its Water and Environments
.
Jordanian Geologists Association
,
Amman, Jordan
.
Abu Ghazleh
S.
,
Hartmann
J.
,
Jansen
N.
&
Kempe
S.
2009
Water input requirements of the rapidly shrinking Dead Sea
.
Naturwissenschaften
96
,
637
643
.
Ahmadlou
M.
,
Al-Fugara
A. k.
,
Al-Shabeeb
A. R.
,
Arora
A.
,
Al-Adamat
R.
,
Pham
Q. B.
,
Al-Ansari
N.
,
Linh
N. T. T.
&
Sajedi
H.
2020
Flood susceptibility mapping and assessment using a novel deep learning model combining multilayer perceptron and autoencoder neural networks
.
Journal of Flood Risk Management
14
(
1
),
e12683
.
Al-Fugara
A. k.
,
Ahmadlou
M.
,
Al-Shabeeb
A. R.
,
AlAyyash
S.
,
Al-Amoush
H.
&
Al-Adamat
R.
2020a
Spatial mapping of groundwater springs potentiality using grid search-based and genetic algorithm-based support vector regression
.
Geocarto International
37
(
1
),
284
303
.
Al-Fugara
A. k.
,
Ahmadlou
M.
,
Shatnawi
R.
,
AlAyyash
S.
,
Al-Adamata
R.
,
Al-Shabeeb
A. A.-R.
&
Soni
S.
2020b
Novel hybrid models combining meta-heuristic algorithms with support vector regression (SVR) for groundwater potential mapping
.
Geocarto International
37
(
9
),
2627
2646
.
Al-Fugara
A. K.
,
Pourghasemi
H. R.
,
Al-Shabeeb
A. R.
,
Habib
M.
,
Al-Adamat
R.
,
Al-Amoush
H.
&
Collins
A. L.
2020c
A comparison of machine learning models for the mapping of groundwater spring potential
.
Environmental Earth Sciences
79
,
206
.
Al-Shabeeb
A. R. R.
2015
A Modified Analytical Hierarchy Process Method to Select Sites for Groundwater Recharge in Jordan
.
PhD thesis
,
University of Leicester
, Leicester, UK.
Al-Shabeeb
A. A. R.
,
Al-Adamat
R.
,
Al-Fugara
A.
,
Al-Amoush
H.
&
AlAyyash
S.
2018
Delineating groundwater potential zones within the Azraq Basin of Central Jordan using multi-criteria GIS analysis
.
Groundwater for Sustainable Development
7
,
82
90
.
Al-Weshah
R. A.
2000
The water balance of the Dead Sea: an integrated approach
.
Hydrological Processes
14
,
145
154
.
Arabtech-Jardaneh
1996
Study and Evaluation of Surface and Underground Water in the Zara and Ma’in Areas – Feasibility Study Report. Jordan Valley Authority, Amman, Jordan
.
Biau
G.
&
Scornet
E.
2016
A random forest guided tour
.
TEST
25
(
2
),
197
227
.
Brans
E. H. P.
2001
Liability for Damage to Public Natural Resources: Standing, Damage and Damage Assessment
.
Kluwer Law International BV
,
The Hague, The Netherlands
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
(
1
),
5
32
.
Breiman
L.
,
Friedman
J. H.
,
Olshen
R. A.
,
Stone
C. J.
1984
Classification and Regression Trees
.
CRC Press
,
Boca Raton, FL, USA
.
Brückner
F.
,
Bahls
R.
,
Alqadi
M.
,
Lindenmaier
F.
,
Hamdan
I.
,
Alhiyari
M.
&
Atieh
A.
2021
Causes and consequences of long-term groundwater overabstraction in Jordan
.
Hydrogeology Journal
29
,
2789
2802
.
Chen
W.
,
Li
H.
,
Hou
E.
,
Wang
S.
,
Wang
G.
,
Panahi
M.
,
Li
T.
,
Peng
T.
,
Guo
C.
,
Niu
C.
,
Xiao
L.
,
Wang
J.
,
Xie
X.
&
Ahmad
B. B.
2018
GIS-based groundwater potential analysis using novel ensemble weights-of-evidence with logistic regression and functional tree models
.
Science of the Total Environment
634
,
853
867
.
Chen
W.
,
Panahi
M.
,
Khosravi
K.
,
Pourghasemi
H. R.
,
Rezaie
F.
&
Parvinnezhad
D.
2019a
Spatial prediction of groundwater potentiality using ANFIS ensembled with teaching-learning-based and biogeography-based optimization
.
Journal of Hydrology
572
,
435
448
.
Chen
W.
,
Pradhan
B.
,
Li
S.
,
Shahabi
H.
,
Rizeei
H. M.
,
Hou
E.
&
Wang
S.
2019b
Novel hybrid integration approach of bagging-based Fisher's linear discriminant function for groundwater potential analysis
.
Natural Resources Research
28
(
4
),
1239
1258
.
Cortes
C.
&
Vapnik
V.
1995
Support-vector networks
.
Machine Learning
20
(
3
),
273
297
.
Elith
J.
,
Leathwick
J. R.
&
Hastie
T.
2008
A working guide to boosted regression trees
.
Journal of Animal Ecology
77
(
4
),
802
813
.
Friedman
J. H.
1991
Multivariate adaptive regression splines
.
The Annals of Statistics
19
(
1
),
1
67
.
Gyau-Boakye
P.
,
Kankam-Yeboah
K.
,
Darko
P. K.
,
Dapaah-Siakwan
S.
&
Duah
A. A.
2008
Groundwater as a vital resource for rural development: example from Ghana
. In:
Applied Groundwater Studies in Africa
(S. M. A. Adelana & A. M. MacDonald, eds)
,
CRC Press/Balkema
,
London, UK
, pp.
149
169
.
Han
T.
,
Jiang
D.
,
Zhao
Q.
,
Wang
L.
&
Yin
K.
2018
Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery
.
Transactions of the Institute of Measurement and Control
40
(
8
),
2681
2693
.
Jha
M. K.
,
Kamii
Y.
&
Chikamori
K.
2009
Cost-effective approaches for sustainable groundwater management in alluvial aquifer systems
.
Water Resources Management
23
(
2
),
219
233
.
Kemper
K. E.
2004
Groundwater – from development to management
.
Hydrogeology Journal
12
(
1
),
3
5
.
Lightbody
L.
2009
Winter v. Natural Resources Defense Council, Inc
.
Harvard Environmental Law Review
33
,
593
807
.
Manap
M. A.
,
Nampak
H.
,
Pradhan
B.
,
Lee
S.
,
Sulaiman
W. N. A.
&
Ramli
M. F.
2014
Application of probabilistic-based frequency ratio model in groundwater potential mapping using remote sensing data and GIS
.
Arabian Journal of Geosciences
7
(
2
),
711
724
.
Miraki
S.
,
Zanganeh
S. H.
,
Chapi
K.
,
Singh
V. P.
,
Shirzadi
A.
,
Shahabi
H.
&
Pham
B. T.
2019
Mapping groundwater potential using a novel hybrid intelligence approach
.
Water Resources Management
33
(
1
),
281
302
.
Mukherjee
P.
,
Singh
C. K.
&
Mukherjee
S.
2012
Delineation of groundwater potential zones in arid region of India – a remote sensing and GIS approach
.
Water Resources Management
26
(
9
),
2643
2672
.
Naghibi
S. A.
,
Dolatkordestani
M.
,
Rezaei
A.
,
Amouzegari
P.
,
Heravi
M. T.
,
Kalantar
B.
&
Pradhan
B.
2019
Application of rotation forest with decision trees as base classifier and a novel ensemble model in spatial modeling of groundwater potential
.
Environmental Monitoring and Assessment
191
(
4
),
248
.
Nguyen
P. T.
,
Ha
D. H.
,
Nguyen
H. D.
,
Phong
T. V.
,
Trinh
P. T.
,
Al-Ansari
N.
,
Le
H. V.
,
Pham
B. T.
,
Ho
L. S.
&
Prakash
I.
2020
Improvement of credal decision trees using ensemble frameworks for groundwater potential modeling
.
Sustainability
12
(
7
),
2622
.
Powell
J. H.
1988
The Geology of the Karak Area, Bulletin 8, Map Sheet No. 3152-III. National Mapping Project, Ministry of Energy and Mineral Resources (Geology Directorate). Amman, Jordan
.
Prasad
P.
,
Loveson
V. J.
,
Kotha
M.
&
Yadav
R.
2020
Application of machine learning techniques in groundwater potential mapping along the west coast of India
.
GIScience & Remote Sensing
57
(
6
),
735
752
.
Rahmati
O.
,
Naghibi
S. A.
,
Shahabi
H.
,
Bui
D. T.
,
Pradhan
B.
,
Azareh
A.
,
Rafiei-Sardooi
E.
,
Samani
A. N.
&
Melesse
A. M.
2018
Groundwater spring potential modelling: comprising the capability and robustness of three different modeling approaches
.
Journal of Hydrology
565
,
248
261
.
Schwenk
H.
&
Bengio
Y.
2000
Boosting neural networks
.
Neural Computation
12
(
8
),
1869
1887
.
Skurichina
M.
&
Duin
R. P. W.
2002
Bagging, boosting and the random subspace method for linear classifiers
.
Pattern Analysis & Applications
5
(
2
),
121
135
.
Tien Bui
D.
,
Shirzadi
A.
,
Chapi
K.
,
Shahabi
H.
,
Pradhan
B.
,
Pham
B. T.
,
Singh
V. P.
,
Chen
W.
,
Khosravi
K.
,
Bin Ahmad
B.
&
Lee
S.
2019
A hybrid computational intelligence approach to groundwater spring potential mapping
.
Water
11
(
10
),
2013
.
Vapnik
V. N.
1999
An overview of statistical learning theory
.
IEEE Transactions on Neural Networks
10
(
5
),
988
999
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).