Abstract
Groundwater availability is one of the key anxieties in most semi-arid regions of Ethiopia. The purpose of this study was to investigate the groundwater potential zone map of the alluvial plain of Gambela. The study applied analytic hierarchy process (AHP) models with four different machine learning algorithms: random forest classifier (RFC), gradient boosting classifier (GBC), decision tree classifier (DTC), and K-neighbor classifier (KNC). The features that are used as predictors include geology, geomorphology, slope, soil, lineament density, drainage density, land use and land cover (LULC), normalized difference vegetation index (NDVI), topographic wetness index (TWI), topographic roughness index (TRI), and rainfall. The final output of the groundwater potential zone was classified as low, moderate, high, and very high potential zones. The authentication through receiver operating curve (ROC) shows 78.2, 93.4, 92.5, 72.4, and 87.7% values of area under the curve (AUC) for AHP, RFC, GBC, DTC, and KNC, respectively. The results show that RFC and GBC are the best groundwater potential zone (GWPZ) map estimator. The study also shows that rainfall and geomorphology are the primary factors influencing the GWPZ. The outcome might promote improved management alternatives in other areas of the country with a comparable climate.
HIGHLIGHTS
The current study is the application of four MLAs.
Using 11 groundwater-influencing criteria to compare models and criteria for the first time.
The study area is a remote area that was not given the required attention by the researchers.
The current study should become the benchmark for researchers in the area.
Applying as many criteria that are expected to influence.
INTRODUCTION
‘Water is a great mover and doer, constantly modifying the landscape’ (Fieth 1973). Investigation of the water resource is the concern of scholars from the beginning of research works. The chief potable water supply of Africa is covered by groundwater resources (MacDonald et al. 2012; Godfrey et al. 2019). Large sedimentary basins made of sandstones and limestones are found mostly in semi-arid and dry sections of Africa's subsurface environment, together with foundation rocks and mudstones with poor transmissivity. Some of the world's greatest freshwater reserves are found in these basins (Medici et al. 2023). Groundwater becomes the primary source of water consumption in semi-arid regions where surface water is scarce (Magesh et al. 2012; Ostad-Ali-Askari et al. 2017; Manna et al. 2022).
Ethiopia possesses vast surface water capacity with eight primary river basin systems, wherein 90% of the water flows consistently throughout the year, crossing international borders. In line, a sufficient amount of surface water resources can increase subsurface storage (Ostad-Ali-Askari et al. 2017; Morbidelli et al. 2018). The potential of groundwater in Ethiopia was not well known (Worqlul et al. 2017). The country's freshwater provision relies on its groundwater reserves sourced from springs, shallow manual wells, and deep wells. Groundwater from the aforementioned sources is used by many sectors, from private citizens to governmental organizations, from urban supplies to industrial water needs. Though little research has been done in the past, there is still much to learn about the subsurface hydrological component of Ethiopian topography. Additionally, the country is experiencing a severe water problem due to climate change and unsustainable economic growth, which increases the strain on the already precarious groundwater resources to meet these unprecedented difficulties (Kebede 2013; Seifu et al. 2022).
Groundwater is a concealed asset and its investigations can be a daunting undertaking. There exist numerous techniques for groundwater exploration across the globe (Jothimani et al. 2021). Among the techniques, test drilling (Regenspurg et al. 2018) and stratigraphic analysis (Campo et al. 2020) are the unique methods for aquifers parameters investigation. But these methods are cost-intensive and time-consuming and also require technology and skilled manpower (Mukherjee et al. 2012). Currently, several types of methods and strategies have been implemented in different parts of the world for generating groundwater potential zone mapping (Razandi et al. 2015; Golkarian et al. 2018; Azma et al. 2021). Machine learning algorithms (MLAs) are a relatively new technology showing promising results (Gómez-Escalonilla et al. 2022). The models are totally based on computer approaches and are used to address difficult situations with complex datasets. Working directly with raw data is the algorithms’ main benefit since it dramatically reduces expert prejudice (Gómez-Escalonilla et al. 2021). MLAs, such as random forest classifier (RFC), gradient boosting classifier (GBC), decision tree classifier (DTC), and K-neighbor classifier (KNC), have been utilized in several studies for mapping groundwater potential zones (Naghibi et al. 2017; Avand et al. 2019; Nguyen et al. 2020; Al-Abadi et al. 2021). These studies produced trustworthy maps with a high degree of precision and displayed strong outcomes.
The researchers did not pay enough attention to the study region because it is in a rural place. The region is distinguished by a semi-arid climate, limited water resources, and erratic weather patterns. Due to remoteness and security concerns for data collection, the aforementioned method is new in the chosen research location and has not been tested in a comparable oasis region. Therefore, this study aims to apply MLAs and analytic hierarchy process (AHP) techniques with Geographical Information System (GIS) and remote sensing for the prediction of groundwater potential. To predict possible groundwater sites for the research region, this effort was also made using MLAs of the RFC, GBC, DTC, and KNC. The region is covered with Quaternary sediment (alluvial-lacustrine), an important geologic component that has spread over the plain (Kebede 2013). Numerous variables, such as temperature, agricultural methods, land use, and land cover, can affect the groundwater patterns in alluvial zones (Jannis et al. 2021).
The key originality of the present study is the use of four MLA models and AHP to compare the effectiveness of each model for groundwater potential site prediction. Additionally, this particular study uses 11 influencing variables for groundwater potential zone mapping to assess models and criteria for the first time. The objectives of the current investigation are to (i) establish the significance of the geological, morphological, and hydroclimatic aspects in assessing groundwater, (ii) recognize the factors that affect groundwater and appraise their impact, (iii) evaluate the AHP technique for GWPZ study, and (iv) assess the efficiency of MLAs in identifying GWPZ.
METHODOLOGY
Description of the study area
Preparation of thematic layers
The features that are used as predictors include geology, geomorphology, slope, soil, lineament density, drainage density, land use and land cover (LULC), normalized difference vegetation index (NDVI), topographic wetness index (TWI), topographic roughness index (TRI), and rainfall. We acquired the study's geological map from the Ethiopian Geological Survey Institute (GSI). Data on groundwater, soil (in shapefile format), and LULC were collected from the Ministry of Water (MoW). The National Metrological Agency (NMA) provided information on the meteorological conditions. Additionally, we discover and download satellite data from the USGS Earth Explorer website, such as SRTM DEM and Sentinel 2 pictures (https://earthexplorer.usgs.gov). The satellite imagery used was captured from January to September 2021 with 20% cloud coverage. Interactive supervised classification was used to classify the LULC of the area. Converting and geo-referencing all the available data into Universal Transverse Mercator (UTM) zone 37 was the common step manipulation. Each thematic map was divided into 5–10 subclasses based on the specific character of each thematic layer (Supplementary Table A2). The Rockwork 16, ERDAS Imagine, Surfer, and ArcGIS software were used in this study for data processing.
Geology
Groundwater influencing parameters of the study area: (a) geology, (b) geomorphology, (c) soil texture, and (d) slope.
Groundwater influencing parameters of the study area: (a) geology, (b) geomorphology, (c) soil texture, and (d) slope.
Geomorphology
The majority of the study area is flat with a small inclination. The ridge and mountains are situated in the eastern headwater area. The height difference between the lowest and highest locations is 577 m. The geomorphology of the study was developed using the topographic position method (Starbuck et al. 2022).
Five groups of landforms, including canyons, shallow valleys, plain regions, local ridges, and high ridges and mesas, were identified in the research area (Figure 2(b)). Accordingly, the plain area has large area coverage in the landscape (82.6%) while the high ridge has the lowest coverage (0.12%).
Soil texture
The aquifer's capacity to transmit water is significantly influenced by the type, texture, permeability, and structure of the soil. The characteristics of the soil particles affect how much water infiltrates into the aquifer medium. Because different soil textures have varied infiltration rates, groundwater recharge is strongly affected by soil texture; fine-grained soil has a relatively low groundwater recharge compared to coarse-grain soil due to its low degree of porosity and permeability (Seifu et al. 2023). Clay soil has the lowest ability for penetration, but sandy soils have a high infiltration rate (Juandi & Syahril 2017). According to the soil classification report by Berhanu et al. (2013), clay, loam, and loamy sand are the most common soil textures discovered in the research region. These soil textures have corresponding areal coverages of 60, 28, and 12% for clay, loamy sand, and loam, respectively (Figure 2(c)).
Slope
Changes in runoff and infiltration, which are controlled by the steepness of the surface, have an effect on groundwater recharge. Because runoff rises with slope steepness, recharge to the subsurface falls, and vice versa (Nag & Kundu 2018). Groundwater potential zone will be higher in flat areas having low runoff (Rajaveni et al. 2017).
The research area's slope map was produced using ArcGIS tools from DEM and ranges from 0 to 70.7° (Figure 2(d)). The slope was then reclassified into five subclasses: flat sloping (0–2°), gentle sloping (2°–5°), strong sloping (5°–10°), moderate steep sloping (10°–15°), and very steep sloping (>15°).
Land use and land cover (LULC)
Groundwater influencing parameters of the study area: (a) land use land cover, (b) rainfall, (c) NDVI, and (d) drainage density.
Groundwater influencing parameters of the study area: (a) land use land cover, (b) rainfall, (c) NDVI, and (d) drainage density.
Rainfall
Rainfall is one of the major influencing elements that affect a region's potential for groundwater. Precipitation increases the amount of water on the surface, which raises the likelihood that water may seep deeper. In comparison to other sections of the country, the research area includes places with fewer data and meteorological stations. The average annual rainfall in the region was 1,037.6 mm/year. The spatial rainfall of the study area was calculated with kriging interpolation in ArcGIS (Figure 3(b)).
Normalized difference vegetation index (NDVI)
High values approaching one signify the vegetation cover while the lower value (−1 to 0) may be clouds, snow, or water (Figure 3(c)). A positive value near zero signifies bare lands, rock area, etc. (Gandhi et al. 2015).
Drainage density
Lineament density
Groundwater influencing parameters of the study area: (a) lineaments density map (the faults on the surface), (b) rose diagram map for general lineament orientation, (c) TWI, and (d) TRI.
Groundwater influencing parameters of the study area: (a) lineaments density map (the faults on the surface), (b) rose diagram map for general lineament orientation, (c) TWI, and (d) TRI.
Topographic wetness index (TWI)
The TWI is a parameter that shows the terrain profile which can control water accumulation. The TWI was developed by Beven and Kirkby (Beven et al. 2021) with TOPMODEL.
Topographic roughness index (TRI)
Methods applied for GWPZ
Analytic hierarchy process (AHP)
Multi-criteria decision-making (MCDM) was used to assign weights to the chosen criterion. The application of MCDM helps individuals or groups of decision-makers examine their choices in the context of complicated scenarios involving several factors (Arabameri et al. 2019; Ravichandran et al. 2022).
The groundwater potential zone is defined by 11 influencing criteria. Using Saaty's scale of relative significance, the value of the weight was determined. The first step was to create a hierarchical structure for these parameters’ influence on groundwater potential zonation. The relative importance of each criterion is evaluated by creating an 11 by 11 pairwise assessment matrix (Table 1). By dividing each value by the total of the appropriate column, the weight of each thematic layer was transformed into a normalized value. The weighting for each thematic layer is provided by the row's common value in the normalized pairwise matrix table (Supplementary Table A1). Each thematic layer received weight based on how it affected the GWPZ of the test site (Supplementary Table A2).
Pairwise comparison matrix
. | RF . | Geom. . | Geo . | LD . | Slope . | Soil . | LULC . | DD . | NDVI . | TWI . | TRI . |
---|---|---|---|---|---|---|---|---|---|---|---|
RF | 1 | ||||||||||
Geom. | 0.33 | 1 | |||||||||
Geo | 0.33 | 0.33 | 1 | ||||||||
LD | 0.2 | 0.33 | 1 | 1 | |||||||
Slope | 0.2 | 0.2 | 0.33 | 1 | 1 | ||||||
Soil | 0.2 | 0.2 | 0.33 | 0.5 | 1 | 1 | |||||
LULC | 0.14 | 0.2 | 0.2 | 0.33 | 0.33 | 1 | 1 | ||||
DD | 0.14 | 0.14 | 0.2 | 0.33 | 0.33 | 0.33 | 1 | 1 | |||
NDVI | 0.14 | 0.14 | 0.2 | 0.2 | 0.2 | 0.33 | 0.33 | 0.33 | 1 | ||
TWI | 0.13 | 0.13 | 0.14 | 0.2 | 0.2 | 0.2 | 0.33 | 0.33 | 1 | 1 | |
TRI | 0.11 | 0.13 | 0.14 | 0.14 | 0.17 | 0.2 | 0.2 | 0.33 | 0.5 | 1 | 1 |
. | RF . | Geom. . | Geo . | LD . | Slope . | Soil . | LULC . | DD . | NDVI . | TWI . | TRI . |
---|---|---|---|---|---|---|---|---|---|---|---|
RF | 1 | ||||||||||
Geom. | 0.33 | 1 | |||||||||
Geo | 0.33 | 0.33 | 1 | ||||||||
LD | 0.2 | 0.33 | 1 | 1 | |||||||
Slope | 0.2 | 0.2 | 0.33 | 1 | 1 | ||||||
Soil | 0.2 | 0.2 | 0.33 | 0.5 | 1 | 1 | |||||
LULC | 0.14 | 0.2 | 0.2 | 0.33 | 0.33 | 1 | 1 | ||||
DD | 0.14 | 0.14 | 0.2 | 0.33 | 0.33 | 0.33 | 1 | 1 | |||
NDVI | 0.14 | 0.14 | 0.2 | 0.2 | 0.2 | 0.33 | 0.33 | 0.33 | 1 | ||
TWI | 0.13 | 0.13 | 0.14 | 0.2 | 0.2 | 0.2 | 0.33 | 0.33 | 1 | 1 | |
TRI | 0.11 | 0.13 | 0.14 | 0.14 | 0.17 | 0.2 | 0.2 | 0.33 | 0.5 | 1 | 1 |
Note: RF, rainfall; Geom., geomorphology; Geo, geology; LD, lineament density; LULC, land use/land cover; DD, drainage density; NDVI, natural difference vegetation index; TWI, topographic wetness index; TRI, topographic roughness index.
Checking consistency
Random forest classifier (RFC)
Random forests are supervised MLAs that can be applied in majority voting for classification or averaging for regression. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned (Tyralis et al. 2019).
Gradient boosting classifier (GBC)
Decision tree classifier (DTC)
A nonparametric supervised learning technique for classification and regression is called a decision tree. The objective is to learn straightforward decision rules derived from the data features in order to build a model that predicts the value of a target variable. A piecewise constant approximation of a tree can be thought of. The techniques known as decision tree inducers create a decision tree from a given dataset automatically. Typically, the best decision tree is found by reducing the generalization error (Tanimu et al. 2022).
K-neighbor classifier (KNC)
Dataset for MLAs
The 2,959 dataset was split into the train, validation, and test data, which were organized at 70, 20, and 10%, respectively, to increase the performance of our MLA models (Figure 5). As a general rule, normalization and standardization are utilized in the next stage to increase the neural network's effectiveness. Data normalization ensures that each character contributes equally to the sum. This does not imply that all characteristics are equally significant when choosing a classifier, though. In several application domains, researchers have employed normalizing techniques to enhance classification performance (Thanh et al. 2022). In this study, we use Min-Max Scalar normalization (Deepa & Ramesh 2022).
Random Search CV: The hyperparameter implementation ‘Randomized Search’ goes by the name of the Randomized Search CV technique (Bergstra et al. 2012). Cross-validation, score, parameter distributions, estimators, and the number of iterations are only a few of the factors that the function considers. This method, in contrast to grid search, posits that not all hyperparameters are equally important. Each cycle generates a different random set of hyperparameters, which helps find more potent combinations. The probability of discovering beneficial combinations rises as a result of the random creation of a new set of hyperparameters with each cycle.
Model validation
Model validation refers to the process of ensuring that the model actually achieves its intended purpose (Naghibi et al. 2016). Any scientific research's verifiability serves as a gauge of its quality. By employing the provided model approaches, the performance of GWPZ prediction has been assessed using the ROC and statistical measures of accuracy, precision, recall, F1-score, and kappa index (Table 2).
Accuracy indices calculated for each classifier
No . | Parameters . | Formulas . |
---|---|---|
1 | Accuracy | ![]() |
2 | Precision | ![]() |
3 | Recall | ![]() |
4 | F1-score | ![]() |
5 | Overall accuracy (OA) | ![]() |
6 | Expected accuracy (EA) | ![]() |
7 | Kappa | ![]() |
No . | Parameters . | Formulas . |
---|---|---|
1 | Accuracy | ![]() |
2 | Precision | ![]() |
3 | Recall | ![]() |
4 | F1-score | ![]() |
5 | Overall accuracy (OA) | ![]() |
6 | Expected accuracy (EA) | ![]() |
7 | Kappa | ![]() |
Note: TP is the number of truly classified (true positive), TN is the number negative classified as true negative, FP is the number of positive classified as false positive, FN is the number of negative classified as false negative, N is the total number of points, and EA is the expected accuracy.
RESULT
Multicollinearity and features selection
The variance inflation coefficient (VIF) measure was used to confirm the severity of collinearity between independent variables. Detect how collinearity affects the variance of the coefficient estimates. Additionally, tolerance indicates the percentage of variance in a predictor that cannot be explained by other predictors. In practice, if the VIF is greater than 4 or the tolerance is less than 0.25, multicollinearity may be present and needs further investigation (Ashwini et al. 2023). According to the findings (Table 3), there is not much severe collinearity among the independent variables in the data.
VIF and tolerance result of the predictors
Predictors . | VIF . | Tolerance . |
---|---|---|
Geomorphology | 1.04 | 0.96 |
Geology | 1.19 | 0.83 |
Drainage density | 1.07 | 0.93 |
Lineament density | 1.19 | 0.84 |
LULC | 1.43 | 0.69 |
NDVI | 1.42 | 0.7 |
Slope | 1.13 | 0.88 |
Soil | 1.11 | 0.9 |
RF | 2.15 | 0.46 |
TWI | 1.13 | 0.88 |
TRI | 1.01 | 0.98 |
Predictors . | VIF . | Tolerance . |
---|---|---|
Geomorphology | 1.04 | 0.96 |
Geology | 1.19 | 0.83 |
Drainage density | 1.07 | 0.93 |
Lineament density | 1.19 | 0.84 |
LULC | 1.43 | 0.69 |
NDVI | 1.42 | 0.7 |
Slope | 1.13 | 0.88 |
Soil | 1.11 | 0.9 |
RF | 2.15 | 0.46 |
TWI | 1.13 | 0.88 |
TRI | 1.01 | 0.98 |
Prediction results of GWPZ
Groundwater potential zone maps produced the AHP, RFC, KNC, GBC, and DTC models.
Groundwater potential zone maps produced the AHP, RFC, KNC, GBC, and DTC models.
Model validation
The models’ performance has been assessed by employing five metrics: accuracy, precision, kappa, recall, and F1-score. The hyperparameters are fitted using the Randomized Search CV technique (Bergstra et al. 2012). It is crucial to specify the search space and provide a starting point. The estimator uses a random state of 42 and performs three convolutions for every 100 candidates, resulting in 300 fits. Table 4 shows the search space used in the base and best scenarios.
Hyperparameter used for the baseline and best algorithms including the search space
Hyperparameters . | Search space . | RFC . | GBC . | DTC . | KNC . | ||||
---|---|---|---|---|---|---|---|---|---|
Baseline . | Best parameters . | Baseline . | Best parameters . | Baseline . | Best parameters . | Baseline . | Best parameters . | ||
n_estimators | [200, 2,000, num = 10] | [‘100’] | [‘600’] | [‘100’] | [‘200’] | _ | _ | _ | _ |
max_features | [‘auto’, ‘sqrt’, ‘log2’] | [‘auto’] | [‘auto’] | [‘auto’] | [‘sqrt’] | [‘None’] | [‘auto’] | _ | _ |
max_depth | [10, 110, num = 11] | [‘None’] | [‘60’] | [‘None’] | [‘50’] | [‘10’] | [‘80’] | _ | _ |
min_samples_split | [2, 5, 10] | [‘2’] | [‘2’] | [‘2’] | [‘10’] | [‘2’] | [‘2’] | _ | _ |
min_samples_leaf | [1, 2, 4] | [‘1’] | [‘2’] | [‘1’] | [‘2’] | [‘1’] | [‘2’] | _ | _ |
bootstrap | [True, False] | [‘True’] | [‘False’] | None | None | _ | _ | _ | _ |
Metric | [‘euclidean’, ‘manhattan’, ‘minkowski’] | _ | _ | _ | _ | _ | _ | [‘minkowski’] | [‘'manhattan’] |
N_Neighors | [3, 5, 11, 19] | [‘5’] | [‘19’] | ||||||
Weights | [‘uniform’, ‘distance’] | [‘uniform’] | [‘distance’] | ||||||
critrion | [‘gini’, ‘entropy’] | [‘gini’] | [‘entropy’] |
Hyperparameters . | Search space . | RFC . | GBC . | DTC . | KNC . | ||||
---|---|---|---|---|---|---|---|---|---|
Baseline . | Best parameters . | Baseline . | Best parameters . | Baseline . | Best parameters . | Baseline . | Best parameters . | ||
n_estimators | [200, 2,000, num = 10] | [‘100’] | [‘600’] | [‘100’] | [‘200’] | _ | _ | _ | _ |
max_features | [‘auto’, ‘sqrt’, ‘log2’] | [‘auto’] | [‘auto’] | [‘auto’] | [‘sqrt’] | [‘None’] | [‘auto’] | _ | _ |
max_depth | [10, 110, num = 11] | [‘None’] | [‘60’] | [‘None’] | [‘50’] | [‘10’] | [‘80’] | _ | _ |
min_samples_split | [2, 5, 10] | [‘2’] | [‘2’] | [‘2’] | [‘10’] | [‘2’] | [‘2’] | _ | _ |
min_samples_leaf | [1, 2, 4] | [‘1’] | [‘2’] | [‘1’] | [‘2’] | [‘1’] | [‘2’] | _ | _ |
bootstrap | [True, False] | [‘True’] | [‘False’] | None | None | _ | _ | _ | _ |
Metric | [‘euclidean’, ‘manhattan’, ‘minkowski’] | _ | _ | _ | _ | _ | _ | [‘minkowski’] | [‘'manhattan’] |
N_Neighors | [3, 5, 11, 19] | [‘5’] | [‘19’] | ||||||
Weights | [‘uniform’, ‘distance’] | [‘uniform’] | [‘distance’] | ||||||
critrion | [‘gini’, ‘entropy’] | [‘gini’] | [‘entropy’] |
The hyperparameters used in the two scenarios (baseline and random search) were applied as follows are given in Table 5. The RFC model fared better than the other models in this experiment in terms of accuracy (96.42%), precision (89%), recall (89%), F1-score (88%), and kappa (0.76). Following RFC, the GBC model measured metrics including Accuracy (96.11%), Precision (88%), Recall (87%), F1-score (88%), and Kappa (0.74). Since all of the top-performing models have kappa values between 0.61 and 0.80, the interpretation demonstrates substantial performance (Czodrowski 2014). The study makes use of weighted and macro averages. But the weighted average produced respectable outcomes. Accordingly, the models exhibited a significant improvement once the random search CV hyperparameters were optimized. GBC has the greatest improvement, followed by DTC, which displays improvements of 2.09 and 1.66%, respectively.
Performance analysis result for all algorithms
Model . | Accuracy . | precision . | Recall . | F1-score . | kappa . | Improvement . | |
---|---|---|---|---|---|---|---|
GBC | Baseline | 94.14% | 80% | 81% | 81% | 0.61 | 2.09% |
Random search CV | 96.11% | 88% | 88% | 87% | 0.74 | ||
RFC | Baseline | 95.30% | 85% | 84% | 84% | 0.68 | 1.18% |
Random search CV | 96.42% | 89% | 89% | 88% | 0.76 | ||
DTC | Baseline | 93.02% | 78% | 77% | 77% | 0.54 | 1.66% |
Random search CV | 94.57% | 82% | 82% | 82% | 0.63 | ||
KNC | Baseline | 92.09% | 74% | 74% | 74% | 0.46 | 1.01% |
Random search CV | 93.02% | 77% | 77% | 77% | 0.52 |
Model . | Accuracy . | precision . | Recall . | F1-score . | kappa . | Improvement . | |
---|---|---|---|---|---|---|---|
GBC | Baseline | 94.14% | 80% | 81% | 81% | 0.61 | 2.09% |
Random search CV | 96.11% | 88% | 88% | 87% | 0.74 | ||
RFC | Baseline | 95.30% | 85% | 84% | 84% | 0.68 | 1.18% |
Random search CV | 96.42% | 89% | 89% | 88% | 0.76 | ||
DTC | Baseline | 93.02% | 78% | 77% | 77% | 0.54 | 1.66% |
Random search CV | 94.57% | 82% | 82% | 82% | 0.63 | ||
KNC | Baseline | 92.09% | 74% | 74% | 74% | 0.46 | 1.01% |
Random search CV | 93.02% | 77% | 77% | 77% | 0.52 |
Feature importance
The Summary bar (on the left) and global interpretability plot (on the right) for GBC, RFC, and DTC.
The Summary bar (on the left) and global interpretability plot (on the right) for GBC, RFC, and DTC.
ROC analysis
DISCUSSION
According to the results, the AUC of ROC for AHP, RFC, GBC, DTC, and KNC were 0.782, 0.934, 0.925, 0.724, and 0.87, respectively. These depict that all the models efficiently perform with higher AUC values that are greater than 70%. These results are similar to previous groundwater potential prediction studies (Naghibi et al. 2019; Thanh et al. 2022). Compared to the other models, RFC and GBC are delivering great results, KNC provides very good results, while AHP and DTC are classified in the category of good outcomes. On the other hand, the DTC has the least predictive value than the other models. These results indicate that the RFC and GBC models were overfitted with excellent AUC values. According to the MLAs’ accuracy and the Kappa index of the findings, the order for GWPZ prediction is RFC > GBC > DTC > KNC. These results of RFC and GBC are similar to the studies in various regions of the world (Rahmati et al. 2016; Naghibi et al. 2017, 2020). Our findings are likewise comparable to those of Maskooni et al.'s (2020) research in northern Iran and Sachdeva & Kumar's (2021) study in India. In both of these studies, the GBC and RFC are best performing and the GBC model (0.874, 0.79) has greater accuracy than RFC (0.864, 0.71) for the respective studies.
The analysis shows that much of the Gambela Plain is high and very high GWPZ. The very high potential class covers 43.37, 44.63, 38.03, 33.81, and 39.6% for AHP, RFC, GBC, DTC, and KNC, respectively (Figure 7). GBC provides high area coverage for very high classes, while DTC predicts the lowest rank. The very low potential class is 3.86, 20.87, 16.13, 15.54, and 16.17% area coverage for AHP, RFC, GBC, DTC, and KNC, respectively. In comparison to other models, the GBC provides more area coverage for very low class, whereas the AHP provides the least area coverage. The DTC model has the most area coverage for the high and moderate potential classes, with 32 and 19.8%, respectively (Figure 7).
Geomorphology and rainfall are the key factors that affect the GWPZ of the study region in all model outputs. In MLAs, slope geology, TWI, and TRI have little influence on GWPZ prediction. MLAs, which are one the most effective tools for dealing with high-dimensional, unstable, and real-world problems, have become very popular, especially in geospatial applications including groundwater potential assessment. Many ensemble models are thought to be more effective at addressing prediction issues than single ML approaches (Naghibi et al. 2017). Eleven regularly utilized groundwater-contributing components are logically considered in the methodology for the study's high degree of accuracy. A number of recent studies worldwide applied AHP and MLAs to investigate the groundwater potential (Zabihi et al. 2016; Golkarian et al. 2018; Avand et al. 2019). The primary drawback of this study, similar to other spatial sciences applications like groundwater potential mapping, is that the researchers must examine the output of their models across several study areas to ensure that the output is universal. Additionally, this approach also has a drawback in that the final GWPZ maps might vary depending on the incorporation of new datasets and/or models. The results of our study show a very good accuracy of MLA performance than the other studies in the field due to a random search application. The baseline and the best parameters are distinguishable after the improvement. This method's success depends on a number of input dataset-related characteristics, as well as on how the algorithm is used and validated. Therefore, in order to enhance the performance of the future study, we highly advise using additional MLA and models and including a number of groundwater-influencing parameters. In order to ensure both groundwater availability and management, we advise that future studies evaluate this technique using qualitative indicators in additional locations with varied geo-environmental features.
CONCLUSION
Sustainable development is dependent on correct GWPZ evaluation. In areas where data is scarce, RS-based data products can provide valuable information. AHP, RFC, GBC, KNC, and DTC were all used for this particular investigation. Eleven thematic maps were generated that have an influence on the local groundwater. In general, the research region is distinguished by a gradual slope, a low number of lineaments, a wetland nature, and homogenous geological material of alluvial-lacustrine deposits, all of which contribute to a high groundwater potential. The eastern corner of the research region has some difficulties due to the abundance of geologic materials, the steepness of the slope, and the increased density of lineaments. The characteristics that had the greatest influence on GWPZ mapping for all of the models were geomorphology and rainfall. The GWPZ map produced by ArcGIS spatial analysis tools depicted five potential zones: very low, low, moderate, high, and very high. GWPZ is very high and high in agricultural, grazing, and wetland regions. The ROC was used to confirm the GWPZ's ability to anticipate using the methodologies that were used. The results reveal that RFC and GBC outperform GWPZ prediction. The discovered GWPZ map of the data-scarce and overlooked region (Gambela Plain) will be the ideal answer for stack holders and decision-makers to efficiently manage and plan the resource. Due to their high and rapid efficiency, the approaches used demonstrate the usefulness of MLAs, remote sensing, and GIS in spatial decision-making, notably in groundwater management.
ACKNOWLEDGEMENTS
The author expresses gratitude to all governmental bodies for supplying the information needed for this research project. The author would like to thank Haramaya University for providing the opportunity for my PhD studies and for sponsoring my tuition. I want to express my gratitude to the anonymous reviewer for their insightful comments, which helped the paper's quality greatly.
STATEMENT OF DECLARATION
I certify that the information presented here is true and complete to the best of my knowledge. I declare that this work is not published anywhere and is not submitted to any journal for publication.
FINANCIAL DISCLOSURE STATEMENT
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
AUTHOR CONTRIBUTIONS
All authors contributed to the study's conception and design. Material preparation, data collection, and analysis were performed by T.K.S. and T.A.W. The first draft of the manuscript was written by T.K.S. and edited by T.A.W. Both T. Alemayehu and T. Ayenew read and commented on previous versions of the manuscript. The final version proofread was undertaken by T.K.S. All authors read and approved the final manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.