ABSTRACT
There is a crucial need for water resources in arid regions. The focus of this study is to examine areas suitable for siting dams using machine learning (ML) techniques. The study was conducted by applying the random forest (RF) and support vector machine (SVM) algorithms to predict regions suitable for siting a dam by utilizing factors that influence the process. The results of the study suggest that 54.43% of the study area was found suitable in the RF model, while 49.72% was found viable with the SVM model, which represents a moderate, high, and very high-suitability class in those areas. The ML algorithm was able to predict areas suitable for siting a dam with a root mean square error, R2, and minimal absolute error of 47.84, 0.53, 0.64, and 43.55, 0.49, 0.61 for the RF and SVM models, respectively. Also, it was found that the area under the receiver operating characteristic curve for the RF and SVM models was 0.827 and 0.790, respectively. The findings of the study can be applied to enhance the decision-making process regarding dam construction. This study offers a solid framework for pinpointing ideal dam locations by utilizing advanced ML methods.
HIGHLIGHTS
Machine learning optimizes dam siting in Northern Ghana for water management.
Random forest classified 54.43% of the region as suitable for dam construction.
Validation metrics (AUC, RMSE, R2, and MAE) show strong model performance.
Supports the One Village One Dam initiative to boost agricultural productivity.
INTRODUCTION
Water is an essential resource for both domestic and commercial purposes, especially in regions with low rainfall where a sustainable water supply is critical (Nayak et al. 2023). This study seeks to identify high-potential areas for dam siting in Northern Ghana, a region characterized by high aridity and variable rainfall. Dams play a crucial role in storing water, mitigating surface flooding, and supporting various uses such as irrigation, water supply, and flood control (Degu et al. 2011; Pradhan & Srinivasan 2022; Luo et al. 2023). Historically, dams have played a significant role in enhancing water security and supporting local economies, but the construction of these dams has been hindered by inadequate investigation and has significantly impacted catchment water resources in terms of their spatial distribution and storage capacities (Al-Ruzouq et al. 2019). However, despite their benefits, the construction of dams often faces challenges related to site selection and design, which can impact their effectiveness and sustainability (Othman et al. 2020; Urzică et al. 2021).
Global water scarcity is a growing concern, with predictions indicating that by 2025, half of the world's population will live in countries experiencing significant water stress (Dalstein & Naqvi 2022; Donchyts et al. 2022). Studies have indicated that by 2016, 36 states within the United States will experience a deficiency in water resources (Donchyts et al. 2022). This highlights the importance of identifying suitable locations for dams, particularly in semi-arid regions. In Ghana, the government's ‘One Village One Dam’ (1V1D) policy aims to address water scarcity through the construction of dams, but many of these structures have not met standard specifications, limiting their effectiveness (Awuni et al. 2023). Therefore, a systematic approach to site selection is necessary to ensure the sustainability and functionality of these dams.
Recent advancements in machine learning (ML) offer promising solutions for water resource management, including dam site selection. This study employed both the RF and SVM models. The RF technique combines many decision trees to improve prediction accuracy and reduce overfitting. Because it is non-parametric, variable important, resistant to overfitting, and accommodates missing values, it can handle partial information in complicated datasets and real-world data. SVM, on the other hand, is a powerful ML technique that models complex data relationships and works well in high-dimensional domains when employing the kernel method. These techniques have demonstrated high accuracy in handling complex data and predicting viable areas for development (Al-Ruzouq et al. 2019; Prasad et al. 2020; Ezugwu et al. 2022). Dam site selection is inherently complex due to the multitude of factors involved. This study proposes a comprehensive framework that considers 11 criteria, addressing the intertwined nature of these factors and incorporating uncertainty into the decision-making process. Previous studies often relied on fewer criteria or simpler decision-making processes, which may not fully capture the complexity of dam site selection (Agarwal et al. 2013). By using an extensive set of criteria, this study aims to provide a more detailed and accurate assessment of potential dam sites. This approach ensures that all relevant factors, including hydrological, geological, and environmental considerations, are taken into account, thereby improving the overall decision-making process (Chegbeleh et al. 2020).
The five northern regions of Ghana have seen numerous dams constructed under the ‘One Village One Dam’ initiative. However, concerns about improper siting and subsequent structural failures necessitate a more rigorous approach. Field surveys indicate that many of these dams have not been constructed to standard specifications, resulting in limited effectiveness and increased risk of breaches and flooding-related issues (Awuni et al. 2023). Few works have been done within these regions to unearth the struggles from the 1V1D. This study provides a novel application of ML in this region, offering valuable insights for local authorities and decision-makers. By focusing on Northern Ghana, this research addresses specific local challenges and contributes to the development of sustainable water management practices in the region. By producing viability maps that consider both ML predictions and current land use conditions, this study ensures that the site selection process is grounded in practical realities. Traditional methods often fail to integrate these aspects comprehensively, leading to suboptimal site selection. The integration of viability maps with land use conditions allows for a more accurate and realistic assessment of potential dam sites. This method enhances the applicability of our findings to other regions with similar conditions, providing a model that can be replicated and adapted for use in different contexts.
The primary objective of this research employs ML techniques to address the pressing issue of water scarcity in Northern Ghana by identifying optimal sites for dam construction. The significance of this study lies in its potential to inform policy decisions, support sustainable development goals, and mitigate the adverse effects of climate change on water resources in the region (Nguyen et al. 2021). By providing a robust, data-driven approach to dam site selection, we aim to contribute to the long-term water security and resilience of communities in Northern Ghana (Al-Addous et al. 2023).
DESCRIPTION OF THE STUDY AREA
This area was selected for this study because of its topographical and hydrological conditions. These are the major phenomena that enable surface water existence. The present reservoirs were the sampling targets. This study found that the area's vegetation is a mixture of Savannah woodland and grassland, depending on the local climate (Asiamah et al. 1997). The Sudan Savannah's sporadic grasslands and trees can be seen in the northern regions. The vegetation changes to Guinea Savannah as you go east, which has a greater tree cover and a wider variety of plants. In the drier sections, acacia and baobab trees are widespread, while luxuriant grasslands and bushes are found in the wetter portions (Avornyo et al. 2014). The study region has a varied geological makeup that includes a range of rock formations. The low hills are a result of the presence of granite and gneiss rocks in the northern and upper east regions. While the Savannah region has a variety of sedimentary rocks, the upper west region is recognized for its sandstone formations (Chegbeleh et al. 2020). Northern Ghana is characterized by a complex geological landscape, primarily composed of the Birimian supergroup and the Tarkwaian group. The Birimian rocks, predominantly metavolcanics and metasediments, are significant for their rich mineral deposits, including gold and manganese. These rocks are intruded by granitoids, which vary from granodiorites to tonalites, contributing to the region's geological diversity. The Tarkwaian group, composed of clastic sediments such as quartzites, conglomerates, and phyllites, overlays the Birimian rocks and is also known for hosting gold deposits. Additionally, the region features extensive lateritic weathering profiles, resulting in ferruginous and bauxitic soils that influence the area's geomorphology and land use. This geological complexity underpins the diverse mineral resources and varied topography of Northern Ghana. The region's water supply and soil composition are influenced by these geological factors. The primary geology and weather have an impact on the different types of soil found in the study area. The soil is fertile, which facilitates agricultural activities. Alluvial soils along riverbanks and laterite soils, which are predominant and contribute to fertile agriculture, are particularly found in Savannah areas. In certain places, the presence of organic matter improves soil fertility, which helps crops grow successfully. Rivers and other bodies of water are important components of the integrated research area. The Volta River flows through the area, supplying water for residential consumption, agriculture, and the production of hydroelectric power. The Volta River systems include the White, Black, and Red Volta. Seasonal ponds and streams also contribute to the water network and support nearby ecosystems.
MATERIALS AND METHODS
Choosing the determining factors
Some of the most crucial factors used in determining sites viable for locating a dam range from topographic to socio-economic. Studies have adopted several factors, including soil type, lithology, land use, slope, elevation, distance to roads, population density, soil moisture, distance to settlements, drainage density, fault line, geomorphology, rainfall, groundwater recharge, and many more (Naghibi et al. 2017; Rane & Jayaraj 2022). The study focuses on selecting influencing factors for dam siting, including rainfall, land use and land cover (LULC), normalized different vegetation index (NDVI), lithology, soil texture, land surface temperature (LST), distance to settlements, slope steepness, elevation, drainage stream density, and topographical wetness index (TWI). These factors are crucial in hydrological and environmental assessments, as they directly impact dam functionality and sustainability. Rainfall is essential for ensuring a reliable water supply, while LULC affects runoff patterns, sedimentation rates, and hydrological response. NDVI assesses vegetation health and density, while soil texture determines water retention and erosion susceptibility. LST provides insights into the microclimate, evaporation rates, and thermal regime of the water body. Distance to settlements is crucial for resource accessibility and socio-economic impacts. Slope steepness affects water runoff and erosion rates, while elevation determines the gravitational potential energy available for water storage and distribution. Drainage stream density reflects the network of water channels in the area, and TWI helps understand the spatial distribution of soil moisture. Incorporating these would be helpful in siting earthen dams in the northern part of Ghana.
Data collection and preparation
A review of the literature indicates that studies have extensively adopted GIS and AHP in dam selection analysis. While AHP and TOPSIS have been employed in several studies, little has been done using ML (Li et al. 2018; Arulbalaji et al. 2019; Ifediegwu 2022; Rane et al. 2023). However, some studies have employed different ML approaches in groundwater potential assessment and landslide susceptibility analysis. This study seeks to apply ML techniques in mapping suitable dam sites while comparing the accuracy of these models for better decision-making.
Firstly, as the entire amount of rainfall has a considerable effect on reservoir volume, it is important to consider the probability of runoff in a given location (Sarkar et al. 2022). The catchment's rainfall information was taken from the August 2003 rainfall data provided by Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) Africa with a resolution of 500 m. Since the CHIRPS dataset offers rainfall distribution across the whole of Ghana, the boundaries of the study area were loaded to extract the extent of rainfall distribution. The CHIRPS dataset captures seasonal and average rainfall, with extreme rainfall events considered for assessing runoff potential.
The present study utilized data from the Shuttle Radar Topography Mission – Digital Elevation Model (SRTM DEM) with 30 m resolution and satellite data downloaded from the United States Geological Survey (USGS) EarthExplorer website. The satellite imagery (Landsat 8 and 9 dataset) was downloaded from the USGS EarthExplorer website, and dam location data was collected from the Ghana Water Company and the Community Water and Sanitation Agency while most of the data was collected from the field and Google Earth (GE). Table 1 shows a breakdown of data and their sources.
Datasets, formats, and sources used in this study
No. . | Data . | Format . | Source . |
---|---|---|---|
1 | Topographic Data (DEM) | TIFF | https://www.usgs.gov/the-national-map-data-delivery/gis-data-download? |
2 | Lithology | Shapefile/scanned maps | https://ggsa.gov.gh/? |
3 | LULC | Shapefile | https://www.esri.com/about/newsroom/announcements/esri-releases-latest-land-cover-map-with-updated-sentinel-2-satellite-data? |
4 | Soil information | Shapefile | https://www.fao.org/land-water/land/land-governance/land-resources-planning-toolbox/category/details/en/c/1026564/? |
5 | Rainfall data | TIFF | https://www.chc.ucsb.edu/data/chirps? |
6 | Dam location | Excel/PDF | Field survey |
No. . | Data . | Format . | Source . |
---|---|---|---|
1 | Topographic Data (DEM) | TIFF | https://www.usgs.gov/the-national-map-data-delivery/gis-data-download? |
2 | Lithology | Shapefile/scanned maps | https://ggsa.gov.gh/? |
3 | LULC | Shapefile | https://www.esri.com/about/newsroom/announcements/esri-releases-latest-land-cover-map-with-updated-sentinel-2-satellite-data? |
4 | Soil information | Shapefile | https://www.fao.org/land-water/land/land-governance/land-resources-planning-toolbox/category/details/en/c/1026564/? |
5 | Rainfall data | TIFF | https://www.chc.ucsb.edu/data/chirps? |
6 | Dam location | Excel/PDF | Field survey |
QGIS 3.16 (Hannover) is an open-source geographic information system (GIS) program for processing, analyzing, and visualizing spatial data (Löwe et al. 2022). The data processing in this research was conducted in the QGIS 3.16 and RStudio environments. The factors that were processed in QGIS were done to extract the extent of topographical features using the boundaries of the study area. All the selected parameters were resampled to a spatial resolution of 30 m with the same processing extent. Because topographic features affect hydrological characteristics both directly and indirectly, they are critical for predicting suitable areas. In QGIS, the Euclidean distance technique was used to determine the distance from settlements. The Euclidean distance was also used to compute the proximity with the settlement data as the input feature.
Environmental change and sustainability issues can be associated with the LULC changes (Kuusaana & Eledi 2015; Hosek 2019). Although LULC changes have both natural and man-made causes, the increasing population has made anthropogenic disturbances, which is the most significant (Fitriyanto et al. 2019; Rahman et al. 2020; Fahad et al. 2021). The majority of these LULC changes are related to numerous and serious man-made activities occurring all over the universe. As a result, it is crucial to investigate these dynamics and the importance of dam location. Some studies have discovered that dams should not be sited around dense vegetation but should be accessible to settlements. This is because dense vegetation would generate low runoff while nearness to settlement satisfies the essence of the dams. However, it was found that the study area is dominated by seven LULC classes which are described in Table 2.
LULC classification and description in the study area
Class . | Depiction . | Description . |
---|---|---|
A | Water bodies | At least 60% of the area is covered by permanent water bodies. |
B | Mixed forest | Dominated by neither deciduous nor evergreen (40–60% of each) tree type (canopy >2 m). Tree cover >60%. |
C | Permanent wetland | Permanently inundated lands with 30–60% water cover and >10% vegetation cover. |
E | Grassland/shrubs | Dominated by woody perennials (1–2 m height) 10–60% cover. |
F | Bareland | At least 60% of the area is non-vegetated barren (sand, rock, and soil) areas with less than 10% vegetation. |
G | Cropland | At least 60% of the area is cultivated cropland. |
H | Urban and built-up | At least 30% impervious surface area including building materials, asphalt, and vehicles. |
Class . | Depiction . | Description . |
---|---|---|
A | Water bodies | At least 60% of the area is covered by permanent water bodies. |
B | Mixed forest | Dominated by neither deciduous nor evergreen (40–60% of each) tree type (canopy >2 m). Tree cover >60%. |
C | Permanent wetland | Permanently inundated lands with 30–60% water cover and >10% vegetation cover. |
E | Grassland/shrubs | Dominated by woody perennials (1–2 m height) 10–60% cover. |
F | Bareland | At least 60% of the area is non-vegetated barren (sand, rock, and soil) areas with less than 10% vegetation. |
G | Cropland | At least 60% of the area is cultivated cropland. |
H | Urban and built-up | At least 30% impervious surface area including building materials, asphalt, and vehicles. |
LST is the concentration of thermal energy as a radiative skin temperature on the surface calibrated in the direction of the remote sensor (Ghosh et al. 2019; Tafesse & Suryabhagavan 2019). There is a vast distinction between air/atmospheric temperature and LST, and these two parameters are distinguished by their prefixes. LST has been used in soil physics (Saha et al. 2018), climatic and meteorological studies (Shiran et al. 2021), forestry (Sinha et al. 2015), ecological studies (Cui & Shi 2012), natural resource studies, and agricultural science (Saha et al. 2018, 2021). In this research, the temperature index was used because of the high spatio-temporal condition in the study area. The current study developed these indices to explore the dynamics of dam location and LST. It can be said that areas with high LST can lead to evaporation and seepage of water because of the soil and land use conditions. The estimation of the LST using Landsat 8 (OLI) requires a series of systematic computations and these are vividly displayed in Table 3 (Kafy et al. 2021).
Formulae used in land surface temperature (LST) computation
S/N . | Name . | Equations . | Remarks . |
---|---|---|---|
1 | Top of atmospheric spectral radiance (TOA) | ![]() | Where ML represents the band-specific multiplicative rescaling factor, Qcal is the Band 10 image, AL is the band-specific additive rescaling factor, while Oi is the correction for Band 10 |
2 | Brightness temperature (BT) | ![]() | Where K1 and K2 stand for the band-specific thermal conversion constants from the metadata. |
3 | NDVI | ![]() | This is used to calculate the proportion of vegetation. |
4 | Proportion of vegetation (Pv) | ![]() | Global values from NDVI can be calculated from at-surface reflectivity. |
5 | Land surface emissivity (LSE) | ![]() | Where Pv is the proportion of vegetation. |
6 | Land surface temperature (LST) | ![]() | Where LST in Celsius (°C), BT is at-sensor BT (°C). |
S/N . | Name . | Equations . | Remarks . |
---|---|---|---|
1 | Top of atmospheric spectral radiance (TOA) | ![]() | Where ML represents the band-specific multiplicative rescaling factor, Qcal is the Band 10 image, AL is the band-specific additive rescaling factor, while Oi is the correction for Band 10 |
2 | Brightness temperature (BT) | ![]() | Where K1 and K2 stand for the band-specific thermal conversion constants from the metadata. |
3 | NDVI | ![]() | This is used to calculate the proportion of vegetation. |
4 | Proportion of vegetation (Pv) | ![]() | Global values from NDVI can be calculated from at-surface reflectivity. |
5 | Land surface emissivity (LSE) | ![]() | Where Pv is the proportion of vegetation. |
6 | Land surface temperature (LST) | ![]() | Where LST in Celsius (°C), BT is at-sensor BT (°C). |
Machine learning
ML is a branch of artificial intelligence (AI) that uses several computational algorithms to solve complex decision-making problems (Jordan & Mitchell 2015). ‘The discipline of science that offers computers the power to develop themselves without explicitly being programmed’ is how AI pioneer Arthur Samuel characterized it in the 1950s (Cordeschi 2007). Several studies have employed ML in natural resource development, social science, political science, and many others to solve complex problems (McFarland et al. 2016; Grimmer et al. 2021). The three primary categories of ML algorithms are reinforcement learning, unsupervised learning, and supervised learning (Iqbal et al. 2022). In supervised learning, an individual provides the machine with both input and output; in unsupervised learning, the system receives input and generates an output based on patterns it detects (Saravanan & Sujatha 2018). Reinforcement learning assists the program in identifying its strengths and motivates it to carry out more of the same type of activity (Mannini & Sabatini 2010). These algorithms frequently make use of particular techniques to find patterns and arrange data so that the computer can process them. Predictive analytics, decision trees, regression, clustering, and classification are examples of common practices. In ML, classification refers to the networks' segmentation and separation of data according to predetermined rules, whereas clustering, which is utilized in unsupervised training, separates related parts (Ezugwu et al. 2022). ML models employ predictive analytics to forecast future events based on the data a network receives.
This study employed a classification technique to examine potential areas that are suitable for siting a dam. The ML techniques adopted in this study were RF and SVM. Several decision trees are combined in the random forest (RF) ensemble learning technique to lessen overfitting and enhance model generalization. It can handle incomplete information in real-world data and complex datasets since it is non-parametric, variable important, resistant against overfitting, and handles missing values. On the other hand, using the kernel method, the support vector machine (SVM) is a potent ML algorithm that performs well in high-dimensional domains and models intricate data relationships. Its margin maximization strategy improves generalization performance and is appropriate for scenarios with small to moderate amounts of data because it resists overfitting.
Data agumentation
Data augmentation is a strategy for increasing both the quantity and variability of a dataset, typically to rectify imbalances or improve the robustness of ML models. In the context of RF and SVM models, data augmentation can improve performance by supplying more training samples, allowing the models to generalize better. Naghibi et al. (2017) utilized synthetic data generation to augment rainfall data in flood risk modeling, improving model robustness and accuracy. Prasad et al. (2020) used data augmentation in land use classification tasks by generating synthetic satellite images. Melville et al. (2020) employed temporal data augmentation for NDVI to improve crop classification accuracy. The current study used synthetic sampling to build synthetic lithology samples based on geological features and known distributions, using interpolation and extrapolation to generate new soil texture data points from existing data. Temporal variations and noise addition were utilized to introduce realistic noise to existing LST data in order to create new samples. Missing data for slope steepness and TWI was filled in using perturbation and interpolation techniques for slope data augmentation.
This study made use of datasets with varying spatial and temporal resolutions obtained from multiple sources. It was observed that the datasets had different spatial resolutions, making it imperative to normalize them to a single scale. The spatial resolution of all datasets was standardized to a coarsest resolution of 30 m to ensure compatibility. This was achieved using resampling techniques such as bilinear interpolation. Missing data points were handled through interpolation or, where appropriate, imputation techniques to maintain consistency across the dataset. This choice was driven by the need to balance detail with computational efficiency and to ensure that the datasets could be meaningfully integrated without introducing significant bias.
Running the model
The caret and e1071 packages are important Rstudio programs that were utilized for the data processing. The caret package was used for the RF model while the e1071 package was used for the SVM model. These packages were used with some other packages like sp, ggplot, dplyr, raster, and rgdal. In this study, the RF program was employed. In RStudio, the sf and raster packages are commonly used to handle vector and raster data, respectively. In RStudio, models were built first by splitting the dataset in a ratio of 70:30, which is widely used in several studies (Hembram et al. 2021; Charan et al. 2023). However, the 11 factors used were referred to as co-variates and dams' locations as the dependent variable while favorability classes were converted into a factor and were grouped into five levels (1, 2, 3, 4, and 5). Kavzoglu et al. (2015) used a variety of methods to ensure reliable data splitting and model validation. To divide data into training and testing sets, the current study used both random and stratified sampling. Stratified sampling was used to ensure that each class was proportionally represented in both the training and testing sets. This strategy is critical for ensuring that models behave consistently across classes. The training set (70%) has 790 data points, whereas the testing set (30%) has 339. The model training method includes using the training set to train RF and SVM models, as well as cross-validation to tweak hyperparameters. Similar approaches have been used to estimate landslide vulnerability, groundwater potential zones, and urban growth. After building the model and validating it, the spatial layers were loaded in RStudio and stacked using the stacking script. The stacked raster was fitted with models built and the final output showed classes suitable for siting a dam. To lower prediction variance, RF combines multiple decision trees that were trained on the same set of data without pruning and applies out-of-bag (OOB) samples for variable importance and test set error evaluation. To obtain optimal performance, hyperparameter tuning is an important phase in the construction of ML models. This methodology comprises combining grid search and cross-validation to investigate various hyperparameter combinations for RF and SVM models. In this study, the RF model's hyperparameters were set based on the number of trees and the minimum number of data per leaf. The SVM model's hyperparameters were adjusted using the regularization parameter, kernel type, and kernel coefficient.
Model evaluation


To evaluate the predictive effectiveness of the RF and SVM models for dam site appropriateness, the area under the curve–receiver operating characteristic (AUC-ROC) analysis was performed utilizing ArcGIS. Initially, the binary dam suitability map produced by every model was redefined into predicted presence (1) and absence (0) according to threshold probability values. The actual dam locations were converted into point shapefiles and then used to validate these maps by overlaying them onto the prediction maps. The Spatial Analyst Toolset in ArcGIS was utilized to obtain prediction values at dam point sites, which were then compared with non-dam (background) points to calculate true positive and false positive rates. This was done using the ROC-add-in in ArcGIS to create the ROC curves and compute AUC scores.
RESULTS AND DISCUSSION
Spatial variation of the thematic layers
Spatial variation of thematic factors affecting dam suitability: (a) drainage stream density, (b) rainfall, (c) geology, (d) soil, (e) distance to settlements, (f) DEM, (g) land use, (h) LS, (i) LST, (j) TWI, and (k) NDVI.
Spatial variation of thematic factors affecting dam suitability: (a) drainage stream density, (b) rainfall, (c) geology, (d) soil, (e) distance to settlements, (f) DEM, (g) land use, (h) LS, (i) LST, (j) TWI, and (k) NDVI.
It was observed that rainfall ranges from 341 to 1,612 mm (Figure 2(b)), and increases toward the southern part of the area. The essence of rainfall in this study cannot be underestimated because it is the primary determinant of runoff leading to water storage in the dam. From the analysis, it was observed that regions of high precipitation would be suitable for siting a dam, while areas with low rainfall would be considered unsuitable. Al-Ruzouq et al. (2019) reported similar findings suggesting that regions of high rainfall distribution are ideal for siting a dam. This is because the size of the dam is based on the amount of rainfall in the catchment, and hence, smaller dams sited in a region of high rainfall can lead to embankment breaches (structural damages) and floods. Degu et al. (2011) indicated that it is important to routinely assess the amount of rainfall in a catchment to avoid unexpected accidents.
Consequently, it was reported that one of the dams constructed under the 1V1D in the upper west region of Ghana (Duong) in 2021 breached its embankment causing extensive flooding in the Kaleo-Nadowli areas leading to loss of properties, settlements, and farms, submerging of roads and other damages as a result of the large amounts of runoffs (Citi News 2019). More recently, it was observed that the Gbimsi dam (northeast region) overtopped its banks, which poses a threat to surrounding communities, the persistence of high runoff from rainfall and its deteriorating impact on the dams makes it important to predict possible sites for the future and ongoing dam projects in Ghana. Several studies have reported the essence of rainfall in determining regions for possible dam location and it was found that rainfall is a primary determinant (Naghibi et al. 2017; Al-Ruzouq et al. 2019). This study seeks to examine areas suitable for dam construction to prevent these events.
The main geological setting in the study area was the synvolcanic intrusive rocks which cover about one-third of the study area, while the main soil type in the area was found to be Fluvisols. Geology and soils are similar factors that play different roles because they are all underlying mediums protecting surface water qualitatively and quantitatively. Soils like Fluvisols have been reported to contain a high amount of clay which serves as a binding agent to cement the dam floor preventing seepage, while geology would prevent deep percolation of fractures within the geological medium (Kpiebaya et al. 2022).
One of the most important factors in this study is the distance to settlements because the primary purpose of these dams is for the natives to have access to water and when a dam is sited far away from settlement, the source of water is difficult to access and has hence been rendered unsuitable for domestic use. From the map, it was found that the high values (Euclidean distance) show that the Savannah region has a scattered settlement and hence might not be considered suitable, while the northern region has a close proximity to settlements. The elevation of the study area ranges from 15 to 532 m with low-lying areas suggesting a higher suitability of generating catchment yield and hence viable for dam siting, while areas with higher elevation would not retard surface runoff and hence might not be viable to site a dam in these areas. The dominant LULC in the study area was grassland with shrubs which was about 84.23%, while the least dominant was the permanent wetland which was found to be 0.07%. The importance of LULC in this study cannot be overlooked because it drives a direct relationship between the immediate land cover and water availability while having an impact on the ecosystem. Several studies have adopted LULC as one of the main influencing factors of dam suitability analysis (Sinha et al. 2015; Talukdar et al. 2020).
There exists a wide range of factors that can be used to predict favorable areas for siting a dam but because of the study area's peculiar characteristics evolving around aridity and vulnerability to climatic change variability, this study adopted LST and slope steepness to improve the prediction. LST is an important variable in this research, one of the major factors that make LST essential in this study, because it describes the processes involved in the exchange of energy and water within the atmosphere. It has been noticed that few studies have employed LST and slope steepness in natural resource development on water resources. The LST in the study area ranges from 24.79 to 45.75 °C, while slope steepness ranges from −0.01409 to 70.1036%. TWI describes the impact of terrain on a location, and it is related to local soil conditions. TWI of the study area ranges from 0.4134 to 13.1835. TWI has an impact on both surface and groundwater hydrology (Regmi et al. 2015). A negative profile TWI value shows that water flow decelerates on the surface, a positive figure suggests that the flow of water on the surface accelerates, and a zero suggests that the surface is linear (Prasad et al. 2020). This indicates that the northern sectors of the study area generally accelerate the flow of water, but the upper regions would accelerate more.
It is easy to understand that vegetation and land cover change can coexist in the same ecosystem but the emergence of rapid urbanization poses a threat to natural biodiversity (Delpy et al. 2021). Several studies have been conducted on the correlation between LULC, NDVI, and water resource management. NDVI in the study area ranges from −0.2 to 0.8126 where positive values suggest healthy vegetation and negative values may indicate less or unhealthy vegetation. Interestingly, regions with healthy vegetation are considered suitable for siting a dam because of their protective nature against expected pollution activities.
Yang et al. (2021) used remote sensing to monitor vegetation performance but concluded that vegetation is largely halted by anthropogenic activities. Analysis from past works and this current study are not far-fetched but this research sought to identify regions that are suitable for siting a dam. Findings from this study suggest that more trees and plants should be planted within urban towns to ensure that the quality of vegetation is improved. This would go a long way to reduce LST and improve soil moisture index (SMI), which leads to reduced global warming and atmospheric temperature impacting the main source of water for people living in the study area. Figure 2 presents the spatial variation of the thematic layers used in the studies.
Suitability map
This finding aligns with prior research using traditional methods such as the analytical hierarchy process (AHP). For example, studies by Al-Ruzouq et al. (2019), Othman et al. (2020), and Rane et al. (2023) utilized AHP to classify regions from very low to very high suitability for dam siting, highlighting the value of multi-criteria analysis in accurately identifying suitable dam locations. This study extends previous work by demonstrating that ML models, like RF and SVM, can achieve comparable or superior accuracy and reliability in suitability mapping (Shahzad et al. 2022). The nuanced differences in area statistics between the RF and SVM models highlight the distinct ways each model processes spatial and environmental data. The RF model, known for its robustness in handling large datasets with numerous variables, identified a slightly larger suitable area compared with the SVM model (Naghibi et al. 2017). These differences, though minor, could influence decision-making in dam construction by offering multiple perspectives on site viability.
The suitability maps generated by the RF and SVM models have significant practical implications for dam construction in Northern Ghana. Highly suitable areas in the northern and upper regions, characterized by low elevation (approximately 54 m) and high annual rainfall (around 1,290 mm), provide clear directives for policymakers and engineers (Asamoah & Ansah-Mensah 2020). These areas are ideal for dam construction due to their potential for water accumulation through surface runoff. Decision-makers can use these findings to prioritize dam construction projects in high-suitability areas, ensuring efficient resource use and maximizing the benefits of water storage and flood mitigation. Moderate suitability areas, particularly in the northeastern region, offer additional development options, allowing for a phased infrastructure development approach. Despite overall consistency with expectations, some discrepancies warrant further discussion. The RF model identified a slightly larger suitable area than the SVM model, likely due to differences in data handling and sensitivity to input parameters (Naghibi et al. 2017). The RF model's ability to capture subtle data variations may have contributed to this discrepancy.
Furthermore, areas identified as having low potential for dam sites, especially around the Savannah regions, reflect the impact of existing land use, such as extensive forest reserves like Damango Forest Reserve and Mole National Park. This finding emphasizes the need to integrate land use data into suitability analyses to avoid conflicts with conservation efforts and ensure sustainable development. The methodological approach and findings of this study have broader implications for future research and policy in dam construction and water resource management. The effectiveness of these ML models suggests that similar approaches could be applied in other regions with comparable environmental conditions. Future studies could refine these models by incorporating additional criteria or exploring other ML techniques to enhance predictive accuracy (Kourou et al. 2015; Prasad et al. 2020).
For policymakers, the insights provided by this study can inform strategic planning and resource allocation for dam construction projects. Focusing on high-suitability areas ensures that investments in dam infrastructure yield maximum benefits, contributing to sustainable development goals related to water security, agriculture, and climate resilience (Nguyen et al. 2021). The suitability maps produced by the RF and SVM models offer valuable tools for identifying optimal dam sites in Northern Ghana. The comparative analysis with traditional methods and the integration of practical land use considerations enhance the reliability and applicability of these findings. By leveraging advanced ML techniques, this study provides a robust framework for decision-making in dam construction, with the potential to significantly influence policy and practice in water resource management (Naghibi et al. 2017; Pradhan & Srinivasan 2022).
Correlation matrix of the determining factors
These findings align with previous research emphasizing the importance of integrating multiple factors for effective dam site selection. For instance, Naghibi et al. (2017), Shaibu et al. (2024), and Kpiebaya et al. (2022) demonstrate the utility of multi-criteria decision analysis (MCDA) methods in capturing the intricate dependencies among site selection criteria. Our study advances this work by employing ML models, specifically RF and SVM, to enhance predictive accuracy and reliability. A study by Prasad et al. (2020) similarly utilized RF and SVM for groundwater potential mapping, highlighting the superior performance of these models in handling complex environmental datasets. The application of these models to dam site selection especially in Northern Ghana highlights their robustness and adaptability across different geographical contexts and environmental challenges.
The practical implications of these findings are significant for decision-making in dam construction across the globe. Understanding the correlations and dependencies among key factors enables planners and engineers to make more informed decisions regarding dam siting. The positive correlation between NDVI and LST suggests that vegetated areas with higher temperatures may still be viable for dam construction if other factors such as soil and rainfall are favorable. Conversely, the negative correlation between rainfall and DEM indicates that lower elevation areas with adequate rainfall might be more suitable, as they can better manage water retention and reduce runoff challenges.
In the Ghanaian context, this knowledge can directly influence policy and strategic planning, enabling the government to optimize the ‘One Village One Dam’ initiative. By leveraging advanced ML models, authorities can prioritize regions that balance environmental suitability with logistical feasibility, ultimately enhancing the effectiveness and sustainability of newly constructed dams. This approach can also mitigate risks associated with improper siting, such as dam breaches and flooding, thereby safeguarding communities and investments.
One unexpected finding in our study is the weak negative correlation between rainfall and DEM. This suggests that areas of higher elevation, which typically receive less rainfall, may not necessarily be unsuitable for dam construction if other factors are favorable. This finding contradicts some traditional assumptions that high rainfall areas are always preferable for dam sites. The practical implication is that site selection should not rely solely on elevation and rainfall but must consider a holistic set of criteria, as demonstrated by our ML models.
Additionally, the strong positive correlation between NDVI and LST challenges the notion that high-temperature areas are automatically unsuitable for dams. In Northern Ghana's arid climate, vegetation presence indicated by NDVI may compensate for high LST, suggesting that these areas could still support dam infrastructure if managed correctly. This insight emphasizes the need for nuanced, data-driven approaches in environmental planning. The integration of RF and SVM models in our study provides a robust framework for identifying suitable dam sites in Northern Ghana. By analyzing the interactions among multiple environmental factors, we offer a comprehensive approach that enhances decision-making processes and supports sustainable water resource management. These findings not only validate the application of ML techniques in environmental studies but also highlight their potential to address complex real-world challenges in dam construction and beyond.
Model performance
Receiver operating characteristic (ROC) curve for RF and SVM models.
Studies worldwide have employed various models to identify appropriate dam sites, achieving differing levels of success (Kim et al. 2019; Karakatsanis et al. 2023). The effectiveness of these models often depends on the metrics used to assess their performance (Kafy et al. 2021). Commonly used metrics such as the kappa coefficient and overall accuracy may not always provide the most accurate measure of a model's performance. Therefore, this study aimed to evaluate suitable dam sites using more robust metrics, including RMSE, R-squared (R2), mean absolute error (MAE), and ROC-AUC which is consistent with previous studies (Saha et al. 2022).
The high ROC-AUC values suggest that the models are reliable for identifying suitable dam sites, which is crucial for regions with limited water resources and high variability in rainfall. The ability of the RF model to outperform SVM in this context implies that decision-makers could prioritize the use of RF for more accurate and reliable site selection. The use of these ML models can streamline the decision-making process by providing a data-driven approach to site selection. This can help mitigate the risks associated with improper siting, such as structural failures and flooding, which have been prevalent issues in Northern Ghana under the ‘One Village One Dam’ initiative (Awuni et al. 2023). By adopting ML models, authorities can ensure that dams are constructed in locations that optimize water storage and distribution, thereby enhancing the sustainability and resilience of water resources in the region.
CONCLUSIONS
In conclusion, this study applied ML techniques, specifically RF and SVM, to predict suitable areas for dam siting in Northern Ghana. Addressing the critical need for water resources in arid regions with low rainfall, this research provided significant insights into water availability. The study revealed that 54.43% of the area was suitable for dam siting using the RF model, while the SVM model identified 49.72% as viable, encompassing moderate, high, and very high-suitability classes. The ML algorithms predicted suitable areas for dam siting with RMSE, R2, and MAE values of 47.84, 0.53, and 0.64 for the RF model, and 43.55, 0.49, and 0.61 for the SVM model, respectively. The AUC values were 0.827 for RF and 0.790 for SVM, indicating superior performance of the RF model. These findings have significant implications for water resource management policy, particularly the ‘One Village One Dam’ initiative. By leveraging advanced ML techniques, this research provides a robust framework for identifying optimal dam sites, which can significantly enhance the effectiveness of this initiative.
The identification of highly suitable areas for dam construction, particularly in the northern and upper regions of the study area, offers clear directives for policymakers and engineers. These regions, characterized by low elevation and high rainfall, are ideal for dam construction, ensuring efficient use of resources and maximizing the benefits of water storage and flood mitigation. For policymakers, the suitability maps generated in this study provide valuable insights for strategic planning and resource allocation. By focusing on areas with high suitability, the government can ensure that investments in dam infrastructure yield maximum benefits, contributing to sustainable development goals related to water security, agriculture, and climate resilience. Also, this framework can be adapted for other semi-arid regions with comparable hydrological and geological conditions.
Additionally, exploring the impact of climate change on dam site suitability is a crucial area for future research. Climate change is expected to alter precipitation patterns and increase the frequency and intensity of extreme weather events, which could affect the viability of dam sites identified in this study. By incorporating climate change projections into ML models such as XGBoost, future research could provide more accurate and resilient site selection strategies, ensuring the long-term sustainability of water resources. Future research should evaluate socio-economic impacts, including community displacement and construction costs, to promote equitable dam siting.
Moreover, the application of advanced ML techniques and data analytics can be expanded to include real-time monitoring and maintenance of existing dams. This can facilitate early detection of structural issues and prevent potential disasters associated with poorly maintained dams. The integration of Internet of Things (IoT) devices and remote sensing technologies with ML models can provide continuous monitoring and predictive maintenance, further enhancing the safety and functionality of dams.
Constraints include potential bias in CHIRPS rainfall data from sparse ground station coverage, especially in heterogeneous or hillslope areas. Furthermore, the sensitivity of the model to outliers and varied quality of input data (e.g., LST or NDVI data obscured by clouds) could influence prediction accuracy. The spatial resolution of remote sensing products and the input storage and other assumptions for parameter calibration also bring some uncertainty for the interpretation of results.
This study contributes to the broader field of hydrology by demonstrating the effectiveness of ML techniques in water resource management. The comparative analysis of RF and SVM models provides valuable insights into the strengths and limitations of these techniques in predicting suitable dam sites. The study also highlights the importance of a comprehensive, multi-criteria approach to site selection, addressing the complex interplay of environmental factors involved in dam construction. By providing a robust, data-driven approach to dam site selection, this research not only supports the Ghanaian government's ‘One Village One Dam’ initiative but also offers a scalable model adaptable for use in other regions facing similar challenges. The findings indicate the potential of ML techniques to enhance the reliability and accuracy of hydrological studies, paving the way for more sustainable and effective water resource management practices.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.