Abstract
Due to the physical processes of floods, the use of data-driven machine learning (ML) models is a cost-efficient approach to flood modeling. The innovation of the current study revolves around the development of tree-based ML models, including Rotation Forest (ROF), Alternating Decision Tree (ADTree), and Random Forest (RF) via binary particle swarm optimization (BPSO), to estimate flood susceptibility in the Maneh and Samalqan watershed, Iran. Therefore, to implement the models, 370 flood-prone locations in the case study were identified (2016–2019). In addition, 20 hydrogeological, topographical, geological, and environmental criteria affecting flood occurrence in the study area were extracted to predict flood susceptibility. The area under the curve (AUC) and a variety of other statistical indicators were used to evaluate the performances of the models. The results showed that the RF-BPSO (AUC=0.935) has the highest accuracy compared to ROF-BPSO (AUC=0.904), and ADTree-BPSO (AUC=0.923). In addition, the findings illustrated that the chance of flooding in the center of the area in question is greater than in other points due to lower elevation, lower slope, and proximity to rivers. Therefore, the ensemble framework proposed here can also be used to predict flood susceptibility maps in other regions with similar geo-environmental characteristics for flood management and prevention.
HIGHLIGHTS
Comparative assessment of tree-based machine learning models to classify locations as either flooded or non-flooded.
Development of machine learning models BPSO algorithm.
A total of 20 geo-environmental criteria were used for flood susceptibility mapping.
Determining flood-affecting criteria using the BPSO algorithm.
Sensitivity analysis of 20 geo-environmental criteria in predicting flood susceptibility.
Graphical Abstract
INTRODUCTION
Every year various natural disasters such as floods, landslides, and earthquakes cause extensive human and financial losses worldwide (Smith & Ward 1998; Chapi et al. 2017). Floods have been identified as one of the most devastating and destructive natural disasters around the globe, negatively affecting humans and ecosystems (Alam et al. 2021). Statistics show that floods were responsible for more than half of the damage caused by natural disasters in the past five decades globally (Nachappa et al. 2020; Imani et al. 2021). In Iran, the occurrence of floods is not specific to certain regions and the entirety of the country faces this issue; however, based on the features of each region, the type of floods that occur and the damages caused by this natural disaster vary (Giang et al. 2020; Haer et al. 2020; Saedi et al. 2020). Despite the efforts of experts, policymakers, stakeholders, and government officials to reduce the effects of floods in the last few decades, the number of occurrences has been on the rise around the world (Johann & Leismann 2017; Kocaman et al. 2020). Due to this increase in floods, especially in cities, and the various hazards therefrom, identifying efficient flood prediction measures is of great importance (Hong et al. 2018; Chen et al. 2019). Therefore, identifying these measures can assist in a more effective prevention of this phenomenon while making use of public education, efficient management policies, and more extensive monitoring to combat the factors stimulating the increase of floods (Du et al. 2013; Minea 2013). Flood susceptibility refers to the probability that an area will experience flooding. Essentially, it is the likelihood of flooding of a particular type in a given location. It refers to the spatial likelihood or probability (either qualitatively or quantitatively) of a flood in the future (Hervás & Bobrowsky 2009). In other words, maps of flood susceptibility can be defined as quantitative or qualitative assessments of the classification, area, and spatial distribution of floods both existing and potentially occurring in a region (Santangelo et al. 2011). Prediction of flood susceptibility maps has been found to be a crucial step in the prevention and management of future floods (Khosravi et al. 2016; Youssef et al. 2016). One of the methods to reduce flood risks is the preparation of flood susceptibility maps, which provide valuable information about nature, floods, and their effects on floodplain lands and river boundaries (Zuo et al. 2015; Khosravi et al. 2016). As a result, it is possible to send appropriate warnings in case of flood danger and facilitate rescue operations. In flood zoning for functional control and land development, floodplain areas are divided into parts with the different susceptible regions. In recent years, metaheuristic techniques, numerical simulations, physical hydrological models, and machine learning (ML) models have been used to prepare flood susceptibility maps in watersheds worldwide (Liu et al. 2016; Chapi et al. 2017; Choubin et al. 2019; Khosravi et al. 2019; Razavi Termeh et al. 2018). Preparing a flood susceptibility map is recognized as an essential step in preventing and managing future floods. Flood susceptibility maps can identify and predict the risks of future floods based on statistical or deterministic methods (Mosavi et al. 2018). However, the occurrence of a flood event has complex conditions that make occurrence difficult to make a reliable prediction (Pham et al. 2020). It can be concluded that spatial prediction of natural hazards using models created by spatial data and their output leads to the preparation of susceptibility maps, which is the most appropriate solution for land use planning in watersheds to prevent these events. Although, due to the various complex criteria affecting floods, it is extremely difficult to reliably predict this phenomenon (Pourghasemi et al. 2020). In recent years, the combination of Geographic Information System (GIS) with Remote Sensing (RS) technology has dramatically increased the accuracy of flood susceptibility prediction (Safaripour et al. 2012; Chen et al. 2015; Zuo et al. 2015). Swift access to RS satellites and improved business practices have increased the use of GIS in the prediction of flood susceptibility maps. Therefore, GIS is a useful tool for analyzing complex phenomena like floods (Hong et al. 2018). In previous studies, knowledge-driven and data-driven approaches were used for the zoning of flood susceptible regions (Khosravi et al. 2016; Wang & Liu 2019). On the one hand, data-driven approaches have high efficiency in known areas or regions in which, statistically, the number of known evidence is sufficient. On the other hand, knowledge-driven approaches are more efficient in less known regions or where there are fewer targets in the region. In terms of flood susceptibility prediction, the major models are data-driven and rely on simple assumptions (Lohani et al. 2014). Over the past two decades, ML models have provided better performance and cost-effective solutions by mimicking the complex mathematical expressions of flooding processes (Mosavi et al. 2018). Physically based models were used to predict hydrological events, such as storms, rainfall, runoff, and models of hydraulic flow, including the impact of atmospheric, oceanic, and flood events (Zhao & Hendon 2009; Borah 2011; Costabile et al. 2013; Fernández-Pato et al. 2016; Xia et al. 2017). Despite their capacity to predict a wide range of flood scenarios, physical models often require a variety of hydro-geomorphological monitoring datasets that require intensive calculations, which make short-term predictions impossible (Nayak et al. 2005). In addition, establishing physically based models requires an in-depth understanding of hydrological parameters which is proving to be quite challenging (Kim et al. 2015). Also, studies have found that physical models have a short-term prediction capability gap (Costabile & Macchione 2015). Physically based models have certain disadvantages, which leads to the use of advanced data-driven models, such as ML models. This popularity is due, in part, to the fact that flood nonlinearity can be numerically derived from historical data without having to understand the relevant physical processes (Mosavi et al. 2018). ML models can develop faster and require less input than traditional models using data-driven prediction. The field of ML is based on artificial intelligence (AI) that aims to develop patterns, provide easier implementation with low computing cost, as well as rapid training, validation, testing, and evaluation with high performance compared to physical hydrological models (Mekanik et al. 2013). With the continual development of ML models in the last two decades, they have proven to perform more accurately than conventional models in predicting flood susceptibility (Mosavi et al. 2018).
For flood modeling studies, physical hydrological models, including HEC-HMS (Feldman 2000), SWAT (Arnold et al. 1998), and HSPF (Bicknell et al. 1997), have been used. Although such models are useful, they require a large number of field measurements and tedious parameterization methods (Fenicia et al. 2008). In addition, they still provide estimates of flood risk only at the site using local flow data recorded at hydrometric stations, so they are not suitable for regional flood assessments (Tien Bui et al. 2016). Using GIS to predict flood susceptibility is a valuable tool to reduce the risks associated with future floods (Wang et al. 2019). Through geostatistical tools for managing large amounts of spatial data, GIS has made significant contributions to flood susceptibility prediction studies (Tien Bui et al. 2016; Wang et al. 2019). Various statistical and data-driven techniques along with GIS techniques to identify flood susceptible regions have been proposed and used in the literature. Common knowledge-driven approaches used are the Analytic Hierarchy Process (AHP), frequency ratio (FR), and weight of evidence (WOE) (Rahmati et al. 2016; Khosravi et al. 2016; Seejata et al. 2018). Khosravi et al. (2016) prepared flood susceptibility maps for the Haraz river watersheds in Mazandaran using four different models, including FR, WOE, Analytical Hierarchy Process (AHP), and a combination of frequency ratio and analytical hierarchy process (FR-AHP). To implement the proposed models, 10 flood-affecting criteria, including slope angle, plain curvature, elevation, Topographic Wetness Index (TWI), Stream Power Index (SPI), rainfall, distance to river, lithology, land use, and Normalized Difference Vegetation Index (NDVI) were extracted. The results showed that the FR model had the most area under the curve (AUC) compared to the other models. Rahmati et al. (2016) used two knowledge-driven FR and WOE models for flood susceptibility mapping in Golestan Province, Iran. The final results showed that the FR and WOE models have almost similar and reasonable results. However, there are drawbacks to the use of these methods in generating a flood susceptibility map. For example, AHP results are subject to uncertainty because of ambiguous judgments and the FR method is highly dependent on the sample size (Miles & Snow 1984; Sajedi-Hosseini et al. 2018). Models based on ML can provide information directly from data without assuming anything prior to analysis. They reduce operating costs, improve the speed of data analysis, and are important when dealing with spatial data analysis (Jaafari et al. 2019). Various ML models have been used to predict flood susceptible regions, including support vector machines (SVMs), genetic algorithms (GAs), adaptive fuzzy neural inference systems (ANFIS), artificial neural networks (ANNs), and tree-based models have been proposed and developed for flood susceptibility prediction (Kia et al. 2012; Seckin et al. 2013; Tien Bui et al. 2016, 2018; Chapi et al. 2017; Zhao et al. 2018; Khosravi et al. 2019; Wang et al. 2019). Liu et al. (2016) used the Naïve Bayes (NB) method to evaluate the flood susceptibility of the Bowen watershed in Australia. To this end, four measures were used, including elevation, slope angle, soil type, and drainage density. It was found that the measures of elevation and slope angle have significant effects on evaluating flood susceptibility compared to the other measures. Hong et al. (2018) used a combination of various methods, including Logistic Regression (LR), Random Forest (RF), and SVM with WOE to predict flood susceptibility maps for the Poyang region in China. The findings indicated that the SVM-WOE combination model had the highest AUC compared to the other combination models (LR-WOE and RF-WOE). Choubin et al. (2019) predicted flood susceptibility maps in the Khiyav-Chai watershed in Iran using different ML models such as multivariate discriminant analysis (MDA), SVM, and classification and regression tree (CART) models. The results showed that all models have high performance with an AUC greater than 0.8. Nachappa et al. (2020) used two Multi-Criteria Decision Analysis (MCDA) models, including AHP and Analytical Network Process (ANP), and two ML models, including RF and SVM, to prepare flood susceptibility maps for the city of Salzburg, Austria. The AUC findings indicated that the RF (AUC=87.8%) and SVM (AUC=87%) models performed better than the multi-criteria decision models. Costache et al. (2021) used six ML models, including SVM, J48 decision tree, ANFIS, RF, ANN, and Alternating Decision Tree (ADTree) to predict flood susceptibility maps for the Buzău river watershed in Romania. To this end, 12 criteria including slope angle, elevation, aspect, Topographic Position Index (TPI), TWI, Convergence Index (CI), plain curvature, soil type, land use, distance to river, lithology, and rainfall were used. The results showed that the RF and ADTree models had the highest accuracy, while the J48 model had the lowest. In addition, the results showed that ML performance can be further improved by combining with other ML models, metaheuristic techniques, numerical simulations, and physical models (Mosavi et al. 2018; Razavi Temeh et al. 2018; Wang et al. 2019). Razavi Temeh et al. (2018) used a combination of Adaptive Neuro-Fuzzy Inference System (ANFIS) with Ant Colony Optimization (ACO), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO) to predict the flood susceptibility of Jahrom in the city of Fars, Iran. To this end, nine hydrogeological, topographical, geological, and environmental criteria were extracted. The AUC resulting from Receiver Operating Characteristic (ROC) showed accuracies of 91.8, 92.6, and 94.5% for ANFIS-ACO, ANFIS-GA, and ANFIS-PSO combination models, respectively. They found that the ANFIS metaheuristic model has the most practical application in terms of reproducing the highly focused flood susceptibility map. To predict flood susceptibility in Dingnan County in China, Wang et al. (2019) combined the ANFIS model with two metaheuristic methods, including biogeography-based optimization (BBO) and imperialistic competitive algorithm (ICA). The results showed that the two ensemble ML models had superior effectiveness compared to ANFIS in the study area. However, ML algorithms have important characteristics that must be carefully considered. First, as a rule, they are as good as their training, since they learn the task through past data, and second, the capability of each ML model varies depending on the type of task (Faizollahzadeh Ardabili et al. 2018; Mosavi et al. 2018). This is also referred to as the ‘generalization problem’, because it shows how well the trained system can predict beyond the scope of the training data. It may be possible for some algorithms to perform well for short-term predictions, but not for long-term predictions. It is important to clarify these characteristics of the ML models based on the type of training data (Mosavi et al. 2018).
In all of the above studies, researchers have compared ML and metaheuristic models to select the best model for predicting flood susceptibility. It is desirable to identify more efficient and accurate ML models that can be used with minimal field data to spatially predict flood susceptibility. Although a number of ensemble ML models have been used to predict flood susceptibility, there is no perfect way to accurately predict flood susceptibility. Based on the results of previous research, tree-based ML models including RF, ADTree, and Rotation Forest (ROF) have high accuracy and efficiency in predicting natural hazards (Xia et al. 2017; Tien Bui et al. 2019; Costache et al. 2021). The advantage of the ROF algorithm is in the fact that variability in the data set and accuracy in the clustering process increase (Phong et al. 2021). The ADTree model uses fewer repetitions in its processes and can turn the classes in the model into binary to analyze the training data sets and validate them (Wu et al. 2020). In addition, the RF model is very flexible and can control regression and classification tasks with a high degree of accuracy (Chen et al. 2021). Due to the ability to identify more accurately and predict more efficiently, predicting flood susceptible regions using new hybrid tree-based ML models with metaheuristic algorithms is extremely important. There are very few studies that report on the optimization of tree-based ML models for flood susceptibility mapping. Accordingly, the objectives of this research are framed based on previous research gaps. Therefore, its main objectives are (1) to evaluate the performance of the improvement of tree-based ML models by the Binary Particle Swarm Optimization (BPSO) algorithm, (2) to determine the effective criteria in predicting flood susceptibility using BPSO along with tree-based ML models, and (3) to predict the flood susceptibility maps using new hybrid tree-based ML models to identify the most susceptible region in the study area.
The reason for combining ML models with the BPSO metaheuristic algorithm is to improve the accuracy of tree-based ML models and to determine the criteria affecting flood susceptibility in the study area, which is the innovation of the present study. Thus, the present research is distinguished from previous studies because of the use of improved tree-based ML models, including Optimized Rotation Forest (ROF-BPSO), Optimized Alternating Decision Tree (ADTree-BPSO), and Optimized Random Forest (RF-BPSO), to identify the effective criteria and predict the flood susceptibility of the Maneh and Samalqan watershed in North Khorasan province, Iran. Finally, AUC and statistical indicators were used to evaluate and compare the effectiveness of the proposed models in the case study.
MATERIALS AND METHODS
The research design adopted for the present study was descriptive-analytical, and its type was based on the applied purpose. QGIS 3.1, SAGA GIS 7.9, and Google Earth Engine were used to process the data, while Python was used for quantitative calculations and the development of methods. The six step procedure of the current research can be observed in Figure 1:
- 1.
Preparation of a flood reference map to define the dependent variable (flood-prone locations in the case study).
- 2.
Extraction of flood-affecting spatial criteria in the case study to define independent variables (20 hydrogeological, topographical, geological, and environmental criteria).
- 3.
Use of a multi-collinearity test to investigate the independence of the spatial flood-affecting criteria in the case study.
- 4.
Identifying spatial flood-affecting criteria in the region using BPSO combined with tree-based ML models.
- 5.
Predicting flood susceptibility based on the development of tree-based ML approaches, such as ROF-BPSO, ADTree-BPSO, and RF-BPSO.
- 6.
Evaluating and comparing the performance of the developed ML models through the use of appropriate statistical indicators.
Study area
The city of Maneh and Samalqan is located in the northwestern region of North Khorasan province. It is surrounded by the city of Raz and Jorgalan to the north, Golestan province to the west, the cities of Jajerm and Garmeh to the south, and the city of Bojnurd to the east. Additionally, this city shares 8 km of its border with Turkmenistan. In terms of its geographical location, this city is located between the 37 °17′ and 38 °7′ north latitude of the equator and between 55 °59′ and 57 °17′ of the east longitude from the Greenwich meridian. The city of Maneh and Samalqan has an area of 6,053 km2 and has a topography that ranges from 314 to 2,785 m in elevation. This city has a variety of different terrain and topography; it is located in a region with an arid and semi-arid climate, and the spatial and temporal distribution of rainfall is completely variable and semi-uniform. Rainfall usually occurs during cold and damp seasons, resulting in an increased chance of flooding and soil erodibility. In addition, the existence of steep slopes, extensive use of sand mines, and the general lack of vegetation and trees in the region have resulted in more floods in the past few years, causing immense damage. The hydrographic network of the city consists of seasonal and permanent rivers and is part of the Atrak watershed. These rivers originate from their sources which can be located in the heights outside of the city and make their way down these areas based on the general slope. Maneh and Samalqan city limits can be observed in Figure 2.
Preparation of information layers
Flood reference map
Before predicting the occurrence of future floods in the area of research, analyses on previous floods must be performed. Flood points play an important role in the relationship between the flood-affecting criteria and the occurrence of floods. In other words, previous historical flood occurrences are indicators of future floods, in that previously affected areas are more susceptible to floods in the future (Rahmati et al. 2016). Flood points represent a level of relationship between the occurrence of floods and the criteria that cause them. Flood inventory maps indicating historic flood locations are required for spatial modeling of flood susceptibility (Tehrany et al. 2014). There are various means for the preparation of reference flood maps, such as interpreting digital satellite imagery or making use of past flood databases (Chen et al. 2019). In total, 370 flood-prone points (with a value equal to one) have been recorded in the watershed of the study area by the Iranian forests, rangelands, and watershed organization (2016–2019). In this study, 70% of these points (259 flood-prone points) were randomly chosen to train the models while the remaining 30% were used for validation (111 flood-prone points). This ratio is accepted and recommended by many researchers in the field (Khosravi et al. 2016; Costache et al. 2021). Additionally, based on the findings of previous studies and in order to make the findings of the present study more realistic, 370 non-flood points (with a value equal to zero) were created using topographic maps, and the Google Earth Engine in areas such as hills and mountains in which floods cannot randomly occur (Hong et al. 2018; Costache et al. 2021). In this research, the dependent variable was defined by merging flood and non-flood points (values of zero and one). Furthermore, the flood inventory maps used in this study and previous ones only used binary values (0, 1) for the absence or presence of floods (Tehrany et al. 2014; Rahmati et al. 2016). As a result, flood locations were given equal weight to predict flood susceptibility maps (Khosravi et al. 2016; Hong et al. 2018; Costache et al. 2021). In fact, it can be said that the binary definition of flood and non-flood data can indicate the susceptibility of the study area to flood events. For all flood and non-flood points (740 points), the raster values of the independent variables are extracted to the points. A data matrix consisting of a column of flood and non-flood points (0 or 1) and 20 columns of the raster values of the independent variables were created. Then, this data matrix can be entered as independent and dependent variables in the optimized tree-based ML models. Therefore, by analyzing the relationships between the dependent variable and their influencing criteria (independent variables) using optimized tree-based ML models, the past flood records can be used for predicting future flood events in the study area (Choubin et al. 2019; Khosravi et al. 2019). Finally, the results were imported back into QGIS to produce the flood susceptibility maps. Field surveys were conducted in order to validate the flood-prone points in the area of research (Figure 3).
Flood-affecting criteria
In which min and max are the lowest and highest elevations of the adjacent cells, respectively. This index is indicative of the topographical features of the region and can significantly influence flood occurrence (Kalantari et al. 2017). Soil type has been found to be an important mechanism regarding permeability, runoff production, and flooding (Kanani-Sadat et al. 2019). The soil type map of the region with a scale of 1:250,000 was obtained from the Iranian forests, rangelands, and watershed organization. Additionally, different lithology units show noteworthy changes in the instability of hillsides and also affect surface runoff. The lithology map of the case study with a scale of 1:100,000 was prepared using the Iranian Geological map (26 geological units). The flow accumulation contains the cumulative number of cells upstream of a cell and shows the amount of current flowing from the upstream cells to that cell (Kanani-Sadat et al. 2019). The higher the flow accumulation value, the higher the susceptibility to floods. The HOFD is the actual movement of water from one cell to another cell (Pourghasemi et al. 2020). The lower the HOFD value, the higher the susceptibility to floods. The VOFD indicates the vertical distance between the height of each cell and the height calculated for the flow network (Pourghasemi et al. 2020). The lower the VOFD value, the higher the susceptibility to floods. The flow accumulation, HOFD, and VOFD criteria were calculated using appropriate analyses in the QGIS 3.1 and SAGA GIS 7.9 software. An overview of the data used in the study is provided in Table 1.
Category . | Unit . | Source . | Data type . | Scale/Spatial resolution . |
---|---|---|---|---|
MFI | Millimeter | Six synoptic stations | Grid | 30×30 m |
Slope | Degree | DEM | Grid | 30×30 m |
DEM | meter | SRTM satellite (https://earthexplorer.usgs.gov) | Grid | 30×30 m |
CN | – | DEM and rainfall | Grid | 30×30 m |
Distance to river | meter | Topographical map from Iranian national cartographic center | Grid | 1:50,000 |
NDVI | – | OLI satellite (https://earthengine.google.com) | Grid | 30×30 m |
Plane curvature | 100/meter | DEM | Grid | 30×30 m |
TWI | – | DEM | Grid | 30×30 m |
Aspect | Degree | DEM | Grid | 30×30 m |
Distance to road | meter | Topographical map from Iranian national cartographic center | Grid | 1:50,000 |
VOFD | – | DEM | Grid | 30×30 m |
Distance to fault | meter | Topographical map from Iranian national cartographic center | Grid | 1:100,000 |
Soil type | – | Iranian forests, rangelands and watershed organization | Vector | 1:250,000 |
Land use | – | OLI satellite (https://earthengine.google.com) | Vector | 30 × 30 m |
Lithology | – | Iranian Geological map | Vector | 1:100,000 |
HOFD | – | DEM | Grid | 30 × 30 m |
Flow accumulation | – | DEM | Grid | 30 × 30 m |
SPI | – | DEM | Grid | 30 × 30 m |
TPI | – | DEM | Grid | 30 × 30 m |
TRI | – | DEM | Grid | 30 × 30 m |
Category . | Unit . | Source . | Data type . | Scale/Spatial resolution . |
---|---|---|---|---|
MFI | Millimeter | Six synoptic stations | Grid | 30×30 m |
Slope | Degree | DEM | Grid | 30×30 m |
DEM | meter | SRTM satellite (https://earthexplorer.usgs.gov) | Grid | 30×30 m |
CN | – | DEM and rainfall | Grid | 30×30 m |
Distance to river | meter | Topographical map from Iranian national cartographic center | Grid | 1:50,000 |
NDVI | – | OLI satellite (https://earthengine.google.com) | Grid | 30×30 m |
Plane curvature | 100/meter | DEM | Grid | 30×30 m |
TWI | – | DEM | Grid | 30×30 m |
Aspect | Degree | DEM | Grid | 30×30 m |
Distance to road | meter | Topographical map from Iranian national cartographic center | Grid | 1:50,000 |
VOFD | – | DEM | Grid | 30×30 m |
Distance to fault | meter | Topographical map from Iranian national cartographic center | Grid | 1:100,000 |
Soil type | – | Iranian forests, rangelands and watershed organization | Vector | 1:250,000 |
Land use | – | OLI satellite (https://earthengine.google.com) | Vector | 30 × 30 m |
Lithology | – | Iranian Geological map | Vector | 1:100,000 |
HOFD | – | DEM | Grid | 30 × 30 m |
Flow accumulation | – | DEM | Grid | 30 × 30 m |
SPI | – | DEM | Grid | 30 × 30 m |
TPI | – | DEM | Grid | 30 × 30 m |
TRI | – | DEM | Grid | 30 × 30 m |
Methods
Multi-collinearity test
Rotation forest
In the ROF model, principal component analysis (PCA) was used to train the model by rotating each of the main individual clusters (Pham et al. 2017). To create a training data set for each of the base clusters, the F data set was classified into K subsets and PCA was performed for each of these subsets. All of the main components remained intact in the data set in order to preserve the variety in information. Therefore, K was identified as the rotation axis and was used as a new feature for basic clustering training (Kuncheva & Rodrıguez 2007). The advantage of the ROF algorithm is in the fact that variability in the data set and accuracy in the clustering process increase (Phong et al. 2021). The extraction of features for each of the main clusters increases data variety and preserves all of the main clustering accuracy components (Xia et al. 2013). The following steps were performed when creating the ROF model (Kuncheva & Rodrıguez 2007; Phong et al. 2021): (1) the training data set features were randomly categorized into K subsets. (2) The PCA algorithm was performed for each subset. (3) The rotation matrix was aligned again according to the sequence of the main features. (4) The results of the various classifiers were combined. (5) Each pixel of the training data set was assigned a class tag before all of the pixels of the area were trained. In recent years, the ROF algorithm has been mostly used for classification and has shown satisfactory performance as confirmed by AUC and other statistical indicators (Xia et al. 2013).
Alternating decision tree
The ADTree algorithm is an enhanced decision tree that is more accurate compared to traditional decision tree algorithms and has higher reliability when resolving classification and prediction issues (Bhowmick et al. 2010; Pham et al. 2016). The ADTree algorithm can be defined as the frequent recall of the decision tree algorithm with the help of the boosting method. In other words, the ADTree algorithm is a comparative algorithm that is created by the frequent recall of data condition-based decision trees (Chen et al. 2020). The ADTree creates an optimal tree structure in response. To create this tree, an index is attributed to each feature. The root node has an index of zero and holds the number of available records, while nodes with an index of j expand from every node with an index of i (i > j). Therefore, all of the features expand from the root node. The application of decision trees and boosting algorithms results in more valid categorizers. The combination of boosting with decision trees creates new classification rules which are smaller and easier to interpret (Bhowmick et al. 2010). The advantages of ADTree include a smaller number of nodes and simplicity of explanation (Pham et al. 2016). In addition, ADTree model uses fewer repetitions in its processes and can turn the classes in the model into binary to analyze the training data sets and validate them (Wu et al. 2020).
Random forest
The RF algorithm is one of the most popular algorithms used to analyze classification issues and multiple predictions (Chen et al. 2021). This algorithm has a low sensitivity to multi-linearity and its results are relatively stable in regard to missing or irregular data (Chen et al. 2020). This algorithm is a modernized version of the base tree in which a vast array of classification and regression trees can be found (Hong et al. 2018). The RF algorithm is essentially an ensemble method in which several tree algorithms are combined in order to create a sequential prediction of any phenomenon (Rahmati et al. 2016). Generally, individual decision trees are more prone to fitness and have little generalizability. During the creation of a decision tree, small changes in the training structure can cause noteworthy changes in the shape of the tree. The predictive RF model is based on the averages of all relevant decision trees and performs classification for most data sets with high accuracy (De Santana et al. 2018). The random trees receive input vectors, classify them with all trees in the forest, and output class tags which were received from the majority of votes. In order to classify a new item, the input vector is placed at the end of each tree in the RF and each tree leads to classification and essentially votes for that class. The forest resulting from the classification process which receives the highest number of votes from the trees in the forest is chosen. The function of the RF is mainly determined by the number of decision trees (ntree) and the features present in the subsets (mtry) (De Santana et al. 2018; Chen et al. 2020). Larger numbers of trees can result in longer modeling times, while smaller numbers of trees can lead to error (Costache et al. 2021).
Binary particle swarm optimization algorithm
In which Xi(t) is the location of the i particle, Xi(t + 1) is the location of the i particle in its next position, Vi(t) is the velocity of the i particle, Vi(t + 1) is the velocity of the i particle in its next position, pbest is the best experienced position of the i particle, gbest is the best experienced position observed for all of the particles, c1 is the individual learning coefficient, c2 is the collective learning coefficient, w is the inertia weight, and r1, r1 and ρ are random numbers in the 0.1 range. The steps of the BPSO algorithm are based on Figure 4 (Beheshti 2020).
Evaluation of the performance of the models
Statistical indicators
In which n is the number of observations, yi is the value for observation i, is the predicted value for the observation i, and is the mean value for observations.
Receiver operating characteristic
In which P and N are the sum of flood and non-flood locations, respectively, and the TP and TN criteria were mentioned in the previous section. The ROC curve is the most effective method of system prediction that can quantitatively estimate the accuracy of the model. In this method, the AUC values can range between 0.5 and 1, and the closer the value to 1, the higher the accuracy of the model (Yesilnacar & Topal 2005). The quantitative–qualitative correlation of the AUC and estimation evaluation has been presented in Table 2.
Qualitative . | Poor . | Average . | Good . | Very Good . | Excellent . |
---|---|---|---|---|---|
Quantitative | 0.5–0.6 | 0.6–0.7 | 0.7–0.8 | 0.8–0.9 | 0.9–1 |
Qualitative . | Poor . | Average . | Good . | Very Good . | Excellent . |
---|---|---|---|---|---|
Quantitative | 0.5–0.6 | 0.6–0.7 | 0.7–0.8 | 0.8–0.9 | 0.9–1 |
Tuning of the machine learning and BPSO algorithm parameters
Due to how RMSE is considered one of the most important data-driven evaluation parameters, the fitness function of the BPSO algorithm has been selected with the aim of minimizing the RMSE value to check the accuracy of the ML models (Fotheringham & Oshan 2016). The RMSE parameter is used to measure the distribution of model residuals (Fotheringham & Oshan 2016). Based on Table 3, the optimal values for the original BPSO algorithm parameters were chosen through trial and error. In order to simplify the procedure, the stop condition was implemented and a specific number of executions was considered. In the present study, due to the randomized nature of the BPSO algorithm and based on previous studies, the BPSO algorithm was conducted 10 times and the best out of these 10 executions was considered as the final output (Saeidian et al. 2018).
Parameters . | Value . | Parameters . | Value . |
---|---|---|---|
Swarm size | 30 | C2 | 2 |
Total iterations | 100 | W | 1 |
C1 | 2 | Minimum and maximum velocity | [−4,4] |
Parameters . | Value . | Parameters . | Value . |
---|---|---|---|
Swarm size | 30 | C2 | 2 |
Total iterations | 100 | W | 1 |
C1 | 2 | Minimum and maximum velocity | [−4,4] |
In addition, in order to implement the ML algorithms of the present study, tuning the input parameters was one of the most important steps in the process. In order to tune the optimal values (ntree and mtry) in the random forest algorithm, the k-fold cross-validation technique was used to converge the error values and increase prediction reliability (Arabameri et al. 2020). In k-fold cross-validation, the training data is divided into two groups of k folds, in which one of the folds is assigned the role of validation, while the rest of the folds (k-1) are allocated for training (Witten et al. 2011). Thus, the implemented ten-fold cross-validation in the present study involved the random division of the training data to 10 groups, before the parameters' best values were realized. Due to the fact that the number of trees has an inverse relationship with error rate (higher tree numbers mean fewer errors), 2,000 trees (ntree = 2,000) were deemed appropriate for the present study, and the subset characteristics parameter was decided to be mtry = 10. In addition, in order to avoid overfitting, ten-fold cross-validation was used to tune the parameters needed for the implementation of the two ROF and ADTree algorithms (Wang et al. 2006; Pham et al. 2016).
RESULTS
Preparation of data
As was previously mentioned, 20 hydrogeological, topographical, geological, and environmental criteria were used to predict the flood susceptibility of the Maneh and Samalqan watershed. These criteria include CN, HOFD, MFI, VOFD, NDVI, TWI, SPI, TRI, TPI, DEM, Slope angle, Flow Accumulation, Aspect, Plane curvature, Distance to fault, Distance to road, Distance to river, Lithology, Soil type, and Land use. Figure 5 shows maps of the mentioned criteria based on kriging interpolation (Eftekhari et al. 2021).
Investigation of variable independence
The existence of multi-collinearity between the criteria was explored, and the results have been presented in Table 4. Based on these results, it was found that TRI has rejected the multi-collinearity conditions (TOL > 0.1 and VIF < 5), and therefore, it cannot be used as modeling input, hence it was removed from the series of criteria in the present study. Other than that, no high multi-collinearity was observed among the other criteria. In addition, T and sig values were considered for the coefficient test. The higher the T value, the weaker the assumption that the coefficient is zero, therefore the role of that criteria is greater in modeling (Choubin et al. 2019). This can also be investigated through consideration of the significance level, where a significance of less than 0.05 indicates that the null hypothesis (no effect of the criteria in modeling) can be rejected (Choubin et al. 2019).
Order . | Criteria . | B . | Std error . | Standardized coefficients Beta . | T . | Sig . | TOL . | VIF . |
---|---|---|---|---|---|---|---|---|
1 | MFI | 0.87 | 0.043 | 0.18 | 1.02 | 0.045 | 0.92 | 1.02 |
2 | Slope | 0.12 | 0.037 | 0.12 | 2.08 | 0.025 | 0.19 | 2.82 |
3 | DEM | 0.03 | 0.025 | 0.65 | 2.02 | 0.023 | 0.52 | 1.95 |
4 | CN | 0.04 | 0.068 | 0.03 | 2.25 | 0.007 | 0.25 | 1.12 |
5 | Distance to river | 0.14 | 0.075 | 0.36 | 3.36 | 0.039 | 0.32 | 2.22 |
6 | NDVI | 0.22 | 0.077 | 0.74 | 2.73 | 0.017 | 0.19 | 2.95 |
7 | Plane curvature | 0.12 | 0.066 | 0.36 | 2.95 | 0.001 | 0.42 | 4.85 |
8 | TWI | 0.03 | 0.042 | 0.58 | 1.35 | 0.081 | 0.44 | 1.35 |
9 | Aspect | 0.01 | 0.036 | 0.19 | 1.58 | 0.025 | 0.74 | 2.36 |
10 | Distance to road | 0.14 | 0.031 | 0.27 | 1.95 | 0.075 | 0.39 | 2.22 |
11 | VOFD | 0.18 | 0.041 | 0.75 | 2.33 | 0.009 | 0.33 | 1.63 |
12 | Distance to fault | 0.17 | 0.042 | 0.23 | 1.97 | 0.025 | 0.63 | 3.25 |
13 | Soil type | 0.10 | 0.048 | 0.33 | 1.25 | 0.018 | 0.27 | 4.16 |
14 | Land use | 0.07 | 0.034 | 0.19 | 2.84 | 0.026 | 0.74 | 1.75 |
15 | Lithology | 0.12 | 0.062 | 0.44 | 1.33 | 0.074 | 0.15 | 2.65 |
16 | HOFD | 0.23 | 0.071 | 0.26 | 2.39 | 0.032 | 0.22 | 2.32 |
17 | Flow accumulation | 0.15 | 0.042 | 0.29 | 1.74 | 0.005 | 0.83 | 2.08 |
18 | SPI | 0.35 | 0.028 | 0.24 | 1.46 | 0.019 | 0.55 | 3.02 |
19 | TPI | 0.18 | 0.021 | 0.23 | 2.59 | 0.013 | 0.49 | 1.96 |
20 | TRI | 0.12 | 0.079 | 0.66 | 3.52 | 0.098 | 0.048 | 13.95 |
Order . | Criteria . | B . | Std error . | Standardized coefficients Beta . | T . | Sig . | TOL . | VIF . |
---|---|---|---|---|---|---|---|---|
1 | MFI | 0.87 | 0.043 | 0.18 | 1.02 | 0.045 | 0.92 | 1.02 |
2 | Slope | 0.12 | 0.037 | 0.12 | 2.08 | 0.025 | 0.19 | 2.82 |
3 | DEM | 0.03 | 0.025 | 0.65 | 2.02 | 0.023 | 0.52 | 1.95 |
4 | CN | 0.04 | 0.068 | 0.03 | 2.25 | 0.007 | 0.25 | 1.12 |
5 | Distance to river | 0.14 | 0.075 | 0.36 | 3.36 | 0.039 | 0.32 | 2.22 |
6 | NDVI | 0.22 | 0.077 | 0.74 | 2.73 | 0.017 | 0.19 | 2.95 |
7 | Plane curvature | 0.12 | 0.066 | 0.36 | 2.95 | 0.001 | 0.42 | 4.85 |
8 | TWI | 0.03 | 0.042 | 0.58 | 1.35 | 0.081 | 0.44 | 1.35 |
9 | Aspect | 0.01 | 0.036 | 0.19 | 1.58 | 0.025 | 0.74 | 2.36 |
10 | Distance to road | 0.14 | 0.031 | 0.27 | 1.95 | 0.075 | 0.39 | 2.22 |
11 | VOFD | 0.18 | 0.041 | 0.75 | 2.33 | 0.009 | 0.33 | 1.63 |
12 | Distance to fault | 0.17 | 0.042 | 0.23 | 1.97 | 0.025 | 0.63 | 3.25 |
13 | Soil type | 0.10 | 0.048 | 0.33 | 1.25 | 0.018 | 0.27 | 4.16 |
14 | Land use | 0.07 | 0.034 | 0.19 | 2.84 | 0.026 | 0.74 | 1.75 |
15 | Lithology | 0.12 | 0.062 | 0.44 | 1.33 | 0.074 | 0.15 | 2.65 |
16 | HOFD | 0.23 | 0.071 | 0.26 | 2.39 | 0.032 | 0.22 | 2.32 |
17 | Flow accumulation | 0.15 | 0.042 | 0.29 | 1.74 | 0.005 | 0.83 | 2.08 |
18 | SPI | 0.35 | 0.028 | 0.24 | 1.46 | 0.019 | 0.55 | 3.02 |
19 | TPI | 0.18 | 0.021 | 0.23 | 2.59 | 0.013 | 0.49 | 1.96 |
20 | TRI | 0.12 | 0.079 | 0.66 | 3.52 | 0.098 | 0.048 | 13.95 |
Identification of criteria affecting flood susceptibility
The reason for using binary PSO is to determine the optimal combination of effective criteria in predicting flood susceptibility. In the algorithm's binary mode, the dimensions of each particle are limited to zero and one. According to Table 4 and Figure 6, the BPSO algorithm used in this study includes 19 dimensions, each of which corresponds to a criterion. After initializing the dimensions of each particle (0 or 1), for dimensions that have a value of one, the criteria with raster values are entered as input into the ML algorithm and the fitness function value for the corresponding particle is calculated. The process continues until the BPSO algorithm reaches the 100th iteration, which is the end of the algorithm in this research. When the BPSO algorithm reaches the stop condition, the particle with the best value of the fitness function is selected. Then, the dimensions with a value of one indicate the optimal combination of effective criteria in predicting flood susceptibility.
After training the ROF, ADTree, and RF ML models using the BPSO algorithm, the RMSE chart was illustrated (Figure 7). Based on Figure 7, the best RMSE values for the ROF-BPSO, ADTree-BPSO, and RF-BPSO models were calculated to be 0.335, 0.289, and 0.218, respectively. This shows how highly accurate the RF-BPSO model is in terms of predicting the flood susceptibility of the area of research. The effective criteria in predicting flood susceptibility for the ROF-BPSO, ADTree-BPSO, and RF-BPSO hybrid tree-based models were identified (Figure 8). The results showed that for the ROF-BPSO, ADTree-BPSO, and RF-BPSO models, 11 criteria (MFI, Slope, DEM, CN, Distance to river, Plane curvature, VOFD, Distance to fault, Land use, Lithology, and SPI), 12 criteria (MFI, Slope, DEM, Distance to river, NDVI, Aspect, Distance to fault, Soil type, Land use, Lithology, HOFD, and SPI), and 15 criteria (MFI, Slope, DEM, CN, Distance to river, NDVI, TWI, Aspect, Distance to road, Soil type, Land use, Lithology, HOFD, SPI, and SPI) were found to be effective in predicting flood susceptibility, respectively. It can also be concluded that a total of three output modes for the proposed models, MFI, Slope, DEM, Distance to river, Land use, Lithology, and SPI criteria, were of greater importance in predicting flood susceptibility in the study area.
Prediction of the flood susceptibility map
Despite the similar results of the different flood susceptibility models in terms of determining flood-prone areas, another major goal of flood susceptibility modeling is to select a model that is accurate. Therefore, the greater the number of flood-prone locations, the higher the likelihood of flood occurrence, and the lower the number, the lower the likelihood of flood occurrence. In the present study, flood susceptibility prediction maps were predicted using the ROF-BPSO, ADTree-BPSO, and RF-BPSO hybrid tree-based models using training data (70%), and were validated (30%). As mentioned, combining the BPSO metaheuristics algorithm with tree-based ML models provides an optimal combination of flood effective criteria based on the value of the fitness function and predicts flood susceptibility maps according to the optimal criteria. According to Figure 9, the maps of the predicted degree of susceptibility (based on the flood-prone locations) by combining tree-based ML models and the BPSO algorithm showed in the range [0,1]. The predicted degree of susceptibility is classified into five output classes according to the natural break classification method which is shown qualitatively from a very low degree of susceptibility to a very high degree of susceptibility. This classification method was used due to how this method is based on the natural groupings in the data, and the fracture points between the groupings are identified in such a way that similar values are grouped together and the differences between the classes are maximized (Razavi Temeh et al. 2018; Chen et al. 2020; Eslaminezhad et al. 2022). Additionally, features are placed in classes in which, due to the way the class boundaries are determined, large relative changes in data values occur. According to Figure 9, these maps show good reliability, since they are compatible with the floods that occurred. There has been predicted a high or very high level of susceptibility at the location of many previous events. Also, Figure 9 shows the flood susceptibility maps for the area in question which were predicted using the aforementioned models. These maps indicate that the chances of flooding in the center of the area in question are higher than in other points due to higher elevations and lower slope angles (Kanani-Sadat et al. 2019). Furthermore, due to a decrease in forest lands and severe changes in land use in recent years, and the resulting increase in urban areas and dry lands, the likelihood of flooding damage has increased. In addition, the damage caused by floods have increased due to buildings that are built too close to the river area.
The percentage of the flood susceptibility classes predicted by the ROF-BPSO, ADTree-BPSO, and RF-BPSO tree-based models can be observed in Figure 10. The findings indicated that, in the RF-BPSO model, two classes with very high and high susceptibility covered a higher percentage of the area in question as compared to similar classes in the ROF-BPSO and ADTree-BPSO models, with this model classifying 50.5 and 29.05% of the area in the very low and very high flood susceptibility classes, respectively. Additionally, the moderate class in the ADTree-BPSO model covered a higher percentage of the area compared to similar classes in the ROF-BPSO and RF-BPSO models, with this model classifying 56.23 and 23.27% of the area in the very low and very high flood susceptibility classes, respectively. Also, the two very low and low classes in the ROF-BPSO model covered a higher percentage of the area compared to similar classes in the RF-BPSO and ADTree-BPSO models, with this model classifying 62.92 and 17.15% of the area in the very low and very high flood susceptibility classes, respectively. All in all, 56.55% of the area in question was classified in the very low flood susceptibility class.
Validation of the flood susceptibility prediction models
Validating the predicted maps is a necessary step in the evaluation of the quality of these maps; thus, the ROC curve method was used to evaluate the performance of the models. The ROC curve for the training and validation data sets of the ROF-BPSO, ADTree-BPSO, and RF-BPSO tree-based models has been presented in Figure 11. Generally, the training data sets show the ability of the models in predicting flood susceptibility, while the validation data sets show the predictive skill of the models. For the training data sets, RF-BPSO was found to have the highest AUC value (0.961), followed by ROF-BPSO (0.957) and ADTree-BPSO (0.942) (Figure 11). Furthermore, for the validation data sets, RF-BPSO was found to have the highest accuracy (AUC = 0.935) compared to ROF-BPSO (AUC = 0.904) and ADTree-BPSO (AUC = 0.923). Therefore, while all of the models have good predictive power, the RF-BPSO model has the best accuracy and performance in regard to predicting the flood susceptibility of the area of research. These findings conform to those of previous studies (Khosravi et al. 2016; Hong et al. 2018).
Training (Table 5) and validation (Table 6) data sets were used to evaluate the flood susceptibility predicting abilities of the models. In order to classify the flood pixels, the RF-BPSO model was found to have the highest SST values for the training and validation data sets (0.929 and 0.887, respectively). For the classification of non-flood pixels, the RF-BPSO model was found to have the highest SPC for the training and validation data sets (0.963 and 0.924, respectively). In addition, the RF-BPSO model was found to have the highest positive and negative prediction rates and the highest accuracy for the two data sets. The validation findings show that the RF-BPSO model has a better performance compared to the other two models in regard to flood susceptibility prediction for the area of study. This may be due to its ability to model large databases and its ability to combine large amounts of input variables without changing them (Rahmati et al. 2016). Also, previous studies have shown that the RF model performs at a high level when predicting flood susceptibility maps (Hong et al. 2018; Zhao et al. 2018; Tang et al. 2020). The RF model makes use of the high variance in individual trees and places them in classes (Chen et al. 2017). Furthermore, studies on various issues such as fire, groundwater potential, and earthquake susceptibility have shown that RF can be used to predict a variety of criteria (Levy et al. 2007; Mukerji et al. 2009; Sahoo et al. 2009; Tien Bui et al. 2018).
Number . | Criteria . | ROF-BPSO . | ADTree-BPSO . | RF-BPSO . |
---|---|---|---|---|
1 | TP | 238 | 244 | 250 |
2 | TN | 233 | 235 | 240 |
3 | FP | 21 | 15 | 9 |
4 | FN | 26 | 24 | 19 |
5 | PPR | 0.918 | 0.942 | 0.965 |
6 | NPR | 0.899 | 0.907 | 0.926 |
7 | SST | 0.901 | 0.910 | 0.929 |
8 | SPC | 0.917 | 0.940 | 0.963 |
9 | ACC | 0.909 | 0.924 | 0.945 |
Number . | Criteria . | ROF-BPSO . | ADTree-BPSO . | RF-BPSO . |
---|---|---|---|---|
1 | TP | 238 | 244 | 250 |
2 | TN | 233 | 235 | 240 |
3 | FP | 21 | 15 | 9 |
4 | FN | 26 | 24 | 19 |
5 | PPR | 0.918 | 0.942 | 0.965 |
6 | NPR | 0.899 | 0.907 | 0.926 |
7 | SST | 0.901 | 0.910 | 0.929 |
8 | SPC | 0.917 | 0.940 | 0.963 |
9 | ACC | 0.909 | 0.924 | 0.945 |
Number . | Criteria . | ROF-BPSO . | ADTree-BPSO . | RF-BPSO . |
---|---|---|---|---|
1 | TP | 95 | 99 | 103 |
2 | TN | 91 | 96 | 98 |
3 | FP | 16 | 12 | 8 |
4 | FN | 20 | 15 | 13 |
5 | PPR | 0.855 | 0.891 | 0.927 |
6 | NPR | 0.819 | 0.864 | 0.882 |
7 | SST | 0.826 | 0.868 | 0.887 |
8 | SPC | 0.850 | 0.888 | 0.924 |
9 | ACC | 0.837 | 0.878 | 0.905 |
Number . | Criteria . | ROF-BPSO . | ADTree-BPSO . | RF-BPSO . |
---|---|---|---|---|
1 | TP | 95 | 99 | 103 |
2 | TN | 91 | 96 | 98 |
3 | FP | 16 | 12 | 8 |
4 | FN | 20 | 15 | 13 |
5 | PPR | 0.855 | 0.891 | 0.927 |
6 | NPR | 0.819 | 0.864 | 0.882 |
7 | SST | 0.826 | 0.868 | 0.887 |
8 | SPC | 0.850 | 0.888 | 0.924 |
9 | ACC | 0.837 | 0.878 | 0.905 |
In addition to the above, the analytical performance of three flood susceptible models based on training and validation data sets using several error measures such as RMSE, MAE, and coefficient of determination (R2) is presented in Table 7. This finding clearly showed that the RF-BPSO model has the lowest RMSE, the lowest MAE, and the highest R2 compared to other proposed models for the training and validation data sets, which represents a very high performance of the RF-BPSO model in flood susceptibility prediction along with fewer errors. According to Table 7, the following ranking of statistical performance for optimized tree-based ML models is observed: RF-BPSO > ADTree-BPSO > ROF-BPSO.
Models . | ROF-BPSO . | ADTree-BPSO . | RF-BPSO . | |||
---|---|---|---|---|---|---|
Training . | Validation . | Training . | Validation . | Training . | Validation . | |
RMSE | 0.312 | 0.335 | 0.269 | 0.289 | 0.198 | 0.218 |
MAE | 0.191 | 0.193 | 0.175 | 0.176 | 0.145 | 0.147 |
R2 | 0.842 | 0.813 | 0.886 | 0.877 | 0.931 | 0.926 |
Models . | ROF-BPSO . | ADTree-BPSO . | RF-BPSO . | |||
---|---|---|---|---|---|---|
Training . | Validation . | Training . | Validation . | Training . | Validation . | |
RMSE | 0.312 | 0.335 | 0.269 | 0.289 | 0.198 | 0.218 |
MAE | 0.191 | 0.193 | 0.175 | 0.176 | 0.145 | 0.147 |
R2 | 0.842 | 0.813 | 0.886 | 0.877 | 0.931 | 0.926 |
In accordance with Khosravi et al. (2016) and Rahmati et al. (2016), the results of the present study showed that the DEM, distance to river, and MFI have a major effect on potential flood susceptibility, and are therefore useful in flood susceptibility prediction models. A direct relationship was found between MFI and flood susceptibility; more rainfall results in more chances of flooding and an increase in flood susceptible layer weight. These findings regarding the direct relationship between MFI and flooding conform to studies conducted by Pham et al. (2020) and Razavi Termeh et al. (2018). Another criterion found to affect flood susceptibility is DEM, where the higher the elevation, the lower the chance of flooding. The highest flood susceptibility was found to be present in low elevation hillside areas, which can be due to the accumulation of rainwater and flooding in these areas. This finding also conforms to previous studies (Khosravi et al. 2016; Liu et al. 2016). Based on the findings of the present study, it can be concluded that the majority of the mass accumulation and distribution of floods occurs near rivers and low elevation regions. In other words, areas with high flood susceptibility have low elevation, minimum slope angle, a flat area, and are close to rivers.
Final flood susceptibility map
In which F is the value of the raster layer resulting from the combination of the three models, AUCi is the area under the curve of each model, m is the flood susceptibility layer of each model, and n is the number of models. Figure 12 shows the final flood susceptibility map in the study area, which is a combination of three models including ROF-BPSO, ADTree-BPSO, and RF-BPSO. This model classifies 54.03 and 26.21% of the area in the very low and very high flood susceptibility classes, respectively. The central parts of the study area to the east and west have high and very high flood susceptibility, the most important reasons for which are the lower elevation and minimum slope angle of these areas.
DISCUSSION
Floods are considered to be some of the most devastating and destructive forces of nature. Using appropriate spatial analyses relevant to flood susceptibility, the majority of the damage caused by these natural disasters can be avoided. The aim of the present study was to predict the flood susceptibility of the Maneh and Samalqan watershed using three ML models, i.e. ROF-BPSO, ADTree-BPSO, and RF-BPSO. In recent years, few studies have been conducted on the use of tree-based ML models in predicting flood susceptibility (Khosravi et al. 2016; Costache et al. 2021). Therefore, the performances of three ML models were evaluated and compared using statistical indicators and AUC values. The findings indicated that RF-BPSO outperformed the other two models. Compared to the other two models, RF is an algorithm that can produce more satisfactory results, with higher accuracy and lower variance and bias (Hong et al. 2018; Zhao et al. 2018; Chen et al. 2020; Nachappa et al. 2020). Therefore, the RF-BPSO model has the best performance in regards to predicting the flood susceptibility of the Maneh and Samalqan watersheds. Regarding the ADTree-BPSO model, some variance may be found in terms of the performance of some of the training and validation data, resulting in a decrease in the model's stability. In addition, it can be concluded that the general statistical performance of the ADTree-BPSO model was better compared to the ROF-BPSO model. However, the ROF-BPSO model showed a higher AUC value for the training data. This means that multiple statistical indicators must be taken into account in order to evaluate the comprehensive performance of various models. All three models can develop acceptable results in terms of flood susceptibility prediction. Regarding the distance criteria, the greater the distance to rivers, the lower the chances of flooding. Regarding elevation, which is considered to be one of the most impactful criteria in flood susceptibility, greater chances of flooding were found in lower elevations; floods usually occur in areas with elevations below sea level. Regarding the MFI, the higher the elevation, the higher the chances of rainfall, while the chances of flooding in higher elevations are lower. Land use is also an influencing criterion in flood susceptibility as the type of land use can affect runoff permeability and increase its speed. Due to human influences and changes in land use, the strength of the waterway decreases, thereby influencing the chances of flooding. These findings conform to the findings of previous studies (Khosravi et al. 2016; Liu et al. 2016; Rahmati et al. 2016; Razavi Termeh et al. 2018; Pham et al. 2020). Based on these findings and the predicted flood susceptibility maps, appropriate management measures can be undertaken in order to reduce the human and financial damage caused by floods. Finally, the use of data mining and GIS to predict the potential of flooding can be beneficial, especially in developing countries in which access to hydrogeological data is difficult.
CONCLUSION
The analysis of flood susceptibility maps using modern ML models can assist policymakers in reducing the effects and damages of floods in high-risk areas. The study proposed new hybrid tree-based ML models to identify the most susceptible region in the Maneh and Samalqan watershed, Iran. So, the present study aimed to evaluate three tree-based ML models, including ROF-BPSO, ADTree-BPSO, and RF-BPSO, to predict flood susceptibility in the study area. As a result, high-resolution satellite images, Google Earth, and primary data on previous flood locations were used to prepare the inventory map, and the 370 flood-prone locations were divided into training (70%) and testing (30%) for the construction and validation of three ML models. The findings of this study can be summarized as follows:
- 1.
The results showed that the three optimized ML models of RF-BPSO, ADTree-BPSO, and ROF-BPSO can perform well in regard to predicting the flood susceptibility of the area in question. RF-BPSO was found to be superior in terms of PPR, NPR, sensitivity, specificity, ACC, and AUC.
- 2.
The results indicated that the seven criteria, including MFI, Slope, DEM, Distance to river, Land use, Lithology, and SPI, were of greater importance in predicting flood susceptibility in the study area.
- 3.
The findings indicated that the chance of flooding in the center of the area in question is greater than in other points due to lower elevation, lower slope angle, more drylands, and proximity to rivers.
Due to the satisfactory performance and high accuracy of the RF-BPSO model in predicting flood susceptibility, it is suggested that future studies use this model and take hydrogeological, topographical, geological, and environmental criteria into account in order to reduce the damage caused by floods in other regions.
However, there are limitations to the practical use of reproduced flood susceptibility maps. In spite of the limitations of reproduced flood susceptibility maps, it is important to note that these maps do not include flood depth, duration, severity, or frequency. Furthermore, the flood inventory maps used in this study and previous ones only used binary values (0, 1) for the absence or presence of floods, but fail to include flood frequency. As a result, flood locations were given equal weight to predict flood susceptibility maps. Future research should consider the frequency of floods at each flood point in flood susceptibility maps.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICTS OF INTEREST
There is no conflict of interest in this article.