## Abstract

The objective of this study was the development of an approach based on machine learning and GIS, namely Adaptive Neuro-Fuzzy Inference System (ANFIS), Gradient-Based Optimizer (GBO), Chaos Game Optimization (CGO), Sine Cosine Algorithm (SCA), Grey Wolf Optimization (GWO), and Differential Evolution (DE) to construct flood susceptibility maps in the Ha Tinh province of Vietnam. The database includes 13 conditioning factors and 1,843 flood locations, which were split by a ratio of 70/30 between those used to build and those used to validate the model, respectively. Various statistical indices, namely root mean square error (RMSE), area under curve (AUC), mean absolute error (MAE), accuracy, and R1 score, were applied to validate the models. The results show that all the proposed models performed well, with an AUC value of more than 0.95. Of the proposed models, ANFIS-GBO was the most accurate, with an AUC value of 0.96. Analysis of the flood susceptibility maps shows that approximately 32–38% of the study area is located in the high and very high flood susceptibility zone. The successful performance of the proposed models over a large-scale area can help local authorities and decision-makers develop policies and strategies to reduce the threats related to flooding in the future.

## HIGHLIGHTS

Flood susceptibility modeling was done using hybrid machine learning approaches.

The proposed models have achieved great precision and have surpassed the reference models.

ANFIS-GBO and ANFIS-SCA were the best models.

### Graphical Abstract

## INTRODUCTION

The consequences of global warming and urban sprawl contribute to both the growing severity and number of floods, landslides, and droughts (Mukherjee *et al.* 2018; Feng *et al.* 2020; Ghorbanzadeh *et al.* 2020). Particularly, climate change leads to an increase in the intensity and frequency of torrential rains, which are the main aggravators of flooding (Prein *et al.* 2017; Nachappa *et al.* 2020; Tabari 2020). According to statistics from the World Bank, the occurrence of flooding has increased by more than 40% in the last two decades, leading to serious economic consequences (Alfieri *et al.* 2017; Andaryani *et al.* 2021). Globally, flooding affected an estimated 109 million people between 1995 and 2015 and caused damage worth approximately USD 75 billion each year (Andaryani *et al.* 2021). This figure may increase to USD 300 billion by 2050, by which time an estimated 1.3 billion people will be living on floodplains (Giang *et al.* 2020).

Vietnam has a coastline of more than 3,200 km; it has 2,360 rivers, with a total length of about 41,900 km. This is why it is one of the countries most affected by floods. Between 1989 and 2015, floods in Vietnam led to at least 14,867 deaths or missing persons (Luu *et al.* 2018). The Central region of Vietnam is particularly affected by flooding as it has a relatively high rainfall of 1,800–2,500 mm/year and is densely populated (Pham *et al.* 2021f). An example of this vulnerability came in October 2020, when flooding resulted in an estimated 129 deaths or missing persons and 214 injured. 111,200 houses were damaged, constituting costs of approximately USD 1 billion.

Despite the efforts of experts and policymakers to reduce the effects of flooding in recent years, its impact is in fact increasing, all over the world (Pham *et al.* 2021d, 2021e). We urgently need a better understanding of what makes a given area vulnerable to flooding. The term ‘flood susceptibility’ describes the spatial and temporal probability of a flood event, and flood susceptibility maps are considered crucial to the management of flood risk in the future (Pham *et al.* 2020). The literature shows that there are three central methods to determine areas at risk of flooding: remote sensing, physics-based models, and data-driven models.

In the case of remote sensing, researchers have used both radar and optical data to detect flood zones (Sun *et al.* 2000; Schumann *et al.* 2007; Anusha & Bharathi 2020). Although this approach can generate highly accurate flood maps at a low cost, especially when the model is integrated with GIS, it has significant limitations related to spatial and temporal resolutions and is affected by clouds. In addition, remote sensing requires the manual adjustment of the threshold parameters to obtain a good result; this process is time-consuming and laborious. Critically, remote sensing is also unable to present the original causes of the flood (Nguyen *et al.* 2020).

Physics-based models – such as MIKE FLOOD (Patro *et al.* 2009; Kadam & Sen 2012) and HEC-RAS (Shustikova *et al.* 2019; Ongdas *et al.* 2020) – have received considerable attention from researchers looking to map flood risk. Although this approach has proven capable of modeling future flood scenarios, unfortunately these models are affected by changes in topography, initial or boundary conditions, or other parameters such as coefficients of friction, diffusion, or degradation (often inaccessible to direct measurement). Moreover, these models require detailed data at location – such as hydro-geomorphological data – requiring intensive calculations and thus making short-term forecasting difficult (Eslaminezhad *et al.* 2022). Previous research has demonstrated that physics-based models only have short-term predictive ability (Eslaminezhad *et al.* 2022; Prasad *et al.* 2022). One final challenge is that the establishment of these models requires a thorough understanding of hydrological parameters (Pham *et al.* 2021f).

The drawbacks of physics-based models have led many to the use of data-based models such as statistical modeling and machine learning. Statistical modeling has been used extensively to predict flood risk, with methods including frequency ratio (FR; Samanta *et al.* 2018; Tehrany & Kumar 2018), logistic regression (LR; Nandi *et al.* 2016), fuzzy logic (Pulvirenti *et al.* 2011; Perera & Lahat 2015), and weight of evidence (Rahmati *et al.* 2016). These models have proven effective in assessing flood susceptibility in several regions around the world. However, the inundation process is often nonlinear and time-varying, which presents a significant challenge for statistical models, particularly at the regional scale (Ha *et al.* 2022).

Some scientists have used machine learning, integrated with data from satellite images, to determine areas that are susceptible to floods. Machine learning is quicker and requires a smaller amount of input data than physics-based models. Moreover, it can resolve the nonlinear characteristics of past flood events just from the data; it does not need to understand the physical processes involved (Islam *et al.* 2021). The use of machine learning is predicated on the relationship within input dataset and flooding remaining unchanged in the future (Bui *et al.* 2019). Examples of this methodology include Support Vector Machine (Tehrany *et al.* 2014; Pham *et al.* 2019), Random Forest (RF; Lee *et al.* 2017; Chen *et al.* 2020), Adaboost (Hong *et al.* 2018a; Pham *et al.* 2021g), Artificial Neural Networks (Falah *et al.* 2019; Bui *et al.* 2020), and Adaptive Neuro-Fuzzy Logic (Hong *et al.* 2018b; Tabbussum & Dar 2021).

Several previous studies have claimed that machine learning is more appropriate than traditional approaches to flood prediction (Mosavi *et al.* 2017), but traditional machine learning methods are becoming less effective. A particular problem is that of overfitting: machine learning is good at prediction in the training process, because systems learn tasks based on data from the past, but if data are missing or insufficiently diverse, the systems cannot predict accuracy in the testing process (Mosavi *et al.* 2018; Bui *et al.* 2020; Chou *et al.* 2021).

In recent years, hybrid models have become a more popular way to resolve the problems of mapping flood susceptibility (Nguyen *et al.* 2021a; Yen *et al.* 2021). Hybrid models combine individual models with metaheuristic algorithms. The advantage is that hybrid models can eliminate the weak points of individual models, to obtain more accurate results (Saha *et al.* 2021). Moreover, metaheuristic algorithms explore the entire search space, thus limiting local optimization problems (Zhao *et al.* 2020). Optimization algorithms have been applied effectively in different domains, such as economics, earth science, and engineering. These can be divided into three groups: evolution-based – such as Genetic Algorithm (GA) (Shahabi *et al.* 2021) and DE (Razavi-Termeh *et al.* 2021) – physics-based – such as Henry Gas Solubility Optimization (HGSO; Nguyen *et al.* 2021b), Atom Search Optimization (ASO; Mundra *et al.* 2022) – and swarm-based – such as Whale Optimization Algorithm (WOA; Liu *et al.* 2020), GWO (Nguyen *et al.* 2021a), and Bat Algorithm (BA; Ahmadlou *et al.* 2019).

One example is that of Termeh *et al.* (2018), who integrated ANFIS with three optimization algorithms (Ant Colony Optimization, GA, and Particle Swarm Optimization (PSO)) to produce flood susceptibility maps for the Fars province of southern Iran. The hybrid models were successful, outperforming the reference models. Bui *et al.* (2019) combined the Extreme Learning Machine model with PSO to evaluate flood susceptibility in a mountain region of northern Vietnam. This combination displayed promising precision, with an AUC of over 0.95. Bui *et al.* (2020) utilized the swarm intelligence algorithms, namely GWO, Grasshopper Optimization Algorithm, and Social Spider Optimization algorithm to improve the performance of the DNN model to determine flood susceptibility in an area of Lai Chau province. The hybrid models outperformed the Support Vector Machine and RF reference models.

Such excellent results have inspired the implementation and development of hybrid models in the modeling of many natural hazards. However, key limitations of machine learning models remain, namely the generalization problem (i.e. models can perform well above the range of the training dataset but cannot predict beyond the range) and the no-free-lunch problem (i.e. there are no models that can solve all problems in all regions).

The researchers recommend developing new ways to build flood susceptibility maps and related natural hazard simulations, and so the work of this study was to conceive of new hybrid models based on ANFIS and five optimization algorithms to predict flood susceptibility in Ha Tinh province, Vietnam. ANFIS is a popular neuro-fuzzy model for solving the highly complex problem of nonlinearity. In addition, ANFIS has the ability to adapt automatically to problems. ANFIS is based on the fuzzy model and is particularly strong at interpreting problems, which in turn improves its ability to generalize. ANFIS has gained recognition due to its strength in transforming fuzzy sets into clean inputs and providing clean output from fuzzy rules (Chopra *et al.* 2021). The model has been proven to be effective in several different areas, such as flood susceptibility construction and streamflow prediction (Adnan *et al.* 2022). However, ANFIS also has disadvantages: the selection of the type and the number of membership functions, as well as the position of the membership functions (Chopra *et al.* 2021). Therefore, it is necessary to integrate ANFIS with optimization algorithms like Gradient-Based Optimizer (GBO), Chaos Game Optimization (CGO), and Sine Cosine Algorithm (SCA). These optimization algorithms have proven effective in several different areas although, as yet, they have rarely been applied in the field of environmental science and natural hazard management.

We have created new methods, based on ANFIS and three novel optimization algorithms (GBO, CGO, and SCA), to generate spatial flood susceptibility maps with minimal field data for the study area, which is a province that is regularly affected by flooding. The performance of these models was compared with the reference models ANFIS-DE and ANFIS-GWO. Our initial hypothesis was that the three proposed models would successfully build a flood susceptibility map and that their performance would surpass the reference models ANFIS-DE and ANFIS-GWO.

The novel elements of this study are twofold. This is the first time flood susceptibility in Ha Tinh has been mapped in such a rigorous way; it is also the first time ANFIS has been integrated with GBO, CGO, and SCA. To the knowledge of the author, these hybrid models have not previously been used to construct a map, and so the finding of this study represents a novel approach to the use of deep-learning algorithms in generating flood susceptibility maps.

## STUDY AREA AND DATA USE

### Study area

Due to the combined effects of tropical depression, cold waves, and high slope, the orographic precipitation over the Central region of Vietnam is very high (Le *et al.* 2021). The study area lies in the tropical monsoon region and experiences two different seasons: the rainy season (November–March) and the dry season (April–October). The average annual precipitation for the period 1958–2018 was relatively high – between 1,142 and 4,391 mm, with 75% of the precipitation concentrated in the rainy season. Peak rainfall is in September and October, when the monthly average is 500–800 m, which is very high compared with other regions of Vietnam (Le *et al.* 2021). The study area has a dense river network with a total length of more than 400 km and an average annual flow of 195 m^{3}/s. There are three main river systems: Ngan Sau, Ngan Pho, and the coastal river system.

Forests currently cover about 66% of the surface of the province, although illegal deforestation, infrastructure construction, and forest fires are causing this figure to rapidly decrease. According to the Department of Agriculture and Rural Development, the forested area decreased by 1,728.17 ha between 2019 and 2020. This increases the flood risk in the study area. According to the Institute of Science and Technology, between 1961 and 2015, 44 storms affected Ha Tinh province. Storms can combine with heavy precipitation over a large area to result in serious floods. In October 2020, the province suffered two particularly destructive floods, which killed 6 people, destroyed 3,765 houses, and caused damage of VND 723 billion.

In recent years, the provincial government's flood strategy has focused on two main objectives: the adoption of structural measures – such as containment or flood mitigation through actions like the construction of dams – and the integration of flood depth maps, produced using hydrodynamic modeling, into land-use planning strategies to minimize future damage to human life and the economy. However, as mentioned above, hydrodynamic modeling has critical limitations that influence the accuracy of the flood depth map. Van den Honert & McAneney (2011) pointed out that the inability of hydrodynamic modeling to accurately predict floods had led to major damage in Queensland, Australia. Therefore, in this study, machine learning was selected for the construction of the map.

### Geospatial data

#### Flood inventory

The first step when using a machine learning approach to create a flood susceptibility model is building the flood inventory map, which details past flood locations and relations with conditioning factors at those locations (Mojaddadi *et al.* 2017; Nachappa *et al.* 2020; Pham *et al.* 2021a). In this study, the locations of floods were collected from sources obtained from the Department of Natural Resources and Environment of Ha Tinh (DONRE of Ha Tinh). In addition, field missions were carried out in 2020 and 2021 to measure flood marks, and Landsat 8 OLI and Sentinel 1A satellite images were used to detect floodplains during flooding in October 2010 and October 2020. Details of the methodology were presented by Kumar (2019) and Tiampo *et al.* (2022). A total of 1,843 samples were acquired to establish the model for Ha Tinh, including 943 samples for flood and 900 samples for non-flood. Flood and non-flood inventory were coded as 1 and 0, respectively. These data were split into two sets: 70% of the dataset was utilized to create the model, 30% to verify the model.

#### Conditioning factors

*Elevation* is an important factor because it is linked to the capacity of water accumulation. Flooding is most likely to happen in lower-lying areas. Elevation affects flood regulation because with decreasing altitude, rainfall and river flow increase (Choubin *et al.* 2019; Nachappa *et al.* 2020). In the study area, the altitude value is 0–2,280 m.

*Slope* is the most vital conditioning factor, because it has effects on flow velocity and water accumulation capacity. Gentle slopes are usually more prone to flooding, as precipitation runs off steeper slopes into lowland water bodies (Yariyan *et al.* 2020). The slope value varied from 0° to 70° in this study.

*Aspect* is an important factor in hydrological response. It influences the local climatic conditions, soil moisture, and infiltration. Although aspect affects flooding indirectly, several researchers have considered it essential to susceptibility maps (Nachappa *et al.* 2020). The aspect value varied from 0° to 360° in this study.

Several researchers have underlined the critical role of *curvature* in flood susceptibility modeling since it directly impacts flow accumulation (Mirzaei *et al.* 2021). In Ha Tinh province, curvature varied from −79 to 79.

*NDVI* is one of the indices to measure vegetation density in the study area. The higher the vegetation density is, the lower the probability of flooding, and conversely (Dodangeh *et al.* 2020; Nachappa *et al.* 2020). The NDVI value ranged from −1 to 1 in this study.

*NDBI* calculates building density. This is important because buildings directly affect surface permeability (Nguyen *et al.* 2021a). The NDBI value ranged from −1 to 1 in this study.

*Density to river* was critical, as we were looking at fluvial flooding, where the river plays an essential role in flood expansion. Most areas near rivers are prone to flooding, as water overflows the riverbank (Vojtek & Vojteková 2019; Chowdhuri *et al.* 2020). The density to river value varied from 0 to 6.4 m in this study.

*Density to road* is also a primary factor as it directly affects the infiltration capacity of the surface (Ahmadlou *et al.* 2021; Linh *et al.* 2022). In Ha Tinh province, the national Highway 1A acts as a dam that blocks the flow of water toward the sea. The density to road value varied from 0 to 7.4 m in this study.

*TWI* describes soil moisture and the ability of the soil to erode spatially (Nachappa *et al.* 2020). The TWI value varied from 1.3 to 23.1 in this study.

*CTI* is the quotient between the slope and the flow accumulation. It describes the capacity of water resources per unit area. A low CTI value represents a basin with a steep slope and a small water surface (Azedou *et al.* 2021). The CTI value varied from 1.3 to 23.1 in this study.

*Land use* is key when predicting flood occurrence because each type of land use has a different water impermeability capacity (Andaryani *et al.* 2021; El-Haddad *et al.* 2021). Beckers *et al.* (2013) pointed out that the evolution of land cover is the main cause of increased flood risk. We categorized land use as either forest, water, agricultural area, artificial area, barren land, or fish-farming areas.

Various previous studies have highlighted the relationship between *precipitation* and flood occurrence. Rainfall is a trigger factor for flooding and an increase in the intensity of rainfall can lead to an increase in the intensity of a flood (Chapi *et al.* 2017; Pham *et al.* 2021f). The rainfall value varied from 2,013 to 2,519 mm in this study.

Although *soil type* influences the process of flooding indirectly, studies have shown its importance in modeling as it controls the process of water infiltration from the surface (Bui *et al.* 2019; Costache *et al.* 2020). Therefore, in this study, soil type was divided into humic acrisols, heplic arenosols, alisols, salic fluvisols, xanthic ferralsols, thionic fluvisols, fluvisols, dystric gleysols, ferralic acrisols, ferralsols, arenic acrisols, lithic leptosols, plinthic acrisols, and hyperdystric acrisols.

## METHODS

- (i)We collected flood inventory data from several field missions and from Landsat 8OLI and Sentinel 1A imagery. The 13 conditioning factors selected were elevation, aspect, slope, curvature, NDBI, NDVI, density to river, density to road, CTI, TWI, land use, rainfall, and soil type. Data were normalized using the min–max normalization technique:

In addition, we used RF to select important factors and filter out non-predictive factors, because data redundancy could increase the complexity of models and could reduce the accuracy of prediction models.

(ii) Two sets of data must be considered in any statistical model. The first is used to train the models and a separate second set is used to evaluate the models. In this study, a total of 1,843 flood and non-flood and 13 conditioning factor points were divided into sets for training (70%) and validation (30%). We used five different models to develop the flood susceptibility map using Python. For neural network usage, the network topology was determined by the dataset. The model parameter initialization process comprised two steps: the initialization of the ANFIS model hyper-parameters using the trial-and-error method and the initialization of the five optimization algorithms.

The structure of the ANFIS model comprises five layers, of which the first is the input layer, which consisted of 13 conditioning factors (see Figure 2). The final output layer contains two classes of results: flood and non-flood. The ANFIS model is influenced by hyper-parameters (number of fuzzy memberships, member of function, batch size, optimizer, and loss).

Optimization algorithms allow us to enhance the accuracy of the models by modifying the hyper-parameters. In this study, ANFIS models were developed using five optimization algorithms (GBO, CGO, SCA, GWO, and DE) to create the fuzzy interface system (FIS). After the generation of FIS, the parameters of the membership function were generated and stored in a master. In the ANFIS algorithm, both the fuzzy and defuzzy layers had parameters, while the rules and normalization layers did not. It should be noted that at each iteration, ANFIS was trained with the problem size created by the optimization algorithms. Each generated different solutions starting with the set of input weights and other specific parameters. RMSE was utilized as an objective function to help to determine the best optimization solution using five optimization algorithms. In the end, if the last condition of the model was met, producing the best result with a small RMSE, the optimization would stop. Otherwise, the optimization process was repeated until the model achieved the best results.

See Table 1 for the parameters of the proposed models.

Parameters . | ANFIS-GBO . | ANFIS-CGO . | ANFIS-SCA . | ANFIS-GWO . | ANFIS-DE . |
---|---|---|---|---|---|

Inputs | 13 | 13 | 13 | 13 | 13 |

Number of fuzzy memberships | 2 | 2 | 2 | 2 | 2 |

Batch size | 32 | 32 | 32 | 32 | 32 |

Member of function | Gaussian | Gaussian | Gaussian | Gaussian | Gaussian |

Optimizer | SGD | SGD | SGD | SGD | SGD |

Loss | RMSE | RMSE | RMSE | RMSE | RMSE |

Epochs | 200 | 200 | 200 | 200 | 200 |

Problem size | 114,740 | 114,740 | 114,740 | 114,740 | 114,740 |

Parameters . | ANFIS-GBO . | ANFIS-CGO . | ANFIS-SCA . | ANFIS-GWO . | ANFIS-DE . |
---|---|---|---|---|---|

Inputs | 13 | 13 | 13 | 13 | 13 |

Number of fuzzy memberships | 2 | 2 | 2 | 2 | 2 |

Batch size | 32 | 32 | 32 | 32 | 32 |

Member of function | Gaussian | Gaussian | Gaussian | Gaussian | Gaussian |

Optimizer | SGD | SGD | SGD | SGD | SGD |

Loss | RMSE | RMSE | RMSE | RMSE | RMSE |

Epochs | 200 | 200 | 200 | 200 | 200 |

Problem size | 114,740 | 114,740 | 114,740 | 114,740 | 114,740 |

The five proposed algorithms were used to calculate the weights of the model. The total value of the weights in the calculated models was 114,740.

(iii) The ROC curve is one of the techniques most often used in assessing the accuracy of a flood susceptibility model, by validating the dataset. The curve was generated by overlaying the validation dataset onto the prediction results of the five proposed models. AUC represents the accuracy of the models.

(iv) After constructing the proposed models, the map was generated by inputting into the models all the pixels of the study area, along with the value of conditioning factors. The output value is the flood susceptibility index and lies within a range of 0 to 1. These values were split for five classes (applying the natural break method): very low, low, moderate, high, and very high.

### Adaptive Neuro-Fuzzy Inference System

### Gradient-Based Optimizer

*et al.*2020). GSR helps GBO consider stochastic behavior in the optimization process to ease the exploration process and avoid local optimization. Also, in the algorithm, the direction movement has been added in GSR to improve the algorithm's convergence speed (Ahmadianfar

*et al.*2020). Based on GSR and direction movement, the position of vector () is calculated by the following equation:where

*n*is the number of iterations,

*m*is the total number of iterations,

*randn*is normally a distributed random number, and

*ε*is a small number with the value range (0, 0.1).

*x*

_{best}). The solution is calculated by the following equation:where

*f*

_{1}is a uniform random number within the range (−1, 1), and

*f*

_{2}is a random number from a normal distribution with a mean of 0 and a standard division of 1. Pr is a probability value;

*u*

_{1},

*u*

_{2},

*u*

_{3}are random numbers. More details of this algorithm were presented by Ahmadianfar

*et al.*(2020). GBO has performed well in a variety of contexts, including economics (Deb

*et al.*2021) and energy (Ismaeel

*et al.*2021; Zhou

*et al.*2021). Prior to this study, this algorithm had not been applied in the field of geography.

### Chaos Game Optimization

CGO is a metaheuristic optimization proposed by Talatahari & Azizi (2021) and is based on chaos theory. Chaos theory explains how a small change in initial conditions can cause an extreme deviation later. Based on this theory, the current state of a system can be used to determine the future state of that system. In mathematics, chaos play is the methodology of constructing fractals using an initial polygon and a randomly selected initial point. This process aims to create a sequence of points by applying the method iteratively (Talatahari & Azizi 2021). The main shape of the fractal is the vertex of a polygon, so the vertex of the polygon must be appropriately defined. Next, an initial point is randomly chosen to start the fractal construction. Subsequent points are determined based on the distance of the original point, and the remaining vertex of the polygon is randomly selected after each iteration. This process is repeated continuously and, with random selection of initial points and vertices of the polygon after each iteration, a fractal is constructed (Talatahari & Azizi 2021). The triangle was constructed:

*X*=

*X*1,

*X*2,…,

*X*

*i*,

*X*

*n*=

*x*11,

*x*12,…,

*x*1

*j*,…,

*x*1

*d*,

*x*21,

*x*22,…,

*x*2

*j*,…,

*x*2

*d*, …,

*x*

*i*1,

*x*

*i*2,…,

*x*

*i*

*j*,…,

*x*

*i*

*d*,…,

*x*

*n*1,

*x*

*n*2,…,

*x*

*n*

*j*,…,

*x*

*n*

*d*,

*i*= 1,2,…,

*n*,

*j*= 1,2…,

*d*.where

*d*is the problem dimension,

*n*is the total number of initializations, and is the

*j*th design variable at point

*i*in the search space. is the lower bounds of the decision variables, while is the upper bound of the decision variables.

*rand*is the random number, which ranked from 0 to 1.

More details of this algorithm can be found in Talatahari & Azizi (2021). CGO has already been successfully applied in fields such as engineering and energy.

### Sine Cosine Algorithm

represents the individual *I* at iteration *t* in the *d*th dimension; is the best position of the individual at iteration *t* in the *d*th dimension; and *r*_{1}, *r*_{2}, and *r*_{3} are random parameters.

More details of this algorithm can be found in Gabis *et al.* (2021). SCA has performed well in contexts such as energy and earth sciences.

### Grey Wolf Optimization

*et al.*(2014). It is based on the predatory nature and social hierarchy of grey wolves. Grey wolves live in packs of 5–12 and are divided into four groups: alpha (

*α*, the most dominant), beta (

*β*), delta (

*δ*), and omega (

*ω*, the least dominant; Darabi

*et al.*2021). GWO works by dividing a set of solutions into four groups. The first three results are considered the best, belonging to

*α*,

*β*, and

*δ*, and the fourth result belongs to

*ω*. To implement this mechanism, in each iteration, GWO updates based on the three best solutions. The algorithm is divided into three clear processes: siege, hunt, and attack. GWO has a small number of parameters and is easy to implement. In addition, the convergence speed of GWO is faster than that of other layout algorithms (Zou

*et al.*2019). The digital model can be represented by the following equation:

*A*, *C*, and *D* are defined as the coefficient vectors; *X*_{victim}(*t*) and *X*_{wolf}(*t*) are the current positions of the victim and the wolf, respectively, and *t* is the current iteration.

More details of this algorithm can be found in Liu *et al.* (2021). GWO has been successfully applied in many different applications including the prediction of forest fires, landslides, and floods.

### Differential Evolution

*et al.*2021). This process is represented by the following equation:

*r*_{1}, *r*_{2}, *r*_{3} ∊ {0, 1, 2,…, NP} are the three random unequal integers. *F* is the coefficient and varies from 0 to 2.

*et al.*2021). The process of selection is represented by the following equation:

In the DE algorithm, all members have an equal chance of becoming parents, and the best members get to the next step compared with the parent generation. The DE algorithm has the advantage of not requiring any optimization function to achieve the best solution; however, to tune the assigned problems, the DE algorithm maintains a large number of existing solutions and initiates new solutions in parallel with existing solutions (Al-Sudani *et al.* 2019). More details of this algorithm can be found in Razavi-Termeh *et al.* (2021).

### Accuracy assessment

*R*

^{2}, accuracy, and

*K*to verify the accuracy of the proposed models. Ours is not the first study to use these indices (Arora

*et al.*2021; Razavi-Termeh

*et al.*2021). ROC is denoted by 1 – specificity on the

*X*-axis and sensibility on the

*Y*-axis. The advantage of this index is that there is no structural assumption that has been made about the shape of the graph, and there is no need to determine the underlying distribution of results for these two groups (Nguyen

*et al.*2022b). In addition, the performance of the model is measured by the area under the ROC curve.

TP constitutes the number of flood pixel classified correctly as flood, while TN is the number of non-flood pixel correctly classified as non-flood. *P* is the number of flood pixel. *N* is the number of non-flood pixel.

*m*is the number of samples,

*Y*

_{i}_{predicted}is the prediction value at sample

*i*, and

*Y*

_{i}_{observed}is the observation value at sample

*i*.

*K*is used to evaluate the percentage of consensus among categorical components. The main advantage of the kappa index is its simplicity and its ability to apply to multi-class problems (Sadeghbeygi

*et al.*2021).

FP and FN are the number of flood and non-flood points that are incorrectly classified as flood and non-flood, respectively. *P _{p}* is the number of samples classified correctly by flooding or non-flooding.

*P*

_{exp}is the number of expected agreements.

*R*^{2} is a popular statistical index and measures the accuracy of the linear regression model. *R*^{2} is calculated as the square of the correlation coefficient between a dependent variable and one or more independent variables. It expresses the part of the variance of the dependent variable that comes from those of the independent variables. The value of *R*^{2} varies from −1 to 1. If the value of *R*^{2} is negative, the model is malfunctioning. If the value of *R*^{2} equals 0, it means that the model cannot explain the variability of the response data around its mean. If the value of *R*^{2} equals 1, the model is perfect. These are the main advantages of the *R*^{2} index compared with other indices such as AUC, RMSE, or MAE (Chicco *et al.* 2021).

## RESULTS

### Feature selection analysis

*et al.*2021). In the ANFIS approach to identify regions susceptible to flooding, 13 factors influencing the occurrence of flooding were selected, based on a review of the existing literature and an analysis of the sensitivity of each factor. Figure 5 shows the results of the sensitivity analysis using the RF method. The results showed that land use (0.25), elevation (0.2), slope (0.14), and distance from road (0.13) were most related to flood occurrence. In other words, when eliminating land use from the model, the accuracy of the model decreases by 25%, with corresponding decreases for elevation, slope, and distance from road. CTI (0.12), NDBI (0.04), soil (0.037), and TWI (0.03) demonstrate moderate levels of importance. Aspect (0.024), distance from river (0.009), rainfall (0.002), and curvature (0.0018) had the least influence. Remarkably, the NDVI factor value was equal to 0, meaning that it did not impact the prediction of floods in the study area at all. Therefore, this factor has been removed, as data redundancy can lead to increased model complexity and may reduce the accuracy of the flood susceptibility model.

### Model performance

The accuracy of the five models based on training and validation data was analyzed by the RMSE, MAE, ACC, *K*, and *R*^{2} (Table 2). The results show that for the validation process, the proposed ANFIS-GBO model outperformed the other models, followed by ANFIS-SCA, ANFIS-GWO, ANFIS-DE, and ANFIS-CGO, respectively. In validation, the ANFIS-GBO model was also more precise than the other models, followed by ANFIS-SCA, ANFIS-GWO, ANFIS-DE, and then ANFIS-CGO. The ANFIS-GBO model had the best overall performance and ANFIS-CGO the poorest.

. | Training dataset . | Validating dataset . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

RMSE . | MAE . | AUC . | ACC . | K
. | R^{2}
. | RMSE . | MAE . | AUC . | ACC . | K
. | R^{2}
. | |

Proposed model | ||||||||||||

ANFIS-GBO | 0.14 | 0.09 | 0.994 | 0.98 | 0.97 | 0.91 | 0.23 | 0.16 | 0.99 | 0.96 | 0.86 | 0.63 |

ANFIS-CGO | 0.2 | 0.15 | 0.992 | 0.98 | 0.96 | 0.82 | 0.26 | 0.2 | 0.98 | 0.95 | 0.83 | 0.54 |

ANFIS-SCA | 0.17 | 0.12 | 0.996 | 0.98 | 0.97 | 0.88 | 0.25 | 0.17 | 0.99 | 0.95 | 0.85 | 0.58 |

Reference model | ||||||||||||

ANFIS-GWO | 0.17 | 0.14 | 0.994 | 0.98 | 0.97 | 0.87 | 0.25 | 0.17 | 0.987 | 0.95 | 0.85 | 0.57 |

ANFIS-DE | 0.18 | 0.14 | 0.993 | 0.98 | 0.97 | 0.86 | 0.26 | 0.19 | 0.984 | 0.95 | 0.84 | 0.56 |

. | Training dataset . | Validating dataset . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

RMSE . | MAE . | AUC . | ACC . | K
. | R^{2}
. | RMSE . | MAE . | AUC . | ACC . | K
. | R^{2}
. | |

Proposed model | ||||||||||||

ANFIS-GBO | 0.14 | 0.09 | 0.994 | 0.98 | 0.97 | 0.91 | 0.23 | 0.16 | 0.99 | 0.96 | 0.86 | 0.63 |

ANFIS-CGO | 0.2 | 0.15 | 0.992 | 0.98 | 0.96 | 0.82 | 0.26 | 0.2 | 0.98 | 0.95 | 0.83 | 0.54 |

ANFIS-SCA | 0.17 | 0.12 | 0.996 | 0.98 | 0.97 | 0.88 | 0.25 | 0.17 | 0.99 | 0.95 | 0.85 | 0.58 |

Reference model | ||||||||||||

ANFIS-GWO | 0.17 | 0.14 | 0.994 | 0.98 | 0.97 | 0.87 | 0.25 | 0.17 | 0.987 | 0.95 | 0.85 | 0.57 |

ANFIS-DE | 0.18 | 0.14 | 0.993 | 0.98 | 0.97 | 0.86 | 0.26 | 0.19 | 0.984 | 0.95 | 0.84 | 0.56 |

### Flood susceptibility maps

. | Very low (%) . | Low (%) . | Moderate (%) . | High (%) . | Very high (%) . |
---|---|---|---|---|---|

ANFIS-GBO | 28.49 | 28.55 | 7.9 | 19.71 | 15.35 |

ANFIS-CGO | 13.52 | 27.43 | 20.83 | 19.43 | 18.78 |

ANFIS-SCA | 24.64 | 28.58 | 14.55 | 17.54 | 14.68 |

ANFIS-GWO | 19.56 | 25.95 | 18.07 | 19.5 | 16.92 |

ANFIS-DE | 18.79 | 24.14 | 18.59 | 18.65 | 19.82 |

. | Very low (%) . | Low (%) . | Moderate (%) . | High (%) . | Very high (%) . |
---|---|---|---|---|---|

ANFIS-GBO | 28.49 | 28.55 | 7.9 | 19.71 | 15.35 |

ANFIS-CGO | 13.52 | 27.43 | 20.83 | 19.43 | 18.78 |

ANFIS-SCA | 24.64 | 28.58 | 14.55 | 17.54 | 14.68 |

ANFIS-GWO | 19.56 | 25.95 | 18.07 | 19.5 | 16.92 |

ANFIS-DE | 18.79 | 24.14 | 18.59 | 18.65 | 19.82 |

## DISCUSSION

Accurate estimation of the level of flood susceptibility in a given area is vital to protect inhabitants and develop effective mitigation strategies. Flood susceptibility mapping has become central to global flood risk management efforts (Pham *et al.* 2021b; Shahabi *et al.* 2021; Towfiqul Islam *et al.* 2021). Powerful new techniques continue to be developed, improving the accuracy of results and better supporting those responsible for flood risk management. In this study, several models – namely ANFIS, ANFIS-GBO, ANFIS-CGO, ANFIS-SCA, ANFIS-GWO, and ANFIS-DE – were developed to prepare flood susceptibility maps for a region that is often affected by flooding but does not have in place adequately powerful or effective measures to mitigate against flood damage.

In this study, 1,843 historic locations of flooding were gathered both from satellite images and field missions. Several different techniques highlighted in previous studies have been used to detect flood locations. For example, we can derive indices like Enhanced Vegetation Index (EVI), NDWI, and Normalized Difference Surface Water Index (NDSWI) to detect flood locations using optical remote sensing data such as Landsat imagery (Tong *et al.* 2018; Du *et al.* 2021). However, these data may be influenced by cloud cover; they also suffer from limited spatial and temporal resolutions. Several researchers have applied radar images like Sentinel 1 to detect flood locations (Martinis *et al.* 2018). These images use long wavelengths so they are not influenced by atmospheres and can obtain information day and night. Moreover, these images have the advantage of being able to detect water and non-water areas using threshold determination. For these reasons, we used radar data to collect flood locations in this study.

This study utilized RF to assess the importance of each of the 13 factors used to create the flood susceptibility model. The importance depends on the study area's geo-environmental, climatic, hydrological, and anthropic activities, and on the methodology utilized. In addition, the flood susceptibility model used in this study was data-driven, so the importance of each factor depends on data distribution (Andaryani *et al.* 2021; Luu *et al.* 2021). Importance was calculated by assigning weights to each factor. The higher the weight, the more important the factor, and vice versa. If the weight equals 0, the factor does not influence the probability of the occurrence of the flood. Shahabi *et al.* (2021) pointed out that slope, distance to river, drainage density, and TWI were the most important of the 10 factors used to construct a flood map of Iran's Haraz watershed, using the ORAE technique. Pham *et al.* (2021a) reported that soil type, distance from river, river density, and geology were most influential on flood occurrence in Vietnam's Nghe An province, using the RF technique. Luu *et al.* (2021) used OneR to assess the importance of conditioning factors when building flood susceptibility models for the Quang Binh province in Vietnam; they pointed out that land use, geology, slope, and rainfall were the most important factors.

Our RF analysis showed that the most important factors that cause flooding in Ha Tinh province were, in descending order, land use (0.25), elevation (0.2), slope (0.14), and distance from road (0.13). These results are consistent with previous studies. Land use ranks highly because flooding can only occur in the region after the soil has been saturated. The clearing of forests in the mountains and the construction of infrastructure allowing for the development of floodplains are the main causes of aggravation of floods. The reduction in the regulation and retention capacity of lakes, which is directly related to land use, is a crucial factor in the intensification of floods. The results show that all residential and agricultural areas are located in places with high and very high flood susceptibility. Residential areas are more vulnerable to floods because they have impermeable surfaces. In addition, agricultural areas are located on the low-lying coastal plains. Elevation and slope came second and third, and are still important: these physical characteristics strongly influence the occurrence of flooding related to the determination capacity of surface runoff. The higher the elevation and the slope, the faster the flow velocity and the faster the water accumulates in the delta area (Abedi *et al.* 2021). Towfiqul Islam *et al.* (2021) used machine learning to assess the flood susceptibility in the Teesta River Basin of Bangladesh. The authors emphasized that land use, distance to road, elevation, and slope are the most important factors contributing to the probability of flood occurrence. El-Haddad *et al.* (2021) reported that distance to river, land use, lithology, and slope were the most influential factors on the likelihood of flood occurrence in Wadi Qena Basin of Egypt. Arora *et al.* (2021) showed that distance to river, curvature, slope, river density and land use were the most important factors for the flood susceptibility model in the Ganga River Basin of India. Pham *et al.* (2021c) used 16 conditioning factors to establish flood susceptibility maps for the Quang Nam province of Vietnam. Of the factors used, elevation, rainfall, slope, and land cover/land use had the biggest effect on the likelihood of flood.

Finally, there is a question linked to the importance of precipitation in the flood susceptibility model. Why is it less influential than other factors such as land use, elevation, slope, and distance from road, while still being a flood trigger factor? Based on different types of floods, an appropriate conditioning factor is chosen for the flood susceptibility model. Moreover, for each type of flood, the importance of each conditioning factor is different. Towfiqul Islam *et al.* (2021) and Saha *et al.* (2021) eliminated precipitation factors when assessing flood susceptibility in the Teesta River Basin in northern Bangladesh. Luu *et al.* (2021) reported that rainfall ranked fourth out of the ten factors used to establish flood susceptibility maps in Quang Binh province of Vietnam. In the study of Nguyen *et al.* (2021a), rainfall ranked sixth out of the 13 factors used to map flood susceptibility in Quang Ngai province.

In the Ha Tinh province, there are fluvial and coastal floodplains. Coastal flooding has less of an impact because Ha Tinh province is less influenced by a micro-tidal regime and storm surges, which are the two important factors that cause flooding. The river flooding in the area is caused by torrential rains in the mountainous area and the intensity of the rains tends to reduce when they reach the plain. Moreover, several previous studies have pointed out that flooding can only occur after soil saturation (Luu *et al.* 2021). That is why, in most cases, floods usually occur after heavy rains or rainy seasons. The national Highway 1A acts as a dam which stores the waters evacuated toward the seas. This is one of the important causes of flooding in Ha Tinh province. Therefore, the importance of rainfall is less than that of land use, elevation, slope, or distance from road. This has been substantiated by several previous studies. NDVI had less influence on the model. The importance of land use, elevation, slope, and distance from road can be related to the diversity of characteristics of these factors. As mentioned above, the model applied in this study is data-driven; the distribution of the data strongly influences the results. So, the lower impact of NDVI may be due to the large area of vegetation (RF = 0).

The final result confirmed our initial hypothesis that the models proposed in this study would successfully build flood susceptibility maps for Ha Tinh and that they would outperform the reference models.

One of the advantages of metaheuristic algorithms is the usefulness of the stochastic system, which helps avoid local optimization problems in order to converge on a near-optimal solution (Lu *et al.* 2021). The objective here is not to obtain the best solution, but to have a near-optimal solution within a reasonable computation time. The system works first by exploration (to discover the promising spaces within which to search) and exploitation (the search for high-quality solutions in these promising spaces). Improving the performance of a metaheuristic algorithm requires a balance between the process of exploration and exploitation (Bui *et al.* 2020). The GBO and SCA algorithms had the advantage of a balance between these two processes (Abualigah & Diabat 2021; Deb *et al.* 2021). That is why two models, ANFIS-GBO and ANFIS-SCA, performed better than others.

Besides the simple structure, fast search speed, and high search precision, a big advantage of the GWO algorithm is that there are fewer setting parameters and it is easier to combine with other algorithms (Arora & Banyal 2021). The ANFIS-GWO model came third in terms of performance.

ANFIS-DE was fourth. Besides its simple structure and ease of use, DE has the advantage of avoiding the local optimization problem (Rout *et al.* 2013).

The ANFIS-CGO algorithm was less accurate than all the other models (RMSE = 0.26, MAE = 0.2, ACC = 0.95, *K* = 0.84, *R*^{2} = 0.54). CGO faces key limitations in the process of exploring the search space. This leads to local optimization problems (Talatahari & Azizi 2021).

The ANFIS model and associated hybrids have been applied in various studies to predict the likelihood of flood occurrence around the world. Costache *et al.* (2020) used ANFIS and hybrids thereof to analyze flood susceptibility in the Trotuș River Basin in Romania. The precision of the results varied from 0.85 to 0.94 for the value of AUC. Termeh *et al.* (2018) evaluated flood susceptibility in the Jahrom Township in Fars province in Iran using ANFIS and three optimization algorithms (Ant Colony Optimization, GA, and PSO). In this case, the value of AUC ranged from 0.91 to 0.94. Another study (Hong *et al.* 2018b) used ANFIS and two optimization algorithms (GA and DE) to predict flood susceptibility in Hengfeng County in China. In this study, the maximum AUC value was 0.87. Vafakhah *et al.* (2020) analyzed flood susceptibility in the Gilan Province of Iran using ANFIS. The accuracy score was 63%. We can conclude that the accuracy of the models proposed in this study was consistent with the accuracy of previous significant studies.

The flood susceptibility levels in Ha Tinh province are also consistent with the flood susceptibility maps obtained using the AHP method (Nguyen *et al.* 2022a). They are also corroborated by previous studies showing the extent to which the coastal plains in the Central region of Vietnam are often subject to large floods (Luu *et al.* 2021; Pham *et al.* 2021e; Nguyen *et al.* 2022b). Therefore, the flood susceptibility maps constructed in this study can be seen to be a suitable alternative solution for local organizations responsible for assessing flood susceptibility.

Flood management is now a critical task. As the climate changes, sea levels rise, and floods become more common and more destructive. In Ha Tinh and elsewhere, informed land-use planners must limit new construction or the concentration of populations in areas with high and very high probability of flooding. The accuracy of the models in this study surpassed the reference models in previous studies. Therefore, the results can support planners to develop necessary strategies to diminish the damage. Although this study applies to Vietnam, the results may be applied to other regions around the world that are similarly affected by flooding.

## CONCLUSION

In the context of climate change, the severity and violence of floods are increasing in Asia, and especially in Vietnam. Therefore, building a flood susceptibility model with high accuracy is one of the most important tools available to policy makers who are responsible for formulating strategies to reduce impact. This study proposed five hybrid models – ANFIS-GBO, ANFIS-CGO, ANFIS-SCA, ANFIS-GWO, and ANFIS-DE – to determine areas with probability of occurrence of flooding for Ha Tinh province in Vietnam.

ANFIS and its hybrids are the most powerful models to determine areas with probability of occurrence of flooding; however, the accuracy depends on the structure of the input data, so data preprocessing must be done properly.

This was the first time these five models have been used to build such maps, which represents the main novelty of this study. The models are characterized by high precision (AUC > 0.95; RMSE < 0.26) and complement the current literature on flood susceptibility. Additionally, the complete proposed models can be used to assess flood susceptibility levels in any region of the world, particularly where data are limited.

Ha Tinh province was divided according to five levels of flood susceptibility. Between 32 and 38% of the study area was located in the high and very high flood susceptibility zone. These regions are mainly concentrated along the river and on the eastern coastal plain, where there is a high density of population and infrastructure.

Once potential flood-prone areas have been identified, local government agencies can take appropriate action to reduce the extreme negative impacts of flooding. The results of this study can also be applied to the prediction and assessment of other types of disasters.

This study still has general limitations related to model input data. However, reducing uncertainties will further improve the accuracy of the predictions. More conditioning factors must be selected; then one of the selection factor methods must be used to remove the non-predicting factors. Furthermore, a limited level of detail is available, particularly regarding flood depth and velocity. Plus, the flood inventory used in this study used binary values (0, 1) to present flood and non-flood points, and did not include flood frequency. Therefore, the flood locations were given equal weight when used to predict flood susceptibility. In future research, we will record the flood frequency at each flood point and put that data to use.

Flooding is changing due to climate change and land-use changes; therefore, the application of machine learning to assess the effects of these phenomena on flooding is extremely useful for decision-makers and others tasked with building effective strategies to reduce future damage to property and loss of life.

## FUNDING STATEMENT

No funding was received for this study.

## AUTHOR CONTRIBUTIONS STATEMENT

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by H.D.N. The first draft of the manuscript was written by H.D.N. All authors read and approved the final manuscript.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

**37**,

**2021**(1), 6455592.

**13**(6), 2353–2385.

**67**(7), 1065–1083.