Abstract
The largest recorded flood loss occurred in the study area in 2013. This study aims to examine resampling methods (i.e. cross-validation (CV), bootstrap, and random subsampling) to improve the performance of seven basic machine learning algorithms: Generalized Linear Model, Support Vector Machine, Random Forest (RF), Boosted Regression Tree, Multivariate Adaptive Regression Splines, Mixture Discriminate Analysis, and Flexible Discriminant Analysis, and found the factors causing flooding and the strongest correlation between variables. The model is evaluated using Area Under the Curve, Correlation, True Skill Statistics, and Deviance. This methodology was applied in Kendari City, an urban area that faced destructive floods. The evaluation results show that CV-RF has a good performance in predicting flood susceptibility in this area with values, AUC = 0.99, COR = 0.97, TSS = 0.90, and deviance = 0.05. A total of 89.44 km2 or equivalent to 32.54% of the total area is a flood-prone area with a dominant area of lowland morphology. Among the 17 parameters that cause flooding, this area is strongly influenced by the vegetation density index and the Terrain Roughness Index (TRI) in the 28 models. The strongest correlation occurs between the TRI and the Sediment Transport Index (STI) = 0.77, which means that flooding in this area is strongly influenced by elements of violence.
HIGHLIGHTS
The CV-RF algorithm has good performance in increasing accuracy and predicting flood vulnerability in urban areas.
The Normalized Difference Vegetation Index (NDVI) and TRI factors have the strongest influence on flood events in all 28 models.
There is a strong correlation between TRI and STI factors. These results were not reviewed in previous studies.
Graphical Abstract
INTRODUCTION
Floods are one of the deadliest and most destructive natural disasters in the world, with more than 5,000 deaths annually (Kalantari et al. 2019; El-Rawy et al. 2022). Floods can cause many casualties, damage to ecosystems, and socio-economics (De Silva & Kawasaki 2020). Floods can occur naturally through prolonged rainfall, heavy rain, and snowmelt or non-natural such as increased degradation due to population growth, deforestation, and urbanization (Chan et al. 2018; Şen 2018; Wang et al. 2019b; Van & Schwarz 2020). Over the past two decades, floods have affected ±109 million people worldwide (Hirabayashi et al. 2013; Alfieri et al. 2017).
Although some findings suggest that floods have a positive impact on an affected area, such as filling aquifers, restoration of wetlands, and improving soil fertility, the positive impacts of floods are not proportional to the damage they cause (Nachappa et al. 2020). Several countries in the world have been affected by floods and Indonesia is no exception. Based on data from the National Disaster Management Agency (BNPB), Indonesia experienced flood events from 2017 to 2020, namely 997, 775, 1,276, and 555 events as of 6 June 2020 (BNPB 2019). Areas affected by floods resulting in missing/death victims and damage to infrastructure mostly occurred in East Java. A total of 51 people were missing/died and 2,666 houses were moderately damaged. Meanwhile, in Southeast Sulawesi, 6,826 houses suffered minor damage (Islamy et al. 2022). Kendari City is one of several regencies in the Southeast Sulawesi region. Areas with flood-prone zones are generally scattered along the banks of the Wanggu River, whose topography is relatively flat (Gandri et al. 2019; Aldiansyah et al. 2021). Kendari City has experienced floods with the largest losses in 2013 when economic losses reached billions of rupiah (BNPB Daerah 2013).
Currently, flood researchers and authorities are constantly looking for new ways to reduce flood hazards in risky areas, especially high-population areas. Efforts are underway to identify flood-prone areas by developing accurate flood vulnerability models and maps. Flood vulnerability maps are a proactive tool used for land use and management planning as a development guide for flood protection and recovery (Tuyet et al. 2019). Therefore, the study of flood events and the mapping of flood vulnerability zones are becoming increasingly important for policymaking because the intensity of urban flooding continues to increase with a repeating pattern.
Machine Learning (ML) is one approach that can be used to map flood vulnerability areas (Bui et al. 2020; Chen et al. 2020). This approach is quite good in solving non-linear problems, but its accuracy is sensitive to the quality of sample points (Tehrany et al. 2013) based on predefined rules (Costache et al. 2020; Ahmed et al. 2021; Pham et al. 2021). The use of ML is increasing because of its ability to efficiently extract past events without having to understand the physical processes that occurred before (Wang et al. 2015; Bera et al. 2022). The most widely used ML methods for modeling flood events are the Generalized Linear Model (GLM), Boosted Regression Tree (BRT) (Davoudi Moghaddam et al. 2019; Nhu et al. 2020; Pourghasemi et al. 2020; Mosavi et al. 2022), Support Vector Machine (SVM) (Fotovatikhah et al. 2018; Mosavi et al. 2018; Choubin et al. 2019a; Pandey et al. 2021), Random Forest (RF) (Chen et al. 2020; Talukdar et al. 2020; Mosavi et al. 2022), Multivariate Adaptive Regression Splines (MARS) (Bui et al. 2019a; Mosavi et al. 2022), Mixture Discriminate Analysis (MDA) (Choubin et al. 2019b), and Flexible Discriminant Analysis (FDA) (Mosavi et al. 2022). These models have the merit of including a wide range of non-normalized factors in the ratio, nominal, ordinal, and interval scales (Ferentinou & Chalkias 2013; Tien Bui et al. 2018; Zhao et al. 2018; Bui et al. 2019b; Wang et al. 2019b). This model is used because of its simplicity in understanding and implementation, so it has received a lot of attention from researchers and policymakers.
Experts have tried to diversify the model and input variables (Chen et al. 2017; Pourghasemi & Rahmati 2018). Explorations so far have been carried out on pixel size (Chen et al. 2020) and the amount of training data (Goetz et al. 2015; Fu et al. 2020). However, the main aspects which are mostly neglected are pre-processing and resampling methods. This class imbalance refers to a condition where the data that determines several classes occurs in a class with a small population (Galar et al. 2011). In this case, the standard classification algorithm tends to predict classes with large amounts of data and classes with few cases are considered noise (Cieslak & Chawla 2008). Thus, small samples of overlapping classes and difficulties with class segregation can exacerbate class imbalance. Three approaches have been proposed to manipulate algorithms, data, or learning frameworks that are highly cost-sensitive (Batista et al. 2004). Computerized statistical computing methods such as Cross-Validation (CV), Bootstrap (Bt), and Random Subsampling (RS) can be used to overcome this problem (Hastie et al. 2009; Dodangeh et al. 2020). The hybridization technique with soft computing is recommended because it can improve the quality of predictions (Fotovatikhah et al. 2018; Mosavi et al. 2018; Fu et al. 2020). This method is assumed to have the same distribution in a set of observations obtained from independent populations. Bt can model the data by rebuilding resampling by randomly replacing the original dataset. RS, the Monte Carlo method (Picard & Cook 1984), can divide the dataset into two groups, namely training and testing after replication (Dieterle 2003). CV is a method that works by dividing the dataset into training and testing with equal proportions and evaluating them according to the number of iterations (Dieterle 2003).
This study aims to predict flood susceptibility in urban areas of Kendari City using ML from several algorithms such as GLM, SVM, RF, BRT, MARS, MDA, and FDA. Seven algorithms are combined with CV, Bt, and RS to improve the performance of the model and compare it with the Basic Model (BM). This study also evaluates the dominant factors and the correlation between the parameters and the occurrence of flooding with parameters.
METHODS
Study area
Factors causing flood: (a) aspect; (b) curvature; (c) elevation; (d) flow accumulation; (e) flow direction; (f) geology; (g) slope; (h) LULC; (i) NDVI; (j) rainfall; (k) distance from rivers; (l) soil; (m) SPI; (n) STI; (o) TRI; (p) TWI; and (q) wind.
Factors causing flood: (a) aspect; (b) curvature; (c) elevation; (d) flow accumulation; (e) flow direction; (f) geology; (g) slope; (h) LULC; (i) NDVI; (j) rainfall; (k) distance from rivers; (l) soil; (m) SPI; (n) STI; (o) TRI; (p) TWI; and (q) wind.
Flood occurrence data
Information on flood events used in this study was obtained from the Regional Disaster Management Agency (BPBD) of Kendari City in 2013 (BPBD 2020). The 2013 flood incident was the worst flood incident, which caused one fatality and 2,300 people were displaced from the flood incident in Kendari City (BNPB 2013). Flood incident data were validated by conducting interviews during June–August with agencies and communities affected by the flood disaster. Data on non-flood events were carried out through interviews with the community for areas that were not affected by floods. There are 23 flood locations and 28 non-flood locations that are used. Information on flood events is extracted from latitude and longitude values with the help of ArcMap 10.4.1 and is assumed to represent one event for one flood event point. The data are stored in csv format to be the dependent variable in this study.
Factors causing flood
This study uses 17 variables that cause flooding (Tehrany et al. 2019; Kalantari et al. 2019; Panahi et al. 2021), such as aspect, curvature, elevation, flow accumulation, flow direction, geology, slope, Land Use Land Cover (LULC), Normalized Difference Vegetation Index (NDVI), rainfall, distance from rivers, soil, Stream Power Index (SPI), Sediment Transport Index (STI), Terrain Roughness Index (TRI), Topographic Wetness Index (TWI), and wind (Figure 2). Digital Elevation Model (DEM) data are sourced from Shuttle Radar Topography Mission (STRM) data for the 11–12 February 2000 mission, which was a one-time mission, so there is no data updating. DEM data are used to extract TWI, SPI, TRI, STI, slope, elevation, curvature, aspect, flow direction, and flow accumulation factors. Soil and geology data are updated for 2016 and developed by BDSLDP. These data are considered valid for use in Indonesia to date. NDVI data were obtained from Landsat-8 imagery taken from the Google Earth Engine (GEE) dataset. Extraction of NDVI values was carried out by utilizing the red and Near-Infrared (NIR) channels of Landsat-8 imagery on the GEE platform through the substitution technique to derive information on vegetation density (Lillesand & Kiefer 1994). Proximity data are obtained using the Euclidean Distance method in ArcMap 10.4.1. River network data are taken from the Geospatial Information Agency (BIG) dataset at a scale of 1:50,000 in 2021. Wind data are obtained from the Global Atlas Wind, which is the average wind speed data for 2021. Data were obtained from various open data sources and are shown in Table 1.
Data source
Factor . | Data source . | Year . | Source location . | Resolution (m) . |
---|---|---|---|---|
1. NDVI | Landsat-8 from GEE | 2021 | https:/code.earthengine.google.com/ | 30 |
2. LULC | Ministry of Environment and Forestry of the Republic of Indonesia (KLHK) | 2020 | https://sigap.menlhk.go.id/sigap/ | |
3. Rainfall | Indonesian Agency for Meteorological, Climatological and Geophysics (BMKG) | 2020 | https://gis.bmkg.go.id/arcgis/home/ | |
4. DEM, TWI, SPI, TRI, STI, Slope, Curvature, Aspect, Flow Direction, Flow Accumulation | Shuttle Radar Topography Mission (SRTM) | 2000 | https://earthexplorer.usgs.gov/ | |
5. Wind | Global Wind Atlas | 2021 | https://globalwindatlas.info/en | |
6. River Network | Geospatial Information Agency (BIG) | 2021 | https://www.big.go.id/ | |
7. Geology | Southeast Sulawesi Agricultural Technology Study Center, Agricultural Research and Development Agency, Ministry of Agriculture. | 2016 | https://www.pemda-balitbangsultra.info/ | |
8. Soil | Center for Research and Development of Agricultural Land Resources (BBSDLP) | 2016 | https://www.pemda-balitbangsultra.info/ |
Factor . | Data source . | Year . | Source location . | Resolution (m) . |
---|---|---|---|---|
1. NDVI | Landsat-8 from GEE | 2021 | https:/code.earthengine.google.com/ | 30 |
2. LULC | Ministry of Environment and Forestry of the Republic of Indonesia (KLHK) | 2020 | https://sigap.menlhk.go.id/sigap/ | |
3. Rainfall | Indonesian Agency for Meteorological, Climatological and Geophysics (BMKG) | 2020 | https://gis.bmkg.go.id/arcgis/home/ | |
4. DEM, TWI, SPI, TRI, STI, Slope, Curvature, Aspect, Flow Direction, Flow Accumulation | Shuttle Radar Topography Mission (SRTM) | 2000 | https://earthexplorer.usgs.gov/ | |
5. Wind | Global Wind Atlas | 2021 | https://globalwindatlas.info/en | |
6. River Network | Geospatial Information Agency (BIG) | 2021 | https://www.big.go.id/ | |
7. Geology | Southeast Sulawesi Agricultural Technology Study Center, Agricultural Research and Development Agency, Ministry of Agriculture. | 2016 | https://www.pemda-balitbangsultra.info/ | |
8. Soil | Center for Research and Development of Agricultural Land Resources (BBSDLP) | 2016 | https://www.pemda-balitbangsultra.info/ |
The importance and relevance of the most commonly used flood conditioning factors in flood susceptibility mapping are covered in the following section. This aspect is important in flood studies (Choubin et al. 2019b). Aspects are related to the direction of the slope and the direction of movement of flood water to the shape of the slope. Yates et al. (2000) showed that the hydrological response unit is highly influenced by the slope aspect. Rahmati et al. (2016) also demonstrated that soil moisture content and local climatic conditions are also influenced by the slope aspect. Curvature is also an important flood conditioning factor that affected heterogeneity and hyporheic (Cardenas et al. 2004). Curvature has a great impact on the acceleration of the flow of water flowing to the surface (Lee et al. 2017; Tehrany et al. 2019). Curvature is usually classified into three classes namely, concave, flat, and convex. Convex surfaces are more susceptible to runoff and highly susceptible to flooding (Il'inskii & Yakimov 1987). The elevation is related to the high and low surfaces of the ground. Areas with lower elevations are usually more prone to flooding (Li et al. 2012; Cea & Bladé 2015). Therefore, it is very important to understand the topographical forms and derived features that are responsible for the occurrence of flooding in an area (Woodrow et al. 2016).
It computes the flow of water accumulated as the compiled weight of all cells and hence, flows into each downslope cell in the output raster (Zhang et al. 2017). The direction of the steepest slope is often used to calculate the direction of flow from each cell, which can also be considered a maximum drop. From a digital surface model, the flow direction map is created. The final map shows eight appropriate output directions that connect to the eight neighboring cells into which flow could pass. This method is commonly referred to as the eight-direction (D8) flow model and incorporates an approach proposed by Jenson & Domingue (1988). Geology describes the level of rock permeability in an area (Hammami et al. 2019). Lee et al. (2012) indicated that different geology units have different susceptibilities to flooding. It also affects the channel shape of the temporal flood (Reneau 2000; Heitmuller et al. 2015). The slope is directly related to flow velocity, soil type structure, and drainage because it regulates surface water flow and controls runoff (Zzaman et al. 2021). The runoff volume and velocity increase with increasing the slope gradient (Khosravi et al. 2016; Tien Bui et al. 2018). As the slope gradient increases, the runoff infiltration rate decreases and a large amount of runoff enters the drainage network (Tehrany et al. 2015).
The shape of the LULC in an area has a significant effect on the transport of surface runoff and sediment (Zhang et al. 2010) especially on built-up land (Costache & Bui 2020), while forest areas have good permeability (Yin et al. 2017). The NDVI is an index that can be used to evaluate vegetation cover and its impact on flooding. The NDVI is a simple graphic indicator that is used in remote sensing analyses for the assessment of vegetation attributes in a region (Sajedi-Hosseini et al. 2018). There is an inverse relationship between vegetation density and flooding (Tehrany et al. 2013; Kumar & Acharya 2016). Rainfall is a key influencing factor in flood susceptibility mapping, which is considerably remarked in the literature (Tehrany et al. 2015; Bui et al. 2017). Short rainfall intensity can cause flooding in an area (Ali et al. 2020). Rainfall data were extracted for the cumulative rainfall period in 2021. Soil type directly affects the drainage process due to inherent soil characteristics, such as texture, permeability level, and structure (Mojaddadi et al. 2017). Predick & Turner (2008) emphasized that distance of proximity to the river is a serious factor for flooding. Tien Bui et al. (2018) and Darabi et al. (2019) also observed that a great number of floods occurred in areas adjacent to the river. Flood events will be affected by the reduced distance from the river (Shahabi et al. 2021).



Machine learning models
Each parameter is converted to raster data in TIF format and imported into R software along with flood event data analyzed by the maximum entropy method using the SDM package (Naimi & Araújo 2016). The landslide model with resampling using CV, Bt, and RS was run with two iterations. Flood vulnerability maps were each made from three different resampling techniques. The results are compared with the unprocessed model (BM) using non-resampling. The resampling technique is done in R software. The number of sample data is not determined due to fluctuations in the gamma value and standard error has decreased so that the sample data is immediately divided into two groups of datasets. The resulting model is a model with a continuous vulnerability value. To obtain a flood vulnerability model, each model is classified into several classes. The quantile-based classification was chosen in this study because it is most suitable for classifying facts in this study, for the reason that it groups an equal number of pixels (area) into each group. The model is classified into five classes, namely very low, low, moderate, high, and very high, as was done in previous studies (Shabani et al. 2018; Pham et al. 2021).
Generalized linear model
GLM is a parametric model and an extension of the linear model. The relationship between explanatory factors and responses was measured by regression parameters. The GLM model excels when the number of observations is not normally distributed and when other regression models yield unsatisfactory results. A more detailed description of GLM can be found in McCullagh & Nelder (1989). The GLM model is explored in the R Studio Desktop software using the ‘glm’ package.
Support vector machine
SVM is a non-linear model (Cortes & Vapnik 1995) by applying the principles of Structural Risk Minimization (SRM) and Empirical Risk Minimization (ERM) because it can minimize training errors and generalizations (Wu et al. 2014). Therefore, the SVM can provide an optimal network structure with better generalization to decrease the complexities associated with flood prediction. This model is applied to flood prediction by using different kernel functions (Chen et al. 2017; Pourghasemi & Rahmati 2018). Linear (LN), Polynomial (PL), Radial Basis Function (RBF), and sigmoid (SIG) are the four kernel types used in SVM. LN is regarded as a distinctive case of RBF although SIG execution is equivalent to that of RBF for the given factors (Song et al. 2011). When RBF is used in the processing, LN is no longer required. In terms of accuracy, RBF generates more reliable and solid outcomes compared with SIG because of its higher fitness in interpolation. The RBF kernel was used in this study. SVM has several types: C-SVM, v-SVM, one-class SVM, ɛ-SVR, and ѵ-SVR. Among those, the ɛ-SVR type was used in this research (Chang & Lin 2011). The values, 1,000 and 0.1, of the parameters ϲ and ɛ, respectively, were selected based on trial and error. GLM model explored in RStudio Desktop software using the ‘kernlab’ package.
Random forest
The RF model uses novel methods of data combination to construct and combine numerous trees for prediction. As the prediction factor is chosen, the RF model grows a Classification And Regression Tree (CART)-like tree (Naghibi & Pourghasemi 2015). This tree, however, differs from standard CART in several respects. The key RF parameters are the number of trees and the number of predicting factors upon which the decision tree is grown to its maximum size and then left unpruned. In this research, the values of these two parameters have been selected based on trial and error. The optimal values for these two parameters are selected, namely 1,000 and 4. RF modeling is carried out with the ‘RF’ package in RStudio Desktop.
Boosted regression tree
BRT is a non-parametric statistical method to strengthen the classification results by using several weak classifiers to produce a strong classifier. The CART algorithm is applied to generate a decision tree. The results of the tree are pruned to select the best performance results through a CV procedure (Breiman et al. 1984). Further explanation can see by Breiman et al. (1984). This model is implemented with the ‘gbm’ package in the software.
Multivariate adaptive regression splines
MARS was introduced by Friedman (1991) as a flexible non-parametric regression approach that can handle high-dimensional data. The shape of the function is unknown and there is a patterned relationship between the dependent and independent variables that can be overcome by this model (Friedman 1991). MARS implementation shows that this ML approach can help to construct good predictive models for complex engineering datasets (Heddam & Kisi 2018). Since the related works on flooding susceptibility prediction demonstrate that the classification function needed to discriminate flood and non-flood areas can be highly complex, MARS can be potentially useful for generalizing such classification functions from the GIS database. This model is run on an RStudio Desktop device using the ‘earth’ package.
Mixture discriminate analysis
Flexible discriminant analysis
FDA is a classification and pattern recognition model consisting of a combination of linear regression models. This model uses optimal finding and canonical correlation analysis of response factors to distinguish the best object groups (Hastie et al. 1994). This model is implemented using the ‘mda’ package in the software.
Resampling approach
Cross-validation
ML modeling uses datasets that are divided into training groups and test groups. As much as 70% of the data are used for training and 30% of the data are used for testing. In each iteration of the CV process, part of the data is used for training and partly for testing. The test data are sent to the CV process and then redefined as training data in the next iteration. The dataset is used to test the model (Dieterle 2003). The CV can be performed in several ways, for example using holdout, leave-one-out, leave-P-out, and k-fold, which is the method used in this study. If a training dataset is randomly split into k subsets or folds with equal sizes, each CV iteration is performed using k–1 folds as training and the remaining fold for validation. In this study, k was set to 5.
Bootstrap
Theoretically, Bt suggests that any deletion from a population can be made from its sampling. Additionally, sampling can be inferred from resampled units. In other words, if a sample with n units is attained in large amounts by a simple arbitrary slice with relief, the Bt statistic can be specified by combining the estimated statistics (Fox 2002). Supposing distinct samples are yn = (y1, y2, yn), the resampled units of y * n = (y1 *, y2 *, yn *) can be attained in large figures (B-model run) using the Monte Carlo Bt approximation of the mean and friction of the given parameter.
Random subsampling
RS is a Monte Carlo technique (Picard & Cook 1984). This technique works based on the random separation into two training and testing groups after B replication (Dieterle 2003). RS works on non-returning samples and the total sample pool in replication B will have a wide range of correlations, which distinguishes it from Bt.
Multicollinearity






Accuracy assessment



RESULTS AND DISCUSSION
Flood prediction
All models show that the highest flood vulnerability is in the middle of the area, which is downstream of the Wanggu watershed. This area has a flat lowland morphology, although most of the soil types are alluvial with a high level of permeability. However, the area has developed into a built-up area that is dense with a layer of concrete or cement so that it interferes with the absorption capacity of the soil. The model also highlights the Wanggu River as the main river with a high level of flooding compared to other tributaries. Sedimentation in this area is also the cause of flooding. Sedimentation in Kendari Bay continues to increase. Kendari Bay sedimentation occurs due to mining activities upstream or in watersheds. According to Yang et al. (2018), Darabi et al. (2019), Choubin et al. (2019b), and Dodangeh et al. (2020), areas around rivers are more prone to flooding.
Evaluation and comparison
The AUC value of the BM models is in the range of 0.90–0.99. The best model is shown by the RF and SVM models of AUC values, followed by BRT, MARS, MDA, GLM, and FDA (Figure 4, Table 2). But overall, the BM-RF model outperformed other BM models because it performed well based on the COR and TSS values. In line with Avand & Moradi (2022), the RF model is a more accurate model for predicting the level of flood vulnerability. RF has the advantage of a relatively fast resolution process, no overfit occurs along with the addition of the number of trees, and it has better accuracy than other models (Breiman 2001). The FDA model has the worst performance among the other models. This is because the FDA works with a small population. The FDA needs that the difference between the with-class variance and the between-class variance is much enough. If this assumption is not satisfied, the power of the FDA is weakening (Yan & Dai 2011).
Model accuracy
Algorithm . | BM . | CV . | Bt . | RS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i . | ii . | iii . | iv . | i . | ii . | iii . | iv . | i . | ii . | iii . | iv . | i . | ii . | iii . | iv . | |
1. GLM | 0.93 | 0.37 | 0.78 | 0.14 | 0.92 | 0.32 | 0.68 | 0.15 | 0.92 | 0.39 | 0.64 | 0.20 | 0.92 | 0.41 | 0.65 | 0.14 |
2. SVM | 0.99 | 0.77 | 0.88 | 0.11 | 0.99 | 0.81 | 0.89 | 0.08 | 0.99 | 0.84 | 0.89 | 0.08 | 0.99 | 0.79 | 0.88 | 0.11 |
3. RF | 0.99 | 0.96 | 0.89 | 0.05 | 0.99 | 0.97 | 0.90 | 0.05 | 0.99 | 0.96 | 0.89 | 0.05 | 0.99 | 0.96 | 0.89 | 0.05 |
4. BRT | 0.98 | 0.70 | 0.88 | 0.12 | 0.99 | 0.76 | 0.89 | 0.11 | 0.99 | 0.80 | 0.90 | 0.14 | 0.99 | 0.76 | 0.88 | 0.11 |
5. MARS | 0.98 | 0.66 | 0.85 | 0.09 | 0.98 | 0.68 | 0.80 | 0.08 | 0.98 | 0.76 | 0.81 | 0.10 | 0.97 | 0.67 | 0.76 | 0.09 |
6. MDA | 0.94 | 0.52 | 0.73 | 0.14 | 0.94 | 0.48 | 0.68 | 0.15 | 0.91 | 0.52 | 0.53 | 0.21 | 0.95 | 0.60 | 0.75 | 0.12 |
7. FDA | 0.90 | 0.27 | 0.74 | 0.17 | 0.90 | 0.26 | 0.67 | 0.16 | 0.90 | 0.35 | 0.65 | 0.25 | 0.91 | 0.32 | 0.62 | 0.17 |
Algorithm . | BM . | CV . | Bt . | RS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i . | ii . | iii . | iv . | i . | ii . | iii . | iv . | i . | ii . | iii . | iv . | i . | ii . | iii . | iv . | |
1. GLM | 0.93 | 0.37 | 0.78 | 0.14 | 0.92 | 0.32 | 0.68 | 0.15 | 0.92 | 0.39 | 0.64 | 0.20 | 0.92 | 0.41 | 0.65 | 0.14 |
2. SVM | 0.99 | 0.77 | 0.88 | 0.11 | 0.99 | 0.81 | 0.89 | 0.08 | 0.99 | 0.84 | 0.89 | 0.08 | 0.99 | 0.79 | 0.88 | 0.11 |
3. RF | 0.99 | 0.96 | 0.89 | 0.05 | 0.99 | 0.97 | 0.90 | 0.05 | 0.99 | 0.96 | 0.89 | 0.05 | 0.99 | 0.96 | 0.89 | 0.05 |
4. BRT | 0.98 | 0.70 | 0.88 | 0.12 | 0.99 | 0.76 | 0.89 | 0.11 | 0.99 | 0.80 | 0.90 | 0.14 | 0.99 | 0.76 | 0.88 | 0.11 |
5. MARS | 0.98 | 0.66 | 0.85 | 0.09 | 0.98 | 0.68 | 0.80 | 0.08 | 0.98 | 0.76 | 0.81 | 0.10 | 0.97 | 0.67 | 0.76 | 0.09 |
6. MDA | 0.94 | 0.52 | 0.73 | 0.14 | 0.94 | 0.48 | 0.68 | 0.15 | 0.91 | 0.52 | 0.53 | 0.21 | 0.95 | 0.60 | 0.75 | 0.12 |
7. FDA | 0.90 | 0.27 | 0.74 | 0.17 | 0.90 | 0.26 | 0.67 | 0.16 | 0.90 | 0.35 | 0.65 | 0.25 | 0.91 | 0.32 | 0.62 | 0.17 |
Note: i, AUC; ii, COR; iii, TSS; iv, deviance.
Similar to the BM model, AUC curves reveal the potential for both RF and SVM models to make highly accurate predictions of all flood events and non-flood events. In the training phase, AUC of the model was RF = 0.99 and SVM = 0.99. The RF model (except BM) has better performance than other models because the model is trained with different resampling procedures (Figures 5–7). The CV-GLM, Bt-GLM, Bt-MDA, RS-GLM, and RS-MARS models are not as strong as the BM model when viewed from the AUC value in predicting flood vulnerability. The CV-RF model looks similar in terms of AUC, TSS, and deviance values with the Bt-RF and RS-RF models, but CV-RF has better performance with COR = 0.97 compared to Bt-RF and RS-RF in predicting events flood. This shows that the CV-RF model shows satisfactory results based on the COR value, followed by Bt-RF, RS-RF, and BM. This is because the RF algorithm can find the best property among a collection of random properties, rather than looking for the most important property when separating ‘nodes’. This also leads to variations and aims to produce the best model. RF works by considering only one sub-feature when dividing a ‘node’ in this study. The use of a threshold for each attribute in finding the best threshold by applying more random trees (Achour & Pourghasemi 2020). However, if the performance of the model is seen from the AUC through the resampling technique, then the RS-MDA and RS-FDA models have the most satisfactory performance. This could be of concern by using the RS technique to improve the performance of MDA and FDA for predicting flood events. Judging from the TSS, the CV and Bt models have the most satisfactory performance compared to the RS model. While the RS-MDA model is considered to give good results compared to the CV and Bt models, it can be a concern to improve the performance of the MDA model. The deviance value of the RF model has satisfactory results in all four methods. The Bt-BRT model has better performance than the BM-BRT, as seen by the value of TSS = 0.90. Dodangeh et al. (2020) also found something similar. The Bt-GLM, Bt-MDA, and Bt-FDA models have the worst scores compared to other models and methods and should not be applied to this algorithm. However, it should be noted that an AUC close to one cannot show very accurate results because some biases can be influenced by other factors.
Flood susceptibility predictions by the CV-RF model and the locations of built-up/settlement and main river.
Flood susceptibility predictions by the CV-RF model and the locations of built-up/settlement and main river.
It should be considered that each of the above models has coefficients and parameters that will affect its performance. In the research of Wang et al. (2019a), the SVM model will have different results if the hybrid kernel selection effect is explored. Kernel-type selection in an SVM model can be considered a vital step because it directly controls effective training and classification accuracy (Yao et al. 2008). In this study, the parameters were determined based on trial and error.
Importance of variable
The relationship between variables and variables with flood events.
Overall, NDVI and TRI have the highest importance across all models. Low NDVI values are related to flood events caused by land cover changes (Aldiansyah et al. 2021; Atefi & Miura 2022). The lowest spatial distribution of the TRI in this study is in a river basin that often floods. A similar finding by Tehrany et al. (2019) found that a low TRI value is always associated with high flood intensity. Changes in low urbanization patterns to high urbanization patterns due to changes in landforms that follow the linear pattern of main roads are the main drivers of increasing flood risk in urban areas (Waghwala & Agnihotri 2019; Pal et al. 2022).
Multicollinearity
The derived features of DEM data play an important role in this study because most of them have a positive correlation with other variables (Figure 13). There is a positive correlation between TWI and slope (0.37), flow accumulation (0.28), and STI (0.24). TWI acts as a means of quantifying topographic effects in hydrological processes (Lee et al. 2017). Water accumulates at certain locations and tends to move downward due to the force of gravity. TWI describes the distribution of moisture in various areas, when TWI tends to increase soil water content, flooding will also increase (Guzzetti et al. 2006; Meinhardt et al. 2015) and this is a factor that affects flooding in this study. The relationship between geology is also positively correlated with soil type (0.39) in this study. Soil absorption is determined by the strength and characteristics of the soil material, such as soil permeability and water pressure in the soil pores (Bui et al. 2016). TRI has a positive correlation with the STI (0.77), flow direction (0.57), distance from rivers (0.48), slope (0.40), and LULC (0.35). The TRI has a strong relationship with morphological aspects of flooded areas (Werner et al. 2005). The TRI shows a uniform elevation distribution, where flooding is more common in areas with low TRI in this study. Flood areas are always associated with roughness elements such as surface variations and irregularities, vegetation types such as trees and shrubs, and the direction of the slope toward the river affects the speed of receiving runoff in a short time through slope gaps (Casas et al. 2010; Tehrany et al. 2014). STI is a factor causing flooding that determines the time of sediment movement due to water movement (Mojaddadi et al. 2017). The STI in this study has a positive correlation with the parameters of flow direction (0.61), slope (0.34), and distance from the river (0.27). The STI describes the overall runoff plot. High runoff areas have higher sediment transport and are less prone to flooding. SPI describes the amount of moisture in the soil and the potential for flooding to flow down the study area. The lower the SPI, the greater the effect of flooding because it determines the area that can accumulate flow. Steep slopes can significantly reduce the amount of soil absorption and accelerate surface runoff due to precipitation. Areas with low slopes are more likely to be inundated, which happened at the research location. Consequently, the slope of the slope plays an important role in regulating surface runoff, infiltration, and water retention.
NDVI has a strong relationship with rainfall (0.54) and flow accumulation (0.25). Kendari City had high rainfall during the period of 2021 and the flood-affected area is an area with low vegetation density with low-lying morphology. Rainfall is the main cause of flooding in some areas where rain is the only source of water. The higher the rainfall, the higher the chance of flooding. The spatial distribution of rainfall in the study area is in the highlands, which can accelerate the flow of water to the lowlands. Most areas of Kendari City often experience flooding that is influenced by the high intensity of rainfall, poor drainage, lack of infiltration or biospheric wells, and coastal reclamation (Idati et al. 2020; Sinaga 2022). The coastal reclamation that occurs changes the topology and elevation, so that if it rains with a large volume of water and encounters a high tide, it will cause seawater to also enter the rainwater channels. A low NDVI value is related to flood events caused by changes in land cover, so that if the intensity of rainfall is higher and not matched by infiltration capabilities, the possibility of creating puddles will be even greater (Rahmati et al. 2016; Atefi & Miura 2022). NDVI has an impact on flooding, where the flow of water can be reduced and slowed by the size of the NDVI value. Vegetation cover allows water to penetrate deeper into the soil, resulting in reduced water volume and a lower probability of flooding. LULC has a positive correlation with the distance from the river, which is 0.43. Several areas along the main river of the research location have land cover that has been degraded into residential areas and buildings. The type of LULC has a significant influence on hydrological elements such as infiltration, evapotranspiration, and runoff generation (Rahmati et al. 2016).
Areas with high flood susceptibility levels are distributed in siliciclastic sedimentary rock types (TRJm), whose porosity and permeability are determined by the size, shape, and composition of the soil type (Pettijohn 1975). Most of the areas are in very low flood susceptibility for these soil types, but areas that have been inventoried by other surface constituents are in high vulnerability areas. This is in line with the altitude factor, which is positively correlated to geology in this study with a value of 0.30. The alluvium is always associated with floodplains in river areas (Miller & Juilleret 2020), which is the dominant area for flood vulnerability in this study with a proven correlation value of 0.28. Geological fractures are usually manifested in relief, distribution of river networks, and rock erosion patterns that can lead to the gradual development of floodplains (Heitmuller et al. 2015).
The aspect is strongly influenced by the slope (0.39) and the acceleration of water flow (0.30) in this study. Areas with low elevation receive runoff from the slopes in a short period of time, thereby creating flooding in the area (Tehrany et al. 2014). This aspect is very important because slopes facing different directions will receive different intensities of sunlight, which indirectly encourages the processes of evapotranspiration, hydrological processes, and soil type and amount of vegetation (Pourghasemi & Rahmati 2018). Floods are usually preceded by heavy and prolonged rainfall. The wind has a positive correlation with STI (0.50), TRI (0.50), distance from the river (0.34), and TWI (0.26), which were never explored in this study area. The wind speed in this area is distributed by the lowland morphology. The process of the occurrence of rain requires unstable atmospheric conditions, high water vapor content, intensive lifting of air masses, and low wind speeds.
In the case of a high Pearson Correlation value among two factors, the simplest method is to delete one of the factors from the dataset and repeat the analysis (Dai & Lee 2001). While Pearson's Correlation coefficient method provides some positive insights, a few results were unexpected. The TRI and STI are all slope-governed functions and thus, a higher correlation between these factors. However, regarding the interpretation of correlations, the following points should be considered:
A correlation of zero indicates absolutely no linear relationship between those two variables whatsoever. Pearson quantifies statistical association in terms of a straight line (Xu & Deng 2017). Pearson, however, does not eliminate the potential for some non-linear relationship between those two variables. Hence, two variables that may be highly associated with one another may produce a Pearson's Correlation coefficient value of zero. Thus, a Pearson score of zero, between two variables, neither means no association, nor that one cannot predict one of the pair from the other.
Correlation can be extremely reactive to the influence of outliers; one extraordinary observation may have a significant impact on a particular correlation. A quick survey of the scatterplot facilitates the detection of outliers (de Winter et al. 2016).
Correlations are not always causally related (Price 2000), nor are the relationships fully reciprocal in that the same association does not apply in both directions. For example, a causal relation will exist between two events where the first causes the occurrence of the second. In simple terms, the first event becomes known as the cause, while the second event becomes the effect. Alternatively, correlations between two variables are not necessarily based on causation.
Hence, the Pearson method has its limitations. In this study, several correlations have been impacted by outliers, unequal variance, non-normality, and nonlinearity (de Winter et al. 2016). The method is most successfully applied where both variables in the pair express normal distribution (Mukaka 2012).
We discuss factors affecting flood susceptibility and model results. Because of the robustness of ML models, they can be effective in evaluating flood susceptibility mapping for other areas when environmental aspects and model input parameters are the same, except where hydrogeological and topographical differences have to be considered. In addition, the proposed method can be efficiently used to obtain flood susceptibility maps for areas with scanty data. The methodology used in the current research can be used for another study area, as well as to map other natural hazards probability analyses, such as river flooding, landslide, and land subsidence. The requirements are an inventory map illustrating the history and location of the previous natural hazard point, a set of relevant conditioning factors (thematic maps), and a spatial analyst to process the resampling technique. Finally, our results also can be extended to mapping the vulnerability of other disasters associated with flooding (water quality, siltation, shoreline change, and loss of vegetation). However, the susceptibility map produced in this study only represents the flood susceptible zone in an urban area like Kendari City. Furthermore, the current study will improve understanding of the evaluation process and consistency in decision-making for flood preparedness activities (e.g construction of houses and buildings) in the future.
CONCLUSIONS
This study explores various algorithms to find an accurate and reliable algorithm for detecting flood-prone areas. Seven ML algorithms, namely GLM, SVM, RF, BRT, MARS, MDA, and FDA, with a resampling method were used in this study. These results cannot be generalized to the prediction of flood risk in other parts of the world with certainty because the patterns of behavior of flood-influencing variables vary with region. CV-RF performance is better than the other 28 models. CV in this study estimates the out-of sample more accurately because each observation is used for training and testing so that it can reduce overfitting in achieving a good level of prediction. Since ML is based on finding these patterns, the outputs of modeling should be expected to vary with region. Of the 17 parameters, two parameters have the greatest influence on flood events in all models in Kendari City, namely NDVI and TRI. The largest correlation occurs between the variables TRI and STI, meaning that the incidence of flooding in this area is strongly influenced by elements of violence such as surface variations and irregularities, types of vegetation such as trees and shrubs, and the direction of the slope toward the river, which can accelerate the process of water accumulation. The importance of other factors differed among the models. A total of 89.44 km2 or equivalent to 32.54% of the total area is a flood vulnerability area, with the dominant area being lowland morphology.
This research also reveals the importance of housing and building development locations. Around 377 location points or the equivalent of 83.04% are in flooded areas. Future research should include evaluation and be consistent with strong spatial and regulatory plans. Future regional development must follow the rules and regulations that have been formulated in the planning stage. This aims to realize the preparation of the spatial policy properly.
ACKNOWLEDGEMENTS
In this study, we would like to thank the University of Indonesia for supporting this research, in addition to the Kendari City Government and the community for allowing and facilitating the collection of field data.
AUTHOR CONTRIBUTIONS
This paper was composed by collaboration among all authors. S.A. and F.W. designed this study, F.W. helped improve its progression and clarity, S.A. and F.W. wrote this paper, and F.W. helped in revising the paper.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.