The largest recorded flood loss occurred in the study area in 2013. This study aims to examine resampling methods (i.e. cross-validation (CV), bootstrap, and random subsampling) to improve the performance of seven basic machine learning algorithms: Generalized Linear Model, Support Vector Machine, Random Forest (RF), Boosted Regression Tree, Multivariate Adaptive Regression Splines, Mixture Discriminate Analysis, and Flexible Discriminant Analysis, and found the factors causing flooding and the strongest correlation between variables. The model is evaluated using Area Under the Curve, Correlation, True Skill Statistics, and Deviance. This methodology was applied in Kendari City, an urban area that faced destructive floods. The evaluation results show that CV-RF has a good performance in predicting flood susceptibility in this area with values, AUC = 0.99, COR = 0.97, TSS = 0.90, and deviance = 0.05. A total of 89.44 km2 or equivalent to 32.54% of the total area is a flood-prone area with a dominant area of lowland morphology. Among the 17 parameters that cause flooding, this area is strongly influenced by the vegetation density index and the Terrain Roughness Index (TRI) in the 28 models. The strongest correlation occurs between the TRI and the Sediment Transport Index (STI) = 0.77, which means that flooding in this area is strongly influenced by elements of violence.

  • The CV-RF algorithm has good performance in increasing accuracy and predicting flood vulnerability in urban areas.

  • The Normalized Difference Vegetation Index (NDVI) and TRI factors have the strongest influence on flood events in all 28 models.

  • There is a strong correlation between TRI and STI factors. These results were not reviewed in previous studies.

Graphical Abstract

Graphical Abstract
Graphical Abstract

Floods are one of the deadliest and most destructive natural disasters in the world, with more than 5,000 deaths annually (Kalantari et al. 2019; El-Rawy et al. 2022). Floods can cause many casualties, damage to ecosystems, and socio-economics (De Silva & Kawasaki 2020). Floods can occur naturally through prolonged rainfall, heavy rain, and snowmelt or non-natural such as increased degradation due to population growth, deforestation, and urbanization (Chan et al. 2018; Şen 2018; Wang et al. 2019b; Van & Schwarz 2020). Over the past two decades, floods have affected ±109 million people worldwide (Hirabayashi et al. 2013; Alfieri et al. 2017).

Although some findings suggest that floods have a positive impact on an affected area, such as filling aquifers, restoration of wetlands, and improving soil fertility, the positive impacts of floods are not proportional to the damage they cause (Nachappa et al. 2020). Several countries in the world have been affected by floods and Indonesia is no exception. Based on data from the National Disaster Management Agency (BNPB), Indonesia experienced flood events from 2017 to 2020, namely 997, 775, 1,276, and 555 events as of 6 June 2020 (BNPB 2019). Areas affected by floods resulting in missing/death victims and damage to infrastructure mostly occurred in East Java. A total of 51 people were missing/died and 2,666 houses were moderately damaged. Meanwhile, in Southeast Sulawesi, 6,826 houses suffered minor damage (Islamy et al. 2022). Kendari City is one of several regencies in the Southeast Sulawesi region. Areas with flood-prone zones are generally scattered along the banks of the Wanggu River, whose topography is relatively flat (Gandri et al. 2019; Aldiansyah et al. 2021). Kendari City has experienced floods with the largest losses in 2013 when economic losses reached billions of rupiah (BNPB Daerah 2013).

Currently, flood researchers and authorities are constantly looking for new ways to reduce flood hazards in risky areas, especially high-population areas. Efforts are underway to identify flood-prone areas by developing accurate flood vulnerability models and maps. Flood vulnerability maps are a proactive tool used for land use and management planning as a development guide for flood protection and recovery (Tuyet et al. 2019). Therefore, the study of flood events and the mapping of flood vulnerability zones are becoming increasingly important for policymaking because the intensity of urban flooding continues to increase with a repeating pattern.

Machine Learning (ML) is one approach that can be used to map flood vulnerability areas (Bui et al. 2020; Chen et al. 2020). This approach is quite good in solving non-linear problems, but its accuracy is sensitive to the quality of sample points (Tehrany et al. 2013) based on predefined rules (Costache et al. 2020; Ahmed et al. 2021; Pham et al. 2021). The use of ML is increasing because of its ability to efficiently extract past events without having to understand the physical processes that occurred before (Wang et al. 2015; Bera et al. 2022). The most widely used ML methods for modeling flood events are the Generalized Linear Model (GLM), Boosted Regression Tree (BRT) (Davoudi Moghaddam et al. 2019; Nhu et al. 2020; Pourghasemi et al. 2020; Mosavi et al. 2022), Support Vector Machine (SVM) (Fotovatikhah et al. 2018; Mosavi et al. 2018; Choubin et al. 2019a; Pandey et al. 2021), Random Forest (RF) (Chen et al. 2020; Talukdar et al. 2020; Mosavi et al. 2022), Multivariate Adaptive Regression Splines (MARS) (Bui et al. 2019a; Mosavi et al. 2022), Mixture Discriminate Analysis (MDA) (Choubin et al. 2019b), and Flexible Discriminant Analysis (FDA) (Mosavi et al. 2022). These models have the merit of including a wide range of non-normalized factors in the ratio, nominal, ordinal, and interval scales (Ferentinou & Chalkias 2013; Tien Bui et al. 2018; Zhao et al. 2018; Bui et al. 2019b; Wang et al. 2019b). This model is used because of its simplicity in understanding and implementation, so it has received a lot of attention from researchers and policymakers.

Experts have tried to diversify the model and input variables (Chen et al. 2017; Pourghasemi & Rahmati 2018). Explorations so far have been carried out on pixel size (Chen et al. 2020) and the amount of training data (Goetz et al. 2015; Fu et al. 2020). However, the main aspects which are mostly neglected are pre-processing and resampling methods. This class imbalance refers to a condition where the data that determines several classes occurs in a class with a small population (Galar et al. 2011). In this case, the standard classification algorithm tends to predict classes with large amounts of data and classes with few cases are considered noise (Cieslak & Chawla 2008). Thus, small samples of overlapping classes and difficulties with class segregation can exacerbate class imbalance. Three approaches have been proposed to manipulate algorithms, data, or learning frameworks that are highly cost-sensitive (Batista et al. 2004). Computerized statistical computing methods such as Cross-Validation (CV), Bootstrap (Bt), and Random Subsampling (RS) can be used to overcome this problem (Hastie et al. 2009; Dodangeh et al. 2020). The hybridization technique with soft computing is recommended because it can improve the quality of predictions (Fotovatikhah et al. 2018; Mosavi et al. 2018; Fu et al. 2020). This method is assumed to have the same distribution in a set of observations obtained from independent populations. Bt can model the data by rebuilding resampling by randomly replacing the original dataset. RS, the Monte Carlo method (Picard & Cook 1984), can divide the dataset into two groups, namely training and testing after replication (Dieterle 2003). CV is a method that works by dividing the dataset into training and testing with equal proportions and evaluating them according to the number of iterations (Dieterle 2003).

This study aims to predict flood susceptibility in urban areas of Kendari City using ML from several algorithms such as GLM, SVM, RF, BRT, MARS, MDA, and FDA. Seven algorithms are combined with CV, Bt, and RS to improve the performance of the model and compare it with the Basic Model (BM). This study also evaluates the dominant factors and the correlation between the parameters and the occurrence of flooding with parameters.

Study area

Kendari City has an absolute location of 03°54′40″ LS – 04°5′05″ LS and 122°26′33″E – 122°39′14″E with a total area based on the projection of UTM 51 S Zone is 274.91 km2. Kendari City is an area that has the potential for economic growth. The research area is bisected by the Wanggu River, which flows from the southwest of the Watu Re Mountains to the north, which is in the south Konawe Regency and empties into Kendari Bay, passing through densely populated settlements. The study area had 20.42 rainy days and 55.88 hours of duration of sunshine in 2021 (BPS 2022). The climate of the watershed is a tropical monsoon climate (Köppen – Am) and the annual average temperature is 18 °C (Peel et al. 2007). In this area, land use in the form of settlements occurs on flat to sloping slopes and with a linear pattern. The north and south sides are hilly areas with sloping to steep morphology, which is feared to trigger the acceleration of water accumulation toward the city center. An overview of the study area can be seen in Figure 1.
Figure 1

Study area.

Figure 2

Factors causing flood: (a) aspect; (b) curvature; (c) elevation; (d) flow accumulation; (e) flow direction; (f) geology; (g) slope; (h) LULC; (i) NDVI; (j) rainfall; (k) distance from rivers; (l) soil; (m) SPI; (n) STI; (o) TRI; (p) TWI; and (q) wind.

Figure 2

Factors causing flood: (a) aspect; (b) curvature; (c) elevation; (d) flow accumulation; (e) flow direction; (f) geology; (g) slope; (h) LULC; (i) NDVI; (j) rainfall; (k) distance from rivers; (l) soil; (m) SPI; (n) STI; (o) TRI; (p) TWI; and (q) wind.

Close modal

Flood occurrence data

Information on flood events used in this study was obtained from the Regional Disaster Management Agency (BPBD) of Kendari City in 2013 (BPBD 2020). The 2013 flood incident was the worst flood incident, which caused one fatality and 2,300 people were displaced from the flood incident in Kendari City (BNPB 2013). Flood incident data were validated by conducting interviews during June–August with agencies and communities affected by the flood disaster. Data on non-flood events were carried out through interviews with the community for areas that were not affected by floods. There are 23 flood locations and 28 non-flood locations that are used. Information on flood events is extracted from latitude and longitude values with the help of ArcMap 10.4.1 and is assumed to represent one event for one flood event point. The data are stored in csv format to be the dependent variable in this study.

Factors causing flood

This study uses 17 variables that cause flooding (Tehrany et al. 2019; Kalantari et al. 2019; Panahi et al. 2021), such as aspect, curvature, elevation, flow accumulation, flow direction, geology, slope, Land Use Land Cover (LULC), Normalized Difference Vegetation Index (NDVI), rainfall, distance from rivers, soil, Stream Power Index (SPI), Sediment Transport Index (STI), Terrain Roughness Index (TRI), Topographic Wetness Index (TWI), and wind (Figure 2). Digital Elevation Model (DEM) data are sourced from Shuttle Radar Topography Mission (STRM) data for the 11–12 February 2000 mission, which was a one-time mission, so there is no data updating. DEM data are used to extract TWI, SPI, TRI, STI, slope, elevation, curvature, aspect, flow direction, and flow accumulation factors. Soil and geology data are updated for 2016 and developed by BDSLDP. These data are considered valid for use in Indonesia to date. NDVI data were obtained from Landsat-8 imagery taken from the Google Earth Engine (GEE) dataset. Extraction of NDVI values was carried out by utilizing the red and Near-Infrared (NIR) channels of Landsat-8 imagery on the GEE platform through the substitution technique to derive information on vegetation density (Lillesand & Kiefer 1994). Proximity data are obtained using the Euclidean Distance method in ArcMap 10.4.1. River network data are taken from the Geospatial Information Agency (BIG) dataset at a scale of 1:50,000 in 2021. Wind data are obtained from the Global Atlas Wind, which is the average wind speed data for 2021. Data were obtained from various open data sources and are shown in Table 1.

Table 1

Data source

FactorData sourceYearSource locationResolution (m)
1. NDVI Landsat-8 from GEE 2021 https:/code.earthengine.google.com/ 30 
2. LULC Ministry of Environment and Forestry of the Republic of Indonesia (KLHK) 2020 https://sigap.menlhk.go.id/sigap/ 
3. Rainfall Indonesian Agency for Meteorological, Climatological and Geophysics (BMKG) 2020 https://gis.bmkg.go.id/arcgis/home/ 
4. DEM, TWI, SPI, TRI, STI, Slope, Curvature, Aspect, Flow Direction, Flow Accumulation Shuttle Radar Topography Mission (SRTM) 2000 https://earthexplorer.usgs.gov/ 
5. Wind Global Wind Atlas 2021 https://globalwindatlas.info/en 
6. River Network Geospatial Information Agency (BIG) 2021 https://www.big.go.id/ 
7. Geology Southeast Sulawesi Agricultural Technology Study Center, Agricultural Research and Development Agency, Ministry of Agriculture. 2016 https://www.pemda-balitbangsultra.info/ 
8. Soil Center for Research and Development of Agricultural Land Resources (BBSDLP) 2016 https://www.pemda-balitbangsultra.info/ 
FactorData sourceYearSource locationResolution (m)
1. NDVI Landsat-8 from GEE 2021 https:/code.earthengine.google.com/ 30 
2. LULC Ministry of Environment and Forestry of the Republic of Indonesia (KLHK) 2020 https://sigap.menlhk.go.id/sigap/ 
3. Rainfall Indonesian Agency for Meteorological, Climatological and Geophysics (BMKG) 2020 https://gis.bmkg.go.id/arcgis/home/ 
4. DEM, TWI, SPI, TRI, STI, Slope, Curvature, Aspect, Flow Direction, Flow Accumulation Shuttle Radar Topography Mission (SRTM) 2000 https://earthexplorer.usgs.gov/ 
5. Wind Global Wind Atlas 2021 https://globalwindatlas.info/en 
6. River Network Geospatial Information Agency (BIG) 2021 https://www.big.go.id/ 
7. Geology Southeast Sulawesi Agricultural Technology Study Center, Agricultural Research and Development Agency, Ministry of Agriculture. 2016 https://www.pemda-balitbangsultra.info/ 
8. Soil Center for Research and Development of Agricultural Land Resources (BBSDLP) 2016 https://www.pemda-balitbangsultra.info/ 

The importance and relevance of the most commonly used flood conditioning factors in flood susceptibility mapping are covered in the following section. This aspect is important in flood studies (Choubin et al. 2019b). Aspects are related to the direction of the slope and the direction of movement of flood water to the shape of the slope. Yates et al. (2000) showed that the hydrological response unit is highly influenced by the slope aspect. Rahmati et al. (2016) also demonstrated that soil moisture content and local climatic conditions are also influenced by the slope aspect. Curvature is also an important flood conditioning factor that affected heterogeneity and hyporheic (Cardenas et al. 2004). Curvature has a great impact on the acceleration of the flow of water flowing to the surface (Lee et al. 2017; Tehrany et al. 2019). Curvature is usually classified into three classes namely, concave, flat, and convex. Convex surfaces are more susceptible to runoff and highly susceptible to flooding (Il'inskii & Yakimov 1987). The elevation is related to the high and low surfaces of the ground. Areas with lower elevations are usually more prone to flooding (Li et al. 2012; Cea & Bladé 2015). Therefore, it is very important to understand the topographical forms and derived features that are responsible for the occurrence of flooding in an area (Woodrow et al. 2016).

It computes the flow of water accumulated as the compiled weight of all cells and hence, flows into each downslope cell in the output raster (Zhang et al. 2017). The direction of the steepest slope is often used to calculate the direction of flow from each cell, which can also be considered a maximum drop. From a digital surface model, the flow direction map is created. The final map shows eight appropriate output directions that connect to the eight neighboring cells into which flow could pass. This method is commonly referred to as the eight-direction (D8) flow model and incorporates an approach proposed by Jenson & Domingue (1988). Geology describes the level of rock permeability in an area (Hammami et al. 2019). Lee et al. (2012) indicated that different geology units have different susceptibilities to flooding. It also affects the channel shape of the temporal flood (Reneau 2000; Heitmuller et al. 2015). The slope is directly related to flow velocity, soil type structure, and drainage because it regulates surface water flow and controls runoff (Zzaman et al. 2021). The runoff volume and velocity increase with increasing the slope gradient (Khosravi et al. 2016; Tien Bui et al. 2018). As the slope gradient increases, the runoff infiltration rate decreases and a large amount of runoff enters the drainage network (Tehrany et al. 2015).

The shape of the LULC in an area has a significant effect on the transport of surface runoff and sediment (Zhang et al. 2010) especially on built-up land (Costache & Bui 2020), while forest areas have good permeability (Yin et al. 2017). The NDVI is an index that can be used to evaluate vegetation cover and its impact on flooding. The NDVI is a simple graphic indicator that is used in remote sensing analyses for the assessment of vegetation attributes in a region (Sajedi-Hosseini et al. 2018). There is an inverse relationship between vegetation density and flooding (Tehrany et al. 2013; Kumar & Acharya 2016). Rainfall is a key influencing factor in flood susceptibility mapping, which is considerably remarked in the literature (Tehrany et al. 2015; Bui et al. 2017). Short rainfall intensity can cause flooding in an area (Ali et al. 2020). Rainfall data were extracted for the cumulative rainfall period in 2021. Soil type directly affects the drainage process due to inherent soil characteristics, such as texture, permeability level, and structure (Mojaddadi et al. 2017). Predick & Turner (2008) emphasized that distance of proximity to the river is a serious factor for flooding. Tien Bui et al. (2018) and Darabi et al. (2019) also observed that a great number of floods occurred in areas adjacent to the river. Flood events will be affected by the reduced distance from the river (Shahabi et al. 2021).

The SPI shows the erosional strength of flowing water and influences the fluvial system (Tehrany et al. 2014) for transporting sediment and erodibility of the riverbed (Chen et al. 2020). The STI relates to the possible movement of sediment in the watershed by water, which is influenced by topography, characterizes erosion, and deposition processes (Rahmati et al. 2019). The TRI is related to the local topography of a basin, low TRI is always associated with higher flooding (Tehrany et al. 2019). TWI of an area depicts the wetness of the watershed by spatial difference. It was first proposed by Beven & Kirkby (1979). In other words, the TWI represents the spatial variation of wetness of a river basin (Meles et al. 2020). Winds play a strong role in generating storm surges and tidal levels (Joyce et al. 2018; Abijith et al. 2021). ArcMap GIS Software was used to generate SPI, STI, TRI, and TWI from DEM using the following equations (Jaafari et al. 2014; Jebur et al. 2014):
(1)
(2)
(3)
where represents the area of the catchment (m2) and (radian) the gradient of the slope.
(4)
where is the elevation of each neighbor cell to cell (0,0).

Machine learning models

Each parameter is converted to raster data in TIF format and imported into R software along with flood event data analyzed by the maximum entropy method using the SDM package (Naimi & Araújo 2016). The landslide model with resampling using CV, Bt, and RS was run with two iterations. Flood vulnerability maps were each made from three different resampling techniques. The results are compared with the unprocessed model (BM) using non-resampling. The resampling technique is done in R software. The number of sample data is not determined due to fluctuations in the gamma value and standard error has decreased so that the sample data is immediately divided into two groups of datasets. The resulting model is a model with a continuous vulnerability value. To obtain a flood vulnerability model, each model is classified into several classes. The quantile-based classification was chosen in this study because it is most suitable for classifying facts in this study, for the reason that it groups an equal number of pixels (area) into each group. The model is classified into five classes, namely very low, low, moderate, high, and very high, as was done in previous studies (Shabani et al. 2018; Pham et al. 2021).

Generalized linear model

GLM is a parametric model and an extension of the linear model. The relationship between explanatory factors and responses was measured by regression parameters. The GLM model excels when the number of observations is not normally distributed and when other regression models yield unsatisfactory results. A more detailed description of GLM can be found in McCullagh & Nelder (1989). The GLM model is explored in the R Studio Desktop software using the ‘glm’ package.

Support vector machine

SVM is a non-linear model (Cortes & Vapnik 1995) by applying the principles of Structural Risk Minimization (SRM) and Empirical Risk Minimization (ERM) because it can minimize training errors and generalizations (Wu et al. 2014). Therefore, the SVM can provide an optimal network structure with better generalization to decrease the complexities associated with flood prediction. This model is applied to flood prediction by using different kernel functions (Chen et al. 2017; Pourghasemi & Rahmati 2018). Linear (LN), Polynomial (PL), Radial Basis Function (RBF), and sigmoid (SIG) are the four kernel types used in SVM. LN is regarded as a distinctive case of RBF although SIG execution is equivalent to that of RBF for the given factors (Song et al. 2011). When RBF is used in the processing, LN is no longer required. In terms of accuracy, RBF generates more reliable and solid outcomes compared with SIG because of its higher fitness in interpolation. The RBF kernel was used in this study. SVM has several types: C-SVM, v-SVM, one-class SVM, ɛ-SVR, and ѵ-SVR. Among those, the ɛ-SVR type was used in this research (Chang & Lin 2011). The values, 1,000 and 0.1, of the parameters ϲ and ɛ, respectively, were selected based on trial and error. GLM model explored in RStudio Desktop software using the ‘kernlab’ package.

Random forest

The RF model uses novel methods of data combination to construct and combine numerous trees for prediction. As the prediction factor is chosen, the RF model grows a Classification And Regression Tree (CART)-like tree (Naghibi & Pourghasemi 2015). This tree, however, differs from standard CART in several respects. The key RF parameters are the number of trees and the number of predicting factors upon which the decision tree is grown to its maximum size and then left unpruned. In this research, the values of these two parameters have been selected based on trial and error. The optimal values for these two parameters are selected, namely 1,000 and 4. RF modeling is carried out with the ‘RF’ package in RStudio Desktop.

Boosted regression tree

BRT is a non-parametric statistical method to strengthen the classification results by using several weak classifiers to produce a strong classifier. The CART algorithm is applied to generate a decision tree. The results of the tree are pruned to select the best performance results through a CV procedure (Breiman et al. 1984). Further explanation can see by Breiman et al. (1984). This model is implemented with the ‘gbm’ package in the software.

Multivariate adaptive regression splines

MARS was introduced by Friedman (1991) as a flexible non-parametric regression approach that can handle high-dimensional data. The shape of the function is unknown and there is a patterned relationship between the dependent and independent variables that can be overcome by this model (Friedman 1991). MARS implementation shows that this ML approach can help to construct good predictive models for complex engineering datasets (Heddam & Kisi 2018). Since the related works on flooding susceptibility prediction demonstrate that the classification function needed to discriminate flood and non-flood areas can be highly complex, MARS can be potentially useful for generalizing such classification functions from the GIS database. This model is run on an RStudio Desktop device using the ‘earth’ package.

Mixture discriminate analysis

MDA was introduced by Hair et al. (1998) and used to make LN combinations of independent factors so that the degree of correlation of the factors in each group is the same (Choubin et al. 2019b). The synthesized compound, called the discriminant function, is:
(5)
where Y is the discriminant score, Xn is the independent factor, and Wn is the weight of each factor. This model is run on an RStudio Desktop device using the ‘mda’ package.

Flexible discriminant analysis

FDA is a classification and pattern recognition model consisting of a combination of linear regression models. This model uses optimal finding and canonical correlation analysis of response factors to distinguish the best object groups (Hastie et al. 1994). This model is implemented using the ‘mda’ package in the software.

Resampling approach

Cross-validation

ML modeling uses datasets that are divided into training groups and test groups. As much as 70% of the data are used for training and 30% of the data are used for testing. In each iteration of the CV process, part of the data is used for training and partly for testing. The test data are sent to the CV process and then redefined as training data in the next iteration. The dataset is used to test the model (Dieterle 2003). The CV can be performed in several ways, for example using holdout, leave-one-out, leave-P-out, and k-fold, which is the method used in this study. If a training dataset is randomly split into k subsets or folds with equal sizes, each CV iteration is performed using k–1 folds as training and the remaining fold for validation. In this study, k was set to 5.

Bootstrap

Theoretically, Bt suggests that any deletion from a population can be made from its sampling. Additionally, sampling can be inferred from resampled units. In other words, if a sample with n units is attained in large amounts by a simple arbitrary slice with relief, the Bt statistic can be specified by combining the estimated statistics (Fox 2002). Supposing distinct samples are yn = (y1, y2, yn), the resampled units of y * n = (y1 *, y2 *, yn *) can be attained in large figures (B-model run) using the Monte Carlo Bt approximation of the mean and friction of the given parameter.

Random subsampling

RS is a Monte Carlo technique (Picard & Cook 1984). This technique works based on the random separation into two training and testing groups after B replication (Dieterle 2003). RS works on non-returning samples and the total sample pool in replication B will have a wide range of correlations, which distinguishes it from Bt.

Multicollinearity

While the perfect number of slide-triggering factors is unknown, the Learning Vector Quantization (LVQ) algorithm was used to determine the most important factors. LVQ is a supervised classifier formulated by Kohonen (1995). The LVQ was a training method for large datasets by classifying the input by searching for the shortest distance to the value and eliminating the noise, which could potentially interfere with the process of convergence in the forecasting system (Kohonen 1995). It was run in the RStudio Desktop software package. This study also uses the Pearson Correlation method (Al-Juaidi et al. 2018) to determine the relationship between variables and variables with flood event data. The Pearson Correlation value is in the range of −1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to +1. If there is a strong negative linear relationship between the variables, the value of r will be close to −1. When the value of r is near 0, the linear relationship is weak or nonexistent. A value of 0 indicates that there is no linear relationship between the two variables. A correlation value > 0.7 indicates a strong level of collinearity (Tien Bui et al. 2016). Pearson Correlation method using the following equation (Sarwono 2006):
(6)
where is the Pearson Correlation Coefficient, n is the number of data pairs,is the sum of products of the paired variables, is the sum of the x values, is the sum of the y values, is sum of the squared x values, and is the sum of the squared y values.

Accuracy assessment

The validity of the results of the models and pre-processing was investigated using different performance coefficients: Area Under the ROC Curve (AUC), Correlation Coefficient (COR), True Skill Statistic (TSS), and Deviance. The AUC has been frequently applied for the accuracy assessment of the spatial prediction models for flood susceptibility modeling (Lee et al. 2017; Hong et al. 2018; Tien Bui et al. 2018; Choubin et al. 2019b; Chen et al. 2020; Mosavi et al. 2022). The AUC is used for the quantitative assessment of the developed integrative model. A good AUC value is above 0.7 (Shabani et al. 2018). TSS is an alternative to the Kappa coefficient, which, unlike the Kappa statistic, is not affected by the prevalence and size of the validation dataset (Allouche et al. 2006). The TSS metric indicates the ability of a model to distinguish between flood and non-flood localities. The closer to 1, the better the COR value, the closer to 0, the better the deviance value. RStudio Software was used to value generate COR, TSS, and deviance using the following equations (Kwon 2017; Hong et al. 2018):
(7)
(8)
(9)
where N is the total number of samples, a and b are the numbers of pixels that were classified as flood-susceptible and non-flood samples by the models, and c is the number of known floods mistakenly labeled as non-susceptible areas. The FOI indicates flood presence (FOI = 1) or absence (FOI = 0), while the FSI indicates sensitivity to flooding predicted by the model. Sensitivity refers to the known flood pixels that were modeled as susceptible flood areas. Specificity, however, is the known non-flood pixels that were labeled as flood susceptible areas by the model (Hong et al. 2018). if = for all future observations, the D value is zero. If is always true, the value of D is infinite (Kwon 2017).
This study required six steps (Figure 3): collecting the spatial data about the past flood in the study area; identifying the factors that contribute to flood occurrence by reviewing the scholarly literature; assessing multicollinearity among the factors, and factor with flood occurrence, selecting standalone models and creating weighted ensemble ML models; counting resampling; determining the set of the most important factors for each model; and calibrating and evaluating the models to select the better model.
Figure 3

Research flowchart.

Figure 3

Research flowchart.

Close modal
Figure 4

Flood vulnerability predictions using BM models.

Figure 4

Flood vulnerability predictions using BM models.

Close modal
Figure 5

Flood vulnerability predictions using CV models.

Figure 5

Flood vulnerability predictions using CV models.

Close modal
Figure 6

Flood vulnerability predictions using Bt models.

Figure 6

Flood vulnerability predictions using Bt models.

Close modal
Figure 7

Flood vulnerability predictions using RS models.

Figure 7

Flood vulnerability predictions using RS models.

Close modal

Flood prediction

All models show that the highest flood vulnerability is in the middle of the area, which is downstream of the Wanggu watershed. This area has a flat lowland morphology, although most of the soil types are alluvial with a high level of permeability. However, the area has developed into a built-up area that is dense with a layer of concrete or cement so that it interferes with the absorption capacity of the soil. The model also highlights the Wanggu River as the main river with a high level of flooding compared to other tributaries. Sedimentation in this area is also the cause of flooding. Sedimentation in Kendari Bay continues to increase. Kendari Bay sedimentation occurs due to mining activities upstream or in watersheds. According to Yang et al. (2018), Darabi et al. (2019), Choubin et al. (2019b), and Dodangeh et al. (2020), areas around rivers are more prone to flooding.

Evaluation and comparison

The AUC value of the BM models is in the range of 0.90–0.99. The best model is shown by the RF and SVM models of AUC values, followed by BRT, MARS, MDA, GLM, and FDA (Figure 4, Table 2). But overall, the BM-RF model outperformed other BM models because it performed well based on the COR and TSS values. In line with Avand & Moradi (2022), the RF model is a more accurate model for predicting the level of flood vulnerability. RF has the advantage of a relatively fast resolution process, no overfit occurs along with the addition of the number of trees, and it has better accuracy than other models (Breiman 2001). The FDA model has the worst performance among the other models. This is because the FDA works with a small population. The FDA needs that the difference between the with-class variance and the between-class variance is much enough. If this assumption is not satisfied, the power of the FDA is weakening (Yan & Dai 2011).

Table 2

Model accuracy

AlgorithmBM
CV
Bt
RS
iiiiiiiviiiiiiiviiiiiiiviiiiiiiv
1. GLM 0.93 0.37 0.78 0.14 0.92 0.32 0.68 0.15 0.92 0.39 0.64 0.20 0.92 0.41 0.65 0.14 
2. SVM 0.99 0.77 0.88 0.11 0.99 0.81 0.89 0.08 0.99 0.84 0.89 0.08 0.99 0.79 0.88 0.11 
3. RF 0.99 0.96 0.89 0.05 0.99 0.97 0.90 0.05 0.99 0.96 0.89 0.05 0.99 0.96 0.89 0.05 
4. BRT 0.98 0.70 0.88 0.12 0.99 0.76 0.89 0.11 0.99 0.80 0.90 0.14 0.99 0.76 0.88 0.11 
5. MARS 0.98 0.66 0.85 0.09 0.98 0.68 0.80 0.08 0.98 0.76 0.81 0.10 0.97 0.67 0.76 0.09 
6. MDA 0.94 0.52 0.73 0.14 0.94 0.48 0.68 0.15 0.91 0.52 0.53 0.21 0.95 0.60 0.75 0.12 
7. FDA 0.90 0.27 0.74 0.17 0.90 0.26 0.67 0.16 0.90 0.35 0.65 0.25 0.91 0.32 0.62 0.17 
AlgorithmBM
CV
Bt
RS
iiiiiiiviiiiiiiviiiiiiiviiiiiiiv
1. GLM 0.93 0.37 0.78 0.14 0.92 0.32 0.68 0.15 0.92 0.39 0.64 0.20 0.92 0.41 0.65 0.14 
2. SVM 0.99 0.77 0.88 0.11 0.99 0.81 0.89 0.08 0.99 0.84 0.89 0.08 0.99 0.79 0.88 0.11 
3. RF 0.99 0.96 0.89 0.05 0.99 0.97 0.90 0.05 0.99 0.96 0.89 0.05 0.99 0.96 0.89 0.05 
4. BRT 0.98 0.70 0.88 0.12 0.99 0.76 0.89 0.11 0.99 0.80 0.90 0.14 0.99 0.76 0.88 0.11 
5. MARS 0.98 0.66 0.85 0.09 0.98 0.68 0.80 0.08 0.98 0.76 0.81 0.10 0.97 0.67 0.76 0.09 
6. MDA 0.94 0.52 0.73 0.14 0.94 0.48 0.68 0.15 0.91 0.52 0.53 0.21 0.95 0.60 0.75 0.12 
7. FDA 0.90 0.27 0.74 0.17 0.90 0.26 0.67 0.16 0.90 0.35 0.65 0.25 0.91 0.32 0.62 0.17 

Note: i, AUC; ii, COR; iii, TSS; iv, deviance.

Similar to the BM model, AUC curves reveal the potential for both RF and SVM models to make highly accurate predictions of all flood events and non-flood events. In the training phase, AUC of the model was RF = 0.99 and SVM = 0.99. The RF model (except BM) has better performance than other models because the model is trained with different resampling procedures (Figures 57). The CV-GLM, Bt-GLM, Bt-MDA, RS-GLM, and RS-MARS models are not as strong as the BM model when viewed from the AUC value in predicting flood vulnerability. The CV-RF model looks similar in terms of AUC, TSS, and deviance values with the Bt-RF and RS-RF models, but CV-RF has better performance with COR = 0.97 compared to Bt-RF and RS-RF in predicting events flood. This shows that the CV-RF model shows satisfactory results based on the COR value, followed by Bt-RF, RS-RF, and BM. This is because the RF algorithm can find the best property among a collection of random properties, rather than looking for the most important property when separating ‘nodes’. This also leads to variations and aims to produce the best model. RF works by considering only one sub-feature when dividing a ‘node’ in this study. The use of a threshold for each attribute in finding the best threshold by applying more random trees (Achour & Pourghasemi 2020). However, if the performance of the model is seen from the AUC through the resampling technique, then the RS-MDA and RS-FDA models have the most satisfactory performance. This could be of concern by using the RS technique to improve the performance of MDA and FDA for predicting flood events. Judging from the TSS, the CV and Bt models have the most satisfactory performance compared to the RS model. While the RS-MDA model is considered to give good results compared to the CV and Bt models, it can be a concern to improve the performance of the MDA model. The deviance value of the RF model has satisfactory results in all four methods. The Bt-BRT model has better performance than the BM-BRT, as seen by the value of TSS = 0.90. Dodangeh et al. (2020) also found something similar. The Bt-GLM, Bt-MDA, and Bt-FDA models have the worst scores compared to other models and methods and should not be applied to this algorithm. However, it should be noted that an AUC close to one cannot show very accurate results because some biases can be influenced by other factors.

According to the CV-RF model, areas with very low, low, moderate, high, and very high levels of flood vulnerability are 145.82 km2 (53.04%), 39.64 km2 (14.42%), 31.76 km2 (11.55%), 29.32 km2 (10.67%), and 28.35 km2 (10.31%) of the total area. Based on the Indonesian Earth Map for the appearance of built-up/settlement areas, there are 454 small and large built-up/settlement areas in this area. Of these, respectively, there are 39 (8.59%), 38 (8.37%), 84 (18.5%), 102 (22.47%), and 191 (42.07) built-up/settlement areas that are in the very low, low, moderate, high, and very high categories (Figure 8).
Figure 8

Flood susceptibility predictions by the CV-RF model and the locations of built-up/settlement and main river.

Figure 8

Flood susceptibility predictions by the CV-RF model and the locations of built-up/settlement and main river.

Close modal

It should be considered that each of the above models has coefficients and parameters that will affect its performance. In the research of Wang et al. (2019a), the SVM model will have different results if the hybrid kernel selection effect is explored. Kernel-type selection in an SVM model can be considered a vital step because it directly controls effective training and classification accuracy (Yao et al. 2008). In this study, the parameters were determined based on trial and error.

Importance of variable

The role of the parameters that cause flooding varies from one region to another. As a result, it is critical to evaluate and validate the data before using it as input in learning models. The BM model shows that the most important factors are NDVI and LULC in all models except BM-GLM and BM-FDA (Figure 9). GLM shows that slope and elevation are the most important factors, while the BM-FDA shows NDVI and TRI. The GLM, SVM, and BM-MDA models show that all variables contribute to the cause of flood events in the study area with varying influence values. In the BM-RF model, only NDVI, LULC, TRI, and TWI factors contribute to flooding. The most important factors of CV are NDVI and TRI in all models, except for the CV-GLM model, which replaces TRI with elevation, CV-MARS, which has slightly different importance between NDVI, LULC, and FD, while the CV-MDA has the most important factors such as SPI, soil, and NDVI (Figure 10). All variables in the CV-GLM, CV-SVM, and CV-MARS models have various contributions as flood-causing factors, but in the case of the CV-RF, it shows that only NDVI and TRI are influential factors, while in other models it has variation, which is not a diverse effect. The most important factor of the model trained with Bt is similar to CV. In Bt, all models show NDVI as the most important factor for flood occurrence, except for Bt-MARS, which is most influenced by flow accumulation (FA), TRI, and LULC (Figure 11). The Bt-SVM, Bt-RF, and Bt-BRT models show that NDVI and TRI are the most important factors, but the Bt-MDA, TRI, and DtR models have no significant effect, while curvature does not affect the Bt-FDA model. The RS method has different levels of importance (Figure 12). Each model has a variable that has no effect, but all variables show the same consistency of influence in all models, where NDVI and TRI are the variables that have the most influence on flood events. These results are in accordance with Khosravi et al. (2018), Tien Bui et al. (2018), Choubin et al. (2019b), and Darabi et al. (2019).
Figure 9

Important factor of flood using BM models.

Figure 9

Important factor of flood using BM models.

Close modal
Figure 10

Important factor of flood using CV models.

Figure 10

Important factor of flood using CV models.

Close modal
Figure 11

Important factor of flood using Bt models.

Figure 11

Important factor of flood using Bt models.

Close modal
Figure 12

Important factor of flood using RS models.

Figure 12

Important factor of flood using RS models.

Close modal
Figure 13

The relationship between variables and variables with flood events.

Figure 13

The relationship between variables and variables with flood events.

Close modal

Overall, NDVI and TRI have the highest importance across all models. Low NDVI values are related to flood events caused by land cover changes (Aldiansyah et al. 2021; Atefi & Miura 2022). The lowest spatial distribution of the TRI in this study is in a river basin that often floods. A similar finding by Tehrany et al. (2019) found that a low TRI value is always associated with high flood intensity. Changes in low urbanization patterns to high urbanization patterns due to changes in landforms that follow the linear pattern of main roads are the main drivers of increasing flood risk in urban areas (Waghwala & Agnihotri 2019; Pal et al. 2022).

Multicollinearity

The derived features of DEM data play an important role in this study because most of them have a positive correlation with other variables (Figure 13). There is a positive correlation between TWI and slope (0.37), flow accumulation (0.28), and STI (0.24). TWI acts as a means of quantifying topographic effects in hydrological processes (Lee et al. 2017). Water accumulates at certain locations and tends to move downward due to the force of gravity. TWI describes the distribution of moisture in various areas, when TWI tends to increase soil water content, flooding will also increase (Guzzetti et al. 2006; Meinhardt et al. 2015) and this is a factor that affects flooding in this study. The relationship between geology is also positively correlated with soil type (0.39) in this study. Soil absorption is determined by the strength and characteristics of the soil material, such as soil permeability and water pressure in the soil pores (Bui et al. 2016). TRI has a positive correlation with the STI (0.77), flow direction (0.57), distance from rivers (0.48), slope (0.40), and LULC (0.35). The TRI has a strong relationship with morphological aspects of flooded areas (Werner et al. 2005). The TRI shows a uniform elevation distribution, where flooding is more common in areas with low TRI in this study. Flood areas are always associated with roughness elements such as surface variations and irregularities, vegetation types such as trees and shrubs, and the direction of the slope toward the river affects the speed of receiving runoff in a short time through slope gaps (Casas et al. 2010; Tehrany et al. 2014). STI is a factor causing flooding that determines the time of sediment movement due to water movement (Mojaddadi et al. 2017). The STI in this study has a positive correlation with the parameters of flow direction (0.61), slope (0.34), and distance from the river (0.27). The STI describes the overall runoff plot. High runoff areas have higher sediment transport and are less prone to flooding. SPI describes the amount of moisture in the soil and the potential for flooding to flow down the study area. The lower the SPI, the greater the effect of flooding because it determines the area that can accumulate flow. Steep slopes can significantly reduce the amount of soil absorption and accelerate surface runoff due to precipitation. Areas with low slopes are more likely to be inundated, which happened at the research location. Consequently, the slope of the slope plays an important role in regulating surface runoff, infiltration, and water retention.

NDVI has a strong relationship with rainfall (0.54) and flow accumulation (0.25). Kendari City had high rainfall during the period of 2021 and the flood-affected area is an area with low vegetation density with low-lying morphology. Rainfall is the main cause of flooding in some areas where rain is the only source of water. The higher the rainfall, the higher the chance of flooding. The spatial distribution of rainfall in the study area is in the highlands, which can accelerate the flow of water to the lowlands. Most areas of Kendari City often experience flooding that is influenced by the high intensity of rainfall, poor drainage, lack of infiltration or biospheric wells, and coastal reclamation (Idati et al. 2020; Sinaga 2022). The coastal reclamation that occurs changes the topology and elevation, so that if it rains with a large volume of water and encounters a high tide, it will cause seawater to also enter the rainwater channels. A low NDVI value is related to flood events caused by changes in land cover, so that if the intensity of rainfall is higher and not matched by infiltration capabilities, the possibility of creating puddles will be even greater (Rahmati et al. 2016; Atefi & Miura 2022). NDVI has an impact on flooding, where the flow of water can be reduced and slowed by the size of the NDVI value. Vegetation cover allows water to penetrate deeper into the soil, resulting in reduced water volume and a lower probability of flooding. LULC has a positive correlation with the distance from the river, which is 0.43. Several areas along the main river of the research location have land cover that has been degraded into residential areas and buildings. The type of LULC has a significant influence on hydrological elements such as infiltration, evapotranspiration, and runoff generation (Rahmati et al. 2016).

Areas with high flood susceptibility levels are distributed in siliciclastic sedimentary rock types (TRJm), whose porosity and permeability are determined by the size, shape, and composition of the soil type (Pettijohn 1975). Most of the areas are in very low flood susceptibility for these soil types, but areas that have been inventoried by other surface constituents are in high vulnerability areas. This is in line with the altitude factor, which is positively correlated to geology in this study with a value of 0.30. The alluvium is always associated with floodplains in river areas (Miller & Juilleret 2020), which is the dominant area for flood vulnerability in this study with a proven correlation value of 0.28. Geological fractures are usually manifested in relief, distribution of river networks, and rock erosion patterns that can lead to the gradual development of floodplains (Heitmuller et al. 2015).

The aspect is strongly influenced by the slope (0.39) and the acceleration of water flow (0.30) in this study. Areas with low elevation receive runoff from the slopes in a short period of time, thereby creating flooding in the area (Tehrany et al. 2014). This aspect is very important because slopes facing different directions will receive different intensities of sunlight, which indirectly encourages the processes of evapotranspiration, hydrological processes, and soil type and amount of vegetation (Pourghasemi & Rahmati 2018). Floods are usually preceded by heavy and prolonged rainfall. The wind has a positive correlation with STI (0.50), TRI (0.50), distance from the river (0.34), and TWI (0.26), which were never explored in this study area. The wind speed in this area is distributed by the lowland morphology. The process of the occurrence of rain requires unstable atmospheric conditions, high water vapor content, intensive lifting of air masses, and low wind speeds.

In the case of a high Pearson Correlation value among two factors, the simplest method is to delete one of the factors from the dataset and repeat the analysis (Dai & Lee 2001). While Pearson's Correlation coefficient method provides some positive insights, a few results were unexpected. The TRI and STI are all slope-governed functions and thus, a higher correlation between these factors. However, regarding the interpretation of correlations, the following points should be considered:

  • A correlation of zero indicates absolutely no linear relationship between those two variables whatsoever. Pearson quantifies statistical association in terms of a straight line (Xu & Deng 2017). Pearson, however, does not eliminate the potential for some non-linear relationship between those two variables. Hence, two variables that may be highly associated with one another may produce a Pearson's Correlation coefficient value of zero. Thus, a Pearson score of zero, between two variables, neither means no association, nor that one cannot predict one of the pair from the other.

  • Correlation can be extremely reactive to the influence of outliers; one extraordinary observation may have a significant impact on a particular correlation. A quick survey of the scatterplot facilitates the detection of outliers (de Winter et al. 2016).

  • Correlations are not always causally related (Price 2000), nor are the relationships fully reciprocal in that the same association does not apply in both directions. For example, a causal relation will exist between two events where the first causes the occurrence of the second. In simple terms, the first event becomes known as the cause, while the second event becomes the effect. Alternatively, correlations between two variables are not necessarily based on causation.

Hence, the Pearson method has its limitations. In this study, several correlations have been impacted by outliers, unequal variance, non-normality, and nonlinearity (de Winter et al. 2016). The method is most successfully applied where both variables in the pair express normal distribution (Mukaka 2012).

We discuss factors affecting flood susceptibility and model results. Because of the robustness of ML models, they can be effective in evaluating flood susceptibility mapping for other areas when environmental aspects and model input parameters are the same, except where hydrogeological and topographical differences have to be considered. In addition, the proposed method can be efficiently used to obtain flood susceptibility maps for areas with scanty data. The methodology used in the current research can be used for another study area, as well as to map other natural hazards probability analyses, such as river flooding, landslide, and land subsidence. The requirements are an inventory map illustrating the history and location of the previous natural hazard point, a set of relevant conditioning factors (thematic maps), and a spatial analyst to process the resampling technique. Finally, our results also can be extended to mapping the vulnerability of other disasters associated with flooding (water quality, siltation, shoreline change, and loss of vegetation). However, the susceptibility map produced in this study only represents the flood susceptible zone in an urban area like Kendari City. Furthermore, the current study will improve understanding of the evaluation process and consistency in decision-making for flood preparedness activities (e.g construction of houses and buildings) in the future.

This study explores various algorithms to find an accurate and reliable algorithm for detecting flood-prone areas. Seven ML algorithms, namely GLM, SVM, RF, BRT, MARS, MDA, and FDA, with a resampling method were used in this study. These results cannot be generalized to the prediction of flood risk in other parts of the world with certainty because the patterns of behavior of flood-influencing variables vary with region. CV-RF performance is better than the other 28 models. CV in this study estimates the out-of sample more accurately because each observation is used for training and testing so that it can reduce overfitting in achieving a good level of prediction. Since ML is based on finding these patterns, the outputs of modeling should be expected to vary with region. Of the 17 parameters, two parameters have the greatest influence on flood events in all models in Kendari City, namely NDVI and TRI. The largest correlation occurs between the variables TRI and STI, meaning that the incidence of flooding in this area is strongly influenced by elements of violence such as surface variations and irregularities, types of vegetation such as trees and shrubs, and the direction of the slope toward the river, which can accelerate the process of water accumulation. The importance of other factors differed among the models. A total of 89.44 km2 or equivalent to 32.54% of the total area is a flood vulnerability area, with the dominant area being lowland morphology.

This research also reveals the importance of housing and building development locations. Around 377 location points or the equivalent of 83.04% are in flooded areas. Future research should include evaluation and be consistent with strong spatial and regulatory plans. Future regional development must follow the rules and regulations that have been formulated in the planning stage. This aims to realize the preparation of the spatial policy properly.

In this study, we would like to thank the University of Indonesia for supporting this research, in addition to the Kendari City Government and the community for allowing and facilitating the collection of field data.

This paper was composed by collaboration among all authors. S.A. and F.W. designed this study, F.W. helped improve its progression and clarity, S.A. and F.W. wrote this paper, and F.W. helped in revising the paper.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abijith
D.
,
Saravanan
S.
,
Jennifer
J. J.
,
Parthasarathy
K. S. S.
,
Singh
L.
&
Sankriti
R.
2021
Assessing the impact of damage and government response toward the cyclone Gaja in Tamil Nadu, India
. In:
Pal, I., Shaw, R., Djalante, R. & Shrestha, S. (eds)
.
Disaster Resilience and Sustainability
.
Elsevier
,
Cambridge, MA, USA
, pp.
577
590
.
Ahmed
N.
,
Hoque
M. A. A.
,
Arabameri
A.
,
Pal
S. C.
,
Chakrabortty
R.
&
Jui
J.
2021
Flood susceptibility mapping in Brahmaputra floodplain of Bangladesh using deep boost, deep learning neural network, and artificial neural network
.
Geocarto International
37
,
1
22
.
Aldiansyah
S.
,
Mandini Mannesa
M.
&
Supriatna
S.
2021
Monitoring of vegetation cover changes with geomorphological forms using google earth engine in Kendari city
.
Jurnal Geografi Gea
21
(
2
),
159
170
.
Alfieri
L.
,
Bisselink
B.
,
Dottori
F.
,
Naumann
G.
,
de Roo
A.
,
Salamon
P.
,
Wayser
K.
&
Feyen
L.
2017
Global projections of river flood risk in a warmer world
.
Earth's Future
5
(
2
),
171
182
.
Ali
S. A.
,
Parvin
F.
,
Pham
Q. B.
,
Vojtek
M.
,
Vojteková
J.
,
Costache
R.
,
Linh
N. T. T.
,
Nguyen
H. Q.
,
Ahmad
A.
&
Ghorbani
M. A.
2020
GIS-based comparative assessment of flood susceptibility mapping using hybrid multi-criteria decision-making approach, naïve Bayes tree, bivariate statistics and logistic regression: a case of Topľa basin, Slovakia
.
Ecological Indicators
117
,
106620
.
Al-Juaidi
A. E.
,
Nassar
A. M.
&
Al-Juaidi
O. E.
2018
Evaluation of flood susceptibility mapping using logistic regression and GIS conditioning factors
.
Arabian Journal of Geosciences
11
(
24
),
1
10
.
Badan Nasional Penanggulangan Bencana [BNPB]
2013
1 Meninggal dan 2.300 Jiwa Mengungsi dari Banjir Di Kota Kendari. Available from: https://bnpb.go.id/berita/1-meninggal-dan-2-300-jiwa-mengungsi-dari-banjir-di-kota-kendari (accesed 7 January 2023)
.
Badan Nasional Penanggulangan Bencana Daerah [BNPB Daerah]
2013
Data Bencana Indonesia. Available from: www.bnpb.go.id (accesed 12 october 2022)
.
Badan Nasional Penanggulangan Bencana [BNPB] 2019 Data Informasi Bencana Indonesia (DIBI) [Data set]. Avaliable from: dibi.bnpb.go.id (accesed 12 October 2022).
Badan Penanggulangan Bencana Daerah [BPBD] 2020 Data Informasi Kejadian Banjir Kota Kendari.
Badan Pusat Statistik [BPS]
2022
Batista
G. E.
,
Prati
R. C.
&
Monard
M. C.
2004
A study of the behavior of several methods for balancing machine learning training data
.
ACM SIGKDD Explorations Newsletter
6
(
1
),
20
29
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
,
5
32
.
Breiman
L.
,
Friedman
J.
,
Olshen
R.
&
Stone
C.
1984
Classification and Regression Trees
, 1st edn.
Chapman and Hall/CRC
,
Belmont, CA
.
Bui
D. T.
,
Bui
Q. T.
,
Nguyen
Q. P.
,
Pradhan
B.
,
Nampak
H.
&
Trinh
P. T.
2017
A hybrid artificial intelligence approach using GIS-based neural-fuzzy inference system and particle swarm optimization for forest fire susceptibility modeling at a tropical area
.
Agricultural and Forest Meteorology
233
,
32
44
.
Bui
D. T.
,
Tsangaratos
P.
,
Ngo
P. T. T.
,
Pham
T. D.
&
Pham
B. T.
2019b
Flash flood susceptibility modeling using an optimized fuzzy rule based feature selection technique and tree based ensemble methods
.
Science of the Total Environment
668
,
1038
1054
.
Cardenas
M. B.
,
Wilson
J. L.
&
Zlotnik
V. A.
2004
Impact of heterogeneity, bed forms, and stream curvature on subchannel hyporheic exchange
.
Water Resources Research
40
(
8
),
1
13
.
Casas
A.
,
Lane
S. N.
,
Yu
D.
&
Benito
G.
2010
A method for parameterising roughness and topographic sub-grid scale effects in hydraulic modelling from LiDAR data
.
Hydrology and Earth System Sciences
14
(
8
),
1567
1579
.
Chan
F. K. S.
,
Griffiths
J. A.
,
Higgitt
D.
,
Xu
S.
,
Zhu
F.
,
Tang
Y. T.
&
Thorne
C. R.
2018
‘Sponge city’ in China – a breakthrough of planning and flood risk management in the urban context
.
Land use Policy
76
,
772
778
.
Chang
C. C.
&
Lin
C. J.
2011
LIBSVM: a library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology (TIST)
2
(
3
),
1
27
.
Chen
W.
,
Li
Y.
,
Xue
W.
,
Shahabi
H.
,
Li
S.
,
Hong
H.
,
Wang
X.
,
Bian
H.
,
Zhang
S.
,
Pradhan
B.
&
Ahmad
B. B.
2020
Modeling flood susceptibility using data-driven approaches of naïve Bayes tree, alternating decision tree, and random forest methods
.
Science of The Total Environment
701
,
134979
.
Choubin
B.
,
Borji
M.
,
Mosavi
A.
,
Sajedi-Hosseini
F.
,
Singh
V. P.
&
Shamshirband
S.
2019a
Snow avalanche hazard prediction using machine learning methods
.
Journal of Hydrology
577
,
123929
.
Choubin
B.
,
Moradi
E.
,
Golshan
M.
,
Adamowski
J.
,
Sajedi-Hosseini
F.
&
Mosavi
A.
2019b
An ensemble prediction of flood susceptibility using multivariate discriminant analysis, classification and regression trees, and support vector machines
.
Science of the Total Environment
651
,
2087
2096
.
Cieslak
D. A.
&
Chawla
N. V.
2008
Start globally, optimize locally, predict globally: Improving performance on imbalanced data
. In
2008 Eighth IEEE International Conference on Data Mining
.
IEEE, Pisa
, pp.
143
152
.
Cortes
C.
&
Vapnik
V.
1995
Support-vector networks
.
Machine Learning
20
(
3
),
273
297
.
Dai
F. C.
&
Lee
C. F.
2001
Frequency–volume relation and prediction of rainfall-induced landslides
.
Engineering Geology
59
(
3–4
),
253
266
.
Darabi
H.
,
Choubin
B.
,
Rahmati
O.
,
Haghighi
A. T.
,
Pradhan
B.
&
Kløve
B.
2019
Urban flood risk mapping using the GARP and QUEST models: a comparative study of machine learning techniques
.
Journal of Hydrology
569
,
142
154
.
Davoudi Moghaddam
D.
,
Pourghasemi
H. R.
&
Rahmati
O.
2019
Assessment of the contribution of geo-environmental factors to flood inundation in a semi-arid region of SW Iran: comparison of different advanced modeling approaches
. In:
Pourghasemi, H. & Rossi, M. (eds)
.
Natural Hazards GIS-Based Spatial Modeling Using Data Mining Techniques
.
Springer
,
Cham
, pp.
59
78
.
Dieterle
F. J.
2003
Multianalyte Quantifications by Means of Integration of Artificial Neural Networks, Genetic Algorithms and Chemometrics for Time-Resolved Analytical Data
.
Dissertation
,
Institute of Physical and Theoretical Chemistry (IPTC), Tubingen University
,
Germany
.
Dodangeh
E.
,
Choubin
B.
,
Eigdir
A. N.
,
Nabipour
N.
,
Panahi
M.
,
Shamshirband
S.
&
Mosavi
A.
2020
Integrated machine learning methods with resampling algorithms for flood susceptibility prediction
.
Science of the Total Environment
705
,
135983
.
Ferentinou
M.
&
Chalkias
C.
2013
Mapping mass movement susceptibility across Greece with GIS, ANN and statistical methods
. In:
Margottini, C., Canuti, P. & Sassa, K. (eds)
.
Landslide Science and Practice
.
Springer
,
Berlin, Heidelberg
, pp.
321
327
.
Fotovatikhah
F.
,
Herrera
M.
,
Shamshirband
S.
,
Chau
K. W.
,
Faizollahzadeh Ardabili
S.
&
Piran
M. J.
2018
Survey of computational intelligence as basis to big flood management: challenges, research directions and future work
.
Engineering Applications of Computational Fluid Mechanics
12
(
1
),
411
437
.
Fox
J.
2002
Bootstrapping regression models
.
The Annals of Statistics.
9
(
6
),
1218
1228
.
doi:10.1214/aos/1176345638
.
Friedman
J. H.
1991
Multivariate adaptive regression splines
.
The Annals of Statistics
19
(
1
),
1
67
.
Fu
M.
,
Fan
T.
,
Ding
Z. A.
,
Salih
S. Q.
,
Al-Ansari
N.
&
Yaseen
Z. M.
2020
Deep learning data-intelligence model based on adjusted forecasting window scale: application in daily streamflow simulation
.
IEEE Access
8
,
32632
32651
.
Galar
M.
,
Fernandez
A.
,
Barrenechea
E.
,
Bustince
H.
&
Herrera
F.
2011
A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches
.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
42
(
4
),
463
484
.
Gandri
L.
,
Purwanto
M. Y. J.
,
Sulistyantara
B.
&
Zain
A. F. M.
2019
Pemodelan Bahaya Banjir Kawasan Perkotaan (Studi Kasus di Kota Kendari)
.
Jurnal Keteknikan Pertanian
7
(
1
),
9
16
.
Goetz
J. N.
,
Brenning
A.
,
Petschko
H.
&
Leopold
P.
2015
Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling
.
Computers & Geosciences
81
,
1
11
.
Guzzetti
F.
,
Reichenbach
P.
,
Ardizzone
F.
,
Cardinali
M.
&
Galli
M.
2006
Estimating the quality of landslide susceptibility models
.
Geomorphology
81
(
1–2
),
166
184
.
Hair
J. F.
,
Black
W. C.
,
Babin
B. J.
,
Anderson
R. E.
&
Tatham
R. L.
1998
Multivariate Data Analysis
, Vol.
5
, No.
3
.
Prentice Hall
,
Upper Saddle River, NJ
, pp.
207
219
.
Hammami
S.
,
Zouhri
L.
,
Souissi
D.
,
Souei
A.
,
Zghibi
A.
,
Marzougui
A.
&
Dlala
M.
2019
Application of the GIS based multi-criteria decision analysis and analytical hierarchy process (AHP) in the flood susceptibility mapping (Tunisia)
.
Arabian Journal of Geosciences
12
(
21
),
1
16
.
Hastie
T.
,
Tibshirani
R.
&
Buja
A.
1994
Flexible discriminant analysis by optimal scoring
.
Journal of The American Statistical Association
89
(
428
),
1255
1270
.
Hastie
T.
,
Tibshirani
R.
&
Friedman
J.
2009
Random Forests, The Elements of Statistical Learning
.
Springer
,
New York, NY
, pp.
587
604
.
Hirabayashi
Y.
,
Mahendran
R.
,
Koirala
S.
,
Konoshima
L.
,
Yamazaki
D.
,
Watanabe
S.
,
Kim
H.
&
Kanae
S.
2013
Global flood risk under climate change
.
Nature Climate Change
3
(
9
),
816
821
.
Hong
H.
,
Panahi
M.
,
Shirzadi
A.
,
Ma
T.
,
Liu
J.
,
Zhu
A. X.
,
Chen
W.
,
Kougias
I.
&
Kazakis
N.
2018
Flood susceptibility assessment in Hengfeng area coupling adaptive neuro-fuzzy inference system with genetic algorithm and differential evolution
.
Science of the Total Environment
621
,
1124
1141
.
Idati
L.
,
Magribi
M.
,
M
L.
&
Lakawa
I.
2020
Analisis Banjir, Faktor Penyebab dan Prioritas Penangan Sungai Anduonohu
.
Sultra Civil Engineering Journal
1
(
2
),
54
71
.
Islamy
U.
,
Nursidah
D. R.
,
Narendra
I. S.
,
Anshori
M. L.
&
Widodo
E.
2022
Pengelompokkan Provinsi Di Indonesia Berdasarkan Indikator Dampak Bencana Banjir Tahun 2017–2020 Menggunakan K-Medoids
.
Bimaster: Buletin Ilmiah Matematika, Statistika dan Terapannya
11
(
2
),
381
388
.
Jaafari
A.
,
Najafi
A.
,
Pourghasemi
H. R.
,
Rezaeian
J.
&
Sattarian
A.
2014
GIS-based frequency ratio and index of entropy models for landslide susceptibility assessment in the Caspian forest, northern Iran
.
International Journal of Environmental Science and Technology
11
(
4
),
909
926
.
Jenson
S. K.
&
Domingue
J. O.
1988
Extracting topographic structure from digital elevation data for geographic information system analysis
.
Photogrammetric Engineering and Remote Sensing
54
(
11
),
1593
1600
.
Kalantari
Z.
,
Ferreira
C. S. S.
,
Koutsouris
A. J.
,
Ahlmer
A. K.
,
Cerdà
A.
&
Destouni
G.
2019
Assessing flood probability for transportation infrastructure based on catchment characteristics, sediment connectivity and remotely sensed soil moisture
.
Science of The Total Environment
661
,
393
406
.
Khosravi
K.
,
Pham
B. T.
,
Chapi
K.
,
Shirzadi
A.
,
Shahabi
H.
,
Revhaug
I.
,
Prakash
I.
&
Bui
D. T.
2018
A comparative assessment of decision trees algorithms for flash flood susceptibility modeling at Haraz watershed, northern Iran
.
Science of the Total Environment
627
,
744
755
.
Kohonen
T.
1995
Learning Vector Quantization; Self-Organizing Maps
.
Springer
,
Berlin
, pp.
175
189
.
Kwon
J. M.
2017
Data Science to Follow and Learn
.
Jpub
,
Seoul
,
Korea
.
Lee
M. J.
,
Kang
J. E.
&
Jeon
S.
2012
Application of frequency ratio model and validation for predictive flooded area susceptibility mapping using GIS
. In:
2012 IEEE International Geoscience and Remote Sensing Symposium
.
IEEE
,
Munich, Germany
, pp.
895
898
.
Lee
S.
,
Kim
J. C.
,
Jung
H. S.
,
Lee
M. J.
&
Lee
S.
2017
Spatial prediction of flood susceptibility using random-forest and boosted-tree models in Seoul metropolitan city, Korea
.
Geomatics, Natural Hazards and Risk
8
(
2
),
1185
1203
.
Li
X. H.
,
Zhang
Q.
,
Shao
M.
&
Li
Y. L.
2012
A comparison of parameter estimation for distributed hydrological modelling using automatic and manual methods
.
Advanced Materials Research
356
,
2372
2375
.
Lillesand
T. M.
&
Kiefer
R. W.
1994
Remote Sensing and Image Interpretation
, 3rd edn.
Jhon Wiley & Son, Inc
,
New York
.
McCullagh
P.
&
Nelder
J. A.
1989
Generalized Linear Models
.
Chapman and Hall
,
London
,
UK
.
Meles
M. B.
,
Younger
S. E.
,
Jackson
C. R.
,
Du
E.
&
Drover
D.
2020
Wetness index based on landscape position and topography (WILT): modifying TWI to reflect landscape position
.
Journal of Environmental Management
255
,
109863
.
Mojaddadi
H.
,
Pradhan
B.
,
Nampak
H.
,
Ahmad
N.
&
Ghazali
A. H. B.
2017
Ensemble machine-learning-based geospatial approach for flood risk assessment using multi-sensor remote-sensing data and GIS
.
Geomatics, Natural Hazards and Risk
8
(
2
),
1080
1102
.
Mosavi
A.
,
Golshan
M.
,
Janizadeh
S.
,
Choubin
B.
,
Melesse
A. M.
&
Dineva
A. A.
2022
Ensemble models of GLM, FDA, MARS, and RF for flood and erosion susceptibility mapping: a priority assessment of sub-basins
.
Geocarto International
37
(
9
),
2541
2560
.
Mukaka
M. M.
2012
A guide to appropriate use of correlation coefficient in medical research
.
Malawi Medical Journal
24
(
3
),
69
71
.
Nachappa
T. G.
,
Piralilou
S. T.
,
Gholamnia
K.
,
Ghorbanzadeh
O.
,
Rahmati
O.
&
Blaschke
T.
2020
Flood susceptibility mapping with machine learning, multi-criteria decision analysis and ensemble using Dempster Shafer Theory
.
Journal of Hydrology
590
,
125275
.
Nhu
V.-H.
,
Thi Ngo
P.-T.
,
Pham
T. D.
,
Dou
J.
,
Song
X.
,
Hoang
N.-D.
,
Tran
D. A.
,
Cao
D. P.
,
Aydilek
İ. B.
,
Amiri
M.
,
Costache
R.
,
Hoa
P. V.
&
Tien Bui
D.
2020
A new hybrid firefly–PSO optimized random subspace tree intelligence for torrential rainfall-induced flash flood susceptible mapping
.
Remote Sensing
12
(
17
),
2688
.
Pal
S. C.
,
Chowdhuri
I.
,
Das
B.
,
Chakrabortty
R.
,
Roy
P.
,
Saha
A.
&
Shit
M.
2022
Threats of climate change and land use patterns enhance the susceptibility of future floods in India
.
Journal of Environmental Management
305
,
114317
.
Panahi
M.
,
Dodangeh
E.
,
Rezaie
F.
,
Khosravi
K.
,
Van Le
H.
,
Lee
M. J.
,
Lee
S.
&
Pham
B. T.
2021
Flood spatial prediction modeling using a hybrid of meta-optimization and support vector regression modeling
.
Catena
199
,
105114
.
Pandey
M.
,
Arora
A.
,
Arabameri
A.
,
Costache
R.
,
Kumar
N.
,
Mishra
V. N.
,
Siddiqui
M. A.
,
Ray
Y.
,
Soni
S.
&
Shukla
U. K.
2021
Flood susceptibility modeling in a subtropical humid low-relief alluvial plain environment: application of novel ensemble machine learning approach
.
Frontiers in Earth Science
9
,
1091
.
Peel
M. C.
,
Finlayson
B. L.
&
McMahon
T. A.
2007
Updated world map of the Köppen-Geiger climate classification
.
Hydrology and Earth System Sciences
11
(
5
),
1633
1644
.
Pettijohn
F. J.
1975
Sedimentary Rocks
, Vol.
3
.
Harper & Row
,
New York
, p.
628
.
Pham
Q. B.
,
Pal
S. C.
,
Chakrabortty
R.
,
Norouzi
A.
,
Golshan
M.
,
Ogunrinde
A. T.
,
Janizadeh
S.
,
Khedher
K. M.
&
Anh
D. T.
2021
Evaluation of various boosting ensemble algorithms for predicting flood hazard susceptibility areas
.
Geomatics, Natural Hazards and Risk
12
(
1
),
2607
2628
.
Picard
R. R.
&
Cook
R. D.
1984
Cross-validation of regression models
.
Journal of the American Statistical Association
79
(
387
),
575
583
.
Pourghasemi
H. R.
,
Amiri
M.
,
Edalat
M.
,
Ahrari
A. H.
,
Panahi
M.
,
Sadhasivam
N.
&
Lee
S.
2020
Assessment of urban infrastructures exposed to flood using susceptibility map and google earth engine
.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
14
,
1923
1937
.
Predick
K. I.
&
Turner
M. G.
2008
Landscape configuration and flood frequency influence invasive shrubs in floodplain forests of the Wisconsin River (USA)
.
Journal of Ecology
96
(
1
),
91
102
.
Price
I.
2000
Research Methods and Statistics PESS202 Lecture and Commentary Notes
.
University of New England
,
Armidale
.
Rahmati
O.
,
Kalantari
Z.
,
Samadi
M.
,
Uuemaai
E.
,
Moghaddam
D. D.
,
Nalivan
O. A.
,
Destouni
G.
&
Tien Bui
D.
2019
GIS-based site selection for check dams in watersheds: considering geomorphometric and topo-hydrological factors
.
Sustainability
11
(
20
),
5639
.
Sarwono
J.
2006
Quantitative and Qualitative Research Methods
.
Quantitative and Qualitative Research. Graha Ilmu
,
Yogayakarta
.
Şen
Z.
2018
Flood Modeling, Prediction and Mitigation
.
Springer International Publishing
,
Cham, Switzerland
.
Shabani
F.
,
Kumar
L.
&
Ahmadi
M.
2018
Assessing accuracy methods of species distribution models: AUC, specificity, sensitivity and the true skill statistic
.
Global Journal of Human Social Science
18
(
1
),
6
18
.
Shahabi
H.
,
Shirzadi
A.
,
Ronoud
S.
,
Asadi
S.
,
Pham
B. T.
,
Mansouripour
F.
,
Geertsema
M.
,
Clague
J. J.
&
Bui
D. T.
2021
Flash flood susceptibility mapping using a novel deep learning model based on deep belief network, back propagation and genetic algorithm
.
Geoscience Frontiers
12
(
3
),
101100
.
Sinaga
T. W.
2022
Evaluasi Sistem Drainase terhadap Penanggunalangan Banjir di Kecamatan Baruga Kota Kendari Sulawesi Tenggara
.
Theses
,
Civil Engineering, Universitas Islam Malang
,
Indonesia
.
Talukdar
S.
,
Ghose
B.
,
Salam
R.
,
Mahato
S.
,
Pham
Q. B.
,
Linh
N.T. T.
,
Costache
R.
&
Avand
M.
2020
Flood susceptibility modeling in Teesta River basin, Bangladesh using novel ensembles of bagging algorithms
.
Stochastic Environmental Research and Risk Assessment
34
(
12
),
2277
2300
.
Tehrany
M. S.
,
Pradhan
B.
&
Jebur
M. N.
2015
Flood susceptibility analysis and its verification using a novel ensemble support vector machine and frequency ratio method
.
Stochastic Environmental Research and Risk Assessment
29
(
4
),
1149
1165
.
Tien Bui
D.
,
Khosravi
K.
,
Li
S.
,
Shahabi
H.
,
Panahi
M.
,
Singh
V.
,
Chapi
K.
,
Shirzadi
A.
,
Panahi
S.
,
Chen
W.
&
Bin Ahmad
B.
2018
New hybrids of anfis with several optimization algorithms for flood susceptibility modeling
.
Water
10
(
9
),
1210
.
Tuyet
N. T.
,
Thanh
N. D.
&
van Tan
P.
2019
Performance of SEACLID/CORDEX-SEA multimodel experiments in simulating temperature and rainfall in Vietnam
.
Vietnam Journal of Earth Sciences
41
,
374
387
.
Van
E. T.
&
Schwarz
A.
2020
Plastic debris in rivers
.
Wiley Interdisciplinary Reviews: Water
7
(
1
),
e1398
.
Waghwala
R. K.
&
Agnihotri
P. G.
2019
Flood risk assessment and resilience strategies for flood risk management: a case study of Surat city
.
International Journal of Disaster Risk Reduction
40
,
101155
.
Wang
Z.
,
Lai
C.
,
Chen
X.
,
Yang
B.
,
Zhao
S.
&
Bai
X.
2015
Flood hazard risk assessment model based on random forest
.
Journal of Hydrology
527
,
1130
1141
.
Wang
Y.
,
Hong
H.
,
Chen
W.
,
Li
S.
,
Panahi
M.
,
Khosravi
K.
,
Shirzadi
A.
,
Shahabi
H.
,
Panahi
S.
&
Costache
R.
2019b
Flood susceptibility mapping in Dingnan County (China) using adaptive neuro-fuzzy inference system with biogeography based optimization and imperialistic competitive algorithm
.
Journal of Environmental Management
247
,
712
729
.
Werner
M. G. F.
,
Hunter
N. M.
&
Bates
P. D.
2005
Identifiability of distributed floodplain roughness values in flood extent estimation
.
Journal of Hydrology
314
(
1–4
),
139
157
.
Yan
H.
&
Dai
Y.
2011
The comparison of five discriminant methods
. In:
2011 International Conference on Management and Service Science
.
IEEE, Wuhan
, pp.
1
4
.
Yang
W.
,
Xu
K.
,
Lian
J.
,
Ma
C.
&
Bin
L.
2018
Integrated flood vulnerability assessment approach based on TOPSIS and Shannon entropy methods
.
Ecological Indicators
89
,
269
280
.
Zhang
X.
,
Wenhong
C.
,
Qingchao
G.
&
Sihong
W.
2010
Effects of landuse change on surface runoff and sediment yield at different watershed scales on the Loess Plateau
.
International Journal of Sediment Research
25
(
3
),
283
293
.
Zhang
H.
,
Yao
Z.
,
Yang
Q.
,
Li
S.
,
Baartman
J. E.
,
Gai
L.
,
Yao
M.
,
Yang
X.
,
Ritsema
C. J.
&
Geissen
V
2017
An integrated algorithm to evaluate flow direction and flow accumulation in flat regions of hydrologically corrected DEMs
.
Catena
151
,
174
181
.
Zhao
G.
,
Pang
B.
,
Xu
Z.
,
Yue
J.
&
Tu
T.
2018
Mapping flood susceptibility in mountainous areas on a national scale in China
.
Science of the Total Environment
615
,
1133
1142
.
Zzaman
R. U.
,
Nowreen
S.
,
Billah
M.
&
Islam
A. S.
2021
Flood hazard mapping of Sangu River basin in Bangladesh using multi-criteria analysis of hydro-geomorphological factors
.
Journal of Flood Risk Management
14
(
3
),
e12715
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).