ABSTRACT
Flash floods are highly destructive disasters, posing severe threats to lives and infrastructure. In this study, we conducted a comparative analysis of bivariate and multivariate statistical models and machine learning to predict flash flood susceptibility in the flood-prone Rheraya watershed. Six models were utilized, including frequency ratio (FR), logistic regression (LR), random forest (RF), extreme gradient boosting (XGBoost), K-nearest neighbors (KNN), and naïve Bayes (NB). We considered 12 flash flood conditioning variables, such as slope, elevation, distance to the river, and others, as independent variables and 246 flash flood inventory points recorded over the past 40 years as dependent variables in the modeling process. The area under the curve (AUC) of the receiver operating characteristic was used to validate and compare the performance of the models. The results indicated that distance to the river was the most contributing factor to flash floods in the study area. Moreover, the RF outperformed all the other models, achieving an AUC of 0.86, followed by XGBoost (AUC = 0.85), LR (AUC = 0.83), NB (AUC = 0.76), KNN (AUC = 0.75), and FR (AUC = 0.72). The RF model effectively pinpoints highly susceptible zones, which is critical for establishing precise flash flood mitigation strategies within the region.
HIGHLIGHTS
Comparative assessment of bivariate and multivariate statistical models and machine learning algorithms to classify locations as flash-flooded or not.
Distance to the river and drainage density have the strongest influence on flash floods in the Rheraya watershed.
The RF model outperformed all the other models.
Highly susceptible zones within the ungauged and flood-prone Rheraya watershed were effectively identified.
INTRODUCTION
A flash flood is a rapid rise of water along a stream or low-lying urban area. Flash flooding occurs within 6 h of a significant rain event and is usually caused by intense storms that produce heavy rainfall in a short amount of time (Doswell 2015). They are one of the most devastating, costly, and frequent disasters, resulting in extensive damage to infrastructure and properties and even the loss of human lives (Jeyaseelan 2003; Tingsanchali 2012; Kuenzer et al. 2013; Dahri & Abida 2017; Nogueira et al. 2018; Mohanty et al. 2020). More specifically, they are responsible for approximately 84% of global deaths (Jamali et al. 2020). In recent years, the increasing frequency of extreme weather events related to global warming, urbanization, loss of natural land, deforestation, and changes in land use patterns such as canalization of water streams has made flash floods a growing concern for scientific communities worldwide, as they expect the severity of this phenomenon to increase (Kundzewicz et al. 2014; Guha-Sapir et al. 2016; Mekonnen & Hoekstra 2016). Effective flood management strategies and accurate identification of high-risk areas are therefore necessary to reduce the impact of these events on human lives and properties (Grothmann & Reusswig 2006; European Union 2007). However, predicting flooding remains a challenging task due to the complex nature of the phenomenon (Kalantari et al. 2014).
In the context of climate change, the severity and intensity of floods are increasing in Morocco (Loudyi et al. 2022). Over recent decades, the country, like many others worldwide, has experienced numerous destructive hydrological events that have impacted many areas of the country. In fact, flash floods are the most common and dangerous disasters in the country due to how often they happen, their magnitude, and their sudden onset (Mallouk et al. 2016; Karmaoui & Balica 2019). Moreover, there is a 95% chance that the country will face either earthquakes or floods over the next 30 years (Michel-Kerjan et al. 2014). This is mainly due to the geographical location of the country, which frequently experiences intense rainfall events. The country has experienced over 35 major flash floods between 1951 and 2015 (TARGA-AIDE & Zurich Insurance 2015). More specifically, the Rheraya watershed witnessed one of the most destructive flooding events in 1995 due to heavy rainfall in a short period of time, killing more than 150 people, including 60 tourists (Digby 2000). Moreover, in 2019, the basin encountered another severe flash flood, which resulted in significant damage to properties. Similarly, flooding in the Ourika basin, which limits the Rheraya basin to the east, resulted in the deaths of more than 200 people in 1995 and more than 60 people in 2002 (JICA 2011). The high risk and significant loss of life in these specific areas can also be attributed to the fact that residents often inhabit these exposed regions for tourism purposes (El Alaoui El Fels et al. 2018). Therefore, pinpointing highly susceptible zones to flash floods in the Rheraya watershed is of paramount relevance for developing precise mitigation strategies in this touristy zone.
Utilizing flood susceptibility mapping (FSM) is essential for identifying flood-prone areas by considering various environmental factors influencing floods (Wang et al. 2019). Rainfall-runoff models are valuable for flood forecasting but come with limitations, including the need for calibration and extensive gauging data, which can be resource-intensive (Ludwig et al. 2003; Peel & McMahon 2020). Researchers have utilized diverse methods for creating flood susceptibility maps, including geospatial analysis through geographic information system tools and remote sensing. Recent advancements include the adoption of statistical methods and machine learning (ML) algorithms. Statistical methods are commonly applied for studying spatial phenomena and mitigating flood risks. Tehrany et al. (2014b), for instance, used bivariate and multivariate statistical models to map flood susceptibility in Busan City, South Korea. Similarly, Tien Bui et al. (2019a) conducted flood susceptibility modeling in the Haraz catchment, Iran, using a multivariate logistic regression (LR) model. Both studies showed high performance of their models in mapping the susceptibility to floods. ML algorithms, on the other side such as artificial neural networks (ANNs) (Shu & Burn 2004; Seckin et al. 2013; Liu et al. 2016; Jahangir et al. 2019; Rahman et al. 2021), support vector machines (Tehrany et al. 2014a, 2015; Dazzi et al. 2021), random forests (RFs) (Chapi et al. 2017; Lee et al. 2017; Farzaneh et al. 2019; Vafakhah et al. 2020; Abedi et al. 2022; Ghanim et al. 2023), adaptive neuro-fuzzy inference systems (Ahmadlou et al. 2018), and long-term memory (Apaydin et al. 2020; Dazzi et al. 2021), have enhanced flood risk prediction accuracy by addressing non-linearity. For example, Seydi et al. (2022) and Sellami et al. (2022) conducted comparative analyses of multiple ML models to assess flash flood susceptibility (FFS) in Iran and Tetouan (Morocco), respectively. Their findings demonstrated that while there were minor differences in algorithm performance, all models effectively identified areas at very high risk of flooding. Nonetheless, the literature lacks a comprehensive comparative study that incorporates both statistical models and ML techniques. Specifically, there is limited research assessing the performance of bivariate and multivariate statistical models alongside various ML algorithms in the same geographical area, with only one study focusing on just two ML algorithms based on regression trees (Al-Abadi & Al-Najar 2020).
This study aims to compare the effectiveness of statistical and ML models for predicting FFS and developing an FFS map for the Rheraya watershed, Morocco. Our specific objectives were (1) to identify the most contributing factors to flash flood occurrence in the region; (2) to create flood susceptibility maps of the region using various models; (3) to assess and compare the performance of the models; and (4) to pinpoint high-risk zones within the study area. The novelty of this study lies in being the first study to conduct a comparative analysis of bivariate and multivariate statistical models and various ML algorithms to assess FFS within a single study and for a specific area. Second, it addresses a crucial gap by developing the first FFS map of the ungauged Rheraya watershed, an area renowned for its historically destructive flash floods. Through this study, we seek to provide valuable insights that will aid in the effective management and mitigation of flash flood risk in the Rheraya watershed.
The subsequent sections of the paper are organized as follows: Section 2 describes the study area, methodological approach, data sources for modeling, and background information on the statistical and ML models used in this work. Section 3 outlines the obtained results, while Section 4 presents a discussion of the outcomes, including a comparison between the various models and the final FFS maps. Finally, Section 5 summarizes the research findings with closing remarks.
MATERIALS AND METHODS
Study area
The location of the study area with the training and testing flash flood inventory points.
The location of the study area with the training and testing flash flood inventory points.
Data and methods
Methodological approach applied for flash flood susceptibility modeling.
Inventory map
Identifying and creating a flash flood inventory map of the study area is a critical step in investigating the relationship between flash flood events and various influencing factors. Current approaches generally rely on historical records, field surveys, and satellite imagery. We first used the historical flood records of the Tensift Hydraulic Basin Agency (ABHT), followed by a field survey in which local residents contributed to the development of a historical flash flood inventory map, and finally, Google Earth images of both pre- and post-flash floods were exploited for further verification. Through these steps, we were able to identify 123 flash flood locations. In addition, 123 points were selected as ‘non-flash flood’ from the areas where there was no evidence of flash floods occurring. Flash FSM is a binary classification; therefore, to generate the training data, we assigned binary values of 1 to flash flood points and 0 to non-flash flood points for the modeling process. The resulting dataset (i.e., 246 points) was then randomly split into a 70% training set and a 30% testing set (Figure 1).
Flash flood influencing factors
The identification of flash flood conditioning factors can greatly affect the accuracy of the mathematical models (Kia et al. 2012). We thoughtfully selected the conditioning factors based on a comprehensive literature review of previous studies (e.g., Bentivoglio et al. 2022), data availability, and the characteristics of flash floods in the Rheraya watershed. We made sure to consider different aspects of the study area, including topographic, hydrological, geological, and land cover features.
Flash flood condition factors used in this study are: (a) elevation, (b) slope, (c) aspect, (d) curvature, (e) TWI, and (f) SPI.
Flash flood condition factors used in this study are: (a) elevation, (b) slope, (c) aspect, (d) curvature, (e) TWI, and (f) SPI.
Flash flood condition factors used in this study are: (a) distance to the river, (b) drainage density, (c) rainfall, (d) land cover, (e) NDVI, and (f) lithology.
Flash flood condition factors used in this study are: (a) distance to the river, (b) drainage density, (c) rainfall, (d) land cover, (e) NDVI, and (f) lithology.
Topographical factors
Topographical features included elevation, slope, aspect, curvature, and distance to the river. Elevation is a key factor in flash flood modeling (Bui et al. 2020; Dodangeh et al. 2020). It is inversely related to flash floods (Fernández & Lutz 2010), meaning that as elevation decreases, the terrain becomes flatter and the amount of water carried by rivers increases (Cao et al. 2016). Slope is an important factor in flash floods because it affects the speed of flowing water (Stevaux et al. 2020). In general, a steeper slope angle leads to higher flow velocity, which can decrease the rate of infiltration and increase water stagnation. Aspect influences floodwater flow directions, which helps to maintain the humidity of the soil (Chu et al. 2020). It indirectly affects the flooding. Slope curvature separates diverging and converging runoff regions, which influences water flow (Torcivia & López 2020). Depending on the slope, runoff accelerates or decelerates. Convex slopes tend to increase overland flow, potentially affecting infiltration and soil saturation (Cao et al. 2016), while concave slopes can slow down overland flow and potentially improve infiltration (Young & Mutchler 1969). Distance to the river is a critical factor in determining an area's vulnerability to flooding in a basin (Tehrany et al. 2015). Areas closer to rivers are more prone to flooding than those farther away (Butler et al. 2006; Chapi et al. 2017).
Hydrological and meteorological factors
Hydrological and meteorological variables directly influence the occurrence and severity of flash floods. Various factors were taken into consideration in this study, such as SPI, TWI, drainage density, and rainfall. SPI is a measure that assesses the potential flow erosion at a specific topographic surface point. It has a significant impact on the fluvial system (Knighton 1999). TWI is a hydrological measure corresponding to the ratio of the area of a specific basin to the angle of the slope (Wilson & Gallant 2000; Nhu et al. 2020). It reflects the amount of water present in each pixel of the area (Zhang et al. 2020). Drainage density is the sum of the stream length per unit watershed area (Elmore et al. 2013; Nguyen et al. 2020). High stream densities have a higher risk of flooding than low stream densities, assuming all other conditions are equal (Chapi et al. 2017). Rainfall is one of the most significant factors that can cause floods (Pourghasemi et al. 2020). When the intensity of the rain exceeds the ground's infiltration capacity, flash floods occur. To generate the precipitation map, data acquired over a 40-year period (1983–2023) from two meteorological stations, Tahanaout and Armed, were used, with Tahanaout located in the lower catchment and Armed positioned in the upper catchment. The average annual rainfall over 40 years for each station was computed, and a rainfall map was generated using the IDW interpolation method in ArcGIS 10.2.
Geological and land cover factors
Geological and land cover variables included lithology, land cover, and NDVI. The variety of lithologic structures in a study area can significantly increase or decrease the level of flood risk since the permeability and porosity of these different structures directly influence infiltration and runoff. Land cover affects surface runoff and sediment transport, which directly influence flood frequency (Benito et al. 2010). Flooding is more common in urban areas, whereas vegetation, particularly forests, intercepts precipitation and slows runoff velocity. We created a land cover map using Sentinel-2B images acquired in May 2023. These images were chosen because they exhibited lower cloud coverage and reduced snow cover. The images were classified into five classes, including agriculture, bare land, bare rocky soil, forest, and built-up areas, using the maximum likelihood classification in SNAP software (Figure 4(d)). Field surveys and Google Earth images were used to validate the obtained map. The NDVI is defined as a dimensionless index that describes the difference between near-infrared and red light, which has values ranging from −1 (e.g., low vegetation density) to +1 (e.g., high vegetation density). NDVI values can indicate changes in vegetation and surface water cover over time (Ahmed & Akter 2017) and reveal the relationship between flooding and vegetation in a basin (Tehrany et al. 2013). To create the NDVI map of the Rheraya watershed, Sentinel-2B images from May 2023 were used, and the map was classified into six classes (Figure 4(e)).
Flash flood susceptibility modeling
Statistical approaches
Frequency ratio
The FR is one of the most commonly used and trustworthy methods to evaluate the susceptibility to floods worldwide (Rahmati et al. 2016a, 2016b; Samanta et al. 2018). An FR value higher than 1 indicates that the factors have a substantial influence on flash flooding, while an FR value less than 1 means that there is a negative correlation between the flash flood frequency and the conditioning factors (Lee & Talib 2005).
Logistic regression
The intercept of the model is b0, the number of independent variables is n, the coefficients are b1, b2,… bn, and the flash flood conditioning factors are x1, x2,… xn. The P-value indicates the probability of vulnerability and ranges from 0 to 1. A P-value close to 1 suggests a high vulnerability, while a P-value close to 0 represents a low vulnerability.
ML algorithms
Extreme gradient boosting
The optimization procedure in XGBoost starts with the creation of the first learner for the variable dataset, followed by the creation of the model according to the residuals. When it reaches the stopping criteria, the procedure ends. The algorithm becomes stronger when there is missing data in the dataset compared with other models. The Caret package in the R statistical software (R.3.6.2., R Core Team 2018) was used to apply the XGBoost algorithm.
Random forest
K-nearest neighbors
KNN is an easy-to-use supervised ML algorithm that can be used for both regression and classification tasks. It is a non-parametric and lazy algorithm, meaning it does not make assumptions about the dataset and only calculates the KNN based on distance for prediction. This is especially useful when modeling hydrological phenomena, such as floods, where there is little prior knowledge of data distribution (Wettschereck et al. 1997). The optimal number of neighbors typically depends on the regression and classification metrics used. For continuous variables, the Euclidean distance is the most commonly used distance metric, while for discrete variables, the Hamming distance is the most typical. The value of K is usually set to the square root of the number of samples, and it can vary depending on the dataset (Duda et al. 2012; Guo et al. 2023).
Naïve Bayes
Model validation
Assessing the performance of a model is a critical step in probabilistic modeling, as it ensures the reliability of the output. There are various metrics that have been used to evaluate the performance of flash flood prediction models. The area under the receiver operating characteristic curve (AUC ROC) is used to assess the performance of our models. AUC computes the entire two-dimensional area below the ROC curve and provides an aggregate measure of classification performance across all potential thresholds. It quantifies the probability that the model will correctly rank a randomly chosen positive instance higher than a randomly chosen negative instance (Hanley 1989). A higher AUC value indicates a better performance of the model (Shirzadi et al. 2019).
RESULTS AND ANALYSIS
FR model
The susceptibility of the Rheraya watershed to flash floods was assessed using the FR bivariate statistical method with geospatial techniques. The FR for various classes of each factor was used to understand and determine the significance or probability of a subclass under flash flood occurrences (Table 1).
Spatial relationship between the flooded area and its related factors using the FR method
Factors . | Factor class . | Number of flash flood pixels . | Percentage of flash flood . | Number of pixels in class . | Percentage of domain . | FR . |
---|---|---|---|---|---|---|
Elevation | 400–981 | 3,593.75 | 27.06 | 705,148 | 28.66 | 0.94 |
981–1,610 | 4,218.75 | 31.76 | 536,973 | 21.83 | 1.46 | |
1,610–2,188 | 5,156.25 | 38.82 | 737,981 | 30.00 | 1.29 | |
2,188–2,928 | 312.5 | 2.35 | 317,140 | 12.89 | 0.18 | |
2,928–4,211 | 0.00 | 162,723 | 6.61 | 0.00 | ||
Slope | 0–7 | 4,062.5 | 30.59 | 495,495 | 20.14 | 1.52 |
7–18 | 4,062.5 | 30.59 | 502,247 | 20.42 | 1.50 | |
18–28 | 2,031.25 | 15.29 | 371,190 | 15.09 | 1.01 | |
28–39 | 2,812.5 | 21.18 | 738,683 | 30.03 | 0.71 | |
39–80 | 312.5 | 2.35 | 352,350 | 14.32 | 0.16 | |
SPI | 0–26,678 | 12,500 | 94.12 | 2,453,021 | 99.72 | 0.94 |
26,678–92,923 | 0.00 | 0.00 | 2,868 | 0.12 | 0.00 | |
92,923–199,745 | 312.5 | 2.35 | 1,373 | 0.06 | 42.16 | |
199,745–352,565 | 468.75 | 3.53 | 1,136 | 0.05 | 76.43 | |
352,565–913,374 | 0.00 | 0.00 | 1,567 | 0.06 | 0.00 | |
TWI | 5_6 | 3,437.5 | 25.88 | 1,094,759 | 44.50 | 0.58 |
6_8 | 3,906.25 | 29.41 | 793,326 | 32.25 | 0.91 | |
8_10 | 2,656.25 | 20.00 | 399,365 | 16.23 | 1.23 | |
10_13 | 1,093.75 | 8.24 | 136,934 | 5.57 | 1.48 | |
13_21 | 2,187.5 | 16.47 | 35,581 | 1.45 | 11.39 | |
Aspect | Flat | 2,968.75 | 22.35 | 462,063 | 18.78 | 1.19 |
North | 3,437.5 | 25.88 | 351,613 | 14.29 | 1.81 | |
Northeast | 1,093.75 | 8.24 | 262,371 | 10.67 | 0.77 | |
East | 625 | 4.71 | 285,843 | 11.62 | 0.40 | |
South | 2,968.75 | 22.35 | 368,264 | 14.97 | 1.49 | |
Southwest | 937.5 | 7.06 | 389,575 | 15.84 | 0.45 | |
West | 1,250 | 9.41 | 340,236 | 13.83 | 0.68 | |
Distance To river | 0–300 | 10,781.25 | 81.18 | 492,783 | 20.03 | 4.05 |
300–600 | 1,406.25 | 10.59 | 813,071 | 33.05 | 0.32 | |
600–800 | 0.00 | 0.00 | 565,760 | 23.00 | 0.00 | |
800–1,000 | 156.25 | 1.18 | 337,585 | 13.72 | 0.09 | |
1,000–1,200 | 937.5 | 7.06 | 250,472 | 10.18 | 0.69 | |
Drainage density | 0–1.5 | 625 | 4.71 | 484,182 | 19.68 | 0.24 |
1.5–3 | 312.5 | 2.35 | 550,492 | 22.38 | 0.11 | |
3–4.5 | 625 | 4.71 | 543,162 | 22.08 | 0.21 | |
4.5–6 | 3,906.25 | 29.41 | 488,064 | 19.84 | 1.48 | |
6–8.32 | 7,812.5 | 58.82 | 387,534 | 15.75 | 3.73 | |
NDVI | −0.32 | 156.25 | 1.18 | 24,787 | 1.01 | 1.17 |
0–0.15 | 3,906.25 | 29.41 | 774,171 | 31.47 | 0.93 | |
0.15–0.3 | 5,468.75 | 41.18 | 819,295 | 33.31 | 1.24 | |
0.3–0.45 | 1,093.75 | 8.24 | 454,964 | 18.49 | 0.45 | |
0.45–0.85 | 2,656.25 | 20.00 | 305,485 | 12.42 | 1.61 | |
Rainfall | 369–373 | 4,062.5 | 30.59 | 713,544 | 29.01 | 1.05 |
373–378 | 1,562.5 | 11.76 | 658,802 | 26.78 | 0.44 | |
378–384 | 781.25 | 5.88 | 226,069 | 9.19 | 0.64 | |
385–389 | 2,812.5 | 21.18 | 278,829 | 11.33 | 1.87 | |
389–392 | 4,062.5 | 30.59 | 582,427 | 23.68 | 1.29 | |
Land cover | Agriculture | 4,687.5 | 35.29 | 208,855 | 8.49 | 4.16 |
Built-up and urban | 312.5 | 2.35 | 24,114 | 0.98 | 2.40 | |
Bare soil | 5,000 | 37.65 | 1,333,323 | 54.20 | 0.69 | |
Forest | 2,812.5 | 21.18 | 781,870 | 31.78 | 0.67 | |
Rocky bare soil | 468.75 | 3.53 | 111,679 | 4.54 | 0.78 | |
Lithology | Impermeable | 6,718.75 | 50.59 | 1,106,856 | 44.99 | 1.12 |
Permeable | 3,906.25 | 29.41 | 770,101 | 31.31 | 0.94 | |
Semipermeable | 2,656.25 | 20.00 | 501,845 | 20.40 | 0.98 | |
Curvature | Concave | 468.75 | 3.53 | 223,604 | 9.08 | 0.39 |
Flat | 1,093.75 | 8.24 | 2,058,021 | 83.66 | 0.10 | |
Convex | 11,718.75 | 88.24 | 178,340 | 7.24 | 12.17 |
Factors . | Factor class . | Number of flash flood pixels . | Percentage of flash flood . | Number of pixels in class . | Percentage of domain . | FR . |
---|---|---|---|---|---|---|
Elevation | 400–981 | 3,593.75 | 27.06 | 705,148 | 28.66 | 0.94 |
981–1,610 | 4,218.75 | 31.76 | 536,973 | 21.83 | 1.46 | |
1,610–2,188 | 5,156.25 | 38.82 | 737,981 | 30.00 | 1.29 | |
2,188–2,928 | 312.5 | 2.35 | 317,140 | 12.89 | 0.18 | |
2,928–4,211 | 0.00 | 162,723 | 6.61 | 0.00 | ||
Slope | 0–7 | 4,062.5 | 30.59 | 495,495 | 20.14 | 1.52 |
7–18 | 4,062.5 | 30.59 | 502,247 | 20.42 | 1.50 | |
18–28 | 2,031.25 | 15.29 | 371,190 | 15.09 | 1.01 | |
28–39 | 2,812.5 | 21.18 | 738,683 | 30.03 | 0.71 | |
39–80 | 312.5 | 2.35 | 352,350 | 14.32 | 0.16 | |
SPI | 0–26,678 | 12,500 | 94.12 | 2,453,021 | 99.72 | 0.94 |
26,678–92,923 | 0.00 | 0.00 | 2,868 | 0.12 | 0.00 | |
92,923–199,745 | 312.5 | 2.35 | 1,373 | 0.06 | 42.16 | |
199,745–352,565 | 468.75 | 3.53 | 1,136 | 0.05 | 76.43 | |
352,565–913,374 | 0.00 | 0.00 | 1,567 | 0.06 | 0.00 | |
TWI | 5_6 | 3,437.5 | 25.88 | 1,094,759 | 44.50 | 0.58 |
6_8 | 3,906.25 | 29.41 | 793,326 | 32.25 | 0.91 | |
8_10 | 2,656.25 | 20.00 | 399,365 | 16.23 | 1.23 | |
10_13 | 1,093.75 | 8.24 | 136,934 | 5.57 | 1.48 | |
13_21 | 2,187.5 | 16.47 | 35,581 | 1.45 | 11.39 | |
Aspect | Flat | 2,968.75 | 22.35 | 462,063 | 18.78 | 1.19 |
North | 3,437.5 | 25.88 | 351,613 | 14.29 | 1.81 | |
Northeast | 1,093.75 | 8.24 | 262,371 | 10.67 | 0.77 | |
East | 625 | 4.71 | 285,843 | 11.62 | 0.40 | |
South | 2,968.75 | 22.35 | 368,264 | 14.97 | 1.49 | |
Southwest | 937.5 | 7.06 | 389,575 | 15.84 | 0.45 | |
West | 1,250 | 9.41 | 340,236 | 13.83 | 0.68 | |
Distance To river | 0–300 | 10,781.25 | 81.18 | 492,783 | 20.03 | 4.05 |
300–600 | 1,406.25 | 10.59 | 813,071 | 33.05 | 0.32 | |
600–800 | 0.00 | 0.00 | 565,760 | 23.00 | 0.00 | |
800–1,000 | 156.25 | 1.18 | 337,585 | 13.72 | 0.09 | |
1,000–1,200 | 937.5 | 7.06 | 250,472 | 10.18 | 0.69 | |
Drainage density | 0–1.5 | 625 | 4.71 | 484,182 | 19.68 | 0.24 |
1.5–3 | 312.5 | 2.35 | 550,492 | 22.38 | 0.11 | |
3–4.5 | 625 | 4.71 | 543,162 | 22.08 | 0.21 | |
4.5–6 | 3,906.25 | 29.41 | 488,064 | 19.84 | 1.48 | |
6–8.32 | 7,812.5 | 58.82 | 387,534 | 15.75 | 3.73 | |
NDVI | −0.32 | 156.25 | 1.18 | 24,787 | 1.01 | 1.17 |
0–0.15 | 3,906.25 | 29.41 | 774,171 | 31.47 | 0.93 | |
0.15–0.3 | 5,468.75 | 41.18 | 819,295 | 33.31 | 1.24 | |
0.3–0.45 | 1,093.75 | 8.24 | 454,964 | 18.49 | 0.45 | |
0.45–0.85 | 2,656.25 | 20.00 | 305,485 | 12.42 | 1.61 | |
Rainfall | 369–373 | 4,062.5 | 30.59 | 713,544 | 29.01 | 1.05 |
373–378 | 1,562.5 | 11.76 | 658,802 | 26.78 | 0.44 | |
378–384 | 781.25 | 5.88 | 226,069 | 9.19 | 0.64 | |
385–389 | 2,812.5 | 21.18 | 278,829 | 11.33 | 1.87 | |
389–392 | 4,062.5 | 30.59 | 582,427 | 23.68 | 1.29 | |
Land cover | Agriculture | 4,687.5 | 35.29 | 208,855 | 8.49 | 4.16 |
Built-up and urban | 312.5 | 2.35 | 24,114 | 0.98 | 2.40 | |
Bare soil | 5,000 | 37.65 | 1,333,323 | 54.20 | 0.69 | |
Forest | 2,812.5 | 21.18 | 781,870 | 31.78 | 0.67 | |
Rocky bare soil | 468.75 | 3.53 | 111,679 | 4.54 | 0.78 | |
Lithology | Impermeable | 6,718.75 | 50.59 | 1,106,856 | 44.99 | 1.12 |
Permeable | 3,906.25 | 29.41 | 770,101 | 31.31 | 0.94 | |
Semipermeable | 2,656.25 | 20.00 | 501,845 | 20.40 | 0.98 | |
Curvature | Concave | 468.75 | 3.53 | 223,604 | 9.08 | 0.39 |
Flat | 1,093.75 | 8.24 | 2,058,021 | 83.66 | 0.10 | |
Convex | 11,718.75 | 88.24 | 178,340 | 7.24 | 12.17 |
Analysis shows that flash floods frequently occur at an elevation of 981.0–1,610.0 m with an FR value of 1.46, which indicates that they are unlikely to occur at higher altitudes in the Rheraya basin. The slope angle was also found to be a significant factor, with the highest probability of flash flood occurrence observed in the 0–7-degree and 7–18-degree classes, with FR values of 1.50 and 1.52, respectively. In addition, it was observed that as the slope angle increases, the FR values decrease, indicating a reduced probability of flash floods at higher slope angles. The relationship between flash floods and the slope aspect reveals that flash floods are frequent on flat (1.19), north (1.81), and south (1.49) terrain surfaces. Convex slope curvatures had the highest probability of flood occurrence, with an FR value of 12.17. The proximity to a river was found to be a significant factor, with the highest FR values observed within a distance of 0–300 m from a river, as flash floods are more likely to occur close to the riverbank. For land cover, the highest FR value observed is in the agriculture and built-up classes. This might be due to the impermeable surfaces and sparse vegetation in these areas. Drainage density was also found to have an impact on flash floods, with the highest probability of occurrence at densities of 6–8.32, with an FR value of 3.7. The study found that impermeable rocks were the most sensitive to flash floods, with an FR value of 1.12. The TWI was found to be proportional to the FR; as TWI values increased, so did the FR values. The river power index (SPI) indicated that areas with high flow power (199,745–352,565) had a high likelihood of flash flood occurrence, with an FR value of 76.43. The relationship between the vegetation index (NDVI) and flash floods showed that areas with high vegetation values had the highest FR values. Lastly, the study found that rainfall values between 389 and 392 mm had a high FR value of 1.29, indicating a higher occurrence of flash floods.
Flash flood susceptibility models using (a) FR, (b) LR, (c) XGBoost, (d) RF, (e) KNN, and (f) NB.
Flash flood susceptibility models using (a) FR, (b) LR, (c) XGBoost, (d) RF, (e) KNN, and (f) NB.
The percentage of flash flood susceptible areas under different classes of the six models.
The percentage of flash flood susceptible areas under different classes of the six models.
LR model
LR is a commonly used technique for determining the magnitude of the relationship between flash flood conditioning factors and locations. The relative importance of each conditioning factor is expressed by the LR coefficients, which were computed using R software (Table 2).
LR coefficients of different factors
Factors . | LR coefficient . |
---|---|
Constant | −8.594 × 10−1 |
Elevation | −1.437 × 10−3 |
Slope | −8.159 × 10−4 |
Aspect | −4.935 × 10−3 |
Curvature | 2.881 × 10−2 |
Distance to the river | 7.35 × 10−5 |
Drainage density | 1.016 |
Rainfall | 2.353 × 10−2 |
TWI | −4.34 × 10−2 |
SPI | 9.947 × 10−6 |
Land cover | −6.329 × 10−7 |
NDVI | 2.585 × 10−1 |
Lithology | −8.371 × 10−7 |
Factors . | LR coefficient . |
---|---|
Constant | −8.594 × 10−1 |
Elevation | −1.437 × 10−3 |
Slope | −8.159 × 10−4 |
Aspect | −4.935 × 10−3 |
Curvature | 2.881 × 10−2 |
Distance to the river | 7.35 × 10−5 |
Drainage density | 1.016 |
Rainfall | 2.353 × 10−2 |
TWI | −4.34 × 10−2 |
SPI | 9.947 × 10−6 |
Land cover | −6.329 × 10−7 |
NDVI | 2.585 × 10−1 |
Lithology | −8.371 × 10−7 |
The results revealed that curvature, distance to the river, drainage density, rainfall, SPI, and NDVI show positive correlations with flash flooding in the study area, while the remaining factors show negative correlations. A positive value means that the influence of the variable increases the likelihood of flash flooding, while a negative value means that the presence of the variable reduces the likelihood of flash flooding. It is clear that flash floods in the study area are not controlled by a single factor but rather by a combination of factors. Based on the absolute values of the LR coefficients, the factor with the greatest impact on flash flood occurrence in the study area was found to be drainage density.
Although slope, elevation, land cover, and lithology seem like factors that would influence FFS (Al-Juaidi et al. 2018; Anucharn 2019), their significance in the LR model can vary depending on the study area and the scale of analysis. In some cases, other variables such as rainfall intensity, distance to the stream, and drainage density patterns may overshadow the effects of these variables. For instance, in areas with high rainfall intensity, even relatively flat terrain can experience flash floods due to poor drainage or urbanization (Shao & Shao 2019). The lack of significance of land cover in the analysis could be due to the complexity of land cover types and their interaction with other variables. In some cases, land cover may indirectly influence FFS through factors such as soil infiltration rates. Similar to land cover, lithology's lack of contribution might be attributed to its interaction with other variables or the specific geological characteristics of the study area. Certain lithological units may promote rapid runoff and contribute to flash floods, but their significance may not always be captured in the model due to collinearity with other predictors.
Finding the same example in other studies is difficult, but in some studies, certain factors that have a high influence on flash floods have either values close to 0 or negative values. Tehrany et al. (2014a), for instance, showed that soil drainage was almost the least influencing factor, and the slope and soil effect were not statistically significant in the model. Nandi et al. (2016) also revealed that the distance to the stream was the least contributing factor to floods in their study area. Therefore, in summary, the lack of significance of slope, elevation, land cover, and lithology in our LR analysis may be due to the interplay of various factors and the specific characteristics of the study area.
An FFS map was created by multiplying the LR coefficients with their corresponding conditioning factors (Figure 5(b)). These values were then classified into five classes using the natural break classification technique, i.e., very high (10.06%), high (19.81%), moderate (26.95%), low (27.6%), and very low (15.58%) (Figure 6).
ML models
The importance of flash flood conditioning factors. g1: Semipermeable, g2: impermeable, g3: permeable, l1: built-up and urban, l2: bare rocky land, l3: agriculture, l4: forest, and l5: bare land.
The importance of flash flood conditioning factors. g1: Semipermeable, g2: impermeable, g3: permeable, l1: built-up and urban, l2: bare rocky land, l3: agriculture, l4: forest, and l5: bare land.
XGBoost, RF, NB, and KNN algorithms were utilized to evaluate the susceptibility of flash floods for each pixel of the basin. The RF model proved to be the most successful in terms of prediction performance. A number of classification methods were applied in several studies to classify the FFS maps, such as the equal interval, regular interval, standard deviation, quantile, natural break, and manual approach. Among these methods, the natural break and quantile approaches are the most popular in the literature (Tehrany et al. 2019a, 2019b; Tien Bui et al. 2019a, 2019b), and thus, this study used the natural break classification method to classify the FFS maps into five classes: very low, low, moderate, high, and very high (Figure 5(c)–5(f)).
Based on the results, the KNN model showed that the lowest percentage of area (1.62%) belonged to the low-risk class, followed by the moderate (18.83%), very low (23.05%), high (27.6%), and very high (28.9%) classes. The NB model revealed that the percentages of areas for the very high, high, moderate, low, and very low classes are 49.35, 21.43, 12.66, 9.42, and 7.14%, respectively. The RF model shows that the percentage of areas in the very high flash flood susceptible class is 10.06%, followed by moderate (10.39%), high-risk (13.64%), low-risk (20.13%), and the very low flash flood susceptible classes (45.78%). Lastly, for the XGBoost model, the area percentages were 17.53% for the very high FFS class, 6.17% for the high, 4.87% for the moderate, 8.12% for the low, and 63.31% for the very low (Figure 6).
The findings indicate that the KNN and NB models tend to overestimate FFS in regions with high risk while underestimating it in areas with low risk, in contrast to the RF and XGBoost models. Nevertheless, all four models unanimously indicate a very high flash flood risk in locations near the main river within the study area.
Validation and comparison
The accuracy and prediction ability of the six FFS models were evaluated using the area under the ROC curve metric using both training and testing data. The higher the AUC value, the better the model's prediction performance, and vice versa.
The validation of the six-flash flood susceptible models using the ROC curve based on the training point (a) and the validation point (b).
The validation of the six-flash flood susceptible models using the ROC curve based on the training point (a) and the validation point (b).
Although all six FFS models exhibited high to moderate prediction accuracy with AUC values greater than 0.70, the RF model was determined to be the most effective for predicting FFS in the study area.
DISCUSSION
This study highlights the multifaceted nature of flash flood occurrences, the interplay of conditioning factors, and the potential of bivariate and multivariate statistical models and ML techniques for flash FSM in the Rheraya watershed, a flood-prone region. These susceptibility maps are invaluable resources for a wide range of stakeholders, including hazard managers, urban planners, and policymakers, to prevent flash flood-related injuries and property losses (Figure 5; Shokouhifar et al. 2022). Moreover, the increasing risk of flash floods, driven by various factors such as rapid urbanization, deforestation, canalization, changes in land use, and the effects of climate change (i.e., changes in the intensity and frequency of heavy precipitation), highlights the critical need for improved mapping of FFS (Hapuarachchi et al. 2011; Badraq Nejad et al. 2019; Prasad et al. 2021). Therefore, our study holds significant implications for understanding and managing FFS in the Rheraya basin and similar ungauged regions known for their past destructive flash flood occurrences.
The results indicate that all the models unveiled several significant factors influencing FFS in the Rheraya watershed, such as elevation, slope, distance to the river, NDVI, and rainfall. This highlights the complex nature of flash floods, reflecting the combined impact of terrain features, land use patterns, and hydrological conditions. Identifying the most contributing factor to flash floods in our area is so significant as it will allow for targeted and effective mitigation strategies, resource allocation, and risk assessment. Distance to the river emerged as the most influential factor for flash floods in the region. Many studies have also confirmed the important contribution of this factor to flood occurrence (Rahmati et al. 2016a, 2016b; Pham et al. 2020; Bansal et al. 2022; Chaulagain et al. 2023). This is evident as areas near riverbanks are more susceptible due to their increased exposure to rapid water flow as well as their role as natural drainage paths. Therefore, urban development in the Rheraya watershed should consider safe locations, particularly those away from riverbanks.
Our FFS modeling involved six approaches, including two statistical methods (i.e., FR and LR) and four ML algorithms (i.e., RF, XGBoost, NB, and KNN). The comparative analysis revealed the RF model as the best-performing model, with a consistently high AUC value of 0.86, followed by XGBoost (AUC = 0.85), LR (AUC = 0.83), NB (AUC = 0.76), KNN (AUC = 0.75), and FR (AUC = 0.72). Multiple studies have aligned with our findings, emphasizing the robustness of RF in modeling FFS (Lee et al. 2017). For instance, Islam & Chowdhury (2024) conducted a local-scale FFS assessment in northeastern Bangladesh using RF and support vector machine algorithms, demonstrating RF's superior performance. Similarly, Ghanim et al. (2023), Abedi et al. (2022), and Vafakhah et al. (2020) assessed various ML algorithms and statistical techniques to delineate FFS at regional scales in Jeddah City (Saudi Arabia), the Bâsca Chiojdului river basin (Romania), and Gilan province (Iran), respectively, also finding RF to be the most effective. In addition, Zhao et al. (2018) mapped flood susceptibility in mountainous areas on a national scale in China using RF, artificial neural network, and support vector machine methods, with RF emerging as the optimal model. Hence, RF emerges as a consistently effective model for FFS modeling. However, it is noteworthy to mention that all models in our study exhibited AUC values exceeding 0.70, indicating their efficacy in assessing FFS.
3D representation of the RF model with the area's most vulnerable to flash floods: (a) Moulay Brahim, (b) Asni, (c) Tinitine, (d) Imlil, and (e) Armed.
3D representation of the RF model with the area's most vulnerable to flash floods: (a) Moulay Brahim, (b) Asni, (c) Tinitine, (d) Imlil, and (e) Armed.
Although this study showed the effectiveness and robustness of statistical methods and ML techniques, particularly RF, in identifying flash flood high-risk areas in the ungauged and flood-prone Rheraya watershed, in addition to identifying the most contributing factors to flash floods in our region, it is important to mention that our approach is not without limitations. The models applied rely on various assumptions, and the results may be sensitive to the choice of conditioning factors and their classification (Tehrany et al. 2019a, 2019b). Future research could explore the use of additional data sources and the validation of the models for different conditions to enhance the accuracy of the FFS maps. Climate models, for instance, provide data for different future projections, which could lead to better FFS mapping. In addition, collaborative efforts in data collection and sharing can enhance data accessibility, spanning from local to global flood inventories. This expanded dataset's availability can help in the evaluation of models performance across diverse spatial scales, assessing their adaptability and generalizability. The flashiness of floods is significantly influenced by the characteristics of rainfall events. Saharia et al. (2021) showed that spatial variability of rainfall impacts flash flood severity as much as basin geomorphology and climatology. Therefore, future work should also investigate adding rainfall event-related factors such as intensity, duration, spatial distribution, and temporal patterns to the modeling and consequently developing more robust FFS maps. It is crucial to note that our study utilized data from just one study site, which limits the generalizability of our findings to other ungauged basins. Nonetheless, to ensure the scalability of our approach, it is essential to assess the accuracy of our methods across basins with diverse topographical, hydrological, geological, geographical, and anthropogenic characteristics. Future research should aim to confirm the reliability of ML compared with statistical models across various watersheds, considering the variability in factors that could impact model outcomes.
CONCLUSION
Flooding risks are constantly increasing due to climate change and land use changes. Flash floods, in particular, are a very difficult phenomenon to predict, and the impacts and consequences can be very extreme. Creating an FFS map through the use of mathematical and statistical methods based on analyzing geospatial data is a significant advancement in managing the hazards of flash floods.
The Rheraya basin is an area susceptible to flash flooding due to its steep slopes, high altitude, low permeability, high density of drainage, etc. To mitigate the severity of flash floods in this region, effective methods for identifying the most vulnerable locations are necessary. Recent advancements in statistical approaches and ML algorithms have proven to be valuable tools in this endeavor. In this study, six models were evaluated and applied, including LR, FR, RF, NB, XGBoost, and KNN. Input data for the models consisted of two major variables: 12 independent factors such as elevation, land use, slope, NDVI, rainfall, lithology, distance to the river, and drainage density, and one dependent factor, flash flood inventory, which represents 246 locations recorded over the past 40 years, obtained from historical archives, field surveys, and validated with satellite data. Flash flood inventory was divided into 70% for training the models and 30% for validating them. Results indicated that the RF model performed the best based on the ROC curve evaluation metric (AUC = 0.86), followed by XGBoost (AUC = 0.85), LR (AUC = 0.83), NB (AUC = 0.76), KNN (AUC = 0.75), and FR (AUC = 0.72). The analysis also revealed that distance to the river and drainage density were the most influential factors in the occurrence of flash floods in the Rheraya basin.
The current study has demonstrated the robustness and effectiveness of the methodology adopted for the analysis of susceptibility to flash floods. Therefore, it is strongly recommended to extend the application of the RF model to predict the susceptibility to flash floods in other basins, especially in the case of ungauged watersheds that lack hydrological data. This will enable a more accurate assessment of vulnerability in areas with high human traffic, allowing for the precise identification of high-risk zones. Furthermore, these models can also be utilized to guide the installation of meteorological equipment and evaluate the suitability of the location of flood warning stations. Future work should broaden the factors considered in susceptibility modeling to better understand flash flood vulnerability. Collaborative data collection and improved data accessibility will help incorporate additional information, making flood management and disaster risk reduction more effective on a larger scale.
ACKNOWLEDGEMENTS
We thank the Tensift Hydraulic Basin Agency for providing data that allowed us to carry out this research.
FUNDING
The authors declare that no funds, grants, or other support was received during the preparation of this manuscript.
ETHICAL APPROVAL
The authors confirm that this article is original research and has not been published or presented previously in any journal or conference in any language (in whole or in part).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.