Abstract
In an effort to improve tools for effective flood risk assessment, we applied machine learning algorithms to predict flood-prone areas in Amol city (Iran), a site with recent floods (2017–2018). An ensemble approach was then implemented to predict hazard probabilities using the best machine learning algorithms (boosted regression tree, multivariate adaptive regression spline, generalized linear model, and generalized additive model) based on a receiver operator characteristic-area under the curve (ROC-AUC) assessment. The algorithms were all trained and tested on 92 randomly selected points, information from a flood inundation survey, and geospatial predictor variables (precipitation, land use, elevation, slope percent, curve number, distance to river, distance to channel, and depth to groundwater). The ensemble model had 0.925 and 0.892 accuracy for training and testing data, respectively. We then created a vulnerability map from data on building density, building age, population density, and socio-economic conditions and assessed risk as a product of hazard and vulnerability. The results indicated that distance to channel, land use, and runoff generation were the most important factors associated with flood hazard, while population density and building density were the most important factors determining vulnerability. Areas of highest and lowest flood risks were identified, leading to recommendations on where to implement flood risk reduction measures to guide flood governance in Amol city.
INTRODUCTION
In general, floods occur when the runoff water volume exceeds the transport capacity of channels (Lytle & Poff 2004; Tockner et al. 2010; Borga et al. 2011; Petroselli et al. 2019). Flooding is intensified when drainage systems are improperly designed and/or maintained (Sadegh et al. 2018). Flood vulnerability increases when development proceeds along river channels and on floodplains (Zhou et al. 2012; Liu et al. 2016). A common narrative in many Asian cities where flooding is prevalent is increased vulnerability to flood disasters owing to uncontrolled development in flood-prone areas (Julien et al. 2009; Ghani et al. 2012; Dewan 2013; Engeland et al. 2018). In future, the flood risk is projected to increase with an intensification of hydrological and climate variables (Muller 2007; Roy 2009; Zhou et al. 2012; Yin et al. 2015; Hinkel et al. 2014; Muis et al. 2015).
Flooding impacts include the loss of life, direct (property/asset loss or damage) and indirect (the loss of livelihoods) economic damage, and damage to transportation, utility, infrastructure, and communication systems (Sharif et al. 2016; Wu et al. 2018). Impacts may persist for months or years, and secondary impacts, including health-related problems, may emerge (Tapsell et al. 2002; Sinnakaudan et al. 2003). As more than half the world's population already lives in urban areas and this proportion is projected to increase to two-thirds by 2050 (UNPD 2014; El Alfy 2016), there is an immediate need to reduce flood risks in urban population centers worldwide.
Floods in cities are often devastating because high densities of people and of assets are concentrated in areas where the flood potential is exacerbated by disturbances to nature (Kjeldsen 2010; Cherqui et al. 2015; Darabi et al. 2019). For example, urbanization is usually associated with a high proportion of impervious features (e.g., roads, walkways, and car parks), disturbed river/stream channels, and artificial storm drainage systems (Kundzewicz et al. 2010; Suriya & Mudgal 2012). Hydrological changes associated with urbanization include reductions in infiltration, evapotranspiration, and groundwater recharge and an increase in the volume of fast-flowing surface water during most storms (Schueler 1994; Mulligan & Crampton 2005; Haghighi et al. 2019; Pirnia et al. 2019). These changes affect the severity, timing, and extent of flooding (Nirupama & Simonovic 2007; Dewan & Yamaguchi 2008; Du et al. 2010; Suriya & Mudgal 2012).
Urban flood mapping is an evolving challenge in flood risk reduction planning by city managers and policymakers (Noh et al. 2016; Darabi et al. 2018; Yaraghi et al. 2019). Effective risk and vulnerability assessment requires a thorough knowledge of the conditions affecting flooding and exacerbating flood impacts (Ouma & Tateishi 2014). At the heart of such assessments are flood risk maps, which can be created by a number of approaches, including hydrological and hydraulic models (Brimicombe & Bartlett 1996; Booij 2005; Masood & Takeuchi 2012), the integration of analytic hierarchy process (AHP) and geographic information system (GIS) techniques (Ouma & Tateishi 2014), frequency ratio (Khosravi et al. 2016), multi-criteria evaluation (Meyer et al. 2009), systems simulation (Amendola et al. 2000), and probability-based analysis (Jalayer et al. 2014). An emerging technology involves machine learning methods (Chau et al. 2005; Maier et al. 2010; Lamovec et al. 2013; Choubin et al. 2018; Termeh et al. 2018; Zhao et al. 2018). Despite recent advances in producing maps for flood risk assessments (Choubin et al. 2018), limitations still exist and new approaches are still needed.
The main objectives of this study were to (i) combine machine learning techniques, GIS, and data on environmental conditioning factors to produce flood risk maps, (ii) compare the new machine learning techniques with previous ones, and (iii) apply the ensemble model and rank the importance of different conditioning factors in urban flood hazard prediction to evaluate flood vulnerability in Amol city, Iran.
The novelty of the study lies in (i) comparing formerly used algorithms with new algorithms, including support vector machine (SVM), random forest (RF), maximum entropy (Maxent), boosted regression tree (BRT), multivariate adaptive regression spline (MARS), generalized linear model (GLM), generalized additive model (GAM), (ii) developing a spatial framework for urban flood vulnerability mapping by applying new urban conditioning factors, and (iii) introducing an ensemble model for urban flood risk reduction, where ensemble modeling will employ the advantages of previous individual models in the current study in the context of urban flood and combines the prediction of above-selected models and provide an integrated model to increase the prediction accuracy.
STUDY AREA AND DATA
Amol city (36°26′02″–36°29′45″N; 52°19′14″–53°23′50″E) has a population of 237,528 (in 2016), making it the third largest city in Mazandaran Province in northern Iran (Lotfi et al. 2016). It is located at an altitude of 59–137 m above mean sea level (masl) and expanded in the geographical area from 21 to 27 km2 between 1998 and 2017 (Figure 1). Amol is located on the Haraz River, which passes through the middle of the city on its way to the Caspian Sea. The residential areas of the city are surrounded mainly by agricultural land, orchards, and high mountains covered by forest. The mean annual rainfall in the region is 680 mm, and the climate is semi-humid (Sahin 2012; Choubin et al. 2018). Over recent decades, urban development has led to extensive changes in hydrological processes and drainage systems in the city. As a result, the number of urban flood events has been increasing annually in the past 10 years or so. Notable flood events occurred on 12 November 2012 (with damage to more than 40 residential areas), 16 September 2015 (31 areas), 24 November 2015, (40 areas), and 15 October 2016 (10 areas) (Sedaghat et al. 2016).
METHODS
Prediction of flood-prone areas
Machine learning models
We applied the following seven machine learning algorithms to predict flooded areas using geospatial predictor variables.
Support vector machine: A classification system and supervised model that enable the computer to learn how to analyze data for classification and regression (Chen et al. 2017). SVMs are popular because of good empirical performance (compared with other models, such as artificial neural networks), easy training process, the avoidance of local minima, relatively suitable mathematics for multi-dimensional data, and a tradeoff between complexity and error (Chen et al. 2017).
Random forest (suggested by Breiman (2001)): A classification and regression technique based on assembling a large number of decision trees. Specifically, it is an ensemble of trees constructed from a training dataset and internally validated to obtain a dependent variable by given independent variables (Boulesteix et al. 2012). Two powerful advantages of machine learning techniques are used in RF: bagging and random feature selection. For bagging, each tree is trained on a bootstrap sample of the training data, and predictions are made by a majority vote of trees. Further, the model randomly selects a subspace of feature predictions to split at each node when growing a tree (Jiang et al. 2007).
Maximum entropy (Maxent) (proposed by Phillips et al. (2006) and specifically designed for ecological modeling and spatial distribution modeling): A general-purpose machine learning technique for making predictions from independent variables. Maxent uses the principle of maximum entropy to relate presence-only data to environmental variables and dependent variables to estimate a potential geographical distribution. Regarding a pattern in the machine learning algorithms, a presence-only feature forces Maxent to follow a solution in which it indirectly solves a discriminative problem through Bays' rule (Phillips et al. 2006; Urbani et al. 2015; Rahmati et al. 2016; Chen et al. 2017).
Boosted regression tree (belonging to the gradient boosting modeling family): A tree-based model that combines a large number of machine learning and regression tree models to learn and weigh them (by assigning individual weights to every sample point of the training dataset), in order to describe the relationship between the independent and dependent variables. It uses several techniques to improve the performance of a single model, e.g., by creating an ensemble of regression models (Littke et al. 2017; Hu et al. 2018; Wang et al. 2018).
Multivariate adaptive regression spline (introduced by Friedman (1991) to organize relationships between a set of independent variables and the dependent variables): A machine learning technique that can estimate general functions of high-dimensional arguments. Further, it is an adaptive modeling process based on non-linear and non-parametric statistics (Samui 2013). MARS makes no assumptions about the underlying functional relationships between inputs (dependent) and (target) independent variables (Zhang & Goh 2016). It allows complex relationship modeling between a dependent variable and independent variables, and its simple rule-based functions facilitate the prediction of spatial distributions using independent variables (Leathwick et al. 2006).
Generalized linear model: A model that assumes a curvilinear relationship (non-linear relationship) between the dependent variable and the independent variables. GLM is defined by three components: a random component that specifies a distribution for response and predictor variables, a systematic component that relates a parameter to the predictor's variables, and a link function that connects the random and systematic components (Guisan et al. 2002; Koubbi et al. 2011).
Generalized additive model: A GLM for which the linear predictor is specified as a sum of smooth functions of some or all of the covariates. GAM also provides estimates using a combination of the local scoring algorithm and the back-fitting algorithm (Dominici et al. 2002). It is best used for more than purely exploratory analysis, for which its smoothing parameter is a key component (Hastie & Tibshirani 1990). GAM is a semi-parametric extension of GLM. Like GLM, GAM uses a link function to establish a relationship between the mean of the dependent variable and a ‘smoothed’ function of the independent variables. The strength of GAM lies in its ability to deal with highly non-linear and non-monotonic relationships between the dependent and independent variables (Guisan et al. 2002).
Inputs to the hazard model
Flooded points in Amol were identified based on a flood inventory of inundated areas during the floods in 2017–2018 and on documents obtained from the municipal authority of Amol showing historical flood inundation areas. Historical points were validated with photographs showing the severity of flooding. These points served as dependent variables for our prediction models. In the preparation of the flood hazard map, we selected 92 flood-prone points (assigned a value of 1, for the flooded zone) and 60 non-flooded points (assigned a value of 0).
Based on the literature, we included eight geospatial predictors as model conditioning factors (independent variables) related to flood inundation. These were precipitation, land use/land cover (LULC), elevation, slope percent, curve number (CN), distance to river, distance to channel, and depth to groundwater (Fernández & Lutz 2010; Ouma & Tateishi 2014). All variables were transferred to 5-m grids to create the hazard maps from the spatial distribution models (SDMs) as machine learning algorithms in the R programing environment.
Daily precipitation data for 16 weather stations (Ramsar, Noushahr, Siahbisheh, Gharakhil-Ghaemshr, Firouzkooh, Sari, Kiasar, Amol, Polsefid, Alasht, Bandaramirabad, Galogah, Kojoor, Baladeh, Babolsar, and Dasht-E-Naz) were obtained from the Iranian Meteorological Organization (IRIMO). These data were used to prepare a precipitation depth map (for the period 2001–2016) for the Mazandaran Province, using the inverse-distance weighted (IDW) interpolation method in ArcGIS GIS 10.4. The recorded amounts vary from 672 mm in the east of the study area to 684 mm in the west (Figure 2(a)). Mean annual precipitation in Amol city is 680 mm, based on the nearby Amol weather station (Figure 1).
Using the 2016 LULC map obtained from Amol city authority (Figure 2(b)), we identified five LULC types: agricultural area (21.67%), orchard (9.45%), park (1.50%), residential (including residential buildings and street areas) (66.20%), and the Haraz River (1.20%). We also created a 5-m resolution digital elevation map (DEM) that represents the 59–137 masl variation across the study site (Figure 2(c)). We derived a slope map from the 5-m DEM using the ‘slope tool, Spatial Analyst’ in ArcGIS GIS 10.4. The slope varies from 0% to more than 7.21% in the study area (Figure 2(d)).
Then, we estimated the U.S. Soil Conservation Service (SCS) CN for the study area (Figure 2(e)). The SCS includes key data on the infiltration capacity and retention of runoff in a specific area, such as the type of land use and land cover (Zeng et al. 2017). We derived these values from land use and a hydrologic soil group using the ArcCN-runoff tool in the GIS software (Darabi et al. 2014, 2016; Menberu et al. 2014).
The distance to the river plays an important role in urban flood inundation mapping (Ghani et al. 2009; Leow et al. 2009). As an example for Amol city, according to records from the field survey and local authorities, many, but not all, areas affected by flooding lie near the Haraz River. The Euclidean distance to the Haraz River was calculated using the distance module in GIS 10.4 (Figure 2(f)).
According to existing records, the areas most affected by flood inundation are close to areas with poor urban drainage systems. Therefore, the Euclidean distance to other channels (as collectors of surface water) was also extracted using the distance module in GIS 10.4 (Figure 2(g)).
The groundwater level can lead to an increase or decrease in the groundwater recharge rate in unsaturated and saturated zones (Tam & Nga 2018). The groundwater level data used in this study were obtained from the Iranian Water Resources Management Company (IWRMC), and an IDW interpolation method was applied to identify the depth to groundwater (Figure 2(h)).
Inputs to the vulnerability model
To determine the vulnerability factors, ArcGIS (10.4) maps were constructed for each class in 5-m raster grids for the assignment of the weight/rank values, and an analysis was carried out using the AHP method (Saaty 2006; Fernández & Lutz 2010). The building density is important because it has significant impacts on the damage caused by urban floods. It was divided into four classes: high (>300 dwellings per hectare), medium (200–300 dwellings per hectare), low (100–200 dwellings per hectare), and very low (<100 dwellings per hectare) (Figure 3(a)). Building age was divided into five classes: recently completed (≤5 years), new (10–19 years), medium (20–29 years), old (30–39 years), and very old (≥40 years) (Figure 3(b)). The population density refers to the number of people inhabiting a given urbanized area, where high levels reflect the population at risk to floods (Güneralp et al. 2017). It was divided into four classes: high (≥1,500/km2), medium (1,500–1,000/km2), low (1,000–500/km2), and very low (≤500/km2) (Figure 3(c)). Socio-economic conditions refer to the inherent properties and behavior of humans and society within a specific urbanized region, and are assessed based on economic conditions and social welfare data on people in a given urban area. This type of information is valuable in taking into account the otherwise indirect and intangible impacts of flood hazards (Kaspersen & Halsnæs 2017). The socio-economic condition map for Amol was divided into five classes: very good, good, moderate, weak, and very weak (Figure 3(d)). Open spaces were also included as a part of Amol city for each vulnerability map. All class divisions are based on suggestions by Güneralp et al. (2017) and Darabi et al. (2019), and all the data were obtained from the municipal authority of Amol.
Model training
Our modeling approach used a variety of models (Naimi & Araújo 2016; Darabi et al. 2019) to relate response variables (here, urban flood mapping) to predictor variables (conditioning factors). In this study, each model was run based on learning algorithms using flood points and predictor variables. Each model also trained a portion of the data and was then tested on another portion.
Testing and performance
Ensemble model
Based on the performance of the individual models, we built the ensemble model from the four best machine learning models. The process for building the ensemble model involved the following steps: (1) selecting the best models based on the ROC-AUC, (2) combining the selected models using an R program, to exploit all the advantages of the selected models, and (3) assessing the ensemble model using the ROC-AUC and selecting the most important conditioning factors (environmental variables) in the urban flood hazard.
Urban flood risk
RESULTS AND DISCUSSION
Performance of hazard models
Model accuracy of the SVM, RF, MAXENT, BRT, MARS, GLM, and GAM approaches, assessed using the ROC-AUC, is shown in Table 1. The AUC values during testing were highest for the GAM (0.85), GLM (0.83), MARS (0.82), and BRT (0.82) models (Table 1). The performance was lower for the SVM (0.77), Maxent (0.76), and RF (0.65) approaches (Table 1).
Models . | AUC . | |
---|---|---|
Training . | Testing . | |
SVM | 0.818 | 0.774 |
RF | 0.788 | 0.649 |
Maxent | 0.806 | 0.764 |
BRT | 0.838 | 0.824 |
MARS | 0.915 | 0.815 |
GLM | 0.876 | 0.833 |
GAM | 0.892 | 0.846 |
Ensemble | 0.925 | 0.892 |
Models . | AUC . | |
---|---|---|
Training . | Testing . | |
SVM | 0.818 | 0.774 |
RF | 0.788 | 0.649 |
Maxent | 0.806 | 0.764 |
BRT | 0.838 | 0.824 |
MARS | 0.915 | 0.815 |
GLM | 0.876 | 0.833 |
GAM | 0.892 | 0.846 |
Ensemble | 0.925 | 0.892 |
Based on these performance data, we built the ensemble model from the BRT, MARS, GLM, and GAM approaches. The accuracy of the ensemble model was 0.925 and 0.89 for the training and testing data, respectively (Figure 4).
Urban flood hazard maps
Urban flood maps were constructed from the machine learning algorithms for regions with a high and low hazard of urban flooding. The importance of the conditioning factors was determined based on ensemble functions and the impact of the variables from the flooded points (flood inventory). Only the four most important factors were included in the ensemble model. These were the distance to the channel (0.92), LCLU (0.88), CN (0.84), and elevation (0.81) (Figure 5). All models demonstrated that zones with high hazard probability are mostly located in the north and center of Amol city (Figure 6(a)–(h)). The zones with a high hazard probability were mostly identified by the GAM, and these areas were considered as having a high flood risk (Figure 6).
Urban flood risk map
Weight and rank values were assigned to the factors and classes according to their importance in the case study. Based on expert knowledge and using AHP results to evaluate the relative importance of urban flood vulnerability factors, the factor with the greatest weight was population density (0.38), followed by building density (0.29), buildings (0.19), and socio-economic conditions (0.14). The weights obtained for each class of building density, building age, population density, and socio-economic conditions factors are shown in Table 2. In weighting factors and class ranking, total scores were applied and then each pixel of the output map was assigned a value reflecting its factor and normalized weight (Figure 7).
Factor . | Weighting . | Class . | Ranking . | Factor . | Weighting . | Class . | Ranking . |
---|---|---|---|---|---|---|---|
Building density | 0.29 | High | 0.354 | Population density | 0.38 | High | 0.402 |
Medium | 0.283 | Medium | 0.281 | ||||
Low | 0.202 | Low | 0.204 | ||||
Very low | 0.152 | Very low | 0.111 | ||||
Open space | 0.009 | Open space | 0.002 | ||||
Inconsistency ratio | 0.017 | Inconsistency ratio | 0.014 | ||||
Building age | 0.19 | Newest | 0.049 | Socio-economic conditions | 0.14 | Very good | 0.002 |
New | 0.089 | Good | 0.041 | ||||
Moderate | 0.234 | Moderate | 0.195 | ||||
Old | 0.269 | Poor | 0.332 | ||||
Very old | 0.358 | Very poor | 0.429 | ||||
Open space | 0.001 | Open space | 0.002 | ||||
Inconsistency ratio | 0.019 | Inconsistency ratio | 0.021 |
Factor . | Weighting . | Class . | Ranking . | Factor . | Weighting . | Class . | Ranking . |
---|---|---|---|---|---|---|---|
Building density | 0.29 | High | 0.354 | Population density | 0.38 | High | 0.402 |
Medium | 0.283 | Medium | 0.281 | ||||
Low | 0.202 | Low | 0.204 | ||||
Very low | 0.152 | Very low | 0.111 | ||||
Open space | 0.009 | Open space | 0.002 | ||||
Inconsistency ratio | 0.017 | Inconsistency ratio | 0.014 | ||||
Building age | 0.19 | Newest | 0.049 | Socio-economic conditions | 0.14 | Very good | 0.002 |
New | 0.089 | Good | 0.041 | ||||
Moderate | 0.234 | Moderate | 0.195 | ||||
Old | 0.269 | Poor | 0.332 | ||||
Very old | 0.358 | Very poor | 0.429 | ||||
Open space | 0.001 | Open space | 0.002 | ||||
Inconsistency ratio | 0.019 | Inconsistency ratio | 0.021 |
Urban flood risk index map
The spatial distribution of the ensemble model output shows the urban flood risk index (Figure 8(a)). Using the natural break method in ArcGIS 10.4, the risk index map was classified into very low, low, moderate, high, and very high, representing 3.40%, 16.41%, 14.17%, 18.58%, and 47.45% of the total area, respectively (Figure 8(b)). The risk index map confirmed that northern and central areas of Amol city have the highest risk of flooding. A histogram assessment of the ensemble model (Figure 9) showed that the probability of flood point occurrence (for the test data) in the very high, high, moderate, low, and very low-risk areas was 5.49%, 18.68%, 15.38%, 17.58%, and 42.86% of total points, respectively. Hence, the ratio of flood point occurrence (%) to the area of risk class (%) was 1.641, 1.139, 1.086, 0.946, and 0.903, for the very high, high, moderate, low, and very low classes, respectively (Figure 9). Accordingly, areas with a very high risk of flooding had the highest density of flood points, and areas with a very low risk had the lowest density (in a given area).
Urban flooding has been modeled previously using models, such as Mike and LisFlood-FP. However, these hydrological and hydraulic models simulate the physical processes of urban flooding conditions and require sophisticated datasets and abundant computations. Thus, in recent years, machine learning algorithms have been widely used in environmental and especially flood risk studies, in which mapping-based models are important. Models for mapping food-prone areas have been used by researchers worldwide and most agree that developing a specific model and identifying appropriate flood conditioning factors are important. In this study, machine learning algorithms were used to identify flood-prone areas in an urban environment, Amol city in Iran. In these models, urban flood potential areas are determined by a flood inventory map, which shows the maximum probability of urban flooding occurring in those locations. The identification of the risk of urban flood-prone areas is important in Amol, because most recent flooding has occurred within the city center. However, mapping of flood-prone areas showed very high risks across Amol city, and also areas with a lower risk around the edges of the city.
In general, urbanization increases the magnitude and frequency of floods and may expose vulnerable communities to greater risk. Machine learning and data mining techniques have now become more popular in the field of spatial distribution analysis, modeling, and urban flood mapping. Floods are affected by many factors, such as land use and meteorological, hydrological, and topographical conditions. The consideration of conditioning factors (precipitation, slope, CN, distance to river, distance to channels, depth to groundwater, land use, and elevation (DEM)) proved useful in developing a flood hazard map for Amol city using several machine learning algorithms combined into a novel ensemble model. A key advantage of this approach is that limited knowledge is required. Moreover, the methods used are parsimonious since in areas where climate, hydrological, and hydraulic data are lacking, predictive variables, namely hydrological, topographical, or land use properties, can be used instead for urban flood modeling. Thus, the method is transferable to other urban areas.
The main limitation of the approach is that some of the above-mentioned conditioning factors vary over time, and therefore influence the results. For example, precipitation is dynamic, yet we were forced to use a rainfall value reflecting the spatial distribution of the long-term mean. Moreover, our approach does not take into account potential changes in LCLU and associated changes in the infiltration capacity. If necessary, new data on the conditioning factors can be added in the future to determine flood hazards, particularly if new information on inundation is made available following future floods. Finally, we expect the most important conditioning factors (here the building density, building age, population density, and socio-economic conditions) to change in future versions of the flood risk map.
CONCLUSIONS
Preparing a flood risk map that reflects both the severity and extent of flood impact is a prerequisite for sound flood governance in urban areas, especially in regions with intensive and prolonged storms that cause recurrent floods. To overcome input data limitations, we combined seven machine learning algorithms to assess flood risk in the case study city of Amol, Iran. The accuracy of mapping differed between the individual algorithms, with the MARS, GAM, GLM, and BRT models proving most accurate. We then integrated these four algorithms into an ensemble prediction model to derive a final flood risk map, which was determined to be more accurate than the individual algorithms. The four conditioning factors that proved to be of greatest importance in determining the susceptibility to future flood damage were building density, building age, population, and socio-economic conditions. The vulnerability map produced was useful in identifying high-risk areas where flood mitigation actions are needed. According to the ensemble model, areas surrounding the city and in the vicinity of the Haraz River have the lowest vulnerability to flooding, whereas the central city district is the most vulnerable and should be a high priority in future flood mitigation planning. Overall, our ensemble approach combining several algorithms gave high accuracy in urban flood risk mapping for the study area. The accuracy could be improved with better data on rainfall distribution during flood-producing storms and on the location of inundation zones during floods. This information was not available to us but is likely to be in the future as Amol develops a stronger flood governance program.