Abstract
Climate change is contributing to the increasing frequency and severity of flooding worldwide. Therefore, forecasting and preparing for floods while considering extreme climate conditions are essential for decision-makers to prevent and manage disasters. Although recent studies have demonstrated the potential of long short-term memory (LSTM) models for forecasting rainfall-related runoff, there remains room for improvement due to the lack of observational data. In this study, we developed a flood forecasting model based on a hybrid modeling approach that combined a rainfall-runoff model and a deep learning model. Furthermore, we proposed a method for forecasting flooding time using several representative rainfall variables. The study focused on urban river basins, combined rainfall amounts, duration, and time distribution to create virtual rainfall scenarios. Additionally, the simulated results of the rainfall-runoff model were used as input data to forecast flooding time under extreme and other rainfall conditions. The prediction results achieved high accuracy with a correlation coefficient of >0.9 and a Nash[ndash]Sutcliffe efficiency of >0.8. These results indicated that the proposed method would enable reasonable forecasting of flood occurrences and their timing using only forecasted rainfall information.
HIGHLIGHTS
A flood forecasting model based on hybrid modeling that combines a R–R model and a LSTM model is developed, and a method for forecasting floods using representative rainfall variables is proposed.
This study combined rainfall amount, duration, and distribution to create virtual rainfall scenarios.
The simulated results of the R–R model were used as input data to forecast flooding time under various rainfall conditions.
INTRODUCTION
Flooding is one of the most common and catastrophic natural hazards in the world and a major cause of social and economic loss and loss of life (Razavi et al. 2020). Moreover, increasingly more people are at risk of flooding (Paprotny et al. 2018). With climate change, flood risks are expected to increase in urban areas due to the intensification of extreme rainfall (Jha et al. 2012; Fadhel et al. 2018; Hettiarachchi et al. 2018). In particular, most small urban river basins in Korea have a very short concentration time, resulting in frequent fatal accidents owing to torrents in rivers during localized heavy rains (Lee et al. 2020). Therefore, the demands for rapid response management and accurate flood predictions are increasing.
There are currently two types of models used for flood prediction: physical-based and data-driven (Thirumalaiah & Deo 2000). Physical-based models can accurately predict runoff as they physically consider the hydrodynamic process of water flow and hydrological processes (Le et al. 2019). However, establishing physical-based models typically requires a large amount of input data (i.e., rainfall, water level, and flow data), which are generally unavailable for most watersheds. In addition, physical-based models incur high computational costs, which limit their practical application for real-time flood forecasting (Nayak et al. 2005; Mosavi et al. 2018; Le et al. 2019; Song et al. 2020). In contrast, data-driven models are based on the statistical relationship between input and output data (Paliwal & Kumar 2009). Therefore, these models do not utilize complex hydrodynamic and hydrologic processes for runoff predictions, and their computational costs are relatively low (Mosavi et al. 2018). One of the common data-driven models is the artificial neural network (ANN) model, which can replace existing physical modeling methods for the hydrological prediction of river flows (Kişi 2011).
With the rapid development of computational science, ANN-based deep learning or deep neural network (DNN) techniques have attracted significant academic and practical research attention (LeCun et al. 2015). Yaseen et al. (2015) reviewed papers published between 2000 and 2015 on the application of artificial intelligence (AI) to flow prediction models and highlighted the significant progress made by AI for predicting and modeling nonlinear hydrologic problems. Hidayat et al. (2014) developed an ANN model to forecast discharge in a tidal river (the Mahakam River, Indonesia) using upstream water levels, tide levels, and flow rates as the input data. Elsafi (2014) proposed an ANN model for the prediction of the daily flow rate at the Donola Station in the Nile River Basin (Sudan) using upstream flow data. Khan et al. (2016) developed an ANN model for predicting the daily stream flow and the level of the Ramgama River in India using daily discharge and the levels of the riverside observatory as the input data. Sung et al. (2017) developed an ANN model for the hourly prediction of water levels in the Anyangcheon basin (Korea) with a lead time of 1–3 h and then validated the accuracy of the model using a statistical method.
Long short-term memory (LSTM) neural networks are one of the most advanced applications of DNNs and have been successfully applied in various fields (particularly for time-series problems). Examples of their uses include stock prediction (Nelson et al. 2017), tourism (Li & Cao 2018), speech recognition (Graves et al. 2013), machine translation (Cho et al. 2014; Sutskever et al. 2014), language modeling (Mikolov et al. 2014), and rainfall–runoff (R–R) simulations (Hu et al. 2018; Kratzert et al. 2018). Several of these studies have demonstrated the successful application of LSTM-based models in various fields, including their potential application for river flood forecasting. Therefore, this study employed a data-based LSTM method to forecast floods using forecasted rainfall information and then applied the model to the Dorim stream basin in Seoul (Korea) to evaluate its applicability.
Several studies have suggested that forecasted rainfall information can be used as a major source of input data for flood forecasting and warning modeling. The United States Bureau of Reclamation implemented an automatic local flood warning system in Northern California to investigate the relationship between rainfall and floods using observed rainfall data to prevent flood damage (National Weather Service 1997). In addition, in the Brays Bayou area in Houston (Texas), which is a representative small-scale urban watershed in the United States, a chart capable of flood forecasting was created using only rainfall information (Vieux et al. 2005). Similarly, Bae et al. (2012) developed a flow nomograph that could be used for real-time forecasting of river floods using only rainfall information through the relations between rainfall intensity and duration and flood occurrences. Lee et al. (2018) proposed a method to reveal the relation between rainfall intensity, rainfall duration, and flood occurrences in urban watersheds using synthetic rainfall data, which included the application of a Huff distribution. Although most of these methods can predict whether a flood will occur based on the forecasted rainfall information, they struggle to accurately forecast the timing of a flood. In addition, as they do not consider rainfall patterns according to various rainfall variables (i.e., rainfall amount, duration, and time distribution), their accuracy decreases when various types of rainfall occur in nature. Additionally, because these methodologies were mainly developed for large-scale watersheds, they cannot be easily applied to small urban river basins with different hydrological runoff characteristics.
In this study, we developed a hybrid modeling approach that combined R–R and LSTM models for flood forecasting in small-sized urban river basins under various rainfall conditions. Although the meteorological and hydrological observatories in the study area have been continuously observing rainfall and water levels, the observational hydrological data are insufficient, as the only input data under extreme rainfall conditions and rainfall to the extent that flood occurs. Therefore, input data necessary for training the LSTM model were generated using the R–R model, which was optimized for this basin and virtual rainfall scenarios. These input data included extreme rainfall conditions and some other rainfall conditions not previously observed. The results of this study indicated that the proposed method can be used to reasonably predict the occurrence of floods and the time of flooding using the predicted rainfall information and is expected to be applicable for real-time flood forecasting.
STUDY AREA AND DATA
Water system of the Dorim stream basin . | ||||||
---|---|---|---|---|---|---|
Classification . | Area (km2) . | Flow path length (km) . | River length (km) . | Average elevation (EL.m) . | Average slope (%) . | Average basin width (A/L) . |
42.50 | 14.51 | 11.00 | 89.34 | 9.99 | 2.30 |
Water system of the Dorim stream basin . | ||||||
---|---|---|---|---|---|---|
Classification . | Area (km2) . | Flow path length (km) . | River length (km) . | Average elevation (EL.m) . | Average slope (%) . | Average basin width (A/L) . |
42.50 | 14.51 | 11.00 | 89.34 | 9.99 | 2.30 |
To analyze the rainfall–runoff characteristics in the investigated watershed, sufficient hydrological observation data and geographical information system (GIS) data in the basin should be collected. Long-term hydrological observation data (rainfall, water level, and discharge) of the Dorim stream and the latest river cross-sectional survey data are available. Rainfall stations are evenly distributed in and around the Dorim stream basin. Accordingly, rainfall data for this study were obtained from the automatic weather system (https://www.kma.go.kr) at seven points (405, 410, 417, 418, 425, 509, and 510) located near the Dorim stream basin. The rainfall data were collected at a resolution of 1 min over 21 years (2001–2021). The water level observation data at a resolution of 10 min were collected from the following seven water level observation stations: Seoul National University entrance, Sillim 3 Bridge, Gwanak Dorim Bridge, Sindaebang Station, Guro Digital Complex Station, Guro 1 Bridge, and Dorim Bridge. Table 2 shows the observation stations in the Dorim stream basin selected for this study, including their latitude and longitude information.
Classification . | Station ID . | Station name . | Height of observation field (EL.m) . | Longitude (east °) . | Latitude (north °) . |
---|---|---|---|---|---|
AWS | 405 | Yangcheon | 11.0 | 126.8793 | 37.5240 |
410 | KMA | 33.5 | 126.9228 | 37.4935 | |
417 | Geumcheon | 45.0 | 126.9162 | 37.4575 | |
418 | Hangang | 11.0 | 126.9394 | 37.5213 | |
425 | Namhyeon | 113.0 | 126.9815 | 37.4634 | |
509 | Gwanak | 142.0 | 126.9523 | 37.4501 | |
510 | Yeongdeungpo | 25.0 | 126.9071 | 37.5271 | |
WLGS | Dorim Bridge | 5.0 | 126.5328 | 37.3038 | |
Guro 1 Bridge | 14.0 | 126.5351 | 37.2914 | ||
Guro Digital Complex Station | 15.0 | 126.5406 | 37.2907 | ||
Sindaebang Station | 17.0 | 126.5447 | 37.2914 | ||
Gwanakdorim Bridge | 18.0 | 126.5508 | 37.2923 | ||
Sillim 3 Bridge | 33.0 | 126.5600 | 37.2819 | ||
Seoul National University | 52.0 | 126.5649 | 37.2802 |
Classification . | Station ID . | Station name . | Height of observation field (EL.m) . | Longitude (east °) . | Latitude (north °) . |
---|---|---|---|---|---|
AWS | 405 | Yangcheon | 11.0 | 126.8793 | 37.5240 |
410 | KMA | 33.5 | 126.9228 | 37.4935 | |
417 | Geumcheon | 45.0 | 126.9162 | 37.4575 | |
418 | Hangang | 11.0 | 126.9394 | 37.5213 | |
425 | Namhyeon | 113.0 | 126.9815 | 37.4634 | |
509 | Gwanak | 142.0 | 126.9523 | 37.4501 | |
510 | Yeongdeungpo | 25.0 | 126.9071 | 37.5271 | |
WLGS | Dorim Bridge | 5.0 | 126.5328 | 37.3038 | |
Guro 1 Bridge | 14.0 | 126.5351 | 37.2914 | ||
Guro Digital Complex Station | 15.0 | 126.5406 | 37.2907 | ||
Sindaebang Station | 17.0 | 126.5447 | 37.2914 | ||
Gwanakdorim Bridge | 18.0 | 126.5508 | 37.2923 | ||
Sillim 3 Bridge | 33.0 | 126.5600 | 37.2819 | ||
Seoul National University | 52.0 | 126.5649 | 37.2802 |
METHODOLOGY
The method can be summarized as follows. First, the XP-storm water management model (SWMM) was used for the R–R analysis, which was required to create the LSTM model training dataset. The XP-SWMM was constructed by collecting basic data, such as topography and hydrological data of the target basin (XP Solutions 2014), and the accuracy of the R–R analysis was improved by calculating various parameters and optimizing the models. Second, an analysis point within the target basin was selected, and the reference flood level was set at that point. Given that safety accidents (such as river isolation and sweepings) are due to the flooding of riversides with trails (in the case of sudden rain), the standard flood level in this study was set to the height of the riverside with trails. In addition, to simulate R–R under various conditions, the analysis was performed using a virtual rainfall scenario created by combining the rainfall amount, duration, and time distribution. Third, flooding time data were constructed as a learning dataset for each rainfall condition based on the R–R simulation results. Finally, the LSTM model, which was developed through data learning and optimization, was used to forecast the flood occurrence time (based on a real heavy rain event) and evaluate the forecasting performance and applicability through statistical verification using the observed data.
LSTM network
The main components of LSTM are a memory-moving cell, which can maintain its state over time, and three nonlinear gates that regulate the data flow in and out of the cell (Greff et al. 2017). In other words, to update the state () at a specific point in time, LSTM determines whether to update the internal information using the concept of the cell (). The gate for controlling the data flow of this cell comprises an input gate (), a forget gate (), and an output gate ().
In this study, the LSTM layer was implemented using the built-in components in the Keras framework, which is a commonly used machine learning software package. The selection of hyperparameters (such as the optimizer, learning rate, and the number of epochs) is crucial for the optimal performance of the LSTM model. The Adam optimizer was also utilized, as it is a widely applied optimization algorithm in deep learning for computer vision and natural language processing. The Adam optimizer was proposed by Kingma & Ba (2014) and is an extension of the stochastic gradient descent method, which allows for iterative updates of the network weights based on the training data. In addition to the optimizer, the choice of other hyperparameters (such as the learning rate and the number of epochs) can also significantly affect the performance of the LSTM model. The learning rate determines the size of the step taken during each iteration of weight updates, and setting it too high or too low can result in suboptimal performance. The number of epochs determines the number of times the model will see the entire training data. Setting this value too low can result in underfitting while setting it too high can cause overfitting. Accordingly, it is important to carefully tune these hyperparameters to achieve optimal performance in the LSTM model. Table 3 lists the detailed training options of the LSTM model and the optimization range of the hyperparameters.
Hyperparameters . | Search space . |
---|---|
Activation function | ReLU, Tanh |
Optimizer | Adam, Adadelta, Adagrad |
Layer | 1, 2, 3 |
Epochs | 300 |
Learning rate | 0.001, 0.002, 0.005, 0.1 |
Batch size | 16, 32, 64, 128 |
Hidden size | 10, 15, 20, 25, 30 |
Dropout rate | 0, 0.2, 0.3, 0.5 |
Hyperparameters . | Search space . |
---|---|
Activation function | ReLU, Tanh |
Optimizer | Adam, Adadelta, Adagrad |
Layer | 1, 2, 3 |
Epochs | 300 |
Learning rate | 0.001, 0.002, 0.005, 0.1 |
Batch size | 16, 32, 64, 128 |
Hidden size | 10, 15, 20, 25, 30 |
Dropout rate | 0, 0.2, 0.3, 0.5 |
The activation function converts the linear relationship into nonlinear ones. The optimizer provides the direction to update the weights of the network. The epoch is a measure of the number of times passing. The learning rate controls the learning speed of the model. The batch size is the number of samples processed by the neural network per iteration. The hidden size is for modulating the output size of hidden layers in the LSTM network, and the dropout rate is for resolving any overfitting issues.
R–R model
Urban rivers can have different runoff characteristics depending on various factors, such as the drainage system, sewer network, land cover condition, and topography of the urban basin (Seo et al. 1996). Hence, it is necessary to comprehensively consider and analyze the drainage system of a river and the basin when analyzing the R–R relationship. Therefore, the XP-SWMM was utilized, as it is capable of R–R simulations that link the inland and outland water and flood control facilities. In addition, the XP-SWMM can simultaneously analyze flow and water level changes by simulating the surface runoff, groundwater flow, and flow within a pipeline system, mainly due to rainfall events in urban watersheds.
Synthetic rainfall scenarios
Synthetic rainfall data are required to obtain the input data for the LSTM model using R–R simulations. In this study, the synthetic rainfall data were generated using Huff distribution (Huff 1967) for the rainfall time distribution, which is commonly used when designing urban watershed drainage facilities in Korea (Lee et al. 2018). Four distributions of Huff quartiles were considered, which typically yield a non-dimensional temporal distribution curve representing the temporal distribution characteristics of rainfall. This curve is divided into four probability interval quartiles for each interval of the peak rainfall intensity and four quantiles for the peak location.
Synthetic rainfall data generation comprised three steps. First, the cumulative distribution for each Huff distribution was generated according to the regression equations. Second, the cumulative distribution was converted to a dispersed distribution. Third, the synthetic rainfall event was obtained by applying the rainfall event to the dispersed distribution. The type of regression equation for the Huff distribution was cumulative distribution. Countless rainfall patterns can occur in nature and considering them all is impossible; hence, a synthetic rainfall scenario was created by combining representative rainfall variables (rainfall amount, duration, and time distribution) as follows. The rainfall variable of the synthetic rainfall data comprised 5 mm units (ranging from 5 to 200 mm), the duration comprised 10 min units (ranging from 10 to 120 min), and the rainfall time distribution comprised 1st quartile units (ranging from the 1st to the 4th quartile), as shown in Table 4.
Type of data . | Rainfall amount . | Duration . | Distribution . |
---|---|---|---|
Unit | mm | min | quartile |
Numbers | 40 | 12 | 4 |
Increments | 5 mm | 10 min | 1 |
Max | 200 mm | 120 min | 4 |
Min | 5 mm | 10 min | 1 |
Type of data . | Rainfall amount . | Duration . | Distribution . |
---|---|---|---|
Unit | mm | min | quartile |
Numbers | 40 | 12 | 4 |
Increments | 5 mm | 10 min | 1 |
Max | 200 mm | 120 min | 4 |
Min | 5 mm | 10 min | 1 |
ANALYSIS AND RESULTS
The results from calibrating and validating the model are presented in this section to demonstrate the accuracy and applicability of the proposed method. The prediction accuracy was quantified using the coefficient of determination (R2), the root mean square error (RMSE), the mean absolute error (MAE), the mean absolute percentage error (MAPE), the Nash–Sutcliffe efficiency (NSE; Nash & Sutcliffe 1970), and the Kling–Gupta efficiency (KGE; Gupta et al. 2009), as follows:
Here, and are the simulated and observed values, respectively; r is the Pearson correlation coefficient (CC) between the simulated and observed value (dimensionless); is the variability ratio (dimensionless); and is the bias ratio (dimensionless). To qualify the performance of the prediction model, we adopted the evaluation criteria for hydrologic models proposed by Moriasi et al. (2015). The performance of the model was considered excellent, good, and satisfactory for NSE values of 1.0–0.8, 0.8–0.6, and 0.6–0.5, respectively. An NSE range from 1.0 to 0 is typically regarded as acceptable. In addition, to evaluate the hydrological model calibration for the computed KGE values, the evaluation criteria proposed by Towner et al. (2019) were utilized. Model calibration is considered good if KGE = 1.0–0.75, intermediate if KGE = 0.75–0.5, poor if KGE = 0.5–0, and very poor if KGE 0. It should be noted that KGE values can range from to 1, with values closer to 1 indicating better performance. The evaluation criteria indicated that the performance of the model was good and satisfactory.
XP-SWMM calibration result
Difficulties can be encountered when calculating the phenomena occurring in rivers in any hydrological model. In particular, most hydrological models only calculate these phenomena using a few of the most influential variables, and the parameters of the hydrological model are estimated using empirical formulas or strategic values. Therefore, to calculate values close to the real discharge and water levels, it is essential to adjust the parameters of the model using the observed data on these factors. In this study, the R–R model (introduced in Section 3.2) was optimized by calibrating and validating parameters using the observed water levels for real major heavy rain events. To optimize our model, parameters such as the roughness coefficient of the channel, roughness coefficient of the conduit, impermeable area ratio, curve number (CN) value, and loss coefficient were considered. However, we employed the CN values and loss coefficients previously suggested by the Water Resources Management Information System (WAMIS; http://www.wamis.go.kr), and only the remaining three parameters were corrected in this study.
The accuracy of the model was validated by applying previous major heavy rain events and comparing the simulated results of the water level with the observations. The target rainfall event selected for the study was the heavy rain event (57 mm/h and 21 mm/10 min) on August 1, 2020 when a river isolation accident occurred during the rainy season.
Station . | R2 . | RMSE (m) . | NSE . | KGE . |
---|---|---|---|---|
Seoul National University | 0.83 | 0.13 | 0.73 | 0.90 |
Sillim 3 Bridge | 0.79 | 0.25 | 0.72 | 0.87 |
Gwanak Dorim Bridge | 0.90 | 0.33 | 0.73 | 0.79 |
Sindaebang Station | 0.95 | 0.13 | 0.93 | 0.87 |
Guro Digital Complex Station | 0.99 | 0.10 | 0.98 | 0.94 |
Guro 1 Bridge | 0.91 | 0.31 | 0.81 | 0.91 |
Average | 0.90 | 0.21 | 0.82 | 0.88 |
Station . | R2 . | RMSE (m) . | NSE . | KGE . |
---|---|---|---|---|
Seoul National University | 0.83 | 0.13 | 0.73 | 0.90 |
Sillim 3 Bridge | 0.79 | 0.25 | 0.72 | 0.87 |
Gwanak Dorim Bridge | 0.90 | 0.33 | 0.73 | 0.79 |
Sindaebang Station | 0.95 | 0.13 | 0.93 | 0.87 |
Guro Digital Complex Station | 0.99 | 0.10 | 0.98 | 0.94 |
Guro 1 Bridge | 0.91 | 0.31 | 0.81 | 0.91 |
Average | 0.90 | 0.21 | 0.82 | 0.88 |
According to the simulated and observed values, the water level prediction of the model was highly accurate in terms of the peak water level and peak arrival time: R2 = 0.79–0.99, RMSE = 0.13–0.33, NSE = 0.72–0.98, and KGE = 0.79–0.94. Although the prediction accuracy right after the peak value was somewhat low (due to underestimating or overestimating the discharge), the overall predicted values (including the peak value) exhibited a similar trend as the observed values and displayed a high degree of correlation. These results indicated that the optimized model exhibited a good prediction performance (NSE > 0.6, KGE > 0.75) for real rainfall events and was considered suitable for use in the development of the LSTM model.
Generation of input datasets for the development of the LSTM model
The generated input dataset comprised 1,686 rainfall scenarios with different flooding times based on various combinations of the three rainfall variables (rainfall amount, duration, and time distribution). Therefore, we managed to consider both extreme rainfall conditions and those that have not been observed in the basin.
Development and assessment of the LSTM model
In this study, the LSTM model of the DNN structure was utilized to forecast the flooding time according to various rainfall scenarios. Accordingly, we set 1,349 datasets (80% of the generated input dataset) as the training datasets and 337 datasets (20%) as the validation datasets and then used them to develop the LSTM model. In addition, 21-year observation data (2001–2021) were used as test data to validate the accuracy of the LSTM model. The time distribution of the observed rainfall data was divided into four quartiles, the section containing the largest rainfall value was selected, and the quartile was determined according to the section's location. Hyperparameter optimization in the LSTM model was performed by 1,000 random trials using a random search method. Finally, Tanh was selected as the activation function, Adam was selected as the optimizer, and the layer number, epoch, learning rate, batch size, hidden size, and dropout rate were set to 1, 300, 0.005, 16, 20, and 0.2, respectively.
Classification . | R2 . | RMSE (min) . | MAE (min) . | MAPE (%) . | NSE . | |
---|---|---|---|---|---|---|
LSTM | Calibration | 0.91 | 3.53 | 2.49 | 4.67 | 0.96 |
Validation | 0.88 | 5.60 | 3.74 | 6.96 | 0.89 | |
Test | 0.81 | 6.18 | 4.42 | 6.48 | 0.81 | |
SWMM | 0.84 | 5.81 | 4.83 | 7.59 | 0.83 |
Classification . | R2 . | RMSE (min) . | MAE (min) . | MAPE (%) . | NSE . | |
---|---|---|---|---|---|---|
LSTM | Calibration | 0.91 | 3.53 | 2.49 | 4.67 | 0.96 |
Validation | 0.88 | 5.60 | 3.74 | 6.96 | 0.89 | |
Test | 0.81 | 6.18 | 4.42 | 6.48 | 0.81 | |
SWMM | 0.84 | 5.81 | 4.83 | 7.59 | 0.83 |
The CC, R2, RMSE, MAE, MAPE, and NSE of the calibration results were 0.96, 0.91, 3.53, 2.49, 4.67, and 0.96, respectively. The CC, R2, RMSE, MAE, MAPE, and NSE of the validation results were 0.94, 0.88, 5.60, 3.74, 6.96, and 0.89, respectively. The CC, R2, RMSE, MAE, MAPE, and NSE of the validation results of the real rainfall events through the test set were 0.90, 0.81, 6.18, 4.42, 6.48, and 0.81, respectively. These results indicate that the developed model exhibited ‘very good’ predictive performance (NSE > 0.8) for calibration and validation and exhibited ‘very good’ predictive performance for the real rainfall events.
The results of this study indicated that the predictive performance of the LSTM model was not significantly different from the SWMM.
Impact of rainfall variables on flooding time
The analysis result revealed that the R2, RMSE, and NSE values of the model under Case 2 (where the input variable S was not used) were 0.68, 11.24, and 0.37, respectively. Compared to Case 1, where all variables were considered, the RMSE value under Case 2 increased by approximately 81%, whereas R2 and NSE decreased by approximately 16 and 54%, respectively. In Case 3 (where the input variable D was not used), the R2, RMSE, and NSE values of the model were 0.68, 11.33, and 0.36, respectively. Compared to Case 1, the RMSE value under Case 3 increased by approximately 83%, whereas the R2 and NSE values decreased by approximately 16 and 56%, respectively. Finally, in Case 4, where the input variable R was not used for the prediction, the R2, RMSE, and NSE values were 0.08, 27.21, and −2.72, respectively. Compared to Case 1, the RMSE value under Case 4 increased by approximately 340%, and the R2 and NSE values decreased by approximately 90 and 435%, respectively.
The model exhibited the highest accuracy when all three input variables were used, and a very low accuracy when any of the variables is excluded. Particularly, when R or D was excluded, the accuracy of the model reduced sharply, indicating that rainfall amount and duration have a somewhat high correlation with the flooding time in the LSTM model. In contrast, S exhibited the lowest correlation with the flooding time. This is because the LSTM model trained only Huff's 4 quartile distributions data and did not consider different distributions of the observed rainfall: all rainfall patterns in nature cannot be explained by the Huff distribution. The detailed results are provided in Table 7.
Case . | Input variable . | R2 . | RMSE (min) . | MAE (min) . | MAPE (%) . | NSE . |
---|---|---|---|---|---|---|
1 | R, D, S | 0.81 | 6.18 | 4.42 | 6.48 | 0.81 |
2 | R, D | 0.68 | 11.24 | 9.24 | 14.93 | 0.37 |
3 | R, S | 0.68 | 11.33 | 9.55 | 15.21 | 0.36 |
4 | D, S | 0.08 | 27.21 | 18.38 | 28.11 | −2.72 |
Case . | Input variable . | R2 . | RMSE (min) . | MAE (min) . | MAPE (%) . | NSE . |
---|---|---|---|---|---|---|
1 | R, D, S | 0.81 | 6.18 | 4.42 | 6.48 | 0.81 |
2 | R, D | 0.68 | 11.24 | 9.24 | 14.93 | 0.37 |
3 | R, S | 0.68 | 11.33 | 9.55 | 15.21 | 0.36 |
4 | D, S | 0.08 | 27.21 | 18.38 | 28.11 | −2.72 |
DISCUSSION
Application of the hybrid modeling approach for flood forecasting and early warning
When applied to a real-time river flood warning system, the timing of flooding and warning issuance corresponding to the forecasted rainfall information can be calculated automatically and managed in real time, allowing the advanced evacuation of people. Therefore, the proposed method was considered to be a valuable method that could enable the forecasting of flooding time using minimum rainfall information. Furthermore, preemptive warnings could be issued by pre-analyzing different flood cases using R–R simulation results applied to virtual rainfall scenarios and training them in the LSTM model.
Limitations and improvements
The flood forecasting method based on the LSTM model proposed in this study uses rainfall information, such as rainfall amount, duration, and time distribution, to forecast flood occurrence and occurrence time for various rainfall conditions. Particularly, it can predict flooding time in real time through virtual rainfall and flood data learning using the predicted rainfall information.
However, the method has some limitations, and there is room for improvement to enable its practical application.
First, the method uses the simulation results of the R–R model (SWMM) to obtain flood data according to the virtual rainfall scenario, which are the input data required for learning. Accordingly, the predictions can inevitably be affected by errors in the R–R simulation results. For example, overpredictions in the SWMM simulations can be attributed to an overestimation of the impervious ratio (Wang & Altunkaynak 2011), and the corresponding simulation errors may decrease the prediction accuracy of the LSTM model.
In addition, as the drainage system diagram, sewage pipe network specifications, and drainage district data are complex, they are difficult to reflect in the model realistically. Therefore, the drainage system diagram and drainage basin were simplified to develop the model, inevitably distorting the characteristics of the basin and conduit.
In addition, urban rivers generally have artificial drainage systems (such as rainwater pumping stations). Hence, the uncertainty of basin data for estimating parameters is relatively high, which could reduce the accuracy of the LSTM model, as it uses the results of this R–R model as input data. Accordingly, this limitation has to be mitigated. Some previous studies proposed a method for improving the accuracy of the R–R model by applying parameter optimization algorithm techniques, such as Shuffled Complex Evolution, University of Arizona (Duan et al. 1994), Monte Carlo simulation (Kuczera & Parent 1998), and dynamically dimensioned search (Tolson & Shoemaker 2007). These methods could be applied to improve the accuracy of the model.
The second limitation of the LSTM model is the use of incomplete rainfall data as the input data. In this study, a synthetic rainfall scenario was employed to generate the input data. However, the selected scenario only comprised a specific range or a combination of representative rainfall variables (amount, duration, and time distribution). Accordingly, these variables alone may not explain real rainfall patterns occurring in nature. Especially, as only four Huff distributions were used to create the input dataset, the LSTM model could not consider different distributions of observed rainfall. There can be types of rainfall that the quartile Huff distribution cannot completely explain, and in the case of rainfall exceeding the range of conditions learned by the model, the prediction performance may reduce. Therefore, it is necessary to improve this aspect, for which the following method can be applied.
The first stage would be to separate the rainfall time distribution into more quantiles. In this study, the rainfall time distribution was divided into four quartiles. However, this could be accomplished by generating more quartiles for input data and test data by increasing the number of quartiles (such as 8, 10, or even 20). The second stage would be to apply weather modeling techniques (such as static, physics-based, and machine learning-based modeling) that mimic real rainfall patterns to create virtual rainfall data. Statistical models can generate new rainfall patterns based on historical rainfall data by modeling the statistical properties of weather events (Wilks 2019). Moreover, physical weather modeling numerically simulates weather conditions by establishing physical models of weather phenomena (Kalnay 2003). Machine learning-based modeling can learn past rainfall data through a neural network or regression models, generating new rainfall data (Hsieh 2009).
Finally, it is necessary to apply a rainfall scenario with subdivided conditions to consider more diverse rainfall variables and to develop and apply customized rainfall and flood factors for each specific watershed. This could be achieved by using continuous data observations and various statistical techniques.
SUMMARY AND CONCLUSION
In this study, we developed a hybrid modeling method that can forecast flooding time in real time based on minimum rainfall variables in an urban river basin. The proposed method was capable of producing instant and reliable forecasting of flooding time according to various rainfall conditions in a small urban river basin. This was achieved by pre-analyzing various flood cases using a virtual rainfall scenario based on the R–R simulation results and applying the synthetic rainfall scenario as the input data for the LSTM model. Thus, we overcame the problem of a shortage of data on observed rainfall and flood events and accurately forecasted the time of flood occurrence for various rainfall conditions. The model was optimized by calibrating the parameters of the R–R model, and the average value of the validation results exhibited an accuracy of NSE > 0.8.
The synthetic rainfall scenario was created by combining three representative rainfall variables: rainfall amount, duration, and time distribution. The proposed model using LSTM was able to forecast the flood occurrence time based on these three types of rainfall variables only. We also validated the performance of the model through a comparison with observed data from an actual rainfall event. The comparison results revealed that the LSTM model exhibited a high accuracy of CC > 0.9 and an NSE of >0.8, indicating its practicability for flood forecasting. Moreover, we expect that the practical applicability of the model for flood forecasting and warnings will be improved by increasing its prediction accuracy, as discussed previously.
FUNDING INFORMATION
This research was supported by the Seoul Institute of Technology (2022-AC-030).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.