## ABSTRACT

With the impact of global climate change and the urbanization process, the risk of urban flooding has increased rapidly, especially in developing countries. Real-time monitoring and prediction of flooding extent and drainage system are the foundation of effective urban flood emergency management. Therefore, this paper presents a rapid nowcasting prediction method of urban flooding based on data-driven and real-time monitoring. The proposed method firstly adopts a small number of monitoring points to deduce the urban global real-time water level based on a machine learning algorithm. Then, a data-driven method is developed to achieve dynamic urban flooding nowcasting prediction with real-time monitoring data and high-accuracy precipitation prediction. The results show that the average MAE and RMSE of the urban flooding and conduit system in the deduction method for water level are 0.101 and 0.144, 0.124 and 0.162, respectively, while the flooding depth deduction is more stable compared to the conduit system by probabilistic statistical analysis. Moreover, the urban flooding nowcasting method can accurately predict the flooding depth, and the *R*^{2} are as high as 0.973 and 0.962 of testing. The urban flooding nowcasting prediction method provides technical support for emergency flood risk management.

## HIGHLIGHTS

A rapid nowcasting prediction method of urban flood based on data-driven and real-time monitoring.

The deduction model accurately estimates the global water depth.

The proposed urban flooding nowcasting model was observed to outperform the traditional machine learning model to predict.

## INTRODUCTION

With the impact of global climate change, extreme weather events are becoming more and more frequent (Güneralp *et al.* 2015; Li & Willems 2020). Meanwhile, the increase in urban impervious surface area due to rapid global urbanization will lead to greater flood risk, especially in developing countries (Ding *et al.* 2022; Wang *et al.* 2022). Rapidly predicting urban flooding is a critical part of providing decision-makers with enough time to take action and thus minimize damage (Hou *et al.* 2021; Yan *et al.* 2021).

To better predict the flooding scenario, a hydraulic model based on physical methods is used extensively for simulating the urban flood (Chang *et al.* 2021). The most commonly used software, such as InfoWorks ICM (Innovyze 2019), LISFLOOD-FP (Bates & De Roo 2000) and HEC-RAS (de Arruda Gomes *et al.* 2021), is based on the shallow water equations (SWEs) and its simplified form (Guidolin *et al.* 2016). However, solving SWEs at high spatial resolution is very complex and requires significant computational costs (Zhao *et al.* 2020; Buttinger-Kreuzhuber *et al.* 2022). Therefore, different prediction methods have been proposed for rapid and accurate urban flood prediction.

Machine learning methods have emerged in recent years as a surrogate model for urban flooding prediction (Mosavi *et al.* 2018). It provides an efficient and accurate prediction approach that does not need to consider complex physical processes such as nonlinear fluid motion (Kabir *et al.* 2020; Chu *et al.* 2020). Recently, these methods have been extensively used in the prediction of urban flood and risk assessment (Madayala *et al.* 2022; Youssef *et al.* 2022; Zahura & Goodall 2022). However, these models are usually based on hydraulic model data or historical data and can only provide static predictions of urban flooding conditions (Wu *et al.* 2020; Zhou *et al.* 2021; Zhou 2022). Thus, understanding the dynamics of flooding processes and the associated variable effects on flood zone objects (e.g., buildings) can provide valuable information for better management of flood disasters (Gao *et al.* 2021).

Moreover, the data-driven method also has shortcomings in the dynamic prediction of urban flooding. Changes in actual conditions will impact the forecast accuracy and make it difficult to develop a dynamic nowcasting prediction (Mancini *et al.* 2022). Therefore, in order to develop a highly accurate data-driven model, effective model calibration based on real-time monitored data is hence essential (Fattoruso *et al.* 2015). Nevertheless, the arrangement of monitoring sensors requires a lot of human and material resources, and it is not practical to arrange a large number of them (Banik *et al.* 2015; Yazdi 2018). Thus, it is quite important to realize the dynamic prediction of urban flooding based on real-time monitoring using a small number of monitoring sensors.

The present work aims to develop a method based on data-driven and real-time monitoring for rapid nowcasting prediction of urban flooding. The main components are as follows: (i) determining the location and number of monitoring sensors, and developing a deduction method for the water level of the whole drainage system based on the monitoring sensors and (ii) proposing a data-driven model and real-time monitoring-based method for rapidly nowcasting prediction of urban flooding.

## METHODOLOGY

An urban flooding nowcasting prediction method was developed to enhance the dynamic nowcasting of urban flooding based on a data-driven model and real-time monitoring data. The method consists of three modules, the data sources, real-time water level deduced in the whole system through machine learning (ML) algorithm and urban flooding rapidly nowcasting through a data-driven method.

### Data sources

A hydraulic model was constructed using InfoWorks ICM software to simulate the urban drainage system and flooding scenario, which includes the conduit drainage model and the urban surface flood model. The rainfall events with precipitation greater than 25 mm in 24 h were statistically analyzed from historical rainfall data as rainfall events that may lead to urban flooding (Zhou *et al.* 2019). The selected historical rainfall events were used as the input of the established hydraulic model, and the model simulated 1,200 different scenarios by adjusting the initial level of drainage node, water level of rivers, hydraulic structures and control facilities and soon in this study. Then, the datasets were obtained based on the simulation results of the hydraulic model, which was used for the UFP-RD model construction subsequently.

### Real-time monitoring

Optimizing the arrangement of monitoring sites is an important part of real-time monitoring, the essence of which is to obtain as much information of the urban drainage system as possible from a limited number of monitoring sites. Thus, we adopt the Principal Component Analysis (PCA) method to simplify the data dimensions for monitoring site optimization. The water level time series data of drainage nodes and surface nodes in the datasets were used as the input to the PCA algorithm to extract the main eigenvectors, and the dimension reduction results were obtained in this study. The specific theories and calculation process are shown in the Text S1 of Supplementary material. The results of PCA are first ranked by the eigenvalue, and each principal component corresponds to the point of maximum load as the monitoring site. Then, select the top-ranked monitoring sites according to the number of required monitoring points. Because the water levels at each node of the drainage system are not completely isolated but interact with each other, especially the water levels at the neighboring points are strongly correlated due to the hydraulic connections. Thus, the vast majority of the water level information of the drainage system can be obtained from the selected monitoring points by ML modeling techniques.

Subsequently, we adopt ML techniques to quantify the cross-scale correlation in local-global water levels. The eXtreme Gradient Boosting (XGBoost) model, proposed by Chen & Guestrin (2016), is a scalable tree boosting model and is an improved model derived from the Gradient Boosting Decision Tree (GBDT). XGBoost effectively avoids overfitting and accelerates convergence by performing a second-order Taylor expansion on the loss function and adding a regular term. The basic idea of the algorithm is to continuously add trees and perform feature splitting, learning a new function with each added tree, fitting the residuals of the previous round's prediction with each round's prediction, and predicting the sample score based on the characteristics of the sample. XGBoost is selected as the ML algorithm in this study due to its fast, efficient, accurate and fault-tolerant compared with other ML methods, and it has been widely used for the prediction of the urban flood (Liao *et al.* 2023; Wang *et al.* 2023).

In this study, the min-max normalization method was used for the data preprocessing to eliminate adverse effects due to odd sample data. Then, the dataset from the hydraulic model is divided into two sub-datasets: one is the training set (1,080 scenarios) and another is the testing set (120 scenarios), corresponding to a total of 90 and 10% of the dataset, respectively. The training set is used for data samples for model fitting, and a gradient descent of the training error is performed during the training process to learn trainable weight parameters. The testing set is used to evaluate the generalization ability of the final model (Ahmed *et al.* 2019). The *k-*fold cross-validation method was used for developing the XGBoost model, in which the *k* value was set as 10. The input of the XGBoost model is the water level of monitoring sites and simulated precipitation events from the train set, and the output of the ML learning model is the water level of urban flooding except for monitoring sites. Moreover, the learning rate of the XGBoost model was 0.1, and the maximum depth and number of trees were 8 and 500, respectively. The specific details were shown in Text S2 of Supplementary material. Finally, the local-global deduction (LGD) model was developed based on the PCA and XGBoost for the real-time water level deduction.

### Urban flooding risk prediction

*et al.*2021). Similarly, we used the

*k-*fold cross-validation method was used for developing this XGBoost model, in which the

*k*value was set as 10, the learning rate of the XGBoost model was 0.003, and the maximum depth and number of trees were 8 and 200, respectively. The autoregression training method was used for developing the time series XGBoost model, and the specific details were shown in Text S3 of Supplementary material. The implemented models were run on a personal computer with an AMD Ryzen 7 5700G and 32 GB random access memory, with a NVIDIA RTX 3060Ti 8 GB graphic processing unit (GPU).

*i*th manhole in this rainfall event, is the predicted water level of the most unfavorable moment in the th manhole, is the elevation of the bottom in the

*i*th manhole, is the elevation of the ground level in the

*i*th manhole. For the urban ground segment, according to the Code for Design of Outdoor Wastewater Engineering GB50014-2021 (Ministry of Housing and Urban-Rural Development of the People's Republic of China 2021) and Technical Code for Urban Flooding Prevention and Control GB51222-2017 (Ministry of Housing and Urban-Rural Development of the People's Republic of China 2017), a 2 cm depth of water on an asphalt pavement is not considered a flooding event, whereas an average depth of water on a road section great than or equal to 15 cm is considered to be a flooding event. Thus we used 2 and 15 cm as the zoning values for flooding risk calculations, and the risk can be expressed as:where is the flood risk of

*i*th ground point in this rainfall event, is the predicted water level of the most unfavorable moment in the

*i*th ground point, is the elevation of the ground level in the

*i*th ground point.

### Model evaluation

*R*

^{2}). The specific calculation was shown in Equations (3)–(6).where is the measured value, is the predicted value, is the average value of ,

*m*is the number of samples. The MAXE, MAE and RMSE are all closer to 0 indicating better model results. The

*R*

^{2}value is between 0 and 1, the closer to 1 means the better the fit is (Chu

*et al.*2020).

*et al.*2020), which was slightly modified to make the evaluation of CCH more objective in this study. CCH is a histogram used to visualize the differences or associations between two groups. By displaying the distribution of the two groups and their cumulative frequencies at different percentiles, it provides an intuitive understanding of the overall data distribution. However, there is a high degree of subjectivity in the evaluation of CCH, so CD values are proposed to evaluate a CCH more objectively. The figure of CCH is a bar chart consisting of the same number of columns as the corresponding CDF and the value of the th column in the figure was defined as:where is the frequency value of the predicted value in the column representing the th CDF, is the frequency value of the measured value in the column representing the th CDF,

*n*is the number of columns in the CDF. When the model works very well, the CDF figure of the predicted value and measured value are exactly the same. At this point, the values of each column in the CCH are uniformly equal to . Due to the strong subjective arbitrariness of CCH, Chen

*et al.*proposed CD to more objectively evaluate CCH (Chen

*et al.*2020):where

*n*is the number of columns in the CCH, is the value of the th column. The CD value is between 0 and 1, and the closer to 0 indicates the better stability of the model.

## CASE STUDY

### Study area

^{2}with an average altitude of 3.5 m (Yellow Sea elevation datum of China) and is served by a separate sewer system. The drainage system has a design return period of less than one year, and thus SGT is highly prone to urban flooding when the disaster-caused rainstorm comes.

### Data collection and hydraulic model calibration

Meteorological data were taken from historical meteorological monitoring data, which was provided by WheatA agro weather big data system. In order to obtain real-time observed data, sensors were installed at the corresponding sites according to the results of the optimal arrangement of monitoring sites. The installation position of these sensors is shown in Figure 3. Three water level meters (HOBO U20L-01) were installed at different locations along the river. Two rain gauges (L99-YL, China) were also installed to record the precipitation event. Finally, six flow-level meters (Isco 2150) were installed at optimized nodes of the drainage system as monitor sites.

The river pump gate operation data was obtained from the management department, and the model was calibrated with these data and actual measurement data from the sensors. In several rainfall events for validation, the *R*^{2} of the hydraulic model at each flow meter reached 0.92–0.98. This indicates that the developed hydraulic model possesses acceptable accuracy, and hence it can accurately simulate the hydrodynamic process of the study area.

## RESULTS AND DISCUSSION

### Real-time monitoring of urban drainage system

*et al.*2022). This means that the water level in almost all drainage systems in the SGT area can be calculated if we get the water level in the 14 manholes. Meanwhile, it can also be seen from Figure 4 that the increase in the number of principal components was not effective in improving the variance explained when the number of principal components exceeds 14. In other words, it cannot improve the accuracy of the urban global water level. In addition, according to Kaiser's rule, only the principal components with eigenvalues greater than one have great significance in PCA (Gewers

*et al.*2022). These principal components (eigenvalue >1) are also the same as those 14 principal components described above. The results present the point corresponding to the maximum load of each principal component for selection (Table S1, Supplementary material). For the balance of economy and accuracy, six monitoring sensors (95.3% explanation rate) were installed in the case study area with two as validation (Figure 3).

*R*

^{2}values of the surface node and manhole reach 0.909 and 0.899, respectively. In addition, the average MAE and RMSE values between the two models are relatively low, are 0.101 and 0.144 (urban surface), and 0.124 and 0.162 (conduit system), respectively. These results revealed a strong fit between the LGD model and the hydraulic model, indicating the reliable accuracy of the results deduced by the LGD model. Figure 5 presents the most unfavorable flooding scenario under a disaster-causing rainfall event. Through comparing the results of the LGD model and hydraulic model, it reveals that both of the two models simulate flood extents similarly and are also consistent with the recorded flood spot in the SGT polder. Thus, it shows the reasonableness of the results deduced by the LGD model.

Metrics . | Urban surface . | Conduit system . | ||||
---|---|---|---|---|---|---|

Average . | Maximum . | Minimum . | Average . | Maximum . | Minimum . | |

MAXE | 0.164 | 0.239 | 0.109 | 0.218 | 0.373 | 0.122 |

MAE | 0.101 | 0.192 | 0.051 | 0.124 | 0.239 | 0.064 |

RMSE | 0.144 | 0.252 | 0.074 | 0.162 | 0.286 | 0.088 |

R^{2} | 0.909 | 0.952 | 0.852 | 0.899 | 0.945 | 0.836 |

Metrics . | Urban surface . | Conduit system . | ||||
---|---|---|---|---|---|---|

Average . | Maximum . | Minimum . | Average . | Maximum . | Minimum . | |

MAXE | 0.164 | 0.239 | 0.109 | 0.218 | 0.373 | 0.122 |

MAE | 0.101 | 0.192 | 0.051 | 0.124 | 0.239 | 0.064 |

RMSE | 0.144 | 0.252 | 0.074 | 0.162 | 0.286 | 0.088 |

R^{2} | 0.909 | 0.952 | 0.852 | 0.899 | 0.945 | 0.836 |

*R*

^{2}is 0.899. The results were similar to the urban surface.

### Urban flooding risk prediction

*R*

^{2}for the XGBoost model was 0.933 and 0.914, whereas the

*R*

^{2}for the UFP-RD model was higher, with 0.973 and 0.962, respectively. In addition, other evaluation indicator values also indicate that the proposed UFP-RD model performs high accuracy during the dynamic prediction process. On the one hand, we simulated the results of different scenarios through the hydraulic model as the dataset to develop the XGBoost model. However, the hydraulic model is only a simplification of the real situation, there are some differences with the real scenario, which is the reason for the difference between the XGBoost model output and the monitoring data. On the other hand, the XGBoost model predicts the flooding situation for the entire area directly based on inputs such as rainfall. It tries to establish a mathematical model that predicts water levels based on these input features. While the XGBoost model can provide predictions by learning complex relationships between the data, it does not utilize water level data from actual monitoring points as inputs, which may limit its accuracy in certain situations. Our proposed UFP-RD mode is more accurate because it combines the information of real monitoring data by learning the mapping relationship between monitoring point data and other point water level data, and obtaining the results by deducting using real monitoring point data.

Metrics . | UFP-RD . | XGBoost . | ||
---|---|---|---|---|

Point 1 . | Point 2 . | Point 1 . | Point 2 . | |

MAXE | 0.053 | 0.132 | 0.127 | 0.212 |

MAE | 0.015 | 0.062 | 0.062 | 0.107 |

RMSE | 0.042 | 0.081 | 0.103 | 0.142 |

R^{2} | 0.973 | 0.962 | 0.933 | 0.914 |

Metrics . | UFP-RD . | XGBoost . | ||
---|---|---|---|---|

Point 1 . | Point 2 . | Point 1 . | Point 2 . | |

MAXE | 0.053 | 0.132 | 0.127 | 0.212 |

MAE | 0.015 | 0.062 | 0.062 | 0.107 |

RMSE | 0.042 | 0.081 | 0.103 | 0.142 |

R^{2} | 0.973 | 0.962 | 0.933 | 0.914 |

It is clear from the comparison analysis above that this paper's urban surface part lacks a comparison between simulation results and actual flooding depth. This is because monitoring data for this area is difficult to obtain.

The UFP-RD model improved the prediction accuracy compared to the traditional ML model because the method considers adopting the real-time observed data to eliminate the cumulative error at each time step. Moreover, the UFP-RD model retains the advantages of high computational efficiency and short time consumption of traditional ML model, and the total calculation time of 0.29 s (CPU) and 0.25 s (GPU) (Figure S3, Supplementary material). Thus, the proposed model can be well used for nowcasting prediction of urban flooding scenarios.

## CONCLUSIONS

This study develops a rapid nowcasting prediction method for urban flooding risk based on data-driven and real-time monitoring. Our approach provides an LGD model based on an ML algorithm that enables quantifying the cross-scale relationship in local-global water levels. The following conclusions were obtained from a case study in the SGT polder of J City:

(1) Through a case study in SGT Polder of J City. The LGD model can deduce the urban global water level based on a small number of monitoring sites. The average values of MAE and RMSE for the urban surface and conduit system in the LGD model were as low as 0.101 and 0.144, 0.124 and 0.162, respectively. Meanwhile, the LGD model was more stable compared to the conduit system when deducing water levels in urban surface.

(2) This study implements a nowcasting prediction method of urban flooding scenarios based on data-driven and real-time monitoring data, which is named the UFP-RD model. The proposed UFP-RD model has higher accuracy than the traditional ML algorithm in predicting the water depth of urban floods, and it retains the advantages of high computational efficiency. It can provide important technical reference for the early warning and control of urban flooding disasters.

## ACKNOWLEDGEMENTS

This project was supported by the Basic Public Welfare Research Program of Zhejiang Province (ZJWZ24E090002).

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

**578**, 124084.

**13**(10), 3692–3715.

**30**(6), 16081–16105