Research on flood forecasting based on flood hydrograph generalization and random forest in Qiushui River basin, China

At present, the use of hydrological models is the main technical approach for real-time flood forecasting. However, in semi-arid and arid areas, the use of the hydrological model is restricted by technical and data conditions. With the accumulation of hydrological data deluge, making full use of historical data and mining potential hydrological laws, causal relationships and other valuable information behind them provide new ideas for real-time flood forecasting in the study area. This paper develops a hybrid flood forecasting model that combines the flood hydrograph generalization method and random forest in the Qiushui River basin in the middle reaches of the Yellow River. The performance of this hybrid model is compared to that of the antecedent precipitation index model. For the development of these models, 23 flood events occurring from 1980 to 2010 are selected, of which 18 are used for calibration and 5 are used for validation. The results show that the hybrid model yields accurate predictions. And the comparison shows that the hybrid model performs better than the empirical model in the Qiushui River basin. Thus, this study provides a method for improving the accuracy of flood forecasting.


INTRODUCTION
the Hec-1 Model (Chem ), emerged in this period. In the second half of the 20th century, many multi-parameter and complex conceptual lumped models have been developed in succession by countries all over the world, such as the TANK model (Sugawara ), antecedent precipitation index (API) model (Sittner et al. ) and Xin'anjiang model (Renjun et al. ). These conceptual hydrological models have played an important role in studying hydrological laws and solving practical problems in production. Another However, the existing hydrological models are more suitable for flood forecasting in humid areas. Our study area, Qiushui River basin, consists of an arid and semi-arid region where the spatial composition of flood sources is complex (Li et al. ), the forecast accuracy of hydrological models is often low, which is difficult to meet the needs of flood control and disaster reduction in this area (Li et al. ). Hence, there is an urgent need for a flood forecasting method that can not only avoid the direct simulation of physical flood formation processes in arid and semi-arid areas but also meet forecasting accuracy requirements. Another type of the flood forecasting model is the data-driven model. This type of model does not consider the physical mechanism of the hydrological process, regards the hydrological process as a black box and determines the mathematical function according to input and output data.
Random forest (RF) is one of a data-driven model which combines the prediction from an ensemble of decision trees (Breiman ). RF has become popular in various industries due to its prediction power and the speed of processing (Svetnik et  The remainder of this paper is organized as follows: 'Study area and data' introduces the study area and the data used. The

STUDY AREA AND DATA
The Qiushui River basin is located in the left bank of the middle of the Yellow River. It is a tributary of the Yellow River and covers an area of 1,989 km 2 . There are more than 20 branch ditches in the basin (the basin area is larger than 10 km 2 ) that are asymmetric pinnate inlets.  Table 1 lists the information of these flood events, including the beginning and ending time.

Flood hydrograph generalization method
Flood hydrograph generalization refers to the production of a representative flood hydrograph based on the observed flood hydrograph data of a large number of flood events at a hydrological station. The method flood hydrograph generalization comprises the following steps: first, combine each flood hydrograph into the same drawing, wherein the ordinate represents the ratio of Q i and Q m , the abscissa represents the ratio of T i and T. Q m is the peak discharge, T is the total duration of the flood process, and Q i and T i represent the discharge and time, respectively, at any time. Then, overlap the time of peak discharge in one place, one common hydrograph that summarizes the station flood shape characteristics of an average hydrograph is chosen as the generalized flood hydrograph. In this paper, considering that the flood recession process is long, the flood progress is divided into two parts: the rising and recession processes. Moreover, we assume that the total flood duration is twice as long as the duration of the rising process. The 6-point generalization method is taken as an example to control the hydrograph characteristics, as shown in Figure 2. When the flood hydrograph is calculated by the generalization map, the coordinates of the points in the graph are (0, α 1 Q m ), (β 1 T, α 2 Q m ), (β 2 T, Q m ), (β 3 T, α 3 Q m ), (0:5T , Q g ), (0, T ), respectively. Here, Q m is the peak discharge, Q g is the maximum discharge of the recession process, and T is the flood duration.

RF (Breiman ) is a machine learning algorithm combin-
ing the Bagging ensemble learning theory (Breiman ) and the random subspace method (Ho ). An RF is a classifier consisting of a collection of tree-structured classifiers {h(x, Θ n ), n ¼ 1, . . .}, where {Θ n } are independent identically distributed random vectors, and each tree casts a unit vote for the most popular class at input x (Breiman ). RF utilizes bootstrap resampling technology to sample original samples to generate a number of training samples, each of which randomly selects feature attributes through random subspace methods to construct a decision tree. Finally, the optimal result is obtained by the voting or averaging method. Previous studies have found that RF can effectively overcome the problems of noise and overfitting and obtain a high prediction precision (Wang et al. ). RF has two main technological aspects: the first is bootstrap resampling technology and out-of-band error estimation; the second is decision tree construction and the random subspace theory (Liang et al. ).
The main structure of the model is shown in Figure 3.

Hybrid model
The hybrid model combines the two methods above. The API model is based on the physical mechanism of rainfall and runoff generation in basins and takes the main influencing factors as parameters to establish the quantitative correlation between rainfall and runoff. Some common parameters are antecedent precipitation, seasonal characteristics and precipitation duration.
We use the (P þ P a ) À R relation graph (Kohler & Linsley ), which is a graph between the sum of two values of precipitation (P) and antecedent precipitation (P a ) and runoff (R), as shown in Equation (1): The main steps of this method are as follows: first, calculate the average daily precipitation in the basin. Second, compute the values of antecedent precipitation at the early stage of the forecast period as follows: where n is the number of days that influence the flood event, with a general value of 15 days; k is a constant coefficient, with a general value of 0.85 (Bao ). Third, R is calculated from the relation between (P þ P a ) À R.
Finally, taking R as the input quantity, the flood hydrograph where Q obv and Q cal are observed and simulated discharge, respectively.
DT ¼ jT cal À T obv j where T obv and T cal are observed and simulated flood duration, respectively.
where Q obs (i) and Q cal (i) are the observed and simulated discharge series, respectively; Q obs and Q cal are the mean observed and simulated discharge series, respectively; and N is the length of the time series considered.

RESULTS AND DISCUSSION
Hybrid model development

Flood hydrograph generalization
Considering the long process of flood recession, the flood progress is divided into two parts: the rising and recession processes. The generalization method is used to generalize these two processes. The rising and recession processes of 18 flood events of the calibration period were generalized for each flood hydrograph, and finally, the general flood hydrograph was obtained by averaging the individual flood hydrographs (Figures 5-8).

Screening of predictors
A correlation analysis between the precipitation factors (peak hourly precipitation (P m ), accumulated 5 h precipitation (AP 5 ), accumulated 10 h precipitation (AP 10 ), accumulated 15 h precipitation (AP 15 ), precipitation intensity during rising process (PI), time of peak discharge (T Qm ), time of peak precipitation (T Pm ) and peak discharge (Q m ), duration of the rising process (T s ) and the maximum discharge of the recession process (Q g ) were established, respectively, to select the key influential predictors, as shown in Table 2.

RF model development
The model was built in three steps. In the first step, peak discharge was forecasted. The duration of the rising process and maximum discharge of the recession process were forecasted in the second and third steps, respectively, with the total duration set to twice the duration of the rising process.
The predicted flood hydrograph was obtained by substituting the predicted time into the generalized flood process.
Eighteen floods were used to calibrate the model, and five floods were used to validate the model.
First, the selected M predictors were used to construct the training sample set D together with the predictand series.
where X is the M-dimensional explanatory variable vector composed of predictors, Y is the target variable of the predictand series, and N is the sample capacity. Second, n training sample subsets were randomly taken from the training sample set D through bootstrap resampling, and the size of the training sample subset was N. Third, n decision trees were constructed for the n training sample subsets. According to the random subspace theory, m indexes (generally, ) were randomly selected from the M indexes.
Then, the optimal value was selected based on the principle of entropy increasing, and this value was the final node attribute value. n in this research was set at 100. Finally, each decision tree was executed based on top-down recursive growth to obtain a predicted value. The results of the n decision trees were then voted or averaged to obtain the ultimate classification or regression results, namely, the final values of predictands.
When forecasting the predictands, the training sample of the RF model was calculated as follows: The forecasting statistics of the RF model in the calibration period are shown in Table 3. Moreover, according to the accuracy requirement of the flood forecasting in the standard for hydrological information and hydrological forecasting in China (GB/T22482-2008), a 20% variation between the observed peak discharge and forecasted peak  In the calibration period of the model, the values of the average relative error of the peak discharge of 13 events were less than 20%; thus, QR of forecasting of Q m is 72%.
In addition, the average value of DQ m in the calibration period was 18.8%. As for the flood duration, there were four flood events with DT values at 0, which is a considerable result. However, the DT value of No. 19890722 was relatively large, mainly because the observed value is relatively large. Ten of the 18 flood events had average relative error of Q g less than 20%; therefore, QR of forecasting of Q g is 55.6%. The average value of the DQ g of the calibration period was 22.4%. Broadly speaking, however, the accuracy was satisfactory. Table 4 shows the forecasting statistics of the RF model in the validation period.
During the validation period, QR of forecasting of Q m and Q g is 80 and 60%, respectively. The average value of  20100919, the accuracies of forecasting of the peak discharge and maximum discharge of the recession process are both very low. It is mainly because of the peculiarity of the RF model. RF cannot make prediction beyond the range of training set data despite it being a powerful model. Namely, when the maximum and minimum of the peak discharge in the training set is 1,520 and 207 m 3 /s, the forecasted discharge cannot be greater than 1,520 m 3 /s or smaller than 207 m 3 /s. However, the observation of the Q m of No. 20100919 is 2,280 m 3 /s, which is beyond the maximum value of the training set. Therefore, the effect will be relatively poor in any case. The same is true for Q g .
In general, the model provided acceptable accuracy in both the calibration and validation periods.

Empirical model development
The (P þ P a ) À R relation graph is shown in Figure 9. The UH was derived in a conventional manner using the selected events, as shown in the figure.
Generally, if the precipitation intensity is large, the peak discharge of UH is higher and the peak time is earlier.
However, the peak discharge is lower and the peak time lags behind. When the precipitation center is in the upstream, due to the long confluence path, the peak discharge of UH is lower and the peak time lags behind.
However, the peak discharge is higher and the peak time is earlier. Therefore, the UH is classified and compiled according to the location of the precipitation center and the magnitude of the net precipitation. We summarized the flood events into four types of unit hydrographs, as shown in Figure 10. The classification of the flood events is shown in Table 5.
Counting the values of P þ P a , according to the relation graph, the values of R will be calculated, then, according to the UH and Equation (12), the flood process is obtained.
Comparison of the hybrid model and empirical model The CC and the RMSE of the hybrid model and empirical model in the calibration period are summarized and shown in Table 6.
It is evident from Table 5 that the hybrid model performs better than the empirical model in the calibration period. Yet, there are two events which have different results. The results of No. 19840701 and No. 19910721 indicate that the empirical model has better values of CC, which is 0.02 and 0.06 higher than the hybrid model. This might be explained by the variance of the antecedent precipitation. When the antecedent precipitation is well distributed in the temporal scale, the empirical model performs well, and when the antecedent precipitation is more concentrated in a short time, the hybrid model is better. However, the real problem of Qiushui River basin is that the spatial and temporal distribution of precipitation is often uneven and expressed in peaks with rising and dropping steeply. Thus, the hybrid model can be more suitable than the traditional model in the  19880715 19890716 19960809 UH2 19800818 19810620 19810703 19840701 19910610 19920802 UH3 19900811 19910721 19910915 19920828 UH4 19810707 19850805 19880718 19890722 19910727 Figure 12 shows

CONCLUSIONS
In order to solve the problem of the low forecasting accuracy of hydrological models in arid and semi-arid areas, this paper develops a flood forecast method that combines the flood hydrograph generalization method and RF in the  Qiushui River basin. First, selected flood events from 1980 to 2010 were generalized using the flood hydrograph generalization method. Then, the peak discharge and flood duration were forecasted using the RF method, and the flood processes were deduced. The specific findings of this study are as follows: 1. RF cannot make prediction beyond the range of training set data, which may lead to the poor prediction effect when we do the extreme value prediction. The solution for that problem could not be proposed in this study and must be left for future work.
2. Our study found that when the antecedent precipitation is well distributed in the temporal scale, the empirical model performs well, and when the antecedent precipitation is more concentrated in short time, the hybrid model is better. Thus, the accuracy of the current hydrological model is often lower in arid and semi-arid areas with more complex hydrological processes. The Qiushui River basin is an arid and semi-arid area where the spatial and temporal distribution of precipitation is often uneven and expressed in peaks with rising and dropping steeply. In this study, for the