Abstract
This article proposes a multi-head attention flood forecasting model (MHAFFM) that combines a multi-head attention mechanism (MHAM) with multiple linear regression for flood forecasting. Compared to models based on Long Short-Term Memory (LSTM) neural networks, MHAFFM enables precise and stable multi-hour flood forecasting. First, the model utilizes characteristics of full-batch stable input data in multiple linear regression to solve the problem of oscillation in the prediction results of existing models. Second, full-batch information is connected to MHAM to improve the model's ability to process and interpret high-dimensional information. Finally, the model accurately and stably predicts future flood processes through linear layers. The model is applied to Dawen River Basin, and experimental results show that the MHAFFM, compared to three benchmarking models, namely, LSTM, BOA-LSTM (LSTM with Bayesian Optimization Algorithm for Hyperparameter Tuning), and MHAM-LSTM (LSTM model with MHAM in hidden layer), significantly improves the prediction performance under different lead time scenarios while maintaining good stability and interpretability. Taking Nash–Sutcliffe efficiency index as an example, under a lead time of 3 h, the MHAFFM model exhibits improvements of 8.85, 3.71, and 10.29% compared to the three benchmarking models, respectively. This research provides a new approach for flood forecasting.
HIGHLIGHTS
Proposes a novel multi-head attention flood forecasting model (MHAFFM).
Multi-head attention mechanism strengthens the model's ability to handle high-dimensional data.
Linear layers effectively harness the performance of the multi-head attention mechanism.
MHAFFM significantly enhances the stability of forecasted results.
Even in longer lead time scenarios, the model maintains high accuracy and stability.
INTRODUCTION
Existing flood forecasting models can be roughly divided into two types according to the driving process. One is a process-driven hydrological model based on the specific physical process of flow generation and confluence (Wang et al. 2021), and the other is a data-driven model based on measured hydrological-related data (Zhang et al. 2020; Cao et al. 2022; Gao et al. 2022). The former is limited by insufficient knowledge of some flood processes, and it is difficult to describe the mechanism of specific hydrological processes in detail (Chomba et al. 2022). The latter is widely used in flood forecasting because it only needs to capture the relationship between input and output and does not need to describe complex physical processes (Liang et al. 2018; Liu et al. 2019; Krisnayanti et al. 2022; Yuan et al. 2022). However, classical data-driven models have a premise that the residual sequence is independent and obeys the normal distribution. It is difficult for a general hydrological sequence to meet this condition, which poses certain limitations to this type of method (Wang et al. 2012).
With the improvement of computing power, data-driven machine learning models, especially deep learning models, have shown powerful learning capabilities (Ekwueme 2022). They are not constrained by hypothetical principles and are widely used in flood forecasting (Ditthakit et al. 2023; Min et al. 2023; Xie et al. 2023). Among various deep learning models, the recurrent neural network model (RNN) can express spatiotemporal anisotropy and has structural advantages in time series data prediction (Elman 1990; Frame et al. 2022). However, due to the forward and backward transmission of weights of the RNN model, longer sequences are prone to gradient explosion or disappearance, which limits the application of the model (Noh 2021). In response to this problem, Hochreiter & Schmidhuber (1997) constructed an LSTM model based on the RNN model and used gating units to filter data, which effectively reduced the occurrence of these problems. Hu et al. (2018) used this model for rainfall–runoff modeling for the first time and achieved good results. Afterward, the LSTM model has been widely used in runoff prediction and achieved good prediction results (Abbas et al. 2020; Gao et al. 2020; Cao et al. 2022).
However, due to the LSTM model being a black box model, the randomness of the gating unit for data screening leads to poor interpretability of the model, making it difficult for practitioners to fully trust this type of model (Herath et al. 2021; Li et al. 2022a; Paudel et al. 2023). In 2019, the International Association of Hydrological Sciences emphasized in the 23 unresolved issues that it was of great significance to solve the confusion and reduce the uncertainty of model structure, parameters, and inputs in hydrological prediction (Blöschl et al. 2019). Jiang et al. (2022) also demonstrated the significance of interpreting deep learning models in understanding the mechanisms of flood formation through their research. It can be seen that the interpretability of the model cannot be ignored in hydrological forecasting.
In addressing the issue of model interpretability, Cai et al. (2022) improved model performance by adding physical mechanism constraints to the model, ensuring that the recursive process of the model aligns with physical mechanisms. Chadalawada et al. (2020) are dedicated to flexibly combining machine learning algorithms to construct rainfall–runoff model components with physical significance. In the field of deep learning, Bahdanau et al. (2014) proposed an attention mechanism that was embedded into the hidden layer of RNN models to improve model performance and observe the degree of model attention to data, thus enhancing model interpretability. RNN combined with the attention mechanism has achieved good performance in many fields (Xu et al. 2019; Zhao et al. 2020; Hwang et al. 2021).
In the direction of hydrological forecasting, Chen et al. (2020) combined the LSTM model with a self-attention mechanism for daily runoff forecasting, and the results were more accurate than those of the LSTM model alone. Gao et al. (2022) combined the attention mechanism with the gated recurrent unit (GRU) model, introduced a linear layer for input information processing, and used a seq2seq architecture for multistep flood forecasting, which improved the accuracy of flood forecasting. However, in the current stage of hydrological forecasting, research on deep learning models incorporating attention mechanisms is primarily focused on improving prediction accuracy, with a limited investigation into model prediction result stability. This lack of exploration hinders the practical application and promotion of interpretable deep learning models in real-world scenarios.
Therefore, besides model prediction accuracy, this study also pays particular attention to the distribution of model performance indicators. Through comparing the changes in the model's predicted outcome metrics, we find that the existing coupling methods that combine the LSTM model with the attention mechanism although can balance interpretability and the average performance of flood forecasting results, the stability of the performance metrics is poor. In extreme cases, it can even be worse than the basic model, which reduces the reliability of the model and makes it unsuitable for practical flood forecasting applications. To address this issue, this article analyzes factors that affect the stability of the model and replaces the traditional use of LSTM as the hidden layer by using multiple linear regression to stabilize the input data in a full batch. By doing so, it overcomes the problem of model prediction result oscillations and constructs a stable deep learning model.
In addition, in current flood forecasting research, the attention mechanism or self-attention mechanism combined with the LSTM model is generally adopted. However, due to their structural characteristics, these two mechanisms have limited ability to extract features from a high-dimensional array, and it is difficult to establish accurate affine transformation for a nonlinear flood routing process that is influenced by multiple factors, which also affects the model accuracy and result stability (Vaswani et al. 2017). In response to this problem, existing research directions in hydrological forecasting typically address it through data processing, with commonly used methods including principal component analysis and data decomposition (Sarraf 2015; Adnan et al. 2021; Carreau & Guinot 2021). Although these processing methods can improve the model performance, they require preprocessing of all runoff data, and future information is introduced into the data processing process, which the model lacks realistic conditions in actual flood forecasting (Zhang et al. 2015; Tan et al. 2018; Zuo et al. 2020).
To enhance the model's practicality, this article introduces a multi-head attention architecture from the perspective of model design. The multi-head attention mechanism was introduced in an article by the Google team on building the transformer model, aimed at handling high-dimensional data information (Vaswani et al. 2017). Currently, the ChatGPT model based on this architecture has achieved tremendous success in various applications. However, in the field of hydrology, there is limited research that utilizes this architecture for prediction tasks. Therefore, this article adopts the multi-head attention mechanism to solve the problem of high-dimensional array processing for runoff data by utilizing its characteristic of partitioning information subspaces for high-dimensional data and improving the model's applicability.
Overall, this article draws on ideas of multiple linear regression and multi-head attention architecture to propose the multi-head attention flood forecasting model (MHAFFM) model, breaking away from the LSTM model framework and using a new architectural form to improve the accuracy and stability of deep learning model predictions for flood forecasting. In addition, it also analyzes the interpretability of the MHAFFM model to help understand the behavioral logic of flood forecasting models and improve the model's trustworthiness. The main contributions of this article are as follows:
- (1)
The introduction of the multi-head attention mechanism, utilizing its ability to jointly focus on different subspaces of data, enhances the model's feature extraction capability for flood processes. It provides a novel way for model predictions in high-dimensional data scenarios.
- (2)
By modifying the hidden layer of the model based on the concept of multiple linear regression, we enable the full-batch input information to enter the attention architecture, establishing the MHAFFM model, addressing the stability issue in flood forecasting model predictions caused by the LSTM model framework.
- (3)
By analyzing the behavioral logic of the MHAFFM model, a highly interpretable flood forecasting model with high accuracy is provided.
The rest of this article is arranged as follows: Section 2 introduces three models for comparison: LSTM, BOA-LSTM, and MHAM-LSTM. The MHAFFM model is used as the benchmark model. Model evaluation metrics are also provided. Sections 3–5 present the results of the study conducted at Dawen River Basin in Shandong Province, China. Finally, Section 6 provides a summary of the article.
METHODOLOGY
LSTM
The LSTM model belongs to the RNN type. Its structure is based on the continuity of time series information, which is input into the hidden layer according to the time step for weight verification. When compared with other types of neural network models, the LSTM model has the following advantages: (1) Due to the structure of its processing information, this type of model can learn spatiotemporal correlation of data without encoding the time position of the data when processing time series information. (2) The LSTM model, as an improved version of the RNN model, introduces a gated recurrent unit to alleviate the problem of gradient dispersion and explosion, which improves its applicability.
In the output gate, the new cell state is activated by the tanh function and a new hidden state is selected for output by .
Information of early-stage precipitation {,…} is passed through the LSTM unit, and at time n, all temporal information that has passed through gating units is accumulated in . is then fed into a linear layer to establish a linear relationship between it and the predicted runoff {,,}.
BOA-LSTM
Learning rate . | Hidden units . | L2 regularization . |
---|---|---|
6.291784 × 10−3 | 180 | 1.49 × 10−6 |
Learning rate . | Hidden units . | L2 regularization . |
---|---|---|
6.291784 × 10−3 | 180 | 1.49 × 10−6 |
MHAM-LSTM
For the hidden state sequence {,…} output by the LSTM unit, the multi-head attention mechanism processes are as follows:
- (1)
The hidden state sequence {,…} output by the LSTM unit is fed into a linear layer, which produces the query sequence , the key sequence , and the value sequence , respectively.
- (2)
The three sequences are divided into h equal parts (number of heads for the multi-head attention mechanism) to obtain the subsequences , , ; , , ; and , , .
- (3)Using the Q sequence at time step n as the final query and performing vector scaling dot product with the K sequence, compute the attention scores for each time step. Taking the computation process of the hth head at time step n as an example:where represents the dimension of sequence K and denotes the attention weight that imposes to at time step n ( = 1…n).
- (4)
- (5)
- (6)
The linear layer ‘linear’ is a linear function of the form .
When compared to the traditional multi-head attention mechanism, the adaptive modification is to compute attention for the final question instead of the question sequence . The reason is as follows:
The traditional multi-head attention mechanism was originally proposed to solve natural language processing (NLP) problems with a seq2seq architecture, where there is temporal parallelism between the input and output sequences and a close context relationship within the sequences. Specifically, there is a possibility of correlation between and , and contains similar levels of information. However, for flood forecasting problems, the output sequence usually lags behind the input sequence and there is temporal unidirectionality. There is no physical correlation between and , and contains lower amounts of information compared to . Therefore, is used to replace for attention computation.
MHAFFM
Through experiments, it has been found that although MHAM-LSTM model can improve the accuracy of flood forecasting results, its stability is poor. The reasons for this phenomenon are as follows:
- (1)
The limited amount of observed data for flood forecasting makes it difficult to support the stable identification of the global optimal solution by the model.
Due to the limited development time of observation techniques, the dataset used in flood forecasting tasks is relatively small. Compared to GB-level datasets in NLP tasks, it is difficult to support the model in stably finding the global optimum. This can cause the model to fall into other local optimal solutions, thereby reducing the stability of the model.
- (2)
Due to the gating mechanism of LSTM hidden layer, it is difficult to feed the entire batch of flood data into the attention layer.
As shown in Equations (1)–(4), the hidden layer selectively inherits previous information and selectively inputs current information. While this information processing method alleviates the gradient problem, it undoubtedly damages the input information required by the model, also leading to reduced model stability.
The first point is limited by actual conditions and has little room for improvement. However, regarding the method of information transmission, this article attempts to use full-batch input to reconstruct the hidden layer to improve the information acquisition of the attention layer. To achieve full-batch data passing through the hidden layer, this article introduces the idea of multivariate linear regression:
This article uses a linear layer as the input encoding layer to replace the LSTM layer to process input information, in order to ensure that all input information is passed to the multi-head attention structure. When compared with MHAM-LSTM, the advantages of this model are as follows: (1) replacing the LSTM layer with a simple linear layer for input data encoding, which improves the speed of the model; and (2) breaking away from the influence of LSTM model and improving the utilization of input data.
Model evaluation indices
NSE is sensitive to the fluctuations of the data series and can characterize the tracking ability of the predicted values to the actual values, which is used to evaluate the prediction stability of the model (Kumar et al. 2016). RMSE and MAE are used to compute errors of predicted values, indicating the overall prediction accuracy of the model (Ćalasan et al. 2020). KGE combines model correlation, bias, and flow variability into a single metric to evaluate the model's performance. TPE is used to evaluate the prediction accuracy of the top 2% of the runoff process of the model (Gao et al. 2022). APB is used to evaluate the prediction error of the flood volume of the model (Miao et al. 2022).
NSE, RMSE, MAE, and KGE are used to analyze the performance of the model on the entire flood process, while TPE and APB are used to analyze the prediction effect of the model on single flood events.
Methodological framework
RESEARCH AREA AND MODEL PARAMETERS
Research area and data
The research object of this article is Dawen River Basin in Shandong Province, China, which belongs to Yellow River Basin in the middle and lower reaches. Dawen River originates from north of Xuangu Mountain in Shandong, with a total length of 209 km and a basin area of 9,098 km2. It flows from east to west and joins the Dongping Lake before flowing into the Yellow River, serving as the last tributary before the Yellow River flows into the sea (Li et al. 2022b). Due to the influence of the monsoon climate, more than 70% of the annual precipitation in this basin occurs during the flood season, making it prone to seasonal flood disasters. After the floods converge into the Yellow River, they may even pose a threat to the safety of the downstream main river channel. Therefore, it is necessary to establish an accurate and stable hydrological model framework for this region.
Based on the measured data from hydrological and rainfall stations within the Dawen River basin, hourly runoff processes from the year 2000 to 2020 were compiled. The data source is the Shandong Hydrological and Water Resources Bureau of the Yellow River Conservancy Commission. Considering changes in underlying surface conditions, this article uses hourly runoff data from the Loude control section in the upstream basin and hourly flood event data from relevant rainfall stations from 2000 to 2020 for the model's flood process prediction performance analysis.
Input and output sequence selection
When using rainfall–runoff data to generate the model dataset, time step and forecast lead time for input and output sequences of the dataset need to be selected. Increasing the time step can allow the model to acquire more relevant information and improve the accuracy of flood process prediction, but it will also increase the model's runtime. The time step is generally selected based on the basin's runoff time. Considering the size of the basin, the time step is set to 12 h, which enables the model to obtain relevant information on flood processes and better demonstrate the model's learning ability. The maximum forecast lead time is set to 3 h, which can demonstrate the model's ability to learn the rules of flood processes. In short, this article uses the rainfall–runoff information from time to t to predict the flood process from time to .
In addition, to improve the model's running speed, this article uses parallelization techniques to divide the dataset into subsets, each containing 500 input and output data (Dean et al. 2012).
Hyperparameter settings
To better utilize the performance of the models, the hyperparameters are adjusted in this article. Due to using LSTM as the baseline comparative model in the article, the performance of the LSTM model was taken as a reference to set the learning rate, the number of neurons, and regularization parameters.
The learning rate is set to 1 × 10−3, which allows the LSTM model to converge stably to the optimal value and avoid the occurrence of divergent results. In addition, 128 neurons are used for all LSTM models with hidden layers, which are sufficient to fit the nonlinear characteristics of the watershed data and have reasonable computational costs. The LSTM model's regularization parameter is set to 1 × 10−5, which helps prevent the LSTM model from overfitting.
To enhance the comparability between different models, the number of nonlinear expression neurons in the encoding layer of the MHAFFM model is also set to 128, giving the encoding layer the same nonlinear expression ability as the LSTM hidden layer. For the MHAM-LSTM model and the MHAFFM model that uses a multi-head attention structure, the number of model heads is set to 16, considering the data size to achieve multi-thread processing for the flood forecasting task. At the same time, this article uses the Adam optimization algorithm for gradient descent and the mean squared error to compute the iteration error loss. Each model is trained 1,500 times, from which the optimal solution is selected (Kingma & Ba 2014).
Data processing workflow
RESULTS
Model performance
To validate the overall performance advantage of the MHAFFM model and the superiority of the multi-head attention mechanism, in addition to the four models constructed in this study, we also introduced the SAM-LSTM model (self-attention mechanism coupled with the LSTM model) proposed in related research in this section.
Model name . | LSTM . | BOA-LSTM . | SAM-LSTM . | MHAM-LSTM . | MHAFFM . | |
---|---|---|---|---|---|---|
Lead time (h) . | Performance metrics . | |||||
1 | MAE | 27.298 | 16.235 | 40.732 | 28.134 | 18.371 |
RMSE | 41.094 | 30.734 | 69.099 | 44.672 | 30.618 | |
NSE | 0.941 | 0.968 | 0.840 | 0.932 | 0.969 | |
KGE | 0.904 | 0.925 | 0.779 | 0.861 | 0.926 | |
2 | MAE | 37.436 | 26.106 | 44.781 | 34.39 | 24.875 |
RMSE | 56.238 | 47.849 | 73.462 | 55.688 | 42.423 | |
NSE | 0.894 | 0.923 | 0.819 | 0.895 | 0.94 | |
KGE | 0.85 | 0.879 | 0.746 | 0.828 | 0.902 | |
3 | MAE | 46.537 | 31.75 | 47.955 | 40.461 | 28.356 |
RMSE | 70.038 | 57.661 | 77.084 | 67.385 | 48.577 | |
NSE | 0.836 | 0.889 | 0.801 | 0.847 | 0.922 | |
KGE | 0.793 | 0.849 | 0.728 | 0.801 | 0.896 |
Model name . | LSTM . | BOA-LSTM . | SAM-LSTM . | MHAM-LSTM . | MHAFFM . | |
---|---|---|---|---|---|---|
Lead time (h) . | Performance metrics . | |||||
1 | MAE | 27.298 | 16.235 | 40.732 | 28.134 | 18.371 |
RMSE | 41.094 | 30.734 | 69.099 | 44.672 | 30.618 | |
NSE | 0.941 | 0.968 | 0.840 | 0.932 | 0.969 | |
KGE | 0.904 | 0.925 | 0.779 | 0.861 | 0.926 | |
2 | MAE | 37.436 | 26.106 | 44.781 | 34.39 | 24.875 |
RMSE | 56.238 | 47.849 | 73.462 | 55.688 | 42.423 | |
NSE | 0.894 | 0.923 | 0.819 | 0.895 | 0.94 | |
KGE | 0.85 | 0.879 | 0.746 | 0.828 | 0.902 | |
3 | MAE | 46.537 | 31.75 | 47.955 | 40.461 | 28.356 |
RMSE | 70.038 | 57.661 | 77.084 | 67.385 | 48.577 | |
NSE | 0.836 | 0.889 | 0.801 | 0.847 | 0.922 | |
KGE | 0.793 | 0.849 | 0.728 | 0.801 | 0.896 |
Note: The bold values represent the best performance for each model.
Compared to (%) . | LSTM . | BOA-LSTM . | SAM-LSTM . | MHAM-LSTM . | |
---|---|---|---|---|---|
Lead time (h) . | Performance metrics . | ||||
1 | MAE | 32.7 | −13.16 | 54.9 | 34.7 |
RMSE | 25.49 | 0.38 | 55.69 | 31.46 | |
NSE | 2.98 | 0.1 | 15.41 | 3.97 | |
KGE | 2.41 | 0.13 | 18.84 | 7.48 | |
2 | MAE | 33.55 | 4.72 | 44.45 | 27.67 |
RMSE | 24.57 | 11.34 | 42.25 | 23.82 | |
NSE | 5.15 | 1.84 | 14.78 | 5.03 | |
KGE | 6.16 | 2.64 | 20.89 | 8.93 | |
3 | MAE | 39.07 | 10.69 | 40.87 | 29.92 |
RMSE | 30.64 | 15.75 | 36.98 | 27.91 | |
NSE | 10.29 | 3.71 | 15.14 | 8.85 | |
KGE | 13.01 | 5.46 | 23.09 | 11.84 |
Compared to (%) . | LSTM . | BOA-LSTM . | SAM-LSTM . | MHAM-LSTM . | |
---|---|---|---|---|---|
Lead time (h) . | Performance metrics . | ||||
1 | MAE | 32.7 | −13.16 | 54.9 | 34.7 |
RMSE | 25.49 | 0.38 | 55.69 | 31.46 | |
NSE | 2.98 | 0.1 | 15.41 | 3.97 | |
KGE | 2.41 | 0.13 | 18.84 | 7.48 | |
2 | MAE | 33.55 | 4.72 | 44.45 | 27.67 |
RMSE | 24.57 | 11.34 | 42.25 | 23.82 | |
NSE | 5.15 | 1.84 | 14.78 | 5.03 | |
KGE | 6.16 | 2.64 | 20.89 | 8.93 | |
3 | MAE | 39.07 | 10.69 | 40.87 | 29.92 |
RMSE | 30.64 | 15.75 | 36.98 | 27.91 | |
NSE | 10.29 | 3.71 | 15.14 | 8.85 | |
KGE | 13.01 | 5.46 | 23.09 | 11.84 |
Based on the comprehensive analysis of Figures 11 and 12 and Tables 2 and 3, the MHAFFM model shows improved performance in all scenarios except for 1-h lead time where the MAE indicator is slightly lower than that of the BOA-LSTM model. Overall, the MHAFFM model outperforms other three benchmarking models in terms of performance indicators. Furthermore, the MHAFFM model exhibits a relatively small performance degradation phenomenon with an increase in lead time, and its performance advantage becomes more prominent compared to other models as the lead time increases. From this, it can be seen that although both approaches of algorithmic hyperparameter optimization and coupling attention mechanism can improve model performance, they are still inferior to the excellent performance of the MHAFFM model. In addition, the SAM-LSTM model constructed with the self-attention mechanism performed the worst in all indicators, while the MHAM-LSTM model constructed with the multi-head attention mechanism showed significant improvement compared to it. This indicates that the self-attention mechanism has poor adaptability to high-dimensional input data.
When compared to MHAM-LSTM, SAM-LSTM, BOA-LSTM, and LSTM models, the MHAFFM model with a linear layer as the data input layer has significantly better stability than other models in all four metrics. On the other hand, the other four models with LSTM hidden layers have poor stability performance, with significant oscillations in their prediction results and lower reliability. As the lead time increases, the prediction results of the LSTM model, BOA-LSTM model, SAM-LSTM model, and MHAM-LSTM model all show increased oscillations. However, the MHAFFM model can still maintain excellent stability in its performance.
As it can be observed, replacing the LSTM layer with a linear layer as the information processing layer in the multi-head attention mechanism can provide the model with more stable information, significantly enhance the overall control of the flooding process, and improve the performance in terms of MAE, RMSE, NSE, and KGE.
Furthermore, the SAM-LSTM model showed the poorest stability in its prediction results among the five models, indicating that its architecture is not suitable for the current flood forecasting task.
Model performance on single flood event
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Performance metrics . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 0.122 | 0.051 | 0.187 | 0.083 | 0.244 | 0.110 |
BOA-LSTM | 0.038 | 0.081 | 0.090 | 0.132 | 0.129 | 0.177 |
MHAM-LSTM | 0.185 | 0.055 | 0.217 | 0.093 | 0.246 | 0.140 |
MHAFFM | 0.080 | 0.060 | 0.109 | 0.066 | 0.119 | 0.058 |
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Performance metrics . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 0.122 | 0.051 | 0.187 | 0.083 | 0.244 | 0.110 |
BOA-LSTM | 0.038 | 0.081 | 0.090 | 0.132 | 0.129 | 0.177 |
MHAM-LSTM | 0.185 | 0.055 | 0.217 | 0.093 | 0.246 | 0.140 |
MHAFFM | 0.080 | 0.060 | 0.109 | 0.066 | 0.119 | 0.058 |
Note: The bold values represent the best performance for each model.
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Compared to . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 34.43% | −17.65% | 41.71% | 20.48% | 51.23% | 47.27% |
BOA-LSTM | −110.53% | 25.93% | −21.11% | 50.00% | 7.75% | 67.23% |
MHAM-LSTM | 56.76% | −9.09% | 49.77% | 29.03% | 51.63% | 58.57% |
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Compared to . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 34.43% | −17.65% | 41.71% | 20.48% | 51.23% | 47.27% |
BOA-LSTM | −110.53% | 25.93% | −21.11% | 50.00% | 7.75% | 67.23% |
MHAM-LSTM | 56.76% | −9.09% | 49.77% | 29.03% | 51.63% | 58.57% |
Note: The bold values represent the best performance for each model.
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Performance metrics . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 0.054 | 0.054 | 0.082 | 0.064 | 0.144 | 0.073 |
BOA-LSTM | 0.031 | 0.030 | 0.053 | 0.062 | 0.084 | 0.074 |
MHAM-LSTM | 0.080 | 0.062 | 0.082 | 0.083 | 0.082 | 0.107 |
MHAFFM | 0.019 | 0.026 | 0.031 | 0.036 | 0.033 | 0.089 |
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Performance metrics . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 0.054 | 0.054 | 0.082 | 0.064 | 0.144 | 0.073 |
BOA-LSTM | 0.031 | 0.030 | 0.053 | 0.062 | 0.084 | 0.074 |
MHAM-LSTM | 0.080 | 0.062 | 0.082 | 0.083 | 0.082 | 0.107 |
MHAFFM | 0.019 | 0.026 | 0.031 | 0.036 | 0.033 | 0.089 |
Note: The bold values represent the best performance for each model.
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Compared to (%) . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 64.82% | 51.85% | 62.20% | 43.75% | 77.08% | −21.90% |
BOA-LSTM | 38.71% | 13.33% | 41.51% | 41.94% | 60.71% | −20.30% |
MHAM-LSTM | 76.25% | 58.07% | 62.20% | 56.63% | 59.76% | 16.82% |
Lead time (h) . | 1 . | 2 . | 3 . | |||
---|---|---|---|---|---|---|
Compared to (%) . | APB . | TPE . | APB . | TPE . | APB . | TPE . |
LSTM | 64.82% | 51.85% | 62.20% | 43.75% | 77.08% | −21.90% |
BOA-LSTM | 38.71% | 13.33% | 41.51% | 41.94% | 60.71% | −20.30% |
MHAM-LSTM | 76.25% | 58.07% | 62.20% | 56.63% | 59.76% | 16.82% |
Note: The bold values represent the best performance for each model.
By combining Figures 14 and 15 with Tables 4 and 5, it can be observed that compared to the other three models, the performance advantage of the MHAFFM model in controlling flood peak and discharge for the first test event gradually becomes more prominent as the lead time increases. However, the performance of the MHAFFM model does not reach its optimal level at a 1-h lead time.
By combining Figures 16 and 17 with Tables 6 and 7, it can be observed that compared to the other three models, the performance advantage of the MHAFFM model in controlling the flooding process for the second test event becomes more prominent as the lead time increases. However, there is a significant decline in performance on the 3-h TPE indicator.
Taking into account the overall performance of each model, in most cases, the MHAFFM model exhibits the best performance, but there are exceptions. Analyzing the exceptions, we believe that the following are the reasons: During the model training process, the search for the optimal solution is based on the entire flood dataset. As a result, the learned data relationships are guided by the overall optimization objective, making it challenging to achieve the best performance on specific indicators for individual flood events. However, the fact that the MHAFFM model generally outperforms the other three models also indirectly proves that it has learned data mapping relationships that are closer to reality compared to the other models.
Taking into consideration both Figures 18 and 19, for individual flood events, the performance of the MHAFFM model in predicting results remains stable. This indicates that the MHAFFM model consistently learns consistent flood process mapping relationships across multiple instances. Therefore, it demonstrates the model's outstanding data abstraction capability, which is both stable and reliable.
Fit of flood process
From the visual perspective, it can be observed from Figures 20–23 that the MHAFFM model fits the observed flood process more accurately than the other three models. The predicted flood process also exhibits higher correlation coefficients with the observed data.
Visualization of attention mechanism in the MHAFFM model
From Figure 24, it can be seen that the attention effects of 16 heads of the MHAFFM model are all different, indicating that the model generates differentiated attention effects for different subinformation spaces and can handle high-dimensional input information well. However, the model's division of data blocks into 16 parts for attention output makes it difficult to understand their mapping to the actual input. Therefore, this article reduces the 16-dimensional data to one dimension.
Based on Figure 25 and the flood timing information, it can be observed that for flood event 1 during the period of 19–30, which corresponds to the flood rising process, the MHAFFM model pays more attention to input information at time step 19, which corresponds to the first rainfall process in the physical space. Moreover, as the time step increases, the main focus is still on input information at that time step. By time step 30, the main focus of the flood peak is still on input information at time step t–11 (i.e., at 19:00). Furthermore, the model's main focus is in the vicinity of the attention space diagonal (i.e., around 19:00). This indicates that the flood generation process mainly originates from the rainfall process around time step 19, which is consistent with physical cognition, and the interpretability of the model is relatively good.
An analysis of the reasons behind this attention pattern is as follows: First, the generation of a reasonable attention space is based on the complete transmission of fundamental information. Since the linear layer does not filter information, it conveys all available information to the attention layer (which also contributes to the stability of the model's predictions), allowing the multi-head attention mechanism to process the complete information. Second, the multi-head attention mechanism assigns appropriate weights to the input information through the scoring function. The information vectors before 22:00 have higher scaled dot-product values, obtaining higher weights (as observed in the green portion above the attention space).
Model execution speed
Model development is based on the torch framework in Python 3.8. The computations are performed using NVIDIA GeForce RTX3080 and Intel Core i7-11800H CPUs. The average run time for each model is listed in Table 8.
Model name . | LSTM . | BOA-LSTM . | MHAM-LSTM . | MHAFFM . |
---|---|---|---|---|
Time cost (s) | 57.94 | 1,266.98 | 98.07 | 60.59 |
Model name . | LSTM . | BOA-LSTM . | MHAM-LSTM . | MHAFFM . |
---|---|---|---|---|
Time cost (s) | 57.94 | 1,266.98 | 98.07 | 60.59 |
In terms of model computation speed, the LSTM model is the fastest, followed by the MHAFFM model, while the MHAM-LSTM model takes slightly longer due to its coupled structure. On the other hand, the BOA-LSTM model requires hyperparameter optimization, resulting in a much longer computation time compared to the other three models.
ANALYSIS AND DISCUSSION
Comparative analysis with LSTM
Based on the results, it can be seen that compared to the LSTM model with the same parameters, the MHAFFM model achieves a significant improvement in both average performance and stability of model evaluation indicators with a small increase in time cost. At the same time, the MHAFFM model also has strong interpretability.
The analysis is as follows:
- (1)
In terms of model evaluation metrics, the combination of linear layers and multi-head attention mechanism in the MHAFFM model can better map the flood process generation mechanism and fit the observed runoff sequence compared to the LSTM-gated recurrent unit.
- (2)
In terms of the stability of model prediction results, the screening method of discarding part of the data in an LSTM-gated recurrent unit leads to oscillating prediction results, and the oscillation becomes stronger as the forecast horizon increases. On the other hand, the MHAFFM model has better stability and is not significantly affected by the forecast horizon.
- (3)
When compared to the current situation where the working mechanism of the LSTM model is difficult to explain (Li et al. 2022a), the multi-head attention mechanism in the MHAFFM model enhances the interpretability of the working mechanism.
Comparative analysis with BOA-LSTM
Using the Bayesian optimization algorithm, important hyperparameters of the LSTM model are optimized, resulting in a significant improvement in the model's performance for flood forecasting at the expense of increased computation time. However, for the MHAFFM model, the BOA-LSTM model does not show a significant performance advantage, and its prediction performance gradually decreases compared to the MHAFFM model as the forecast horizon increases. In addition, the Bayesian optimization algorithm only improves the performance of the LSTM model in terms of evaluation metrics, while the stability of the model does not change significantly. Prediction results still exhibit obvious oscillations, indicating that the hyperparameter settings have little impact on the model's stability.
When compared to the BOA-LSTM model that obtains the optimal hyperparameter combination for the current data, the MHAFFM model achieves superior prediction results with less time cost. This indicates that the combination of linear layers and multi-head attention mechanism in the MHAFFM model exhibits structural advantages over the LSTM model for the given data. The performance of the MHAFFM model is less affected by the performance degradation caused by increasing lead time, resulting in more accurate and stable predictions.
Comparison analysis with MHAM-LSTM
When compared to the MHAM-LSTM model, the MHAFFM model differs only in the data processing layer. However, in terms of model performance, stability, and computation time, the MHAFFM model outperforms the MHAM-LSTM model in all aspects. This suggests that the data processing layer of the two models has a significant impact on the performance of the subsequent multi-head attention mechanism.
The MHAFFM model adopts a linear layer to fully pass the data, which allows the multi-head attention mechanism to better explore the linear relationship between input information and flood processes, thereby obtaining stable and excellent prediction results. In contrast, the MHAM-LSTM model selectively passes data through gate units in the hidden layer, making it difficult for the multi-head attention mechanism to effectively work, resulting in poor performance and stability in flood forecasting. In addition, the MHAFFM model has a significantly faster operating speed than the MHAM-LSTM model. Therefore, when compared to the LSTM hidden layer, the linear layer activates the multi-head attention mechanism more concisely and efficiently, achieving more excellent prediction results.
CONCLUSION AND FUTURE PERSPECTIVES
This article introduces a multi-head attention mechanism to construct a flood forecasting model and explores the impact of the data processing method in the hidden layer on the performance of the multi-head attention mechanism. Ultimately, the LSTM architecture is abandoned and a linear layer is used as the hidden layer combined with the multi-head attention mechanism to propose the MHAFFM model. Through comparison with the LSTM model, BOA-LSTM model, and MHAM-LSTM model, the following conclusions are drawn:
- (1)
When compared to LSTM, BOA-LSTM, and MHAM-LSTM models, the proposed MHAFFM model exhibits less performance degradation with increasing lead time. It efficiently accomplishes flood forecasting tasks under different lead times, resulting in high prediction accuracy and good stability.
- (2)
The multi-head attention mechanism enables the model to obtain differentiated attention effects, which not only endows the model with interpretability but also enhances the model's ability to process high-dimensional data information.
- (3)
In the task of streamflow forecasting, the method of linear layer fully batched data input is more conducive to the performance of the multi-head attention mechanism compared to the LSTM hidden layer. This approach achieves higher prediction accuracy and stability and improves its reliability.
However, this study still has some limitations. First, the study area is characterized by a monsoon climate, and whether the conclusions are applicable under other climate conditions requires further analysis. Second, the forecast period is limited to 3 h, and whether the superior performance of the MHAFFM model can be maintained beyond this duration needs further investigation. In addition, the hyperparameters of the MHAFFM model were not algorithmically optimized, and the potential improvement in model performance in this direction remains to be explored.
ACKNOWLEDGEMENTS
The authors are grateful for the support of the special project for collaborative innovation of science and technology in 2021 (No: 202121206) and Henan Province University Scientific and Technological Innovation Team (No: 18IRTSTHN009).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.