ABSTRACT
In this paper, we address the critical task of 24-h streamflow forecasting using advanced deep-learning models, with a primary focus on the transformer architecture which has seen limited application in this specific task. We compare the performance of five different models, including persistence, long short-term memory (LSTM), Seq2Seq, GRU, and transformer, across four distinct regions. The evaluation is based on three performance metrics: Nash–Sutcliffe Efficiency (NSE), Pearson's r, and normalized root mean square error (NRMSE). Additionally, we investigate the impact of two data extension methods: zero-padding and persistence, on the model's predictive capabilities. Our findings highlight the transformer's superiority in capturing complex temporal dependencies and patterns in the streamflow data, outperforming all other models in terms of both accuracy and reliability. Specifically, the transformer model demonstrated a substantial improvement in NSE scores by up to 20% compared to other models. The study's insights emphasize the significance of leveraging advanced deep learning techniques, such as the transformer, in hydrological modeling and streamflow forecasting for effective water resource management and flood prediction.
HIGHLIGHTS
Transformer model surpasses persistence, LSTM, GRU, Seq2Seq in 24-h streamflow prediction.
Employed different data extension methods, zero-padding and persistence, to enhance predictive capabilities.
Evaluated model performance using Nash–Sutcliffe Efficiency, Pearson's r, and normalized root mean square error metrics.
Conducted region-specific analysis, demonstrating the transformer model's adaptability across varied hydrological environments.
INTRODUCTION
Globally, the incidence and catastrophic effects of natural disasters have increased dramatically. The World Meteorological Organization's analysis (2021) shows that, on average, each day for the past half-century, a weather, climate, or water-related disaster has led to a loss of $202 million and claimed 115 lives. Further, Munich Re's (2022) report indicated that natural catastrophes, encompassing hurricanes, floods, and other disaster types, have inflicted more than $280 billion in projected damage worldwide. Out of this total, disasters caused $145 billion in damages in the United States alone, along with thousands of fatalities and substantial damage to properties and infrastructure. Current research suggests that ongoing climate change is projected to cause an upsurge in extreme and intense natural disasters globally, leading to an increase in the number of victims and losses (Banholzer et al. 2014; WMO 2021).
Floods are the most commonly occurring natural disaster, leading to billions in financial losses and innumerable fatalities over time (WMO 2021). In the year 2020, over 60% of all reported natural disasters were flood-related, accounting for 41% of the overall death toll due to such events (NDRCC 2021). Multiple studies suggest that climate change is causing an escalation in the frequency and severity of floods in specific areas (Tabari 2020; Davenport et al. 2021; NOAA 2022). This rise in flooding events can be attributed to factors like an increase in sea level (Strauss et al. 2016), the heightened occurrence of extreme rainfall (Diffenbaugh et al. 2017), or amplified rainfall during hurricanes (Trenberth et al. 2018). Hence, accurately forecasting streamflow and, as a result, potential flooding is essential for effectively mitigating the destructive consequences in terms of property damage and fatalities (Alabbad & Demir 2022).
In addition, streamflow forecasting plays a vital role in numerous aspects of hydrology and water management, including watershed management (Demir & Beck 2009), agricultural planning (Yildirim & Demir 2022), flood mapping systems (Li & Demir 2022), and other mitigation activities (Yaseen et al. 2018; Ahmed et al. 2021). Yet, achieving accurate and reliable predictions poses a challenge due to the inherent complexity of hydrological systems, which include nonlinearity, and unpredictability in the datasets (Yaseen et al. 2017; Honorato et al. 2018; Sit et al. 2023a).
Over time, a plethora of physical and data-driven methods have been introduced, each exhibiting diverse characteristics such as employing different types of data, focusing on specific geographical areas, or offering varying levels of generalization (Salas et al. 2000; Yaseen et al. 2015). Physics-driven prediction models (Beven & Kirkby 1979; Ren-Jun 1992; Arnold 1994; Lee & Georgakakos 1996; Devia et al. 2015) have the capability to simulate the complex interactions among different physical processes, including atmospheric circulation and the long-term evolution of weather patterns in the world (Yaseen et al. 2019; Sharma & Machiwal 2021). However, these models, while valuable, come with notable limitations. They demand extensive and precise hydrological and geomorphological data, increasing operational costs. The accuracy of these models tends to wane in long-term forecasting scenarios.
Furthermore, due to their computational intensity and high parameter counts, traditional physically based hydrological models require substantial computing resources, leading to significant computational costs (Mosavi et al. 2018; Sharma & Machiwal 2021; Liu et al. 2022; Castangia et al. 2023). As a result, recent research (Yaseen et al. 2015) has explored alternative approaches to streamflow forecasting, indicating that machine learning, especially deep learning models, can serve as viable alternatives and often outperform physically based models in terms of accuracy. These deep learning models have shown promising results in enhancing the accuracy and reliability of streamflow predictions, presenting an opportunity to revolutionize hydrological modeling (Demiray et al. 2023; Sit et al. 2023b).
Many classical machine-learning approaches have been used in streamflow forecasting and environmental studies (Bayar et al. 2009; Li & Demir 2023) including support vector machines (SVMs) and linear regression (LR) (Granata et al. 2016; Yan et al. 2018; Sharma & Machiwal 2021). However, advancements in artificial intelligence (AI) coupled with the increasing capabilities of graphics processing units (GPUs) have opened up new possibilities and accelerated the progress of deep learning techniques, which has led to the widespread usage of these techniques in streamflow forecasting as well (Sit et al. 2022a). Out of various neural network architectures explored for streamflow forecasting (Sit et al. 2021a; Xiang & Demir 2022b; Chen et al. 2023), Recurrent neural networks (RNNs), especially the long short-term memory (LSTM) neural network and gated recurrent units (GRUs), have emerged as the most extensively studied and researched models in this domain.
Kratzert et al. (2018) applied an LSTM model to predict daily runoff, incorporating meteorological observations, and demonstrated that the LSTM model outperformed a well-established physical model in their study area. In their study, Xiang et al. (2021) demonstrated that the LSTM-seq2seq model surpasses the performance of other linear models, such as LR, lasso regression, and ridge regression methods. The LSTM-seq2seq model outperformed these linear models in terms of predictive accuracy or other relevant evaluation metrics. Guo et al. (2021) compared LSTMs, GRUs, and SVMs over 25 different locations in China and found that while LSTMs and GRUs demonstrated comparable performance, GRUs exhibited faster training times. Since the research about the field is extensive, more detailed information about deep learning studies on streamflow prediction can be found in Yaseen et al. (2015) and Ibrahim et al. (2022).
In 2017, a group of researchers from Google introduced a new way to model longer sequences for language translation (Vaswani et al. 2017) and this new model, namely transformers, was applied to various tasks since then including time series prediction (Wu et al. 2021; Zhou et al. 2021, 2022; Lin et al. 2022). Despite attention from other fields, there is a limited number of studies that focus on the performance and usage of transformers in streamflow forecasting. Liu et al. (2022) introduced a transformer neural network model for monthly streamflow prediction of the Yangtze River in China. Their approach utilized historical water levels and incorporated the El Niño-Southern Oscillation (ENSO) as additional input features. This allowed the model to capture the influence of ENSO on streamflow patterns and improve the accuracy of monthly streamflow predictions for the Yangtze River. More recently, Castangia et al. (2023) used a transformer based model to predict the water level of a river 1 day in advance, leveraging the historical water levels of its upstream branches as predictors. They conducted experiments using data from the severe flood that occurred in Southeast Europe in May 2014.
In this work, we investigate the performance of a transformer model in streamflow forecasting for four different locations in Iowa, USA. More specifically, we predict the upcoming 24-h water levels using the previous 72-h precipitation, evapotranspiration, and discharge values, then compare the results of transformer-based model with three deep learning models as well as the persistence method. According to experiment results, transformer-based model outperforms all tested methods.
The structure of the remaining sections of this paper is as follows: in the next section, the dataset that has been used in this research and study area will be introduced. Section 3 outlines the methods employed in this study. Following that, Section 4 presents the results of our experiments and provides a detailed discussion of the findings. Finally, in Section 5, we summarize the key findings of this study and discuss future prospects.
DATASET
WaterBench, developed by Demir et al. (2022), is a benchmark dataset explicitly created for flood forecasting research, adhering to FAIR (findability, accessibility, interoperability, and reuse) data principles. Its structure is designed for easy application in data-driven and machine-learning studies, and it also provides benchmark performance metrics for advanced deep-learning architectures, enabling comparative analysis. This dataset has been compiled by gathering streamflow, precipitation (Sit et al. 2021b), watershed area, slope, soil types, and evapotranspiration data from various federal and state entities, including NASA, NOAA, USGS, and the Iowa Flood Center. This consolidated resource is specifically geared towards studies of hourly streamflow forecasts.
Selected locations and corresponding watersheds in the State of Iowa.
The data from October 2011 to September 2017 are selected for the training set. Fifteen percent of the remaining data are used for validation and the rest is allocated for testing. This data split was consistent across all stations and models studied in our research. The use of a uniform data division strategy ensures comparability and fairness in the evaluation of each model's performance in different regions. As a preprocessing step, we followed the same methods in the original dataset paper (Demir et al. 2022) since we compared our results with the models provided in the WaterBench paper. The data and benchmark models can be accessed from https://github.com/uihilab/WaterBench. The statistical summary of streamflow values in used test data is provided in Table 1.
Statistical summary of streamflow values in test data (m3/s)
. | Bluffton . | Fulton . | Iowa City . | Clarinda . |
---|---|---|---|---|
Max | 13,050.00 | 10,075.00 | 2,242.50 | 11,575.00 |
Min | 41.09 | 121.99 | 0.16 | 70.87 |
Mean | 436.98 | 425.59 | 12.76 | 443.70 |
Median | 246.00 | 308.00 | 4.28 | 256.00 |
. | Bluffton . | Fulton . | Iowa City . | Clarinda . |
---|---|---|---|---|
Max | 13,050.00 | 10,075.00 | 2,242.50 | 11,575.00 |
Min | 41.09 | 121.99 | 0.16 | 70.87 |
Mean | 436.98 | 425.59 | 12.76 | 443.70 |
Median | 246.00 | 308.00 | 4.28 | 256.00 |
METHODS
In this study, we evaluated the transformer-based model in streamflow prediction tasks and compared the results with the four models (Persistence, GRU, LSTM, and Seq2Seq), that are mentioned and provided in the WaterBench dataset. In this section, we will provide the details of these methods as well as the transformer-based approach.
Persistence approach


LSTM and GRU
In the context of time-series forecasting, RNNs have proven effective in capturing temporal dependencies. However, they suffer from the vanishing gradient problem, where the gradient diminishes exponentially over time, hindering the model's ability to retain long-term dependencies. This limitation impacts the accuracy of time-series predictions, particularly for tasks that require memory of events far back in the past. To address these shortcomings, LSTM networks were introduced by Hochreiter & Schmidhuber (1997). LSTMs are designed to extend the lifespan of short-term memory and effectively capture long-term dependencies in the data. This makes them well-suited for time series problems consequently hydrological forecasting tasks as well that involve longer memory requirements, such as flood and rainfall forecasting (Kratzert et al. 2018; Feng et al. 2020; Frame et al. 2022; Sit et al. 2022b).
The core of an LSTM unit comprises three gates: the input gate, forget gate, and output gate, each playing a pivotal role in the information flow within the network. The input gate controls the addition of new information to the cell state, balancing between the current input and the previous state. The forget gate, on the other hand, determines which parts of the existing memory to retain or discard, allowing the model to forget irrelevant data over time. The output gate decides the next hidden state based on the current cell state, effectively controlling the output information of the LSTM unit. These gates, governed by sigmoid and tanh activation functions, operate through a series of equations that manage data storage, retention, and output. This sophisticated gating mechanism is fundamental to the LSTM's ability to manage information over extended periods, making it a powerful tool in forecasting where past events significantly influence future outcomes.
While LSTM networks have been instrumental in addressing the vanishing gradient problem and achieving remarkable progress in Natural Language Processing and time-series prediction, their time complexity can be a concern, especially for large-scale applications. To mitigate this issue, GRU networks were introduced by Cho et al. (2014) as an efficient alternative that retains the effectiveness of LSTM while reducing computational burden. It merges the functionalities of the input and forget gates into a single update gate, reducing the complexity and computational burden. The GRU's architecture comprises two main components: the update gate and the reset gate. The update gate in GRUs determines the extent to which the previous state influences the current state, thus controlling the flow of information from the past. The reset gate, on the other hand, decides how much past information to forget, allowing the model to drop irrelevant data from previous time steps. These gates use sigmoid and tanh functions to manage the model's memory effectively. By employing these gating mechanisms and streamlined computations, the GRU model strikes a balance between computational efficiency and predictive performance.
Both LSTM and GRU models have been applied successfully in various domains, particularly in hydrological forecasting (Yaseen et al. 2015; Ibrahim et al. 2022). Their ability to model the non-linear relationships inherent in hydrological processes has led to them becoming popular choices for tasks such as predicting streamflow and rainfall. Studies such as those by Kratzert et al. (2018) and Guo et al. (2021) have demonstrated the efficacy of these models in hydrological forecasting, highlighting their strengths in capturing complex temporal patterns and relationships in hydrological data.
Seq2Seq model
In addition to LSTM and GRU, a variant of the Seq2Seq model (Xiang & Demir 2022a) is also employed as a baseline method in this study. The Seq2Seq model follows an encoder–decoder architecture and utilizes multiple TimeDistributed layers with a final dense layer. The encoder–decoder structure of the Seq2Seq model consists of two main components: an encoder and a decoder. The encoder processes the input time series data and encodes it into a fixed-size context vector, effectively capturing relevant temporal patterns and features. For this implementation, multiple GRUs are used as both the encoder and decoder, proven effective in modeling sequential data and handling long-range dependencies.
During the encoding process, the input time series data, including historical rainfall, streamflow, and evapotranspiration for the past 72 h, along with 24-h forecast data of rainfall and evapotranspiration, is passed through the multiple GRUs. The encoder generates a context vector that summarizes the important information from the input sequence. Next, the decoder takes the context vector produced by the encoder and predicts the future 24-h streamflow. The decoder GRUs process the context vector along with the predicted streamflow values from the previous timestep, iteratively generating the streamflow predictions for the next 24 h. To capture intricate patterns and temporal dynamics in the predictions, multiple TimeDistributed layers are employed, applying the same dense layer to each timestep of the output sequence. Finally, the Seq2Seq model concludes with a final dense layer that projects the output sequence to the desired format for 24-h streamflow predictions. For comprehensive implementation details, we recommend referring to the works by Xiang & Demir (2022a) and Demir et al. (2022).
Transformer model
The transformer model represents a revolutionary neural network architecture that emerged as a seminal work by Vaswani et al. (2017) to tackle challenges in machine translation tasks. Its groundbreaking design subsequently found applications in various domains that deal with long input sequences, including time series forecasting (Wu et al. 2021; Zhou et al. 2021; Lin et al. 2022; Wen et al. 2022; Zhou et al. 2022). The transformer's key innovation lies in the self-attention mechanism, which completely replaces traditional recurrent layers, enabling more efficient and effective analysis of extended input sequences.
The persistence, GRU, LSTM, and transformer models are developed with Pytorch, whereas the Seq2Seq model is developed with Keras. Please refer to Demir et al. (2022) for further implementation details and model architectures of GRU, LSTM, and Seq2Seq utilized in this study. In the transformer model, we employed a linear embedding layer to expand the feature size of the input from 3 to 64, preparing the input data for efficient processing by the transformer architecture. The model comprises a single encoder layer equipped with eight attention heads, enhancing its ability to focus on different facets of the input sequence simultaneously. The model size for the transformer is set to 64, balancing the model's complexity and computational efficiency. Additionally, the encoder's internal feedforward network has a dimension of 256, which provides sufficient capacity for internal feature transformations and processing. In addition, GELU activation function is used between two linear functions inside the feed-forward component. To provide a clearer understanding of the computational complexity of the models employed in our study, Table 2 provides a detailed account of the trainable parameter counts for each model.
Trainable parameter counts of tested models
Model name . | Number of trainable parameters . |
---|---|
LSTM | 51,009 |
SeqSeq | 77,505 |
GRU | 38,273 |
Transformer | 56,449 |
Model name . | Number of trainable parameters . |
---|---|
LSTM | 51,009 |
SeqSeq | 77,505 |
GRU | 38,273 |
Transformer | 56,449 |
During the training, we used mean squared error (MSE) as the loss function and Adam as the optimizer. In addition, we set the batch sizes to 512 and the learning rate to 0.00001. The learning rate is divided by two if no improvement is noticed for 10 epochs and training is frozen if there is no improvement for 20 epochs.
RESULTS AND DISCUSSION
In this section, we present and discuss the findings from our research into the 24-h prediction of streamflow using different models, focusing primarily on the performance of the transformer model we used. To assess its effectiveness, we compare it against four other models, three of which are deep learning models – LSTM, GRU, and Seq2Seq – and one is a classical approach known as persistence.
Streamflow prediction holds immense significance in various domains such as water resource management, environmental monitoring, and decision-making processes. Deep learning models have demonstrated remarkable capabilities in time-series forecasting tasks, making them a natural choice for tackling streamflow prediction challenges. However, the application of transformer models in this specific context is relatively new and deserving of detailed investigation. The transformer's self-attention mechanism has shown great promise in sequence modeling tasks, making it an intriguing candidate for capturing temporal dependencies in streamflow data.
Our comparative analysis employs three metrics, namely, Nash–Sutcliffe efficiency (NSE), Pearson's r, and normalized root mean square error (NRMSE). Each of these metrics serves to facilitate a thorough and multidimensional understanding of each model's predictive capacities and the effectiveness of the transformer model. The subsequent sections delve into a detailed exposition of the three-evaluation metrics and their relevance in streamflow prediction assessment. Following that, we meticulously present and analyze the results obtained from each model, highlighting their respective strengths and limitations. Through this thorough examination, we aim to uncover the effectiveness of the transformer model in 24-h streamflow prediction and its potential implications for future research and real-world applications.
Performance metrics
These metrics serve as essential tools in quantifying the predictive performance of our streamflow prediction models, enabling us to assess their effectiveness in capturing the underlying patterns and dynamics of streamflow behavior.
Experiment results and discussion
In this section, we present the experiment results that address the core objective of our study: 24-h streamflow prediction. For this investigation, we utilized a comprehensive dataset comprising historical data on precipitation, evapotranspiration, discharge values from the preceding 72 h, as well as forecast data of 24-h precipitation and evapotranspiration. Our investigation focused on evaluating the performance of the transformer-based model, comparing it against three deep learning models (LSTM, GRU, and Seq2Seq), and a classical method (persistence). To assess the predictive capabilities of these models, we employed three commonly used metrics in hydrological modeling and streamflow forecasting: NSE, Pearson's r, and NRMSE. These metrics provide valuable insights into the accuracy and effectiveness of the models in capturing streamflow patterns.
In the experiments, a crucial aspect involved adjusting the dimensions of the input data and incorporating additional values to accommodate the implementation specifications of GRU, LSTM, and transformer models. More specifically, input data for these networks are a combination of previous values and forecast values. Previous values are 72 h of precipitation, evapotranspiration, and discharge values, for the forecast values 24 h of precipitation, and evapotranspiration information are used. So, one has a shape of [batch size, 72, 3] and the other has [batch size, 24, 2]. To merge and align these two input groups for the models, an extra dimension for forecast values needed to be introduced. According to experiment results, what is used as an additional dimension affects the results dramatically. Two approaches are considered to handle this dimension discrepancy. One approach is zero-padding, wherein the forecast values are extended with zeros in the additional dimension. Alternatively, the persistence method can be adopted, wherein the historical values were extended into the forecast period by repeating the last available data. This method ensured consistency in the input data across time steps. Both techniques are employed to ensure compatibility between the input data and the specific model requirements. Once the additional dimension is added, past and forecast values merge and input with a dimension of [batch size, 96, 3] is obtained for transformer, GRU, and LSTM models.
The results in Table 3 demonstrate the performance comparison of the transformer model using zero-padding and persistence approaches for 24-h streamflow forecasting in four different regions. The NSE scores reveal valuable insights into the model's predictive capabilities under each data extension method. Upon analysis, it becomes evident that the persistence method for data extension consistently outperforms zero-padding in capturing underlying streamflow patterns and dynamics for the transformer model in three of the four analyzed regions. These findings emphasize the critical role of data extension techniques in improving the transformer model's performance for streamflow forecasting tasks.
Performance comparison of transformer model for 24-h streamflow forecasting using zero-padding and persistence approaches in four different regions (NSE scores)
. | Bluffton . | Fulton . | Iowa City . | Clarinda . | ||||
---|---|---|---|---|---|---|---|---|
Mean . | Median . | Mean . | Median . | Mean . | Median . | Mean . | Median . | |
Transformer-zero-padding | 0.73 | 0.70 | 0.62 | 0.64 | 0.24 | 0.25 | 0.72 | 0.72 |
Transformer-persistence | 0.82 | 0.81 | 0.70 | 0.71 | 0.42 | 0.42 | 0.65 | 0.68 |
. | Bluffton . | Fulton . | Iowa City . | Clarinda . | ||||
---|---|---|---|---|---|---|---|---|
Mean . | Median . | Mean . | Median . | Mean . | Median . | Mean . | Median . | |
Transformer-zero-padding | 0.73 | 0.70 | 0.62 | 0.64 | 0.24 | 0.25 | 0.72 | 0.72 |
Transformer-persistence | 0.82 | 0.81 | 0.70 | 0.71 | 0.42 | 0.42 | 0.65 | 0.68 |
Similar to Table 3, Table 4 displays the NSE scores obtained from the predictions made by the GRU and LSTM models under the zero-padding and persistence data extension methods for each region. Upon analysis, we observe variations in the models' performance across the four regions. Interestingly, for the LSTM model, the zero-padding approach yields higher mean and median NSE scores compared to the persistence method. Conversely, for the GRU model, the persistence method consistently outperforms the zero-padding approach, resulting in higher mean and median NSE scores.
Performance comparison of GRU and LSTM models for 24-h streamflow forecasting using zero-padding and persistence approaches in four different regions (NSE scores)
. | Bluffton . | Fulton . | Iowa City . | Clarinda . | ||||
---|---|---|---|---|---|---|---|---|
Mean . | Median . | Mean . | Median . | Mean . | Median . | Mean . | Median . | |
GRU-zero-padding | 0.56 | 0.54 | 0.42 | 0.45 | 0.12 | 0.12 | −0.40 | −0.45 |
GRU-persistence | 0.72 | 0.72 | 0.62 | 0.64 | 0.19 | 0.19 | 0.57 | 0.59 |
LSTM-zero-padding | 0.77 | 0.76 | 0.45 | 0.44 | −0.45 | −0.52 | 0.51 | 0.53 |
LSTM-persistence | 0.50 | 0.50 | 0.40 | 0.41 | −1.50 | −1.60 | 0.09 | 0.09 |
. | Bluffton . | Fulton . | Iowa City . | Clarinda . | ||||
---|---|---|---|---|---|---|---|---|
Mean . | Median . | Mean . | Median . | Mean . | Median . | Mean . | Median . | |
GRU-zero-padding | 0.56 | 0.54 | 0.42 | 0.45 | 0.12 | 0.12 | −0.40 | −0.45 |
GRU-persistence | 0.72 | 0.72 | 0.62 | 0.64 | 0.19 | 0.19 | 0.57 | 0.59 |
LSTM-zero-padding | 0.77 | 0.76 | 0.45 | 0.44 | −0.45 | −0.52 | 0.51 | 0.53 |
LSTM-persistence | 0.50 | 0.50 | 0.40 | 0.41 | −1.50 | −1.60 | 0.09 | 0.09 |
In summary, the different performance trends for the three models under the zero-padding and persistence approaches highlight the significance of selecting appropriate data extension techniques in streamflow forecasting tasks, as the effectiveness can vary depending on the model architecture.
Table 5 presents the 24-h streamflow prediction results for four different regions using five different models: Persistence, LSTM, Seq2Seq, GRU, and Transformer. The results are evaluated using three performance metrics: mean of 24 h of NSE scores, Pearson's r, and NRMSE. In this study, the persistence model serves as the baseline for comparison. While it exhibits moderate performance in some regions, it falls short of capturing the underlying dynamics of streamflow, leading to higher NRMSE values. As expected, it shows limited predictive capabilities compared to the advanced deep learning models. The LSTM and Seq2Seq models demonstrate mixed results across regions. While they achieve reasonably high NSE scores in certain regions, they struggle to consistently outperform the persistence model, especially in regions Iowa City and Clarinda.
24-h Streamflow prediction results (NSE scores)
Region . | Metric . | Persistence . | LSTM . | Seq2Seq . | GRU . | Transformer . |
---|---|---|---|---|---|---|
Bluffton | NSE | 0.58 | 0.77 | 0.66 | 0.72 | 0.82 |
r | 0.77 | 0.88 | 0.85 | 0.86 | 0.91 | |
NRMSE | 1.26 | 0.92 | 1.13 | 1.01 | 0.82 | |
Fulton | NSE | 0.46 | 0.45 | 0.58 | 0.62 | 0.70 |
r | 0.73 | 0.67 | 0.76 | 0.78 | 0.83 | |
NRMSE | 0.98 | 0.99 | 0.87 | 0.83 | 0.73 | |
Iowa City | NSE | −0.30 | −0.45 | 0.01 | 0.19 | 0.42 |
r | 0.34 | 0.29 | 0.16 | 0.44 | 0.66 | |
NRMSE | 6.36 | 6.65 | 5.53 | 4.95 | 4.22 | |
Clarinda | NSE | 0.48 | 0.51 | 0.29 | 0.57 | 0.65 |
r | 0.74 | 0.84 | 0.56 | 0.90 | 0.90 | |
NRMSE | 1.23 | 1.20 | 1.43 | 1.08 | 1.00 |
Region . | Metric . | Persistence . | LSTM . | Seq2Seq . | GRU . | Transformer . |
---|---|---|---|---|---|---|
Bluffton | NSE | 0.58 | 0.77 | 0.66 | 0.72 | 0.82 |
r | 0.77 | 0.88 | 0.85 | 0.86 | 0.91 | |
NRMSE | 1.26 | 0.92 | 1.13 | 1.01 | 0.82 | |
Fulton | NSE | 0.46 | 0.45 | 0.58 | 0.62 | 0.70 |
r | 0.73 | 0.67 | 0.76 | 0.78 | 0.83 | |
NRMSE | 0.98 | 0.99 | 0.87 | 0.83 | 0.73 | |
Iowa City | NSE | −0.30 | −0.45 | 0.01 | 0.19 | 0.42 |
r | 0.34 | 0.29 | 0.16 | 0.44 | 0.66 | |
NRMSE | 6.36 | 6.65 | 5.53 | 4.95 | 4.22 | |
Clarinda | NSE | 0.48 | 0.51 | 0.29 | 0.57 | 0.65 |
r | 0.74 | 0.84 | 0.56 | 0.90 | 0.90 | |
NRMSE | 1.23 | 1.20 | 1.43 | 1.08 | 1.00 |
This indicates that their recurrent architecture might face challenges in capturing the complex temporal dependencies in streamflow data. The GRU model showcases competitive performance across all regions. With consistent NSE scores and relatively lower NRMSE values compared to LSTM and Seq2Seq models, it proves its capability to effectively model the temporal dynamics in streamflow data. However, it still falls behind the transformer model's overall performance. The transformer model emerges as the top-performing model in 24-h streamflow prediction across all regions. With the highest NSE scores and the lowest NRMSE values among all models, the transformer demonstrates its efficacy in capturing and learning the long-range dependencies and patterns in the time series data. The self-attention mechanism, along with positional encoding, enables the transformer to effectively process and utilize the sequential information, leading to its superior predictive capabilities.
In conclusion, the experimental results highlight the transformer model's significant advantage over other models in 24-h streamflow forecasting. Its powerful self-attention mechanism allows it to efficiently capture and utilize the temporal dependencies in the input time series, resulting in more accurate and reliable predictions compared to traditional LSTM and Seq2Seq models, as well as the GRU model. These findings underscore the importance of leveraging advanced deep learning architectures like the transformer in hydrological modeling and streamflow forecasting tasks, offering valuable insights for the research community and practical applications in water resource management and flood forecasting. However, it is crucial to recognize certain limitations that are inherent to streamflow forecasting models in general. One notable observation is the varying performance across different watersheds, as seen in the Iowa City basin. This basin, being the smallest among those studied, presented challenges not only for the transformer model but also for other models. The reasons for these variations are complex and might be related to factors such as watershed size, land use patterns, or specific hydrological characteristics. This indicates a need for further investigation into how different environmental and regional factors influence model performance. Additionally, across all models, we observed a trend of diminishing NSE scores over the 24-h forecasting period. This pattern suggests that while these models are effective in short-term forecasting, their accuracy tends to decrease over longer periods. This is a critical area for future research, which could focus on enhancing long-term forecasting capabilities and examining the causes of this accuracy decline. Furthermore, while the extended inference time of the transformer model at 0.95 s, compared to the LSTM's 0.51 and GRU's 0.49 s for a batch of 32 examples, may not significantly impact our 24-h prediction window, it is a crucial consideration for other environmental and hydrological tasks that demand immediate action, such as flash flood early warning systems or real-time water quality monitoring. This aspect of the transformer's performance, sometimes observed in its architecture (Tay et al. 2022), underscores the broader importance of selecting models not only for their accuracy but also for their operational efficiency in various hydrological contexts. Acknowledging these broader challenges and limitations is important for advancing the field of hydrological modeling. Our findings, while highlighting the strengths of the transformer model, also point to the ongoing need for research and development to address these complex issues in streamflow prediction.
CONCLUSION
In this study, we conducted an in-depth investigation of 24-h streamflow forecasting using various deep learning models, with a particular focus on the transformer architecture. Through extensive experimentation and analysis, we compared the performance of five different models across four distinct regions. The results demonstrate that the transformer model consistently outperforms other models, including persistence, LSTM, Seq2Seq, and GRU, in terms of accuracy and predictive capabilities. The transformer's powerful self-attention mechanism, along with positional encoding, enables it to effectively capture long-range dependencies and underlying patterns in the input time series data. Consequently, the transformer model excels in providing accurate and reliable streamflow predictions.
Furthermore, we explored the influence of two data extension methods: zero-padding and persistence, on the model's performance. The findings indicate that the persistence method, which incorporates historical streamflow data, consistently yields superior results compared to zero-padding. This underscores the importance of carefully considering data extension techniques to improve the model's forecasting accuracy.
Overall, our research contributes valuable insights into the field of hydrological modeling and streamflow forecasting, with the transformer model exhibiting superior accuracy. This model's success in streamflow prediction opens new opportunities in water resource management, where precise forecasts can inform reservoir level adjustments and water distribution planning, crucial for drought mitigation and flood prevention. In flood prediction, the model's reliable forecasts provide essential data for developing proactive flood management strategies, enhancing emergency response, and safeguarding communities and infrastructure.
The applications of the transformer model also extend to sectors reliant on streamflow predictions. In agriculture, accurate forecasts from the transformer model can guide irrigation scheduling, contributing to water conservation and crop yield optimization. In urban planning, insights from streamflow predictions can be pivotal in designing robust drainage systems and managing sewage overflow events, especially in cities prone to sudden water level changes. As future work, we suggest exploring the applicability of the transformer in handling larger datasets and further investigating the impact of different hyperparameters on the model's performance. The knowledge gained from this study can significantly benefit water management practices, supporting sustainable decision-making and mitigation strategies in the face of increasingly unpredictable weather patterns and climate change.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper. In addition, the dataset and benchmark models can be accessible from https://github.com/uihilab/WaterBench.
CONFLICT OF INTEREST
The authors declare there is no conflict.