The imperative for a reliable and accurate flood forecasting procedure stems from the hazardous nature of the disaster. In response, researchers are increasingly turning to innovative approaches, particularly machine learning models, which offer enhanced accuracy compared to traditional methods. However, a notable gap exists in the literature concerning studies focused on the South Asian tropical region, which possesses distinct climate characteristics. This study investigates the applicability and behavior of long short-term memory (LSTM) and transformer models in flood simulation considering the Mahaweli catchment in Sri Lanka, which is mostly affected by the Northeast Monsoon. The importance of different input variables in the prediction was also a key focus of this study. Input features for the models included observed rainfall data collected from three nearby rain gauges, as well as historical discharge data from the target river gauge. Results showed that the use of past water level data denotes a higher impact on the output compared to the other input features such as rainfall, for both architectures. All models denoted satisfactory performances in simulating daily water levels, with Nash–Sutcliffe Efficiency (NSE) values greater than 0.77 while the transformer encoder model showed a superior performance compared to encoder–decoder models.

  • Past water level data had the highest impact on the output among all the input features.

  • It is recommended to use many input features that show high correlations with the target, along with the past data of the output observations.

  • switching the inputs between the encoder and decoder and checking the accuracies for these two cases are recommended when using Transformer encoder-decoder models.

Natural catastrophes, such as hurricanes, earthquakes, and floods lead to significant economic, ecological, and social damages and casualties. Among them, flood can be identified as a phenomenon with severe effects, impacting about 109 million people throughout the world between 1995 and 2015 (Hirabayashi et al. 2013; Alfieri et al. 2017). As percentages, about 55% of people and 43% of all events were impacted by floods with lost assets totaling over 636 billion USD (Serinaldi et al. 2018) encouraging researchers worldwide to mitigate this disaster (Hallegatte et al. 2017). When considering South Asian Tropical regions, which are highlighted due to the rapid occurrence of seasonal reversal of the wind direction accompanied by intense precipitation, and resultant wet summers and dry winters (Xie & Saiki 1999), large-scale floods and droughts can be expected (Parthasarathy & Mooley 1978). One viable option in flood hazard management is practical and effective flood warning systems (Boulange et al. 2021). Early prediction of floods facilitates timely management of hydro-junction operations and fast evacuation of individuals from flood-affected regions, leading to a reduction in socioeconomic losses (Zhang et al. 2022).

A significant challenge in advancing flood forecasting technology is the limited availability of field data. Usually, flood prediction approaches can be divided into two categories, physically based models, and data-driven models. Physical models (Pierini et al. 2014; Mourato et al. 2021) often require a substantial amount of both hydrological and geomorphological data for calibration and validation, and they might not always be readily accessible. Furthermore, the model parameters must be carefully tested and evaluated, because they are regionally dependent and can be challenging to estimate.

To overcome these limitations, machine learning-based data-driven models have gained popularity in flood forecasting because of their ability to capture complex nonlinear patterns, cope with limited data effectively (Rahmati & Pourghasemi 2017), and ability to capture spatial data from images (Lee et al. 1990). These models can be effectively implemented solely based on available rainfall data and measured discharge data, without the need for detailed catchment characteristics. Artificial neural network (ANN) is a common algorithm for flood simulation because it has outperformed traditional methods on many occasions (Elsafi 2014; Chu et al. 2020; Tamiru & Dinka 2021). Then, recurrent neural network (RNN) was introduced for time series forecasting tasks with the ability of capturing essential information from long sequences of data. Long short-term memory (LSTM), which is a special type of RNN, have gained significant popularity and widespread adoption in hydrologic prediction tasks (Xiang et al. 2020; Fang et al. 2021; Zou et al. 2023; Dtissibe et al. 2024).

As a solution for some limitations of these traditional neural network algorithms such as low computational speed and ineffectiveness of capturing long-term dependencies, google initiated a new architecture called transformers (Vaswani et al. 2017) which is based on an attention mechanism (Bahdanau et al. 2014). Although this was originally designed for natural language processing (NLP), the transformer model has denoted its effectiveness in handling other types of time series data (Wu et al. 2020; Farsani & Pazouki 2021).

In the realm of flood forecasting, there is a scarcity of studies that incorporate transformer architecture, representing a notable gap in the literature. Moreover, existing research indicates that the accuracy comparison of models, including transformers, varies across different regions. For example, Xu et al. (2023) introduced the transfer learning (TL)-transformer framework to enhance flood prediction accuracy in basins with limited data by utilizing models trained on data-rich basins. The research was centered on the middle reaches of the Yellow River. The findings showed that the TL-transformer outperformed other models, such as TOPMODEL, Gated Recurrent Unit (MLP), TL-MLP, LSTM, TL-LSTM, and transformer, at all the target basin stations. However, the conclusions of the study of Wei et al. (2023) say otherwise. They conducted a study evaluating the performance of transformer (TSF), LSTM, and Gated Recurrent Unit (GRU) models for runoff prediction in the Yangtze River Basin in China. The results showed that GRU outperformed the other models with fewer parameters, while TSF faced challenges due to data limitations. Therefore, whether the transformer model outperforms the traditional LSTM model in forecasting flood events remains crucial. This provides the motivation and contribution to the present research.

We examined 1 day ahead flood forecasting capabilities of transformer encoder, transformer encoder–decoder, and LSTM. The lack of studies regarding deep learning-based flood simulation in the South Asian Tropical Zone further motivated us to do this study. This paper is related to our study (Madhushanka et al. 2024), which focused on the lower reach of the Mahaweli catchment, in Sri Lanka. In this paper, apart from the forecasting capabilities of daily water levels, the effects of different input features on the output are thoroughly investigated.

Study area

South Asian tropical climate

South Asia, encompassing Afghanistan, Pakistan, India, Nepal, Bhutan, Bangladesh, Maldives, and Sri Lanka, stands as the world's most populous and agriculture-dependent region. The climate in the South Asian region is characterized by the South Asian Monsoon, a significant seasonal phenomenon marked by dramatic shifts in winds that bring vital rainfall to the area. Unlike regions with consistent precipitation year-round, South Asia experiences distinct dry and wet seasons. The summer monsoon, occurring from June to September, carries moisture-laden winds from the Indian, resulting in heavy rainfall. Conversely, the winter monsoon, from December to February, brings dry continental winds from the north (Xie & Saiki 1999). Additionally, there are two inter-monsoon seasons, the first inter-monsoon from March to May and the second inter-monsoon from October to November (Wickramagamage 2016).

South Asian countries are particularly susceptible to temperature and precipitation extremes, including floods and droughts, due to the effects of global warming (Naveendrakumar et al. 2019). The frequency of intense precipitation events with the potential for extreme outcomes is projected to rise across various regions in South Asian countries (Christensen et al. 2007) while Central Asia is expected to have less rainfall compared to the past (Donat et al. 2016). Given these factors, there is an urgent need to establish a reliable flood forecasting procedure to prevent future catastrophes.

Mahaweli catchment

The Mahaweli River, the longest river in Sri Lanka, stretches 335 km in length, originating from the central hills of the country as a collection of numerous small creeks. It traverses through the central region of Sri Lanka before reaching its terminus at the southwestern side of Trincomalee Bay, where it merges with the Bay of Bengal. The Mahaweli River Basin (MRB) is the largest river basin in Sri Lanka, covering an area of approximately 10,448 km2, which represents about 16% of the country's total land area (Diyabalanage et al. 2016). The runoff from the Mahaweli River contributes to one-seventh of the total runoff of all rivers in Sri Lanka, with an average annual streamflow of 8.8 × 109 m3 (De 1997).

The distribution of rainfall within the MRB is uneven both spatially and temporally due to its topographical features. The MRB can be divided into two main parts based on topography: the Upper Mahaweli Basin (UMB) and the Lower Mahaweli Basin (LMB) (Hewawasam 2010). The UMB, situated in the western part of the central highlands, experiences a total annual precipitation of around 6,000 mm (Zubair 2003). Conversely, most parts of the LMB are classified as dry regions, such as the North Central and Eastern provinces, with mean annual precipitation ranging from about 1,600 to 1,900 mm. The precipitation in the UMB is primarily influenced by the southwest monsoon, while the precipitation in the LMB is affected by the northeast monsoon (NEM), owing to the intricate terrain and monsoon patterns in Sri Lanka (Shelton & Lin 2019). We selected the LMB as our study area because the climatic data of the region align well with the climatic characteristics of the South Asian Tropical zone. This choice ensures that the case study accurately reflects the conditions typically observed in this region, enhancing the relevance and applicability of our research findings.

The Somawathi area, located downstream of the Parakrama Samudra reservoir in the Polonnaruwa district and within the LMB, is recognized as a flood-prone region. Floods in this area typically occur between December and February, coinciding with the NEM season. To monitor flood levels effectively, the Manampitiya River gauge station plays a crucial role. Real-time water level data, along with alert levels, minor flood levels, and major flood levels for the station, are available on the Irrigation Department of Sri Lanka's website. Figure 1 shows the map of the study area.
Figure 1

Mahaweli water shed up to Manampitiya station (Madhushanka et al. 2024).

Figure 1

Mahaweli water shed up to Manampitiya station (Madhushanka et al. 2024).

Close modal

The dataset

The dataset pertaining to the area comprises daily water levels (in meters) recorded at the Manampitiya station and daily rainfall measurements (in millimeters) obtained from three upstream meteorological stations, namely Angamedilla, Aralaganwila, and Polonnaruwa Agri. Figure 2 shows the water level and rainfall data while Table 1 exhibits the statistics of them.
Table 1

Dataset summary

Rainfall (mm)
Water level (m)
AralaganwilaAngamedillaPolonnaruwa AgriManampitiya
Count 10,319 10,319 10,319 10,319 
Mean 4.94584 4.820227 4.20901 33.382384 
Std 15.08369 15.45192 13.48675 0.643105 
Min 32.196 
Max 225.8 222 184 37.254333 
Rainfall (mm)
Water level (m)
AralaganwilaAngamedillaPolonnaruwa AgriManampitiya
Count 10,319 10,319 10,319 10,319 
Mean 4.94584 4.820227 4.20901 33.382384 
Std 15.08369 15.45192 13.48675 0.643105 
Min 32.196 
Max 225.8 222 184 37.254333 
Figure 2

Characteristics of the rainfall and water level data used for the study. (a–c) subplots show the rainfall variation of Aralaganwila, Angamedilla, and Polonnaruwa_Agri, respectively. (d) exhibits the water level of Manampitiya. (e, f) boxplots illustrate the annual and monthly water level variation of Manampitiya, respectively.

Figure 2

Characteristics of the rainfall and water level data used for the study. (a–c) subplots show the rainfall variation of Aralaganwila, Angamedilla, and Polonnaruwa_Agri, respectively. (d) exhibits the water level of Manampitiya. (e, f) boxplots illustrate the annual and monthly water level variation of Manampitiya, respectively.

Close modal

Based on the boxplots presented in Figure 2, we can discern the influence of the NEM on the region. Upon closer examination, it becomes evident that while the maximum precipitation occurs from December to February, there are also instances of extreme rainfall events throughout the year. Although some of these events are classified as outliers, it is not advisable to use preprocessing techniques such as the interquartile range (IQR) (Granata et al. 2022; Luppichini et al. 2022) to remove them, as some of these points might be attributed to unexpected extreme weather conditions. Such occurrences, including sudden storms with heavy rainfall, are common in Sri Lanka due to the influences of the Bay of Bengal. Furthermore, consistent patterns observed in monthly precipitation and water level data indicate a strong correlation between them. This correlation underscores the interdependence of precipitation patterns and water levels in the region, emphasizing the importance of considering both variables in this flood simulation task.

As outlined in our previous paper (Madhushanka et al. 2024), the analysis was conducted using rainfall data from the designated rain gauge stations along with the water level at Manampitiya, based on the Pearson correlation coefficient calculated among the four stations. Because the other upstream river gauges belong to the UMB, a region with different geographic and climatic conditions, they were considered not to be used. Prior to their use as inputs and labels for the models, the data underwent normalization using the standard scaling method, which involves considering the mean and standard deviation values of each variable. Subsequently, the dataset was split into the training set and the test set. Specifically, the first 70% of the data was allocated for the training set, while the remaining portion was reserved for the test set to evaluate model performance.

Utilized models

Long short-term memory

The LSTM architecture, which was initiated by Hochreiter & Schmidhuber (1997), is an upgraded version of RNN invented to address the limitation of traditional RNNs in capturing long-term dependencies (Figure 3) and to solve the vanishing gradient problem. Unlike RNNs, LSTM incorporates an additional cell state or cell memory (ct) where information can be stored, along with gates (represented by dashed rectangles in Fig. 2.2) that regulate the flow of information within the LSTM cell. The first gate, known as the forget gate, denoted by a red rectangle, determines the extent to which elements of the cell state vector (ct−1) will be forgotten. It is computed using the following equations.
(1)
(2)
Figure 3

LSTM architecture.

Figure 3

LSTM architecture.

Close modal

Here, ft represents the resulting vector with values ranging from 0 to 1 and ht−1 denotes the hidden state at time step t − 1. σ (.) denotes the sigmoid function. The parameters wf and uf, are adjustable weight matrices and bf is the bias vector in the forget gate. The symbol ⊙ denotes element-wise multiplication. Similar to traditional RNNs, the hidden state h is initialized with a vector of zeros of a predefined length in the first timestep.

Subsequently, the current input (xt) and the last hidden state (ht−1) are combined to calculate a potential update vector for the cell state, using the following equations.
(3)
(4)
(5)

tanh (.) denotes the hyperbolic tangent function, and wi, wc, ui, uc, bi, and bc represent another set of learnable parameters. Additionally, the second gate, denoted by a green rectangle, known as the input gate or compute gate, determines the extent to which the information from ct is utilized for updating the cell state in the current time step.

The third and final gate, shown by a blue rectangle, is known as the output gate which is calculated using the following equations manages the information from the cell state (ct) which moves into the new hidden state (ht).
(6)
(7)

Here, it represents a vector with values ranging from 0 to 1, and wo, uo, and bo represent a set of learnable parameters specific to the output gate. By combining the results obtained from the previous equations, the new hidden state (ht) is calculated using the information derived from this vector. Particularly, the cell state (ct) is responsible for learning long-term dependencies effectively. It can retain information unchanged over an extended number of time steps due to its simple linear interactions with the rest of the LSTM cell.

Transformer and attention mechanism

The transformer, denoted in Figure 4, is a special type of neural network architecture that was initiated by Vaswani et al. (2017) and it has been a game-changer in the field of NLP, particularly for solving machine translation tasks. However, its innovative design has found applications in various other domains, including those that involve the analysis of lengthy input sequences, such as time series forecasting and classification, as highlighted by Li et al. (2019).
Figure 4

Transformer architecture – with the encoder and the decoder.

Figure 4

Transformer architecture – with the encoder and the decoder.

Close modal
One of the key features that set the transformer apart is its self-attention mechanism which replaces traditional recurrent layers in sequence analysis. As the first step of the process, each element in the input sequence is transformed into Query (Q), Key (K), and Value (V) vectors, with a dimension of dmodel. Then, the Q vectors are matched against the K vectors by performing dot-product multiplications between each pair of Q and K vectors generating a square matrix where each element represents the relationship between the corresponding Q and K vectors. This matrix is then scaled by dividing the square root of the dimension of the key vectors (dmodel) and the scaled matrix is passed through a softmax function to compute the attention scores, which represent the importance or weight of each element in the sequence concerning all other elements. These attention scores are generated for each element in the sequence and the obtained attention scores are multiplied by the corresponding value (V) vectors to produce a new representation of the input sequence. These separate attention heads are concatenated to generate a large matrix, which is known as multi-head attention (MHA). Finally, the concatenated representations are passed through a final linear transformation. The following equation summarizes the self-attention mechanism.
(8)

Additionally, positional information of elements in the sequence is achieved through static positional encodings using Sine and Cosine functions, and then they are added to the original input embeddings. The original transformer consists of an encoder–decoder architecture as depicted in Figure 2. In the decoder, a casual mask is used when calculating the self-attention, preventing each token to attend their future ones. In contrast to self-attention, cross-attention, also known as ‘encoder–decoder attention,’ is used to capture relationships between tokens in different input sequences. The output of the encoder is transformed into the query and the key matrices, and the output of the self-attention block of the decoder is transformed into the key matrix in order to calculate the attention as mentioned before.

Evaluation metrics

The following indicators (Equations (9)–(12)) were used to quantify the accuracy of the models. They are root mean square error (RMSE), mean absolute percentage error (MAPE), Nash–Sutcliffe efficiency (NSE), and coefficient of determination (R2). It is preferable for the values of NSE and R2 to be close to 1 while values of RMSE and MAPE close to 0 are considered desirable. The above indicators were calculated using the following formulas.
(9)
(10)
(11)
(12)
where n denotes the number of observations; and represent the observation and simulation on day i, while and denote the mean values of the observation and simulation series.

Lag correlation

In hydrology, the timing misalignment between predicted and observed data is an issue. This temporal discrepancy can be quantified and analyzed using lag correlation metrics, providing insights into the timing errors. The process involves computing a selected error matrix (e.g., RMSE), then systematically adjusting or lagging one of the time series relative to the other and recomputing the error matrix (Jackson et al. 2019). This approach indicates the time lag at which the correlation or similarity is maximized between observations and predictions. Hyndman & Khandakar (2008) utilized lag correlation measures to gain insights into their dataset's ability to capture specific events, despite the timing variations.

In our case, we used the RMSE value to determine the lag correlation between simulations and observations with 1-day lag time. We calculated the percentage deviation of RMSE between the original and the lagged prediction series in order to quantify the lag correlation.

Experimental setup

The work was conducted using the Python programming language, with data preprocessing, management, and visualization carried out using libraries such as NumPy, Pandas, Scikit-learn, Matplotlib, and Seaborn. For deep learning tasks, the TensorFlow framework was employed. Training of the models was conducted on the Google Colab platform, which provides a cloud-based environment for running Python code, particularly well-suited for machine learning and deep learning tasks. Historical data of the three rain gauges and the target river gauge were used as inputs to forecast the following day's water level using the sliding window method. Hyperparameters for the models were kept the same as in our previous paper (Madhushanka et al. 2024). The transformer architecture employed has been slightly modified from the original implementation presented in Vaswani et al. (2017). Notably, the original input data were used directly without being mapped to an embedding vector, which was done considering the continuity of the data for this regression task. Additionally, the mask of the self-attention layer in the decoder was omitted, thereby allowing time series data to access their successors. However, other aspects such as positional encoding, the number of layers and attention heads, and the dropout rate were kept consistent with the specifications outlined in the original paper. All the hyperparameters are shown in Table 2. ‘Early Stopping’ was used as the regularization for all the LSTM and transformer models.

Table 2

The values of hyperparameters

HyperparameterValue
Batch size 32 
Sliding window size 15 
Number of LSTM units in the hidden layer 64 
Optimizer Adam 
Activation function ReLU 
Validation split 0.1 
Learning rate 0.001 
For the transformer 
dmodel (Number of units in the final layer of the feedforward block) 64 
dff (Number of units in the final layer of the feedforward block) 192 
Number of layers 
Number of heads 
Dropout rate 0.1 
HyperparameterValue
Batch size 32 
Sliding window size 15 
Number of LSTM units in the hidden layer 64 
Optimizer Adam 
Activation function ReLU 
Validation split 0.1 
Learning rate 0.001 
For the transformer 
dmodel (Number of units in the final layer of the feedforward block) 64 
dff (Number of units in the final layer of the feedforward block) 192 
Number of layers 
Number of heads 
Dropout rate 0.1 

As the first task, we studied the contributions from each input feature to the final output by considering the following input combinations. Three LSTMs and three transformer encoders were utilized for this task with the same model architectures, except for the input layer.

  • i. Case 1 – Past water levels and rainfall data as inputs

  • ii. Case 2 – Past water levels as the only input

  • iii. Case 3 – Past rainfall data as the only input

Based on the results, Combination 1 was selected as the best scenario for both LSTM and transformer encoder models. Then we considered two additional transformer encoder–decoder models with the same hyperparameters as in the encoder (Figure 5), for a broader comparison. Figure 6 illustrates the overall methodology.
Figure 5

Developed transformer models – in addition to the initial encoder model, two additional encoder–decoder models were developed for the comparison.

Figure 5

Developed transformer models – in addition to the initial encoder model, two additional encoder–decoder models were developed for the comparison.

Close modal
Figure 6

Methodology.

Impact of input features on simulation of daily water level

Three LSTM and three transformer models were used to examine the impacts of the three different input combinations for prediction. RMSE values of the results were calculated after lagging the prediction series by 1 day in order to examine the lag correlation between the two series as shown in Figure 7.
Figure 7

RMSE and RMSE (lagged) for the three input combinations – RMSE and RMSE (lagged) for each model show the lag correlation when lagging the prediction series by 1 day.

Figure 7

RMSE and RMSE (lagged) for the three input combinations – RMSE and RMSE (lagged) for each model show the lag correlation when lagging the prediction series by 1 day.

Close modal

According to the results (blue and green bars) in Figure 7, Case 3 has the highest error and case 2 denotes the lowest error while the error of Case 1 is close to Case 2, for both LSTM and transformer algorithms. For the LSTM, RMSE performance was improved by 47% in Case 3 compared to Case 1 while 12% from Case 1 to Case 2. For the transformer, they were 39 and 9%, respectively, denoting a similar behavior. This high improvement from Case 3 to Case 1 and comparatively small improvement from Case 1 to Case 2 indicate a higher impact of the past water level data on the output, among all the input features.

When considering the LSTM models (blue and orange bars), there is a large reduction of RMSE between the actual and the lagged scenarios in Case 1 and 2 while Case 3 does not show a lag correlation. RMSE of the LSTM was dropped by 76% in Case 1 and 51% in Case 2, while the transformer encoder (green and red bars) had values of 67 and 46%, respectively, indicating the highest percentage reduction for the univariate analysis (Case 1). These percentage values exhibit the impact of the 1-day lagging of the prediction series. This error reduction might happen when the models allocate a higher weightage to the final timestep of the input series, which was generated by the sliding window method. It explains the univariate analysis showing the highest RMSE reduction while Case 3, which used only rainfalls as inputs, showed no reduction when lagging the prediction series by 1 day. We can suggest that a low lag correlation is better because one can argue that the prediction series is developed by solely shifting the observation series by a time offset. Case 2 showed the lowest RMSE value as well as a low lag correlation compared to Case 1. Therefore, we used past rainfall and water level data for the upcoming models based on these results.

Model performance

Figures 811 denote the daily water level of observation and simulation at Manampitiya station in Mahaweli catchment for the training (1984–2003) and testing (2004–2012) periods in the multivariate analysis.
Figure 8

The predicted and observed water levels of Mahaweli River for the training period.

Figure 8

The predicted and observed water levels of Mahaweli River for the training period.

Close modal
Figure 9

Scatter plots of predicted and observed water levels during the training period, with NSE representing the Nash–Sutcliffe efficiency.

Figure 9

Scatter plots of predicted and observed water levels during the training period, with NSE representing the Nash–Sutcliffe efficiency.

Close modal
Figure 10

The predicted and observed water levels of Mahaweli River for the testing period.

Figure 10

The predicted and observed water levels of Mahaweli River for the testing period.

Close modal
Figure 11

Scatter plots of predicted and observed water levels during the testing period, with NSE representing the Nash–Sutcliffe efficiency.

Figure 11

Scatter plots of predicted and observed water levels during the testing period, with NSE representing the Nash–Sutcliffe efficiency.

Close modal

As shown in Figures 811, all multivariate models performed relatively well in simulating the average water levels of the Mahaweli River. However, they differ greatly when it comes to simulating both upward and downward peaks in streamflow. It should be noted that both transformer encoder–decoder models significantly underestimated the peak water levels, mostly for daily water levels exceeding 35.5 m, while clear overestimations were observed for low water levels, especially for water levels less than 33 m. Among the models tested, LSTM and transformer encoder showed superior performance in simulating the peak water levels, while the other two models performed poorly in this regard.

Table 3 exhibits the evaluation indices of the developed models to compare their performances. Four evaluation metrics (RMSE, MAPE, NSE, and R2) were utilized for measuring the forecasting capabilities of the models. During the training period, RMSE ranged from 18.09 to 23.59, MAPE varied from 0.282 to 0.392%, and R2 varied from 0.8743 to 0.9185. All models had NSE greater than 0.86. LSTM achieved an NSE of 0.9183 and RMSE of 18.09, with R2 of 0.9185 and a MAPE of 0.282%. For the transformer encoder, those values were 0.9010, 19.90, 0.9014, and 0.314%, respectively. Both transformer encoder–decoder models denoted a poorer performance compared to the others.

Table 3

The evaluation of daily streamflow simulation

PeriodModelRMSE (cm)MAPE (%)NSER2
Training LSTM 18.09 0.282 0.9183 0.9185 
Transformer Encoder 19.90 0.314 0.9010 0.9014 
Transformer Enc → WL Dec → RF 21.51 0.338 0.8844 0.8928 
Transformer Enc → RF Dec → WL 23.59 0.392 0.8609 0.8743 
Testing LSTM 27.83 0.449 0.7971 0.8026 
Transformer Encoder 27.83 0.471 0.7972 0.7985 
Transformer Enc → WL Dec → RF 29.10 0.527 0.7782 0.7828 
Transformer Enc → RF Dec → WL 29.19 0.52 0.7768 0.7901 
PeriodModelRMSE (cm)MAPE (%)NSER2
Training LSTM 18.09 0.282 0.9183 0.9185 
Transformer Encoder 19.90 0.314 0.9010 0.9014 
Transformer Enc → WL Dec → RF 21.51 0.338 0.8844 0.8928 
Transformer Enc → RF Dec → WL 23.59 0.392 0.8609 0.8743 
Testing LSTM 27.83 0.449 0.7971 0.8026 
Transformer Encoder 27.83 0.471 0.7972 0.7985 
Transformer Enc → WL Dec → RF 29.10 0.527 0.7782 0.7828 
Transformer Enc → RF Dec → WL 29.19 0.52 0.7768 0.7901 

During the testing period, all models demonstrated satisfactory performance in simulating daily water levels, with NSE values greater than 0.7768, although the performances are lower compared to the training phase. The LSTM and transformer encoder showed similar performances with RMSE, MAPE, NSE, and R2 values of 27.83, 0.449%, 0.7971, 0.8026 and 27.83, 0.471%, 0.7972, 0.7985, respectively. Transformer Enc – WL Dec – Rainfall (RF) and Enc – RF Dec – Water Level (WL) denoted similar evaluation values with an RMSE of 29.10 and 29.19 and NSE of 0.7782 and 0.7768, respectively. Transformer encoder exhibited improved performance compared to encoder–decoder models.

Analysis of attention mechanism in transformer

Furthermore, we utilized the transformer's attention mechanism to gain deeper insights into the relationships between input features and their impact on the model's predictions. The attention plot (Figure 12) illustrates the cross-attention weights from the last decoder layer of our multivariate encoder–decoder model 1. This analysis covers the period from 3 January to 17 January 2011, which was deliberately selected because it includes the day with the highest recorded water level (10 January) in the test set. By focusing on this specific time period, we aim to observe the model's behavior and the attention dynamics surrounding an extreme event, providing a clear view of how the model prioritizes different inputs under significant hydrological conditions.
Figure 12

The attention plot obtained from multivariate encoder–decoder model 1. The data from 3 January 2011 to 17 January 2011 were used for this analysis.

Figure 12

The attention plot obtained from multivariate encoder–decoder model 1. The data from 3 January 2011 to 17 January 2011 were used for this analysis.

Close modal

The plot consists of eight subplots, corresponding to the eight attention heads. Each subplot visualizes the attention weights, where the x-axis represents the rainfall data fed into the encoder, and the y-axis represents the water level data fed into the decoder. The color intensity indicates the magnitude of the attention weights, with darker colors representing lower attention weights and brighter colors representing higher attention weights.

Key observations from the attention plot reveal several important aspects. First, there is a noticeable concentration of attention around 10 January across several heads, particularly in Heads 1, 2, 5, and 6. This suggests that the model is giving significant importance to the days leading up to and following January 10, which is consistent with the high-water level recorded on that day. Moreover, the different attention heads display varying patterns. While some heads (e.g., Head 3 and Head 7) have a more diffuse distribution of attention, others (e.g., Head 1 and Head 2) show more focused attention on specific dates. This variability indicates that the transformer model can differentiate the importance of input features over time and thus, captures a wide range of dependencies and relationships in the data. For example, certain heads might focus more on Aralaganwila rainfall data, while others give more weight to Angamedilla or Polonnaruwa Agri.

Discussion

Overall, the single encoder model showed better performance among all the transformer models. When there is an encoder as well as a decoder in a transformer model, it takes the Query (Q) and the Key (K) vectors from the encoder and the value (V) vector from the decoder to calculate the cross-attention between the two inputs. But when there is just the encoder, all the Q, K, and V are taken from a single input series. That might be the reason for its increased accuracy. However, if there are input series with two different sequence lengths, an encoder–decoder model has to be used for the analysis.

Furthermore, LSTM and transformer encoder models exhibited approximately similar performances in daily water level forecasting although many studies have illustrated a higher accuracy of transformers compared to LSTM (Castangia et al. 2023; Xu et al. 2023). There might be several common reasons for the issues observed, including the presence of outliers and high variance in the dataset (Granata et al. 2022). The box plots in Figure 2 highlight numerous outlier points within the dataset. Despite this, we chose not to apply any denoising techniques because some outliers may reflect the climatic patterns of the South Asian Tropical Zone.

Additionally, another reason for the suboptimal performance could be the lack of input features, as suggested by Wei et al. (2023). Our dataset included only four variables: data from three rain gauges and one river gauge. Incorporating other variables, such as evapotranspiration, temperature, and wind, could potentially enhance performance.

In this study, four machine learning models (LSTM, transformer encoder and transformer encoder–decoder models 1 and 2) were applied to simulate daily water levels at the Manampitiya River gauge in the Mahaweli catchment. The impacts of different inputs on the output were also examined. According to the results, the use of past water level data denotes a higher impact on the output compared to the other input features such as rainfall. It is recommended to use many other input features that show higher correlation values with the target, along with the past observations of the output since this strategy increases the model performance as well as decreases the lag correlation. Further, LSTM and transformer encoder show similar accuracies in daily water level forecasting. Although transformer models tend to show a better performance in streamflow forecasting, lack of input features, and outliers and high variance of the dataset may cause the performance reduction for the transformer encoder model. However, in this case, switching the inputs between the encoder and decoder shows similar performance but it might be different in another case. Therefore, it is recommended to do this strategy and check the accuracy when there are input features with two different time steps.

For future research, we recommend examining how different input features such as rainfall, temperature, and humidity affect water level forecasting, considering the attention analysis. Specifically, leveraging the attention mechanism of transformer models can help identify the most influential input features, thereby optimizing the data used for training and improving prediction accuracy.

Additionally, exploring hybrid models that combine multiple algorithms, such as integrating LSTM with transformer models, could enhance forecasting accuracy and extend the prediction window. These hybrid models can take advantage of the strengths of each individual algorithm, potentially leading to more robust and reliable predictions. This approach is particularly valuable in addressing the high variability and complexity of hydrological data.

Ultimately, these advancements in model development and data utilization can significantly benefit water management practices. Improved forecasting accuracy supports sustainable decision-making, allowing for more effective mitigation strategies for droughts, floods, and other water-related challenges. This research has the potential to enhance disaster preparedness, optimize resource allocation, and contribute to the overall resilience of communities and infrastructure against climate-related impacts.

We would like to thank Ranga Rodrigo for giving valuable advice as well as learning resources regarding machine learning. Further, we are thankful to the Department of Irrigation, Sri Lanka for providing us with water level data.

G.W.T.I.M. led the development of research methodology, raw data acquisition, code development, data manipulation, training and evaluation of the models, results analysis and writing of the original draft. M.T.R.J. led the supervision throughout the study, providing financial support and facilitating data acquisition by signing the agreements. R.A.R. contributed to data analysis, code development, and final manuscript writing and provided guidance as well as financial support.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Alfieri
L.
,
Bisselink
B.
,
Dottori
F.
,
Naumann
G.
,
de Roo
A.
,
Salamon
P.
,
Wyser
K.
&
Feyen
L.
(
2017
)
Global projections of river flood risk in a warmer world
,
Earth's Future
,
5
(
2
),
171
182
.
https://doi.org/10.1002/2016EF000485
.
Bahdanau
D.
,
Cho
K.
&
Bengio
Y.
(
2014
)
Neural Machine Translation by Jointly Learning to Align and Translate
.
Boulange
J.
,
Hanasaki
N.
,
Yamazaki
D.
&
Pokhrel
Y.
(
2021
)
Role of dams in reducing global flood exposure under climate change
,
Nature Communications
,
12
(
1
),
417
.
https://doi.org/10.1038/s41467-020-20704-0
.
Castangia
M.
,
Grajales
L. M. M.
,
Aliberti
A.
,
Rossi
C.
,
Macii
A.
,
Macii
E.
&
Patti
E.
(
2023
)
Transformer neural networks for interpretable flood forecasting
,
Environmental Modelling and Software
,
160
.
https://doi.org/10.1016/j.envsoft.2022.105581
.
Christensen
J. H.
,
Hewitson
B.
,
Busuioc
A.
,
Chen
A.
,
Gao
X.
,
Held
I.
,
Jones
R.
,
Kolli
R. K.
,
Kwon
W. T.
,
Laprise
R.
,
Magana Rueda
V.
,
Mearns
L.
,
Menendez
C. G.
,
Raisanen
J.
,
Rinke
A.
,
Sarr
A.
&
Whetton
P.
(
2007
)
Regional Climate Projections. Chapter 11
.
Cambridge, UK
:
Cambridge University Press
.
Chu
H.
,
Wu
W.
,
Wang
Q. J.
,
Nathan
R.
&
Wei
J.
(
2020
)
An ANN-based emulation modelling framework for flood inundation modelling: Application, challenges and future directions
,
Environmental Modelling & Software
,
124
,
104587
.
https://doi.org/10.1016/J.ENVSOFT.2019.104587
.
De
A. C.
(
1997
)
Management of the Mahaweli, a river in Sri Lanka
,
Water International
,
22
(
2
),
98
107
.
https://doi.org/10.1080/02508069708686678
.
Diyabalanage
S.
,
Abekoon
S.
,
Watanabe
I.
,
Watai
C.
,
Ono
Y.
,
Wijesekara
S.
,
Guruge
K. S.
&
Chandrajith
R.
(
2016
)
Has irrigated water from Mahaweli River contributed to the kidney disease of uncertain etiology in the dry zone of Sri Lanka?
,
Environmental Geochemistry and Health
,
38
(
3
),
679
690
.
https://doi.org/10.1007/s10653-015-9749-1
.
Donat
M. G.
,
Lowry
A. L.
,
Alexander
L. V.
,
O'Gorman
P. A.
&
Maher
N.
(
2016
)
More extreme precipitation in the world's dry and wet regions
,
Nature Climate Change
,
6
(
5
),
508
513
.
https://doi.org/10.1038/nclimate2941
.
Dtissibe
F. Y.
,
Ari
A. A. A.
,
Abboubakar
H.
,
Njoya
A. N.
,
Mohamadou
A.
&
Thiare
O.
(
2024
)
A comparative study of machine learning and deep learning methods for flood forecasting in the far-north region, Cameroon
,
Scientific African
,
23
,
e02053
.
https://doi.org/10.1016/J.SCIAF.2023.E02053
.
Elsafi
S. H.
(
2014
)
Artificial Neural Networks (ANNs) for flood forecasting at Dongola Station in the River Nile, Sudan
,
Alexandria Engineering Journal
,
53
(
3
),
655
662
.
https://doi.org/10.1016/J.AEJ.2014.06.010
.
Fang
Z.
,
Wang
Y.
,
Peng
L.
&
Hong
H.
(
2021
)
Predicting flood susceptibility using LSTM neural networks
,
Journal of Hydrology
,
594
,
125734
.
https://doi.org/10.1016/J.JHYDROL.2020.125734
.
Farsani
R. M.
&
Pazouki
E.
(
2021
)
A transformer self-attention model for time series forecasting
,
Journal of Electrical and Computer Engineering Innovations
,
9
(
1
),
1
10
.
https://doi.org/10.22061/JECEI.2020.7426.391
.
Granata
F.
,
Di Nunno
F.
&
de Marinis
G.
(
2022
)
Stacked machine learning algorithms and bidirectional long short-term memory networks for multi-step ahead streamflow forecasting: A comparative study
,
Journal of Hydrology
,
613
.
https://doi.org/10.1016/j.jhydrol.2022.128431
.
Hallegatte
S.
,
Vogt-Schilb
A.
,
Bangalore
M.
&
Rozenberg
J.
(
2017
)
Unbreakable: Building the Resilience of the Poor in the Face of Natural Disasters
.
Washington, DC
:
World Bank
.
https://doi.org/10.1596/978-1-4648-1003-9
.
Hewawasam
T.
(
2010
)
Effect of land use in the upper Mahaweli catchment area on erosion, landslides and siltation in hydropower reservoirs of Sri Lanka
,
Journal of the National Science Foundation of Sri Lanka
,
38
(
1
),
3
.
https://doi.org/10.4038/jnsfsr.v38i1.1721
.
Hirabayashi
Y.
,
Mahendran
R.
,
Koirala
S.
,
Konoshima
L.
,
Yamazaki
D.
,
Watanabe
S.
,
Kim
H.
&
Kanae
S.
(
2013
)
Global flood risk under climate change
,
Nature Climate Change
,
3
(
9
),
816
821
.
https://doi.org/10.1038/nclimate1911
.
Hochreiter
S.
&
Schmidhuber
J.
(
1997
)
Long short-term memory
,
Neural Computation
,
9
(
8
),
1735
1780
.
https://doi.org/10.1162/neco.1997.9.8.1735
.
Hyndman
R. J.
&
Khandakar
Y.
(
2008
)
Journal of Statistical Software Automatic Time Series Forecasting: The forecast Package for R (Vol. 27). Available from: http://www.jstatsoft.org/.
Jackson
E. K.
,
Roberts
W.
,
Nelsen
B.
,
Williams
G. P.
,
Nelson
E. J.
&
Ames
D. P.
(
2019
)
Introductory overview: Error metrics for hydrologic modelling – a review of common practices and an open source library to facilitate use and adoption
. In:
Environmental Modelling and Software
, Vol.
119
.
Elsevier Ltd.
, pp.
32
48
.
https://doi.org/10.1016/j.envsoft.2019.05.001
.
Lee
J.
,
Weger
R. C.
,
Sengupta
S. K.
&
Welch
R. M.
(
1990
)
A neural network approach to cloud classification
,
IEEE Transactions on Geoscience and Remote Sensing
,
28
(
5
),
846
855
.
https://doi.org/10.1109/36.58972
.
Li
S.
,
Jin
X.
,
Xuan
Y.
,
Zhou
X.
,
Chen
W.
,
Wang
Y.-X.
&
Yan
X.
(
2019
)
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
.
Luppichini
M.
,
Barsanti
M.
,
Giannecchini
R.
&
Bini
M.
(
2022
)
Deep learning models to predict flood events in fast-flowing watersheds
,
Science of the Total Environment
,
813
.
https://doi.org/10.1016/j.scitotenv.2021.151885
.
Madhushanka
T.
,
Jayasinghe
T.
&
Rajapakse
R.
(
2024
)
Multi day ahead flood predictionin South Asian tropical zone using deep learning
,
Available as a Preprint. https://doi.org/10.21203/rs.3.rs-4070758/v1
.
Mourato
S.
,
Fernandez
P.
,
Marques
F.
,
Rocha
A.
&
Pereira
L.
(
2021
)
An interactive Web-GIS fluvial flood forecast and alert system in operation in Portugal
,
International Journal of Disaster Risk Reduction
,
58
,
102201
.
https://doi.org/10.1016/j.ijdrr.2021.102201
.
Naveendrakumar
G.
,
Vithanage
M.
,
Kwon
H.-H.
,
Chandrasekara
S. S. K.
,
Iqbal
M. C. M.
,
Pathmarajah
S.
,
Fernando
W. C. D. K.
&
Obeysekera
J.
(
2019
)
South Asian perspective on temperature and rainfall extremes: A review
,
Atmospheric Research
,
225
,
110
120
.
https://doi.org/10.1016/j.atmosres.2019.03.021
.
Parthasarathy
B.
&
Mooley
D. A.
(
1978
)
Some features of a long homogeneous series of Indian summer monsoon rainfall
,
Monthly Weather Review
,
106
(
6
),
771
781
.
https://doi.org/10.1175/1520-0493(1978)106 < 0771:SFOALH > 2.0.CO;2
.
Pierini
N. A.
,
Vivoni
E. R.
,
Robles-Morua
A.
,
Scott
R. L.
&
Nearing
M. A.
(
2014
)
Using observations and a distributed hydrologic model to explore runoff thresholds linked with mesquite encroachment in the Sonoran Desert
,
Water Resources Research
,
50
(
10
),
8191
8215
.
https://doi.org/10.1002/2014WR015781
.
Rahmati
O.
&
Pourghasemi
H. R.
(
2017
)
Identification of critical flood prone areas in data-scarce and ungauged regions: A comparison of three data mining models
,
Water Resources Management
,
31
(
5
),
1473
1487
.
https://doi.org/10.1007/s11269-017-1589-6
.
Serinaldi
F.
,
Loecker
F.
,
Kilsby
C. G.
&
Bast
H.
(
2018
)
Flood propagation and duration in large river basins: A data-driven analysis for reinsurance purposes
,
Natural Hazards
,
94
(
1
),
71
92
.
https://doi.org/10.1007/s11069-018-3374-0
.
Shelton
S.
&
Lin
Z.
(
2019
)
Streamflow variability over the period of 1990–2014 in Mahaweli River Basin, Sri Lanka and its possible mechanisms
,
Water (Switzerland)
,
11
(
12
).
https://doi.org/10.3390/w11122485
.
Tamiru
H.
&
Dinka
M. O.
(
2021
)
Application of ANN and HEC-RAS model for flood inundation mapping in lower Baro Akobo River Basin, Ethiopia
,
Journal of Hydrology: Regional Studies
,
36
,
100855
.
https://doi.org/10.1016/J.EJRH.2021.100855
.
Vaswani
A.
,
Shazeer
N.
,
Parmar
N.
,
Uszkoreit
J.
,
Jones
L.
,
Gomez
A. N.
,
Kaiser
L.
&
Polosukhin
I.
(
2017
)
Attention Is All You Need
.
Wei
X.
,
Wang
G.
,
Schmalz
B.
,
Hagan
D. F. T.
&
Duan
Z.
(
2023
)
Evaluate transformer model and self-attention mechanism in the Yangtze River Basin runoff prediction
,
Journal of Hydrology: Regional Studies
,
47
.
https://doi.org/10.1016/j.ejrh.2023.101438
.
Wickramagamage
P.
(
2016
)
Spatial and temporal variation of rainfall trends of Sri Lanka
,
Theoretical and Applied Climatology
,
125
(
3–4
),
427
438
.
https://doi.org/10.1007/s00704-015-1492-0
.
Wu
N.
,
Green
B.
,
Ben
X.
&
O'Banion
S.
(
2020
)
Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case
.
Xiang
Z.
,
Yan
J.
&
Demir
I.
(
2020
)
A rainfall-runoff model with LSTM-based sequence-to-sequence learning
,
Water Resources Research
,
56
(
1
).
https://doi.org/10.1029/2019WR025326
.
Xie
S.-P.
&
Saiki
N.
(
1999
)
Abrupt onset and slow seasonal evolution of summer monsoon in an idealized GCM simulation
,
Journal of the Meteorological Society of Japan. Ser. II
,
77
(
4
),
949
968
.
https://doi.org/10.2151/jmsj1965.77.4_949
.
Xu
Y.
,
Lin
K.
,
Hu
C.
,
Wang
S.
,
Wu
Q.
,
Zhang
L.
&
Ran
G.
(
2023
)
Deep transfer learning based on transformer for flood forecasting in data-sparse basins
,
Journal of Hydrology
,
625
.
https://doi.org/10.1016/j.jhydrol.2023.129956
.
Zhang
Y.
,
Ragettli
S.
,
Molnar
P.
,
Fink
O.
&
Peleg
N.
(
2022
)
Generalization of an encoder-decoder LSTM model for flood prediction in ungauged catchments
,
Journal of Hydrology
,
614
,
128577
.
https://doi.org/10.1016/j.jhydrol.2022.128577
.
Zou
Y.
,
Wang
J.
,
Lei
P.
&
Li
Y.
(
2023
)
A novel multi-step ahead forecasting model for flood based on time residual LSTM
,
Journal of Hydrology
,
620
,
129521
.
https://doi.org/10.1016/J.JHYDROL.2023.129521
.
Zubair
L.
(
2003
)
El Niño-southern oscillation influences on the Mahaweli streamflow in Sri Lanka
,
International Journal of Climatology
,
23
(
1
),
91
102
.
https://doi.org/10.1002/joc.865
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).