Runoff prediction serves as the cornerstone for the effective management, allocation, and utilization of water resources, playing a key role in hydrological research. This study employs a newly reported deep learning model, Mamba, to forecast river daily runoff and compared the proposed model with various benchmark methods, including statistical models, machine learning methods, recurrent neural networks, and attention-based models. Application of these models is implemented on three hydrological stations situated along the middle and lower reaches of the Mississippi River. Daily runoff from 1983 to 2023 were used to build the model for 7-day prediction. Findings demonstrate the superiority of the Mamba model over its counterparts, showcasing its potential as a backbone model. In response to the necessity for a more lightweight approach, a refined variant of the Mamba model is proposed, called LightMamba. LightMamba incorporates partial normalization and MPM (Multi-Path-Mamba) to enhance its efficacy in discerning nonlinear trends and capturing long-term dependencies within the streamflow data. Notably, LightMamba achieves commendable performance with an average NSE of 0.904, 0.907, and 0.900 on the three stations. This study introduces an innovative backbone model for time series forecasting, which offers a novel approach to hybrid modeling for future daily runoff prediction.

  • The effectiveness of a Mamba model for predicting long-term daily runoff is explored.

  • LightMamba was proposed as a lightweight model with partial normalization and MPM module.

  • The proposed model was compared with various benchmark methods and the results were analyzed.

  • Results show that the LightMamba generally worked well.

Runoff forecasting plays a critical role in water resources planning and management (Liu et al. 2024), and high-precision runoff forecasts are essential for water supply, flood control, and power generation (Williams et al. 2021). Influenced by natural meteorological conditions, watershed characteristics, and human activities, the runoff sequence is characterized by nonlinearity, stochasticity, and periodicity (Sharafati et al. 2020). The search for more accurate runoff forecasting methods has been a major concern of scholars.

Existing runoff prediction methods generally fall into two broad categories: physically driven models and data-driven models (Nourani et al. 2021). Physically driven models help understand the underlying physics of hydrologic processes. These models construct complex mathematical models, such as high-dimensional partial differential equations, using data on hydrology, meteorology, topography, soil properties, and vegetation cover index, to simulate or predict river runoff (Farmer et al. 2003; Fenicia et al. 2008; Bai et al. 2009). Unfortunately, the degree of difficulty and the complexity of parameter estimation are high.

Unlike physically driven models, data-driven models focus only on the relationship between inputs and outputs (Kumar et al. 2022), and it is easy to conduct. As data-driven models continue to evolve, they can be further categorized into statistical models, dynamic and stochastic model, machine learning methods and deep learning models. Statistical models use statistical correlation methods to predict future runoff from historical runoff observations, such as Autoregressive Integrated Moving Average Model (ARIMA) (Farajzadeh et al. 2014). Statistical models can accurately capture the linear relationship between historical and future data, but they cannot account for the effects of external factors related on considered series, and the ability of catching nonlinear relationship between time series is weaker. Stochastic methods model time series based on stochastic processes (Dimitriadis et al. 2021). Complex nonlinear relationships in systems can be described. However, it is difficult to calculate various parameters of stochastic processes.

Machine learning models are capable of fitting nonlinear relationships between inputs and outputs (Mosavi et al. 2018) and can therefore be used for short-term runoff prediction. A variety of machine learning methods have been proposed for runoff prediction, including k-nearest neighbor (KNN) regression (Yang et al. 2020), support vector regression (SVR) (Bigdeli et al. 2023), eXtreme Gradient Boosting (XGBoost) (Szczepanek 2022), and others. However, these machine learning methods have limited capabilities in extracting deep information and lack the ability to obtain prior information about the sequence order. Therefore, temporal information must be added to the model through feature engineering (Dwivedi et al. 2022). However, the predictive performance of the models decreases rapidly as the dimensionality of the input features and the prediction steps increase.

With the rapid development of artificial intelligence and computing capability (e.g., GPUs), deep learning models are widely used in various tasks due to their ability to process high-dimensional input information. Owing to its self-ordering advantage, Recurrent Neural Network (RNN) excels at sequence tasks (Nosouhian et al. 2021). Long Short-Term Memory (LSTM) alleviates the problem of vanishing or exploding gradients in RNNs, and Gated Recurrent Unit (GRU) further improves training efficiency. Therefore, these two variants are often implemented as base models for runoff prediction (Xiang et al. 2020; Man et al. 2023; Yao et al. 2023). However, based on the structure of RNN, these models can only be trained in recursive mode. When the input sequence is long, the training of the neural network becomes inefficient.

Transformer (Vaswani et al. 2017) is widely used in the fields of natural language processing and computer vision, and it is also applied to the runoff prediction task (Yin et al. 2022; Fang et al. 2024). The core of transformer is the self-attention mechanism, which eliminates the sequential dependency problem in traditional recurrent neural networks by introducing an attention mechanism. Informer model (Zhou et al. 2021), which improves the efficiency of the transformer in the task of time series prediction, was used to runoff prediction and achieved good results (Du et al. 2022). However, the attention mechanism suffers from the quadratic complexity of the input sequence length and low inductive bias (Neyshabur 2020).

Recently, state space models (SSMs) have attracted a lot of attention in the NLP and CV communities due to their efficacy in long-range sequence modeling. Mamba introduces a selection mechanism into SSMs, giving them the ability to perform context-aware selection and capture long-range correlations (Gu & Dao 2023). It employs scanning operations for parallel computation during training while still maintaining linear complexity, allowing Mamba-based models to outperform transformers. As a sequence model, Mamba also has great potential for time series prediction tasks, which triggers our interest in exploring Mamba for runoff prediction.

The main contributions of this study can be categorized into three areas:

  • The Mamba model was used to predict runoff data to explore the potential application of Mamba as a backbone.

  • A new deep learning prediction model LightMamba is proposed. The Multi-Path-Mamba (MPM) module is utilized to reduce the number of parameters of the Mamba model, and the input sequences are locally regularized to improve the prediction accuracy of non-stationary time series such as runoff data.

  • We tested our model and compared it with several benchmark models in four evaluation metrics, and the results show that our model has better performance.

Problem statement

In the time series forecasting task, the input to the model is , where . Predict future sequence using input X. L and T are the lengths of the time windows for inputting the past and predicting the future, respectively, and are referred to as the retrospective window and the predictive horizon. pi is the variable and D is the dimension of the variable.

State space model

The structured state space sequence model (S4) (Gu et al. 2021) is a class of recent sequence models for deep learning that can represent any cyclic process with latent states. The model maps the input to the output via the hidden state :
(1)
(2)
where , , are learnable parameters. To model discrete sequences, the SSM must be discretized using the time scale parameter Δ. This discretization transforms the parameter from the continuous form to the discrete form . Let , where denote the time step and dimension, respectively. Let denote the inputs, hidden states, and outputs at times , respectively. Then the discretized SSM can be expressed as follows:
(3)
(4)
Here, , , . The discrete form of SSM can be computed in a linear recursive manner, improving computational efficiency. Figure 1 shows the workflow of SSM at moment t. S4, derived from the original SSM, employs HiPPO (Gu et al. 2020) for initialization to add structure to the state matrix A, thereby enhancing long-range dependency modeling.
Figure 1

Architecture of discretized SSM.

Figure 1

Architecture of discretized SSM.

Close modal

Mamba

After introducing the data-dependent selection mechanism (Arjovsky et al. 2016) and efficient scanning algorithms into SSM, merging the most famous H3 architecture in SSM with the MLP gating mechanism into one component, a new structured state space sequence model, Mamba (Gu & Dao 2023), is proposed, as shown in Figure 2. The mechanism allows Mamba to capture contextual information in long sequences while maintaining computational efficiency.
Figure 2

Architecture of Mamba.

Figure 2

Architecture of Mamba.

Close modal
In SSM with a selection mechanism, a fully connected layer is used to learn B, C, from . The hidden states and outputs of the SSM model can then be written as:
(5)
(6)
Through this selection mechanism, the variables in S4 are transformed from being time-invariant to time-varying. In general, controls the balance between how much to focus or ignore the current input . Mechanically, a large resets the state h and focuses on the current input x, while a small persists the state and ignores the current input. Although A could also be considered selective, its ultimate effect on the model is only through its interaction with via . Therefore, ensuring selectivity in is sufficient to guarantee selectivity in (A, B). Since B and C are selective, it is possible to accurately control whether the input is allowed to enter the state t or the state is allowed to enter the output . This mechanism, similar to the attention mechanism, enables the model the ability to selectively extract key information and filter irrelevant noise based on the input data. The scanning algorithm also allows Mamba to use an efficient parallel training mode during training. Mamba is a potentially powerful tool for runoff prediction tasks because it enables better sequence modeling performance while maintaining a linear complexity of the input sequence length. Algorithm 1 provides a detailed description of the computation of Mamba blocks.

Algorithm 1 Mamba Block with Selective SSM

Input:

Output:

1:

2:

3:

4:

5:

6:

7:

8:

9:

10: returnY

LightMamba

The Mamba’s number of parameters is primarily determined by its internal linear projection layer. When the input and output dimensions are both d, the linear layer contains parameters, which is the square level of the input variable’s dimension. Thus, reducing the input variable’s dimension to or less will result in a smaller number of parameters in the linear layer.

The MPM Algorithm 2 is proposed based on this idea. First, the input sequence X is split into different subspaces along the variable dimensions to obtain . Then, the smaller Mamba model is used to extract the sequence features under the subspace. Finally, the features from each subspace are merged to create the final output. MPM can help the model to learn time series features representations under different subspaces while using fewer parameters than the original model.

Algorithm 2 Multi-Path-Mamba

Input:

Output:

1:

2:

3: for each do

4:

5:

6: end for

7:

8: returnY

After the MPM module, a MLP block is connected and residual linkage and normalization layers are added at suitable locations. This is to fuse the multi-subspace features extracted by MPM. Usually, the intermediate dimensions d are typically larger in deep learning, even if the added MLP blocks increase certain parameters, the MPM can keep the model lighter overall. Finally, a partial normalization block is added to the model for runoff prediction. For each input series , we transform it by translation and scaling operations and obtain . The partial normalization block can be formulated as follows:
(7)
where means the element-wise division and is the element-wise product. This is motivated by the fact that non-stationary transformers with this module can improve the forecasting effect when predicting non-stationary time series, and its effectiveness has been confirmed (Liu et al. 2022).
Therefore, we propose a lightweight Mamba model with partial normalization and MPM, called LightMamba, whose network structure is shown in Figure 3. First, the input X is fed into the normalization layer to obtain the regularized , as well as the mean and variance. After embedding, it is fed into the MPM block and the MLP block. It goes to the projection layer and denormalization to get the final output after N cycles. In LightMamba, the MHM layer acts as a token mixer for sequence interaction, and the MLP layer acts as a channel mixer to further enhance the representation.
Figure 3

Overall framework of LightMamba, the left side of the figure presents the overall architecture of our model. The right side of the figure details the components of the MPM Block.

Figure 3

Overall framework of LightMamba, the left side of the figure presents the overall architecture of our model. The right side of the figure details the components of the MPM Block.

Close modal

Model evaluation indicators

The following four evaluation indicators are used to evaluate the prediction results, with metrics including mean absolute error (MAE), root mean squared error (RMSE), mean absolute percentage error (MAPE) and Nash–Sutcliffe efficiency coefficient (NSE) as quantitative evaluation criteria, and the calculation equations are as follows:
(8)
(9)
(10)
(11)
Here, and represent the observed and predicted values, respectively; denotes the mean value of . Lower error and higher NSE denote better performance.

Study area

In this study, the middle and lower reaches of Mississippi River was selected as a case study, as shown in Figure 4. The total length of the Mississippi River is 6,021 km, with a basin area of 3.22 million (km2). The middle and lower reaches of the Mississippi River are flooded from January to June, with the highest water level in April and the dry season in October, with high sand content. The sand content of the river increases as a result of the influx of tributaries from the semi-arid regions. Most of the basin lies on the plains. However, the middle and lower reaches of the river, with their gentle slope and extensive floodplain, are prone to flooding during spring and summer surges. Low-lying areas along the middle reaches are particularly vulnerable (Munoz et al. 2018).
Figure 4

The research area and hydrological stations of this study.

Figure 4

The research area and hydrological stations of this study.

Close modal

Data

The Global Runoff Data Centre (GRDC) is an international data centre operating under the auspices of the World Meteorological Organization (WMO). GRDC provides pre-processed flow data from rivers around the world to scientists for research (https://grdc.bafg.de/). For this study, St. Louis, Chester, and Thebes, located in the middle and lower reaches of Mississippi River, were selected as study stations. The spatial distribution is shown in Figure 4. Daily runoff data (m3/s) were obtained from GRDC for the period 1983–2023, and each time series was complete with 14,610 data points. The statistical characteristics of the runoff data from the three stations are shown in Table 1, and the runoff process is plotted in Figure 5. It can be observed that the area exhibits distinct characteristics of summer flood, with significant variations in annual runoff values and substantial differences between the maximum and minimum values within each year. These characteristics pose a challenge to the model’s learning capability. Obviously additional explanatory variables (e.g., rainfall or river flow at an upstream site) would enhance the prediction capacity but this is out of the scope of this work, which does not make use of additional explanatory variables.
Table 1

Statistical characterization of runoff data

StationMinMaxMeanStdDate range
St. Louis 1464.0 29166.3 7076.7 4329.1 1983/10/1–2023/09/30 
Chester 1407.3 28316.8 6809.0 4198.5 1983/10/1–2023/09/30 
Thebes 1166.7 29732.6 6509.5 4033.8 1983/10/1–2023/09/30 
StationMinMaxMeanStdDate range
St. Louis 1464.0 29166.3 7076.7 4329.1 1983/10/1–2023/09/30 
Chester 1407.3 28316.8 6809.0 4198.5 1983/10/1–2023/09/30 
Thebes 1166.7 29732.6 6509.5 4033.8 1983/10/1–2023/09/30 
Figure 5

Runoff process of St. Louis, Chester, and Thebes.

Figure 5

Runoff process of St. Louis, Chester, and Thebes.

Close modal

Prediction models

The models utilized in this paper are as follows:

  • Benchmark: BM.

  • Statistical and ML methods: ARIMA, SVR, KNN, XGBoost.

  • Recurrent neural network: RNN, LSTM, BiLSTM, GRU.

  • Attention-based models: Transformer, Informer.

  • Mamba-based models: Mamba, LightMamba.

At the benchmark model (abbreviated BM), the prediction is considered to be the current state (Dimitriadis et al. 2016), i.e.: . The Mamba model is devoid of both MPM and partial normalization, whereas the LightMamba represents our proposed model.

Experiment settings

The data is divided into training, validation, and testing sets in a 7:1:2 ratio along the time axis. Normalization is performed on the training set, and the validation and test sets are normalized using the scale parameters of the training set. After making a prediction using the model, the results are denormalized using the same scale, to avoid information leakage (Li et al. 2023). The selected training model is the best performer from the validation set, ensuring the model’s generalization ability. The daily runoff for the next 7 days was predicted by all models using runoff data from the past 30 days, that is the in this study. The parameter settings for each model are shown in Table 2, in which the deep learning model shares some hyperparameters. All models will be trained three times, and the average of the evaluation metrics will be taken as the final result.

Table 2

Parameter settings for each model

ModelParameter
ARIMA , ,  
KNN Number of neighbors  
SVR Kernel type is linear 
XGBoost Number of estimators 100, max depth 6 
RNNLSTM Hidden layer size 64, layers of RNNs 1 (BiLSTM 2), attention heads of transformers 4, dropout rate 0.1, loss MSE, optimizer AdamW, learning rate 0.001, learning rate decay factor 0.9, maximum training period 30, early-stop patience 5 
BiLSTM  
GRU  
Transformer  
Informer  
Mamba  
LightMamba  
ModelParameter
ARIMA , ,  
KNN Number of neighbors  
SVR Kernel type is linear 
XGBoost Number of estimators 100, max depth 6 
RNNLSTM Hidden layer size 64, layers of RNNs 1 (BiLSTM 2), attention heads of transformers 4, dropout rate 0.1, loss MSE, optimizer AdamW, learning rate 0.001, learning rate decay factor 0.9, maximum training period 30, early-stop patience 5 
BiLSTM  
GRU  
Transformer  
Informer  
Mamba  
LightMamba  

Result and analysis

We used all of the above models to make predictions for the testing set and evaluated them. Figure 6 shows the real data of the St. Louis and the predictions by each model in the last year. Scatter plots of the results of the various model predictions are shown in Figure 7. The MAE, RMSE, and NSE metrics for each of the models for the St. Louis station at different prediction steps are given in Tables 35 (results for Chester and Thebes are in the Supplementary material). The models are compared using the average of three evaluations as the results. The best results are highlighted in bold, while the second-best results are shown in underlined text. Additionally, Figure 8 displays the performance of the deep learning models, measured by average RMSE, and the number of parameters is the horizontal axis.
Table 3

MAE of different methods for St. Louis station runoff prediction

ModelAvg
BM 989.48 309.81 588.83 831.63 1042.27 1227.30 1390.89 1535.60 
ARIMA 904.27 205.73 465.42 716.13 944.48 1149.06 1338.48 1510.59 
KNN 1588.89 830.06 1154.43 1442.37 1689.35 1869.59 2013.46 2122.97 
SVR 1432.19 524.75 946.36 1286.92 1556.96 1761.53 1914.18 2034.62 
XGBoost 1331.62 252.50 752.27 1160.16 1488.43 1725.83 1903.24 2038.93 
RNN 902.94 257.29 494.71 729.26 950.20 1140.52 1302.69 1445.94 
LSTM 912.93 248.57 502.46 744.87 956.86 1154.15 1315.92 1467.72 
BiLSTM 909.29 243.53 499.24 743.94 958.58 1150.75 1314.11 1454.90 
GRU 889.00 253.46 484.74 725.03 930.91 1117.99 1282.23 1428.62 
Transformer 970.09 430.52 625.63 819.04 1000.91 1165.90 1314.67 1433.97 
Informer 959.05 437.69 602.01 803.45 989.61 1166.47 1291.52 1422.62 
Mamba 856.00 235.21 455.78 681.93 892.59 1081.13 1248.17 1397.16 
LightMamba 833.75 191.18 422.63 661.40 877.14 1068.54 1234.47 1380.87 
ModelAvg
BM 989.48 309.81 588.83 831.63 1042.27 1227.30 1390.89 1535.60 
ARIMA 904.27 205.73 465.42 716.13 944.48 1149.06 1338.48 1510.59 
KNN 1588.89 830.06 1154.43 1442.37 1689.35 1869.59 2013.46 2122.97 
SVR 1432.19 524.75 946.36 1286.92 1556.96 1761.53 1914.18 2034.62 
XGBoost 1331.62 252.50 752.27 1160.16 1488.43 1725.83 1903.24 2038.93 
RNN 902.94 257.29 494.71 729.26 950.20 1140.52 1302.69 1445.94 
LSTM 912.93 248.57 502.46 744.87 956.86 1154.15 1315.92 1467.72 
BiLSTM 909.29 243.53 499.24 743.94 958.58 1150.75 1314.11 1454.90 
GRU 889.00 253.46 484.74 725.03 930.91 1117.99 1282.23 1428.62 
Transformer 970.09 430.52 625.63 819.04 1000.91 1165.90 1314.67 1433.97 
Informer 959.05 437.69 602.01 803.45 989.61 1166.47 1291.52 1422.62 
Mamba 856.00 235.21 455.78 681.93 892.59 1081.13 1248.17 1397.16 
LightMamba 833.75 191.18 422.63 661.40 877.14 1068.54 1234.47 1380.87 
Table 4

RMSE of different methods for St. Louis station runoff prediction

ModelAvg
BM 1503.74 493.46 925.91 1290.98 1600.19 1862.63 2083.94 2269.04 
ARIMA 1425.46 345.57 757.95 1145.84 1496.41 1811.72 2090.19 2330.54 
KNN 2318.50 1250.13 1726.09 2146.32 2476.32 2712.56 2888.03 3030.04 
SVR 2125.61 867.01 1507.39 1982.88 2321.32 2560.62 2741.91 2898.16 
XGBoost 1942.27 441.84 1171.33 1758.76 2187.99 2481.73 2692.83 2861.39 
RNN 1384.58 452.50 803.42 1143.27 1455.57 1733.11 1957.58 2146.62 
LSTM 1387.60 434.77 801.26 1149.31 1458.54 1739.39 1969.05 2160.88 
BiLSTM 1380.70 406.99 790.73 1148.12 1465.74 1739.40 1965.51 2148.44 
GRU 1371.60 429.09 778.21 1137.82 1443.73 1719.66 1949.90 2142.78 
Transformer 1515.09 682.19 995.55 1309.02 1589.43 1824.72 2019.52 2185.17 
Informer 1497.22 696.35 944.21 1287.12 1563.05 1814.74 1998.11 2177.00 
Mamba 1320.39 374.28 727.13 1076.42 1394.06 1669.89 1904.70 2096.26 
LightMamba 1286.01 308.04 680.41 1040.57 1363.11 1640.26 1883.45 2086.22 
ModelAvg
BM 1503.74 493.46 925.91 1290.98 1600.19 1862.63 2083.94 2269.04 
ARIMA 1425.46 345.57 757.95 1145.84 1496.41 1811.72 2090.19 2330.54 
KNN 2318.50 1250.13 1726.09 2146.32 2476.32 2712.56 2888.03 3030.04 
SVR 2125.61 867.01 1507.39 1982.88 2321.32 2560.62 2741.91 2898.16 
XGBoost 1942.27 441.84 1171.33 1758.76 2187.99 2481.73 2692.83 2861.39 
RNN 1384.58 452.50 803.42 1143.27 1455.57 1733.11 1957.58 2146.62 
LSTM 1387.60 434.77 801.26 1149.31 1458.54 1739.39 1969.05 2160.88 
BiLSTM 1380.70 406.99 790.73 1148.12 1465.74 1739.40 1965.51 2148.44 
GRU 1371.60 429.09 778.21 1137.82 1443.73 1719.66 1949.90 2142.78 
Transformer 1515.09 682.19 995.55 1309.02 1589.43 1824.72 2019.52 2185.17 
Informer 1497.22 696.35 944.21 1287.12 1563.05 1814.74 1998.11 2177.00 
Mamba 1320.39 374.28 727.13 1076.42 1394.06 1669.89 1904.70 2096.26 
LightMamba 1286.01 308.04 680.41 1040.57 1363.11 1640.26 1883.45 2086.22 
Table 5

NSE of different methods for St. Louis station runoff prediction

ModelAvg
BM 0.875 0.988 0.959 0.920 0.878 0.834 0.793 0.754 
ARIMA 0.882 0.994 0.973 0.937 0.893 0.843 0.791 0.740 
KNN 0.726 0.925 0.858 0.780 0.707 0.649 0.602 0.562 
SVR 0.762 0.964 0.891 0.812 0.743 0.687 0.641 0.599 
XGBoost 0.788 0.991 0.934 0.852 0.771 0.706 0.654 0.609 
RNN 0.893 0.990 0.969 0.937 0.899 0.856 0.817 0.780 
LSTM 0.892 0.991 0.969 0.937 0.898 0.855 0.815 0.777 
BiLSTM 0.892 0.992 0.970 0.937 0.897 0.855 0.815 0.780 
GRU 0.894 0.991 0.971 0.938 0.900 0.859 0.818 0.781 
Transformer 0.878 0.978 0.953 0.918 0.879 0.841 0.805 0.772 
Informer 0.881 0.977 0.957 0.921 0.883 0.843 0.809 0.774 
Mamba 0.900 0.993 0.975 0.945 0.907 0.867 0.827 0.790 
LightMamba 0.904 0.995 0.978 0.948 0.911 0.871 0.830 0.792 
ModelAvg
BM 0.875 0.988 0.959 0.920 0.878 0.834 0.793 0.754 
ARIMA 0.882 0.994 0.973 0.937 0.893 0.843 0.791 0.740 
KNN 0.726 0.925 0.858 0.780 0.707 0.649 0.602 0.562 
SVR 0.762 0.964 0.891 0.812 0.743 0.687 0.641 0.599 
XGBoost 0.788 0.991 0.934 0.852 0.771 0.706 0.654 0.609 
RNN 0.893 0.990 0.969 0.937 0.899 0.856 0.817 0.780 
LSTM 0.892 0.991 0.969 0.937 0.898 0.855 0.815 0.777 
BiLSTM 0.892 0.992 0.970 0.937 0.897 0.855 0.815 0.780 
GRU 0.894 0.991 0.971 0.938 0.900 0.859 0.818 0.781 
Transformer 0.878 0.978 0.953 0.918 0.879 0.841 0.805 0.772 
Informer 0.881 0.977 0.957 0.921 0.883 0.843 0.809 0.774 
Mamba 0.900 0.993 0.975 0.945 0.907 0.867 0.827 0.790 
LightMamba 0.904 0.995 0.978 0.948 0.911 0.871 0.830 0.792 
Figure 6

Predictions of the St. Louis in the last year.

Figure 6

Predictions of the St. Louis in the last year.

Close modal
Figure 7

Scatter plots of different methods on St. Louis station.

Figure 7

Scatter plots of different methods on St. Louis station.

Close modal
Figure 8

Model performance and number of parameters.

Figure 8

Model performance and number of parameters.

Close modal

From Tables 35 we can see that the ARIMA model has excellent forecasting performance on the 1-day forecasting and also performs better over the whole forecasting period compared to listed ML methods. This indicates that the ARIMA model can capture short-term linear relationships by considering the historical behavior and trend of the series, and the prediction is more robust in terms of the long-term trend (Fard & Akbari-Zadeh 2014). However, the ARIMA model has poorer prediction performance at relatively long step sizes. ARIMA is not suitable for predicting runoff, as it cannot handle time series with complex patterns or external factors influencing the series. Additionally, selecting the appropriate ARIMA model requires complex data preprocessing, making it difficult to apply in multivariate forecasting and practical applications.

The predictive power of machine learning methods in this study is inadequate. These models lack sequential structure and only learn the relationship between input features and outputs, without the ability to infer temporal order. Although XGBoost achieved an impressive accuracy of 0.991 NSE on the first day, its prediction error increased significantly with each subsequent step. As a result, the model’s overall performance was poor, with a score of only 0.788 NSE for the entire prediction period. This suggests that machine learning methods struggle to capture the temporal evolution of runoff sequences.

The recurrent neural network contains a priori information on sequence order and has a better prediction effect on time series. The GRU exhibits the best performance on four recurrent neural networks in terms of MAE, RMSE, and NSE, with values of 889.00, 1371.60, and 0.894, respectively, on St. Louis station, and has 25.5K parameters relatively lower than others. The effectiveness of GRU in runoff prediction is verified. Due to its superior prediction capacity and the advantage of the number of parameters, GRU is widely used in various runoff prediction works. However, the recursive structure of RNNs limits their ability to be trained in parallel. This inefficiency becomes more pronounced when the input sequence is long, requiring additional computational resources for neural network training.

However, our results show that the transformers did not achieve the expected accuracy. We analyzed that this is due to the low inductive bias of the transformers and its encoder and decoder structure, which results in a large number of parameters in this type of model. In this experiment, the parameters of the transformer and informer are as high as 93.5K and 106.0K. Although larger model parameters can increase the upper learning limit of the model, it is crucial to train it adequately with a significant amount of data. Therefore, this type of model is more suitable for large-scale datasets (Wen et al. 2022). In contrast, the runoff data is small, making it challenging to realize the potential of transformer models.

The Mamba-based model demonstrates superior performance in comparison to the other models, particularly in terms of the relatively long number of prediction steps and the overall performance. The MAE, RMSE, and NSE of Mamba are 856.00, 1320.39, and 0.900 on St. Louis station, respectively. The number of parameters is only 16.9K, which exceeds the prediction performance of GRU with a lower number of parameters. The results demonstrate that the Mamba-based model is capable of performing well in the runoff prediction task. The performance of LightMamba is particularly noteworthy, with MAE, RMSE, and NSE values of 833.75, 1286.01, and 0.904, respectively. These values illustrate a greater degree of accuracy in prediction than that observed in Mamba. This is due to the incorporation of local normalization and MPM into LightMamba. Local normalization enables the output of the model to be maintained within the same range as the inputs, thereby enhancing the prediction accuracy to a certain extent when confronted with a non-stationary time series such as runoff. Concurrently, our MPM algorithm, analogous to the multi-head attention mechanism, enables the Mamba to model the sequence relationship in disparate subspaces, thereby enhancing the model’s representation capability. Furthermore, MPM can effectively reduce the model parameters. LightMamba has only 10.2K parameters, which constitutes 60% of Mamba.

The average performance of each model across the three stations is summarized in Table 6. The best results are highlighted in bold, while the second-best results are shown in underlined text. The results indicate that the transformers exhibit a lower overall performance compared to the RNNs when confronted with this limited dataset. In contrast, the Mamba-based model achieves the best combined performance across the three stations, with higher predictive accuracy of the model while maintaining a lower number of parameters. In particular, an examination of the comprehensive performance of the three stations, as reflected in the RMSE metrics, reveals that our proposed model, LightMamba, exhibits a reduction of 14.71, 49.14, 7.67, and 16.16% in comparison to BM, XGBoost, GRU, and Informer, respectively.

Table 6

Results of the seven-step prediction

StationSt. Louis
Chester
Thebes
ModelMAERMSEMAPENSEMAERMSEMAPENSEMAERMSEMAPENSE
BM 989.48 1503.74 0.121 0.875 956.70 1445.57 0.123 0.876 893.05 1355.40 0.121 0.883 
ARIMA 904.27 1425.46 0.113 0.882 882.35 1376.12 0.115 0.882 846.52 1325.55 0.116 0.884 
KNN 1588.89 2318.50 0.196 0.726 1568.84 2278.42 0.201 0.717 1476.23 2144.91 0.197 0.729 
SVR 1432.19 2125.61 0.174 0.762 1373.94 2023.02 0.175 0.768 1265.08 1859.82 0.169 0.788 
XGBoost 1331.62 1942.27 0.165 0.788 1287.96 1875.96 0.167 0.788 1209.52 1777.83 0.164 0.798 
RNN 902.94 1384.58 0.118 0.893 877.93 1343.99 0.120 0.891 854.76 1324.41 0.121 0.888 
LSTM 912.93 1387.60 0.118 0.892 885.94 1355.29 0.119 0.889 846.58 1284.60 0.122 0.893 
BiLSTM 909.29 1380.70 0.118 0.892 893.07 1364.69 0.121 0.888 849.86 1304.55 0.120 0.890 
GRU 889.00 1371.60 0.114 0.894 875.30 1365.61 0.115 0.889 841.58 1301.76 0.117 0.891 
Transformer 970.09 1515.09 0.124 0.878 947.02 1458.48 0.126 0.878 904.76 1401.88 0.127 0.879 
Informer 959.05 1497.22 0.121 0.881 945.28 1459.25 0.125 0.878 903.28 1401.45 0.125 0.879 
Mamba 856.00 1320.39 0.111 0.900 828.69 1274.02 0.114 0.902 818.82 1241.25 0.119 0.900 
LightMamba 833.75 1286.01 0.106 0.904 799.07 1229.48 0.107 0.907 809.62 1236.29 0.113 0.900 
StationSt. Louis
Chester
Thebes
ModelMAERMSEMAPENSEMAERMSEMAPENSEMAERMSEMAPENSE
BM 989.48 1503.74 0.121 0.875 956.70 1445.57 0.123 0.876 893.05 1355.40 0.121 0.883 
ARIMA 904.27 1425.46 0.113 0.882 882.35 1376.12 0.115 0.882 846.52 1325.55 0.116 0.884 
KNN 1588.89 2318.50 0.196 0.726 1568.84 2278.42 0.201 0.717 1476.23 2144.91 0.197 0.729 
SVR 1432.19 2125.61 0.174 0.762 1373.94 2023.02 0.175 0.768 1265.08 1859.82 0.169 0.788 
XGBoost 1331.62 1942.27 0.165 0.788 1287.96 1875.96 0.167 0.788 1209.52 1777.83 0.164 0.798 
RNN 902.94 1384.58 0.118 0.893 877.93 1343.99 0.120 0.891 854.76 1324.41 0.121 0.888 
LSTM 912.93 1387.60 0.118 0.892 885.94 1355.29 0.119 0.889 846.58 1284.60 0.122 0.893 
BiLSTM 909.29 1380.70 0.118 0.892 893.07 1364.69 0.121 0.888 849.86 1304.55 0.120 0.890 
GRU 889.00 1371.60 0.114 0.894 875.30 1365.61 0.115 0.889 841.58 1301.76 0.117 0.891 
Transformer 970.09 1515.09 0.124 0.878 947.02 1458.48 0.126 0.878 904.76 1401.88 0.127 0.879 
Informer 959.05 1497.22 0.121 0.881 945.28 1459.25 0.125 0.878 903.28 1401.45 0.125 0.879 
Mamba 856.00 1320.39 0.111 0.900 828.69 1274.02 0.114 0.902 818.82 1241.25 0.119 0.900 
LightMamba 833.75 1286.01 0.106 0.904 799.07 1229.48 0.107 0.907 809.62 1236.29 0.113 0.900 

This paper examines the potential of the Mamba model for runoff prediction. Compared to the statistical model, machine learning methods, recurrent neural networks, and attention-based models, the Mamba model demonstrates superior overall performance. The Mamba model is adept at extracting nonlinear and long-term dependencies in runoff data, and its prediction accuracy is enhanced while the number of parameters is reduced. The scanning algorithm in the Mamba block enables the model to be trained in parallel, resulting in a faster training speed and lower GPU memory usage than other models. This demonstrates the great potential of Mamba as a backbone model for runoff prediction tasks.

In this paper, we also propose a deep learning model for runoff prediction: LightMamba. It utilizes partial normalization and an MPM module, which improves the accuracy of Mamba on runoff prediction tasks and reduces the number of parameters. We validate LightMamba on runoff data from three stations, and the results demonstrate that LightMamba can effectively capture the temporal dependence in runoff data and outperforms previous methods.

However, in this study, only historical runoff data were utilized to predict future runoff, which is a univariate forecasting problem. Mamba is also adept at extracting synergistic relationships between multiple variables. Therefore, other variables related to runoff (e.g., rainfall, temperature, potential evapotranspiration, vegetation cover) can be subsequently collected and utilized. In addition, various decomposition methods are frequently employed in runoff prediction tasks, as they can significantly improve the prediction accuracy. Mamba may be considered as a backbone model, combined with data decomposition methods to improve the precision of runoff prediction, which is also our future research direction.

This work was supported by the National Natural Science Foundation of China (Grant No. 42130113). The numerical calculations in this paper were supported by the Supercomputing Center of Lanzhou University.

J.D. conceived and designed the study; J.D., H.D., and C.S. conducted the workshops; J.D. and H.D. conducted the survey; J.D., H.D., and C.S. performed the inspections; J.D. and H.D. analyzed the data; J.D., H.D., and L.W. wrote the draft paper. All authors have read and agreed to the published version of the manuscript.

All relevant data are available from an online repository: https://grdc.bafg.de/.

The authors declare there is no conflict.

Arjovsky
M.
,
Shah
A.
&
Bengio
Y.
(
2016
)
‘Unitary evolution recurrent neural networks’ International conference on machine learning. PMLR, pp. 1120–1128
.
Bai
Y.
,
Wagener
T.
&
Reed
P.
(
2009
)
A top-down framework for watershed model evaluation and selection under uncertainty
,
Environmental Modelling & Software
,
24
(
8
),
901
916
.
Bigdeli
Z.
,
Majnooni-Heris
A.
,
Delirhasannia
R.
&
Karimi
S.
(
2023
)
Application of support vector machine and boosted tree algorithm for rainfall-runoff modeling (Case study: Tabriz plain)
,
Environment and Water Engineering
,
9
(
4
),
532
547
.
Dimitriadis
P.
,
Koutsoyiannis
D.
&
Tzouka
K.
(
2016
)
Predictability in dice motion: how does it differ from hydro-meteorological processes?
Hydrological Sciences Journal
,
61
(
9
),
1611
1622
.
Du
N.
,
Liang
X.
,
Wang
C.
&
Jia
L.
(
2022
)
‘Multi-station joint long-term water level prediction model of Hongze Lake based on RF-informer’, 2022 3rd international conference on information science, parallel and distributed systems (ISPDS). IEEE, pp. 25–30
.
Dwivedi
D.
,
Mital
U.
,
Faybishenko
B.
,
Dafflon
B.
,
Varadharajan
C.
,
Agarwal
D.
,
Williams
K. H.
,
Steefel
C. I.
&
Hubbard
S. S.
(
2022
)
Imputation of contiguous gaps and extremes of subhourly groundwater time series using random forests
,
Journal of Machine Learning for Modeling and Computing
,
3
(
2
), 1–22.
Fang
J.
,
Yang
L.
,
Wen
X.
,
Li
W.
,
Yu
H.
&
Zhou
T.
(
2024
)
A deep learning-based hybrid approach for multi-time-ahead streamflow prediction in an arid region of Northwest China
,
Hydrology Research
, 55 (2), 180–204.
Fard
A. K.
&
Akbari-Zadeh
M. -R.
(
2014
)
A hybrid method based on wavelet, ANN and ARIMA model for short-term load forecasting
,
Journal of Experimental & Theoretical Artificial Intelligence
,
26
(
2
),
167
182
.
Farmer
D.
,
Sivapalan
M.
&
Jothityangkoon
C.
(
2003
)
Climate, soil, and vegetation controls upon the variability of water balance in temperate and semiarid landscapes: Downward approach to water balance analysis
,
Water Resources Research
,
39
(
2
), 1035. doi:10.1029/2001WR000328,2003.
Fenicia
F.
,
McDonnell
J. J.
&
Savenije
H. H.
(
2008
)
Learning from model improvement: On the contribution of complementary data to process understanding
,
Water Resources Research
,
44
(
6
), W06419. doi:10.1029/2007WR006386,2008.
Gu
A.
&
Dao
T.
(
2023
)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752
.
Gu
A.
,
Dao
T.
,
Ermon
S.
,
Rudra
A.
&
C.
(
2020
)
Hippo: Recurrent memory with optimal polynomial projections
,
Advances in Neural Information Processing Systems
,
33
,
1474
1487
.
Gu
A.
,
Goel
K.
&
C.
(
2021
)
Efficiently Modeling Long Sequences with Structured State Spaces. arXiv preprint arXiv:2111.00396
.
Kumar
M.
,
Elbeltagi
A.
,
Pande
C. B.
,
Ahmed
A. N.
,
Chow
M. F.
,
Pham
Q. B.
,
Kumari
A.
&
Kumar
D.
(
2022
)
Applications of data-driven models for daily discharge estimation based on different input combinations
,
Water Resources Management
,
36
(
7
),
2201
2221
.
Liu
Y.
,
Wu
H.
,
Wang
J.
&
Long
M.
(
2022
)
Non-stationary transformers: Exploring the stationarity in time series forecasting
,
Advances in Neural Information Processing Systems
,
35
,
9881
9893
.
Liu
Z.
,
Zhou
J.
,
Yang
X.
,
Zhao
Z.
&
Lv
Y.
(
2024
)
Research on water resource modeling based on machine learning technologies
,
Water
,
16
(
3
),
472
.
Man
Y.
,
Yang
Q.
,
Shao
J.
,
Wang
G.
,
Bai
L.
&
Xue
Y.
(
2023
)
Enhanced LSTM model for daily runoff prediction in the upper Huai river basin, China
,
Engineering
,
24
,
229
238
.
Mosavi
A.
,
Ozturk
P.
&
Chau
K.-w.
(
2018
)
Flood prediction using machine learning models: literature review
,
Water
,
10
(
11
),
1536
.
Munoz
S. E.
,
Giosan
L.
,
Therrell
M. D.
,
Remo
J. W.
,
Shen
Z.
,
Sullivan
R. M.
,
Wiman
C.
,
O’Donnell
M.
&
Donnelly
J. P.
(
2018
)
Climatic control of Mississippi River flood hazard amplified by river engineering
,
Nature
,
556
(
7699
),
95
98
.
Neyshabur
B.
(
2020
)
Towards learning convolutions from scratch
,
Advances in Neural Information Processing Systems
,
33
,
8078
8088
.
Nosouhian
S.
,
Nosouhian
F.
&
Khoshouei
A. K.
(
2021
)
A review of recurrent neural network architecture for sequence learning: comparison between LSTM and GRU
.
Nourani
V.
,
Gökçekuş
H.
&
Gichamo
T.
(
2021
)
Ensemble data-driven rainfall-runoff modeling using multi-source satellite and gauge rainfall data input fusion
,
Earth Science Informatics
,
14
(
4
),
1787
1808
.
Sharafati
A.
,
Khazaei
M. R.
,
Nashwan
M. S.
,
Al-Ansari
N.
,
Yaseen
Z. M.
&
Shahid
S.
(
2020
)
Assessing the uncertainty associated with flood features due to variability of rainfall and hydrological parameters
,
Advances in Civil Engineering
,
2020
,
1
9
.
Vaswani
A.
,
Shazeer
N.
,
Parmar
N.
,
Uszkoreit
J.
,
Jones
L.
,
Gomez
A. N.
,
Kaiser
Ł.
&
Polosukhin
I.
(
2017
)
Attention is all you need
,
Advances in Neural Information Processing Systems
,
30
, 5998–6008.
Wen
Q.
,
Zhou
T.
,
Zhang
C.
,
Chen
W.
,
Ma
Z.
,
Yan
J.
&
Sun
L.
(
2022
)
Transformers in Time Series: A Survey. arXiv preprint arXiv:2202.07125
.
Williams
B. S.
,
Das
A.
,
Johnston
P.
,
Luo
B.
&
Lindenschmidt
K. -E.
(
2021
)
Measuring the skill of an operational ice jam flood forecasting system
,
International Journal of Disaster Risk Reduction
,
52
,
102001
.
Xiang
Z.
,
Yan
J.
&
Demir
I.
(
2020
)
A rainfall-runoff model with LSTM-based sequence-to-sequence learning
,
Water Resources Research
,
56
(
1
),
e2019WR025326
.
Yang
M.
,
Wang
H.
,
Jiang
Y.
,
Lu
X.
,
Xu
Z.
&
Sun
G.
(
2020
)
GECA proposed ensemble–KNN method for improved monthly runoff forecasting
,
Water Resources Management
,
34
(
2
),
849
863
.
Yin
H.
,
Guo
Z.
,
Zhang
X.
,
Chen
J.
&
Zhang
Y.
(
2022
)
RR-former: rainfall-runoff modeling based on transformer
,
Journal of Hydrology
,
609
,
127781
.
Zhou
H.
,
Zhang
S.
,
Peng
J.
,
Zhang
S.
,
Li
J.
,
Xiong
H.
&
Zhang
W.
(
2021
)
‘Informer: beyond efficient transformer for long sequence time-series forecasting’, Proceedings of the AAAI conference on artificial intelligence, Vol. 35. pp. 11106–11115
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data