Streamflow prediction offers crucial information for managing water resources, flood control, and hydropower generation. Yet, reliable streamflow prediction is challenging due to the complexity and nonlinearity of the rainfall-runoff relationship. This study investigated the comparative performance of the newly integrated self-attention-based deep learning (DL) model, SA-Conv1D-BiGRU with Conv1D-LSTM, and bidirectional long short-term memory (Bi-LSTM) models for streamflow prediction under different time-series conditions, and a range of variable input combinations based on flood events. All datasets passed quality control procedures, and the time lag for generating input series was established through Pearson correlation analysis. 80% of the data was used for training, whereas 20% was used to evaluate the model's performance. The performance of the models was evaluated using three metrics: mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (R2). The findings reveal the excellent potential of DL models for streamflow prediction, with the SA-Conv1D-BiGRU model outperforming other models under different time-series characteristics. Despite the complexity, the Conv1D-LSTM models did not outperform the Bi-LSTM model. In conclusion, the results are condensed into themes of model variability and time-series characteristics. Consequently, different architectures in DL models had a greater influence on streamflow prediction accuracy than input time lags and time-series features.

  • Deep learning (DL) algorithms are compared for streamflow prediction across various time-series traits.

  • Bi-LSTM outperforms Conv1D-LSTM in capturing time-series characteristics.

  • Hybrid self-attention improves spatial and temporal feature identification in time series.

Streamflow prediction offers crucial information for managing water resource systems, flood control, and hydropower generation. However, due to the high nonlinearity and spatiotemporal variability in hydrological processes reliable streamflow prediction is challenging (Apaydin et al. 2020). Streamflow prediction techniques employ various statistical, mathematical, and computational approaches (Gupta & Nearing 1969). The choice of technique is influenced by the available data, system complexity, and the specific requirements of the application (Wegayehu & Muluneh 2022). Although traditional statistical models and lumped hydrological models are effective in managing the temporal fluctuations observed in precipitation and flow time series, they encounter challenges in accurately depicting the spatial variations inherent in these phenomena (Niu et al. 2019). Hence, researchers attempt to provide models that simulate streamflow accurately and easily. Given these limitations of physical models, a data-driven model presents an attractive solution to overcome these challenges (Ji et al. 2012).

Data-driven models rely on the statistical relationship between input and output data. These models are further classified as linear and nonlinear models. Autoregressive moving average (ARMA), multiple linear regression (MLR), and autoregressive integrated moving average (ARIMA) are the most common linear methods (Mosavi et al. 2018), and the most common nonlinear data-driven model is a machine learning (ML) model. The major drawback of the former model is that it is incapable of handling the system's nonlinearity (Apaydin et al. 2020). Advanced data-driven models, like ML, are increasingly used due to the shortcomings of the physical-based and linear models discussed above. ML, a subfield of artificial intelligence (AI), is nowadays the most widely used in hydrology. It involves using computational power to extract insights from data by iteratively learning relationships from datasets (Salehinejad et al. 2017).

Artificial neural networks (ANNs), one of the popular data-driven models, have been widely applied in hydrological modeling for their strong nonlinear fitting ability (Ji et al. 2012). A significant limitation of ANNs is their inability to be constructed with more than one or two hidden layers, which can restrict their ability to model complex relationships effectively. To successfully address complicated issues, deep learning (DL) networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have recently been enhanced with multi-layered architectures, and time-series data, like streamflow, can be effectively modeled using DL algorithms of neural networks (Wegayehu & Muluneh 2022).

RNN is a neural network that is specialized for processing a sequence of data, swiftly adjusting to temporal dynamics through prior time-step data (Apaydin et al. 2020). Nevertheless, they struggle to capture prolonged dependencies and are prone to issues of vanishing and exploding gradients. To overcome this deficiency, Hochreiter & Schmidhuber (1997) proposed a long short-term memory (LSTM) network for learning long-term dependence. Recently, LSTM models have been explored and researched in watershed hydrological modeling. Their capabilities have been showcased in various applications, including river flow forecasting and flood prediction (Wegayehu & Muluneh 2022). Kratzert et al. (2018) utilized the LSTM network for daily flow prediction and discovered that it significantly outperformed hydrological models calibrated at both the regional and individual basin levels. In their study, Hu et al. (2018) evaluated the LSTM model using 98 flood events and concluded that the LSTM model surpassed conceptual and physical models in terms of performance. Studies have shown that LSTM exhibits impressive performance in streamflow prediction when compared with other sophisticated multi-layered techniques. A few years back, gated recurrent units (GRUs) were introduced as an alternative to LSTM. GRUs, similar to LSTM but with a forget gate and fewer parameters, do not include an output gate (Apaydin et al. 2020). GRUs have demonstrated proficiency in time-series modeling and natural language processing akin to LSTM. Nonetheless, there is an ongoing discussion regarding the comparative effectiveness of these two architectures in simulating streamflow and reservoir inflow. This area has not been extensively explored across different timescales and environments.

CNN has a superior capability in capturing spatial data features and has played a key role in the latest advancements in DL (Wegayehu & Muluneh 2022). Van et al. (2020) developed a novel 1D CNN with a ReLU activation function for rainfall-runoff modeling. Recently, integrated DL approaches have received more attention in hydrological modeling. Specifically, the 1D CNN component is used to capture temporal patterns and spatial dependencies in the time-series data. It is effective in extracting local features from the input data. Then, the LSTM component is used to model the temporal dependencies and long-range dependencies present in the streamflow data. By combining the capabilities of 1D CNN and LSTM in the integrated model, the system can effectively leverage both the spatial features extracted by the 1D CNN and the temporal dependencies captured by the LSTM for improved streamflow prediction performance. Furthermore, combining CNN with GRU can enhance data preprocessing robustness, offering a promising avenue to enhance the model's precision (Wegayehu & Muluneh 2022). In their study, Li & Xu (2022) employed both single-variable and multi-variable time-series data in LSTM and CNN-LSTM models. The results indicated that when forecasting particulate matter (PM2.5) concentration for air quality analysis, the multi-variable CNN-LSTM model they proposed exhibited superior performance, showcasing minimal error. The fusion of CNN and LSTM models enhances time-series prediction models by enabling the LSTM model to effectively capture extended sequences of pattern information. Conversely, CNN models excel in noise filtration of input data and extracting crucial features, potentially boosting the prediction model's accuracy (Livieris et al. 2020). Moreover, Wegayehu & Muluneh (2021) used a hybrid CNN-LSTM and CNN-GRU models for multivariate streamflow prediction in different climatic regions, and the model effectively captured the complex temporal dependencies and patterns present in the streamflow data. Their findings indicated that the CNN-GRU model surpasses both CNN-LSTM and traditional LSTM and GRU models.

Most recently, to allow the model to weigh the importance of different time steps in the input sequence, the self-attention mechanism was applied. Forghanparast (2022) conducted a study comparing the effectiveness of three DL algorithms – CNN, LSTM, and self-attention LSTM models – with a baseline extreme learning machine (ELM) model for forecasting monthly streamflow in the upper reaches of the Texas Colorado River. According to the results, the SA-LSTM model offered higher accuracy and better stability. Zhou et al. (2023) compared a new hybrid DL model to predict hourly streamflow: SA-CNN-LSTM with LSTM, CNN, ANN, RF, SA-LSTM, and SA-CNN. Their findings revealed that the SA-CNN-LSTM model exhibited strong predictive capabilities across varying flood intensities and lead times, emphasizing capturing temporal and feature interdependencies in runoff forecasting. Despite the variations in their performance, choosing suitable time-series models from a range of established DL network architectures poses a challenge. Further research is needed to achieve predictions with enhanced prediction accuracy, quicker processing times, and simpler model structures. To the best of our knowledge, there is limited literature available that explores the performance differences among various hybrid DL models for streamflow prediction under different input variability conditions. Hence, conducting a comparative assessment of diverse network architectures can assist in identifying the most optimized solution for time-series analysis.

In this study, the SA-Conv1D-BiGRU hybrid streamflow prediction model is introduced to capture the interdependencies between time steps and features within the streamflow series. By considering these relationships, the model aims to enhance the accuracy and performance of monthly streamflow predictions. Thus, as a main aim, we compared the newly integrated self-attention-based DL model, SA-Conv1D-BiGRU with other integrated and standalone models, Conv1D-LSTM, and bidirectional LSTM (Bi-LSTM) for streamflow prediction under different time-series characteristics.

Study area description

In the present study, two river sub-catchments were selected having different climatic regions: the Kulfo River catchment in southern Ethiopia and the Woybo River catchment of the Omo River Basin in Ethiopia (Figure 1).
  • (A) The study was carried out in the Kulfo River Watershed, which is located in the Abaya-Chamo sub-basin of the Southern Ethiopian Rift Valley. This watershed flows into Lake Chamo and is positioned between latitudes 5°55′ N and 6°15′ N, and longitudes 37°18′ E and 37°36′ E (Figure 1). The elevation in the area ranges from 1,208 to 3,547 meters above sea level (masl), covering a total area of about 384.56 km2. The annual rainfall in the catchment area varies from 620 to 1,250 mm, and the mean annual temperature ranges from 14 to 23 °C.

  • (B) Woybo River is one of the tributaries of the Omo River Basin flowing in the southwest of Ethiopia. It flows into Omo Gibe River and is situated between latitudes 6°40′ N and 7°10′ N and longitudes 37°30′ E and 38°00′ E as shown in Figure 1. The study area has a tropical climate regime with a watershed area of 533.65 km2. Precipitation in the watershed has strong seasonal and elevation variability. The rainy season spans from April to October, with July and August being the wettest months. Rainfall distribution is mainly influenced by the movement of the Inter Tropical Convergence Zone from south to north. The average maximum and minimum temperatures range from 19.67 to 21.83 °C and 16.19 to 18.71 °C, respectively. The sub-basin receives an average annual rainfall of 1,377.74 mm, indicating significant spatial and temporal variability in precipitation. The drainage network of the Woybo River watershed was derived from the digital elevation model (DEM). The watershed, classified as third order with a drainage density of 0.45 km/km2, consists of 23 streams totaling 148.9 km in length. The longest flow path in the watershed is 1,944 km, and the bifurcation ratio is 0.96 (Ukumo et al. 2022).

Figure 1

Description of the study area.

Figure 1

Description of the study area.

Close modal

Data collection and preprocessing

The DEM data for the Kulfo watershed with a 30 × 30 m resolution was obtained from the United States Geological Survey (USGS) database. Meteorological data, including daily precipitation, and maximum and minimum temperatures were gathered from the Ethiopian Meteorology Institute (EMI). The quality and completeness of the data play a crucial role in influencing the outcomes of data analysis (Mathewos et al. 2024). In this study, the multiple imputation method and ML techniques were used comparatively to impute the missing data, given their ability to capture complex relationships and patterns. Rainfall records were examined for consistency using double mass curve analysis. The nondimensional parametrization method was used to verify the homogeneity of the rainfall data. The long-time daily flow (from 1991 to 2013) for the Kulfo station and from 1997 to 2018 for the Woybo station was obtained from the Ministry of Water and Energy (MoWE).

Rain gauges indeed have limitations in capturing the spatial distribution of rainfall due to their point-sampling nature. When conducting hydrological analyses over large areas, it becomes crucial to estimate average rainfall depths over sub-watershed areas to gain a more comprehensive understanding of the overall rainfall patterns and their impact on the hydrological system (Chow et al. 1988). In this study, the Thiessen polygon method was employed to calculate the spatial distribution of rainfall across the catchments with the help of ArcGIS 10.3 tools. Selecting input variables for various model architectures poses a challenge for researchers. While rainfall, evaporation, and temperature are essential factors for streamflow modeling, the availability of data and research objectives constrain the range of choices (Apaydin et al. 2020). Van et al. (2020) discussed that applying temperature and evapotranspiration input nodes into the model increases the network complexity and causes overfitting. Various combinations of daily and monthly rainfall and discharge, with different lag times, are evaluated to identify the optimal input combination for DL streamflow prediction models. Temperature data was excluded from the analysis due to its minimal correlation with discharge. We utilized linear correlation statistics, such as Pearson's correlation coefficient, to assess the interdependence between variables (Wegayehu & Muluneh 2022; Tables 1 and 2). Furthermore, our correlation analysis between the independent variables (monthly rainfall and discharge) demonstrated a significant correlation in different lag times (Figures 2 and 3). Hence, the DL models were designed comparatively with different input scenarios.
Table 1

Descriptive statistics of time-series data for the Kulfo watershed

Data typePearson correlation with streamflowSkewnessMeanMinMaxSDCV
Streamflow 1.00 1.69 10.75 0.00 50.73 5.43 0.61 
Rainfall 0.68 1.90 11.25 0.00 56.87 6.34 0.67 
Data typePearson correlation with streamflowSkewnessMeanMinMaxSDCV
Streamflow 1.00 1.69 10.75 0.00 50.73 5.43 0.61 
Rainfall 0.68 1.90 11.25 0.00 56.87 6.34 0.67 

CV, coefficient of variation; SD, standard deviation.

Table 2

Descriptive statistics of time-series data for the Woybo watershed

Data typePearson correlation with streamflowSkewnessMeanMinMaxSDCV
Streamflow (m3/s) 1.00 2.81 9.04 0.00 120.08 13.98 1.54 
RF (mm/day) 0.38 2.63 4.19 0.00 43.48 4.98 1.19 
Data typePearson correlation with streamflowSkewnessMeanMinMaxSDCV
Streamflow (m3/s) 1.00 2.81 9.04 0.00 120.08 13.98 1.54 
RF (mm/day) 0.38 2.63 4.19 0.00 43.48 4.98 1.19 

CV, coefficient of variation; SD, standard deviation.

Figure 2

Split dataset for training and testing purposes for the Kulfo watershed.

Figure 2

Split dataset for training and testing purposes for the Kulfo watershed.

Close modal
Figure 3

Split dataset for training and testing purposes for the Woybo watershed.

Figure 3

Split dataset for training and testing purposes for the Woybo watershed.

Close modal
Normalization of data is crucial for mitigating the impact of varying scales and units of measurement (Zeroual et al. 2016). This process entails standardizing the values of different variables to a common range, usually between 0 and 1 or −1 and 1, employing various techniques. In our case, we utilize the min–max scalar function (Equation (1)):
(1)
where is the scaled value, and x is the original value.

The datasets underwent thorough quality control procedures, including preprocessing steps such as data standardization, and post-processing tasks like model evaluation metric computations and visualizations, all carried out using Python. The models were constructed and executed in Python (Zeroual et al. 2016). The pre-processed data was split into training and testing sets for model development and assessment. The training set is employed to train the model and conduct hyperparameter tuning, whereas the testing set is utilized to assess the final model's performance and split into 80% for training and 20% for testing, as depicted in Figure 2.

Figures 4 and 5 illustrate autocorrelation (ACF) and partial autocorrelation (PACF) between the target and antecedent streamflow. Both ACF and PACF exhibit a gradual decrease as the time lag increases. Integrating DL models raises the critical issue of determining the optimal model structure from the available input variables and defining the number of hidden nodes. However, no universal rule exists to address this challenge. Consequently, employing a trial-and-error approach becomes a unique technique for overcoming this obstacle. In our study, we propose selecting input variables through input combinations/scenarios based on correlation and lag analysis. Four and three scenarios were used for model training, construction of a model structure, and to determine the models' sensitivity to the number of inputs for the Kulfo and Woybo watersheds, respectively (Tables 3 and 4).
Table 3

Model input combinations for the Kulfo watershed

Input combinationOutputScenarioModel name
Qt−1, Qt−3, Qt−6 Qt Bi-LSTM1 Conv1D-LSTM1 SA-Conv1D-BiGRU1 
Qt−1, Qt−3, Qt−6, Qt−9, Qt−11, Qt−15 Qt Bi-LSTM2 Conv1D-LSTM2 SA-Conv1D-BiGRU2 
Rt−1, Rt, Qt−1 Qt Bi-LSTM3 Conv1D-LSTM3 SA-Conv1D-BiGRU3 
Rt, Rt−1, Rt−2, Qt−1, Qt−2 Qt Bi-LSTM4 Conv1D-LSTM4 SA-Conv1D-BiGRU4 
Input combinationOutputScenarioModel name
Qt−1, Qt−3, Qt−6 Qt Bi-LSTM1 Conv1D-LSTM1 SA-Conv1D-BiGRU1 
Qt−1, Qt−3, Qt−6, Qt−9, Qt−11, Qt−15 Qt Bi-LSTM2 Conv1D-LSTM2 SA-Conv1D-BiGRU2 
Rt−1, Rt, Qt−1 Qt Bi-LSTM3 Conv1D-LSTM3 SA-Conv1D-BiGRU3 
Rt, Rt−1, Rt−2, Qt−1, Qt−2 Qt Bi-LSTM4 Conv1D-LSTM4 SA-Conv1D-BiGRU4 
Table 4

Model input combinations for the Woybo watershed

Input combinationOutputScenarioModel name
Qt−1, Qt−2, Qt−3 Qt Bi-LSTM1 Conv1D-LSTM1 SA-Conv1D-BiGRU1 
Qt−2, Qt−1, Rt, Rt−1 Qt Bi-LSTM2 Conv1D-LSTM2 SA-Conv1D-BiGRU2 
Rt, Rt−1, Rt−2, Qt−1, Qt−2 Qt Bi-LSTM3 Conv1D-LSTM3 SA-Conv1D-BiGRU3 
Input combinationOutputScenarioModel name
Qt−1, Qt−2, Qt−3 Qt Bi-LSTM1 Conv1D-LSTM1 SA-Conv1D-BiGRU1 
Qt−2, Qt−1, Rt, Rt−1 Qt Bi-LSTM2 Conv1D-LSTM2 SA-Conv1D-BiGRU2 
Rt, Rt−1, Rt−2, Qt−1, Qt−2 Qt Bi-LSTM3 Conv1D-LSTM3 SA-Conv1D-BiGRU3 
Figure 4

Pearson correlation plot for input variables for the Kulfo watershed.

Figure 4

Pearson correlation plot for input variables for the Kulfo watershed.

Close modal
Figure 5

Pearson correlation plot for input variables for the Woybo watershed.

Figure 5

Pearson correlation plot for input variables for the Woybo watershed.

Close modal

In this case study, a variety of resources were employed. The delineation of the watershed and sub-basins was conducted using ArcGIS 10.3 software. Python, a programming language, along with the Jupyter Notebook, was applied for data processing, analysis, and the development of DL models.

DL algorithms

DL is an advanced form of ML that uses neural networks with multiple layers to enhance performance (Mosavi et al. 2018). DL has emerged as a highly promising ML technique for accurate rainfall-runoff predictions.

Bidirectional long short-term memory (Bi-LSTM)

Bi-LSTM is a variation of the LSTM architecture that considers information from both past (t − 1) and future (t + 1) time steps when determining the output at each time step (Figure 6). This enables the model to effectively capture bidirectional dependencies within the data (Wegayehu & Muluneh 2022).
Figure 6

Architecture of Bi-LSTM (Wegayehu & Muluneh 2022).

Convolutional neural network (CNN)

The CNN is recognized as one of the most successful DL models, particularly for its effectiveness in feature extraction and its network architectures encompass 1D CNN, 2D CNN, and 3D CNN (Wegayehu & Muluneh 2021). The structure of a CNN typically includes a convolutional layer, a pooling layer, and a fully connected layer. In a CNN, the convolution and pooling layers serve as the fundamental building blocks. These layers extract various features from the input layer and reduce their dimensions by conducting convolution operations on the input layer and consolidating the outputs of neuron clusters into a single neuron. The pooling mechanism plays a crucial role in reducing the number of parameters in the network, enhancing the efficiency, ease, and speed of the training phase of CNNs compared with traditional ANNs. The 1D CNN is primarily utilized for processing sequence data (Duan et al. 2020). The 2D CNN is usually used for text and image identification (Lin et al. 2023), and usually, the 3D CNN is recognized for modeling medical images and video data identification (Duan et al. 2020). Streamflow data is one-dimensional, so a Conv1D model is used for this research. CNN models have become popular for predicting streamflow in recent years due to their speed, accuracy, and stability compared with other DL algorithms (Wegayehu & Muluneh 2021).

Conv1D-LSTM hybrid model

This study created a hybrid model by combining CNN and LSTM layers. The CNN layer's output was used as input for the LSTM layer to capture short- and long-term dependencies. The Conv1D-LSTM model includes two parts: the first part has convolutional and pooling layers followed by a flattened layer to prepare the data for LSTM. The second part uses LSTM and dense layers to process the features, with dropouts added to prevent overfitting (Figure 7).
Figure 7

Basic architecture of the proposed Conv1D-LSTM model.

Figure 7

Basic architecture of the proposed Conv1D-LSTM model.

Close modal

Attention/self-attention mechanism

The attention mechanism is extensively applied in different DL tasks like natural language processing, image recognition, and speech recognition (Zhou et al. 2023). This strategy in DL can be seen as embedding a neural network within another neural network to assign weights to different parts of a sequence based on their relative importance for features. Inspired by the selective attention in human vision, it sifts through a wealth of data to focus on information crucial for the current task by assigning varying degrees of importance (Forghanparast 2022). During DL training, models may struggle to focus on crucial features or may overlook them, leading to reduced prediction accuracy. To address this issue and enhance prediction accuracy by emphasizing important features in hydrological time-series data, a self-attention mechanism unit is integrated into the latter section of the combined model. In streamflow prediction, data from the past and future surrounding the prediction time typically carry more valuable information, significantly influencing current streamflow predictions. The attention mechanism prioritizes essential features, aiding the model in making more precise assessments (Equation (2)):
(2)
where is the feature sequence inputs to the attention mechanism layer, and the attention mechanism is the weighted sum of weight.
In this study, the self-attention mechanism was used to extract interdependences between time steps and features. The self-attention mechanism is recognized as a powerful technique for enhancing DL models by focusing on and assigning attention scores (weights) to each observation, thereby improving the model's performance (Zhou et al. 2023). In this study, the SA-Conv1D-BiGRU model utilizes multiplicative self-attention, a specific type of attention mechanism. This approach involves establishing relationships between various positions within a single sequence to generate a representation of the sequence (Equations (3)–(5)):
(3)
(4)
(5)

Self-attention-based hybrid deep learning model (SA-Conv1D-BiGRU)

The SA-Conv1D-BiGRU model in this paper is a neural network architecture and can extract multi-level feature representations by integrating the excellent abilities of the self-attention mechanism, 1D convolutional layers (Conv1D), and BiGRU (Figure 8). The self-attention mechanism in the SA-Conv1D-BiGRU model enhances the model's ability to adapt to variable input sequences, improve interpretability, and reduce overfitting in streamflow prediction, Conv1D is used to extract local features, and BiGRU is used to learn long-term dependencies. These advantages make the SA mechanism a valuable component for effectively modeling and predicting streamflow dynamics (Zhou et al. 2023).
Figure 8

Architecture of the proposed SA-Conv1D-BiGRU model.

Figure 8

Architecture of the proposed SA-Conv1D-BiGRU model.

Close modal

The specific working principle of the proposed model is as follows:

  • (1) Prepare the input dataset with features such as rainfall and streamflow data with their lagged time.

  • (2) The data extracted by Conv1D is input to the fully connected layer. The Conv1D layers in the model can capture local patterns in the streamflow data.

  • (3) Implement a BiGRU layer to capture both past and future dependencies in the combined features, and their outputs are merged. This helps the model understand the sequential nature of streamflow data and how past and future conditions can impact the current flow rate.

  • (4) The sequence self-attention calculation is performed on the output of the BiGRU layer, and different weights are assigned according to the degree of influence of the feature on the prediction result.

  • (5) Conduct hyperparameter tuning using random search methods to optimize the model's performance.

  • (6) The output of the sequence multiplicative self-attention mechanism is fully connected through the dense layer to extract nonlinear features, and then the forward and backward hidden states are connected to obtain the final output.

  • (7) Finally, compile the model using the Adam optimizer and mean squared error loss function.

DL model development in Python

The development steps for a DL model in Python can be summarized as follows: import different libraries, preprocess data, design model architecture, compile the model, train the model, use the model for prediction, evaluate its performance, iterate and refine, and deploy the model for real-world applications (Ergete & Geremew 2024). Achieving optimal performance in DL models necessitates making decisions on a combination of parameters and hyperparameters (Wegayehu & Muluneh 2021). Parameters are the variables that the model learns from the data during the training process including the weights and biases of the neurons in the hidden layer(s) and the output layer. Hyperparameters, on the other hand, are set by the user before the training process begins and are not learned from the data. The following paragraph discusses the main hyperparameters optimized.

Number of hidden units: This parameter determines the number of neurons in the hidden layer of the DLs. Increasing the number of hidden units allows the model to capture more intricate patterns in the data, but it also increases the computational complexity.

Activation function: Activation functions are crucial in neural networks as they introduce nonlinearity, control the output range, and add interpretability to the model (Mosavi et al. 2018). The selection of an activation function relies on the specific problem at hand and the desired behavior of the network. There are several activation functions commonly used in DLs. In this case, Sigmoid and Tanh activation functions were used.

Sigmoid function: It converts the input to a value between 0 and 1, which makes it suitable for modeling probabilities or binary classification problems, and interpreted as the activation level of a neuron. Values that are near 0 indicate low activation, while values that are close to 1 indicate high activation (Equation (6)):
(6)
Tanh function: The hyperbolic tangent (tanh) function transforms the input into a value between −1 and 1, enabling the effective capture of both positive and negative values in the data (Equation (7)):
(7)
Softmax function: The softmax activation function is commonly used in neural networks for multiclass classification tasks. It transforms the raw outputs of the network into a vector of probabilities, creating a probability distribution across the input classes. In a scenario with N classes, the softmax activation produces an output vector with N entries. Each entry corresponds to the probability of the input belonging to a specific class, with the index i representing the probability of the input being classified into class i (Equation (8)):
(8)
where z is the vector of raw outputs from the neural network.

The value of e ≈ 2.718.

The i-th entry in the softmax output vector softmax(z) can be thought of as the predicted probability of the test input belonging to class i.

Learning rate: It governs the magnitude of the step taken by the model to update its parameters during training. Setting the learning rate too high can cause training to be unstable, potentially leading to divergence or oscillations in the optimization process. Conversely, setting the learning rate too low can result in slow convergence, where the model takes a long time to reach an optimal solution. It is crucial to choose an appropriate learning rate that balances between fast convergence and stable training. Hyperparameter tuning and experimentation are often necessary to find the optimal learning rate for a specific model and dataset. Fine-tuning the learning rate is crucial for achieving optimal training performance and model accuracy. Experimentation and validation are essential to determine the best learning rate for a specific dataset and model architecture.

Number of epochs: The other hyperparameter that determines how many times the model sees the entire training dataset. It is essential to find the right balance to prevent underfitting or overfitting. Monitoring loss, using early stopping, adjusting learning rates, considering computational resources and hyperparameter tuning are key aspects to optimize the number of epochs for efficient training and accurate predictions in streamflow prediction tasks.

Dropout rate: Dropout is a regularization technique commonly used in DL models to prevent overfitting. It involves randomly setting a fraction of the neurons in the hidden layers to zero during each training iteration. Tuning the dropout rate can help improve the model's generalization performance and prevent it from memorizing noise in the training data.

Batch size: The batch size determines the number of training samples that are processed together in each iteration during training. Increasing the batch size can expedite training by processing more samples at once, yet it necessitates more memory. Smaller batch sizes may allow the model to generalize better as it updates its parameters more frequently.

Optimization algorithm: The optimization algorithm controls parameter updates during training. Common optimization algorithms used in DLs include stochastic gradient descent (SGD) and Adam. The choice of an optimization algorithm can affect the convergence speed and final performance of the model.

The rationale behind choosing a set of parameters for effectively fine-tuning a DL model for streamflow prediction and achieving accurate and reliable predictions were: model complexity, computational resources, and training performance (Wegayehu & Muluneh 2021). For example, monitoring the model's training performance, such as loss and accuracy, can guide the selection of parameters. Experimenting with different configurations and evaluating their impact on training metrics can help identify the optimal set of parameters. Understanding the sensitivities of these parameters and their effects on the model's performance is crucial for optimizing a DL model for streamflow prediction tasks and achieving accurate and reliable predictions.

Model training and testing

In the context of DL, training and testing are two critical phases in the development and evaluation of a model. Training refers to the process of using a labeled dataset to teach a DL model to recognize patterns and make predictions. On the other hand, testing, also known as evaluation or validation, is the phase where the trained model is assessed for its performance on new, unseen data. The objective is to improve the model's accuracy and determine its suitability for real-world applications (Ergete & Geremew 2024). In this study, the DL models were trained for monthly streamflow data from 1991 to 2008 and then tested from 2009 to 2013 for Kulfo trained from 1997 to 2012, and then tested from 2014 to 2018 for the Woybo catchment.

Performance measures

The purpose of performance measures in streamflow prediction is to assess the accuracy and reliability of the predicted streamflow data in comparison with observed data. In this study, the evaluation is conducted by comparing the predicted streamflow from the three models to the observed streamflow using the following performance measures (Equations (9)–(12)):

Root mean squared error (RMSE):
(9)
Mean absolute error (MAE):
(10)
Nash–Sutcliffe efficiency (NSE):
(11)
Coefficient of determination ():
(12)
where n is the number of data points, Qsim is the predicted streamflow, Qobs is the actual streamflow, SS_res is the sum of squared residuals, and SS_tot is the total sum of squares.

DL models result

This research involved applying three distinct DL models (Bi-LSTM, Conv1D-LSTM, and SA-Conv1D-BiGRU) using the Python programming language to predict streamflow at different time-series characteristics. The mean squared error (MSE) was used as the cost function to minimize the cost and identify the best output. After multiple iterations, the most effective hyperparameter configurations for the networks are detailed in Table 5. The monthly predicted and observed flow hydrographs are presented in Figures 10 and 11 for Kulfo and Woybo catchments, respectively.
Table 5

Ranges of model parameters used in optimization

ParameterRange
Conv1D layer 16–128 
Self-attention layer 16–128 
BiGRU layer 16–128 
Learning rate [0.1, 0.01, 0.001, 0.0001] 
ParameterRange
Conv1D layer 16–128 
Self-attention layer 16–128 
BiGRU layer 16–128 
Learning rate [0.1, 0.01, 0.001, 0.0001] 
Figure 9

Training and test loss function of the optimized models.

Figure 9

Training and test loss function of the optimized models.

Close modal
Figure 10

Comparison of the observed and predicted flow hydrograph of all models for the Kulfo catchment.

Figure 10

Comparison of the observed and predicted flow hydrograph of all models for the Kulfo catchment.

Close modal
Figure 11

Comparison of the observed and predicted flow hydrograph of all models for the Woybo catchment.

Figure 11

Comparison of the observed and predicted flow hydrograph of all models for the Woybo catchment.

Close modal

Various performance metrics can be utilized to assess DL models. In our case, we used RMSE, MAE, and R2 to evaluate the models' performance (Tables 6 and 7).

Table 6

Performance of the DL model for the Kulfo watershed

Model nameTesting period
MAERMSENSER2
Bi-LSTM1 5.13 3.01 0.87 0.89 
Bi-LSTM2 4.64 2.94 0.88 0.91 
Bi-LSTM3 4.12 2.87 0.90 0.93 
Bi-LSTM4 4.05 2.52 0.92 0.94 
Conv1D-LSTM1 5.91 4.11 0.84 0.86 
Conv1D-LSTM2 5.95 4.12 0.83 0.86 
Conv1D-LSTM3 5.53 3.81 0.85 0.87 
Conv1D-LSTM4 4.74 3.12 0.87 0.88 
SA-Conv1D-BiGRU1 3.34 2.32 0.93 0.95 
SA-Conv1D-BiGRU2 3.13 2.11 0.94 0.95 
SA-Conv1D-BiGRU3 3.04 1.62 0.94 0.96 
SA-Conv1D-BiGRU4 2.85 1.41 0.96 0.97 
Model nameTesting period
MAERMSENSER2
Bi-LSTM1 5.13 3.01 0.87 0.89 
Bi-LSTM2 4.64 2.94 0.88 0.91 
Bi-LSTM3 4.12 2.87 0.90 0.93 
Bi-LSTM4 4.05 2.52 0.92 0.94 
Conv1D-LSTM1 5.91 4.11 0.84 0.86 
Conv1D-LSTM2 5.95 4.12 0.83 0.86 
Conv1D-LSTM3 5.53 3.81 0.85 0.87 
Conv1D-LSTM4 4.74 3.12 0.87 0.88 
SA-Conv1D-BiGRU1 3.34 2.32 0.93 0.95 
SA-Conv1D-BiGRU2 3.13 2.11 0.94 0.95 
SA-Conv1D-BiGRU3 3.04 1.62 0.94 0.96 
SA-Conv1D-BiGRU4 2.85 1.41 0.96 0.97 
Table 7

Performance of the DL model for the Woybo watershed

Model nameTesting period
MAERMSENSER2
Bi-LSTM1 5.23 3.45 0.85 0.86 
Bi-LSTM2 4.92 3.11 0.85 0.87 
Bi-LSTM3 4.72 3.01 0.88 0.90 
Conv1D-LSTM1 6.54 4.71 0.77 0.78 
Conv1D-LSTM2 6.25 4.62 0.80 0.82 
Conv1D-LSTM3 5.73 3.71 0.81 0.84 
SA-Conv1D-BiGRU1 3.54 2.53 0.94 0.96 
SA-Conv1D-BiGRU2 3.06 1.61 0.95 0.97 
SA-Conv1D-BiGRU3 2.74 1.32 0.96 0.98 
Model nameTesting period
MAERMSENSER2
Bi-LSTM1 5.23 3.45 0.85 0.86 
Bi-LSTM2 4.92 3.11 0.85 0.87 
Bi-LSTM3 4.72 3.01 0.88 0.90 
Conv1D-LSTM1 6.54 4.71 0.77 0.78 
Conv1D-LSTM2 6.25 4.62 0.80 0.82 
Conv1D-LSTM3 5.73 3.71 0.81 0.84 
SA-Conv1D-BiGRU1 3.54 2.53 0.94 0.96 
SA-Conv1D-BiGRU2 3.06 1.61 0.95 0.97 
SA-Conv1D-BiGRU3 2.74 1.32 0.96 0.98 

This study collected daily discharge data from Kulfo and Woybo stations and daily rainfall data from 10 gauging stations. Data for 22 flood events from 1991 to 2013 and 34 flood events from 1997 to 2018 were obtained for Kulfo and Woybo catchments, respectively.

Evaluating models under various time-series conditions using historical data will likely provide robust DL models for streamflow prediction (Wegayehu & Muluneh 2021). Therefore, this study evaluated the performance of different DL models in two distinct time-series characteristics. The study reveals that DL models have excellent potential in time-series prediction in different time-series characteristics. The predicted streamflow captures the baseflow, recession limb, and rising limb of the observed hydrograph, and also peak flow fairly in both catchments. The purpose of the training and test loss function plot is to evaluate the model's performance during the training process as it provides insights into how the model's loss, which used to assess the convergence of the model, identify potential overfitting or underfitting, and determine the effectiveness of the model in capturing the underlying patterns in the streamflow data (Mosavi et al. 2018). On the other hand, the scatter plot for observed and simulated streamflow serves as a means to visually compare the model's predictions with the actual streamflow observations. The findings revealed that the SA-Conv1D-BiGRU model proposed in this paper has the highest degree of fitting with the observed data among the comparison models at both catchments, with the Bi-LSTM model demonstrating the second-best performance near that of the SA-Conv1D-BiGRU model (Figures 911). A comparison of prediction results between SA-Conv1D-BiGRU and other models on the whole test set is shown in Tables 69. Through the analysis of Tables 69, it can be seen that the SA-Conv1D-BiGRU model has an excellent predicting ability as per all numerical performance indexes Taylor plot and the scatter plot (Figures 12 and 13). This finding aligns with prior research, which has indicated the superior capability of attention-based DL models in capturing temporal patterns within diverse datasets, including streamflow (Zhou et al. 2023). The noteworthy strength of SA-Conv1D-BiGRU lies in its self-attention mechanism. The SA-Conv1D-BiGRU model enhances the model's ability to capture long-range dependencies, adapt to variable input sequences, improve interpretability, and reduce overfitting in streamflow prediction applications. Despite the complexity, the Conv1D-LSTM models did not outperform the Bi-LSTM model. The study clearly illustrated the effects of time-series characteristics and variability in lagged time on the performance of diverse DL models. The results are condensed into themes of model variability and time-series characteristics. Consequently, different architectures in DL models had a greater influence on streamflow prediction accuracy than input time lags and time-series features. Analysis of input combinations and lag times demonstrates an improvement in performance in both catchment areas, as depicted in Tables 6 and 7. Consequently, scenario 4 for the Kulfo and scenario 3 for the Woybo catchments performed better than other scenarios. Using precipitation as input enhances the performance of models. This study did not assess the performance of DL models in different temporal scales. Thus, future research should evaluate DL models on diverse temporal scales and climatic regions to optimize parameters, determine generalizability, and incorporate more granular data to enhance predictive capabilities, ultimately improving DL model understanding and applicability for effective water resource management.
Table 8

Collected flood events with their predicted values for the Kulfo watershed

Event No.TimeObserved (m3s−1)Bi-LSTM (predicted)Conv1D-LSTM (predicted)SA-Conv1D-BiGRU model (predicted)
1997/11/26 70.32 67.76 65.88 68.89 
2001/8/9 81.95 77.78 74.56 79.34 
2001/8/11 78.32 76.21 74.54 77.06 
2001/8/13 77.42 75.74 74.38 76.03 
 … … … … … 
19 2002/9/24 81.04 77.09 86.90 79.97 
20 2002/9/25 79.67 77.34 76.88 78.82 
21 2002/9/27 77.42 73.76 75.12 76.32 
22 2002/10/3 87.29 85.67 80.36 86.45 
Event No.TimeObserved (m3s−1)Bi-LSTM (predicted)Conv1D-LSTM (predicted)SA-Conv1D-BiGRU model (predicted)
1997/11/26 70.32 67.76 65.88 68.89 
2001/8/9 81.95 77.78 74.56 79.34 
2001/8/11 78.32 76.21 74.54 77.06 
2001/8/13 77.42 75.74 74.38 76.03 
 … … … … … 
19 2002/9/24 81.04 77.09 86.90 79.97 
20 2002/9/25 79.67 77.34 76.88 78.82 
21 2002/9/27 77.42 73.76 75.12 76.32 
22 2002/10/3 87.29 85.67 80.36 86.45 
Table 9

Collected flood events with their predicted values for the Woybo watershed

Event No.TimeObserved (m3s−1)Bi-LSTM (predicted)Conv1D-LSTM (predicted)SA-Conv1D-BiGRU model (predicted)
2003/9/19 109.329 103.54 99.20 106.13 
2004/1/23 98.329 92.34 84.32 96.87 
2005/2/12 93.975 89.67 87.67 91.25 
2005/8/10 104.32 100.02 94.76 101.21 
2005/8/9 120.076 115.78 112.76 117.34 
 … … … … … 
31 2009/8/27 99.671 93.21 87.89 96.01 
32 2012/9/16 97.958 92.12 88.67 91.12 
33 2016/8/14 102.822 94.23 91.56 97.34 
34 2018/8/17 90.113 84.23 81.54 88.56 
Event No.TimeObserved (m3s−1)Bi-LSTM (predicted)Conv1D-LSTM (predicted)SA-Conv1D-BiGRU model (predicted)
2003/9/19 109.329 103.54 99.20 106.13 
2004/1/23 98.329 92.34 84.32 96.87 
2005/2/12 93.975 89.67 87.67 91.25 
2005/8/10 104.32 100.02 94.76 101.21 
2005/8/9 120.076 115.78 112.76 117.34 
 … … … … … 
31 2009/8/27 99.671 93.21 87.89 96.01 
32 2012/9/16 97.958 92.12 88.67 91.12 
33 2016/8/14 102.822 94.23 91.56 97.34 
34 2018/8/17 90.113 84.23 81.54 88.56 
Figure 12

Scatter plot for employed DL models.

Figure 12

Scatter plot for employed DL models.

Close modal
Figure 13

Taylor diagram displays the standard deviations and correlation coefficient between observed and predicted streamflow for the proposed three models.

Figure 13

Taylor diagram displays the standard deviations and correlation coefficient between observed and predicted streamflow for the proposed three models.

Close modal

Aiming at the problem that a single model is not accurate enough when dealing with hydrological modeling, this study introduces the SA-Conv1D-BiGRU streamflow prediction model along with Conv1D-LSTM and Bi-LSTM models at two different catchments. The Conv1D and BiGRU models were employed to extract the inherent characteristics of time-series data and investigate the temporal and feature dependencies within streamflow input data by leveraging the self-attention mechanism. The models were implemented using Python programming language within the Jupyter Notebook through different libraries and packages. The analysis revealed a diverse range of results, attributed to the lag time variation, time-series characteristics, and type DL algorithms deployed. The performance is highly dependent on lag time variations and type DL algorithms deployed compared with a time-series characteristics. The SA-Conv1D-BiGRU model consistently excels at capturing flow data patterns at both catchments as per all performance matrices utilized. The self-attention mechanism introduces additional computations to calculate attention weights for each element in the input sequence, allowing the model to capture long-range dependencies and focus on relevant features. This added complexity can result in a higher number of parameters and computational overhead compared with a model without the self-attention mechanism. However, the incorporation of self-attention in the SA-Conv1D-BiGRU model further increases the model's capacity to learn intricate patterns in the data. Despite the complexity, the Conv1D-LSTM models did not outperform the Bi-LSTM standalone model. Overall, the findings of this study provide insights into the performance of different DL models under different catchments. The results can be useful for evaluating various water resources management.

No fund was provided from any source.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Apaydin
H.
,
Feizi
H.
,
Sattari
M. T.
&
Colak
M. S.
(
2020
)
Comparative analysis of recurrent neural network. Water 12 (5), 1500
.
Chow
V. T.
,
Maidment
D. R.
&
Mays
L. W.
(
1988
)
Applied Hydrology. McGraw-Hill Education, New York
.
Duan
S.
,
Ullrich
P.
&
Shu
L.
(
2020
)
Using convolutional neural networks for streamflow projection in California
.
Frontiers in Water
2
(
September
),
1
19
.
Ergete
E. T.
&
Geremew
G. B.
(
2024
)
Predicting the peak flow and assessing the hydrologic hazard of the Kessem Dam, Ethiopia using machine learning and risk management centre-reservoir frequency analysis software
.
Journal of Water and Climate Change
15
(
2
),
370
391
.
https://doi.org/10.2166/wcc.2024.320
.
Forghanparast
F.
(
2022
)
Using deep learning algorithms for intermittent streamflow prediction in the headwaters of the Colorado river, Texas. Water 14 (19), 2972
.
Gupta
H. V.
&
Nearing
G. S.
(
1969
)
Water resources research
.
Journal of the American Water Resources Association
,
5
(
3
),
2
.
DOI:10.1111/j.1752-1688.1969.tb04897
.
Hochreiter
S.
&
Schmidhuber
J.
(
1997
)
Long short-term memory
.
Neural Computation
9
(
8
),
1735
1780
.
https://doi.org/10.1162/neco.1997.9.8.1735
.
Hu
C.
,
Wu
Q.
,
Li
H.
,
Jian
S.
,
Li
N.
&
Lou
Z.
(
2018
)
Deep learning with a long short-term memory networks approach for rainfall-runoff simulation
.
Water (Switzerland)
10
(
11
),
1
16
.
https://doi.org/10.3390/w10111543
.
Ji
J.
,
Choi
C.
,
Yu
M.
&
Yi
J.
(
2012
)
Comparison of a data-driven model and a physical model for flood forecasting
.
WIT Transactions on Ecology and the Environment
159
,
133
142
.
https://doi.org/10.2495/FRIAR120111
.
Kratzert
F.
,
Klotz
D.
,
Brenner
C.
,
Schulz
K.
&
Herrnegger
M.
(
2018
)
Rainfall–runoff modelling using long short-term memory (LSTM) networks. HESS 22, 6005–6022. https://doi.org/10.5194/hess-22-6005
.
Li
X.
&
Xu
W.
(
2022
)
Hybrid CNN-LSTM models for river flow prediction
.
22
(
5
),
4902
4920
.
https://doi.org/10.2166/ws.2022.170
.
Lin
Y.
,
Wang
D.
,
Meng
Y.
,
Sun
W.
&
Qiu
J.
(
2023
)
Bias learning improves data driven models for streamflow prediction
.
Journal of Hydrology: Regional Studies
50
,
101557
.
Livieris
I. E.
,
Pintelas
E.
&
Pintelas
P.
(
2020
)
A CNN–LSTM model for gold price time-series forecasting
.
Neural Computing and Applications
32
(
23
),
17351
17360
.
https://doi.org/10.1007/s00521-020-04867
.
Mosavi
A.
,
Ozturk
P.
&
Chau
K. W.
(
2018
)
Flood prediction using machine learning models: Literature review
.
Water (Switzerland)
10
,
11
.
Niu
W. J.
,
Feng
Z. K.
,
Feng
B. F.
,
Min
Y. W.
,
Cheng
C. T.
&
Zhou
J. Z.
(
2019
)
Comparison of multiple linear regression, artificial neural network, extreme learning machine, and support vector machine in deriving operation rule of hydropower reservoir
.
Water (Switzerland)
11
(
1
).
https://doi.org/10.3390/w11010088
.
Salehinejad
H.
,
Sankar
S.
,
Barfett
J.
,
Colak
E.
&
Valaee
S.
(
2017
)
Recent advances in recurrent neural networks, pp. 1–21. Available from: http://arxiv.org/abs/1801.01078.
Ukumo
T. Y.
,
Edamo
M. L.
,
Abdi
D. M.
&
Derebe
M. A.
(
2022
)
Evaluating water availability under changing climate scenarios in the Woybo catchment, Ethiopia
.
Journal of Water and Climate Change
13
(
11
),
4130
4149
.
Van
S. P.
,
Le
H. M.
,
Thanh
D. V.
,
Dang
T. D.
,
Loc
H. H.
&
Anh
D. T.
(
2020
)
Deep learning convolutional neural network in rainfall-runoff modelling
.
Journal of Hydroinformatics
22
(
3
),
541
561
.
https://doi.org/10.2166/hydro.2020.095
.
Wegayehu
E. B.
&
Muluneh
F. B.
(
2021
)
Multivariate streamflow simulation using hybrid deep learning models
.
Computational Intelligence and Neuroscience
2021
(
1
).
https://doi.org/10.1155/2021/5172658
.
Wegayehu
E. B.
&
Muluneh
F. B.
(
2022
)
Short-term daily univariate streamflow forecasting using deep learning models
.
Advances in Meteorology
2022
.
https://doi.org/10.1155/2022/1860460
.
Zhou
F.
,
Chen
Y.
&
Liu
J.
(
2023
)
Application of a new hybrid deep learning model that considers temporal and feature dependencies in rainfall-runoff simulation. Water Resources Management 30 (9), 3191–3205. https://doi.org/10.1007/s11269-016-1340-8
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).