Abstract
Deep learning has made significant advances in methodologies and practical applications in recent years. However, there is a lack of understanding on how the long shortterm memory (LSTM) networks perform in river flow prediction. This paper assesses the performance of LSTM networks to understand the impact of network structures and parameters on river flow predictions. Two river basins with different characteristics, i.e., Hun river and Upper Yangtze river basins, are used as case studies for the 10day average flow predictions and the daily flow predictions, respectively. The use of the fully connected layer with the activation function before the LSTM cell layer can substantially reduce learning efficiency. On the contrary, nonlinear transformation following the LSTM cells is required to improve learning efficiency due to the different magnitudes of precipitation and flow. The batch size and the number of LSTM cells are sensitive parameters and should be carefully tuned to achieve a balance between learning efficiency and stability. Compared with several hydrological models, the LSTM network achieves good performance in terms of three evaluation criteria, i.e., coefficient of determination, Nash–Sutcliffe Efficiency and relative error, which demonstrates its powerful capacity in learning nonlinear and complex processes in hydrological modelling.
HIGHLIGHTS
Long shortterm memory (LSTM) networks are assessed for river flow prediction.
The impacts of network structures and parameters on learning efficiency are analysed.
The batch size and the number of LSTM cells are sensitive parameters for learning.
LSTM has good predictive accuracy compared to hydrological and datadriven models tested.
INTRODUCTION
The rainfall–runoff process of a river basin is normally characterized by a high degree of nonlinearity. The process is one of the most important components in the hydrological cycle and its accurate modelling is crucial for water resources and flood management (Clarke 1994; Nourani 2017). Rainfall–runoff models are usually classified into three main classes: distributed, conceptual and black box models (Clarke 1994). Distributed and conceptual models are based on various hydrological processes; however, they are limited by our understanding and ability to accurately represent these processes and computational resources. By contrast, black box models are normally datadriven but can provide an accurate prediction in many situations (Tanty & Desmukh 2015; Nourani 2017). Artificial neural network (ANN) models are one of the typical black box models. Since Daniell (1991) applied ANNs to streamflow modelling, ANNs have been widely applied in hydrological modelling because of its strong nonlinear fitting ability (ASCE Task Committee 2000a, 2000b). Currently, various ANN models have been employed to study the rainfall–runoff process, such as fuzzy neural networks (Nayak et al. 2004), wavelet neural networks (Wang & Ding 2003; Alexander & Thampi 2018) and Bayesian neural networks (Bateni et al. 2007; Kayabası et al. 2017). Traditionally, the ANN learns the relationships between input and output variables from historical data provided and does not have the ability to automatically select the input variables or factors. ANNs with multiple hidden layers have excellent learning ability (Hinton & Salakhutdinov 2006), thus, deep neural networks are increasingly used in hydrology to simulate the rainfall–runoff relationships building on big data which have become available in recent years (Hu et al. 2018, 2019; Kratzert et al. 2019; Le et al. 2019).
In recent years, deep learning has made significant advances in the field of machine learning and data science (Negnevitsky & Pavlovsky 2005). Many deep learning algorithms have shown great potential in solving realworld problems (Khan & Yairi 2018), for example, recurrent neural networks (RNN) (Shin et al. 2017) and convolutional neural networks (Zhou et al. 2019). In particular, the RNN network has a strong learning ability for time series data (Bengio et al. 1994; Hochreiter & Schmidhuber 1997; Saon & Picheny 2017) as it includes loops to allow the information from previous time steps to be passed to the next time step. However, the gradient disappearance or explosion problem makes the RNN gradually lose the ability to learn longdistance information (Bengio et al. 1994). To overcome the deficiency, the long shortterm memory (LSTM) network, a special type of RNN, was developed for learning with long sequence data (Hochreiter & Schmidhuber 1997), as it is capable of learning longterm dependencies in the data series. Based on the concept of LSTM networks, many similar networks have been constructed to improve the learning ability for different tasks (Sutskever et al. 2013; Bellegarda & Monz 2016). At present, the LSTM has been successfully used in speech recognition and text translation (Bellegarda & Monz 2016; Rocha et al. 2019).
In the last few years, LSTM networks have been tested and studied in watershed hydrological modelling, and their potential has been demonstrated in many applications, such as river flow and flood predictions (Shen 2018). Kratzert et al. (2018) applied the LSTM network to simulate the daily flows of 241 basins and found that it greatly outperforms hydrological models that are calibrated both at the regional level and at the individual basin level. Lee et al. (2018) developed an LSTM for daily runoff simulations based on the water level data of 10 stations at the upper Mekong River and showed that the LSTM performs better than the Soil and Water Assessment Tool model (SWAT). Zhang et al. (2018) compared four different neural networks, namely multilayer perceptron, wavelet neural network (WNN), LSTM and gated recurrent unit (GRU), in predicting the daily discharges of combined sewer overflow structures, and showed that LSTM and GRU are highly capable of multistepahead time series prediction. Sahoo et al. (2019) applied LSTM to forecast daily flows during lowflow periods in the Mahanadi River basin, India. Kratzert et al. (2019) proposed an EntityAwareLSTM (EALSTM), which performed substantially better at the regional level with 531 basins than several hydrological models calibrated individually for each basin. Hu et al. (2018) tested an LSTM model on 98 flood events and indicated that the LSTM model outperformed conceptual and physical models. Yan et al. (2019) constructed an LSTM with historical flow and weather data and weather forecasts and indicated that the LSTM outperforms support vector machines in flood predictions, especially for flood peak flow forecasts. Karimi et al. (2019) compared three datadriven methods, i.e., ANN, LSTM and Least Absolute Shrinkage and Selection Operator (LASSO) for flood flow predictions in sewer systems, and concluded that all three models provide acceptable prediction performance, but LSTM outperforms ANN due to the inherent memory integrated with a feedback structure. Muhammad et al. (2019) proposed a hybrid model by combining LSTM and GRU for river flow simulations, which was used for early flood warning. Hu et al. (2019) integrated an LSTM and reduced order model to represent the spatial–temporal distribution of floods. Le et al. (2019) used the LSTM in modelling 1, 2 and 3day flood events in Vietnam's Da River basin. In summary, previous research has shown the ability of LSTM in river flow predictions. However, there is a lack of understanding on how LSTM structures and parameters are linked to predictive accuracy in hydrological modelling.
The main aim of this paper is to assess the performances of LSTM networks in river flow predictions in terms of LSTM structures and parameters. In this study, the LSTM networks with different network structures, i.e., fully connected layers and LSTM cells are trained and their performances compared using two case studies of different characteristics – the Hun river basin and the upper river basin of Yangtze River, China. The trained LSTM networks are used to predict the river flows in the two case study river basins. Finally, the LSTM networks are compared with four models, i.e., the SWAT, Xinanjiang model (XAJ), multiple linear regression model (MLR) and backpropagation neural networks (BP).
METHODOLOGY
In this section, the LSTM network for flow simulation and predication is first presented and the key components including the network structure, LSTM cells and loss function are explained. Then the data preprocessing and evaluation criteria used in this study are introduced. Finally, different simulation scenarios designed to study the performance of the LSTM network are explained.
LSTM network
Network structure
In the LSTM network, the key components are fully connected layers and LSTM cells. As shown in Figure 1, the LSTM network contains four types of layers: (a) the input layer which receives the input sequence data; (b) the fully connected layer a which transfers the dimensions of the input data into the dimensions of LSTM cells and establishes a bridge between the input layer and the LSTM cell layer; (c) the LSTM cell layer of n cells (i.e., simple networks) which provides different memory abilities; (d) output layers including fully connected layers b1, b2 and the output flow vector, which transfer the outputs of LSTM cells to flows.
LSTM cell structure
Figure 2 shows the structure of the LSTM cell. There are two key states in LSTM cell calculation, i.e., cell state and hidden state. In Figure 2, C_{t−}_{1} and H_{t−}_{1} represent the cell state and hidden state at time step t− 1, respectively. The cell state is the main chain of the data stream, which allows the data to flow forward substantially unchanged. However, the data in the hidden state can be added or removed from the cell state (Le et al. 2019), which is carefully controlled by ‘forget gate’, ‘input gate’ and ‘output gate’, represented by the dashed boxes in Figure 2. The gates are neural network layers with a series of matrix operations, which contain different individual weights and biases. The LSTM cell uses gates to control the memory process to avoid the longterm dependency problem (Hochreiter & Schmidhuber 1997). The cell learns from time series data using five simple network layers, including three sigmoid layers and two tanh network layers (Le et al. 2019).
Loss function
Pseudo code of LSTM network
The pseudo code of the LSTM network training is shown in Algorithm 1. In the pseudo code, the parameters include the batch size of training input data (T), the dimensions of the input data (m), the dimensions of the output flow vector (g), LSTM cell size (n) and the input and output dimensions of the fully connected layers (i.e., h and k). In the case study, the LSTM network training ends after 1,500 epochs.
Input: Input data matrix: [T, m]. Target data matrix: [T, g]. 
Initial Parameters:

Fully Connected Layer: Transforms input matrix [T, m] to [T, n] using Equation (3); LSTM Cells: For t =0 in length (T): For k =0 in length (n): Update states for LSTM cell: and using Equations (8) and (10), separately; Generate cell output: using Equation (9). Get the outputs of LSTM cells after the iteration of the loop: [T,n]. Fully Connected Layers: Get the outputs : Transform matrix [T,n] to [T,g] using fully connected layers b1 and b2. Loss Function: Comparing the simulated flows () and observed flows () and, the loss value is evaluated using Equation (11). Weights Updating: Based on the loss value, the weights of the networks are updated using the Adam algorithm. 
Input: Input data matrix: [T, m]. Target data matrix: [T, g]. 
Initial Parameters:

Fully Connected Layer: Transforms input matrix [T, m] to [T, n] using Equation (3); LSTM Cells: For t =0 in length (T): For k =0 in length (n): Update states for LSTM cell: and using Equations (8) and (10), separately; Generate cell output: using Equation (9). Get the outputs of LSTM cells after the iteration of the loop: [T,n]. Fully Connected Layers: Get the outputs : Transform matrix [T,n] to [T,g] using fully connected layers b1 and b2. Loss Function: Comparing the simulated flows () and observed flows () and, the loss value is evaluated using Equation (11). Weights Updating: Based on the loss value, the weights of the networks are updated using the Adam algorithm. 
Structure scenarios
In this section, the scenarios of different LSTM structures and parameters are presented to test the network performances as shown in Table 1. The LSTM network structure in Figure 1 is taken as scenario A1 and used as a benchmark for comparison of the other network structures. Building on scenario A1, Scenario A2 has a fully connected neural layer with an activation function added between the input layer and the fully connected layer a. Scenario A3 is developed by removing the activation functions in the fully connected layers b1 and b2 based on A1.
Scenarios .  Structures .  

A1  P[T,m] → FC[m,n] → Cell[n] → FCa[n,50] → FCa[50,30] → O[T,g]  
A2  P[T,m] → FCa[m,n] → FC[m,m] → Cell[n] → FCa[n,50] → FCa[50,30] → O[T,g]  
A3  P[T,m] → FC[m,n] → Cell[n] → FC[n,50] → FC[50,30] → O[T,g]  
B1  P[T,m] → FC[m,n] → Cell[n] → FCa[n,50] → FCa[50,100] → FCa[100,30] → O[T,g]  
B2  P[T,m] → FC[m,n] → Cell[n] → FCa[n,50] → FCa[50,100] → FCa[100,50] → FCa[50,30] → O[T,g]  
P  Input matrix  T  Batch size of precipitation data 
O  Output flow vector  M  Number of meteorological stations 
FC  Fully connected layer without activation function  N  Number of LSTM cells 
FCa  Fully connected layer with ReLU activation function  g  Number of hydrological stations 
Cell  LSTM cells 
Scenarios .  Structures .  

A1  P[T,m] → FC[m,n] → Cell[n] → FCa[n,50] → FCa[50,30] → O[T,g]  
A2  P[T,m] → FCa[m,n] → FC[m,m] → Cell[n] → FCa[n,50] → FCa[50,30] → O[T,g]  
A3  P[T,m] → FC[m,n] → Cell[n] → FC[n,50] → FC[50,30] → O[T,g]  
B1  P[T,m] → FC[m,n] → Cell[n] → FCa[n,50] → FCa[50,100] → FCa[100,30] → O[T,g]  
B2  P[T,m] → FC[m,n] → Cell[n] → FCa[n,50] → FCa[50,100] → FCa[100,50] → FCa[50,30] → O[T,g]  
P  Input matrix  T  Batch size of precipitation data 
O  Output flow vector  M  Number of meteorological stations 
FC  Fully connected layer without activation function  N  Number of LSTM cells 
FCa  Fully connected layer with ReLU activation function  g  Number of hydrological stations 
Cell  LSTM cells 
Comparing the performances between scenario A1 and A3 allows us to analyse the impact of activation functions which establish a nonlinear transformation between the LSTM cells and the output runoff vector. To evaluate the impacts of the number of layers on the learning efficiency of LSTM, scenarios B1 and B2 are constructed by adding one and two fully connected layers to the benchmark A1, respectively.
The batch sizes of training samples (T) and number of LSTM cell (n) are important network parameters, which determine the learning efficiency of the LSTM network. Table 1 shows their values tested in this study.
Streamflow data preprocess
Model evaluation criteria
CASE STUDY
Study area
In this study, two basins are taken as case studies, the Hun river basin and the upper Yangtze river basin, which are located in the northeast and southwest of China, respectively (Figure 3). The basic characteristics of the two basins are shown in Table 2.
Characteristic .  Upper Yangtze river .  Hun river . 

Location  106°14′∼111°28′ E, 28°16′∼31°44′ N  124°43′∼126°50′ E, 40°40′∼42°15′ N 
Area (10^{4} km^{2})  100.23  1.48 
Channel length (km)  4,000  435 
Mean annual rainfall (mm)  1,250  860 
Mean annual streamflow (m^{3}/s)  14,300  227 
Characteristic .  Upper Yangtze river .  Hun river . 

Location  106°14′∼111°28′ E, 28°16′∼31°44′ N  124°43′∼126°50′ E, 40°40′∼42°15′ N 
Area (10^{4} km^{2})  100.23  1.48 
Channel length (km)  4,000  435 
Mean annual rainfall (mm)  1,250  860 
Mean annual streamflow (m^{3}/s)  14,300  227 
The Hun river originates from the Changbai Mountain. The precipitation is affected by temperate monsoon continental climate. The vegetation in this basin is well maintained and is rarely interrupted by human activities. 70% of the annual precipitation occurs from June to September.
The Upper Yangtze river originates from the QinghaiTibet Plateau. The drainage area above the Three Gorges dam in the Yangtze River is the upper reach. The precipitation of the basin is affected by the topography and the monsoon, which makes the annual precipitation unevenly distributed both spatially and temporally. The precipitation increases from northwest to southeast, and about 70% of the precipitation occurs from May to September.
Data source
In the Hun river, the 10day average meteorological data from 10 stations (i.e., m = 10) and river flow data from the Huanren station (HR) (i.e., g = 1) from 1970 to 2010 were obtained. The 30year data from 1970 to 1999 are used to train LSTM networks and calibrate the comparison models which are introduced in Section Comparison models. The data from 2000 to 2010 are used to verify the performances of the models.
In the Upper Yangtze river, the daily meteorological data were obtained from 1991 to 1998, including 57 meteorological stations (i.e., m = 57). The observed daily streamflow from six hydrological stations (i.e., g = 6) as shown in Figure 3 was obtained from 1991 to 1998, i.e., Shi gu (SG), Pan zhi hua (PZH), Bei bei (BB), Wu long (WL), Xiang jia ba (XJB) and Wang zhou (WZ). The data from 1991 to 1995 are used for model training and calibration, and the data from 1996 to 1998 are used for verification.
Comparison models
In the Hun river basin case study, four models are constructed as comparison models to evaluate the performance of the proposed LSTM network, i.e., SWAT, XAJ, MLR and BP. In the upper Yangtze river basin, SWAT is used only.
model
The SWAT model is a distributed hydrological model developed by the U.S. Department of Agriculture and the Agricultural Research Service (Guzman et al. 2015). The SWAT model has strong physical mechanisms. In this study, the DEM raster data with a resolution of 90 m and the land use raster data with a resolution of 3 km were obtained in Hun river basin. The DEM and land use data of 3 km × 3 km in the Upper Yangtze river basin were obtained. In this study, the SWAT model is calibrated using the Calibration Uncertainty Procedure (SWATCUP) program, which is an autocalibration tool that allows for sensitivity analysis, calibration, validation and uncertainty analysis of the SWAT model (Abbaspour 2011; Abbaspour et al. 2015). Finally, the key parameters for the two basins are listed in Table 3.
Parameters .  Definition .  Hun river .  Upper Yangtze river . 

CN2  SCS runoff curve number for moisture condition II  72  35 
ALPHA_BF  Base flow recession constant (days)  0.8  0.61 
GW_DELAY  Delay time for aquifer recharge (days)  31  15 
CH_N2  Manning's n value for the main and tributary channels  0.1  0.3 
CH_K2  Effective hydraulic conductivity of channel (mm/hr)  5  20 
SOL_AWC  Available water capacity (mm/mm)  0.005  0.004 
SMTMP  Snow melt minimum temperature (°C)  − 1  − 2 
CANMX  Canopy maximum storage(mm)  4  9 
ESCO  Soil evaporation compensation factor (mm)  0.95  0.6 
REVAPMN  Threshold depth for evaporation to occur (mm)  71  120 
Parameters .  Definition .  Hun river .  Upper Yangtze river . 

CN2  SCS runoff curve number for moisture condition II  72  35 
ALPHA_BF  Base flow recession constant (days)  0.8  0.61 
GW_DELAY  Delay time for aquifer recharge (days)  31  15 
CH_N2  Manning's n value for the main and tributary channels  0.1  0.3 
CH_K2  Effective hydraulic conductivity of channel (mm/hr)  5  20 
SOL_AWC  Available water capacity (mm/mm)  0.005  0.004 
SMTMP  Snow melt minimum temperature (°C)  − 1  − 2 
CANMX  Canopy maximum storage(mm)  4  9 
ESCO  Soil evaporation compensation factor (mm)  0.95  0.6 
REVAPMN  Threshold depth for evaporation to occur (mm)  71  120 
XAJ model
The XAJ model is a conceptual rainfall–runoff model, which was developed by Zhao (1992). This model has been widely used in China, particularly in humid and semihumid regions (Xu et al. 2014). The XAJ model assumes that runoff is not produced until the soil water content of the aeration zone reaches its field capacity. The actual evapotranspiration is computed from potential evapotranspiration while soil storage deficit is calculated in three layers, i.e., upper, lower and deep soil layers. The XAJ model parameters of the Hun river basin have been calibrated using a Genetic Algorithm (Xu & Peng. 2016). In calibration, NSE andRE are used to evaluate the performance of the parameters as shown in Equations (14) and (15), respectively.
The calibrated parameters of the XAJ model are shown in Table 4. In this model, the surface runoff, interflow and groundwater are routed using instantaneous unit lines, and the parameters of the three lines, i.e., (n and k) are set to (3, 4), (4, 5) and (4.5, 7), respectively.
Parameters .  Physical meaning (unit) .  Value . 

U_{m}  Tension water capacity of upper layer (mm)  5 
L_{m}  Tension water capacity of lower layer (mm)  50 
D_{m}  Tension water capacity of deep layer (mm)  10 
IMP  Ratio of impervious area to the total catchment area (%)  0.05 
C  Evapotranspiration coefficient of the deeper layer (–)  0.1 
B  Exponential of the distribution of tension water capacity (–)  0.36 
S_{m}  Free water capacity (mm)  5 
E_{x}  Exponent of the distribution of free water capacity (–)  1.4 
kg  Outflow coefficient of the free water storage to groundwater (–)  0.2 
k_{i}  Outflow coefficient of the free water storage to interflow (–)  0.5 
Parameters .  Physical meaning (unit) .  Value . 

U_{m}  Tension water capacity of upper layer (mm)  5 
L_{m}  Tension water capacity of lower layer (mm)  50 
D_{m}  Tension water capacity of deep layer (mm)  10 
IMP  Ratio of impervious area to the total catchment area (%)  0.05 
C  Evapotranspiration coefficient of the deeper layer (–)  0.1 
B  Exponential of the distribution of tension water capacity (–)  0.36 
S_{m}  Free water capacity (mm)  5 
E_{x}  Exponent of the distribution of free water capacity (–)  1.4 
kg  Outflow coefficient of the free water storage to groundwater (–)  0.2 
k_{i}  Outflow coefficient of the free water storage to interflow (–)  0.5 
MLR model
In the MLR model, one regression model was developed for each time step to identify the factors of physical and statistical significance in the Hun river basin. There are 36 time steps in a year, i.e., 10 days for each time step. In the MLR models, the least square method is used to estimate the parameters of the regression equations based on the observed runoffs and factors, and the equations are shown in Table 5.
Time step .  Equation .  Time step .  Equation . 

1  F_{t} = 3.5 + 0.55Q_{t−}_{1}  19  F_{t} = − 250 + 0.05Q_{t−}_{1} + 0.31P_{t−}_{1} + 0.48P_{t} 
2  F_{t} = 2.75 + 0.65Q_{t−}_{1}  20  F_{t} = −50 + 0.33Q_{t−}_{1} − 0.04P_{t−}_{1} + 0.43P_{t} 
3  F_{t} = 1.94 + 0.33(Q_{t−}_{1} + Q_{t−}_{2})  21  F_{t} = −70 + 0.29Q_{t−}_{1} + 0.06P_{t−}_{1} + 0.39P_{t} 
4  F_{t} = 2.97 + 0.66Q_{t−}_{1}  22  F_{t} = −25 + 0.25Q_{t−}_{1} + 0.09P_{t−}_{1} + 0.26P_{t} 
5  F_{t} = 1.06 + 0.45(Q_{t−}_{1} + Q_{t−}_{2})  23  F_{t} = 13 + 0.49Q_{t−}_{1} + 0.01P_{t−}_{1} + 0.09P_{t} 
6  F_{t} = 4.47 + 0.68Q_{t−}_{1}  24  F_{t} = −5.0 + 0.7Q_{t−}_{1} + 0.01P_{t−}_{1} + 0.08P_{t} 
7  F_{t} = 10.50 + 0.49Q_{t−}_{1}  25  F_{t} = −7 + 0.51Q_{t−}_{1} + 0.05P_{t−}_{1} + 0.17P_{t} 
8  F_{t} = 17.10 + 1.18Q_{t−}_{1}  26  F_{t} = −3 + 0.67Q_{t−}_{1} + 0.05P_{t−}_{1} + 0.04P_{t} 
9  F_{t} = 35.0 + 1.0Q_{t−}_{1}  27  F_{t} = −12 + 0.73Q_{t−}_{1} + 0.03P_{t−}_{1} + 0.13P_{t} 
10  F_{t} = −20.0 + 0.6Q_{t−}_{1} + 0.1P_{t−}_{1} + 0.1P_{t}  28  F_{t} = 60.0 + 0.75Q_{t−}_{1} 
11  F_{t} = 3.0 + 0.44Q_{t−}_{1} + 0.1P_{t−}_{1} + 0.15P_{t}  29  F_{t} = 90.0 + 0.05(Q_{t−}_{1} + Q_{t−}_{2}) + 0.4P_{t−}_{1} 
12  F_{t} = −10 + 0.37Q_{t−}_{1} + 0.07P_{t−}_{1} + 0.19P_{t}  30  F_{t} = 50.0 + 0.35(Q_{t−}_{1} + Q_{t−}_{2}) + 0.2P_{t−}_{1} 
13  F_{t} = −10 + 0.4Q_{t−}_{1} + 0.16P_{t−}_{1} + 0.04P_{t}  31  F_{t} = 2.50 + 0.48Q_{t−}_{1} + 0.13P_{t−}_{1} 
14  F_{t} = −40 + 0.56Q_{t−}_{1} + 0.08P_{t−}_{1} + 0.14P_{t}  32  F_{t} = −1.0 + 0.90Q_{t−}_{1} 
15  F_{t} = −90 + 0.32Q_{t−}_{1} + 0.11P_{t−}_{1} + 0.28P_{t}  33  F_{t} = 11.0 + 0.48Q_{t−}_{1} 
16  F_{t} = −45 + 0.28Q_{t−}_{1} + 0.14P_{t−}_{1} + 0.19P_{t}  34  F_{t} = 10.0 + 0.44Q_{t−}_{1} 
17  F_{t} = −40 + 0.35Q_{t−}_{1} + 0.06P_{t−}_{1} + 0.25P_{t}  35  F_{t} = 3.5 + 0.65Q_{t−}_{1} 
18  F_{t} = −270 + 0.54Q_{t−}_{1} + 0.07P_{t−}_{1} + 0.55P_{t}  36  F_{t} = 3.5 + 0.65Q_{t−}_{1} 
Time step .  Equation .  Time step .  Equation . 

1  F_{t} = 3.5 + 0.55Q_{t−}_{1}  19  F_{t} = − 250 + 0.05Q_{t−}_{1} + 0.31P_{t−}_{1} + 0.48P_{t} 
2  F_{t} = 2.75 + 0.65Q_{t−}_{1}  20  F_{t} = −50 + 0.33Q_{t−}_{1} − 0.04P_{t−}_{1} + 0.43P_{t} 
3  F_{t} = 1.94 + 0.33(Q_{t−}_{1} + Q_{t−}_{2})  21  F_{t} = −70 + 0.29Q_{t−}_{1} + 0.06P_{t−}_{1} + 0.39P_{t} 
4  F_{t} = 2.97 + 0.66Q_{t−}_{1}  22  F_{t} = −25 + 0.25Q_{t−}_{1} + 0.09P_{t−}_{1} + 0.26P_{t} 
5  F_{t} = 1.06 + 0.45(Q_{t−}_{1} + Q_{t−}_{2})  23  F_{t} = 13 + 0.49Q_{t−}_{1} + 0.01P_{t−}_{1} + 0.09P_{t} 
6  F_{t} = 4.47 + 0.68Q_{t−}_{1}  24  F_{t} = −5.0 + 0.7Q_{t−}_{1} + 0.01P_{t−}_{1} + 0.08P_{t} 
7  F_{t} = 10.50 + 0.49Q_{t−}_{1}  25  F_{t} = −7 + 0.51Q_{t−}_{1} + 0.05P_{t−}_{1} + 0.17P_{t} 
8  F_{t} = 17.10 + 1.18Q_{t−}_{1}  26  F_{t} = −3 + 0.67Q_{t−}_{1} + 0.05P_{t−}_{1} + 0.04P_{t} 
9  F_{t} = 35.0 + 1.0Q_{t−}_{1}  27  F_{t} = −12 + 0.73Q_{t−}_{1} + 0.03P_{t−}_{1} + 0.13P_{t} 
10  F_{t} = −20.0 + 0.6Q_{t−}_{1} + 0.1P_{t−}_{1} + 0.1P_{t}  28  F_{t} = 60.0 + 0.75Q_{t−}_{1} 
11  F_{t} = 3.0 + 0.44Q_{t−}_{1} + 0.1P_{t−}_{1} + 0.15P_{t}  29  F_{t} = 90.0 + 0.05(Q_{t−}_{1} + Q_{t−}_{2}) + 0.4P_{t−}_{1} 
12  F_{t} = −10 + 0.37Q_{t−}_{1} + 0.07P_{t−}_{1} + 0.19P_{t}  30  F_{t} = 50.0 + 0.35(Q_{t−}_{1} + Q_{t−}_{2}) + 0.2P_{t−}_{1} 
13  F_{t} = −10 + 0.4Q_{t−}_{1} + 0.16P_{t−}_{1} + 0.04P_{t}  31  F_{t} = 2.50 + 0.48Q_{t−}_{1} + 0.13P_{t−}_{1} 
14  F_{t} = −40 + 0.56Q_{t−}_{1} + 0.08P_{t−}_{1} + 0.14P_{t}  32  F_{t} = −1.0 + 0.90Q_{t−}_{1} 
15  F_{t} = −90 + 0.32Q_{t−}_{1} + 0.11P_{t−}_{1} + 0.28P_{t}  33  F_{t} = 11.0 + 0.48Q_{t−}_{1} 
16  F_{t} = −45 + 0.28Q_{t−}_{1} + 0.14P_{t−}_{1} + 0.19P_{t}  34  F_{t} = 10.0 + 0.44Q_{t−}_{1} 
17  F_{t} = −40 + 0.35Q_{t−}_{1} + 0.06P_{t−}_{1} + 0.25P_{t}  35  F_{t} = 3.5 + 0.65Q_{t−}_{1} 
18  F_{t} = −270 + 0.54Q_{t−}_{1} + 0.07P_{t−}_{1} + 0.55P_{t}  36  F_{t} = 3.5 + 0.65Q_{t−}_{1} 
F_{t} is the simulated runoff at the time step t; Q_{t−}_{1} is the observed runoff at time step t− 1; P_{t} is the average precipitation at time step t.
BP neural network
The BP model takes the flows at time step t− 2 and t− 1 and the average precipitation of the watershed at time step t− 2, t− 1 and t as inputs and takes the flow at time step t as output. The BP model constructed using fourlayer neural network, and the network nodes for each layer are 5 (input layer), 50 (hidden layer), 50 (hidden layer) and 1 (output layer), respectively. The outputs of the hidden layer are transformed by the sigmoid activation function. The BP network was trained for 900 epochs as it was already converged.
RESULTS
The network structure and parameters have a great influence on the learning efficiency. In this study, the LSTM networks are tested on the Hun and Upper Yangtze river basins, respectively. First, the effects of LSTM network structure are analysed and the performances of the network parameters are evaluated in terms of the number of cells (n) and the batch size (T). Then, the structure and parameters with the best performance are selected to predict the river flows of the two study cases. Finally, the performances of the LSTMs are compared with the results from the comparison models.
Learning efficiency with different structures
Effects of activation function
Scenarios A1, A2 and A3 are trained for the Hun river and Upper Yangtze river case studies, and the variations of the loss function values are shown in Figure 4. For each network structure, the network weights are trained for 1,500 epochs. The training loss values of each scenario in Figure 4 are the average values of 10 independent training runs. Note that the model parameter values used for the results in Figure 4 are those optimal values identified in Table 6. The results from the two case studies in Figure 4 show the following key points:
 (1)
The loss value of A1 is rapidly decreased, demonstrating a rapid learning.
 (2)
The inputs in A2 are nonlinearly transferred by activation functions before being passed to the LSTM cells; as a result, the LSTM cells cannot effectively capture the longterm time dependencies in the time series data. Thus, the loss values cannot be reduced rapidly during the training.
 (3)
In A3, there is only a simple linear transformation between the LSTM cells and output layer. This makes learning difficult with a slow reduction in loss values before they start to increase after 500 epochs.
Watershed .  Structures and parameters . 

Hun river  P[T= 90,m= 10] → FC[m,n] → Cell[n= 20] → FCa[n,50] → FCa[50,30] → O[T,g= 1] 
Upper Yangtze river  P[T= 360,m= 57] → FC[m,n] → Cell[n= 50] → FCa[n,50] → FCa[50,30] → O[T,g= 6] 
Watershed .  Structures and parameters . 

Hun river  P[T= 90,m= 10] → FC[m,n] → Cell[n= 20] → FCa[n,50] → FCa[50,30] → O[T,g= 1] 
Upper Yangtze river  P[T= 360,m= 57] → FC[m,n] → Cell[n= 50] → FCa[n,50] → FCa[50,30] → O[T,g= 6] 
The results from Figure 4 show the LSTM structures in Scenarios A2 and A3 could not provide efficient learning and the model outputs cannot match the target values well.
Effects of fully connected layers
The test results from Scenarios B1 and B2 are shown in Figure 5. The results indicate that all the scenarios can learn effectively in the two case studies. However, the detailed loss value variations between epochs 1,250 and 1,500 as shown in Figure 5(a2,b2) reveals that scenario A1 converges faster and is more stable than the other two scenarios. This implies that the network does not need a large number of fully connected layers between the LSTM cells and the output layer to improve the learning efficiency. Therefore, the LSTM structure in A1 is selected to predict the river flows in the two case study basins.
Learning efficiency with different parameters
Effects of T
In the LSTM network training, the batch size (T) of training samples has a significant effect on the learning efficiency. In this study, a number of T values are tested for Hun river (i.e., 10, 30, 50, 70 and 90) and Upper Yangtze river (i.e., 30, 60, 120, 180 and 360). Figure 6 shows the training loss variations of the Hun river and Upper Yangtze river. The loss values fluctuate dramatically when the batch size is small, i.e., T = 10 or 30 in the case of Hun river and T = 30, 60 or 120 in the case of Upper Yangtze river. This indicates that the LSTM cells cannot capture the periodicity in the time series using small batch sizes, when the network weights are updated frequently. With T increasing, the amount of training samples used for learning can gradually reflect the periodicity. Therefore, the fluctuation of the loss values is gradually reduced. Figure 6(a2,b2) shows the magnified training loss curves during epochs from 1,250 to 1,500. The results indicate that the learning curves are stable after T = 50 and 180 for the Hun river and Upper Yangtze river basins, respectively.
Figure 7 shows the means of the loss values from epochs from 1,250 to 1,500 in Figure 6. The results show that the loss means are gradually decreasing with T increasing, which represent the learning efficiencies of the LSTM network are gradually improved. In the case of Upper Yangtze river, when T > 180, the loss value cannot be reduced.
Effects of cell number
The LSTM cell is the core concept in the LSTM network. The number of the cells (n) determines the performances of the network. Figure 8 shows the means of the loss values from epochs 1,250 to 1,500 using different cell numbers. With the Hun river basin, when the number of cells reaches 20, LSTM achieves a good learning efficiency as shown in Figure 8(a). When the cell number exceeds 20, the efficiency shows very slow improvement. Figure 8(b) reflects the network performances in the Upper Yangtze river. When the number of cells reaches 40, the learning efficiency begins to stabilize. In this study, 20 and 50 cells are used in the LSTM networks for the Hun river and the Upper Yangtze river, respectively.
Performance evaluation
Based on the analysis of the structure and parameters of the LSTM network as above, the structure and parameters with the best performance as shown in Table 6 are selected to learn and predict the flows of the Hun river and Upper Yangtze river, respectively.
Hun river basin
Figure 9(a) shows the observed and simulated flows from 1970 to 1999, and the performance evaluation criteria are shown in Table 7. The simulated flows in Figure 9(a) fit well with the observed flows. In training, the LSTM achieves an NSE value of 0.98. The results show that LSTM has a strong nonlinear learning ability and it outperforms the other models in the Hun river basin. During the models, the MLR model performs worst.
Models .  Training .  Verification .  

NSE (%) .  R^{2} (%) .  RE (%) .  NSE (%) .  R^{2} (%) .  RE (%) .  
SWAT  84.55  85.96  3.74  79.43  80.87  − 4.79 
XAJ  78.56  78.56  0.62  73.63  75.82  3.38 
MLR  54.53  54.56  − 3.08  34.98  36.31  − 0.44 
BP  85.47  85.48  0.30  68.93  69.26  6.42 
LSTM  98.21  98.31  1.62  74.76  75.14  − 4.06 
Models .  Training .  Verification .  

NSE (%) .  R^{2} (%) .  RE (%) .  NSE (%) .  R^{2} (%) .  RE (%) .  
SWAT  84.55  85.96  3.74  79.43  80.87  − 4.79 
XAJ  78.56  78.56  0.62  73.63  75.82  3.38 
MLR  54.53  54.56  − 3.08  34.98  36.31  − 0.44 
BP  85.47  85.48  0.30  68.93  69.26  6.42 
LSTM  98.21  98.31  1.62  74.76  75.14  − 4.06 
In the verifying process, the predictive ability of the LSTM is shown in Table 7 to be slightly worse than that of the SWAT. However, the LSTM slightly outperforms the XAJ model in terms of NSE but is much better than other models. Figure 9(b) shows the predicted and observed flows from 2000 to 2010. Though the predicted flows from the LSTM did not match well with the observed in some periods, most peak flows are predicted well. This is clearly shown in the scatter plots of the observed and simulated flows from the training and verification periods in Figure 10. Though the overall performances of SWAT and XAJ in terms of the three criteria are better than those of LSTM, LSTM performs much better for peak flows, which are of particular concern in flood predictions.
Upper Yangtze river
In the Upper Yangtze river basin, the daily data of 57 meteorological stations are used as inputs, and the daily flow of six hydrological stations is used as target values. The flows of the six stations are simulated using LSTM and SWAT and their performances are shown in Table 8.
Criteria .  Category .  Yangtze River .  

SG .  PZH .  BB .  WL .  XJB .  WZ .  
LSTM network  
NSE (%)  89.63  90.77  91.09  90.96  89.84  90.35  
Training  R^{2} (%)  89.73  90.89  91.11  91.01  89.96  90.41 
RE (%)  2.15  2.17  1.17  0.05  2.33  1.65  
NSE (%)  54.60  54.13  56.56  61.62  62.48  71.51  
Verification  R^{2} (%)  64.92  61.6  58.13  69.79  64.11  72.72 
RE (%)  0.94  6.10  − 12.71  11.18  8.48  1.85  
SWAT model  
NSE (%)  77.05  72.56  70.67  76.95  79.95  84.31  
Training  R^{2} (%)  77.89  80.41  75.65  78.93  81.55  88.67 
RE (%)  13.34  − 0.29  4.35  8.07  − 2.74  9.66  
NSE (%)  66.03  68.13  62.67  72.78  74.82  74.40  
Verification  R^{2} (%)  66.04  67.19  63.45  67.18  78.01  79.46 
RE (%)  14.47  8.51  4.92  10.58  5.29  10.94 
Criteria .  Category .  Yangtze River .  

SG .  PZH .  BB .  WL .  XJB .  WZ .  
LSTM network  
NSE (%)  89.63  90.77  91.09  90.96  89.84  90.35  
Training  R^{2} (%)  89.73  90.89  91.11  91.01  89.96  90.41 
RE (%)  2.15  2.17  1.17  0.05  2.33  1.65  
NSE (%)  54.60  54.13  56.56  61.62  62.48  71.51  
Verification  R^{2} (%)  64.92  61.6  58.13  69.79  64.11  72.72 
RE (%)  0.94  6.10  − 12.71  11.18  8.48  1.85  
SWAT model  
NSE (%)  77.05  72.56  70.67  76.95  79.95  84.31  
Training  R^{2} (%)  77.89  80.41  75.65  78.93  81.55  88.67 
RE (%)  13.34  − 0.29  4.35  8.07  − 2.74  9.66  
NSE (%)  66.03  68.13  62.67  72.78  74.82  74.40  
Verification  R^{2} (%)  66.04  67.19  63.45  67.18  78.01  79.46 
RE (%)  14.47  8.51  4.92  10.58  5.29  10.94 
In the training, LSTM has a very high performance for the flows of six stations. The NSE and R^{2} values indicate that the LSTM outperforms the SWAT during training. Figure 11 shows the simulated flows at a station (WZ).
The scatter plots of simulated and predicted flows for six hydrological stations are shown in Figure 12. The results indicate that the performance of LSTM in the verifying period is worse than that in the training period. Predicted peak flows are likely to be lower than those observed.
Note that in Upper Yangtze river, the LSTM network is constructed to predict flows at multiple stations at the same time. The training results show that the LSTM network has a strong fitting ability to learn the flow data of multiple hydrological stations.
DISCUSSION
LSTM network training is significantly affected by the training dataset size. It is generally understood that the network training requires a large amount of training sample data. However, the dataset size depends on the characteristics of the catchment and flows of concern, which determines the complexity of the input–output relationships represented by the LSTM. The LSTM has been shown to perform well with a smaller dataset than 30 years used in the Hun river basin. For example, Kratzert et al. (2018) used the daily meteorological data and observed flow data from 15 years in 241 catchments to train LSTM networks, whose performances are comparable to processbased models. Lee et al. (2018) trained a deep LSTM network using daily rainfall and water level data during two periods: 2000–2002 and 2008–2014 (a total of 9 years) and it outperformed a SWAT model when verified using daily data from 2003 to 2007.
A main disadvantage of using large datasets is the computational time required for LSTM training, particularly when individual networks are trained for a large number of catchments. This issue could be tackled from several aspects: (1) using the advances in computing power such as Graphics Processing Units and cloud computing; (2) pooling the datasets from individual catchments of similar hydrological characteristics to train an LSTM network as a regional model which can predict the discharge for a number of catchments in the region (Kratzert et al. 2018); (3) training the network offline for realtime predictions. The second aspect is also important for flow predictions in ungauged catchments as suggested by Kratzert et al. (2018) and provides a new application area in the use of the LSTM network in hydrological predictions.
Transfer learning is powerful and useful in deep learning as it can use the network knowledge gained from solving one problem to help solve another similar problem. Due to the focus of this work, transfer learning is not investigated here. Future research should explore the potential of transfer learning from two aspects: (1) building on a reference architecture (e.g. Scenario A1 in this study), the network knowledge (e.g. parameters) could be applied to other similar architectures in solving the same problem so training could be continuously improved using prior network knowledge; (2) transferring the LSTM learning knowledge from datarich catchments to datascarce catchments, so the flow predictions in datascarce catchments could be improved.
CONCLUSIONS
In this study, the performance of LSTM networks is assessed for river flow simulations using two river basins: the Hun river and Upper Yangtze river. Different LSTM structures are analysed. The prediction performances are compared against other models including hydrological models and datadriven models. The key research conclusions are summarized below.
In the LSTM network, the activation function in the fully connected layer before the LSTM cell layer can substantially reduce learning efficiency. In the LSTM network, the transformation through activation functions weakens the correlations between precipitation and flow, leading to failure in learning the rainfall–runoff relationships in both study basins. On the contrary, nonlinear transformation following the LSTM cells is required to improve learning efficiency due to the different magnitudes of precipitation and streamflow. Further, increasing the number of fully connected layers cannot improve the learning efficiency, instead it needs more epochs due to the fluctuation in the loss values.
In the LSTM network, the batch size and the number of LSTM cells are the sensitive parameters that affect the learning efficiency. The results of this study show that the learning efficiency continues to increase with the batch size and the number of cells increasing. However, when the learning is stable, the number of cells should be kept to the minimum to reduce the complexity of the network.
The LSTM has superior nonlinear learning ability for time series data and has a simple structure and few parameters, which has strong application potential in streamflow simulation. The results of this study show that the nonlinear learning ability in the training process is very powerful.
The LSTM networks achieve good performance compared to other models considering three criteria, i.e., NSE, R^{2} and RE. In the case of Hun river, LSTM outperforms MLR and BP and achieves a similar level of accuracy of XAJ. It is slightly worse than a wellcalibrated SWAT but provides more accurate predictions for peak flows. In the case of the Upper Yangtze river, LSTM outperforms SWAT during the training but is worse than SWAT in the verification period. This is mainly because predicted peak flows are likely to be lower than those observed.
ACKNOWLEDGEMENTS
This research is supported by the National Natural Science Foundation of China (Grant No. 51609025 and 51709108) and the UK Royal Society through an industry fellowship to Guangtao Fu (Ref: IF160108) and an international collaboration project (Ref: IEC\NSFC\170249). Guangtao Fu is also supported by The Alan Turing Institute under the EPSRC (Grant EP/N510129/1). The Open Fund Approval (SKHL1713, 2017), Chongqing technology innovation and application demonstration project (cstc2018jscxmsybX0274, cstc2016shmszx30002). We would like to thank the Hun river cascade hydropower reservoirs development Ltd and Upper Yangtze River Bureau of Hydrological and Water Resources Survey for the case study data.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories (http://www.hydroshare.org/resource/93f1f580de88403a8c52d2b3238297eb).