Using long short-term memory networks for river flow prediction


 Deep learning has made significant advances in methodologies and practical applications in recent years. However, there is a lack of understanding on how the long short-term memory (LSTM) networks perform in river flow prediction. This paper assesses the performance of LSTM networks to understand the impact of network structures and parameters on river flow predictions. Two river basins with different characteristics, i.e., Hun river and Upper Yangtze river basins, are used as case studies for the 10-day average flow predictions and the daily flow predictions, respectively. The use of the fully connected layer with the activation function before the LSTM cell layer can substantially reduce learning efficiency. On the contrary, non-linear transformation following the LSTM cells is required to improve learning efficiency due to the different magnitudes of precipitation and flow. The batch size and the number of LSTM cells are sensitive parameters and should be carefully tuned to achieve a balance between learning efficiency and stability. Compared with several hydrological models, the LSTM network achieves good performance in terms of three evaluation criteria, i.e., coefficient of determination, Nash–Sutcliffe Efficiency and relative error, which demonstrates its powerful capacity in learning non-linear and complex processes in hydrological modelling.


INTRODUCTION
The rainfall-runoff process of a river basin is normally characterized by a high degree of nonlinearity. The process is one of the most important components in the hydrological cycle and its accurate modelling is crucial for water resources and flood management (Clarke ; Nourani ). Rainfallrunoff models are usually classified into three main classes: distributed, conceptual and black box models (Clarke ).
Distributed and conceptual models are based on various hydrological processes; however, they are limited by our understanding and ability to accurately represent these processes and computational resources. By contrast, black box models are normally data-driven but can provide an accurate prediction in many situations (Tanty & Desmukh ; Nourani ). Artificial neural network (ANN) models are one of the typical black box models. Since Daniell () applied ANNs to streamflow modelling, ANNs have been widely applied in hydrological modelling because of its strong non-linear fitting ability (ASCE Task Committee a, b). Currently, various ANN models have been employed to study the rainfall-runoff process, such as fuzzy neural networks (Nayak et al. ), wavelet neural networks (Wang & Ding ; Alexander & Thampi ) and Bayesian neural networks (Bateni et al. ; Kayabası et al. ). Traditionally, the ANN learns the relationships between input and output variables from historical data pro- loops to allow the information from previous time steps to be passed to the next time step. However, the gradient disappearance or explosion problem makes the RNN gradually lose the ability to learn long-distance information (Bengio et al. ). To overcome the deficiency, the long shortterm memory (LSTM) network, a special type of RNN, was developed for learning with long sequence data (Hochreiter & Schmidhuber ), as it is capable of learning long-term dependencies in the data series. Based on the concept of LSTM networks, many similar networks have been constructed to improve the learning ability for different tasks Bellegarda & Monz ;Greff et al. 2017). At present, the LSTM has been successfully used in speech recognition and text translation (Bellegarda & Monz ; Rocha et al. ).
In the last several years, LSTM networks have been tested and studied in watershed hydrological modelling, and their potential has been demonstrated in many applications, such as river flow and flood predictions (Shen ). Kratzert et al. () applied the LSTM network to simulate the daily flows of 241 basins and found that it greatly outperforms hydrological models that are calibrated both at the regional level and at the individual basin level. Lee The main aim of this paper is to assess the performances of LSTM networks in river flow predictions in terms of LSTM structures and parameters. In this study, the LSTM networks with different network structures, i.e., fully connected layers and LSTM cells are trained and their performances compared using two case studies of different characteristicsthe Hun river basin and the upper river basin of Yangtze River, China. The trained LSTM networks are used to predict the river flows in the two case study river basins. Finally, the LSTM networks are compared with four models, i.e., the SWAT, Xinanjiang model (XAJ), multiple linear regression model (MLR) and back-propagation neural networks (BP).

METHODOLOGY
In this section, the LSTM network for flow simulation and predication is first presented and the key components including the network structure, LSTM cells and loss function are explained. Then the data pre-processing and evaluation criteria used in this study are introduced. At last, different simulation scenarios designed to study the performance of the LSTM network are explained.

Network structure
In the LSTM network, the key components are fully connected layers and LSTM cells. As shown in Figure 1 where m is the number of meteorological stations; T is the batch size of precipitation data; t is the start time step; x t is the precipitation vector at time step t and represented as The observed flow data at hydrological stations are used as targets for training, i.e., to compare with the simulated flows from the LSTM network where g is the number of hydrological stations; q g t is the observed flow of the gth hydrological station at time step t; q t is the flow vector at time step t.
During the training processes of the LSTM network at a time step, as shown in Figure 1, the fully connected layer a transfers the precipitation vector (e.g., x t ) with m dimensions into n dimensions (i.e., n is the number of LSTM cells) as where y out is the output vector of layer a; W and b are the weight matrix and bias, respectively; n is the total number of the LSTM cells.
After recurrent learning of the LSTM cells, an output of n dimensions is generated and sent to the fully connected layer b1. The fully connected layers b1 and b2 are neural layers with activation functions and used to transfer the LSTM cell output to flow as LSTM cell structure The 'forget gate' determines what information in the hidden state is forgotten, shown as the f t process in Figure 2.
By the forget gate, the meteorological precipitation information of the past time steps can be recalled at the current time step as where σ represents the sigmoid network layer where the sigmoid function is used as the activation function; x t is the input data; W f and b f are weight matrix and bias in the sigmoid network layer, respectively; f t is the forget vector with values in the range [0, 1], where 1 means 'completely reserved' and 0 means 'completely forgotten'.
The 'input gate' determines what information in the cell state C t to be updated by x t and H tÀ1 . There are a sigmoid layer and a tanh layer at this gate. The tanh layer is expressed by the output weights as a one-dimensional matrix, which determines how the information of the cell state to be updated according to x t and H tÀ1 as where C t is a one-dimensional matrix with values in the range [0,1]; W C and b C are the weight matrix and bias in the tanh network layer in the 'input gate'.
The sigmoid layer in the 'input gate' determines the information in the hidden state to participate in the update, which operates as where i t is a one-dimensional matrix with values in the range [0,1]; W i and b i are the weight matrix and bias in the sigmoid network layer in the 'input gate'.
Combining the outputs from 'forget gate' and 'input gate', the information in the cell state C t can now be updated by where the first component represents the passthrough information from the forget gate and the second component represents the update information from the input gate. In this way, the impact of precipitation from previous time times on the runoff at the current time step can be learned.
The 'output gate' uses the sigmoid layer to determine which information of the hidden state is taken as the output.
where W o and b o are the weight matrix and bias in the output gate; o t is the output of the LSTM cell; The hidden state H t can be determined based on the output of cell and hidden state C t .

Loss function
In this study, observed flow data of hydrological station are where T is the batch size of training samples for each training; q t is the target value at time step t; y f out,t is the LSTM network output; MSE is the mean square error.

Pseudo code of LSTM network
The pseudo code of the LSTM network training is shown in Algorithm 1. In the pseudo code, the parameters include the Algorithm 1 | The pseudo code of the LSTM neural network.
(2) The number of LSTM cells is n; Set initial states of C k,0 and S k,0 (k ∈ n) to zero matrix.

Fully Connected Layer:
Update states for LSTM cell: C t and H t using Equations (8) and (10), separately; Generate cell output: o k,t using Equation (9). Get the outputs of LSTM cells after the iteration of the loop: [T,n].

Fully Connected Layers:
Get the outputs y f out,t : Transform matrix [T,n] to [T,g] using fully connected layers b1 and b2. Loss Function: Comparing the simulated flows (y f out,t ) and observed flows ( q t ) and, the loss value is evaluated using Equation (11). Weights Updating: Based on the loss value, the weights of the networks are updated using the Adam algorithm.
batch size of training input data (T ), the dimensions of the input data (m), the dimensions of the output flow vector (g), LSTM cell size (n) and the input and output dimensions of the fully connected layers (i.e., h and k). In the case study, the LSTM network training ends after 1,500 epochs.

Structure scenarios
In this section, the scenarios of different LSTM structures and parameters are presented to test the network performances as shown in Table 1 Table 1 shows their values tested in this study.

Streamflow data pre-process
In a large river basin with multiple flow stations, the flow rates at different stations may vary in a wide range due to different sizes of drainage areas. The difference may cause the network to ignore small flows, leading to learning inefficiency or failure. Thus, the flow processes for each hydrological station are pre-processed as.
where q t,i represents the observed flow of the hydrological station i at time step t. q i represents the mean value of the observed flow process of the hydrological station i. q 0 t,i represents the pre-processed flow; g is the number of hydrological stations; N represents the length of the flow data.

Model evaluation criteria
In this study, the simulation performances of the models are evaluated by the following three criteria. The coefficient of determination (R 2 ) provides a statistical measure that assesses how well a hydrological model explains and predicts future flows, and it indicates the level of explained variability in the data set. The Nash-Sutcliffe Efficiency (NSE) is used to quantitatively describe the accuracy of the hydrological model (Nash & Sutcliffe ). The NSE value is between 1 and negative infinity. An NSE value of 1 corresponds to a perfect match of simulated flows to observed data. The where q s,t and q o,t represent the simulated and observed flow at time step t, respectively. q s and q o represent the means of the simulated and observed flows, respectively.

Study area
In this study, two basins are taken as case studies, the Hun river basin and the upper Yangtze river basin, which are located in the northeast and southwest of China, respectively ( Figure 3). The basic characteristics of the two basins are shown in Table 2.
The Hun river originates from the Changbai Mountain. increases from northwest to southeast, and about 70% of the precipitation occurs from May to September.

Data source
In the Hun river, the 10-day average meteorological data from 10 stations (i.e., m ¼ 10) and river flow data from the

Comparison models
In the Hun river basin case study, four models are constructed as comparison models to evaluate the performance of the proposed LSTM network, i.e., SWAT, XAJ, MLR and BP. In the upper Yangtze river basin, SWAT is used only.

SWAT model
The  Table 3.

XAJ model
The XAJ model is a conceptual rainfall-runoff model, which was developed by Zhao (). This model has been widely used in China, particularly in humid and semi-humid regions (Xu et al. ). The XAJ model assumes that runoff is not produced until the soil water content of the aeration zone reaches its field capacity. The actual evapotranspiration is   (14) and (15), respectively.
The calibrated parameters of the XAJ model are shown in Table 4. In this model, the surface runoff, interflow and groundwater are routed using instantaneous unit lines, and the parameters of the three lines, i.e., (n and k) are set to (3, 4), (4, 5) and (4.5, 7), respectively.

MLR model
In  Table 5.
Ft is the simulated runoff at the time step t; QtÀ1 is the observed runoff at time step t À 1; Pt is the average precipitation at time step t.

BP neural network
The BP model takes the flows at time step t À 2 and t À 1 and the average precipitation of the watershed at time step t À 2, t À 1 and t as inputs and takes the flow at time step t as output. The BP model constructed using four-layer neural network, and the network nodes for each layer are 5 (input layer), 50 (hidden layer), 50 (hidden layer) and 1 (output layer), respectively. The outputs of the hidden layer are transformed by the sigmoid activation function. The BP network was trained for 900 epochs as it was already converged.

RESULTS AND DISCUSSION
The network structure and parameters have a great influence on the learning efficiency. In this study, the LSTM networks are tested on the Hun and Upper Yangtze river basins, respectively. First, the effects of LSTM network structure are analysed and the performances of the network parameters are evaluated in terms of the number of cells (n) and the batch size (T ). Then, the structure and parameters with the best performance are selected to predict the river flows of the two study cases. Finally, the performances of the LSTMs are compared with the results from the comparison models.

Effects of activation function
Scenarios A1, A2 and A3 are trained for the Hun river and Upper Yangtze river case studies, and the variations of the loss function values are shown in Figure 4. For each network structure, the network weights are trained for 1,500 epochs.
The training loss values of each scenario in Figure 4 are the average values of 10 independent training runs. Note that the model parameter values used for the results in Figure 4 are those optimal values identified in Table 6. The results from the two case studies in Figure 4 show the following key points: (1) The loss value of A1 is rapidly decreased, demonstrating a rapid learning.
(2) The inputs in A2 are nonlinearly transferred by activation functions before passed to the LSTM cells, as a result, the LSTM cells cannot effectively capture the long-term time dependencies in the time series data.
Thus, the loss values cannot be reduced rapidly during the training.
(3) In A3, there is only a simple linear transformation between the LSTM cells and output layer. This makes learning difficult with a slow reduction in loss values before they start to increase after 500 epochs.
The results from Figure 4 show the LSTM structures in Scenarios A2 and A3 could not provide efficient learning and the model outputs cannot match the target values well.

Effects of fully connected layers
The test results from Scenario B1 and B2 are shown in 1,500 as shown in Figure 5(a2,b2) reveals that scenario A1 converges faster and more stable than the other two scenarios. This implies that the network does not need a large number of fully connected layers between the LSTM cells and the output layer to improve the learning efficiency.
Therefore, the LSTM structure in A1 is selected to predict the river flows in the two case study basins. With T increasing, the amount of training samples used for learning can gradually reflect the periodicity. Therefore, the fluctuation of the loss values is gradually reduced. Figure 6(a2,b2) shows the magnified training loss curves during epochs from 1,250 to 1,500. The results indicate that the learning curves are stable after T ¼ 50 and 180 for the Hun river and Upper Yangtze river basins, respectively. Figure 7 shows the means of the loss values from epochs from 1,250 to 1,500 in Figure 6. The results show that the loss means are gradually decreasing with T increasing, which represent the learning efficiencies of LSTM network are gradually improved. In the case of Upper Yangtze river, when T > 180, the loss value cannot be reduced.

Effects of cell number
The LSTM cell is the core concept in LSTM network. The number of the cells (n) determines the performances of the network. Figure

Performance evaluation
Based on the analysis of the structure and parameters of the LSTM network as above, the structure and parameters with  the best performance as shown in Table 6 are selected to learn and predict the flows of the Hun river and Upper Yangtze river, respectively Hun river basin In the verifying process, the predictive ability of the LSTM is shown in Table 7 to be slightly worse than that of the SWAT. However, the LSTM slightly outperforms the XAJ model in terms of NSE but is much better than other models. Figure 9(b) shows the predicted and observed flows from 2000 to 2010. Though the predicted flows from the LSTM did not match well with the observed in some periods, most peak flows are predicted well. This is clearly shown in the scatter plots of the observed and simulated flows from the training and verification periods in Figure 10.
Though the overall performances of SWAT and XAJ in terms of the three criteria are better than those of LSTM, LSTM performs much better for peak flows, which are of particular concern in flood predictions.

Upper Yangtze river
In the Upper Yangtze river basin, the daily data of 57 meteorological stations are used as inputs, and the daily flow of six hydrological stations is used as target values.

13
W. Xu et al. | Using long short-term memory networks for river flow prediction Hydrology Research | in press | 2020

Uncorrected Proof
The flows of the six stations are simulated using LSTM and SWAT and their performances are shown in Table 8.
In the training, LSTM has a very high performance for the flows of six stations. The NSE and R 2 values indicate that the LSTM outperforms the SWAT during training. Figure 11 shows the simulated flows at a station (WZ).
The scatter plots of simulated and predicted flows for six hydrological stations are shown in Figure 12. The results indicate that the performance of LSTM in the verifying period is worse than that in the training period. Predicted peak flows are likely to be lower than those observed.
Note that in Upper Yangtze river, the LSTM network is constructed to predict flows at multiple stations at the same time. The training results show that the LSTM network has a strong fitting ability to learn the flow data of multiple hydrological stations.

DISCUSSION
LSTM network training is significantly affected by the training dataset size. It is generally understood that the network training requires a large amount of training sample data.
However, the dataset size depends on the characteristics of the catchment and flows of concern, which determines the complexity of the input-output relationships represented by the LSTM. The LSTM has been shown to perform well with a smaller dataset than 30 years used in the Hun river basin. For example, Kratzert et al. () used the daily meteorological data and observed flow data from 15 years in 241 catchments to train LSTM networks, whose performances are comparable to process-based models. Lee et al.  14 W. Xu et al. | Using long short-term memory networks for river flow prediction Hydrology Research | in press | 2020 Uncorrected Proof  (3) training the network off-line for real-time predictions.
The second aspect is also important for flow predictions in ungauged catchments as suggested by Kratzert et al. () and provides a new application area in the use of the LSTM network in hydrological predictions.
Transfer learning is powerful and useful in deep learning as it can use the network knowledge gained from solving one problem to improve solving another similar problem.
Due to the focus of this work, transfer learning is not investigated here. Future research should explore the potential of transfer learning from two aspects: (1) building on a reference architecture (e.g. Scenario A1 in this study), the network knowledge (e.g. parameters) could be applied to other similar architectures in solving the same problem so training could be continuously improved using prior network knowledge; (2) transferring the LSTM learning knowledge from data-rich catchments to data-scarce catchments, so the flow predictions in data-scarce catchments could be improved.

CONCLUSIONS
In this study, the performance of LSTM networks is assessed for river flow simulations using two river basins: the Hun river and Upper Yangtze river. Different LSTM structures are analysed. The prediction performances are compared against other models including hydrological models and data-driven models. The key research conclusions are summarized below. The LSTM has superior non-linear learning ability for time series data and has a simple structure and few parameters, which has strong application potential in streamflow simulation. The results of this study show that the non-linear learning ability in training process is very powerful.
The LSTM networks achieve good performance compared to other models considering three criteria, i.e., NSE, R 2 and RE. In the case of Hun river, LSTM outperforms MLR and BP and achieves a similar level of accuracy of XAJ. It is slightly worse than a well-calibrated SWAT but provides more accurate predictions for peak flows. In the case of the Upper Yangtze river, LSTM outperforms SWAT during the training but is worse than SWAT in the verification period. This is mainly because predicted peak flows are likely to be lower than those observed.