## Abstract

Kelantan river (Sungai Kelantan in Malaysia) basin is one of the essential catchments as it has a history of flood events. Numerous studies have been conducted in river basin modelling for the prediction of flow and mitigation of flooding events as well as water resource management. Therefore, having multi-step-ahead forecasting for river flow (RF) is of important research interest in this regard. This study presents four different approaches for multi-step-ahead forecasting for the Kelantan RF, using NARX (nonlinear autoregressive with exogenous inputs) neural networks and deep learning recurrent neural networks called LSTM (long short-term memory). The dataset used was obtained in monthly record for 29 years between January 1988 and December 2016. The results show that two recursive methods using NARX and LSTM are able to do multi-step-ahead forecasting on 52 series of test datasets with NSE (Nash–Sutcliffe efficiency coefficient) values of 0.44 and 0.59 for NARX and LSTM, respectively. For few-step-ahead forecasting, LSTM with direct sequence-to-sequence produces promising results with a good NSE value of 0.75 (in case of two-step-ahead forecasting). However, it needs a larger data size to have better performance in longer-step-ahead forecasting. Compared with other studies, the data used in this study is much smaller.

## HIGHLIGHTS

Four different approaches for multi-step-ahead forecasting for river flow.

Two forecasting approaches are performed in multivariate approach using NARXNN (nonlinear autoregressive with exogenous inputs - neural networks) with respectively recursive and direct auto-regressive mode.

Two forecasting approaches are performed using LSTM (long short-term memory) with respectively recursive univariate and sequence-to-sequence multivariate approach.

LSTM with a direct sequence-to-sequence multivariate approach performs the best for few-step-ahead forecasting in the limited data size. It has promising application for longer-step-ahead forecasting provided that the data size is sufficiently large.

### Graphical Abstract

## INTRODUCTION

In a time-series sequence, prediction can be classified into two categories, known as single-step ahead and multi-step ahead (Saroha & Aggarwal 2014). Predicting multiple time steps into the future is called multi-step time-series forecasting. It also includes the forecasting of some variables for some future time steps, given over a significant time span of data (Dabrowski *et al.* 2020). Moreover, the extent of future forecasting is known as the forecasting horizon. Highly chaotic time series and those with missing data pose important issues in multi-step-ahead prediction, which have been addressed using non-linear filters and neural networks (Chandra *et al.* 2021). According to Guo *et al.* (2020), the recursive strategy has been normally applied in several fields among five strategies to address a multi-step-ahead prediction task that has been suggested in the literature. Besides the recursive strategy, a sequence-to-value and sequence-to-sequence forecasting approach has also been used in multi-step-ahead forecasting (Wunsch *et al.* 2021).

Numerous studies were conducted with regard to this multi-step-ahead predictive modelling. Hernandez-Ambato *et al.* (2017) conducted research to predict the future reservoir level of a hydroelectric dam in a hydroelectric power station located in Ecuador. Open-Loop Prediction (OLP) and Closed-Loop Prediction (CLP) techniques were used in the study. The results showed that CLP models were better than OLP models due to the results obtained from 22 days of testing in a real environment since the authors compared both the models using the same horizon and season. Yu *et al.* (2011) applied eight different training algorithms to forecast the water level for 1- to 5 days ahead in the Heshui catchment in China. It was found that all of the models give satisfactory results in 1-day-ahead forecasting where *R*^{2} values ranged from 0.85 to 0.932 and RMSE values ranged from 0.094 to 0.115. However, the performances of all models dipped with an increase of the time steps (from 1 to 5 days ahead). Kisi *et al.* (2015) applied multi-step-ahead modelling to forecast daily lake water levels for three different horizons, which was 1–7 days ahead in Urmia Lake in Iran. The models were established using support vector machine (SVM), together with a firefly algorithm (FA). The authors observed that 1-day-ahead forecasting model was much better as they achieved a value of 0.9999 for *R*^{2} and was known to be more accurate than the 7-days-ahead prediction. Other previous studies conducted using the same approach were Chang *et al.* (2014), who established a real-time multi-step-ahead water level prediction using RNN for urban flood control in Yu-Cheng Pumping Station, Taiwan, as well as Saroha & Aggarwal (2014), who applied multi-step-ahead prediction of wind power using genetic algorithm (GA)-based neural networks. RNN was also used in the study by (Granata & Di Nunno 2021) where RNN-based models using LSTM were built for the prediction of short-term-ahead actual evapotranspiration. With reference to the subtropical climatic conditions of South Florida, LSTM models proved to be more accurate than NARX models, while some exogenous variables such as sensible heat flux and relative humidity did not affect the results significantly.

A parameter that is significant in the hydrological cycle and has the easiest access to the local area is known as river flow (RF). By setting up a portion of the vital relationships that happen between social, physical, environmental, and financial processes, this water parameter takes an essential position in hydrology (Bhagwat & Maity 2012). For over a half century, RF prediction has been attracting the interest of researchers. Nowadays, there are two techniques for water flow prediction. The first is a physically based model that contains mathematical models that trigger the hydrodynamic system of the flow of water. The second comprises data-driven models that are constructed on a statistical relationship between the input and the output variable (Le *et al.* 2019). This water parameter prediction can prevent conceivable flood harm and water deficiencies as well as assist water management in the agricultural field, thus providing huge financial benefits (Atiya *et al.* 1999). Furthermore, forecasting river flow has been found to be difficult to measure due to the nonlinear, time varying, and indeterminate nature of river flow data. (Zaini *et al.* 2018).

Zaini *et al.* (2018) used the SVM-based model, together with particle swarm optimization (PSO), to predict short-term daily RF at the Upper Bertam Catchment located in Cameron Highland, Malaysia. The authors made four SVM models where SVM1 and SVM-PSO1 contained only historical data, while SVM2 and SVM-PSO2 contained historical data and meteorological variables as input. The outcomes showed that SVM2 and SVM-PSO2 outperformed SVM1 and SVM-PSO1. Moreover, the performance of the hybrid models was much better than that of the basic SVM models. Sahoo *et al.* (2019) applied LSTM, together with RNN (LSTM-RNN) and the AI method, to predict low-flow series using daily discharge data that were obtained from the Basantapur gauging station, India. The results showed that the LSTM-RNN model outperformed the RNN model and other naïve methods that the authors used for comparison. The value of *R* (correlation coefficient), Nash–Sutcliffe efficiency () and RMSE for LSTM-RNN obtained were 0.943, 0.878 and 0.487, respectively. Moreover, Noor *et al.* (2017) applied the NARX model to predict rainfall-based RF in Pelarit and River Jarum, Perlis, Malaysia, and successfully established a model flow of the river 1 day (24 h) in advance according to the current rainfall rates. The authors achieved 0.045, 0.013 and 0.9985 values for RMSE, MAPE and *R*^{2}, respectively, in Jarum River, while in Pelarit River, the values of RMSE, MAPE and *R*^{2} were 0.0113, 0.0038 and 0.999, respectively. Thus, it showed that research was successfully conducted. Furthermore, Zhang *et al.* (2019) applied a time-series analysis model using autoregressive integrated moving average (ARIMA) as well as a multilayer perceptron neural network (MLPNN) to forecast wastewater inflow in Barrie Wastewater Treatment Facility in Canada.

With regard to recurrent neural network (RNN) and its proven history of performances particularly for an intricate nonlinear system, a dynamic neural network that was established on nonlinear auto-regressive models with exogenous input (NARX) models was extensively chosen. The NARX model, which has a limited number of parameters, is known to be as strong as a connected RNN and is computationally efficient (Tijani *et al.* 2014). Furthermore, it is considered as a part of RNN, having feedback connections surrounding the hidden layers of the network (Marcjasz *et al.* 2019). Furthermore, it can be utilized in multi-time-series input and output application since it is a different category of artificial neural network (ANN) that is fit to model time-series and nonlinear systems (Ardalani-Farsa & Zolfaghari 2010).

Di Piazza *et al.* (2016) established the NARX model to execute hourly speed and solar irradiation prediction based on a multi-step-ahead technique. Temperature is the chosen exogenous variable. The NARX model has been optimized by a GA along with an optimal brain surgeon's (OBS) strategy. The time horizons used in the research ranged from 8 to 24 h ahead. As a result, the greatest outcomes are 8 and 24 h for wind speed, while they are 8 and 10 h for solar irradiation. Di Nunno & Granata (2020), on the other hand, established the NARX model to forecast daily groundwater level fluctuations for 76 monitored springs located in Apulian territory, Italy. Daily groundwater level data and training algorithms such as LM, Bayesian Regularization (RB) and Scaled Conjugate Gradient (SGC) were used in the research. The outcomes showed that the BR training algorithm produced greater accuracy in groundwater level prediction. In another study (Di Nunno *et al.* 2021), an unprecedented application of nonlinear AutoRegressive with eXogenous inputs (NARX) neural networks to the prediction of spring flows was shown. Discharge prediction models were developed for nine monitored springs located in the Umbria region, along the carbonate ridge of the Umbria-Marche Apennines. In the modelling, precipitation was also considered as an exogenous input parameter. Good performances were achieved for all springs and for both short-term and long-term predictions, passing from a lag time equal to 1–12 months. Furthermore, Wunsch *et al.* (2018) also applied NARX to forecast short- and mid-term groundwater levels for several groups of aquifers in the states of Baden-Wuttemberg, Bavaria and Hesse, Germany. Precipitation and temperature were chosen as the input of the model. As a result, the NARX model was successfully used to conduct groundwater predictions for uninfluenced observation wells in all aquifer types, but it gave contrasting results for influenced observation wells. In addition, RNN was applied to simulate the operation of three multipurpose reservoirs located in the upper Chao Phraya River basin (Yang *et al.* 2019). Three RNNs, namely nonlinear autoregressive models with exogenous input (NARX), LSTM and genetic algorithm–based NAXR (GA-NAXR) for reservoir operation, were based on historical data. The results showed that GA-NARX had the highest accuracy among three RNNs and was more stable than the original NARX by optimizing the initial conditions, although it took a longer training time than NARX and LSTM.

Apart from NARX, an LSTM network deep learning as an expansion of RNN also has greater capacity in time-series data prediction. The major contrast between RNN and LSTM is that it can suitably plot between input and output data as well as keep long-range time reliance data (Abbasimehr *et al.* 2020). According to Zhou *et al.* (2018), the LSTM network system is highly acceptable to deal with the continuation of data like water quality data since LSTM has a particular memory function (Zhou *et al.* 2018). There are several gates in the LSTM network, namely, forget gates, output gates and input gates as well as a memory cell in every neuron (Vu *et al.* 2020; Jia & Zhou 2020).

A number of research studies conducted using the LSTM model, such as the one by Zhang & Jin (2020), have used automatic encoder, together with LSTM, to forecast the concentration of total phosphorous (TP) and total nitrogen (TN) using 13 different variables as inputs. The authors found that the TP model outperformed the TN model by having *R*^{2} and RMSE values of 0.924 and 0.0002, respectively, while the TN model achieved *R*^{2} and RMSE values of 0.909 and 0.024, respectively. Liu *et al.* (2019) also applied LSTM to forecast dissolved oxygen (DO) in the next 10 days (*m* = 10) and the next 6 months (*m* = 181). The authors achieved MSE values of 0.0020 and 0.0017 for *m* = 181 and *m* = 10, respectively. Furthermore, water quality forecasting based on the LSTM learning network was used by Hu *et al.* (2019) to forecast water temperatures and pH values. The outcomes showed that the accuracy levels from the forecasting were 98.97% and 98.56% for water temperature and pH values, respectively. As for long-term forecasting, accuracy levels of up to 96.88% and 95.76% for water temperature and pH values, respectively, were achieved. Le *et al.* (2019) also established some LSTM models to predict floods at the Hoa Binh Station on Da River, Vietnam.

Based on the literature review presented, it can be observed that LSTM and NARX forecasting models are commonly used in multi-step-ahead prediction involving nonlinear multivariate data. The challenge in multi-step-ahead modelling is that normally the model achieves high accuracy for one or a few steps ahead. In other words, most of the studies with high accuracy conducted using NARX and LSTM are applied for one or a few steps ahead of forecasting. This study aims to build a model for multi-step-ahead forecasting of RF based on NARX Neural Networks and LSTM in univariate and multivariate approaches. In NARX, we will use direct autoregression and a recursive model. Unlike the recursive NARX approach, the direct autoregression NARX approach tries to train a multi-step-ahead prediction model directly without requiring future exogenous inputs to make multi-step prediction. In LSTM, we will use a direct sequence-to-sequence multivariate model and also a recursive univariate model, both of which will be discussed in Section 2.

## METHODOLOGY

### Study area

The Kelantan river basin is one of the important catchments as it has a history of flood events (Pradhan & Youssef 2011; Nashwan *et al.* 2018). The catchment represents most of the land area of Kelantan State, as shown in Figure 1. Several stations such as rainfall, water level, evaporation, water quality and meteorological stations operate in the area.

### RF data

Machine learning and deep learning algorithms, such as ANN and LSTM, have been successfully proven for data-driven predictive modelling applications. Therefore, the main ingredient for the success of predictive modelling is the data itself in addition to the training algorithm developed. In this study, the RF data were collected with some meteorological parameters. The time-series RF data were collected from the north of Kuala Krai city downstream (merger of two main tributaries before discharging into the sea). The original data were recorded and compiled, which consisted of 348 monthly samples of the Sungai Kelantan RF () spanning from January 1988 to December 2016 (29 years). Rainfall and evaporation are usually measured in a determined station, and only the computed area weighted rainfall (WR) is used to evaluate the whole area rainfall quantity (Faisal & Gaffar 2012). The RF was mainly from one main station (Guillemard), while the WR and evaporation (secondary data) were over the whole river catchment. Table 1 shows the attributes and its basic statistical properties of the data collected and used in this study. The time-series graph is shown in Figure 2. As we can see, the RF pattern is close to the time-series pattern of the WR. Meanwhile, the average evaporation (AE) pattern cannot be visibly observed in relation to RF as the amplitude is relatively small.

Name of variable . | River flow . | Weighted rainfall . | Averaged evaporation . |
---|---|---|---|

Notation (unit) | (mm) | (mm) | |

Max value | 2,853.2 | 725.8 | 173.8 |

Min value | 70.5 | 3.0 | 57.6 |

Mean value | 471.8 | 207.7 | 114.4 |

Name of variable . | River flow . | Weighted rainfall . | Averaged evaporation . |
---|---|---|---|

Notation (unit) | (mm) | (mm) | |

Max value | 2,853.2 | 725.8 | 173.8 |

Min value | 70.5 | 3.0 | 57.6 |

Mean value | 471.8 | 207.7 | 114.4 |

### Preliminary data analysis

The stage of preliminary data analysis and pre-processing is crucial in the initial stage of the machine learning model building. This process can significantly affect the prediction accuracy in any type of data (Hayder *et al.* 2020). The purpose of this study is to build a multi-step-ahead predictive model for RF using a machine learning/deep learning-based approach.

There are basically two ways of performing multi-step-ahead forecasts. It can be produced recursively by iterating a one-step-ahead model, or directly using a specific model for each span (Ben Taieb *et al.* 2012). In addition, the recorded RF data, together with other climate variables, are basically time-series data. Thus, the multi-step-ahead time-series prediction model can be developed based on two strategies, namely, univariate and multivariate time-series prediction. In the univariate time-series forecasting, it is assumed that RF has its seasonality and it depends on its past values only. On the other hand, in multivariate time-series forecasting, the RF will also depend on the other variables and their past values.

In this study, the multi-step-ahead prediction is performed using a nonlinear machine learning-based model with multivariate and univariate approaches. The two algorithms used are NARX (nonlinear autoregressive with exogenous inputs) Neural Networks and recurrent Neural Networks called LSTM, which fall under the category of deep learning.

*y*can be expressed as

This coefficient is plotted against the lag value (resulting autocorrelation function plot called correlogram.

### Data preparation

Prior to feeding the data into the machine learning/deep learning, some data preparation needs to be carried out. First, the data is partitioned into train and test split. Train dataset is used to build the predictive model, which is otherwise called model training. Test dataset is used to evaluate the performance of the model after the model training has been completed. The first 280 samples, i.e. ≈80% (sequence from Jan 1988 to April 2011), are used as train dataset, while the test set is the sequence from April 2011 to December 2016 (81 samples). Notice that there is some sequence overlap in the data partition and we do not have large samples to split. In the overlapped sequence, only the target data is used during the training, which is predicted using feature/input variables in the previous sequence. Thus, the input/feature variables in the overlapped sequence still can be used as test dataset. The illustration of train-test data partition is shown in Figure 3. Furthermore, as the data amplitudes are in wide range, scaling is also performed to normalize the amplitudes of the feature inputs at a range of 0 to 1.

### NARX- and LSTM-based approaches for multi-step-ahead prediction

There are two machine learning models used in this study as mentioned, namely NARX and LSTM. The architecture of NARX can be seen in Figure 4, where two types of connection are common (Thapa *et al.* 2020). NARX is used in multivariate mode that is applied to predict some step-ahead values of RF based on current and past values (lags) of WR and AE.

*et al.*2020):where is the mapping function performed by neural networks, is the external input variable (s), is the output variable,

*p*is auto regression order,

*q*is the exogenous input order and

*d*is the exogenous delay number.

*k*being the prediction step.

Furthermore, LSTM is a type of RNN that was proposed to solve this issue by explicitly introducing a memory unit called the cell into the network. A single memory unit makes a decision by considering the current input, previous output and previous memory and generates a new output and alters its memory. In addition to NARX, LSTM is also used here for multi-step-head-prediction with two different approaches, namely, sequence-to-sequence multivariate approach (we call LSTM2) and univariate approach (we call LSTM1). The illustration of the sequence-to-sequence multivariate approach (LSTM2) is shown in Figure 5 to indicate how many sequences of past values of exogenous inputs () will be used and how many step-ahead predictions will be performed (). The training data will be arranged as per Figure 5, by arranging *q* sequence of input variables with the next *k* target variable (including the current step).

*et al.*2019):

### Metric of model evaluation

*et al.*2020). In this study, the obtained model accuracy is evaluated by calculating RMSE (root mean squared error) value and Nash–Sutcliffe efficiency coefficient (NSE), which are defined, respectively, as follows:

NSE is widely used to evaluate the performance of hydrological models. NSE is even better than other metrics, such as the coefficient of determination, a.k.a regression coefficient. An NSE value of above 0.75 is considerably a good fit model, while less than 0.5 indicates unsatisfactory model performance (Pena *et al.* 2020).

## RESULTS AND DISCUSSION

The experimentation is performed for the predictive model with four different approaches using NARX neural networks and LSTM, namely NARX1, NARX2, LSTM1 and LSTM2. These model approaches are listed in Table 2, together with their information. Despite the fact that experimentation in this study is performed for a local study, the modelling approach presented in this paper can be extended and applied for other international studies. Generally, the predictive modelling approach can also be used not only for RF but also for other hydrology-related variables such as sediment load, water quality, ground water level, etc. This generic data-driven approach can be summarized as a flowchart in Figure 6 that begins with data collection and ends with a decision on possible model deployment upon satisfactory model evaluation.

Multi-step-ahead predictive model . | Approach . |
---|---|

NARX1 | Recursive (closed loop) approach; refer to Equation (3) |

NARX2 | Direct auto-regressive approach; refer to Equation (5) |

LSTM1 | Univariate autoregressive approach; refer to Equation (6) |

LSTM2 | Sequence-to-sequence multivariate approach; refer to Figure 5 |

Multi-step-ahead predictive model . | Approach . |
---|---|

NARX1 | Recursive (closed loop) approach; refer to Equation (3) |

NARX2 | Direct auto-regressive approach; refer to Equation (5) |

LSTM1 | Univariate autoregressive approach; refer to Equation (6) |

LSTM2 | Sequence-to-sequence multivariate approach; refer to Figure 5 |

### Multi-step-ahead prediction using NARX neural networks

The first experimentation is to obtain an optimum NARX model involving lag/delay number () and order for the input and the auto regressive order . Here, autocorrelation and cross-correlation analysis are observed. Figure 7 shows the sample autocorrrelation function for the RF data, which indicates strong correlation of the present RF to its past value at a lag of 1 and around a lag of 12, i.e. for the past 1 month and for the past 12 months.

Furthermore, cross-correlation between RF and WR as well as between RF and AE are also observed. The result is shown in Figure 8, which indicates the following situation. RF has a positive and negative strong correlation to the current value (lag 0) of WR and AE, respectively. It also reveals a pretty similar pattern at lag 11 and at lag 12 particularly for AE. With this observation, some experimention to obtain the optimum NARX model is performed. The NARX model notation follows the values of set for the model building; for example, NARX_2-12-0 means the model is set with , and .

Table 3 shows the ANN setup used in the NARX model. Table 4 shows the experimental results to obtain the optimum model for the order and lags involved. We start the experiment with a two-step-ahead prediction model and the performance metrics are shown. As we can see, NARX_13-13-0 gives the optimum model and, therefore, this model is used for the rest of the experimentations in different step ahead. The models are evaluated using the test data as shown in Table 4, where NARX_13-13-0 produces and . Throughout the rest of the discussion, this model is called NARX2, which is the direct autoregression approach. We perform the evaluation up to . Above this step, the model performs poorly and is, therefore, not shown. Other poor performances with different delays and orders are also not shown. Later, we will compare the result with NARX1, which is a recursive approach as well as LSTM1 and LSTM2.

Items . | Value . | Remarks . |
---|---|---|

Number of neurons | [5, 5] | Using two hidden layers |

Learning rate | 0.01 | Using a constant learning rate |

Max epoch | 1,000 | Interaction using early stopping |

Solver | ‘Adam’ | Reference (Kingma & Ba 2014) |

Activation function | Relu | Rectified linear unit |

Items . | Value . | Remarks . |
---|---|---|

Number of neurons | [5, 5] | Using two hidden layers |

Learning rate | 0.01 | Using a constant learning rate |

Max epoch | 1,000 | Interaction using early stopping |

Solver | ‘Adam’ | Reference (Kingma & Ba 2014) |

Activation function | Relu | Rectified linear unit |

Step-ahead prediction . | NARX model . | NSE . | RMSE . |
---|---|---|---|

NARX_2-12-0 | 0.36 | 224.19 | |

NARX_2-13-0 | 0.40 | 217.49 | |

NARX_13-12-0 | 0.35 | 226.87 | |

NARX_13-13-0 | 0.41 | 215.37 | |

NARX_13-3-10 | 0.34 | 228.99 | |

NARX_13-4-10 | 0.37 | 224.89 | |

NARX_13-13-0 | 0.31 | 235.46 | |

NARX_13-13-0 | 0.41 | 218.23 | |

NARX_13-13-0 | 0.32 | 233.98 | |

NARX_13-13-0 | 0.35 | 223.38 |

Step-ahead prediction . | NARX model . | NSE . | RMSE . |
---|---|---|---|

NARX_2-12-0 | 0.36 | 224.19 | |

NARX_2-13-0 | 0.40 | 217.49 | |

NARX_13-12-0 | 0.35 | 226.87 | |

NARX_13-13-0 | 0.41 | 215.37 | |

NARX_13-3-10 | 0.34 | 228.99 | |

NARX_13-4-10 | 0.37 | 224.89 | |

NARX_13-13-0 | 0.31 | 235.46 | |

NARX_13-13-0 | 0.41 | 218.23 | |

NARX_13-13-0 | 0.32 | 233.98 | |

NARX_13-13-0 | 0.35 | 223.38 |

The next result shows the experimentation on NARX1. With the same setup of delay and order as in NARX2, Figures 9 and 10 show the forecasting result on training data. The result indicates the good fit of the model with NSE = 0.85 and RMSE = 152.29. The higher errors are shown particularly for some extreme values for certain months. In Figure 10, the blue diagonal line represents the regression line for the observed vs forcasted values.

Furthermore, Figures 11 and 12 show the sample of graph of the forecasting result on test data where the model is used to perform 52 step-ahead prediction recursively in closed-loop mode. The result indicates the moderate capability of the model with NSE = 0.44 and RMSE = 208.85. The result of NARX1 on test data is slighly better than that of NARX2. However, NARX1 has a disadvantage. It needs the information of immediate past exogenous inputs to make prediction for . This means this approach is not really multi-step-ahead as compared with the direct approach (NARX2).

### Multi-step-ahead prediction using univariate and multivariate LSTM

In this section, the result using LSTM is presented. LSTM is a specialized deep learning recurrrent neural network able to learn and preserve long-term dependencies that were proposed in the late 90's (Hochreiter & Schmidhuber 1997). As mentioned earlier, the LSTM model used in this experiment is basically a univariate NAR model referred to in Equation (5). Table 5 shows the LSTM setup used during the training. This setup was chosen after a series of experimentations. There is a better way to choose the best hyperparameter setup such as using a grid search. However, this is beyond the scope of our research at the moment.

Items . | Value . | Remarks . |
---|---|---|

Number of neurons | 20 | Using 1 LSTM layer |

Learning rate | 0.01 | Using constant learning rate |

Max epoch | 100 | Interaction using early stopping |

Batch size | 32 | – |

Solver | ‘Adam’ | Reference (Kingma & Ba 2014) |

Activation functions | Relu and linear | Rectified linear unit (Relu) at LSTM and linear at output (dense) layer |

Dropout | 0.2 | Applying LSTM and dense layers |

Items . | Value . | Remarks . |
---|---|---|

Number of neurons | 20 | Using 1 LSTM layer |

Learning rate | 0.01 | Using constant learning rate |

Max epoch | 100 | Interaction using early stopping |

Batch size | 32 | – |

Solver | ‘Adam’ | Reference (Kingma & Ba 2014) |

Activation functions | Relu and linear | Rectified linear unit (Relu) at LSTM and linear at output (dense) layer |

Dropout | 0.2 | Applying LSTM and dense layers |

As mentioned earlier, we use two LSTM approaches, namely, univariate recursive (LSTM1) and direct sequence-to-sequence multivariate (LSTM2). While the recursive approach is able to make longer-term prediction, the direct sequence-to-sequence approach may not be able to make long-term prediction as it also needs a long sequence to train the model. This may not be practical when the data is not sufficiently large like in our case. LSTM2 will suffer from a lack of data. We will see the results in the following discussion.

Figure 13 shows the recursive forecasting result of LSTM1 on the test data where RMSE = 191.42 is achieved, which is slightly lower than that achieved with NARX1. Figure 14 shows the observed vs predicted values, and NSE = 0.59 is achieved. This result indicates that LSTM capability for multi-step-ahead prediction in the univariate (NAR) approach is applied in this case.

Next, we will show the experimentation result with LSTM2. This approach uses the multivariate direct sequence-to-sequence approach so that we need to specify the step size of the input () and the step size of the output (for *k* step-ahead prediction). Here, we set as to make the same model order with that of NARX1 and NARX2. The experimentation result is shown in Table 6 for a different number of *k*. The result is evaluated by rolling on -step-ahead prediction using some portions of test data. We also give the results of NARX2 in the last two columns for comparison. It can be seen that LSTM2 performs better up to a seven-step-ahead prediction. The model performance degrades as increases above 7. This is due to the fact that the LSTM2 model is trained only with = 13. It is most likely that LSTM needs a longer input sequence to make a longer step-ahead prediction. However, this will need more data to be used. The nature of deep learning is that it needs large data to learn from more complex processes. How much of training data still remains is an open question in research (Wunsch *et al.* 2021). In addition, the model may also need different number of neurons and batch size. This should be an optimization work such as using the grid search method, which is beyond our scope at the moment. Figure 15 shows a sample of time-series graph for two-step-ahead forecasting using LSTM2 rolling on some portions of test data.

Step-ahead prediction . | LSTM2 . | NARX2 . | ||
---|---|---|---|---|

NSE . | RMSE . | NSE . | RMSE . | |

0.78 | 119.44 | 0.35 | 226.87 | |

0.54 | 168.45 | 0.31 | 235.46 | |

0.42 | 196.49 | 0.41 | 218.23 | |

0.39 | 201.36 | 0.32 | 233.98 | |

0.42 | 188.86 | 0.33 | 231.97 | |

0.23 | 220.52 | 0.28 | 243.54 | |

0.13 | 239.36 | 0.31 | 230.07 | |

0.13 | 231.20 | 0.41 | 216.68 | |

≍0 | 259.75 | 0.35 | 223.38 |

Step-ahead prediction . | LSTM2 . | NARX2 . | ||
---|---|---|---|---|

NSE . | RMSE . | NSE . | RMSE . | |

0.78 | 119.44 | 0.35 | 226.87 | |

0.54 | 168.45 | 0.31 | 235.46 | |

0.42 | 196.49 | 0.41 | 218.23 | |

0.39 | 201.36 | 0.32 | 233.98 | |

0.42 | 188.86 | 0.33 | 231.97 | |

0.23 | 220.52 | 0.28 | 243.54 | |

0.13 | 239.36 | 0.31 | 230.07 | |

0.13 | 231.20 | 0.41 | 216.68 | |

≍0 | 259.75 | 0.35 | 223.38 |

On the other hand, NARX2 is seen to have more stable performance, at least up to , despite having inferior performance for a shorter step ahead. This suggests that for shorter-step-ahead forecasting, a multivariate LSTM approach is preferable to NARX. For longer terms of step-ahead forecasting, the NARX approach is preferable, especially under conditions of limited size of data like in our case. However, this suggestion is not final as the LSTM model can be further optimized as the predictors' sequence can be increased whenever larger data is available. With the larger number of data and further fine-tuning of the hyperparameters in LSTM, different suggestion may be proposed, i.e LSTM may outperform NARX for longer terms of step-ahead forecasting.

The last part of this section is to highlight the comparison between this paper and other cited studies for multi-step-ahead prediction in hydrological systems. This comparison is viewed from some aspects of model algorithms, data size and model performance. This comparative summary is given in Table 7. We can see that in terms of data size, our study has a much lower number of samples as compared with other studies. This is certainly our major concern for improvement in the future. In terms of algorithms, NARX Neural Networks are commonly used, while LSTM is still rarely explored especially for multi-step-ahead forcasting, i.e. not considering studies in single-step-ahead forecasting. There is one interesting recent study by Guo *et al.* (2021) pointing out that gradient boosting machine, a new type of machine learning algorithm, performs favourably for some step-ahead prediction. This application could be explored in the future. In terms of accuracy, our proposed method in few-step-ahead forecasting using LSTM particularly is generally on a par with or slightly better than some results from other studies.

Research study . | Output variable . | Methods . | Data size . | Highlights of results . |
---|---|---|---|---|

Bhagwat & Maity (2012) | River flow; Narmada river (India) | LS-SVR and ANN | 2,556 samples for training | The best NSE = 0.49 for two-step-ahead prediction using LS-SVR; Reasonably good up to 5-day-ahead predictions (NSE = 0.3). |

Chang et al. (2014) | Inundation level of flood; Yu–Cheng Pumping Station (Taipei City) | ANN, Elman Networks and NARX Neural networks | 1,985 samples for training and testing | NARX Networks perform the best, producing coefficients of efficiency within 0.9–0.7 (scenario I) and 0.7–0.5 (scenario II) in the testing stages for 10–60-min-ahead forecasts. |

Guo et al. (2021) | River stage; Lan-Yang river basin (Lanyan, Simon and Kavalan stations at Taiwan) | Optimized four ML techniques, namely, SVR, RFR, ANN and LBGM | ≈7,500 samples for Simon and Lanyan; ≈4,000 samples for Kavalan | All models demonstrate favourable performance in terms of of about 0.72 for 1–6 step-ahead forecasting at all stations. The LGBM model achieves more favourable prediction than SVR, RFR and ANN. |

Yu et al. (2011) | Water level; Heshui catchment, China | Using eight different types of ANN training algorithms | 4,749 samples for training and testing | BFGS- and LM-trained ANN models gave the best performance among all of the prediction scenarios. Obtained a coefficient of determination of around for two-step-ahead forecasting and for five-step-ahead forecasting. |

This paper | River flow; Kelantan river (Malaysia) | NARX Neural networks and LSTM | 348 samples for training and testing | LSTM with a direct sequence-to-sequence produces NSE = 0.75 and NSE = 0.39 for two-step-ahead and five-step-ahead forecasting, respectively. |

Research study . | Output variable . | Methods . | Data size . | Highlights of results . |
---|---|---|---|---|

Bhagwat & Maity (2012) | River flow; Narmada river (India) | LS-SVR and ANN | 2,556 samples for training | The best NSE = 0.49 for two-step-ahead prediction using LS-SVR; Reasonably good up to 5-day-ahead predictions (NSE = 0.3). |

Chang et al. (2014) | Inundation level of flood; Yu–Cheng Pumping Station (Taipei City) | ANN, Elman Networks and NARX Neural networks | 1,985 samples for training and testing | NARX Networks perform the best, producing coefficients of efficiency within 0.9–0.7 (scenario I) and 0.7–0.5 (scenario II) in the testing stages for 10–60-min-ahead forecasts. |

Guo et al. (2021) | River stage; Lan-Yang river basin (Lanyan, Simon and Kavalan stations at Taiwan) | Optimized four ML techniques, namely, SVR, RFR, ANN and LBGM | ≈7,500 samples for Simon and Lanyan; ≈4,000 samples for Kavalan | All models demonstrate favourable performance in terms of of about 0.72 for 1–6 step-ahead forecasting at all stations. The LGBM model achieves more favourable prediction than SVR, RFR and ANN. |

Yu et al. (2011) | Water level; Heshui catchment, China | Using eight different types of ANN training algorithms | 4,749 samples for training and testing | BFGS- and LM-trained ANN models gave the best performance among all of the prediction scenarios. Obtained a coefficient of determination of around for two-step-ahead forecasting and for five-step-ahead forecasting. |

This paper | River flow; Kelantan river (Malaysia) | NARX Neural networks and LSTM | 348 samples for training and testing | LSTM with a direct sequence-to-sequence produces NSE = 0.75 and NSE = 0.39 for two-step-ahead and five-step-ahead forecasting, respectively. |

LS-SVR, least-square-support vector regression; BFGS, Broyden–Fletcher–Goldfarb–Shanno; LM, Levenberg–Marquardt; LGBM, light gradient boosting machine; RFR, random forest regressor.

## CONCLUSIONS

Four approaches for multi-step-ahead forecasting for the Kelantan RF in Malaysia using NARX neural networks and deep learning LSTM have been discussed. These approaches use different strategies to perform multi-step-ahead forecasting involving univariate and multivariate methods. The first approach is NARX neural networks using the recursive approach that gives acceptable performance with RMSE = 208.85 and NSE = 0.44 on test dataset. The second approach is NARX neural networks with direct multi-step-ahead prediction, which produces the highest NSE = 0.41 in four-step-ahead and nine-step-ahead forecasting in our experiment. The third approach is LSTM with a direct sequence-to-sequence prediction, which performs better for few-step-ahead forecasting, i.e. NSE = 0.78 for two-step-ahead forecasting. The performance, however, degrades quite significantly when longer-step-ahead is performed. The fourth approach is the LSTM univariate recursive method, which performs slightly better than the first approach. The third approach is promising for few-step-ahead prediction, but it needs larger data to build the model that is able to perform longer-step-ahead forecasting. Future work will involve collecting more data and also investigating optimization of hyperparameters of the machine learning/deep learning-based model such as using grid search or meta-heuristic optimization, which has been done earlier in another study (Hayder *et al.* 2020). Applications of ensemble and boosting machine learning such as random forest and gradient boosting algorithms can be explored as well. The application of this kind of study is important for water resource management and flood mitigation planning as Kelantan River has recorded some flooding events in the recent past.

## ACKNOWLEDGEMENTS

This work is supported by the Ministry of Higher Education of Malaysia under Fundamental Research Grant Scheme (FRGS) with reference number FRGS/1/2019/TK01/UNITEN/02/5 (20190105FRGS).

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.