## Abstract

The objective of this study is the development of a state-of-the-art method based on long short-term memory (LSTM), support vector machine (SVM), and random forest (RF) to predict the streamflow in the Mekong Delta in Vietnam, an area crucial to Vietnam's food security. Water level and flow data from 2014 to 2018 at the Tan Chau station and Can Tho (on the Hau River) were used as the input data of the prediction model. Three different ranges of data – from the preceding 4, 8, and 12 days – were used to predict streamflow for both 1 and 7 days ahead, resulting in six individual predictions. Various statistical indices, namely root-mean-square error, mean absolute error (MAE), and the coefficient of determination (*R*^{2}), were used to assess the predictive ability of the model. The results showed that the SVM and random forest models were successful in improving the performance of the LSTM model, with *R*^{2} > 80%. For a prediction of 1 day ahead, the proposed models gave an *R*^{2} value of 2–5% higher than a prediction of 7 days ahead. These results highlighted that LSTM is a robust technique for characterizing and predicting time series behaviors in hydrology applications.

## HIGHLIGHTS

Daily streamflow forecasting was done using hybrid machine learning approaches.

Model performance was evaluated using RMSE,

*R*^{2}, and MAE.Developed models achieved high accuracy in daily streamflow forecasting.

### Graphical Abstract

## INTRODUCTION

Streamflow is an important index that directly influences the quantification of available water resources for water supply projects and agricultural, hydroelectric, and other development (Malik *et al.* 2020; Hunt *et al.* 2022). Adnan *et al.* (2021) reported that approximately 20% of river flow is affected by human activities. Changes in land use and the construction of dams and reservoirs are the main factors influencing the trend of river flows (Adnan *et al.* 2021).

Early warning systems are installed in many watersheds around the world, providing real-time flow measurements of river systems for water resource management. However, these warning systems require significant investment and can encounter difficulties in poorer regions (Krajewski *et al.* 2017; Hussain & Khan 2020), so state-of-the-art methods must be developed that not only reduce investment capital but also have high accuracy and reliability.

Because river flow is a complex physical process, flow predictions can be solved using physics-based models and data-driven models. Many previous studies have demonstrated the efficiency of these models for hydrological predictions, especially those over the short term (Yadav *et al.* 2007). Physics-based models are developed based on flow generation processes using mathematical formulas or parameterization of physical processes (Paniconi & Putti 2015; Yan *et al.* 2021), which is complex and time-consuming. Especially in small catchments, the process of flow formation is complex and nonlinear, so physics-based models are limited in their ability to accurately predict flows (Alizadeh *et al.* 2018). In addition, physics-based models require the availability of reliable data, which is lacking in most global river basins (Alizadeh *et al.* 2018; Parisouj *et al.* 2020). In addition, model parameters derived from the physical characteristics of a basin are often associated with high uncertainty (Bulygina *et al.* 2011); their application is also limited by very high costs. All of this has restricted the use of these models in many locations around the world. Therefore, physics-based models still need to be developed, or replaced with robust and automated methods, to address the limitations of these models.

To overcome these challenges, data-driven models have recently been receiving increased attention from scientific communities around the world due to their accurate prediction capability (Kratzert *et al.* 2019). They are categorized into two main models: statistical and machine learning (ML). Statistical models are built on the assumption that the flow generation process follows a normal distribution, so such models have limited accuracy when predicting the characteristics of nonlinear and random flow processes (Ghimire *et al.* 2021). In the last two decades, ML has proven to be a successful and cost-effective solution in the field of water resources with its ability to measure and predict flooding (Islam *et al.* 2021; Nguyen 2022b), surface water (Acharya *et al.* 2019; Chen *et al.* 2020), water quality (Haghiabi *et al.* 2018; Ahmed *et al.* 2019), water salinity (Melesse *et al.* 2020; Jung *et al.* 2021), and groundwater (Sahoo *et al.* 2017; Singha *et al.* 2021). Particularly, in recent years, ML has been widely applied in streamflow prediction with models focusing less on the physical characteristics of the hydrological cycle and more on using black box methods to establish optimal mathematical relationships between input and output data. Such models have been widely applied in flow prediction in small and medium watersheds (Jothiprakash & Magar 2012). These models include artificial neural network (ANN) (Zealand *et al.* 1999; Besaw *et al.* 2010), support vector machine (SVM) (Kisi & Cimen 2011; Huang *et al.* 2014), adaptive neuro-fuzzy inference system (ANFIS) (Chang & Chang 2006; Firat & Güngör 2007), long short-term memory (LSTM) (Feng *et al.* 2020; Ni *et al.* 2020), extreme learning machine (ELM) (Yaseen *et al.* 2016; Adnan *et al.* 2019), and fuzzy neural network (FNN) (Valença & Ludermir 2000; Deka & Chandramouli 2003). The advantage of ML models is their ability to handle large datasets and accept datasets at different scales, while not being sensitive to missing data (Yaseen *et al.* 2018). The models find the optimization relationships between the input data and the output data for the prediction. In addition, ML models can simulate nonlinear and complicated dynamic streamflow systems with high accuracy, as has been recognized since the 1990s (Khosravi *et al.* 2021). Challenges in the model transformation from a basin with available flow data to another with similar characteristics have been solved by ML modeling (Contreras *et al.* 2019; Khosravi *et al.* 2021). Lin *et al.* (2021) used precipitation and runoff data from six meteorological and hydrological stations to develop a hybrid model based on the first-order difference (DIFF), the feedforward neural network (FFNN), and the LSTM to predict the hourly flows of the Andun Basin in China. Rahimzad *et al.* (2021) constructed four models based on linear regression (LR), multilayer perceptron (MLP), SVM, and LSTM to predict daily streamflows in the Kentucky watershed in the United States. Data used in the model include precipitation and discharge for the period 1986–2012. Adnan *et al.* (2021) developed hybrid models based on locally weighted learning (LWL), additive regression (AR), bagging (BG), dagging (DG), random subspace (RS), and rotation forest for monthly forecasting flows in the Jhelum River Basin, Pakistan. Monthly rainfall data from 1965 to 2012 at the Kohala station on the Jhelum River Basin was used to construct these models. Le *et al.* (2021) developed an LSTM model for 1-day and 2-day flow forecasting at Son Tay station in the Red River Basin in Vietnam. The authors used river flow data for 20 years (1995–2014) to train, validate, and evaluate the model. However, the literature suggests that there is no universal model, that is, one that can solve all problems in all regions. Moreover, ML is associated with generation issues: models have weak prediction ability when the training dataset is not long enough or the validation data is not in the training data range (Melesse *et al.* 2011). Moreover, although individual models bring better performance, complex structures and parameter configurations offer great challenges to build a better individual model. Popular methods, including trial-and-error, random search have been widely applied, however, they have a low convergence rate and do not specifically consider interactions between parameters and hyperparameters. This is why recent studies have emphasized the development of a model together: Because they have the ability to eliminate individual model weak points.

The objective of this study is the development of a state-of-the-art method based on LSTM, SVM, and RF to predict the streamflow in the Mekong Delta in Vietnam. These three models are considered to be the most popular and have been widely applied in previous studies to predict the streamflow. Moreover, these models have advantages in fast convergence ability and solve nonlinear problems, as well as generate models with high accuracy in high-dimensional spaces. In addition, these models have effective memories. Finally, the RF model has the ability to automatically resolve missing values. This study is different from previous studies because this is the first time these models have been applied to predict streamflow in the Mekong River. Water resource management in the area (prediction, reservoir operation, and flood control) has been carried out based on streamflow. In developing countries like Vietnam, which lack the appropriate amount of data to build water resource management strategies, streamflow prediction is important. The development of a model in areas with limited data has received attention from the scientific communities of the world. The results of this study will bring new understanding and improvements in streamflow modeling and prediction. The findings of this study can help decision-makers better manage water resources.

## MATERIALS AND METHODS

### Study area and data

The Mekong River is seen as 1 of the 10 most important rivers in the world in terms of flow and sediment. Its 4,350 km-long journey begins in the Tibetan Plateau in China. It flows through six different countries: China, Laos, Cambodia, Thailand, Myanmar, and Vietnam. Finally, the Mekong River empties into the East Sea through the Mekong Delta.

^{2}(Figure 1). This region is home to nearly 20 million people. Rice cultivation is the main crop in the VMD, covering an area of about 1.9 million ha, representing about 50% of the country's total rice production. The delta is relatively flat, with an average altitude of 0–2 m above mean sea level. The delta has a tropical climate with two main seasons: the dry season is from November to April, and the rainy season is from May to October. Average precipitation in the VMD ranges from 1,400 to 2,200 mm per year, 90–95% of which falls during the rainy season. The tides in the delta are very complicated and divided into two main regimes: semi-diurnal in the East Sea and diurnal in the West Sea. For the semi-diurnal tide, the high tide period lasts about 6 h, and the low tide period is about 7 h. The average magnitude of the tides in this region varies from 3 to 4 m and the maximum tide can reach 4.1 m. For the diurnal tide, there are two peaks and two feet during the day, with a magnitude ranging from 0.8 to 1.2 m.

Figure 1 shows the hydrological networks in the VMD are very dense (80 m/ha) with two main rivers: the Tien River and the Hau River. According to a report by the Ministry of Natural Resources and Environment, the annual flow in the delta is around 500 km^{3}, of which approximately 23 km^{3} (4.6%) comes directly from precipitation. The remaining 477 km^{3} is from the flow of the upstream Mekong River. The flow at the Tan Chau station ranges from 5,000 to 17,000 m^{3}/s. 70–80% of the annual flow occurs in the rainy season, which causes pressure on agricultural development in the dry season.

The VMD is thought to be particularly affected by climate change. Previous studies have predicted a sea-level rise of between 46 and 77 cm by the end of the 21st century in this region, which will aggravate drought and saltwater intrusion, especially during the dry season from October to May each year. The accurate prediction of streamflow in the delta, therefore, plays an important role in supporting those responsible for water resource management and the sustainable development of agriculture.

The model training process of the ML method often encounters numerical difficulties, because the raw streamflow data has strong nonlinearity, which strongly influences the prediction model (Niu *et al.* 2020). It is necessary to normalize these data to limit these problems. This study uses a neural network; the original values of all attributes have been kept, but the databases have been normalized to within a range of 0–1.

### Methodology

#### Long short-term memory algorithm

LSTM is a type of recurrent neural network (RNN) that extends its memory. In RNN, the output of the last step is fed as the input to the current step. LSTM was designed to solve the problem of long-term dependencies of the RNN in which the RNN cannot predict the problem stored in the long-term memory but can give more accurate predictions from recent information (Hochreiter & Schmidhuber 1997). LSTM allows RNNs to remember their inputs over a long period. LSTM stores information for a long time in its memory. It can read, write, and delete information from its memory. LSTM is widely used for prediction based on time series data (Vojtek *et al.* 2021). The structure of LSTM includes three gates: input, forget, and output (Dong *et al.* 2020; Ghimire *et al.* 2021).

*X*at time

_{t}*t*and previous cell output data

*h*

_{t}_{−1}are sent to the gate and they are multiplied with matrices of weights and biases. The output results are binary functions. If these results are equal to 0, the gate overwrites the information; if the results are equal to 1, the information is kept for the next step.where

*f*is the forget gate,

_{t}*σ*is the sigmoid function, and

*W*and

_{f}*b*are the weight and bias matrices of the forget gate, respectively.

_{f}*X*and

_{t}*h*

_{(t−1)}, as with the forget gate. Then, the tanh function is used to generate the vector which gives an output result from −1 to 1. This result contains all possible values of the input data

*X*and

_{t}*h*

_{(t−1)}. Vector values and set values are multiplied to get useful information.where

*i*is the input gate,

_{t}*C*

_{t}_{−1}and

*C*are the cell states at time

_{t}*t*− 1 and

*t*, respectively, and

*W*and

*b*are the respective weight matrices and bias of the cell state.

#### Support vector machine

*et al.*(1995). SVM creates a hyperplane in an

*N*-dimensional space to divide the data into two parts corresponding to their class. In two-dimensional space, this hyperplane is a line dividing the plane of space into two parts corresponding to two layers, each layer being located on one side of the line. This technique is applied to the linear model by dividing the dataset into feature spaces via a nonlinear function (Samantaray

*et al.*2022). It uses the principle of structural risk minimization and statistical learning to determine the boundary between the two opposite classes to improve the generalization capacity, thanks to the reduction of the generalization error as opposed to the training error (Christian

*et al.*2021; Essam

*et al.*2022). SVM works using the kernel function that converts data from the input feature space to the higher-dimensional feature space. This conversion supports determining complex input–output relationships in a relatively simple way (Christian

*et al.*2021). The SVM technique for solving the regression problem can be expressed as follows:where is the Lagrange multiplier,

*K*(

*x,z*) is the kernel function inside the multiplier, and

*b*is the bias.

_{i}The performance of the SVM model depends on the parameters kernel, *C*, and gamma. Kernel parameters can be linear, poly, radial basis function (RBF), sigmoid, and precomputed. *C* adjusts for omitted outliers when building the SVM model, while gamma determines the number of data to build the hyperplane. In this study, SVM was used to optimize the parameters of the LSTM algorithm.

#### Random forest

Random forest (RF) is a powerful supervised learning algorithm and was first proposed by Breiman (2001). This algorithm uses the results of the decision tree prediction to solve the classification and regression problems. RF makes it possible to combine a large number of decision trees (weak models) automatically and randomly to create the best results with higher accuracy than individual models (Zhang *et al.* 2019; Peng *et al.* 2020). The sub-models (the decision trees) are evaluated using the majority voting method to select the best model (Al-Abadi & Shahid 2016). RF works based on three main steps (Tian *et al.* 2016): the first is to randomly select *n* data from the dataset using the bootstrapping technique. In this study, the dataset was divided into two parts: 80% of the data were used to train the models, while 20% of the data were used to validate the models. The second step is the building of the decision tree using the decision tree algorithm. RF includes many decision trees; each tree is built using the decision tree algorithm on different datasets and using different sets of attributes. Then, the RF prediction results are aggregated from the decision trees. The third step is the vote for the best prediction results. The best result is then returned.

RF has the advantage of solving the problem of missing data by using the average value of the adjacent values (Ziegler & König 2014). Also, when the forest has more trees, RF can avoid the overfitting problem. Although RF has high precision, it has limitations such as when a dataset has a large number of variables (Arabameri *et al.* 2019). A decision tree of limited depth often misses important variables. The performance of the RF model is influenced by parameters like max_features, n_estimators, and min_sample_leaf. In this study, RF was applied to compute the weights for each layer of LSTM.

### Performance assessment

In this study, various statistical indices were used to evaluate the performance of the prediction model, namely root-mean-square error (RMSE), mean absolute error (MAE), and the coefficient of determination (*R*^{2}). Several previous studies have confirmed that these indices are widely used and reliable measures for the prediction problem.

*et al.*2020). The value of RMSE ranges from 0 to 1. The closer the RMSE value is to 0, the more accurate the model's prediction. RMSE is defined by the following equation:where

*N*is the number of samples,

*Y*

_{pre,i}is the prediction value at point number

*i*, and

*Y*

_{obs,i}is the observation value at point number

*i*.

*R*

^{2}is considered one of the most popular measures to assess the level of fit of the model to the observational data. The value of

*R*

^{2}ranges from 0 to 1. The more efficient the model, the closer the value is to 1 (Kumari

*et al.*2021).

*R*

^{2}is defined by the following equation:where and are the mean value of predicted and observed daily streamflow, respectively.

### Basic steps of modeling by LSTM

#### Collection and preparation of data

Water level and flow data were collected at the Chau Doc and Can Tho stations on the Hau River from 2014 to 2018. After data collection, these data were normalized to use as the model input data.

#### Building of model

Firstly, 80% of the data, corresponding to 1,460 days (4 years) were used to develop the models. The rest of the data, corresponding to 20%, were used to compare and evaluate the performance of the proposed models. The assumption is that the streamflow is a dynamically responsive system and depends on the occurrence of the past. The value of streamflow in the future was predicted using the value of the water level and the streamflow in the past. This is why data for the different preceding days (4, 8, and 12 days) are being tested. This study applied the trial-and-error method, therefore, this selection depends on this method. Moreover, these selections are related to statistical significance and depend on the sizes of the empirical models. Various studies have pointed out that the number of previous days is greater, which increases the computational capacity of the model.

Second, in this study, LSTM was used to predict the daily streamflow at Can Tho in the Hau River station. To improve the successfulness of the LSTM model, the hybrid models were built by integrating SVM and RF in the LSTM network to resample the training dataset to train the base LSTM model. The success of the hybrid models was calculated by comparing their performance with the nonhybrid model.

The structure of the LSTM-SVM and LSTM-RF models is shown in Figure 3, which is referenced in Guo *et al.* (2019). The prediction process flow was divided into two stages. The first stage is to increase the number of samples for the training models by using the historical samples of precipitation and discharge data to generate the steady series samples and combine them with temporal series samples. While the second stage is to predict the river discharges by integrating SVM/RF and LSTM. The output data of SVM/RF and LSTM was *X*_{1} and *X*_{2}. The combination between *X*_{1} and *X*_{2} is the final result of the LSTM-SVM/RF model.

= |previous label – previous prediction|: degree of absolute error

= |previous label – previous prediction|/previous label: degree of relative error

= previous label – previous prediction: trend of recent error

To build the models, the different preceding days (4, 8, and 12 days) were used to predict the 1 and 7 days ahead.

- (i)
*H*_{t}_{−1},*H*_{t}_{−2},*H*_{t}_{−3},*H*_{t}_{−4},*Q*_{t}_{−1},*Q*_{t}_{−2},*Q*_{t}_{−3},*Q*_{t}_{−4}. - (ii)
*H*_{t}_{−1},*H*_{t}_{−2},*H*_{t}_{−3},*H*_{t}_{−4},*H*_{t}_{−5},*H*_{t}_{−6},*H*_{t}_{−7},*H*_{t}_{−8},*Q*_{t}_{−1},*Q*_{t}_{−2},*Q*_{t}_{−3},*Q*_{t}_{−4},*Q*_{t}_{−5},*Q*_{t}_{−6},*Q*_{t}_{−7},*Q*_{t}_{−8}. - (iii)
*H*_{t}_{−1},*H*_{t}_{−2},*H*_{t}_{−3},*H*_{t}_{−4},*H*_{t}_{−5},*H*_{t}_{−6},*H*_{t}_{−7},*H*_{t}_{−8},*H*_{t}_{−9},*H*_{t}_{−10},*H*_{t}_{−11},*H*_{t}_{−12},*Q*_{t}_{−1},*Q*_{t}_{−2},*Q*_{t}_{−3},*Q*_{t}_{−4},*Q*_{t}_{−5},*Q*_{t}_{−6},*Q*_{t}_{−7},*Q*_{t}_{−8},*Q*_{t}_{−9},*Q*_{t}_{−10},*Q*_{t}_{−11},*Q*_{t}_{−12}.

*H*

_{t}_{−1}is the streamflow at 1 previous day.

#### Model validation

In this study, RMSE, MAE, and *R*^{2} were the statistical indices used to validate the proposed models. 20% of the data (river flows) were used to assess the accuracy of the proposed models. This data should not be used to train the models. Depending on the number of preceding days and the days ahead considered, the precisions of the proposed models are different.

#### Prediction

The proposed models were used to predict the streamflow for 1 and 7 days ahead at the Can Tho station on the Hau River in Mekong Delta.

## RESULTS

### Modeling parameter optimization in daily streamflow prediction

Each model will have a set of parameters, which affects the performance. The SVM model has parameters including gamma, *C*, and epsilon. The *C* parameter represents the rate of misclassification in the model. A large value of *C* shows a low bias and high variance. And a small value of *C* shows higher bias and lower variance. Gamma is the parameter of a Gaussian kernel, which supports handling nonlinear classification. The values of parameters are optimized based on the empirical processes. The gamma value equals 0.5. The *C* value equals 2 and the epsilon value equals 0.05. The RF model has several parameters, including max_depth and min_sample_split. However, the model was most affected by the max_depth parameter. The max_depth of a tree in RF is defined as the longest path between the root node and the leaf node. We set 20 as the optimized value of this parameter based on the experimental processes. While for the LSTM model, the performance of this model depends on the number of layers, the number of hidden units, and the number of iterations. These parameters were 2, 128, and 100, respectively. The parameter selections were based primarily on the trial-and-error process.

### Evaluation of the number of previous days

Tables 1 and 2 show the results of predicting 1- and 7-day streamflows using the 4, 8, and 12 previous days. In general, when the prediction horizon increases, the performance of the prediction model also increases. In the case of the 1-day ahead forecast, for the LSTM model, the value of RMSE decreased from 0.111 to 0.105 and then to 0.075, for the prediction horizons of 4, 8, and 12 days, respectively. Similarly, the value of MAE decreased from 0.08 to 0.076 and then to 0.055. The value of *R*^{2} increased from 0.826 to 0.844 and then 0.921. For the LSTM-SVM model, the RMSE value decreased from 0.111 to 0.088 and then to 0.085, the MAE value decreased from 0.078 to 0.066 and then to 0.06, and the *R*^{2} value increased from 0.826 to 0.892 and then to 0.899. For the LSTM-RF model, the values of RMSE and MAE decreased from 0.111 to 0.087 and then to 0.086; and from 0.077 to 0.065 and then 0.06, respectively. The value of *R*^{2} increased from 0.828 to 0.893 and then to 0.898.

Machine . | LSTM-SVM . | LSTM-RF . | ||||
---|---|---|---|---|---|---|

Number of previous days | 4 | 8 | 12 | 4 | 8 | 12 |

RMSE | 0.104 | 0.062 | 0.06 | 0.103 | 0.061 | 0.059 |

MAE | 0.08 | 0.046 | 0.045 | 0.075 | 0.046 | 0.045 |

R^{2} | 0.846 | 0.946 | 0.954 | 0.848 | 0.947 | 0.953 |

Machine . | LSTM . | . | . | . | ||

Number of previous days | 4 | 8 | 12 | |||

RMSE | 0.111 | 0.105 | 0.075 | |||

MAE | 0.08 | 0.076 | 0.055 | |||

R^{2} | 0.826 | 0.844 | 0.921 |

Machine . | LSTM-SVM . | LSTM-RF . | ||||
---|---|---|---|---|---|---|

Number of previous days | 4 | 8 | 12 | 4 | 8 | 12 |

RMSE | 0.104 | 0.062 | 0.06 | 0.103 | 0.061 | 0.059 |

MAE | 0.08 | 0.046 | 0.045 | 0.075 | 0.046 | 0.045 |

R^{2} | 0.846 | 0.946 | 0.954 | 0.848 | 0.947 | 0.953 |

Machine . | LSTM . | . | . | . | ||

Number of previous days | 4 | 8 | 12 | |||

RMSE | 0.111 | 0.105 | 0.075 | |||

MAE | 0.08 | 0.076 | 0.055 | |||

R^{2} | 0.826 | 0.844 | 0.921 |

Machine . | LSTM-SVM . | LSTM-RF . | ||||
---|---|---|---|---|---|---|

Number of previous days | 4 | 8 | 12 | 4 | 8 | 12 |

RMSE | 0.117 | 0.088 | 0.085 | 0.111 | 0.087 | 0.086 |

MAE | 0.085 | 0.066 | 0.06 | 0.084 | 0.065 | 0.06 |

R^{2} | 0.826 | 0.892 | 0.899 | 0.828 | 0.893 | 0.898 |

Machine . | LSTM . | . | . | . | ||

Number of previous days | 4 | 8 | 12 | |||

RMSE | 0.116 | 0.09 | 0.089 | |||

MAE | 0.081 | 0.078 | 0.062 | |||

R^{2} | 0.81 | 0.84 | 0.889 |

Machine . | LSTM-SVM . | LSTM-RF . | ||||
---|---|---|---|---|---|---|

Number of previous days | 4 | 8 | 12 | 4 | 8 | 12 |

RMSE | 0.117 | 0.088 | 0.085 | 0.111 | 0.087 | 0.086 |

MAE | 0.085 | 0.066 | 0.06 | 0.084 | 0.065 | 0.06 |

R^{2} | 0.826 | 0.892 | 0.899 | 0.828 | 0.893 | 0.898 |

Machine . | LSTM . | . | . | . | ||

Number of previous days | 4 | 8 | 12 | |||

RMSE | 0.116 | 0.09 | 0.089 | |||

MAE | 0.081 | 0.078 | 0.062 | |||

R^{2} | 0.81 | 0.84 | 0.889 |

In the case of the 7-day ahead prediction, the performance of the models also increased slightly as the amount of input data increased (the prediction horizon). The value of RMSE and MAE decreased from 0.116 to 0.09 and then 0.089; from 0.081 to 0.078 and then 0.062. The value of *R*^{2} increased from 0.81 to 0.84 and then to 0.889 with the prediction horizon of 4, 8, and 12 days for the LSTM model. For the LSTM-SVM model, the value of RMSE and MAE decreased from 0.117 to 0.088 and then to 0.085; and from 0.085 to 0.066 and 0.06. The value of *R*^{2} increased from 0.826 to 0.892 and then 0.899. For the LSTM-RF model, the value of RMSE and MAE decreased from 0.111 to 0.087 and then to 0.086; and from 0.084 to 0.065 and 0.06. The *R*^{2} value increased from 0.828 to 0.893 and then to 0.898.

### Evaluation of the 1 and 7 days ahead

For a definitive analysis of the models proposed in this study, two scenarios were used for forecasting (1 and 7 days). The models used in these predictions are considered the streamflow data from the previous 12-day maximum because these models have the best performance (Tables 1 and 2).

In general, as the ahead prediction increased, model performance decreased. In the case of using 4 days of past data to predict 1 and 7 days, the performance of the LSTM-SVM, LSTM-RF, and LSTM models decreased as the ahead prediction increased from 1 to 7 days. Specifically, the value of RMSE increased from 0.104 to 0.117 for the LSTM-SVM model, from 0.103 to 0.111 for the LSTM-RF model, and from 0.111 to 0.116 for the LSTM model. Regarding MAE, the value increased from 0.08 to 0.085 for the LSTM-SVM model, from 0.075 to 0.084 for the LSTM-RF model, and from 0.08 to 0.081 for the LSTM model.

Similarly, in the case of using the preceding 8 days data, the value of RMSE and MAE increased from 0.062 to 0.088 and 0.046 to 0.066 for the LSTM-SVM model, from 0.061 to 0.087 and 0.046 to 0.065 for the LSTM-RF model, and from 0.065 to 0.09 and 0.076 to 0.078 for the LSTM model.

In the case of the 12-day dataset, the values of RMSE and MAE increased from 0.06 to 0.085 and from 0.045 to 0.06 for the LSTM-SVM model, from 0.059 to 0.086 and 0.045 to 0.06 for the LSTM-SVM model, and from 0.075 to 0.089 and 0.055 to 0.062 for the LSTM model.

*R*

^{2}index was used to evaluate performance. Figure 6 shows the

*R*

^{2}value of the LSTM, LSTM-SVM, and LSTM-RF models. With a 4-day dataset predicting 1 and 7 days, the

*R*

^{2}value decreased from 0.846 to 0.826; from 0.848 to 0.828; and from 0.826 to 0.81 for LSTM-SVM, LSTM-RF, and LSTM, respectively. With 8 days of data,

*R*

^{2}decreased from 0.946 to 0.892; from 0.947 to 0.893; and from 0.844 to 0.84 for LSTM-SVM, LSTM-RF, and LSTM, respectively. For 12 days,

*R*

^{2}decreased from 0.954 to 0.899; from 0.953 to 0.898; and from 0.921 to 0.889.

The results showed that the LSTM-RF model has better generalization performance than the other two models (LSTM, LSTM-SVM) both 1 and 7 days ahead. The second class was the LSTM-SVM model, followed by LSTM. The results again confirm the superiority of the hybrid model.

^{3}/s) compared to the observation streamflow. Because this study uses the data-driven approach to predict the streamflow, therefore, the accuracy of the models depends on the input data, however, the data during the flood in this study is limited (about two or three events per year). So they are not enough to train the models. Several studies have applied different methods to reduce these problems, for example, the use of data transformation techniques such as Fourier or wavelet decomposition to process data before use as data from ML models (Wang

*et al.*2022).

## DISCUSSION

Accurate streamflow prediction plays an important role in water resource management and is one of the most challenging tasks in hydrology, especially in the context of climate change (Rasouli *et al.* 2012; Kilinc & Haznedar 2022). Although various models have been developed to predict streamflows in rivers around the world, the accuracy of these models is still a big challenge for the global scientific community. Moreover, each model can only solve the problems of a certain region. There is not yet a universal model to solve all problems in all regions, so it is necessary to develop new models. The objective of this study is the development of a state-of-the-art model to predict streamflow, based on LSTM, SVM, and RF. The Mekong River was selected as the area of study because it is the most important transboundary river in Asia, and the construction of upstream dams and climate change have had increasingly profound effects on the downstream streamflow.

We proposed an optimization framework to determine the optimization hyperparameters of the LSTM model using SVM and RF, both of which help the optimization process to converge faster than traditional research methods such as trial-and-error, grid search, or population-based training. Moreover, the combination of models like SVM, RF, and LSTM provides advantages for time-series prediction problems such as river streamflow prediction. These advantages become clear when the optimization is completed. The proposed models have been trained to find the optimization level of the river streamflow, and they can predict the streamflow value in the following days. This situation was justified by comparing the performance of the hybrid model and the individual model. In general, hybrid models have outperformed individual models because several previous studies have pointed out that hybrid models overcome the weak points of individual models. In this study, the LSTM-SVM and LSTM-RF models were better than the LSTM models, because SVM requires less memory and has the ability to process large data. While in addition to ease of use, RF algorithms have advantages for dealing with overfitting issues. So they can improve the performance of the LSTM model (Nguyen 2022a).

Several recent studies have used LSTM to predict the streamflow of the river. Xu *et al.* (2020) used the LSTM model to predict the 10- and 1-day ahead streamflow in the Hun and Yangtze rivers, respectively. The results showed a value of *R*^{2} ranging from 64 to 75%. Girihagama *et al.* (2022) applied the LSTM model to predict the daily streamflow for ten different watersheds of the Ottawa River watershed. The value of *R*^{2} ranged from 50 to 86%. Li *et al.* (2021) used the LSTM model and its hybrids to predict streamflow in the Baozhusi Hydrological Station in Jialing River, China. The results indicated that the accuracy of the proposed models varied from 70 to 80% for the value of *R*^{2}. Qi *et al.* (2019) developed the LSTM and DEL-LSTM (decomposition-ensemble-learning model and LSTM neural network) model to predict daily inflow into the Ankang reservoir from the Han River in northern China. The results showed a value of *R*^{2} from 60 to 70%. In addition, on the same study area (the Mekong River delta), Nguyen *et al.* (2015) used three models, namely, Least Absolute Shrinkage and Selection Operator (LASSO), Random Forests, and Support Vector Regression (SVR) to predict the streamflow in the Mekong River. The results reported the higher MAE value of 0.486. Although there are differences between the study region and the methodology, however, in general, the previous studies have used the same approach as the ML approach to predict the streamflow. It can be seen that the accuracy of our models was similar to that of models in previous studies. So, we can conclude that the performance of the models proposed in this study was consistent with the performance of the models in the literature.

Although several studies and methods have been applied effectively to predict streamflow in the various world rivers (Petty & Dhingra 2018; Adnan *et al.* 2021c), in recent years, hydrological processes have been strongly influenced by human activities such as dam construction, which causes difficulties in streamflow prediction, especially in extreme events such as floods or droughts (Ahn & Merwade 2014; Zuo *et al.* 2014; Nguyen *et al.* 2022). Many researchers have also wondered if these problems can be solved using data related to human activities to train the necessary models. This has been substantiated in previous studies (Sun *et al.* 2014; Jalali *et al.* 2021; Shah *et al.* 2022). However, in the case of the Mekong River, data-sharing issues are seen as major challenges.

The global optimization problem is one of the largest obstacles to using ML in general and deep learning (DL) in particular; that is, whether the models can predict outside the scope of the training dataset. For example, multiple models may work well for predicting streamflow in the short term, but they cannot predict the long term. In theory, this would not be a significant challenge if the training dataset was sufficient and included all possible events. However, one of the disadvantages when using large datasets is the computation time to train the LSTM model. Xu *et al.* (2020) proposed two solutions to solve these problems: improving the computing capacity of computers, particularly graphics processing units (GPUs), and aggregating individual watershed datasets with similar characteristics to form the LSTM model. This model can work on a regional level to predict the streamflow in other watersheds. In several cases, the collection of sufficient training data is a significant challenge when using the data-driven model. In this case, several authors have developed hybrid models by combining DL with a model with extrapolation (Kişi 2011) or with a physics-based model (Cho & Kim 2022). Cigizoglu (2003) demonstrated that the ANN model has advantages in solving the extrapolation problem and is better than traditional models such as lognormal distribution.

The streamflow of the Mekong Delta in Vietnam is strongly influenced by human activities upstream of the river and climate change. Therefore, the results in this study play an important role in supporting decision-makers or planners in the creation of effective water resource management strategies for the development of agriculture and industry. Moreover, this article has significantly advanced the knowledge on the real applications of ML tools in earth science, which we believe is useful and necessary to solve problems in real life with new technologies. Although this study is applied to predict the streamflow in the Mekong Delta in Vietnam, its results can apply to other rivers around the world.

Although ML in general, and deep learning in particular, have proven effective in predicting streamflow, there are limitations to using the DL model: When setting hyperparameters of the DL model, the initialization of the parameter models was double, including the initialization of the parameters of the LSTM model and the parameters of the two proposed optimization algorithms. Different tuning methods have been used such as trial-and-error, grid search, and population-based training; however, these methods can be time-consuming and resource intensive. In addition, this study uses the water level at the Tan Chau station to predict the streamflow at the Can Tho station. In reality, several factors influence the streamflow such as precipitation and evaporation. In this study, the methodology was adapted to better predict the streamflow in the Mekong Delta. Moreover, although the proposed models were effective in predicting the streamflow, the streamflow prediction from the proposed models tended to be lower than the observed streamflow. This is related to the limitations of the model training data. To improve the generalizability of the model in extreme event cases, it is necessary to collect extreme event data. The other solutions that can be explored in the future to improve the prediction capacity of the model will be the use of other models such as ELM.

## CONCLUSION

Accurate streamflow prediction plays an important role in water resource management and planning. There are several physics-based and data-based models that have been used to predict streamflow; however, each model has different limitations and there are no universal methods to solve all problems in all regions. The objective of this study is the development of a state-of-the-art method and understanding based on LSTM, SVM, and RF to predict the daytime streamflow in the Mekong Delta of Vietnam. Therefore, the results of this study can support decision-makers in the development and sustainable management of water resources.

The results were validated using various statistical indices and comparing individual LSTM model results. Based on the results obtained, we can conclude that:

The performance of the hybrid models (LSTM-SVM and LSTM-RF) outperformed the individual model (LSTM) in predicting the daily streamflow.

The models proposed in this study with high accuracy (

*R*^{2}> 80%) successfully predict the daytime streamflow in the Mekong Delta of Vietnam. The new models can be used to predict streamflow in any region, especially in data-limited regions.The prediction results highlighted that the predicted streamflow was lower than the observed streamflow in the cases of both the 1- and 7-day ahead forecasts, particularly in the flood season (300–500 m

^{3}/s).

Although this study was successful in building models to predict the streamflow from 1 and 7 days ahead using the prediction horizon of 4, 8, and 12 days in the Mekong Delta, the models should predict streamflow over a longer period with high accuracy, and the construction of models requires a faster process to better support decision-makers or planners in water resource management.

Future research may extend this streamflow prediction approach using different and perhaps more interesting ensemble-learning techniques by combining physics-based models with data-based models. The idea of model integration has the objective of improving the predictive ability to solve complicated hydrological problems. Moreover, in this study, the value of the previous water level and streamflow were used as the input data of the prediction model. In the future, streamflow prediction can consider more factors, such as rainfall, evaporation, and weir effects. The results of this study can be an effective tool for developing water resource management strategies in all regions of the world.

## AUTHOR CONTRIBUTIONS

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by H.D.N. and Q.-H.N. The first draft of the manuscript was written by H.D.N., C.P.V., Q.-H.N., and Q.-T.B. All authors read and approved the final manuscript.

## FUNDING STATEMENT

No funding was received for this study.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

**34**,

**26**(21),

**58**(6), 1175–1188.

**45**(1), 82–97.

**38**(7), 1237–1253.

**1637**,