Abstract
The prediction of precipitation is of importance in the Thua Thien Hue Province, which is affected by climate change. Therefore, this paper suggests two models, namely, the Seasonal Auto-Regressive Integrated Moving Average (SARIMA) model and the Long Short-Term Memory (LSTM) model, to predict the precipitation in the province. The input data are collected for analysis at three meteorological stations for the period 1980–2018. The two models are compared in this study, and the results showed that the LSTM model was more accurate than the SARIMA model for Hue, Aluoi, and Namdong stations for forecasting precipitation. The best forecast model is for Hue station (= 0.94, = 0.94, = 8.15), the second-best forecast model is for Aluoi station ( = 0.89, = 0.89, = 12.72), and the lowest level forecast is for Namdong station ( = 0.89, = 0.89, = 12.81). The study result may also support stakeholderswho apply these models with future data to mitigate natural disasters in Thua Thien Hue.
HIGHLIGHTS
Neural network methods of SARIMA and LSTM can improve the accuracy of forecasting of monthly precipitation in the Thua Thien Hue Province.
The local precipitation forecast system depends heavily on the neural network using meteorological data collected from Hue, Aluoi, and Namdong stations, and these are presented.
The Min–Max normalization method for the data is applied to improve the accuracy of the precipitation forecast of the models.
A comparison of forecasts implemented between LSTM with NSE, R2, and RMSE is made.
The prediction of LSTM is significantly better than SARIMA for the monthly precipitation regime.
INTRODUCTION
In recent decades, global climate change has caused sea levels to rise, increased droughts, and extreme flooding. These dangerous weather phenomena, which are almost becoming a pattern in modern times, threaten food security and endanger the lives of several hundred million people on earth (Mall et al. 2006; Stuart et al. 2011; Busby 2018). Several countries are experiencing harsh climates, and the least-developed countries are the most affected (Mango et al. 2011; Elliott et al. 2014; Obianyo 2019; Oo et al. 2020). Climate change has led to the worst precipitation scenarios, because the change in the rain cycle has affected the agricultural sector, the management and reserves of water sources, groundwater, and the flow of rivers (Liu et al. 1998; Liu et al. 20s00; Mirza 2003; Kotir 2011; Defrance et al. 2020; Javadinejad et al. 2020).
The mitigation and control of natural disasters require that meteorological forecasts should be done early, with high accuracy and easy understanding. Currently, there are plenty of meteorological prediction studies based on artificial intelligence like machine learning and neural networks, and the results of research reveal the high accuracy of prediction of precipitation, storms, and droughts for both short and long periods (Nourani et al. 2011; Deka 2014; Du et al. 2018). Weather-related data are usually time-series data, so methods for time-series forecasting commonly apply to weather prediction (Mishra et al. 2007; Park & Kim 2017). Several types of research using supervised machine learning like Long Short-Term Memory Recurrent Neural Networks (LSTM RNNs) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA) or Auto-Regressive Integrated Moving Average (ARIMA) predict and analyze time-series data (Anwar et al. 2016; Chen et al. 2017; Parmezan et al. 2019). The data signals move in the backward directions, and these networks have feedback connections in LSTM as well (Kalchbrenner et al. 2015; Salehinejad et al. 2017). Valipour (2015) used two models of SARIMA and ARIMA to study long-term runoff forecasting for 2011 with the data obtained from the year 1901 to 2010 in the United States. The study results also indicated that the accuracy of the SARIMA model is better than that of the ARIMA model. Sampson et al. (2013) employed the SARIMA model to develop forecasting precipitation from January 1980 to December 2010 in the Navrongo Municipality of Ghana. The study result indicated that the (p, d, q) × (P, D, Q)s parameters for the best SARIMA model of the precipitation forecast were (0, 0, 1) × (0, 1, 1)12. Bibi et al. (2014) applied the ARIMA time-series model for monthly precipitation prediction with data over 27 years (from 1980 to 2006) in Northeastern Nigeria. The model showed the monthly precipitation tendency and the number of rainy days in a month for every six months (from May to October). Hu et al. (2018) applied ANN and LSTM network models that predicted the precipitation runoff. The research data were collected from 14 precipitation stations and one hydrologic station in the catchment for flood events from 1971 to 2013 in the Fen River Basin. The results indicated that both network models were suitable for precipitation-runoff models. Ni et al. (2020) implemented the LSTM model for the forecast of streamflow and precipitation. The monthly streamflow volume data were collected from Cuntan and Hankou stations in the Yangtze River basin, China. The results showed that LSTM was also suitable for time-series prediction.
This paper studies precipitation prediction using the SARIMA and LSTM models for the Thua Thien Hue Province. The input vectors used in the models are based on 468 months per 39 years of precipitation measured at three main meteorological stations, Hue, Aluoi, and Namdong, of the province. This study also highlights the comparison between SARIMA and LSTM that heavily depend on the results of statistic accuracy parameters such as mean (M), root mean square error (RMSE), Nash–Sutcliffe Efficiency (NSE), minimum (Min), maximum (Max), standard deviation (St. Dev.), coefficient of variation (Cv), skewness coefficient (Cs), and correlation of determination (R2). The collection of results of these models may indicate the working efficiency of the models for precipitation prediction. Furthermore, the forecasting result can assist the province in mitigating natural disasters like floods, droughts, and landslides.
The structure of the paper is organized as follows. Section 1 gives the introduction to the paper. Section 2 introduces the methodology used throughout this paper. Sections 3 and 4 describe the study results and present discussions. Finally, Section 5 presents the conclusions.
METHODOLOGY
Study area
The Thua Thien Hue Province belongs to the central coast of Vietnam (see Figure 1). The province is a tropical monsoon region with a complicated topography and climate. This province is one of the areas greatly affected by natural disasters in Vietnam (Do 2002; Lee & Lee 2017; Huynh et al. 2021; Nguyen et al. 2021). Almost annually, the province has to contend with natural disasters like floods, storms, droughts, and so on. These extreme disasters include past events like the floods of 1983, 2007, 2011, and 2015, the destructive storms of 1985, 2006, and 2013, and the floods of 1999 and 2020 that were unprecedented in terms of recorded history. These disasters caused huge loss of human lives and damage to property in the province. Moreover, nowadays, they occur with intense frequency and unfailing regularity, breaking the established law of the weather and climate.
The red–blue dots in Figure 1 denote the locations of the Hue, Aluoi, and Namdong meteorological stations. The precipitation observed in these stations greatly influences the province river system. The precipitation obtained in these stations indicates that the average annual precipitation of the province is unevenly distributed over both space and time. With regard to space, the precipitation of the mountain area fluctuates from 3,400 to 7,000 mm and that of the delta area ranges from 2,100 to 3,000 mm. With regard to time, the season of less rainfall is from January to August, and the actual rainy season is from September to December (for details, see Figure 2).
Long short-term memory
LSTM is a member of the family of RNN and was first introduced in the year 1997 (Hochreiter & Schmidhuber 1997; Chakraborty et al. 2016; Vazhayil & Soman 2018). LSTM can record values from previous periods for future applications (Mackenzie et al. 2018; Siami-Namini et al. 2018). Before describing LSTM, this study will introduce the basic concept of neural networks.
Artificial neural network
A neural network includes at least three core layers, namely, an input layer, a hidden layer, and an output layer (Huang 2003). The numerical characteristic of the dataset determines the dimension or the number of nodes in the input layer. Input nodes receive communication signals in a form that can be represented by numerical expressions. Activation values represent the communication, in which each node assigns a number, and the higher number gets a greater activation. Then, this communication is transmitted over the network. Each node carries a weight, and based on connection strength (weighting), inhibition or excitation, and the transfer function, the activation value is transferred from one node to another (Murray & Edwards 1993; Kaushik et al. 2020). The neural networks learn essentially by adjusting the weight for each summary (Smith & Demetsky 1994; Chang 2012). In the hidden layers, the nodes are used as an activation function on the weighted sum of inputs to transform to the output layer or predicted values. The output layer initiates a probability vector for the various output nodes and selects the one with a minimum error rate. This point implies that minimizing the difference between expected and predicted values, the error rate using a function called ReLU to obtain through first-time network training may not be the best because of the assignments to the weight vectors. Based on the algorithm of ‘backpropagation’, which finds the smallest values for errors, the errors are in the form of ‘backpropagation’ in the network from the output layer to be fed back on the hidden layers, and the resulting weight is adjusted to improve the predicted values. The training procedure is repeated until the predicted values approach the actual values (Werbos 1990; Paola & Schowengerdt 1995).
Recurrent neural network
The RNN is an exceptional case of a neural network. The goal of RNN is to predict the next step in a sequence of observations relative to previous steps in the same sequence. In other words, RNN is a memory algorithm and is capable of remembering previously computed information (Mohan & Gaitonde 2018; Siami-Namini et al. 2018; Balderas et al. 2019). Unlike traditional neural networks, the communication between the input layer and the output layer is independent. The idea behind RNN is to take advantage of sequential observations and learn from previous periods to predict future trends. Hence, the earlier stages of data need to be memorized when guessing the next steps (Kraus & Feuerriegel 2017; Siami-Namini et al. 2018, 2019). In RNNs, hidden layers act as internal storage to store the information obtained in previous sequential read stages. RNNs are named ‘recurrent’ because they conduct the same task for every element of the sequence, with the characteristic of using previously collected information to predict future invisible sequence data. The major problem of typical generic RNNs is that these networks only remember a few previous steps in the sequence and are not suitable for memorizing longer data sequences. This problem is expected to be solved by using the ‘memory stream’ and is presented in LSTM (Lipton et al. 2015; Kraus & Feuerriegel 2017; Yang et al. 2018; Young et al. 2018; Wu et al. 2020a, 2020b).
Long short-term memory
LSTM includes multiple LSTM cells that are connected, and the specific structure is described in Figure 3 (Bermúdez et al. 2017; Wu et al. 2020a, 2020b). The idea behind LSTM is to add the internal state and three filter gates from the input to the output process for the cell . These ports include the forget gate , the input gate , and the output gate . At each time step t, the gates, respectively, take the input and the value obtained from the output of the memory cell from the previous time step t–1.
The forget gate is responsible for deciding whether cell state communication in the t–1 time step should be stored. Communication from the current input and the hidden state is passed through the Sigmoid function with the output in the range {0, 1}. Therefore, if the output gate is close to 1, the communication needs to be retained. If the output gate is close to zero, the communication must be discarded.
The input gate is responsible for updating communication to the cell state. Presently, the output layer of the sigmoid activation multiplies with the output tanh activation to decide whether the current of the input gate state and the hidden gate state and the states should be updated in the cell state.
The output gate is responsible for calculating the hidden state value of the next time step. Using the forget and input gates, the new value of the cell state can be calculated, and the current input and the hidden state values can be combined to calculate the next hidden state value. The next hidden state value is the prediction value.
Obtaining the best generalization of network models requires understanding a method of manual calibration of hyperparameters. The main parameters calibrated during the model training encompass the number of layers, number of nodes, batch size, verbose, and epoch. Therefore, by looking for relevant hyperparameters, the model can make better predictions. In this paper, the main criteria of the LSTM model built for three meteorological stations of Hue, Aluoi, Namdong are given in Table 1. The model consists of an LSTM layer, followed by a single connected layer, as recommended by some studies (Zhang et al. 2018; Ayzel 2019). The main parameters of the LSTM neural network model are weights and biases, updated through the backpropagation time algorithm (BPTT) (Werbos 1990). Furthermore, metadata should be selected to design the training process. The loss function is the mean squared error function. The number of hidden neurons of the LSTM layer is 200. The training network uses several epochs to minimize mean squared error (MSE) using the optimization method by estimating the moment of adaptation (Adam) (Kingma & Ba 2014). The number of epochs for this study is 2,000 and 10,000, respectively. The batch size has a size of the training dataset with 336 samples, and is called Batch Gradient Descent.
Number of inputs | 12 | Number of outputs | 1 |
Number of hidden nodes | 200 | Activation function | ReLU |
Optimizer | Adam | Loss function | MSE |
Epoch | 2,000; 10,000 | verbose | 2 |
Batch size | 336 | Metrics | Mean Absolute Error |
Number of inputs | 12 | Number of outputs | 1 |
Number of hidden nodes | 200 | Activation function | ReLU |
Optimizer | Adam | Loss function | MSE |
Epoch | 2,000; 10,000 | verbose | 2 |
Batch size | 336 | Metrics | Mean Absolute Error |
SARIMA model
In this paper, the SARIMA model is deployed in four phases: determination, estimation, verification, and execution (or forecast) phases. These phases are implemented to determine the optimal predictive model in each station (Salas & Obeysekera 1982; Burlando et al. 1993). In the first phase, it is required to determine the data stationarity and the general form or the estimated model order. In the second phase, using the Augmented Dickey–Fuller (ADF) test, the p-value and critical value cutoffs are evaluated. If the test statistic value is lower than the critical value, the stationary time-series of data is fixed (Guo & Ogata 1997; Avishek & Prakash 2017); the unit root test is used to consider that a null hypothesis of stationarity in a time-series for validation is accepted or rejected (Said & Dickey 1984; Dabral & Murry 2017); the maximum likelihood method is applied to estimate the model parameters; and the Ljung–Box statistical test is used to check its suitability to prove that the residue is white noise (Ljung & Box 1978). In the third phase, the Auto Correlation Function (ACF) and Partial Correlation Function (PACF) is employed to determine the best model type. Using Akaike Information Criteria (AIC), Bayes Information Criteria (BIC), and the plots of residuals to estimate each model, the model is selected with the lowest AIC and BIC (Akaike 1998; Box et al. 2011). Finally, once the optimal model is determined, the execution is made, and the results are compared with the validation vectors by , NSE, and RMSE to evaluate the performance (Dastorani et al. 2016; Nury et al. 2017).
Data normalization
Performance metrics
The methodology of this study is described by the flowchart shown in Figure 4.
RESULT
Data collection
For the present observed study, the precipitation at the Hue, Aluoi, and Namdong meteorological stations for the period spanning between 1980 and 2018 was used, as well as the annual statistical report by the Thua Thien Hue Centre for Hydro-Meteorological Forecasting. At the same time, to double-check the data reliability, the annual statistical report by the Thua Thien Hue Province was also used. The data in Table 2 indicate the characteristics of the data employed in this study.
Stations . | Location . | Earliest record year . | Latest record year . | Numbers of month . |
---|---|---|---|---|
Hue | Hue city | 1980 | 2018 | 468 |
Aluoi | Aluoi district | 1980 | 2018 | 468 |
Namdong | Namdong district | 1980 | 2018 | 468 |
Stations . | Location . | Earliest record year . | Latest record year . | Numbers of month . |
---|---|---|---|---|
Hue | Hue city | 1980 | 2018 | 468 |
Aluoi | Aluoi district | 1980 | 2018 | 468 |
Namdong | Namdong district | 1980 | 2018 | 468 |
The operation and maintenance of the meteorological sites may affect the collection of the dataset, possibly leading to an extreme value outside the expected range of the prediction model, which will cause dissimilarily with other data. Therefore, the observed data must be free from any outliers to ensure that the best observation data are used in the models. A dataset that contains extreme values that are outside the range is called outliers. To remove these outliers from our datasets, we used the annual statistical report by the Thua Thien Hue Centre for Hydro-Meteorological Forecasting to check data reliability with the standard deviation method.
Statistical characteristic results from the monthly precipitation for each meteorological station are also presented in Table 3. The range of the following characteristics was computed from the observed monthly precipitation time-series: the mean, minimum and maximum values, standard deviation (St. Dev.), coefficient of variation (Cv), and skewness coefficient (Cs) (Eris et al. 2019).
. | Mean (mm) . | Min (mm) . | Max (mm) . | St. Dev. . | Cv (%) . | SkD . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Stations . | Min . | Max . | Min . | Max . | Min . | Max . | Min . | Max . | Min . | Max . | Min . | Max . |
Hue | 50.6 | 788.3 | 3.2 | 35 | 353.7 | 2,452.3 | 46.7 | 451 | 48 | 106 | 0.39 | 2.26 |
Aluoi | 68.2 | 912.4 | 4.5 | 132.7 | 499.0 | 2,590.0 | 68.2 | 912.4 | 31 | 89 | 0.37 | 1.93 |
Namdong | 66.2 | 974.0 | 1.6 | 123.5 | 412.4 | 2,672.3 | 50 | 681.1 | 43 | 76 | 0.44 | 2.06 |
. | Mean (mm) . | Min (mm) . | Max (mm) . | St. Dev. . | Cv (%) . | SkD . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Stations . | Min . | Max . | Min . | Max . | Min . | Max . | Min . | Max . | Min . | Max . | Min . | Max . |
Hue | 50.6 | 788.3 | 3.2 | 35 | 353.7 | 2,452.3 | 46.7 | 451 | 48 | 106 | 0.39 | 2.26 |
Aluoi | 68.2 | 912.4 | 4.5 | 132.7 | 499.0 | 2,590.0 | 68.2 | 912.4 | 31 | 89 | 0.37 | 1.93 |
Namdong | 66.2 | 974.0 | 1.6 | 123.5 | 412.4 | 2,672.3 | 50 | 681.1 | 43 | 76 | 0.44 | 2.06 |
The input precipitation variables are separated into two parts. The first part is for the period January 1980–December 2008 and is used for the training phase, which contains about 74% of the entire data. The second part is for January 2009–December 2018 and is used for the test phase, which contains the remaining 26% of precipitation data. The monthly values of the dataset have been averaged for the entire January 1980–December 2018 period. Furthermore, considering the precipitation amplitude in these areas, this study has normalized this dataset by the Min–Max scaler method. The dataset after normalization is used in this study for precipitation prediction using the LSTM and SARIMA models. The lines in Figure 5 indicate the conversion of the precipitation data series of the three meteorological stations (Figure 5(a)) to the Min–Max scaler (Figure 5(b)).
Application
Precipitation prediction by the SARIMA model
With regard to predicting the precipitation at the three stations, the following four steps for this experiment determine the optimal predictive models. Firstly, the precipitation data are normalized by using a Min–Max scaler. The calibration data in Figure 5(b) show that the peaking precipitations are observed about once a year. The observation results show that the highest precipitation peak is the Namdong area, next to the second-highest in the Aluoi area and the lowest in the Hue area. This result can be considered as an indicator of the sequence of seasonal behavior. However, it is difficult to determine the presence of seasonality visually with the figure above, so this study is conducted by using time-series decomposition. In this way, this study is verified by the components and the structure of the string is analyzed (trends, seasons, and random items). The four-line charts in Figure 6 show the additive decomposition of the lines. It can be determined that the lines are highly seasonal, mainly following the typical unilateral precipitation regime in the stations. At the same time, a trend is not maintained throughout the line. Secondly, the result of using the ADF test showed that the p-value of the models is p≈ 0.00 (<0.05). The p-values indicate that the data of the models have a unit root (d= 1 and D= 1) and are stationary (Diebold & Senhadji 1996; MacKinnon 1996; Granger & Swanson 1997). Third, testing the ACF and PACF of the data of the three stations with correlation plots (detail in Figure 7) shows that the oscillations of these plots are sinusoidal in nature. These fluctuations also point out that the data are suitable for the SARIMA model to represent the precipitation series of the Hue, Aluoi, and Namdong stations. Besides, there are two similar things for each pair of plots: non-randomness of the time-series and a high lag-1 (which will probably need a higher order of differencing d and D). The standardized residual plots in Figures 8(a), 9(a), and 10(a) indicate the residuals over time as white noise. According to the histogram in Figures 8(b), 9(b), and 10(b), the orange Kernel Density Estimation (KDE) curve line conforms rather closely to the green N(0,1) curve line. Furthermore, Figures 8(b) and 10(b) show that the histograms resemble a right-skewed distribution, and meanwhile, Figure 9(b) indicates that the histogram resembles a left-skewed distribution. These are good signals for the SARIMA models of the Hue, Namdong, Aluoi stations in the sense that the residuals are equivalent to normal distribution. Likewise, Figures 8(c), 9(c), and 10(c) point out the Q–Q plots, the distribution of the ordered residuals with blue dots lying on the red line, with a mean equal to 0 and standard deviation equal to 1. These indicate that the residuals follow a linear trend. The correlogram plots in Figures 8(d), 9(d), and 10(d) imply that the residuals of the original data have a low correlation with the lagged data of itself. Finally, the p, d, q, and P, D, Q parameters of these models will be adjusted until the optimal model is gained by using the minimum of AIC. The data in Table 4 indicate that the best-fit models for the Hue, Namdong, and Aluoi stations are SARIMA (0, 1, 1) × (1, 1, 1, 12), SARIMA (0, 1, 1) × (1, 1, 1, 12), and SARIMA (0, 1, 1) × (1, 1, 1, 12), respectively, which are selected on the basis of the minimum values of =−510, = − 322, and = − 321, respectively.
Hue . | Namdong . | Aluoi . | |||
---|---|---|---|---|---|
Model of SARIMA . | AIC . | Model of SARIMA . | AIC . | Model of SARIMA . | AIC . |
(0, 1, 1) × (1, 1, 1, 12) | −510 | (0, 1, 1) × (1, 1, 1, 12) | −321 | (0, 1, 1) × (1, 1, 1, 12) | −322 |
(0, 1, 1) × (0, 1, 1, 12) | −509 | (0, 1, 1) × (0, 1, 1, 12) | −320 | (0, 1, 1) × (0, 1, 1, 12) | −320 |
(1, 1, 1) × (0, 1, 1, 12) | −509 | (1, 1, 1) × (1, 1, 1, 12) | −319 | (0, 1, 1) × (0, 1, 1, 12) | −320 |
(1, 1, 1) × (1, 1, 1, 12) | −509 | (1, 1, 1) × (0, 1, 1, 12) | −317 | (1, 1, 1) × (1, 1, 1, 12) | −319 |
(1, 1, 1) × (1, 1, 0, 12) | −458 | (0, 1, 1) × (1, 1, 0, 12) | −243 | (1, 1, 1) × (0, 1, 1, 12) | −318 |
Hue . | Namdong . | Aluoi . | |||
---|---|---|---|---|---|
Model of SARIMA . | AIC . | Model of SARIMA . | AIC . | Model of SARIMA . | AIC . |
(0, 1, 1) × (1, 1, 1, 12) | −510 | (0, 1, 1) × (1, 1, 1, 12) | −321 | (0, 1, 1) × (1, 1, 1, 12) | −322 |
(0, 1, 1) × (0, 1, 1, 12) | −509 | (0, 1, 1) × (0, 1, 1, 12) | −320 | (0, 1, 1) × (0, 1, 1, 12) | −320 |
(1, 1, 1) × (0, 1, 1, 12) | −509 | (1, 1, 1) × (1, 1, 1, 12) | −319 | (0, 1, 1) × (0, 1, 1, 12) | −320 |
(1, 1, 1) × (1, 1, 1, 12) | −509 | (1, 1, 1) × (0, 1, 1, 12) | −317 | (1, 1, 1) × (1, 1, 1, 12) | −319 |
(1, 1, 1) × (1, 1, 0, 12) | −458 | (0, 1, 1) × (1, 1, 0, 12) | −243 | (1, 1, 1) × (0, 1, 1, 12) | −318 |
The experimental results in Figure 11 reveal that the predicted values produced by the SARIMA models are very close to the actual values of precipitation at the three stations. These plots also indicate that the overlap levels between the predictive and the actual precipitation lines are equivalent to 85% at the three stations. The parameters of R2, NSE, and RMSE in Table 5 show that the values of accurate measurement of the Hue, Aluoi, and Namdong stations are , = 0.93, and = 30.10; = 0.89, , and= 33.32; = 0.87, = 0.88, and = 36.73, respectively. The values of these indicators for the forecast of natural phenomena are quite acceptable. In addition, the highest accuracy of the precipitation forecast compared with actual precipitation is a result obtanied from the Hue station, followed by the second-highest in the Aluoi station, and the lowest in the Namdong station. The results of the model study are compared with the forecasting simulation of precipitation by using the LSTM model.
Station . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Hue | 30.09 | 8.15 | 0.93 | 0.94 | 0.93 | 0.94 |
Aluoi | 33.32 | 12.72 | 0.88 | 0.89 | 0.89 | 0.90 |
Namdong | 36.73 | 12.81 | 0.87 | 0.89 | 0.88 | 0.90 |
Average | 33.38 | 12.21 | 0.90 | 0.91 | 0.90 | 0.91 |
Station . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Hue | 30.09 | 8.15 | 0.93 | 0.94 | 0.93 | 0.94 |
Aluoi | 33.32 | 12.72 | 0.88 | 0.89 | 0.89 | 0.90 |
Namdong | 36.73 | 12.81 | 0.87 | 0.89 | 0.88 | 0.90 |
Average | 33.38 | 12.21 | 0.90 | 0.91 | 0.90 | 0.91 |
Precipitation prediction by LSTM
With regard totheLSTM model, the following five steps for this analysis have produced acceptable results. First, the study input data are standardized by Min–Max normalization. Second, the number of neurons for the input layer, the hidden layer, and the output layer including 12, 200, and 1 nodes is identified and deployed. Third, the activation function is the ReLU function with the training algorithm using the Adam algorithm. Fourth, the test input parameters are analyzed as the epoch, verbose, batch size using the early stopping function gain to the optimal target function by comparing the loss function (MSE) with metrics (MAE). Finally, the result in Figure 12 shows the descent gradient trend as it increases the number of epochs of the model. In addition, the parameters in Table 6 show that the slope direction accuracy and training vibration amplitude between the loss function and the metrics of Figure 12(b) are lower than those of Figure 12(a). (This means that the loss function and the metric lines of Figure 12(b) move closer to the horizontal axis compared with the loss function and the metric lines of Figure 12(a).) After the training stops at the 2,000th and 10,000th epochs, the model will start overfitting from the 2,000th and 10,000th epochs. A comparison of the error metrics between two different numbers of epochs shows that the 10,000th epoch has better accuracy, better performance, and greater stability than the 2,000th epoch.
Stations . | 2,000th epoch . | 10,000th epoch . | ||
---|---|---|---|---|
MSE . | MAE . | MSE . | MAE . | |
Hue | 1.5508e-05 | 0.0019 | 1.4790e-05 | 0.0018 |
Aluoi | 2.2966e-05 | 0.0028 | 2.1158e-05 | 0.0023 |
Namdong | 3.1147e-05 | 0.0030 | 2.6396e-05 | 0.0027 |
Stations . | 2,000th epoch . | 10,000th epoch . | ||
---|---|---|---|---|
MSE . | MAE . | MSE . | MAE . | |
Hue | 1.5508e-05 | 0.0019 | 1.4790e-05 | 0.0018 |
Aluoi | 2.2966e-05 | 0.0028 | 2.1158e-05 | 0.0023 |
Namdong | 3.1147e-05 | 0.0030 | 2.6396e-05 | 0.0027 |
The experimental results in Figure 13 show that the overlap levels between the predictive and the actual precipitation lines are equivalent to 95% at the three stations. The parameters of , NSE, and RMSE in Table 5 are considered to evaluate the best predictive model. The forecasting result of the precipitation shows the highest accuracy of the precipitation forecast for the Hue station with = 0.94, = 0.94, and = 8.15, the second-highest accuracy for the Aluoi station with = 0.89, = 0.90, and = 12.72, and the lowest accuracy for the Namdong station with = 0.89, = 0.89, and = 12.81.
Also, the of the predictive models ranges from 0.89 to 0.94. In other words, the values indicate that the response of the predictive model is about 90% in comparison with the actual reality. Therefore, the forecasting model of LSTM for the three stations is acceptable. The results of these models will be compared with the rain forecast simulations by using the SARIMA model, which will be discussed in the following section.
A comparison of precipitation prediction between SARIMA and LSTM
The results of simulating the optimal precipitation forecast for the three meteorological stations of Hue, Aluoi, Namdong based on the two models of SARIMA and LSTM are analyzed as presented in Section 3.2.1 and Section 3.2.2 of this paper. In this section, the LSTM model is compared with the SARIMA model to explore the best forecasting model. First, the predicted precipitation results of these models indicate the best, the second-best, and the lowest ranks for the stations, respectively. Second, the two models are used to compare which model gives highly accurate and acceptable precipitation forecasting results for each meteorological station. The three-line charts in Figure 14 show the lines of training, testing, and forecasting. At the same time, Figure 14(a)–(c) shows precipitation prediction graphs using the LSTM and SARIMA models at Hue, Aluoi, and Namdong, respectively. The average values in Table 5 also indicate that the precipitation prediction by the LSTM model is better than the precipitation forecast by the SARIMA model at these three meteorological stations.
DISCUSSIONS
Rain brings many benefits to natural landscapes and human life, but sometimes it also brings disasters such as flooding, landslides, and inundation, and it is a cause of drought when it fails. Hence, it is necessary to have high-accuracy forecasts that can help mitigate the impact of these disasters. The Thua Thien Hue Province is a place with the maximum precipitation and precipitation patterns in Vietnam, with an annual amount of about 7,000 mm per year. With this huge amount of precipitation, it is important that this province should have some forecasting models related to precipitation forecast. Hence, this study deploys forecasting precipitation methods using the two models of SARIMA and LSTM with data collected from Hue, Aluoi, and Namdong meteorological stations. The models may prove useful for the province's precipitation prediction. At the same time, the plots in Figure 15 provide the regression graphs for precipitation forecasting in each model. Both models have values for the precipitation forecast that are greater than 0.87. This shows that a high correlation exists between the actual and the predicted data. This study uses two different algorithmic neural network methods, but both produce quite similar results. Several recent precipitation studies have factored in rainfall and some of its effects to forecast precipitation. The research of Barman et al. (2021) used the SARIMA model to predict precipitation in the state of Assam in India; the results revealed an RMSE value of 84.40, which was higher than the RMSE value of SARIMA in this study. Poornima & Pushpalatha (2019) deployed an advanced LSTM-based RNN with weighted linear units to estimate precipitation in the Hyderabad region of India; the result showed an RMSE value of 0.35, which was lower than that of this study. Hence, this study result is acceptable when compared with that of the two studies.
However, many published works have shown that the mechanisms to generate rainfall are based on cloud phenomena. Sánchez-Monedero et al. (2014) classified clouds into three main groups: condensing nucleus, steam, and vertical motion. Consequently, their studies indicated that the ensemble of input variables was selected by different pressure levels relative to weather factors. These factors consisted of the total amount of precipitable water, the degrees of rainfall, moisture, wind speed, wind direction, Convective Available Potential Energy (CAPE), and inhibition of convection. Hashim et al. (2016) also ascertained the significant effect of cloud information when studying rainfall in the city of Patna, India, based on a set of variables including cloud coverage, steam pressure, maximum and minimum temperatures, and frequency of wet days. Therefore, although rainfall is a complex nonlinear atmospheric process, largely dependent on local-scale spaces (Applequist et al. 2002), it is still difficult to define a set of variables for it. Meteorology is the most suitable discipline facilitating training in Artificial Intelligence (AI) models by taking into account the physical mechanism of the 24-h rain process.
Although this study has some limitations mentioned above, the results have shown that the best model for monthly rainfall forecasting is the LSTM model, followed by the SARIMA model. In addition, the results support the efforts of the province authority and local people to develop a plan for the mitigation of natural disasters.
CONCLUSIONS
This paper is the first study to present predicted precipitation by neural networks for the Thua Thien Hue Province. The LSTM and SARIMA models are used for forecasting precipitation. The two models are compared to prove their effectiveness in predicting precipitation. The study also shows that the LSTM model is more accurate than the SARIMA model, and the hierarchical order of the models indicates the best forecast model for the Hue station (= 0.94, = 0.94, and = 8.15), the second-best forecast model for the Aluoi station ( = 0.89, = 0.90, and = 12.72), and the lowest one for the Namdong station (= 0.89, = 0.89, and = 12.81). These statistical indicators also indicate that the monthly rainfall fluctuation at the Hue station is more stable than that at the Namdong and Aluoi stations. In addition, the Min–Max normalization method for the data was applied to improve the accuracy of the precipitation forecast of the models. This study will be more useful for daily forecasting results if it is based on the input data of rainfall measured by day. However, the results will prove useful for the Thua Thien Hue's provincial government and the local inhabitants if they apply these models for the mitigation of natural disasters such as floods, droughts, and landslides with new data collected in the future. One possible future work is to combine these two models into a hybrid model that will possibly give more accurate rain forecasting results.
AUTHOR CONTRIBUTIONS
Conceptualization, Material, and Methods: Nguyen Hong Giang; Writing, Original Draft Preparation: Tran Dinh Hieu, Nguyen Tien Thinh; Writing, Review, and Editing: Yu Ren Wang, Le Anh Phuong; Funding Acquisition: Tran Dinh Hieu, Nguyen Tien Thinh, and Nguyen Hong Giang. All authors have read and agreed to the published version of the manuscript.
FUNDING
This research was funded by Thu Dau Mot University, Vietnam. The APC was funded by Thu Dau Mot University.
ACKNOWLEDGEMENTS
We thank the school of Civil Engineering of the National Kaohsiung University of Science and Technology, Taiwan. We also thank Dr Hector M. Tibo and Dr June Raymond L. Mariano, the current PhD students of the National Kaohsiung University of Science and Technology, Taiwan.
CONFLICTS OF INTEREST
The authors declare no conflict of interest.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.