## Abstract

At present, the method of using coupled models to model different frequency subseries of precipitation series separately for prediction is still lacking in the research of precipitation prediction, thus in this paper, a coupled model based on Ensemble Empirical Mode Decomposition (EEMD), Long Short-Term Memory neural network (LSTM) and Autoregressive Integrated Moving Average (ARIMA) is proposed for month-by-month precipitation prediction. The monthly historical precipitation data of Luoyang City from 1973 to 2021 were used to build the model, and the modal components of different frequencies obtained by EEMD decomposition were divided into high-frequency series part and low-frequency series part using the Permutation Entropy (PE) algorithm, the LSTM model is used to predict the high-frequency sequence part, while the ARIMA model is used to predict the low-frequency sequence part. Monthly precipitation forecasts are obtained by superimposing the results of the two models. Finally, the predictive performance is evaluated using several assessment metrics. The indicators show that the model predictive performance outperforms the EMD-LSTM (Empirical Mode Decomposition), EEMD-LSTM, EEMD-ARIMA combined models and the single models, and the model has high confidence in the prediction results of future precipitation.

## HIGHLIGHTS

This paper adopts the EEMD algorithm to decompose the precipitation series into modal components of different frequencies.

LSTM is a special kind of RNN that can solve the problem of gradient explosion and gradient disappearance that occurs during the training of RNN.

The ARIMA model is very simple and requires only endogenous variables without resorting to other exogenous variables.

### Graphical Abstract

## INTRODUCTION

Precipitation is the main driver of the biospheric water cycle and is highly complex and uncertain in time and space. Relevant studies have shown that global warming has exacerbated the instability of the climate system (Zhang & Zhou 2019). In the context of the current global climate change, the frequency of extreme precipitation events and extreme drought events is also showing an increase (Simon & Alberto 2019), and it brings a lot of uncertainty to agricultural production, ecological environment and sustainable economic and social development. Precise forecasts of regional precipitation are therefore made in advance, with a view to meeting the needs of regional flood and drought prevention and water management. Precise precipitation prediction has been a difficult problem to solve in related research because of the complexity and variability of the causes affecting precipitation, which makes it extremely difficult to predict precipitation.

To date, precipitation prediction models can be roughly summarised into two main categories: traditional and emerging prediction models. Traditional forecasting models contain regression analysis classes and time series classes, examples include grey systems theory (Meng *et al.* 2022), Markov chains and set-pair analysis (Huang & Gao 2022). Such models are generally based on a large amount of monitored historical precipitation data, sorting the data by time scales, combining statistical methods and numerical autocorrelation features, mining the patterns and structures of stochastic processes, and constructing corresponding prediction models. Such models do not take into account other factors that may have an impact, relying alone on finding large data regularities to predict future conditions. Liu & Duan (2017) used maximum covariance regression analysis for precipitation grid data from observation stations on the Qinghai-Tibet Plateau to improve rainfall prediction in the region. Chang & Zhao (2011) obtained a method for predicting precipitation by combining time series methods and regression methods, which combines information on the time series dynamics of precipitation and the influence of natural environmental factors on precipitation, and is suitable for predicting the variability characteristics of precipitation. Lai & Dzombak (2021) developed a statistical time series forecasting technique based on the Autoregressive Integrated Moving Average model (ARIMA) and made a quantitative comparison of other common statistical techniques developed for annual temperature and precipitation forecasting models, showing that ARIMA models generally provide more accurate forecasts, especially in terms of interval forecasts. Jabbari & Bae (2020) used Total Least Square (TLS) and a bias correction method related to lead time, namely Dynamic Weighting (DW), to post-process the forecast real-time data and reduce the bias in real-time precipitation forecasts. Although this traditional prediction model has achieved some success, it still faces many challenges due to the low stability of statistical relationships, the complexity of the modelled atmospheric physics equations and the high smoothness requirements of the time series.

The rapid development of machine learning and deep learning has effectively compensated for the shortcomings of traditional prediction models. This emerging forecasting approach attempts to explore precipitation patterns, using large amounts of historical meteorological data (including precipitation data) to analyse and build models that can be trained to identify patterns in the data and further predict the evolution of other precipitation processes (He *et al.* 2021). This approach is of interest to many researchers because it not only learns the relationship between input observations and output results directly from the data, but also generates higher grid resolutions more frequently, for example, the first Artificial Neural Network (ANN) methods used to predict rainfall (Zhou *et al.* 2020), robust rainfall prediction techniques based on Support Vector Regression (SVR) (Hasan *et al.* 2015), ensemble precipitation prediction based on random forest approach (Zarei *et al.* 2021). Although these models can handle non-linear precipitation data, they ignore the sequential relationships between the data, so they have room for improvement.

Recurrent Neural Networks (RNN) are memetic, allowing the network to use previous information and current data to jointly determine the output outcome, but RNN have difficulty effectively remembering longer intervals of dependencies and usually lose the memetic information. Therefore, Hochreiter & Schmidhuber (1997) designed the Long Short-Term Memory (LSTM) network, which not only solves the long-term dependence problem, but also overcomes the gradient disappearance problem, and thus is widely used in the field of hydrological prediction. Li *et al.* (2022) selected the runoff data of the Yarkant River for training and testing of the LSTM model to improve the accuracy of inland river runoff prediction. Liu *et al.* (2020) applied a Long Short-Term Memory neural network (LSTM) to predict the future trend of the Tibetan Plateau based on its precipitation data from 1990–2016, and the results showed that the prediction accuracy of the LSTM model was improved in predicting precipitation compared with the traditional prediction methods.

Based on the consideration of the nonlinear and time-series relationship of precipitation data, the prediction results may have some errors when the long- and short-term memory method is used for prediction, as it is influenced by many complex factors and cannot guarantee the relative smoothness in general (Kala *et al.* 2022). To further improve prediction accuracy, the field of forecasting has turned its attention to the exploitation of time series frequency domain features. Precipitation data itself contains random noise, which will interfere with the process of precipitation prediction and thus affect the accuracy of precipitation prediction, so the processing of random noise is an indispensable data processing task, and Ensemble Empirical Mode Decomposition (EEMD) is one of the most commonly used data denoising methods (Yu *et al.* 2018). Wang *et al.* (2015) used the EEMD method to process the original annual runoff time series, the constructed EEMD-ARIMA model can significantly improve the annual runoff time series prediction by the ARIMA time series method. Yang *et al.* (2021) used a coupled EEMD and LSTM model to predict annual precipitation in the northern Tianshan economic zone and the results show that the model outperforms a single model for annual precipitation prediction, but also exposes the drawback that the LSTM model cannot handle well the low-frequency series part obtained from EEMD decomposition and thus is prone to errors. These models do not take into account the division of the subseries obtained from EEMD decomposition into high-frequency and low-frequency parts according to different frequencies, after which the corresponding models are built and predicted separately, and do not take into account the applicability of a single algorithm to different frequency subseries, thus generating some avoidable errors. Based on the above, this paper adopts a permutation entropy (PE) algorithm to divide the subseries obtained from EEMD decomposition into high and low frequencies, and adopts different models to build models for prediction of high and low frequencies separately to construct an EEMD-LSTM-ARIMA coupled precipitation prediction model. LSTM has higher prediction accuracy for high frequency series, and the ARIMA model has an excellent prediction effect for low frequency series.

At present, the method of using the coupled EEMD-LSTM-ARIMA model to model different frequencies separately for prediction is still lacking in the study of precipitation prediction. This study has a dual purpose: firstly, to find the optimal parameters of this coupled model for regional medium- and long-term precipitation forecasting, and secondly, to conduct a comparative test of the developed model to highlight the model accuracy and improve the precipitation prediction accuracy in the region. Therefore, based on the month-by-month historical precipitation data of Luoyang City from 1973 to 2021, this paper investigates and proposes a method that integrates empirical modal decomposition method, long- and short-term memory network and autoregressive differential moving average. A coupled EEMD-LSTM-ARIMA model was constructed, and the model was trained using the monthly precipitation data of Luoyang region from 1973 to 2011, after which the monthly precipitation of Luoyang region from 2012 to 2021 was simulated and predicted, and the prediction results were compared with the prediction results of several different models and real data to evaluate the effectiveness of the model and to make predictions for the region for 2022–2024 monthly precipitation forecasts. The model can provide a new model reference for improving the precipitation prediction accuracy in the region, and also provide data support for disaster prevention and mitigation in the region, etc.

## RESEARCH METHODOLOGY

### Coupled model (EEMD-LSTM-ARIMA)

The steps are as follows:

- (1)
The EEMD decomposition was performed on the collected monthly historical precipitation data from 1973–2021 in Luoyang City.

- (2)
The valid IMF (Intrinsic Mode Function) components obtained from the decomposition are divided and the 1973–2011 dataset is selected for training and the 2012–2021 dataset for validation. The permutation entropy of each IMF component is calculated separately to divide the high-frequency sequence part and the low-frequency sequence part.

- (3)
The LSTM prediction model with good prediction effect on high frequency data is set for several IMF subsequences of selected high frequency sequence parts. Several IMF subseries of the selected low-frequency series part are calculated using the ARIMA model that performs well for low-frequency data, and finally the predicted values of the two methods are combined to be the final predicted values.

- (4)
The obtained prediction model was used to predict the monthly precipitation in Luoyang City from 2012 to 2021, and the performance of the resulting coupled EEMD-LSTM-ARIMA precipitation prediction model was evaluated by comparing it with the validation set of historical real data.

- (5)
The EEMD-LSTM-ARIMA coupled precipitation prediction model was used to predict monthly precipitation in Luoyang City from 2022 to 2024.

The partitioning of time series datasets is often judged according to the rule of thumb that normally the training part of the dataset should carry greater than 60% of the total and the validation part should be greater than 20% of the total, and many researchers have used different partitioning scenarios: Kumar *et al.* (2019) used 70% of the data to train RNN and LSTM models to model the ‘all-India’ monthly average precipitation data; Liu *et al.* (2021) used 80% of the data as the training set to train models for wind speed prediction and the remaining 20% as the test set. In this study 80% of the data is taken to train the model and 20% of the data is used to test the model performance.

#### EEMD-Ensemble empirical mode decomposition

Most of the time series usually used for forecasting exhibit non-linear and non-stationary characteristics. If the prediction process is done directly without smoothing, it is prone to inaccuracy of the prediction results. The Empirical Modal Decomposition (EMD) is a method for analytical processing of nonlinear and non-smooth signals (Huang *et al.* 1998). The original complex non-linear signal is decomposed by EMD into several Intrinsic Mode Functions (IMFs) and a Residual (Res). The characteristic information of each frequency of the original signal is present in the decomposition to obtain multiple IMF components. where the IMF obtained by EMD processing needs to satisfy: 1) Within the original data, the number of local extremes and trans-zero points must be equal to or differ by at most one; 2) At any time point, the upper and lower envelopes have a mean value of 0. Set the signal to be decomposed as ,The EMD decomposition steps are as follows:

- (1)
Find the local extrema of the original signal , The upper and lower envelopes and are obtained by fitting all extreme points separately through three spline interpolations.

- (2)
- (3)
- (4)
Determine whether the new sequence satisfies the IMF conditions:

- (a)
meet the IMF conditions, so is the IMF component, let = , then calculate the first-order residuals , consider as the initial signal and repeat steps (1), (2), and (3).

- (b)
does not satisfy the condition of IMF, it is considered as the initial signal to repeat steps (1), (2), and (3).

- (a)
- (5)
- (6)

However, the EMD method also has certain limitations. A major problem is the polarization point of the signal will affect the IMF, and if the distribution is not uniform, the final result will appear model blending (Tang *et al.* 2012). The EEMD method of adding noise assisted analysis to the decomposition process effectively suppresses the model aliasing problem (Wu & Huang 2009). By adding white noise to the analysis of the original signal, after using a large number of samples to test the mean, the original abnormal noise will be slowly eliminated and the final integrated mean signal obtained will become stable.

The EEMD decomposition steps are as follows:

- (1)
Total number of initialized empirical modal runs M.

- (2)
When the system runs to M times, add Gaussian white noise to construct a new sequence signal: .

- (3)
The new sequence signal is decomposed to obtain a finite number of IMF variables and a residual .

- (4)
Repeat the operation (1), (2), (3) steps M times.

- (5)
- (6)

The precipitation data series are non-linear, generally with random noise, which easily affects the precipitation prediction accuracy. By decomposing the precipitation series with EEMD, we can obtain the intrinsic pattern and change trend of precipitation distribution, remove the influence of random noise, and lay the foundation for the following construction of precipitation prediction model.

#### LSTM-Long and short-term memory neural network

*et al.*2021). When a cell enters the LSTM framework, it is transmitted to the next step if it is considered as useful information by the rules, and it is discarded by the forgetting gate if it is deemed useless. The basic cell structure of LSTM is shown in Figure 2.

The specific calculation process is as follows:

- (1)
- (2)
- (3)
LSTM has a strong adaptive learning ability and better fitting effect when dealing with complex sample data. By keeping the memory function through the gating structure, it performs better for long time temporal correlation prediction and effectively avoids the phenomenon that the gradient tends to disappear with the increase of time. Therefore, the LSTM neural network is chosen to be more effective for precipitation data processing when considering the nonlinear and time-series relationship of precipitation data.

#### ARIMA-Autoregressive integrated moving average

ARIMA models are often used for forecasting of time series data. It deals with time series data that are dynamic rather than stationary, and is especially suitable for the dynamics of stochastic processes, but only if the data is a smooth series (Liu *et al.* 2021). Its essence is to summarize the patterns and trends of the series by exploring the linear relationship between the past data and the current data. ARIMA models can be classified as q-order moving average model MA (q), p-order autoregressive model AR (p), autoregressive moving average model ARMA (p, q), and summed autoregressive moving average ARIMA (p, d, q) model depending on the conditions.

### Model evaluation indicators

In Equations (13)–(16) above, and denote the observed and model predicted values of precipitation data at moment *k*, respectively, then represents the average of the observed data, *n* indicates the total volume of precipitation data. The smaller the values of RMSE, MAE and MSE, the more accurate the prediction result and the higher the model prediction effect. The closer is to 1, the better the model predicts the regression fit. Generally, above 0.8 is considered to be a good model fit.

## CASE STUDY

### Study area

### Data source

### Decomposition results

Series . | Maximum (mm) . | Minimum (mm) . | Average (mm) . | Standard deviation (mm) . |
---|---|---|---|---|

Raw data | 360.426 | 0 | 47.56782 | 63.56654 |

IMF1 | 122.4906 | −125.845 | 31.5553 | 41.3389 |

IMF2 | 115.4464 | − 130.461 | 37.80779 | 47.15509 |

IMF3 | 98.32462 | − 72.9705 | 17.8036 | 23.88606 |

IMF4 | 37.82077 | − 54.033 | 10.44778 | 14.1637 |

IMF5 | 8.553985 | − 7.07767 | 3.594369 | 4.149955 |

IMF6 | 6.073163 | − 6.29281 | 3.748799 | 4.168697 |

IMF7 | 17.02878 | − 14.1688 | 7.894088 | 9.062226 |

Res | 80.06577 | 62.37893 | 4.831101 | 5.490517 |

Series . | Maximum (mm) . | Minimum (mm) . | Average (mm) . | Standard deviation (mm) . |
---|---|---|---|---|

Raw data | 360.426 | 0 | 47.56782 | 63.56654 |

IMF1 | 122.4906 | −125.845 | 31.5553 | 41.3389 |

IMF2 | 115.4464 | − 130.461 | 37.80779 | 47.15509 |

IMF3 | 98.32462 | − 72.9705 | 17.8036 | 23.88606 |

IMF4 | 37.82077 | − 54.033 | 10.44778 | 14.1637 |

IMF5 | 8.553985 | − 7.07767 | 3.594369 | 4.149955 |

IMF6 | 6.073163 | − 6.29281 | 3.748799 | 4.168697 |

IMF7 | 17.02878 | − 14.1688 | 7.894088 | 9.062226 |

Res | 80.06577 | 62.37893 | 4.831101 | 5.490517 |

### Model building

LSTM network modeling for the high frequency part; the optimal solution is selected through a large number of experiments, the number of selected hidden cells is 180, the activation function is ReLU function, the number of iterations is 600, the initial learning rate is 0.002, and after 300 rounds the learning rate is multiplied by a factor of 0.2 to reduce the learning rate, and the optimizer is defined as Adam function and the loss function is mae for training. Also, to control the overfitting phenomenon, the Dropout function is set to 10%. The ARIMA network model was established for the low-frequency part, and PACF analysis was performed for the series of Res. The ARIMA (1, 0, 0) model was finally used.

## RESULTS AND DISCUSSION

### Validation of the model

### Discussion

The predictive performance of the model can be more clearly evaluated using the following model evaluation indicators: RMSE、MAE、MSE、, the detailed results of each model are shown in Table 2.

Model . | MAE . | MSE . | RMSE . | . |
---|---|---|---|---|

EEMD-LSTM-ARIMA | 7.15772 | 127.1159 | 11.27457 | 0.87221 |

EEMD-LSTM | 13.40644 | 452.7453 | 21.27781 | 0.76066 |

EEMD-ARIMA | 23.20466 | 1126.839 | 33.56842 | 0.58573 |

EMD-LSTM | 18.002 | 668.5379 | 25.8561 | 0.678614 |

LSTM | 28.52654 | 1799.315 | 42.41833 | 0.49072 |

ARIMA | 32.89908 | 2389.303 | 48.8805 | 0.41266 |

Model . | MAE . | MSE . | RMSE . | . |
---|---|---|---|---|

EEMD-LSTM-ARIMA | 7.15772 | 127.1159 | 11.27457 | 0.87221 |

EEMD-LSTM | 13.40644 | 452.7453 | 21.27781 | 0.76066 |

EEMD-ARIMA | 23.20466 | 1126.839 | 33.56842 | 0.58573 |

EMD-LSTM | 18.002 | 668.5379 | 25.8561 | 0.678614 |

LSTM | 28.52654 | 1799.315 | 42.41833 | 0.49072 |

ARIMA | 32.89908 | 2389.303 | 48.8805 | 0.41266 |

From the data comparison, it can be seen that due to the characteristics of non-smoothness and large fluctuations in the monthly precipitation data series, a single model cannot summarize the monthly precipitation variation characteristics better, and thus the prediction effect of a single model is poor. Among them, the prediction effect is weaker than that of the LSTM model which can fit any nonlinear relationship because the ARIMA model requires high smoothness of the time series and cannot capture nonlinear relationships. The EEMD method is used to smooth the original monthly precipitation data series, which effectively removes the influence of random noise and improves the prediction accuracy of a single model significantly; meanwhile, the EEMD method avoids the model mixing problem that EMD is prone to and makes each IMF component more uniform, and the final obtained EEMD-LSTM combined model has better prediction effect than the EMD-LSTM combined model. However, because there is still a difference between high and low frequency sequences of each IMF component after EEMD decomposition, the combined models EEMD-LSTM and EEMD-ARIMA formed by using a single model and EEMD method still have certain limitations, the coupled EEMD-LSTM-ARIMA model divides the high and low frequency sequences after PE calculation, and for high frequency the LSTM model is used to calculate the data, and the ARIMA model is used to calculate the low-frequency series, which gives full play to the respective advantages of LSTM and ARIMA models, and avoids the large errors easily generated by a single model in inapplicable series, so the prediction accuracy of the EEMD-LSTM-ARIMA coupled model is higher than that of the combined EEMD-LSTM and EEMD-ARIMA model.

### Monthly precipitation forecast

The EEMD-LSTM-ARIMA model was used to forecast the monthly precipitation in Luoyang City from 2022 to 2024, and the forecast results are shown in Table 3.

Year . | Jan . | Feb . | Mar . | Apr . | May . | Jun . | Jul . | Aug . | Sept . | Oct . | Nov . | Dec . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

2022 | 22.92 | 5.44 | 29.42 | 34.18 | 65.17 | 74.83 | 151.52 | 183.06 | 102.74 | 67.2 | 13.6 | 2.367 |

2023 | 12.84 | 11.03 | 22.08 | 40.62 | 66.04 | 60.57 | 146.55 | 169.67 | 95.86 | 38.03 | 16.01 | 5.08 |

2024 | 16.27 | 22.25 | 34.24 | 82.67 | 95.86 | 107.92 | 173.36 | 236.23 | 59.43 | 33.53 | 18.03 | 7.87 |

HistAvg* | 11.66 | 15.64 | 21.9 | 50.42 | 62.01 | 75.69 | 145.25 | 127.59 | 104.27 | 49.9 | 28.54 | 7.1 |

Year . | Jan . | Feb . | Mar . | Apr . | May . | Jun . | Jul . | Aug . | Sept . | Oct . | Nov . | Dec . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

2022 | 22.92 | 5.44 | 29.42 | 34.18 | 65.17 | 74.83 | 151.52 | 183.06 | 102.74 | 67.2 | 13.6 | 2.367 |

2023 | 12.84 | 11.03 | 22.08 | 40.62 | 66.04 | 60.57 | 146.55 | 169.67 | 95.86 | 38.03 | 16.01 | 5.08 |

2024 | 16.27 | 22.25 | 34.24 | 82.67 | 95.86 | 107.92 | 173.36 | 236.23 | 59.43 | 33.53 | 18.03 | 7.87 |

HistAvg* | 11.66 | 15.64 | 21.9 | 50.42 | 62.01 | 75.69 | 145.25 | 127.59 | 104.27 | 49.9 | 28.54 | 7.1 |

Hist Avg*: Historical average, monthly average precipitation for the past 30 years.

In 2022–2024, 16 months are lower than the historical monthly average precipitation of the past 30 years, and 20 months are higher than the historical monthly average precipitation of the past 30 years; according to the climate type, 19 months are dry or less rainy, eight months are rainy, and the remaining nine months are wet, during the three years. Among them, the predicted precipitation in August 2024 is the maximum in the past three years, reaching 236.23 mm. Combined with the analysis of historical monthly precipitation, strong precipitation in Luoyang City often occurs in July-August, so Luoyang City needs to make corresponding preventive measures for urban precipitation in July-August, and at the same time, focus on preventing flash floods in mountainous and hilly areas caused by local or continuous rainfall, and make preparations for flash flood disaster prevention. The minimum precipitation for three consecutive months is likely to occur from December 2022 to February 2023, and the sum of the three months' precipitation forecast is 26.24 mm, which is consistent with the low precipitation in the local winter and spring controlled by dry and cold continental air masses, so attention needs to be paid to possible drought disasters posing a threat to agriculture and the ecological environment, as well as forest fires.

## CONCLUSION

For the non-linear and non-smooth characteristics of precipitation series, this paper adopts the EEMD algorithm to decompose the precipitation series into modal components of different frequencies, avoiding the problem of modal overlap, and the PE algorithm is used to divide each modal component into a high-frequency sequence part and a low-frequency sequence part, and the LSTM and ARIMA models are used to predict the two parts, respectively. It reduces the error of each component prediction, improves the overall prediction accuracy and enhances the prediction stability, and can be applied to the study of monthly precipitation prediction.

The results of month-by-month precipitation prediction for 2012–2021 in Luoyang City show that the decomposition of raw data by EEMD method can significantly improve the prediction performance; the single LSTM and ARIMA models have poor processing ability for precipitation time series in Luoyang City, and the prediction results have larger errors compared with other combined models. The coupled EEMD-LSTM-ARIMA model proposed in this paper makes full use of the advantages of each method to obtain more accurate prediction results, and the indicators show that the prediction performance of the model is better than the combined models EEMD-LSTM, EEMD-SVR, and EEMD-ARIMA, and the prediction results have higher confidence.

The model proposed in this paper is effective in improving the accuracy of month-by-month precipitation prediction, but it has not been considered in the research process on whether to add some other variables affecting precipitation, and whether adding other variables will have better prediction effect, and we can try to add other variables for multivariate prediction in the future. Other combination models can also be tried for prediction.

## FUNDING

This study was funded by the Henan Provincial Key R&D and Promotion Special Project (Science and Technology Tackling) (182102210066) and the National Natural Science Foundation of China (51709115).

## DECLARATIONS

Ethical Approval Not applicable.

**Consent to Participate** Not applicable.

**Consent to Publish** Not applicable.

**Competing interests** None.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.