Performance comparison of filtering methods on modelling and forecasting the total precipitation amount: a case study for Muğla in Turkey

Condensed water vapor in the atmosphere is observed as precipitation whenever moist air rises sufficiently enough to produce saturation, condensation, and the growth of precipitation particles. It is hard to measure the amount and concentration of total precipitation over time due to the changes in the amount of precipitation and the variability of climate. As a result of these, the modelling and forecasting of precipitation amount is challenging. For this reason, this study compares forecasting performances of different methods on monthly precipitation series with covariates including the temperature, relative humidity, and cloudiness of Muğla region, Turkey. To accomplish this, the performance of multiple linear regression, the state space model (SSM) via Kalman Filter, a hybrid model integrating the logistic regression and SSM models, the seasonal autoregressive integrated moving average (SARIMA), exponential smoothing with state space model (ETS), exponential smoothing state space model with Box-Cox transformation-ARMA errors-trend and seasonal components (TBATS), feed-forward neural network (NNETAR) and Prophet models are all compared. This comparison has yet to be undertaken in the literature. The empirical findings overwhelmingly support the SSM when modelling and forecasting the monthly total precipitation amount of the Muğla region, encouraging the time-varying coefficients extensions of the precipitation model.


INTRODUCTION
One of the most common problems in the world is the remarkable changes observed in climate. The effects of these changes on Earth and human beings cannot be ignored if we want to create a more habitable future.
There are many parameters listed which change everyday climate. One of the most known is global warming, which has a direct effect on climate. This has caused an imbalance in the world's climate. The seasons are shorter or longer than in previous years, relatively speaking. Accordingly, it does not make sense to expect normal seasonal weather anymore. There may be no solutions to this change in temperature, but if the amount of precipitation can be predicted, that will help make lives easier and the world more liveable. Planning for these types of future events is crucial during this tumultuous time. Making predictions about such unseasonable and changeable factors affecting the Earth requires some scrutiny and investigation.
Especially, when global warming has combined with the variability of nature itself, predicting the amount of precipitation in its various forms will be a tough process. Making such a predictive device will have untold benefits for the people and animals living on Earth. To be able to predict the amount of precipitation with all of its effective parameters, it is necessary to make some changes to agricultural activities, to plan engineering processes, or to be prepared for the conditions caused by a severe amount of precipitation. For example, having a good prediction model would enable farmers to predict the amount of precipitation for their area, thereby enabling them to make innovations on, for example, their existing irrigation systems. According to the updated information of precipitation, the updated irrigation systems plans would be able to improve seeding and cropping mechanisms, thereby allowing farmers to make a profit in terms of time and manpower costs when compared to previous seeding systems. Keefer () mentioned that knowledge of the amount of precipitation also enables the selection of the correct agricultural equipment for the purpose of better handling severe precipitation types. Indeed, the importance of predicting and forecasting the amount of precipitation for agricultural activities is related to the power of production.
As a result, having a perfect model of prediction translates to a strong agricultural system which improves the economies of countries in every respect. For instance, the significance of predicting the amount of precipitation also shows itself in the energy area. The most popular field of energy associated with precipitation is hydroelectric power, which is a source of electricity. The generation of this type of electricity is processed by large dams (Harting ). Thus, precipitation zones are taken into consideration whilst selecting the correct fertile areas for the construction of plants for the purpose of producing energy by hydroelectrical means. Large dams should be constructed in regions which receive sufficient precipitation so that maximal energy is reached. These plants are not only constructed in precipitation zones but also in drainage basins. There also exists a relationship between the amount of precipitation and the basins in such a way that river basins are prioritized in those regions known to receive more precipitation than others. Constructing these basins in these regions also prevents freshet cases, which result from severe amounts of precipitation. As a result, by predicting the correct amount of precipitation, proper drainage basins are then set up and, with the source of water which is gained from the river, compounded by the amount of precipitation which occurs naturally, hydroelectrical energy is then produced. At this point, it should not be ignored that a good model for predicting precipitation potentially lends itself to preventing huge natural disasters. People can take proper precautions against such disasters if they have prior knowledge of possible amounts of extreme precipitation. Predicting precipitation amounts can also aid tourism. Good predictions regarding precipitation are beneficial for travelers wishing to better manage their holidays. It also directly affects countries' economies. As a result, a good mechanism for predicting precipitation has comprehensive effects on daily life and the long-run planning of countries' economies.
Range, intermittency, concentration, and temporal and spatial distribution type problems create massive variety and complexity in precipitation variables which do not allow easy descriptions, modelling and prediction. This difficulty can be explained by the association between the changing amount of precipitation and the variability in the climate, with both its causes and consequences (Ezenwaji et al. ). In addition to the effects of climatological variability on the total amount of precipitation, some natural causes can be listed as factors which have effects on the total precipitation. Those factors can be considered as different parameters of nature which, in turn, can both be results of climate change and effects on the total precipitation amount over varying periods of time. According to this nature, the observed amount of precipitation changes.
Furthermore, different mechanisms, such as the rate of humidity, observed temperature, and cloudiness may affect the time, duration, or intensity of the precipitation. As a result, an accurate and precise modelling of total precipitation series is difficult to achieve because of its being highly parametrized, not to mention the highly varied nature of the data. While modelling total precipitation, it is important to take the maximum and minimum values of those parameters into account in order to obtain the most efficient structures for the model. To exert dominance on the different factors which affect the total precipitation makes it easy for one to understand that the structure of the data is important in order to generate a model with forecasts which are good enough.
Although numerous studies on modelling and predicting precipitation amounts exist (e.g. Sigrist Table 1. Furthermore, the time series plot of monthly total precipitation amounts is displayed in Figure 1.
The various descriptive statistics are given in Table 1. The minimum observation for precipitation in Mugla is 0 mm as, on days without any precipitation, the amount of precipitation remains at 0. The mean amount of precipitation is 96.72 mm, which is characterized as a moderate value taking into consideration the dataset's minimums and maximums. The temperature never drops below 0 C and never exceeds 29 C. The average temperature during the time period under study is 14.97 C. This indicates a moderate

Multiple linear regression
The precipitation model in the form of the multiple linear regression model is defined as follows: where κ is the regression intercept and β 1 , β 2 and β 2 are unknown coefficients. ε t (t ¼ 1,…,T ), are normally distributed errors with mean 0 and constant variance H. The unknown β values given in Equation (1) are estimated via ordinary least squares (OLS), whereβ is written as follows:

State space model and Kalman filter
The extension of the regression model given in Equation (1) allows time-varying coefficients, β i (i ¼ 1,2,3), into the mean reverting specification of the SSM using the Kalman filter algorithm present in Equation (3): The state equations given in Equation (3) can be written as follows: with prior distributions for the parameters: Here, the initial estimates of μ β 10 , μ β 20 and μ β 30 and Σ β 10 , Σ β 20 and Σ β 30 are obtained from the data as part of the estimation and 3). The error terms for observation (ε t ) and state equations (w it ) are assumed to be mutually independent of each other and independent in time t (t ¼ 1,…,T ) and normally distributed with 0 mean and variances H and Q i , respectively. Also, φ i quantifies the temporal autocorrelation in β it in precipitation.
A Kalman filter can be used anywhere uncertain information exists about some dynamic system. The aim of the Kalman filter is to obtain as much information from the uncertain measurements as possible. A Kalman filter is an optimal estimator that infers parameters of interest from indirect, inaccurate and uncertain observations. It deals with the uncertainty associated with the system by adding some new uncertainty after every prediction step. It combines a prediction of the true data with the new measurement, using a weighted average that is an estimate lying between the prediction and the measurement. Hence, it has a better estimated uncertainty. This process is repeated at every time step, with the new estimate informing the prediction used in the preceding iteration. The relative certainty of the measurements and current estimate yields the Kalman filter gain which is the relative weight given to the measurements and current state estimate, and can be 'tuned' to achieve particular performance.
Because the uncertainty is too much in the precipitation series, this method handles this well. For example, Weeks () proposed the Kalman filter, which is also particularly suitable for missing data in time series due to its recursive algorithm.
During the Kalman filtering and smoothing process, three types of problems are defined as follows: if t > n, this is a prediction problem, if t ¼ n, this is a filtering problem, and if t < n, this is a smoothing problem. The flow chart of general state space model via Kalman filtering (forward step (t ¼ 1,…n)) and smoothing (backward step (t ¼ n, n À 1, …, 1)) algorithm are given in Figure 2. In this process, the parameters of Equations (3) and (4) are modified as follows: Relative Humidity t , Cloudiness t ) Note that the whole mechanism for the application of the Kalman filter is based on this procedure. This model is used for the application to obtain the smoothed values for precipitation in Equations (3) and (4). As a result of the estimation, some of the values are predicted as negative values that are very close to 0. To overcome this, logarithmic transformation was considered, but this did not solve the problem. At that point, it was decided to count those negative values as zero because they were already very close to zero. This means that there is no expectation for precipitation on that day. This model is called an SSM throughout this research.

Hybrid model
The hybrid data is created from the actual data to handle negative predictions observed in the smoothing part by using the logistic regression method. The steps listed below are followed for generating the hybrid model: Step 1: The precipitation amount series are arranged as 0 if the amount is 0; otherwise, it is recorded as 1; Step 2: A logistic regression model is fitted by using explanatory variables. The logistic regression model is defined as follows: Step 3: After classifying observations as rain and no rain cases, the estimated 0's (no-rain cases) fixed as 0 and the estimated 1's (rain cases) are put in the Kalman filter algorithm as precipitation amounts; Step 4: The accuracy measures and forecasts for future observations are then obtained by using the state space model via the Kalman filter algorithm.
This approach is called a hybrid model throughout this research.

ARIMA model
The SARIMA model is defined as follows: Here, it is denoted by ARIMA( p, d, q)x(P, D, Q) s where the seasonality is represented by s (Box & Jenkins ). Φ and Θ are the polynomials with orders P and Q, respectively, each containing no roots inside the unit circle and B represents the back-shift operator, B j y t ¼ y tÀj . Given the seasonality observed in Figure 1 and autocorrelation and partial autocorrelation plots of the series, the SARIMA model was fitted by using the arima function in R software (R Core This method is widely used in forecasting due to its simplicity, computational efficiency and accurate forecast performance. Types of the exponential smoothing method vary based on the characteristics of the time series. The additive model is given in Equation (7): where ' t denotes an estimate of the level of the series at time where m i denotes the seasonal periods, ' t and b t represent the level and trend components of the series at time t, respectively, s (i) j,t represents the ith seasonal component at time t with λ (i) j ¼ 2π j =m i , d t denotes an ARMA(p, q) process and ε t is a Gaussian white noise process with zero mean and constant variance. The smoothing parameters are given by α, β, γ i for i ¼ 1, …, T, is the dampening parameter.

Feed-forward neural network (NNETAR)
ANN is a mathematical model imitating human neural biology to solve nonlinear problems. One of its significant properties is containing non-linearity in its structure. Also, the ANN models can be described as universal approximators that can approximate the generating mechanism of the data accurately (Zhang ). They do not require any modelling assumptions. However, since they require a high level of modelling complexity, ANN models suffer from being time-consuming (Theodosiou ). The algorithm is supervised learning where input and output series need to be provided. At the first step, the given input variables are multiplied by weights which is then learned by the algorithm using the back-propagation. Next, these weighted inputs are summed. Then, a bias term is added to adjust the threshold. At the final step, the summation of the weighted inputs and bias are transformed into the final output by the activation function to catch the complex structure of the series. One of the common activation functions that is used for time series is the sigmoid function. The mathematical formulation of this process is given as follows: where x i,t is the input in discrete time t where i ¼ 1, …, m, w i,t is the weight value at time t, b is bias, f is an activation function, and Precipitation t is the output value at time t.
To update the weights to minimize the loss function like MSE, back-propagation which is based on the chain rule is used. In feed-forward neural network, the procedure is one directional from the input variables to the output variable.
For the network to capture the nonlinear structure of the data, hidden layers are added between the input layer and the output layer. In the structure of feed-forward there exists a network with interconnections, but these intercon- where g(t) is the piecewise linear or logistic growth curve for modelling non-periodic changes in time series; s(t) is the periodic change (e.g. weekly/yearly seasonality); h(t) is the effect of holidays (user provided) with irregular schedules; ε t is the error term accounting for any unusual changes not accommodated by the model. SARIMA, TBATS and NNETAR models are also run including the temperature, humidity and cloudiness covariates. The results of these models are given as ARIMAX, TBATSX and NNETARX.
The performance of fitted models in the in-sample modelling (out-of-sample forecasting) procedure are evaluated in terms of the MAE and MSE criteria and are shown as follows: The lowest values of modelling (forecast) evaluation criteria mean that the in-sample modelling procedure (out-ofsample forecasting) has been estimated properly and that the model constructed at the end is a significant model.  (11)) and MSE (Equation (12)), respectively. Those results are presented in Table 2.

RESULTS AND DISCUSSION
As seen in  during that time period and are given in Table 2.
As seen in Table 2 filter forecasts and actual total precipitation amounts for the 12 months are provided in Table 3.
As seen from Table 3, it is apparent that the difference between the actual precipitation values and the forecast precipitation values from the SSM via Kalman filter is too low.
This means that the variation of the precipitation amount has been captured successfully in the model.

Best model fit and forecast
In terms of a comparison of the results between the aforementioned model while modelling and forecasting during these time periods, the SSM via Kalman filter exhibits the best performance. The parameter estimates of SSM in the in-sample model fit procedure are shown in Table 4.    given time period. Moreover, the mean value of the time-varying b β 3 of cloudiness is greater than that of temperature and relative humidity during the time period. This result indicates that cloudiness is more volatile than the other two parameters.

CONCLUSIONS
The issue of predicting precipitation is a challenging process since there are various natural parameters which are involved in the procedure which directly affect precipitation.

DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.