## Abstract

Water management is very important for human life sustainability. Rainfall forecasting is one of the most important factors for the water management of an area. A forecast is simply a calculation of what happens in the future based on past information under the assumption that the pattern followed in the past would continue in the future also. This work aims at obtaining forecasting models for the time series data set using conventional models and computational models. Varanasi City's annual climate data for a total of 113 years is used for the analysis. Initially, the individual model is considered and used for forecasting. Later, hybrid models will be considered and a comparison between individual models and hybrid models would be obtained. The individual statistical models to be considered are moving average, exponential smoothing with one parameter, and autoregressive integrated moving average (ARIMA). The forecast is also done individually using the k-nearest neighbor (kNN) and interpolation technique cubic spline. Finally, the best-chosen statistical models and the interpolation model are coupled with kNN to develop hybrid models and with these hybrid models, the forecast is done for the data. All the models will be compared and the best among them will be chosen.

## HIGHLIGHT

Rainfall forecasting is very important for water management, here we have compared the five latest AI/data science techniques of forecasting.

Study area is Varanasi, the oldest city in India.

113 years of climate data is used.

Models used are ARIMA, kNN, spline, exponential smoothing.

The hybrid model was prepared for better forecasting.

## INTRODUCTION

Time series is termed a sequence of observations measured at equal intervals of time. The observations are measured hourly, daily, weekly, monthly, yearly, or at any other regular interval. When the observations are recorded continuously through time, time series is said to be continuous and if the observations are recorded at specified times, usually equally spaced, then it is said to be discrete. The dependence among the observations in a time series data is of great interest. Analysis of this dependence is found by using techniques.

Forecasting in general can be referred to as the process of estimating or evaluating the value of some variable at some future point in time. Weather forecasting plays a significant role in meteorology. Weather forecasting remains a formidable challenge because of its data intensive and frenzied nature (Omar *et al.* 2017, 2019, 2022). Forecasting is an important problem that is used by the government and industry for planning and decision-making, to protect life and property, and by every individual to carry out daily activities. There are two main broad types of forecasting techniques: qualitative techniques and quantitative techniques. Qualitative forecasts are normally used in situations where there is no or little historical data available. An example is the introduction of a new product for which no history is available whereas quantitative forecasts make use of historical data. In this technique, a forecasting model is used to project past and current data into the future. If the historical data is restricted to past values the forecasting procedure is called the time series method and the corresponding historical data is termed a time series. Analysis of time series mainly deals with statistical methods or any other methods to analyze and extract the characteristics of the given data. Time series analysis aims at identifying the pattern in the given time series and using it to predict the future. A model is constructed to extract meaningful information about the data using appropriate methods. Using the prescribed model, one can forecast future occurrences based on past data.

Three things are essential for the survival of human beings. They are air, food, and water. Water is available in different forms of precipitation. Among them, rain is the most important. Rivers and lakes are considered to be natural resources for any place. Forecasting weather parameters which include rainfall, temperature, wind, humidity, etc., for a region plays an important role and in fact, it is one of the main functions of National Weather Services. Information about rainfall is very much useful for predicting natural disasters such as droughts and floods. Rainfall forecasting is also very much helpful to decide upon the area of irrigation, the requirement of water for irrigation, estimating the quantity and quality of surface water and groundwater, etc. Forecasting models help us to understand meteorological information thereby integrating the information into the planning and decision-making process. Thus, forecasting rainfall acts as an aid for doing efficient planning by the government.

Various methodologies have been tried in predicting rainfall. Some of the most popular methods are autoregressive integrated moving average (ARIMA) and artificial neural networks (ANNs).

Gupta *et al.* (2013) also made a study of a neural network model for the prediction of rainfall in India. They have proposed multilayered ANN with learning by the backpropagation algorithm. The criteria used for prediction are RMSE, correlation, and standard deviation. The proposed model predicted values with suitable results. Local rainfall prediction system using neural networks (NNs) and prediction of local rainfall in Japan using the data from the Japan Meteorological Agency (JMA) is shown by Kashiwao *et al.* (2017).

Mandale & Jadhawar (2015) analyses the application of data mining techniques in weather forecasting which is done by using two techniques: ANN and decision trees. For the classification of weather parameters, C5 decision tree classification algorithm was applied to formulate rules. Based on the given data, NNs is able to detect the relationship between weather parameters and the prediction of the same. Kar *et al.* (2019) used a fuzzy logic technique to predict the rainfall using the temperature of that geographical location as input. Results show that there is an association between actual and predicted rainfall. Darji *et al.* (2015) made a detailed survey and comparison of different neural network architectures used by researchers for rainfall forecasting. Their paper discussed the issues of using ANN for yearly/monthly/daily rainfall forecasting and also presented different accuracy measures used by researchers for evaluating the performance of ANN.

Lee *et al.* (1998) proposed an approach in which the whole region under study is divided into four subregions. The two larger regions are predicted by radial basis function (RBF) networks using information based on location and the remaining two smaller areas are predicted by the model linear regression using information based on elevation. The prediction of daily rainfall at 367 locations is based on the prediction of daily rainfall at 100 locations in Switzerland. Comparisons with the observed data have shown that RBF networks gave better results than the linear regression model. Sharma & Nijhawan (2015), for predicting rainfall in Delhi, have used three different NNs, namely, backpropagation algorithm, cascaded backpropagation, and layer recurrent network with the same learning functions and adaptive learning functions. Backpropagation gave the best results. The training function TRAINLM gave the best results in the training set, testing set, and validation of data. LEARNGDM has been identified as the best adaptive learning function with minimum MSE value. Mislan *et al.* (2015) applied ANN with backpropagation neural network (BPNN) algorithm to get accurate forecasting of rainfall. The rainfall data were tested using two hidden layers of BPNN architectures with three different epochs. Experimental results show that the epoch of 1,000 produced a good result to predict rainfall in Tenggarong, Indonesia.

Pai & Rajeevan (2006) explored the hypothesis that the incorporation of sea surface temperature configuration information in the statistical models improves the monsoon forecast skill through a long time series of global sea surface temperature and rainfall data. The relationship between all-India summer monsoon rainfall and the timing of El Niño-Southern Oscillation (ENSO) related warming is investigated by Ihara *et al.* (2008). In their work, the link between the interannual variability of Indian summer monsoon rainfall and the time evolutions of Indo-Pacific sea surface temperature anomalies are also examined and the significance is assessed using a two-sample, two-tailed, Student's *t*-test.

The combination of the statistical method ARIMA with the neural network model, called the hybrid model, is considered to be one of the famous combinations for making forecasts. Hybrid models are mainly applied to improve the performance of individual techniques by smoothing the input data through conventional statistical methods or any other methods and then applying the second model. Remesan *et al.* (2009) applied hybrid models for runoff prediction. Indrabayu *et al.* (2013) proposed a novel approach in which they combined a support vector machine (SVM) and fuzzy logic. This new combination is then compared with another combination of neural network and fuzzy logic. Results have shown that the combination of SVM and fuzzy logic achieved higher accuracy than a combination of neural network and fuzzy logic.

Rahul *et al.* (2020) in their paper identified a suitable seasonal model through Box Jenkins seasonal ARIMA model for the prediction of monthly rainfall in UP. Two methods, ARIMA and adaptive splines threshold autoregressive (ASTAR), are compared by Indrabayu *et al.* (2013) in another paper to predict daily rainfall in the area of Makassar, Indonesia. For prediction, 10 years of daily data from 2001 to 2010 obtained from Badan Meteorologi, Klimatologi, dan Geofisika (BMKG) have been utilized. Among various meteorological variables, four variables, namely temperature, humidity, wind speed, and previous precipitation, have been selected and given as input to both these methods. Based on the error measure root mean square error (RMSE), it has been proved that ASTAR outperformed ARIMA. Nirmala & Sundaram (2010) have considered the coupling of the traditional technique moving average with the conventional technique ANN to make a hybrid model MA-ANN. They have shown that the duo combination model is a better tool than the moving average and ANN models applied individually. Joseph & Ratheesh (2013) have applied respectively the data mining techniques such as clustering and classification for prediction of rainfall. The neural network Bayesian regularization is implemented. Dataset is obtained from the official website of the National Oceanic and Atmospheric Administration (NOAA) maintained by the US Department of Commerce. For doing the prediction, the parameters namely relative humidity, pressure, temperature, precipitable water, and wind speed are used.

Two prediction techniques, one being the linear approach (ARIMA) and the other being non-linear approach (ANN) are presented by Mahalakshmi *et al.* (2014). A comparison is made between ARIMA and the hybrid model ARIMA-ANN. Based on the error measure mean squared error (MSE), the hybrid model outperforms ARIMA, the reason being that it uses the linearity of ARIMA and the non-linearity of ANN. Mohamed & Ibrahim (2016), identified (0,0,0) × (0,1,1)12 for monthly rainfall prediction in Nyala station, Sudan. For doing the prediction, rainfall data for the years 1971–2010 were used. Using this model, the monthly rainfall for upcoming years is predicted. Swain *et al.* (2020) developed an ARIMA model in the Khordha district, Odisha, India. The model selection of ARIMA was made using the Akaike information criterion (AIC) and Bayesian information criterion (BIC). The forecasts produced by the model showed an excellent match with observed monthly rainfall data.

The prediction of temperature and rainfall on a monthly scale basis is based on the approach of seasonal ARIMA by Kaushik & Singh (2008). For doing this, 12 years of data from 1994 to 2006 has been utilized to make a prediction of temperature and rainfall for the next five years. At Mirzapur, Uttar Pradesh, the performance of the model is based on the correlation coefficient (*R*^{2}) and RMSE. The results indicate that the model provides a reliable and satisfactory prediction of temperature and rainfall on a monthly scale. Graham & Mishra (2017), in their study, predicted rainfall for the next five years on a monthly scale by analyzing the previous 31 years (1985–2015) rainfall data of Allahabad, Uttar Pradesh. Seasonal ARIMA model (0,0,0) (0,1,0) was identified as the best for making the prediction. Based on the error measures correlation coefficient (*R*^{2}) and RMSE, the prediction of rainfall by the model proved consistent and satisfactory on a monthly scale. Chaudhari *et al.* (2013) also discusses the application of various data mining techniques to predict or classify or cluster or associate the meteorological data pattern.

A survey on different techniques for rainfall prediction is presented in the paper by Hirani & Mishra (2016). This paper reports a detailed survey of the following methods namely multiple linear regression (MLR), ARIMA, genetic algorithm, ASTAR, SVM, fuzzy logic, BPNN, radial basis function network (RBFN), self-organizing map (SOP), and weather research and forecasting (WRF) model. These are different methods used extensively over the last 20 years. It was found from the survey that most researchers use ANN often and got significant results. The survey also tells that forecasting techniques using MLP, BPNN, RBFN, Self-Organizing Map (SOM), and SVM gave more suitable results than statistical methods and numerical methods for rainfall prediction though with some limitations in each model.

Literature survey reveals that rainfall forecasting was done by many techniques like artificial intelligence techniques, NNs, ARIMA, SVM, MLR, k-nearest neighbor (kNN), spline, fuzzy logic, genetic algorithm, particle swarm optimization, etc. Every model has its own limitations, advantages, and disadvantages. In this research study, the non-parametric technique kNN is applied individually. Also, different combinations of hybrid models with kNN have been proposed as an innovative approach for forecasting rainfall in Varanasi, UP.

The overall objective of the work is to develop rainfall forecasting models using the computational technique kNN. Another attempt has been made by applying the interpolation technique spline. The main goal of the study is to develop new hybrid models for evaluating the effectiveness of the data pre-processing technique, moving average and the time series-forecasting model ARIMA and the interpolation technique spline with kNN. The various models developed are compared using error measures and are applied for forecasting the annual rainfall amount in Varanasi. The objectives of this research work have been framed as follows:

To identify computational intelligence techniques for the time series data.

To formulate new models and methods based on computational intelligence for time series data.

To compare the performance of new hybrid models with conventional models.

To validate the model for the time series data.

## STUDY AREA

## METHODOLOGY AND DATA USED

Data mining techniques are applicable in various areas like clustering, pattern recognition, machine learning, etc. ANNs, Bayesian classification, SVMs, genetic algorithms, fuzzy logic, and rough sets are some of the popular methods in data mining.

### k-nearest neighbor (kNN)

One of the non-parametric techniques is the k-nearest neighbor. Unlike other techniques where a generalization of the model is made with the available data and before getting new information, this technique is basically a lazy technique in which no generalization is done with the available data called training data but waits for the new entry to do the classification or the prediction for the test sample, hence the name lazy learner. In other words, all the training data is stored and required for testing data. kNN just stores the entire dataset. No model is constructed and as a result, no learning is required. It uses the entire dataset to make predictions. Thus, a model is not constructed, instead, given new data, prediction is done and the output is available instantly.

This is basically one of the simplest classifiers which can also be used for regression. It is one of those algorithms which are very simple and easy to use. It is called non-parametric as it makes absolutely no assumptions on the given data. Needless of any prior information and basic assumptions about the data makes this method so impressive. It is termed lazy learning because it does not do any generalizations with the available data called training data.

kNN method was first described in the early 1950s but did not gain popularity until the 1960s when computing power became available. It is extensively used in pattern recognition.

Some techniques can be used for both classification and regression like ANNs and kNN, whereas some techniques can be used either for classification or regression but not for both. Logistic regression can be used only for classification whereas linear regression can be used only for regression. Some of the advantages of kNN are as follows:

Can be applied to any data and follows any distribution. Hence, it is called non-parametric.

Very simple and intuitive as it requires only two parameters for the implementation, namely the value of ‘

*k*’ and the distance function.No training phase is required and hence any new data can be added and analysis can be done.

Very effective in classification if the sample is very large.

Some of the disadvantages of kNN are as follows:

Choosing ‘

*k*’ may be tricky.Requires large storage space.

Testing phase is computationally expensive than the training phase.

Very sensitive to missing values, noise, and outliers and irrelevant attributes.

Does not work well with high dimensionality.

Summarizing the whole procedure, the algorithm for kNN either for classification or for regression is given as follows:

Compute the distance

*D*(*x*,*x*) of the new data with every training instance_{i}*x*._{i}Choose the ‘

*k*’ closest values of*x*and their corresponding output ‘*y*’ are based on the ‘*k*’ nearest distances.Classify the required output as the one which is the majority among the

*k*nearest*y*values, in the case of classification.Calculate the required output for the input

*x*using some appropriate formula based on the*k*nearest*y*values, in the case of regression.

### Moving average

*N*observations is taken for the calculation. Once a new value is available, it is taken up by dropping the oldest observation from the time series data and the new average is calculated, thus maintaining the same number of terms. Every time a new entry is added an old observation is deleted thereby getting a new average which keeps on changing, hence the name moving average. The number of terms to be considered to calculate the average has to be decided and that is called the period of the moving average. It can be a three-, five-, or seven-yearly moving average and so on. The main application of the moving average is to smooth the given time series data. The forecast value using a simple moving average is given by:where

*F*is forecast for the period ‘

_{t}*t*’,

*n*is the number of terms to be averaged, and

*A*

_{t}_{−1},

*A*

_{t}_{−2},

*A*

_{t}_{−3}are the actual values for the periods

*t*− 1,

*t*− 2,

*t*− 3, and so on.

### Autoregressive integrated moving average (ARIMA)

An iterative and effective approach for analyzing time series introduced by Box & Jenkins (1976) is the ARIMA. An ARIMA model in a time series predicts a value as a linear combination of its own past values, and past errors (also known as shocks). A powerful classical model that combines the autoregressive model and moving average model through a different process ‘integrated’ to make the series stationary is called an ARIMA model. The autoregressive model of order ‘*p*’ abbreviated as AR(*p*) is based on the past ‘*p*’ values of the variable *Y _{t}*.

The above equation resembles a multiple regression model except that *Y _{t}* is regressed on its past values instead of different predictor variables, hence the prefix ‘auto’ in the autoregressive model.

*q*’ abbreviated as MA(

*q*) is based on the past ‘

*q*’ disturbances or prediction errors of the past values of the same variable. Thus, the moving average model uses past errors as explanatory variables. A moving average model of order ‘

*q*’ is represented as follows:

The autoregressive model and the moving average model are efficiently coupled to form a general and useful class of time series model called ARMA. This class of models can be extended as an ARIMA model for non-stationary time series by differencing the data series. Non-stationary data are unpredictable and cannot be modeled or forecasted. The non-stationary data needs to be transformed into stationary data to obtain consistent and reliable results.

ARIMA model can be guessed to some extent based on three things: time series plot, autocorrelation function (ACF), and partial autocorrelation function (PACF). Based on the ACF and PACF plots, the general characteristics of the various models can be obtained by the following guidelines.

If the series is non-stationary, then the ACF plot remains significant for six or more lags instead of declining to zero quickly. In that case, the series must be differenced until it becomes stationary.

Exponentially declining ACF and spikes in the first one or more lags in PACF indicate an autoregressive model.

Spikes in the first one or more lags in ACF and exponentially declining PACF specifies a moving average model.

Exponential decline in both ACF and PACF stipulates a mixed model namely an autoregressive moving average model.

*d*’ (Figure 2).

### Spline

Spline is an interpolation technique in which a set of unique cubic polynomials are represented between each of the data points so that the curve obtained is continuous and smooth. The basic idea of cubic spline is to represent the function by a different cubic function on each interval between data points.

## MODEL VALIDITY

Different models are considered for the rainfall data for forecasting. To assess the quality of forecasting and to evaluate the consistency of the model, error measures are required. In this research work, the error measures mean absolute percentage error (MAPE), mean absolute error (MAE), and RMSE are used. The formula for the error measures are given below.

## RESULTS AND DISCUSSION

### kNN results

The given data is divided into two sets, one is known as the training set and the other is known as the test set. As discussed earlier, kNN does not perform any training with the training set, instead, it utilizes the what-so-called training data to make prediction for the new data in the test set. Thus, for each data in the test set, predictions are made and the validity of the prediction is checked through the error measures MAPE, MAE, and RMSE. In this research work, Varanasi, UP annual rainfall for the first 100 observations (1906–2006) were considered as an initialization/training set and the remaining 13 years (2006–2018) were taken as test set by taking the previous year North East Monsoon (NEM) as input. Different ‘*k*’ values ranging from 2 to 10 were considered and for each ‘*k*’, kNN was applied. Among the *k* values applied, the value *k* = 3 gave a better MAPE value.

Using kNN, Varanasi, UP city annual rainfall for the years 2006–2018 are predicted and these values are compared with the actual rainfall values using the error measures MAPE, MAE, and RMSE.

Error measures . | Value . |
---|---|

MAPE | 3.831108 |

MAE | 0.279955 |

RMSE | 0.339164 |

Error measures . | Value . |
---|---|

MAPE | 3.831108 |

MAE | 0.279955 |

RMSE | 0.339164 |

### Moving average results

One of the oldest statistical techniques moving average is used to smoothen any given data. As already discussed, the number of terms to be averaged is called the period of the moving average. For the natural log transformation of Varanasi annual rainfall, moving average is calculated by varying the period from 2 to 14. For every moving average with a period ranging from 2 to 14 error measure, MAPE is also determined and the values are shown in Table 2.

Model . | MAPE . |
---|---|

MA(2) | 4.412168 |

MA(3) | 3.474756 |

MA(4) | 3.258254 |

MA(5) | 3.165095 |

MA(6) | 3.128931 |

MA(7) | 3.387038 |

MA(8) | 3.511451 |

MA(9) | 3.396071 |

MA(10) | 3.255401 |

MA(11) | 3.147462 |

MA(12) | 3.128595 |

MA(13) | 3.082821 |

MA(14) | 3.048054 |

Model . | MAPE . |
---|---|

MA(2) | 4.412168 |

MA(3) | 3.474756 |

MA(4) | 3.258254 |

MA(5) | 3.165095 |

MA(6) | 3.128931 |

MA(7) | 3.387038 |

MA(8) | 3.511451 |

MA(9) | 3.396071 |

MA(10) | 3.255401 |

MA(11) | 3.147462 |

MA(12) | 3.128595 |

MA(13) | 3.082821 |

MA(14) | 3.048054 |

Error measures . | Value . |
---|---|

MAPE | 3.048054 |

MAE | 0.222456 |

RMSE | 0.310444 |

Error measures . | Value . |
---|---|

MAPE | 3.048054 |

MAE | 0.222456 |

RMSE | 0.310444 |

### Exponential smoothing

*α*’ where 0 ≤

*α*≤ 1 is assigned for the most recent observation. Values of ‘

*α*’ were varied from 0.1 to 0.9 and for each of these ‘

*α*’ values calculations are done. After the calculations for each ‘

*α*’ value is over, the error measure MAPE is also calculated for all the ‘

*α*’ values mentioned. The table involving the error measure is shown in Table 4. From the table, it was a unanimous decision to say that

*α*= 0.1 gave the least MAPE value. The graph of observed and forecast values of Varanasi rainfall by single exponential smoothing for

*α*= 0.1 is shown in Figure 7. Error measures other than MAPE for this model are also shown in Table 5.

Model . | MAPE . |
---|---|

ES (α = 0.1) | 3.106286 |

ES (α = 0.2) | 3.146679 |

ES (α = 0.3) | 3.272267 |

ES (α = 0.4) | 3.495975 |

ES (α = 0.5) | 3.711791 |

ES (α = 0.6) | 3.910091 |

ES (α = 0.7) | 4.07846 |

ES (α = 0.8) | 4.204747 |

ES (α = 0.9) | 4.333397 |

Model . | MAPE . |
---|---|

ES (α = 0.1) | 3.106286 |

ES (α = 0.2) | 3.146679 |

ES (α = 0.3) | 3.272267 |

ES (α = 0.4) | 3.495975 |

ES (α = 0.5) | 3.711791 |

ES (α = 0.6) | 3.910091 |

ES (α = 0.7) | 4.07846 |

ES (α = 0.8) | 4.204747 |

ES (α = 0.9) | 4.333397 |

Error measures . | Value . |
---|---|

MAPE | 3.106286 |

MAE | 0.226326 |

RMSE | 0.312263 |

Error measures . | Value . |
---|---|

MAPE | 3.106286 |

MAE | 0.226326 |

RMSE | 0.312263 |

### Autoregressive moving average or ARIMA

The basic criteria for applying the model are normality and stationarity. The available rainfall data is stationary. Normality for the original data is checked using Anderson–Darling test. This test is applied to check whether the given sample comes from a specific distribution. In this test, more weight is given to the tails of the distribution than given in Kolmogorov–Smirnov (K-S) test. The critical values are calculated based on the particular distribution in this test. The Anderson–Darling test uses the particular distribution for calculating the critical values despite the disadvantage that for each distribution critical values have to be calculated. But this is not going to be a difficult one as Anderson–Darling test is considered more superior to the chi-square and Kolmogorov–Smirnov goodness-of-fit tests.

*p*-value is 0.007 which is less than 0.05. Hence, the null hypothesis is rejected and the given original data is non-normal. Thus, to apply ARIMA, the given data has to be converted into another data that is normal. Hence, a handful of transformations was tried out and among them, the best one is chosen for further investigation (Figure 9).

From all the figures given above, it is clear that the *p*-value of the natural log-transformed data and Johnson transformation is greater than 0.05, whereas the other transformations do not have a *p*-value greater than 0.05. But then, it was found that the MAPE value for the forecast of rainfall data using Johnson transformation is very very high when compared to the natural log. Thus, further research is done with natural log transformation.

*y*and

_{t}*y*

_{t}_{+}

*after removing the dependence on*

_{k}*y*

_{1},

*y*

_{2}…

*y*

_{k}_{−1}. The ACF and the PACF plots for the log-transformed rainfall data are shown in Figures 10 and 11.

Since both ACF and PACF tails off gradually, the series is said to be stationary and hence differencing is not required for the data chosen which implies that *d* = 0. Different tentative ARIMA models based on ACF and PACF plots were identified and the MAPE values of those models were calculated. Among those models, ARIMA (2,0,0) gives a better value based on the error measure MAPE. Error measures for the models are also shown in Table 6.

Model . | MAPE . |
---|---|

ARIMA (1,0,0) | 3.09957 |

ARIMA (0,0,1) | 3.075936 |

ARIMA (1,0,1) | 3.275254 |

ARIMA (2,0,0) | 3.045036 |

ARIMA (0,0,2) | 3.077912 |

ARIMA (2,0,2) | 3.180048 |

ARIMA (2,0,1) | 3.063268 |

ARIMA (1,0,2) | 3.10476 |

ARIMA (3,0,0) | 3.088554 |

ARIMA (0,0,3) | 3.154329 |

ARIMA (3,0,3) | 3.541949 |

ARIMA (4,0,0) | 3.309103 |

ARIMA (0,0,4) | 3.181194 |

ARIMA (3,0,4) | 3.416867 |

ARIMA (4,0,3) | 3.460453 |

ARIMA (4,0,4) | 3.434379 |

Model . | MAPE . |
---|---|

ARIMA (1,0,0) | 3.09957 |

ARIMA (0,0,1) | 3.075936 |

ARIMA (1,0,1) | 3.275254 |

ARIMA (2,0,0) | 3.045036 |

ARIMA (0,0,2) | 3.077912 |

ARIMA (2,0,2) | 3.180048 |

ARIMA (2,0,1) | 3.063268 |

ARIMA (1,0,2) | 3.10476 |

ARIMA (3,0,0) | 3.088554 |

ARIMA (0,0,3) | 3.154329 |

ARIMA (3,0,3) | 3.541949 |

ARIMA (4,0,0) | 3.309103 |

ARIMA (0,0,4) | 3.181194 |

ARIMA (3,0,4) | 3.416867 |

ARIMA (4,0,3) | 3.460453 |

ARIMA (4,0,4) | 3.434379 |

Error measures . | Value . |
---|---|

MAPE | 3.045036 |

MAE | 0.223393 |

RMSE | 0.322602 |

Error measures . | Value . |
---|---|

MAPE | 3.045036 |

MAE | 0.223393 |

RMSE | 0.322602 |

### Spline results

While finding the error measure, it was identified that the MAPE value is quite large. Hence instead of taking all the 100 values from 1906 to 2006 as input, only 21 values have been chosen for spline calculation. Error measure for this model is even worse than the previous one and hence this model is dropped.

Model . | MAPE . | MAE . | RMSE . |
---|---|---|---|

Spline1 | 10.3612 | 0.741927 | 1.052117 |

Spline2 | 5.277433 | 0.378281 | 0.508601 |

Spline3 | 4.245829 | 0.309842 | 0.366111 |

Model . | MAPE . | MAE . | RMSE . |
---|---|---|---|

Spline1 | 10.3612 | 0.741927 | 1.052117 |

Spline2 | 5.277433 | 0.378281 | 0.508601 |

Spline3 | 4.245829 | 0.309842 | 0.366111 |

## CONCLUSION

The data mining technique kNN has a wide range of applications not only for doing classification but also to do regression. This method is very much applicable in various fields, such as medicine, finance which includes stock markets, agriculture which involves climate forecasting, text categorization, etc. In this paper, the technique kNN has been considered for the prediction of Varanasi, UP annual rainfall. Attempts have been made to find out the proper choice of ‘*k*’ by the trial-and-error method. Finally, ‘three’ has been chosen to be the best value for ‘*k*’ for doing the prediction. For the test set, a prediction is done based on this chosen ‘*k*’ value, and the validity of the model is checked through all the error measures discussed earlier. Results have shown that the model is quite good at predicting rainfall. This value of ‘*k*’ is maintained for the hybrid models.

Cubic spline approximation is one of the finest approximations such that this interpolating polynomial has a very minimal error when compared to the standard interpolating polynomial functions: Lagrange polynomial and Newton's polynomial. From Figure 17, it can be concluded that the last Spline3 model has the least error and so it performs better than the other two spline models and hence this model is considered for making a hybrid with kNN.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

Arabian Journal of Geosciences13, 1–9