Water management is very important for human life sustainability. Rainfall forecasting is one of the most important factors for the water management of an area. A forecast is simply a calculation of what happens in the future based on past information under the assumption that the pattern followed in the past would continue in the future also. This work aims at obtaining forecasting models for the time series data set using conventional models and computational models. Varanasi City's annual climate data for a total of 113 years is used for the analysis. Initially, the individual model is considered and used for forecasting. Later, hybrid models will be considered and a comparison between individual models and hybrid models would be obtained. The individual statistical models to be considered are moving average, exponential smoothing with one parameter, and autoregressive integrated moving average (ARIMA). The forecast is also done individually using the k-nearest neighbor (kNN) and interpolation technique cubic spline. Finally, the best-chosen statistical models and the interpolation model are coupled with kNN to develop hybrid models and with these hybrid models, the forecast is done for the data. All the models will be compared and the best among them will be chosen.

  • Rainfall forecasting is very important for water management, here we have compared the five latest AI/data science techniques of forecasting.

  • Study area is Varanasi, the oldest city in India.

  • 113 years of climate data is used.

  • Models used are ARIMA, kNN, spline, exponential smoothing.

  • The hybrid model was prepared for better forecasting.

Time series is termed a sequence of observations measured at equal intervals of time. The observations are measured hourly, daily, weekly, monthly, yearly, or at any other regular interval. When the observations are recorded continuously through time, time series is said to be continuous and if the observations are recorded at specified times, usually equally spaced, then it is said to be discrete. The dependence among the observations in a time series data is of great interest. Analysis of this dependence is found by using techniques.

Forecasting in general can be referred to as the process of estimating or evaluating the value of some variable at some future point in time. Weather forecasting plays a significant role in meteorology. Weather forecasting remains a formidable challenge because of its data intensive and frenzied nature (Omar et al. 2017, 2019, 2022). Forecasting is an important problem that is used by the government and industry for planning and decision-making, to protect life and property, and by every individual to carry out daily activities. There are two main broad types of forecasting techniques: qualitative techniques and quantitative techniques. Qualitative forecasts are normally used in situations where there is no or little historical data available. An example is the introduction of a new product for which no history is available whereas quantitative forecasts make use of historical data. In this technique, a forecasting model is used to project past and current data into the future. If the historical data is restricted to past values the forecasting procedure is called the time series method and the corresponding historical data is termed a time series. Analysis of time series mainly deals with statistical methods or any other methods to analyze and extract the characteristics of the given data. Time series analysis aims at identifying the pattern in the given time series and using it to predict the future. A model is constructed to extract meaningful information about the data using appropriate methods. Using the prescribed model, one can forecast future occurrences based on past data.

Three things are essential for the survival of human beings. They are air, food, and water. Water is available in different forms of precipitation. Among them, rain is the most important. Rivers and lakes are considered to be natural resources for any place. Forecasting weather parameters which include rainfall, temperature, wind, humidity, etc., for a region plays an important role and in fact, it is one of the main functions of National Weather Services. Information about rainfall is very much useful for predicting natural disasters such as droughts and floods. Rainfall forecasting is also very much helpful to decide upon the area of irrigation, the requirement of water for irrigation, estimating the quantity and quality of surface water and groundwater, etc. Forecasting models help us to understand meteorological information thereby integrating the information into the planning and decision-making process. Thus, forecasting rainfall acts as an aid for doing efficient planning by the government.

Various methodologies have been tried in predicting rainfall. Some of the most popular methods are autoregressive integrated moving average (ARIMA) and artificial neural networks (ANNs).

Gupta et al. (2013) also made a study of a neural network model for the prediction of rainfall in India. They have proposed multilayered ANN with learning by the backpropagation algorithm. The criteria used for prediction are RMSE, correlation, and standard deviation. The proposed model predicted values with suitable results. Local rainfall prediction system using neural networks (NNs) and prediction of local rainfall in Japan using the data from the Japan Meteorological Agency (JMA) is shown by Kashiwao et al. (2017).

Mandale & Jadhawar (2015) analyses the application of data mining techniques in weather forecasting which is done by using two techniques: ANN and decision trees. For the classification of weather parameters, C5 decision tree classification algorithm was applied to formulate rules. Based on the given data, NNs is able to detect the relationship between weather parameters and the prediction of the same. Kar et al. (2019) used a fuzzy logic technique to predict the rainfall using the temperature of that geographical location as input. Results show that there is an association between actual and predicted rainfall. Darji et al. (2015) made a detailed survey and comparison of different neural network architectures used by researchers for rainfall forecasting. Their paper discussed the issues of using ANN for yearly/monthly/daily rainfall forecasting and also presented different accuracy measures used by researchers for evaluating the performance of ANN.

Lee et al. (1998) proposed an approach in which the whole region under study is divided into four subregions. The two larger regions are predicted by radial basis function (RBF) networks using information based on location and the remaining two smaller areas are predicted by the model linear regression using information based on elevation. The prediction of daily rainfall at 367 locations is based on the prediction of daily rainfall at 100 locations in Switzerland. Comparisons with the observed data have shown that RBF networks gave better results than the linear regression model. Sharma & Nijhawan (2015), for predicting rainfall in Delhi, have used three different NNs, namely, backpropagation algorithm, cascaded backpropagation, and layer recurrent network with the same learning functions and adaptive learning functions. Backpropagation gave the best results. The training function TRAINLM gave the best results in the training set, testing set, and validation of data. LEARNGDM has been identified as the best adaptive learning function with minimum MSE value. Mislan et al. (2015) applied ANN with backpropagation neural network (BPNN) algorithm to get accurate forecasting of rainfall. The rainfall data were tested using two hidden layers of BPNN architectures with three different epochs. Experimental results show that the epoch of 1,000 produced a good result to predict rainfall in Tenggarong, Indonesia.

Pai & Rajeevan (2006) explored the hypothesis that the incorporation of sea surface temperature configuration information in the statistical models improves the monsoon forecast skill through a long time series of global sea surface temperature and rainfall data. The relationship between all-India summer monsoon rainfall and the timing of El Niño-Southern Oscillation (ENSO) related warming is investigated by Ihara et al. (2008). In their work, the link between the interannual variability of Indian summer monsoon rainfall and the time evolutions of Indo-Pacific sea surface temperature anomalies are also examined and the significance is assessed using a two-sample, two-tailed, Student's t-test.

The combination of the statistical method ARIMA with the neural network model, called the hybrid model, is considered to be one of the famous combinations for making forecasts. Hybrid models are mainly applied to improve the performance of individual techniques by smoothing the input data through conventional statistical methods or any other methods and then applying the second model. Remesan et al. (2009) applied hybrid models for runoff prediction. Indrabayu et al. (2013) proposed a novel approach in which they combined a support vector machine (SVM) and fuzzy logic. This new combination is then compared with another combination of neural network and fuzzy logic. Results have shown that the combination of SVM and fuzzy logic achieved higher accuracy than a combination of neural network and fuzzy logic.

Rahul et al. (2020) in their paper identified a suitable seasonal model through Box Jenkins seasonal ARIMA model for the prediction of monthly rainfall in UP. Two methods, ARIMA and adaptive splines threshold autoregressive (ASTAR), are compared by Indrabayu et al. (2013) in another paper to predict daily rainfall in the area of Makassar, Indonesia. For prediction, 10 years of daily data from 2001 to 2010 obtained from Badan Meteorologi, Klimatologi, dan Geofisika (BMKG) have been utilized. Among various meteorological variables, four variables, namely temperature, humidity, wind speed, and previous precipitation, have been selected and given as input to both these methods. Based on the error measure root mean square error (RMSE), it has been proved that ASTAR outperformed ARIMA. Nirmala & Sundaram (2010) have considered the coupling of the traditional technique moving average with the conventional technique ANN to make a hybrid model MA-ANN. They have shown that the duo combination model is a better tool than the moving average and ANN models applied individually. Joseph & Ratheesh (2013) have applied respectively the data mining techniques such as clustering and classification for prediction of rainfall. The neural network Bayesian regularization is implemented. Dataset is obtained from the official website of the National Oceanic and Atmospheric Administration (NOAA) maintained by the US Department of Commerce. For doing the prediction, the parameters namely relative humidity, pressure, temperature, precipitable water, and wind speed are used.

Two prediction techniques, one being the linear approach (ARIMA) and the other being non-linear approach (ANN) are presented by Mahalakshmi et al. (2014). A comparison is made between ARIMA and the hybrid model ARIMA-ANN. Based on the error measure mean squared error (MSE), the hybrid model outperforms ARIMA, the reason being that it uses the linearity of ARIMA and the non-linearity of ANN. Mohamed & Ibrahim (2016), identified (0,0,0) × (0,1,1)12 for monthly rainfall prediction in Nyala station, Sudan. For doing the prediction, rainfall data for the years 1971–2010 were used. Using this model, the monthly rainfall for upcoming years is predicted. Swain et al. (2020) developed an ARIMA model in the Khordha district, Odisha, India. The model selection of ARIMA was made using the Akaike information criterion (AIC) and Bayesian information criterion (BIC). The forecasts produced by the model showed an excellent match with observed monthly rainfall data.

The prediction of temperature and rainfall on a monthly scale basis is based on the approach of seasonal ARIMA by Kaushik & Singh (2008). For doing this, 12 years of data from 1994 to 2006 has been utilized to make a prediction of temperature and rainfall for the next five years. At Mirzapur, Uttar Pradesh, the performance of the model is based on the correlation coefficient (R2) and RMSE. The results indicate that the model provides a reliable and satisfactory prediction of temperature and rainfall on a monthly scale. Graham & Mishra (2017), in their study, predicted rainfall for the next five years on a monthly scale by analyzing the previous 31 years (1985–2015) rainfall data of Allahabad, Uttar Pradesh. Seasonal ARIMA model (0,0,0) (0,1,0) was identified as the best for making the prediction. Based on the error measures correlation coefficient (R2) and RMSE, the prediction of rainfall by the model proved consistent and satisfactory on a monthly scale. Chaudhari et al. (2013) also discusses the application of various data mining techniques to predict or classify or cluster or associate the meteorological data pattern.

A survey on different techniques for rainfall prediction is presented in the paper by Hirani & Mishra (2016). This paper reports a detailed survey of the following methods namely multiple linear regression (MLR), ARIMA, genetic algorithm, ASTAR, SVM, fuzzy logic, BPNN, radial basis function network (RBFN), self-organizing map (SOP), and weather research and forecasting (WRF) model. These are different methods used extensively over the last 20 years. It was found from the survey that most researchers use ANN often and got significant results. The survey also tells that forecasting techniques using MLP, BPNN, RBFN, Self-Organizing Map (SOM), and SVM gave more suitable results than statistical methods and numerical methods for rainfall prediction though with some limitations in each model.

Literature survey reveals that rainfall forecasting was done by many techniques like artificial intelligence techniques, NNs, ARIMA, SVM, MLR, k-nearest neighbor (kNN), spline, fuzzy logic, genetic algorithm, particle swarm optimization, etc. Every model has its own limitations, advantages, and disadvantages. In this research study, the non-parametric technique kNN is applied individually. Also, different combinations of hybrid models with kNN have been proposed as an innovative approach for forecasting rainfall in Varanasi, UP.

The overall objective of the work is to develop rainfall forecasting models using the computational technique kNN. Another attempt has been made by applying the interpolation technique spline. The main goal of the study is to develop new hybrid models for evaluating the effectiveness of the data pre-processing technique, moving average and the time series-forecasting model ARIMA and the interpolation technique spline with kNN. The various models developed are compared using error measures and are applied for forecasting the annual rainfall amount in Varanasi. The objectives of this research work have been framed as follows:

  • To identify computational intelligence techniques for the time series data.

  • To formulate new models and methods based on computational intelligence for time series data.

  • To compare the performance of new hybrid models with conventional models.

  • To validate the model for the time series data.

The study area in which modeling is done for the present study is part of the Ganga river basin. It occupies the south and eastern parts of Uttar Pradesh State of India (Figure 1). The area lies between latitude 82°1′52.439″E to 83°55′10.63 and longitude 26°2′7.842″N to 24°22′53.034″N. The significant stations covered by the study area are Varanasi and Gazipur districts. The mainstream flow of the study area lies in the Varanasi district. The average rainfall in this area is 941.2 mm. The maximum rainfall occurs in July, August, and September. Moreover, the minimum rainfall occurs in June and October. From November to May, the study area receives negligible rainfall. The dominant land use of the basin is barren and urban lands cover more than 50% of the basin.
Figure 1

Location of the study area in India.

Figure 1

Location of the study area in India.

Close modal

Data mining techniques are applicable in various areas like clustering, pattern recognition, machine learning, etc. ANNs, Bayesian classification, SVMs, genetic algorithms, fuzzy logic, and rough sets are some of the popular methods in data mining.

k-nearest neighbor (kNN)

One of the non-parametric techniques is the k-nearest neighbor. Unlike other techniques where a generalization of the model is made with the available data and before getting new information, this technique is basically a lazy technique in which no generalization is done with the available data called training data but waits for the new entry to do the classification or the prediction for the test sample, hence the name lazy learner. In other words, all the training data is stored and required for testing data. kNN just stores the entire dataset. No model is constructed and as a result, no learning is required. It uses the entire dataset to make predictions. Thus, a model is not constructed, instead, given new data, prediction is done and the output is available instantly.

This is basically one of the simplest classifiers which can also be used for regression. It is one of those algorithms which are very simple and easy to use. It is called non-parametric as it makes absolutely no assumptions on the given data. Needless of any prior information and basic assumptions about the data makes this method so impressive. It is termed lazy learning because it does not do any generalizations with the available data called training data.

kNN method was first described in the early 1950s but did not gain popularity until the 1960s when computing power became available. It is extensively used in pattern recognition.

Some techniques can be used for both classification and regression like ANNs and kNN, whereas some techniques can be used either for classification or regression but not for both. Logistic regression can be used only for classification whereas linear regression can be used only for regression. Some of the advantages of kNN are as follows:

  • Can be applied to any data and follows any distribution. Hence, it is called non-parametric.

  • Very simple and intuitive as it requires only two parameters for the implementation, namely the value of ‘k’ and the distance function.

  • No training phase is required and hence any new data can be added and analysis can be done.

  • Very effective in classification if the sample is very large.

Some of the disadvantages of kNN are as follows:

  • Choosing ‘k’ may be tricky.

  • Requires large storage space.

  • Testing phase is computationally expensive than the training phase.

  • Very sensitive to missing values, noise, and outliers and irrelevant attributes.

  • Does not work well with high dimensionality.

Summarizing the whole procedure, the algorithm for kNN either for classification or for regression is given as follows:

  • Compute the distance D(x,xi) of the new data with every training instance xi.

  • Choose the ‘k’ closest values of x and their corresponding output ‘y’ are based on the ‘k’ nearest distances.

  • Classify the required output as the one which is the majority among the k nearest y values, in the case of classification.

  • Calculate the required output for the input x using some appropriate formula based on the k nearest y values, in the case of regression.

Moving average

Moving average is the most widely method used in time series analysis. The most common moving average is the unweighted moving average wherein each data is given equal weight. The mean of the most recent N observations is taken for the calculation. Once a new value is available, it is taken up by dropping the oldest observation from the time series data and the new average is calculated, thus maintaining the same number of terms. Every time a new entry is added an old observation is deleted thereby getting a new average which keeps on changing, hence the name moving average. The number of terms to be considered to calculate the average has to be decided and that is called the period of the moving average. It can be a three-, five-, or seven-yearly moving average and so on. The main application of the moving average is to smooth the given time series data. The forecast value using a simple moving average is given by:
where Ft is forecast for the period ‘t’, n is the number of terms to be averaged, and At−1, At−2, At−3 are the actual values for the periods t − 1, t − 2, t − 3, and so on.

Autoregressive integrated moving average (ARIMA)

An iterative and effective approach for analyzing time series introduced by Box & Jenkins (1976) is the ARIMA. An ARIMA model in a time series predicts a value as a linear combination of its own past values, and past errors (also known as shocks). A powerful classical model that combines the autoregressive model and moving average model through a different process ‘integrated’ to make the series stationary is called an ARIMA model. The autoregressive model of order ‘p’ abbreviated as AR(p) is based on the past ‘p’ values of the variable Yt.

Thus, if Yt denotes the current value of the variable, it is expressed as a linear combination of its own previous values at times t − 1, t − 2, …tp plus a random disturbance. An autoregressive model of order ‘p’ is represented as follows:

The above equation resembles a multiple regression model except that Yt is regressed on its past values instead of different predictor variables, hence the prefix ‘auto’ in the autoregressive model.

Similarly, the moving average model of order ‘q’ abbreviated as MA(q) is based on the past ‘q’ disturbances or prediction errors of the past values of the same variable. Thus, the moving average model uses past errors as explanatory variables. A moving average model of order ‘q’ is represented as follows:

The autoregressive model and the moving average model are efficiently coupled to form a general and useful class of time series model called ARMA. This class of models can be extended as an ARIMA model for non-stationary time series by differencing the data series. Non-stationary data are unpredictable and cannot be modeled or forecasted. The non-stationary data needs to be transformed into stationary data to obtain consistent and reliable results.

ARIMA model can be guessed to some extent based on three things: time series plot, autocorrelation function (ACF), and partial autocorrelation function (PACF). Based on the ACF and PACF plots, the general characteristics of the various models can be obtained by the following guidelines.

  • If the series is non-stationary, then the ACF plot remains significant for six or more lags instead of declining to zero quickly. In that case, the series must be differenced until it becomes stationary.

  • Exponentially declining ACF and spikes in the first one or more lags in PACF indicate an autoregressive model.

  • Spikes in the first one or more lags in ACF and exponentially declining PACF specifies a moving average model.

  • Exponential decline in both ACF and PACF stipulates a mixed model namely an autoregressive moving average model.

In most time series problems, data will be non-stationary. It will be converted into stationary by doing differences by an order of integration parameter ‘d’ (Figure 2).
Figure 2

Flowchart for ARIMA.

Figure 2

Flowchart for ARIMA.

Close modal

Spline

Piecewise polynomial fitted to each subinterval of an entire dataset is called a spline function or simply a spline. Splines can be of any degree. Linear spline represents the straight line equations joining each pair of data points. Quadratic spline, on the other hand, represents second degree approximating polynomials. A cubic spline represents a third degree polynomial involving each pair of data points. In a similar manner, higher order degree polynomials can also be defined. However, cubic spline performs better than linear and quadratic as far as accuracy and complexity are concerned. A spline is a piecewise polynomial which satisfies some continuity conditions. As polynomial functions and its derivatives of all orders are always continuous, the question of discontinuity, if at all exists, should be at the endpoints of each subinterval. These points are termed as the ‘knots’ or ‘ducks’. Originally spline is a thin metal or wood used by drafters to help in drawing smooth shapes by bending at some points, as decided by the drafters, by placing weights or pins so that the spline passes through all these pins. The positions of these weights are called ‘knots’. In mathematical terminology, spline refers to polynomial functions defined in each subinterval and joined together so as to obtain a smooth finish to the function for the entire interval. Due to the flexibility of the rod, the slope and curvature will be continuous at each and every point (Figure 3).
Figure 3

A draftsmen spline (Courtesy: en.wikipedia.org).

Figure 3

A draftsmen spline (Courtesy: en.wikipedia.org).

Close modal

Spline is an interpolation technique in which a set of unique cubic polynomials are represented between each of the data points so that the curve obtained is continuous and smooth. The basic idea of cubic spline is to represent the function by a different cubic function on each interval between data points.

Different models are considered for the rainfall data for forecasting. To assess the quality of forecasting and to evaluate the consistency of the model, error measures are required. In this research work, the error measures mean absolute percentage error (MAPE), mean absolute error (MAE), and RMSE are used. The formula for the error measures are given below.

However, MAPE is scale sensitive. MAPE is not defined if any value of the given data is zero as the observed value lies in the denominator in the formula. So it should not be used in that case.

kNN results

The time series plot of Varanasi's monthly rainfall for the period from 1996 to 2015 obtained from the dataset is depicted in Figure 4.
Figure 4

Time series plot of Varanasi monthly rainfall.

Figure 4

Time series plot of Varanasi monthly rainfall.

Close modal

The given data is divided into two sets, one is known as the training set and the other is known as the test set. As discussed earlier, kNN does not perform any training with the training set, instead, it utilizes the what-so-called training data to make prediction for the new data in the test set. Thus, for each data in the test set, predictions are made and the validity of the prediction is checked through the error measures MAPE, MAE, and RMSE. In this research work, Varanasi, UP annual rainfall for the first 100 observations (1906–2006) were considered as an initialization/training set and the remaining 13 years (2006–2018) were taken as test set by taking the previous year North East Monsoon (NEM) as input. Different ‘k’ values ranging from 2 to 10 were considered and for each ‘k’, kNN was applied. Among the k values applied, the value k = 3 gave a better MAPE value.

Using kNN, Varanasi, UP city annual rainfall for the years 2006–2018 are predicted and these values are compared with the actual rainfall values using the error measures MAPE, MAE, and RMSE.

The graph between the actual and predicted values of rainfall in Varanasi, UP using kNN is given in Figure 5. Error measures for the model are shown in Table 1.
Table 1

Error measures for the model kNN

Error measuresValue
MAPE 3.831108 
MAE 0.279955 
RMSE 0.339164 
Error measuresValue
MAPE 3.831108 
MAE 0.279955 
RMSE 0.339164 
Figure 5

Graph of actual and predicted values of annual rainfall in Varanasi, UP using kNN.

Figure 5

Graph of actual and predicted values of annual rainfall in Varanasi, UP using kNN.

Close modal

Moving average results

One of the oldest statistical techniques moving average is used to smoothen any given data. As already discussed, the number of terms to be averaged is called the period of the moving average. For the natural log transformation of Varanasi annual rainfall, moving average is calculated by varying the period from 2 to 14. For every moving average with a period ranging from 2 to 14 error measure, MAPE is also determined and the values are shown in Table 2.

Table 2

MAPE for the model moving average

ModelMAPE
MA(2) 4.412168 
MA(3) 3.474756 
MA(4) 3.258254 
MA(5) 3.165095 
MA(6) 3.128931 
MA(7) 3.387038 
MA(8) 3.511451 
MA(9) 3.396071 
MA(10) 3.255401 
MA(11) 3.147462 
MA(12) 3.128595 
MA(13) 3.082821 
MA(14) 3.048054 
ModelMAPE
MA(2) 4.412168 
MA(3) 3.474756 
MA(4) 3.258254 
MA(5) 3.165095 
MA(6) 3.128931 
MA(7) 3.387038 
MA(8) 3.511451 
MA(9) 3.396071 
MA(10) 3.255401 
MA(11) 3.147462 
MA(12) 3.128595 
MA(13) 3.082821 
MA(14) 3.048054 

Among the values of error measures, it can be found that the moving average for period 14 had the least MAPE value. To make a hybrid model with kNN, this moving average period of 14 is considered so as to further minimize the error measure not only MAPE but also MAE and RMSE. Figure 6 shows the observed and forecast values of Varanasi rainfall by moving average for period 14. Error measures other than MAPE for this chosen moving average model are shown in Table 3.
Table 3

Error measures for the model MA(14)

Error measuresValue
MAPE 3.048054 
MAE 0.222456 
RMSE 0.310444 
Error measuresValue
MAPE 3.048054 
MAE 0.222456 
RMSE 0.310444 
Figure 6

Graph of actual and predicted values of annual rainfall in Varanasi using MA(14).

Figure 6

Graph of actual and predicted values of annual rainfall in Varanasi using MA(14).

Close modal

Exponential smoothing

A special case of the weighted moving average is exponential smoothing with weights being decreased exponentially as the observation gets older. Here one smoothing constant ‘α’ where 0 ≤ α ≤ 1 is assigned for the most recent observation. Values of ‘α’ were varied from 0.1 to 0.9 and for each of these ‘α’ values calculations are done. After the calculations for each ‘α’ value is over, the error measure MAPE is also calculated for all the ‘α’ values mentioned. The table involving the error measure is shown in Table 4. From the table, it was a unanimous decision to say that α = 0.1 gave the least MAPE value. The graph of observed and forecast values of Varanasi rainfall by single exponential smoothing for α = 0.1 is shown in Figure 7. Error measures other than MAPE for this model are also shown in Table 5.
Table 4

MAPE for the model exponential smoothing

ModelMAPE
ES (α = 0.1) 3.106286 
ES (α = 0.2) 3.146679 
ES (α = 0.3) 3.272267 
ES (α = 0.4) 3.495975 
ES (α = 0.5) 3.711791 
ES (α = 0.6) 3.910091 
ES (α = 0.7) 4.07846 
ES (α = 0.8) 4.204747 
ES (α = 0.9) 4.333397 
ModelMAPE
ES (α = 0.1) 3.106286 
ES (α = 0.2) 3.146679 
ES (α = 0.3) 3.272267 
ES (α = 0.4) 3.495975 
ES (α = 0.5) 3.711791 
ES (α = 0.6) 3.910091 
ES (α = 0.7) 4.07846 
ES (α = 0.8) 4.204747 
ES (α = 0.9) 4.333397 
Table 5

Error measures for the model ES (α = 0.1)

Error measuresValue
MAPE 3.106286 
MAE 0.226326 
RMSE 0.312263 
Error measuresValue
MAPE 3.106286 
MAE 0.226326 
RMSE 0.312263 
Figure 7

Graph of actual and predicted values of annual rainfall in Varanasi using ES (α = 0.1).

Figure 7

Graph of actual and predicted values of annual rainfall in Varanasi using ES (α = 0.1).

Close modal

Autoregressive moving average or ARIMA

The basic criteria for applying the model are normality and stationarity. The available rainfall data is stationary. Normality for the original data is checked using Anderson–Darling test. This test is applied to check whether the given sample comes from a specific distribution. In this test, more weight is given to the tails of the distribution than given in Kolmogorov–Smirnov (K-S) test. The critical values are calculated based on the particular distribution in this test. The Anderson–Darling test uses the particular distribution for calculating the critical values despite the disadvantage that for each distribution critical values have to be calculated. But this is not going to be a difficult one as Anderson–Darling test is considered more superior to the chi-square and Kolmogorov–Smirnov goodness-of-fit tests.

The probability plot or the normality plot for the original annual rainfall data is shown in Figure 8. The test statistic value is 1.086 and the p-value is 0.007 which is less than 0.05. Hence, the null hypothesis is rejected and the given original data is non-normal. Thus, to apply ARIMA, the given data has to be converted into another data that is normal. Hence, a handful of transformations was tried out and among them, the best one is chosen for further investigation (Figure 9).
Figure 8

Normality plot for standardized model of annual rainfall in Varanasi.

Figure 8

Normality plot for standardized model of annual rainfall in Varanasi.

Close modal
Figure 9

Normality plot for unity-based normalization of annual rainfall in Varanasi.

Figure 9

Normality plot for unity-based normalization of annual rainfall in Varanasi.

Close modal

From all the figures given above, it is clear that the p-value of the natural log-transformed data and Johnson transformation is greater than 0.05, whereas the other transformations do not have a p-value greater than 0.05. But then, it was found that the MAPE value for the forecast of rainfall data using Johnson transformation is very very high when compared to the natural log. Thus, further research is done with natural log transformation.

The ACF and the PACF help us to check whether the given time series is stationary or non-stationary. The ACF plot shows how the values of the time series are correlated with the past values, whereas the PACF is the autocorrelation between yt and yt+k after removing the dependence on y1, y2yk−1. The ACF and the PACF plots for the log-transformed rainfall data are shown in Figures 10 and 11.
Figure 10

ACF plot for the transformed annual rainfall in Varanasi.

Figure 10

ACF plot for the transformed annual rainfall in Varanasi.

Close modal
Figure 11

PACF plot for the transformed annual rainfall in Varanasi.

Figure 11

PACF plot for the transformed annual rainfall in Varanasi.

Close modal

Since both ACF and PACF tails off gradually, the series is said to be stationary and hence differencing is not required for the data chosen which implies that d = 0. Different tentative ARIMA models based on ACF and PACF plots were identified and the MAPE values of those models were calculated. Among those models, ARIMA (2,0,0) gives a better value based on the error measure MAPE. Error measures for the models are also shown in Table 6.

Table 6

MAPE for different ARIMA models

ModelMAPE
ARIMA (1,0,0) 3.09957 
ARIMA (0,0,1) 3.075936 
ARIMA (1,0,1) 3.275254 
ARIMA (2,0,0) 3.045036 
ARIMA (0,0,2) 3.077912 
ARIMA (2,0,2) 3.180048 
ARIMA (2,0,1) 3.063268 
ARIMA (1,0,2) 3.10476 
ARIMA (3,0,0) 3.088554 
ARIMA (0,0,3) 3.154329 
ARIMA (3,0,3) 3.541949 
ARIMA (4,0,0) 3.309103 
ARIMA (0,0,4) 3.181194 
ARIMA (3,0,4) 3.416867 
ARIMA (4,0,3) 3.460453 
ARIMA (4,0,4) 3.434379 
ModelMAPE
ARIMA (1,0,0) 3.09957 
ARIMA (0,0,1) 3.075936 
ARIMA (1,0,1) 3.275254 
ARIMA (2,0,0) 3.045036 
ARIMA (0,0,2) 3.077912 
ARIMA (2,0,2) 3.180048 
ARIMA (2,0,1) 3.063268 
ARIMA (1,0,2) 3.10476 
ARIMA (3,0,0) 3.088554 
ARIMA (0,0,3) 3.154329 
ARIMA (3,0,3) 3.541949 
ARIMA (4,0,0) 3.309103 
ARIMA (0,0,4) 3.181194 
ARIMA (3,0,4) 3.416867 
ARIMA (4,0,3) 3.460453 
ARIMA (4,0,4) 3.434379 

The graph for the actual and the predicted values of rainfall for the ARIMA (2,0,0) model is shown in Figure 12. Error measures for the same are depicted in Table 7. The residual plots of ACF and PACF for this model are shown in Figure 13.
Table 7

Error measures for the model ARIMA (2,0,0)

Error measuresValue
MAPE 3.045036 
MAE 0.223393 
RMSE 0.322602 
Error measuresValue
MAPE 3.045036 
MAE 0.223393 
RMSE 0.322602 
Figure 12

Graph of actual and predicted values of annual rainfall in Varanasi using ARIMA (2,0,0).

Figure 12

Graph of actual and predicted values of annual rainfall in Varanasi using ARIMA (2,0,0).

Close modal
Figure 13

The residual plots of ACF and PACF of rainfall in Varanasi using ARIMA (2,0,0).

Figure 13

The residual plots of ACF and PACF of rainfall in Varanasi using ARIMA (2,0,0).

Close modal

Spline results

Spline is basically an interpolation technique that can also be used for extrapolation. In this research work, three models of spline are considered. A comparative study is done among these models, spline is also applied for this transformed data. Hence, transformed NEM values of rainfall from the year 1906–2006 are taken as input and the forecast is done for the remaining 13 years (2006–2018) using MATLAB. This computer science uses the not-a-knot condition for the construction of the cubic spline. The above discussed model is taken as Spline1. The fitted values using computer science are recorded and the graph of the observed rainfall values and the forecast values is obtained and is shown in Figure 14.
Figure 14

Graph of actual and predicted values of annual rainfall in Chennai using Spline1.

Figure 14

Graph of actual and predicted values of annual rainfall in Chennai using Spline1.

Close modal

While finding the error measure, it was identified that the MAPE value is quite large. Hence instead of taking all the 100 values from 1906 to 2006 as input, only 21 values have been chosen for spline calculation. Error measure for this model is even worse than the previous one and hence this model is dropped.

Transformed NEM values for 11 years are taken as input and the forecast is done for the same 13 years using MATLAB. This model is taken as Spline2. The graph of the observed values and the predicted values for the second model in spline is shown in Figure 15. The error measures are evaluated for the forecast values and the results are tabulated and shown in Table 8.
Table 8

Error measures for the model spline

ModelMAPEMAERMSE
Spline1 10.3612 0.741927 1.052117 
Spline2 5.277433 0.378281 0.508601 
Spline3 4.245829 0.309842 0.366111 
ModelMAPEMAERMSE
Spline1 10.3612 0.741927 1.052117 
Spline2 5.277433 0.378281 0.508601 
Spline3 4.245829 0.309842 0.366111 
Figure 15

Graph of actual and predicted values of annual rainfall in Chennai using Spline2.

Figure 15

Graph of actual and predicted values of annual rainfall in Chennai using Spline2.

Close modal
To reduce the error, even better, all the NEM values from 1901 to 1999 are sorted and from these 11 values with a gap of 9 between each are identified and for this combination of input values spline is applied and the forecast for the same 13 years is done through MATLAB. This model is taken as Spline3. The graph of the observed values and the predicted values for the third model in spline is shown in Figure 16. The error measures for all the models discussed above are also evaluated and the results are tabulated and shown in Table 8.
Figure 16

Graph of actual and predicted values of annual rainfall in Chennai using Spline3.

Figure 16

Graph of actual and predicted values of annual rainfall in Chennai using Spline3.

Close modal

The data mining technique kNN has a wide range of applications not only for doing classification but also to do regression. This method is very much applicable in various fields, such as medicine, finance which includes stock markets, agriculture which involves climate forecasting, text categorization, etc. In this paper, the technique kNN has been considered for the prediction of Varanasi, UP annual rainfall. Attempts have been made to find out the proper choice of ‘k’ by the trial-and-error method. Finally, ‘three’ has been chosen to be the best value for ‘k’ for doing the prediction. For the test set, a prediction is done based on this chosen ‘k’ value, and the validity of the model is checked through all the error measures discussed earlier. Results have shown that the model is quite good at predicting rainfall. This value of ‘k’ is maintained for the hybrid models.

The three popular statistical models, namely moving average, exponential smoothing with one parameter, and the classical model ARIMA, is presented. A comparison of these three models for the data chosen is done, all the models are applied to the transformed data only. In each model, the best one was chosen based on the error measures. The best in each model is taken for further work for considering hybrid with the computational model kNN. As far as this data is concerned both ARIMA and moving average have given good forecast results which are almost the same. Figure 17 shows that these two models have the least errors. To further refine the forecast values these models are merged with the computational model kNN.
Figure 17

Comparison of all methods used for forecasting.

Figure 17

Comparison of all methods used for forecasting.

Close modal

Cubic spline approximation is one of the finest approximations such that this interpolating polynomial has a very minimal error when compared to the standard interpolating polynomial functions: Lagrange polynomial and Newton's polynomial. From Figure 17, it can be concluded that the last Spline3 model has the least error and so it performs better than the other two spline models and hence this model is considered for making a hybrid with kNN.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Box
G. E. P.
&
Jenkins
J. M.
1976
Time Series Analysis:Forecasting and Control
, 2nd edn.
Holden-Day
,
San Francisco, USA
.
Chaudhari
A. R.
,
Rana
D. P.
&
Mehta
R. G.
2013
Data mining with meteorological data
.
International Journal of Advanced Computer Research
3
(
3
),
25
.
Darji
M. P.
,
Dabhi
V. K.
&
Prajapati
H. B.
2015
Rainfall forecasting using neural network: a survey
. In:
International Conference on Advances in Computer Engineering and Applications (ICACEA) IMS Engineering College
,
Ghaziabad, India
, pp.
706
713
.
Graham
A.
&
Mishra
E. P.
2017
Time series analysis model to forecast rainfall for Allahabad region
.
Journal of Pharmacognosy and Phytochemistry
6
(
5
),
1418
1421
.
Gupta
A.
,
Gautam
A.
,
Jain
C.
,
Prasad
H.
&
Verma
N.
2013
Time series analysis of forecasting Indian rainfall
.
International Journal of Inventive Engineering and Sciences (IJIES)
1
(
6
),
42
45
.
Hirani
D.
&
Mishra
N.
2016
A survey on rainfall prediction techniques
.
International Journal of Computer Application
6
(
2
),
28
42
.
Ihara
C.
,
Kushnir
Y.
&
Cane
M. A.
2008
Warming trend of the Indian Ocean SST and Indian Ocean dipole from 1880 to 2004
.
Journal of Climate
21
(
10
),
2035
2046
.
Indrabayu
N. H.
,
Pallu
M. S.
&
Achmad
A.
2013
Statistic approach versus artificial intelligence for rainfall prediction based on data series
.
International Journal of Engineering and Technology
5
(
2
),
1962
1969
.
Joseph
J.
&
Ratheesh
T. K.
2013
Rainfall prediction using data mining techniques
.
International Journal of Computer Applications
83
(
8
),
11
15
.
Kar
K.
,
Thakur
N.
&
Sanghvi
P.
2019
Prediction of rainfall using fuzzy dataset
.
International Journal of Computer Science and Mobile Computing
8
(
4
),
182
186
.
Kaushik
I.
&
Singh
S. M.
2008
Seasonal ARIMA model for forecasting of monthly rainfall and temperature
.
Journal of Environmental Research and Development
3
(
2
),
506
514
.
Lee
S.
,
Cho
S.
&
Wong
P. M.
1998
Rainfall prediction using artificial neural networks
.
Journal of Geographic Information and Decision Analysis
2
(
2
),
233
242
.
Mahalakshmi
D. V.
,
Paul
A.
,
Dutta
D.
,
Ali
M. M.
,
Jha
C. S.
&
Dadhwal
V. K.
2014
Net surface radiation retrieval using earth observation satellite data and machine learning algorithm
.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
2
(
8
),
9
.
Mandale
M. A.
&
Jadhawar
B. A.
2015
Weather forecast prediction: a data mining application
.
International Journal of Engineering Research and General Science
3
(
2
),
1279
1284
.
Mislan
M.
,
Haviluddin
H.
,
Hardwinarto
S.
,
Sumaryono
S.
&
Aipassa
M.
2015, August
Rainfall monthly prediction based on artificial neural network: A case study in Tenggarong Station, East Kalimantan-Indonesia
.
The International Conference on Computer Science and Computational Intelligence (ICCSCI 2015)-Procedia Computer Science
59
,
142
151
.
Mohamed
T. M.
&
Ibrahim
A. A. A.
2016
Fitting probability distributions of annual rainfall in Sudan
.
SUST Journal of Engineering and Computer Sciences
17
(
2
),
34
39
.
Nirmala
M.
&
Sundaram
S. M.
2010
A seasonal ARIMA model for forecasting monthly rainfall in Tamilnadu
.
National Journal on Advances in Building Sciences and Mechanics
1
(
2
),
43
47
.
Omar
P. J.
,
Gupta
N.
,
Tripathi
R. P.
&
Shekhar
S.
2017
A study of change in agricultural and forest land in Gwalior city using satellite imagery
.
SAMRIDDHI: A Journal of Physical Sciences, Engineering and Technology
9
(
02
),
109
112
.
Omar
P. J.
,
Bihari
D. S.
&
Kumar
D. P.
2019
Temporal variability study in rainfall and temperature over Varanasi and adjoining areas
.
Disaster Advances
12
(
1
),
1
7
.
Omar
P. J.
,
Shivhare
N.
,
Dwivedi
S. B.
&
Dikshit
P. K. S.
2022
Identification of soil erosion-prone zone utilizing geo-informatics techniques and WSPM model
.
Sustainable Water Resources Management
8
(
3
),
66
.
Pai
D. S.
&
Rajeevan
M.
2006
Empirical prediction of Indian summer monsoon rainfall with different lead periods based on global SST anomalies
.
Meteorology and Atmospheric Physics
92
(
1–2
),
33
43
.
Rahul, A. K., Shivhare, N., Dwivedi, S. B. & Dikshit, P. K. S. 2020 Estimation of behavioral change of SSC of bed profile in the river using ADCP. Arabian Journal of Geosciences 13, 1–9
.
Remesan
R.
,
Shamim
M. A.
,
Han
D.
&
Mathew
J.
2009
Runoff prediction using an integrated hybrid modelling scheme
.
Journal of Hydrology
372
(
1–4
),
48
60
.
Sharma
A.
&
Nijhawan
G.
2015
Rainfall prediction using neural network
.
IJCST
3
(
3
),
65
69
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).