The machine learning techniques of Multiple Linear Regression (MLR), Generalized Additive Models (GAMs), and the Random Forest (RF) Method have been used to analyze the extreme annual rainfall in the six states of Assam, Meghalaya, Tripura, Mizoram, Manipur, and Nagaland in North-Eastern (NE) India. Latitude, longitude, altitude, and temperature were the covariates that were used in this study. Ordinary Kriging was used to interpolate the predicted outcomes of each dataset. Statistical metrics like Mean Absolute Errors (MAE), Root Mean Square Error (RMSE), Coefficients of Determination (COD-R2), and Nash–Sutcliffe Efficiency (NSE) were also assessed. When compared to satellite rainfall data, all techniques performed significantly better for ground rainfall data. For prediction, GAM's predicted rainfall values triumph over MLR or RF. RF ranks a close second, while the linearity of MLR prohibits it from making precise predictions for a physical phenomenon like rainfall. The MAE and RMSE of GAM forecasts are significantly lower than those of MLR and RF in most circumstances. Additionally, the COD and NSE of GAM predictions are significantly better than both MLR and RF in most cases, showing that GAM, out of MLR, GAM, and RF, is the best model for predicting rain in our research area.

  • The map for RF + OK is much more realistic.

  • The testing (rain gauge data) maps were much better for all the methods than their training counterparts.

  • Temperature inclusion has an effect on latitude, longitude, and elevation at each point as an attribute.

  • The use of splines in the linear model allows GAMs to get around linearity constraints.

  • Comparing the MAE and RMSE of GAM predictions for the various datasets.

Most hydrologic assessments and design concepts for sustainable management of water resource systems require precipitation data as an input. However, estimating data at a random site is one of the most encountered problems in engineering hydrology since the data are either missing or the area is ungauged (Agarwal et al. 2022). Spatial interpolation can be used to estimate missing or inaccessible data in such cases.
Figure 1

Map of the study area.

Figure 1

Map of the study area.

Close modal
Figure 2

p-values for randomness and trend for the final 171 points.

Figure 2

p-values for randomness and trend for the final 171 points.

Close modal

NE India's precipitation is unique in that it frequently exceeds 1,000 mm (Underwood 2009), the highest recorded annual rainfall in history. The average yearly rainfall in Mawsynram, the wettest place on earth, is currently 11,802.4 mm, which is about 10 times as much rain as the average yearly rainfall in India. NE India, on the other hand, exhibits a wide range of localized rainfall due to its high-altitude differences within a compact spatial area. Owing to the Tropic of Cancer running through the territories of Mizoram and Tripura, North-East India has a significant tropical impact, especially in the plains. Typically, the warmer months of June–September have extreme to excessively heavy rainfall in the region, which has a monsoon climate (Deka et al. 2009). Rainfall in the region's mountainous parts ranges from 2,000 to 3,000 mm, but lower than 2,000 mm falls in the Himalayas' rain-shadow zone. Since it has a different precipitation pattern than the rest of six of the seven sister states of NE India, Arunachal Pradesh has been left out of this analysis. The North-east of India has been experiencing an unusual weather pattern; while many areas of Assam, Meghalaya, and other portions of NE India are dangerously flooded each year, some areas remain much drier in comparison and can even exhibit a lack of rainfall.

Jain et al. (2013) and Harvey (1993) have previously studied the regional change in climate through time by analyzing the rainfall and temperature patterns in NE India. Mahanta et al. (2013) also looked at predictions for heavy rainfall in NE India, according to research by Choudhury et al. (2019), there has been a significant drop in summer monsoon rainfall (about 355 mm) during the last 36 years (1979–2014), which has had a significant impact on the local population's ecosystem and way of life. It was not apparent, however, whether the documented decline was a consequence of human behavior or if it was linked to global natural variability. Malik et al. (2021) used Multiple Linear Regression (MLR), among other methods, to determine multi-scalar Standardized Precipitation Index (SPI) prediction and obtained quite good results for MLR in a few stations. Hwang et al. (2012) reviewed and compared the output of regression-based statistical methods for the spatial evaluation of precipitation with several methods, including MLR. Ali et al. (2020) combined Random Forest (RF) to overcome the non-stationarity challenges faced by rainfall forecasting models and saw that the model used improved rainfall forecasting accuracy, which is crucial for agriculture, water resource management, and early drought/flood warning systems. Dou et al. (2019), on a regional scale, evaluated and compared the effectiveness of two machine learning models including the RF techniques, to model major rainfall-triggered landslides and RF was seen to have a significantly greater overall efficiency, even when the training and validation samples were changed. To interpolate average daily precipitation depending on a synoptic scale categorization of different kinds of weather utilizing the Generalized Additive Model (GAM) together with numerous other models, Lemus-Canovas et al. (2019) coupled the synoptic scale to the local scale using the Pyrenees regional scale to interpolate daily average rainfall. GAM was among theologies providing an excellent fit for the models. Westra et al. (2013) present an approach for breaking down rainfall distribution into sub-daily rainfall ‘fragments’ (fine-resolution rainfall sequences) under a future, warmer climate by combining GAM with another method. Utilizing the GAM directly produced results that were very closely aligned.

Studies now in NE India have mostly been concentrated in the regions of Assam and Meghalaya, mostly to analyze the trend patterns of the region (Hastie 2017). Also, the usage of Machine Learning Techniques like MLR, GAM, and RF for rainfall analysis of this region has not been found fully satisfactory as per the literature review of the study area. NE India has been recognized as one of the country's most flood-prone areas, setting this region and the country back by millions of funds used for rehabilitation every year. Even then, the devastating annual flood in this region is the most dominant difficulty that remains beyond control (Esterby 1996). One of the reasons might be the unavailability of accurate precipitation data required for the proper planning and management of these floods (Ahilan et al. 2019). In this study, we will be reviewing the applications of the Machine Learning Techniques of MLR, GAM, and RF for the prediction of rainfall and then further interpolate the various predicted datasets by Ordinary Kriging (OK) for the creation of rainfall maps (Srivastava et al. 2009). Now, talking of the OK method, OK is a geostatistical interpolation technique commonly used to predict values at unobserved locations within a spatial dataset. This method relies on variography principles to quantify the spatial correlation or variability between data points. It involves creating a weighted linear combination of nearby sample values, where weights are determined by the spatial relationships between the points. Also, statistical criteria will be used for comparison of the various predictions to determine which among them is the method most suited to analyze the rainfall in our study area, as represented in Figure 1.

From 1980 to 2020, daily gridded satellite-based rainfall data files with 0.25 × 0.25 grids were acquired from the IMD Pune website. The daily rainfall measurements at each grid point were then extracted from the binary files into spreadsheets spanning the necessary latitudes and longitudes. A total of 246 grid points covered the six northeastern Indian states of Assam, Meghalaya, Tripura, Mizoram, Manipur, and Nagaland. The grid points were acquired at increments of 0.25° in both vertical and horizontal directions, over the latitude range of 22° N to 29° N, and the longitude range of 90° E to 96° E. This was to be the training dataset.

Assam has a tropical monsoon rainforest environment with high humidity and high precipitation. The state receives 70–120 in. of yearly precipitation on average. Meghalaya experiences an annual average rainfall of 1,150 cm, and Mawsynram experiences the highest annual average rainfall of almost 10,000 mm, ranking it the wettest area on the planet. The states of Mizoram and Tripura are traversed by the Tropic of Cancer. Rainfall in Tripura ranges from 1,922 to 2,855 mm annually and flows from the south-west to the north-east (Subash et al. 2011). The wettest months are April, June, and July to September. The average annual rainfall in Mizoram is 2,540 mm, with precipitation being heaviest from May to September and least during the winter. The climate in Manipur is significantly influenced by its topography (Raju & Kumar 2014, 2018). The average annual rainfall in the state is 1,467.5 mm. In Nagaland, the weather is monsoonal (wet-dry). Rainfall is concentrated from May to September, during the southwest monsoon, and varies from 1,800 to 2,500 mm annually.

To test all the prediction models created with the training dataset, a set of testing data of 33 points dispersed across the six states of NE India was taken into consideration (Agarwal et al. 2021b). In contrast to the satellite-based rainfall data of the training dataset, these 33 points were ground annual rainfall data from rain gauge stations (Agarwal et al. 2021a).

From 1980 to 2020, a total number of 41 datasets were taken into account. The daily maximum rainfall at each grid junction in a year was then calculated using the datasets. With a mean of 97.16 mm, the highest rainfall was recorded at 26.25 N, 90.25 E in 1984 as 795.65 mm while the lowest rainfall was recorded at 22.75 N, 92.75 E in 2009 to be 1.4 mm.

Randomness and trend

At each grid point, the results were analyzed for trends and randomness. The grid points with only randomness present and no trends were chosen for further investigation once the randomness and trends, as shown in Figure 2, were determined, Ahilan et al. (2015).

(Ljung & box 1978) The Ljung-Box test was used to check whether randomization was present and the existence of a trend in the time series data was determined using the (Mann 1945) Mann–Kendall Pattern Test (Kendall 1975). The p-value, or probability value, was used to assess the existence of randomness and trend. The null hypothesis is dismissed if the p-value is less than 0.05, meaning that the likelihood of the event is less than 5% and therefore extremely low. Therefore, the p-value for randomness must be less than 0.05 for there to be no randomness present. p-value is to be less than 0.05 for the trend test, to necessarily conclude that a trend exists (Hirsch et al. 1982), meaning that only 5% of the instances could such a consistent pattern have developed by chance. As a result, the p-value must be more than or equal to 0.05 for there to be no trend. In terms of statistics, randomness means that there should be no discernible tendencies or consistencies in the values of rainfall (Gilbert 1987), as this could compromise the accuracy of our forecasts and cause them to be prejudiced or affected. If the rainfall data do follow a specific pattern, we can infer a trend from their frequency and exclude them from our computations. It is crucial that the final dataset just contains randomness and lacks any observable trends. This is because data displaying trends will affect our forecasts due to their co-dependency or statistically significant correlation with time, making them biased and impairing their prediction and accuracy (Pai et al. 2014). After this procedure, 171 such points were left.

Determination of elevation and extreme annual rainfall

The 171 remaining sites were then plotted on a Digital Elevation Modeling (DEM) map of North-East India to further ascertain the corresponding geographic altitudes at these points (Jhajharia et al. 2009). The average annual maximum rainfall for all of the 171 grid locations throughout the 41 years between 1981 and 2020 was subsequently determined. The maximum elevation was found to be 2,826.75 m at 25.75°N, 95°E, the minimum 19.5 m at 24.75°N, 92.75°E, with the median 224.75 m at 25.5°N, 93°E. The elevations for the 33 testing points were also determined from the DEM image.

The average of the yearly maximum rainfall for 41 years from 1981 to 2020 was then found for each of the 171 grid points.

Two other datasets were also prepared for validation of the various models. One was a combination of the 171 training data points (satellite data) and the 33 testing data points (rain gauge data), therefore a total of 204 data points. The training dataset of 171 points was also separately processed with an added covariate of temperature for rainfall prediction at each point, treated as another dataset for validation of the models. The temperature used at each point was the average of the yearly maximum and minimum temperature for 41 years from 1981 to 2020 for each of the grid points.

The datasets were then processed through various trial models of MLR, GAM, and RF for the prediction of rainfall and the models with optimum output for each method selected as the final model for prediction and comparison under that technique. Further interpolation was carried out for the predicted datasets from the final models of MLR, GAM, and RF, by OK for the creation of rainfall maps. Standard error maps for each of these interpolated maps were also produced, which could be used for visual verification of the accuracy of the interpolations. Statistical criteria of MAE, Root Mean Square Error (RMSE), Coefficient of Determination (COD), and Nash–Sutcliffe Efficiency (NSE) were also found out for each prediction and interpolation for comparison and establishment about which method is most suited to analyze the rainfall in our study area.

Multiple Linear Regression

The training dataset of 171 points was processed under various trial models of MLR till an optimum model was found for prediction. The model chosen was also applied to the testing data to check for its errors and efficiency.

MLR follows the equation as follows:
formula
(1)
where b0 is the value of Y when all the independent variables (X1 through Xp) are equal to zero, and b1 through bp are the estimated regression coefficients, where Y is the predicted or expected value of the dependent variable, X1 through Xp are p distinct independent or predictor variables. Each regression coefficient indicates the change in Y when the independent variable is changed by one unit. For example, in a multiple regression situation, b1 is the change in Y relative to a one-unit change in X1, with all other independent variables held constant (i.e., when the remaining independent variables are held at the same value or are fixed).

The predictions through MLR of both training and testing data were then made to undergo OK for further interpolation and creation of respective predicted maps. Through the training map, predictions at the testing data points (33 points) were also made and the two testing predictions were compared. The same MLR model was also used on the training + testing dataset of 204 points and the predictions run through OK to create a map. Also, the attribute of temperature was added to the training dataset in the chosen MLR model and another set of predictions was obtained and the same process of interpolation by OK was repeated to create a map.

Notably, OK minimizes estimation variance by considering both spatial correlation and data uncertainty, making it a robust tool for estimating values and providing uncertainty assessments in fields such as geosciences, environmental modeling, and resource management (Cressie & Johannesson 2008)

After trying and testing different models with different combinations of input attributes, the model selected included all attributes of latitude, longitude and elevation. Temperature could also be added where need be.

Generalized Additive Model

GAMs allow us to model nonlinear data by assuming that the outcome can be described by a sum of arbitrary functions of each variable. A spline is used for this case, which is a multi-dimensional function that lets us represent nonlinear interactions for each feature. In linear regression, the equation is defined by the sum of a linear combination of variables. Each variable is given a weight, β (like we saw for MLR) and added together. We simply state we can use a nonlinear combination of variables, represented by s for spline or ‘smooth function,’ instead of assuming that our aim can be determined using a linear combination of variables.

This is the equation for the GAM, where ‘s’ is a smooth function, si is the spline function for xi, where i=0 to n:
formula
(2)

Similarly, as in MLR, the training dataset (171 points) was processed through various trial models of GAM till the most favorable model was found. Attributes were also checked for results without spline functions in some of the trials. Various trial models, as mentioned, were evaluated. The optimum number of splines denoted by k (smoothing function of particular variable) was determined as 15 for elevation and 10 for latitude and longitude each. The same model was then also applied to the testing data to check for its errors and efficiency. Also, the GAM iterations were executed by default as Gaussian distributions. The predictions through GAM of both training and testing data were then processed through OK for further interpolation and creation of respective predicted maps. Also, through the training map, predictions at the testing data points (33 points) were also made and the two testing predictions were compared. The same GAM model was also used on the training + testing dataset of 204 points and the predictions run through OK to create a map. The attribute of temperature with a number of splines as 14 was added to the selected GAM model and another set of predictions was obtained for the training dataset of 171 points. This set of predictions was further run through OK for an interpolated map.

RF method

In a RF model, each tree is built individually and is based on a random vector sampled from the input data, with the same distribution across the forest. Since the RF combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. The number m (mtry) is chosen at random from the total input variables at each node, and the best split on all these m is used to split the node. During the forest's growth, the value of m remains constant. Every tree is cultivated to its full potential. Pruning does not exist. Pruning is a data compression technique used in machine learning and search algorithms to minimize the size of decision trees by deleting non-critical and superfluous portions.

n (ntree) is the number of decision trees you grow in the algorithm. It is given before the maximum voting or prediction averages are calculated. A larger number of trees improves efficiency.

After much trial and error, the mtry was kept at 2 for the final models and the ntree at 500. Similarly, as for MLR and GAM, the training dataset (171 points) was processed through a model of RF. Then the same model was tested on the testing dataset (33 points). The same RF model was also used on the training + testing dataset of 204 points. The predictions through RF of both training and testing data were then processed through OK for further interpolation and creation of respective predicted maps. Also, through the training map, predictions at the testing data points (33 points) were also made and the two testing predictions were compared. The predictions of the RF model of the training + testing dataset of 204 points were also run through OK in ArcGIS to create a map. Temperature was also added as an attribute to the selected RF model and another set of predictions was obtained for the training dataset of 171 points. This set of predictions was further run through OK, and an interpolated map produced.

Statistical criteria

Mean Absolute Error

Error in predictions was calculated at each point by finding the difference between the observed values and predicted results. The mean of this calculated error gave the MAE for each predicted dataset.
formula
(3)
where n indicates number of data points, yi indicates actual output value at ith grid point, and indicates predicted output value at ith grid point.

Root Mean Square Error

The errors calculated at each point in the previous step are each squared and their mean is found out. The square root of this mean gave the RMSE for each predicted dataset.
formula
(4)
where n indicates number of data points, yi indicates actual output value at ith grid point, and indicates predicted output value at ith grid point.

COD (R2)

The COD is used to explain how much variability of one factor can be caused by its relationship to another factor. A value of 100% indicates a perfect fit and is thus a highly reliable model for future forecasts, while a value of 0% would indicate that the calculation fails to accurately model the data at all.
formula
(5)
where yi indicates actual output value at ith grid point, indicates predicted output value at ith grid point, and indicates mean of observed values, yi.

Nash–Sutcliffe Efficiency

NSE is the difference of the ratio of the summation of the square of the difference between each predicted value from each observed value to the summation of the square of the difference between each observed value and the mean of the observed values. NSE > 75% signifies very good performance, 65% < NSE < 75% is good, 50% < NSE < 65% is satisfactory, while lower than 50% is unsatisfactory.
formula
(6)
where Yiobs indicates actual observed value at ith grid point, Yisim indicates predicted (simulated) output value at ith grid point, and Ymean indicates mean of observed values, yi.

MLR + OK

After a prediction for the training dataset was obtained through the chosen MLR model with optimum output, it was then processed through OK to produce an interpolated map and then predictions for the testing dataset (yellow points in Figure 3(a)) were also made and the results validated. Also, predictions for the testing dataset were separately made by MLR and interpolated using OK, to produce a separate map, as was a combined dataset of training + testing points. Standard error maps were also produced for these along with the interpolation maps. The training map in Figure 3(a) shows an almost parallel change of contours, being highest in the west and gradually decreasing as we move toward the east. This is because MLR predicts data following linearity and when those are interpolated on a map, the contours of the map also appear to vary linearly with increasing distance. This type of prediction map clearly isn't accurate since a natural phenomenon like rainfall cannot occur with such precise linear variations and orderly fashion. Another factor which might have influenced the prediction might be the usage of satellite data instead of more accurate ground rainfall data. However, satellite data being more convenient and economical are more used nowadays for various studies. The testing map (Figure 3(b)) shows a more natural interpolated map than that of the training one. This is because for testing data, rainfall data from rain gauge stations have been used, instead of satellite data. The respective Standard Error maps come off as expected, with less error around the points of known values and an increase in error as we move away from those points. The Standard Error map for the testing dataset shows more error and is of higher value since a number of points of known values is much lower than that for training dataset. Hence, it is predictably observed that with an increase in the number of known points, the error in interpolation decreases. Figure 3(c) shows an interpolated map by OK for training (171) + testing (33) data predictions by MLR. Comparing this with the map of Figure 4, training + testing data produces a much more natural map, which foregoes the parallel contouring of Figure 3(a). This is because it is a mix of satellite (training) + rain gauge (testing) data. Figure 3(d) is the interpolated map by OK for training data predictions by MLR with temperature parameters. This map shows the influence of temperature on rainfall predictions by MLR. However, again due to the usage of satellite data only, some parallelism persists. Figure 3(g) shows the standard error in interpolation of OK for the MLR predictions for training + testing data. The error as usual is least where the number of points is denser and the error increases as the crowding of points becomes scarce. The lowest error ranges from 8 to 15 mm and hence is undesirable. Figure 3(h) is the standard error map for interpolation of OK for the MLR predictions for training data predictions with temperature parameters. Here too, the error is least where the number of points is denser and error increases as the crowding of points becomes scarce and we move away from the known points. However, the range of error is much less and hence acceptable.
Figure 3

Interpolated map by OK for MLR predictions for (a) training dataset; (b) testing dataset; (c) training + testing dataset; (d) training dataset with temperature parameters; standard error map for interpolated map by OK for MLR predictions for (e) training dataset; (f) for testing dataset; (g) training + testing dataset; and (h) training dataset with temperature parameters.

Figure 3

Interpolated map by OK for MLR predictions for (a) training dataset; (b) testing dataset; (c) training + testing dataset; (d) training dataset with temperature parameters; standard error map for interpolated map by OK for MLR predictions for (e) training dataset; (f) for testing dataset; (g) training + testing dataset; and (h) training dataset with temperature parameters.

Close modal
Figure 4

Interpolated map by OK for GAM predictions for (a) training dataset; (b) testing dataset; (c) training + testing dataset; (d) training dataset with temperature parameters; standard error map for interpolated map by OK for GAM predictions for (e) training dataset; (f) testing dataset; (g) training + testing dataset; and (h) training dataset with temperature parameters.

Figure 4

Interpolated map by OK for GAM predictions for (a) training dataset; (b) testing dataset; (c) training + testing dataset; (d) training dataset with temperature parameters; standard error map for interpolated map by OK for GAM predictions for (e) training dataset; (f) testing dataset; (g) training + testing dataset; and (h) training dataset with temperature parameters.

Close modal

GAM + OK

Similarly, as for MLR, after obtaining a prediction for the training dataset using the chosen GAM model with the best output, it was interpolated using OK to produce an interpolation map, and then predictions for the testing dataset (yellow dots in Figure 4(a)) were made and the results validated. Testing data were also predicted through GAM separately for validation of the model and then interpolated independently. The same was done for a combined dataset of training and testing points. Also, the predictions for training data with temperature as an added attribute were also interpolated individually into a map using OK. Along with the interpolation maps, standard error maps were also created. Figure 4(a), similarly to in Figure 3(a), shows almost parallel contours of values, being highest in the west and gradually decreasing as we move toward the east. However, unlike MLR, Figure 4(a) has clearly overcome the problems with the linearity of the MLR model, which is one of the main differences between MLR and GAM. The contours are curved and more accurate than the MLR + OK predictions with the highest rainfall in the west and a gradual decrease toward the right with the lowest in the eastern regions of Assam and Nagaland. The interpolated map for GAM, although not linear, possesses parallel contours, like that of MLR because the equations of MLR and GAM are of similar form. Figure 4(b) shows an interpolated map of a more realistic nature than that in Figure 4(a). This is because for testing data, rainfall data from rain gauge stations have been used, instead of satellite data. However, due to the availability of fewer known points (33), the interpolated map is still inaccurate. Figure 4(e) and 4(f) show us the standard error maps for interpolation by OK for the training data predictions and testing data predictions respectively. These Standard Error maps come off as expected, with less error around the points of known values and an increase in error as we move away from those points. Figure 4(f) shows more error and of higher value than that of Figure 4(e), since a number of points of known values is much lower than that for the training dataset. Hence, it is predictably observed that with an increase in the number of known points, the error in interpolation decreases.

GAMs bypass the limits of linearity, simply by using splines in the linear model. GAMs learn a nonlinear relationship between each predictor variable and the outcome variable automatically, and then add these effects linearly, along with the intercept (Gibbons & Chakraborti 2003). The type of interpolation map in Figure 4(a) clearly isn't accurate since a natural phenomenon like rainfall cannot occur with such precise variations and orderly fashion. Another factor which might have influenced the prediction might be the usage of satellite data instead of more accurate ground rainfall data. However, satellite data being more convenient and economical are used more nowadays for various studies.

Figure 4(c) shows an interpolated map by OK for training (171) + testing (33) data predictions by GAM. Comparing this with the map of Figure 4(a), it can be seen that training + testing data produces a much more natural map, which foregoes the parallel contouring of Figure 4(a). This is because it is a mix of satellite (training) + rain gauge (testing) data. Figure 4(d) is the interpolated map by OK for training data predictions by GAM with temperature parameters. This map shows the influence of temperature on rainfall predictions by GAM. However, again due to the usage of satellite data only, some parallelism still persists.

Figure 4(g) shows the standard error in interpolation of OK for the GAM predictions for training + testing data. The error as usual is least where the number of points is denser and the error increases as the crowding of points becomes scarce. The lowest error ranges from 11 to 21 mm and hence is undesirable. Figure 4(h) is the standard error map for interpolation of OK for the GAM predictions for training data predictions with temperature parameters. Here too, the error as usual, the least where the number of points is denser and error increases as the crowding of points becomes scarce and we move away from the known points. However, the range of error is much less and hence acceptable.

RF + OK

The same procedure was again followed for RF, as it was for MLR and GAM. After obtaining a prediction for the training dataset using the RF model with the best output, it was interpolated using OK to produce an interpolation map, and then predictions for the testing dataset (yellow dots in Figure 5(a)) were created and the results validated. Predictions of testing data from RF were also interpolated independently, as was a combined dataset of training and testing points. Also, the predictions for training data with added temperature as an attribute were also interpolated individually into a map using OK. Along with the interpolation maps, standard error maps were also created. The prediction map in Figure 5(a) shows a more realistic map than that of Figures 3(a) and 4(a), indicating that when predictions from MLR, and the machine learning techniques of GAM and RF are interpolated using OK, RF + OK fares much better than GAM + OK, which in turn had given better results than MLR + OK. Figure 5(a) shows the highest amount of rainfall in the lower Meghalaya plateau and the lowest in the eastern region of Assam and Nagaland, more accurately than those seen in Figures 3(a) and 4(a). Figure 5(b), however, shows the lowest rainfall in upper mid-Assam, which is inaccurate. This may be because of the availability of fewer known data points for interpolation. Figure 5(e) and 5(f) show us the standard error maps for interpolation by OK for the training data predictions and testing data predictions of the RF model, respectively. The Standard Error maps come off as expected, with less error around the points of known values and an increase in error as we move away from those points. Figure 5(f) shows more error and of higher value than that of Figure 5(e), since a number of points of known values is much lower than that for the training dataset. Hence, it is predictably observed that with an increase in the number of known points, the error in interpolation decreases. Figure 5(c) shows an interpolated map by OK for training (171) + testing (33) data predictions by RF. It can be seen as quite like that of Figure 5(a). Figure 5(d) is the interpolated map by OK for training data predictions by RF with temperature parameters. This map shows the influence of temperature on rainfall predictions by RF. This again holds many similarities with Figure 5(a) and 5(c). Figure 5(g) shows the standard error in interpolation of OK for the RF predictions for training + testing data. The error as usual is least where the number of points is denser and the error increases as the crowding of points becomes scarce. The lowest error ranges from 9 to 12 mm and hence is undesirable. Figure 5(h) is the standard error map for interpolation of OK for the MLR predictions for training data predictions with temperature parameters. Here too, the error as usual, the least where the number of points is denser and error increases as the crowding of points becomes scarce and we move away from the known points. However, the range of error is lesser, ranging from around 3 mm, but the max error stands at around 40 mm, which makes it highly unacceptable.
Figure 5

Interpolated map by OK for RF predictions for (a) training dataset; (b) testing dataset; (c) training + testing dataset; (d) training dataset with temperature parameters (Lower Row; left to right) standard error map for interpolated map by OK for RF predictions for (e) training dataset; (f) RF predictions for testing dataset; (g) RF predictions for training + testing dataset; and (h) RF predictions for training dataset with temperature parameters.

Figure 5

Interpolated map by OK for RF predictions for (a) training dataset; (b) testing dataset; (c) training + testing dataset; (d) training dataset with temperature parameters (Lower Row; left to right) standard error map for interpolated map by OK for RF predictions for (e) training dataset; (f) RF predictions for testing dataset; (g) RF predictions for training + testing dataset; and (h) RF predictions for training dataset with temperature parameters.

Close modal

MAE and RMSE

Figure 6 shows the calculated Mean Absolute Error (MAE) and RMSE for the different methods. The charts are made for comparisons and a better visual understanding of the accuracies of the various methods. The lower the MAE and RMSE for a process, the more accurate the process is. Figure 6(a) shows that for training data predictions there is not much improvement in the prediction error for MLR, GAM or RF when it is interpolated with OK. However, from this comparison chart, RF and RF + OK show the least errors, followed by GAM and GAM + OK. MLR and MLR + OK show the highest errors for training data predictions. For each of the processes however, the MAE < RMSE. In Figure 6(b), for the testing data predictions, it is evident that GAM has the least error. All the other processes have RMSE almost in the same range. MAE of all processes is almost in the same range, with only GAM having a low MAE. Testing data predictions by MLR/GAM/RF is more accurate than the same for the training dataset (Figure 6(d)). The testing data have been predicted once by the machine learning process and then interpolated by OK, denoted by MLR/GAM/RF Testing + OK. Also, it had been predicted separately from interpolation of the training data predictions, denoted in Figure 6(b) by MLR/GAM/RF + OK. Here too, similarly as for training data, MAE < RMSE. For training + testing data, predictions from OK show the lowest MAE and RMSE, followed by GAM and then GAM + OK. MLR + OK shows the most error, both for MAE and RMSE, while RF shows the second-highest RMSE and MLR shows the second-highest MAE. RF + OK also shows moderately high MAE and RMSE. For training data predictions with temperature as an added attribute, MAE for RF + OK and RF is the least, followed by that for GAM and GAM + OK. MLR and MLR + OK show the highest MAE. RF + OK shows the least RMSE, followed by that of GAM, GAM + OK and RF, in the same range. Similarly, as for MAE, RMSE is the highest for MLR and MLR + OK.
Figure 6

Error comparison of (a) training data predictions; (b) testing data predictions; (c) training data predictions with added covariate of temperature; (d) error comparison of machine learning predictions; and (e) error comparison of training + testing predictions.

Figure 6

Error comparison of (a) training data predictions; (b) testing data predictions; (c) training data predictions with added covariate of temperature; (d) error comparison of machine learning predictions; and (e) error comparison of training + testing predictions.

Close modal

COD and NSE

Figure 7 shows the calculated COD (R2) and NSE for the different methods. The higher the R2 and NSE for a process, the more efficient it is. In Figure 7(a), GAM + OK shows the highest COD, followed by that of RF and then GAM and after that RF + OK. NSE is highest for RF + OK, followed by RF and then GAM. MLR and MLR + OK have the least COD for training data predictions, while GAM + OK has the least NSE, followed by MLR and MLR + OK. For testing data predictions (Figure 7(b)), GAM shows an exceptional good NSE of above 95%. GAM also has a high COD of 90.4% for the testing data predictions. All the other methods fail to show such considerable good performance as GAM. Even GAM + OK has 31.4% COD and only 0.29% NSE. MLR and RF have moderate COD and NSE, whereas MLR + OK and RF + OK have COD < 30% and NSE comes off as negative. Of all the three predictions among MLR, GAM and RF (Figure 7(d)); GAM comes with the second-highest COD and NSE for training data but is only marginally close to the first-highest RF. However, for testing data, GAM fares exceptionally well with COD above 90% and NSE above 95%, while RF only has COD at 46.5% and NSE even lower at 34.5%. MLR predictions for both training and testing data are much lower. The testing data were predicted twice (Figure 7(b)) for each of the three soft computing techniques + OK, firstly, from the interpolation of the training dataset, and secondly, directly predicted from the soft computing models. The direct predictions obviously show higher efficiency than those predicted from the training data interpolations. NSE for testing data predictions from training data interpolations of MLR, GAM and RF all came around 0%. However, for the same, GAM had NSE = 0.29% which is still higher than that of the other two. For the COD of testing data predictions from training data interpolations of MLR, GAM and RF, again GAM had the highest COD. For direct predictions of testing data and further interpolation with OK, GAM + OK had the highest COD of around 75% and the highest NSE of around 94%. COD for RF + OK was the second-highest followed by MLR + OK having the least of the same. NSE for MLR was higher than that of RF. For the training + testing dataset (Figure 7(e)), the COD of GAM is the highest at 80%, followed by GAM + OK, MLR, RF, RF + OK and MLR + OK in the decreasing order. NSE is highest for GAM, followed by GAM + OK. MLR, MLR + OK, RF and RF + OK have NSE in the range of 40 to around 50%. Surprisingly, for training data predictions with temperature (Sonali & Kumar 2013) as added attribute (Figure 7(b)), MLR + OK has the highest values of COD at 61% and NSE at 89%, followed by GAM, GAM + OK and RF with COD at 59% each and NSE at 63% respectively, with RF + OK following close behind. MLR as expected had the least COD and NSE, both at around 36%.
Figure 7

(a) Efficiency comparison of training data predictions; (b) efficiency comparison of testing data predictions; (c) efficiency comparison of training data predictions with added covariate of temperature (lower row; left to right); (d) efficiency comparison of machine learning predictions; and (e) efficiency comparison of training + testing predictions.

Figure 7

(a) Efficiency comparison of training data predictions; (b) efficiency comparison of testing data predictions; (c) efficiency comparison of training data predictions with added covariate of temperature (lower row; left to right); (d) efficiency comparison of machine learning predictions; and (e) efficiency comparison of training + testing predictions.

Close modal

It is difficult to make spatially detailed climatic parameter predictions in areas with complex relief and limited meteorological data. Among the training data (satellite data) maps interpolated by OK for MLR, GAM and RF, the map for RF + OK is much more realistic. The testing (rain gauge data) maps were much better for all the methods than their training counterparts. When temperature was added along with latitude, longitude, and elevation at each point as an attribute, the maps tend to improve. Also, maps for training + testing data, i.e., a combination of satellite and gauge data gave much improved and more realistic maps than those for only training data. Although not linear, the interpolated training map for GAM had parallel contours that are comparable to those of MLR, since their equations are similar. The use of splines in the linear model allows GAMs to get around linearity constraints. GAMs produce a nonlinear relation between each dependent variable and the outcome variable, and then add these effects, as well as the intercept, linearly. This style of interpolation map is clearly inaccurate because natural phenomena such as rainfall do not occur in such precise and orderly variations. For prediction purposes, predicted rainfall values of GAM exhibit much better and overall consistent results than the same for MLR or RF. Comparing the MAE and RMSE of GAM predictions for the various datasets with those of MLR and RF, we see that the errors are much less in most of the cases than that in the other methods. Also, the COD and NSE of the GAM predictions are much higher than those of OK, MLR, and RF, overall exhibiting GAM as the best model for the prediction of rain in our study area among MLR, GAM, and RF.

Overall findings suggest that GAMs are the most suitable approach for predicting rainfall in areas with complex terrain and limited meteorological data, offering superior accuracy and performance compared to Machine Learning Regression (MLR), RF, and OK interpolation.

Data sources: Training data that includes satellite data maps interpolated by OK for MLR, GAMs, and RF. Testing data, on the other hand, is based on rain gauge data.

Performance improvements: Maps created using RF combined with OK interpolation are more realistic compared to other methods. Moreover, the testing maps are better than their training counterparts.

Attributes: The inclusion of temperature, latitude, longitude, and elevation as attributes at each point has a positive impact on map accuracy.

Training + testing data: Combining both satellite and gauge data for training and testing leads to significantly improved and more realistic maps compared to using only training data.

GAM vs. MLR vs. RF: GAMs outperform MLR and RF in predicting rainfall. GAMs provide a nonlinear relationship between dependent variables and the outcome variable, allowing them to capture complex patterns more effectively.

Interpolation style: Mentioned that the interpolation style used by GAMs is more accurate because it doesn't impose the same linearity constraints as MLR.

Performance metrics: Metrics like MAE, RMSE, COD, and NSE consistently show that GAMs are the best model for predicting rainfall in your study area when compared to MLR, RF, and OK.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Agarwal
S.
,
Roy
P. J.
,
Choudhury
P. S.
&
Debbarma
N.
2021a
River flow forecasting by comparative analysis of multiple input and multiple output models form using ANN
.
H2Open Journal
4
(
1
),
413
428
.
https://doi.org/10.2166/h2oj.2021.122
.
Agarwal
S.
,
Roy
P. J.
,
Choudhury
P.
&
Debbarma
N.
2021b
Flood forecasting and flood flow modeling in a river system using ANN
.
Water Practice and Technology
16
(
4
),
1194
1205
.
https://doi.org/10.2166/wpt.2021.068
.
Agarwal
S.
,
Roy
P.
,
Choudhury
P.
&
Debbarma
N.
2022
Comparative study on stream flow prediction using the GMNN and wavelet-based GMNN
.
Journal of Water and Climate Change
13
(
9
),
3323
3337
.
https://doi.org/10.2166/wcc.2022.226
.
Ahilan
S.
,
Guan
M.
,
Wright
N.
&
Sleigh
A.
2015
Natural flood risk management in urban rivers
. In:
National Hydrology Conference 2015
,
Leeds
.
Ahilan
S.
,
Webber
J.
,
Melville-Shreeve
P.
&
Butler
D.
2019
Building Urban Flood Resilience with Rainwater Management
.
Choudhury
B. A.
,
Saha
S. K.
,
Konwar
M.
,
Sujith
K.
&
Deshamukhya
A.
2019
Rapid drying of Northeast India in the last three decades: Climate change or natural variability?
Journal of Geophysical Research: Atmospheres
124
(
1
),
227
237
.
Cressie
N.
&
Johannesson
G.
2008
Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society
.
Series B (Statistical Methodology)
70
,
209
226
. https://doi.org/10.1111/j.1467-9868.2007.00633.x
Deka
S.
,
Borah
M.
&
Kakaty
S. C.
2009
Distributions of annual maximum rainfall series of north-east India
.
European Water
27
(
28
),
3
14
.
Dou
J.
,
Yunus
A. P.
,
Bui
D. T.
,
Merghadi
A.
,
Sahana
M.
,
Zhu
Z.
,
Chen
C. W.
,
Khosravi
K.
,
Yang
Y.
&
Pham
B. T.
2019
Assessment of advanced random forest and decision tree algorithms for modeling rainfall-induced landslide susceptibility in the Izu-Oshima Volcanic Island, Japan
.
Science of the Total Environment
662
,
332
346
.
Gibbons
J. D.
&
Chakraborti
S.
2003
Nonparametric Statistical Inference
.
Marcel Dekker, Inc
, NY.
Gilbert
R. O.
1987
Statistical Methods for Environmental Pollution Monitoring
.
John Wiley & Sons
, NY.
Harvey
A. C.
1993
Time Series Models
, 2nd edn.
Harvester Wheatsheaf
, NY, p.
44, 45
.
Hastie
T. J.
2017
Generalized additive models
. In:
Statistical Models in S
(Hastie, T.J., ed.).
Routledge
,
New York
, pp.
249
307
.
Hirsch
R. M.
,
Slack
J. R.
&
Smith
R. A.
1982
Techniques of trend analysis for monthly water quality data
.
Water Resources Research
18
(
1
),
107
121
.
Hwang
Y.
,
Clark
M.
,
Rajagopalan
B.
&
Leavesley
G.
2012
Spatial interpolation schemes of daily precipitation for hydrologic modeling
.
Stochastic Environmental Research and Risk Assessment
26
(
2
),
295
320
.
Jain
S. K.
,
Kumar
V.
&
Saharia
M.
2013
Analysis of rainfall and temperature trends in northeast India
.
International Journal of Climatology
33
(
4
),
968
978
.
Jhajharia
D.
,
Shrivastava
S. K.
,
Sarkar
D. S. A. S.
&
Sarkar
S.
2009
Temporal characteristics of pan evaporation trends under the humid conditions of northeast India
.
Agricultural and Forest Meteorology
149
(
5
),
763
770
.
Kendall
M.
1975
Rank Correlation Methods
, Vol.
8
, 4th edn..
Charles Griffin
,
San Francisco, CA
, p.
875
.
Lemus-Canovas
M.
,
Lopez-Bustins
J. A.
,
Trapero
L.
&
Martin-Vide
J.
2019
Combining circulation weather types and daily precipitation modeling to derive climatic precipitation regions in the Pyrenees
.
Atmospheric Research
220
,
181
193
.
Ljung
G. M.
&
Box
G. E.
1978
On a measure of lack of fit in time series models
.
Biometrika
65
(
2
),
297
303
.
Mahanta
R.
,
Sarma
D.
&
Choudhury
A.
2013
Heavy rainfall occurrences in northeast India
.
International Journal of Climatology
33
(
6
),
1456
1469
.
Mann
H. B.
1945
Nonparametric tests against trend
.
###Econometrica: Journal of the Econometric Society
13
(
3
)
245
259
.
Raju
K. S.
&
Kumar
D. N.
2014
Ranking of global climate models for India using multicriterion analysis
.
Climate Research
60
(
2
),
103
117
.
Raju
K. S.
&
Kumar
D. N.
2018
Impact of Climate Change on Water Resources
.
Springer
,
Singapore
.
Sonali
P.
&
Nagesh Kumar
D.
2013
Review of trend detection methods and their application to detect temperature changes in India
.
Journal of Hydrology
476
,
212
227
. https://doi.org/10.1016/j.jhydrol.2012.10.034.
Srivastava
A. K.
,
Rajeevan
M.
&
Kshirsagar
S. R.
2009
Development of a high resolution daily gridded temperature data set (1969–2005) for the Indian region
.
Atmospheric Science Letters
10
(
4
),
249
254
.
Underwood
F. M.
2009
Describing long-term trends in precipitation using generalized additive models
.
Journal of Hydrology
364
(
3–4
),
285
297
.
Westra
S.
,
Evans
J. P.
,
Mehrotra
R.
&
Sharma
A.
2013
A conditional disaggregation algorithm for generating fine time-scale rainfall data in a warmer climate
.
Journal of Hydrology
479
,
86
99
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).