Abstract
The machine learning techniques of Multiple Linear Regression (MLR), Generalized Additive Models (GAMs), and the Random Forest (RF) Method have been used to analyze the extreme annual rainfall in the six states of Assam, Meghalaya, Tripura, Mizoram, Manipur, and Nagaland in North-Eastern (NE) India. Latitude, longitude, altitude, and temperature were the covariates that were used in this study. Ordinary Kriging was used to interpolate the predicted outcomes of each dataset. Statistical metrics like Mean Absolute Errors (MAE), Root Mean Square Error (RMSE), Coefficients of Determination (COD-R2), and Nash–Sutcliffe Efficiency (NSE) were also assessed. When compared to satellite rainfall data, all techniques performed significantly better for ground rainfall data. For prediction, GAM's predicted rainfall values triumph over MLR or RF. RF ranks a close second, while the linearity of MLR prohibits it from making precise predictions for a physical phenomenon like rainfall. The MAE and RMSE of GAM forecasts are significantly lower than those of MLR and RF in most circumstances. Additionally, the COD and NSE of GAM predictions are significantly better than both MLR and RF in most cases, showing that GAM, out of MLR, GAM, and RF, is the best model for predicting rain in our research area.
HIGHLIGHTS
The map for RF + OK is much more realistic.
The testing (rain gauge data) maps were much better for all the methods than their training counterparts.
Temperature inclusion has an effect on latitude, longitude, and elevation at each point as an attribute.
The use of splines in the linear model allows GAMs to get around linearity constraints.
Comparing the MAE and RMSE of GAM predictions for the various datasets.
INTRODUCTION
NE India's precipitation is unique in that it frequently exceeds 1,000 mm (Underwood 2009), the highest recorded annual rainfall in history. The average yearly rainfall in Mawsynram, the wettest place on earth, is currently 11,802.4 mm, which is about 10 times as much rain as the average yearly rainfall in India. NE India, on the other hand, exhibits a wide range of localized rainfall due to its high-altitude differences within a compact spatial area. Owing to the Tropic of Cancer running through the territories of Mizoram and Tripura, North-East India has a significant tropical impact, especially in the plains. Typically, the warmer months of June–September have extreme to excessively heavy rainfall in the region, which has a monsoon climate (Deka et al. 2009). Rainfall in the region's mountainous parts ranges from 2,000 to 3,000 mm, but lower than 2,000 mm falls in the Himalayas' rain-shadow zone. Since it has a different precipitation pattern than the rest of six of the seven sister states of NE India, Arunachal Pradesh has been left out of this analysis. The North-east of India has been experiencing an unusual weather pattern; while many areas of Assam, Meghalaya, and other portions of NE India are dangerously flooded each year, some areas remain much drier in comparison and can even exhibit a lack of rainfall.
Jain et al. (2013) and Harvey (1993) have previously studied the regional change in climate through time by analyzing the rainfall and temperature patterns in NE India. Mahanta et al. (2013) also looked at predictions for heavy rainfall in NE India, according to research by Choudhury et al. (2019), there has been a significant drop in summer monsoon rainfall (about 355 mm) during the last 36 years (1979–2014), which has had a significant impact on the local population's ecosystem and way of life. It was not apparent, however, whether the documented decline was a consequence of human behavior or if it was linked to global natural variability. Malik et al. (2021) used Multiple Linear Regression (MLR), among other methods, to determine multi-scalar Standardized Precipitation Index (SPI) prediction and obtained quite good results for MLR in a few stations. Hwang et al. (2012) reviewed and compared the output of regression-based statistical methods for the spatial evaluation of precipitation with several methods, including MLR. Ali et al. (2020) combined Random Forest (RF) to overcome the non-stationarity challenges faced by rainfall forecasting models and saw that the model used improved rainfall forecasting accuracy, which is crucial for agriculture, water resource management, and early drought/flood warning systems. Dou et al. (2019), on a regional scale, evaluated and compared the effectiveness of two machine learning models including the RF techniques, to model major rainfall-triggered landslides and RF was seen to have a significantly greater overall efficiency, even when the training and validation samples were changed. To interpolate average daily precipitation depending on a synoptic scale categorization of different kinds of weather utilizing the Generalized Additive Model (GAM) together with numerous other models, Lemus-Canovas et al. (2019) coupled the synoptic scale to the local scale using the Pyrenees regional scale to interpolate daily average rainfall. GAM was among theologies providing an excellent fit for the models. Westra et al. (2013) present an approach for breaking down rainfall distribution into sub-daily rainfall ‘fragments’ (fine-resolution rainfall sequences) under a future, warmer climate by combining GAM with another method. Utilizing the GAM directly produced results that were very closely aligned.
Studies now in NE India have mostly been concentrated in the regions of Assam and Meghalaya, mostly to analyze the trend patterns of the region (Hastie 2017). Also, the usage of Machine Learning Techniques like MLR, GAM, and RF for rainfall analysis of this region has not been found fully satisfactory as per the literature review of the study area. NE India has been recognized as one of the country's most flood-prone areas, setting this region and the country back by millions of funds used for rehabilitation every year. Even then, the devastating annual flood in this region is the most dominant difficulty that remains beyond control (Esterby 1996). One of the reasons might be the unavailability of accurate precipitation data required for the proper planning and management of these floods (Ahilan et al. 2019). In this study, we will be reviewing the applications of the Machine Learning Techniques of MLR, GAM, and RF for the prediction of rainfall and then further interpolate the various predicted datasets by Ordinary Kriging (OK) for the creation of rainfall maps (Srivastava et al. 2009). Now, talking of the OK method, OK is a geostatistical interpolation technique commonly used to predict values at unobserved locations within a spatial dataset. This method relies on variography principles to quantify the spatial correlation or variability between data points. It involves creating a weighted linear combination of nearby sample values, where weights are determined by the spatial relationships between the points. Also, statistical criteria will be used for comparison of the various predictions to determine which among them is the method most suited to analyze the rainfall in our study area, as represented in Figure 1.
STUDY AREA AND DATA
From 1980 to 2020, daily gridded satellite-based rainfall data files with 0.25 × 0.25 grids were acquired from the IMD Pune website. The daily rainfall measurements at each grid point were then extracted from the binary files into spreadsheets spanning the necessary latitudes and longitudes. A total of 246 grid points covered the six northeastern Indian states of Assam, Meghalaya, Tripura, Mizoram, Manipur, and Nagaland. The grid points were acquired at increments of 0.25° in both vertical and horizontal directions, over the latitude range of 22° N to 29° N, and the longitude range of 90° E to 96° E. This was to be the training dataset.
Assam has a tropical monsoon rainforest environment with high humidity and high precipitation. The state receives 70–120 in. of yearly precipitation on average. Meghalaya experiences an annual average rainfall of 1,150 cm, and Mawsynram experiences the highest annual average rainfall of almost 10,000 mm, ranking it the wettest area on the planet. The states of Mizoram and Tripura are traversed by the Tropic of Cancer. Rainfall in Tripura ranges from 1,922 to 2,855 mm annually and flows from the south-west to the north-east (Subash et al. 2011). The wettest months are April, June, and July to September. The average annual rainfall in Mizoram is 2,540 mm, with precipitation being heaviest from May to September and least during the winter. The climate in Manipur is significantly influenced by its topography (Raju & Kumar 2014, 2018). The average annual rainfall in the state is 1,467.5 mm. In Nagaland, the weather is monsoonal (wet-dry). Rainfall is concentrated from May to September, during the southwest monsoon, and varies from 1,800 to 2,500 mm annually.
To test all the prediction models created with the training dataset, a set of testing data of 33 points dispersed across the six states of NE India was taken into consideration (Agarwal et al. 2021b). In contrast to the satellite-based rainfall data of the training dataset, these 33 points were ground annual rainfall data from rain gauge stations (Agarwal et al. 2021a).
From 1980 to 2020, a total number of 41 datasets were taken into account. The daily maximum rainfall at each grid junction in a year was then calculated using the datasets. With a mean of 97.16 mm, the highest rainfall was recorded at 26.25 N, 90.25 E in 1984 as 795.65 mm while the lowest rainfall was recorded at 22.75 N, 92.75 E in 2009 to be 1.4 mm.
METHODOLOGY
Randomness and trend
At each grid point, the results were analyzed for trends and randomness. The grid points with only randomness present and no trends were chosen for further investigation once the randomness and trends, as shown in Figure 2, were determined, Ahilan et al. (2015).
(Ljung & box 1978) The Ljung-Box test was used to check whether randomization was present and the existence of a trend in the time series data was determined using the (Mann 1945) Mann–Kendall Pattern Test (Kendall 1975). The p-value, or probability value, was used to assess the existence of randomness and trend. The null hypothesis is dismissed if the p-value is less than 0.05, meaning that the likelihood of the event is less than 5% and therefore extremely low. Therefore, the p-value for randomness must be less than 0.05 for there to be no randomness present. p-value is to be less than 0.05 for the trend test, to necessarily conclude that a trend exists (Hirsch et al. 1982), meaning that only 5% of the instances could such a consistent pattern have developed by chance. As a result, the p-value must be more than or equal to 0.05 for there to be no trend. In terms of statistics, randomness means that there should be no discernible tendencies or consistencies in the values of rainfall (Gilbert 1987), as this could compromise the accuracy of our forecasts and cause them to be prejudiced or affected. If the rainfall data do follow a specific pattern, we can infer a trend from their frequency and exclude them from our computations. It is crucial that the final dataset just contains randomness and lacks any observable trends. This is because data displaying trends will affect our forecasts due to their co-dependency or statistically significant correlation with time, making them biased and impairing their prediction and accuracy (Pai et al. 2014). After this procedure, 171 such points were left.
Determination of elevation and extreme annual rainfall
The 171 remaining sites were then plotted on a Digital Elevation Modeling (DEM) map of North-East India to further ascertain the corresponding geographic altitudes at these points (Jhajharia et al. 2009). The average annual maximum rainfall for all of the 171 grid locations throughout the 41 years between 1981 and 2020 was subsequently determined. The maximum elevation was found to be 2,826.75 m at 25.75°N, 95°E, the minimum 19.5 m at 24.75°N, 92.75°E, with the median 224.75 m at 25.5°N, 93°E. The elevations for the 33 testing points were also determined from the DEM image.
The average of the yearly maximum rainfall for 41 years from 1981 to 2020 was then found for each of the 171 grid points.
Two other datasets were also prepared for validation of the various models. One was a combination of the 171 training data points (satellite data) and the 33 testing data points (rain gauge data), therefore a total of 204 data points. The training dataset of 171 points was also separately processed with an added covariate of temperature for rainfall prediction at each point, treated as another dataset for validation of the models. The temperature used at each point was the average of the yearly maximum and minimum temperature for 41 years from 1981 to 2020 for each of the grid points.
The datasets were then processed through various trial models of MLR, GAM, and RF for the prediction of rainfall and the models with optimum output for each method selected as the final model for prediction and comparison under that technique. Further interpolation was carried out for the predicted datasets from the final models of MLR, GAM, and RF, by OK for the creation of rainfall maps. Standard error maps for each of these interpolated maps were also produced, which could be used for visual verification of the accuracy of the interpolations. Statistical criteria of MAE, Root Mean Square Error (RMSE), Coefficient of Determination (COD), and Nash–Sutcliffe Efficiency (NSE) were also found out for each prediction and interpolation for comparison and establishment about which method is most suited to analyze the rainfall in our study area.
Multiple Linear Regression
The training dataset of 171 points was processed under various trial models of MLR till an optimum model was found for prediction. The model chosen was also applied to the testing data to check for its errors and efficiency.
The predictions through MLR of both training and testing data were then made to undergo OK for further interpolation and creation of respective predicted maps. Through the training map, predictions at the testing data points (33 points) were also made and the two testing predictions were compared. The same MLR model was also used on the training + testing dataset of 204 points and the predictions run through OK to create a map. Also, the attribute of temperature was added to the training dataset in the chosen MLR model and another set of predictions was obtained and the same process of interpolation by OK was repeated to create a map.
Notably, OK minimizes estimation variance by considering both spatial correlation and data uncertainty, making it a robust tool for estimating values and providing uncertainty assessments in fields such as geosciences, environmental modeling, and resource management (Cressie & Johannesson 2008)
After trying and testing different models with different combinations of input attributes, the model selected included all attributes of latitude, longitude and elevation. Temperature could also be added where need be.
Generalized Additive Model
GAMs allow us to model nonlinear data by assuming that the outcome can be described by a sum of arbitrary functions of each variable. A spline is used for this case, which is a multi-dimensional function that lets us represent nonlinear interactions for each feature. In linear regression, the equation is defined by the sum of a linear combination of variables. Each variable is given a weight, β (like we saw for MLR) and added together. We simply state we can use a nonlinear combination of variables, represented by s for spline or ‘smooth function,’ instead of assuming that our aim can be determined using a linear combination of variables.
Similarly, as in MLR, the training dataset (171 points) was processed through various trial models of GAM till the most favorable model was found. Attributes were also checked for results without spline functions in some of the trials. Various trial models, as mentioned, were evaluated. The optimum number of splines denoted by k (smoothing function of particular variable) was determined as 15 for elevation and 10 for latitude and longitude each. The same model was then also applied to the testing data to check for its errors and efficiency. Also, the GAM iterations were executed by default as Gaussian distributions. The predictions through GAM of both training and testing data were then processed through OK for further interpolation and creation of respective predicted maps. Also, through the training map, predictions at the testing data points (33 points) were also made and the two testing predictions were compared. The same GAM model was also used on the training + testing dataset of 204 points and the predictions run through OK to create a map. The attribute of temperature with a number of splines as 14 was added to the selected GAM model and another set of predictions was obtained for the training dataset of 171 points. This set of predictions was further run through OK for an interpolated map.
RF method
In a RF model, each tree is built individually and is based on a random vector sampled from the input data, with the same distribution across the forest. Since the RF combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. The number m (mtry) is chosen at random from the total input variables at each node, and the best split on all these m is used to split the node. During the forest's growth, the value of m remains constant. Every tree is cultivated to its full potential. Pruning does not exist. Pruning is a data compression technique used in machine learning and search algorithms to minimize the size of decision trees by deleting non-critical and superfluous portions.
n (ntree) is the number of decision trees you grow in the algorithm. It is given before the maximum voting or prediction averages are calculated. A larger number of trees improves efficiency.
After much trial and error, the mtry was kept at 2 for the final models and the ntree at 500. Similarly, as for MLR and GAM, the training dataset (171 points) was processed through a model of RF. Then the same model was tested on the testing dataset (33 points). The same RF model was also used on the training + testing dataset of 204 points. The predictions through RF of both training and testing data were then processed through OK for further interpolation and creation of respective predicted maps. Also, through the training map, predictions at the testing data points (33 points) were also made and the two testing predictions were compared. The predictions of the RF model of the training + testing dataset of 204 points were also run through OK in ArcGIS to create a map. Temperature was also added as an attribute to the selected RF model and another set of predictions was obtained for the training dataset of 171 points. This set of predictions was further run through OK, and an interpolated map produced.
Statistical criteria
Mean Absolute Error
Root Mean Square Error
COD (R2)
Nash–Sutcliffe Efficiency
RESULTS AND DISCUSSION
MLR + OK
GAM + OK
Similarly, as for MLR, after obtaining a prediction for the training dataset using the chosen GAM model with the best output, it was interpolated using OK to produce an interpolation map, and then predictions for the testing dataset (yellow dots in Figure 4(a)) were made and the results validated. Testing data were also predicted through GAM separately for validation of the model and then interpolated independently. The same was done for a combined dataset of training and testing points. Also, the predictions for training data with temperature as an added attribute were also interpolated individually into a map using OK. Along with the interpolation maps, standard error maps were also created. Figure 4(a), similarly to in Figure 3(a), shows almost parallel contours of values, being highest in the west and gradually decreasing as we move toward the east. However, unlike MLR, Figure 4(a) has clearly overcome the problems with the linearity of the MLR model, which is one of the main differences between MLR and GAM. The contours are curved and more accurate than the MLR + OK predictions with the highest rainfall in the west and a gradual decrease toward the right with the lowest in the eastern regions of Assam and Nagaland. The interpolated map for GAM, although not linear, possesses parallel contours, like that of MLR because the equations of MLR and GAM are of similar form. Figure 4(b) shows an interpolated map of a more realistic nature than that in Figure 4(a). This is because for testing data, rainfall data from rain gauge stations have been used, instead of satellite data. However, due to the availability of fewer known points (33), the interpolated map is still inaccurate. Figure 4(e) and 4(f) show us the standard error maps for interpolation by OK for the training data predictions and testing data predictions respectively. These Standard Error maps come off as expected, with less error around the points of known values and an increase in error as we move away from those points. Figure 4(f) shows more error and of higher value than that of Figure 4(e), since a number of points of known values is much lower than that for the training dataset. Hence, it is predictably observed that with an increase in the number of known points, the error in interpolation decreases.
GAMs bypass the limits of linearity, simply by using splines in the linear model. GAMs learn a nonlinear relationship between each predictor variable and the outcome variable automatically, and then add these effects linearly, along with the intercept (Gibbons & Chakraborti 2003). The type of interpolation map in Figure 4(a) clearly isn't accurate since a natural phenomenon like rainfall cannot occur with such precise variations and orderly fashion. Another factor which might have influenced the prediction might be the usage of satellite data instead of more accurate ground rainfall data. However, satellite data being more convenient and economical are used more nowadays for various studies.
Figure 4(c) shows an interpolated map by OK for training (171) + testing (33) data predictions by GAM. Comparing this with the map of Figure 4(a), it can be seen that training + testing data produces a much more natural map, which foregoes the parallel contouring of Figure 4(a). This is because it is a mix of satellite (training) + rain gauge (testing) data. Figure 4(d) is the interpolated map by OK for training data predictions by GAM with temperature parameters. This map shows the influence of temperature on rainfall predictions by GAM. However, again due to the usage of satellite data only, some parallelism still persists.
Figure 4(g) shows the standard error in interpolation of OK for the GAM predictions for training + testing data. The error as usual is least where the number of points is denser and the error increases as the crowding of points becomes scarce. The lowest error ranges from 11 to 21 mm and hence is undesirable. Figure 4(h) is the standard error map for interpolation of OK for the GAM predictions for training data predictions with temperature parameters. Here too, the error as usual, the least where the number of points is denser and error increases as the crowding of points becomes scarce and we move away from the known points. However, the range of error is much less and hence acceptable.
RF + OK
MAE and RMSE
COD and NSE
CONCLUSION
It is difficult to make spatially detailed climatic parameter predictions in areas with complex relief and limited meteorological data. Among the training data (satellite data) maps interpolated by OK for MLR, GAM and RF, the map for RF + OK is much more realistic. The testing (rain gauge data) maps were much better for all the methods than their training counterparts. When temperature was added along with latitude, longitude, and elevation at each point as an attribute, the maps tend to improve. Also, maps for training + testing data, i.e., a combination of satellite and gauge data gave much improved and more realistic maps than those for only training data. Although not linear, the interpolated training map for GAM had parallel contours that are comparable to those of MLR, since their equations are similar. The use of splines in the linear model allows GAMs to get around linearity constraints. GAMs produce a nonlinear relation between each dependent variable and the outcome variable, and then add these effects, as well as the intercept, linearly. This style of interpolation map is clearly inaccurate because natural phenomena such as rainfall do not occur in such precise and orderly variations. For prediction purposes, predicted rainfall values of GAM exhibit much better and overall consistent results than the same for MLR or RF. Comparing the MAE and RMSE of GAM predictions for the various datasets with those of MLR and RF, we see that the errors are much less in most of the cases than that in the other methods. Also, the COD and NSE of the GAM predictions are much higher than those of OK, MLR, and RF, overall exhibiting GAM as the best model for the prediction of rain in our study area among MLR, GAM, and RF.
Overall findings suggest that GAMs are the most suitable approach for predicting rainfall in areas with complex terrain and limited meteorological data, offering superior accuracy and performance compared to Machine Learning Regression (MLR), RF, and OK interpolation.
Data sources: Training data that includes satellite data maps interpolated by OK for MLR, GAMs, and RF. Testing data, on the other hand, is based on rain gauge data.
Performance improvements: Maps created using RF combined with OK interpolation are more realistic compared to other methods. Moreover, the testing maps are better than their training counterparts.
Attributes: The inclusion of temperature, latitude, longitude, and elevation as attributes at each point has a positive impact on map accuracy.
Training + testing data: Combining both satellite and gauge data for training and testing leads to significantly improved and more realistic maps compared to using only training data.
GAM vs. MLR vs. RF: GAMs outperform MLR and RF in predicting rainfall. GAMs provide a nonlinear relationship between dependent variables and the outcome variable, allowing them to capture complex patterns more effectively.
Interpolation style: Mentioned that the interpolation style used by GAMs is more accurate because it doesn't impose the same linearity constraints as MLR.
Performance metrics: Metrics like MAE, RMSE, COD, and NSE consistently show that GAMs are the best model for predicting rainfall in your study area when compared to MLR, RF, and OK.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.