ABSTRACT
Data scarcity and unavailability of observed rainfalls in the northeastern states of India limit prediction of extreme hydro-climatological changes. To fill this gap, a data assimilation approach has been applied to re-construct accurate high-resolution gridded (5 km2) daily rainfall data (2001–2020), which include seasonality assessment, statistical evaluation, and bias correction. Random forest (RF) and support vector regression were used to predict rainfall time series, and a comparison between machine learning and data assimilation-based gridded rainfall data was performed. Five gridded rainfall datasets, namely, Indian Monsoon Data Assimilation and Analysis (IMDAA) (12 km2), APHRODITE (25 km2), India Meteorological Department (25 km2), PRINCETON (25 km2), and CHIRPS (25 and 5 km2), have been utilized. For re-constructed rainfall datasets (5 km2), the comparative seasonality and change assessment have been performed with respect to other rainfall datasets. CHIRPS and APHRODITE datasets have shown better similarities with IMDAA. The RF and assimilated rainfall (AR) have superiority based on bias and extremity, and AR data were recognized as the best accurate data (>0.8). Precipitation change analysis (2021–2100) performed utilizing the bias-corrected and downscaled CMIP6 datasets showed that the dry spells will be enhanced. Considering the CMIP6 moderate emission scenario, i.e., SSP245, the wet spell will be enhanced in future; however, when considering SSP585 (representing the extreme worst case), the wet spells will be decreased.
HIGHLIGHTS
A unique data assimilation approach is applied to construct an accurate high-resolution gridded (5 km2) daily rainfall time series.
Evaluation and bias correction of multisource gridded rainfall datasets were performed.
Random forest and support vector regression machine learning methods were applied for the prediction of rainfall.
Assessment of long-term rainfall changes was done in the wettest regions of the world.
INTRODUCTION
Several regions of the world suffer from the daily or sub-daily basis availability of high-resolution rainfall datasets due to the limited presence of rainfall gauges (Gupta et al. 2020a). In India, regions such as the Himalayan river basins and northeastern India (which is also known as one of the wettest regions of the world) have limited availability of rainfall gauges, and therefore, these regions lack a standard and accurate rainfall data product that could be utilized for different watershed applications and also to analyse extreme hydro-climatic changes in the region (Bharti & Singh 2015; Gupta et al. 2020a). For the analysis of extreme event conditions such as floods and droughts, which may be linked to climate change (Mukherjee et al. 2018), the high resolution and at least daily rainfall availability are required (Alexander et al. 2019). Due to the poor availability of rainfall gauges or high-resolution accurate rainfall datasets (e.g. gridded rainfall datasets), the prediction and simulation of hydrological events can be highly uncertain (Singh & Xiaosheng 2019). Therefore, in a data-scarce region like the northeastern states of India, due to the low density of rainfall gauges and the lack of presence of long-term availability of high-resolution gridded rainfall time series (mostly available at >12.5 km2 scale) (Zahan et al. 2021), there is an urgent requirement to develop high-resolution (say up to 5 km2 scale or less) gridded rainfall datasets, so that the near-tern and long-term changes related to rainfall extremity in the northeastern states of India can be addressed accurately.
The availability of universally acknowledged high-resolution open-source gridded rainfall datasets such as Climate Hazards Group InfraRed Precipitation (CHIRPS) (available at 25 km2 and 5 km2 scale), Tropical Rainfall Measuring Mission (TRMM), APHRODITE (available at 25 km2 scale), Soil Moisture to Rain (SM2RAINASCAT), and PRINCETON rainfall data (available 25 km2 scale) provides a viable source to assess the rainfall variability and pattern in different parts of the world (Aggarwal et al. 2022; Bhattacharyya et al. 2022). The reliability of these gridded rainfall datasets has been explored around the world and in India, which provides valuable information for the long-term assessment of rainfall variability mostly at a larger scale (Singh & Xiaosheng 2019). In a study conducted by Gupta et al. (2020a), the applicability of various sources of gridded rainfall datasets across India was compared and tested, and it showed that the CHIRPS and TRMM performed better to capture the rainfall characteristics with reference to India Meteorological Department (IMD) data in most of the regions (Sulugodu & Deka 2019). However, TRMM has shown some predictions in the northeastern regions. As per the applicability of these different rainfall datasets in India, mostly performed at a larger scale, it is found that each dataset has its own advantages and limitations (Dubey et al. 2021). The temporal and spatial availability of these rainfall data sources restrict the assessment of the short-term and long-term impact of rainfall at a higher resolution spatial scale (Singh & Xiaosheng 2019).
As per the obtained feedback from the previous studies, the new hybrid and improved (in terms of resolution and accuracy) rainfall datasets can be generated to better analyse the long-term rainfall changes even at the basin scale or smaller scale (Pai et al. 2014). Many studies applied data assimilation (DA) techniques to adjust or generate new datasets for better numerical predictions (Lu et al. 2018; Singh & Xiaosheng 2019). Singh & Xiaosheng (2019) utilized the data assimilating approach for the construction of long-term daily gridded rainfall datasets over Southeast Asia, and by utilizing several statistical methods, they successfully removed the time series gaps in the rainfall data. Several studies demonstrated the utilities and consequences of machine learning methods such as decision forest regression, neural network regression, multilayer perceptron, random forest (RF), and support vector regression (SVR) methods for the prediction of short-term and long-term rainfall datasets (Ridwan et al. 2021; Barrera-Animas et al. 2022). While testing the capabilities of machine learning methods for the prediction of rainfall datasets, some methods are found reliable for short-term rainfall predictions like SVR (Kajewska-Szkudlarek 2020), and some methods are found efficient for long-term predictions of the rainfall such as RF (Pham et al. 2019). Several studies demonstrated the utility of different statistical functions and bias correction methods such as quantile mapping (QM) (Singh & Xiaosheng 2019; Kumar et al. 2021), quantile–quantile analysis (Gupta et al. 2020a), and probability methods (Fang et al. 2015; Shivam et al. 2019) for the correction of rainfall datasets.
A significant impact of climate change has been noticed in the last few years over the Indian Monsoon system, which may be caused due to climate change, and this has impacted the rainfall pattern and amounts in terms of both intensity and frequency across India (Gupta et al. 2020b; Kumar et al. 2021). Many regions in India, especially the hilly regions including the northeastern regions of India, have been threatened by severe extreme events such as flash droughts and extreme high floods (Yaduvanshi et al. 2019; Sharma & Goyal 2020). A study conducted by Mukherjee et al. (2018) elaborated that the annual maximum precipitation will be decreased in the northeastern regions of India. It was observed that the northern part of the northeastern states showed a decrease in rainfall, which varied from 3% in the northwestern part to ∼12% in the northeastern part (Ravindranath et al. 2011). The increase in temperature and rainfall variability causing due to climate change has exerted pressure on the overall water availability in Mizoram and other northeastern regions through increased rate of evapotranspiration and altering the overall water balance (Ravindranath et al. 2011; Monsang et al. 2021). However, the northeastern region's specific observations of extreme rainfall changes are less explored, which might be crucial for analysing the impact of climate change on the current and long-term water availability and water security in the region.
Considering the aforementioned points, this study mainly focuses on the construction of accurate and reliable high-resolution gridded rainfall datasets (i.e. 5 km2 grid scale) for the selected study area to enhance the scope of analysing the current and long-term rainfall changes. The second objective is to analyse the historical and long-term (1991–2100) rainfall changes in the selected study area using constructed rainfall data and climate model datasets by formulating various extreme rainfall climate indices (RCIs). For this purpose, first, the de-bias of the latest large-scale gridded (25 km2) coupled model inter-comparison project phase six (CMIP6) climate model datasets with shared socioeconomic pathways (SSPs) experiments (i.e. SSP245 and SSP585) with reference to the newly generated rainfall datasets has been done, and then while utilizing the CMIP6 climate model datasets (Gupta et al. 2020b), the near-term and long-term rainfall changes have been analysed. For the construction of accurate gridded high-resolution rainfall data, DA with machine learning methods has been performed. For the DA, various open-source gridded rainfall datasets such as Indian Monsoon Data Assimilation and Analysis (IMDAA) (12 km2), APHRODITE (25 km2), IMD (25 km2), PRINCETON (25 km2), and CHIRPS (25 and 5 km2) have been utilized (Gupta et al. 2020a). For assimilation, the least error datasets were found with reference to the observed gridded IMD rainfall dataset by statistical functions and quantile–quantile (Q–Q) plots, and then the bias correction was done to remove the uncertainty in the rainfall data. For the accuracy assessment of the newly constructed time series gridded rainfall dataset (5 km2), the comparative seasonality and change assessment have been performed with respect to other rainfall datasets. For the rainfall predictions and assessment of the newly constructed time series gridded rainfall dataset, machine learning methods such as RF and SVR have been employed. The SVR and RF have been successfully utilized for the prediction of time series rainfall datasets across the world (Pham et al. 2019). For the assessment of near-term and long-term rainfall changes, the standard and widely used RCIs such as annual mean, dry spell frequency, wet spell frequency, and maximum 1-day precipitation per year (Rx1D) have been formulated and analysed.
STUDY REGION
The present study area comprises the northeastern region which belongs to the latitudes 22.0°–26.0° and longitudes 92.0°–94.5°, which covers mainly the Mizoram state and several parts of Assam, Tripura, Manipur, Meghalaya, and Nagaland (Figure 1). The selected study area comprises the parts of three river basins such as Barak and minor rivers draining into Bangladesh (MRD-BAN) and minor rivers draining into Myanmar (MRD-MYA) sub-basins. The average rainfall of the region corresponded to ∼2,500 to ∼6,000 mm, and this region can be categorized as the wettest region in the world. The topographical elevation varies from 10 to 3,100 m, and the topography of the terrain is the most variegated topography among all hilly areas in this part of the country. The hills are extremely rugged and steep; the ranges are running in the north–south direction, leaving some plains scattered occasionally here and there. Furthermore many rivers and streamlets drain the hill ranges.
MATERIALS AND METHODOLOGY
Data sources
In this study, six rainfall datasets such as IMD (25 km2), IMDAA re-analysis (12 km2), PRINCETON University Rainfall datasets (25 km2), CHIRPS (25 km2), CHIRPS (5 km2), and APHRODITE (25 km2) for the homogeneous time period, i.e. 1979–2020, have been utilized. Among the aforementioned grid-based rainfall datasets, the IMDAA (12 km2) and IMD (25 km2) datasets, specifically generated for the Indian region using gauged rainfall stations, have been considered the most accurate and reliable observed rainfall datasets (Ashrit et al. 2020). All these gridded rainfall datasets have been obtained in the NETCDF file format. For the data extraction, the Python scripts were written, and the rainfall values were extracted for the current study region.
IMDAA re-analysis rainfall datasets
The IMDAA re-analysis is a regional atmospheric re-analysis that encompasses the Indian subcontinent. The IMDAA re-analysis datasets have been generated by the National Centre for Medium-Range Weather Forecasting (NCMRWF) and the IMD (Ashrit et al. 2020) (https://rds.ncmrwf.gov.in/datasets). Previous studies have utilized the IMDAA datasets across India (Ashrit et al. 2020; Rani et al. 2021).
CHIRPS rainfall
Climate Hazards Group Infrared Precipitation with Station Data (CHIRPS) is a quasi-global rainfall dataset spanning 35 years (https://data.chc.ucsb.edu/products/CHIRPS-2.0/). The CHIRPS rainfall datasets are available in two resolutions i.e. 0.05° × 0.05° (i.e. 5 km2) and 0.25° × 0.25° (25 km2), and in this study, both resolution datasets (1981–2020) have been utilized. The CHIRPS rainfall product incorporates Climate Hazards Group Rainfall Climatology (CHP Clim), Tropical Rainfall Measuring Mission (TRMM) 3B42 rainfall product, Geostationary Thermal Infrared Satellite Observations, atmospheric model rainfall from NOAA Climate Forecast System, and gauge rainfall observations from national or regional meteorological sources (Sulugodu & Deka 2019; Gupta et al. 2020a).
APHRODITE rainfall
Asian Precipitation – Highly-Resolved Observational Data Integration Towards Evaluation (APHRODITE's) (Yatagai et al. 2012; Banerjee et al. 2020) gridded precipitation is a series of long-term (1951–2016) continental-scale daily products for Asia, including the Himalayas, South and Southeast Asia, and mountainous areas in the Middle East (https://climatedataguide.ucar.edu/climate-data/aphrodite-asian-precipitation-highly-resolved-observational-data-integration-towards). APHRODITE gridded data products are available for four subdomains (Monsoon Asia, Middle East, Russia, and Japan), as well as a unified domain. Except for Japan, which has a 0.05° × 0.05° horizontal resolution, the time-varying datasets have a 0.25° × 0.25° (25 km2) or 0.05° × 0.05° (5 km2) horizontal resolution. In this study, the Monsoon Asia-based climatological daily mean precipitation datasets with a resolution of 0.25° × 0.25° (25 km2) have been utilized (Yasutomi et al. 2011; Bhattacharyya et al. 2022). This dataset was prepared using gauged-based rainfalls (around 12,000 rain gauge stations over the entire Asian region), and the angular distance weighting interpolation method was used for the gridding of rainfall observations (Yasutomi et al. 2011; Singh & Xiaosheng 2019).
PRINCETON rainfall
The Terrestrial Hydrology Research Group Princeton University provides a gridded daily rainfall dataset (PRINCETON) from 1948 to 2008 at a grid resolution of 0.50° × 0.50° globally (http://hydrology.princeton.edu/data.pgf.php). This dataset has been utilized in various hydro-climatological studies around the world (Sheffield et al. 2006; El Kenawy & McCabe 2016). The PRINCETON dataset was prepared utilizing the NCEP re-analysis dataset with observational datasets such as TRMM, CRU (Sheffield et al. 2006; El Kenawy & McCabe 2016), GPCP (El Kenawy & McCabe 2016), and NASA SRB products (Sheffield et al. 2006). This dataset showed better accuracy as compared to other global datasets (Sheffield et al. 2006; El Kenawy & McCabe 2016) and does not contain gaps over SEA. In this study, the PRINCETON rainfall data are adopted for the generation of long-term rainfall time series over the SEA.
Princeton University's Terrestrial Hydrology Research Group provides a gridded daily rainfall dataset (PRINCETON) from 1948 to 2016 with a global grid resolution of 0.25° × 0.25° (http://hydrology.princeton.edu/data.pgf.php) and hence utilized in the study (Singh & Xiaosheng 2019). PRINCETON rainfall datasets have been used in a variety of hydro-climatological investigations all over the world (Sheffield et al. 2006; El Kenawy et al. 2016; Singh & Xiaosheng 2019). The PRINCETON dataset was created by combining the NCEP re-analysis dataset with observational data from TRMM, CRU (Sheffield et al. 2006; El Kenawy et al. 2016), GPCP (El Kenawy et al. 2016), and NASA SRB products (Sheffield et al. 2006). This dataset outperformed comparable worldwide datasets in terms of accuracy (Sheffield et al. 2006; El Kenawy et al. 2016).
CMIP6 climate models
In this study, the latest climate model datasets by CMIP under the World Climate Research Programme (WCRP) named CMIP6 developed after CMIP5 have been utilized. CMIP6 marks a significant increase over CMIP5, and a new set of emissions scenarios based on various socioeconomic assumptions known as ‘shared socioeconomic pathways’ (SSPs) have been developed (Gupta et al. 2020b; Samantaray et al. 2022). These scenarios are called SSP1-2.6, SSP2-4.5, SSP4-6.0, and SSP5-8.5, each of which results in similar 2100 radiative forcing levels as their predecessor in AR5 (Mishra et al. 2020).
Mishra et al. (2020) have developed a bias-corrected CMIP6 climate model data of precipitation, maximum temperature, and minimum temperature for six countries in South Asia. Each zipped country file contains 13 models, and each model includes five scenarios (historical, SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5). In this analysis, four climate models such as ACCESS-ESM 1-5, BCC-CSM2-MR, EC-Earth3, and MRI-ESM2-0 considering three scenarios each (e.g. historical, SSP2-4.5, SSP5-8.5) have been selected among 13 climate models (https://zenodo.org/record/3987736). To select the best applicable model (as selected above) for the current study region, the model's historical rainfall data are compared with the observed historical data by following the previous research works (Gupta et al. 2020b; Samantaray et al. 2022). The overall rainfall and climate data availability is shown in Table 1.
Rainfall and climate model data availability
SI. No. . | Dataset name . | Resolution . | Time series availability . |
---|---|---|---|
1 | CHIRPS | 0.05° × 0.05° and 0.25° × 0.25° | 1981–2021 |
2 | IMDAA re-analysis | 0.12° × 0.12° | 1979–2021 |
3 | IMD | 0.25° × 0.25° | 1901–2021 |
4 | PRINCETON | 0.25° × 0.25° | 1948–2016 |
5 | APHRODITE | 0.25° × 0.25° | 1951–2007, 2007–2015 |
6 | ACCESS-ESM 1-5 (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
7 | BCC-CSM2-MR (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
8 | EC-Earth3 (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
9 | MRI-ESM2-0 (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
SI. No. . | Dataset name . | Resolution . | Time series availability . |
---|---|---|---|
1 | CHIRPS | 0.05° × 0.05° and 0.25° × 0.25° | 1981–2021 |
2 | IMDAA re-analysis | 0.12° × 0.12° | 1979–2021 |
3 | IMD | 0.25° × 0.25° | 1901–2021 |
4 | PRINCETON | 0.25° × 0.25° | 1948–2016 |
5 | APHRODITE | 0.25° × 0.25° | 1951–2007, 2007–2015 |
6 | ACCESS-ESM 1-5 (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
7 | BCC-CSM2-MR (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
8 | EC-Earth3 (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
9 | MRI-ESM2-0 (SSP245 and SSP585) | 0.25° × 0.25° | 2014–2100 |
Data assimilation
In this study, DA has been performed to combine rough-scale rainfall datasets from various sources in a synergistic way to generate a new and accurate high-resolution gridded rainfall product, which is able to capture the seasonality, distribution, and extremity with respect to the observed data (i.e. IMDAA). DA comprises the four main steps: (i) evaluation of the various rainfall datasets and selection of the most applicable rainfall datasets, (ii) downscaling and bias correction of rainfall datasets, (iii) prediction of rainfall datasets using deep learning methods, and (iv) final comparison and evaluation of the best rainfall dataset.
Previously, various mathematical methods (e.g. distance power method, distance power with high correlation coefficient, linear regression, multiple linear regression, quantile regression) have been used to assimilate time series meteorological datasets (Singh & Xiaosheng 2019). In this study, quantile regression or quantile-based mapping technique has been used, because it is found useful for capturing extreme values and rainfall distribution patterns (Gupta et al. 2020a). Very few studies have used this method in the correction and adjustment of rainfall datasets (Singh & Xiaosheng 2019; Gupta et al. 2020a).
Workflow of methodology adopted for the construction of new hybrid rainfall datasets using data assimilation.
Workflow of methodology adopted for the construction of new hybrid rainfall datasets using data assimilation.
Seasonality and statistical evaluation
In this study, the selected rainfall datasets were available for different time spans and resolutions. Therefore, for comparison and evaluation of the datasets, homogeneous rainfall datasets have been prepared for a common time period (1989–2015) at the same grid resolution (i.e. 0.25° × 0.25°). All the datasets were available at this resolution except IMDAA, which was upscaled from 0.12° × 0.12° to 0.25° × 0.25° using the nearest neighbourhood interpolation method (Teegavarapu et al. 2018). The rainfall data gaps have also been corrected to improve the accuracy of the overall dataset. Rainfall gaps were mostly present in the IMD datasets, and the rainfall gaps have been filled using a data gap-filling approach as previously applied by Singh & Xiaosheng (2019). In this study, to identify the most suitable datasets for the selected study region, the seasonality and statistical evaluation have been performed at 0.25° × 0.25°. For this purpose, the IMDAA is considered as the observed dataset. Rainfall seasonality is described as the uneven distribution of rainfall over the course of a normal year (Roffe et al. 2019). In the northeastern regions of India, the rainfall distributions are highly varied, and these regions have been categorized as the wettest regions of India (Dikshit & Dikshit 2014). The northeastern states, especially Mizoram and its environment, are mostly influenced by southwest (SW) summer monsoon rainfall (June–September), and around 70% of the annual rainfall is received during the SW monsoon time period (Saha et al. 2015). Therefore, for the seasonality assessment, the whole year has been categorized into four seasons, namely, pre-monsoon (March–May), post-monsoon (October–November), monsoon (June–September), and winter (December–February) followed by previous studies (Gupta et al. 2020a). For the comparison of different datasets, the seasonal average was calculated for all the datasets.
For the statistical evaluation, the widely used statistical functions such as annual mean, standard deviation, root-mean-square error (RMSE), mean-square error (MSE), coefficient of determination (R2), quantile–quantile (Q–Q) plots, Hamman–Quinn information criteria (HQC), and Akaike-information criteria (AIC) have been used for the selection of best datasets (Singh & Xiaosheng 2019; Afuecheta & Omar 2021). In total, 13 grids have been randomly selected over the entire study region from each dataset, and all the statistical evaluation was performed on these grids. IMDAA is considered reference dataset because it is a regional dataset present at a fine resolution (0.12° × 0.12°) specifically generated for the Indian region using gauged rainfall stations data. IMDAA can be considered the most accurate and reliable observed rainfall dataset in the Indian context (Ashrit et al. 2020). For the assessment of long-term precipitation changes, first, the CMIP6 climate models are bias corrected with the assimilated rainfall (AR) datasets. For this, the QM bias correction method was performed by utilizing the historical experimental scenario of the CMIP6 model and AR, and then the future scenarios of rainfall for all four GCMs and 2 SSP scenarios were corrected.
Downscaling and bias corrections
Downscaling refers to the process of obtaining high-resolution information from low-resolution variables (Kumar & Singh 2021; Kumar et al. 2022). This method is based on dynamical or statistical methodologies that are extensively employed in a variety of disciplines, particularly meteorology, climatology, and remote sensing. Of the various downscaling techniques available, APHRODITE, being considered the best dataset, is downscaled to 0.05° × 0.05° scale. CHIRPS is already available at fine resolution 0.05° × 0.05° (Section 3.1.3). IMDAA, which was available at 0.12° × 0.12°, was also downscaled to 0.05° × 0.05° scale.

To execute quantile regression (QR) in this study, the closest grid (equal to the observed) has been determined for rainfall correction from the training period to correct test results. QR is applied two times to obtain the desired results. Initially, both APHRODITE and CHIRPS were considered for the year 2001–2016 as training datasets, and CHIRPS data for the year 1981–2000 were considered for testing and the CHIRPS data for 1981–2000 were predicted. Then, this predicted dataset was considered as a training dataset alongside IMDAA for the same year, and CHIRPS for the year 2001–2020 was considered as testing data and QR was applied to obtain the newly generated hybrid gridded rainfall dataset.
Machine learning methods for rainfall predictions
Machine learning methods such as RF regression (RFR) and SVR have been used to predict the time series rainfall at 0.05° × 0.05° grid scale. For the prediction of time series rainfall, the best-selected datasets have been utilized (Figure 2). Deep learning techniques contain certain parameters that need to be optimized for its progressive use (Ridwan et al. 2021). A few studies have demonstrated that the SVR and RFR performed well in the prediction of time series datasets (Pham et al. 2019; Kajewska-Szkudlarek 2020), and therefore, the most suitable two methods such as SVR and RFR have been adopted in this study.
Support vector regression


Hyperparameters tuned for SVR
SI. No. . | Parameter . | Range/type . | Optimum value . |
---|---|---|---|
1 | Kernel | Linear, poly, rbf, sigmoid | rbf |
2 | C regularisation parameter | 1.0–100,000.0 | 85,000 |
3 | gamma | 1.0–0.0001 | 0.001 |
4 | Epsilon (ℇ) | 0.1–0.00001 | 0.0001 |
SI. No. . | Parameter . | Range/type . | Optimum value . |
---|---|---|---|
1 | Kernel | Linear, poly, rbf, sigmoid | rbf |
2 | C regularisation parameter | 1.0–100,000.0 | 85,000 |
3 | gamma | 1.0–0.0001 | 0.001 |
4 | Epsilon (ℇ) | 0.1–0.00001 | 0.0001 |
Random forest regression
RF is an ensemble machine learning algorithm, which has been found suitable for the prediction of time series variables (Pham et al. 2019). An RF algorithm is a combination of a large number of trees. In RF, each tree is independently constructed with a bootstrap sample of the original dataset, and each node is split with the most suitable random selection of predictor variables at that node (Pham et al. 2019). For each training set, a new decision tree is grown, and every time, a new split has to be made at a given node of the tree. In RFR, the final prediction is simply the average of all outcomes of the individual trees of the forest.
In this study, the RFR parameters like n_estimators (i.e. the number of regression trees that have been created) and hyperparameters such as max depth (i.e. the greatest depth to which trees can grow) have been utilized. In the present study, the n_estimators (ranges between 10 and 100) and max depth (ranges between 2 and 8) have been tuned to get the best-fit parameters and results (Table 3). For the determination of the hyperparameters, the grid search method was applied to train the model for multiple combinations of parameters, and then the best combination was selected that gave the best performance (Table 3). In the present study, the downscaled best rainfall datasets have been utilized as an input (predictor) variable, and IMDAA is used as the dependent variable (or reference dataset). The regression model was trained (1989–2007) and validated (2007–2015) over the 0.05° scale, and the precipitation datasets have been predicted at the 0.05° scale during 2007–2015.
Hyperparameters tuned for RF
SI. No. . | Parameter . | Range . | Optimum value . |
---|---|---|---|
1 | n_estimators | 10–100 | 100 |
2 | max_depth | 2–8 | 2 |
SI. No. . | Parameter . | Range . | Optimum value . |
---|---|---|---|
1 | n_estimators | 10–100 | 100 |
2 | max_depth | 2–8 | 2 |
Evaluation of assimilated and predicted rainfall time series
After the construction of the hybrid rainfall time series dataset, the statistical evaluation was done for analysing the performance of the dataset using changes in mean, coefficient of determination R2, and RMSE functions. The mediated parameters have been calculated with IMDAA and CHIRPS at 0.05° × 0.05° scale. For the selection of best rainfall datasets between AR (i.e. hybrid rainfall) and predicted rainfall (e.g. SVR and RF), the percentage (%) of change has been computed with respect to reference rainfall data (i.e. IMDAA). The selected best dataset, i.e. between hybrid and predicted dataset, is further used for the evaluation of RCIs.
Rainfall extreme indices and future changes
For the calculation of the RCIs, CMIP6-based four models and, for each model, two SSP scenarios (i.e. SSP245 and SSP585) were considered as mentioned in Section 3.1.6. Each scenario has been bias corrected with respect to the AR dataset. After bias corrections of the CMIP6 models with their SSP245 and SSP 585 scenarios, the total time series datasets were converted into two categories, namely, near-term (2020–2050) and far-term (2060–2090), and for each category, annual average and climate indices (CIs) were calculated with respect to CMIP6 historical scenarios of the four selected climate models. The RCIs considered are annual mean, dry spell frequency, wet spell frequency, and maximum 1-day precipitation per year (Rx1D) as per the guidelines of IPCC and also utilized in previous studies (Singh & Goyal 2016; Kumar et al. 2021). If rainfall is <2.5 mm/day for a continuous 5 days, then it is considered to be dry spell frequency, and if rainfall is >2.5 mm/day for a continuous 5 days, then it is considered as wet spell frequency. As the name suggests, Rx1D is the maximum precipitation in a day. Python modules are available for the calculation of CIs, and one of them is XCLIM, which can be accessed here (https://xclim.readthedocs.io/en/stable/).
RESULTS AND DISCUSSION
Comparative assessment of rainfall datasets
This study utilizes the five multiscale rainfall datasets such as APHRODITE, CHIRPS, IMD, IMDAA, and PRINCETON. The assimilation of these rainfall datasets has been done, and a new hybrid and improved fine resolution gridded rainfall dataset has been generated for the selected study area. In this context, primarily the applicability of the different rainfall datasets (e.g. APHRODITE, CHIRPS, IMD, and PRINCETON) have been evaluated with respect to the IMDAA re-analysis rainfall data. To find out the best rainfall datasets, the rainfall seasonality, statistical evaluation, and quantile regression (e.g. Q–Q plots) analysis have been done.
Average annual plots of rainfall (mm) (1989–2015) highlighting spatial variations among the selected different rainfall datasets at 25 km2 scale.
Average annual plots of rainfall (mm) (1989–2015) highlighting spatial variations among the selected different rainfall datasets at 25 km2 scale.
Highlighting seasonal variations in average (1989–2015) rainfall in the selected rainfall datasets.
Highlighting seasonal variations in average (1989–2015) rainfall in the selected rainfall datasets.
Statistical evaluation of different rainfall datasets using Mean bias, RMSE and Coefficient of Determination (R2) which is computed with respect to IMDAA reanalysis rainfall datasets (1989–2015).
Statistical evaluation of different rainfall datasets using Mean bias, RMSE and Coefficient of Determination (R2) which is computed with respect to IMDAA reanalysis rainfall datasets (1989–2015).
Comparative assessment of different rainfall datasets using different evaluation criteria which is computed with respect to IMDAA reanalysis rainfall datasets (1989–2015).
Comparative assessment of different rainfall datasets using different evaluation criteria which is computed with respect to IMDAA reanalysis rainfall datasets (1989–2015).
Comparative assessment of different rainfall datasets using Q-Q plots which is computed with respect to IMDAA reanalysis rainfall datasets at the selected random grids (locations) (1989–2015).
Comparative assessment of different rainfall datasets using Q-Q plots which is computed with respect to IMDAA reanalysis rainfall datasets at the selected random grids (locations) (1989–2015).
Construction of hybrid rainfall data
Comparison of Bias corrected vs uncorrected Rainfall datasets viz. CHIRPS (w.r.t. APHRODITE), CHIRPS (w.r.t. IMDAA) and Assimilated Rainfall. Also, highlighting the correlation and evaluating the strength of different rainfall datasets using R2 and RMSE.
Comparison of Bias corrected vs uncorrected Rainfall datasets viz. CHIRPS (w.r.t. APHRODITE), CHIRPS (w.r.t. IMDAA) and Assimilated Rainfall. Also, highlighting the correlation and evaluating the strength of different rainfall datasets using R2 and RMSE.
A significant variation can be seen between bias-corrected CHIRPS (with respect to IMDAA) and uncorrected CHIRPS as shown in Figures 8(e) and 8(f). As per the computed RMSE between bias-corrected CHIRPS (with respect to IMDAA) and uncorrected CHIRPS (Figure 8(g)), most of the grids secured RMSE around >100 except few grids (mostly in Jantia hills, Meghalaya region), which have shown slightly higher RMSE values (>125). As per the computed R2 between bias-corrected CHIRPS (with respect to IMDAA) and uncorrected CHIRPS (Figure 8(h)), the majority of the grids show a good match and the R2 is computed ∼> 0.5. Figures 8(i) and 8(j) show the distribution of the mean rainfall (2001–2020) for newly constructed AR (hybrid rainfall data) and IMDAA. In Figure 8(k), the majority of grids (area) show RMSE <200, and here, it can be seen that the AR dataset is able to capture the spatial pattern of IMDAA rainfall data, especially the extreme rainfall values, i.e. >5,200 mm (e.g. over the Jantial hills, Meghalaya region). In Figure 8(l), the AR data shows a good match with IMDAA rainfall data, and most of the grids have shown R2 values >0.6, except very few grids. Overall, the AR datasets, which have been generated for the time period 2001–2020, performed well across the study region, and also it is found comparable to the reference IMDAA datasets.
Prediction of rainfalls using SVR and RF
Comparison of different rainfall datasets through averaged mean (2007–2015) viz. IMDAA, Assimilated Rainfall (RF), rainfall predicted by RF and rainfall predicted by SVR.
Comparison of different rainfall datasets through averaged mean (2007–2015) viz. IMDAA, Assimilated Rainfall (RF), rainfall predicted by RF and rainfall predicted by SVR.
Evaluation of best-constructed rainfall times series
Showing the evaluation results of predicted rainfall datasets by RF and SVR methods (a to d) and figures (a to g) showing the comparative assessment of predicted and assimilated rainfall datasets by the computation of percentage (%) of change.
Showing the evaluation results of predicted rainfall datasets by RF and SVR methods (a to d) and figures (a to g) showing the comparative assessment of predicted and assimilated rainfall datasets by the computation of percentage (%) of change.
Long-term assessment of rainfall changes through rainfall indices
This study explores the long-term future changes in rainfall in the selected study region by categorizing the total time series length (2021–2100) into two terms: (i) near-term (NR-2021 to 2050) and far-term (FT-2061 to 2090) utilizing the statistically downscaled and bias-corrected CMIP6 GCM scenarios, namely, ACCESS-ESM1, BCC-CSM2-MR, EC-EARTH3, and MRI-ESM2-0 with SSP245 (i.e. moderate emission scenario) and SSP585 (extreme emission scenario). For analysing the rainfall changes, the percentage (%) of change analysis was performed between historical scenario (1991–2020) versus NT (2021–2050) and historical scenario (1991–2020) versus FT (2061–2090) while deriving four rainfall indices such as annual mean, Rx1D, dry spell frequency, and wet spell frequency.
Showing the variations in average annual rainfall in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Showing the variations in average annual rainfall in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
In the case of EX-EARTH3 SSP245, as per both NT and FT scenarios (Figure 11(i) and 11(j)), the rainfall is found to be decreased (0 to −60%), while in the case of SSP585, both NT and FT scenarios show a slight decrease or no change (0 to −20%) (Figure 11(k) and 11(l)). In the case of MRI-ESM2-0 SSP245 scenario, as per both NT and FT observations (Figure 11(m) and 11(n)), the rainfall is found to be slightly decreased (0 to −25%) or no change. In the case of SSP585 NT and FT scenarios (Figure 11(o) and 11(p)), in most of the cases, the rainfall is slightly decreasing or no change was observed (0 to −20%). Based on these observations, in most of the areas, the rainfall is found to be decreased, and a higher rate of change (mostly decreasing) is recorded in the Mizoram area.
Showing the variations in average Rx1D (maximum 1 D rainfall) in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Showing the variations in average Rx1D (maximum 1 D rainfall) in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Showing the variations in rainfall using Dry Spell Frequency in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Showing the variations in rainfall using Dry Spell Frequency in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Showing the variations in rainfall using Wet Spell Frequency in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Showing the variations in rainfall using Wet Spell Frequency in near term (NT) scenario (2021–2050) and far term (FT) scenario (2061–2090) as per four different CMIP6 climate models with two experimental scenarios (i.e. SSP245 and SSP585).
Wet spell frequency-based rainfall changes have been analysed during NT and FT scenarios as shown in Figure 14. In Figure 14(a)–14(d), as per the ACCESS-ESM-1 SS9245, the NT and FT scenarios show mixed response, areas like Nagaland and Manipur show a slight decrease in wet spells (0 to −10%), and areas like Mizoram and Assam show a slight increase in wet spells (0 to +20%). As per Figures 14(c) and 14(d), the SSP585-based observations clearly show an increase in wet spell frequency in the majority of areas (0 to +40%). In the case of BCC-CSM2-MR, a mixed observation has been derived, and in the case of SSP245, the NT scenario shows a slight increase in wet spells (0–20%), except in some areas over Nagaland, while the FT scenario displays a decrease in wet spells (0 to −20%). As per Figures 14(g) and 14(h), as per SSP585, both NT and FT scenarios show a clear decrease in wet spells (0 to −40%). In the case of EC-EARTH3, again some mixed responses have been observed, and in the case of SSP245, in NT, a slight increase (0 to +10%) in wet spells has been observed, except a very few areas which show a slight decrease. In the case of the FT scenario (Figure 14(j)), the majority of areas show a decrease in wet spells (0 to −30%). In the case of SSP585 (Figure 14(k) and 14(l)), a mixed response has been observed in both NT and FT scenarios. As per MRI-ESM2-0, as per SSP245 (Figure 14(m)–14(p)), a mixed response has been observed in both NT and FT scenarios, where most of the areas over Mizoram mostly show an increase in wet spells, while a major area of Assam and Nagaland shows a decrease in wet spells. In the case of SSP585 (Figure 14(o)–14(p)), a major area of Mizoram shows an increase in wet spells (0 to +30%), while some areas of Assam, Meghalaya, and Nagaland show a decrease in the wet spell. Overall, all scenarios have shown a mixed response; however, in the case of SSP245-based observations, the wet spell will be enhanced, while in the case of SSP585, the wet spells will be decreased.
CONCLUSION
This study was basically performed to accomplish two important objectives. First, this study evaluated the applicability of various sources of gridded rainfall datasets (including satellite based and gauged based) in the wettest regions of India such as Mizoram state and some parts of Meghalaya, Assam, Nagaland, and Manipur. In this study, the DA has been done to generate a new hybrid and improved gridded rainfall datasets over the selected study region utilizing six gridded rainfall datasets, namely, IMDAA re-analysis, APHRODITE, IMD, PRINCETON, and CHIRPS (two different versions). After applying various statistical evaluation functions and bias corrections, a new AR product was generated, and its evaluation was done with the predicted rainfall datasets. In this study, it is concluded that the APHRODITE and CHIRPS rainfall datasets were found to be close to the IMDAA. Therefore, finally, these two datasets were utilized for the construction of assimilating rainfall. In this study, the RF and SVR machine learning methods have been utilized to predict the rainfall datasets, which was found very useful for the comparison and evaluation of AR products and other sources of rainfall datasets. Based on the inter-comparisons of predicted rainfall datasets (as per RF and SVR) and AR product, the RF algorithm almost equally performed well with the AR. The AR product is able to capture the seasonality and extremity as compared to the IMDAA rainfall data, and therefore, among all the predicted and constructed datasets, the AR product (at 5 km2 scale) was found to give the best data in the selected study area.
Second, different CMIP6-based climate model datasets have been bias corrected with reference to the AR product, and then the future rainfall changes were analysed. For the assessment of long-term rainfall changes, various standard climate change indices have been formulated and the percentage of change in rainfall was computed to highlight the rainfall variability and changes in the near future term (2021–2050) and far future term (2061–2090) with respect to the historical time (1991–2020). The rainfall indices displayed substantial variabilities in the near future term and far future term over the selected study region. The long-term percent of change analysis based on rainfall extreme indices revealed significant changes in rainfall extremes over the selected study region. As per the annual mean and Rx1D-based observations performed in NT and FT, the rainfall amount and extreme events are expected to decrease in the future. As per the dry spell frequency analysis, the dry spells will enhance, and the wet spells will increase/decrease in different areas of the selected study area. Considering the moderate emission scenario, i.e. SSP245, the wet spell will enhance in the future, while in the case of SSP585 (representing the extreme worst case), the wet spells will decrease. These observations are very crucial for the northeastern states of India, and mostly show that over the wettest regions of India, as per the expected climate change, the rainfall variability (in terms of frequency) will increase, while extreme high events and rainfall amount will decrease.
ACKNOWLEDGEMENTS
The authors thanks the National Hydrology Project (NIH SP-45) for funding the study and supporting this research work. The authors would like to thank the National Institute of Hydrology India for providing facilities to carry-out this research work. The authors are also thankful to the Indian Meteorological Department Pune for providing the gridded precipitation dataset. The authors are thankful to WRD Mizoram for providing the observed gauge datasets. The authors are obliged to IPCC CMIP6, CHIRPS, TRMM, APHRODITE, and PRINCETON data generation teams/organizations for providing the rainfall datasets at free of cost. The authors are thankful to the Python developers project team who made the software/scripts/libraries available free of cost.
CONTRIBUTIONS
All authors have equally contributed to the research work.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.