Decision tree-based reduction of bias in monthly IMERG satellite precipitation dataset over India

Decision trees are ideally suited for handling huge datasets and modelling non-linear relationships between different variables. Given the relationship between precipitation and bias may be very complex and non-linear, bias-correction of satellite precipitation is a challenge. We examine the applicability of Classi ﬁ cation and Regression tree (CART) for bias-correction of the Integrated Multi-satellite Retrievals for Global Precipitation Mission (IMERG) precipitation dataset over India. The gauge-based 0.25° gridded precipitation dataset from India Meteorological Department is considered as the reference. The CART model is trained (2001 – 2011) and tested (2012 – 2016) over each 0.25° grids. The training dataset is subjected to 10-fold cross-validation and optimization of the minimum size of leaf node (one of the hyper-parameter). Ef ﬁ ciency of the CART model is evaluated using performance metrics like R 2 , RMSE and MAB over entire India and different climate and elevation zones in India. CART model is observed to be highly effective in capturing the bias during the training (average R 2 ¼ 0.77) and testing (average R 2 ¼ 0.66) period. Signi ﬁ cant improvement in average monthly MAB ( (cid:2) 6.3 to 29.2%) and RMSE (8.7 – 37.3%) was obtained post bias-correction by CART. Better performance of CART model was observed when compared to two widely adopted bias-correction techniques. (cid:129) Though the CART algorithm is found to be robust, over hilly and mountainous regions, the CART model should be cautiously adopted, however.


INTRODUCTION
In the present study, we explore the possibility of applying Classification and Regression Tree (CART) for possible bias reduction of IMERG precipitation dataset over India. The specific objectives of this study are -(i) to train and test the CART model (wherein, satellite rainfall is the predictor variable and observed rainfall is the response variable) grid-wise over entire India for bias reduction of satellite rainfall (ii) to analyze the performance of CART model over different climate and elevation zones in India and (iii) to compare the performance of CART based bias correction techniques with widely adopted and established bias correction techniques. Such an analysis will serve to improve the reliability of IMERG precipitation for applications in water resource and hydrological applications over India.

STUDY AREA
The land region of Indian sub-continent covering from 66°E to 100°E and from 6°N to 39°N is chosen as the study area. India is a country with varied geophysical and climatic features. The northern part of India is surrounded by the mountainous range of Himalayas, while the southern part is surrounded by sea/oceans ( Figure 1, panel a). Western Ghats, located near to the south-western coast receive heavy rainfall. The rainfall during the monsoon months of June, July, August and September, account for more than 75% of the annual rainfall of the country and is highly significant for the agricultural dominated country. India has seven main Koppen -Geiger climate zones showing different climatic characteristics (Peel et al. 2007) (Figure 1, panel b). A brief description about different Koppen -Geiger climate zones over India is shown in Table 1.

DATA USED
Integrated multi-satellite retrievals for GPM (IMERG) The Integrated Multi-satellite Retrievals for GPM (IMERG) is a U.S. multi-satellite rainfall product from the Global Precipitation Mission (GPM) team. IMERG algorithm estimates precipitation from  Huffman et al. 2014). GPM is the successor of TRMM and has three noteworthy improvements. Firstly, it has higher global coverage due to the increased orbital inclination from 35°to 65°. Secondly, it can measure very light precipitation (,0.5 mm/hour; Hou et al. 2014) owing to increased sensitivity of precipitation radar sensor due to the inclusion of two additional frequencies. Thirdly, solid form of precipitation (snow) can also be measured due to the inclusion of high-frequency channels in the passive microwave sensor imager. The IMERG data is available at a very high spatial resolution of 0.1°Â 0.1°and multiple temporal resolutions (halfhourly, 3-hourly, one-day, 7-day and monthly). Version 6 and Level 3 monthly IMERG rainfall data (Research/Final Run) available since June 2000 is used in the present study (https://gpm.nasa.gov/ data-access/downloads/gpm). The default units of precipitation in the IMERG datasets is mm/hr, which is transformed to mm/day by accumulating over entire day (multiplying by 24).

India meteorological department (IMD) rainfall
The 0.25°Â 0.25°gridded gauge-based daily rainfall dataset developed by India Meteorological Department (IMD) (Pai et al. 2014) is considered as the reference dataset (http://imdpune.gov.in/ Clim_Pred_LRF_New/Grided_Data_Download.html#), for the estimation of bias. The IMD dataset is constructed by incorporating daily rainfall records information of approximately 6,955 rain gauge stations covering India. However, the availability of gauge station data varied vividly throughout the years. On an average gauge data from 2,600 stations is available per year for construction of data (Pai et al. 2014). Information from all the rain gauge data was subjected to various quality checks and then interpolated to 0.25°Â 0.25 grids using the Inverse Distance Weighted (IDW) scheme. Orographic dependence of rainfall is accurately captured by the IMD rainfall dataset. For instance, heavy rainfall is observed on the windward side of Western Ghats while low rainfall is observed on the leeward side of Western Ghats (Pai et al. 2014;Chaudhary et al. 2017). It is noteworthy that the gauge density is low over the hilly regions of northern India including the Himalayas and Jammu and Kashmir, therefore, IMD rainfall is less reliable over those regions.
The 0.1°Â 0.1°gridded IMERG dataset was resampled to the 0.25°Â 0.25 gridded IMD reference dataset using nearest-neighbour interpolation algorithm for further analysis. Uncorrected Proof METHODOLOGY Bias is defined as the deviation in the magnitude of satellite rainfall from gauge observed rainfall as shown in Equation (1).
where P s is satellite (here, IMERG) precipitation; P g is gauge (here, IMD) precipitation Additionally, we have adopted the Willmott scheme to decompose the total mean square error (MSE) into its systematic and random component (Willmott 1981;AghaKouchak et al. 2012;Prakash et al. 2015). The total, systematic and random MSE is derived using Equation (2).
where P Ã s is defined using a linear regression error model - where a being the slope and b being the intercept.
The left-hand side of Equation (2) represents the total MSE which is decomposed into systematic MSE (middle term in Equation (2)) and random MSE (rightmost term in Equation (2)).

Classification and regression tree (CART)
Classification and regression trees (CART), also called as decision trees are data structures that are developed by recursively partitioning the input space, and defining a local model in each of the partitioned regions of input space (Murphy 2012). The decision trees are graphically represented with a primary root node at the top, subsequently leading to branches (internal nodes) and finally ending in outer nodes called leaves (terminal or decision nodes). A sample regression decision tree is shown in Figure 2, panel a.
The CART tree is progressed at each internal node by splitting the instance space into two or more subspaces according to a certain discrete function of the input attributes values. To understand the CART algorithm, let x ¼ x 1 , x 2 : . . . :: x n be the explanatory (predictor, or features; satellite rainfall in the present study) variables and y ¼ y 1 , y 2 : . . . :: y n be the target response or dependent (predictand or class; corrected satellite rainfall or observed rainfall in the present study) variables. Here n is the number of predictor and predictand variables (the total number of months in this study). The process of building a CART model involves partitioning the input space into distinct and non-overlapping regions and then predict the responses in each region using the mean of all the response variables in that region. In CART, we divide the predictor space (x) i.e., x 1 , x 2 : . . . :: x n into J distinct and non-overlapping regions -… These regions can take any shape and are formed such that they maximize the reduction in the mean squared error (MSE) over all the splitting candidates. To partition the predictor space into J regions, the top-down recursive binary splitting approach is adopted. The splitting initially begins at the top of the tree, where two regions of partitioned, and then successively the above-partitioned regions are further split and the process continues till the optimization criteria are met (MATLAB 2015;Pekel 2020).
To understand the splitting procedure at any node, let's select a node j that is subjected to binary recursive partitioning. The node j has p predictor variables, i.e., x 1 , x 2 : . . . :: x p . The MSE of the responses in node j, i.e., y 1 , y 2 : . . . :: y p is computed using Equation (3).
where y i is the mean response for the predictor variables within the j th box.
The objective is to identify a threshold (th j ) of a split of x j at node j such that two partitioned regions exist, i.e., {xjx j , th j } (left region) and {xjx j ! th j } (right region), where th j is selected such that it leads to maximizing the reduction in MSE (ΔI) over all the splitting threshold candidates, as shown in
Now this process is recursively repeated until the reduction in MSE is maximized in all the regions (R 1 , R 2 : . . . :: R J ). The splitting of nodes in subsequent nodes is also governed by controlling different hyper-parameters of a decision tree. The maximum number of splits of branch nodes (decision splits), minimum number observations in the branch node (size of parent node) and minimum number observations in the leaf (size of leaf node) are the generally controlled hyper-parameters of the tree.
To implement CART, we use the statistical classification and regression tree feature available in the Statistics and Machine Learning Toolbox in MATLAB (MATLAB 2019).

Training and testing of CART
The dataset is split into two distinct parts for training and testing of developing the CART model. In this study, 2001-2011 is considered as the training period (132 monthly datasets for training) and 2012-2016 is considered as the testing period (60 monthly datasets for testing). The training dataset is subjected to 10-fold cross-validation (Waheed et al. 2006;Bhuiyan et al. 2019;Pekel 2020) as shown in Figure 2, panel b. In cross-validation, the training dataset is split into 10 random parts. Out of these 10 parts, one of the parts is retained as the validation data and the remaining 9 data parts are used to train the tree. The cross-validation process is repeated 10 times, till each of the 10 parts serve as the validation data at least once, which yields 10 different decision trees. Cross-validation while training the model leads to improved accuracy of the resulting tree, when it is tested on new data.
In this study, we train the decision trees, by optimizing only one of the hyper-parameterminimum number observations in the leaf (size of leaf node) at each grid over India. Minimum leaf size puts a limit to split a node when the number of observations in one of the subsequent child node is lower than the minimum leaf size. The size of leaf node controls the depth of the decision tree and thus aids in preventing a tree from overfitting the data. We vary the size of leaf node from 0 to 100 (approximately. 50% of data size) and train the regression tree for each of the above leaf node size using 10fold cross-validation. The 10-fold cross-validation loss, which is the total MSE of all the folds is estimated for each of the tree leaf size. Finally, the leaf node size which gives the minimum 10-fold loss is selected as the optimum leaf node size of the training decision tree.
The training of the CART model is performed at each grid of India and the trained CART is subjected over the independent validation period. The performance of the CART model during training and teasing is also verified using the four performance metrics -Root Mean Square Error (RMSE), Coefficient of Determination (R 2 ), adjusted R 2 and Mean Absolute Bias (MAB) as shown in Table 2.  The spatial distribution of 16 years (2001-2016) mean monthly rainfall over India from the observed IMD and satellite-based IMERG dataset during the annual period, winter season (January, February; JF), summer season (March, April, May; MAM), monsoon season (June, July, August, September; JJAS) and post-monsoon season (October, November, December; OND) are shown in Figure 3. The IMERG dataset is effective in capturing the mean annual and seasonal spatial rainfall patterns of the reference IMD dataset. The heavy rainfall over the Western Ghats and North East India during the monsoon season and southern peninsular region during the post-monsoon season is well captured by the IMERG dataset ( Figure 3, panel d,e,i,j). Although spatial rainfall pattern is well represented, slight deviations in intensities of IMERG rainfall with respect to IMD rainfall are observed in the majority of grids over India. IMERG over-estimates the all-India mean rainfall as obtained from IMD, especially during the monsoon period. On an annual scale, 3.1 mm/day of all-India mean rainfall is observed in IMERG slightly higher than all-India mean IMD rainfall 3.0 of mm/day ( Figure 3, panels a, f). The spatial variation of bias intensity, i.e., deviation in the magnitude of IMERG from IMD (IMERG rainfall -IMD rainfall), is estimated for the annual period and different seasons in Figure 3, panels ko. High magnitude of bias is evident over the Western Ghats, North-East India and northern state of Jammu and Kashmir. Over the windward side of Western Ghats, a considerably high dry bias (under-estimation) is observed in the IMERG dataset on contrary to the leeward side where significantly high wet bias (over-estimation) is observed (Figure 3, panel k, n). Consistent dry bias is observed in the northern state of Jammu and Kashmir in IMERG dataset over all the seasons. The IMD rainfall is also highly uncertain over these regions as it is highly mountainous and has significantly less gauge network station density . The spread and intensity of bias are significantly higher during the monsoon season, as evident from Figure 3, panel n. Further, the MSE in satellite-based IMERG rainfall is decomposed into systematic and random components over India as shown in Figure 4. Larger MSE is generally observed over high rainfall regimes (like Western Ghats and North-East India) and foothills of Himalayas. The spatial variations in systematic and random error components over India expressed in percentages of total MSE are shown in Figure 4(b) and 4(c). Larger systematic error observed over these mountainous zones of North-East India and Jammu and Kashmir can be either attributed to higher uncertainty of satellite rainfall over mountainous zones or uncertainty in observed rainfall due to lack in the required number of gauge station to capture rainfall variability in these zones (Prakash et al. 2015). Overall, a higher percentage of random MSE compared to systematic error is observed over India. Systematic error can be potentially be corrected using bias correction schemes, whereas although random error can be reduced but can be statistically quantified (AghaKouchak et al. 2012;Prakash et al. 2015).

Training and testing of CART over India
The optimum leaf size of the decision tree is estimated at each grid over India by minimizing the total 10-fold cross-validation MSE during the training period. Figure 5(a) shows the variation of total crossvalidation loss of 10-folds with different minimum leaf size obtained while training the CART tree on a grid selected (75°E, 15.75°N) over Western Ghats in India. The minimum leaf size of 12, which observes the minimum total 10-folds cross-validation loss of 1.25 (mm/day) 2 , is selected as the calibrated model parameter for the tree. The classification tree developed at the selected grid by using the optimized leaf size can be graphically visualized in Figure 5(b). Here, 'r' denotes the satellite rainfall and the values at node indicate the CART corrected satellite rainfall. At the parent node, the value of the rainfall threshold split is observed to be 2.8 mm/day. If the rainfall is greater than 2.8 mm/day, further if rainfall is greater than 3.6 mm/day, further it is greater than 4.5 mm/day, the corrected satellite rainfall is 4.71 mm/day. Similarly, we can generate rules using the other branches of tree as shown in Figure 5(b).
The optimized minimum leaf size obtained during the training of the CART model at each grid over India is shown in Figure 6, panel a. Higher number (.30) of observations at the leaf nodes (minimum leaf size) were required to train the model in hilly and mountainous regions, where elevation is greater than 3,000 m. On an average minimum, 10 observations were required at the leaf nodes of the CART tree over India. Figure 6, panel b shows the rainfall threshold split at the root node of the CART tree trained over India. Higher (lower) rainfall thresholds were observed at heavy (low) rainfall regions, indicating the possible dependence of root node threshold split on the rainfall pattern over India.

Uncorrected Proof
The performance of the CART model during the training and testing period is analyzed over India using four performance metrics -Coefficient of Determination (R 2 ), adjusted R 2 , Root Mean Square Error (RMSE) and Mean Absolute Bias (MAB). Figure 7, panel a-b shows the spatial distribution of R 2 during the training and the testing period. Higher R 2 is observed in most of the grids in India except for the states of Jammu and Kashmir, leeward side of Western Ghats and few grids in the southernmost corner of India. The all-India average R 2 of 0.77 and 0.66 is observed during the training and testing period respectively. Higher RMSE error was observed over the leeward side of Western Ghats and hilly regions of North East India (Figure 7, panel c-d). Similarly, high magnitude adjusted-R 2 were observed in both the training and testing period as shown in Figure S1, panel a-b. Lower magnitude of MAB, i.e., all-India average of 0.95 mm/day during the training period and 1.2 mm/day during the testing period were also observed ( Figure S1, panel c-d). The above results show that the performance of the CART model is satisfactory and can be used over India for bias correction of monthly satellite estimates. However, over hilly and mountainous regions, the CART model should be cautiously adopted.

Performance of CART over different climate zones of India
The performance of the CART algorithm is evaluated over different climate zones of India. Figure 8 shows box plots that describe the variation of different R 2 and RMSE values in 7 different climate zones of India during the training and testing periods. Least R 2 (,0.4) is observed in the Bwk climate zone during the training and testing periods. Similarly, moderately lower R 2 values, however, with high variability in R 2 (0.3-0.8) are observed in Cwb climate zone. Lower magnitude and higher variability in R 2 observed over Bwk and Cwb climate zone can be attributed to either low reliability of reference dataset (owing to low rain-gauge density in these climatic zones) or possible complexity in capturing the satellite and reference rainfall relationships at very high

Improvement in IMERG precipitation post application of CART
To quantify the improvement in the IMERG monthly satellite rainfall post-application of CART, we calculated the average monthly MAB and RMSE during the validation period. Table 3 shows the average monthly MAB and RMSE obtained before bias correction (between Original IMERG and IMD) and post bias-correction (CART corrected IMERG and IMD). The percentage of improvement in MAB and RMSE obtained was also quantified as shown in Table 3. High magnitudes of MAB (1.44-2.9) and RMSE (2.26-4.76) are observed during the monsoon season (June, July, August and September). All the months, except December, show a reduction in MAB post bias correction by CART. In December, CART was not able to reduce the absolute bias in the original IMERG dataset as an increase in MAB by 6.3% was observed. However, a significant reduction (17.1%) in RMSE was observed during December. Overall, CART was efficient in reducing the all-India average monthly MAB and RMSE in the IMERG dataset. The performance of bias correction algorithm was further analysed over different climate and elevation zones over India. Figure 9 shows the variation of monthly MAB in original and CART corrected IMERG dataset spatially averaged over entire India (panel a) and climate zones (panel b-h) during the testing period (2012)(2013)(2014)(2015)(2016). CART effectively reduced the all-India MAB over the majority of months. Although the total bias was not removed completely, a significant improvement in MAB ranging from 5.3 to 29.2% was observed. The ability of CART algorithm to reduce the MAB in IMERG dataset varied with varying  climate zones. Highest magnitude of MAB was observed in Am climate zone which receives heavy rainfall during the monsoon season. CART based bias correction was able to dampen the higher magnitudes of MAB in original data over all the climate zones except BWk. Over the BWk zone, bias correction with CART algorithm escalated the lower magnitudes of MAB, when compared to the original dataset. Moreover, Am zone observed minimum improvement in MAB. The RMSE in the original IMERG dataset was also significantly reduced post-application of CART as shown in Figure S2. Figure 10 shows the variation of all-India and elevation zones averaged monthly MAB in original IMERG and CART bias-corrected IMERG during the testing period (2012-2016) over India. Seven elevation zones as shown in Table S1 were formulated for evaluating the performance of CART based bias-correction algorithm. Low MAB values were observed in very low elevation zone (,500 m) as well as very high elevation zones (.5,000 m). The performance of CART algorithm was relatively better when elevation was low and deteriorated with increasing elevation. An increase in the magnitude of MAB was observed post bias correction by CART, especially during the months with low MAB in the original IMERG data.
Comparison of CART-based bias correction scheme with other widely adopted schemes Two widely adopted bias correction schemes -Linear scaling (LS) and Equidistant Cumulative Distribution Matching (EDCDF) are selected for comparison with the proposed CART-based bias correction technique. Detailed formulation of LS and EDCDF technique is given in Table S2. Figure 11 shows the spatial variation of average MAB and RMSE obtained from the original IMERG dataset, CART bias-corrected IMERG, LS corrected and EDCDF matching scheme during the testing period (2012-2016) over India. Over most of the grids over India, least MAB and RMSE were observed in CART-based bias correction followed by EDCDF scheme and LS scheme. Overall, least all-India average MAB of 0.94 mm/day was observed in CART, while the other two techniques, i.e., EDCDF matching and LS corrected method observed the all-India average MAB of 1.1 mm/day and 1.3 mm/day. Figure 12(a) shows the monthly variation of all-India average MAB obtained after CART based bias correction, LS bias correction and EDCDF matching method. CART based method and EDCDF matching method performed best in reducing the all-India average MAB, however, for initial months the performance of CART was much better among the two. Similar inferences were drawn when the bias correction schemes were compared based on their average all-India RMSE values ( Figure S3). Better performance of CART model over the other two methods can also be observed in Figure 12(b), wherein CART corrected IMERG dataset shows the highest correlation with IMD reference dataset during the testing period (2012)(2013)(2014)(2015)(2016).

CONCLUSIONS
In the present study, we examine the applicability of Classification and Regression tree-based machine learning algorithm for possible bias reduction of the IMERG precipitation dataset over India. The major conclusions observed in the present study are: • The CART model is highly effective in capturing the bias during the training (average R 2 ¼ 0.77) and testing (average R 2 ¼ 0.66) period. High R 2 (.0.75) and very low variability is observed in all climate zones except for Bwk and Cwb zone.
• Significant improvement in average monthly MAB (À6.3 to 29.2%) and RMSE (8.7-37.3%) was obtained post-application of CART based bias-correction.  • CART based bias correction was able to dampen the higher magnitudes of MAB and RMSE in original data over all the climate zones except BWk, where the application of CART algorithm escalated the lower magnitudes of MAB and RMSE, when compared to the original dataset. • The performance of CART algorithm was relatively better when elevation was low and deteriorated with increasing elevation.
• Better performance of CART model over Linear scaling (LS) and Equidistant Cumulative Distribution Matching (EDCDF) was observed. CART corrected IMERG dataset depicted the highest correlation with IMD reference and least average MAB and RMSE dataset during the testing period.
It is important to mention here some of the limitations of the present study. Firstly, we assumed that the reference gauge-based gridded dataset from IMD to be error-free, which may not be true always. Although errors may exist in the gauge-based datasets, their magnitude would be much smaller than those in the satellite-based datasets. For further improvement in bias reduction, station data available from Doppler weather radar or automatic weather stations can be used as a reference in the future. Since over mountainous regions of India, IMD rainfall data is uncertain due to less gauge network density, in the future, a study can be conducted to compare the IMERG datasets directly with gauge station values for these regions. Secondly, we have trained the CART model on 11 years of monthly data, i.e., 132 values which are quite less for a data greedy algorithm like CART. Nonetheless, satisfactory fitting of the CART model is observed in the present study which aids in bias correction of