## Abstract

Monthly Precipitation Forecasts (MPF) play a critical role in drought monitoring, hydrological forecasting and water resources management. In this study, we applied two advanced Machine Learning Models (MLM) and latest General Circulation Models (GCM) to generate deterministic MPFs with a resolution of 0.5° across China. Then the Bayesian Joint Probability (BJP) modeling approach is employed to calibrate and generate corresponding ensemble MPFs. Raw and post-processing MPFs were put against gridded observations over the period of 1981–2015. The results indicated that: (1) for deterministic evaluation, the forecasting performance of MLMs was more inclined to generate random forecasts around the mean value, while the GCMs could reflect the increasing or decreasing trend of precipitation to some degree; (2) for probabilistic evaluation, the four BJP calibrated ensemble MPFs were unbiased and reliable. Compared to climatology, reliability and sharpness were all significantly improved. However, in terms of overall accuracy metric, the ensemble MPFs generated from MLMs were similar to climatology. In contrast, the ensemble MPFs generated from GCMs achieved better forecasting skill and were not dependent on forecasting regions and months. Moreover, the post-processing method is necessary to achieve not only bias-free but also reliable as well as skillful ensemble MPFs.

## HIGHLIGHTS

Two advanced Machine Learning Models are employed to generate monthly precipitation forecasts with a resolution of 0.5 degree.

The latest seasonal forecasts of ECMWF are evaluated with the same forecasting grid cells and lead time.

The BJP modeling approach is used to calibrate above four raw forecasts.

A comprehensive comparison is achieved for the raw and post-processing forecasts.

## INTRODUCTION

Monthly Precipitation Forecasts (MPF) play a critical role in drought monitoring, hydrological forecasting and water resources management (Schepen *et al.* 2012; Yuan & Wood 2012; Wang *et al.* 2017). With the intensification of climate change in the low and middle latitudes of the Northern Hemisphere, the frequency of extreme rainfall and extreme drought in China is continuously increasing, which further enhances the requirements for accurate and reliable MPF (Peng *et al.* 2014b; Schepen *et al.* 2018; Wang *et al.* 2019a). Theoretically, recent methods to generate MPF can be broadly divided into two main approaches: Machine Learning Model (MLM) and General Circulation Model (GCM) (Yuan *et al.* 2016; Zhao *et al.* 2016; Shen 2018).

For the first approach, MLM implements the MPF by exploring the relationship between past precipitation and climate indices (Bazile *et al.* 2017). Operationally, MLM mainly consists of two components, climate predictors and a regression algorithm (Peng *et al.* 2014a). Here, climate predictors mean a number of large-scale climate indices, such as sea surface temperature, oscillation index, etc. In order to facilitate the research, the China Meteorological Administration (CMA) has established a climate predictors dataset including 130 types of climate indices (referred to hereafter as CI, see Appendix) and has been widely used to drive the regression algorithm (He *et al.* 2018). The regression algorithm is essentially a black box system and is capable of estimating the MPF without fully understanding the effects of climate change and human activities on the precipitation-streamflow-evaporation circle in a certain catchment. Although the regression algorithm contains a large number of different mathematical forms, it can still be roughly divided into the following categories (applied in the hydrological community): (1) Linear regression with regularization, such as Lasso (Jeon *et al.* 2016), Ridge (Jeong *et al.* 2012), and Elastic Network (Park & Mazer 2018); (2) Support Vector Regression with different kernel functions (Chen *et al.* 2010; Liang *et al.* 2018); (3) Deep Learning, which includes but is not limited to Convolutional Neural Network (Qiu *et al.* 2017), Long Short-Term Memory (Zhang *et al.* 2018), and Deep Belief Network (Bai *et al.* 2016); (4) Bagging strategy, which is famous for Random Forest (Wang *et al.* 2015); (5) Boosting strategy, in the forms of Adaptive Boosting (Liu *et al.* 2014), Gradient Boosting Decision Tree (Ma *et al.* 2018), and eXtreme Gradient Boosting (Fan *et al.* 2018). Among the above regression algorithms, the Boosting strategy is greatly developed and widely used in MPF due to the characteristics of solid mathematical theory, low computational cost, and easy implementation.

For the second approach, GCM as a category of climate models can directly generate precipitation forecasts with different lead times. GCM generally consists of numerous and complicated atmosphere, ocean, land components and corresponding dynamic exchange interfaces (Street 2016; Zhao *et al.* 2017). With the constant improvements of model configuration (e.g. structure and resolution), model initialization (e.g. operational analysis system), and ensemble generation (e.g. initial condition perturbations and stochastic perturbations), the GCM approach has become the main tool to implement MPF by most authoritative meteorological research centers (Molteni *et al.* 2011; Yuan *et al.* 2011; Schepen *et al.* 2014; Johnson *et al.* 2019), and has shown advantages in extending the lead time, improving the resolution, and quantifying the uncertainty. For instance, considering the interactions between the atmosphere and ocean allows GCM to simulate long-term phenomena such as phases of the El Niño–Southern Oscillation cycle which is a significant large-scale climate predictor deeply affecting the precipitation in the middle latitude of the Northern Hemisphere (Cao *et al.* 2014). Moreover, National Centers for Environmental Prediction (NCEP) of USA has transitioned to operationally use the Climate Forecast System version 2 (CFSv2) since 2011, to provide real-time forecasts with 24 ensemble members, maximum lead time of nine months, and resolution of 1° (Saha *et al.* 2014; Liu *et al.* 2019). The European Centre for Medium-Range Weather Forecasts (ECMWF) also upgraded the Seasonal Forecast System 5 (SEAS5) in 2017 to replace its predecessor System 4 (SEAS4) which had been operational since 2011 (Johnson *et al.* 2019).

Nonetheless, both MLM and GCM have their own shortcomings that cannot be ignored in practical application. In the matter of MLM, one of the most obvious drawbacks is that the MPF generated is always a deterministic forecast (i.e. single-valued forecast), which is generally considered to be outdated compared with the probabilistic forecast (i.e. ensemble forecast) especially in the sub-seasonal (1–3 months) scale (Bennett *et al.* 2016; Li *et al.* 2019). Duan *et al.* (2019) concluded that the traditional deterministic MPF is inadequate to meet different needs of emergency and water resources managers and that the emerging ensemble forecasting approach is the way forward. One immediate way to obtain ensemble forecasts is to input different scenarios of climate predictors or employ different regression algorithms. However, for a suitable combination of climate predictors and regression algorithms, the forecasts usually show a small variation. As for GCM, although it produces multi-member ensemble forecasts, its incomplete system patterns and unsuitable modeling parametrizations always lead to systematic bias and cannot be used directly by the end-users (Li *et al.* 2017; Shen 2018).

Therefore, in response to the above disadvantages, the post-processing method becomes a necessary step not only to generate an ensemble forecast but also to quantify and reduce the uncertainty. Moreover, the objectives of the post-processing method also contain: (1) correct bias; (2) preserve the raw predictive skills; (3) ensure the ensemble members have a reliable time-space relationship (Clark *et al.* 2004; Li *et al.* 2017). Recently, a Bayesian Joint Probability modeling approach (referred to hereafter as BJP), developed by Wang *et al.* (2009), as a conditional distribution-based statistical post-processing method, has been successfully applied to calibrate the MPF from GCM (Peng *et al.* 2014a; Bennett *et al.* 2016; Zhao *et al.* 2019a). In addition, BJP also has the ability to generate ensemble forecasts based on certain deterministic MPF, which provides an available method to overcome the shortcomings from the MLMs (Zhao *et al.* 2015), and also supports a worthwhile way to compare two MPFs from different sources within the framework of both deterministic and probabilistic forecasting.

In summary, the specific objectives this study aims to realize are as follows: (1) use the climate predictors from CI and two advanced Boosting models of MLMs (namely XGB and LGB) to generate raw MPFs with the resolution of 0.5° (3824 grid cells) covering the majority of the Chinese mainland over the period of 1981–2015; (2) assess the prediction performance of two GCMs from ECMWF (namely SEAS5 and SEAS4) by ensemble means and compare with the above two MLMs in terms of bias, accuracy, and correlation; (3) calibrate the above four raw deterministic MPFs by BJP modeling approach to generate corresponding probabilistic forecasts, and evaluate the four post-processing ensemble MPFs in terms of bias, accuracy, reliability, and sharpness. The paper is organized as follows: the next describes three categories of the dataset (observation, CI, and GCM forecasts) followed by a section providing a brief overview of XGB, LGB, BJP models as well as evaluation measures and metrics. This is followed by the results, discussion and conclusions, respectively.

## DATASET

### Gridded monthly precipitation observation

The observed precipitation is provided by the CMA, with a resolution of 0.5°, a total of 3824 grid cells, and the period of 1981–2015, covering the majority of the Chinese mainland. The original data is daily and comes from 2472 national meteorological stations that have been uniformly distributed throughout China since 1961. The Thin Plate Spline (TPS) spatial interpolation method is employed by CMA to generate gridded observations from sites and has been verified to represent a desirable accuracy. The specific introduction, establishment and assessment can be found in Zhao *et al.* (2014, 2018). Figure 1 shows the monthly mean precipitation that displays an enormous spatial and temporal variability. This increases the difficulty of accurate forecast. Due to the three-steps-distributed feature of Chinese terrain and altitude, the precipitation generally decreases from southeast to northwest, with several distinguishable boundaries. Besides, since China is mainly characterized by a continental monsoon climate, which often represents a co-occurrence of rain and hot summer, the precipitation from April to September is also significantly higher than that in other months.

### 130 types of large-scale climate indices

CI is a continuously improved large-scale climate predictor dataset which has been provided by CMA since 1970. CI contains 88 types of atmospheric circulation (e.g. Pacific Subtropical High of area/intensity/position index), 26 types of sea surface temperature (e.g. Niño anomaly of 1 + 2/3.4), and 16 types of other indices (e.g. landing typhoon, sunspot), which have demonstrated good adaptability and flexibility as predictors to establish monthly, seasonal, even yearly forecasts. The specific description can be found at https://cmdp.ncc-cma.net/Monitoring/cn_index_130.php. Here, we take the forecasting timeline shown in Figure 2 to illustrate the lead time (shown as LD) of predictand and the lag time (shown as LG) of predictors in detail. First, we assume that the lag periods of CI are 12 months, i.e. when the predictand is the forecasts in January, the predictor with the longest duration is in January of the previous year. Then, when we intend to generate MPF of January and February on any day in January (usually at the beginning), the corresponding lead times are 0 and 1, respectively. In this study, we focus on the forecast scenery with LD = 1, i.e. LG from 2 to 13 (as shown in the lower half of Figure 2).

### Seasonal forecast system products

ECMWF has been operating the SEAS4 since November 2011 and it was upgraded to the latest version, SEAS5, in November 2017. SEAS5 is a substantially changed and coupled ocean-atmosphere-land dynamical forecast system. Compared with the predecessor (i.e. SEAS4), SEAS5 has improvement in the following main aspects: (1) The atmosphere model is the ECMWF IFS (Integrated Forecast System) from version 36r4 to 43r1, with a higher resolution from 0.8 to 0.36°; (2) The ocean model is the NEMO (Nucleus for European Modelling of the Ocean) from version 3.0 to 3.4, with a higher resolution from 1 to 0.25° as well as the number of vertical levels from 42 to 75 levels; (3) A sea-ice model named LIM2 has been added; (4) Wave model resolution has improved from 1 to 0.5°; (5) The real-time forecasts still comprise 51 ensemble members but the re-forecasts (also known as hindcasts) members have increased from 15 to 25. Details of the comparisons can be found in Johnson *et al.* (2019). In this study, since the precipitation resolutions of SEAS4 and SEAS5 are 0.75 and 0.4° degrees respectively, Bilinear Interpolation, as an extension of linear interpolation and widely used interpolating function on a rectilinear 2D grid, is used to process the calculation of downscaling (for SEAS4) or upscaling (for SEAS5) to coordinate and unify the spatial resolution issue.

## METHODOLOGY

To assist the reader through the description of the components of this research, a conceptual representation of the calculation process is shown in Figure 3. First, we obtained the raw deterministic MPFs from the XGB and LGB models, as well as obtaining raw ensemble GCMs from SEAS4 and SEAS5. Then a deterministic evaluation was applied to analyse the quality of the raw MPFs. Second, the BJP model was used to calibrate the raw MPFs which contains both MLMs and GCMs approaches to obtain the calibrated MPFs. Third, a probabilistic evaluation was applied to analyse the quality of the calibrated MPFs.

### Extreme gradient boosting

*et al.*(2015) and is widely used in classification and regression issues as an enhanced implementation of the Gradient Boosting Decision Tree (Friedman 2002). The core idea of XGB is to combine a large number of weak learners (e.g. classification and regression tree) into strong learners through continuously fitting residuals. This is obtained by adding regularization into the loss function that not only minimizes the computational cost but also prevents over-fitting. The loss function of XGB can be defined as:where

*l*is a differentiable convex loss function which measures the error of observed

*y*and simulated , is the fraction of leaf nodes, is the regularization parameter, is the minimum loss required to further divide the leaf nodes and

_{i}*J*is the quantity of the leaf nodes. In addition, XGB adopts the viewpoint of parallel computing in Random Forest (Breiman 2001), which obviously improves the calculation efficiency. The detailed information of the XGB model can be found in Chen & Guestrin (2016) and Chen

*et al.*(2015).

### Light gradient boosting machine

Light Gradient Boosting Machine (referred to hereafter as LGB) is a fast, distributed and high-performance framework based on GBDT (Ke *et al.* 2017). Compared with XGB, several obvious differences are as follows. First, LGB further studies the accelerated calculation method based on the histogram algorithm. Second, XGB adopts a level-wise growth strategy, while LGB adopts a more efficient leaf-wise growth strategy with depth restriction on leaf growth. Third, LGB proposes Gradient-based One-Side Sampling (GOSS) so that the training is accelerated under the condition that the precision is hardly affected. Fourth, LGB proposes Exclusive Feature Bundling (EFB) for the optimization of column sampling. Detailed information of the LGB model can be found in Ke *et al.* (2017). Theoretically, these differences play an important role in optimizing the computational performance, but there is no research on MPF showing that the forecast skills of LGB will be better or worse than XGB (Ukkonen & Mäkelä 2019).

### Bayesian joint probability post-processing modeling approach

*et al.*2009; Wang & Robertson 2011; Robertson

*et al.*2013) and has been successfully applied to post-process precipitation (Shrestha

*et al.*2015; Bennett

*et al.*2016) and other hydrometeorological predictions (Zhao

*et al.*2019a, 2019b). The BJP post-processing modeling approach begins with the predictand

*y*(in this case observed precipitation) and predictor

*x*(in this case raw MPF), and then the log-sinh transformation (Wang

*et al.*2012) is used to normalize the variables and homogenize their variances. Mathematically, the log-sinh transformations are given by:where and are transformed variables of

*x*and

*y*, respectively, and and are the transformation parameters. Then, the transformed variables are assumed to follow a bivariate normal distribution:where and . and are the mean and standard deviation for each predictand and predictor; is the correlation coefficient. Then the forecast function could be defined as:

*N*is the length of training period and and are likelihood function and prior distribution of , respectively. The Gibbs sampling based on Markov chain Monte Carlo is the core of BJP to establish the Bayesian parameters inference and generate climatological reference distribution (Zhao

*et al.*2019b). Detailed information on the BJP model can be found in Wang

*et al.*(2009) and Wang & Robertson (2011).

### Evaluation measures and metrics

We employ a leave-one-month-out-cross-validation (lomocv) procedure to evaluate the performance of raw deterministic MPFs, calibrated deterministic MPFs and calibrated ensemble MPFs. Specifically speaking, when we drive XGB and LGB models to generate the gridded MPFs, all available datasets except one month are used to train the parameters and the prediction for the leave-out month is compared to the corresponding observation. Since we have 35 years of observation, the procedure will be repeated 35 times and the corresponding cross-validated predictions will be obtained for 12 months. Similarly, lomocv is also used to infer parameters and generate the ensemble forecast in the BJP post-processing model.

A detailed assessment usually requires several performance metrics to analyze different aspects of forecast quality attributes which normally contain Bias, Accuracy, Correlation, Reliability, Sharpness and Skill.

- (1)Bias measures the difference between average forecast and average observation. Here we use the relative bias (RB) to measure the Bias. The perfect value of RB is 0 and could be defined as:where
*y*is the observation,_{obs}*y*is the raw deterministic MPF or the mean of calibrated ensemble MPF, and_{fct}*T*is the length of verification period (in this case 35). - (2)Accuracy measures the average distance between forecast and observation. Here we use the well-known Continuous Ranked Probability Score (CRPS) for ensemble MPF (Hersbach 2000). The perfect value of CRPS is 0 and can be defined as:where is the cumulative distribution function (CDF) of the forecasts and
*H*is the Heaviside step function and is defined as: - (3)
- (4)Reliability describes how well the forecast agrees with the observation when a specific forecast is issued and here a probability integral transform (PIT) histogram is employed to assess the reliability of ensemble forecasts. PIT is the CDF of the forecast evaluated at observation and is given by:when the ensemble forecast reliably captures the distribution of observation, the observation can statistically be regarded as random samples drawn from . Therefore, the reliability of ensemble spread is shown by the uniformity of the PIT histogram (Laio & Tamea 2007; Duan
*et al.*2019). In addition, in order to compare the reliability with climatology, the PIT Area is also used (Renard*et al.*2010; Schepen*et al.*2018) and is defined by:where is the sorted in increasing order. The PIT Area represents the total deviation of from the corresponding uniform quantile (i.e. the tendency to deviate from the 1:1 line in PIT diagrams). The PIT Area ranges from 0 (perfect reliability) to 1 (worst reliability). - (5)Sharpness describes the concentration of the predicted distribution and indicates the distribution of the ensemble members. Here the 90% interquartile range (IQR) is used to evaluate the sharpness (Crochemore
*et al.*2016) and can be defined as follows:where*Q*(95%) and*Q*(5%) are the 95 and 5% percentiles of the forecast distribution. The final IQR score is the average of the whole interquartile range at a certain period. The narrower the IQR, the sharper the ensemble forecasts. - (6)Skill describes the accuracy of a forecast relative to a reference forecast or benchmark (Duan
*et al.*2019). When the skill score is superior (inferior) to zero, this means that the forecast is more (less) skillful than the reference. When it is equal to zero, the forecasts and the reference have equivalent skill (Crochemore*et al.*2016). Here CRPS Skill Score (CRPSS), PIT Area Skill Score (PITSS), and IQR Skill Score (IQRSS) are computed for the probabilistic comparison. The three skill scores can be calculated by:

Here we refer to climatology as the reference and the climatology is calculated by the BJP method (Wang *et al.* 2019b).

## RESULTS

### Quality of raw deterministic MPFs

Bias in raw deterministic MPFs is analyzed by RB using Figure 4. Each row corresponds to a month and each column corresponds to a model. In general, the MLMs overall outperform the GCMs, representing a smaller RB throughout 12 months. From February to October, almost half of the grid cells are slightly underestimated (shown in light red), and others are slightly overestimated (show in light blue). From November to January, some unideal dark blue pixels occur in the middle and northwest, but with no obvious aggregation. RB in XGB is the smallest overall, varying from –20 to 20%, and is distributed evenly throughout the mainland, slightly better than LGB. In contrast to MLMs, the MPFs from SEAS4 and SEAS5 are badly overestimated (show dark blue) in most month and grid cells, especially during January–April. For April–November, both of them show positive and negative alternation in other regions except for the negative RB of less than –40% in the northwest region. Meanwhile, the distribution of RB always shows an obvious aggregation. These systematic errors are most likely caused by the incomplete simplification of hydrological cycle mechanisms and unsuitable modeling parametrizations.

The accuracy of raw deterministic MPFs is analyzed by MAE in Figure 5, which shows a clear dividing line conforming to the geographic elevation. The distribution of MAE exhibits certain common temporal and spatial distribution characteristics of MAE among four models, particularly for November–April. The MAE values of these months are mainly distributed in three intervals. Most of them vary from 0 to 10 in the northwest; some are 10–30 in the middle, and the rest are less than 50 in the southeast. In terms of May–October, the MAE values are worse than other months, while the worst results are detected in July and August. Some MAE values even exceed 70, which means an unacceptable accuracy. It can be found that the MAE generally increases from northwest to southeast, matching the distribution of precipitation. Meanwhile, it is also difficult to tell the best model in terms of MAE. The performance of accuracy seems to depend on the magnitude of precipitation rather than the difference in models, which confirms the results in Liu *et al.* (2017) and Tian *et al.* (2017).

As mentioned above, the PCC between the raw MPF and the observation is a key parameter used by BJP in the calibration process. For this reason, Figure 6 represents the PCC of four models and only contains the significantly positive correlations at a 90% confidence level (according to a t-test). Contrary to the results in Figures 4 and 5, the GCMs overall outperform the MLMs. The number of PCC grid cells satisfying the above conditions in XGB and LGB are sparse over 12 months and far lower than those in SEAS4 and SEAS5. In addition, the PCC values in MLMs are also detected to be much smaller than in GCMs and mainly vary between 0.3 and 0.6. In contrast, PCC values of SEAS4 and SEAS5 basically cover the mainland except for some northwest grid cells, mainly varying between 0.4 and 0.9. It can be concluded that although MLMs display better performance in term of bias and accuracy, the MPFs tend to be a phenomenon of random prediction due to poor PCC values. Meanwhile, although the RB and MAE of GCM are not ideal, both SEAS4 and SEAS5 can more effectively reflect the increasing or decreasing trend of precipitation than XGB and LGB.

### Quality of BJP calibrated ensemble MPFs

As introduced above, we employed the BJP post-processing modeling approach to calibrate the four raw deterministic MPFs, which aims to correct bias and generate corresponding 1000-member ensemble MPFs. Figure 7 illustrates the RB between four calibrated ensemble MPFs and observation. As expected, in contrast to the results of Figure 4, post-processed MPFs are less biased than raw MPFs. The BJP method is effective at reducing RB of MLM-based and GCM-based MPFs. However, for November and December on the Qinghai-Tibet Plateau, the MPFs suffer negative RB, the magnitude of which varies between –30 and 0%. This is mostly caused by inaccurate precipitation observation or extreme precipitation events. Due to the sophisticated underlying surface environment, the error of precipitation observation may be magnified in dry months. For the southeast coast with dense observation sites, the RB for calibrated MPFs ranges from approximately 0 to 10%. This suggests that BJP is a useful method to generate bias-free ensemble MPFs. Meanwhile, it is also evident that post-processing is a necessary step before using the MLM-based and GCM-based MPFs in hydrological forecasting.

The accuracy of calibrated ensemble MPFs is analyzed by CRPSS using Figure 8. The climatology as the reference is also calculated by the BJP method (Wang *et al.* 2019b). Here we consider that white pixels (CRPSS between −5 and 5%) show neutrally (little or no) skillful forecasts (Zhao *et al.* 2017). Blue pixels (>5%) show positively skillful forecasts and red pixels (<− 5%) show negatively skillful forecasts. Overall, there is no great difference between XGB and LGB, with approximately 90% values of CRPSS located in white pixels throughout all months. The other 10% values mainly vary from 5 to 15, scattering sporadically on all the grid cells. For the SEAS4 and SEAS5, except that about 40% values of CRPSS are similar with climatology, the other 60% values tend to be positive, varying between 5 and 35, sometimes more than 35. Although there are several red pixels in the southwest and northwest of China, the proportion is very small. Moreover, the predictive performance of summer (June–August) is not as good as the other three seasons. Basically, SEAS5 shows a slightly better performance than SEAS4 and represents the most optimal forecasting skill in four models.

Reliability is analyzed in terms of PIT histograms (Figure 9) and PITSS plots (Figure 10). The PIT histograms visualize the distribution of PIT values to reflect the property of reliability. As shown in Figure 9, the PIT histograms exhibit inconsiderable variability between different models and different months. The red dotted line means an average level. Since there are 3824 grid cells and the period of loocv is 35 years, the average level for each month in this study is 3824 × 35/10 = 13384. The values of the frequency corresponding to 10 bins in the *x*-axis float around the red line (i.e. the PIT histograms concentrate uniform probability density in all ten bins), which presents good reliability. Figure 10 clearly illustrates that four calibrated ensemble MPFs are more reliable than climatology, with the emergence of a large proportion of blue pixels. It should be noted that in terms of reliability, the predictive performance is too close to distinguish.

Sharpness is analyzed using Figure 11. It should be noted that without reliability, a sharp forecast is misleading (Crochemore *et al.* 2016). Four calibrated ensemble MPFs are overall sharper than climatology in the large majority of grid cells. Some exceptions appear for December, and especially in the edge of the Qinghai-Tibet Plateau of west China. The IQRSS of SEAS4 and SEAS5 is much better than that of XGB and LGB, confirmed by a double proof based on a larger area and darker blue. Moreover, the obvious aggregation pixels, especially in the southeast coastal area, reflect the remarkable improvement of calibrated ensemble MPFs on becoming sharper and gaining skill.

## DISCUSSION

As illustrated in the introduction section, massive MLMs have been widely implemented in the MPFs, and the prediction performance of these deterministic forecasts has been verified around the world. However, most previous studies always focus on small catchments, with not many precipitation stations. This kind of small-scale evaluation is unconvincing to some extent. As a key process of applying MLMs to MPFs, hyperparameter optimization may be the most important factor affecting the forecasting skill. We have reason to believe that by excessively optimizing the hyperparameters, the small-scale predictions will be in good agreement with the observations.

Therefore, in order to comprehensively evaluate the deterministic forecasting performance, we have generated gridded MPFs with a resolution of 0.5° (3824 grid cells), almost spanning the entire Chinese mainland. In this case, it is almost impossible to repeatedly and accurately optimize the hyperparameters of each grid cell (although the continuous development of the Automatic Machine Learning method provides the possibility for the subsequent research of gridded hyperparameter optimization, it has not yet been applied to the hydrological community). The application of the same hyperparameters to all grid cells provides certain evidence for the general adaptation and performance of forecasting. Meanwhile, XGB and LGB have been proven to have advanced structures and weak sensitivity to hyperparameters (Chen *et al.* 2015; Chen & Guestrin 2016; Ke *et al.* 2017), which is also the basis of this prediction attempt. Moreover, due to the variety of large-scale climate predictors, as well as the complexity of the relationship between the precipitation and predictors, the main purpose of this study is to analyze the difference of MPFs between MLMs and GCMs in the above assumptions, rather than to explore what are the significant predictors that affect the precipitation in a specific region.

What calls for special attention is the evaluation of the second part. The forecasting skill of raw ensemble MPFs from SEAS4 and SEAS5 are much worse than climatology (not shown), this is the reason why we chose to compare calibrated GCMs with climatology instead of raw ensemble GCMs. Meanwhile, although only the ensemble mean values are used as the input of the calibration process, the BJP modeling approach has been proven to be a viable option for producing ensemble time-series MPFs (Hawthorne *et al.* 2013; Peng *et al.* 2014b; Schepen & Wang 2014; Schepen *et al.* 2014; Khan *et al.* 2015; Shrestha *et al.* 2015).

In addition, Molteni *et al.* (2011) and Johnson *et al.* (2019) have shown that the forecasting performance of SEAS4 and SEAS5 decreased (e.g. in terms of CRPSS) with the lead time increasing from 0 to one month. However, the 130 types of climate predictors provided by CMA have a one-month lag time, i.e. 0-month lead time forecasts cannot be generated by MLMs in this study (as shown in the upper half of Figure 2). For this reason, this study focuses on the comparison with the one-month lead time. In the future, we will continue the research to evaluate the forecasting performance between MLMs and GCMs with different lead times.

## CONCLUSIONS

In this study, 130 types of large-scale climate predictors and two advanced MLMs (namely XGB and LGB) were applied to generate deterministic MPFs with a resolution of 0.5° across China. Meanwhile, the latest ECMWF's GCMs (namely Seasonal Forecast System 5 and its predecessor System 4) were used to compare with the above two MLMs for the same lead time and grid cells. Moreover, the BJP post-processing modeling approach was employed to calibrate the raw deterministic MPFs and to generate the corresponding ensemble MPFs. Raw and post-processing MPFs were put against gridded observations with different precipitation regimes for all months over the period of 1981–2015.

Deterministic evaluation of raw MPFs was evaluated by RB, MAE, and PCC in terms of bias, accuracy, and correlation. Compared with GCMs, the results indicated that the MLMs represented obviously smaller RB (Figure 4), similar MAE (Figure 5) as well as poorer PCC values (Figure 6). It can be concluded that the forecasting performance of MLMs was more inclined to generate random forecasts around the mean value. In contrast, due to the significant PCC values, the GCMs could reflect the increasing or decreasing trend of precipitation to some degree. However, the forecasting performance of raw deterministic MPFs was strongly dependent on forecasting regions and months.

Probabilistic evaluation of post-processing MPFs was evaluated by RB, CRPSS, PITSS, and IQRSS in terms of bias, accuracy, reliability, and sharpness. Since the BJP modeling approach had been proven to be effective in calibrating raw forecasts, we directly compared the post-processing ensemble MPFs with climatology that was also generated by BJP. The results indicated that the four post-processing ensemble MPFs were unbiased (Figure 7) and reliable (Figure 9). Meanwhile, in contrast to climatology, reliability (Figure 10) and sharpness (Figure 11) were all significantly improved. The results of the overall accuracy metric (Figure 8) showed that the ensemble MPFs generated from MLMs were similar to climatology, with no obvious difference. In contrast, the ensemble MPFs generated from GCMs achieved better forecasting skills and were not dependent on forecasting regions and months.

## ACKNOWLEDGEMENTS

The study was financially supported by the Major Project of Zhejiang Natural Science Foundation, China (LZ20E090001), and the Science and Technology Project of Zhejiang Provincial Water Resources Department (RB2012). We gratefully acknowledge the anonymous editors and reviewers for their insightful and professional comments, which greatly improved this manuscript.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.