An evaluation of CMIP5 precipitation simulations using ground observations over ten river basins in China

Using the precipitation measurements obtained from 2,419 ground meteorological stations over China from 1960 to 2005 as benchmark, the performance of 21 single-mode precipitation data from the Coupled Model Intercomparison Project Phase 5 (CMIP5) were evaluated using Taylor diagrams and several statistical metrics. Based on statistical metrics, the models were ranked in terms of their ability to reproduce similar patterns in precipitation relative to the observations. Except in Southeast and Pearl river basins, research results show that all model ensemble means overestimate in the rest of the river basins, especially in Southwest and Northwest. The performance of CMIP5 models is quite different among each river basin; most models show significant overestimation in Northwest and Yellow and significant underestimations in Southeast and Pearl. The simulations are more reliable in Songhua, Liao, Yangtze, and Pearl than in other river basins according to spatial distribution and interannual variability. No individual model performs well in all the river basins both spatially and temporally. In Songhua, Liao, Yangtze, and Pearl, precipitation indices are more consistent with observations, and the spread among models is smaller. The multimodel ensemble selected from the most reasonable models indicates improved performance relative to all model ensembles.

in global mean temperature and precipitation, the relationship remains unclear on regional scale because of the highly variable nature of precipitation on this scale (Guilbert et al. ). However, it is regional precipitation that has important effects on socioeconomics due to potentially severe impacts for water resources and agriculture (Cao The current study expands the evaluation domain to the entirety of China with a focus on assessing CMIP5 GCMs' ability to capture regional differences in precipitation pattern and variability. Although the study domain is the entirety of China, the evaluation is done on basin scale. Specifically, the Chinese mainland is divided into ten major river basins based on topographical and hydrological features and the skill of each CMIP5 model is quantitatively evaluated by comparison with in-situ precipitation observations within the basin. The best and worst performing models in the CMIP5 GCM suite are identified based on a set of comprehensive evaluation scores along with an estimate of uncertainty. In addition to the large domain and the use of multiple evaluation scores, the current study also differs from previous studies of similar nature by utilizing a very large precipitation observational network of more than 2,400 ground stations around China and over a period of more than four decades . Because basin-scale precipitation is directly linked to water resources and agriculture, the results from this study can be used to directly inform stakeholder and policymakers that rely on CMIP5 precipitation projections for future planning.
The remainder of this paper is organized as follows: the next section describes the datasets and methodology utilized. This is followed by a quantitative analysis of CMIP5 precipitation product errors, then a section presents a summary of our evaluation results. The final section provides a discussion and final conclusions.

DATASETS AND METHODOLOGY
The skill of 21 CMIP5 GCMs in simulating seasonal and annual precipitation and interannual variability in ten river basins in China is evaluated by comparison with observations from 2,419 meteorological stations during the historical period of 1960-2005 when long-term ground precipitation observations are available. Several quantitative evaluating metrics are employed to rank the models and identify those of superior performance and those that are least reliable as far as precipitation is concerned. Details about the models, observations and evaluation metrics are given below.

Data
The simulated monthly precipitation values, monthly zonal winds, and geopotential height from 21 CMIP5 models were obtained through data portals of the Earth System Grid Federation via the website http://www.ipccdata.org/sim/gcm_monthly/AR5/Reference-Archive.html.
Information about the models, including model names, resolution, as well as institutions that performed the simulations, are given in Table 1. The spatial resolution differs considerably among the models and to facilitate model intercomparison, the simulated monthly precipitation data from all models was interpolated onto a common grid  Figure 1 and Table 2). The boundaries of these ten river basins are defined based on topographic river basin divides (Yang et al. ).

Methods of evaluation
The comparison is carried out on the seasonal and annual can be either spatial or temporal. In this case, the Taylor diagram is used to explore temporal relationship or, specifically, the relationship between modeled and observed interannual and interdecadal variability of annual mean precipitation.
where P m and P o are the modeled and observed precipitation averaged over the 46-year (n ¼ 46) study period.
The standard deviations (STD) of the observed and modeled precipitation are calculated, respectively, by And the normalized STD, NSTD, is given by The centralized RMSE (CRMSE) is defined as follows: which, when normalized by the observed STD, yields The statistics calculated at each grid cell are then averaged over all grid cells in a basin to obtain the basin mean values. CC can quantitatively measure the degree of similarity between two fields, which for the current study are where m denotes the number of models that participated in evaluation and in this study m ¼ 21; n is the number of variables used to evaluate model performance and here n ¼ 4 for CC, NSTD, NCRMSE, and Bias; rank i is the ranking of each model based on the ith variable. For each variable, the rank of the best-performing model is 1 and the worst is 21.
If a model is ranked 1 for all four variables, then the closer the M R value is to 1, the closer the simulated precipitation is to the observation, as measured collectively by the four statistical variables.

Spatial pattern
The overall skill of all models in capturing the spatial distri-  In most basins, the points are cluttered together, indicating

Ranking of the models
The overall ranking of the 21 models in their performance of simulating temporal variability, as measured by four statistics, CC, NSTD, NCRMSE, and Bias, is determined by the M R score calculated by Equation (10) where 0 M R < 1, with higher score indicating better performance (Figure 8). We choose the best models and worst models for each river basin based on the M R values, as shown in Table 3. In addition, the corresponding values of the RB for the top-ranked models (best) based on the M R scores are compared with the bottom-ranked models (worst) and with all models (all) (Figure 9). Except the  Consequently, the performances of CMIP5 models in simulating historical precipitation need to be evaluated thoroughly in order to better interpret the projections on future precipitation based on these models (Park et al. ).
The current study evaluated and ranked the performance of 21 CMIP5 models in simulating annual and seasonal precipitation in each of the ten river basins    In addition, we also examine the geopotential height at 500 hPa in CMIP5 models in summer, along with the model spread for 1960-2005 ( Figure 11). In summer, compared to AMME ( Figure 11(v)), WPSH of several models, e.g., BNU-ESM, shows an obvious large center for model spread across the