Evaluation of CMIP6 models in reproducing observed rainfall over Ethiopia

Ethiopia is highly susceptible to the effects of climate change and variability. This study evaluated the performances of 37 CMIP6 models against gridded rainfall product of Ethiopia known as Enhancing National Climate Services (ENACTS) in simulating the observed rainfall from 1981 to 2014. Taylor diagrams and Taylor Skill Score were used for ranking the performance of individual models for mean monthly, June – September and February – May seasonal rainfall. Comprehensive rating metrics (RM) were used to derive the overall ranks of the models. Results show that the performances of the models were not consistent in reproducing rainfall distributions at different statistical metrics and timeframes. More than 20 models simulated the largest dry bias on high topographic and rainfall-receiving areas of the country during the June – September season. The RM-based overall ranks of CMIP6 models showed that GFDL-CM4 is the best-performing model followed by GFDL-ESM4, NorESM2-MM, and CESM2 in simulating rainfall over Ethiopia. The ensemble of these four GCMs showed the best performance in representing the spatiotemporal patterns of the observed rainfall relative to the ensembles of all models. Generally, this study highlighted the existence of dry bias in climate model projections for Ethiopia, which requires bias adjustment of the models, for impact assessment.


INTRODUCTION
Climate variability and change have a substantial impact on the livelihood and economy of people in the Horn of Africa including Ethiopia (Reinman 2012). To address the impact of climate change, the Ethiopian government implements National Adaptation Plan (NAP) by developing a Climate Resilient Green Economy (CRGE) strategy as the country's development policy framework. The country aims to pursue further integration of climate change adaptation proactively and iteratively in development policies and strategies (FDRE 2019). It is, therefore, important to have a reliable future climate projection for the country to build a successful NAP, devise efficient CRGE strategies, and prevent setbacks associated with climate change.
The coupled model intercomparison project phase 6 (CMIP6) models show improvements in the spatial resolution and the representation of physical processes relative to the previous CMIP5 models (Eyring et al. 2016). Hence, various studies utilized CMIP6 models to study climate characteristics of observed datasets over different parts of the globe (Zhu & Yang 2020;Ayugi et al. 2021;Babaousmail et al. 2021;Ngoma et al. 2021;Faye & Akinsanola 2022;Makula & Zhou 2022). The abovementioned authors reported that CMIP6 had both better and poorer performance than earlier versions of CMIP5 in simulating climate characteristics in different regions. Moreover, most of those studies indicate that the performance of models in reproducing the observed climate can vary over regions and across CMIP6 models.
Few studies have evaluated the CMIP5 models in Ethiopia (Jury 2015;Li et al. 2016;Dyer et al. 2020) and they indicated that CMIP5 models are still prone to substantial biases in simulating seasonal and annual rainfall over the country. Few studies have evaluated and utilized CMIP6 models in Ethiopia (Alaminie et al. 2021;Dyer et al. 2022;Fetene et al. 2022). Despite the high importance of assessing the performances of different CMIP6 models in replicating the mean monthly and seasonal rainfall both spatially and temporally, there is no national study that investigates the performance of different CMIP6 models relevant to impact studies of different sectors. This study aimed to fill that knowledge gap using the most reliable gridded dataset of Enhancing National Climate Services (ENACTS) produced specifically for Ethiopia to evaluate the CMIP6 models.
The study evaluates the capability of CMIP6 models in their representation of the spatiotemporal rainfall distribution before employing them in the projection of future climate for building adaptation capacity and climate resilient strategy at the national scale. It also selects the optimum number of models for generating an ensemble of CMIP6 which can be used for future climate projection impact studies over Ethiopia The spatial and temporal evaluation of CMIP6 rainfall products conducted for mean monthly, main rainy season during June-September (JJAS), and short rainy season during February-May (FMAM). The two seasons account for about 90% of the total rainfall over the country (Philip et al. 2018).

Study area description
Ethiopia is located in between latitudes of 3°N to14°N and longitude of 33°E to 48°E. The country is characterized by complex topography, with marked contrasts in relief and altitudes ranging from about 155 m below sea level in the Northeast to about 4,533 m above sea level in the Northern highlands ( Figure 1). The climate of Ethiopia is characterized by three distinct seasons locally named as Belg, Kiremt, and Bega (Gissila et al. 2004;Korecha & Barnston 2007). Belg season is the small rainy season from February to May (FMAM). Much of the Northeastern, Central, Southern, Southwestern, Eastern, and Southeastern parts of the country receive a modest rainfall amount during this season (Gissila et al. 2004). Kiremt season is the main rainfall season from June to September for most parts of the country except for the lowlands of Southern and Southeastern Ethiopia. This season (JJAS) contributes 65-90% of Ethiopia's annual rainfall and is responsible for 80-95% of the production of food crops in the country (Gissila et al. 2004). The Bega season is mostly a dry season from October to January, for most parts of the country except for Southwestern as well as the lowlands of Southern and Southeastern Ethiopia.

ENACTS and CHIRPS
Rainfall measurements from ground-based rain gauge observations in Ethiopia cannot provide adequate and timely information due to insufficient spatial coverage, poor representatives, and many missing records (Dinku et al. 2014). Thus, ENACTS has been developed by a three-track approach that involved the National Meteorological and Hydrological Services (NMHS), the International Research Institute for Climate and Society (IRI), and the University of Reading (UoR) in order to produce high resolution gridded climate data time series (Dinku et al. 2014).
ENACTS climate dataset derives from two main kinds of data. The first dataset is station rainfall data from the national weather stations managed by NMHS. The Ethiopian NMHS contributed high-quality-controlled ground-based station data from more than 600 rain gauge stations. The second dataset is satellite rainfall estimates from the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT). TAMSAT satellite rainfall data were calibrated over Ethiopia using over 600 high-quality-controlled station data to generate the best possible satellite rainfall estimates. Station rainfall data were used to evaluate and correct the errors in the spatially complete products, which in turn are used to fill spatial and temporal gaps in station observations. The combined station and satellite product were generated at a daily or decadal time scale for all available stations and the final products are datasets with over 40 years of rainfall time series produced at a nominal 4 km grid across the country. Most global satellite products use 17 stations through GTS stations while the ENACTS data incorporate over 600 stations. The result is a distinctive, high-quality, and more reliable dataset than any other long-term, satellite-based time series dataset (Dinku et al. 2014). ENACTS products are a 'gold standard' binding of the best local datasets and global data and can be produced sustainably so that new data can be regularly added to the historical archive (Dinku et al. 2018).
ENACTS has a spatial resolution of 0.0375°Â 0.0375°from 1981 to 2020 across Ethiopia. Hence, in this study, ENACTS gridded dataset was used as an observation reference in order to evaluate the performance of CMIP6 models in simulating rainfall characteristics for the period 1981-2014 over Ethiopia.
In addition, monthly CHIRPS data for the period 1981-2014 were used as an additional observed dataset to evaluate the climate models. CHIRPS v2.0 is a quasi-global (50°S-50°N) gridded product and has a spatial resolution of 0.05°Â 0.05°at daily, pentad, decadal, and monthly temporal resolution (Funk et al. 2015). CHIRPS v2.0 is available online at http://chg. geog.ucsb.edu/data/chirps. A detailed description of CHIRPS satellite rainfall products is available in Funk et al. (2015).

CMIP6 model data
This study used the monthly precipitation data of historical simulation of 37 CMIP6 model outputs from 1981 to 2014 for the same period as observed reference gridded datasets. The rainfall simulated data from 37 CMIP6 models were obtained from Earth System Grid Federation (ESGF) archives (https://esgf-node.llnl.gov/ search/cmip6). The details of CMIP6 models regarding the modeling groups, countries, and horizontal resolution are presented in Table 1.

Quantitative statistical measures
The models were first evaluated by observing their ability to capture the annual rainfall cycle and mean seasonal climatology (FMAM and JJAS seasons). The abilities of the CMIP6 models to replicate the climate properties of the study area were assessed using spatial comparisons. Since the resolutions are different for most models and observations, all data were regridded to a common grid of 1°Â 1°resolution using the bilinear interpolation method to confirm uniform resolution (Makula & Zhou 2022). The performance of each model in simulating rainfall was also evaluated using statistical metrics. Statistical metrics used for this study included mean bias (MB), root-mean-squared error (RMSE), and Pearson correlation coefficient (PCC) for the period 1981-2014. These quantitative metrics have been employed in many studies in Africa (Dyer et al. 2020;Akinsanola et al. 2021;Alaminie et al. 2021;Ngoma et al. 2021;Makula & Zhou 2022) and are summarized in Table 2 along with their equations. of each model with respect to the observations in terms of the PCC, the root-mean-square error (RMSE), and the standard deviation (SD).
TSS is a statistical summary of PCC, RMSE, and SD (Taylor 2001). It has been used for assessing and ranking individual CMIP6 models in reproducing mean monthly and seasonal spatial patterns of observed rainfall with respect to ENACTS and CHIRPS. Successful application of this approach has been employed in various recent studies over Africa and East Africa region in ranking model capability (Dyer et al. 2020;Ayugi et al. 2021;Makula & Zhou 2022). TSS is calculated using the following equation (Taylor 2001).
where PCC is the Pearson correlation coefficient between reference and model outputs; σ m and σ r are the model and reference standard deviations, respectively; and PCC 0 is the highest achievable correlation coefficient set at 1. The closer the value of TSS is to 1, the better the agreement between the simulation and observation. Rank was computed for the individual models using TSS for mean monthly, JJAS, and FMAM with respect to ENACTS and CHIRPS. The comprehensive rating metric (RM) was then used to obtain the overall ranks of CMIP6 models over the entire study area (Chen et al. 2019). An RM value closer to 1 refers to a better-performing CMIP6 model whereas an RM closer to 0 refers to a poorly performing model. The overall ranks (RM) of GCMs were calculated using the following equation.
where n and m represent the number of GCMs and timeframe, respectively, and i represents the rank of GCMs in the mean monthly, JJAS and FMAM seasons. The linear trends of the best-performing models were further assessed using the 'modifiedmk' R package. Sen's Slope and P-value are determined using the 'modifiedmk' R package. Modified Mann-Kendall (m-MK) test is used to test the significance of the trends.

Generating ensembles of CMIP6
Identification and selection of an ensemble of better-performing GCMs are important in the projection of climate in order to reduce the uncertainty associated with future climate projection (Pour et al., 2018)  Root-mean-squared error Finally, seven different ensembles were compared and evaluated against ENACTS using the spatial distribution of rainfall, quantitative statistical measures, and TSS in order to select the optimum number of models for generating ensembles. The weighted average method (WA) and simple average method (SA) were employed to generate different ensembles based on RM values using Equations (3) and (4), respectively. The performance of ensembles was compared using two methods (weighted and SA methods).
The formula for computing a weighted average is as follows: SA-based ensembles were calculated as follows: where WA is the weighted average of models; SA is a simple average of models; m i represents the assessed value obtained from the CMIP6 rainfall dataset; RM i represents values of overall ranks of CMIP6 with respect to ENACTS; and 'n' is the number of data pairs.

Mean annual and seasonal cycle precipitation
ENACTS reproduces a bimodal rainfall pattern, with a peak in August and May for JJAS and FMAM seasons, respectively over Ethiopia ( Figure 2). CHIRPS was able to capture these bimodal rainfall patterns and rates very well. Nevertheless, compared to ENACTS, it underestimated rainfall amounts ranging between 1.2 mm in January and 24 mm in August. A high degree of spread was observed among different CMIP6 models in capturing the annual and seasonal rainfall distributions Uncorrected Proof ( Figure 2). Six models (15%) (E3SM-1-0, HadGEM3-GC31-LL, MPI-ESM1-2-HR, MPI-ESM1-2-LR, MIROC6, and UKESM1-0-LL) represented the shape of the observed annual cycle more correctly even though they showed some discrepancy in the amount of rainfall. MIROC6 simulated the highest rainfall amount in all months whereas E3SM-1-0, HadGEM3-GC31-LL, MPI-ESM1-2-LR, and UKESM1-0-LL reproduced the observed rainfall rates and patterns of the annual cycle more accurately. Out of the 37 CMIP6 models, 13 (35%) captured the general pattern of the main rainy season (JJAS). On the other hand, 18 models (49%) simulated rainfall peaks in September and October rather than in August during the main rainy season. Thirteen models except for FGOALS-f3-L and MPI-ESM1-2-LR simulated more rainfall compared to a seasonal peak during the JJAS season. The pattern of the small rainy season (FMAM) was adequately replicated with reasonable skill by all the 37 CMIP6 models except for three models (CAMS-CSM1-0, CNRM-CM6-1, and CNRM-CM6-1-HR). However, seven models reduced the peak for FMAM while 27 models increased the peak. UKESM1-0-LL and MPI-ESM1-2-LR showed better JJAS seasonal rainfall patterns and peaks with a slight difference of þ2 and À2.8 mm/month, respectively. IITM-ESM model represented the FMAM peak value accurately with a slight difference of 0.7 mm to observed peak rainfall, followed by NESM3, IPSL-CM6A-LR, and UKESM1-0-LL. Figure 3 shows that maximum amount of rainfall occurs during the JJAS season in the West and Northwest, followed by Central, South, and Southeast parts of the country. CHIRPS and most CMIP6 models could generally capture such geographical distribution of observed rainfall over Ethiopia ( Figure 3). However, more than 25 models were unable to simulate the maximum seasonal rainfall amount in the West and Northwest while they reproduced the low rainfall amount in the Southeast and South lowlands of the country. Out of the 37 CMIP6 models, 20 (54%) underestimated the observed seasonal rainfall over high rainfall-receiving areas of the country (West and Northwest), with the highest underestimation by FGOALS-f3-L, CNRM-CM6-1, CNRM-CM6-1-HR, and CNRM-ESM2-1 models. On the contrary, six models (ACCESS-ESM1-5, BCC-CSM2-MR, CESM2-WACCM, CESM2, MIROC6, and NorESM2-MM) overestimated the observed seasonal rainfall over the aforementioned areas of the country during the JJAS season. GFDL-CM4, GFDL-ESM4, and E3SM-1-0 showed greater performance in replicating the spatial distribution of observed JJAS rainfall while CNRM-CM6-1, CNRM-CM6-1-HR, CNRM-ESM2-1, FGOALS-g3-L, and INM-CM4-8 poorly reproduced the seasonal rainfall pattern.

Spatially distribution of seasonal rainfall
The highest amount of rainfall was received over the Southwest and Southeast highlands of the country followed by Central Ethiopia, but it decreased towards to North, Northeast, and Southeast parts of the country (Figure 4). CHIRPS showed better performance in capturing the spatial pattern of rainfall during the FMAM season compared to the JJAS season. However, most of the models could not generally capture such geographical distribution of observed rainfall over the country during the FMAM season compared to the JJAS season. Moreover, more than 30 CMIP6 models were incapable of capturing observed rainfall over the Southwest and Southeast highland parts of the country, where the highest FMAM rainfall is observed. Out of 37 models, 24 (65%) underestimated the maximum observed rainfall, with the highest underestimation by ACCESS-CM2 and FGOALS-f3-L models over Southwest and Southeast highland parts of the country. In contrast, six models (ACCESS-ESM1-5, BCC_CSM2-MR, CESM2-WACCM, CESM2, MIROC6, MRI-ESM2-0) displayed an overestimation of the observed rainfall, with the highest overestimation by MIROC6 models over above-mentioned areas during the FMAM season. BCC-ESM1, and NorESM2-MM were more capable of simulating the spatial variations of the observed FMAM rainfall than other models. On the contrary, ACCESS-CM2, ACCESS-ESM1-5, CAMS-CSM1-0, FGOALS-f3-L, NESM3, SAM0-UNICON, and UKESM1-0-LL were unable to accurately represent the spatial distribution of the observed FMAM rainfall, especially in high rainfall-receiving areas.
CHIRPS showed rainfall bias in the range À50 to 50 mm in most parts of the country during the JJAS season ( Figure 5). CMIP6 models revealed a varying performance in reducing the error of simulating the JJAS seasonal rainfall. Eight of the models (ACCESS-ESM1-5, BCC-CSM2-MR, CanESM5, CESM2-WACCM, CESM2, MIROC6, MRI-ESM2-0, and NorESM2-MM) depicted the highest wet bias .100 mm over high rainfall and topographic areas of the country. On the contrary, CNRM-CM6-1, CNRM-CM6-1-HR, FGOALS-f3-L, and CNRM-ESM2-1 depicted the highest dry bias that ranges from À100 mm to À326 mm over the above-mentioned area during the JJAS season. Generally, out of the 37 models, 13 (35%) portrayed a wet bias of the observed JJAS rainfall ranging from 37 to 350 mm at most grid points of the country while 17 models (46%) showed a dry bias of seasonal rainfall in the range À40 to À330 mm. CESM2-WACCM-FV2, E3SM-1-0, GFDL-CM4, GFDL-ESM4, MPI-ESM1-2-HR, and NESM3 performed well in simulating the JJAS rainfall with a relatively lower bias of less than 50 mm in most parts of the country, with exception of highly wet and dry bias in some parts of the country.

Statistical analysis of CMIP6 models
Results of statistical analysis of the MB for the 37 CMIP6 models, and CHIRPS with respect to the ENACTS showed that CHIRPS exhibited little bias in mean monthly (À1.8 mm), JJAS (5.3 mm), and FMAM (1.8 mm) seasons (Figure 7(a)). The MB in mean monthly simulated rainfall was between À36.62 and 52.71 mm while JJAS and FMAM rainfall biases were in the range of À79.58 to 71.78 mm and À31.69 to 60.61 mm, respectively. More than half of the models revealed wet bias in mean monthly rainfall whereas many of the models simulated dry bias in both seasons. However, 25 (close to 68%) of the models reflected more dry bias during the FMAM season than the JJAS season (20 models). However, the magnitude of dry bias was larger in the JJAS season (22 models) compared to the FMAM season (15 models) (Figure 7(a)). FGOALS-g3 exhibited the lowest simulation bias of 0.48 mm for mean monthly while BCC-CSM2-MR, and EC-Earth3 revealed the lowest bias of 0.70 and 0.30 mm for the FMAM and JJAS seasons, respectively. Further analyses of JJAS seasonal rainfall bias for some models' time series were presented in Figure A1. Rainfall bias was changed significantly from one year to another year in most models during the JJAS season ( Figure A1). The simulated rainfall bias for GFDL-CM4, CESM2-WACCM-FV2, and MPI-E3SM1-2-HR was in the range of À24.6% (1992) to 91.7% (1981), À51% (2002) to 65% (1984) and À42.1% (1993) to 64. 9% (1981), respectively, during the JJAS season. The biases in GFDL-ESM4, ESM-1-0, NESM3, CESM2, CESM2-WACCM, MPI-ESM1-2-LR, and BCC-CSM2-MR also varied over time as shown in Figure A1.
The RMSE results show that CHIRPS exhibited a larger error in JJAS (31.74 mm) compared to mean monthly (22.89 mm) and FMAM (20.65 mm) (Figure 7(b)). The general distribution of RMSE exhibited different variations for different models and time frames. The simulated error in mean monthly rainfall was in the range of 11.88 mm (GFDL-ESM4) to 83.06 mm (CNRM-CM6-1). The simulated rainfall of RMSE distribution for all the models during JJAS and FMAM were in the range of 35.01 mm (GFDL-CM4) to 114.78 mm (CNRM-CM6-1) and 11 mm (GFDL-ESM4) to 74.7 mm (MIROC6), respectively. Most of the models revealed higher RMSE when simulating rainfall for the JJAS season compared to the FMAM season. Three and thirty of the CMIP6 models reflected higher RMSE values of greater than 50 mm in simulating observed rainfall during the FMAM and JJAS seasons, respectively. GFDL-CM4 showed a lower RMSE value of 11.9 mm in mean monthly rainfall while GFDL-ESM4 revealed low RMSE values of 11 and 35 mm for FMAM and JJAS seasons, respectively.
The 37 CMIP6 models and CHIRPS of PCC were evaluated for mean monthly and two seasons relative to the ENACTS dataset. Results indicated that CHIRPS showed the best correlation with PCC values of 0.92 and 0.91 when simulating JJAS and FMAM rainfall, respectively ( Figure 8). All of the models revealed a positive correlation with observed rainfall patterns during the whole period. The models had varied PCC values ranging between 0.154 and 0.91 in JJAS, 0.31 and 0.88 in FMAM season while the corresponding mean monthly value ranged from 0.17 to 0.89. CESM2-WACCM, and NorESM2-MM revealed the highest correlation with PCC values of more than 0.8 in the whole period. MPI-ESM1-2-LR, NorESM2-MM, GFDL-CM4, and NESM3 showed the highest correlation with PCC values of over 0.85 when simulating mean monthly observed rainfall. NESM3, GFDL-CM4, GFDL-ESM4, and E3SM-1-0 were models that best captured the spatial distribution of observed JJAS rainfall, with the highest PCC values of greater than 0.9. GFDL-ESM4, MPI-ESM1-2-LR, GFDL-CM4, and NorESM2-MM performed relatively well, with higher PCC values of greater than 0.8 during FMAM. On the other hand, CNRM-CM6-1 showed the lowest skill with a low correlation coefficient with PCC value of less than 0.3 in the whole period. The Fine resolution, CNRM-CM6-1-HR, had a lower PCC value of less than 0.5 in mean monthly and JJAS while it had a higher PCC value of 0.65 during the FMAM season compared to the JJAS season over Ethiopia.
TSS was used in the ranking of the CMIP6 models for simulating mean monthly, FMAM, and JJAS rainfall over the country (Table 3). Large variation was found in the TSS of simulated rainfall among different models and time scales (Table 3). TSS values of the models for simulating mean monthly rainfall ranged from 0.27 to 0.89. For FMAM and JJAS seasons, the variations ranged from 0.35 to 0.8 and 0.17 to 0.91, respectively. Thirteen of the models depicted more than 0.8 TSS in mean monthly, 11 of the models in JJAS and 3 of the models in FMAM. GFDL-CM4, GFDL-ESM4, NorESM2-MM, NESM3, and CESM2-WACCM were the best-performing models, with the highest TSS values of greater than 0.85 during mean monthly. GFD-LCM4, GFDL-ESM4, NESM3, CESM2, CESM2-WACCM, and Nor-ESM2-MM showed the best JJAS rainfall representation, with TSS exceeding 0.85. During the FMAM season, GFDL-ESM4, GFDL-CM4, and NorESM2-MM adequately replicated the observed seasonal rainfall, with higher TSS values of more than 0.80. CNRM-CM6-1, and CNRM-ESM2-1 were the least skilled models in simulating the observed mean monthly and JJAS rainfall, with lower TSS values of less than 0.4 while CNRM-CM6-1 and FGOALS-f3-L poorly simulated the FMAM rainfall patterns.
The overall ranks were assigned for each model based on the RM values (RM_TSS) to indicate their performance in representing the spatial characteristics of rainfall (Table 3). It is noted that the value of the RM_TSS metric varied between 0.01 (last rank) and 0.96 (first preferred) over the 37 ranks (Table 3). GFDL-CM4, GFDL-ESM4, NorESM2-MM, CESM2, CESM2-WACCM, and NESM3 were occupying the first six positions with RM_TSS values of greater than 0.80 and hence regarded as the most skilful CMIP6 models in terms of TSS. It was observed that GFDL-CM4 and GFDL-ESM4 are the only CMIP6 models which were able to perform well for all timeframes with respect to TSS. On the other hand, CNRM-ESM2-1 FGOALS-f3-L, CNRM-CM6-1, ACCESS-CM2, and CNRM-CM6-1-HR were the last five with RM_TSS values of less than 0.2 and can be regarded as the least performing CMIP6 models for Ethiopia. It is important to note that the finest resolution CMIP6 model (CNRM-CM6-1-HR) showed the least skill for simulating rainfall signifying that reducing the resolution is not a guarantee for better performance. Two CMIP6 GCMs (GFDL-CM4 and GFDL-ESM4), which were developed by the same modeling center (Geophysical Fluid Dynamics Laboratory (GFDL), USA), ranked first and second in the overall ranking of the models (Table 3). Whereas, most of the CMIP6 models showed different performances by ranks even though they were developed by the same modeling center. For example, CESM2, CESM2-WACCM, and CESM2-WACCM-FV2 which were developed by the same institution (National Center for Atmospheric Research, USA) ranked 4th, 5th, and 23rd, respectively. Similarly, differences in rank among models developed by the same modeling centers were also depicted between the models: INM-CM5-0 and INM-   (Table 3).

Comparison of different ensembles
Weighted (WA) and SA methods for generating ensembles were compared and presented in Tables A2, A3 and A4. All ensembles showed better improvement in bias (Table A2) and RMSE (Table A3) in WA relative to the SA method. They revealed minimum bias (,7 mm) and RMSE (,40 mm) in all timeframes in WA-based ensembles relative to SA-based ensembles. The highest TSS value (.0.75) is found in WA-based ensembles relative to SA-based ensembles in all timeframes (Table A4). Thus, WA generated better-performing ensembles compared to SA and thus, was used in comparing ensembles in this study. Spatially, a high amount of rainfall over the Northwest and West parts of the country was well acquired by all ensembles during the JJAS season ( Figure 3). Furthermore, ENSEMB_4 and ENSEMB_8 outperformed all the 37 CMIP6 individual models and other ensembles in capturing the spatial distribution of JJAS ( Figure 3) and FMAM (Figure 4) seasonal rainfall. All seven ensembles showed relatively weaker simulation bias ranging between À50 and 50 mm and between À40 and 40 mm in most parts of the country during the JJAS ( Figure 5) and FMAM seasons (Figure 6), respectively.
All ensembles showed wet and dry bias in mean monthly and JJAS seasons, respectively (Table 4). Except for ENSEMB_4 and ENSEMB_8, they showed dry bias during the FMAM season. The simulated rainfall of RMSE for all ensembles was in the range between 16.12 mm (ENSEMB_4) and 29.06 mm (ENSEMB_ALL) during the FMAM season and 25.09 mm (ENSEMB_4) and 50.41 mm (ENSEMB_ALL) during the JJAS season. ENSEMB_4 and ENSEMB_8 exhibited lower RMSE values of greater than 30 mm in simulating observed rainfall during the whole period. The error was larger during the JJAS season relative to the FMAM season in all ensembles. TSS values of the ensembles for simulating rainfall in mean monthly were in the range of 0.82-0.88, while during FMAM and JJAS seasons were 0.78-0.84 and 0.81-0.91, respectively. ENSEMB_4 showed the best performance in representing the spatial distribution of seasonal rainfall with the highest TSS value of greater than 0.84 while ENSEMB_ALL revealed the lowest performance compared to other ensembles. Table A1 presented the estimated value of RM and overall ranks of 37 models, and 7 ensembles with respect to bias, RMSE, PCC, and TSS. Compared to individual CMIP6 models, ENSEMB_4, ENSEMB_10, ENSEMB_20, ENSEMB_8, ENSEMB_27, and ENSEMB_16 were occupying the first six positions in the RM_BIAS values while NorESM2-MM, BCC-ESM1, CMCC-CM2-HR4, and GFDL-ESM4 attained ranks 7, 8, 9, and 10, respectively. ENSEMB_4, ENSEMB_8, GFDL-ESM4, ENSEMB_10, ENSEMB_16, ENSEMB_20, and ENSEMB_20 were occupying the first seven positions in the RM_RMSE values while MIROC6, CNRM-ESM2-1, and CNRM-CM6-1 attained the last three ranks in the RM_BIAS and RM_ RMSE values. GFDL-CM4 was occupying the first position in the RM_PCC value followed by GFDL-ESM4, MPI-ESM1-2-LR, and ENSEMB_4 whereas CNRM-CM6-1, CNRM-ESM2-1, and CNRM-CM6-1-HR were occupying the last three positions with the lowest RM_PCC value in that order. Compared to individual CMIP6 models, Table A1 shows that ENSEMB_4, ENSEMB_8, ENSEMB_16, ENSEMB_20, ENSEMB_10, ENSEMB_27, and ENSEMB_ALL ranked the 3rd, 4th, 6th, 7th, 8th,12th, and 15th positions in the RM_TSS values, respectively.

. Evaluation of models with CHIRPS
The performance of the model was further evaluated using CHIRPS as the reference dataset with respect to TSS and the results were presented in Table 5. Thirteen of the models illustrated more than 0.8 TSS in mean monthly, 9 models in Uncorrected Proof JJAS, and 8 models in FMAM. CanESM5 showed a better skill, with TSS value of 0.9 in simulating mean monthly rainfall followed by BCC-ESM1, GFDL-CM4, NESM3, and MPI-ESM1-2-LR. MPI-ESM1-2-LR revealed the best performance in reproducing the JJAS rainfall pattern, with the highest TSS value of 0.91 followed by NESM3, BCC-CSM2-MR, and MPI-ESM1-2-HR. CanESM5 showed the best skill in representing FMAM rainfall, with TSS value of 0.98 followed by BCC-ESM1, GFDL-CM4, GFDL-ESM4, and NESM3. Based on comparative rating metrics of RM_TSS values (Table 5), six models (NESM3, GFDL-ESM4, MPI-ESM1-2-LR, GFDL-CM4, BCC_CSM2-MR, and E3SM-1-0) outperformed the other individual models against CHIRPS, with RM_TSS value greater than 0.8 in that order. CNRM-CM6-1 revealed the lowest RM_TSS value (0.00) preceded by FGOALS-f3-L, CMCC-CM2-HR4, IPSL-CM6A-LR, and CNRM-CM6-1-HR. Some models such as NESM3, GFDL-CM4, and GFDL-ESM4 showed higher RM_TSS value, with greater than 0.8 in both CHIRPS (Table 5) and ENACTS (Table 3) even though their position is different. Fine resolution model, CNRM-CM6-1-HR, showed the least performance in simulating rainfall from CHIRPS and ENACTS datasets. Table 6 presents mean rainfall, Sen's slope, and P-value for the four top ranking models including ENACTS andCHIRPS during 1981-2014. The four top ranking models are GFDL-ESM4, GFDL-CM4, NorESM2-MM, and ENSEMB_4. ENACTS depicts the insignificant increasing trend in rainfall at the rate of 0.10 and 0.59 mm/year for mean monthly and FMAM seasons at a 5% significance level, respectively. On the contrary, the JJAS rains exhibit an insignificant trend with a decrease at the rate of À0.49 mm/year. The CHIRPS trend has a smaller magnitude than ENACTS trend in mean monthly and both seasons. According to CHIRPS, significant and insignificant positive trends were observed at the rate of 0.33 and 0.14 mm/year at JJAS and mean monthly rains, respectively. An insignificant negative trend at the rate of À0.11 mm/year was depicted at FMAM rains. All four models were consistent in showing positive insignificant trends over Ethiopia for mean monthly and both seasons and they have the lowest absolute trend magnitudes compared to ENACTS in both seasons. GFDL-ESM4, GFDL-CM4, NorESM2-MM, and ENSEMB_4 slightly overestimated the observed rainfall for JJAS rainfall while they moderately overestimated for FMAM and mean monthly.

DISCUSSION
The main focus of this study was to evaluate the performance of CMIP6 models in reproducing the mean monthly and seasonal observed rainfall over Ethiopia from 1981 to 2014 using ENACTS data. ENACTS reveals a bimodal rainfall pattern with two peaks at the national scale. This is due to the north-south movement of the ITCZ, or tropical rain belt which oscillates from 15°S-15°N throughout the year (Nicholson 2018). More than 30 models showed greater performance in capturing the small rainy seasonal cycle (FMAM) relative to the long rainy seasonal cycle (JJAS). This is because the majority of the models (22) cannot simulate rainfall peaks in August during the JJAS season as they produce peak rainfall in September and October. This might be related to the large East African rainfall pattern that has two rainfall seasons in MAM and OND. However, the JJAS seasonal rainfall is of great significance to the rain-fed agricultural economy of the Northwest part of Ethiopia. This season is responsible for 80-95% of the production of food crops in the country (Gissila et al. 2004). This emphasizes the need for improved simulation of the main rainy seasonal cycle of JJAS rainfall. Majority of CMIP6 models showed a better spatial correlation in the main rainy JJAS season relative to the short rainy FMAM season over most parts of the country. This may be due to higher rainfall variability during FMAM compared to JJAS (Bekele et al. 2017) which is often difficult to capture in climate models (Kamruzzaman et al. 2022). However, more than 15 and 20 CMIP6 models simulated dry bias over most parts of the country in JJAS and FMAM seasons, respectively. This was more pronounced in high topographic and rainfall-receiving areas (Northwest and Central parts of the country). The dry bias that is reported in the previous version of CMIP5 models over the study area (Jury 2015;Li et al. 2016;Dyer et al. 2020) has not been significantly improved in CMIP6 models. Similar dry bias is reflected in the main rainy season in CMIP5 and CMIP6 models in East Africa ). This might be related to weak Hadley circulation (Chadwick & Good 2013), and associated with large-scale dynamics and southward shifts of the ITCZ over East Africa (Yang et al. 2015). Dyer et al. (2022) demonstrate that biases in CMIP5 and CMIP6 are related to Southern Ocean warm biases and Western Indian Ocean warm biases, respectively.
Moreover, fine resolution model, CNRM-CM6-1-HR, highly underestimated the observed JJAS seasonal rainfall whereas medium and coarse resolution models such as BCC-ESM1, E3SM-1-0, GFDL-CM4, and NESM3 showed a better simulation of JJAS rainfall over the study area. This indicates that the underestimation of observed rainfall in this study may not be improved by simply increasing the models' horizontal resolution. This is mainly due to the challenges related to the complex topography, especially in the East African region, which is related to model physical parameterization . Once the representation of microphysical parameterizations is improved, increases in model resolution are also likely to contribute to more accurate simulation of precipitation especially in mountainous regions (Champion et al. 2011).
Our results found a wide range of model skills in their ability to simulate observed rainfall at different time scales and statistical metrics that lead to different rankings of the models. Based on the overall ranks of CMIP6 models using RM_TSS values (RM_TSS), GFDL-CM4 showed the best performance followed by GFDL-ESM4, NorESM2-MM, CESM2, CESM2-WACCM, NESM3, E3SM-1-0, MPI-ESM1-2-LR, and BCC-CSM2-MR in simulating rainfall over Ethiopia with respect to ENACTS (Table 5). Generally, the best-performing models, considering all the metrics values (RM_BIAS, RM_RMSE, RM_PCC, and RM_TTS), were observed for the GFDL-ESM4, NorESM2-MM, and ENSEMB_4, while the worst performance was observed for the CNRM-CM6-1, CNRM-ESM2-1, and CNRM-CM6-1-HR models. These results are consistent with other previous studies over Ethiopia (Dyer et al. 2020;Alaminie et al. 2021), Uganda , North Africa , and East Africa (Makula & Zhou 2022). The failure of individual models to capture the observed spatial patterns of precipitation might be due to the complex dynamics of the temporal and spatial properties of precipitation over the East African region, differences in initial and boundary conditions (Makula & Zhou 2022), and poor representation of convective schemes and model parametrization (Ongoma et al., 2019).
CHIRPS showed the best performance in representing the reference observed rainfall (ENACTS) based on spatial analysis, quantitative statistics metrics and TSS. Furthermore, CMIP6 models were compared with respect to CHIRPS based on TSS in order to see the performance of models with regard to different observed datasets. Some models such as NESM3, GFDL-CM4, and GFDL-ESM4 were the most skillful models while CNRM-CM6-1, CNRM-ESM2-1, and CNRM-CM6-1-HR were the poorest models in both CHIRPS and ENACTS. However, most models revealed different skills with respect to ENACTS and CHIRPS even though some models showed similar performance in both gridded datasets. Thus, the relative performances of the CMIP6 models may depend on the choice of the reference gridded dataset as the different gridded observation datasets revealed substantial differences (Akinsanola et al. 2021;Faye & Akinsanola 2022).
Various studies use the different numbers of top-ranked GCMs for generating ensembles since there is no standard on the selection of the maximum number of the top-ranked of GCMs (Alaminie et al. 2021;Babaousmail et al. 2021;Makula and Zhou 2021;Ngoma et al. 2022). We found that the weighted average method showed better skill compared to the SA method in reducing errors as it gives more weight to higher performing models. Overall, all ensembles outperformed most individual models in representing temporal and spatial rainfall distribution. Six ensembles (ENSEMB_4, ENSEMB_8, ENSEMB_10, ENSEMB_16, ENSEMB_20, and ENSEMB_27) showed better performance in most of the statistical metrics relative to the weighted average of all models. Nevertheless, the weighted average of the four top-ranked models (ENSEMB_4) can best reproduce spatial and temporal rainfall variations while the weighted average of all models (ENSEMB_ALL) revealed the least skill. Moreover, ENSEMB_4 was the only ensemble that shows the most skills in all metrics compared to other ensembles and individual models. This indicated the importance of choosing models for climate change impact assessment for a given region. Similar result was observed in the earlier version of CMIP5 model over Ethiopia (Jury 2015;Dyer et al. 2020). Dyer et al. (2020) showed that the ensemble averages of 30 CMIP5 models revealed weak bimodal seasonality in the Awash basin due to large discrepancies in individual models. An average of better models may provide more confident guidance than an average of all models as it is prone to dry bias (Jury 2015). Also Kang et al. (2014) stated that the predictive capabilities of the ensembles do not significantly improve after a certain number of GCMs. A model that cannot properly represent the current climate system may not add any improvement to the ensemble.

CONCLUSION AND RECOMMENDATION
This study evaluated the performance of CMIP6 models in reproducing observed rainfall at the national level for the period 1981-2014. Most of the CMIP6 models revealed lower performance over high topographic and rainfall-receiving areas of Northwest and Central parts of the country while they performed well in low topographic and rainfall-receiving areas of the Southeast. This indicates that the effect of high topography on precipitation is still a challenge in climate modeling. Overall, CMIP6 models revealed a similarly poor capability with producing a dry bias simulation of rainfall as CMIP5 models over the study area. The four top-ranked models (GFDL-CM4 GFDL-ESM4, NorESM2-MM, and CESM2) and weighted average of those models (ENSEMB_4) can be applied in the projection of climate using CMIP6 under different SSPs for Ethiopia.
The findings of this study give important information to model producers and end-users of the datasets. More research and detailed studies are still crucial for improving the poor performance of models that lead to sources of dry biases since the future projection of its likelihood will be of great importance to the country's economy. This calls for the modeling groups to improve the simulation of the seasonal rainfall through a better understanding of the complex dynamics of the seasonal rainfall over Ethiopia. For climate model output users, caution is needed when employing the models in seasonal climate change studies as dry bias exists in seasonal rainfall simulation and hence, bias correction is important before employing models for future climate projection and impact studies.
Despite the robust findings of this study, there are some limitations. One of the limitations is the choice of the two seasonal rainfall (FMAM and JJAS) for different regions of the country since large seasonal variations are observed in different regions of Ethiopia. Thus, this study recommends evaluating the different CMIP6 models based on different homogeneous rainfall zones in the country to make further detailed studies of the performance of models. The results offer useful information about the different performances of CMIP6 models over the whole country and can serve as a reference for the new phase of climate models over the continent.