## Abstract

Groundwater recharge plays a vital role in replenishing aquifers, sustaining demand, and reducing adverse effects (e.g. land subsidence). In order to manage climate change-induced effects on groundwater dynamics, climate models are increasingly being used to predict current and future recharges. Even though there has been a number of hydrological studies that have averaged climate models’ predictions in a Bayesian framework, few studies have been related to the groundwater recharge. In this study, groundwater recharge estimates from 10 regional climate models (RCMs) are averaged in 12 different Bayesian frameworks with variations of priors. A recession-curve-displacement method was used to compute recharge from measured streamflow data. Two basins of different sizes located in the same water resource region in the USA, the Cedar River Basin and the Rainy River Basin, are selected to illustrate the approach and conduct quantitative analysis. It has been shown that groundwater recharge prediction is affected by the Bayesian priors. The non-Empirical Bayes g-Local-based Bayesian priors result in posterior inclusion probability values that are consistent with the performance of the climate models outside the Bayesian framework. With the proper choice of priors, the Bayesian frameworks can produce good results of groundwater recharge with R^{2}, percent bias error, and Willmott's index of agreement of >0.97, <2%, and >0.97, respectively, in the two basins. The Bayesian framework with an appropriate prior provides opportunity to estimate recharge from multiple climate models.

## HIGHLIGHTS

The choice of prior affects the suitability of Bayesian formulation for averaging recharge.

All RCMs tend to underestimate groundwater recharge.

Non-EBL-based priors result in posterior inclusion probabilities consistent with the RCM performance.

Bayesian frameworks produce a better recharge estimate than individual RCMs.

## INTRODUCTION

Globally, groundwater supplies 70% of water needs for irrigation (Rosegrant *et al.* 2009; Siebert *et al.* 2010; Long *et al.* 2020). About 25% of total freshwater consumption in the USA comes from groundwater (Meixner *et al.* 2016). Besides, groundwater is a source of safe drinking water to 90% of the US rural population (Niraula *et al.* 2017).

To assess the effect of potential climate change on groundwater resources, groundwater recharge is increasingly being modeled using climate models (Risser *et al.* 2005; Allen *et al.* 2010; Crosbie *et al.* 2010). The regional climate models (RCMs) use land surface models (LSMs) to partition precipitation from GCMs to surface runoff, evapotranspiration, and drainage (Mearns *et al.* 2012). These hydrological variables are often archived in databases such as North American Regional Climate Change Assessment Program (NARCCAP 2007; Mearns *et al.* 2013). Most LSMs are one-dimensional (1-D) vertical soil columns that partition precipitation to various hydrological outputs at relatively smaller spatial scales. Previous studies have successfully used RCMs and/or LSMs hydrological products to simulate the groundwater recharge. A Water Atmosphere Vegetation Energy and Solutes (WAVES) model (Zhang & Dawes 1998) is a 1-D soil water balance model and has been often and successfully used to simulate recharge by equating recharge to deep drainage below the rooting depth, assuming that soil evaporation and transpiration are negligible below the rooting depth (Crosbie *et al.* 2013; Xie *et al.* 2018). Hydrologic Evaluation of Landfill Performance (HELP) (Berger 2000, 2015) is a 1-D model, driven by daily precipitation, daily temperature, and daily solar radiation, which has been successfully used in numerous studies to simulate recharge based on deep drainage through and below the vertical soil column model (Woyshner & Yanful 1995; Berger 2000, 2015; Risser *et al.* 2005; Scibek & Allen 2006; Jyrkama & Sykes 2007; Allen *et al.* 2010). Soil Vegetation Atmosphere Transfer (SVAT) is a class of 1-D LSM that takes rainfall as input and partitions it into evapotranspiration (ET), runoff, and drainage below the soil column (recharge). SVAT has been used in multiple recharge studies such as in the Okavango River Basin and the Chobe-Zambezi River system (Brunner *et al.* 2004).

Combining multiple climate models' hydrological predictions in a Bayesian framework (Sloughter *et al.* 2007; Duan & Phillips 2010) has been found to provide better predictions than the individual climate models (Ma *et al.* 2016). Some of the hydrological studies based on the Bayesian framework include: the streamflow study on Mississippi's Leaf River Basin (Ajami *et al.* 2007) and French Broad watersheds in the USA (Vrugt *et al.* 2008); Yellow River Basin, China (Zhang *et al.* 2018); Irish river catchments (Bastola *et al.* 2011); and Bayesian-based precipitation and sea surface temperature studies (Luo *et al.* 2007). However, there are limited recharge studies, if any, that have averaged recharge prediction from multiple climate models in a Bayesian framework. In addition, most previous hydrological studies were based on an assumed single Bayesian prior. However, due to subjectivity, an assumed single Bayesian prior may have a potential of an ill-posed posterior model probability (PMP; Kavetski *et al.* 2006) potentially due to drastic changes in posterior probability because of small errors in the observation. Bayesian model averaging (BMA) has been used to quantify groundwater model uncertainty, with model parameters being assumed to follow a uniform prior distribution (Mustafa *et al.* 2020). In a study conducted in the San Joaquin River Basin, BMA was successfully used to estimate the uncertainty of aquifer storage from the machine leaning-based groundwater model (Yin *et al.* 2021). Plausible recharge estimates were obtained in Central America using a combination of stable isotopes and the BMA framework with an assumed uniform prior distribution (Arellano *et al.* 2020). Groundwater storage across the contiguous USA was reliably estimated using a combination of Gravity Recovery and Climate Experiment (GRACE) satellite data and the water balance model within the Bayesian framework (Mehrnegar *et al.* 2021). The Bayesian framework, with an assumed uniform distribution, was used to establish interactions of climate with groundwater and food within Louisiana, USA (Singh *et al.* 2020). Groundwater age-dating within the Bayesian framework (with an assumed log-normal prior distribution) was used to better understand the irregular mixing at the seawater intrusion zones of groundwater aquifer that exist along the shorelines of Yellow Sea, South Korea (Ju *et al.* 2021). In a study conducted in the Yixunhe watershed, China, non-point source pollution was reliably modeled in the BMA framework (Wang *et al.* 2020). Other studies have also investigated Bayesian-based integrated management of stressed hydrological systems (Molina *et al.* 2010) and the impact of climate change on stressed groundwater (Molina *et al.* 2013). Therefore, there is no study that has investigated the influence of Bayesian priors on groundwater recharge estimation to the best of our knowledge.

Based on the above discussion, the objectives of this study are to (1) average recharge predictions from 10 RCMs in 12 different Bayesian frameworks; (2) find a suitable Bayesian framework for averaging multiple climate-based recharge prediction; (3) evaluate the relative importance of climate models in predicting recharge inside and outside the Bayesian framework. To calculate posterior model probabilities, which are used to average recharge prediction from the climate models, a known actual/observed recharge is required. In this study, a recession-curve-displacement method, which is also called RORA method (Rorabaugh 1964; Rorabaugh & Simons 1966), is used to estimate groundwater recharge from historical daily streamflow measurement. The RORA method has been proven to provide reliable recharge estimates from historical streamflow measurements (Daniel 1976; Delin *et al.* 2007; Lorenz & Delin 2007; Rutledge 2007).

## METHODS

### Study area and data sources

In this study, two case-study basins located in Minnesota, USA are selected. The Cedar River Basin (near Austin, Minnesota) and the Rainy River Basin at Manitou Rapids have drainage area of 1,500 and 30,000 km^{2}, respectively. The gauging station for the Cedar River Basin is located at [−92.97°, 43.64°], whereas for the Rainy River Basin it is located at [−95.54°, 44.72°]. The Cedar Basin receives relatively higher rainfall (of 834 mm/year) than the Rainy Basin which receives 671 mm/year on average (Achieng & Zhu 2019).

The estimation of actual groundwater recharge using the RORA method requires continuous data of historical streamflow measurement. Both Cedar and Rainy River Basins' streamflow dataset was obtained from the United States Geological Survey (USGS). The climate model's water balance dataset is freely available from the North American Regional Climate Change Assessment Program (NARCCAP) (Mearns *et al.* 2012). The NARCCAP project has downscaled the general circulation models (GCMs) to the RCMs' relatively smaller spatial and temporal resolutions of 50 km and 3 h, respectively. The RCMs have LSMs that partition the precipitation input into various hydrological outputs at RCMs' spatial and temporal scales. Comparison of the climate models requires use of a common time-frame; therefore, the RCM datasets of 1968–1998 are used in this study. The same 10 NARCCAP RCM–GCM combinations as the study of Achieng & Zhu (2019) are used in this study. It should be noted, however, that any number of RCMs can be included in the proposed framework. Table 1 summarizes the RCMs and their corresponding LSMs.

NARCCAP RCMs . | LSMs . |
---|---|

1. CRCM_ccsm | Canadian Land Surface Scheme (CLASS) (Verseghy et al. 1991, 1993). |

2. CRCM_cgcm3 | |

3. ECP2_ gfdl | NOAH (N: National Centers for Environmental Prediction (NCEP); O: Oregon State University (Department of Atmospheric Sciences); A: Air Force (both Air Force Weather Agency and Air Force Research Laboratory); H: Hydrology Lab – National Weather Service, NWS) (Ek et al. 2003). |

4. ECP2_hadcm3 | |

5. MM5I_ccsm | |

6. MM5I_hadcm3 | |

7. WRFG_ccsm | |

8. WRFG_cgcm3 | |

9. RCM3_cgcm3 | Biosphere-Atmosphere Transfer Scheme (BATS) (Dickinson et al. 1986; Yang & Dickinson 1996). |

10. RCM3_gfdl |

NARCCAP RCMs . | LSMs . |
---|---|

1. CRCM_ccsm | Canadian Land Surface Scheme (CLASS) (Verseghy et al. 1991, 1993). |

2. CRCM_cgcm3 | |

3. ECP2_ gfdl | NOAH (N: National Centers for Environmental Prediction (NCEP); O: Oregon State University (Department of Atmospheric Sciences); A: Air Force (both Air Force Weather Agency and Air Force Research Laboratory); H: Hydrology Lab – National Weather Service, NWS) (Ek et al. 2003). |

4. ECP2_hadcm3 | |

5. MM5I_ccsm | |

6. MM5I_hadcm3 | |

7. WRFG_ccsm | |

8. WRFG_cgcm3 | |

9. RCM3_cgcm3 | Biosphere-Atmosphere Transfer Scheme (BATS) (Dickinson et al. 1986; Yang & Dickinson 1996). |

10. RCM3_gfdl |

For the RCMs, CRCM is the Canadian RCM (Caya & Laprise 1999; Music & Caya 2007), ECP2 is the Scripps Experimental Climate Prediction Center/Regional Spectral Model (Juang *et al.* 1997), MM5I is the Fifth-generation Pennsylvania State University/NCAR Mesoscale Model (Grell *et al.* 1994), RCM3 is the RCM, version 3 (Pal *et al.* 2007), and WRFG is the Weather Research and Forecasting Grell (Skamarock *et al.* 2005) model. For the GCMs, ccsm is the Community Climate System Model (Collins *et al.* 2006), cgcm3 is the Third Generation Coupled Global Climate Model (Flato *et al.* 2000), gfdl is the Geophysical Fluid Dynamics Laboratory (GFDL) model (Delworth *et al.* 2006), and hadcm3 is the Hadley Climate Model (Pope *et al.* 2000).

### RCM-based recharge estimate

*et al.*2013) to regional scales (Risser

*et al.*2005; Scibek & Allen 2006; Allen

*et al.*2010). Based on these assumptions, the water balance for the 10 NRCCAP RCMs (i.e. CRCM_ccsm, CRCM_cgcm3, ECP2_gfdl, ECP2_hadcm3, MM5I_ccsm, MM5I_hadcm3, WRFG_ccsm, and WRFG_cgcm3) is expressed as:where

*R*is the recharge (mm), is the rainfall (mm), is the snowmelt (mm), is the evapotranspiration (mm); is the surface runoff (mm), and is the change in water storage (mm). The NARCCAP models provide the water balance variables on 3-hourly intervals. The 3-hourly values are summed up into monthly values because recharge values are at monthly time-steps. Therefore, the terms in Equation (1) are on a monthly time-step. Note that in 2 of the 10 RCMs, RCM3_ cgcm3 and RCM3_ gfdl, the rainfall and snowmelt were combined as one variable.

To calculate the average recharge for the basin, recharge estimates from the RCM grid cells that cover the basin are averaged using the grid cell areas as the weights. Since the studied basins have irregular shapes and the RCM grids are regular in shape, only the portions of the RCM grids that cover the basins are considered.

### Observed groundwater recharge estimation: recession-curve-displacement method

The climate models' recharges are averaged in a Bayesian framework based on a known actual recharge. The observed recharge is estimated from the USGS continuous historical daily streamflow measurements using the RORA method (Rorabaugh 1964). The RORA method assumes that recharge events only occur in the aquifer and instantaneously cause the groundwater level to rise, while the rest of the basin does not experience any net gain or loss. Therefore, the rise in the groundwater level causes groundwater discharge into the stream. The Cedar Basin has sand/gravel aquifer that is about 30 m thick. The Rainy Basin has igneous and metamorphic rock-based aquifer that stores water in fractures and faults. The Cedar Basin is most likely to uphold the homogeneity/isotropy assumption because its aquifer is made of coarse sediment. Since the Rainy Basin's aquifer is composed of rock fractures and faults, it may violate the homogeneity/isotropy assumption if these cracks are not uniformly distributed throughout the aquifer. This assumption is appropriate for the two basins because both basins experience humid climate. Therefore, it is possible for the basins to have zero net loss (e.g. due to evapotranspiration) or net gain (e.g. from irrigation). The time period between the peak streamflow discharge and the stabilized streamflow is called the critical time. It is used in this study because it has been found to provide groundwater recharge that is representative of the entire river basin (Delin *et al.* 2007; Lorenz & Delin 2007) unlike the point-scale methods. Besides, the RORA method only needs streamflow dataset in order to determine the recharge (Busenberg & Plummer 1992).

*et al.*2007). Groundwater recharge associated with each peak event is analytically calculated from the recession-curve-displacement as function of the difference between the pre-event and post-event discharges as follows (Rorabaugh 1964; Rorabaugh & Simons 1966; Daniel 1976; Arnold & Allen 1999; Rutledge 2000, 2004; Delin

*et al.*2007; Lorenz & Delin 2007):where is the daily groundwater recharge (mm);

*A*is the area of the basin (km

^{2}); is the groundwater discharge at the critical time that is extrapolated from the pre-event streamflow recession (m

^{3}/s); is the groundwater discharge at the critical time that is extrapolated from the post-event streamflow recession (m

^{3}/s);

*K*is the recession index (days/log-cycle) which is either obtained as the slope of the recession curve after critical time or computed as a function of aquifer properties as shown in the Supplementary Material.

### Bayesian frameworks for averaging RCMs' recharge

*model space, where*

^{k}*k*is the total number of regressors (RCMs), i.e. 10. The actual recharge is regressed onto each of the possible combinations of the 10 regressors to produce a total of 2

^{10}(i.e. 1,024) regression models. A normal regression model of actual recharge onto regressors (where ) that contains

*n*observations is expressed as:where is the intercept, is the -dimensional vector of 1's, is an matrix that contains columns of regressors and

*n*rows of observations of the respective regressors, is the -dimensional vector of relevant regression coefficients, is the residual standard deviation, is the identically and normally distributed ratio of residual to error that results from fitting all the

*k*regressors to .

Twelve Bayesian frameworks, which are used in the study by Achieng & Zhu (2019), were implemented as the Bayesian recharge models. These Bayesian frameworks are developed by combining prior distribution of the Bayesian regression models (MPrior) and the prior distribution of the regression coefficients (g-prior). Both prior distributions are vital in computing PMP, which is then used to compute both the Bayesian regression coefficients and the posterior inclusion probability (PIP) of the individual covariates (i.e. the RCMs).

*j*is computed, based on Bayes theorem, as follows:where is the likelihood of regression model that is computed based on the g-prior, is the normalizing constant of the PMP, is the prior probability distribution of the regression model , which is the MPrior, is the actual recharge, and

*k*is the number of RCMs, i.e. 10.

*j*is computed as the sum of PMPs of regression models that contain the regressor

*j*:where is the regression coefficient. PIP is used because it has been found to be useful performance indicator in evaluating the relative importance of the explanatory variables within the Bayesian framework (Fernández

*et al.*2002; Zeugner 2011; Ley

*et al.*2012; Steel 2013; Zeugner & Feldkircher 2015).

*et al.*,2001a, 2001b):

Three MPriors were used in this study including (1) the discrete uniform distribution (Liang *et al.* 2008; Ley & Steel 2012), (2) the fixed model prior (Ley & Steel 2008), and (3) the random model prior (Zeugner 2011; Zeugner & Feldkircher 2015). Along with the three MPriors, four g-priors were used in this study which includes: (1) the uniform information prior (UIP) (Kass & Wasserman 1995), (2) the Bayesian Risk Information Criterion (BRIC) (Fernández *et al.* 2001a), (3) hyper-UIP (Liang *et al.* 2008), and (4) the Empirical Bayes g-Local (EBL) (George & Foster 2000; Cui & George 2004) g-priors. A total of 12 Bayesian frameworks were formulated based on these MPriors and g-priors. Further details about these Bayesian frameworks can be found in Achieng & Zhu (2019).

*j*, and is the covariates matrix. Using Equation (7), the Bayesian averaged coefficients are computed for each of the k regressors. The Bayesian recharge model is comprised of these k regressors with their corresponding computed regression coefficients.

### Evaluation of the performance of the climate models and Bayesian recharge models

The evaluation of the performance of the climate models and the Bayesian frameworks is performed to understand the strength of the respective models in simulating recharge, inform parameter tuning of the climate models, and evaluate the advancement of the current versions of the climate models compared with the previous versions. In this study, four widely used classical performance indices are used to assess the performance of the RCMs and the Bayesian recharge models: the coefficient of determination (R^{2}) (Gupta *et al.* 1998; Dawson *et al.* 2002), the root mean square error (RMSE) (Legates & McCabe 1999), the Willmott's index of agreement (d1) (Willmott 1982; Willmott *et al.* 1985), and the percent bias error (PBIAS) (Gupta *et al.* 1998). The RMSE is used because it is a measure of goodness of fit between observation and modeled recharge values. The PBIAS gives us the deviation of the modeled recharge values from the observation. The index of agreement d1 characterizes how well the modeled recharge is in agreement with the observation.

The relative importance of the 10 climate models is evaluated both before (outside the Bayesian framework) and after (inside the Bayesian framework) the climate models are averaged in a Bayesian framework. Evaluating the relative importance outside the Bayesian framework is done to ensure that Bayesian prior probability results into PMP values that are consistent with the performance indices of the climate models that are obtained outside the Bayesian framework.

### Sensitivity analysis of Bayesian recharge models

Bayesian recharge models are formulated by regressing the actual recharge on to the 10 RCMs' predicted recharge. Sensitivity of these recharge models, to the covariates, is assessed based on a leave-one-out approach. A climate model is left out of the trained Bayesian recharge regression equation, and the recharge prediction of the resulting model is compared with the prediction of the full model. The magnitude and direction of the change in recharge prediction, when one of the climate models is left out, give an indication of the sensitivity of the trained Bayesian recharge model to the climate model in question. The climate model which leads to the greatest change in recharge is the most important model in the Bayesian regression model. Since averaging of the climate models within the Bayesian framework aims to use the relative importance of the climate models in predicting the Bayesian-based recharge, the climate model that results in a relatively big change in predicted recharge when left out of the Bayesian recharge model is the climate model that the recharge model is most sensitive to.

## RESULTS AND DISCUSSION

### Historical recharge

The results of historical annual recharge for both basins are shown in Figure 1. Recharge estimation with this method involves the determination of both the recession index and the critical time for each basin. The recession index values for Cedar and Rainy Basins were 19.78 and 73.94 days/log-cycle, respectively. Whereas the critical time values for Cedar and Rainy Basins were 4.24 and 15.9 days, respectively. The Cedar River Basin seems to have a positive trend in annual recharge (of 3.4 mm/year) over the 30 years, whereas the Rainy Basin has a negative trend in annual recharge (of −3.4 mm/year) over the same period. This signifies an increase and a decline in annual recharge in Cedar and Rainy River Basins, respectively, during the 1968–1998 time period. Note that the long trend in an annual recharge observed in these basins is fairly weak since both basins have a low R^{2} term value of 0.15. The Rainy River Basin has a long-term mean annual recharge of 225 mm, which is larger than that of the Cedar River Basin of 191 mm. The relatively larger mean annual recharge in the Rainy River Basin can be explained by the fact that this basin lies in northern Minnesota – a region that receives relatively higher precipitation than the southern region where the Cedar River Basin is located. These recharge values are similar those obtained in previous studies (Delin *et al.* 2007).

### Recharge estimation from climate models

The scatterplots of monthly recharge for each RCM and observed data are plotted as shown in Figure 2 for both Cedar and Rainy River Basins. Most RCMs (e.g. WRFG_ccsm, WRFG_cgcm3, and MM5I_ccsm) underestimated recharge, whereas a few RCMs (e.g. ECP2_hadcm3and RCM3_cgcm3) overestimated recharge in both basins. Due to variability in partitioning of the precipitation, the RCMs' LSMs provide different values of the water balance components. As a result, there exists variability in recharge prediction among the RCMs as shown in Figure 2. For example, studies have found that NOAH LSM that is used by most of the NARCCAP RCMs (such as ECP2_ gfdl, ECP2_hadcm3, MM5I_ccsm, MM5I_hadcm3, WRFG_ccsm, and WRFG_cgcm3) underestimates evapotranspiration (Yang *et al.* 2011) and overestimates surface runoff (Niu *et al.* 2011) because the frozen top soil layer in the NOAH model is relatively more impermeable than its counterpart LSMs, thus blocking infiltration and deep drainage while promoting soil evaporation. Other studies have found that the Canadian Land Surface Scheme (CLASS) (which is used by CRCM_ccsm and CRCM_cgcm3 climate models) has been found to underestimate surface runoff and drainage (Lohmann *et al.* 1998) and overestimates evapotranspiration (Liang *et al.* 1998). On the other hand, the Biosphere-Atmosphere Transfer Scheme (BATS) is prone to overestimating surface runoff (Yang *et al.* 2011) and underestimating evapotranspiration (Dickinson & Henderson-Sellers 1988). The BATS is used by RCM3_cgcm3 and RCM3_gfdl RCMs.

### Performance of the individual RCMs

Table 2 summarizes the performance indices of climate models in simulating the groundwater recharge. In the Cedar River Basin, RCM3_gfdl is the best performing RCM with R^{2}, RMSE, d1, and PBIAS of 0.98, 3.12 mm, 0.92, and 0.29%, respectively. In the Rainy River Basin, RCM3_cgcm3 is the best performing RCM with R^{2}, RMSE, d1, and PBIAS of 0.98, 4.04 mm, 0.93, and −8.79%, respectively. In both basins, CRCM_ccsm is the worst performing model. The poor performance of the CRCM_ccsm model can be attributed to the fact that it uses the CLASS LSM which has been found to overestimate both evapotranspiration (Yang *et al.* 2011) and surface runoff (Niu *et al.* 2011) and therefore to underestimate aquifer recharge. Both evapotranspiration and surface runoff are key components of the water balance-based recharge estimation. The value of R^{2} of the remaining RCMs ranges from 0.55 to 0.98, for the Cedar Basin, and 0.22 to 0.99, for the Rainy Basin. The bias of the remaining climate models, which is represented with RMSE, is in the range of 4.5–19.1 mm in the Cedar River Basin and 4.0–35.6 mm in the Rainy River Basin. The remaining RCMs predicted recharge with d1 values of 0.50–0.92 and 0.26–0.90 in Cedar and Rainy River Basins, respectively. Whereas the PBIAS values with which the remaining RCMs predicted recharge range from −7.41 to +1.09% in the Cedar Basin and −14.77 to −1.13% in the Rainy Basin. The RCMs driven by ‘gfdl’ and ‘cgcm3’ GCMs (e.g. RCM3_cgcm3, RCM3_gfdl, and ECP2_gfdl) seem to perform better in estimating recharge in both basins. On the other hand, the RCMs driven by ‘ccsm’ GCMs (such as CRCM_ccsm and WRFG_ccsm) have poor performance in recharge estimation.

. | Cedar . | Rainy . | ||||||
---|---|---|---|---|---|---|---|---|

Climate models . | (–) . | PBIAS (%) . | d1 (–) . | RMSE (mm) . | (–) . | PBIAS (%) . | d1 (–) . | RMSE (mm) . |

CRCM_ccsm | 0.552 | −7.41 | 0.539 | 19.1 | 0.27 | −14.8 | 0.353 | 43.2 |

CRCM_cgcm3 | 0.849 | −3.48 | 0.841 | 9.64 | 0.46 | −9.99 | 0.615 | 32.1 |

ECP2_gfdl | 0.984 | −0.82 | 0.921 | 2.61 | 0.91 | −2.23 | 0.902 | 9.42 |

ECP2_hadcm3 | 0.879 | −1.10 | 0.743 | 7.29 | 0.82 | −1.13 | 0.814 | 12.9 |

MM5I_ccsm | 0.888 | −0.73 | 0.823 | 7.35 | 0.67 | −6.50 | 0.776 | 22.2 |

MM5I_hadcm3 | 0.949 | −0.14 | 0.848 | 4.52 | 0.86 | −2.77 | 0.826 | 11.5 |

RCM3_cgcm3 | 0.924 | 1.09 | 0.887 | 5.52 | 0.99 | −8.79 | 0.928 | 4.05 |

RCM3_gfdl | 0.976 | 0.29 | 0.921 | 3.13 | 0.99 | −7.82 | 0.939 | 3.58 |

WRFG_ccsm | 0.569 | −4.44 | 0.536 | 16.7 | 0.22 | −6.79 | 0.266 | 35.6 |

WRFG_cgcm3 | 0.524 | −1.66 | 0.504 | 16.6 | 0.37 | −4.96 | 0.415 | 29.5 |

. | Cedar . | Rainy . | ||||||
---|---|---|---|---|---|---|---|---|

Climate models . | (–) . | PBIAS (%) . | d1 (–) . | RMSE (mm) . | (–) . | PBIAS (%) . | d1 (–) . | RMSE (mm) . |

CRCM_ccsm | 0.552 | −7.41 | 0.539 | 19.1 | 0.27 | −14.8 | 0.353 | 43.2 |

CRCM_cgcm3 | 0.849 | −3.48 | 0.841 | 9.64 | 0.46 | −9.99 | 0.615 | 32.1 |

ECP2_gfdl | 0.984 | −0.82 | 0.921 | 2.61 | 0.91 | −2.23 | 0.902 | 9.42 |

ECP2_hadcm3 | 0.879 | −1.10 | 0.743 | 7.29 | 0.82 | −1.13 | 0.814 | 12.9 |

MM5I_ccsm | 0.888 | −0.73 | 0.823 | 7.35 | 0.67 | −6.50 | 0.776 | 22.2 |

MM5I_hadcm3 | 0.949 | −0.14 | 0.848 | 4.52 | 0.86 | −2.77 | 0.826 | 11.5 |

RCM3_cgcm3 | 0.924 | 1.09 | 0.887 | 5.52 | 0.99 | −8.79 | 0.928 | 4.05 |

RCM3_gfdl | 0.976 | 0.29 | 0.921 | 3.13 | 0.99 | −7.82 | 0.939 | 3.58 |

WRFG_ccsm | 0.569 | −4.44 | 0.536 | 16.7 | 0.22 | −6.79 | 0.266 | 35.6 |

WRFG_cgcm3 | 0.524 | −1.66 | 0.504 | 16.6 | 0.37 | −4.96 | 0.415 | 29.5 |

### Calibration/training of the Bayesian recharge models

The calibration of the Bayesian regression model is done by determining the optimal values of regression coefficients of each of the 10 regressors. Each of the regression coefficient (of the calibrated Bayesian model) is obtained by averaging the values of the regression coefficient of all the regression models that contain the regressor in question (within the 1,024 regression model space) using the corresponding PMP values of the regression models. A total of 12 Bayesian recharge models are calibrated using a subset of 23-year data from the 31-year dataset from 1968 to 1990. Calibration is undertaken using monthly data for the 1968–1990 period. Using the calibration dataset, the recharge predictions from the 10 climate models are averaged in the Bayesian framework with the RORA-based recharge being used as the observed recharge. Once the optimal Bayesian regression coefficients for the 10 RCMs are obtained, the calibrated Bayesian recharge model predictions are plotted versus the measured recharge as shown in Figure 3. Results suggest a strong correlation between the calibrated recharge models and the measured recharge. Nearly all the Bayesian recharge models have good performance with an R^{2} value of 0.998, an RMSE value of <2 mm, d1 of >0.97, and PBIAS of <|1%| in both Cedar and Rainy River Basins, as shown in Table 3. The monthly recharge of the remaining 2-year dataset for 1991–1998 is used to validate the 12 Bayesian recharge models.

. | Cedar . | Rainy . | ||||||
---|---|---|---|---|---|---|---|---|

Bayesian models . | (–) . | RMSE (mm) . | d1 (–) . | PBIAS (%) . | (–) . | RMSE (mm) . | d1 (–) . | PBIAS (%) . |

m.fixed_g.BRIC | 0.998 | 0.858 | 0.974 | −2.06 × 10^{−10} | 0.996 | 1.64 | 0.974 | −0.169 |

m.fixed_g.EBL | 0.998 | 0.749 | 0.977 | −2.75 × 10^{−10} | 0.996 | 1.60 | 0.974 | −0.363 |

m.fixed_g.hyper.UIP | 0.997 | 0.926 | 0.970 | −1.03 × 10^{−9} | 0.996 | 1.66 | 0.973 | −0.595 |

m.fixed_g.UIP | 0.998 | 0.858 | 0.974 | 3.58 × 10^{−9} | 0.996 | 1.64 | 0.974 | −0.169 |

m.random_g.BRIC | 0.998 | 0.898 | 0.972 | 2.06 × 10^{−9} | 0.996 | 1.65 | 0.973 | −0.195 |

m.random_g.EBL | 0.998 | 0.749 | 0.977 | 1.10 × 10^{−9} | 0.996 | 1.61 | 0.974 | −0.364 |

m.random_g.hyper.UIP | 0.998 | 0.898 | 0.972 | −1.10 × 10^{−9} | 0.996 | 1.71 | 0.972 | −0.703 |

m.random_g.UIP | 0.998 | 0.898 | 0.972 | −1.10 × 10^{−9} | 0.996 | 1.65 | 0.973 | −0.195 |

m.uniform_g.BRIC | 0.998 | 0.858 | 0.974 | 1.45 × 10^{−9} | 0.996 | 1.64 | 0.974 | −0.169 |

m.uniform_g.EBL | 0.998 | 0.749 | 0.977 | −1.45 × 10^{−9} | 0.996 | 1.60 | 0.974 | −0.363 |

m.uniform_g.hyper.UIP | 0.997 | 0.926 | 0.970 | 3.92 × 10^{−9} | 0.996 | 1.66 | 0.973 | −0.595 |

m.uniform_g.UIP | 0.998 | 0.858 | 0.974 | 2.68 × 10^{−9} | 0.996 | 1.64 | 0.974 | −0.169 |

. | Cedar . | Rainy . | ||||||
---|---|---|---|---|---|---|---|---|

Bayesian models . | (–) . | RMSE (mm) . | d1 (–) . | PBIAS (%) . | (–) . | RMSE (mm) . | d1 (–) . | PBIAS (%) . |

m.fixed_g.BRIC | 0.998 | 0.858 | 0.974 | −2.06 × 10^{−10} | 0.996 | 1.64 | 0.974 | −0.169 |

m.fixed_g.EBL | 0.998 | 0.749 | 0.977 | −2.75 × 10^{−10} | 0.996 | 1.60 | 0.974 | −0.363 |

m.fixed_g.hyper.UIP | 0.997 | 0.926 | 0.970 | −1.03 × 10^{−9} | 0.996 | 1.66 | 0.973 | −0.595 |

m.fixed_g.UIP | 0.998 | 0.858 | 0.974 | 3.58 × 10^{−9} | 0.996 | 1.64 | 0.974 | −0.169 |

m.random_g.BRIC | 0.998 | 0.898 | 0.972 | 2.06 × 10^{−9} | 0.996 | 1.65 | 0.973 | −0.195 |

m.random_g.EBL | 0.998 | 0.749 | 0.977 | 1.10 × 10^{−9} | 0.996 | 1.61 | 0.974 | −0.364 |

m.random_g.hyper.UIP | 0.998 | 0.898 | 0.972 | −1.10 × 10^{−9} | 0.996 | 1.71 | 0.972 | −0.703 |

m.random_g.UIP | 0.998 | 0.898 | 0.972 | −1.10 × 10^{−9} | 0.996 | 1.65 | 0.973 | −0.195 |

m.uniform_g.BRIC | 0.998 | 0.858 | 0.974 | 1.45 × 10^{−9} | 0.996 | 1.64 | 0.974 | −0.169 |

m.uniform_g.EBL | 0.998 | 0.749 | 0.977 | −1.45 × 10^{−9} | 0.996 | 1.60 | 0.974 | −0.363 |

m.uniform_g.hyper.UIP | 0.997 | 0.926 | 0.970 | 3.92 × 10^{−9} | 0.996 | 1.66 | 0.973 | −0.595 |

m.uniform_g.UIP | 0.998 | 0.858 | 0.974 | 2.68 × 10^{−9} | 0.996 | 1.64 | 0.974 | −0.169 |

### Validation of the Bayesian recharge models

To ensure that the calibration of the Bayesian recharge models works effectively, the calibrated Bayesian recharge models were validated. The monthly recharge of the 1991–1998 dataset is used to validate the 12 calibrated Bayesian recharge models. The modeled recharge is plotted versus the RORA recharge as shown in Figure 4. Results suggest a strong agreement between the Bayesian recharge models and the measured recharge at low recharge values and a weaker agreement at high recharge values as shown in Figure 4. Overall, the validated recharge models have a good performance. The validated Bayesian recharge models in the Cedar Basin have R^{2} of 0.62–0.68, RMSE of 10.1–11.1 mm, d1 of 0.70–0.72, and PBIAS of 33–49% as shown in Table 4. On the other hand, the River Basin's recharge models have relatively better performance in the validation phase, with R^{2} of 0.66–0.76, RMSE of 8.7–10.3 mm, d1 of 0.81–0.83, and PBIAS of 12–25% as shown in Table 4.

. | Cedar . | Rainy . | ||||||
---|---|---|---|---|---|---|---|---|

Bayesian models . | . | RMSE (mm) . | d1 (–) . | PBIAS (%) . | (–) . | RMSE (mm) . | d1 (–) . | PBIAS (%) . |

m.fixed_g.BRIC | 0.941 | 11.6 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

m.fixed_g.EBL | 0.917 | 9.01 | 0.757 | 34.0 | 0.937 | 14.9 | 0.783 | 25.2 |

m.fixed_g.hyper.UIP | 0.943 | 13.0 | 0.647 | 49.0 | 0.917 | 14.1 | 0.801 | 12.9 |

m.fixed_g.UIP | 0.941 | 11.6 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

m.random_g.BRIC | 0.942 | 12.4 | 0.661 | 47.0 | 0.930 | 13.8 | 0.795 | 13.2 |

m.random_g.EBL | 0.917 | 9.01 | 0.757 | 33.9 | 0.937 | 14.9 | 0.784 | 25.3 |

m.random_g.hyper.UIP | 0.942 | 12.4 | 0.661 | 47.0 | 0.903 | 14.6 | 0.800 | 14.0 |

m.random_g.UIP | 0.942 | 12.4 | 0.661 | 47.0 | 0.930 | 13.8 | 0.795 | 13.2 |

m.uniform_g.BRIC | 0.941 | 11.6 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

m.uniform_g.EBL | 0.917 | 9.01 | 0.757 | 34.0 | 0.937 | 14.9 | 0.783 | 25.2 |

m.uniform_g.hyper.UIP | 0.943 | 12.953 | 0.647 | 49.0 | 0.917 | 14.1 | 0.801 | 12.9 |

m.uniform_g.UIP | 0.941 | 11.636 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

. | Cedar . | Rainy . | ||||||
---|---|---|---|---|---|---|---|---|

Bayesian models . | . | RMSE (mm) . | d1 (–) . | PBIAS (%) . | (–) . | RMSE (mm) . | d1 (–) . | PBIAS (%) . |

m.fixed_g.BRIC | 0.941 | 11.6 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

m.fixed_g.EBL | 0.917 | 9.01 | 0.757 | 34.0 | 0.937 | 14.9 | 0.783 | 25.2 |

m.fixed_g.hyper.UIP | 0.943 | 13.0 | 0.647 | 49.0 | 0.917 | 14.1 | 0.801 | 12.9 |

m.fixed_g.UIP | 0.941 | 11.6 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

m.random_g.BRIC | 0.942 | 12.4 | 0.661 | 47.0 | 0.930 | 13.8 | 0.795 | 13.2 |

m.random_g.EBL | 0.917 | 9.01 | 0.757 | 33.9 | 0.937 | 14.9 | 0.784 | 25.3 |

m.random_g.hyper.UIP | 0.942 | 12.4 | 0.661 | 47.0 | 0.903 | 14.6 | 0.800 | 14.0 |

m.random_g.UIP | 0.942 | 12.4 | 0.661 | 47.0 | 0.930 | 13.8 | 0.795 | 13.2 |

m.uniform_g.BRIC | 0.941 | 11.6 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

m.uniform_g.EBL | 0.917 | 9.01 | 0.757 | 34.0 | 0.937 | 14.9 | 0.783 | 25.2 |

m.uniform_g.hyper.UIP | 0.943 | 12.953 | 0.647 | 49.0 | 0.917 | 14.1 | 0.801 | 12.9 |

m.uniform_g.UIP | 0.941 | 11.636 | 0.680 | 44.6 | 0.933 | 13.7 | 0.794 | 13.3 |

Even though there is no significant difference in performance among the 12 Bayesian recharge models, the EBL-based Bayesian models (i.e. m.fixed_g.EBL, m.random_g.EBL, and m.uniform_g.EBL) performed slightly better than the rest of the models in simulating recharge in the Cedar Basin during the validation phase, as shown in Table 4 and Figure 4(a). However, in the Rainy Basin, the hyper.UIP-based Bayesian models (i.e. m.fixed_g.hyper.UIP, m.random_g.hyper.UIP, and m.uniform_g.hyper.UIP) seem to have slightly outperformed the rest of the Bayesian recharge models in the validation phase, as shown in Table 4 and Figure 4(b). There is no significant difference in the Bayesian recharge models' RMSE values between Cedar and Rainy River Basins, as shown in Table 4. Therefore, the average deviation in Bayesian models recharge estimation seems to be uniform.

### Relative importance of the climate models outside and inside the Bayesian framework

The performance indices of the 10 climate models outside the Bayesian framework are summarized in Table 2. The models with the strongest agreement with the observed recharge have the highest importance during averaging in the Bayesian framework. As found earlier, RCM3_gfdl has the best performance indices outside the Bayesian framework for the Cedar River Basin, and RCM3_cgcm3 is the best model for the Rainy River Basin. On the other hand, CRCM_ccsm performs the worst for both Cedar and Rainy Basins.

After averaging the climate models within the Bayesian frameworks, the relative importance of the models (within the framework) is evaluated based on their PIPs. The PIP values of the 10 RCMs are shown in Figure 5. The results suggest that MM5I_ccsm and RCM3_gfdl are the most important RCMs in the Cedar Basin since these RCMs have PIP values of close to 1 in nearly all the 12 Bayesian recharge models, as shown in Figure 5(a). Note that even though the MM5I_ccsm does not have the strongest agreement with observed recharge outside the Bayesian framework, its performance indices values suggest a good performance based on R^{2}, RMSE, and d1. Beside the two best performing RCMs, WRFG_cgcm3 also has a good performance in the Cedar Basin with PIP values of 0.8–1 across the 12 Bayesian models, as shown in Figure 5(a). The remaining seven RCMs for the Cedar Basin have a weak performance with PIP values of less than 0.5, as shown in Figure 5(a). High variability of PIP values is also observed among the weak performing RCMs since the PIP values are in the range of 0–0.25 across the 12 Bayesian recharge models for each of the weak performing RCMs. CRCM_ccsm is also categorized as a weak performing model in the Cedar Basin since its PIP values from 7 of the 12 Bayesian models are less than 0.5. This is consistent with its performance indices values outside the Bayesian framework. However, the EBL-based Bayesian models (i.e. m.fixed_g.EBL, m.random_g.EBL, and m.uniform_g.EBL) assign PIP values of 1 to the weak performing CRCM_ccsm. This is due to the fact that EBL g-prior uses a local approach to compute g-prior based on the *F*-statistics of the individual regression models within the Bayesian regression model space of 2^{10} since there are 10 RCMs. The rest of the Bayesian models use a global approach in which a common g-value across all the regression models within the model space is applied.

In the Rainy River Basin, the best performing climate model is RCM3_cgcm3 since it has the PIP values of 0.75–1 across the 12 Bayesian recharge models, as shown in Figure 5(b). This is followed by MM5I_ccsm, MM5I_hadcm3, and WRFG_ccsm, which have PIP values of 0.5–0.7, as shown in Figure 5(b). The worst performing RCMs outside the Bayesian framework, CRCM_ccsm, also has the smallest PIP values (PIP value of <0.5), as shown in Figure 5(b). Therefore, the PIP values of the poor performing RCMs seem to be consistent with the models performance outside the Bayesian framework. The remaining RCMs have weak performance with PIP values of <0.5, as shown in Figure 5(b). The EBL-based Bayesian models seem to exaggerate the PIP values in both best and worst performing RCMs in the Rainy Basin, as shown in Figure 5(b). As explained earlier, the slightly larger PIP values observed in the EBL-based Bayesian models are attributed to the fact that the g-value of the EBL prior is computed locally for each regression model, unlike the non-Empirical Bayes g-Local (non-EBL)-based Bayesian models. The recharge estimations are relatively more uncertain in the Rainy River Basin than in the Cedar Basin. This is observed in the relatively high coefficient of variation of recharge prediction in the Rainy Basin of 1.42 than in the Cedar Basin of 1.23. The relatively high recharge prediction variability among the climate models explains the observed variability of PIP values across the 12 Bayesian recharge models for almost every climate model, as shown in Figure 5(b). Even though RCMs' PIP values are highly variable in the Rainy Basin, the Bayesian models still give a clear direction of the recharge prediction performance strength of the RCMs within the Bayesian framework.

### Sensitivity of Bayesian recharge models to the climate models

Based on a leave-one-out technique, recharge prediction of the Bayesian recharge models was computed by leaving out one of the 10 RCMs. The sensitivity of the Bayesian recharge models was expressed as both scatter plots (Figures 6 and 7) and mean percent change (Figures 8 and 9) of Bayesian recharge prediction between the Bayesian recharge prediction and the leave-one-out-based Bayesian recharge prediction. The computation of the mean percent change of recharge is shown in Equation A4 in the Supplementary Material.

Figures 6 and 7 are plots of modeled recharge with one climate model left out versus the observed recharge for Cedar and Rainy River Basins, respectively. In the Cedar River Basin, the Bayesian recharge models are most sensitive to WRFG_ccsm since the largest deviation of the modeled recharge from the observed recharge is observed when WRFG_ccsm is left out of the calibrated Bayesian recharge models, as shown in Figures 6(i) and 8(b). It is also observed that the changes in recharge when WRFG_ccsm is left out the Bayesian models lead to the overestimation of recharge in the Cedar River Basin. The worst performing model inside and outside the Bayesian framework, CRCM_ccsm, seems to produce low-to-moderate sensitivity, as shown in Figures 8 and 9 when left out of the Bayesian recharge model. However, this model is still important within the Bayesian framework since leaving CRCM_ccsm out of the calibrated Bayesian models causes underestimation of recharge in the Cedar Basin, especially at high recharge values, as shown in Figure 6(a). The only climate model that results in overestimation in recharge in the Cedar Basin, when the model is left out, is WRFG_ccsm, as shown in Figures 6(i) and 8(b). Note that the overestimation of recharge is mostly significant at high recharge values when the WRFG_ccsm model is left out. An insignificant change in recharge occurs when CRCM_cgcm3, ECP2_gfdl, ECP2_hadcm3, MM5I_hadcm3, or RCM3_cgcm3 is left out, as shown in Figure 6.

Like the Cedar River Basin, leaving out ECP2_gfdl does not cause any significant change in recharge prediction by Bayesian recharge models in the Rainy Basin, as shown in Figure 7(c). The largest underestimation in recharge in the Rainy River Basin occurs when RCM3_cgcm3 is left out, as shown in Figure 7(g). However, this underestimation does not deviate much from the original prediction (when none of the RCMs is left out). Leaving out CRCM_ccsm produces the largest mean percent change in recharge prediction, in the Rainy Basin, as shown in Figure 9(b). Therefore, in the Rainy Basin, Bayesian recharge models are most sensitive to CRCM_ccsm. A slight overestimation and underestimation of recharge, in the lower and upper recharge values, respectively, is observed when ECP2_hadcm3 is left out of the Bayesian recharge models, as shown in Figure 7(d). This is also supported by the relatively moderate mean percent change (less than −20%) as shown in Figure 9(a). No significant change in recharge is observed when WRFG_cgcm3 is left out, as shown in Figures 7(j) and 9(a).

In the Cedar River Basin, the Bayesian recharge models are most sensitive to RCM3_gfdl because RCM3_gfdl has the largest Bayesian regression coefficient value of an average of 0.56 mm/month. Therefore, removing RCM3_gfdl from the calibrated Bayesian recharge models leads to the greatest change in the modeled recharge. Unlike the RCM3_gfdl, leaving out WRFG_ccsm only produces strong sensitivity in EBL-based recharge models depicted by the relatively large mean percent change of recharge. On the other hand, ECP2_gfdl has the smallest Bayesian regression coefficient value of 0.01 mm/month and thus leads to the smallest deviation when being left out of the calibrated Bayesian recharge models.

Within the calibrated Bayesian regression model, RCM3_cgcm3 has the largest regression coefficient – an average of 0.55 mm/month – across all Bayesian recharge models. Therefore, removing RCM3_cgcm3 from the Bayesian recharge models causes the greatest change in the predicted recharge in the Rainy Basin. This is followed closely by CRCM_ccsm which, unlike the RCM3_cgcm3, only produces high sensitivity in the EBL-based recharge models of the Rainy Basin. Conversely, since ECP2_hadcm3 has the smallest Bayesian regression coefficient of −0.01 mm/month and leaving it out leads to the smallest change in the Bayesian modeled recharge in the Rainy River Basin. Except for WRFG_ccsm and CRCM_ccsm models, leaving out the climate models from the Bayesian recharge models mainly causes underestimation in the recharge in the Rainy River Basin. In the Cedar Basin, leaving out WRFG_ccsm from the Bayesian models could cause overestimation in the recharge because WRFG_ccsm has a negative effect due to the relatively larger negative PBIAS of −6.79%. Moreover, WRFG_ccsm has the largest negative regression coefficient of −0.22 mm/month. Therefore, removing WRFG_ccsm removes the negative effect on the Bayesian recharge models and thus overestimation of the recharge in the Rainy River Basin. The negative bias in recharge prediction can be explained by the tendency of WRFG_ccsm's LSM (NOAH) to overestimate surface runoff due to the assumed frozen top soil layer in model physics (Niu *et al.* 2011), which makes it more impermeable than its counterpart LSMs.

### Implication of averaging climate model-based recharge in the Bayesian framework

Groundwater recharge replenishes aquifer storage, thus sustaining demand from the ever-increasing global population, alleviating potential environmental impact like land subsidence, and dampening of draining of nearby rivers that are hydraulically connected to the aquifer. Traditional methods of estimating recharge (e.g. lysimeters and isotope techniques) are often expensive and not sustainable in long-term recharge monitoring, especially in the face of climate change. Therefore, climate models come in handy in monitoring groundwater recharge, not only in the current climate scenario, but also in future climate scenarios. However, individual climate models often produce different recharge predictions. As a result, most recharge studies often use model selection to select climate model that provides the closest recharge prediction. Model selection leads to the elimination of the climate models whose predictions are relatively over- and underestimations of the actual recharge. Climate models are costly to develop with respect to time, knowledge, and labor. Eliminating climate models that do not give predictions that are close to the measured recharge is merely a waste of resources. In this study, out of the 10 climate models, ECP2_hadcm3 and RCM3_cgcm3 would be selected since they provide recharge that is closest to the observations. However, this will mean that the rest of the models are discarded. To harness the potential relative importance of all the 10 models, we averaged their recharge predictions in a Bayesian framework. However, because Bayesian priors have been found to impact posterior and thus predictions (Kavetski *et al.* 2006; Bastola *et al.* 2011), it is important to identify the prior combinations (MPrior and g-prior) that would best average recharge predictions from climate models in a Bayesian framework. Averaging the recharge predictions in the Bayesian framework also allows us to determine the relative contribution of the climate models to the Bayesian recharge model. As a result, we not only used all the climate models in producing superior recharge predictions by identifying the most suitable Bayesian prior combinations for recharge but also provided opportunity for improving future versions of these climate models based on their relative contribution to the Bayesian recharge model.

## CONCLUSIONS

Our study suggests that non-EBL-based Bayesian frameworks are suitable for averaging recharge from climate models within a Bayesian framework. The performance of the climate models (in modellng recharge) inside the non-EBL-based Bayesian framework is consistent with their corresponding performance outside the Bayesian framework. Inconsistency between climate models' performance inside and outside the Bayesian framework arising from the EBL-based Bayesian frameworks can be attributed to the fact that their g-parameter is regression model-specific, whereas a constant g-parameter is used (for all the 1,024 regression models) in the non-EBL-based priors. These observations are consistent with previous studies (e.g. Achieng & Zhu 2019). However, since the previous study was on different regimes of streamflow, EBL priors was found to be just as suitable for estimating low flows as the non-EBL-based priors (Achieng & Zhu 2019). Note that the basin size does not seem to affect the choice of Bayesian framework when estimating the recharge in the Bayesian framework – unlike the basin size-effect on the choice of Bayesian framework for streamflow in the previous study (Achieng & Zhu 2019). This could be attributed to the fact that, unlike streamflow which is a surface hydrological process, recharge is a process that occurs much slower than streamflow and the recharge is relatively less affected than the streamflow by many hydro-topo-climatological factors. Therefore, the recharge is much more stable and thus even though there was uncertainty in the climate model-based recharge prediction, all the climate models underestimated the recharge. Either way, the climate models' uncertainty is evident in both streamflow and recharge. This can be contributed to the different model physics and assumptions that are applied on the LSMs of the respective climate models and the different spatial resolution GCMs that provide the input to the corresponding RCMs. The RORA method has been found to provide reliable recharge values. However, previous studies have also found that the RORA method has some uncertainty due to many assumptions that the method has been developed under (Delin *et al.* 2007; Lorenz & Delin 2007). Future studies should focus on characterizing the uncertainties from the ‘reference recharge’ and the climate models themselves. The relative strength of climate models at predicting groundwater recharge can be harnessed by averaging their predictions within a Bayesian framework. However, it is important to use the Bayesian framework with priors that result into PIP consistent with the performance of the respective climate models outside the Bayesian framework because of difference in Bayesian frameworks with respect to prior probability of both Bayesian regression models and regression coefficients. In this study, we comprehensively evaluate the performance of groundwater recharge predictions from 12 different Bayesian frameworks. EBL-based prior is found to overestimate PIP values of the poor performing climate models. Non-EBL-based priors result in PIP values that are consistent with the performance of the climate models outside the Bayesian framework in the two case-study basins. Therefore, we recommend the use of non-EBL prior-based Bayesian frameworks when averaging recharge predictions from climate models within the Bayesian framework. Other conclusions can be summarized as follows:

The choice of prior affects the suitability of Bayesian framework for averaging recharge from multiple climate models in a Bayesian framework.

Averaging recharge predictions from climate models within the Bayesian framework results in better recharge prediction than that from any of the individual climate models.

Nearly all the NARCCAP RCMs underestimate groundwater recharge.

## ACKNOWLEDGEMENTS

This research was supported by the Bishop, Floyd & Wilma Endowment and Paul A. Rechard Fellowship at the University of Wyoming, USA.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.