Modelers typically test alternative conceptualizations of a groundwater system because the ‘true’ hydrogeological model of the system is unknown, and the observed data are limited. This study elaborates on how to rank different conceptualizations of a groundwater system by using selection techniques and criteria that make a balance between complexity and accuracy of models. Depending on the parameterization method, number of spatial zones in parameterization, and the calibration approach, 30 different models have been built and compared against 12 discriminating selection criteria in the Dehloran groundwater system application in Iran. The validation results have shown that the best model calibrated by a two-step approach is far superior than the one calibrated by a one-step approach. Nevertheless, the low ranked models of the first step should not be disregarded, because the high ranked models selected in the first and second steps of calibration are not necessarily the same. In addition, although the ratio of the number of total observations to the number of parameters strongly affects the performance of a selection criterion, a model's complexity assessment term of a number of selection criteria is not continuous with respect to the number of parameters that causes an inappropriate performance of these criteria at some certain points.

INTRODUCTION

In recent years, researchers have actively sought to construct aquifer simulation models as a primary tool for groundwater resource management. Groundwater modeling invariably involves simplifying a real system into a conceptual model to be used for analysis purposes. Such a conceptual model must capture the essence of the problem (Bredehoeft 2003). A number of researchers have emphasized that conceptualization is the most important and difficult step in groundwater modeling (Samper et al. 1990; Kolm & Van Der Heijde 1996; Poeter & Hill 1997; Bredehoeft 2003, 2005). Recent studies have clearly shown that a large degree of uncertainty arising in parameter estimation, predictions and results of a groundwater model lies in conceptual model deficiencies (e.g., Hyun & Lee 1998; Bredehoeft 2003, 2005; Sonnenborg et al. 2003; Poeter & Anderson 2005; Gaganis & Smith 2006; Rojas et al. 2010).

Modelers are faced with an enormous volume of data in practical cases. As well, in some cases the available field data have many shortcomings. In the first case, more than one description of the system may result from the conceptualization step (Samper et al. 1990). In the latter case, in order to ensure that the true representation of the real system is reflected in the model construction, one should develop and test a set of candidate models. The candidate models typically differ in the way that they conceptualize the real system.

Given the uncertainties in modeling (conceptualization, parameterization, and calibration), a groundwater model can only be used to approximate the behavior of a subsurface system. Therefore, one can merely arrive at the best approximation (‘best’ model) and/or make inference from a set of ‘good’ models. A good model is an effective one that has as much transparency (understandability of the model dynamics) as accuracy (ability to simulate the behavior of the real system). Perhaps, we cannot do any better than to use our hydrogeologic insight to develop what we believe is a full set that includes the true situation, and then admit that we are merely making our best hydrogeologically informed guess (Voss 2011b).

The above-mentioned recognition has led to a growing tendency among hydrologists to postulate several alternative hydrologic models for a site by using a multi-modeling approach and use various criteria to (1) rank these models, (2) eliminate some of them, and/or (3) weight and average predictions and statistics generated by multiple models (Ye et al. 2008). Model selection techniques have been applied in several groundwater modeling studies (Hyun & Lee 1998; Poeter & Anderson 2005; Foglia et al. 2007; Tan et al. 2008; Rojas et al. 2010; Ye et al. 2010; Engelhardt et al. 2012).

In this study, some new model selection techniques have been used, and their performances have been evaluated. The paper particularly focuses on how to rank different conceptualizations or models while using different selection criteria. In this regard, the steps of a model selection process are first reviewed, and the by-product profits that can be obtained from model selection criteria are then introduced. To this end, several criteria are introduced and examined as effective, yet practical approaches for ranking candidate models and, consequently, selecting the best one or defining a set of good models. Finally, model selection techniques and criteria are evaluated within a multi-modeling approach in the Dehloran aquifer system located in an arid western part of Iran, and some practical points related to the appropriate use of selection criteria are provided.

MODELING TOOLS AND METHODOLOGY

Model selection criteria

While evaluating several alternative groundwater simulation models based on different conceptualizations, we need to calibrate each of the models. To do so, automatic inverse modeling is often used as a state-of-the-art convenience to estimate hydrogeological parameters of interest. Then, model selection criteria are used for assessing and ranking the candidate models as well as making some beneficial inferences. In this study, we use a number of selection criteria whose theoretical principles and formulas are discussed below.

It has been generally understood that as parameters are added to a model, the modeling error decreases, and the fit improves although the model uncertainty increases, and the predictions become less accurate (Yeh 1986; Poeter & Anderson 2005; Hill 2006). Thus, comparisons are to be made among several models in terms of both the models’ complexity and the modeling error indisputably. Model selection criteria or information criteria (IC) are used for this purpose.

Table 1 elaborates on the formulas and elements of 12 model selection criteria being investigated in this study. All the formulas comprise a term representing lack of fit and a bias correction factor quantifying a model's complexity. With regard to lack of fit, there is usually a need to calculate the maximum likelihood estimate of the residual variance, . For this case, where (Poeter & Anderson 2005; Singh et al. 2010), WRSS is the weighted residual sum of squared, and N is the number of total observations equal to the number of observations, , plus the number of prior estimation equations, (Hill & Tiedeman 2007). Thus, the term can be replaced with in Table 1.

Table 1

Selection criteria and their formulas being investigated

Criterion Symbol Formula Reference 
Akaike Information Criteria AIC  Akaike (1974)  
Small-sample correction of AIC AICc  Hurvich & Tsai (1989); Sugiura (1978)  
Unbiased version of AIC AICu  McQuarrie et al. (1997)  
Consistent version of AIC CAIC  Bozdogan (1987)  
Schwarz Information Criterion SIC or BIC  Schwarz (1978)  
Small-sample correction of SIC SICc  McQuarrie (1999)  
Draper Information Criterion DIC  Draper (1995)  
Hannan Information Criterion HIC  Hannan (1980)  
Small-sample correction of HIC HICc  McQuarrie & Tsai (1998)  
Corrected Residual Information Criterion RICc  Leng (2013)  
Large-sample model selection criterion based on Kullback's symmetric divergence KIC or AIC3  Cavanaugh (1999); Bozdogan (1994)  
Second order of KIC KICc  Cavanaugh (2004)  
Criterion Symbol Formula Reference 
Akaike Information Criteria AIC  Akaike (1974)  
Small-sample correction of AIC AICc  Hurvich & Tsai (1989); Sugiura (1978)  
Unbiased version of AIC AICu  McQuarrie et al. (1997)  
Consistent version of AIC CAIC  Bozdogan (1987)  
Schwarz Information Criterion SIC or BIC  Schwarz (1978)  
Small-sample correction of SIC SICc  McQuarrie (1999)  
Draper Information Criterion DIC  Draper (1995)  
Hannan Information Criterion HIC  Hannan (1980)  
Small-sample correction of HIC HICc  McQuarrie & Tsai (1998)  
Corrected Residual Information Criterion RICc  Leng (2013)  
Large-sample model selection criterion based on Kullback's symmetric divergence KIC or AIC3  Cavanaugh (1999); Bozdogan (1994)  
Second order of KIC KICc  Cavanaugh (2004)  

: the maximum likelihood for the candidate model; : number of estimated model parameters plus one; : number of observations; c: arbitrary positive constant greater than or equal to 2 (normally c = 2 (Carrera & Numan 1986)).

The second term penalizes models with greater complexity and is a function of K and N where K is the number of estimated parameters. In the case of least-squares regression, the value of K must be increased by 1 to reflect the variance estimate as an extra parameter (Poeter & Anderson 2005; Symonds & Moussalli 2011). The penalty term measures the increased unreliability of the chosen model due to the increased number of parameters (Mutua 1994).

Figure 1 illustrates the penalty term values of different criteria for = 5, 10, 20, 50 and = 1, …, 100. One can see that the penalty term in AIC and KIC is constant for a given K. For CAIC, SIC, DIC, and HIC, increasing N with a constant K enlarges the value of penalty term significantly, whereas a contrary behavior is observable for AICu, RICc, and KICc. This figure also shows that the penalty terms of AICc, SICc, and HICc are treated differently compared to other criteria.

In general, the variations of the penalty terms along with changes in K are low for a given N. However, when K is equal to , the penalty term cannot be calculated for AICc. This condition occurs for SICc and HICc as well when K is equal to . Similarly, the penalty terms are incomputable for AICu and KICc when (except in point for KICc) and for RICc when (except in point ). The penalty values of AICc, SICc, and HICc are relatively high and vary sharply when K is located near N. Furthermore, significant changes in value and sign can occur for the penalty term as a result of minor variations in K. For example, when = 50, for = 46, 48, and 50, the values of AICc's penalty term will be 1,533, 4,800, −5,000, respectively. This indicates that AICc may behave irrationally around the point at which with two different asymptotically converging limits of the penalty values from left and right sides (see Figure 1 at ). This situation proves true for AICu, (KICc) and RICc when K is selected approximately near and . This indicates that all these criteria (AICc, SICc, HICc, HICu, KICc, and RICc) may behave irrationally around a certain point ( or or , depending on the criterion) at which the penalty term is not continuous and would converge asymptotically to different limits from different sides (see Figure 1).

Figure 1

Penalty term of under-investigated model selection criteria for N =5, 10, 20, 50 and K =1, …,100.

Figure 1

Penalty term of under-investigated model selection criteria for N =5, 10, 20, 50 and K =1, …,100.

Figure 1

continued.

Figure 1

continued.

Figure 1

continued.

Figure 1

continued.

The comparison of SIC with AIC and AIC3 reveals that the SIC's penalty term is larger than that of AIC and AIC3 when N is greater than 102 and 103, respectively. In addition, it is evident that RICc has a larger penalty function compared to AIC and a smaller penalty compared to AICc when (Leng 2013). If K is considered as the primary measure of complexity (Hill 2006; Voss 2011b), unlike AICu, RICc, and KICc, the penalty terms of CAIC, SIC, DIC, and HIC will increase by adding the observed data when the complexity of the model is kept constant. Such a result may not be justified as it is expected that the penalty term (model complexity) increases while increasing .

AIC has a tendency to overfit when the sample size N is small or when K is a moderate-to-large fraction of the sample size N (Hurvich & Tsai 1989). Compared to AIC, AICc has an additional nonstochastic penalty term that is for small-sample bias adjustments (Hurvich & Tsai 1989) and, consequently, tends to select much better models. AICc has been recommended where is less than 40 (Burnham & Anderson 2002; Hill & Tiedeman 2007). When N becomes larger than K, the additive coefficient converges to 1, and AICc approaches AIC. Compared to AICc, AICu has an additional penalty term that provides better model choices than AICc for moderate-to-large sample sizes N, except when the ‘true’ model is of infinite order (Rao & Wu 2001). Contrary to the AIC, CAIC practically never overfits, but it often underfits (Dziak et al. 2012).

SIC can overfit in small samples the same way as AIC does (McQuarrie 1999). The models selected by SIC tend to contain fewer parameters than those selected by AIC. In the same vein, SICc shows a better performance than SIC in small samples. The penalty term in DIC is typically similar to that of SIC when the number of observations N is replaced with . As argued by Draper (1995), this correction is necessary in order to improve the performance in finite samples (Figure 1). As well, HIC is obtained by substituting the factor of the penalty term in SIC with a slower diverging quantity, i.e., . HICc is a small-sample bias correction version of HIC. KIC (or AIC3) is a more penalized version of AIC that places more emphasis on parsimony than AIC does. KICc is known as a large-sample based selection criterion where under-fitting is not that much problematic, yet overfitting still remains probable (Cavanaugh 2004).

The above explanations and analyses reveal that the behavior of different criteria and their orders of magnitude with respect to N, K, and may be complicated or even misleading. Comparative measures focusing on their relative values could, therefore, be of more help to clarify their relative advantages and disadvantages.

Assessment measures

Δ value

Given the aforementioned IC, the model with the lowest value, ICmin, may be chosen as the ‘best’ model, and the remaining models are ranked from best to worst. High ranked models constitute a set of ‘good’ models. In addition, IC differences ( values), , where i = 1, 2, … , R, can be used in order to rank the competing models (Burnham & Anderson 2004). Δ value for each model is a representation of the estimated distance to the best model.

Evidence ratio

To quantify the plausibility of each model, one needs an estimate of the likelihood of the model to be the best approximation for a given data set, (modeli|data). This term is equal to (Akaike 1981; Burnham & Anderson 2004). The ratio of likelihood of model i to that of model j is termed as the evidence ratio (ER): 
formula
1

This quantity provides a measure of the extent to which model i is more likely to be the best model than model j, especially when i is the best model (Poeter & Anderson 2005).

Model weight

A posterior model weight, , is obtained for model i by normalizing the model's likelihood in such a way that the denominator is simply the sum of the relative likelihoods for all candidate models: 
formula
2
In this way, two models with the same are given the same weight regardless of the values of their penalty terms, where can be interpreted as the weight of evidence or the probability that model i is the best model, so . Unlike the ER, values depend on the full set of models. A model with the greatest w has the lowest value of . In the case of the AIC family, these weights or model probabilities are formal probabilities, yet they differ from Bayesian posterior probabilities (Lukacs et al. 2010).
Using the model weights, one can compare specific subsets of models with other subsets in the same way (e.g., all models containing an interaction term versus those that do not) (Wagenmakers & Farrell 2004; Lavoue & Droz 2009). For the sake of illustration, consider the candidate models sorted in two classes of A and B containing two types of different parameters (e.g., hydraulic conductivity (HK) and specific storage). Without loss of generality, assume that  
formula
Then, models of class A are q times more likely than those of class B. The relative importance of individual parameters can also be investigated using weights, To this end, the weights for each model containing the parameters of interest are summed, which estimate the relative importance of the parameters in question for the available data and candidate models (Symonds & Moussalli 2011).

Rejection rule

Some simple rules of thumb have been introduced as cut-off rules to eliminate uninformative models. As discussed by Burnham & Anderson (2004), Poeter & Anderson (2005), and Burnham et al. (2011), the models with have substantial support and are regarded as very good models, and those with have considerably less empirical support and should rather be dismissed. Finally, models with have essentially no support and can be definitely dismissed from further consideration. Richards (2005, 2008) explicated a cut-off border whereby models with should be considered for making inferences, and the rest are disregarded. However, laying some models away using a defined cut-off threshold for values seems not to be a good way for multi-model based inferences because: (1) sometimes models with lower values have relatively low weights, w (say 0.5), meaning that the retained models as well as the ones deleted from the candidate set have less chance to be regarded as ‘good’ models; and (2) for values gathered around the cut-off borders, there is no explicit distinction between desirability of models located up or down. These values are on a continuous scale of information and are interpretable regardless of the measurement scale and whether the data are continuous, discrete, or categorical (Burnham et al. 2011).

A rather less Procrustean approach is to produce a ‘confidence set’ or ‘credibility set’ of models that are most realistically likely to be the best approximating model (Symonds & Moussalli 2011). This confidence set can be achieved by adding the weights, , from the best downwards. If the sum of the highest values, , is (or any other arbitrary value), one can say that the best model is within this subset by 95% confidence and could decide to base inferences only on these models. Another reasonable basis for constituting a confidence set of candidate models is selecting the models with weights that are within 10% of the highest (see e.g., Burnham & Anderson (2002) and Omole et al. (2013)).

GROUNDWATER MODEL SETUP

Study area

Dehloran regional groundwater flow system encompasses nearly 568.5 km2 and is located in the west of Iran about 228 km distance from Ilam City (Figure 2). Dehloran plain lies between 32 °22′ and 32 °45′ northern latitude and 47 °05′ and 47 °50′ eastern longitude. Dehloran City is situated in the north of the plain, and the underlying groundwater resource serves a leading role in securing irrigation demand for the region. Two major agricultural zones are considered to be the primary consumers of the groundwater resource in the south-east and western outskirt of the plain. In addition, there is a municipal pumping well field in the south-east of the plain, above irrigation areas, which supplies the necessary drinking water for Dehloran City. The study area covers 360 km2.

Figure 2

Location of the study area.

Figure 2

Location of the study area.

The analysis of results from geoelectrical profiles data (Mahab Ghods Consulting Engineers 1992) has led to the recognition of geologic layers, lithological cross sections, the aquifer geometry and the layers’ properties. The aquifer depth is within the limits of 140 m in the center of the plain, reaching 150 m in some points and decreases to 50 m and 5 m in the south and north, respectively. There are 22 observation wells, of which 18 wells are located in our modeling domain. According to data collected in 2007–2008 (Mahab Ghods Consulting Engineers 2008), there are a total of 233 operational wells in Dehloran well-field with about 67.0 MCM (million cubic meters) extraction of water annually, of which 115 wells with 38.2 MCM annual water extraction are located in the modeling domain.

Parameterization method

The zonation method (ZM) and the regularized pilot points (RPP) (Doherty 2003; Moor & Doherty 2006) method have been used to parameterize unknown parameters of the aquifer.

In the ZM, the model domain is divided into multiple subdomains or zones, and a constant value of a parameter of interest is assigned to each zone. These constants may then be considered as unknowns being adjusted in a calibration process (Prasad & Rastogi 2001; Rojas & Dassargues 2007).

The RPP method aims to provide a middle ground between cell-by-cell variability and the reduction to a few homogeneous zones for characterizing natural-world heterogeneity in groundwater models (Doherty & Hunt 2010). In this approach, a set of 2D scatter points (pilot points) are initially distributed throughout the model domain. The values assigned to the points of interest are then estimated by calibration. Afterward, the ‘true’ values for other cells of the model are calculated by using a spatial interpolation technique such as Kriging. In the RPP method (Alcolea et al. 2006), the initial values of pilot points must be in a geologically reasonable range expected in the study area. Regularization imposes an additional knowledge as well as a geological technical perspective, in the form of prior information, on the parameter estimation process and can be integrated by the method of pilot points (Doherty 2003; Moor & Doherty 2006; Franssen et al. 2009; Doherty & Hunt 2010). In fact, the addition of each new parameter will be accompanied by at least one (possibly many) ‘regularization observations’, thus restoring the numerical predominance of observations over parameters (Doherty 2003).

HK and specific yield (Sy) are considered as unknown parameters to be determined by inverse modeling (see Table 2). A SWAT (Soil and Water Assessment Tool; Arnold et al. 1998) model of the investigated area has been developed and calibrated by Ehtiat et al. (2015) to estimate time-variant recharge rates to the aquifer. The ZM has been used to model spatial distribution of recharge and Sy, and both the methods of RPP and zonation have been used for modeling spatial variations of HK. We have also considered the zonation patterns of Sy the same as those of HK, as illustrated in Figure 3.

Table 2

Definition of different conceptual models being evaluated

Group num. Calibration state Index Unknown parameters
 
Parameterization method
 
Num. of parameters (KNum. of total observation (NN/K 
HK Sy Zonation Pilot point 
Group 1 Steady-state A1   18 9.0 
A2   18 4.5 
A3   18 2.3 
A4   13 18 1.4 
A5   19 18 0.9 
B1   21 38 1.8 
B2   31 48 1.5 
B3   46 63 1.4 
B4   55 72 1.3 
B5   76 93 1.2 
Group 2 Transient A6   216 108.0 
A7   216 54.0 
A8   216 27.0 
A9   13 216 16.6 
A10   19 216 11.4 
B6   216 108.0 
B7   216 54.0 
B8   216 27.0 
B9   13 216 16.6 
B10   19 216 11.4 
Group 3 Transient A11 216 72.0 
A12 216 30.9 
A13 15 216 14.4 
A14 25 216 8.6 
A15 37 216 5.8 
B11 22 236 10.7 
B12 34 246 7.2 
B13 53 261 4.9 
B14 67 270 4.0 
B15 94 291 3.1 
Group num. Calibration state Index Unknown parameters
 
Parameterization method
 
Num. of parameters (KNum. of total observation (NN/K 
HK Sy Zonation Pilot point 
Group 1 Steady-state A1   18 9.0 
A2   18 4.5 
A3   18 2.3 
A4   13 18 1.4 
A5   19 18 0.9 
B1   21 38 1.8 
B2   31 48 1.5 
B3   46 63 1.4 
B4   55 72 1.3 
B5   76 93 1.2 
Group 2 Transient A6   216 108.0 
A7   216 54.0 
A8   216 27.0 
A9   13 216 16.6 
A10   19 216 11.4 
B6   216 108.0 
B7   216 54.0 
B8   216 27.0 
B9   13 216 16.6 
B10   19 216 11.4 
Group 3 Transient A11 216 72.0 
A12 216 30.9 
A13 15 216 14.4 
A14 25 216 8.6 
A15 37 216 5.8 
B11 22 236 10.7 
B12 34 246 7.2 
B13 53 261 4.9 
B14 67 270 4.0 
B15 94 291 3.1 
Figure 3

Parameterization methods of HK and specific yield (Sy) used in the models of type A and type B. Dark points are the location of pilot points. Note: zonation and RPP parameterization methods have been used in models of type A and type B, respectively, and the zonation patterns of Sy have been considered the same as those of HK.

Figure 3

Parameterization methods of HK and specific yield (Sy) used in the models of type A and type B. Dark points are the location of pilot points. Note: zonation and RPP parameterization methods have been used in models of type A and type B, respectively, and the zonation patterns of Sy have been considered the same as those of HK.

Conceptual models

Several alternative conceptualizations have been considered based on: (1) the parameterization methods of HK and Sy, (2) the number of spatial zones or pilot points in parameterization, and (3) the calibration approach (one-step or two-step) (Table 2). Accordingly, the developed models have been grouped into two types of A and B based on parameterization methods used for unknown parameters. The ZM is used for parameterization of both HK and Sy in the type A models, whereas in the type B models, the RPP and the ZMs are employed for HK and Sy parameterizations, respectively.

Two calibration procedures have also been used to estimate the parameter values, namely two-step and one-step calibration approaches. In the two-step approach, for each type A or type B model, there are five sub-models (A1–A5 and B1–B5) for which a steady-state groundwater model is first calibrated. Then, the resulting spatial distributions of HK are used in the subsequent transient models (A6–A10 and B6–B10), where zonal Sy values are calibrated. However, in the one-step approach, both HK and Sy are calibrated concurrently by transient models A11–A15 and B11–B15.

Zonal patterns of Sy for type B models are the same as those for the corresponding type A models with 1, 3, 7, 12, and 18 zones, respectively. The estimated distribution of HK for models B1–B5 and B11–B15 has been characterized using 20, 30, 45, 54, and 75 pilot points, respectively. In the steady-state calibration mode, the differences among models A1–A5 is in the number of zones, and the number of pilot points and their distributions are different for models B1–B5. As a result, a total of 30 models have been developed for comparison purposes.

The US Geological Survey modular flow simulator MODFLOW (McDonald & Harbaugh 1988) and PEST optimizer (Doherty 2002) have been employed for simulation and optimization (automatic calibration) of all the models, respectively. Pilot points in combination with the regularization capabilities of PEST have been exploited in the way described by Doherty (2003). Assigning initial values and emplacement of pilot points have also been coordinated considering the attention points given by Doherty & Hunt (2010). Specifically, the pilot points have been located where results from a pumping test have been available. Therefore, some of the pilot point initial values have been set to estimated K values, and others have been assumed according to the K map of the study area. Lower and upper bounds of K have also been selected based on the values recommended in previous studies.

The steady-state and transient modes of calibration have been conducted based on the data on water levels from October 2005 to September 2006. The recharge rates have been calculated on a daily basis by SWAT, and the groundwater models have been calibrated using annual and monthly time steps in steady-state and transient modes, respectively.

To attain an acceptable simulated behavior of a subsurface system, the residuals between observed and simulated values must be limited. Acceptable residuals may vary for different hydraulic head calibration targets within different models (e.g., Henriksen et al. 2003; Rojas & Dassargues 2007). In this study, a calibration target has arbitrarily been defined as a root mean squared error (RMSE) lower than 1 m, which is about 1.0% of the difference between the highest and the lowest heads across the site.

RESULTS AND DISCUSSION

Calibration results

Figure 4 demonstrates the results of the steady-state and the transient modes of automatic calibration, including the RMSE values and the WRSS errors between simulated and observed water levels at the observation wells. In the case of steady-state calibration, RMSE values for the type B models with RPP method of parameterization are significantly less than those for the models of type A. Nevertheless, the calibration target is met only in model B5 and partly in model B4. In addition, the results indicate that the modeling error decreases for both types of models when the parameter dimension increases. This behavior is also observable in transient calibration of type B models while calibrating Sy. However, for the models of type A and in the same calibration mode, the maximum and minimum values of RMSE are associated with models A7 and A9, respectively. In transient calibration of Sy, the calibration target is met in models B8–B10 and partly in models A6 and A7.

Figure 4

RMSE and the WRSS errors for 30 models at the end of steady-state and transient modes of automatic calibration. Errors calculated between simulated and observed water levels of 18 observation wells.

Figure 4

RMSE and the WRSS errors for 30 models at the end of steady-state and transient modes of automatic calibration. Errors calculated between simulated and observed water levels of 18 observation wells.

The models of type A and type B calibrated by the one-step approach are found to outperform the corresponding models calibrated by the two-step approach. Nonetheless, the improvement for models of type A is significantly greater than that for type B models in this case. It shows the flexibility of the RPP method compared to the ZM. This is because the estimation of spatially distributed parameters by means of RPP often involves, to some extent, the imposition of smoothness of equality constraints of the inverse calibration problem, and therefore creates more confidence with respect to uniqueness and stability of the identified set of parameters. As well, an acceptable characterization of HK variations over the model domain is attainable by using the RPP method in the two-step calibration approach. In the one-step calibration approach, RMSE values of type B models decrease uniformly while increasing the number of pilot points, whereas in the case of the ZM, the largest RMSE belongs to model A13 and then decreases with the rise of the number of zones. In this mode, the calibration target is met in all the models except for model A13.

Ranking of models

To rank the models, models A1–A5 plus B1–B5, models A6–A10 plus B6–B10, and finally, models A11–A15 plus B11–B15 have been categorized into three separate groups of 1, 2, and 3, respectively, and then and values for each model i of every group have been calculated, as presented in Table 3.

Table 3

and values for each model i of all groups 1, 2, and 3. Numbers in bold font indicate best models of each group

Group num. Index AIC
 
AICc
 
AICu
 
CAIC
 
SIC
 
SICc
 
DIC
 
HIC
 
HICc
 
RICc
 
AIC3
 
KICc
 
Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi 
Group 1 A1 −8.1 170.4 0.000 −7.3 354.3 0.000 −5.8 0.0 0.935 −7.5 97.3 0.000 −9.5 171.3 0.000 −8.8 153.8 0.000 −9.2 156.8 0.000 −11.7 274.1 0.000 −11.5 59.6 0.000 −8.0 137.3 0.000 −6.1 96.4 0.000 −8.0 0.0 0.925 
A2 −6.1 172.4 0.000 −3.0 358.5 0.000 −0.5 5.3 0.065 −5.1 99.8 0.000 −9.1 171.8 0.000 −6.6 156.1 0.000 −8.3 157.7 0.000 −13.3 272.4 0.000 −12.9 58.2 0.000 −5.5 139.8 0.000 −2.1 100.4 0.000 −3.0 5.0 0.075 
A3 −0.9 177.5 0.000 15.1 376.6 0.000 20.5 26.3 0.000 1.1 106.0 0.000 −6.9 174.0 0.000 5.6 168.3 0.000 −5.3 160.6 0.000 −15.4 270.4 0.000 −13.4 57.7 0.000 3.6 148.9 0.000 7.1 109.5 0.000 16.9 24.9 0.000 
A4 6.4 184.8 0.000 97.4 458.9 0.000 109.1 115.0 0.000 9.7 114.6 0.000 −3.3 177.6 0.000 78.3 240.9 0.000 −0.8 165.2 0.000 −17.1 268.7 0.000 −4.2 66.9 0.000 38.0 183.3 0.000 19.4 121.8 0.000 101.5 109.5 0.000 
A5 18.5 196.9 0.000 −361.5 0.0 1.000 NC NC NC 23.3 128.2 0.000 4.3 185.2 0.000 −162.6 0.0 1.000 8.0 174.0 0.000 −15.8 270.0 0.000 −42.0 29.1 0.000 NC NC NC 37.5 139.9 0.000 NC NC NC 
B1 −10.5 168.0 0.000 47.3 408.8 0.000 61.5 67.4 0.000 1.7 106.6 0.000 −19.3 161.6 0.000 31.6 194.2 0.000 −15.2 150.8 0.000 −44.1 241.6 0.000 −31.4 39.8 0.000 −7.0 138.3 0.000 10.5 113.0 0.000 57.2 65.2 0.000 
B2 −24.5 154.0 0.000 99.5 461.1 0.000 122.4 128.3 0.000 −3.4 101.5 0.000 −34.4 146.5 0.000 80.3 242.9 0.000 −28.3 137.7 0.000 −72.5 213.3 0.000 −41.7 29.4 0.000 −17.3 128.0 0.000 6.5 109.0 0.000 117.0 125.0 0.000 
B3 −71.5 107.0 0.000 198.8 560.3 0.000 236.3 242.1 0.000 −34.7 70.1 0.000 −80.7 100.1 0.000 184.1 346.8 0.000 −71.7 94.3 0.000 −140.0 145.7 0.000 −64.9 6.2 0.043 −57.1 88.2 0.000 −25.5 77.0 0.000 229.1 237.1 0.000 
B4 −103.1 75.4 0.000 281.9 643.4 0.000 328.9 334.8 0.000 −55.9 48.9 0.000 −110.9 69.9 0.000 277.2 439.9 0.000 −100.2 65.8 0.000 −183.5 102.2 0.000 −71.1 0.0 0.957 −83.6 61.7 0.000 −48.1 54.4 0.000 320.7 328.7 0.000 
B5 −178.5 0.0 1.000 553.0 914.6 0.000 624.1 629.9 0.000 −104.9 0.0 1.000 −180.9 0.0 1.000 597.1 759.7 0.000 −166.0 0.0 1.000 −285.8 0.0 1.000 −53.3 17.8 0.000 −145. 0.0 1.000 −102.5 0.0 1.000 613.5 621.5 0.000 
Group 2 A6 −506.6 66.2 0.000 −506.6 62.4 0.000 −505.3 54.6 0.000 −504.0 43.5 0.000 −506.0 60.5 0.000 −505.9 55.8 0.000 −505.6 57.2 0.000 −509.2 87.7 0.000 −509.1 86.2 0.000 −507.7 74.7 0.000 −504.6 49.2 0.000 −507.2 54.7 0.000 
A7 −508.7 64.2 0.000 −508.5 60.5 0.000 −506.3 53.6 0.000 −503.3 44.2 0.000 −507.3 59.2 0.000 −507.1 54.6 0.000 −506.5 56.2 0.000 −513.7 83.1 0.000 −513.6 81.7 0.000 −510.8 71.7 0.000 −504.7 49.2 0.000 −508.2 53.7 0.000 
A8 −429.0 143.8 0.000 −428.3 140.6 0.000 −424.3 135.5 0.000 −418.3 129.1 0.000 −426.3 140.1 0.000 −425.4 136.3 0.000 −424.8 138.0 0.000 −439.1 157.7 0.000 −438.8 156.5 0.000 −433.2 149.2 0.000 −421.0 132.8 0.000 −426.3 135.6 0.000 
A9 −369.4 203.4 0.000 −367.6 201.3 0.000 −361.3 198.5 0.000 −352.1 195.4 0.000 −365.1 201.4 0.000 −362.8 198.9 0.000 −362.5 200.2 0.000 −385.9 211.0 0.000 −385.1 210.2 0.000 −376.1 206.3 0.000 −356.4 197.4 0.000 −363.3 198.6 0.000 
A10 −387.3 185.5 0.000 −383.4 185.5 0.000 −374.3 185.5 0.000 −361.9 185.5 0.000 −380.9 185.5 0.000 −376.2 185.5 0.000 −377.2 185.5 0.000 −411.3 185.5 0.000 −409.8 185.5 0.000 −396.9 185.5 0.000 −368.3 185.5 0.000 −376.4 185.5 0.000 
B6 −398.1 174.7 0.000 −398.1 170.9 0.000 −396.8 163.1 0.000 −395.5 152.0 0.000 −397.5 169.0 0.000 −397.4 164.3 0.000 −397.1 165.7 0.000 −400.7 196.2 0.000 −400.6 194.7 0.000 −399.2 183.2 0.000 −396.1 157.7 0.000 −398.6 163.3 0.000 
B7 −463.3 109.5 0.000 −463.2 105.8 0.000 −461.0 98.9 0.000 −458.0 89.5 0.000 −462.0 104.5 0.000 −461.7 100.0 0.000 −461.2 101.5 0.000 −468.4 128.4 0.000 −468.3 127.0 0.000 −465.4 117.0 0.000 −459.3 94.5 0.000 −462.9 99.0 0.000 
B8 −533.1 39.8 0.000 −532.4 36.6 0.000 −528.4 31.4 0.000 −522.4 25.1 0.000 −530.4 36.1 0.000 −529.5 32.2 0.000 −528.8 33.9 0.000 −543.2 53.7 0.000 −542.9 52.4 0.000 −537.2 45.2 0.000 −525.1 28.8 0.000 −530.3 31.6 0.000 
B9 −550.3 22.6 0.000 −548.5 20.5 0.000 −542.2 17.7 0.000 −532.9 14.6 0.001 −545.9 20.6 0.000 −543.7 18.0 0.000 −543.4 19.4 0.000 −566.7 30.1 0.000 −566.0 29.4 0.000 −556.9 25.5 0.000 −537.3 16.6 0.000 −544.2 17.7 0.000 
B10 −572.8 0.0 1.000 −569.0 0.0 1.000 −559.8 0.0 1.000 −547.5 0.0 0.999 −566.5 0.0 1.000 −561.7 0.0 1.000 −562.8 0.0 1.000 −596.8 0.0 1.000 −595.3 0.0 1.000 −582.4 0.0 1.000 −553.8 0.0 1.000 −561.9 0.0 1.000 
Group 3 A11 −522.8 239.4 0.000 −522.6 148.4 0.000 −520.9 100.2 0.000 −518.8 105.8 0.000 −521.8 196.8 0.000 −521.6 83.0 0.000 −521.2 179.0 0.000 −526.6 350.0 0.000 −526.5 313.8 0.000 −524.3 278.7 0.000 −519.8 148.4 0.000 −522.8 101.1 0.000 
A12 −511.1 251.1 0.000 −510.5 160.5 0.000 −507.0 114.1 0.000 −501.7 122.9 0.000 −508.7 209.9 0.000 −508.0 96.6 0.000 −507.3 192.8 0.000 −519.9 356.7 0.000 −519.7 320.6 0.000 −514.7 288.3 0.000 −504.1 164.1 0.000 −508.9 115.0 0.000 
A13 −415.0 347.2 0.000 −412.6 258.4 0.000 −405.4 215.7 0.000 −395.0 229.6 0.000 −410.0 308.6 0.000 −407.0 197.5 0.000 −407.1 293.1 0.000 −434.0 442.6 0.000 −433.0 407.3 0.000 −422.7 380.4 0.000 −400.0 268.2 0.000 −407.4 216.5 0.000 
A14 −492.4 269.8 0.000 −485.5 185.5 0.000 −473.5 147.6 0.000 −459.0 165.6 0.000 −484.0 234.6 0.000 −475.7 128.9 0.000 −479.1 221.0 0.000 −524.0 352.6 0.000 −521.3 319.0 0.000 −504.8 298.2 0.000 −467.4 200.8 0.000 −475.6 148.3 0.000 
A15 −484.1 278.1 0.000 −468.3 202.7 0.000 −450.2 170.9 0.000 −434.8 189.8 0.000 −471.8 246.8 0.000 −452.7 151.8 0.000 −464.5 235.6 0.000 −530.9 345.7 0.000 −524.9 315.4 0.000 −501.8 301.2 0.000 −447.1 221.1 0.000 −452.4 171.4 0.000 
B11 −556.9 205.3 0.000 −552.1 118.9 0.000 −541.6 79.5 0.000 −526.7 97.9 0.000 −548.7 169.9 0.000 −542.8 61.8 0.000 −544.4 155.8 0.000 −584.4 292.2 0.000 −582.5 257.8 0.000 −568.0 235.0 0.000 −534.9 133.3 0.000 −543.7 80.2 0.000 
B12 −572.8 189.4 0.000 −561.5 109.5 0.000 −545.1 76.0 0.000 −525.5 99.1 0.000 −559.5 159.1 0.000 −545.6 59.0 0.000 −552.8 147.3 0.000 −615.1 261.5 0.000 −610.6 229.7 0.000 −589.6 213.4 0.000 −538.8 129.4 0.000 −547.3 76.6 0.000 
B13 −655.3 106.9 0.000 −627.7 43.4 0.000 −601.4 19.7 0.000 −580.2 44.3 0.000 −633.2 85.3 0.000 −599.0 5.5 0.053 −622.8 77.3 0.000 −720.7 155.9 0.000 −709.9 130.5 0.000 −680.5 122.5 0.000 −602.3 65.9 0.000 −603.7 20.1 0.000 
B14 −685.5 76.7 0.000 −640.4 30.7 0.000 −606.4 14.8 0.001 −589.6 35.0 0.000 −656.6 62.0 0.000 −600.7 3.9 0.119 −643.5 56.7 0.000 −767.8 108.8 0.000 −750.0 90.3 0.000 −716.4 86.6 0.000 −618.5 49.7 0.000 −608.9 15.0 0.001 
B15 −762.2 0.0 1.000 −671.1 0.0 1.000 −621.1 0.0 0.999 −624.6 0.0 1.000 −718.6 0.0 1.000 −604.6 0.0 0.829 −700.1 0.0 1.000 −876.6 0.0 1.000 −840.3 0.0 1.000 −803.0 0.0 1.000 −668.2 0.0 1.000 −623.9 0.0 0.999 
Group num. Index AIC
 
AICc
 
AICu
 
CAIC
 
SIC
 
SICc
 
DIC
 
HIC
 
HICc
 
RICc
 
AIC3
 
KICc
 
Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi Value Δ wi 
Group 1 A1 −8.1 170.4 0.000 −7.3 354.3 0.000 −5.8 0.0 0.935 −7.5 97.3 0.000 −9.5 171.3 0.000 −8.8 153.8 0.000 −9.2 156.8 0.000 −11.7 274.1 0.000 −11.5 59.6 0.000 −8.0 137.3 0.000 −6.1 96.4 0.000 −8.0 0.0 0.925 
A2 −6.1 172.4 0.000 −3.0 358.5 0.000 −0.5 5.3 0.065 −5.1 99.8 0.000 −9.1 171.8 0.000 −6.6 156.1 0.000 −8.3 157.7 0.000 −13.3 272.4 0.000 −12.9 58.2 0.000 −5.5 139.8 0.000 −2.1 100.4 0.000 −3.0 5.0 0.075 
A3 −0.9 177.5 0.000 15.1 376.6 0.000 20.5 26.3 0.000 1.1 106.0 0.000 −6.9 174.0 0.000 5.6 168.3 0.000 −5.3 160.6 0.000 −15.4 270.4 0.000 −13.4 57.7 0.000 3.6 148.9 0.000 7.1 109.5 0.000 16.9 24.9 0.000 
A4 6.4 184.8 0.000 97.4 458.9 0.000 109.1 115.0 0.000 9.7 114.6 0.000 −3.3 177.6 0.000 78.3 240.9 0.000 −0.8 165.2 0.000 −17.1 268.7 0.000 −4.2 66.9 0.000 38.0 183.3 0.000 19.4 121.8 0.000 101.5 109.5 0.000 
A5 18.5 196.9 0.000 −361.5 0.0 1.000 NC NC NC 23.3 128.2 0.000 4.3 185.2 0.000 −162.6 0.0 1.000 8.0 174.0 0.000 −15.8 270.0 0.000 −42.0 29.1 0.000 NC NC NC 37.5 139.9 0.000 NC NC NC 
B1 −10.5 168.0 0.000 47.3 408.8 0.000 61.5 67.4 0.000 1.7 106.6 0.000 −19.3 161.6 0.000 31.6 194.2 0.000 −15.2 150.8 0.000 −44.1 241.6 0.000 −31.4 39.8 0.000 −7.0 138.3 0.000 10.5 113.0 0.000 57.2 65.2 0.000 
B2 −24.5 154.0 0.000 99.5 461.1 0.000 122.4 128.3 0.000 −3.4 101.5 0.000 −34.4 146.5 0.000 80.3 242.9 0.000 −28.3 137.7 0.000 −72.5 213.3 0.000 −41.7 29.4 0.000 −17.3 128.0 0.000 6.5 109.0 0.000 117.0 125.0 0.000 
B3 −71.5 107.0 0.000 198.8 560.3 0.000 236.3 242.1 0.000 −34.7 70.1 0.000 −80.7 100.1 0.000 184.1 346.8 0.000 −71.7 94.3 0.000 −140.0 145.7 0.000 −64.9 6.2 0.043 −57.1 88.2 0.000 −25.5 77.0 0.000 229.1 237.1 0.000 
B4 −103.1 75.4 0.000 281.9 643.4 0.000 328.9 334.8 0.000 −55.9 48.9 0.000 −110.9 69.9 0.000 277.2 439.9 0.000 −100.2 65.8 0.000 −183.5 102.2 0.000 −71.1 0.0 0.957 −83.6 61.7 0.000 −48.1 54.4 0.000 320.7 328.7 0.000 
B5 −178.5 0.0 1.000 553.0 914.6 0.000 624.1 629.9 0.000 −104.9 0.0 1.000 −180.9 0.0 1.000 597.1 759.7 0.000 −166.0 0.0 1.000 −285.8 0.0 1.000 −53.3 17.8 0.000 −145. 0.0 1.000 −102.5 0.0 1.000 613.5 621.5 0.000 
Group 2 A6 −506.6 66.2 0.000 −506.6 62.4 0.000 −505.3 54.6 0.000 −504.0 43.5 0.000 −506.0 60.5 0.000 −505.9 55.8 0.000 −505.6 57.2 0.000 −509.2 87.7 0.000 −509.1 86.2 0.000 −507.7 74.7 0.000 −504.6 49.2 0.000 −507.2 54.7 0.000 
A7 −508.7 64.2 0.000 −508.5 60.5 0.000 −506.3 53.6 0.000 −503.3 44.2 0.000 −507.3 59.2 0.000 −507.1 54.6 0.000 −506.5 56.2 0.000 −513.7 83.1 0.000 −513.6 81.7 0.000 −510.8 71.7 0.000 −504.7 49.2 0.000 −508.2 53.7 0.000 
A8 −429.0 143.8 0.000 −428.3 140.6 0.000 −424.3 135.5 0.000 −418.3 129.1 0.000 −426.3 140.1 0.000 −425.4 136.3 0.000 −424.8 138.0 0.000 −439.1 157.7 0.000 −438.8 156.5 0.000 −433.2 149.2 0.000 −421.0 132.8 0.000 −426.3 135.6 0.000 
A9 −369.4 203.4 0.000 −367.6 201.3 0.000 −361.3 198.5 0.000 −352.1 195.4 0.000 −365.1 201.4 0.000 −362.8 198.9 0.000 −362.5 200.2 0.000 −385.9 211.0 0.000 −385.1 210.2 0.000 −376.1 206.3 0.000 −356.4 197.4 0.000 −363.3 198.6 0.000 
A10 −387.3 185.5 0.000 −383.4 185.5 0.000 −374.3 185.5 0.000 −361.9 185.5 0.000 −380.9 185.5 0.000 −376.2 185.5 0.000 −377.2 185.5 0.000 −411.3 185.5 0.000 −409.8 185.5 0.000 −396.9 185.5 0.000 −368.3 185.5 0.000 −376.4 185.5 0.000 
B6 −398.1 174.7 0.000 −398.1 170.9 0.000 −396.8 163.1 0.000 −395.5 152.0 0.000 −397.5 169.0 0.000 −397.4 164.3 0.000 −397.1 165.7 0.000 −400.7 196.2 0.000 −400.6 194.7 0.000 −399.2 183.2 0.000 −396.1 157.7 0.000 −398.6 163.3 0.000 
B7 −463.3 109.5 0.000 −463.2 105.8 0.000 −461.0 98.9 0.000 −458.0 89.5 0.000 −462.0 104.5 0.000 −461.7 100.0 0.000 −461.2 101.5 0.000 −468.4 128.4 0.000 −468.3 127.0 0.000 −465.4 117.0 0.000 −459.3 94.5 0.000 −462.9 99.0 0.000 
B8 −533.1 39.8 0.000 −532.4 36.6 0.000 −528.4 31.4 0.000 −522.4 25.1 0.000 −530.4 36.1 0.000 −529.5 32.2 0.000 −528.8 33.9 0.000 −543.2 53.7 0.000 −542.9 52.4 0.000 −537.2 45.2 0.000 −525.1 28.8 0.000 −530.3 31.6 0.000 
B9 −550.3 22.6 0.000 −548.5 20.5 0.000 −542.2 17.7 0.000 −532.9 14.6 0.001 −545.9 20.6 0.000 −543.7 18.0 0.000 −543.4 19.4 0.000 −566.7 30.1 0.000 −566.0 29.4 0.000 −556.9 25.5 0.000 −537.3 16.6 0.000 −544.2 17.7 0.000 
B10 −572.8 0.0 1.000 −569.0 0.0 1.000 −559.8 0.0 1.000 −547.5 0.0 0.999 −566.5 0.0 1.000 −561.7 0.0 1.000 −562.8 0.0 1.000 −596.8 0.0 1.000 −595.3 0.0 1.000 −582.4 0.0 1.000 −553.8 0.0 1.000 −561.9 0.0 1.000 
Group 3 A11 −522.8 239.4 0.000 −522.6 148.4 0.000 −520.9 100.2 0.000 −518.8 105.8 0.000 −521.8 196.8 0.000 −521.6 83.0 0.000 −521.2 179.0 0.000 −526.6 350.0 0.000 −526.5 313.8 0.000 −524.3 278.7 0.000 −519.8 148.4 0.000 −522.8 101.1 0.000 
A12 −511.1 251.1 0.000 −510.5 160.5 0.000 −507.0 114.1 0.000 −501.7 122.9 0.000 −508.7 209.9 0.000 −508.0 96.6 0.000 −507.3 192.8 0.000 −519.9 356.7 0.000 −519.7 320.6 0.000 −514.7 288.3 0.000 −504.1 164.1 0.000 −508.9 115.0 0.000 
A13 −415.0 347.2 0.000 −412.6 258.4 0.000 −405.4 215.7 0.000 −395.0 229.6 0.000 −410.0 308.6 0.000 −407.0 197.5 0.000 −407.1 293.1 0.000 −434.0 442.6 0.000 −433.0 407.3 0.000 −422.7 380.4 0.000 −400.0 268.2 0.000 −407.4 216.5 0.000 
A14 −492.4 269.8 0.000 −485.5 185.5 0.000 −473.5 147.6 0.000 −459.0 165.6 0.000 −484.0 234.6 0.000 −475.7 128.9 0.000 −479.1 221.0 0.000 −524.0 352.6 0.000 −521.3 319.0 0.000 −504.8 298.2 0.000 −467.4 200.8 0.000 −475.6 148.3 0.000 
A15 −484.1 278.1 0.000 −468.3 202.7 0.000 −450.2 170.9 0.000 −434.8 189.8 0.000 −471.8 246.8 0.000 −452.7 151.8 0.000 −464.5 235.6 0.000 −530.9 345.7 0.000 −524.9 315.4 0.000 −501.8 301.2 0.000 −447.1 221.1 0.000 −452.4 171.4 0.000 
B11 −556.9 205.3 0.000 −552.1 118.9 0.000 −541.6 79.5 0.000 −526.7 97.9 0.000 −548.7 169.9 0.000 −542.8 61.8 0.000 −544.4 155.8 0.000 −584.4 292.2 0.000 −582.5 257.8 0.000 −568.0 235.0 0.000 −534.9 133.3 0.000 −543.7 80.2 0.000 
B12 −572.8 189.4 0.000 −561.5 109.5 0.000 −545.1 76.0 0.000 −525.5 99.1 0.000 −559.5 159.1 0.000 −545.6 59.0 0.000 −552.8 147.3 0.000 −615.1 261.5 0.000 −610.6 229.7 0.000 −589.6 213.4 0.000 −538.8 129.4 0.000 −547.3 76.6 0.000 
B13 −655.3 106.9 0.000 −627.7 43.4 0.000 −601.4 19.7 0.000 −580.2 44.3 0.000 −633.2 85.3 0.000 −599.0 5.5 0.053 −622.8 77.3 0.000 −720.7 155.9 0.000 −709.9 130.5 0.000 −680.5 122.5 0.000 −602.3 65.9 0.000 −603.7 20.1 0.000 
B14 −685.5 76.7 0.000 −640.4 30.7 0.000 −606.4 14.8 0.001 −589.6 35.0 0.000 −656.6 62.0 0.000 −600.7 3.9 0.119 −643.5 56.7 0.000 −767.8 108.8 0.000 −750.0 90.3 0.000 −716.4 86.6 0.000 −618.5 49.7 0.000 −608.9 15.0 0.001 
B15 −762.2 0.0 1.000 −671.1 0.0 1.000 −621.1 0.0 0.999 −624.6 0.0 1.000 −718.6 0.0 1.000 −604.6 0.0 0.829 −700.1 0.0 1.000 −876.6 0.0 1.000 −840.3 0.0 1.000 −803.0 0.0 1.000 −668.2 0.0 1.000 −623.9 0.0 0.999 

The results indicate that AIC, CAIC, SIC, DIC, HIC, RICc, and AIC3 criteria select models B5, B10, and B15, respectively, as the ‘best’ model, although they are the most complex models (large number of parameters) among the models of each group. In other words, they select the transient model in group 2 in which the estimated parameters of HK have been obtained by the corresponding steady-state model in group 1. The selected models have been characterized with the RPP method and have shown the lowest RMSE. These selection criteria have opted for the aforementioned models with among the candidate models of groups 1, 2, and 3 (note that just CAIC has assigned a weight equal to 0.999 to model B10). As a result, the rejection rule approach and the idea of making a confidence set have not practically been useful for these criteria in this particular case study.

Model A5 has been selected by AICc and SICc as the best model with among the candidate models of group 1. It is worth noting that the values of penalty terms in these criteria are confusing. That is to say, the number of parameters for model A5 is near N, causing a large negative value for the penalty term. For example, the penalty term values for candidate models in group 1 (models A1–A5 and B1–B5) using the AICc criterion are 4.8, 11.1, 32.0, 117, −342.0, 99.8, 186.0, 362.3, 495, and 883.5, respectively. However, eliminating this model (A5) from the candidate set of group 1 and recalculating the weights, one can find that AICc and SICc select the simplest model in the set (model A1) with (for AICc) and 0.754 (for SICc), respectively, followed by model A2 as the second model with and 0.245, respectively.

For the candidate set of group 2, AICc and SICc opt for model B10 as the best model with weight values equal to 1.0. In other words, the AICc and SICc results indicate that the remaining models have relatively no empirical support just like the other criteria. Moreover, SICc nominates model B15 with the chance of 82.9% (i.e., ) as the best approximating model among the candidate models in group 3, whereas AICc tends to select model B15 as the best model with . Taking SICc into consideration, there is a 94.7% confidence that the best model in group 3 is one of the models B14 and B15. Also, model B15 has an ER equal to 7 compared to model 14, which means that model 15 is seven times more likely to be the best model than model 14.

AICu and KICc show the same performance in the selection of the best model in groups 1, 2, and 3 with no distinctive ranking of the models. In this regard, the simplest model, i.e., A1, from the candidate set of group 1 remains the best as chosen by AICu and KICc with w values equal to 0.935 and 0.925, respectively, whereas models B10 and B15 have been chosen by both criteria as the best model from the candidate sets of groups 2 and 3 with w values equal to 1 and 0.999, respectively. It is worth noting that the two mentioned criteria cannot be calculated for model A5.

From the candidate models of group 1, HICc selects model B4 as the best one with , followed by model B3 as the second model with . Thus, model B4 is 22 times more likely to be best model than model B3 (ER = 22). Similar to other criteria, HICc nominates models B10 and B15 as the best models of groups 2 and 3, respectively. Note that HICc is a small-sample bias correction version of HIC. We can verify when N is larger than K, values of these two criteria become closer, so both of them have selected the same model as the best one in groups 2 and 3. However, values of the two criteria are completely different for group 1's models. Therefore, the best selected model is different (models B4 and B5 for HICc and HIC, respectively).

AICu and KICc have been recommended for moderate-to-large sample size N and are known as large-sample based selection criteria (Rao & Wu 2001; Cavanaugh 2004). One can verify that for group 1's models with between 1 and 9, where K is a moderate-to-large fraction of sample size, AICu and KICc have given the highest rank to the simplest model (model A1) that shows the dominance of the model complexity term. Consequently, these criteria do not perform accurately when is low.

AICc and SICc, which respectively are small-sample bias correction versions of AIC and SIC, have been recommended where is low (say values less than 40 for AICc). For group 1's models with between 1 and 9, these two criteria have selected model A5 with the highest (=0.9). As mentioned before, the number of parameters for model A5 is near N, causing a large negative value for the penalty term. Eliminating this model (A5) from the candidate set of group 1 and recalculating the weights, one can find that AICc and SICc select the simplest model in the set (model A1) with the highest (=9.0) showing the dominance of the model complexity term. Model A1 has only one parameter with the largest RMSE (7.46 m) and cannot be a good candidate for the best model. Other criteria have selected the model whose is equal to 1.2 (model B5) and the dominant role of the factor of model fit. This model is the most complex model among the set of group 1's models with RMSE equal to 0.61 m. Note that HICc has opted for model B4 whose RMSE is very close to the minimum error. The results show that CAIC, HICc, and RICc have unexpectedly opted for the most complex models of group 1 (model B4 and B5).

For group 2's models with varying from 11 to 108, all 12 criteria have chosen the model with the lowest values of both and RMSE (model B10). In other words, for and K ranging from 1 to 19, these criteria have selected the model with the highest complexity (model B10). According to the results illustrated in Table 3, one can infer that while has varied from 3 to 72 for group 3's models, all the criteria have selected the model with the lowest (model B15). However, defining a new group containing models A6–A15 and B6–B15, one can verify that all the criteria tend to opt for the model with the highest weight and the lowest (model B15), similar to the ranking of group 3's models. It shows that when has varied from a moderate value to a large value, all 12 criteria have a tendency to overfit.

From an overall view, the results indicate that all the criteria evaluated in the present study have recommended models B15 and B10 as the best models in the one-step and the two-step calibration approaches, respectively. Model B10 has the lowest RMSE value from the candidate set of group 2's models, and model B15 has the lowest RMSE value among all the candidate models. For both of them, the parameterization of HK and Sy has been done by RPP and ZMs, respectively.

We finally tested the selected models based on a monthly data set of observed water levels from October 2001 to September 2006. A meaningful difference is observed in their performances as the RMSE values are equal to 0.68 m and 0.97 m for models B10 and B15, respectively. Figure 5 displays the simulated and observed water levels at three observation wells during the validation period. The location of these observation wells is shown in Figure 1. One can see that monthly water level fluctuations have been simulated well just in model B10, and model B15 has failed to simulate the fluctuations accurately.

Figure 5

Comparison of simulated and observed water levels at the observation wells 17, 7, and 5 between models B10 and B15 during the validation period. The target interval is ±1.0 m.

Figure 5

Comparison of simulated and observed water levels at the observation wells 17, 7, and 5 between models B10 and B15 during the validation period. The target interval is ±1.0 m.

CONCLUSIONS

This study elaborated on how to rank different conceptualizations of a groundwater system by using appropriate selection techniques and criteria that make a balance between the complexity and accuracy features of models.

We used a multi-modeling approach with regard to models’ parameterization and calibration. Different candidate models were generated depending on the parameterization method of HK by either ZM (type A) or RPP method (type B), the number of zones in the ZM or number of pilot points in the RPP method and the calibration approach (one-step or two-step). This resulted in 30 different models (Table 2) that were then evaluated and compared against 12 discriminating selection criteria. Some of the criteria, i.e., AIC, AICc, SIC, and HIC had been widely used in groundwater modeling, and the others, i.e., AICu, CAIC, SICc, DIC, HICc, RICc, AIC3, and KICc had not. The results obtained from the use of several models and selection techniques in modeling of the Dehloran groundwater system in Iran, in which a SWAT-based distributed hydrological model estimates the spatio-temporal patterns of recharge, indicate that:

  1. In the two-step calibration approach, the models with high ranks in the transient mode are not necessarily the best corresponding models in the steady-state calibration approach. This situation was observed for AICc, AICu, SICc, HICc, and KICc. This reveals that low rank models in the first step of calibration should not be disregarded from further consideration, and the selection procedure must be done at the end of the second calibration step.

  2. Models B10, calibrated by the two-step approach using RPP and ZMs of HK and Sy parameterization, respectively, and B15, calibrated by the one-step approach using the same parameterization methods, have been better than other candidate models based on calibration results. However, the validation results reveal that the best model calibrated by the two-step approach has been far superior than the one calibrated by the one-step approach.

  3. Regarding the number of observations, AICc and SICc criteria could be very sensitive to the number of calibration parameters for some of the models. This is the case for HICc, AICu, KICc, and RICc too because the penalty terms of these criteria are not continuous functions of K (number of parameters), so they may change abruptly at some certain points, approximately around point. Therefore, more care should be paid when using these criteria.

  4. The ratio of the number of total observations to the number of parameters strongly affects the performance of the selection criteria. AICc had been recommended when was less than 40 (Burnham & Anderson 2002; Poeter & Anderson 2005; Hill & Tiedeman 2007). In this study, AICc has been unable to determine the best model when varied from 1 to 9. In addition, it had been reported that AICu had provided better model choices than AICc for moderate-to-large sample sizes N without defining any range (Rao & Wu 2001). Our results have indicated that AICu may not work well enough for the values of between 1 and 9.

ACKNOWLEDGEMENT

This paper has been partially supported by the research grant No. 40/758 provided by Research Grants Program of Amirkabir University of Technology to the corresponding author.

REFERENCES

REFERENCES
Akaike
H.
1974
A new look at the statistical model identification
.
IEEE Trans. Automatic Control
AC-19
(
6
),
716
723
.
Arnold
J. G.
Srinivasan
R.
Muttiah
R. S.
Williams
J. R.
1998
Large area hydrologic modeling and assessment – Part 1: model development
.
J. Am. Water Resour. Assoc.
34
(
1
),
73
89
.
Bozdogan
H.
1994
Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity
. In:
Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, vol. 2
,
Kluwer Academic Publishing
,
Boston, MA
, pp.
69
113
.
Bredehoeft
J. D.
2005
The conceptualization model problem: surprise
.
Hydrogeol. J.
13
,
37
46
.
Burnham
K. P.
Anderson
D. R.
2002
Model Selection and Multimodel Inference
.
Springer
,
New York
,
USA
.
Burnham
K. P.
Anderson
D. R.
2004
Multimodel inference, understanding AIC and BIC in model selection
.
Sociol. Methods Res.
33
(
2
),
261
304
.
Doherty
J.
2002
Manual for PEST, 5th edition
.
Watermark Numerical Computing
,
Brisbane
,
Australia
.
Doherty
J.
Hunt
R. J.
2010
Approaches to highly parameterized inversion: A guide to using PEST for groundwater model calibration
.
U.S. Geological Survey Scientific Investigations Report 2010–5169
, p.
59
.
Draper
D.
1995
Assessment and propagation of model uncertainty
.
J. Roy. Stat. Soc.
57
,
45
97
.
Dziak
J.
Coffman
D. L.
Lanza
S. T.
Li
R.
2012
Sensitivity and Specificity of Information Criteria
.
The Pennsylvania State University
,
Technical report series 12–119
, p.
31
.
Ehtiat
M.
Mousavi
J.
Ashraf Vaghefi
S.
Ghaheri
A.
2015
Analysis of recharge conceptualization in inverse groundwater modeling
.
Hydrol. Sci. J.
(submitted).
Engelhardt
I.
De Aguinaga
J. G.
Mikat
H.
Schuth
C.
Lenz
O.
Liedl
R.
2012
Complexity versus simplicity: an example of groundwater model ranking with the Akaike Information Criterion
.
Hydrol. Earth Syst. Sci. Discuss.
9
,
9687
9714
.
Foglia
L.
Mehl
S. W.
Hill
M. C.
Perona
P.
Burlando
P.
2007
Testing alternative groundwater models using cross-validation and other methods
.
Ground Water
45
(
5
),
627
641
.
Franssen
H. J. H.
Alcolea
A.
Riva
M.
Bakr
M.
van der Wiel
N.
Stauffer
F.
Guadagnini
A.
2009
A comparison of seven methods for the inverse modelling of groundwater flow. Application to the characterisation of well catchments
.
Adv. Water Resour.
32
,
851
872
.
Hannan
E. J.
1980
The estimation of the order of an ARMA process
.
Ann. Statist.
8
(
5
),
1071
1081
.
Henriksen
H. J.
Troldborg
L.
Nyegaard
P.
Sonnenborg
T. S.
Refsgaard
J. C.
Madsen
B.
2003
Methodology for construction, calibration and validation of a national hydrological model for Denmark
.
J. Hydrol.
280
,
52
71
.
Hill
M. C.
Tiedeman
C. R.
2007
Effective Groundwater Model Calibration
.
Wiley
,
Hoboken, NJ
.
Hurvich
C. M.
Tsai
C.
1989
Regression and time series model selection in small samples
.
Biometrika
76
(
2
),
279
307
.
Kolm
K. E.
Van Der Heijde
P. K. M
.
1996
Conceptualization and characterization of envirochemical systems
. In:
Proc. Int. Conf. ModelCARE'96
,
IAHS 237
,
Golden, Colorado
,
September 1996
,
267
275
.
Leng
C.
2013
The Residual Information Criterion, Corrected. arXiv:0711.1918v1 [stat.ME]. http://arxiv.org/pdf/0711.1918.pdf
.
Lukacs
P. M.
Burnham
K. P.
Anderson
D. R.
2010
Model selection bias and Freedman's paradox
.
Ann. Inst. Stat. Math.
62
,
117
125
.
Mahab Ghods Consulting Engineers
1992
Groundwater study of Dehloran, Mousian and Einkhosh Plains (available in Persian).
Mahab Ghods Consulting Engineers
2008
Groundwater study of Dehloran Plain 2008 vol. 3 (available in Persian).
McDonald
M.
Harbaugh
A.
1988
A Modular Three-dimensional Finite-difference Groundwater Flow Model
.
USGS Technical Report on Modelling Techniques Book 6
,
USGS
,
Reston, VA
, p.
596
.
McQuarrie
A.
Tsai
C. L.
1998
Regression and Time Series Model Selection
.
World Scientific
,
Singapore
.
McQuarrie
A.
Shumway
R.
Tsai
C.
1997
The model selection criterion AICu
.
Stat. Prob. Lett.
34
,
285
292
.
Moor
C.
Doherty
J.
2006
The cost of uniqueness in groundwater model calibration
.
Adv. Water Resour.
29
,
605
623
.
Omole
D. O.
Longe
E. O.
Musa
A. G.
2013
An approach to reaeration coefficient modeling in local surface water quality monitoring
.
Environ. Model. Assess.
18
,
85
94
.
Poeter
E. P.
Anderson
D.
2005
Multimodel ranking and inference in ground water modeling
.
Ground Water
43
(
4
),
597
605
.
Poeter
E. P.
Hill
M. C.
1997
Inverse models: a necessary next step in ground water modeling
.
Ground Water
35
(
2
),
250
260
.
Rao
C. R.
Wu
Y.
2001
On Model Selection. IMS Lecture Notes- Monograph Series 38
, p.
57
.
Samper
J.
Carrera
J.
Galarza
G.
Medina
A.
1990
Application of an automatic calibration technique to modeling an alluvial aquifer
.
Int. Conf. ModelCARE'90, Hague, September 1990, IAHS
195
,
87
95
.
Schwarz
G.
1978
Estimating the dimension of a model
.
Ann. Stat.
6
(
2
),
461
464
.
Sonnenborg
T. O.
Christensen
B. S. B.
Nyegaard
P.
Henriksen
H. J.
Refsgaard
J. C.
2003
Transient modeling of regional groundwater flow using parameter estimates from steady-state automatic calibration
.
J. Hydrol.
273
,
188
204
.
Wagenmakers
E. J.
Farrell
S.
2004
AIC Model selection using Akaike weights
.
Psychon. Bull. Rev.
11
(
1
),
192
196
.
Ye
M.
Meyer
P. D.
Neuman
S. P.
2008
On model selection criteria in multimodel analysis
.
Water Resour. Res.
44
,
W03428
.
Ye
M.
Pohlmann
K. F.
Chapman
J. B.
Pohll
G. M.
Reeves
D. M.
2010
A model-averaging method for assessing groundwater conceptual model uncertainty
.
Ground Water
48
(
5
),
716
728
.