ABSTRACT
Among the several hydrological model uncertainty estimation methods, the generalized uncertainty estimation (GLUE) method is popular due to its simplicity. The application of GLUE tends to be limited to the cases when hydrological models are applied individually. Notably, little attention is given to model differences when applying GLUE. This study introduced a framework for multi-hydrological model ensemble prediction uncertainty estimation (e-PRUNE). For demonstration, the framework was applied to real hydrometeorological data while considering three sub-sources of calibration-related uncertainty including the influence of the choice of a sampling scheme (SC), hydrological model (HM), and objective function (OF). Ten SCs, six HMs, and eight OFs were considered. Influences from SCs, OFs, and HMs were combined and assumed to substantially comprise the overall predictive uncertainty (OPU). The sub-uncertainty bound based on HM's choice was larger than that of either SC's or OF's selection. Contributions of sub-uncertainties from HM, SC, and OF to the OPU were additive. Thus, the effect of removing one source of uncertainty (for instance, OF) could easily be realized from the width of OPU's bounds. This study showed the importance of the e-PRUNE framework for insight into the contributions of various sub-uncertainty sources to the OPU.
HIGHLIGHTS
GLUE tends to be limited to cases when models are applied individually.
Model differences are apparent, yet they are not considered when applying GLUE.
A multi-model ensemble prediction uncertainty estimation (e-PRUNE) framework is introduced.
The potential of the framework was demonstrated using three uncertainty sources.
The framework gives insights into the contributions of various uncertainty sources.
INTRODUCTION
To aid predictive planning and decision-making for local or regional-scale management of hydrological systems under the ongoing global environmental challenges, the application of hydrological models (HM) is on the increase. There are various sources of uncertainties in hydrological modelling, including model structures, parameters, model inputs, uncertainty in observed data, calibration data, and calibration approaches (Renard et al. 2010; Blöschl et al. 2019; Moges et al. 2020). Uncertainties related to calibration stem from various sources such as errors in observations. Because the initial parameter values are often not optimal, mismatches between observed and modelled series tend to be large. Calibration or adjustments in the parameter values to reduce the said mismatches and thereby increase the credibility of a model's prediction can be manually or automatically made. Here, it is worth noting that there are several sub-sources of calibration uncertainty, including observation errors, the choice of a calibration method, the selection of an objective function (OF) (Onyutha 2024a), and opting for a particular sampling scheme (SC) (Onyutha 2024b). Apart from calibration uncertainty, other model uncertainty sources are linked with parameters, model structures, and model inputs (Renard et al. 2010; Faramarzi et al. 2013; Gupta & Govindaraju 2023). Due to various uncertainty sources, there are normally many sets of model parameters that can optimally guarantee satisfactory model fit (Qi et al. 2019). This is actually the notion of equifinality (Beven & Binley 1992; Beven 2006), in which there can exist various sets of model parameters for which the performance of a model cannot be rejected (Beven 2006). The said notion of equifinality further gave rise to the generalized likelihood uncertainty estimation (GLUE) methodology (Beven & Binley 1992) to construct predictive uncertainty bounds.
GLUE (Beven & Binley 1992) is an extension of the generalized sensitivity analysis (GSA) (Spear & Hornberger 1980). In GSA, each parameter of a model is generated from the normal or uniform distribution. Eventually, many sets of parameters can be obtained. A selected model is run using the various sets of parameters to obtain several modelled series of two groups, including behavioural and non-behavioural solutions, which are consistent and inconsistent with observations, respectively (Spear & Hornberger 1980). In GLUE, several sets of generated parameters are mainly generated from the uniform distribution. The model is run using each set of parameters. The acceptability of each model's output is assessed in terms of the mismatch between observed and modelled series, and this is done in terms of the ‘goodness-of-fit’ measure. A model's outputs for which values of the ‘goodness-of-fit’ measures fall below and above the chosen threshold are termed behavioural and non-behavioural solutions, respectively. We construct the cumulative density function (CDF) of the model's output using the behavioural solutions. In turn, the confidence intervals (CIs) (representing model uncertainty) are constructed using quantiles from the CDFs. For instance, to construct 95% CI, we generate 1,000 bootstrap resamples and select the 25th and 975th quantiles of the ranked solutions. Compared to formal techniques of uncertainty analysis, the overall advantage of the use of GLUE has attracted considerable debate. For instance, Blasone et al. (2008) concluded that GLUE is ineffective in establishing behavioural solutions. The ineffectiveness of GLUE could, to some extent, be linked to errors in the model's outputs (Xiong & O'Connor 2008) and the reliance of the framework on the simple Monte Carlo sampling technique (Onyutha 2024b). Furthermore, Blasone et al. (2008) demonstrated that the uncertainty bounds from GLUE tend to be subjective. In the appraisal of GLUE, the use of informal likelihood by GLUE was found to lead to flat posterior distributions (Christensen 2004; Mantovan & Todini 2006; Mantovan et al. 2007; Blasone et al. 2008; Stedinger et al. 2008; Liu et al. 2022). In the same vein, Stedinger et al. (2008) argued that the choice of a rigid threshold of a ‘goodness-of-fit’ measure for obtaining behavioural solutions indicates the incorrectness of the statistical analysis. Besides, Hossain & Anagnostou (2005) argued that GLUE has a prohibitive computational burden that substantially limits its application. Despite the above limitations, GLUE remains very popular because it is simple, easy to implement, and can handle several error structures and models (Blasone et al. 2008).
A few researchers made some improvements to GLUE. One improvement for GLUE was in terms of the introduction of an approach to stochastically model parameters' response surface (Hossain & Anagnostou 2005). The advantage of the introduced sampling approach is that the sampling recognizes the inherent non-linear interactions among parameters and it preserves the key features of uncertainty structure. Blasone et al. (2008) improved GLUE through the use of adaptive Markov Chain Monte Carlo (MCMC) to sample prior parameter space. The revised GLUE performs better than the original GLUE based on a pseudo-random generator (Blasone et al. 2008). In another development, by examining the connection between the informal GLUE and formal probabilistic approaches in hydrological modelling, Nott et al. (2012) showed that, regardless of whether the ‘generalized likelihood’ is not a true likelihood, the connection between formal Bayes and GLUE is deeper than the mere operational aspect. In fact, GLUE corresponds to some approximate Bayesian procedure. According to Nott et al. (2012), ‘Two interpretations of GLUE emerge, one as a computational approximation to a Bayes procedure for a certain ‘error-free’ model and the second as an exact Bayes procedure for a perturbation of that model in which the truncation of the generalized likelihood in GLUE plays a role’. Chagwiza et al. (2015) introduced a framework that combines GLUE with mixed integer programming (MIP) for model uncertainty quantification. The introduced framework GLUEMIP performs better than MIP when tested using a water distribution system. Onyutha (2024b) showed that the use of a pseudo-random generator in GLUE does not take into account the uncertainty from the selection of an SC on uncertainty bounds. To fill this gap, randomized block quasi-Monte Carlo sampling (RBMC) (Onyutha 2024b) was introduced to improve GLUE.
The application of GLUE tends to be limited to the use of a single model and selected likelihood function. In other words, studies that extend the concept of GLUE, while considering jointly various sources of uncertainties, such as model differences and disparity in the influences of OFs, are lacking. Of the various uncertainty sources, what could require further elaboration is the influences from the differences among HMs. A significant driving force for marked diversity in HMs can be linked to various applications (Weiler & Beven 2015; Horton et al. 2022). Thus, the degree of sophistication for each model should be commensurate with the purpose for which the model is needed (Rosbjerg & Madsen 2005). Nevertheless, the complexity of a model cannot be a guarantee for its top performance (Fatichi et al. 2016). Another reason for the differences among models could be thought of in terms of the modelling context and this can be referred to as uniqueness of place (Beven 2001). Here, models are required to be parsimonious and fit for purpose (Beven & Young 2013). As a matter of fact, it is practically impossible for only one model to be fit for every purpose (Hämäläinen 2015). Differences among models have often been recognized in terms of ambiguity (Beck & Halfon 1991; Zin 2002; Beven 2006), a term used in differentiating models identified on the same model inputs that have overlapping predictive uncertainty bounds (Beck & Halfon 1991) or models for which forecasts are made with different model inputs that cannot be statistically differentiated (Zin 2002). When a dominant system response exists, its identification can possibly be without relative ambiguity (Beven 2006). Ambiguity arises not in the system or model itself, but only in decisions regarding its different representations (Kirchner et al. 2001; Beven 2006). Ambiguity is a less contentious word than equifinality (Beven 2006). This can explain why little attention has been given to the differences among models despite their importance in the analysis of the uncertainty of HM.
To even out the uncertainties from various models that can be used for a particular application, the use of the multi-model ensemble is encouraged, especially to boost the credibility of forecasts or predictions. Frameworks are lacking for quantifying generalized uncertainties on multi-model ensembles while considering equifinality and differences in models. The need to address this gap is the motivation for this study. This is in line with the need to attend to 20 of the unsolved problems in hydrology (UPH 20) (Blöschl et al. 2019).
MATERIALS AND METHODS
Selected data and models
Many HMs were considered for this study, such as the Nedbør-Afstrømnings-Model (NAM) (Nielsen & Hansen 1973), the Veralgemeend Conceptueel Hydrologisch Model (VHM) (Willems 2014), HM focusing on sub-flows' variation (HMSV) (Onyutha 2019), ABCD (Martinez & Gupta 2010), the simplified HM (SIMHYD) (Porter & McMahon 1971), and the probability-distributed model (PDM) (Moore 2007). All these are lumped conceptual rainfall-runoff models and they were selected because of their compatibility with the lumped nature of the meteorological inputs adopted for this study. For brevity, the structure of each model was not presented in this study and can be obtained from the respective original paper, as cited in the first sentence of this paragraph.
Sources of uncertainty







When a certain constant is first added to the values of x and y before applying Equation (3). This is to ensure that each of the x and y series comprises only positive values. The said constant is taken as the greater of the absolute values of the maximum or minimum values of x. In any case, stream flows can never be negative. Thus, the case with
may materialize when modelling other variables apart from stream flows. Importantly, it should be apparent that a model that considers variability is not required when
Generally, the limitations of R2 are more well-known than those for other ‘goodness-of-fit’ measures. For instance, it does not determine how biased a model is. Furthermore, a poor model can yield R2 close to one. On the other hand, a good model can produce R2 close to zero. Notwithstanding these disadvantages, R2 is still arguably the most widely applied metric for assessing model quality (Onyutha 2022). This was why the use of R2 as an OF was considered in this study. Nevertheless, to ensure that the uncertainty range based on R2 was not exaggerated, two stringent criteria were considered. First, the threshold for populating the set of behavioural solutions was raised to 0.8 compared to, say, MSC that was kept at 0.60. Second, modelled series for which R2 was high (for instance, R2 = 0.8), while NSE and KGE remained low (e.g. less than zero), were excluded from being considered part of the set of behavioural solutions. Here, NSE and KGE were considered because they are actually variants of R2.
Some of the well-known parameter sampling methods include the simple Monte Carlo random sampling (denoted in this study as RND) and Latin hypercube sampling (LHS). Recently, RBMC (Onyutha 2024b) was introduced to support GLUE. The robustness of RBMC in uncertainty analysis lies in its uniqueness in yielding ‘independent’ SCs based on the stipulated configurations. This property is not possessed by the common existing methods such as RND and LHS. For demonstration of the uncertainty framework, this study applied eight RBMC configurations denoted as ) with
. In other words, the SCs used in this study included RND, LHS, qMC2, qMC3,…, and qMC9. RBMC coded in MATLAB with a demonstration of its application for uncertainty quantification can be found freely downloadable via https://doi.org/10.5281/zenodo.10702810. An overview of the three key sources of uncertainty considered in this study can be seen in Figure 2.
Uncertainty estimation framework
1. List the HMs to be applied for the uncertainty analysis. For instance, this study applied six models: NAM, PDM, VHM, HMSV, SIMHYD, and ABCD. Build up each model using the same datasets of a given catchment. Set the upper and lower limits of the parameters of a selected HM.
2. Determine the number of OFs considered for the hydrological modelling. Here, there were eight OFs: NSE, KGE, IOA, TSS, RRS, LME, MSC, and R2. Set the threshold of the chosen OF for determining the behavioural solutions.
3. Decide on the number of SCs for generating parameters of a selected model during modelling. This study employed RND, LHS, and
) with
. Set the upper and lower limits of each model's parameters.
4. Set the size of the set of behavioural solutions. This could be, for instance, 1,000 modelled series that yield values of OFs above the set threshold. For relative error measures, the best model performance is indicated by values close to one.
Three sources of calibration uncertainty were considered in this study: the choice of an SC, selection of an objective function, and choice of a HM.
Three sources of calibration uncertainty were considered in this study: the choice of an SC, selection of an objective function, and choice of a HM.
There are two approaches of the e-PRUNE framework. The summary of the specific steps of the first approach of e-PRUNE is as follows.
(a) Select the first model (e.g. PDM) from step (1).
(b) Pick the first SC (e.g. RND) from step (3).
(c) Use the selected SC (e.g. RND) from step (3) to generate the values of the parameters within the limits stipulated in step (a).
(d) Choose the first OF (e.g. NSE) and determine its set threshold from step (2).
(e) Run the model from step (a) information in steps (b)–(d) several times until the size of the set of behavioural solutions stipulated in step (4) is achieved.
(f) From step (e), select and make the modelled series with the best (e.g. maximum) value of the OF as an optimal solution.
(g) Go to the next objective (e.g. KGE) from step (2) and repeat steps (e)–(f). Do this until all the OFs in step (2) are used.
(h) Go to the next SC (e.g. LHS) from step (3) and repeat steps (d)–(g).
(i) Go to the next model (e.g. VHM) from step (1) and repeat steps (b)–(h).
(j) Consider the overall ensemble mean to be the average of the modelled series obtained in step (f).
(k) Using the collection of all the sets of behavioural solutions from step (e), determine the total number of all the series in the collection and denote it as
(l) Determine the upper limit of the (100 − α)% confidence interval (CI) as the
value. Similarly, the lower CI limit is taken as the
value.
The summary of the second approach of e-PRUNE is as follows.
(i) Set the total number of simulations (ntsm) to a large value, e.g. 10,000.
(ii) Randomly select a model from step (1).
(iii) Randomly choose an SC from step (3) and generate the parameters of the model selected in step (iii).
(iv) Randomly select an OF from step (2). Also, consider the threshold of the selected OF as set from step (2).
(v) Run the model from step (ii) using parameters in step (iii) and information in step (iv). Keep the modelled series if it is behavioural. Otherwise, discard the modelled series.
(vi) Repeat steps (ii)–(v) until the size of the set of behavioural solutions is equal to the number of simulations stipulated in (i).
(vii) Consider the average of all the behavioural solutions from step (v) as the overall ensemble mean.
(viii) The upper and lower limits of the (100 − α)% CI are determined as the
and
values, respectively.
The first approach of e-PRUNE is important for obtaining insights into the contributions from the various uncertainty sources to the overall predictive uncertainty bounds. However, the focus of the second approach is on the overall predictive bound, regardless of the amount of contribution from each uncertainty source.

















Procedure to consider the various sources of calibration uncertainty.
The simulation experiments used the following: = 8,
, and
= 6. The first approach of the introduced framework starts with the stipulation of the upper and lower limits of every parameter of each model. This is followed by the stipulation of
and
. In a standard GLUE, one would fix
and subsequently
is the number of acceptable solutions. Based on some factors like the rigidity of the likelihood threshold, such an approach could lead to a small
that would end up affecting the reasonableness of the uncertainty bound. A good idea as adopted for the introduced framework is to ensure that
is far greater than
. Furthermore,
is made large enough as well. For instance,
and
were set to 10,000 and 5,000, respectively.
In the next step, we choose the HM such that
. Subsequently, for every
model, select the
OF such that
. Finally, considering the
model and
OF, we choose the
SC such that
. For each combination of model, OF and SC,
simulations are targeted to obtain
behavioural solutions. The detailed procedure, based on RBMC, to estimate model uncertainty, while considering the influence of the choice of an SC, can be found in Onyutha (2024b). While considering OFs, the initial value of
was set to 0.01, thereby making
. The value
for each of the various OFs considered in this study is 1. For the standard GLUE, a rigid threshold for a likelihood function tends to be considered under fixed
. This, as hinted shortly before, has the implication of making
small. To differ from the standard GLUE, the introduced framework considers an adjustable threshold to easily obtain the stipulated
. This, in turn, permits
to be made large enough. To do so, when
is reached while the total acceptable modelled series is still less than the stipulated
, we increase
using
where
to obtain a new
Consider that, say, with
and
, a total of 1,800 (instead of the 2,000) acceptable modelled series were obtained using
when
. It means that we make
such that
. We set
again and run the model until the remaining
or
behavioral solutions are obtained and the procedure is stopped; otherwise, the procedure is repeated. The idea of ensuring that
keeps reducing is only when
Otherwise, for instance, if
and the worst value of the
is 1, we cause
to keep increasing using
with
Furthermore, the value of
can be set on a case-by-case basis. For instance,
was set to 0.01, 0.02, …, 0.1 based on the desired computation load.
Using one model, we can choose a particular SC and vary the OFs. Here, using an OF, the optimal modelled series leads to the best value of the OF. For every observed flow value, there are
modelled events, which can be ranked from the highest to the lowest. The upper
and lower
limits of the (100 − α)% CI on the optimal modelled event using a particular OF are given by
and
modelled values, respectively. In this study, α was taken as 5%. Repeating the procedure with all the OFs and SCs using one model, the ensemble can be obtained as the average of the
sets of
series. Finally, when the procedure is repeated with all the
models, the overall model ensemble
is obtained as the average of the
sets of
series. The final limits of the (100 − α)% CI on
can be obtained as described in step (l) of the procedure for the first approach of e-PRUNE.
RESULTS
Model quality and distribution of parameters

Frequency distribution of selected parameters for various HMs, including (a)–(f) ABDC, (g)–(l) PDM, (m)–(r) SIMHYD, (s)–(x) HMSV, (y)–(ad) NAM, and (ae)–(aj) VHM based on various OFs.
Frequency distribution of selected parameters for various HMs, including (a)–(f) ABDC, (g)–(l) PDM, (m)–(r) SIMHYD, (s)–(x) HMSV, (y)–(ad) NAM, and (ae)–(aj) VHM based on various OFs.
The application of each of the eight selected OFs allowed the narrowing of the range over which every parameter could be generated. Narrowing each parameter's range (within which the optimal value of the respective parameter can be found) increases the precision with which observations can be predicted. High precision can be realized in terms of the narrow width of the uncertainty range. The precision of a model can depend on a number of factors, such as the choice of the OF, the number of model parameters, and the selected SC. For instance, how a chosen OF relates to the values of a particular parameter varies among the parameters.
The identifiability of each parameter of a given model varied among the OFs (Figure 5(a)–5(ai)), and it depended on the choice of the SC. The identifiability of a parameter is the degree to which the parameter can be endorsed based on available information that can be validated in reality (Zhao et al. 2008). It is about the level of precision with which each parameter's values can easily be identified. The identifiability of a parameter is an important aspect of the conceptualization of a model's structure. It can be enhanced through a strong linkage of the hydrological processes and the parameters of a model (Guse et al. 2017). Adjustments of a model structure are normally guided by measurements if available. In the absence of measurements, estimations are undertaken and the guess of the starting point for each parameter depends on the experience and expert judgement of the modeller. Complexity in parameter identifiability increases with the number of a model's parameters. Parsimonious models contain a few parameters and the components that can be identified from observations are not many. Having too few parameters can limit the number of hydrological processes or components to be considered within a model structure. Oversimplification of the model's structure comprises the challenge of failure to meet the expected purpose of the modelling exercise. On the other hand, including too many parameters or processes increases a model's complexity and this can affect the identifiability of parameters. This study considered models with parameters ranging from 6 (ABCD) to 17 (NAM). This was to consider the influence of the identifiability of models' parameters on the overall predictive uncertainty.
The main sources of uncertainty












Influence of the choice of the (a) SC, (b) OF and (c) the HM on the number of behavioural solutions from each model.
Influence of the choice of the (a) SC, (b) OF and (c) the HM on the number of behavioural solutions from each model.
Influence of the choice of HM
Uncertainty bounds on stream flows modelled by (a) ABCD, (b) PDM, (c) SIMHYD, (d) HMSV, (e) NAM, and (f) VHM. Parameters were generated using RND and the model was calibrated using NSE as the OF.
Uncertainty bounds on stream flows modelled by (a) ABCD, (b) PDM, (c) SIMHYD, (d) HMSV, (e) NAM, and (f) VHM. Parameters were generated using RND and the model was calibrated using NSE as the OF.
Though, for brevity, the results, based on only NSE, were presented, increasing the number of OFs led to an increase in the width of the uncertainty bounds for each HM. Furthermore, the bounds of the uncertainty varied in width among the models. Generally, the width of the uncertainty bound is taken to reflect the precision with which the observed flow is reproduced. However, many factors affect the magnitudes of the uncertainties of models. Some of these factors include the compatibility of a model's structure with available data, errors in meteorological inputs, and errors in river flow (or calibrating data). Generally, these results reflect the need to apply many models to consider the influences of choosing a particular model based on a given OF on generating a set of modelled series.
Influence of the choice of an objective function
Uncertainty bounds on stream flows modelled by HMSV when its parameters were generated by RND and the model calibrated based on (a) IOA, (b) KGE, (c) LME, (d) NSE, (e) MSC, (f) RRS, (g) R2, and (h) TSS.
Uncertainty bounds on stream flows modelled by HMSV when its parameters were generated by RND and the model calibrated based on (a) IOA, (b) KGE, (c) LME, (d) NSE, (e) MSC, (f) RRS, (g) R2, and (h) TSS.
Influence of the choice of sampling scheme
Uncertainty bounds on stream flows modelled by HMSV when its parameters were generated by (a) RND, (b) LHS, and (c)–(j) qMC2–9. The results of each chart were obtained when the model was calibrated using NSE as the OF.
Uncertainty bounds on stream flows modelled by HMSV when its parameters were generated by (a) RND, (b) LHS, and (c)–(j) qMC2–9. The results of each chart were obtained when the model was calibrated using NSE as the OF.
Overall model uncertainty
Overall uncertainty was obtained from the combination of sub-uncertainties based on the choices of HM, OF and SC. These results are for (a) daily and (b) monthly scales.
Overall uncertainty was obtained from the combination of sub-uncertainties based on the choices of HM, OF and SC. These results are for (a) daily and (b) monthly scales.
The interactive contributions to the overall uncertainty from the various sources may be additive or multiplicative in nature. For the additive case, the effect of removing one source of uncertainty, for instance, OF, can easily be realized from the overall uncertainty bound. This could be in terms of a slight amplification of the overall uncertainty due to combined contributions from various uncertainty sources. If the interactive contributions from the various uncertainty sources were multiplicative, the width of the uncertainty bound would be so much more pronounced than in the case shown in Figure 10. Even when considering the sources of uncertainty individually, the width of the uncertainty bounds can increase under various circumstances, for instance, (i) when the threshold of an OF (such as NSE, R2, and IOA) is reduced, (ii) when the parameter space gets larger, and (iii) when the number of structurally different HMs being applied is increased.
The results of hydrological modelling can be influenced by the scale, as shown in the results for daily (Figure 10(a)) and monthly (Figure 10(b)) scales. Aggregation of stream flows from daily to monthly scale leads to a reduction in the temporal variability or fluctuations in the series. Nevertheless, the results of the uncertainty remain valid under both daily and monthly scales.
The choice of scale is generally linked to the purpose of the analysis. For instance, the choice of daily scale is relevant for uncertainty analysis of risk-based hydrological quantiles necessary for the design of water resource management applications and hydraulic structures, such as dikes, dams, and irrigation systems. Uncertainty about quantiles of high temporal scales enables the modeller to estimate the cost implications of constructing and operating water resource management applications or hydraulic structures designed under various modelling uncertainties.
The results of an HM vary across catchments. This is because catchments tend to differ with respect to climatic and physiographic characteristics. Thus, the results of this study are limited to the catchment to which the models were applied. Furthermore, the results of a model depend on the sub-period considered for hydrological analysis. Due to climate variability, hydroclimatic conditions tend to differ over sub-periods. This can result in the changes of values of a given parameter over time, thereby affecting temporal transferability of a given model.
DISCUSSION
Different sources of model uncertainty contribute in varying proportions to the overall uncertainty bounds. In this study, the magnitude of the uncertainty due to OF was found to be substantial. There have been proposals for modellers to prefer multi-objective calibration (MOC) to single-objective calibration (SOC) of HMs. This was on the assumption that MOC leads to better results than in the case when SOC is used (Kim & Lee 2014; Mostafaie et al. 2018). In a relevant study by Zhang et al. (2018) on the mentioned proposal, MOC appeared not to necessarily yield higher model quality than SOC. In fact, large differences exist among results from various OFs (Krause et al. 2005; Onyutha 2024a). In this line, it is a difficult task to determine comparable thresholds of two or more OFs that could be used for determining behavioural solutions. Furthermore, even for the SOC, each OF for calibrating HMs has its advantages and disadvantages (Onyutha 2024a). By increasing the number of uncertainty sources, the overall uncertainty bound increased, thereby corroborating the importance of considering as many sources of uncertainty as possible in HM results. In this study, the SOC approach was considered. The uncertainty bounds in using SOC and MOC are deemed to be different.
The uncertainty bounds, based on the influence of the choice of a model, were the largest. This reflected the differences in the structures of the selected models, regarding their compatibility with the data of the chosen catchment. The idea of the differences among models can be easily given. For a given model, various hydrological processes are considered while conceptualizing the model structure. Some hydrological processes differ in various regions. Besides, catchments, of even the same climate region, can differ among themselves in terms of geology, soil, topography, drainage area, topographical aspect, land use and land cover types, and extent of urbanization. Therefore, the concept that a model can depend on physical laws and parameters established a priori to yield accurate mathematical results for different hydrological conditions, catchments, and climates concerns an impossible task or the so-called Hydrologic ElDorado (Grayson et al. 1992; Woolhiser 1996). The reasonableness of the said concept is further challenged by the non-linearity of hydrological processes, dependence on spatio-temporal scales, equifinality, and system heterogeneity (Beven & Cloke 2012). For instance, differences in characteristics of climate regions require a focus on some distinct processes such as snow melting in a temperate (but not tropical) region, and this can influence the structuring of an HM. Importantly, the compatibility of a model structure with observations will vary among catchments of even the same climate region. In summary, differences among models arise from dissimilarities of concepts for representing hydrological processes (Weiler & Beven 2015). Thus, model differences comprise an important consideration in making predictions to yield information for planning water resource management. Furthermore, regarding influences from the choice of a model, understanding inter-model and intra-model differences is important for the comparison of simulations of modelled results. Inter-model difference means that results from any two structurally dissimilar models tend to be different and this is due to the divergences in the concepts and assumptions in structuring the models (Butts et al. 2004). Furthermore, under the same initial condition, a particular model (due to its structural complexity) will always yield dissimilar simulations and this is linked with the intra-model differences.
In this study, the multi-model ensemble of modelled flow series was obtained by averaging. In fact, the idea of combining model results to obtain ensembles is not fresh (Twedt et al. 1977). Many studies applied averaging of model outputs when considering mean hydrological conditions (Rojas et al. 2008; Kumar et al. 2015; Onyutha et al. 2021). Applications of model averaging to extreme hydrological events are often rarely considered by hydrologists except on a few occasions like in Onyutha et al. (2021). Apart from the use of simple arithmetical averaging of all the realizations, other approaches, such as the weighted mean of outputs, also exist. Thus, ensembles could also vary with respect to the methods of combining the various model results.
Though not considered in this study, other sub-sources of model uncertainty exist such as (i) observation data errors, (ii) the choice of optimizers, and (iii) the method of calibration. Clarifications on these uncertainty sub-sources are as follows:
(i) Model inputs include rainfall and temperature or potential evapotranspiration. Model inputs are characterized by gaps in missing records, coarse spatio-temporal resolutions, short data records, and issues of data entry such as inaccurate placement of decimal points. Poor calibration of data recording equipment leads to bias in measurements. Errors in observed river flow or calibration data can arise due to inaccurate extrapolation or interpolation of the rating curve and hysteresis (McMillan et al. 2012; Sikorska et al. 2013; Moges et al. 2020).
(ii) Calibration is the process of changing the values of parameters to reduce the mismatch between the observed and modelled series. The calibration data for the rainfall-runoff model are observed river flows. However, optimization deals with the adjustments of the parameters of a model to guarantee the behaviour of a model with respect to a certain policy that can maximize performance without reference to measured data. Calibration can be manual or automatic. It is known that manual calibration tends to be subjective and time-consuming (Boyle et al. 2000). Automatic calibration uses some strategies that can guarantee minimal model residuals, by allowing global optimum to be located within the response surface characterized by several local minima. An example of a global search technique is the shuffled complex evolution (Duan et al. 1992). The results of automatic or manual calibration can be different to some extent.
(iii) For automatic calibration, there are various strategies or search algorithms/optimizers. Examples of calibration strategies include GLUE (Beven & Binley 1992) and the Bayesian MCMC simulation. Many optimizers exist such as charged system search (Kaveh & Talatahari 2010), genetic algorithm (Goldberg 1989; Wang 1991), simulated annealing (Kirkpatrick et al. 1983), sequential simplex optimization (Walters et al. 1991), Powell's method (Powell 1964), differential evolution (Storn & Price 1997), particle swarm optimization (Kennedy & Eberhart 1995), and direct local search or the Rosenbrock's method (Rosenbrock 1960). Others include the gravitational search algorithm (Rashedi et al. 2009), non-dominated sorting genetic algorithm II (Deb et al. 2002), ant colony optimization (Dorigo et al. 1996), dolphin echolocation (Kaveh & Farhoudi 2013), differential evolution (Storn & Price 1997), big bang-big crunch (Erol & Eksin 2006), grey wolf optimizer (Mirjalili & Hashim 2010), bacterial foraging (Passino 2002), and bat-inspired algorithm (Yang 2010). Each of the various optimization methods or algorithms has its own advantages and disadvantages. Besides, these optimizers have differences. Thus, results of calibration based on various optimizers are expected to be dissimilar to some extent. This means that the choice of an optimizer is a sub-source of calibration uncertainty.
In this study, model residuals were assumed to be of a nature that could allow them to be treated implicitly. Furthermore, the residuals in both calibration and prediction are assumed to be similar in nature. In other words, hyperparameters of an error model resulting from calibration tend to be identical to those based on prediction (Beven & Binley 2014). Explicit representation of errors is another option, by an explicit representation or by non-parametric technique (Beven & Smith 2015).
Apart from GLUE, several approaches for analysing uncertainty in HMs exist such as the differential evolution adaptive metropolis (Vrugt 2016), pseudo-Bayesian (Freer et al. 1996), parameter estimation algorithm (Doherty 2010), classical Bayesian (Thiemann et al. 2001), multi-model averaging methods (Vrugt & Robinson 2007), sequential data assimilation (Moradkhani et al. 2005), MOC analysis (Hadka & Reed 2013), and Bayesian total error analysis (Kavetski et al. 2003). It is deemed that these methods can lead to dissimilar uncertainty results due to the differences among them. Thus, the choice of uncertainty analysis method is a sub-source of the predictive uncertainty.
CONCLUSIONS
The non-linear nature of hydrological modelling equations makes the use of analytical approaches of uncertainty quantification an unstraightforward task. As a result, there are a number of uncertainty estimation methods such as the Bayesian total error analysis, sequential data assimilation, and GLUE. The popularity of GLUE compared with other methods is linked to its simplicity. However, little attention is normally given to differences among models when applying the original GLUE framework. This study introduced a framework that considers differences among models for generating ensemble uncertainty bounds.
The potential of the introduced framework was demonstrated using three sub-sources of calibration-related uncertainty, including the influence of the choice of an OF, selection of HM, and opting for a particular SC. Sub-uncertainty bounds on the multi-model ensemble, based on the choice of an OF, were wider than those due to the selection of a particular SC. However, the sub-uncertainty bounds on the multi-model ensemble due to the choice of an HM were wider than those due to the selection of an SC or OF. The uncertainty bounds based on combined contributions from HM, SC, and OF are wider than that for a particular uncertainty source. This means that the larger the number of sources of uncertainty, the more representative the predictive uncertainty. The introduced framework is recommended to consider various sources of uncertainty from different models.
It is recommended that other sources of uncertainty should be incorporated within the framework including the influences from the choice of optimizers, observation data errors, initial and boundary conditions, and if possible, issues of model transferability. Given that several uncertainty sources were considered, this study could not give an in-depth analysis of the identifiability of parameters from the various models, based on different SCs and OFs. A comprehensive assessment of parameter identifiability under various sources of uncertainty is recommended for a future study.
FUNDING
This research received no external funding.
ACKNOWLEDGEMENTS
The author acknowledges that this study used part of the modelling data from Onyutha (2022).
DATA AVAILABILITY STATEMENT
Data used in this study can be obtained as example dataset of Rainfall Runoff Library (RRL) downloadable via https://toolkit.ewater.org.au/Tools/RRL (June 16, 2022).
CONFLICT OF INTEREST
The authors declares there is no conflict.