## Abstract

Multi-objective calibration can help identify parameter sets that represent a hydrological system and enable further constraining of the parameter space. Multi-objective calibration is expected to be more frequently utilized, along with the advances in optimization algorithms and computing resources. However, the impact of the number of objective functions on modeling outputs is still unclear, and the adequate number of objective functions remains an open question. We investigated the responses of model performance, equifinality, and uncertainty to the number of objective functions incorporated in a hierarchical and sequential manner in parameter calibration. The Hydrological Simulation Program – FORTRAN (HSPF) models that were prepared for bacteria total maximum daily load (TMDL) development served as a mathematical representation to simulate the hydrological processes of three watersheds located in Virginia, and the Expert System for Calibration of HSPF (HSPEXP) statistics were employed as objective functions in parameter calibration experiments. Results showed that the amount of equifinality and output uncertainty overall decreased while the model performance was maintained as the number of objective functions increased sequentially. However, there was no further significant improvement in the equifinality and uncertainty when including more than four objective functions. This study demonstrated that the introduction of an adequate number of objective functions could improve the quality of calibration without requiring additional observations.

## INTRODUCTION

Parameters are required to describe the features of a system of interest in hydrological modeling, however, often they cannot be measured but only estimated due to their conceptual nature and scale dependency (Vrugt *et al.* 2008; Her & Chaubey 2015). Then, parameter values are investigated and adjusted so that an agreement between observations and simulation outputs can reach a predefined performance threshold. The use of optimization algorithms in parameter calibration enables exploring the multi-dimensional parameter space of a hydrological model efficiently and finding solutions close to a global optimum conveniently (Tang *et al.* 2006; Efstratiadis & Koutsoyiannis 2010). As automatic calibration using the algorithms can identify multiple parameter sets that provide equally good or acceptable model performance, called equifinality, a modeler may face the problem of selecting a single set among many equivalent ones (Beven 2006; Beven & Binley 2014). Prediction uncertainty caused by the equifinality might then negatively affect the reliability of following modeling practices such as forecasting (Wagener & Gupta 2005; Yatheendradas *et al.* 2008).

Equifinality is inevitable unless all of the observations that can regulate the equations and parameters of a mathematical model are available. Equifinality and its implications in model calibration have been discussed in many studies, especially for the quantification of output uncertainty resulting from model equifinality (Beven & Binley 1992). Her & Chaubey (2015) demonstrated that the number of observations and parameters are closely associated with the amount of equifinality. However, it is not still clear how deploying multiple objective functions can help solve the equifinality issue (Gupta *et al.* 1998; Yapo *et al.* 1998; Franks *et al.* 1999; Madsen 2000; Efstratiadis & Koutsoyiannis 2010; Shafii *et al.* 2015). This is presumably because of subjectivity involved with the selection of objective functions and/or potential deterioration in the efficiency of parameter search, in particular when more than three objectives are of interest (Khu *et al.* 2008; Shafii *et al.* 2015).

The HSPEXP criteria have been used in calibrating the HSPF model for developing bacteria total maximum daily loads (TMDLs) in Virginia (Lumb *et al.* 1994; VADEQ 2003). The criteria were originally suggested by the US Geological Survey (USGS) professionals based on their many years of calibration experiences (Donigan *et al.* 1984; Lumb *et al.* 1994; Kim *et al.* 2007; Seong *et al.* 2015). Manual or semi-manual calibration methods were conventionally employed to calibrate the HSPF model in bacteria TMDL developments, and a few studies demonstrated the efficiency of using automatic optimization algorithms in parameter calibration (Kim *et al.* 2007; Seong *et al.* 2015). Such studies employed a single-objective optimization algorithm by linearly aggregating the six HSPEXP statistics in their parameter calibration. However, the aggregated single-objective calibration approach requires a subjective decision in determining weights between sub-objective functions. In addition, it does not show a Pareto front and trade-offs among the conflicting criteria (Efstratiadis & Koutsoyiannis 2010). The understanding of the trade-offs provides opportunities to identify balanced, representative parameter sets in the calibration of a hydrological model (Booij & Krol 2010). A study suggested a further investigation on the Pareto optimal surface and trade-offs between the HSPEXP statistics by implementing multi-objective optimization approaches (Seong *et al.* 2015).

The use of multiple objective functions can reduce the amount of equifinality and uncertainty by applying additional constraints and informing trade-offs between objective functions in parameter calibration (Wagener & Gupta 2005; Tang *et al.* 2006; Reed *et al.* 2013). When observations to be used for refining parameters are limited, introducing additional criteria or objective functions can help further screen out parameter values in calibration. As the number of objective functions increases, however, the performance of a search algorithm is expected to be degraded since the non-dominance of Pareto fronts tends to increase (Ishibuchi *et al.* 2008). The degraded search performance can then fail to identify a global optimum, increase equifinality and modeling uncertainty, and subsequently affect modeling practices followed by parameter calibration. Therefore, the introduction of an adequate number of multiple objective functions is expected to maintain the performance of a calibrated model while minimizing model equifinality and uncertainty. We investigated the responses of the model performance, equifinality, parameter, and output uncertainty to the number of objective functions. This study explicitly demonstrates how multi-objective calibration can improve the quality of parameter calibration without requiring additional observations to refine a hydrological model. In addition, this study links the search efficiency of an optimization algorithm, the search dimension of calibration, equifinality, and parameters to each other to provide the integrated understanding of hydrological modeling uncertainty.

## MATERIAL AND METHODS

### Study watersheds and HSPF model

Three watersheds were selected for this investigation as they had been studied in association with a parameter calibration tool and bacteria TMDLs in the Ridge and Valley Physiographic Region of Virginia: the Piney River (123 km^{2}), Pigg River (909 km^{2}), and Reed Creek (704 km^{2}) watersheds (Figure 1). The watersheds mainly consist of forest (52–79%) and pasture (10–38%), and cropland (2–6%). Weather input and streamflow data were obtained from National Weather Service Cooperative Weather stations (COOP ID 445690 for Piney, 449301 for Reed, and 447338 for Pigg Creeks) and USGS streamflow-gaging stations (ID: 02027500 for Piney, 03167000 for Reed, and 02058400 for Pigg Creeks). The HSPF models prepared for the watersheds were calibrated and validated to streamflow measured at the outlets (Table 1). In the bacteria TMDL developments for the study watersheds, the storm events were selected considering the sizes of storms and the balance of the numbers of storm events in dry and wet years. In this study, we directly adopted the storm selections made in the bacteria TMDL developments. Detailed descriptions about the study watersheds can be found in Seong *et al.* (2015).

Watershed | Precipitation (mm/year) | Runoff (mm/year) | Soils/topography | Period and no. storm events (in parentheses) | |
---|---|---|---|---|---|

Calibration | Validation | ||||

Pigg River | 1,134 | 408 | Cecil-Madison-Enon/Northern Inner Piedmont level IV ecoregion | 1989–1995 (29) | 1984–1989 (23) |

Reed Creek | 951 | 393 | Frederick-Carbo-Timberville and Berks-Weikert-Laidig/Ridge and Valley Level III and Southern Limestone-Dolomite Valleys and Low Rolling Hills Level IV ecoregion | 1991–1998 (29) | 2001–2005 (31) |

Piney River | 1,277 | 729 | Clifford loam/Northern Inner Piedmont and Level IV ecoregion | 1991–1995 (21) | 1996–2000 (16) |

Watershed | Precipitation (mm/year) | Runoff (mm/year) | Soils/topography | Period and no. storm events (in parentheses) | |
---|---|---|---|---|---|

Calibration | Validation | ||||

Pigg River | 1,134 | 408 | Cecil-Madison-Enon/Northern Inner Piedmont level IV ecoregion | 1989–1995 (29) | 1984–1989 (23) |

Reed Creek | 951 | 393 | Frederick-Carbo-Timberville and Berks-Weikert-Laidig/Ridge and Valley Level III and Southern Limestone-Dolomite Valleys and Low Rolling Hills Level IV ecoregion | 1991–1998 (29) | 2001–2005 (31) |

Piney River | 1,277 | 729 | Clifford loam/Northern Inner Piedmont and Level IV ecoregion | 1991–1995 (21) | 1996–2000 (16) |

We used the HSPF model to simulate runoff hydrographs of the study watersheds. HSPF is a continuous and spatially lumped-parameter model that has been used as a simulation tool in many TMDL studies (Benham *et al.* 2006; Seong *et al.* 2015). The hydrological simulation components of the model can be grouped into the impervious land (IMPLND), pervious land (PERLND), and reaches (RCHRES). The IMPLND takes into account surface runoff generated from impervious areas (e.g. urban land uses), and the PERLND simulates hydrological processes, including infiltration, evapotranspiration, surface detention, interflow, groundwater discharge to stream, and percolation to a deep aquifer, happening on pervious areas. The RCHRES accounts for the movement of runoff that is routed from upstream reaches, impervious and pervious areas into streams and reservoirs using a kinematic wave approach. The model calculates hydrological processes on an hourly basis and provides simulation results at a daily (24-hour averages) scale. The detailed description of the HSPF model can be found in Bicknell *et al.* (1996).

Nine parameters for the hydrological simulation of HSPF were adjusted during parameter calibration (Table 2). The calibration parameters were chosen based on a parameter sensitivity analysis implemented for the watersheds and the authors’ professional judgments based on modeling experiences for watersheds in the Ridge and Valley region in Virginia. The UZSN controlling monthly upper soil zone storage was split into two parameters, UZSNw and UZSNo, for winter (December–February) and non-winter months (March–November), respectively, in the calibration. We used the parameter value ranges suggested by the US EPA HSPF modeling guidance (USEPA 2000) to set the upper and lower limits of parameter values. The other parameters of HSPF were set to values determined from the previous TMDL studies (Benham *et al.* 2006, 2012, 2013).

Parameter | Definition | Typical range | Possible range |
---|---|---|---|

LZSN | Lower zone nominal storage, mm | 76.2–203.2 | 50.8–381 |

UZSN^{a} | Upper zone nominal storage, mm | 2.54–25.4 | 1.27–50.8 |

INFILT | Index to infiltration capacity, mm/h | 0.25–6.35 | 0.025–12.7 |

BASETP | Fraction of potential ET that can be sought from base flow | 0–0.05 | 0–0.2 |

AGWETP | Fraction of remaining potential ET that can be satisfied from active groundwater storage | 0–0.05 | 0–0.2 |

INTFW | Interflow inflow parameter | 1.0–3.0 | 1.0–10.0 |

IRC | Interflow recession parameter, per day | 0.5–0.7 | 0.3–0.85 |

AGWRC | Groundwater recession parameter, per day | 0.92–0.99 | 0.85–0.999 |

DEEPFR | Fraction of groundwater inflow that goes to inactive groundwater | 0–0.2 | 0–0.5 |

Parameter | Definition | Typical range | Possible range |
---|---|---|---|

LZSN | Lower zone nominal storage, mm | 76.2–203.2 | 50.8–381 |

UZSN^{a} | Upper zone nominal storage, mm | 2.54–25.4 | 1.27–50.8 |

INFILT | Index to infiltration capacity, mm/h | 0.25–6.35 | 0.025–12.7 |

BASETP | Fraction of potential ET that can be sought from base flow | 0–0.05 | 0–0.2 |

AGWETP | Fraction of remaining potential ET that can be satisfied from active groundwater storage | 0–0.05 | 0–0.2 |

INTFW | Interflow inflow parameter | 1.0–3.0 | 1.0–10.0 |

IRC | Interflow recession parameter, per day | 0.5–0.7 | 0.3–0.85 |

AGWRC | Groundwater recession parameter, per day | 0.92–0.99 | 0.85–0.999 |

DEEPFR | Fraction of groundwater inflow that goes to inactive groundwater | 0–0.2 | 0–0.5 |

^{a}Value varied between winter season and non-winter season.

### Objective functions and parameter sampling

The HSPEXP criteria were employed as exemplary multi-objective functions that can be used in hydrological model calibration: differences (percentage errors) in observed and simulated total, low, high, seasonal streamflow volume, and volumes and peaks of selected storm events (Table 3). We selected the HSPEXP criteria as they can bring multiple unique perspectives into model performance investigation. For instance, large total streamflow volume does not necessarily mean large storm volume since baseflow can make a significant contribution to the total volume. Similarly, large storm volume does not always lead to high storm peak as the peak is a function of rainfall intensity as well as rainfall depth. The Total Volume error might be correlated to all other statistics, but it emphasizes the overall agreement between hydrographs observed and simulated. In addition, the High10 error might be correlated to the Storm Peak error, but the High10 error highlights the agreement between relatively high streamflow observed and simulated (c.f. the Storm Peak error also considers storms that produced small peaks) as well as antecedent soil moisture and flow conditions at the beginning of an event.

Criteria | Description | Percent difference |
---|---|---|

Total Volume | Error in total runoff volume for the calibration period | ±10 |

Low50 | Error in the mean of the lowest 50% of the daily mean flows | ±10 |

High10 | Error in the mean of the highest 10% of the daily mean flows | ±15 |

Storm Peak | Error in flow volumes for selected storms | ±15 |

Seasonal | Seasonal volume error, June–August runoff volume error minus December–February runoff volume error | ±10 |

Storm Volume | Error in runoff volume for selected storms | ±15 |

Criteria | Description | Percent difference |
---|---|---|

Total Volume | Error in total runoff volume for the calibration period | ±10 |

Low50 | Error in the mean of the lowest 50% of the daily mean flows | ±10 |

High10 | Error in the mean of the highest 10% of the daily mean flows | ±15 |

Storm Peak | Error in flow volumes for selected storms | ±15 |

Seasonal | Seasonal volume error, June–August runoff volume error minus December–February runoff volume error | ±10 |

Storm Volume | Error in runoff volume for selected storms | ±15 |

The HSPEXP criteria have been commonly used for more than 30 years in HSPF modeling for TMDL development since it was first suggested in 1984 (Donigan *et al.* 1984; Lumb *et al.* 1994; Kim *et al.* 2007; Seong *et al.* 2015). The storm events were determined based on the previous TMDL studies (Benham *et al.* 2006, 2012, 2013). In parameter calibration, model performance statistics are commonly used as objective functions of the optimization problem (Moradkhani & Sorooshian 2008). There are many statistics proposed to use in hydrological model calibration, and many of them are correlated to each other (Muleta 2012). For instance, it is well known that the Nash–Sutcliffe efficiency coefficient (NSE) is correlated to mean squared error (MSE) and the coefficient of determination (R^{2}). In addition, the Total Volume error is closely associated with other HSPEXP statistics such as the Storm Volume and Peak errors. Thus, the utility of multi-objective calibration can be potentially degraded by adding more objective functions in the parameter calibration. However, a study that investigated the correlation structure between the HSPEXP statistics has not been found in the literature.

The six statistics of the HSPEXP give 720 (6!) possible permutations of objective functions, which makes full consideration of the combinations impractical (Table 3). Thus, a principal component analysis (PCA) was conducted to identify relative priorities between the statistics based on the results of HSPF modeling with randomly sampled 10,008 parameter sets and to simplify the calibration experiments by reducing the size of analysis dimension. The PCA result showed that the first three principal components could explain more than 97% of the total variance of the six objective function values calculated for the study watersheds, and the Storm Peak and Total Volume errors were identified as the most and least correlated to the principal components, respectively. The High10 and Low50 errors showed stronger correlations with the first three principal components than did the Seasonal and Storm Volume errors. From the PCA result, we could prioritize the HSPEXP statistics in the order of the Storm Peak, High10, Low50, Seasonal, Storm Volume, and Total Volume errors. Each of the six HSPEXP statistics was cumulatively added to the multi-objective calibration experiments in order of the priorities, denoted as OF1 to OF6 (Table 4). The prioritization of the HSPEXP statistics using the PCA helped simplify the calibration experiments by revealing the sensitivity of the statistics to simulated hydrographs in this study.

Experimental case IDs | Number of objective functions | Objective functions incorporated in the multi-objective calibration |
---|---|---|

OF1 | 1 | Storm Peak |

OF2 | 2 | Storm Peak, High10 |

OF3 | 3 | Storm Peak, High10, Low50 |

OF4 | 4 | Storm Peak, High10, Low50, Seasonal |

OF5 | 5 | Storm Peak, High10, Low50, Seasonal, Storm Volume |

OF6 | 6 | Storm Peak, High10, Low50, Seasonal, Storm Volume, Total Volume |

Experimental case IDs | Number of objective functions | Objective functions incorporated in the multi-objective calibration |
---|---|---|

OF1 | 1 | Storm Peak |

OF2 | 2 | Storm Peak, High10 |

OF3 | 3 | Storm Peak, High10, Low50 |

OF4 | 4 | Storm Peak, High10, Low50, Seasonal |

OF5 | 5 | Storm Peak, High10, Low50, Seasonal, Storm Volume |

OF6 | 6 | Storm Peak, High10, Low50, Seasonal, Storm Volume, Total Volume |

The AMALGAM algorithm was employed to locate parameter sets providing objective function values close to a global optimum (minimum) in this study. AMALGAM is an optimization algorithm that has been popularly used in the parameter calibration of hydrological models (Raad *et al.* 2009; Zhang *et al.* 2010; Dane *et al.* 2011; Her & Chaubey 2015). It makes four sampling-based optimization algorithms compete to improve the computational efficiency of parameter space exploration. The four sampling methods include adaptive metropolis search (Haario *et al.* 2001), differential evolution (Storn & Price 1997), Non-dominated Sorting Genetic Algorithm (NSGA) II (Deb *et al.* 2002), and particle swarm optimizer (Kennedy *et al.* 2001). Detailed descriptions of the AMALGAM algorithm can be found in Vrugt & Robinson (2007) and Zhang *et al.* (2010).

The parameter sampling was stopped once the number of sampled populations (parameter sets) reached 10,008. The population size of a generation was set to 12 in the calibration, and thus a total of 834 generations were created in the calibration. The preliminary analysis showed that the stopping criterion was long enough to assure the convergence of parameter values sampled. The criterion was applied to all the calibration cases of this study, which was expected to enable fair comparisons between the measures of model performance, equifinality, and uncertainty.

### Evaluation of equifinality, uncertainty and model performance

In this study, equifinality, parameter, and output uncertainty were quantified using a GLUE framework (Beven & Binley 1992; Beven & Freer 2001; Beven 2006). The level of equifinality was assumed proportional to the number of behavioral parameter sets. Threshold values for identifying behavioral parameter sets were adapted from the HSPEXP criteria. For example, in the case of the OF2 in Table 4, parameter sets that provide differences between observed and simulated storm peaks equal to or less than 15% (relative errors) were regarded as behavioral in this study. The posterior parameter distribution was derived from the identified behavioral sets. A ratio of the number of behavioral parameter sets identified to that of total parameter sets sampled in the calibration was used to measure the equifinality of each calibration experiment. Model output uncertainty was quantified as an average of the ranges between the highest and lowest daily streamflow values simulated using the behavioral sets. Parameter uncertainty was measured using the coefficient of variation (CV) of normalized behavioral parameter values. The performance of the calibrated models was evaluated using statistics commonly utilized in hydrological modeling practices such as NSE, R^{2}, and KGE as well as the HSPEXP criteria used in the parameter calibration. NSE and KGE range from 1 to minus infinity, where 1 means the perfect fit between two hydrological time series (Gupta *et al.* 2009; Kling *et al.* 2012), and R^{2} varies from 0 to 1.

## RESULTS

### Equifinality

Parameter values sampled by AMALGAM are presented on the six OFs in Figure 2. When three or less objective functions were employed in the calibration (OF1–OF3), parameters were relatively quickly converged to final values. The convergence, however, was not observed or its speed was slow in the cases of using more than three objective functions (OF4–OF6). Values of all parameters were stabilized at the beginning of the sampling when using the single objective function (OF1). In the cases of the multi-objective calibration (OF2–OF6), on the other hand, it required much more samples for parameter values to converge. For example, the normalized AGWRC parameter values of the Pigg River watershed converged to 0.48 with only about 100 samplings in the case of OF1, and around 500 samples were required for parameter values to be narrowed to the range between 0.88 and 1.00 in the OF2 and OF3 cases. However, the convergence of parameter values was not found in the cases of OF4–OF6 (Figure 2). Even when using 500,000 parameter samples (rather than 10,008 samples), the parameter values did not become stabilized in the case of having all the six objective functions (OF6). Such results implied that employing many objective functions could make the parameter search of a sampling-based heuristic optimization algorithm less efficient and thus fail to locate solutions close to a global optimum within a limited number of trials. The behavior of parameter values sampled in the calibration experiments for the Pigg River watershed was also observed in the cases for the other two watersheds. The number of behavioral sets decreased with the increases in the number of objective functions employed in the calibration (Table 5). In the cases of calibration using four or more objective functions (OF4–OF6), the number of behavioral parameter sets (or the level of equifinality) dramatically decreased with some fluctuations, compared to the cases of OF1–OF3. In the Pigg and Piney River watersheds, the number of behavioral sets increased when adding the Total Volume error to the OF5 even though the increased proportions were relatively small.

Cases | Amount of equifinality (ratios of behavioral sets, %) | ||
---|---|---|---|

Pigg River | Reed Creek | Piney River | |

OF1 | 71.36 | 71.17 | 70.12 |

OF2 | 36.22 | 73.07 | 67.69 |

OF3 | 9.34 | 23.59 | 15.87 |

OF4 | 0.05 | 7.30 | 1.54 |

OF5 | 0.05 | 12.16 | 0.79 |

OF6 | 0.13 | 0.23 | 2.33 |

Cases | Amount of equifinality (ratios of behavioral sets, %) | ||
---|---|---|---|

Pigg River | Reed Creek | Piney River | |

OF1 | 71.36 | 71.17 | 70.12 |

OF2 | 36.22 | 73.07 | 67.69 |

OF3 | 9.34 | 23.59 | 15.87 |

OF4 | 0.05 | 7.30 | 1.54 |

OF5 | 0.05 | 12.16 | 0.79 |

OF6 | 0.13 | 0.23 | 2.33 |

### Model performance

The streamflow predictions of the HSPF models calibrated with the different objective function combinations was evaluated regarding the six HSPEXP statistics (Figure 3 and Table 4). The variations in the HSPEXP statistics provided by behavioral parameter sets decreased with increases in the number of objective functions included, which corresponds to the tendency of the quantified equifinality. Figure 3 also shows trade-offs between the HSPEXP statistics calculated in the multi-objective calibration practices. For example, the interquartile ranges of the Low50 and Seasonal errors, along with the calibration cases, were more fluctuated than were the Total Volume and Storm Volume errors. Such results suggested that the Low50 and Seasonal Errors are less correlated to other HSPEXP statistics than are the Total Volume and Storm Peak errors. The median of the Total Volume errors moved toward zero as the number of the objective functions employed increased. Such a finding implies that the Total Volume error is an objective function that is the most correlated to the other HSPEXP statistics.

The performance of the calibrated models was also evaluated using statistics, NSE, R^{2}, and KGE commonly used in hydrological model calibration to enable quick and easy interpretation of the performance evaluation results. The statistics were calculated based on streamflow simulated using behavioral parameter sets identified in calibration practices for the study watersheds (Figure 4 and Table 6). The model performance statistics varied with an increase in the number of objective functions but did not show a consistent trend. The OF3 case provided slightly worse performance statistics compared to the OF2 case in all the study watersheds. Such results can be well explained by the fact that NSE and R^{2} that are sensitive to high flow are weakly associated with the LOW50 error that was newly introduced in the OF3 case (Legates & McCabe 1999).

Cases | Statistical measures | Pigg River watershed | Piney River watershed | Reed Creek watershed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Calibration | Validation | Calibration | Validation | Calibration | Validation | ||||||||

Med^{a} | Max^{b} | Med | Max | Med | Max | Med | Max | Med | Max | Med | Max | ||

OF1 | NSE | −0.39 | 0.24 | 0.47 | 0.57 | −0.03 | 0.31 | 0.75 | 0.78 | 0.55 | 0.60 | 0.58 | 0.67 |

R^{2} | 0.25 | 0.35 | 0.49 | 0.58 | 0.32 | 0.44 | 0.76 | 0.78 | 0.64 | 0.65 | 0.62 | 0.69 | |

KGE | 0.40 | 0.58 | 0.56 | 0.73 | 0.53 | 0.66 | 0.73 | 0.84 | 0.75 | 0.79 | 0.74 | 0.80 | |

OF2 | NSE | 0.25 | 0.32 | 0.50 | 0.60 | 0.08 | 0.25 | 0.70 | 0.78 | 0.50 | 0.62 | 0.53 | 0.65 |

R^{2} | 0.32 | 0.39 | 0.51 | 0.61 | 0.33 | 0.39 | 0.75 | 0.80 | 0.61 | 0.65 | 0.60 | 0.67 | |

KGE | 0.51 | 0.61 | 0.55 | 0.67 | 0.54 | 0.62 | 0.58 | 0.78 | 0.64 | 0.72 | 0.64 | 0.73 | |

OF3 | NSE | 0.25 | 0.31 | 0.49 | 0.60 | −0.36 | 0.23 | 0.67 | 0.78 | 0.33 | 0.59 | 0.47 | 0.68 |

R^{2} | 0.32 | 0.37 | 0.51 | 0.61 | 0.20 | 0.41 | 0.69 | 0.79 | 0.48 | 0.63 | 0.53 | 0.70 | |

KGE | 0.52 | 0.61 | 0.51 | 0.66 | 0.40 | 0.63 | 0.63 | 0.83 | 0.59 | 0.77 | 0.56 | 0.78 | |

OF4 | NSE | 0.18 | 0.22 | 0.50 | 0.57 | 0.02 | 0.16 | 0.71 | 0.77 | 0.46 | 0.60 | 0.56 | 0.63 |

R^{2} | 0.29 | 0.34 | 0.51 | 0.57 | 0.29 | 0.33 | 0.75 | 0.79 | 0.54 | 0.63 | 0.59 | 0.65 | |

KGE | 0.51 | 0.58 | 0.51 | 0.65 | 0.51 | 0.57 | 0.60 | 0.80 | 0.69 | 0.79 | 0.67 | 0.78 | |

OF5 | NSE | 0.20 | 0.23 | 0.52 | 0.55 | 0.19 | 0.28 | 0.63 | 0.79 | 0.46 | 0.60 | 0.58 | 0.63 |

R^{2} | 0.32 | 0.34 | 0.55 | 0.55 | 0.41 | 0.43 | 0.64 | 0.80 | 0.53 | 0.62 | 0.60 | 0.67 | |

KGE | 0.53 | 0.57 | 0.54 | 0.56 | 0.62 | 0.64 | 0.65 | 0.78 | 0.68 | 0.78 | 0.71 | 0.80 | |

OF6 | NSE | 0.19 | 0.25 | 0.54 | 0.57 | 0.24 | 0.30 | 0.58 | 0.78 | 0.45 | 0.52 | 0.51 | 0.58 |

R^{2} | 0.32 | 0.34 | 0.55 | 0.58 | 0.40 | 0.43 | 0.59 | 0.79 | 0.52 | 0.57 | 0.54 | 0.58 | |

KGE | 0.55 | 0.58 | 0.58 | 0.64 | 0.63 | 0.65 | 0.60 | 0.82 | 0.72 | 0.75 | 0.68 | 0.76 |

Cases | Statistical measures | Pigg River watershed | Piney River watershed | Reed Creek watershed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Calibration | Validation | Calibration | Validation | Calibration | Validation | ||||||||

Med^{a} | Max^{b} | Med | Max | Med | Max | Med | Max | Med | Max | Med | Max | ||

OF1 | NSE | −0.39 | 0.24 | 0.47 | 0.57 | −0.03 | 0.31 | 0.75 | 0.78 | 0.55 | 0.60 | 0.58 | 0.67 |

R^{2} | 0.25 | 0.35 | 0.49 | 0.58 | 0.32 | 0.44 | 0.76 | 0.78 | 0.64 | 0.65 | 0.62 | 0.69 | |

KGE | 0.40 | 0.58 | 0.56 | 0.73 | 0.53 | 0.66 | 0.73 | 0.84 | 0.75 | 0.79 | 0.74 | 0.80 | |

OF2 | NSE | 0.25 | 0.32 | 0.50 | 0.60 | 0.08 | 0.25 | 0.70 | 0.78 | 0.50 | 0.62 | 0.53 | 0.65 |

R^{2} | 0.32 | 0.39 | 0.51 | 0.61 | 0.33 | 0.39 | 0.75 | 0.80 | 0.61 | 0.65 | 0.60 | 0.67 | |

KGE | 0.51 | 0.61 | 0.55 | 0.67 | 0.54 | 0.62 | 0.58 | 0.78 | 0.64 | 0.72 | 0.64 | 0.73 | |

OF3 | NSE | 0.25 | 0.31 | 0.49 | 0.60 | −0.36 | 0.23 | 0.67 | 0.78 | 0.33 | 0.59 | 0.47 | 0.68 |

R^{2} | 0.32 | 0.37 | 0.51 | 0.61 | 0.20 | 0.41 | 0.69 | 0.79 | 0.48 | 0.63 | 0.53 | 0.70 | |

KGE | 0.52 | 0.61 | 0.51 | 0.66 | 0.40 | 0.63 | 0.63 | 0.83 | 0.59 | 0.77 | 0.56 | 0.78 | |

OF4 | NSE | 0.18 | 0.22 | 0.50 | 0.57 | 0.02 | 0.16 | 0.71 | 0.77 | 0.46 | 0.60 | 0.56 | 0.63 |

R^{2} | 0.29 | 0.34 | 0.51 | 0.57 | 0.29 | 0.33 | 0.75 | 0.79 | 0.54 | 0.63 | 0.59 | 0.65 | |

KGE | 0.51 | 0.58 | 0.51 | 0.65 | 0.51 | 0.57 | 0.60 | 0.80 | 0.69 | 0.79 | 0.67 | 0.78 | |

OF5 | NSE | 0.20 | 0.23 | 0.52 | 0.55 | 0.19 | 0.28 | 0.63 | 0.79 | 0.46 | 0.60 | 0.58 | 0.63 |

R^{2} | 0.32 | 0.34 | 0.55 | 0.55 | 0.41 | 0.43 | 0.64 | 0.80 | 0.53 | 0.62 | 0.60 | 0.67 | |

KGE | 0.53 | 0.57 | 0.54 | 0.56 | 0.62 | 0.64 | 0.65 | 0.78 | 0.68 | 0.78 | 0.71 | 0.80 | |

OF6 | NSE | 0.19 | 0.25 | 0.54 | 0.57 | 0.24 | 0.30 | 0.58 | 0.78 | 0.45 | 0.52 | 0.51 | 0.58 |

R^{2} | 0.32 | 0.34 | 0.55 | 0.58 | 0.40 | 0.43 | 0.59 | 0.79 | 0.52 | 0.57 | 0.54 | 0.58 | |

KGE | 0.55 | 0.58 | 0.58 | 0.64 | 0.63 | 0.65 | 0.60 | 0.82 | 0.72 | 0.75 | 0.68 | 0.76 |

^{a}Median.

^{b}Maximum.

In Figure 4(b) there is a clear increase in NSE but no significant change in R^{2} between OF4 and OF5. In contrast, there is no substantial change in NSE but an increase in R^{2} between OF5 and OF6. Both NSE and R^{2} are sensitive to high extremes. However, NSE is responsive to bias while R^{2} is not. The inclusion of the Storm Volume error could decrease the overall bias in stormwater simulation when moving from OF4 to OF5. Then, the additional inclusion of the Total Volume error could improve the overall agreement between total runoff hydrograph (including stormwater and baseflow) observed and simulated when moving from OF5 to OF6.

The performance statistics, NSE, R^{2}, and KGE, did not show significant changes when the number of objective functions increased from four to six (Table 6). Overall, the OF2 and OF3 cases provided average NSE, R^{2}, and KGE values equal to or greater than those of the OF4–OF6 cases. Such results imply that the use of a large number of objective functions are likely to decrease model performance since the non-dominance of Pareto fronts will increase, and more numbers of parameter samples (and subsequently more computing time and model runs) will be required to identify Pareto fronts when the number of objective functions increases.

### Output uncertainty

Daily streamflow hydrographs simulated for the Pigg River watershed using behavioral parameter sets are plotted in Figure 5. The OF1 case for the Pigg River watershed underestimated the observed streamflow and provided a wide range of streamflow uncertainty bands derived from behavioral output sets, presumably because the case focused on only the peak flows of selected storm events (Figure 5(a)). The averages of streamflows simulated using the HSPF model calibrated in the OF2 case overestimated as the High10 error was included in the calibration. When adding the Low50 error in the OF3 case, an improvement was made on relatively low flow parts (Figure 5(b) and 5(c)). For the validation period, the model showed similar responses to those of the calibration period in all the OF cases.

Model output uncertainty was quantified as the ranges between the highest and lowest daily streamflows (Table 7). The output uncertainty consistently decreased with the increase in the number of objective functions for both the calibration and validation periods, with some exceptions in the Piney River watershed. Although the amount of equifinality quantified in the OF2 and OF5 cases for the Reed Creek watershed and in the OF6 case for the Pigg River watershed was greater than those of the OF1 and OF4 cases for the Reed Creek watershed and OF5 in the Pigg River watershed, the streamflow simulated by the OF2, OF5, and OF6 parameter sets provided a less amount of uncertainty compared to the OF1, OF4, and OF5 cases, respectively (Tables 5 and 7).

Cases | Ranges of streamflow (m^{3}/s) | |||||
---|---|---|---|---|---|---|

Pigg River watershed | Reed Creek watershed | Piney River watershed | ||||

Calibration | Validation | Calibration | Validation | Calibration | Validation | |

OF1 | 771.93 | 745.44 | 449.82 | 462.59 | 150.15 | 130.78 |

OF2 | 715.88 | 629.07 | 398.85 | 413.37 | 118.19 | 104.46 |

OF3 | 616.09 | 546.95 | 254.55 | 257.64 | 127.91 | 107.09 |

OF4 | 321.22 | 293.43 | 233.52 | 240.04 | 95.40 | 84.89 |

OF5 | 242.12 | 229.62 | 219.57 | 217.55 | 75.76 | 60.92 |

OF6 | 219.79 | 214.07 | 168.01 | 187.27 | 81.96 | 66.78 |

Cases | Ranges of streamflow (m^{3}/s) | |||||
---|---|---|---|---|---|---|

Pigg River watershed | Reed Creek watershed | Piney River watershed | ||||

Calibration | Validation | Calibration | Validation | Calibration | Validation | |

OF1 | 771.93 | 745.44 | 449.82 | 462.59 | 150.15 | 130.78 |

OF2 | 715.88 | 629.07 | 398.85 | 413.37 | 118.19 | 104.46 |

OF3 | 616.09 | 546.95 | 254.55 | 257.64 | 127.91 | 107.09 |

OF4 | 321.22 | 293.43 | 233.52 | 240.04 | 95.40 | 84.89 |

OF5 | 242.12 | 229.62 | 219.57 | 217.55 | 75.76 | 60.92 |

OF6 | 219.79 | 214.07 | 168.01 | 187.27 | 81.96 | 66.78 |

The impacts of equifinality and output uncertainty on bacteria TMDL development are demonstrated in Figure 6. The bacteria (*Escherichia coli*) loads of the Pigg River watershed were simulated using behavioral parameter sets identified in the calibration. Then, the calendar-month geometric means of simulated *E. coli* concentrations were compared with the water quality standard (of 126 cfu/100 mL) for bacteria (Benham *et al.* 2006). As seen in Figure 6, the calendar-month *E. coli* geometric means substantially vary depending on a parameter set selected among the behavioral sets. For instance, the *E. coli* averages vary from 37 to 232 cfu/100 mL in the case of OF3, and the variation range (195 cfu/100 mL) is large enough to change the following TMDL implementation goals and plans (Table 8). The uncertainty bands (the variation ranges) decreased while moving from OF1 to OF4 and did not diminish when adding additional objective functions to OF4 (Table 8), indicating no more than four objective functions are required to reduce uncertainty in this bacteria modeling. Such a demonstration suggests that the variation range should be considered part of a margin of safety in TMDL development processes.

Statistics | OF1 | OF2 | OF3 | OF4 | OF5 | OF6 |
---|---|---|---|---|---|---|

Minimum | 36 | 33 | 37 | 44 | 46 | 47 |

Maximum | >10,000 | 335 | 232 | 91 | 105 | 95 |

Range | >10,000 | 303 | 195 | 47 | 59 | 48 |

Statistics | OF1 | OF2 | OF3 | OF4 | OF5 | OF6 |
---|---|---|---|---|---|---|

Minimum | 36 | 33 | 37 | 44 | 46 | 47 |

Maximum | >10,000 | 335 | 232 | 91 | 105 | 95 |

Range | >10,000 | 303 | 195 | 47 | 59 | 48 |

### Parameter uncertainty

The posterior distributions of the calibration parameters were driven from behavioral parameter sets identified in the calibration to investigate how the parameter values responded to the varying objective function combinations. As seen in Figure 7, the distributions largely varied without a trend in the median values (marked as solid lines in the boxes) or the ranges (heights of the boxes) as the number of objective functions increased. The amount of the parameter uncertainty was quantified with the standard deviation of the normalized behavioral parameter values (Figure 8). The results showed that the posterior parameter distributions were very sensitive to the combinations of the objective functions, which well corresponds to the conclusions of Kuczera & Mroczkowski (1998) who suggested that the parameter identifiability is mainly determined based on how a model structure responds to an objective function. As seen in Figure 8, parameter uncertainty increased when the number of the objective functions increased from one to three. On the other hand, the parameter uncertainty did not show any trend when further increasing the number of objective functions from four to six, indicating that the increased number of objective functions could decrease the search efficiency of a sampling-based optimization algorithm. To confirm the inversely proportional relationship between the number of objective function and the sampling efficiency, the number of parameter sets that satisfied the behavioral thresholds of all the objective functions was counted in each multi-objective calibration case for the Pigg River watershed (Figure 9). The y-axis of Figure 9 represents the number of behavioral parameter sets found in each generation. Figure 9 shows that the sampling intervals between peaks are much longer in the case of OF6 than that of OF1 (or when including more objective functions). Such a finding suggests that the sampling algorithm with less objective functions can explore solutions more efficiently.

## DISCUSSION

The results showed that the number of behavioral sets is inversely proportional to the number of objective functions employed in calibration (Table 5), which demonstrates that the use of multiple objective functions can be an effective way to reduce the degree of equifinality in hydrological modeling. The decreases in the number of behavioral sets would be partially attributed to the decreased search efficiency of the optimization algorithms due to the increases in the number of objective functions (or the search dimension) used in the calibration. In the cases of including four or more objective functions, the number of behavioral parameter sets dramatically decreased (Table 5). The sudden degradation of search efficiency is commonly found in optimization practices using multiple objectives, and it has been a study subject in the field of evolutionary optimization algorithm (Purshouse & Fleming 2003; Ishibuchi *et al.* 2008; Denysiuk *et al.* 2013; Guo *et al.* 2013). Therefore, introducing too many objective functions can lead to a decrease in the quality of solutions (optimal parameter sets) and subsequently an increase in prediction error and uncertainty of modeling (Efstratiadis & Koutsoyiannis 2010).

In the study, the objective functions were additionally introduced into the calibration in the order of the strength of their statistical relationship (correlation) to the first three principal components that explain the total variations of their sampled values (Table 4). Such procedures could ensure that the HSPEXP criteria which are the most sensitive to modeling outputs are included first in the parameter calibration. There are many other ways to determine the orders, and we could have had different analysis results depending on how objective functions were included in the calibration. For instance, a modeler can identify the first objective function based on the purpose of modeling and then sequentially add others that are known to be the least correlated to the previous ones.

We tried different orders of including objective functions to investigate how the correlation structure between objective functions can affect equifinality and output uncertainty in multi-objective parameter calibration (Table 9). In this study, the Seasonal and Low50 errors might be less correlated to the first objective function, the Storm Peak error, than the others might. Thus, we rearranged the orders of objective functions including High10, Seasonal, and Low50 errors in the cases of OF2 and OF3 (the other cases, OF1 and OF4–OF6, would not be affected by the replacement) and then compared equifinality and output uncertainty calculated with the original and new objective function orders. From the comparison, we could find that the replacement did not change the trends of equifinality but output uncertainty across the different numbers of objective functions (Table 9).

Item | Cases | OF1 | OF2 | OF3 | OF4 |
---|---|---|---|---|---|

Equifinality^{a} | Original^{d} | 71.17 | 73.07 | 23.59 | 7.30 |

Alternative A^{e} | 71.17 | 74.57 | 7.68 | 7.30 | |

Alternative B^{f} | 71.17 | 74.57 | 10.35 | 7.30 | |

Output uncertainty^{b} (calibration^{c}) | Original | 449.82 | 398.85 | 254.55 | 233.52 |

Alternative A | 449.82 | 269.43 | 217.92 | 233.52 | |

Alternative B | 449.82 | 269.43 | 314.13 | 233.52 |

Item | Cases | OF1 | OF2 | OF3 | OF4 |
---|---|---|---|---|---|

Equifinality^{a} | Original^{d} | 71.17 | 73.07 | 23.59 | 7.30 |

Alternative A^{e} | 71.17 | 74.57 | 7.68 | 7.30 | |

Alternative B^{f} | 71.17 | 74.57 | 10.35 | 7.30 | |

Output uncertainty^{b} (calibration^{c}) | Original | 449.82 | 398.85 | 254.55 | 233.52 |

Alternative A | 449.82 | 269.43 | 217.92 | 233.52 | |

Alternative B | 449.82 | 269.43 | 314.13 | 233.52 |

^{a}Ratios of behavioral sets (%).

^{b}Ranges of streamflow (m^{3}/s).

^{c}Calibration period (1991–1998).

^{d}Storm Peak and High10 in OF2, and Storm Peak, High10, and Low50 in OF3.

^{e}Storm Peak and Seasonal errors in OF2, and Storm Peak, Seasonal, and Low50 errors in OF3.

^{f}Storm Peak and Seasonal errors in OF2, and Storm Peak, Seasonal, and High10 in OF3.

When the seasonal error was included in OF2, the output uncertainty decreased more significantly compared to the case of the original objective function order (Table 9). In addition, the inclusion of the Low50 error in OF3 (Alternative A) or OF4 (Alternative B) reduced output uncertainty more efficiently than did that of the High10 error. Such a finding indicates that output uncertainty is closely associated with the correlation structure of objective functions employed in parameter calibration, and it suggests objective functions less correlated to each other be included in parameter calibration to reduce modeling output uncertainty efficiently.

The results showed that the equifinality and output uncertainty substantially decreased when adding the fourth objective function (OF4) in the parameter calibration (Tables 5 and 7). However, the equifinality and uncertainty were not reduced significantly when adding additional objective functions to OF4 (OF5 and OF6). If we could include objective functions that are completely uncorrelated to each other, the degree of equifinality and output uncertainty would decrease more efficiently with increases in the number of objective functions from OF1 to OF4. Considering that there is a certain degree of correlation between the objective functions employed, especially in the cases of OF2 (between Storm Peak and High10 or between Storm Peak and Seasonal), we would say that the benefit we can get from including more than three to four objective functions in parameter calibration is marginal when objective functions are selected carefully.

The level of equifinality did not constantly decrease when the number of objective functions increased from four to six in the calibration (Table 5). In the cases of the Pigg and Piney River watersheds, the number of behavioral sets even increased when the Total Volume error was included in the calibration even though the amount of the increases was relatively small (0.08% for OF5 and 1.54% for OF6). Such results could be explained by the correlation of the Total Volume error to the other objective functions. The Total Volume error is closely related to the overall water balance that other objectives also implicitly and partially consider. A posterior analysis on the correlation between the objective functions in the OF6 case supported the explanation (Figure 3). In the Reed Creek watershed, on the other hand, the number of behavioral sets increased when an additional objective function was added to the OF1 and OF4 cases (Table 5). Calibration replicated with different seed numbers for the random generator led to the increased or decreased number of behavioral sets in the OF2 case, which could be explained by the stochastic nature of the heuristic sampling-based optimization algorithm, AMALGAM. Additional experiments showed that the initial parameter values randomly sampled could lead to differences in search direction and efficiency even when the same calibration settings were employed.

A large amount of equifinality did not always lead to a large amount of output uncertainty in the results (Tables 5 and 7) partially due to the guided sampling strategy of the optimization algorithm. The parameter search space tends to become narrow as sampling evolves since the search may focus on the vicinity of known optimal points in the parameter space. Thus, parameter sets sampled in a small space would produce streamflow hydrographs similar to each other, which then provides fewer variations in the hydrographs simulated and subsequently leads to the smaller amount of output uncertainty. As long as a guided sampling strategy (rather than a purely random sampling such as a simple Monte Carlo simulation) is employed in exploring the parameter space, such results are inevitable. Thus, it would be necessary to develop stopping criteria that can be universally applied to parameter calibration using a guided sampling-based optimization algorithm so that we can fairly compare the amount of equifinality quantified from different modeling studies.

There was no consistent trend found in the amount of parameter uncertainty while the amount of equifinality and output uncertainty generally decreased as increasing the number of objective functions (Tables 5 and 7, Figure 8). This result demonstrates the case-sensitive nature of the parameter posterior distributions or uncertainty. In calibration using a sampling-based heuristic optimization algorithm, objective function values are maximized (or minimized) by adjusting only parameter values while fixing the other modeling components such as objective functions, calibration period (or observations to be compared with simulated outputs), model structure, and initial and boundary conditions. Thus, calibration will identify parameter values different from the previous ones once a modification is made to the other components. As a result, parameter uncertainty quantified through calibration processes will be responsive to any changes made in the components including objective functions. Beck (1985), Clark & Vrugt (2006), and Her *et al.* (2017) exemplified how parameter values could be compensated for the incomplete, biased, or different representations of a watershed system in hydrological modeling. In addition, the parameters of a watershed-scale model such as HSPF have a complicated and nonlinear relationship to each other in the model structure. Thus, it is often not clear how they mutually interact in trying to achieve the best modeling performance in parameter calibration. Therefore, the posterior parameter distributions identified in a model application cannot directly apply to other cases with different watersheds, models, calibration periods and algorithms, and/or combinations of calibration parameters. Figure 8 clearly shows how parameter uncertainty unexpectedly responds to the selection of watersheds and objective functions.

In the original GLUE framework, parameter values are sampled purely randomly, called simple Monte Carlo simulation, and thus the values do not converge to a particular region in the parameter space (Beven 2006; Beven & Binley 1992, 2014). The naïve sampling strategy often does not provide the number of behavioral parameter sets enough to make statistically meaningful inference on parameter and output uncertainty (Blasone *et al.* 2008; Tolson & Shoemaker 2008). Studies have used heuristic optimization algorithms that employ guided sampling strategies to identify behavioral sets efficiently in equifinality and uncertainty analyses (Tolson & Shoemaker 2008; Her & Chaubey 2015). However, the use of the educated guess (rather than random guess) in parameter exploration brought subjectivity into the behavioral parameter identification of the GLUE framework (Shafii *et al.* 2015). If parameter sampling converges into a small space, for instance, the following sampling may further focus on the space so that the difference between parameter values sampled in the previous and current sequences (or generations) can become negligible at a certain point. If the sampling continues after the point, the level of equifinality will be overestimated significantly. Monitoring the distance between parameter sets sampled would be a way to detect the possibility of the overestimation.

The Euclidean distances between parameter sets sampled in the calibration case for the Reed Creek watershed were calculated to investigate the spread and extent of the sampling. The parameter values were normalized by bringing all values into the range between 0 and 1 before the distance calculation. The spread of the distances did not vary with respect to sampling sequences once it was stabilized at around the 500–5,000th samplings depending on the objective functions (Figure 10). Figure 2 also demonstrates that the search space did not shrink as sampling sequences proceeded with some spaces concentrated. Such findings indicate that the sampling strategy of the AMALGAM algorithm could keep the search extent constant and wide during the entire sampling processes while maintaining focus in the vicinity of the potential optimal sets (Figures 2 and 10). On the other hand, the Euclidean distances between parameter sets increased with the increases in the number of objective functions incorporated in the calibration (Figure 10 and Table 10). Such a finding indicates that the parameter space exploration of AMALGAM gets more extensive as the number of objective functions becomes larger. The average distance (1.58) of OF6 was 2.65 times longer than that (0.60) of OF1 (Table 10). Considering the size of variations (equifinality of OF1 is 30 times greater than that of OF6) in the amount of equifinality along the objective functions (Table 5), the difference in the parameter distances would not be significant. However, it is worth noting that the close similarity between parameter sets may lead to the overestimation of equifinality level to some extent, especially in the cases of having a smaller number of objective functions.

Statistics | OF1 | OF2 | OF3 | OF4 | OF5 | OF6 |
---|---|---|---|---|---|---|

Average distance | 0.60 | 1.04 | 1.36 | 1.55 | 1.70 | 1.58 |

Percentage (%) | 100 | 175 | 229 | 259 | 285 | 265 |

Statistics | OF1 | OF2 | OF3 | OF4 | OF5 | OF6 |
---|---|---|---|---|---|---|

Average distance | 0.60 | 1.04 | 1.36 | 1.55 | 1.70 | 1.58 |

Percentage (%) | 100 | 175 | 229 | 259 | 285 | 265 |

## CONCLUSIONS

This study investigated how the model performance, equifinality, and uncertainty respond to the number of objective functions with the goal of understanding if multi-objective parameter calibration can help to improve calibration quality. The results showed that the output uncertainty consistently decreased as the number of objective functions increased. On the other hand, equifinality generally decreased as the number of objective functions increased from one to four, but it fluctuated when the number became greater than four. In addition, the amount of parameter uncertainty overall increased with increases in the number of objective functions from one to three, and no trend was found with the four to six objective functions. Such findings imply a large number of objective functions can limit calibration algorithm's capacity of locating optimal parameter sets. The model performance measured with NSE, R^{2}, and KGE did not deteriorate with an increase in the number of objective functions while model output uncertainty decreased consistently, suggesting the potential of multi-objective calibration as a way to improve calibration quality. The amount of output uncertainty was found associated with the severity of equifinality, but large equifinality did not necessarily mean poor performance nor large parameter uncertainty. Parameter uncertainty did not show a significant relationship to the number of criteria used in the calibration. The amount of parameter uncertainty and their posterior distributions delivered little information about modeling equifinality, uncertainty, and performance, exhibiting the limited utility of parameter uncertainty analysis in hydrological modeling.

We eventually need to come up with a methodology to identify a parameter value set that accurately represents the reality in calibration. However, equifinality and uncertainty will be inevitably introduced into modeling as long as the amount of information we have is less than required for completely constraining the parameter spaces. There may be many ways to improve parameter value selection, such as reevaluating model structure and parameters in simulation periods that contain high information content, increasing the number of observations while decreasing the number of parameters to be calibrated, and linking parameters to one of the performance criteria (Wagener *et al.* 2003; Her & Chaubey 2015; Guse *et al.* 2017). This study demonstrated that multi-objective calibration could mitigate the equifinality and uncertainty issues without incorporating additional observations into parameter calibration, even though it could not solve the issues completely. The findings of this study would be useful in modeling studies, especially for areas where available measurements are limited such as developing countries and watersheds that have short monitoring data.

## ACKNOWLEGEMENTS

This study was partially supported by research grants from the University of Florida's Institute of Food and Agricultural Sciences (UF/IFAS) through the Early Career Scientist Seed Fund program.