Abstract
Evaluation criteria play a key role in assessing the performances of hydrological models. Most previous criteria are based on the standard least square method, which assumes model residuals to be homoscedastic and is, therefore, not suitable for assessing cases with heteroscedastic residuals. Here, we compared a heteroscedastic and symmetric efficiency (HSE) criterion with the Nash–Sutcliffe efficiency (NSE) and the heteroscedastic maximum-likelihood estimator (HMLE) by running a monthly water balance model with four parameters (i.e., the abcd model) in 138 basins located in the continental United States derived from the Model Parameter Estimation Experiment dataset. The results show that compared to the NSE, the HSE and HMLE are both more effective for stabilizing variance and producing more uniform performances with flow magnitude, and the latter is slightly more effective than the former on stabilizing the residual heteroscedasticity, with the aid of an additional parameter.
INTRODUCTION
To resolve the aforementioned problems, it is essential to acquire quantitative knowledge of the heteroscedasticity of model residuals (Beven & Smith 2015) and develop variance-stabilizing techniques. It is a common practice to use transformations, such as logarithmic or square root transformations in hydrology, to adjust the variances on different flow levels (Oudin et al. 2006). The other method is the weighted least squares method, whose major problem is the weight calculation. For instance, Sorooshian & Dracup (1980) developed a heteroscedastic maximum-likelihood estimator (HMLE) method by assuming that the standard deviation in the model residuals followed a power function of the predicted flows qsim(t). In a later study, Sorooshian & Gupta (1983) modified qsim(t) to be replaced by the observed flows qobs(t) in the HMLE method (Yapo et al. 1998; Nourali et al. 2016). There have also been attempts to use a linear function to represent the heteroscedasticity of model residuals (e.g., Schoups & Vrugt 2010). The above methods can effectively stabilize the variance in model residuals; however, their formulas contain unknown parameters that must be inferred from the model and data via complicated iterations (Duan 1991; Sorooshian et al. 1993). Thus, the objective function cannot be fully determined before calibration. Meanwhile, the unknown parameters vary with different applications, leading to certain difficulties when comparing the performances of different case studies (e.g., multi-models or multi-basins). The primary objective of this study was to construct an efficiency criterion which was suitable for heteroscedastic cases without any additional parameters, and it can be computed based on only the observed and simulated flows.
The metric-based distance or similarity measures are widely used in many scientific fields, such as bioinformatics (Sippl 2008), transport geography (L'Hostis 2015), and computer science (Chen et al. 2009). However, the benefits of the metric-based definition of efficiency criteria have not been adequately discussed in the field of hydrology. For instance, (1) a linear combination of metrics is also a metric (see Appendix A, available with the online version of this paper) that provides a theoretical foundation for aggregated multi-objective calibrations (e.g., (Zhang et al. 2016)) and (2) the equivalence of metrics has been thoroughly studied with the metric space theory (O'Searcoid 2006), which can be applied to hydrological studies to distinguish equivalent efficiency criteria (e.g., the debate between Willmott et al. (2012) and Legates & McCabe (2013)). There are also other properties, such as convergence and continuity, which are likely helpful for efficiency criteria and need additional consideration. Due to the aforementioned advantages, heteroscedasticity and the metric-based definition were simultaneously taken into account in the design of the new proposed criterion.
The proposed criterion was compared to the NSE and HMLE in terms of the effectiveness of variance stabilization using a heteroscedasticity test (i.e., the White test) (White 1980). Further analysis of the residuals of the three objective functions was also conducted to investigate the advantages and disadvantages of the proposed criterion. The hydrological model was run in large samples of catchment basins to arrive at more convincing and general conclusions of the criterion comparisons with a single simulation (Gupta et al. 2014).
METHODS
Nash–Sutcliffe efficiency
SLS-based criteria, including the NSE, generally show their high sensitivity to peak flows in model calibrations, which can be largely attributed to the use of the squared differences in formulating these indices (Legates & McCabe 1999; Krause et al. 2005; Moriasi et al. 2007; Pushpalatha et al. 2012). It should also be emphasized here that both the NSE and SLS are prone to larger model residuals, which are more likely to occur in high flows because of heteroscedasticity (Sorooshian & Dracup 1980; Gupta et al. 1998; Schoups & Vrugt 2010; Pushpalatha et al. 2012). The characteristic results in the high sensitivity of NSE/SLS schemes to high flows. Meanwhile, as pointed out by Schaefli et al. (2005); Schaefli & Gupta (2007) and Muleta (2012), in the case of strong seasonal flows, the NSE overestimates the model performance, while in the case of weak seasonal flows, the NSE underestimates model performance. This result suggests that variance alone is not a perfect benchmark for calibrating hydrological models (Legates & McCabe 2013).
Heteroscedastic maximum-likelihood estimator
Heteroscedastic and symmetric efficiency
HSE = 1 indicates a perfect match between the simulated and observed flows, and HSE = −1 implies simulated flows with constant zero series. In other words, there is no effective information extracted from the model simulations if HSE = −1. The HSE calculation at a certain time step is independent from other calculations, which is referred to as temporal independence. As a result of the boundedness and temporal independence, the influence of a single time step on the HSE is limited within 2/N and, therefore, the HSE is a robust criterion suitable for cases with outliers.
The White test for heteroscedasticity
MODEL AND DATASET
Description of the abcd model
A commonly used monthly water balance model (i.e., the abcd model) (Thomas 1981) is selected in this study because it has been applied over a variety of hydroclimatic regions (e.g., Martinez & Gupta 2010). The model structure is illustrated in Figure 1, and the parameters with their descriptions are listed in Table 1. Detailed information on the abcd model can be found in Thomas (1981), and a brief description of this model is provided in Appendix C (available with the online version of this paper).
Parameters . | Units . | Description . | Ranges . | Initial value . |
---|---|---|---|---|
a | – | Propensity for flow to occur before the soil is completely saturated | (0,1) | 0.99 |
b | mm | Upper soil zone water storage capacity | (0,5000) | 400 |
c | – | Groundwater recharge coefficient | (0,1) | 0.2 |
d | – | Groundwater discharge constant | (0,1) | 0.8 |
Parameters . | Units . | Description . | Ranges . | Initial value . |
---|---|---|---|---|
a | – | Propensity for flow to occur before the soil is completely saturated | (0,1) | 0.99 |
b | mm | Upper soil zone water storage capacity | (0,5000) | 400 |
c | – | Groundwater recharge coefficient | (0,1) | 0.2 |
d | – | Groundwater discharge constant | (0,1) | 0.8 |
Note: The initial values were determined according to the median values of parameters in the 138 basins in a rough beforehand calibration.
Study region and dataset
The Model Parameter Estimation Experiment (MOPEX) dataset (Duan et al. 2006) (available at ftp://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/) provides the daily precipitation, potential evapotranspiration, streamflow, and maximum/minimum air temperatures of 438 basins located in the continental United States. In the US MOPEX dataset, only 138 basins (Figure 2) have complete time series of monthly flow data from 1948 to 2000, with NSE ≥ 0.5 (Moriasi et al. 2007), HSE ≥ 0.4, and |λ| ≤ 1, by using the abcd model (see the next section for details on the model calibrations). The analysis of this study focused on the ensemble results of the 138 basins. Furthermore, detailed model results of two illustrative examples are also presented to illustrate the similarity and differences among the NSE, HMLE, and HSE. The first catchment is the Quinebaug River Basin in Jewett City, CT (QRB; USGS: 01127000), and the other is the Current River Basin in Van Buren, MO (CRB; USGS: 07067000). Detailed calibration results of all 138 basins, including the simulated flows and optimal parameters with the three criteria, are provided in Supplementary Material (available online).
Calibration and validation
Martinez & Gupta (2010) compared the ranges of the four parameters of the abcd model derived in their study with previously reported values (Alley 1984; Vandewiele et al. 1992). In their studies, the parameter a of abcd model can presumably range from 0.0 to 1.0, and the values of parameter b <1,900 (mm/month) are more common for no-snow basins; therefore, in this study, the parameters a, c, and d were set to vary from 0.0 to 1.0, and parameter b was set to vary from 0 to 5,000 (mm/month). The shuffled complex evolution method developed at the University of Arizona (denoted as SCE-UA) was employed as the calibration algorithm (Duan, 1991). The model warm-up period was 1948–1950, the calibration period was 1951–1980, and the validation period was 1981–2000. The model parameters were, respectively, calibrated based on the criteria of NSE, HMLE, and HSE.
RESULTS AND DISCUSSION
Effectiveness of variance stabilization
In this section, the White test was used to verify the variance-stabilizing abilities of the NSE, HMLE, and HSE. The results of the White test for the heteroscedasticity of the (weighted) residuals in all 138 basins and two illustrative basins are displayed in Figure 3. The NSE residuals in all 138 basins show significant heteroscedasticity based on the White test results (NR2 ranges from 37.6 to 332.8, with an average of 150.0). Meanwhile, the HMLE- and HSE-weighted residuals in 95.7% (132 out of 138) and 51.5% (71 out of 138), respectively, passed those of the White test at a significance level of α = 0.01 (NR2 > x0.01(2) = 9.21). Compared to the White statistics of the NSE residuals, the HMLE- and HSE-weighted residuals reduced the heteroscedasticity, on average, by 96.9% and 87.7%, respectively. The scattergrams of the two illustrative basins also show that the NSE variances were positively correlated with the flow magnitude. In comparison, the HMLE- and HSE-weighted residuals were more homoscedastic, supporting their heteroscedastic assumptions on the relationship between the variances of raw residuals and the corresponding flows. In summary, both the HMLE and HSE methods were shown to be effective for stabilizing variances for different hydrological models, while the effectiveness of the NSE for stabilizing variances was rather weak.
Note that the parameter λ, which represents the degree of heteroscedasticity in the HMLE procedure, was generally close to zero, with a mean value of −0.04 and |λ| ≤ 0.2 for 75% of all basins (Figure 4). Meanwhile, the weights of the HSE were significantly correlated with those of the HMLE in a positive manner (R2 = 0.92 ± 0.046), which indicates that the HSE method has a strong relationship with the HMLE method, even if the HSE does not use any additional parameters.
Further analysis of residuals
Ideally, the model residuals should be independent of the flow magnitudes, which usually derivates from real situations. Therefore, we regressed the model residuals against the observed flows in all 138 basins (e.g., Figure 5), and the corresponding statistical parameters (slope, intercept, and R2) of the linear regression are shown in Figure 6. The findings are listed below:
The slope was mostly negative (Figure 6(b)), and the intercept was mostly positive (Figure 6(c)), which indicated that the flows were generally overestimated in low flow regimes, and vice versa.
The HMLE and HSE intercepts were considerably smaller than the NSE intercept, indicating much better performances at low flows using the HMLE and HSE than the NSE (Figure 7). Meanwhile, the performances of the HSE simulations and most of the HMLE simulations, where severe negative biases did not occur, were only slightly worse than the NSE simulations at high flows.
When λ ≥ 0, the NSE, HMLE, and HSE residuals had similar R2 values against the observed flows. However, when λ < 0, the HMLE residuals depended more on the observed flows than the HSE and NSE residuals, and R2 values for the HMLE even reached up to 0.6–0.9, which was not expected in the model evaluation. Moreover, when λ < 0, the HMLE slope was more negative than the NSE and HSE slopes, which led to severe negative bias at middle and high flow levels (e.g., Figures 6(b) and 8(b)).
Performance consistency between the calibration and validation periods
We compared the values of the NSE, HMLE, and HSE during calibration and validation. The R2 value for the NSE during the calibration and validation periods was 0.61, and the R2 values for the HMLE and HSE were 0.80 and 0.85, respectively (Figure 9). The criterion values for the HMLE and HSE (especially the HSE) showed great agreement between the calibration and validation periods, which is a good phenomenon in model validations. In addition, data quality, especially the rainfall input data, would influence the parameter estimation in model calibration and validation to some extent (Beven & Smith 2015). High-resolution datasets should be used to gain more precise performance.
CONCLUSIONS
Due to different weighting schemes, the NSE puts more emphasis on high flows, while the HMLE and HSE do not emphasize any flow levels. If modelers would like to care more about the flood peaks, the NSE is suggested to be used; if modelers would like to obtain more uniform performances along with flow magnitude, either the HMLE or HSE is suggested.
For the two heteroscedastic criteria, the HMLE produced slightly better homoscedastic variances than the HSE, with the aid of an additional parameter (λ). However, this result is not accomplished without a cost to achieve variance stabilization. When λ < 0, the HMLE tends to allot excessive weights for low flows and insufficient weights for middle and high flows, which might cause severe bias at middle and high flow levels. Compared to the HMLE-weighting approach, the HSE-weighting approach is slightly less effective for variance stabilization but produces more robust simulations due to the use of mean flows (qmean) instead of observed flows and setting λ as zero. The robustness of the HSE is also embodied in two other aspects. The first aspect is that the influence of a single time step on the HSE is limited within 2/N, and the second aspect is that the HSE during calibration and validation shows more agreement than the NSE and HMLE. Therefore, if modelers would like to stabilize the residual heteroscedasticity as much as possible, the HMLE is recommended for use; if the modelers would like to obtain a more robust criterion that could reduce the vast majority of residual heteroscedasticity, the HSE is recommended. It is suggested that future research into the proposed HSE and HMLE would be employed in different hydrological models.
ACKNOWLEDGEMENT
This study was supported by the National Key R&D Program of China (2017YFA0603702) and National Natural Science Foundation of China (no. 41571019 and no. 51779009). We are very grateful to the editors and reviewers for their valuable comments and constructive suggestions that helped us to greatly improve the manuscript.