The ability of a hydrological model to reproduce observed streamflow can be represented by a large variety of performance measures. Although these metrics may suit different purposes, it is unclear which of them is most appropriate for a given application. Our objective is to investigate various performance measures to assess model structures as tools for catchment classification. For this purpose, 12 model structures are generated using the SUPERFLEX modelling framework, which are then applied to 53 meso-scale basins in the Rhineland-Palatinate (Germany). Statistical and hydrological performance measures are compared with signature indices derived from the flow duration curve and combined into a new performance measure, the standardized signature index sum (SIS). The performance measures are evaluated in their ability to distinguish the relative merits of various model alternatives. In many cases, classical and hydrological performance measures assign similar values to different hydrographs. These measures, therefore, are not well suited for model comparison. The proposed SIS is more effective in revealing differences between model results. It allows for a more distinctive identification of a best performing model for individual basins. A best performing model structure obtained through the SIS can be used as basin classifier.

## INTRODUCTION

The selection of an appropriate model for a basin critically depends on the basin characteristics and its dominant runoff processes. This inevitably leads to modelling approaches that recognize the different characteristics of individual systems (e.g. Leavesley *et al.* 1996; Fenicia *et al.* 2011; Coxon *et al.* 2014). In conceptual hydrological modelling, flexible and multi-model frameworks have been already used to examine, for example patterns of structural errors across multiple basins (Clark *et al.* 2008); mean residence time and basin mixing mechanisms (McMillan *et al.* 2012; Hrachowitz *et al.* 2014); representations of plot-scale surface and groundwater dynamics (Krueger *et al.* 2010) and time scale control on model parameters and inferred complexity (Kavetski *et al.* 2011).

Previous work has shown that when a set of model structures is applied to multiple basins, their performance may rank differently (e.g. Duan *et al.* 2006; Fenicia *et al.* 2013; van Esse *et al.* 2013). As a result, the performance of the different structures may provide information on the similarities and differences between basins and may therefore be used as a catchment classifier. For this approach to succeed, it is necessary to select an appropriate measure for evaluating model performance to identify a best performing model structure for each basin.

Streamflow simulations are typically assessed through a wide range of performance measures. These metrics include statistical performance measures (e.g. Pearson correlation coefficient (Pearson 1895); weighted *R*² (Krause *et al.* 2005)), hydrological performance measures (e.g. Nash and Sutcliffe Efficiency (NSE) (Nash & Sutcliffe 1970); volumetric efficiency (Criss & Winston 2008)), performance metrics that are derived from flow duration curves (FDCs) (Yilmaz *et al.* 2008) and other hydrological signatures (e.g. Westerberg *et al.* 2011; Coxon *et al.* 2014). The assessment of model performance may be conducted using these metrics in isolation or combined in a single objective function (e.g. Kling *et al.* 2012). Alternatively, these metrics may also be used simultaneously in a multi-objective framework (e.g. Yilmaz *et al.* 2008).

All known performance measures have specific strengths, weaknesses and sensitivities to various parts of the hydrograph (Legates & McCabe Jr 1999; Krause *et al.* 2005; Schaefli & Gupta 2007; Gupta & Kling 2011; Pushpalatha *et al.* 2012). For example, the NSE favours the simulation of the peaks, the *R*² reveals similarities in the dynamics, but disregards differences in the absolute values, whereas the FDC provides indications on the flow distribution, disregarding potential timing errors.

The FDC is perceived to represent a meaningful descriptor of catchment response. Blöschl *et al.* (2013) describe FDCs as ‘a key signature of runoff variability’ which ‘can be used for evaluating rainfall–runoff model output and for calibrating such models'. The use of an FDC gives more information about the hydrological behaviour of the modelled basins (Hrachowitz *et al.* 2014) and their underlying hydrological processes (Yilmaz *et al.* 2008; Gupta *et al.* 2009; Wagener & Montanari 2011).

FDCs are often used in hydrology, e.g. for model evaluation (e.g. Yilmaz *et al.* 2008; Herbst *et al.* 2009; Westerberg *et al.* 2011) or catchment grouping (e.g. Carrillo *et al.* 2011; Sawicz *et al.* 2011). A comparison of FDCs can be done with one value for the whole FDC (Ganora *et al.* 2011; Sauquet & Catalogne 2011) or with multiple indices which consider where differences between two FDCs occur. Westerberg *et al.* (2011) and Coxon *et al.* (2014) use several evaluation points, whereas Yilmaz *et al.* (2008) propose indices describing meaningful parts of the FDC. Furthermore, the use of the FDC is not restricted to the entire curve. For certain research questions specific parts of the curve can be used as well (Herbst *et al.* 2009; Sawicz *et al.* 2011; Coxon *et al.* 2014).

The aim of this study is to test the appropriateness of statistical performance measures, hydrological performance measures and performance measures derived from the FDC to identify a best performing model out of various calibrated models for 53 basins with a view to basin classification. Basins that are characterized by the same model structure may build a class of similar basins. The structures of these models are generated within the SUPERFLEX modelling framework, which facilitates model development and enables controlled model comparison (Fenicia *et al.* 2011).

## STUDY AREA AND DATA

The study area consists of 53 small to medium-sized gauged basin areas in Rhineland-Palatinate (RLP), Germany (Figure 1 and Appendix 1 (available in the online version of this paper)). The basins lie in low mountain ranges of the Rheinisches Schiefergebirge, the Saar-Nahe-Bergland and the Rhine Valley. Among these 53 basins, there are 35 headwater basins, of which three are triply nested. Basin sizes vary from 10 to 1,469 km^{2}; 48 basins are less than 400 km^{2} and two are larger than 1,000 km^{2}. Elevation ranges between 100 and 818 m above sea level (a.s.l.) with a mean elevation of 341 m a.s.l. Geology differs from schist, greywacke, and quartzite to sedimentary rock with tertiary and quaternary volcanism (basaltic rocks, pumice stone and tuff). Almost all the basins are rural with little urbanization except for three basins, which are moderately urbanized (11–13%). Agricultural land use varies between 7 and 90%, for most of the basins between 40 and 80%. Some basins, especially in the southeast, support viticulture and orchards.

All basins within the study area belong to the same climatic region, but depending on altitude and location, show climatic peculiarities, caused by precipitation ranges from 530 mm/y in the southeast up to 1,108 mm/y in the west and differences in temperature and potential evaporation. Runoff behaviour varies between high reactivity and variability and high event runoff coefficients in wet basins with steep slopes and low storage and low reactivity and variability and low runoff coefficients in dry, mostly flat basins with high storage capacities (Ley *et al.* 2011; Ley 2014).

For the model application, hourly runoff, areal precipitation and temperature data for the period from January 1996 to December 2003 are used. These time series cover a wide range of diverse annual or seasonal precipitation and runoff events.

Areal precipitation was calculated with ‘InterMet’ (Gerlach 2006), which interpolates meteorological data using Kriging. To calculate areal precipitation for Rhineland-Palatinate and adjacent areas, InterMet takes into account data form about 200 rain gauges, meteorological data, prevailing atmospheric conditions, orography, and satellite and radar data. Typical rainfall fields extend in the range of most of the basin sizes. In summer, some mostly convective rainfall events affect only parts of the basins. Although snow events do occur in the study area, they are of minor importance since they are limited in amount, not prolonged and irregular. Snow processes are thus not considered in the modelling exercise.

## METHODOLOGY

### Model structures

The SUPERFLEX modelling framework can be used to perform model comparisons through constructing models that differ in a controlled way. In the present case, it was applied using 12 model structures as proposed by Fenicia *et al.* (2013), which include serial, linear and parallel model structures with different numbers of reservoirs and parameters, thus covering a relatively broad range of conceptual model complexities (Figure 2). Starting from the simplest structure (model structure 1 (M01), Figure 2), the complexity gradually increases by adding reservoirs and lag-functions to the most complex structure (model structure 12 (M12), Figure 2). Although Fenicia *et al.* (2013) and van Esse *et al.* (2013) describe the models extensively, a short explanation of the structures follows.

Model structure 1 consists of a single reservoir with a nonlinear storage-discharge relationship characterized by a time constant and a power parameter. Model structure 2 also consists of a single reservoir, but with an upper threshold, and uses a linear function to describe outflow and a power function to describe flow exceeding the threshold. Model structures 3, 4, 5 and 6 show serial reservoir connections. These models differ in the constitutive functions used to describe the flows between the reservoirs and in the number of calibrated parameters. Model structure 3 describes the flow between the reservoirs as a threshold function while the model structures 4, 5 and 6 use power functions to describe outflows. Model structure 5 has a lag-function to represent hydrograph delay. Compared to model structure 5, model structure 6 has an interception reservoir and model structure 7 has a riparian reservoir. Model 8 is a simple parallel structure with two parallel reservoirs, namely a fast reservoir and a slow reservoir and a precipitation partitioning parameter. The model structures 9, 10, 11 and 12 build on model structure 8 with increasing complexity and an unsaturated reservoir preceding the parallel structure. Figure 2 details the 12 model structures.

### Calibration approach

Fifty-three basins in RLP provide concurrent rainfall and runoff data. The warm-up period of the calibration consists of the first year of the data period (1996). Following a split-sample approach (Klemeš 1986), the remaining range (1997–2003) is subdivided into a calibration period and a validation period of equal length.

The calibration objective function is based on a weighted least squares approach, assuming independent Gaussian error with zero mean and standard deviation linearly proportional to the modelled discharge (Kavetski & Fenicia 2011). Optimization is carried out through a quasi-Newton method with 20 multi-starts randomly selected across the parameter space. The determination of one calibrated model for each basin and each structure results into 636 calibrated models and hence 636 validated models.

### Diagnostics

‘Hydrological model’ and ‘model structure’ can be befuddling terms when not clearly defined. This study considers a hydrological model as a combination of a specific model structure with a particular parameter set. The calibration of this parameter set displays an optimal model for a given forcing data set (basin). If multiple model structures are used, the identification of the optimal model with the highest performance defines the best performing model structure.

The identification of a best performing model structure for a given basin commences with comparing observed with simulated runoff time series by means of a performance measure. This study tests three types of performance measures: (1) statistical performance measures, (2) hydrological performance measures and (3) performance metrics from the FDC. Appendix 2 (available in the online version of this paper) contains the mathematical formulations of all measures.

Statistical performance measures:

• Root mean square error (RMSE)

• Pearson product-moment correlation coefficient (Pearson 1895)

• Weighted

*R*² (Krause*et al.*2005)• Spearman's rank correlation coefficient (Spearman 1904)

Hydrological performance measures:

• NSE (Nash & Sutcliffe 1970)

• Modified NSE (without squaring values) (Krause

*et al.*2005)• Index of agreement (Willmott 1981)

• Modified index of agreement with (without squaring values) (Krause

*et al.*2005)• Kling–Gupta efficiency (Gupta

*et al.*2009; Kling*et al.*2012)• Volumetric efficiency (Criss & Winston 2008)

Performance metrics from the FDC:

• SIS: combination of four performance metrics:

• FHV: very high flow (Yilmaz

*et al.*2008)• FMV: high flow

• FMS: slope of the mid-segment FDC (Yilmaz

*et al.*2008)• FLV: low flow (Gronz 2013; Yilmaz

*et al.*2008)

The 12 selected model structures are calibrated for all basins and the above-listed performance measures for each individual model (i.e. structure + parameter set) and basin are calculated as well. All performance measures are analysed with a view to redundancies, explanatory power and suitability to identify a best performing model for a basin.

The statistical and hydrological performance measures describe the overall performance of a model with one value. The four performance metrics that are derived from the FDC examine the influence of specific aspects of the hydrograph on model performance. The FDC is the complement of the cumulative distribution function of stream flow (Vogel & Fennessey 1994). Despite the fact that FDCs include no information on timing of the flow, they are still a useful way of comparing observed and simulated runoff. A poorly reproduced FDC is an indication of poor model performance. Therefore, the comparison between simulated and observed FDC is a powerful descriptor of model performance.

To compare FDCs, the study adopts the approach proposed by Yilmaz *et al.* (2008) who developed so-called signature indices derived from FDCs. These indices represent for specific parts of an FDC the bias between observed values and simulated values proportional to the observed FDC and have proven their usefulness in several applications (Casper *et al.* 2012; Gronz 2013; Herbst *et al.* 2009; Ley *et al.* 2011). The indices describe major behavioural functions of a basin: extreme high runoff (FHV), mid-slope of the FDC (FMS) and the low flow (FLV). Gronz (2013) modified the FLV index to prevent misleading indices. A fourth signature index for the high flow between extreme high and medium runoff (FMV) is added in order to consider the whole FDC (Figure 3). This study summarizes the above-listed indices into the term ‘signature indices’.

Each of the four signature indices reflect the model's ability to reproduce specific parts of the hydrograph. A combination of the four indices into one value should reflect the ability of the model to reproduce the entire hydrograph. Since the four indices can have different orders of magnitude, an approach is needed that weights them equally when combined. A straightforward method to equally combine them is to standardize the indices and then to summarize. This results in a new performance measure: the standardized signature index sum (SIS).

The SIS is calculated as follows:

(1) Calibration of all model structures on all catchments, and calculation of the four signature indices (where

*s*indicates the structure,*i*indicates the catchment,*a*the type of signature index, and*x*its value).(2) Calculation of the absolute value of each signature index (since the sign is irrelevant, the absolute values treat under or overestimation equally).

(3) Calculation of the standard deviation and the mean of for all

*i*and*s*.(4) Calculation of the standardized values (z-score); Equation (1).

(5) Combining the standardized values; Equation (2).

*a*= signature index (i.e. FHV, FMV, FMS or FLV);

*s*= model structure;

*i*= basin; = value of a signature index;

*= mean of all values of one signature index*

_{a}*a*;

*σ*= standard deviation all values of one signature index

_{a}*a*.

## RESULTS AND DISCUSSION

Concerning model assessment, validation is an important step, since it provides independent information on model consistency (Klemeš 1986; Andréassian *et al.* 2009). Comparing the results of the calibrated models with the validated models, only minor differences occur with reference to model performance. The analysis therefore uses the simulated results of the calibrated models.

### Identification by classical statistical performance measures and hydrological performance measures

The patterns of model performance calculated with classical statistical or hydrological performance measures are very similar. As an example of model performance on a given catchment, Figure 4 displays the results of the different performance measures for the simulated runoff of the basin ‘Flaumbach’ at the gauging station ‘Kloster Engelport’. The calibrated models based on structures 1, 2 and 8 demonstrate a worse performance than the other models, regardless of the performance measure. The performances of the other nine models demonstrate almost similar performances. The simulated runoff of the other basins displays a similar pattern as depicted in Figure 4. Owing to this pattern, the choice of a classical statistical or hydrological performance measure seems to be less important for the identification of a best performing model structure. Therefore, the NSE is chosen for further analysis.

Fenicia *et al.* (2013) as well as van Esse *et al.* (2013) found patterns between model structure and model performance similar to our results. Enlarging the parameter space (to avoid too small parameter boundaries affecting model performance) ensued in better results for the model structures 9, 10, 11 and 12 than in van Esse *et al.* (2013). Hydrological differences in the respective study areas could attribute to this, but it may well be that limitations in parameter space in the French modelling exercise cause the differences.

The sequence of the NSE between model performance and model structure for individual basins determines a best performing model structure for each basin. For most of the basins, several model structures display almost identical values for the NSE (Figure 5). The conceptual differences between the model structures 4, 5 and 6; 9 and 10 and 11 and 12 are minor (Figure 2) and apparently result in very similar NSEs for these model structures. Often, also conceptually different model structures display similar NSE values with differences of less than 0.05. This effect obscures the identification of a decidedly best performing model structure for a single basin.

Only model structure 3 displays a wide range of performance values for the NSE, which allows for the identification of basins with a good performance. The performance of model structure 3 is worse for basins with low precipitation or low total runoff coefficients. Probably, the threshold overflow between the two reservoirs, which is a specific characteristic of structure 3, causes this. This may give indications about the process representation by this structure, e.g. an indication of a threshold-like response. However, this is a distinction in wet and dry basins, which is a trivial result.

### Identification by signature indices

Figure 6 displays the FDCs (observed and simulated with different structures) for the two gaging stations Weinähr and Wernerseck. These stations have similar NSE values for more than one model, but differ when it comes to their simulated FDCs. This shows that simulations for one basin with different models and with similar NSEs need not have similar hydrographs.

From Figure 6 it is difficult to decide which simulated FDC performs better. Figure 7 displays the values of the signature indices for all basins as box-and-whisker plots. For most of the basins, the four signature indices show clear differences in performance. Except for the models that are based on structure 3, all models underestimate the very high flow (FHV). For the other three signatures, most of the structures perform well. In general, the models of structure 8 have very low values for the NSE, caused by high deviations for the very high flow. However, when it comes to the signature indices, structure 8 shows a good performance for the other parts of the FDC. Analogous to the large variation in NSE values for structure 3, the four indices of the FDC for this structure display a wide range as well.

As for the above-mentioned statistical and hydrological performance measures, most of the simulated FDCs indicate a better agreement for structures 4, 5, 6, 7, 11 and 12 than for the other model structures (Figure 7). Although the NSEs for structures 9 and 10 are good, their simulated FDCs show for most basins a levelled curvature and thus obtain a bad signature index performance. The Steinbach gaging station provides a good example for the different relationships between the NSE and the signature indices of the FDC. Steinbach has NSEs between 0.63 and 0.69 for structures 4, 5, 6, 7, 9, 10, 11 and 12 and poor NSEs of 0.45 and 0.47 for structures 1 and 3. Figure 8 displays the measured and simulated FDCs for each structure in detail, reflecting the index values depicted in Figure 7. However, model structure 3 shows a distinctively better agreement between observed and simulated FDC than for the other model structures. Despite the fact that model structures 1 and 3 show similar NSEs, the simulated FDC of model structure 1 is apparently worse than the simulated FDC of model structure 3. The diverging results with the NSE may be caused by a high sensitivity of the NSE to overestimated extreme high discharge.

The SIS (Equation (1)) indicates an overall performance for a single basin. Negative values point to an above average good performance and the lowest value identifies the best performing model. Figure 9 displays the signature indices of the three gaging stations Weinähr, Seelbach and Wernerseck, listing the sum of the SIS and NSE as well. In contrast to the NSE, the SIS identifies one model as undeniably best performing. As for the NSE, the model structures 4, 5 and 6, 9 and 10, and 11 and 12 often have minor differences between their SIS. In these cases, the simpler model structures (4, 9, respectively, 11) are set as best performing.

From Figure 9 the following can be observed:

For the gaging station Weinähr (basin size 215 km

^{2}), the model based on structure 7 has the lowest SIS, which is due to a very low divergence from the observed FDC for high and mid runoff and a moderate divergence for low flow. Only the very high flow (FHV) shows a considerable bias, which is weighted lower by standardizing the biases for SIS.For the gaging station Seelbach (basin size 193 km

^{2}), the model based on structure 4 has the lowest SIS. The models based on structures 7 and 12 show a slightly better NSE, which is caused by lower biases for FHV and disregarding better adaptions for the high and mean part of the FDC.For gaging station Wernerseck (basin size 242 km

^{2}) the model based on structure 12 has the lowest SIS. The difference in NSE between the models based on structures 11 and 12 is 0.001 and is in this context negligible. With signature indices however, the differences in performance for these structures becomes apparent.

With respect to the 12 model structures, the simple models 3 and 4 show the best performance based on the SIS for 38% of the basins. These two models differ in that the outflow from the unsaturated reservoir in model 4 is a power function rather than a threshold function as in model 3 (Fenicia *et al.* 2013). The extension of model 4 with a lag function (model 5) and an interception reservoir (model 6) rarely leads to a better performance: model 6 outperforms model 4 only for one basin.

Model 7 is an extension of model 5 with an additional riparian zone reservoir that receives a constant fraction of the total precipitation (Fenicia *et al.* 2013). Although this model performs best in many cases, the performances of its simpler variants (4, 5) are often almost equally good. The same holds for the models 10 and 12 and their less complex counterparts 9 and 11, respectively: the gain in performance for these models (i.e. 10 and 12) is only marginal when compared to the performance of the less complex ones (i.e. 9 and 11). Therefore, the less complex models are preferable as catchment representation.

Although single indices indicate a very good performance for special parts of the FDC, the SIS recognizes the overall performance with a compensation of extreme values and considers equally all parts of the FDC to describe the overall performance. Furthermore, the SIS value allows evaluating the similarity of the performance of different models. This enables us to better differentiate, e.g. between ‘good’ or ‘bad’ performing models. In combination with the single indices, a decision for a best performing model consistent with special aspects of a given research question is now possible.

Since many active components of the hydrological cycle occur below the sub-surface (Beven & Freer 2001), it makes them difficult to observe. Thus far, human observation is largely barred from observing these sub-surface processes properly and this has a psychological consequence which Kahneman & Tverky (1982) called ‘the perceptual best bet’. In hydrology this means that experience of above ground phenomena shape the expectation of the hidden sub-surface processes (Hellebrand 2010). Basin classification with hydrological modelling requires further research, where a larger number of model structures need testing to find optimal structures. However, to prevent the conception of ‘perceptual best bet’ models (i.e. models that are based upon our unobserved perception of the sub-surface), it would be of interest to automatically generate model structures by means of genetic programming, which would provide the modeller with new and unthought-of structures (hypotheses) that can be tested.

## CONCLUSIONS

This study compares different performance metrics to assess model performance with a view to catchment classification. If an insufficient number of classical performance measures are used simultaneously, they fail to discriminate between different model structures, providing similar values for seemingly different hydrographs. Signature indices derived from the FDC instead succeed in capturing differences between model results. Although standard hydrological performance measures are suitable to divide well from less well performing models, they show hardly any differentiation between good performing models. Since several of these performance measures are sensitive for special parts of the hydrograph, a bias between observed and modelled hydrographs for certain flow types (e.g. high flows) can mask a good performance for other parts of the hydrograph.

The four signature indices that calculate biases between the observed FDC and the simulated FDC, provide more differentiated results: they clearly identify at which part of the hydrograph these biases occur. The combination of the indices allows a decision for a best performing model that is consistent with specific aspects of a research question. The proposed SIS treats all parts of the FDC equally and makes a reasoned identification of a best performing model for single basins possible.

The use of signature indices for the evaluation of a best performing model structure is a promising way forward for basin classification by means of multiple hydrological model structures. There is clearly a need to expand the range of different types of hydrological model structures as well as to test this approach to different meso-scale research areas.

## ACKNOWLEDGEMENTS

We acknowledge the financial support of the Deutsche Forschungsgemeinschaft (DFG) through grant CA728/5-1. Furthermore, we would like to thank the LUWG, Mainz (D) for providing the data. We also would like to thank both reviewers for their constructive comments**.**