Evaluation of design flood estimates – a case study for Norway

The aim of this study was to evaluate the predictive fit of probability distributions to annual maximum flood data, and in particular to evaluate (1) which combination of distribution and estimation method gives the best fit and (2) whether the answer to (1) depends on record length. These aims were achieved by assessing the sensitivity to record length of the predictive performance of several probability distributions. A bootstrapping approach was used by resampling (with replacement) record lengths of 30 to 90 years (50 resamples for each record length) from the original record and fitting distributions to these subsamples. Subsequently, the fits were evaluated according to several goodness-of-fit measures and to the variability of the predicted flood quantiles. Our initial hypothesis that shorter records favor two-parameter distributions was not clearly supported. The ordinary moments method was the most stable while providing equivalent goodness-of-fit. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/). doi: 10.2166/nh.2017.068 om http://iwaponline.com/hr/article-pdf/49/2/450/196222/nh0490450.pdf 2020 Florian Kobierska Kolbjørn Engeland (corresponding author) The Norwegian Water Resources and Energy Directorate, P.O. Box 5091 Majorstua, Oslo NOR-0301, Norway E-mail: koe@nve.no Florian Kobierska Western Norway University of Applied Sciences, Institute of Natural Sciences, Sogndal, Norway Thordis Thorarinsdottir Norwegian Computing Center, P.O. Box 114 Blindern, Oslo NO-0314, Norway


INTRODUCTION
The motivation for this study is the need to revise guidelines for design flood estimation in Norway. According to Norwe- Existing guidelines are given in Midttømme et al. () and Castellarin et al. (), and summarized in Table 1.
The approach is based on using annual maximum floods, and the recommendations depend on the length of the local data record. A minimum of 30 years of local observations is required for local flood frequency analysis and at least 50 years of data should be available to use three-parameter distributions. The Gumbel (two parameters) and generalized extreme value (GEV) (three parameters) are the preferred distributions. More recently, Glad et al. () found the generalized logistic (GL) to be the preferred distribution for annual maximum floods in small catchments.
Other guidelines for flood frequency estimation include the USA (Stedinger & Griffis , ), Australia (Ball Four methods are commonly used to estimate distribution parameters: ordinary moments, linear moments, maximum likelihood (ML), and Bayesian. The method of linear moments has been recommended for its robustness with small sample sizes (Hosking ). In recent years, Bayesian flood frequency estimation has gained increased attention in the research community (e.g., Coles & Tawn ; Gaál et al. ; Gaume et al. ; Renard et al. b), and is recommended in the operational guidelines in Australia (see Chapter 2.6.3 in Ball et al. ). The benefit of the Bayesian method is the flexibility in model formulation, the possibility to include prior and/or regional knowledge in the local estimation, and the possibility to account for errors in rating curves (Ball et al. ).
For many cases the streamflow record is either nonexisting or much shorter than the target return period. In order to predict flood quantiles in ungauged catchments or to reduce the estimation uncertainty for high flood quantiles, three different strategies can be followed (Merz & Blöschl ): (i) use flood data from several locations within a region (e.g., Dalrymple ); (ii) use historical, (e.g., Benson ) and/or paleo-hydrological information (e.g., Benito & O'Connor ); or (iii) use causal information, i.e., by combining precipitation statistics with precipitation-runoff models (e.g., Lawrence et al. ).
The recommendations provided in the national guidelines should preferably be based on systematic evaluations.
A recent example is provided in Kochanek et al. () where local, regional, and local-regional flood frequency analysis, as well as local and regional applications of a simulation approach are systematically compared resulting in recommendations. Renard et al. (a) provide a short review of evaluation frameworks and distinguish between simulation-based and data-based frameworks. In the simulation-based approach, the true distribution is known, and Monte-Carlo-generated samples from the true distribution are used to assess the performance of different distributions and/or parameter estimation methods (e.g., Hosking et al. ). It is especially useful for assessing robustness (e.g., Stedinger & Cohn ) and evaluating the estimates of standard errors (e.g., Cohn et al., ). For data-based approaches, the true distribution is not known, and the aim of the evaluation is to assess if the observations might be realizations of the estimated distribution. Goodness-offit tests combined with split-sample or cross-validation are used in order to assess the predictive performance of the fitted distribution. The goodness-of-fit criteria measure the reliability, i.e., how well the model fits to (independent) data. Renard et al. (a) introduced 'stability' as an additional criterion. It measures the sensitivity of the design flood estimates to different subsets of data. Design flood estimates that depend strongly on the underlying data might lead to re-assessment of the design flood. This can, for example, result in large costs for dam owners as the design of dams has to be re-assessed every 20 years. Stability is therefore an important criterion in order to choose between the most reliable models.
The aim of this study is to perform a systematic evaluation of the predictive performance of local flood frequency distributions and estimation methods applied to annual maximum data. The results will later be used as a foundation for recommendations in new guidelines.
In this study, we wanted to answer the following research questions: 1. Which combination of distribution and estimation method best fits the data? 2. Does the answer to (1) depend on local data availability?
To answer these questions, we set up a test bench for local flood frequency analysis using data-based evaluation

STUDY AREA AND DATA
We used annual maximum floods from 529 streamflow stations of the Norwegian hydrological database 'Hydra II'. We present here a brief summary of the dataset and associated quality control methods, which are described in detail in Engeland et al. (). All data influenced by river regulations were removed. In addition, quality controls of the data including quality assessment by the field hydrologist and of the rating curve for high flows, were used to select flood data with a sufficient quality. For all gauging stations, we extracted a set of catchment properties (for details see Engeland et al. ). Figure 1 shows the histogram for record length, catchment areas, lake percentage, mean annual temperature and precipitation and the rain contribution to floods.

Distributions
We evaluated five probability distributions (Supplementary materials, Table S1)

GEV distribution
The extreme value theorem is also known as the Fisher-Tippett theorem, which says that the maximum value from a sample of independent and identically distributed (iid) random variables follows the GEV distribution (e.g., where m is a location parameter, α a scale parameter, and k a shape parameter. Defined on the region 1 À k(x À m)=α > 0. The mean exists if k > À1.0, and the variance if k > À0.5. The shape parameter k is important in the GEV distribution as it shapes the tail of the distribution.
A negative value indicates a heavy tail, whereas positive values describe a light tail and an upper limit for the variable x.

Gumbel distribution
The Gumbel distribution is a special case of the GEV distribution (shape parameter k ¼ 0) and is written as: where m is a location parameter and α a scale parameter.
This distribution is often recommended for small datasets. Maximum values of random variables, with an exponential like upper tail (e.g. Normal, lognormal, Gamma), will theoretically follow a Gumbel distribution.

Generalized logistic
The GL distribution ( (3): where m is a location parameter, α scale parameter and k a shape parameter. As for the GEV distribution, the GL distribution has an upper bound of k > 0. This is the case only when the skewness is negative whereas for the GEV distribution, there is also an upper bound for positive skewness, i.e., L-skewness <0.17 (Robson & Reed ). Thus, for flood data we could expect the shape parameter to be between À0.5 and 0.2.

Gamma distribution
The gamma distribution is a flexible two-parameter distribution often used in environmental sciences: Here, Γ denotes the complete gamma function and γ the lower incomplete gamma function.

Pearson III
The Pearson type III distributions are given as: where m is a location parameter, α a scale parameter, and k and Australia (Haddad & Rahman ). Prior distributions are given in Reis & Stedinger ().

Fitting methods
Four methods for fitting the distributions to observed data were used: ordinary moments, linear moments, ML and Bayesian estimation.

Ordinary moments (O-moments)
The method of ordinary moments means that the moments (mean, variance, and skewness) are estimated based on the data, and subsequently, the parameters of the selected distribution are calculated based on a theoretical relationship between the moments and the distribution parameters.
Two-parameter distributions need the estimates of mean and standard deviation whereas the three-parameter distributions also require an estimate of the skewness. The specific equations for each distribution used in this study are given in Bezak et al. () and are also provided in Supplementary materials, Table S4 (available online).

Linear moments (L-moments)
The method of linear moments is a popular method in hydrology since it is a direct analog to the method of moments, easy to apply and the parameter estimates are less sensitive to outliers in the data (Hosking ). As for the O-moments, the linear moments are estimated from the data, and subsequently, the parameters of the selected distribution are calculated based on a theoretical relationship between the L-moments and the distribution parameters. The specific equations for each distribution used in this study are given in Hosking (), and are also provided in Supplementary materials, Table S5 (available online).

Maximum likelihood
The ML method chooses the values of the parameters' estimates that maximize the probability of the data sample. This probability is the product of the probability density function evaluated at all observations (with a common parameter set) and is called the likelihood function l(θ|x) of the parameters θ given data x. The objective is to maximize this function.
The likelihood functions are specified in Bezak et al.

Bayesian estimation
Bayes theorem combines the knowledge brought by the prior distribution and the data (through the likelihood) into the posterior distribution of parameters, whose pdf is The Bayesian method might include prior knowledge that could be expert knowledge or regional information In the Results section, we use the mode of the resulting posterior distribution.

Evaluation methods
We followed the evaluation strategy specified by Renard Both the fitted distribution parameters and the return levels were used for evaluation, as described below.

Stability
The stability measure is a property of the statistical model only and we can thus evaluate it for any return period, including those greatly exceeding the length of record.
Here, we evaluated the stability by calculating the coeffi- The KS test evaluates how well an empirical distribution fits to a parametric one. The statistics is based on the maximum distance between the two cumulative distributions and should therefore be as small as possible:

Reliability: evaluation of thresholds
Since the aim of flood frequency analysis is to assess critical design flood, it is relevant to evaluate the fitted distributions according to how well they predict thresholds.
The Brier score (BS) (Brier ) is commonly used for evaluation, and was used in this paper for evaluating the predicted T-years event for flood frequency distributions. The BS compares the predicted probability of the exceedance of a threshold u T,s (given by 1 À F l,s,m u T,s À Á ) to actual exceedance of the threshold by independent data (given by where u T ,s is the threshold defined by a return period T and II is an indicator function that is 1 if x s,i > u T,s and otherwise 0. The quantile score (QS) compares observed floods x s,i to the estimated flood quantile F À1 l,s,m 1 À 1=T ð Þ for a given return period T and gives the difference a low weight if the observed flood is smaller than the estimated quantile: where i is the rank of the observation Q (i) , n is the number of observations, andP 0 (i) is the estimated cumulative probability. According to Stedinger et al. (), the Hazen plotting position is a traditional choice that is least specific to a particular distribution.

Reliability: evaluation of empirical L-moments
The L-moment ratio diagram compares sample estimates of τ 2 , τ 3 , and τ 4 (standard deviation, skewness, and kurtosis) to the theoretical population for parametric distributions by plotting the relationship between τ 4 and τ 3 for three-parameter distributions and between τ 3 and τ 2 for two-parameter distributions. It was introduced by Hosking (), and approximations for several distributions are given in Hosking & Wallis (). The advantage of this evaluation is that we visually compare how several theoretical distributions fit to our data sample, and it has become a standard tool in regional flood frequency analysis (Peel et al. ).

Estimation computational chain and open access to results
Based on the methods presented above, our research approach was highly multi-dimensional and involved saving a great amount of data. For this reason, we chose to save the input and model data into a NetCDF database.
The full computational chain was carried out with the R software (R Core Team )). The following libraries were The L-moments ratios plotted in Figure 9 give a good visual impression of the spread in L-kurtosis and L-skewness across all stations. A moving average of L-skewness along L-kurtosis removes much of the scatter and thus helps analyze the data.

DISCUSSION
The first research question raised in the Introduction sought to determine which combination of distribution and estimation method best fits the data. From the results presented herein, we see that it is difficult to disentangle the performance of the estimation methods from the performance of the distributions, and that the combinations of estimation method and distribution that give the best performance vary between the performance measures. The   interpretation of the results in order to answer the research questions, is therefore, challenging.
From the performance of the reliability criteria, we see that the best estimation methods for the three-parameter distributions perform, in general, equally well or better than the best estimation methods for two-parameter distributions for all record lengths (Figure 8). The gain in using a three-parameter distribution increases with record length. The only exception is the QS, where the Gumbel distribution is equally good as the three-parameter distributions (Figure 8).   Among the three-parameter distributions, the GEV and the GL distributions give the best performance. The GL distribution is better than the GEV distribution for the BS, whereas for the other two scores, the GEV distribution slightly outperforms the GL distribution. The GL distribution seems to be more challenging to estimate than the GEV distribution, since it is rather sensitive to the estimation methods used. Taking into account the stability criterion, the method of moments is most stable with the GL distribution. However, choosing to look only at the L-moments and Bayesian estimators that are the most reliable, we see that the difference in stability between the GEV and GL distributions is small (Figure 7). This indicates a slight preference for the GEV distribution.
Concerning the choice of estimation methods, the ML method should not be used in combination with three-parameter distributions since this combination provides very unstable results ( Figure 7) and is, in some cases, only marginally better than the Bayesian and L-moment approaches (Figures 4-6). The method of moments is the most stable method for all distributions (Figure 7), but it also provides the most unreliable results for several scores (Figures 4-6).
For all three-parameter distributions, either the L-moments or the Bayesian methods is preferred (Figure 8).
An unexpected result is the relatively low performance, as measured by the Brier and QS, when the Bayesian and ML methods are used to fit the data to the Gumbel distribution. In contrast, these two estimation methods perform relatively well for the AD and KS test statistics (Figures 3   and 4). Further investigations revealed that this low performance is, to a large degree, controlled by the skewness of the    Figure 7), we conclude that the convergence rate of our chains is sufficient.
The resampling with replacement approach allowed us to compare all stations with sample sizes longer than 30 years, i.e., resampled records of lengths up to 90 years were created from the original record of 30 years. The benefit of using this approach was that more stations could be included in the evaluation. We used 280 stations, of which, only 35 had record lengths of 90 years or more. The drawback of this approach is that stations with short record lengths will get resampled several times. By grouping stations according to their length of record and plotting the groupaveraged CV of return levels for each group, we saw that (i) the average CV is lowest for the shortest record lengths, and (ii) the spread in CV is largest for the shortest record lengths. An explanation for the second issue is that the resampling approach used here might be sensitive to outliers in the underlying data, as those might be sampled several times for short records. We identified three stations that may exhibit this behavior, but excluding them from the evaluation showed little influence on the average performance.
Another aspect not tackled in this study is the possible non-stationarity of flooding patterns in Norway. The standard approach for addressing non-stationarity related to climate change is to look for a climate factor that describes the expected change in design flood estimates (Lawrence ). The climate factor assumes that the reference design flood is based on a stationarity assumption. The non-stationarity in our case might be linked to (1)  properties. This could be due to the very fragmented hydro-meteorological patterns in Norway. A more in-depth study of those relationships was beyond the scope of this paper and will be investigated in subsequent studies.

CONCLUSIONS
The aim of this study was to evaluate the predictive fit of probability distributions to annual maximum flood data, and in particular to evaluate (1) which combination of distribution and estimation method gives the best fit and (2) whether the answer to (1) depends on record length. These aims were achieved by assessing the sensitivity to record length of the predictive performance of several probability distributions.
A bootstrapping approach was used by resampling (with replacement) record lengths of 30 to 90 years (50 resamples for each record length) from the original records and fitting distributions to these subsamples. Subsequently, the fits were evaluated according to several goodness-of-fit measures and to the variability of the predicted flood quantiles.
Based on the results presented herein we conclude the following: • The GEV and GL distribution provided the most reliable results.
• The method of linear moments or the Bayesian method are the recommended estimation methods.
• The ML method was particularly unstable with three-parameter distributions, even for short return periods. This method should therefore be avoided.
• For the Gumbel distribution, the L-moment approach is recommended. The Bayesian approach was sensitive to the skewness of the data.
• The method of ordinary moments was consistently the most stable estimation method. This stability results in a light but consistent trade-off on goodness-of-fit against the method of linear moments.
• There is no clear threshold in record length above in which one should use a three-parameter distribution rather than a two-parameter distribution.
• We focused on developing a reproducible workflow so that the methodology can be reused and improved as more data become available.
The results herein show that the use of the GEV or the GL distribution is challenging since, in particular, the shape parameter is sensitive to the underlying data resulting in more unstable results. Alternative approaches such as using a mixture of two-parameter distributions, should therefore be investigated.