Abstract
Geophysical processes are often characterized by long-term persistence. An important characteristic of such behaviour is the induced large statistical bias, i.e. the deviation of a statistical characteristic from its theoretical value. Here, we examine the most probable value (i.e. mode) of the estimator of variance to adjust the model for statistical bias. Particularly, we conduct an extensive Monte Carlo analysis based on the climacogram (i.e. variance of the average process vs. scale) of the simple scaling (Gaussian Hurst-Kolmogorov) process, and we show that its classical estimator is highly skewed especially in large scales. We observe that the mode of the climacogram estimator can be well approximated by its lower quartile (25% quantile). To derive an easy-to-fit empirical expression for the mode, we assume that the climacogram estimator follows a gamma distribution, an assumption strictly valid for Gaussian white noise processes. The results suggest that when a single timeseries is available, it is advantageous to estimate the Hurst parameter using the mode estimator rather than the expected one. Finally, it is discussed that while the proposed model for mode bias works well for Gaussian processes, for higher accuracy and non-Gaussian processes, one should perform a Monte Carlo simulation following an explicit generation algorithm.
INTRODUCTION
An important attribute characterizing geophysical processes is the high spatio-temporal dependence, in the sense that a random variable of such a process at a specific time or location strongly depends on several (even infinite) past, or of different location, random variables of the same process. This type of dependence requires long samples for its identification, which is a rare case in most natural processes, and thus, for the estimation of its parameters, it is advised to use only up to the second-order statistics (Lombardo et al. 2014) and only in cases where very long samples are available to expand to higher orders. The above issues are further highlighted in Dimitriadis (2017), where several (overall thirteen) such processes with various lengths and physical properties expanding from small-scale turbulence to large-scale hydrometeorological processes are analyzed in terms of their long-term behaviour using massive databases and unbiased estimators of the second-order dependence structure. Interestingly, all the examined processes exhibited long-term persistence, otherwise known as Hurst-Kolmogorov (HK) behaviour (coined by Koutsoyiannis & Cohn (2008)), i.e. power-law decay of the autocorrelation function with lag (for a literature review on long-term persistent processes in hydrometeorology, see also O'Connell et al. (2016)). Additionally, Koutsoyiannis (2011) provided a theoretical justification of the HK behaviour in geophysical processes, showing that it is linked to the second law of thermodynamics (i.e. entropy extremization), and specifically, the stronger the persistence of the dependence structure of a process, the higher the entropy of the process at large scales.
The identification of the dependence structure of a process can be highly affected by the sample uncertainty and statistical bias where the true statistical properties (mean, variance etc.) of a statistic (e.g. variance) of a stochastic process may differ from the one estimated from a series with finite length. The deviations of the statistical characteristics from their true values should be taken into account not only for the marginal characteristics but also for the dependence structure of the process. Therefore, to correctly adjust the stochastic model to the observed series of the physical process, we should account for the bias effect since all series are of finite (and often short) lengths.
The second-order properties can be similarly assessed by common stochastic tools such as the autocovariance function (a function of lag), power spectrum (a function of frequency), and variation of statistics (e.g. variance) of the averaged process vs. scale, a tool known as climacogram (Koutsoyiannis 2010). It is shown that the latter estimator of the second-order dependence structure, as compared to the other two metrics, encompasses additional advantages in stochastic model building and interpretation from data; for example, it is characterized by smaller statistical uncertainty and easier to handle expressions of the statistical bias (Dimitriadis & Koutsoyiannis 2015). Therefore, it is advisable that the sample uncertainty of the second-order dependence structure be tackled with the estimator with the lower variation, such as the climacogram. When multiple sample realizations (i.e. recorded series) are known, the handling of the statistical bias arising from a selected stochastic model may be based on the unbiased estimator of the expected value of the climacogram (Dimitriadis & Koutsoyiannis 2018). However, when a single data series of observations is available for the model fitting (which is the case when geophysical processes are studied), it would be interesting to examine the mode of the climacogram, instead of the expected value; the two may differ in case of strong HK behaviour. This estimator is equivalent to a maximum-likelihood estimator (e.g. Kendziorski et al. 1999) for processes with zero (i.e. white noise) or short-term (e.g. Markov) dependence structure, while here we further extend it for HK processes (see also the work of Tyralis & Koutsoyiannis (2011) for the expectation of the climacogram). It is noted that while the climacogram is often based on the second central moment (i.e. variance), other types of moments (e.g. raw, L-moments or K-moments; Koutsoyiannis 2019) can be used to measure fluctuation in scale, and here, we focus on the central second-order climacogram (i.e. fluctuation measured by variance vs. scale).
METHODS
In this section, we present the applied methods, namely the climacogram estimator, the statistical bias expressions for the mode and expected values of the estimator and the algorithm for the stochastic synthesis of the Gaussian HK process for the Monte Carlo analysis.
The climacogram
The analysis of a process through the variance of the averaged process vs. scale has been thoroughly applied in stochastic processes (e.g. Papoulis 1991; Vanmarcke 2010). However, its importance to the analysis of the second-order dependence structure is highlighted mainly by more recent works (see a historical review in Koutsoyiannis (2018)). Also, the simple name climacogram allowed its further understanding through visualization; indeed, the term originates from the Greek climax (meaning scale) and gramma (meaning written; cf. the terms autocorrelogram for the autocorrelation, scaleogram for the power spectrum and wavelets).
It has been shown that the climacogram, treated as an estimator (rather than just a tool for the identification of long-term behaviour of the second-order dependence structure), has additional advantages from the more widely applied estimators of the autocovariance and power spectrum (Dimitriadis & Koutsoyiannis 2015). Namely, the climacogram could provide a more direct, easy, and accurate means to make diagnoses from data and build stochastic models in comparison to the power spectrum and autocovariance. For example, the climacogram, compared to other tools, has the lowest standardized estimation error for processes with short- and long-term persistence, zero discretization error for averaged processes, simple and analytical expression for the statistical bias, always positive values, well-defined and usually monotonic behaviour, smallest fluctuation of skewness on small scales while closest to zero skewness in larger scales, and mode closest to the expected (i.e. mean) value in large scales. Also, the climacogram is directly linked to the entropy production of a process (Koutsoyiannis 2011, 2016). Furthermore, the climacogram expands the notion of variance by making it a function of time scale and is per se further expandable for statistics different from the central estimators of fluctuation (e.g. second raw moment and second L-moment vs. scale; Koutsoyiannis 2019) for different characteristics of the estimator (e.g. mode and median) and even for moments of higher (e.g. third and fourth) orders (Dimitriadis & Koutsoyiannis 2018). Recently, Koutsoyiannis (2019) extended the notion of climacogram for orders higher than two and showed how to substitute the joint moments of a process, allowing in this manner to tackle some limitations of the latter such as the discretization effect and statistical bias.
The Gaussian long-term persistent process and its stochastic synthesis
H . | p1 . | p2 . | Maximum error (standardized) . |
---|---|---|---|
0.51 | 0.022 | 17.122 | 0.001 |
0.60 | 0.091 | 12.607 | 0.006 |
0.70 | 0.124 | 13.317 | 0.009 |
0.80 | 0.394 | 12.356 | 0.009 |
0.90 | 0.395 | 14.708 | 0.005 |
0.99 | 0.548 | 19.465 | 0.001 |
H . | p1 . | p2 . | Maximum error (standardized) . |
---|---|---|---|
0.51 | 0.022 | 17.122 | 0.001 |
0.60 | 0.091 | 12.607 | 0.006 |
0.70 | 0.124 | 13.317 | 0.009 |
0.80 | 0.394 | 12.356 | 0.009 |
0.90 | 0.395 | 14.708 | 0.005 |
0.99 | 0.548 | 19.465 | 0.001 |
The mode of climacogram estimator and its statistical bias
Since the above estimator is a random variable, it has a marginal distribution (see an illustration in Figure 1). The true value of a statistical characteristic (e.g. variance) of a stochastic model may differ from the one estimated from a series with finite length. To correctly adjust the stochastic model to the observed series of the physical process, one should account for the bias effect. An important question is how the statistical bias is generally handled through the second-order dependence structure in case of long-term persistent processes. Particularly, the selected stochastic model should be adjusted for bias before it is fitted to the sample dependence structure. It is noted that neglecting the bias effect in case of a long-term persistent process leads to underestimations of the stochastic model parameters such as the Hurst parameter and to erroneous conclusions. An adjustment of the models for bias is usually done by equating the observed dependence structure to the expected value of the applied estimator. The alternative studied here is the mode, instead of the expected value, of the dependence structure, which represents the most probable value (and thus, the most expected) of the variance estimator at each scale.
For a Gaussian white noise process of length n and variance , the distribution of its sample variance follows the gamma distribution (Cochran 1934). The averaged process at scale κ, with a sample length of n/κ and variance , follows , with for , or else 0. Hence, for , we have that , and , i.e. zero bias. However, for long-term persistent processes, the mode bias is non-zero and its analytical solution is no longer easy to derive.
From the above results, it becomes evident that the statistical bias always depends on the selected model and not on the data as commonly thought. For example, consider the Gaussian HK process in the previous section with an autocorrelation function in discrete time , where is the discrete-time lag. The bias of the autocorrelation is similarly defined as , and thus depends on the model parameter H. It is noted that the above apply even to the so-called non-parametric models, since they also involve estimation from data, and thus, these models should be similarly adjusted for statistical bias to avoid underestimation of the process variability during a Monte Carlo simulation.
For simplicity and without loss of generality, we set Δ = 1 for the rest of the analysis. It is evident that or else , since the sample variance is positively skewed, i.e. and the equality holds when , where the variance of the sample variance is zero for an ergodic process. A preliminary analysis of common HK-type processes has shown that the mode climacogram is close to the low quartile (25% quantile) of the marginal distribution of variance at each scale (Dimitriadis et al. 2016c; Gournary 2017). Therefore, when the mode of the variance estimator is of interest, we may use a Monte Carlo technique (as described in the next section) to accurately estimate the mode bias or, in case the marginal distribution of the climacogram is known, to calculate the 25% quantile at each scale to approximate the mode bias.
MONTE CARLO ANALYSIS FOR THE MODE OF THE VARIANCE ESTIMATOR
We perform Monte Carlo experiments over the N(0,1)-HK model for a wide range of Hurst parameters H (i.e. 0.5 to 0.95) and for a wide range of series lengths n (i.e. 20–2,000). Specifically, we produce a number (N) of synthetic series through the SAR model described in the section ‘The Gaussian long-term persistent process and its stochastic synthesis’, where N depends on the sample mean value to reach the expected one at scale κ = n/10 based on the rule of thumb when using the climacogram as shown in Dimitriadis & Koutsoyiannis (2015). We found that for , the standardized error between the theoretical expected value and the sample one (Equation (5)) is lower than 1% at scale κ = n/10. In this way, the mode is expected also to be well preserved with a similar error. However, caution should be given to the selection of the sample mode estimator to ensure that its variance permits a robust estimation of the true value of the mode. Since the distribution function of the estimator of variance is unknown for long-term persistent processes and given that the mode value is the most likely to occur within the sample, we calculate the sample mode from each simulated series by finding the most probable value with an accuracy of two decimal digits. Specifically, we round up each value of the time series, and for each scale, to the second decimal digit, and we estimate the most probable value of the rounded time series (for higher accuracies a larger N was required). Also, other estimators for the sample mode (e.g. Bickel & Fruwirth 2006) could be used and compared to the proposed one in future research to optimize the performance of the analysis.
Here, to derive an easy-to-fit empirical expression to approximate the mode bias, we adopt the assumption that the above distribution is nearly gamma for smaller scales (see also a similar analysis in Gournary (2017) and Dimitriadis et al. (2018)). Using the results from the Monte Carlo analysis, we then evaluate the parameter of the gamma distribution for each H, n, and κ, and we build a model for the mode, then later test its performance. Although the true autocorrelation function of the averaged process for a long-term persistent process does not vary with scale, the sample autocorrelation will be also prone to bias (e.g. Dimitriadis & Koutsoyiannis 2015) affecting the distribution function of the sample variance at each scale. To minimize the sample error for the fitting of the two-parameter gamma distribution, we use the theoretical expression for the expected value of the sample climacogram, i.e. , and the variance of the sample climacogram, i.e. , as evaluated from the Monte Carlo analysis, which exhibits the lowest variability in estimation among the four central moments (Dimitriadis & Koutsoyiannis 2018; Figure 2). Based on these two measures, we estimate the two parameters of the gamma distribution.
We first set the scale parameter of the gamma distribution such as to simulate the sample ratio of the aforementioned parameters, i.e. and so, the shape parameter can be also estimated as .
It is noted that based on the above assumptions, the standard deviation, and the skewness and excess kurtosis coefficients of the climacogram estimator can be estimated as , , and , respectively. Since all the above measures will be larger than those in case of a white noise process.
The above expression can approximate the mode by an absolute difference of 0.005 from the Monte Carlo estimates, while for better approximations it is advised to implement a new Monte Carlo analysis (see also discussion and application in the section ‘Applications to annual streamflow’). Interestingly, the standardized error between the mode and expected values of the estimator, i.e. , is calculated from the Monte Carlo analysis to reach a maximum value of 67% corresponding to cases with H ≥ 0.6 and n/κ ≤ 10, while for the white noise process it can be theoretically estimated as ɛ = 2/, which for κ = n/10 is approximately 20%.
APPLICATIONS TO ANNUAL STREAMFLOW
For illustrations of possible implications of the above results, we apply a stochastic analysis based on the expected and the mode values of the climacogram to a streamflow process at the Peneios river (Thessaly, Greece), where a historical streamflow annual time series is available at the upstream station of Ali Efenti with only a 13-year length (for more information on the study area, see Dimitriadis et al. (2016b)). For the identification of the stochastic model, we adjust for statistical bias and, in particular, we fit the mode of the estimator rather than its expectation. It is noted that the proposed empirical model for the mode bias (Equation (10)) is derived from a Monte Carlo analysis for sample lengths of n ≥ 20, and so for this application, we perform a new Monte Carlo analysis to fit the observed climacogram for scales 1 ≤ κ ≤ n/10 (rule of thumb; Dimitriadis & Koutsoyiannis 2015) and so here, for the first two scales (Figure 3). We find that an HK model can adequately simulate the observed standardized climacogram, i.e. , with H= 0.9. We also estimate the Hurst parameter with the expectation of the estimator, and we find H′ ≈ 0.8 and H′′ ≈ 0.7, with or without adjusting for bias, respectively. Evidently, both latter values underestimate the long-term persistence behaviour (Figure 3).
It is noted that the dependence structure of a process (e.g. streamflow) will have a small effect at the risk imposed by the expected number of peaks over threshold (e.g. for the design of a dam or for flood risk mapping) as compared to the effect of the marginal distribution of the process (Volpi et al. 2015; Serinaldi & Kilsby 2018). However, the dependence structure will have a great effect (especially for processes with long-term behaviour) at the duration of successive peaks over threshold (e.g. maximum duration of wet/dry periods or of flood inundation), which may highly affect urban as well as agricultural areas and insurance policies (e.g. Serinaldi & Kilsby 2016; Goulianou et al. 2019). To illustrate this, we generate an adequate number N (see the section ‘Monte Carlo analysis for the mode of the variance estimator’) of HK synthetic timeseries with H = 0.5 (N= 5 × 103), H = 0.7 (N= 4 × 104), H = 0.8 (N= 105), and H = 0.9 (N= 3 × 105). For convenience, we assume an N(0,1) distribution for all processes. We then estimate the expected frequency of the number of peaks over various thresholds (PoT) as well as the expected frequency of the maximum duration of successive peaks over various thresholds (MdT), and we standardize them with the PoT and MdT values of the white noise process (Figure 4). We find that the MdT varies with threshold and long-term persistence, while the PoT stays almost unaffected by both. Additional analyses and quantifications on the reflection of long-term term persistence in terms of clustering in time can be found in Iliopoulou & Koutsoyiannis (2019).
The results from this study suggest that the sample estimator of the variance can be skewed even for long samples in the presence of long-term persistence behaviour as opposed to the white noise process. Therefore, the mode is different from the expectation and more suitable to use in estimation. We propose that when a single recorded series is available and a Gaussian HK process is fitted with small sample size and relatively high Hurst parameter, it is advantageous to employ the mode of the estimator as calculated from the empirical model of Equation (10), rather than its expectation (Equation (5)), so as to avoid underestimation of the Hurst parameter (and thus, the uncertainty of the process). In case of a non-Gaussian distribution, larger accuracy, or a different estimator of the second-order dependence structure (e.g. other climacogram estimator, autocovariance, power spectrum, variogram etc.), we should employ the Monte Carlo technique and test whether the mode of the estimator used is close enough to its expected value. If this is true, then the expected value can be used to adjust the model for bias, whereas if the two values vary then the model should be adjusted for bias based on the mode estimator. For Monte Carlo analysis of a non-Gaussian-correlated process, an explicit algorithm should be preferred (Dimitriadis & Koutsoyiannis 2018) since the mode value is expected to highly depend on higher-order moments in case of long-term persistent processes.
CONCLUSIONS AND DISCUSSION
Awareness of uncertainty in assessing the dependence structure of a process is of paramount importance as it may critically affect the interpretation of results. Estimation uncertainty may introduce large statistical bias, which can be additionally magnified in the presence of long-term persistence (Dimitriadis & Koutsoyiannis 2015). In addition, if the uncertainty is underestimated, then a regular cluster of events could be erroneously regarded as an extreme cluster. Although the mode of the examined classical estimator for variance is close to its expectation for small Hurst parameters and large lengths, we show that for larger values of the Hurst parameter and small sample lengths, equating the expected climacogram to the observed one may lead to underestimation of the long-term persistence and thus the uncertainty of the process.
We propose that when the available series have short lengths or when the empirical Hurst parameter is estimated larger than 0.5, we should always account for statistical bias. Particularly for the bias adaptation, when information is available on only a single series/realization of the process, it is advantageous to equate the mode instead of the expectation of the climacogram estimator to the sample values. Interestingly, in case of an N(0,1)-HK process, the absolute difference between the mode and expected values of the estimator is calculated (from a Monte Carlo analysis performed in this study) to reach a maximum value of 67% of the expected value, corresponding to cases with H ≥ 0.6 and n/κ ≤ 10, while for the white noise process the value is approximately 20% for . In cases of different stochastic processes or estimators or when a larger accuracy of the mode bias is of interest, one should employ a Monte Carlo technique through an explicit generation algorithm (Dimitriadis & Koutsoyiannis 2018) to estimate the mode climacogram estimator or use the lower quartile (25% quantile) of the estimator (in case its distribution is known) as an approximation.
From the Monte Carlo analysis performed in this study, it is also observed that for an N(0,1)-HK process with variance and for large n and small n/, the distribution of the climacogram estimator tends to that of , with a mean value equal to , i.e. zero bias. However, given the estimation uncertainty present in records exhibiting persistence, the autocorrelation of the averaged process is independent of the scale, and thus, the above distribution will never be truly reached. The underestimation of the persistence of the parent process also has critical implications for the estimation of the properties of its extremes, since it was shown that the maximum duration of successive peaks over threshold is greatly affected by the degree of dependence. Additional analyses and quantifications on the reflection of long-term term persistence in terms of clustering in time can be found in Iliopoulou & Koutsoyiannis (2019).
A final remark for discussion, considering the etymology of the terms, is that the expected value of a random process is less expected to occur than its mode (i.e. most probable value; a term coined by Pearson (1895, p. 345)), where the two coincide only in symmetrical distributions. Therefore, when only one value is known (here, only one realization of the climacogram estimator), it is more accurate to fit the model and evaluate the Hurst parameter based on the proposed mode estimator rather than the expected one.
ACKNOWLEDGEMENT
The authors would like to thank the editor Luigi Berardi for handling the paper, one anonymous reviewer for useful comments, and Federico Lombardo for his fruitful discussion, comments, and suggestions that helped us improve the paper.
CODE AVAILABILITY
The MATLAB script for the SAR generation algorithm is available as well as the script for a fast estimation algorithm of the sample climacogram in very long timeseries and in many scales.