Estimation of flood frequency using statistical method: Mahanadi River basin, India

Estimating stream flow has a substantial financial influence, because this can be of assistance in water resources management and provides safety from scarcity of water and conceivable flood destruction. Four common statistical methods, namely, Normal, Gumbel max, Log-Pearson III (LP III), and Gen. extreme value method are employed for 10, 20, 30, 35, 40, 50, 60, 70, 75, 100, 150 years to forecast stream flow. Monthly flow data from four stations on Mahanadi River, in Eastern Central India, namely, Rampur, Sundargarh, Jondhra, and Basantpur, are used in the study. Results show that Gumbel max gives better flow discharge value than the Normal, LP III, and Gen. extreme value methods for all four gauge stations. Estimated flood values for Rampur, Sundargarh, Jondhra, and Basantpur stations are 372.361 m/sec, 530.415 m/sec, 2,133.888 m/sec, and 3,836.22 m/sec, respectively, considering Gumbel max. Goodness-of-fit tests for four statistical distribution techniques applied in the present study are also evaluated using Kolmogorov–Smirov, Anderson–Darling, Chisquared tests at critical value 0.05 for the four proposed gauge stations. Goodness-of-fit test results show that Gen. extreme value gives best results at Rampur, Sundergarh, and Jondhra gauge stations followed by LP III, whereas LP III is the best fit for Basantpur, followed by Gen. extreme value.

INTRODUCTION Consistent and precise stream flow forecasting is needed for numerous issues such as water resources planning, strategy improvement, maneuver and upkeep events. In water management, forecasting high-quality stream flow and effective usage of this estimate gives substantial financial and communal assistance. For the hydrologic constituent, there is the requirement of interim as well as enduring events of stream flow forecasting for optimizing systems or for planning future growth or drop. Interim forecasting denotes hourly or day-to-day forecasting, which is vital for caution against flood and safety, and enduring forecasting is on the basis of monthly, seasonal or annual timescales which is very beneficial in reservoir processes and irrigation administration choices like distributing water to consumers downstream, arranging discharges, famine extenuation and handling river agreements or applying compacted acquiescence. Masmoudi & Habaieb (1993) developed seven statistical channeling models, which were used on the Medjerdah River (Tunisia) to forecast dangerous flood occurrences. Model performance is described by statistical measures of accuracy, ultimate fault, and ultimate interruption among the measured and predicted flow with their alterations. Evensen (1994) discussed a novel chronological data integration technique based on predicting error statistics utilizing Monte Carlo procedures which served as a superior alternative to solve customary and computationally enormous challenging estimated error covariance equations utilized in extended Kalman filter. Bartholmes & Todini (2005) studied the possibility of extending flood predicting lag times equal to 10 days by engaging an amalgamation of innovative climatological and hydrological models and presented outcomes of the joined approach among a numerical weather forecast system and rainfall-runoff model. Griffis & Stedinger (2007) explored features of LP III distribution in real and log space. Assessments with outlines of US flood data revealed that LP III distribution offers a sensible model for yearly flood sequence distribution from unfettered catchments for log space skews. Moreover, for LP III distribution relations of L-moment ratio were established so as to compare them to overall statistics of a province. Rowinski et al. (2002) discussed two probability density functions, prevalent in hydrological studies, i.e., Log-Gumbel and Log-Logistic, with regard to use of the functions to hydrological numbers and problems ensuing from their mathematical properties. The maximum likelihood method promises merging of the estimators away from the area of reality of the two L-moments. Rath et al. (2018) employed the autoregressive integrated moving average (ARIMA) model to predict flow discharge at Mahanadi River basin. Helsel & Hirsch (1992) discussed probabilistic approaches usually accomplished in hydrology. Gumbel max value and LP III distribution are considered to be the best prevalent probabilistic models related to solving water resources problems. Kamal et al. (2017) applied statistical distribution on discharge data for two locations and discovered that Log-normal is applicable for Haridwar and Gumbel EV1 for Garhmukteshwar. Subsequent to finding an appropriate distribution for a region, the distribution helps in predicting discharge for a certain return period. Brandimarte & Di Baldassarre (2012) proposed another method on the basis of applicability of uncertain flood profile to estimate uncertainty in hydraulic modeling and FFA, where the major considerable uncertainty sources are clearly scrutinized. Ewemoje & Ewemooje (2011) investigated Normal, Lognormal, and LP III distributions to model at-site annual peak flood flow in Ogun-Oshun River, Nigeria. Chen et al. (2012) analyzed the risk of flooding resulting from the occurrence of flood, taking into consideration flood enormity and time of incidence applying LP III and mixed von Mises distribution. Mukherjee (2013) developed a mathematical model regarding peak flood discharge and return period utilizing GEV. Bezak et al. (2014) explored the influence of threshold value in the peaksover-threshold method on FFA results, compared different statistical distribution functions and evaluated three parameter estimation techniques. Haddad & Rahman (2011a) investigated the usability of the quantile regression method as a feasible regional FFA technique for New South Wales, Australia. Haddad & Rahman (2011b) examined flood data from Tasmania, Australia considering an assortment of models' criteria: Akaike information criterion (AIC), AIC-second order variant, Bayesian IC, and a customized ADC. Results obtained by simulating the Monte Carlo model shows that ADC is better at recognizing parent allocation fittingly. Grimaldi & Serinaldi (2006) modeled trivariate joint distribution of flood peak, volume, and duration using a class of copulas called asymmetric Archimedean copulas. Hirabayashi et al. (2013) presented universal flood hazard for this century on the basis of results obtained from climate models and employed a condition of skill for the universal stream steering model with a barrage system for computing river discharge and flood area. Haddad & Rahman (2012) proposed a model utilizing Bayesian generalized least squares regression in an authoritative area structure for RFFA of ungauged watersheds in eastern Australia. Yue (2001) investigated the usability of a two variable gamma model comprising five constraints to describe combined probability actions of multiple variable flood occurrences. Reis & Stedinger (2005) explored Bayesian Markov chain Monte Carlo techniques to evaluate subsequent circulation of flood magnitude, flood menace, and constraints of Log-normal and LP III distributions. Subyani (2011) quantified hydro-geological distinctiveness and probability of flood occurrence of several main valleys in western Saudi Arabia by applying GEV and LP III distributions to peak daily precipitation data. Sraj et al. (2015) examined 58 flood occurrences at Litija station on Sava River, Slovenia applying different bivariate copulas and contrasted them utilizing various arithmetic, graphic, and higher extremity reliance experiments. Merz & Thieken (2005) explored the difference between natural and epistemic uncertainty in FFA. Ouarda et al. (2001) projected an apparent theoretical framework for application of canonical correlations in RFFA using data from 106 stations in Ontario province, Canada. Micevski et al. (2015) presented a substitute RFFA technique that is predominantly valuable when adequately harmonized areas cannot be recognized on the basis of region of influence. Sahoo et al. (2020) studied bivariate low flow frequency analysis of Mahanadi basin, which has major deviations in hydrological performance from upstream to downstream, for two main low flow characteristics. Parhi (2018) estimated peak floods at Mahanadi River at the Hirakud dam and Naraj of up to 100 years' recurrence interval utilizing HEC-RAS and Gumbel's distribution. Pawar & Hire (2018) applied LP III distribution for flood data of four locations on the Mahi River and studied peak stream flow frequency and magnitude in the field of flood hydrology. Lima et al. (2016) estimated local and regional GEV distribution for flood frequency analysis of Rio Doce basin, Brazil in a multilevel, hierarchical Bayesian framework, to explicitly model and reduce uncertainties. Here, various statistical methods are established for estimation of flow discharge at four gauge stations in Mahanadi River basin, India. Also, goodness of fit is applied for analyzing data sets. Flow discharge is calculated through various confidence limits (up to 95%) and is also discussed here.

STUDY AREA
Mahanadi ( Figure 1) is a major interstate east-flowing river in peninsular India. The river length from the origin to convergence in the Bay of Bengal is 851 km. In Chhattisgarh the river flows for 357 km and the other 494 km is in Odisha. Details of geographical and hydrological details of four gauging stations are shown in Table 1. Four gauge stations, Rampur, Sundargarh, Jondhra, and Basantpur, are considered for our research.

METHODOLOGY Generalized extreme value
Generalized extreme value is a continuous probability distribution developed within extreme value theory. It is a combination of Gumbel, Fréchet, and Weibull extreme value distributions and is a bounded distribution of standardized maxima of a series of autonomous and indistinguishable dispersed arbitrary variables. GEV is utilized as an estimate for modeling maxima of lengthy (limited) series of arbitrary variables. Significantly, while using this distribution, the upper bound is unidentified and hence has to be projected; when Weibull is applied, the lower bound is identified as zero.
Frequency factor for GEV distribution is: To express T in terms of K t : T ¼ 1 1 À exp Àexp À 0:5772 þ pK t p 6 (2)  Predicted discharge (Q p ) is calculated with the standard normal distribution formula for the different return periods, and expressed as:

Normal distribution
In statistics, normal distribution is a type of distribution where the data are characterized by a bellshaped curve. Discrete form and curve location are obtained by mean and standard deviation. As many natural phenomena fit into this, it is a highly significant probability distribution in statistics. This distribution illustrates how variable data are dispersed. The majority of annotations group about a central peak as it is symmetric and the probability is for data to shrink off uniformly in both directions away from the mean. The arithmetic mean of sample x 1, x 2… ..x n typically represented by μ is the sum of the sampled value divided by item number(n): For the required return period (T ), the probability factor (P) is evaluated in percentage. The conversion formula used to evaluate the probability is given as: From the standard normal distribution table, by interpolation, the frequency factor (K t ) is computed based on the different return periods, where frequency factor equals to standard normal deviate (z). Finally, the predicted discharge (Q p ) is found using the standard normal distribution formula for the different return periods for the respective seasons: Gumbel is a type of statistical distribution which began from extreme theory. Function in this distribution is unrestrained on whichever side, leading to negative flow calculation. This represents distribution of extreme values, either highest or lowest of samples, used in various distributions and for modeling distribution of peak levels. This is utilized for predicting earthquake, flood, and other natural hazards. It also models operational threat in managing threat and product life which wears out rapidly prior to a certain age. For the required return period (T ), abridged variate (Y t ) has been assessed by using the formula: The abridged mean and abridged standard deviation have been obtained from the Gumbel distribution table for the given sample size (N ). Then the frequency factor is estimated using the formula: Thus, the predicted discharge (Q p ) is computed using the standard normal distribution formula for diverse return period for respective seasons: Log-Pearson III LP III is a statistical method of fitting frequency distribution values for predicting flood at a few sites of a specified river. Frequency distribution is built after calculating data related to statistics at a particular river site. Flood occurrence probability of different densities can be taken out from the curve. This particular method helps in extrapolating event data with return periods ahead of pragmatic occurrence of flood. After finding the actual discharge, we then calculate the natural logarithm of the actual discharges (Z ) and find the standard logarithmic mean (μ) and standard logarithmic deviation (σ) of the calculated discharges for the respective seasons: Then the coefficient of skewness (C s ) is calculated using the logarithmic discharges (Z ) and for the required return period (T ), we calculated the probability (P) in percentage, as per the formula: From the standard normal distribution table, by interpolation, we calculate the standard normal deviate (z). The frequency factor depends on coefficient of skewness and return period. When Cs ¼ 0, the frequency factor is equal to standard normal deviate z and is calculated as in the case of Normal deviation. When Cs ≠ 0, the frequency factor (K t ) is modified by using the formulae developed by Kite (1977): where z ¼ normal deviate k ¼ Cs 6 (13) K t ¼ frequency factor Now, predicted logarithmic discharge is calculated by using the formula: where q p ¼ predicted logarithmic discharge, μ ¼ standard logarithmic mean, σ ¼ standard logarithmic deviation. Hence, the predicted discharge (Q p ) is calculated by taking the antilog of q p : .
For a given set of data, whether a certain distribution is fit or not is checked using this test. Quality of fit for the observed data set is ranked through calculation of statistical parameters. Affinity of samples from the expected theoretical probability distribution is assessed. To evaluate null hypothesis, it is applied and discarded if the observed test surpasses the critical value for the constant significance level. Chi-squared, Anderson-Darling (AD) and Kolmogorov-Smirnov (KS) tests are employed here at significance level 0.05.

Kolmogorov-Smirnov test
Discovering whether a sample is from an assumed continuous probability distribution is the main objective of this test. It is on the basis of empirical cumulative distribution functions (CDF), that is: The Kolmogorov-Smirnov test statistic (K ) is given by prevalent perpendicular difference in hypothetical and experiential CDF:

Anderson-Darling test
This associates the fit of an observed to an expected CDF, hence giving additional weight to distribution tails compared to previous experiments.

Chi-squared test
This is applied to find out whether a sample has come from a population with a given distribution. Binned data are applied, and hence the value of the test statistic depends on how data are binned.
where O j ¼ observed frequency

RESULTS AND DISCUSSION
Parameters like shape (k, a), scale (s, b), and location (m, g) for different distribution methods at the four gauge stations are presented in Table 2. Probability density function (PDF) and the cumulative density function (CDF) graph for respective gauge stations are displayed in Figure 2. Three goodness-of-fit tests (as presented in the section 'Goodness-of-fit test') were used to analyze rainfall data series at the four stations chosen. Test statistics in correspondence to each test were calculated, and hypothesis testing was done at significance level 0.05. For KS, AD, and Chi-squared tests, the tests reject the hypothesis concerning distribution level if the statistics found are more than the critical value 2.5, 0.12555, and 12.592, respectively (Millington et al. 2011). KS, AD and Chi-squared tests were applied in Easy Fit software for selecting the best fit distribution (s) and outcomes obtained are specified in Table 3.
At Rampur, Sundergarh, and Jondhra gauge stations, extreme value distribution gives best results followed by LP III, whereas LP III is the best fit for Basantpur followed by extreme value. Therefore, extreme value can be utilized to calculate flood return periods for the present study area. The poor ranking of Normal distribution fitted results is perhaps due to its nature. Given that Normal distribution is based on central limit theorem while the data considered in this study (annual maximum) are at the extreme right of all considered distributions, it was expected that normal fit to the data would be least efficient. In addition, it is observed that at Rampur, Jondhra, and Basantpur stations the Chisquared test correctly rejects normal fit to data as both statistics are related to central limit theorem.

Gen. extreme value
For Rampur watershed, the value of flood calculated during monsoon period ranges between 177.4414 m 3 /sec to 321.6385 m 3 /sec for 10 years to 150 years' return period (Table 4) watershed, designed flood lies within 893.1144 m 3 /sec to 1,944.325 m 3 /sec for 10 years to 150 years return period. The magnitude of peak floods with respect to the return period is found to be 2,052.522 m 3 /sec to 3,372.061 m 3 /sec for Basantpur watershed. This range is the highest among all seasonal peak floods.

Gumbel max
The intended flood value for Rampur watershed lies within 198.8535 m 3 /sec to 372.361 m 3 /sec for 10 years to 150 years' return period (Table 5). Correspondingly for Sundargarh, the appraised flood diverges from 329.1873 m 3 /sec to 530.415 m 3 /sec. For Jondhra watershed, the premeditated flood lies within 986.1719 m 3 /sec to 2,133.888 m 3 /sec for 10 years to 150 years' return period. The magnitude of peak floods with respect to the return period is found to be 2,248.463 m 3 /sec to 3,836.22 m 3 /sec for Basantpur watershed.

Normal method
For 10 years to 150 years' return period the calculated flood value deviates within 175.76 m 3 /sec to 255.892 m 3 /sec for Rampur watershed (Table 6). Consistently for Sundargarh, the assessed flood is from 302.4046 m 3 /sec to 395.3386 m 3 /sec. For Jondhra watershed, premeditated flood contrasts within 885.808 m 3 /sec to 1,234.06 m 3 /sec for 10 years to 150 years' return period. The enormity of extreme flood with respect to the return period is found to be 2,037.138 m 3 /sec to 2,770.419 m 3 /sec for Basantpur watershed.

Log-Pearson III
The gauged flood value diverges within 177.4024 m 3 /sec to 317.6723 m 3 /sec for 10 years to 150 years' return period for Rampur watershed (Table 7). Reliably for Sundargarh, the projected flood is from 303.5037 m 3 /sec to 532.3849 m 3 /sec. For Jondhra watershed, the planned flood contrasts within 897.3183 m 3 /sec to 2,183.191 m 3 /sec for 10 years to 150 years' return period. The enormity of    For a given return period, x T is determined by Gumbel methods which have errors because of limited use of sample data. The confidence interval indicates the limits regarding the calculated value between which the true value can be said to lie with a specific probability based on sampling  where K ¼ frequency factor given by ¼ y T À y n S n s n À 1 ¼ standard deviation For different values of T, X T is calculated and shown in Figure 4. Also 95, 90, 85, 80, and 75% confidence limits for various values of T are shown. It is seen that while the confidence probability rises, the confidence interval also increases. Further increase in T causes the confidence band to spread. Thus, Gumbel distribution will give erroneous results if the sample has a value of C s very much different from 1.14.

Sensitivity analysis
For the Normal distribution method, the probability factor is dependent on the required return period (T ), which is inversely proportional. Frequency factor (K t ) varies with return periods. Predicted discharge (Q p ) increases with respect to the increase in required return period, while the probability factor (P) decreases. When the frequency factor increases, predicted discharge increases. Predicted flood increases with regard to the increase in the required return period, while at the same time, frequency factor increases with decrease of standard deviation in the case of the Gen. extreme value method. Predicted flood increases with reference to the increase in the required return period,  while at the same time, frequency factor also increases, whereas reduced mean (Y n ) and reduced standard deviation (S n ) remain constant for all recurrence intervals; however, reduced variate (Y t ) increases in Gumbel max. In LP III, predicted flood increases with an increase in the required return period, while at the same time, the frequency factor also increases, whereas the coefficient of skewness (C s ) and reduced standard deviation remain constant for all recurrence intervals.

CONCLUSIONS
In this paper, an effort has been made to forecast discharges at various return periods using statistical methods. Here, four statistical methods are used to predict flow discharge in the Mahanadi River basin, covering four stations. Four statistical distribution methods, namely, Normal, LP III, Gumbel max, and Gen. extreme value method are employed here. Based on the trends of the last 60 years, the maximum and minimum discharges are found at 150 years and 10 years' return period, respectively. The rate of increase of discharge is very high at the initial return periods and then it becomes constant and eventually lower. The shapes of the graphs are common in nature and most of the time they do not intersect with each other. In most of the cases, Gumbel max gives the peak flood discharge and normal distribution contributes to the least discharge. The Gumbel max is the most widely used method to obtain flood discharge as it can be used for infinite sample sizes. The influencing factor of frequency is analyzed on the basis of analysis of the runoff complexity from drainage basins. It is found that flow probability increases at the upstream of Mahanadi, which may be characterized by the underlying surface condition change influenced by human activities and geomorphology changes, and be considered for future scope. In other sections, the purpose of the research is to diminish future flood damage in the river basin. Hence, forecast of flow discharge is a key indication towards hydrological modeling and development for water resources engineering.

DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.