The analysis of precipitation data is extremely important for strategic planning and decision-making in various natural systems, as well as in planning and preparing for a drought period. The drought is responsible for several impacts on the economy of Northeast Brazil (NEB), mainly in the agricultural and livestock sectors. This study analyzed the fit of 2-parameter distributions gamma (GAM), log-normal (LNORM), Weibull (WEI), generalized Pareto (GP), Gumbel (GUM) and normal (NORM) to monthly precipitation data from 293 rainfall stations across NEB, in the period 1988–2017. The maximum likelihood (ML) method was used to estimate the parameters to fit the models and the selection of the model was based on a modification of the Shapiro-Wilk statistic. The results showed the chosen 2-parameter distributions to be flexible enough to describe the studied monthly precipitation data. The GAM and WEI models showed the overall best fits, but the LNORM and GP models gave the best fits in certain months of the year and regions that differed from the others in terms of their average precipitation.
Real monthly precipitation data from 293 rainfall stations in Northeastern Brazil.
The selection of the model was based on a modification of the Shapiro-Wilk statistic.
The gamma and Weibull distributions showed the best fits compared to the others.
The analysis of precipitation data is extremely important for strategic planning and decision-making in various natural and socio-economic systems, such as agricultural planning (Stern & Coe 1982; Hussain et al. 2010), civil engineering (Bjureland et al. 2019), hydrology (Bedient et al. 2008; Langat et al. 2019; Brendel et al. 2020), water resources management (Jain & Singh 2003), among others. The search for the probability distribution that best fits the precipitation data, for the highlighted systems, was the objective of several studies (Önöz & Bayazit 1995; Olofintoye et al. 2009; Alam et al. 2018) because by using the identified distribution, it is possible to predict future events, such as the probability of rain occurring in a given region (Şen & Eljadid 1999).
Low precipitation is one of the natural factors responsible for drought (Wilhite 2000). According to Mcglade et al. (2019), among the weather-related natural hazards, drought is probably the most complex and severe natural disaster due to its intrinsic nature and wide-ranging and cascading impacts. Several definitions of drought can be found in the literature; however, in general, drought is defined as a deficiency of precipitation over a long period, resulting in a water shortage for some activities (Eslamian & Eslamian 2017). Drought is a natural phenomenon that develops slowly, unlike tornadoes and hurricanes, which are immediately detectable. The difficulty of immediately detecting the beginning of a drought period can generate devastating costs for society in different sectors such as economy, agriculture, ecosystem, and infrastructure (WMO 2015; Marengo et al. 2017; Alvala et al. 2019).
Brazil is a country characterized by having long periods of rain, as well as long periods of drought, in different regions at the same time (Marengo et al. 2013). Covering about 47.3% of South America, Brazil is influenced by natural phenomena that directly influence the climate. The Northeast Region of Brazil (NEB) presents scarce rainfall and is characterized by long periods of drought (Palharini & Vila 2017). This region has a territory of 1,554,000 km² and a population of approximately 56 million inhabitants. The sectors that drive the economy are agriculture, livestock, industry and tourism (IBGE 2020).
Most states in the NEB have a semi-arid climate and high, seasonal and interannual, rainfall variability with extreme episodes of humidity and drought. Over the years, records of extreme drought have been reported in NEB (Marengo et al. 2017). These phenomena caused several impacts, mainly in the agriculture and livestock sectors. In the years 2012 and 2013, Brazil recorded losses of approximately US$ 1.6 billion for the economic sector and US$ 1.5 billion for livestock mortality, due to drought (Brito et al. 2018).
All over the world, many scientists have been investigating changes in patterns and amounts of precipitation, such as Brazil (Rao et al. 1986; Coêlho et al. 2004; Gutiérrez et al. 2014), Nigeria (Olofintoye et al. 2009), Bangladesh (Ghosh et al. 2016; Alam et al. 2018), United States (Ye et al. 2018), Burundi (Nkunzimana et al. 2019), Australia (Hasan et al. 2019), among others. Most studies use precipitation data to analyze the severity of drought, since this phenomenon directly interferes with drought (Cavalcanti & Kousky 2004; Coelho et al. 2016; Brito et al. 2018).
Precipitation analysis using probability distribution models has already been investigated by several researchers in Brazil. Vieira et al. (2018) evaluated the fit of four probability distributions (gamma, Weibull, log normal and normal) for precipitation data in the southwest region of Paraná and concluded that the gamma and Weibull models presented a better fit to the data. Eight probability distributions were applied for modeling the monthly rainfall data of the pluviometric station of Campo Grande, Mato Grosso do Sul (Ozonur et al. 2020). The goodness of fit tools indicated that although no distribution provides the best fit to the data for all months, the three-parameter lognormal distribution shows a generally better fit than the other distributions. Martins et al. (2020) applied both the Generalized Pareto Distribution (GP) and the Exponential Distribution (ED), in monthly rainfall data from the city of Uruguaiana, in the state of Rio Grande do Sul. The results show that the GP and ED fit the data in all months. Through the simulation study, they perceive that the GP is more suitable in September and November. However, in January, March, April and August the exponential model is more appropriate. Melo & Lima (2021) analyzed rainfall series obtained between 1910 and 2016 in 11 localities in the region of Catolé do Rocha, State of Paraíba and concluded that the logistic distribution adequately represents the rainfall in the region.
Although several studies in different parts of Brazil have assessed precipitation data using probability distributions, few have analyzed a large area like the Northeast region. In this scenario, this study aimed to identify the probability models that best fit the precipitation data from the Northeast region of Brazil, in order to contribute to a better understanding of rainfall patterns in the region. Monthly rainfall data from 293 rainfall stations spread across the NEB for 30 years (1988–2017) were analyzed.
Study area and rainfall data
The Northeast Brazil (NEB) region is located between the parallels of 01° 02′ 30″ north latitude and 18° 20′ 07″ south latitude and between the meridians of 34° 47′ 30″ and 48° 45′ 24″ west of Greenwich Mean Time. It is the Brazilian region that has the largest number of states (nine in total), Maranhão (MA), Piauí (PI), Ceará (CE), Rio Grande do Norte (RN), Paraíba (PB), Pernambuco (PE), Alagoas (AL), Sergipe (SE), Bahia (BA) and a territorial extension of 1,554 km2. To the north and east it is bordered by the Atlantic Ocean; to the south with the states of Minas Gerais (MG) and Espírito Santo (ES) and the west with the states of Pará (PA), Tocantins (TO), and Goiás (GO). The NEB has an estimated population of approximately 56 million people and the economic sectors that stand out are agriculture, livestock, industry, and tourism (IBGE 2020). A graphical representation of NEB is given in Figure 1.
This study was conducted with data from the online platform of the National Water Agency (Agência Nacional de Águas – ANA) – Brazil. This data was collected in 293 rainfall stations spread across the NEB and consists of a historical series of monthly precipitation from 1988 to 2017 (30 years). The location of the stations can be seen in Figure 1.
The NEB rainy season can be divided into different periods. From Figure 2, it can be seen that from January to May the occurrence of rain is greater in the northern portion of the NEB, affecting mainly the states of Maranhão and Piauí, especially in March and April. This behavior may be related to the squall lines and the Intertropical Convergence Zone (ITCZ) (Palharini & Vila 2017).
From June onwards, it is possible to observe a significant decrease in average precipitation in almost the entire NEB. The rainfall pattern changes almost completely compared to previous months, with the highest rainfall on the NEB coast. According to Kousky (1979), the rainy season on the NEB coast from April to July is mainly due to the circulation of the sea breeze and cold fronts that occur along the coast (Figure 2).
An increase of average precipitation in November and December is observed towards the southwest of the Northeast region of Brazil, affecting mainly the states of Bahia, southern Piauí, and southern Maranhão. The increase in rainfall may be related to the fact that the South Atlantic Convergence Zone shifts to the east in December (Carvalho & Jones 2009) (Figure 2).
The procedure for the correct choice of the statistical model that best fits the data is given in three steps. The first is the choice of the model to be tested. In this step, graphs like histograms and dot plots are extremely useful to visualize the shape of the data. Once the model has been chosen, the second step is the estimation of the different parameters of the chosen model. In this step, it is necessary to use a parameter estimation method. In the third step, the quality of the model fitted to the data can be assessed, that is, how well it fits the observed data.
In the last 10 years, several authors have presented different model choices to represent monthly precipitation data (Li et al. 2013; Ashkar & Ba 2017; Sukrutha et al. 2018; Ye et al. 2018; Hasan et al. 2019; Nkunzimana et al. 2019; Salman et al. 2019; Mehdizadeh 2020); these models were listed in Table 1. Among the models, the 2-parameter models presented a better fit and for this reason, we selected the ones which presented the best results to be used in this study, they are Gamma (GAM), Gumbel (GUM), Normal (NORM), Log-normal (LNORM), Generalized Pareto (GP) and Weibull (WEI).
|Distribution .||Probability Density Function (PDF) .||Cumulative Distribution Function (CDF) .||Support .||Parameters .|
|Generalized Extreme Value (GEV)|
|Log-Pearson type III||(a)||(a) (b)|
|Generalized Pareto (GP)|
|Distribution .||Probability Density Function (PDF) .||Cumulative Distribution Function (CDF) .||Support .||Parameters .|
|Generalized Extreme Value (GEV)|
|Log-Pearson type III||(a)||(a) (b)|
|Generalized Pareto (GP)|
a is the gamma function (Artin 2015).
b is the lower incomplete gamma function (Gautschi 1959).
cThese are shape, scale and location parameters of .
Estimation of parameters by maximum likelihood
In statistical modeling, all or most parameters of the probability distributions are unknown, so there is a need to estimate them from the data. Among the several methods of parameter estimation that were developed (moments, L-moments, maximum likelihood, among others). Many studies have focused on identifying the method, or methods, that are more efficient in estimating parameters for certain probability distributions. However, for most probability distributions and all sample sizes, it is rarely possible to identify a method that stands out as the best (Ashkar & Ba 2017). For this reason, in the present study, the maximum likelihood estimation method was chosen to estimate the parameters, since it performs well in parameter estimation for several probability distributions (Myung 2003; Fienberg & Rinaldo 2012).
The Maximum Likelihood (ML) method is the most widely used method for estimating parameters, as it presents a good performance for different probability distributions (Zong 2006; Ashkar & Tatsambon 2007; Park et al. 2009). In general, given a set of data and a statistical model, ML estimates the values of the parameters of a statistical model to maximize the probability of the observed data (that is, it seeks parameter values that maximize the likelihood function).
The ML estimates are those that maximize and is called ML estimator (MLE) of . The ML estimation method has some essential properties for its application that will not be presented because they are outside the scope of this study; more details about the method can be seen in Zong (2006), Bolfarine & Sandoval (2001).
The Akaike information criterion (AIC) and the Anderson-Darling (AD) statistics are probably the most used methods in the literature to discriminate between probability distributions. However, these methods are biased, especially when used in small sample sizes (Ashkar & Ba 2017; Ashkar et al. 2019), which is the reality of most studies in the field of hydrology. Ashkar et al. (1997) developed the TN.SW statistic, which is a modified version of the famous Shapiro-Wilk test statistic (Shapiro & Wilk 1965).
The TN.SW statistic is much less biased than the AIC and AD methods and offers a clear advantage over Regularized Maximum Likelihood (RML)-based methods as it does not favor the selection of one model over the other, especially for small sample sizes (Ashkar & Ba 2017; Ashkar et al. 2019). Another advantage of using TN.SW statistic is the easy implementation of the method, as described by Ashkar & Ba (2017). The calculation of the statistic consists of just two steps:
- 2.Use the transformed sample obtained in Equation (4) to calculate the required TN.SW statistic:
Given the advantages offered by TN.SW statistic, this method was chosen to be used in this study.
RESULTS AND DISCUSSIONS
In this section, the results obtained from the fitting of the six probability distributions selected in this study will be presented. The objective is to show which distribution provides the most appropriate fit for each month of the year, using monthly precipitation data, extracted from 293 rainfall stations spread around the Northeast Brazilian region from 1988 to 2017. The ML parameters estimates for each statistical distribution, as well as the calculation of the TN.SW statistic were obtained using software R.
To assess the quality of the fit of the distributions mentioned before, it is necessary to assess the p-value of the TN.SW statistic, since the TN.SW statistic itself allows comparison of the distributions' fits but does not indicate whether a fit is ‘good’ or ‘bad’ (Ashkar & Aucoin 2012). Once the p-values are calculated, the analysis proceeds by observing those with the highest p-values, as these present the best fit to the data. The higher the p-value produced by a model, the better is its fit to the data. And low p-values indicate an inadequate fit. Figure 3 presents the boxplots for the p-values obtained from the fitting of the distributions under study.
In Figure 3, it can be seen that the majority of p-values were above 0.10. This means that, at a significance level of α = 0.05, for example, most of the models were not rejected. Globally, the gamma and Weibull models showed the best fits compared to the others but the log normal model gave a comparable fit during a certain period of the year, particularly for January and June to October. The good performance of the gamma distribution to fit precipitation data has already been identified in other regions of Brazil. Araújo et al. (2001) evaluated monthly precipitation data in Boa Vista, Roraima (Northern Brazil), and observed the gamma distribution presented best fits to the data for almost all months of the year. Ribeiro et al. (2007) carried out a similar study in Barbacena, Minas Gerais (Southeast Brazil) and also identified the gamma distribution as the one that presented the best fit to monthly precipitation data in that region.
The GP model gave a comparable fit from September to January. This classification is based on the medians of the p-values (Figures 3 and 4). For all months of the year, the Weibull and gamma distributions, despite having high variability, presented a first quartile (Q1) above 0.125 (Figure 4); this means that less than 25% of each of these models were rejected at a significance level of α = 0.125 and a significantly lower percentage was rejected at α = 0.05.
The log normal model does not appear to be as interesting as the gamma and Weibull models when only Figures 3 and 4 are analyzed, but by looking at Figure 5 it is possible to identify six months in which this model had the highest percentage of best fit; this is, for January and June to October, the log normal model fitted better than the others. From Figure 4 it is possible to observe that from July to October the medians of the p-values were greater than 0.25, showing that less than 50% of the time this model was rejected at a significance level α = 0.25. A comparison of gamma and lognormal distributions for characterizing satellite rain rates was performed by Cho et al. (2004) and it was observed the gamma fits outperform the lognormal fits in wet regions, whereas the lognormal fits are better than the gamma fits for dry regions.
As of September, and up to January, the GP model becomes more interesting, with a performance comparable to that of the Weibull and gamma models. In fact, from September to February, the medians of the GP p-values were greater than 0.25 (Figure 4), showing that less than 50% of the time this model was rejected at a significance level α = 0.25 and a significantly lower percentage was rejected at α = 0.05. Also, high variability was observed among the p-values. In agreement with these results, Martins et al. (2020) observed that the GP distribution showed good fits to rainfall data for the Uruguaiana region (Southern Brazil). Their results showed adequate GP distribution fits the data in all months. Through a simulation study, they remarked that the GP model was more suitable in September and November.
The Gumbel distribution showed low p-values. In all months, the median of the p-values of the TN.SW statistic was well below 0.50 (Figure 4), and well below 0.25 outside of the period March-May. In addition, for all months the first quartile (Q1) was well below 0.25, which means that more than 25% of the data fits were rejected at a significance level of α = 0.25.
The normal distribution, in contrast to those mentioned in the previous paragraphs, presented the lowest p-values in all months analyzed. In most months, the median values of this distribution were very close to zero, as in the months of January, October and December, indicating rejection at the significance level α = 0.05. The poor performance of the normal distribution for fitting rainfall data when compared to other distributions has already been observed by other researchers. When analyzing precipitation data from Bangladesh, Ghosh et al. (2016) observed among six probability distributions under study, the normal distribution showed less satisfactory results compared to the others. Similar results were also observed by Olofintoye et al. (2009) when comparing six probability distributions to rainfall data in some cities in Nigeria.
From Figure 5, it is possible to observe the percentage of distributions that showed the best fit to the data, selected by month. As in Figures 3 and 4, the distributions that showed good fitting potential were the gamma, Weibull, GP, and log normal. The log normal gave a good performance during the period of June to October. During the month of August, the log normal model gave the best fits compared to the other models in 34.8% of the data sets. The gamma, Weibull and GP models presented the best fits during different periods of the year. For February to May, the Weibull and gamma distributions fitted better to the data with February and April being the months in which the Weibull model gave the best fit. Finally, in November and December, the GP model outperformed the others, although in the two preceding months (September and October) it also performed relatively well. The Gumbel and Normal distributions did not outperform in any month.
Figure 6 was created to facilitate the spatial visualization of the selected models for each of the 293 stations, per month. From this figure, it is possible to see some results also observed in the analysis of Figures 3–5.
The selection of the GP distribution showed an interesting behavior. It is possible to observe that most of the stations, in which this model presented a better fit, are in a region with low average rainfall. In the months from February to April, there is a high concentration of GP distribution on the NEB coast. In addition, for September and October, the model was chosen more throughout the region, regardless of the area. This behavior can be justified, since the GP distribution is particularly suitable for fitting data with a thick right tail, and between February and April the NEB coast has low average rainfall, as well as September and October in almost the entire NEB.
The log normal and gamma models fitted well in any region of NEB and showed similar behavior in almost all months of the year. A highlight only for the log normal distribution in January and October, which was more selected at stations on the coast of the region. The state of Maranhão had the highest percentage of selection for the Weibull model, mainly from December to April. These months are characterized by a higher average rainfall.
The Gumbel and Normal models were seldom selected in almost every month analyzed. The normal model presented the highest percentage of selection in March (6.5%), fitting better to the data coming from the northern region of the NEB, mainly in the state of Maranhão. The Gumbel model was concentrated in the regions with the highest average rainfall for each month of the year. As one would expect, the presence of a location parameter in the Gumbel model gives it a fitting advantage when the average rainfall is high, in comparison to other models that have no location parameter. The location parameter of the Gumbel model provides it with a flexibility to better fit the high rainfall events. On the other hand, as would also be expected, the absence of a shape parameter in the Gumbel model (i.e. low shape flexibility) lowers its overall performance. In January and April, there is a concentration of the Gumbel model in the state of Maranhão. Between July and August, this concentration moves to the coast, and finally, it returns to concentrate in the north of the NEB in December.
As previously mentioned, some distributions showed similar behavior in the analysis of the p-values, as well as in the fitting models to the data. For this reason, Figures 7 and 8 were created. From these figures, we compare the p-values between the models analyzed, in addition to presenting a scatter plot showing the correlation between these variables (based on the Pearson method). These analyses were carried out for all months, however, their results were similar, so we use only the month of January as an example.
Figure 7 provides a comparison between the p-values of the gamma distribution and the other models. It is easy to see a possible positive linear relationship between the gamma and Weibull distribution (Figure 7(b)). The correlation analysis confirms that there is a strong and positive correlation (0.84) between these variables (Figure 7(f)).
When analyzing the p-values of the gamma distribution compared to the p-values of the GP, there is a moderate correlation between these variables. There is a moderately strong correlation in the upper right portion of the scatter plot, i.e. the area where both models give a good data fit. The correlation analysis indicates only a moderate linear relationship between the p-values of the gamma and GP models (correlation coefficient Corr = 0.44).
When comparing the p-values of the gamma model with the other models, it is noticed that the points are more scattered on the graph, which indicates that there is a low relationship between the p-values. The calculated correlation coefficients confirm this suspicion (gamma x lnorm: Corr = 0.01, gamma × Gumbel: Corr = 0.38, gamma x normal: Corr = 0.04).
Figure 8 provides a comparison between the p-values of the GP distribution and the other models. It is possible to observe a possible linear relationship between GP and gamma, as well as GP and Weibull (Figure 8(b) and 8(c)). Comparing the GP model with the other models, it is possible to notice random points in the graph (Figure 8(a), 8(d) and 8(e)), meaning that there is no linear relationship between the variables. The correlation chart shows a moderate linear relationship between the GP and gamma models (Corr = 0.44), as well as GP and Weibull (Corr = 0.53) (Figure 8(f)).
An analysis of the skewness of the data sets and its effect on the p-values of the TN.SW statistic for the various models, was also performed. From Figure 9, it can be seen that most of the observed skew coefficient(Cs) values (marked simply as ‘skewness’ on the horizontal axes) are between 0.5 and 3. For the gamma, log-normal, Weibull and GP models the highest p-values were when the skewness was around 1 and 2 (Figure 9(a)–9(d)), this means that these models fitted better to the data that had a medium positive skewness. The Gumbel model presented the highest p-values (Figure 9(e)) when the skewness was around 1 because the skew coefficient of the Gumbel distribution itself is close to 1 (). On the other hand, the normal distribution showed the highest concentration of high p-values (Figure 9(f)) when the skewness was close to 0; this behavior can be justified by the symmetrical shape that this distribution assumes ().
The 2-parameter distributions log normal, Weibull, gamma and Generalized Pareto proved to be flexible enough to describe monthly precipitation data. The results presented in this study allowed us to observe that:
Globally, the gamma and Weibull distributions showed the best fits compared to the others. These models fitted well to data throughout the NEB territory and showed similar behavior for almost every month, except for January and October. Although the gamma and Weibull models presented the overall best fits to the data, the log normal model gave a comparable fit during a certain period of the year, particularly for January and June to October.
The GP model was better fitted to data in regions with low average precipitation, mainly on the NEB coast between February and April. In general, the GP model becomes more interesting than the others from September to January, where the medians of the p-values were greater than 0.25, showing that less than 50% of the time this model was rejected at a significance level α = 0.25 and a significantly lower percentage was rejected at α = 0.05.
The Weibull model, unlike the GP, fitted better to data in regions with high average rainfall, mainly between December and April. This model was highly selected in the state of Maranhão from January to April. In general, the Weibull model was most selected in February, March and April.
As expected, the absence of a shape parameter in the Gumbel and normal models (that is, low shape flexibility) lowers their performance compared to the other models.
A 1-parameter model such as the exponential model would not be flexible enough to provide a good fit for the data.
In most cases, there seems to be no need to pick a 3-parameter model (such as the log-Pearson type III or GEV) to fit the data because 2-parameter ones such as those considered in this study seem to be flexible enough to fit the majority of the observed data in the NEB region.
In general, this study identified which probability distributions best fit monthly precipitation data for the Northeastern region of Brazil, identifying models best fit for each region in a given period of the year. These results will be useful for studies related to drought in the region, as well as in the analysis of precipitation.
The authors acknowledge the support of Brazilian agency CAPES for granting a scholarship in the first year of the research and the National Water Agency (Agência Nacional de Águas – ANA) for providing the data used in the study.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories at https://www.snirh.gov.br/hidroweb/serieshistoricas.