## Abstract

Monthly precipitation modeling is important in various applications, e.g. streamflow forecasts and water resources management. This paper develops an operational precipitation forecasting scheme, using Bayesian Non-homogeneous Hidden Markov Chain (NHMM) model and teleconnection index. Although the Hidden Markov Chain model has been investigated before in similar studies, the NHMM algorithm employed in this study allows modeling both non-stationary transition probabilities and emission matrix. Climatic teleconnection that affect precipitation is used to drive changes in transition probabilities of different states in the Markov model. The proposed framework is illustrated for multiple-station precipitation analysis in NingXiang County, a southern inland area in China with a high population density. A simulation model is constructed to examine the model's capacity in capturing variabilities and temporal-spatial characteristics exhibiting in monthly precipitation data during 1961–2013. Results indicate that the proposed NHMM model captures the precipitation characteristics at different stations well. Spearman correlation between conditional mean of simulated ensembles and observed data is 0.87–0.9, with few variations at distinct stations. The proposed framework has general applications and can be applied to simulate and generate stochastic monthly precipitation. Further application of the method is also described in the paper.

## INTRODUCTION

Unknown future precipitation is a key factor to consider in relevant decision-making processes. It also serves as an input to hydrologic modeling for streamflow forecasts. Hence, precipitation forecasts at multiple time scales are useful for water resources planning and management and hazard management. For instance, flooding managers typically pay close attention to daily and sub-daily rainfall forecasts to mitigate possible damage caused by storm surge and flash flooding. Water managers would use weekly and up to seasonal rainfall forecasts to guide operations to satisfy water delivery for customers. Water resources planners would then examine historical rainfall patterns and any changes in statistical characteristics of rainfall pattern in planning water infrastructures for future years. The utility of precipitation forecasts has been examined and demonstrated in various studies. Brown *et al.* (1986) examined the economic value of seasonal-precipitation forecasts for wheat farmers in the northern Great Plains to decide each spring whether to plant a crop or to let their land lie fallow. Schneider & Garbrecht (2003) developed a measure to evaluate the usefulness of seasonal precipitation forecasts for agricultural applications. Usefulness was defined as the ability of the forecasts to predict conditions significantly different from climatological norms, assuming that producers would be more likely to use a forecast if it predicts a larger departure from normal conditions. Meza *et al.* (2008) provided a review of economic value of seasonal climate forecasts, including precipitation forecasts, for agriculture applications.

Precipitation modeling and forecasts are of great importance to forecasting other important hydrologic attributes, e.g. streamflow. Cuo *et al.* (2011) reviewed approaches used for quantitative precipitation forecasts and discussed their use in short- to medium-range streamflow forecasts. Habets *et al.* (2004) examined the utility of quantitative forecast precipitation for the prediction of daily streamflow, using two numerical weather prediction models used in France, ARPEGE and ALADIN. The authors found that such precipitation forecasts can be of interest to forecast the progress of long-duration floods for their study area. Clark & Hay (2004) down-scaled precipitation forecasts from NCEP reanalysis project to station locations and then used them as input for streamflow prediction at four illustrative basins. Collischonn *et al.* (2007) developed a forecasting scheme of reservoir inflow for hydroelectric reservoirs in Brazil, based on precipitation forecasts. Precipitation forecasts from general circulation models (GCMs) have been demonstrated as useful in improving streamflow forecasting (Sankarasubramanian *et al.* 2008; Block *et al.* 2009; Sankarasubramanian *et al.* 2009; Wang *et al.* 2013; Wang & Fu 2014; Tsai *et al.* 2015; Crochemore *et al.* 2017). Courdent *et al.* (2018) developed a scheme to distinguish high and low flow domains in urban drainage systems 2 days ahead using numerical weather predictions.

Efforts to improve quantitative precipitation forecasts can be categorized into three general approaches. The first one is to improve numerical modeling of physical processes that drives the amount of future precipitation. Application of data assimilation techniques falls into this category as those techniques typically aim to improve initial conditions for precipitation forecasts in the future. The second one applies statistical methods that aim to capture statistic features of the precipitation pattern. This ranges from traditional time series applications to the recent development of machine learning algorithms. The third is a mixture of the previous two, where statistic methods are used to post-process outputs from numerical models. Various model output statistical approaches have been investigated to improve quantitative precipitation forecasts at different time scales.

Among all the statistical approaches, Hidden Markov Models (HMM) have gained much attention in the research community (Zucchini & Guttorp 1991; MacDonald & Zucchini 1997; Bellone *et al.* 2000; Greene *et al.* 2011; Zucchini *et al.* 2016) as it owns the capacity of capturing the probabilistic transition between different hidden ‘states’ of rainfall events. The hidden states, as indicated by its name, are not explicitly modeled but are implicitly considered for the changes in precipitation amount. Once an HMM model is trained, homogeneous HMM assumes that the probabilities of jumping from one state to another stays the same and the probability density function (PDF) of rainfall amount for a certain state is unchanged. Compared to homogeneous HMM, non-homogeneous HMM (NHMM) allows changes in transition probabilities or PDF for each state due to other driving forces (Hughes *et al.* 1999; Kim *et al.* 2008). Instead of using non-homogeneous transition probabilities, Bracken *et al.* (2014) used an explicit state model that is regressed on climate index. Emission distribution is assumed to be stationary. In addition, parameters of the Hidden Markov Chain are not estimated based on the Bayesian method. Very recently, Bayesian NHMM was developed to capture parameter uncertainties of the Markov model (Robert *et al.* 2000; Paroli & Spezia 2008).

Although NHMM has been applied to precipitation simulation, Bayesian NHMM is a new development and very few previous studies have examined its application in precipitation simulation. The importance of monthly precipitation simulation in many applications, e.g. water resources management, cannot be overstated. This serves as the primary motivation of this study. This paper focuses on monthly rainfall modeling at multiple stations incorporating climate index, using a Bayesian NHMM algorithm.

The objectives of this study are to: (1) develop a NMHH model for multi-stations in the study area and estimate the model parameter using the Bayesian method; and (2) examine the model's capacity in capturing characteristics associated with monthly precipitation.

This paper is arranged as follows. Background and motivation is provided in this section, followed by a description of the methodology. Study area and rainfall data as well as climate index are described in the subsequent section. The results and discussion are then provided, and finally the conclusions.

## METHODOLOGY

### Bayesian Non-homogenous Hidden Markov Chain

NHMM has general application and it is described in the context of rainfall modeling in this section. Consider a multivariate discrete time series of rainfall observation *y _{t,s}*, which is rainfall at time step

*t*at a spatial location

*s*. The observed

*y*is assumed to be a stochastic function of a finite-state Markov process

_{t,s}*z*with a component

*z*associated with each time step. The hidden state space consists of

_{t}*N*possible values, denoting each of the

*N*possible states. For a given time step

*t*, state

*z*is assumed to be conditioned on only one of its previous time steps next to it, namely

_{t}*t*− 1. In essence, the finite-state Markov process has a short memory of one time step and anything beyond one time step is not remembered by the process. Rainfall observation,

*y*, is often described as a stochastic process that follows a certain probability distribution. There are a few terminologies that are used in describing the hidden Markov process. Transition probability

_{t,s}*P*is described as the chance of transitioning from one state to another Markov state and the transition probability matrix,

_{i,j}*P*, describes transition among all different states. For each possible state there is a probability density function that determines the rainfall amount. The PDF associated with each state are components of the emission matrix. The transition matrix and emission matrix, hence, are the two most important elements in estimating HMM models.

Compared to HMM, NHMM models introduce non-homogeneity by allowing varying components in the transition matrix or emission matrix, depending on other relevant variables. As shown in Figure 1, hidden state *z _{t}* is determined jointly by the state of the previous time step

*z*

_{t−1}and another variable. In precipitation modeling, for instance, it could be large scale atmospheric teleconnections that leads to such non-stationary transition probabilities. Similarly, rainfall at time step

*t*at a spatial location

*s*,

*y*, does not only depend on the discrete state but also another variable. Two types of exogenous variables

_{t,s}*W*and

*X*are included in the modeling scheme.

*X*is a variable that could affect emission distribution and

*W*could influence transition probabilities. This allows the NHMM model to incorporate any factors that could affect stochastic distribution of the modeling quantity, precipitation in this case. For instance, precipitation seasonality could be easily incorporated. Different approaches have been investigated to estimate parameters, represented in circles in Figure 1 (Rajagopalan

*et al.*1996; Hughes

*et al.*1999; Meligkotsidou & Dellaportas 2011; Heaps

*et al.*2015; Zucchini

*et al.*2016). Holsclaw

*et al.*(2017) developed a Bayesian approach to estimate all the parameters and quantify corresponding uncertainties. All parameters are sampled via effective Gibbs sampling algorithm; hence, no tuning is needed. The

*W*variable coefficients are sampled through an ordered Multinomial probit (Albert & Chib 1998). The

*X*variable coefficients are sampled through an unordered Multinomial logit model Polya-Gamma formulation (Polson

*et al.*2013). The hidden states

*Z*

_{t}are sampled through a blocked Gibbs sampler. Details of Gibbs sampling for all the parameters are omitted due to space limitation. Interested readers are referred to Holsclaw

*et al.*(2017), where the application of NHMM to simulate daily rainfall data is demonstrated.

### Precipitation modeling using climate index forecasts

To test the applicability of NHMM model for monthly precipitation simulation, a simulation model is set to fit historical precipitation for both spatial and temporal characteristics. In doing so, seasonality is first removed before fitting any NHMM model. Normal distribution is assumed for emission distribution, which is based on the assumption that normal kernel estimation can be used to mimic any probability distribution functions. Transition probabilities are modeled as non-stationary components of the NHMM model, which is dependent on variation of the Nino 3.4 index.

### Model evaluation

Several metrics, including spatial and temporal correlations, are used to evaluate the performance of the simulation model (a description of the metrics is omitted since they are often used in the hydrology community). In addition, root mean absolute error (MSPE) and Nash–Sutcliffe efficiency (NSE) are used to evaluate deterministic forecasts deriving from simulation ensemble for the period 1961–2013. NSE is often used to evaluate how effective the forecasts are, compared to long-term mean values of the quantity of interests. It is calculated based on Equation (1).

*n*is the total number of observations.

*icat*is the category number (for instance, 1 for below normal, 2 for near normal, 3 for above normal),

*ncat*is the total number of categories (3 in a tercile-based system),

*Pcumfct*is the cumulative forecast probability up to category

*icat*, and

*Pcumobs*is the comparable term for the cumulative observation ‘probability’.

_{0}is the average RPS of the reference model (climatology) for the retrospective period. When RPSS is positive, NMME forecasts outperform climatology. The RPSS value represents how much of the RPS has been reduced (RPSS > 0) or increased (RPSS < 0) compared to the reference model (climatology). RPSS is calculated for each of the three forecasting leads.

## STUDY AREA AND DATA

The proposed methodology is illustrated by monthly precipitation simulation and operational forecasts for NingXiang county (NXC), a southern populous area in China. The county has a total area of 2,905 square kilometers (∼1.121 square miles) and it is part of Weishui basin, which contributes to Xiangjiang River. It has a population of over 1.4 million. Precipitation in the study area shows strong seasonality and inter-annual variabilities. Average annual rainfall for NCX is about 1,360–1,750 mm with a rainy season in boreal summer (May–July). Nearly 43% of rainfall over a year is received during the summer rainy season. Variations in monthly precipitation amount is important for local water resources planning and management, food production and human safety, among which summer flooding is one of the primary concerns. For instance, during the 1998 Yangtze flood, rainfall at the main stations in NXC was between 540 and 650 mm. In the summer of 2017, a regional storm at Xiangjiang river basin caused detrimental loss of life and property due to intensive rainfall in a few days. Therefore, forecasting precipitation amount in 1–3 months is of great importance.

Monthly precipitation data at 14 rainfall stations, from 1961 to 2013, were obtained, as shown in Figure 2. The stations are scattered in the region of NXC and their names are provided in Table 1. Monthly time series of rainfall over 636 months are available at each station. Preliminary analyses using one station clearly shows intra-annual seasonality with a summer peak. It is also found that de-seasonalized monthly rainfall follows a normal distribution after lognormal pretreatment. Lilliefors tests were applied to de-seasonalized monthly rainfall time series at all stations. They follow normal distribution at 5% significance level and Figure 3(b) shows a histogram of the de-seasonalized time series for illustrative purposes.

El Niño-Southern Oscillation (ENSO) phenomena, characterized by inter-annual sea surface temperature (SST) variations in the eastern-to-central equatorial Pacific, is identified as the most dominant global climatic teleconnections that affects climate variability at the sub-decadal scale (Rasmusson & Carpenter 1982; Rasmusson & Wallace 1983). Peel *et al.* (2002) investigated the relationship between annual precipitation variability and ENSO on a global scale and found that the annual precipitation variability is higher in the ENSO-influenced continent. It has been employed to improve seasonal streamflow and precipitation forecasts (Quan *et al.* 2006; Devineni & Sankarasubramanian 2010). In this study, the Nino-3.4 index reflects the SST anomaly over the region bounded by 120 W–170 W and 5S–5N is used. Monthly Nino 3.4 and retrospective precipitation forecasts for the study area were obtained from the International Research Institute for Climate and Society (http://portal.iri.columbia.edu) for the period 1961–2013.

## RESULTS AND DISCUSSION

Monthly precipitation for all the 14 stations in NXC region presents strong seasonality, as shown in Figure 4. The late boreal spring and summer season (April–June) is of the highest precipitation within the range of 200–250 mm each month, on average, over 53 years. The boreal winter months of November–January are the dry season with long-term average precipitation of about 50 mm. Besides evident seasonality, strong inter-annual variability is also observed for monthly precipitation. The coefficient of variation for each calendar month for each gage station was calculated and is presented in Figure 4(b). Not surprisingly, the coefficient of variation for the wet season is smaller compared to the dry season, where August and September are of the largest coefficient of variation. The coefficient of variation for the winter season is about 0.6–0.8 and it is 0.3–0.5 for the boreal spring season. This denotes that the inter-annual variability of precipitation for those months is large. Since surface water storage reservoirs in NXC region are within-year reservoirs that largely depend on inflow in summer and fall seasons to fill it back to full. Hence, below normal precipitation in the dry fall could often lead to below normal reservoir storage. A following dry winter season could pose challenges for meeting regional water demand. Therefore, improved prediction of monthly precipitation is informative in implementing drought mitigation strategies. Intra-annual and inter-annual variabilities are properties related to temporal characteristics of monthly precipitation in the NXC region.

### Spatial correlation of rainfall

Spatial characteristics of regional precipitation are primarily reflected in spatial correlation. Monthly precipitation data for each station are pulled first and spatial correlation among the 14 rainfall stations is then calculated. The results are presented in Figure 5, all of which are statistically significant at a confidence level of 0.95. Additional analysis of spatial correlation for each calendar month shows similar results (not shown). This indicates that there is little spatial inhomogeneity in rainfall across the region at monthly time scale and high consistency in dry or wet states among all the stations. Such spatial characteristics should be well recognized and can be used as one criteria in evaluating the proposed modeling framework.

Nino 3.4 is the averaged SST calculated from 5S–5N and 170–120 W, which is used as an ocean-atmospheric teleconnection index of ENSO phenomena. Figure 6(a) shows raw Nino 3.4 data for the period 1961–2013, which is found to have a low frequency oscillation of 2–7 years. Statistically significant correlation between Nino 3.4 and precipitation is observed for all the stations. For instance, Figure 6(b) is a scatterplot of Nino 3.4 and monthly precipitation at station ID 1. Correlation between the two attributes is 0.34, which is statistically significant at a confidence level of 0.95.

### NHMM model fitting

Before fitting the NHMM model for monthly precipitation over 53 years at all the 14 stations, the time series at each station is first de-seasonalized and a natural logarithm is then applied to the de-seasonalized data. The de-seasonalizing process first determines the trend component using a moving average and removes it from the time series of monthly precipitation. Then, the seasonal component is computed for each calendar month by averaging over all periods. The seasonal component is then centered. Finally, the error component is determined by a removing trend and seasonal component from the original time series of monthly precipitation (Kendall & Stuart 1983).

In fitting the NHMM model, three states implicitly denoting below normal, normal and AN conditions across the region are used. The 34th percentile of monthly precipitation is used as the upper boundary of below normal category, while the 67th percentile is used as the lower boundary of the AN category. Preliminary analysis found that de-seasonalized data follows normal distribution (Figure 3(b)). Hence, normal distribution is assumed for the emission distributions for each state. In Bayesian sampling, 500 iterations of Markov Chain Monte Carlo (MCMC) sampler are used to sample from the posterior distribution of all parameters of interest. To examine if the estimated NHMM model can capture the statistical properties of the historical data well, an ensemble of 100 simulations, each using estimated states of monthly precipitation and emission distribution, are generated. As stated in the methodology section, the proposed model uses all 53 years' data and the primary objective is to evaluate if the model captures all the statistics related to regional monthly precipitation.

Figure 7 depicts transition probabilities among all three states of the NHMM model. Sub-plots across each row correspond to probabilities transiting from one to the other states. The *X* axis of all the sub-plots denotes time in months. Compared to HMM, transition probabilities vary in time depending on external variables, e.g. Nino3.4 in this case. For instance, the probability of transiting from state I to state II is from 0.38 to 0.52, as shown in the middle of the first row in Figure 7. There is one state of precipitation condition associated with all the 14 stations in the region. The most probable state for one month is defined as the dominant state for all MCMC iterations, while each iteration may not infer the same state over time. The most probable sequence from all iterations indicates the relative frequency of state 1 is 269 out of 636 months (42.3%) and it is 35.4 and 22.3% for states 2 and 3 respectively.

Emission distribution is a key component of the NMME model, which defines the probability density functions for each state of monthly precipitation. Table 1 lists the mean of the normal distribution associated with each state of NMME model for all the 14 rainfall stations. Hence, three different normal distributions can represent low, normal and high precipitation across all the months. Low precipitation is associated with state 1 and high precipitation is associated with state 3.

### Model evaluation

To evaluate the performance of the NHMM model for historical data, simulated precipitation is compared to observation for each rainfall station. The first is to examine if intra-annual seasonality is preserved. For each member of the 200 simulation ensembles, average precipitation is obtained for each calendar month. Using ensemble members, a time series comprised of 200 mean values are available for one calendar month. The 95th prediction interval defined as the interval between the 2.5th and 97.5th percentile of the time series is then calculated. This procedure is repeated for all the twelve calendar months, station by station. A simple verification is to examine if average precipitation calculated from observed data falls into the 95th prediction interval. Figure 8 shows such evaluation for four rainfall stations and results for the rest are similar but are not shown here. It is found that monthly mean precipitation from the observed data is well bounded by the prediction intervals for all the stations.

To examine if inter-annual variabilities exhibiting in monthly precipitation is captured by the fitted NHMM model, the median of ensembles at each month across the 200 ensemble members are used as a deterministic simulation. Spearman correlation between ensemble median and observed time series is within the range of 0.87–0.90 for all rainfall stations. Figure 9(a) shows scatterplots between observation and ensemble median for one station as an illustrative example. Another aspect worth examining is spatial correlation among all the rainfall stations for simulation data. Analyses show that such spatial characteristics are well preserved by ensemble median, where the Spearman correlation is greater than 0.9 for each pair of rainfall gages. Figure 9(b) displays a scatterplot of median ensemble for two stations for illustrative purposes.

Evaluation metrics discussed in the methodology section are used to evaluate the performance of the simulation model. Table 2 lists evaluation results for each calendar month. For each evaluation metric, results from all the 14 stations are provided as a range. For instance, correlation between NHMM model simulation and observation of January ranges from 0.83 to 0.91 among the 14 stations.

RPS is calculated for the simulation period across all the 14 rainfall stations. It was then compared to RPS of climatology at each station, resulting in a RPSS. Figure 10 presents boxplots of RPSS across all the 14 rainfall stations in the study area. It is found that median RPSS is greater than 0.5 for all stations, denoting that the RPS calculated from the NHMM model is at least 50% less than that of climatology (Equation (3)).

## CONCLUSIONS

Monthly precipitation simulation and forecasts are useful in many applications, e.g. streamflow forecasts and water resources management. This study examines the use of Bayesian Non-homogeneous Markov Chain model in modeling monthly precipitation data, incorporating climate index and climate model forecasted gridded precipitation. One major advantage of the NHMM is to allow variations in transition probabilities or emission probabilities depending on external factors. The proposed approach was demonstrated using precipitation data over 1961–2003 for 14 stations in NXC, a southern county in China. The conditional mean of 200 ensemble members at each time period can be treated as deterministic simulation, while probabilistic outcome can be derived from ensemble members. The results indicate that the NHMM model captures the characteristics of precipitation data well, e.g. inter-annual and intra-annual variabilities, spatial correlation matrix. The 95% prediction intervals of the long-term average precipitation capture the long-term mean of monthly precipitation. The spatial correlation of monthly precipitation among different rainfall stations are also preserved. Both deterministic and probabilistic results for the simulation time period 1961–2013 were evaluated by multiple metrics, including mean square error, root mean square error and rank probability score. The results for different seasons reveal that simulation performance is better for the non-rainy season than for the rainy season. The RPS values are compared to the RPS of climatology calculated from each calendar month, resulting in RPSS. The RPSS is greater than 0.5 for half of the simulation period across all the stations, denoting a significant improvement of probabilistic precipitation. The framework demonstrated in this study is useful in generating stochastic rainfall in historical years and has general applications to other areas. Further extension of this study will be investigated later to examine its application in real-time precipitation forecasts and short-term water resources management.

## ACKNOWLEDGEMENTS

This research was partially supported by the National Natural Science Foundation of China (51809020 and 51839002), Natural Science Foundation of Hunan Province, China (2016JJ3011 and 2018JJ3535). The authors appreciate the two anonymous reviewers whose comments have substantially improved the manuscript.

## REFERENCES

*Monographs on Statistics and Applied Probability*