Abstract
This study aims to evaluate the suitability of the Tweedie generalised linear model for characterising monthly rainfall patterns across 18 meteorological stations in Peninsular Malaysia. It incorporates harmonic functions consisting of sine and cosine functions as seasonal predictors and El Niño Southern Oscillation (ENSO) indices as climatic predictors. Results indicate that three harmonic functions are essential to accurately portray rainfall dynamics in the southwestern and northwestern regions, while two suffice for the inland and western regions. However, incorporating four harmonic functions is the most optimal representation of the eastern region. An additional 1-month lag in ENSO indices is introduced to the optimal seasonal predictor model. Based on the findings, the southern oscillation index notably impacts monthly rainfall significantly in eastern and inland areas, while meteorological stations in the western and northwestern areas fit better with the multivariate ENSO index. Strikingly, no substantial impact of climate predictors is observed on the monthly rainfall within the southwestern region. Thus, the influence of climate indices is very much influenced by the geographical locations of the regions. Importantly, generating simulated data through the Tweedie model contributes to a more accurate representation of the statistical properties inherent in rainfall analysis.
HIGHLIGHTS
Incorporating seasonal terms using harmonic functions enhanced the model's ability to describe rainfall patterns.
The effect of seasonal and ENSO climate predictors on monthly rainfall amounts varies depending on the geographical locations and monsoon influence.
Simulated data based on the Tweedie model able to capture the distributional properties, and excess zeros observed in rainfall analysis.
INTRODUCTION
Due to the enormous global population growth and economic development, understanding rainfall behaviour is essential for its significance to water resource management and climate change. Consequently, climatologists and meteorologists are consistently engaged in exploring the causative factors and underlying mechanisms governing seasonal fluctuations in rainfall. An illustrative example lies in the discrete wavelet transform, as Ruwangika et al. (2020) applied it to identify complex rainfall patterns. Furthermore, integrating the Fourier series into the generalised linear model help in accounting for the nuances of seasonal rainfall variations. Leveraging machine learning techniques, as demonstrated by Barrera-Animas et al. (2022), Das et al. (2022), and Liyew & Melese (2021), have proven effective in enhancing the analytical capacity for rainfall forecasting. The main characteristics of rainfall include its amount, frequency, length of dry and wet spells, and intensity. These values may vary from place to place, year to year, month to month, day to day, and even hour to hour. Scientists have always faced challenges in modelling the chaotic behaviour of rainfall due to its spatial and temporal variabilities, coupled with constraints imposed by data availability within the study region. Therefore, finding a suitable rainfall model that can account for both issues is quite promising.
Rainfall processes may be categorised into two: one is concerned with the amount of rain on wet days, while the other deals with the sequence of dry and wet days. A combination of the Markov chain and the gamma distribution function is recognised as a simple approach for modelling. It has effectively generated daily rainfall data in many applications (Gabriel & Neumann 1962; Stern & Coe 1982; Geng et al. 1986). Because of their simplicity, those two-part models have been successfully used to simulate rainfall data. However, the effectiveness of the best order of Markov chains has been questioned, which led to other studies by several researchers, such as higher-order (Deni et al. 2009), hybrid (Wilks 1999), and hidden Markov chains (Robertson et al. 2003). Although Markov chain models have been extensively studied, these models only monitor the occurrence of rain and are thus limited in their efficacy in describing the amount of rain.
Furthermore, the shape of the rainfall distribution is often skewed to the right; hence, several positive skewed distributions other than Gamma are used in fitting the distribution, such as the lognormal distribution (Chebana & Ouarda 2021), the Weibull distribution (Sharda & Das 2005; Ximenes et al. 2021), and the skewed normal distribution (Chapman 1998). The two-part models are known to have potential uses in predicting and simulating rainfall. Several studies have been done to extend the two-part models in the framework of generalised linear models (GLMs) (Coe & Stern 1982; Chandler & Wheater 2002; Yang et al. 2005; Suhaila & Jemain 2009a; Serinaldi & Kilsby 2014; Pomee et al. 2020).
GLMs were originally used to model daily rainfall time series data at a single site (Coe & Stern 1982; Stern & Coe 1984). These models were the first to be implemented worldwide and are the most basic stochastic weather models. GLMs are parametric models that have the structure to condition the outcome variable, which is daily rainfall, on observed covariates, such as sine and cosine functions of time, to account for seasonality and other climatological variables. Chandler & Wheater (2002) developed a GLM-based framework for comprehending the spatiotemporal rainfall pattern. The division of the rainfall process into occurrences, describing wet and dry states and the amount of rainfall measured during wet spells, is a feature of the GLM models (Wilks 1998; Chandler & Wheater 2002). Yang et al. (2005) used the GLM approach to simulate multisite daily rainfall sequences incorporating intersite dependence structures.
However, when two separate models are employed, there is a possibility that the results fail to describe the overall characteristics of the rainfall. Therefore, Serinaldi (2009) developed a new model structure in modelling to deal with the issue of the construction of two separate models. He incorporated the concept of copula-based Markov chain models. The rainfall values at day t, given the rainfall value on the previous day (t −1), are conditioned using the probability distribution based on the bivariate copula model. The conditional distribution could describe the discrete-continuous nature of the rainfall without splitting the occurrence and amount of processes. The same approach was followed by Serinaldi & Kilsby (2014). They merged meta-Gaussian random fields and a generalised additive model to simulate daily rainfall over large areas without splitting the processes of rainfall occurrence and rainfall amount. As a result, the combined models become more parsimonious compared to other models for multisite rainfall simulation and, at the same time, preserve the rainfall characteristics.
The Tweedie family of distributions is a viable alternative for modelling datasets that exhibit discrete (zero) and continuous (non-zero) values. It is a special case of an exponential dispersion model (EDM), which is often used as a distribution for generalised linear models. The Tweedie family is used to fit within the GLM framework. Within this framework, the Tweedie family accommodates incorporating covariates, allowing response variables to adapt to these predictors. The Poisson–Gamma distribution in the Tweedie family was sufficient to represent both rainfall components with a single complete rainfall model (Dunn 2004; Hasan & Dunn 2010, 2011; Dzupire et al. 2018; Hasan et al. 2019). Hasan & Dunn (2010) used Tweedie GLM with a sine and cosine term (base model) as predictors for modelling monthly rainfall by fitting a single model for each station. The model fits well with the monthly rainfall data for most Australian stations. In another study, Hasan & Dunn (2012) improved the base model by introducing climatological covariates. Three separate covariates were successively applied to the base model: the southern oscillation index (SOI), the SOI phase, and the NINO 3.4 index. Their results indicated that there had been a significant shift in the amount expected and the possibility of no rain due to the impact of these climate variables. A later study by Hasan et al. (2019) investigated the relationship between the estimated parameters from the fitted model of daily, monthly, and seasonal rainfall totals across Australia. They discussed the possibility of estimating the parameters of daily data using the fitted parameters for monthly rainfall. They concluded that the established relationship between the parameters in the monthly and daily models might aid in comprehending the features and generating data on a daily timescale.
A Tweedie GLM has also been applied in a statistical downscaling approach (Hertig & Jacobeit 2015; Hertig et al. 2017; Pomee et al. 2020). Hertig & Jacobeit (2015) used the Tweedie EDM to downscale weather-dependent daily precipitation data in the Mediterranean region. The probability of occurrence and rainfall distribution corresponds well with the Poisson–Gamma distribution but not in situations with extreme values. In a recent publication, Pomee et al. (2020) employed a GLM approach to model the relationship between atmospheric variables and precipitation in the Indus River Basin of Pakistan. GLMs based on a gamma-distributed model were used in the framework. However, for the case of having exact zeros in the observed time series, a Tweedie EDM within a GLM framework was applied because of its ability to model discrete and continuous precipitation features simultaneously.
Seasonal variability and cyclical patterns are likely to be apparent in Malaysian rainfall, with some areas having dry and rainy periods regularly from year to year. It is well known that the monsoons and geographical locations have a significant impact on the Malaysian climate (Suhaila & Jemain 2009b; Suhaila et al. 2011). In their scholarly contributions, Suhaila & Jemain (2009a) introduced a Fourier series to enhance the representation of model parameters. This particular technique holds the advantage of succinctly elucidating the intricate patterns and temporal variations in rainfall. Incorporating sine and cosine functions within this series framework facilitated the systematic quantification of harmonics, thereby providing alternative ways of characterising the complex rainfall patterns prevalent in Peninsular Malaysia. It is notable, however, that their investigation concentrated solely on modelling rainfall amounts, with rainfall occurrences treated independently.
On the other hand, Yunus et al. (2017) used Tweedie GLM to investigate several possible climate covariates on daily rainfall at four studied stations over Peninsular Malaysia. The covariates in the model served various purposes, such as sinusoidal functions for capturing seasonal variability, temporal autocorrelation accounting for the influence of rainfall on the previous, and climate predictors. They found that adding climate indices such as the SOI and NINO 3.4 index improved the fit and substantially changed the predicted daily rainfall outcomes.
The concurrent modelling of discrete and continuous distributions poses a significant challenge for most models, primarily due to the dissimilarity in their marginal distributions. As a result, addressing this mixture of distributions in a unified framework can introduce considerable complexity. Therefore, the Poisson–Gamma Tweedie model emerges as one of the alternatives for the concurrent modelling of rainfall amount and occurrence, particularly within the context of Malaysia. To the best of the author's knowledge, the application of the Tweedie distribution in the domain of rainfall modelling remains new within the Malaysian context. Hence, the primary goal of this study is twofold: firstly, to examine the adaptability and versatility of the Tweedie GLM in characterising the intricate attributes of Malaysian rainfall data, and secondly, to determine the impact exerted by El Niño Southern Oscillation (ENSO) indices on the distribution of rainfall patterns.
This study introduces a dynamic approach by incorporating a linear function comprising several harmonic components and climatological predictors to enhance the flexibility of the Tweedie model. In contrast, the prior investigation by Yunus et al. (2017) exclusively focused on a single harmonic function across four meteorological stations situated in Malaysia. Since Malaysia has a distinctive seasonal rainfall pattern, relying solely on a single harmonic function proves inadequate in effectively describing the rainfall distributions across all stations. Consequently, the present study engages in a comparative analysis of various harmonic functions employed as seasonal predictors, subsequently selecting the optimal number that best describes the rainfall pattern in Malaysia. In addition, this study introduces the monthly SOI and Multivariate El Niño southern oscillation index (MEI) as climate predictors to improve the model accuracy. The predictors serve to assess the significant effect on Malaysian rainfall patterns. The relationship between the ENSO indices and rainfall could be established through the application of fitted Tweedie generalised linear models. Furthermore, the simulated data derived from the fitted model could be used for diverse applications within rainfall modelling, particularly in domains like water resource management and agricultural planning.
STUDY AREA
DATA AND METHODOLOGY
Rainfall: data source and data quality checking
This study selected 18 representative meteorological stations that are well distributed across Peninsular Malaysia based on the availability of the data. The datasets were obtained from the Malaysian Meteorological Department, and most of the studied stations (67%) have data available from January 1980 to December 2019. While the rainfall data at the remaining stations started in 1985 and 1988, some ended in 2012 and 2015. Detailed information regarding the meteorological stations is presented in Table 1. This table includes summary statistics highlighting the months when no rainfall was recorded. It was found that Kangar exhibited the highest percentage, with 42.4% of months having no recorded rainfall. Following Kangar, Chuping and Alor Star are in the second and third positions, with percentages of 32.5 and 27.5%, respectively, signifying notable occurrences of months without rainfall. Conversely, the remaining stations experienced months without rainfall at least once within the study period.
The geographical coordinates of the studied stations with the percentage of months with no rains
Stations . | Latitude . | Longitude . | Elevation (m) . | Months with no rains (%) . | Period . |
---|---|---|---|---|---|
Alor Star | 6°12′ N | 100°24′E | 3.9 | 11 (27.5) | 1980–2019 |
Baling | 5°41′ N | 100° 55′E | 51.9 | 4 (10) | 1980–2019 |
Bayan Lepas | 5°18′N | 100°16′E | 3.0 | 2 (5) | 1980–2019 |
Butterworth | 5°27′ N | 100°23′E | 3.3 | 1 (2.86) | 1985–2019 |
Chuping | 6°29′ N | 100°16′E | 22 | 13 (32.5) | 1980–2019 |
Dungun | 4°46′ N | 103°25′E | 3.1 | 2 (5.6) | 1980–2015 |
Kangar | 6° 26′ N | 100°12′E | 3.1 | 14 (42.4) | 1980–2012 |
Kluang | 2°01′ N | 103°19′E | 88.1 | 1 (2.5) | 1980–2019 |
Kota Bharu | 6°10′ N | 102°18′E | 4.4 | 7 (17.5) | 1980–2019 |
Kuala Krai | 5°32′ N | 102°12′E | 68.3 | 2 (5.89) | 1985–2019 |
KualaTerengganu | 5°23′ N | 103°06′E | 5.2 | 4 (11.43) | 1985–2019 |
Langkawi | 6°20′N | 99°44′E | 6.4 | 8 (25) | 1988–2019 |
Lenggong | 5°06′ N | 100°58′E | 100.7 | 1 (2.5) | 1980–2019 |
Melaka | 2°16′ N | 102°15′E | 8.5 | 1 (2.5) | 1980–2019 |
Mersing | 2°27′ N | 103°50′E | 43.6 | 1 (2.5) | 1980–2019 |
Pasir Mas | 6°02′ N | 102°07′E | 9.1 | 5 (12.5) | 1980–2019 |
Pontian | 1°29′ N | 103°23′E | 4.6 | 1 (2.5) | 1980–2019 |
Tangkak | 2°16′ N | 102°32′E | 30.5 | 1 (2.5) | 1980–2019 |
Stations . | Latitude . | Longitude . | Elevation (m) . | Months with no rains (%) . | Period . |
---|---|---|---|---|---|
Alor Star | 6°12′ N | 100°24′E | 3.9 | 11 (27.5) | 1980–2019 |
Baling | 5°41′ N | 100° 55′E | 51.9 | 4 (10) | 1980–2019 |
Bayan Lepas | 5°18′N | 100°16′E | 3.0 | 2 (5) | 1980–2019 |
Butterworth | 5°27′ N | 100°23′E | 3.3 | 1 (2.86) | 1985–2019 |
Chuping | 6°29′ N | 100°16′E | 22 | 13 (32.5) | 1980–2019 |
Dungun | 4°46′ N | 103°25′E | 3.1 | 2 (5.6) | 1980–2015 |
Kangar | 6° 26′ N | 100°12′E | 3.1 | 14 (42.4) | 1980–2012 |
Kluang | 2°01′ N | 103°19′E | 88.1 | 1 (2.5) | 1980–2019 |
Kota Bharu | 6°10′ N | 102°18′E | 4.4 | 7 (17.5) | 1980–2019 |
Kuala Krai | 5°32′ N | 102°12′E | 68.3 | 2 (5.89) | 1985–2019 |
KualaTerengganu | 5°23′ N | 103°06′E | 5.2 | 4 (11.43) | 1985–2019 |
Langkawi | 6°20′N | 99°44′E | 6.4 | 8 (25) | 1988–2019 |
Lenggong | 5°06′ N | 100°58′E | 100.7 | 1 (2.5) | 1980–2019 |
Melaka | 2°16′ N | 102°15′E | 8.5 | 1 (2.5) | 1980–2019 |
Mersing | 2°27′ N | 103°50′E | 43.6 | 1 (2.5) | 1980–2019 |
Pasir Mas | 6°02′ N | 102°07′E | 9.1 | 5 (12.5) | 1980–2019 |
Pontian | 1°29′ N | 103°23′E | 4.6 | 1 (2.5) | 1980–2019 |
Tangkak | 2°16′ N | 102°32′E | 30.5 | 1 (2.5) | 1980–2019 |
Complete rainfall data are observed for all stations except for the Pasir Mas station, which had 0.6% of its values recorded as missing. A simple inverse distance weighting method was used to estimate the missing values at the Pasir Mas station using the method of the nearest neighbouring stations. The data were grouped into monthly rainfall series and then tested for homogeneity. In the Malaysian context, there are no specific guidelines for choosing the best method for the homogenisation procedure. In this study, the homogeneity of each data series was conducted using the approach introduced by Wijngaard et al. (2003). This study employed four distinct homogeneity tests: the standard normal homogeneity, the Buishand range test, the Pettitt, and the Von Neumann Ratio. All these tests applied the information from the single station series since the meteorological stations used are sparsely distributed. The classification of the monthly station rainfall series was then based on their performance in the homogeneity tests, which were classified as ‘useful (U)’ (when at least three tests indicated homogeneity), ‘doubtful (D)’ (when a minimum of two tests indicated homogeneity), and ‘suspect (S)’ (none or only one test indicated homogeneity). For a more comprehensive understanding of the methods employed, readers are directed to the studies by Wijngaard et al. (2003) and Alexandersson (1986). The results of the homogeneity tests are presented in Table 2. Inhomogeneous rainfall series were detected in January for Baling, Chuping, Kota Bharu, and Pasir Mas. In contrast, the remaining months of these stations exhibited a homogeneous rainfall series. Conversely, it is notable that all other stations subjected to investigation displayed consistent homogeneity in their respective monthly rainfall series.
Categorising homogeneity test in monthly rainfall series across 18 metorological stations
Stations . | J . | F . | M . | A . | M . | J . | J . | A . | S . | O . | N . | D . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Alor Star | D | U | U | U | U | U | U | U | U | U | U | U |
Baling | S | U | U | U | U | U | U | U | U | U | U | U |
Bayan Lepas | U | U | U | U | U | U | U | U | U | U | U | U |
Butterworth | U | U | U | U | U | U | U | U | U | U | U | U |
Chuping | S | U | U | U | U | U | U | U | U | U | U | U |
Dungun | U | U | U | U | U | U | U | U | U | U | U | U |
Kangar | U | U | U | U | U | U | U | U | U | U | U | U |
Kluang | U | U | U | U | U | U | U | U | U | U | U | U |
Kota Bharu | S | U | U | U | U | U | U | U | U | U | U | U |
Kuala Krai | U | U | U | U | U | U | U | U | U | U | U | U |
Kuala Trengganu | U | U | U | U | U | U | U | U | U | U | U | U |
Langkawi | U | U | U | U | U | U | U | U | U | U | U | U |
Lenggong | U | U | U | U | U | U | U | U | U | U | U | U |
Melaka | U | U | U | U | U | U | U | U | U | U | U | U |
Mersing | U | U | U | U | U | U | U | U | U | U | U | U |
Pasir Mas | S | U | U | U | U | U | U | U | U | U | U | U |
Pontian | U | U | U | U | U | U | U | U | U | U | U | U |
Tangkak | U | U | U | U | U | U | U | U | U | U | U | U |
Stations . | J . | F . | M . | A . | M . | J . | J . | A . | S . | O . | N . | D . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Alor Star | D | U | U | U | U | U | U | U | U | U | U | U |
Baling | S | U | U | U | U | U | U | U | U | U | U | U |
Bayan Lepas | U | U | U | U | U | U | U | U | U | U | U | U |
Butterworth | U | U | U | U | U | U | U | U | U | U | U | U |
Chuping | S | U | U | U | U | U | U | U | U | U | U | U |
Dungun | U | U | U | U | U | U | U | U | U | U | U | U |
Kangar | U | U | U | U | U | U | U | U | U | U | U | U |
Kluang | U | U | U | U | U | U | U | U | U | U | U | U |
Kota Bharu | S | U | U | U | U | U | U | U | U | U | U | U |
Kuala Krai | U | U | U | U | U | U | U | U | U | U | U | U |
Kuala Trengganu | U | U | U | U | U | U | U | U | U | U | U | U |
Langkawi | U | U | U | U | U | U | U | U | U | U | U | U |
Lenggong | U | U | U | U | U | U | U | U | U | U | U | U |
Melaka | U | U | U | U | U | U | U | U | U | U | U | U |
Mersing | U | U | U | U | U | U | U | U | U | U | U | U |
Pasir Mas | S | U | U | U | U | U | U | U | U | U | U | U |
Pontian | U | U | U | U | U | U | U | U | U | U | U | U |
Tangkak | U | U | U | U | U | U | U | U | U | U | U | U |
Note: S, suspect; D, doubtful; U, useful.
The ENSO index influences the Malaysian climate, which can vary depending on the season and location, with drier-than-normal conditions during El Niño and wetter-than-normal conditions during La Niña (Tangang et al. 2017). According to studies by Tangang & Juneng (2004), a strong ENSO index is only apparent in East Malaysia during the season, and the interannual variability of Malaysia rainfall associated with ENSO events peaks during the NEM period (Juneng & Tangang 2005). Gomyo & Koichiro (2009) also affirm a strong connection between ENSO and rainfall patterns in East Malaysia. Historical analysis reveals that strong El Niño occurrences, such as those in 1997–1998 and 2015–2016, have led to extended droughts and water crises across multiple regions within Malaysia (Tan et al. 2022). Consequently, this study employed the SOI as the initial ENSO indicator, alongside the recent multivariate ENSO index (MEI), to serve as climate predictors. The purpose is to explore the relationship between rainfall distribution and the ENSO index phenomenon using the Tweedie GLM. Within this study, the MEI values can be comprehended through the description provided and accessed via the website: https://psl.noaa.gov/enso/mei/. Similarly, the SOI values required for this research were acquired from the following source: https://www.cpc.ncep.noaa.gov/data/indices/soi/.
Exponential dispersion model














Tweedie family of distributions






The process has a point mass at zero, which implies that it is not an entirely continuous random variable. The Tweedie family of distributions belongs to the class of the EDM (Jørgensen 1997), which can be used to fit into the GLM framework. For those cases having exact zeroes in rainfall time series, a Tweedie GLM was employed due to its ability for simultaneous modelling of discrete and continuous rainfall features (Hasan & Dunn 2010, 2012; Hertig & Jacobeit 2015; Hertig et al. 2017; Pomee et al. 2020).
Fitting Tweedie distributions
The Tweedie distributions are those EDMs with a variance function of the form where
(Jørgensen 1997). The Tweedie family of distributions has three parameters: mean
, dispersion parameter
, and index parameter p, which is denoted as
If p = 0, the Tweedie distributions correspond to normal distributions, but when p = 1,
, it becomes the Poisson distribution. In addition, when p = 2, the Tweedie distributions resemble a Gamma distribution, while if p = 3, it is an inverse Gaussian distribution. Apart from these four special cases, the densities of the Tweedie distribution cannot be written in closed form (Dunn & Smyth 2005).
In the context of GLM, corresponds to predicted values that can be estimated directly from the model using a standard algorithm. However, the maximum likelihood estimation of
is quite complicated (Dunn 2004). Similarly, the estimation procedure to compute the index parameter, p, is also difficult and requires high computational resources (Dunn & Smyth 2005; Hasan & Dunn 2010, 2012; Hasan et al. 2019). With modern and advanced computer technologies, the algorithm to compute the parameter estimations for Tweedie distribution has been successfully developed. Dunn (2022) created the algorithm for the Tweedie profile function in the R package by considering a set of values for the index parameter p and computing the log-likelihood. This can be done by directly maximising the profile likelihood function. The Tweedie package in R has a function to estimate p using tweedie.profile functions. Once the maximum likelihood value of p is found, the distribution within the class of the Tweedie family is identified. In addition, the algorithms that were used to estimate p can also be used to compute the Maximum Likelihood Estimation (MLE) of
(Dunn & Smyth 2005). Hence, this study will employ the algorithm developed by Dunn (2022), which is available in the R package.
Generalized linear models
After the Tweedie Poisson–Gamma distribution is specified, the GLM framework, as discussed in McCullagh & Nelder (1989), will be used to fit the dataset. To model the response with a linear component
the link function needs to be chosen correctly. Since the model fits GLM to the mean rainfall, the logarithm link function is applied for
Based on the Tweedie GLM, models in the form of
are fitted, where
is the mean rainfall in month i,
is a vector of predictors, and
represents a vector of parameter coefficients.
Modelling the predictors in a Tweedie GLM
Sine and cosine functions are included in the Tweedie model to describe seasonal rainfall patterns, which can be expressed as follows:
Model 1 (base model)





Instead of considering only a sine and cosine terms as applied in the previous analyses (e.g., Hasan & Dunn 2010; Yunus et al. 2017), the present study added several sine and cosine terms to determine the optimum number of harmonic functions that best described the rainfall data of the stations since it is known that Malaysian rainfall patterns are heavily influenced by the monsoon seasons. After the optimum number of harmonic functions has been determined, the lagged 1 month SOI and MEI were added to the base model of Equation (6), which are presented as Model 2 and 3 as follows:
Model 2 (base model + SOI)
Model 3 (base model + MEI)
The LRT will be used to compare the fitness of Models 2 and 3 with the base model (Model 1).
Model validation
Two approaches are proposed to assess the effectiveness of the best-fitting model. The first approach uses the k-fold cross-validation method of the generalised linear model. The entire dataset is divided into k pieces (equal sample size) for k-fold cross-validation. Following that, the remaining (k − 1) folds act as the new training set, with each fold acting as a validation set. This process is carried out iteratively k times, with a different group of observations being used as a validation set each time.
The mean square error (MSE) of each fold is computed as , where
and
are the observed and predicted values, respectively. The generalisation error estimate can be determined by averaging the MSEs over k folds of validation. This is referred to as the cross-validation error and can be expressed as
. The k-fold cross-validation method reduces the computational time compared to leave-one-out cross-validation (LOOCV). However, it may introduce some bias in the performance estimate (James et al. 2013). This study uses an algorithm cv.glm in the library(boot) to evaluate the cross-validated prediction error. The adjusted cross-validation estimate is chosen since it is designed to compensate for the bias introduced by not using LOOCV. A k-fold cross-validation is a great option for most applications since it strikes a good balance between computational efficiency and reliability.
The second model validation approach uses a graphical display via a Taylor diagram. It is a method that can be used to compare the performance of different models or simulations against observations by plotting the standard deviation, centred root mean square difference, and correlation coefficient of the model output relative to the observations. Karl Taylor introduced the method in 2001, and the diagram was named after him. However, Taylor diagrams do not offer a formal statistical test of the model's accuracy or generalisation. Instead, they can provide a visual and qualitative assessment of the model's performance. Therefore, Taylor diagrams should be used in combination with other validation methods to obtain a more comprehensive and reliable assessment of the model's performance. A detailed explanation is given by Taylor (2001).
RESULTS AND DISCUSSION
Parameter estimates of tweedie distribution


Table 3 lists the estimated index parameter p, the confidence interval of p, and the dispersion parameter for all studied stations. As shown in Table 3, the estimated values of p for the stations under study that maximise the log-likelihood function fall between 1.32 and 1.72. Dungun demonstrates the highest value of all the estimated values, with p = 1.720. The values of the estimated p for stations situated on the east coast of Peninsular Malaysia are higher than other stations in other parts of the Peninsula. The distinction in index parameter p could be attributed to heavy rains on the east coast of Peninsular Malaysia that may be influencing these outcomes. However, the MLE of p for all studied stations remains in the interval of 1 < p < 2. This finding implies that the Tweedie Poisson–Gamma distribution can be effectively employed for modelling purposes. The estimation of the Tweedie index parameter is the starting point for the formulation of the Tweedie model. The present study uses monthly rainfall as the basis. However, the Tweedie index parameter is computed for each station (not each month), producing much simpler models and reducing the computational burden. After performing the profile likelihood estimate, the desired Tweedie family distribution could be used to fit a GLM.
Tweedie parameters of each studied station
Stations . | p-Index . | Confidence interval of p . | Dispersion parameter, ![]() |
---|---|---|---|
Alor Star | 1.443 | (1.391,1.510) | 7.027 |
Baling | 1.410 | (1.337,1.502) | 7.454 |
Bayan Lepas | 1.486 | (1.388, 1.592) | 4.986 |
Butterworth | 1.492 | (1.380,1.636) | 4.522 |
Chuping | 1.443 | (1.386, 1.510) | 6.978 |
Dungun | 1.720 | (1.633, 1.792) | 2.387 |
Kangar | 1.378 | (1.312, 1.449) | 9.059 |
Kluang | 1.557 | (1.397, 1.708) | 3.281 |
Kota Bharu | 1.655 | (1.580, 1.727) | 3.760 |
Kuala Krai | 1.606 | (1.503, 1.700) | 3.497 |
Kuala Trengganu | 1.704 | (1.622, 1.772) | 2.935 |
Langkawi | 1.476 | (1.443, 1.543) | 6.068 |
Lenggong | 1.508 | (1.364, 1.662) | 3.948 |
Melaka | 1.378 | (1.233, 1.542) | 6.570 |
Mersing | 1.655 | (1.559, 1.761) | 2.575 |
Pasir Mas | 1.622 | (1.553, 1.706) | 4.105 |
Pontian | 1.329 | (1.194, 1.505) | 8.209 |
Tangkak | 1.443 | (1.346, 1.579) | 4.987 |
Stations . | p-Index . | Confidence interval of p . | Dispersion parameter, ![]() |
---|---|---|---|
Alor Star | 1.443 | (1.391,1.510) | 7.027 |
Baling | 1.410 | (1.337,1.502) | 7.454 |
Bayan Lepas | 1.486 | (1.388, 1.592) | 4.986 |
Butterworth | 1.492 | (1.380,1.636) | 4.522 |
Chuping | 1.443 | (1.386, 1.510) | 6.978 |
Dungun | 1.720 | (1.633, 1.792) | 2.387 |
Kangar | 1.378 | (1.312, 1.449) | 9.059 |
Kluang | 1.557 | (1.397, 1.708) | 3.281 |
Kota Bharu | 1.655 | (1.580, 1.727) | 3.760 |
Kuala Krai | 1.606 | (1.503, 1.700) | 3.497 |
Kuala Trengganu | 1.704 | (1.622, 1.772) | 2.935 |
Langkawi | 1.476 | (1.443, 1.543) | 6.068 |
Lenggong | 1.508 | (1.364, 1.662) | 3.948 |
Melaka | 1.378 | (1.233, 1.542) | 6.570 |
Mersing | 1.655 | (1.559, 1.761) | 2.575 |
Pasir Mas | 1.622 | (1.553, 1.706) | 4.105 |
Pontian | 1.329 | (1.194, 1.505) | 8.209 |
Tangkak | 1.443 | (1.346, 1.579) | 4.987 |
Determining the optimum number of harmonic functions for seasonal rainfall distribution
For simplicity, the fitting process for additional sine and cosine functions will be carried out using the same Tweedie index parameter p obtained for each station using Equation (11). As mentioned in the study by Hasan & Dunn (2012), the value of the estimated p makes no difference to the estimated GLM coefficients, but it has a small impact on the monthly rainfall variation of the stations. All 18 meteorological stations were analysed in this study. However, to facilitate the discussion, the findings for the following five stations will be used as case studies based on their geographical locations in Peninsular Malaysia. Chuping represents the studied stations located in the northern region; Bayan Lepas in the inland region; Lenggong represents the stations in the western region; Kota Bharu represents the stations in the eastern region of the Peninsula; and Pontian represents the stations in the southwest region.
The model fitting starts with a constant term and then proceeds with a sine and cosine term, representing the first harmonic. Then, the second and third harmonics were added until no significant effect was observed in the model. Next, model deviance is used to describe the adequacy of the fitted model. The term residual deviance (Res.dev) is used to measure the unexplained variation that remains after fitting a model to the data, while deviance (dev) here represents the reduction in the residual deviance between the nth and (n + 1)th harmonics. Models with different numbers of harmonics are then compared by considering the reductions in deviance. In addition, residual degrees of freedom (Res.df) represent the number of observations or data points minus the number of estimated parameters. The degree of freedom indicates the number of observations that are free to vary after accounting for the parameters in the model. The probability value, or p-value, of the test, is then computed using a distribution with degrees of freedom equal to the difference in the number of estimated parameters. A smaller p-value will lead to rejecting the null hypothesis, which suggests stronger evidence against the null hypothesis, implying that the factor significantly affects the response variable. Finding the optimal number of harmonic functions will proceed until no significant effect is observed in the reduction of deviance.
The deviance analyses for Chuping and Bayan Lepas stations are presented in Tables 4(a) and 4(b). The performance of the model is measured based on the reduction in deviance. Every time a harmonic is added, it will remove two degrees of freedom corresponding to the sine and cosine terms. The chi-square test statistics for a model with one harmonic yield a p-value of 0.000. This p-value is less than the significance level of 0.05, indicating a significant effect of adding one harmonic to the model with no harmonics. Note that further increases in the number of harmonics yield a p-value less than the significance level of 0.05, which means that based on the reduction in deviance, additional harmonics are needed in the model. The second and third harmonics still produce a significant result until the fourth harmonics are added to the model; they do not significantly reduce the deviance. Thus, three harmonics are adequate to fit the mean monthly rainfall at the Chuping station.
The effect of increasing the number of harmonics fitted to the mean monthly rainfall for (a) Chuping, (b) Bayan Lepas, (c) Lenggong, (d) Pontian, and (e) Kota Bharu stations
Number of harmonics . | Res. df . | Res. Dev . | Dev . | P-value . |
---|---|---|---|---|
(a) | ||||
0 harmonic | 479 | 4,692.4 | ||
1 harmonic | 477 | 3,886.1 | 806.36 | 0.0000 |
2 harmonics | 475 | 3,226.8 | 659.30 | 0.0000 |
3 harmonics | 473 | 3,107.3 | 119.43 | 0.0001 |
4 harmonics | 471 | 3,097.1 | 10.24 | 0.4638 |
5 harmonics | 469 | 3,073.1 | 24.04 | 0.1647 |
(b) | ||||
0 harmonic | 479 | 3,408.0 | ||
1 harmonic | 477 | 2,628.5 | 779.55 | 0.0000 |
2 harmonics | 475 | 2,031.1 | 597.39 | 0.0000 |
3 harmonics | 473 | 2,024.8 | 6.27 | 0.4632 |
4 harmonics | 471 | 2,021.8 | 3.02 | 0.6904 |
5 harmonics | 469 | 2,018.7 | 3.10 | 0.6831 |
(c) | ||||
0 harmonic | 479 | 2,181.9 | ||
1 harmonic | 477 | 2,051.1 | 130.88 | 0.0000 |
2 harmonics | 475 | 1,453.0 | 598.03 | 0.0000 |
3 harmonics | 473 | 1,435.9 | 17.08 | 0.0616 |
4 harmonics | 471 | 1,429.4 | 6.55 | 0.3434 |
5 harmonics | 469 | 1,422.0 | 7.41 | 0.2986 |
(d) | ||||
0 harmonic | 479 | 2,481.3 | ||
1 harmonic | 477 | 2,338.6 | 142.675 | 0.0000 |
2 harmonics | 475 | 2,250.0 | 88.556 | 0.0000 |
3 harmonics | 473 | 2,179.4 | 70.654 | 0.0000 |
4 harmonics | 471 | 2,170.3 | 9.093 | 0.3474 |
5 harmonics | 469 | 2,167.3 | 3.030 | 0.7031 |
(e) | ||||
0 harmonic | 479 | 3,070.6 | ||
1 harmonic | 477 | 2,083.6 | 986.98 | 0.0000 |
2 harmonics | 475 | 1,760.0 | 323.60 | 0.0000 |
3 harmonics | 473 | 1,559.9 | 200.13 | 0.0000 |
4 harmonics | 471 | 1,503.7 | 56.18 | 0.0002 |
5 harmonics | 469 | 1,493.0 | 10.70 | 0.2018 |
Number of harmonics . | Res. df . | Res. Dev . | Dev . | P-value . |
---|---|---|---|---|
(a) | ||||
0 harmonic | 479 | 4,692.4 | ||
1 harmonic | 477 | 3,886.1 | 806.36 | 0.0000 |
2 harmonics | 475 | 3,226.8 | 659.30 | 0.0000 |
3 harmonics | 473 | 3,107.3 | 119.43 | 0.0001 |
4 harmonics | 471 | 3,097.1 | 10.24 | 0.4638 |
5 harmonics | 469 | 3,073.1 | 24.04 | 0.1647 |
(b) | ||||
0 harmonic | 479 | 3,408.0 | ||
1 harmonic | 477 | 2,628.5 | 779.55 | 0.0000 |
2 harmonics | 475 | 2,031.1 | 597.39 | 0.0000 |
3 harmonics | 473 | 2,024.8 | 6.27 | 0.4632 |
4 harmonics | 471 | 2,021.8 | 3.02 | 0.6904 |
5 harmonics | 469 | 2,018.7 | 3.10 | 0.6831 |
(c) | ||||
0 harmonic | 479 | 2,181.9 | ||
1 harmonic | 477 | 2,051.1 | 130.88 | 0.0000 |
2 harmonics | 475 | 1,453.0 | 598.03 | 0.0000 |
3 harmonics | 473 | 1,435.9 | 17.08 | 0.0616 |
4 harmonics | 471 | 1,429.4 | 6.55 | 0.3434 |
5 harmonics | 469 | 1,422.0 | 7.41 | 0.2986 |
(d) | ||||
0 harmonic | 479 | 2,481.3 | ||
1 harmonic | 477 | 2,338.6 | 142.675 | 0.0000 |
2 harmonics | 475 | 2,250.0 | 88.556 | 0.0000 |
3 harmonics | 473 | 2,179.4 | 70.654 | 0.0000 |
4 harmonics | 471 | 2,170.3 | 9.093 | 0.3474 |
5 harmonics | 469 | 2,167.3 | 3.030 | 0.7031 |
(e) | ||||
0 harmonic | 479 | 3,070.6 | ||
1 harmonic | 477 | 2,083.6 | 986.98 | 0.0000 |
2 harmonics | 475 | 1,760.0 | 323.60 | 0.0000 |
3 harmonics | 473 | 1,559.9 | 200.13 | 0.0000 |
4 harmonics | 471 | 1,503.7 | 56.18 | 0.0002 |
5 harmonics | 469 | 1,493.0 | 10.70 | 0.2018 |
The bold values show the optimal number of harmonics required with the p-value < 0.05.
Table 4(c) shows that two harmonics are the best way to characterise the rainfall pattern for Lenggong station in the western Peninsula. The results presented in this table indicate that with only one harmonic, the deviations are still much larger, suggesting that the model still did not suit very well. When two harmonics are added to the model, the deviation is substantially reduced, suggesting that this term is still needed, but no more harmonics are required. As a result, two harmonics are ideal for describing the bimodal rainfall pattern in Lenggong. As shown in Figure 2, Lenggong displayed a bimodal rainfall pattern, representing two seasonal rainfall peaks in April and October during the intermonsoon season. This result is consistent with the findings of Suhaila & Jemain (2009a), in which they concluded that heavy rainfall was observed during the intermonsoon in the western Pennisula.
The seasonal rainfall pattern observed in Pontian is characterised by the utilisation of three harmonics, as depicted in Table 4(d). Figure 2 shows three rainfall peaks that could be considered: April, August, and November, corresponding to the SWM, NEM, and intermonsoon seasons. The result suggests that Pontian, situated in the southwestern region of the Peninsula, receives rainfall consistently throughout the year. A similar pattern could be seen for Tangkak and Melaka stations. Upon comparing these three peaks, it becomes apparent that the rainfall distribution on the southwestern Peninsula is more likely to be influenced by the NEM season due to the highest rainfall peak compared to the other seasons. Despite the northwest and southwest Peninsula yielding three harmonics, the rainfall patterns between those studied stations differ. These distinctions could be due to their geographical locations, which impact the local rainfall patterns with the monsoon circulations. Conversely, the eastern region required a large number of harmonics to depict its rainfall pattern effectively. Four harmonics are needed to describe the rainfall pattern in Kota Bharu due to fluctuating rainfall values throughout the year. The highest peak presented in Figure 2 is observed either in November or December. Suhaila & Jemain (2009a) affirmed that rainfall distribution in the eastern region is highly associated with the NEM flow.
The present study introduces a distinct model configuration compared to the approach adopted by Hasan & Dunn (2010, 2012). The number of sine and cosine terms was first examined, and the optimal number was chosen based on the deviance analysis. The selection of an optimal number of seasonal terms yielded the most accurate depiction of the rainfall distributions across the observed stations within Peninsular Malaysia. This choice is particularly relevant due to the profound impact of monsoon patterns on the climatic conditions prevailing in Malaysia.
The fitted GLM Tweedie models
Table 5 presents the findings from the analysis of five selected case studies chosen as illustrative examples. The influence of the seasonal component, characterised by sine and cosine functions, in conjunction with the climatic predictors, ENSO indices, specifically SOI and MEI, has a significant impact on the monthly rainfall amount observed at the Chuping station, as shown in Table 5(a). The LRT is employed to assess the significance of climatic predictors compared to the initial base model centred on seasonal terms. Based on the lowest AIC, it was found that the presence of MEI values within the model has a statistically significant effect on the measured rainfall amount at the Chuping station. In contrast, when considering the model incorporating SOI and two harmonic functions, it is noted that this configuration yields a better fit for the rainfall distribution in the inland region than the model with MEI, as presented in Table 5(b).
Estimated values of the regression coefficients for (a) the Chuping station (northwest region), (b) the Bayan Lepas station (inland region), (c) the Lenggong station (west region), (d) the Pontian station (southwest region), and (e) the Kota Bharu station (east region)
(a) . | ||||||
---|---|---|---|---|---|---|
. | Model 1 (Base model) AIC = 5464.46 . | Model 2 (Base Model + SOI) AIC = 5457.26 . | Model 3 (Base Model +MEI) AIC = 5450.06 . | |||
LRT Statistics . | . | 9.20a . | 16.40b . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 4.91 | 160.33b | 4.90 | 159.62b | 4.90 | 161.83b |
A1 (sine) | 0.47 | 11.17b | 0.48 | 11.35b | 0.48 | 11.47b |
B1 (cosine) | 0.30 | 6.67b | 0.30 | 6.68b | 0.30 | 6.86b |
A2 (sine) | −0.41 | −9.60b | −0.42 | −9.65b | −0.42 | −9.76b |
B2 (cosine) | −0.19 | −4.50b | −0.19 | −4.42b | −0.19 | −4.54b |
A3 (sine) | 0.13 | 3.13a | 0.13 | 3.11a | 0.13 | 3.16a |
B3 (cosine) | −0.12 | −2.90a | −0.12 | −2.89a | −0.12 | −2.96a |
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.09 | 2.78a | ||||
C4 (MEI) | −0.11 | −3.75b | ||||
(b) . | ||||||
. | Model 1 (Base model) AIC = 5647.83 . | Model 2 (Base Model + SOI) AIC = 5643.09c . | Model 3 (Base Model +MEI) AIC = 5646.10 . | |||
LRT Statistics . | . | 6.75a . | 3.73 . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.17 | 210.84b | 5.17 | 211.62b | 5.17 | 211.42b |
A1 (sine) | 0.44 | 13.20b | 0.45 | 13.44b | 0.46 | 13.31b |
B1 (cosine) | 0.25 | 7.09b | 0.25 | 7.12b | 0.25 | 7.15b |
A2 (sine) | −0.35 | −10.31b | −0.35 | −10.42b | −0.35 | −10.36b |
B2 (cosine) | −0.22 | −6.54b | −0.22 | −6.51b | ||
A3 (sine) | ||||||
B3 (cosine) | ||||||
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.07 | 2.62a | ||||
C4 (MEI) | −0.05 | −1.91 | ||||
(c) . | ||||||
. | Model 1 (Base model) AIC = 5451.101 . | Model 2 (Base Model + SOI) AIC = 5448.981 . | Model 3 (Base Model +MEI) AIC = 5444.948 . | |||
LRT Statistics . | . | 4.12c . | 8.15a . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.02 | 214.63b | 5.01 | 215.66b | 5.01 | 217.45b |
A1 (sine) | 0.21 | 6.52b | 0.21 | 6.67b | 0.21 | 6.64b |
B1 (cosine) | 0.02 | 0.50 | 0.02 | 0.49 | 0.02 | 0.54 |
A2 (sine) | −0.43 | −13.04b | −0.43 | −13.20b | −0.43 | −13.30b |
B2 (cosine) | −0.18 | −5.39b | −0.17 | −5.38b | −0.17 | −5.41b |
(c) . | ||||||
. | . | Model 2 (base model + SOI) AIC = 5448.981 . | Model 3 (base model + MEI) AIC = 5444.948 . | |||
. | Model 1 (base model) AIC = 5451.101 . | 4.12c . | 8.15a . | |||
LRT statistics . | Coefficients . | t-Value . | Coefficients . | t-Value . | Coefficients . | t-Value . |
A3 (sine) | ||||||
B3 (cosine) | ||||||
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.05 | 2.00c | ||||
C4 (MEI) | −0.07 | −2.78a | ||||
(d) . | ||||||
. | Model 1 (Base model) AIC = 5679.23 . | Model 2 (Base Model + SOI) AIC = 5678.86 . | Model 3 (Base Model +MEI) AIC = 5678.83 . | |||
LRT Statistics . | . | 2.37 . | 2.41 . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.27 | 241.10b | 5.27 | 241.02b | 5.27 | 241.13b |
A1 (sine) | 0.14 | 4.61b | 0.14 | 4.61b | 0.14 | 4.61b |
B1 (cosine) | −0.01 | −3.23a | −0.10 | −3.22a | −0.10 | −3.22a |
A2 (sine) | −0.14 | −4.59b | −0.14 | −4.60 | −0.14 | −4.60b |
B2 (cosine) | 0.01 | 0.24 | 0.01 | 0.23 | 0.01 | 0.23 |
A3 (sine) | 0.05 | 1.48 | 0.05 | 1.50 | 0.05 | 1.50 |
B3 (cosine) | −0.12 | −3.80a | −0.12 | −3.78b | −0.12 | −3.78b |
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.01 | 0.22 | ||||
C4 (MEI) | 0.01 | 0.28 | ||||
(e) . | ||||||
. | Model 1 (Base model) AIC = 5784.20 . | Model 2 (Base Model + SOI) AIC = 5757.89 . | Model 3 (Base Model +MEI) AIC = 5766.55 . | |||
LRT Statistics . | . | 28.31b . | 19.65b . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.12 | 146.72b | 5.10 | 153.31b | 5.11 | 150.75b |
A1 (sine) | 0.50 | 9.80b | 0.53 | 10.99b | 0.51 | 10.45b |
B1 (cosine) | −0.48 | −9.91b | −0.47 | −10.24b | −0.47 | −10.03b |
A2 (sine) | −0.31 | −6.26b | −0.33 | −6.84b | −0.32 | −6.58b |
B2 (cosine) | 0.34 | 6.95b | 0.35 | 7.63b | 0.34 | 7.34b |
A3 (sine) | 0.28 | 5.71b | 0.28 | 5.92b | 0.28 | 5.84b |
B3 (cosine) | −0.24 | −4.94b | −0.24 | −5.23b | −0.24 | −5.11b |
A4 (sine) | −0.06 | −1.19 | −0.06 | −1.27 | −0.06 | −1.26 |
B4 (cosine) | 0.19 | 3.94b | 0.19 | 4.13b | 0.19 | 4.03b |
C3 (SOI) | 0.17 | 5.07b | ||||
C4 (MEI) | −0.14 | −4.18b |
(a) . | ||||||
---|---|---|---|---|---|---|
. | Model 1 (Base model) AIC = 5464.46 . | Model 2 (Base Model + SOI) AIC = 5457.26 . | Model 3 (Base Model +MEI) AIC = 5450.06 . | |||
LRT Statistics . | . | 9.20a . | 16.40b . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 4.91 | 160.33b | 4.90 | 159.62b | 4.90 | 161.83b |
A1 (sine) | 0.47 | 11.17b | 0.48 | 11.35b | 0.48 | 11.47b |
B1 (cosine) | 0.30 | 6.67b | 0.30 | 6.68b | 0.30 | 6.86b |
A2 (sine) | −0.41 | −9.60b | −0.42 | −9.65b | −0.42 | −9.76b |
B2 (cosine) | −0.19 | −4.50b | −0.19 | −4.42b | −0.19 | −4.54b |
A3 (sine) | 0.13 | 3.13a | 0.13 | 3.11a | 0.13 | 3.16a |
B3 (cosine) | −0.12 | −2.90a | −0.12 | −2.89a | −0.12 | −2.96a |
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.09 | 2.78a | ||||
C4 (MEI) | −0.11 | −3.75b | ||||
(b) . | ||||||
. | Model 1 (Base model) AIC = 5647.83 . | Model 2 (Base Model + SOI) AIC = 5643.09c . | Model 3 (Base Model +MEI) AIC = 5646.10 . | |||
LRT Statistics . | . | 6.75a . | 3.73 . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.17 | 210.84b | 5.17 | 211.62b | 5.17 | 211.42b |
A1 (sine) | 0.44 | 13.20b | 0.45 | 13.44b | 0.46 | 13.31b |
B1 (cosine) | 0.25 | 7.09b | 0.25 | 7.12b | 0.25 | 7.15b |
A2 (sine) | −0.35 | −10.31b | −0.35 | −10.42b | −0.35 | −10.36b |
B2 (cosine) | −0.22 | −6.54b | −0.22 | −6.51b | ||
A3 (sine) | ||||||
B3 (cosine) | ||||||
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.07 | 2.62a | ||||
C4 (MEI) | −0.05 | −1.91 | ||||
(c) . | ||||||
. | Model 1 (Base model) AIC = 5451.101 . | Model 2 (Base Model + SOI) AIC = 5448.981 . | Model 3 (Base Model +MEI) AIC = 5444.948 . | |||
LRT Statistics . | . | 4.12c . | 8.15a . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.02 | 214.63b | 5.01 | 215.66b | 5.01 | 217.45b |
A1 (sine) | 0.21 | 6.52b | 0.21 | 6.67b | 0.21 | 6.64b |
B1 (cosine) | 0.02 | 0.50 | 0.02 | 0.49 | 0.02 | 0.54 |
A2 (sine) | −0.43 | −13.04b | −0.43 | −13.20b | −0.43 | −13.30b |
B2 (cosine) | −0.18 | −5.39b | −0.17 | −5.38b | −0.17 | −5.41b |
(c) . | ||||||
. | . | Model 2 (base model + SOI) AIC = 5448.981 . | Model 3 (base model + MEI) AIC = 5444.948 . | |||
. | Model 1 (base model) AIC = 5451.101 . | 4.12c . | 8.15a . | |||
LRT statistics . | Coefficients . | t-Value . | Coefficients . | t-Value . | Coefficients . | t-Value . |
A3 (sine) | ||||||
B3 (cosine) | ||||||
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.05 | 2.00c | ||||
C4 (MEI) | −0.07 | −2.78a | ||||
(d) . | ||||||
. | Model 1 (Base model) AIC = 5679.23 . | Model 2 (Base Model + SOI) AIC = 5678.86 . | Model 3 (Base Model +MEI) AIC = 5678.83 . | |||
LRT Statistics . | . | 2.37 . | 2.41 . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.27 | 241.10b | 5.27 | 241.02b | 5.27 | 241.13b |
A1 (sine) | 0.14 | 4.61b | 0.14 | 4.61b | 0.14 | 4.61b |
B1 (cosine) | −0.01 | −3.23a | −0.10 | −3.22a | −0.10 | −3.22a |
A2 (sine) | −0.14 | −4.59b | −0.14 | −4.60 | −0.14 | −4.60b |
B2 (cosine) | 0.01 | 0.24 | 0.01 | 0.23 | 0.01 | 0.23 |
A3 (sine) | 0.05 | 1.48 | 0.05 | 1.50 | 0.05 | 1.50 |
B3 (cosine) | −0.12 | −3.80a | −0.12 | −3.78b | −0.12 | −3.78b |
A4 (sine) | ||||||
B4 (cosine) | ||||||
C3 (SOI) | 0.01 | 0.22 | ||||
C4 (MEI) | 0.01 | 0.28 | ||||
(e) . | ||||||
. | Model 1 (Base model) AIC = 5784.20 . | Model 2 (Base Model + SOI) AIC = 5757.89 . | Model 3 (Base Model +MEI) AIC = 5766.55 . | |||
LRT Statistics . | . | 28.31b . | 19.65b . | |||
. | Coefficients . | t-value . | Coefficients . | t-value . | Coefficients . | t-value . |
![]() | 5.12 | 146.72b | 5.10 | 153.31b | 5.11 | 150.75b |
A1 (sine) | 0.50 | 9.80b | 0.53 | 10.99b | 0.51 | 10.45b |
B1 (cosine) | −0.48 | −9.91b | −0.47 | −10.24b | −0.47 | −10.03b |
A2 (sine) | −0.31 | −6.26b | −0.33 | −6.84b | −0.32 | −6.58b |
B2 (cosine) | 0.34 | 6.95b | 0.35 | 7.63b | 0.34 | 7.34b |
A3 (sine) | 0.28 | 5.71b | 0.28 | 5.92b | 0.28 | 5.84b |
B3 (cosine) | −0.24 | −4.94b | −0.24 | −5.23b | −0.24 | −5.11b |
A4 (sine) | −0.06 | −1.19 | −0.06 | −1.27 | −0.06 | −1.26 |
B4 (cosine) | 0.19 | 3.94b | 0.19 | 4.13b | 0.19 | 4.03b |
C3 (SOI) | 0.17 | 5.07b | ||||
C4 (MEI) | −0.14 | −4.18b |
ap < 0.001.
b0.001 ≤ p < 0.01.
c0.001 ≤ p < 0.05.
Table 5(c) shows that all seasonal terms with two harmonics, except for the first cosine term, significantly impact the monthly rainfall amounts for the Lenggong station. In addition, based on the LRT statistics, models featuring climate predictors offer a significantly improved fit compared to the base model. Among these, the model with MEI demonstrates the most optimal fit. Contrasting results are observed at the southwestern station, as exemplified by the Pontian station, where none of the climate predictors substantially impact the monthly rainfall patterns, as detailed in Table 5(d). Consequently, adding climate predictors did not significantly influence the model for stations in the southwestern region. In other words, the monthly rainfall distribution for stations in the southwestern region is completely characterised by the model, consisting of seasonal terms. For the Kota Bahru station in the eastern region, both SOI and MEI together with the seasonal terms (excluding the sine term for the fourth harmonic) show a significant impact on the monthly rainfall amounts. Through an evaluation of the AIC, the model featuring the SOI emerges as the preferred choice for accurately modelling the monthly rainfall at the Kota Bharu station.
According to Richard & Walsh (2018), a La Niña event can be identified by a positive value of the SOI, typically accompanied by an increase in the total amount of rain. In addition, as outlined in the work of Wong et al. (2016), negative MEI values reflect the cold ENSO phase associated with the La Niña events, whereas positive MEI values correspond to the warm ENSO phase associated with the El Niño episodes. As can be seen in Tables 5(a), 5(c), and 5(e), there were cases of regression coefficients that exhibited significant positive SOI as well as significant negative MEI. These combined positive values of SOI and negative values of MEI suggest the prevalence of the La Niña episodes.
The SOI shows positive regression coefficients for all case-studied stations, with the largest impact at the Kota Bharu station in the eastern areas, as shown in Table 5(e). These findings can be explained in such a way that the increase in the positive coefficients of SOI will lead to an increase in the amount of rainfall. On the other hand, a converse relationship is observed between the MEI and the rainfall amount. The more severely negative MEI values will lead to an increase in the amount of rain. This phenomenon is particularly evident at the Chuping and Lenggong stations, which represent the northwestern and western regions of Peninsular Malaysia in the study. These stations exhibit notably larger and more significant negative MEI values. On the contrary, neither SOI nor MEI has a substantial impact on the amount of rainfall recorded at the Pontian station.
The preferred model of each studied station based on the lowest AIC and LRT statistics.
The preferred model of each studied station based on the lowest AIC and LRT statistics.
Model interpretation
Consider the model fitted to the monthly rainfall amounts for the Kota Bharu station with the index Tweedie parameter (Table 2). Based on the deviance analysis, the fitted model with four harmonics is the best for the Kota Bharu station. Note that the model uses the logarithmic link function.
Model 1 (base model)
Model 2 (base model + SOI)
Model 3 (base model + MEI)
Suppose that m = 3, t = 10 (tenth year of study 1989), then i = 12(t − 1)+ m = 12(9) + 3 = 113. The values of SOI and MEI for February 1989 are 1.2 and −1.06, respectively. The mean predicted rainfall amounts in March 1989 using model 1, model 2, and model 3 are 115.58, 135.10, and 130.11 mm, respectively. The mean predicted rainfall amounts for the rest of the months can be found by substituting the appropriate m and i.










Cross-validation and model selection
The methods of k-fold cross-validation and the Taylor diagram have been employed for all three tested models to assess the effectiveness of the best-fitting model. Table 6 lists the adjusted cross-validation prediction error for five stations taken as a case study. This analysis encompasses the utilisation of both 10-fold and 20-fold cross-validation to ensure the consistency of the results. Chuping and Lenggong stations are well fitted with the base model with the addition of the MEI, while Bayan Lepas and Kota Bharu stations are well suited with the addition of the SOI and the base model. For the Pontian station, no addition of climate predictors best described the monthly rainfall data of the station.
Tweedie GLM cross-validation prediction error of each station
Stations . | Models . | Cross-validation prediction error . | |
---|---|---|---|
K = 10 . | K = 20 . | ||
Chuping (northwest) | Base model | 6,704.72 | 6,670.65 |
Base model + SOI | 6,705.90 | 6,656.03 | |
Base model + MEI | 6,623.24 | 6,639.70 | |
Bayan Lepas (inland) | Base model | 9,785.80 | 9,771.71 |
Base model + SOI | 9,725.98 | 9,754.93 | |
Base model + MEI | 9,866.33 | 9,819.34 | |
Lenggong (west) | Base model | 6,162.00 | 6,134.14 |
Base model + SOI | 6,241.50 | 6,190.96 | |
Base model + MEI | 6,152.54 | 6,125.03 | |
Pontian (southwest) | Base model | 8,540.16 | 8,566.46 |
Base model + SOI | 8,693.56 | 8,611.66 | |
Base model + MEI | 8,604.62 | 8,594.16 | |
Kota Bharu (east) | Base model | 28,571.92 | 28,404.75 |
Base model + SOI | 27,559.41 | 27,886.04 | |
Base model + MEI | 28,319.96 | 28,052.38 |
Stations . | Models . | Cross-validation prediction error . | |
---|---|---|---|
K = 10 . | K = 20 . | ||
Chuping (northwest) | Base model | 6,704.72 | 6,670.65 |
Base model + SOI | 6,705.90 | 6,656.03 | |
Base model + MEI | 6,623.24 | 6,639.70 | |
Bayan Lepas (inland) | Base model | 9,785.80 | 9,771.71 |
Base model + SOI | 9,725.98 | 9,754.93 | |
Base model + MEI | 9,866.33 | 9,819.34 | |
Lenggong (west) | Base model | 6,162.00 | 6,134.14 |
Base model + SOI | 6,241.50 | 6,190.96 | |
Base model + MEI | 6,152.54 | 6,125.03 | |
Pontian (southwest) | Base model | 8,540.16 | 8,566.46 |
Base model + SOI | 8,693.56 | 8,611.66 | |
Base model + MEI | 8,604.62 | 8,594.16 | |
Kota Bharu (east) | Base model | 28,571.92 | 28,404.75 |
Base model + SOI | 27,559.41 | 27,886.04 | |
Base model + MEI | 28,319.96 | 28,052.38 |
Taylor diagram showing the correlation and standard deviation ratio between the predicted and observed monthly rainfall at five case study stations. The legend shows the base model, the base model with SOI, and the base model with MEI.
Taylor diagram showing the correlation and standard deviation ratio between the predicted and observed monthly rainfall at five case study stations. The legend shows the base model, the base model with SOI, and the base model with MEI.
In comparing all three tested models, a base model with the SOI displays the closest agreement to the observed monthly rainfall values for the Kota Bharu station in the eastern region. On the other hand, the base model enhanced with the MEI index emerges as the optimal choice for the Chuping station in the northwest region, surpassing the performance of the other two models. For the rest of the stations, the Taylor diagram suggests a minimal difference in statistical metrics among the three models.
Limitations of study and future work
This study aims to identify an appropriate set of harmonic functions, which include sine and cosine functions, that can effectively represent the monthly rainfall data in Peninsular Malaysia. The main concern revolves around the availability of datasets, as the study focuses solely on stations with an extensive study period and a percentage of missing daily rainfall data below 1%. Moreover, to achieve the objective of the study, only stations that recorded both zero and non-zero monthly rainfall during the study period were taken into account.
The subsequent challenge relates to the computational time required to calculate the regression coefficients for each tested harmonic function, potentially impacting the model's efficiency. Nevertheless, the outcomes yielded the optimal number of sine and cosine functions for accurately describing the monthly rainfall data, considering the substantial influence of monsoons on Malaysia's climate. These results exhibited higher accuracy than models that utilised a single harmonic function composed exclusively of a sine and cosine component.
The third limitation focuses on the Tweedie index parameter p. In this study, the Tweedie index parameter p was computed for each station but not on a monthly basis. According to Hasan & Dunn (2010), the estimated p values do not significantly affect the regression coefficient estimates but can have a slight impact on monthly variation. For the sake of simplicity, the same index parameter is applied to all three tested models: the base model (seasonal terms), the base model with the SOI, and the base model with the MEI. Cross-validation was conducted to assess the model's fitness using k-fold and Taylor diagrams presented in a graphical form. Despite these limitations, this study presents an optimal approach for selecting the correct number of sine and cosine functions to capture the seasonal rainfall pattern precisely. Moreover, this method could be employed to determine the influence of climate predictors on the monthly rainfall of the stations.
For future research, it is recommended to develop separate models for modelling rainfall data and comparing their performance against the Tweedie Poisson–Gamma distribution. Furthermore, it is advisable to compute the Tweedie index parameter for each tested model, enhancing its accuracy in describing monthly rainfall variation. The Tweedie family of distributions applies not only to monthly timescales but also to daily timescales. Finally, considering additional ENSO climate predictors in the model could provide insight into their influence on monthly rainfall.
CONCLUSION
Modelling rainfall processes separately pose many challenges. Direct modelling of rainfall processes is often problematic because rainfall data consists of zeroes and non-zeroes. Many fitting models encounter problems when handling this kind of dataset. In Malaysia, the variation in rainfall patterns is strongly affected by monsoonal winds that bring heavy rainfall during certain months in different regions. Since the studied stations have some months with zero rainfall values within the continuous rainfall total, the Tweedie family of distributions is the best model to be applied to the Malaysian rainfall series. The present study focused on modelling the monthly rainfall distribution at 18 rain gauge stations across Peninsular Malaysia using a Tweedie Poisson–Gamma distribution under the GLM framework.
This study successfully demonstrated the practical applications of the Tweedie distribution to address situations where continuous rainfall data contain exact zero values. The choice of Tweedie distribution was justified within the range of Tweedie index parameter values (1 < p < 2). The discrepancy in Tweedie parameter values observed among stations was attributed to their geographical locations and monsoon-induced variations. In contrast to the previous literature (Hasan & Dunn 2010, 2012; Yunus et al. 2017), incorporating seasonal terms using harmonic functions enhanced the model's ability to describe rainfall patterns and reveal distinct patterns across different regions. Furthermore, the influence of climatological predictors on monthly rainfall was examined, highlighting the significance of the SOI and MEI indices in explaining rainfall variations. The analysis showed that the base model with SOI provided the best fit for most stations, particularly in eastern and inland areas, while the base model with MEI demonstrated a significant impact on rainfall at specific northern stations.
The main contribution of this study lies in developing a Tweedie GLM that handles both zero and non-zero rainfall values and its extension to incorporate climatological predictors and the optimal number of seasonal terms. By re-parameterizing Tweedie parameters to the Poisson–Gamma distribution, simulations can be conducted to predict statistical aspects of monthly rainfall, aiding in agricultural planning, disaster mitigation, and water resource management.
In summary, the Tweedie GLM emerges as a suitable alternative for modelling rainfall distributions, effectively addressing the challenges posed by zero-inflated data. The insights gained from this study offer practical modelling solutions and contribute to a deeper understanding of the complex associations between climatological variables and rainfall distribution. This knowledge holds potential implications for climate forecasting and resource management within the region, emphasising the relevance of the findings for future applications and research.
ACKNOWLEDGEMENTS
The authors are indebted to the Malaysian Meteorological Department for providing the daily rainfall data used in the study. The authors would like to express their gratitude to the Ministry of Higher Education (MOHE) for the funding under the Fundamental Research Grant Scheme (FRGS/1/2020/STG06/UTM/02/3) under vote 5F311. We are also grateful to Universiti Teknologi Malaysia for supporting this project.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.