Evaluation of spatial – temporal characteristics of precipitation using discrete maximal overlap wavelet transform and spatial clustering tools

In the present study, classical and proposed methods were used to investigate the monthly precipitation characteristics of 30 stations in the southeastern United States during 1968 – 2018. Maximal overlap discrete wavelet transform (MODWT) as preprocessing method and K-means clustering method were used. First, the monthly precipitation time series of stations were decomposed into several subseries using MODWT and considering db as the mother wavelet. Then, the energy values of theses subseries were calculated and used as inputs in K-means and radial basis functions (RBF) methods. The optimum number of clusters obtained for the considered stations in both classical and proposed methods was ﬁ ve clusters. In order to use the data as the input of the RBF method, the data correlation was evaluated by variogram. Based on the results of clustering and in accordance with the latitude and longitude variations of the stations, it was found that with increasing the energy of the clusters, the amount of precipitation in the stations decreased and vice versa. The silhouette coef ﬁ cient of clustering for the classical method obtained was 0.3 and for the proposed method it was 0.8, which indicates better clustering of the selected area using the proposed method.


INTRODUCTION
Precipitation variation assessment over a large area can provide valuable information for water resources management and engineering issues, particularly in a changing climate (Mishra et al. ; Wei et al. ). The variability of precipitation through its participation in the global hydrologic and energy cycles is important to understand the behavior and the Earth's climatic system changes (Lettenmaier et  () investigated rainfall data from 14 rain gauges in the western mountain range of the Ecuadorian Andes. They studied spatial and temporal rainfall patterns and showed that spatial variability in average rainfall was very high.
Also, significant correlations were found between average daily rainfall and geographical location. Cheng et al. () evaluated the rain-gauge network using geo-statistical methods to calculate the mean precipitation in areas without stations. They showed that annual rainfall exhibits a significant orographic effect and less spatial variability, whereas hourly rainfall exhibits higher variability in space and the spatial variation structures vary among different Hybrid wavelet analysis has been used to improve the ability of models to capture the multiscale features of hydrological time series (e.g., Agarwal et  On the other hand, for identifying structure in an unlabeled precipitation data set clustering techniques can be applied. Clustering techniques objectively organize data into homogeneous groups in which the within-group-object similarity is maximized and the between-group-object simi- In both cases, the number of clusters needs to be determined in advance.
In general, input data for clustering have a significant effect on output results. However, all properties of input variables are not required in area clustering. Therefore, appropriate methods are needed to extract the desired characteristics from the input data. The original precipitation data contain a large amount of information, so, a multiscale approach is needed to select some of the desired data and eliminate undesirable or additional information.
Due to the dynamic characteristics and non-uniform distribution of precipitation data and also the need for identifying the homogeneous precipitation areas in water resources management, a temporal-spatial model is proposed to investigate the precipitation characteristics. In this regard, monthly precipitation data from 30 precipitation stations in the United States during the period 1968-2018 were used and the temporal-spatial characteristics of precipitation were investigated using two methods, classical and proposed. In the proposed method, data were first decomposed into several subseries using maximal overlap discrete wavelet transform (MODWT) method and the energy values of each subseries were calculated. In fact, for correcting the WT misapplication, precipitation datasets were analyzed using MODWT and suitable boundary extensions were selected. These data were then used as inputs in the K-means clustering method. The main objects of this study are as shown below: • Investigating the benefits of the MODWT in evaluating the spatial-temporal characteristics of precipitation.

Study area
In this study, precipitation data from 30 stations in the southeast United States and around Atlanta City in the state of Georgia were used to investigate the temporalspatial variability of precipitation. Most of the southeast part of the United States is dominated by humid subtropical climate and receives uniform precipitation throughout the year. Atlanta has hot and humid summers and mild winters.
The mean annual precipitation in Atlanta is 50.2 inches (1,280 mm). Atlanta's area is 347.1 square kilometers, of which, 344.9 square kilometers is land and 2.2 square kilometers water. Atlanta is located among the foothills of the Appalachian Mountains. Among major cities in the east part of the Mississippi River, Atlanta has the highest elevation. Figure 1 and Table 1 show the location of the selected stations.

Maximal overlap discrete wavelet transform (MODWT)
MODWT was considered as the discrete wavelet transform the MODWT can be expressed as: where W j is wavelet coefficients which capture local fluctuations over the whole period of a time series at that coefficients, W j and V j , can be written as Equation (2): ( h j,l ≡ h j,l =2 j=2 ) and low-pass filter ( g j,l ≡ g j,l =2 j=2 );{ h j,l } and { g j,l } are the jth level DWT high-and low-pass filters; and L is the highest decomposition level. The filters are determined depending on the mother wavelets, as in DWT.
where P is the precipitation amount, n represents the month of the time series, and E denotes the energy value.

K-means clustering method
Clustering is the process of dividing or grouping a specific set of patterns into separate groups, in which, similar patterns remain in the same cluster and different patterns locate in the other clusters. The k-means clustering is used for vector quantization and, generally, it is used in data mining cluster analysis. Given an initial set of k means, the algorithm proceeds by alternating between two steps: • Assignment step: assigning each observation to the cluster whose mean has the least squared Euclidean distance. Wright (), RBF interpolation has been used to approximate differential operators, integral operators, and surface differential operators. The RBF mappings give an interpolating function which passes exactly through every data point.
If there is noise present on the data, the interpolating function which averages over the noise gives the best extension.

Considered methodology
The main aim of this study was to investigate the capability of new methods in temporal and spatial analysis of (1) The number of inputs in modeling process should be reduced.
(2) The probability of selecting inputs should be reduced (i.e., the selection of inputs should not be done by chance).
(3) The methods with higher accuracy should be used.

Decomposition of precipitation time series and calculating the subseries energy values
To investigate the temporal-spatial variations of precipitation in the study area, time series were first decomposed to several subseries via MODWT. In this study, the db was used as mother wavelet, since this type of mother wavelet has been widely used in hydrological studies. The db wavelets provide complete support for time series, indicating that these wavelets have non-zero basic functions over a given interval. In the proposed model, in order to find the best value of the decomposition level and the best type of db, monthly time series were decomposed in several levels between two and six and numbers two to five were considered for db. Among the 20 different considered states, the best case was selected by calculating the values of root mean square error (RMSE) criteria. Therefore, the decomposition level of four and the db number of three was selected for time series decomposition. In order to use the MODWT method, some data must be removed from the left side of the time series, whereas in the DWT method it is necessary to remove data from both left and right sides. In fact, the MODWT needs an infinite signal P t , in which t ¼ …, À1, 0, 1, ….N-1, N, while in reality the data are measured in a finite interval at discrete times. To use this method, the extension of time series is need for unobserved amounts determining, P 0 , P 1 , …. P Nþ1, P Nþ2 , prior to preprocessing.
Therefore, the right end of the Ps series (i.e., P Nþ1 , P Nþ2 , …) should be extended properly and special attention should be paid to the values affected by the boundary conditions. There are two methods for considering the boundary effect, which include data modification and wavelet modification. In this study, boundary conditions' handling is performed based on Percival et al. (). The extension of data is done using the following equation: where j is decomposition level, L is twice the db number, and L j is the number of omitted data. In Figure 4, the results of station S2 time series decomposition with and without data removing are shown.

Clustering the study area
After calculating the subseries energy values, clustering was performed for both classical and proposed models and the obtained results were compared. In the clustering process, first, the number of clusters should be determined, therefore, the number of clusters was selected between 2 and 10 and K-means operation was performed. The best number of clusters was selected based on silhouette coefficient (SC) and dispersion of stations. Accordingly, the best cluster number was found to be 5 clusters. In the classical method, all monthly precipitation data were considered as input. In this case, 612 data were selected as input for each station and clustering was performed. The silhouette coefficient for the classical method obtained was 0.31 which indicated relatively poor correlation and clustering.
Figure 6(a) shows the silhouette coefficient of each cluster in the classical method. As can be seen, the cluster number 5 is a single-member cluster that showed dissimilarity to other stations. Also, clusters 2 and 3 had negative silhouette coefficients. In the proposed method, for selecting the best input data for clustering with 5 clusters, the energy  The results are presented in Table 2 and Figure 6. It was observed that when variables W2 and V4 were used as inputs, the value of silhouette coefficient improves up to 0.8. Another reason for selecting variables W2 and V4 as the best inputs was their lower data variability in comparison with the other variables. According to the results, the S18 and S24 stations with SC ¼ 0.98 and S ¼ 0.97 had the highest, and the S30 and S22 stations with S ¼ 0.11 and S ¼ 0.37 had the lowest SC, respectively. In Figure 6(b), the clusters were marked on the map with different shapes. From the results, it seems that the stations are clustered appropriately. The clusters were located as: the cluster number 2 (with rhombus shape) in the southern part, the cluster number 3 (with square shape) in the north and northwest parts, the cluster number 5 (with triangle shape) in the east part, and the cluster numbers 1 and 4 (with star and circle shapes, respectively) from the east to the west parts.
Also, Figure 7 shows the energy values of the decomposed subseries. According to this figure, it could be deduced     Table 3.
According to Figure 8(b), it can be seen that with increasing energy values the precipitation values decreased and vice versa.

Radial basis functions (RBF) method results
In this study, two important parameters of precipitation and energy were used as inputs for the RBF method which is an interpolation method. Before interpolating the precipitation and energy values of stations without data, first, data distribution should be investigated. One of the useful methods for determining the data distribution is the QQ plot. In this study, the QQ plot was used and the results showed that the data used had normal distribution. After preliminary analysis, the semi-variance graphs were plotted. Semi-variance analysis reveals the data correlations, trends, and similarity covariance between points. Figure 9(a) shows the semi-variance and covariance graphs for monthly precipitation and energy. Also, in Table 4, the characteristics of precipitation and energy semi-variance models are listed. To express the robustness of the spatial structure of a variable, the ratio C0/(Cþ C0) (C0 is nugget effect and C is partial sill) can be used and investigated to see how much of the total variability justifies the nugget effect.
Since the obtained values for both monthly precipitation and energy parameters were less than 1/2, it could be deduced that the role of the unstructured component is less than the structured component. Therefore, the investigated parameters had strong spatial structure. According to Figure 9(b), and based on the covariance between mean monthly precipitation and station distance, it could be indicated that there was a significant positive correlation between these two variables. This correlation was higher at the initial distances and as the distance increased, the  ineffective. The covariance between energy and station distance also showed that there was a significant positive correlation between these two variables. The correlation was higher at the initial distances and, as the distance increased, the effect of the station distance on the amount of energy decreased. The RBF method has five models. These models can be evaluated using different performance criteria to select the most appropriate model. In this study, different models with different parameters were tested and after interpolation via the RBF method the model with the lowest error (RMSE) and the highest correlation coefficient (R 2 ) was considered as the best model. The results are shown in Figure 9(c).
From the results, it was found that the best model for both According to Figure 10, it was observed that in the south and southeast of the studied area, monthly precipitation of the stations was lower than the other stations, and in the north and northwest parts, the monthly precipitation increased and reached the highest amount of precipitation. In the state of zoning based on energy values, it was found that the south and  In the classical method, monthly precipitation data were used as input for clustering the study area. The results showed that in the proposed method, clustering based on the combination of W2 and V4 subseries led to better results. The best number of clusters obtained was 5. The silhouette coefficient in the classical method obtained was 0.3 and in the proposed method 0.8, which indicated appropriate clustering of the selected area using the proposed method. In the RBF modeling, data distribution was first evaluated and the correlation of the data was verified. Then, the five most commonly used RBF methods were modeled and the best model was selected based on RMSE and R 2 . The results showed that the best model for both mean monthly precipitation and energy variables was spline with tension model. According to the results, an inverse relationship between monthly precipitation and subseries energy was obtained. It was found that the southeast of the selected area had the highest energy and the lowest precipitation values, and the northern parts had the highest precipitation and the lowest energy values. Also, variations of the precipitation and energy parameters were investigated in terms of stations' latitude and longitude. It was observed that with increasing the stations' longitude and decreasing their latitude the amount of precipitation increased and the energy values decreased. In general, the proposed model yielded better results than the classical model due to its higher silhouette coefficient and station similarities. Also, the proposed model performed better than the classical method due to less input data and computational time.

DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories.