A hybrid regionalization method based on canonical correlation analysis and cluster analysis: a case study in northern Iran

The performance of regionalization methods used for regional flood frequency analysis is affected considerably by the features used to identify the homogeneous regions (e.g., climatological, meteorological, geomorphological, and physiographic characteristics of the watersheds). In this study, a regionalization method is proposed that takes advantage of the two widely used techniques in regionalization of watersheds: canonical correlation analysis and cluster analysis. In the proposed method, the canonical correlation analysis is utilized to select or weight features that then will be used by a hybrid clustering algorithm for regionalization of watersheds. The proposed method is applied to Sefidrud basin, located in the north of Iran, to implement regionalization with two, three, four, and five regions. Performance assessment of the proposed method shows that all the options of the proposed method can be effective alternatives to some common regionalization methods to improve the homogeneity of the regions. The results indicate that the method can satisfy the homogeneity conditions approximately for all the regions which were identified in the study area. doi: 10.2166/nh.2019.105 om https://iwaponline.com/hr/article-pdf/50/4/1076/584685/nh0501076.pdf er 2019 Ali Ahani S. Saeid Mousavi Nadoushani (corresponding author) Ali Moridi Department of Water Resources Management, Faculty of Civil, Water and Environmental Engineering, Shahid Beheshti University, Tehran, Iran E-mail: sa_mousavi@sbu.ac.ir


INTRODUCTION
Flood frequency analysis (FFA) is used to estimate the magnitude of a flood with a specified return period or estimate the return period of a flood with a specified magnitude. Flood quantiles can be estimated by at-site FFA using only flood data recorded at the site of interest. However, in many cases, the length of flood data records in sites of interest are not appropriate to provide reliable flood estimates. In such situations, regional flood frequency analysis (RFFA) is an efficient approach to compensate for the temporal shortage of flood data records by pooling flood data over a number of sites with similar flood generation mechanisms.
The objective of the regionalization is to identify homogeneous regions, i.e., groups of sites with similar flood generation mechanisms (Hosking & Wallis ). In a homogeneous region, the flood frequency distribution varies from site to site with only a site-specific factor named the index-flood. RFFA based on index-flood was first introduced by Dalrymple ().
The identification of homogeneous regions for RFFA is required to find an appropriate regional flood frequency distribution. However, identifying a group of sites satisfying homogeneity conditions is not always possible easily. When the regionalization features not directly related to flood data records (e.g., climatological, meteorological, geomorphological, and physiographic characteristics of the watersheds), it is difficult to assign all the watersheds to the regions where all of them satisfy homogeneity conditions. One of the most widely used methods for regionalization is cluster analysis. The cluster analysis methods are multivariate statistical analysis methods which have been utilized by many researchers in several hydrological studies especially for the regionalization (e.g., Acreman  Region of influence (ROI) is another widely used approach for RFFA that was developed by Burn (). In ROI, a region of influence, which is a hydrologically homogeneous neighborhood, is formed for each watershed in a study area. ROI has been used in several regional frequency analysis studies and its performance evaluated in different case studies (e.g., Zrinji  On the other hand, a noticeable limitation of hierarchical algorithms is that after assigning a data point to a cluster, it is not possible to move it between clusters. Partitional algorithms, which often are based on the minimization of an objective function, require the determination of initial conditions, such as the initial cluster centers, but these algorithms often provide the benefits of the possibility of moving data points between different clusters in different iterations of the algorithm. One of the most widely used partitional clustering algorithms in regional frequency analysis studies is the K-means algorithm ( showed that, in general, the combination of the Ward and SOM algorithms with FCM algorithm provide the best results for regionalization of the studied region in order to analyze the flood regional frequency. The ability of cluster analysis methods in dealing with multivariate analysis problems and reducing the need for visual judgments and time-consuming assessments are the benefits of these methods for regionalization of watersheds. However, there are some issues that may affect the efficiency of cluster analysis methods for regionalization of watersheds. In regionalization by cluster analysis methods, Longitude, latitude, elevation from the sea level, drainage area, mean annual precipitation, and the runoff coefficient were selected as the watershed features contributing to the regionalization procedure (Table 1). It is worth noting that four sites have runoff coefficients greater than 1 because the watersheds of these sites are located in areas with karst geologic structures.
The features were selected based on the availability of the relevant data and their potential role in the flood generation mechanism. The longitude, latitude, and elevation from the sea level were selected, because of the special geographical situation of the study area. In fact, the noticeable where x ij is the value of the feature j in the data point i; x j is the mean value of the feature j over the dataset, and S j is the standard deviation of the feature j over the dataset. In addition, y ij is the standardized value of the feature j for the data point i.

Discordancy evaluation
Prior to the feature selection or feature weighting steps, the flood data records were evaluated by using the discordancy measure D proposed by Hosking & Wallis () in terms of the L-moments of flood data. A site is identified as discordant if D exceeds the critical value. When the number of sites is greater than 14, the critical value of D is equal to 3. This screening procedure can be performed either before regionalization for all the sites as one group, or after regionalization for the sites belonging to each region. Among the 39 sites, two sites were identified as discordant and were excluded from the regionalization process as suggested by Hosking & Wallis (). Therefore, the data related to the 37 remaining sites was used in the next stages of the study.

Canonical correlation analysis
In CCA, a canonical space is formed based on two sets of canonical variables. Each canonical variable is a linear combination of one of the two sets of original variables.
If the original variables include two sets of the variables then the two sets of the canonical variables formed as in Equation (2), such that the correlation between pairs of the corresponding canonical variables is maximized and the correlation between other pairs are minimized. The highest correlation is related to the canonical variables V 1 and W 1 , and the lowest correlation is related to the canonical variables V p and W p . For more details on CCA, see Hotelling () and Cavadias ().
In the present study, the six watershed features and

Cluster validity index
To Rousseeuw () defined the silhouette width for a data point i in a clustered dataset, as Equation (3): where a(i) is the average distance of the data point i from the data points with which it is placed in the same cluster; and b(i) is the minimum average distance from the data point i  In the current study, the heterogeneity measures H are used to assess the homogeneity of the regions and a region is considered as homogeneous if H 1 < 1 and H 2 < 1 and H 3 < 1.

RESULTS AND DISCUSSION
The coefficients of standardized watershed features and standardized L-moment ratios in the linear combinations related to the canonical variables of the watershed features and the L-moment ratios are presented in Table 3. The canonical variables of the watershed feature space are represented by V 1 , V 2 , and V 3 , and the canonical variables of the L-moment ratios space are denoted by W 1 , W 2 , and W 3 .
As shown in Table 4, the correlation coefficient between the first pair of canonical variables is considerably greater than the values of the correlation coefficient between the second pair of canonical variables as well as the third pair of canonical variables. The values of the correlation In Table 5  It should be noted that the words 'cluster' and 'region' may be used in the rest of article equivalently.
The ASW values for implementing regionalization by each method for two, three, four, and five regions are presented in Table 6    In Table 7 Table 7. The percentage of homogeneous regions is defined as Equation (4): where p H is the percentage of homogeneous regions, n H represents the number of homogeneous regions and n t denotes the total number of regions in two, three, four, and fiveregion states (n t ¼ 2 þ 3 þ 4 þ 5 ¼ 14).
The results indicate that among the methods and their implementation options, the best performance in providing   However, as seen in Table 6 Table 7, by adding V 2 and V 3 to the regionalization feature vectors, homogeneity of the regions is improved to some extent based on the heterogeneity measures H 2 and H 3 . Of course, as seen in Table 6, the effect of V 2 is more considerable than V 3 on the homogeneity improvement.
In general, the results of calculating the heterogeneity indi- After the homogeneity assessment, the size of regions, i.e., the total number of flood data contained by each region, was evaluated. Also, the assignment of the sites to the regions was studied. For this evaluation, CCA-WAKM 1,2,3 is selected as the best option for representing CCA-WAKM, due to its better performance in identifying homogeneous regions than other CCA-WAKM options.
For both WAKM and CCA-wWAKM, it is not needed to choose an optimal option, because there is only one implementation option for each of them. of flood data recorded in each region (station-years) for two, three, four, and five regions.
As seen in Figure 10, the number of regions with a size lower than 100 station-years identified by CCA-WAKM 1,2,3 is greater than those identified by the other methods. Considering the fact that the average flood data record length in the selected sites is about 23 years, a region with size greater than 100 station-years can include at least four watersheds with the average flood data record length. It should be noted that in RFFA, the main goal is to increase the reliability of flood estimates by increasing the number of flood data pooled from several sites in the homogeneous regions. The regionalization that yields the identification of small regions may not be so useful to achieve this goal and so, it cannot be the optimal option for RFFA. Indeed, RFFA is characterized by a trade-off between the size of the region (i.e., number of flood data in station-years) and its homogeneity: usually, the higher the size of the identified pooling group of sites, the higher the expected heterogeneity.
Moreover, the target-size of the region depends on the return period associated with the target flood quantile (see, e.g., Cunnane ; Jakob et al. ). Therefore, the use of the CCA-wWAKM method in three, four, and five-region states seems to provide better results than the CCA-WAKM 1,2,3 .
Considering the excellent performance of CCA-wWAKM in the identification of homogeneous regions with appropriate spatial proximity and more balanced assignment of the sites to the regions, it seems that this method can be selected as the optimal option for regionalization of watersheds in Sefidrud basin for RFFA. In addition, it was observed that the distribution of the sites across the regions identified by CCA-wWAKM is more balanced in terms of the number of flood data contained by the regions compared to that of CCA-WAKM. In Also, access to a higher number of watershed features can lead to a more accurate judgment about the advantages and disadvantages of the proposed method.