Abstract
Lack of streamflow data is one of the main limitations in hydrologic studies. One method of solving this problem is by streamflow regionalization. The identification of hydrologically homogeneous regions is the main and most important stage of regionalization. In this study homogeneous flow regions are identified by fuzzy c-means (FCM) cluster analysis based on morpho-climatic characteristics from streamflow at 208 stream gauges in the Amazon region. The optimal number of clusters in the dataset was identified by applying the PBM validation index, maximized for ten clusters, with a fuzzing parameter of 1.6. The application dataset is best divided into 10 groups. These were well defined and demonstrated the Amazon's hydrologic similarity.
INTRODUCTION
Knowledge of the hydrologic behavior and flows in a river basin is very important for water resource planning and management. Flow rates, calculated from time-series obtained from stream gauges must be quantified. However, such hydrologic information is not always available, perhaps because there are few stream gauges and/or the observation period(s) are short. Observations are valid only at the measurement sites, while water resource project implementation rarely coincides with stream gauge locations.
Streamflow data can be estimated by hydrologic regionalization, involving information transfer between locations, taking advantage of real data from another geographical area with similar characteristics. Regionalization techniques can be applied to such variables as rainfall and streamflow, probability distribution parameters, hydrologic indicators in general, rainfall-runoff model parameters and hydrologic functions, such as flow duration curves. However, for good regionalization, it is necessary to define regions with hydrologically homogeneous behavior – i.e., regions with hydrologic similarity between their physical and climatic characteristics.
Cluster analysis methods generally have a hierarchical or partitioned approach. Hierarchical and non-hierarchical algorithms are widely used to identify homogeneous regions of precipitation (Lin & Chen 2006; Álvarez et al. 2011; Nasseri & Zahraie 2011; Farsadnia et al. 2014; Awan et al. 2015) and/or discharge (Kahya et al. 2008; Srinivas et al. 2008; Rianna et al. 2011; Tsakiris et al. 2011; Dikbas et al. 2012; Latt et al. 2015).
In the partitioned approach, the diffuse fuzzy c-means (FCM) method has been used frequently in hydrology to identify homogeneous streamflow regions (Rao & Srinivas 2006). The FCM generalized algorithm from the K-means algorithm by Bezdek (1981) allows a characteristic vector to belong to more than one cluster, albeit with different pertinence degree rates. Rao & Srinivas (2006) used the diffuse clustering FCM algorithm to determine statistically homogeneous regions in Indiana, USA, for regional flow rate analysis. Sadri & Burn (2011) employed FCM procedures to define homogeneous regions in the Canadian provinces of Alberta, Saskatchewan and Manitoba. The methodology was applied to the hydrologic records from 36 flow monitoring sites, based on bivariate criteria (severity and duration). The authors confirmed the importance of the methodology to delimit homogeneous regions. Satyanarayana & Srinivas (2011) presented an approach based on FCM cluster analysis by which homogeneous rainfall regions in India could be identified with large-scale atmospheric variables, location attributes and rainfall seasonality. In a study on rainfall series classification of 188 rain-gauges installed in Turkey, Dikbas et al. (2013) applied the FCM method and found that 6 was the ideal number of hydrologically homogeneous regions, on the basis of total annual precipitation and its coefficient of variation, and latitude and longitude. Sahin & Cigizoglu (2012) applied cluster analysis methods, including the Ward method, and a combination of the neural and FCM methods, to identify homogeneous climate and precipitation sub-regions in Turkey. With >95% performance, the neuro-fuzzy method proved to be applicable to cluster analysis problems. Nourani & Komasi (2013) used the Integrated Geomorphological Adaptive Neuro-Fuzzy Inference System (IGANFIS) model for rainfall-flow modeling at several stations in the Enguia River Basin, California, USA. Input data were classified in agglomerates (homogeneous groups) by the FCM method to improve the model's efficiency. Goyal & Gupta (2014) identified four homogeneous rainfall regions in northeastern India using fuzzy cluster analysis FCM. Bharath & Srinivas (2015) employed FCM methodology to delimit homogeneous hydro-meteorological regions, with precipitation and temperature as two key variables.
Although the FCM algorithm always reaches convergence, it does not always reach the objective function's global minimum because its results depend strongly on the initialization rates, which are usually assigned at random. Thus, it is necessary to determine which partitions are significant. Partition evaluation is done by applying the cluster validation index with the aim of establishing which partition yields better grouping structure in a dataset. Several clustering validation indexes have been proposed. For example, the PBM index (2004) investigates partitions by evaluating their geometric structure and whether those generated are well defined and separate. It is a maximizing index, or rather, the higher the calculated PBM, the better the quality of the partition generated.
In this study, the FCM method was applied to a database composed of variables that explain the occurrence of flow rates to identify homogeneous streamflow regions in the Amazon region and to validate the partitions of the different PBM index groupings.
MATERIALS AND METHODS
Study Area
The study area involves watersheds in the Amazon region, between 5°N and 18°S, and 42°W and 74°W. Within Brazil it includes several states – Acre, Amapá, Amazonas, Mato Grosso, Pará, Roraima, Rondônia, Tocantins and part of Maranhão. It also extends into neighboring countries like French Guiana, Venezuela, Colombia, Peru and Bolivia (Figure 1).
Data
Streamflow Behavior in a river basin may be influenced by both morphometry and climate. The drainage area, for instance, defines the recharge area limit that supplies the basin. The amount of precipitation contributes to the rivers’ streamflow volumes. River length and the basin perimeters are characteristics related to the basin's shape, and may influence the time of concentration and the maximum rate of streamflow. The data used in this study to identify homogeneous flow regions are the drainage area (A), basin perimeter (Pe), river length (L), mean annual precipitation (P), and average long period flow (Qm). The choice of these variables, among those related to the flow rate, arises from their relatively easy acquisition from current Geographic Information Systems (GIS). Data on rainfall and outflows were retrieved from the Brazilian Water Agency (ANA) database system.
Initially, 208 rainfall and streamflow gauges were selected (Figure 1). They belong to the ANA Hydrological Information System – HIDROWEB (ANA 2016). The stations were chosen because their data distribution was consistent with the historical set for the period 1975 to 2012. Rainfall and streamflow data were stored in electronic spreadsheets, while average annual rainfall and long-term streamflow for each gauge were calculated. The basins’ drainage areas, and length of main-river and perimeter were delimited using GIS with the Brazilian Digital Elevation Model – MDE (Miranda 2005).
The streamflow variables are limited to intervals between 491 and 3,911,283 km2 (A); 134 and 17,231 km (Pe); 36 and 427,531 km (L); 813 and 3,539 mm (P), and 2 and 170,013 m³/s (Qm).
Algorithm FCM
The FCM algorithm was proposed by Dunn (1973) but generalized by Bezdek (1981). It is a multivariate analytical technique that replaces the binary configuration, classical set theory, by intervals of pertinence, so that one element belongs to one or more sets with a certain degree of pertinence between 0 and 1. Because of this distortion, it can be assumed that the results provide more information explaining hydrologic processes than conventional methods.


PBM Validation index

RESULTS AND DISCUSSION
The FCM algorithm was implemented to generate a data matrix of 208 characteristic vectors and five independent variables (A, Pe, L, P, and Qm). Fuzzification parameter rates (m), between 1.5 and 2.0, were tested, following Ross (1995), as well as the number of clusters (c), between 2 and 15. The minimum ε = 0.0001 and maximum (tmax) = 200, errors were used as the stopping criterion. Algorithm performance is influenced by various parameters – e.g., m, c, ε, tmax – and even the data matrix order. The optimal number of clusters formed was identified by applying the PBM validation index – see Table 1, where each column has rates corresponding to m ranging between 1.5 and 2.0. Figure 2 shows the graphs for these results. As can be seen, the PBM validation index achieved its maximum for a number of groupings equal to ten (c = 10) and m = 1.6 (fuzzing parameter). In other words, the application dataset is best grouped into 10 groups.
Application of the PBM index to FCM algorithm groupings
Number of Clusters . | PMB-index . | |||||
---|---|---|---|---|---|---|
m = 1.5 . | m = 1.6 . | m = 1.7 . | m = 1.8 . | m = 1.9 . | m = 2.0 . | |
2 | 4.65E + 09 | 4.64E + 09 | 1.02E + 12 | 1.01E + 12 | 1.00E + 12 | 4.64E + 09 |
3 | 2.38E + 09 | 2.82E + 10 | 2.37E + 09 | 2.38E + 09 | 2.55E + 10 | 1.56E + 11 |
4 | 1.29E + 11 | 1.76E + 10 | 1.48E + 09 | 4.59E + 10 | 4.00E + 11 | 1.61E + 11 |
5 | 1.01E + 10 | 3.35E + 10 | 1.47E + 10 | 1.10E + 12 | 3.08E + 11 | 6.82E + 10 |
6 | 2.40E + 10 | 1.99E + 10 | 1.30E + 10 | 9.37E + 08 | 2.34E + 10 | 9.54E + 09 |
7 | 1.80E + 08 | 8.71E + 08 | 2.12E + 10 | 6.11E + 09 | 1.51E + 11 | 5.65E + 09 |
8 | 2.40E + 10 | 6.76E + 09 | 1.55E + 10 | 2.90E + 10 | 6.50E + 08 | 5.54E + 09 |
9 | 3.28E + 09 | 5.80E + 09 | 5.87E + 08 | 5.49E + 10 | 6.51E + 09 | 7.53E + 09 |
10 | 5.16E + 09 | 1.32E +12 | 2.80E + 09 | 4.99E + 10 | 2.48E + 09 | 1.21E + 10 |
11 | 1.48E + 10 | 1.46E + 09 | 1.77E + 09 | 2.13E + 10 | 5.61E + 08 | 3.79E + 10 |
12 | 1.44E + 10 | 3.52E + 10 | 1.66E + 10 | 1.01E + 09 | 9.19E + 09 | 1.20E + 07 |
13 | 1.21E + 09 | 1.64E + 10 | 3.35E + 09 | 1.52E + 10 | 7.62E + 09 | 5.01E + 08 |
14 | 1.25E + 09 | 5.03E + 09 | 1.45E + 09 | 4.13E + 09 | 9.72E + 08 | 3.27E + 10 |
15 | 5.64E + 09 | 4.96E + 09 | 1.73E + 09 | 1.26E + 09 | 1.42E + 10 | 8.83E + 09 |
Number of Clusters . | PMB-index . | |||||
---|---|---|---|---|---|---|
m = 1.5 . | m = 1.6 . | m = 1.7 . | m = 1.8 . | m = 1.9 . | m = 2.0 . | |
2 | 4.65E + 09 | 4.64E + 09 | 1.02E + 12 | 1.01E + 12 | 1.00E + 12 | 4.64E + 09 |
3 | 2.38E + 09 | 2.82E + 10 | 2.37E + 09 | 2.38E + 09 | 2.55E + 10 | 1.56E + 11 |
4 | 1.29E + 11 | 1.76E + 10 | 1.48E + 09 | 4.59E + 10 | 4.00E + 11 | 1.61E + 11 |
5 | 1.01E + 10 | 3.35E + 10 | 1.47E + 10 | 1.10E + 12 | 3.08E + 11 | 6.82E + 10 |
6 | 2.40E + 10 | 1.99E + 10 | 1.30E + 10 | 9.37E + 08 | 2.34E + 10 | 9.54E + 09 |
7 | 1.80E + 08 | 8.71E + 08 | 2.12E + 10 | 6.11E + 09 | 1.51E + 11 | 5.65E + 09 |
8 | 2.40E + 10 | 6.76E + 09 | 1.55E + 10 | 2.90E + 10 | 6.50E + 08 | 5.54E + 09 |
9 | 3.28E + 09 | 5.80E + 09 | 5.87E + 08 | 5.49E + 10 | 6.51E + 09 | 7.53E + 09 |
10 | 5.16E + 09 | 1.32E +12 | 2.80E + 09 | 4.99E + 10 | 2.48E + 09 | 1.21E + 10 |
11 | 1.48E + 10 | 1.46E + 09 | 1.77E + 09 | 2.13E + 10 | 5.61E + 08 | 3.79E + 10 |
12 | 1.44E + 10 | 3.52E + 10 | 1.66E + 10 | 1.01E + 09 | 9.19E + 09 | 1.20E + 07 |
13 | 1.21E + 09 | 1.64E + 10 | 3.35E + 09 | 1.52E + 10 | 7.62E + 09 | 5.01E + 08 |
14 | 1.25E + 09 | 5.03E + 09 | 1.45E + 09 | 4.13E + 09 | 9.72E + 08 | 3.27E + 10 |
15 | 5.64E + 09 | 4.96E + 09 | 1.73E + 09 | 1.26E + 09 | 1.42E + 10 | 8.83E + 09 |
The FCM algorithm, c = 10 and m = 1.6, reached the stop condition in 10 iterations (Figure 3). For the first iteration, the objective function jm provided 6.79 × 1012 and the calculated rate for the last iteration was 4.44 × 1010.
Table 2 is a summary of pertinence degree rates for the 208 streamflow gauges. The groups were formed by assessing the degree of pertinence, that is, the highest degree of pertinence determines to which group an object belongs.
Degrees of pertinence of streamflow gauges
ID | code | Gauge | G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 | G10 |
E1 | 18250000 | Uruará | 0.99 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E2 | 17345000 | Base Cachimbo | 0.95 | 0.04 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E3 | 18121006 | Barragem Conj. | 0.57 | 0.40 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E4 | 17610000 | Creporizão | 0.99 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E5 | 17675000 | Jardim Ouro | 0.00 | 0.86 | 0.00 | 0.06 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 |
⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ |
E208 | 17121000 | Caiabis | 0.87 | 0.12 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
ID | code | Gauge | G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 | G10 |
E1 | 18250000 | Uruará | 0.99 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E2 | 17345000 | Base Cachimbo | 0.95 | 0.04 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E3 | 18121006 | Barragem Conj. | 0.57 | 0.40 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E4 | 17610000 | Creporizão | 0.99 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
E5 | 17675000 | Jardim Ouro | 0.00 | 0.86 | 0.00 | 0.06 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 |
⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ | ⁞ |
E208 | 17121000 | Caiabis | 0.87 | 0.12 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Table 3 shows the data distribution for each group. The intervals between the largest and smallest area of each of the groups is given in the drainage area column, whereas a single rate corresponding to each group's average is given in the average flow and average annual precipitation columns.
Cluster x data distribution
Clusters . | NG . | % . | Drainage Area (km²) . | Average Flow (m³/s) . | Average annual precipitation (mm) . |
---|---|---|---|---|---|
1 | 101 | 48.56 | 491–17,990 | 150 | 1837 |
2 | 55 | 26.44 | 18,394–47,038 | 925 | 1883 |
3 | 21 | 10.10 | 51,147–112,186 | 1824 | 2027 |
4 | 10 | 4.81 | 133,571–193,372 | 4646 | 2343 |
5 | 6 | 2.88 | 225,424–293,084 | 13014 | 2178 |
6 | 4 | 1.92 | 317,967–367,791 | 22882 | 2342 |
7 | 3 | 1.44 | 456,347–508,733 | 9083 | 1829 |
8 | 6 | 2.88 | 889,201–1,082,709 | 17661 | 2162 |
9 | 1 | 0.48 | 1,402,097 | 101158 | 2250 |
10 | 1 | 0.48 | 3,911,283 | 170013 | 1778 |
Total | 208 | 100 |
Clusters . | NG . | % . | Drainage Area (km²) . | Average Flow (m³/s) . | Average annual precipitation (mm) . |
---|---|---|---|---|---|
1 | 101 | 48.56 | 491–17,990 | 150 | 1837 |
2 | 55 | 26.44 | 18,394–47,038 | 925 | 1883 |
3 | 21 | 10.10 | 51,147–112,186 | 1824 | 2027 |
4 | 10 | 4.81 | 133,571–193,372 | 4646 | 2343 |
5 | 6 | 2.88 | 225,424–293,084 | 13014 | 2178 |
6 | 4 | 1.92 | 317,967–367,791 | 22882 | 2342 |
7 | 3 | 1.44 | 456,347–508,733 | 9083 | 1829 |
8 | 6 | 2.88 | 889,201–1,082,709 | 17661 | 2162 |
9 | 1 | 0.48 | 1,402,097 | 101158 | 2250 |
10 | 1 | 0.48 | 3,911,283 | 170013 | 1778 |
Total | 208 | 100 |
NG – number of gauges.
The explanatory variable drainage area was that with the greatest significance in cluster formation, so that the areas were presented in ascending order (Table 3). River length and basin perimeter did not follow a singular distribution between the clusters, as occurred for both area and flow. However, they provided mean rates 1,109 and 4,275 km, and 571 and 17,231 km, respectively, in the clusters formed. The groups (homogeneous regions) were determined by the variable distribution for drainage area, mean long period flow, mean annual rainfall, river length and basin perimeter. However, the first two were the most similar within the groups, whereas precipitation was the least affected (Figure 4). The graph in Figure 4 is an estimate of the mean flow rate of a river as a function of the homogeneous region, the drainage area and the average annual rainfall.
Groupings according to the drainage area, mean annual precipitation and average long term flow.
Groupings according to the drainage area, mean annual precipitation and average long term flow.
Figure 5 shows the spatial distribution of the hydrologically homogeneous streamflow regions in the Amazon determined by the FCM algorithm. Homogeneous regions 9 and 10 had the largest drainage areas (1,402,097 and 3,911,283 km2, respectively). Region 10 corresponds to the entire basin of the Amazon River and its tributaries the Solimões, Negro and Madeira rivers. It extends beyond the Brazilian border into Venezuela, Colombia, Peru and Bolivia, where the rivers rise. Consequently, the Solimões River receives contributions from tributaries in Peru; the Negro from tributaries rising in Colombia; and the Madeira from Bolivia. It has also been observed that, due to its very extensive area, there are other homogeneous regions within the Amazon Basin. These include the Purus, Tapajós and Madeira river basins. Region 9 is an example as it is entirely within region 10, and corresponds to part of the Solimões River basin, with its main tributaries including the Purus, Juruá and Japurá rivers.
Geographic contiguity is not necessary to define a hydrologically homogeneous region (Rao & Srinivas 2006). Equally, it is noted that regions 6 to 8 group few stations – about 6% of the total (Table 3) – whereas regions 1 to 5 incorporate most (about 93%). These regions cover most of the Amazon, especially compared to other regions, and reveal the region's hydrologic similarity. Some regions of the Brazilian Amazon (white) were not grouped due to the lack of streamflow data.
CONCLUSION
The FCM methodology grouped streamflow gauges in homogeneous regions and revealed that the explanatory variable drainage area had the greatest significance in region formation. The optimal number of clusters formed in the dataset was identified by applying the PBM validation index, which was maximized for 10 clusters, with fuzzing parameter 1.6. The 10 regions were well defined, showing the Amazon's general hydrologic similarity. Consequently, as the graph presented is a function of clustering, drainage area, mean flow and annual mean precipitation, simpler reference identification between watersheds and homogeneous regions is provided. If the drainage area, average flow rate and precipitation – easily measured variables – are known, the region to which the river basin belongs can be determined.