Abstract
It is essential to know the streamflow behavior in hydrological basins for appropriate water resource planning and management. In Colombia, where there is a considerable water resource potential, there is a need to generate hydrological modeling for many ungauged catchments. Thus, this study presents the regionalization of flow duration curves (FDC) in Colombia. Daily flow time series from 655 gauging stations were used to define homogenous hydrological regions, considering geological, topographic, and climatic information. Fifteen hydrological regions were delimited by cluster analysis using the K-means algorithm, all of which exhibited high spatial heterogeneity. Multiple linear regressions were used to estimate characteristic dimensionless flows as a function of each basin's attributes. A set of equations that allow the reconstruction of simulated dimensionless FDC for each cluster was determined, and regression (R2) values of 0.5–0.9 were obtained. The percentage error of the mean, maximum, and minimum discharge of the simulated FDC compared with observed values were approximately 9, 30, and 50%, respectively.
HIGHLIGHTS
Regional equations that allow the estimation of flow regime in Colombia.
Definition of hydrological homogeneous regions in Colombia.
Hydrological data and parameter analysis for Colombia.
More regionalization parameters were included and analyzed.
Linear multiple regresions were performed with more than two independent variables.
INTRODUCTION
Flow estimation in ungauged catchments is an essential task for the study, planning, and management of water resources worldwide and remains a major challenge for the hydrological community (Sivapalan 2003). It is necessary to know the flow regime behavior estimations on catchments or territories lacking information (Mesa Sánchez et al. 2003). Flow duration curves (FDC) can be applied to summarize flow regimes (Foster 1933) using the relationship between their magnitude and frequency (Vogel & Fennessey 1994). Many engineering and environmental planning applications of FDC exist (Castellarin et al. 2007): for example, the analysis of maximum, average, and minimum flows (Blöschl 2005) and the estimation of environmental flow and water supply at points of interest within the framework of water planning and management (GWP 2008; IDEAM & MinAmbiente 2015). There are different methodologies for the parameterization of the FDC, one of them is based on regression models as in this study (Wagener et al. 2013) and others are based on physical models supported by a probability distribution (Doulatyari et al. 2015).
Colombia has a considerable water resource potential (IDEAM 2008), which generates the need for knowledge about flow regime for appropriate designs, execution, and operation of projected small and large hydroelectric power plants, intake water or potable water, agriculture, flood control, recreation, etc. (UPME-PPUJ 2015). The main motivations for developing this work are the lack of information and the need to adequately characterize the flow regime in Colombia for those mentioned above.
The main objective of this work is to present a method that permits the streamflow approximations in ungauged catchments using FDC estimation for different hydrological regions in Colombia (Olden et al. 2012). The information aggrupation technique defines these hydrological regions on the basis of the K-means cluster methodology (Hartigan 1975). Multiple linear regression (MLR) analysis is proposed to estimate the FDC on each cluster, interpolating dimensionless flow estimations of different characteristic percentiles (Li et al. 2010). Different methodology and approximation of FDC are presented in this study (Vogel & Fennessey 1994; Castellarin 2014).
This research considers more catchment attributes, including geological (Musiake et al. 1975), geomorphological (Beven 2012), climatic (Perez et al. 2019), topographical (Wood et al. 1990), landscape (Winter 2001), and vegetation descriptors (Burt & Swank 1992). Correlations with more than two variables were used to estimate the percentiles of characteristic flow and more criteria for hydrological regions per definition (Mohamoud 2008). Furthermore, a comparison between cluster flow estimation results and traditional subregions aggrupation flow estimation is performed.
This work considers different aspects that improve the FDC estimation in Colombia: The proposed estimation of the FDC takes into account the high spatial and temporal variability in Colombia considering a regionalization strategy for the hydrographic zones using morphometric criteria, hydrological information, and the K-means method.
This work presents 15 homogeneous hydrological regions instead of the 5 traditional regions (Caribbean, Pacific, Andean, Amazon, and Orinoquia) which allows considering areas that have different microclimates inside them and even speculate tele-connections between areas or basins in different parts of the country. Observed FDC were estimated using as much information as possible after depuration of inconsistencies or time series of less than 15 years of length.
Also, FDC were estimated using a piecewise fit strategy to represent, in the best way, the minimum, mean, and maximum flow regime. Another methodological contribution consists of adding more than two variables to the equations built from the MLR method and obtaining equations that relate up to five independent variables. This paper is presented as follows: data and methodology, results, discussion, and conclusions.
DATA AND METHODOLOGY
Data collection
A total of 655 daily flow time series data were collected from measurement stations (IDEAM 2014). The length of data was from 1940 to 2015. In determining the gauging stations, the measurement stations were distributed in 1,141,748 km², corresponding to the surface of the Colombian national territory located in the northern region of South America. Time series corresponding to gauging stations located in river branches with multiple channels, series with less than 15 years of daily records, and those detected as inconsistent were removed from the analysis.
Hydrological clustering
After drawing the hydrographical basin corresponding to selected gauging stations (Figure 1(b)), each gauging station was assigned the set of attributes and climate to landscape descriptors to perform hydrological clusters using the K-means algorithm (Álvarez et al. 2011; Wilks 2011). Table 1 shows these sets. The entire national territory was divided into subbasins or hydrological units to extrapolate and spatialize the cluster results. Sensitivity analysis with the mean Euclidian distance to each cluster centroid was conducted to define the optimum number of basins groups (García et al. 2017).
Variable . | Units . | Abbreviation . |
---|---|---|
Basin drainage area | km² | DreA |
Basin perimeter | km | Perim |
Graveluis compactness coefficient | m/m | Comp |
Agricultural land percentage | % | %Agr |
Forest land percentage | % | %For |
Urban land percentage | % | %Urb |
Tectonic fault density | (km/km2) | FDen |
Drainage density | (km/km2) | DDen |
Slime percentage | % | %Lo |
Sand percentage | % | %San |
Clay percentage | % | %Cl |
Mean annual potential evapotranspiration | mm/year | PET |
Maximum elevation | m | MaxE |
Mean elevation | m | Emean |
Minimum elevation | m | MinE |
Basin unevenness | m | BUne |
Mainstream channel length | km | MSCL |
Average basin slope | % | Sl |
Maximum monthly precipitation | mm/month | Pmax |
Mean monthly precipitation | mm/month | Pm |
Mean surface temperature | °C | Tm |
Hypsometric curve percentile 10 | % | H10 |
Hypsometric curve percentile 25 | % | H25 |
Hypsometric curve percentile 50 | % | H50 |
Hypsometric curve percentile 75 | % | H75 |
Variable . | Units . | Abbreviation . |
---|---|---|
Basin drainage area | km² | DreA |
Basin perimeter | km | Perim |
Graveluis compactness coefficient | m/m | Comp |
Agricultural land percentage | % | %Agr |
Forest land percentage | % | %For |
Urban land percentage | % | %Urb |
Tectonic fault density | (km/km2) | FDen |
Drainage density | (km/km2) | DDen |
Slime percentage | % | %Lo |
Sand percentage | % | %San |
Clay percentage | % | %Cl |
Mean annual potential evapotranspiration | mm/year | PET |
Maximum elevation | m | MaxE |
Mean elevation | m | Emean |
Minimum elevation | m | MinE |
Basin unevenness | m | BUne |
Mainstream channel length | km | MSCL |
Average basin slope | % | Sl |
Maximum monthly precipitation | mm/month | Pmax |
Mean monthly precipitation | mm/month | Pm |
Mean surface temperature | °C | Tm |
Hypsometric curve percentile 10 | % | H10 |
Hypsometric curve percentile 25 | % | H25 |
Hypsometric curve percentile 50 | % | H50 |
Hypsometric curve percentile 75 | % | H75 |
FDC estimation using a simple linear regression model
From observed FDC built including all observed data available at the calibration stations (period-of-record FDC), the distribution of gauged basins in clusters (Sauquet & Catalogne 2011), climate and landscape attributes, and the values of different characteristic flows were estimated to generate daily, synthetic, and regional duration curves (Razavi & Coulibaly 2013). Characteristic flow percentiles chosen were similar to those of Mohamoud (2008) and Salazar Oliveros (2016): Q100, Q90, Q80, Q70, Q60, Q50, Q40, Q35, Q30, Q20, Q10, Q5, Q1, Q0.5, and Q0.1. Characteristic flows were normalized using the relationship with the average streamflow , resulting in dimensionless characteristic flows as follows: . Due to the magnitude relationship between the FDC and the drainage area or the average flow, it is necessary to standardize, in this case removing dimension by the relation on the average flow of each series. Regionalization tests were carried out based on a standardization or normalization with respect to the drainage area of each basin; however, the results were not the best.
Equation (2) represents the potential form of MLR, and it is a multiplicative of n in terms , raised to its respective exponent and a general coefficient . Contrary to Equation (1), coefficient is the logarithm of the independent term, whereas exponents are the same, and and y values are the logarithm of the original matrices. The potential equation was adequate to represent natural processes. In this case, discharges result from the interaction of the multiple product variables, such as terrain slope, soil covers, drainage area, and catchment perimeter.
Dimensionless flow estimations were also performed for the five traditionally established subregions in Colombia: Caribe, Magdalena, Orinoquia, Amazonia, and Pacific (Salazar-Holguín, 2013). These regions were delimited in this way because of their similarity in topographical, climatic, and geomorphological aspects on a large scale, as well as because of their geographical position in the national territory. Figure 1(b) shows these regions. Results were contrasted between estimation from clusters and traditionally established subregions, and the hypothesis indicated that estimation applied to clusters should give better approximations to observed flows (Swain & Patra 2017).
Statistical model (Regress) was performed to select the combination of variables that presents the highest value of determination coefficient R2 on each equation (Mohamoud 2008), which establishes how good the estimations of were (regression model output) (Steel & Torrie 1960). Fifteen matrices (one by each cluster) were conformed to dimensions n × 25, where n represents the number of selected gauging stations and 25 represents the attributes in Table 1; this matrix group was named X. The dependent variable Y was compounded using 15 matrices (number of clusters) with dimensions n × 15 (number of characteristics percentiles) and dimensionless characteristic streamflow . However, data were not standardized in both cases. The regression analysis was very useful, and this permits to obtain dimensionless FDC for each homogeneous region and its validation exercise, respectively.
From each hydrological cluster and subregion, a flow time series and its respective catchment were randomly selected to validate the estimation results. These stations were excluded from the calibration process. Tables 2 and 3 show the validation station for clusters and subregions, respectively.
Cluster . | Station code . | Region . |
---|---|---|
1 | 11077020 | Caribe |
2 | 23097040 | Magdalena |
3 | 35017070 | Orinoquia |
4 | 21227010 | Magdalena |
5 | 13017010 | Caribe |
6 | 35027020 | Orinoquía |
7 | 21207960 | Magdalena |
8 | 51027020 | Pacific |
9 | 42067010 | Amazonia |
10 | 15017010 | Caribe |
11 | 32077100 | Orinoquia |
12 | 23057010 | Magdalena |
13 | 21017020 | Magdalena |
14 | 35027150 | Orinoquia |
15 | 21197030 | Magdalena |
Cluster . | Station code . | Region . |
---|---|---|
1 | 11077020 | Caribe |
2 | 23097040 | Magdalena |
3 | 35017070 | Orinoquia |
4 | 21227010 | Magdalena |
5 | 13017010 | Caribe |
6 | 35027020 | Orinoquía |
7 | 21207960 | Magdalena |
8 | 51027020 | Pacific |
9 | 42067010 | Amazonia |
10 | 15017010 | Caribe |
11 | 32077100 | Orinoquia |
12 | 23057010 | Magdalena |
13 | 21017020 | Magdalena |
14 | 35027150 | Orinoquia |
15 | 21197030 | Magdalena |
Region . | Station code . |
---|---|
Caribe | 13047040 |
Magdalena | 21147080 |
Orinoquía | 35087010 |
Amazonía | 44117010 |
Pacífico | 52027030 |
Region . | Station code . |
---|---|
Caribe | 13047040 |
Magdalena | 21147080 |
Orinoquía | 35087010 |
Amazonía | 44117010 |
Pacífico | 52027030 |
RESULTS
Hydrological clustering
The 655 hydrological basins were grouped into 15 clusters using the K-means algorithm. Table 4 shows a summary of clustering results.
Cluster . | Number of units . | Predominant subregion . |
---|---|---|
1 | 53 | Magdalena |
2 | 21 | Magdalena |
3 | 12 | Magdalena and Orinoquia |
4 | 82 | Magdalena and Caribe |
5 | 38 | Caribe and Orinoquia |
6 | 29 | Magdalena and Orinoquia |
7 | 16 | Magdalena |
8 | 62 | Magdalena |
9 | 26 | Orinoquia and Amazonia |
10 | 57 | Magdalena |
11 | 38 | Caribe and Pacific |
12 | 42 | Magdalena and Amazonia |
13 | 74 | Magdalena |
14 | 65 | Magdalena |
15 | 40 | Orinoquia |
Total: 655 |
Cluster . | Number of units . | Predominant subregion . |
---|---|---|
1 | 53 | Magdalena |
2 | 21 | Magdalena |
3 | 12 | Magdalena and Orinoquia |
4 | 82 | Magdalena and Caribe |
5 | 38 | Caribe and Orinoquia |
6 | 29 | Magdalena and Orinoquia |
7 | 16 | Magdalena |
8 | 62 | Magdalena |
9 | 26 | Orinoquia and Amazonia |
10 | 57 | Magdalena |
11 | 38 | Caribe and Pacific |
12 | 42 | Magdalena and Amazonia |
13 | 74 | Magdalena |
14 | 65 | Magdalena |
15 | 40 | Orinoquia |
Total: 655 |
Ungauged basins were grouped with the same criterion and procedure as gauged ones. Figure 12 shows the result of this grouping.
FCD estimations
MLR was conducted for FDC estimation with two independent variables on each equation. R2 average coefficient for each cluster was comparatively qualified by its value as follows: Poor if R2 coefficient is lower than 0.3, Fair if it is between 0.3 and 0.4, Good if it is between 0.4 and 0.6, and very good if R2 value is greater than 0.6. The same results for traditional subregions were estimated.
Generally, higher R2 values were obtained in cluster FDC estimations compared with those obtained in traditional subregions estimations.
MLR was recalculated to improve flow estimation in different clusters and percentiles in which qualification was fair or poor, adding more independent variables to the equations. Higher R2 values were observed, with an average increase from 0.45 to 0.54. Table 5 presents a summary of the results.
Cluster . | Average R2 . | Number of variables . | Concept . |
---|---|---|---|
1 | 0.46 | 4 | Good |
2 | 0.79 | 2 | Very good |
3 | 0.78 | 2 | Very good |
4 | 0.40 | 5 | Good |
5 | 0.60 | 2 | Good |
6 | 0.53 | 5 | Good |
7 | 0.71 | 2 | Very good |
8 | 0.48 | 4 | Good |
9 | 0.54 | 3 | Good |
10 | 0.49 | 4 | Good |
11 | 0.71 | 2 | Very good |
12 | 0.43 | 4 | Good |
13 | 0.35 | 5 | Fair |
14 | 0.36 | 5 | Fair |
15 | 0.49 | 5 | Good |
Cluster . | Average R2 . | Number of variables . | Concept . |
---|---|---|---|
1 | 0.46 | 4 | Good |
2 | 0.79 | 2 | Very good |
3 | 0.78 | 2 | Very good |
4 | 0.40 | 5 | Good |
5 | 0.60 | 2 | Good |
6 | 0.53 | 5 | Good |
7 | 0.71 | 2 | Very good |
8 | 0.48 | 4 | Good |
9 | 0.54 | 3 | Good |
10 | 0.49 | 4 | Good |
11 | 0.71 | 2 | Very good |
12 | 0.43 | 4 | Good |
13 | 0.35 | 5 | Fair |
14 | 0.36 | 5 | Fair |
15 | 0.49 | 5 | Good |
Table 6 shows the linear correlation coefficient (R) and covariance (R2) between observed and estimated streamflow, and y = x function for clusters, whereas Table 7 shows the results for traditional subregions.
Cluster . | Station . | R−y = x . | R2−y = x . |
---|---|---|---|
1 | 11077020 | 0.86 | 0.74 |
2 | 23097040 | 0.98 | 0.97 |
3 | 35017070 | 0.91 | 0.83 |
4 | 21227010 | 0.79 | 0.62 |
5 | 13017010 | 0.93 | 0.86 |
6 | 35027020 | 0.93 | 0.86 |
7 | 21207960 | 0.80 | 0.63 |
8 | 51027020 | 0.96 | 0.92 |
9 | 42067010 | 0.98 | 0.96 |
10 | 15017010 | 0.82 | 0.67 |
11 | 32077100 | 0.95 | 0.90 |
12 | 23057010 | 0.96 | 0.91 |
13 | 21017020 | 0.91 | 0.82 |
14 | 35027150 | 0.91 | 0.82 |
15 | 21197030 | 0.83 | 0.70 |
Cluster . | Station . | R−y = x . | R2−y = x . |
---|---|---|---|
1 | 11077020 | 0.86 | 0.74 |
2 | 23097040 | 0.98 | 0.97 |
3 | 35017070 | 0.91 | 0.83 |
4 | 21227010 | 0.79 | 0.62 |
5 | 13017010 | 0.93 | 0.86 |
6 | 35027020 | 0.93 | 0.86 |
7 | 21207960 | 0.80 | 0.63 |
8 | 51027020 | 0.96 | 0.92 |
9 | 42067010 | 0.98 | 0.96 |
10 | 15017010 | 0.82 | 0.67 |
11 | 32077100 | 0.95 | 0.90 |
12 | 23057010 | 0.96 | 0.91 |
13 | 21017020 | 0.91 | 0.82 |
14 | 35027150 | 0.91 | 0.82 |
15 | 21197030 | 0.83 | 0.70 |
Region . | Station . | R−y = x . | R2−y = x . |
---|---|---|---|
Caribe | 13047040 | 0.99 | 0.98 |
Magdalena | 21147080 | 0.86 | 0.75 |
Orinoquia | 35087010 | 0.94 | 0.89 |
Amazonia | 44117010 | 0.94 | 0.88 |
Pacific | 52027030 | 0.96 | 0.92 |
Region . | Station . | R−y = x . | R2−y = x . |
---|---|---|---|
Caribe | 13047040 | 0.99 | 0.98 |
Magdalena | 21147080 | 0.86 | 0.75 |
Orinoquia | 35087010 | 0.94 | 0.89 |
Amazonia | 44117010 | 0.94 | 0.88 |
Pacific | 52027030 | 0.96 | 0.92 |
Please refer to the online version of this paper to see this table in colour: http://dx.doi.org/10.2166/nh.2022.022.
Please refer to the online version of this paper to see this table in colour: http://dx.doi.org/10.2166/nh.2022.022.
Dimensionless streamflow estimation equations are shown in the Supplementary Material (Gaviria Arbeláez 2019). Although the equations follow Equation (2) structure, it is highlighted that the combination of attributes shown in each percentile regression has a higher R2 value. Figure 12 shows the map of the regions with the main results of FDC estimations (validation set), in which the horizontal and vertical axes represent the percentage of exceedance and dimensionless flow, respectively.
DISCUSSION
High spatial heterogeneity and discontinuities were seen in the cluster regions map (Figure 12). This can be explained by the complexity and orography of Colombian territory, high spatial variability of precipitation, and heterogeneity of land cover and soil type.
The main result is an equation that estimates the dimensionless flow of each characteristic percentile and cluster. Generally, higher correlations were found in cluster regression than in regional equations. Nonetheless, locating a study catchment in a traditional subregion would be simpler than locating it in a regionalization cluster.
The mean percentage error in model validations gave an average of approximately 27%, corresponding to 50, 9, and 26% of minimum, mean, and maximum flow estimations. However, traditional subregions validations generally gave a higher performance, as shown in Figure 11. Most of the streamflow magnitudes gave lower percentage error in regions estimations than in clusters. This lower percentage error was different than expected because the cluster regression equation gave generally higher R2 values. Also, the regions were validated with basins of different sizes to show that the results fit for a wide range of areas
Estimation behavior varies spatially between clusters. For example, better results were found in clusters 2, 3, 7, and 11, which gave considerably higher R2 values with respect to the other groups. By contrast, clusters 13 and 14 gave lower R2 values. This behavior was not due to the number of calibration points or how homogeneous each group was. Nevertheless, even when R2 values are not so high, it is possible to have satisfactory approximations when the estimated and observed flows are compared. It also depends on the study catchment, which is the case validations of clusters 8, 9, 10, and 11.
Soil use and land cover variables are notably frequent in regression equations, followed by climatic variables from precipitation and evapotranspiration, then by topographic variables like elevations and slope. Regression equations are helpful for understanding which variables and processes involved are more relevant in the flow regime and basin's rainfall–runoff estimations.
Better results were obtained for mean streamflow percentiles compared with maximum and minimum flows. A possible reason for this is that the gauging stations were calibrated frequently in average flows and not in flood events (Qp<1) or in recessions (Qp>85). However, these are extrapolations in observed FDC. Nevertheless, variations in mean R2 values among flow magnitudes were insignificant, excepting a lower value in percentile 20 (Figure 3), which can be defined as a transition between the average and maximum discharges.
The form of dimensionless FDC was strongly related to basin size. It can be observed that small basins (magnitude order between 101 and 102 km2) FDC had an ‘L’ form (pronounced concavity), which represents a high difference between extreme and mean flows, whereas big basins (104 and 105 km2) FDC form was softened. The FDC form resulted from the geomorphological attributes and their interrelations (Perez et al. 2018), which are very complex.
Some estimations show a Qp1 with a higher percentage of exceedance than Qp2 (p1>p2). Here a conceptual error is induced. If something like this occurs in FDC estimation, the wrong percentile should be discarded, and linearly interpolated between neighbors should be used. Errors like this were uncommon in this work's validations.
The correlations obtained between observed and estimated flows and y = x matrices were high for clusters and regions (Tables 6 and 7). These high values indicate coherence in magnitude order in dimensionless flow estimations.
The regression model estimates dimensionless flow regime; however, to get the original FDC (flow in m³/s), the product of each Q* must be calculated using the long-term mean flow. There is a possibility that the approximate long-term average flow of each Colombian basin can be determined by applying a long-term water balance. However, estimating mean annual precipitation and real evapotranspiration represents an additional error source.
CONCLUSIONS
Despite Colombia's wide lack of hydrological information, it is possible to extrapolate conditions to perform flow regime estimation in ungauged sites. It is assumed that selected gauging stations and their respective hydrological basins span into a large spectrum of characteristics like size, form, mean discharges, and climatic conditions, which allows us to conduct the approach with similar success probabilities for different characteristics of rivers.
An equation series that allows FDC estimation in ungauged catchments in Colombia was performed. Attributes used in this work are publicly available in web databases. This methodology requires defining a targeted basin, locating it on a cluster using geometric centroid coordinates, searching for the estimation equations, and defining which attributes are needed. Attribute units and dimensions are specified.
Traditionally, five homogeneous hydrological regions are defined in Colombian territory. Nevertheless, too many variables are involved and related to the basin's behavior, which has a heterogeneous spatial distribution. The closeness criterion was not considered in the homogeneous hydrologic regions.
The way MLR was applied in this work represents an adjustment in the Colombian flow regime sectioned estimations. There are other options for this, for example, spline regressions. According to model validation results, low mean percentage errors were achieved for average discharge percentiles. Nevertheless, extreme values (minimum and maximum flow) gave higher mean error values because their flows were not measured but extrapolated.
Basin size is influential in flow regime estimation, even in dimensionless flow. Aggrupation and estimation models consider geographic basin extension, and parameter drainage area is not related only to basin size. For example, basin perimeter, Graveluis compactness coefficient, mainstream length, tectonic fault density, and basin unevenness are closely related too. There is an area dependency in these attributes, which is observed in the results of and values. May review in the Supplementary Material on Gaviria Arbeláez (2019).
ACKNOWLEDGEMENTS
Analyses are based on data provided by IDEAM, SGC, and IGAC national institutes. We are beforehand grateful to potential reviewers and editors.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.