The clustering of small watersheds based on hydrological similarity serves as an effective technique for identifying similarities in watershed runoff generation and routing conditions. It also addresses the challenge of parameter transplantation in undocumented areas. In this research, 545 small watersheds in hilly areas within Shandong Province were studied using 22 selected indicators to represent their climate and underlying surface characteristics. The study employed a two-stage clustering method combining the self-organizing map (SOM) neural network with the K-means algorithm, facilitating the classification of these watersheds into various groups. Each group of small watersheds was then analyzed for its unique characteristics. To validate the reasonableness of the classification results, the flood peak modulus of each watershed was calculated using a hydrologic-hydraulic method, while a parameter transplantation study was carried out and generalized for the clustering results. The findings indicate that the SOM-K-means clustering method efficiently classified the watersheds into 12 similar groups, validating its effective application in small watershed classification. This classification assists in solving the problem of flood forecasting in the ungauged watersheds in Shandong Province and developing more effective flood risk management strategies.

  • The study focused on 545 watersheds in the hilly areas of Shandong Province.

  • A technical system of 'small watershed division-watershed similarity classification-small watershed hydrological model construction-parameter transplantation' has been proposed.

  • The basic features and hydrological characteristics of each group's subwatersheds were analyzed, and flood modulus was used to verify the reasonableness of the results.

Carrying out hydrological forecasting in hilly watersheds can effectively prevent flooding and reduce disaster losses, and is the key to solving floods in hilly areas (Xue 2016). Currently, imperfect runoff monitoring networks and delayed initiation of monitoring activities result in scarce or nonexistent runoff data in most hilly watersheds. In addition, rapid socio-economic developments have led to significant changes in underlying surface conditions, further complicating hydrological forecasting in ungauged areas (Zhu et al. 2020). Thus, hydrological forecasting in these ungauged areas remains a significant challenge.

In recent years, more and more researchers and scholars have adopted the parameter regionalization method based on basin hydrological similarity to carry out hydrological forecasting in ungauged areas, which is to derive the model parameters of ungauged basins from the model parameters of the gauged basins by some means. Watershed hydrologic similarity is usually defined as the watersheds with similar underlying surface characteristics, driving forces, hydrodynamic conditions, etc., and satisfy the transplantation of hydrologic model parameters between each other (Hrachowitz et al. 2013; Wu et al. 2023). The primary methods of parameter regionalization include the spatial proximity, physical attribute similarity, and regression method (Wu et al. 2023). The regression method establishes the regression relationship between the watershed attributes and the hydrological model parameters and applies it to ungauged watersheds, but this method is prone to the phenomenon of equifinality, which limits its practical application. The spatial proximity method is generally based on the spatial distance between the watersheds to determine the reference watershed for parameter transplantation, which includes spatial distance averaging and spatial interpolation (Guo et al. 2020). However, its applicability is reduced in areas with diverse watershed characteristics (Sun et al. 2023a, b).

In contrast, the physical attribute similarity method is based on the premise that watersheds with similar physical characteristics will exhibit similar hydrological responses (Yi et al. 2014). The method has the characteristics of wide applicability and small limitations, increasingly making it a focal point for solving hydrological forecasting problems in ungauged watersheds. It typically employs various clustering algorithms to identify and group watersheds with similar characteristics (Wu et al. 2023). These clustering algorithms (Hassan & Ping 2012; Zhou et al. 2014) are different from classification methods (Cao et al. 2020) and belong to unsupervised learning algorithms without training data. Notable examples include hierarchical clustering as demonstrated by the analytic hierarchy process (Fan & Liu 2015; Alves et al. 2023), division-based clustering using the K-means algorithm (Clare & Cohen 2001), the fuzzy clustering algorithm (Yu & Li 2014), density clustering such as DBSCAN (Ram et al. 2010), model clustering through the self-organizing map (SOM) neural network (Kohonen 1982), graph clustering represented by spectral clustering (Liu et al. 2023), etc. These algorithms find extensive utilizations in watershed classification and have achieved desirable results. For instance, Yi et al. (2014) classified watersheds using the SOM and hierarchical clustering analysis (HCA), later employing the Hydrologiska Byråns Vattenbalansavdelning (HBV) model for parameter transposition in ungauged watersheds. Similarly, Yang et al. (2022) used the K-means algorithm based on physical similarity to group 64 small watersheds in hilly areas into 11 similar groups and conducted tests for hydrological model parameter transplantation. Merz & Blöschl (2004) focused on hydrological forecasting through watershed similarity, and Mayer et al. (2014) applied cluster analysis to classify the Great Lakes basin in the United States. SOM neural networks also now have more applications in hydrogeology. Varouchakis et al. (2022) utilized the SOM neural network to process the relevant data and calculate the groundwater stress index, proving its effectiveness and reliability in large-scale applications, as well as combining SOM with geostatistics to provide a new method for estimating the spatial distribution of groundwater hydrology in complex hydrogeological systems (Varouchakis et al. 2023).

However, inherent limitations in single clustering algorithms, such as K-means, might result in suboptimal clustering effects. For instance, the selection of initial clustering centers in the K-means algorithm is typically random, significantly influencing the final results. Furthermore, the SOM neural network is difficult to explain the clustering results better. To overcome these issues, some researchers have combined the strengths of the SOM and K-means algorithms, applying this integrated approach across various fields. This synthesis has proven to enhance the effectiveness of clustering results significantly, as evidenced by studies such as those by Huang et al. (2022) and Bigdeli et al. (2022). In this paper, aiming at the difficult problem of watershed similarity division feature index construction and classification algorithm, we constructed a feature index library considering physical index and attribute index, synthesized the two-stage clustering algorithm, and proposed a small watershed similarity evaluation method for coupling two-stage clustering model.

At present, there are limited studies on large-scale hydrological similarity analysis for specific regions combined with machine learning. In this paper, a hydrological similarity analysis framework based on machine learning for watershed clustering-categorization determination is proposed and applied to Shandong Province, China. Shandong Province is a large economic province in northern China, has an important strategic position, is an important part of the Northeast Asian Economic Circle, and serves as a vital link between the Beijing–Tianjin–Hebei area and the Yangtze River Delta region. This geographical and economic significance underscores the importance of ungauged watershed flood forecasting for the region. Consequently, the focus of this study is to analyze the hydrological similarity of 545 upstream watersheds in Shandong's hilly area, based on their physical characteristics, using a combined approach of the SOM neural network and K-means method for clustering. Based on the clustering results, two similar basins were selected for parameter transplantation, and their flood processes were simulated using the HEC-HMS model to validate the applicability of this method in hydrological forecasting for ungauged watersheds. The study also used hydrologic-hydraulic methods to calculate the flood peak modulus of these small watersheds, providing a means to verify the effectiveness of the clustering. Finally, based on the watershed clustering results, a supervised machine learning classification technique (Random Forest Classification Model) (Jonathan et al. 2022) was introduced for classification determination and parameter transplantation of other small watersheds in Shandong Province, confirming that the method can be used to solve the problem of flood forecasting for any ungauged watersheds in Shandong Province. The results of this research not only clarify the hydrological characteristics of various small watersheds in Shandong Province but also simplify their study and facilitate the implementation of flood risk management strategies. It can also be used to identify reference watersheds and provide an effective tool for solving the flood forecasting problems in ungauged watersheds within Shandong Province, as well as providing some experiences and insights into flood forecasting in other ungauged areas.

The remainder of the paper is structured as follows: Section 2 describes the study area and the data sources used. Section 3 details the methodology and the research process. Section 4 presents the results and includes a discussion of these findings, and Section 5 concludes.

Shandong Province, situated in northeastern China, is part of the North China Plain coastal region, adjacent to the east of the Taihang Mountains. The province spans approximately 155,800 km², with diverse terrain. The central region is dominated by the Taiyi Mountain Range, characterized by elevated terrain. In contrast, the southwest and northwest regions are notably flat, while the eastern section comprises gently undulating hills. Particularly, over 40% of Shandong's territory comprises mountainous and hilly areas, which are prone to frequent flooding that threatens the lives and property of tens of millions of residents across approximately 12 cities. Shandong's water systems are expansive, encompassing parts of the Yellow River, Huai River, and the Sea basins, along with the Southern Four Lakes Region. The climate exhibits a warm temperate profile with monsoonal influences, marked by synchronized seasonal rainfall and temperature, brief spring and autumn, and extended winter and summer periods.

For the purposes of this research, 545 upstream watersheds within Shandong Province have been analyzed. These watersheds vary in size from 42.14 to 359.55 km2 and have an average elevation of approximately 156.55 m. Figure 1 illustrates the topography of the study area.
Figure 1

General topography of typical hilly areas of Shandong Province, including the research area.

Figure 1

General topography of typical hilly areas of Shandong Province, including the research area.

Close modal

Methodological framework

The research methodology employed in this paper is structured into three main steps: the first step involves understanding the specific conditions of the study area and consulting relevant literature to select suitable similarity indices, such as climate and underlying surface characteristics of the watersheds (Mayer et al. 2014; Jin et al. 2017; Sun et al. 2023b). These indices are then extracted and calculated for 545 small watersheds using ArcGIS and digital elevation model (DEM) data. The second step utilizes the SOM neural network to conduct the initial clustering. The results from this primary clustering are then input into the K-means algorithm to achieve a secondary clustering. The final step involves analyzing the characteristics of the various watershed groups formed through clustering. This analysis includes evaluating the flood peak modulus and the effectiveness of parameter transplantation to assess the reasonableness of the clustering results. The entire research workflow is illustrated in Figure 2.
Figure 2

Workflow chart for similarity analysis in small watersheds.

Figure 2

Workflow chart for similarity analysis in small watersheds.

Close modal

SOM neural network

The SOM neural network is an unsupervised learning technique capable of transforming high-dimensional data into a more manageable low-dimensional space. This transformation is achieved through self-organizing mapping processes. The model segregates sample points into different discrete areas in accordance with the similarity of their data features, thereby achieving clustering. The SOM neural network mimics the two-dimensional spatial structure of neurons and incorporates self-organization and learning capabilities through mutual competition and interaction among the neurons. Its applications in pattern recognition and classification are well regarded, owing to its proficiency in automatic clustering, robust nonlinear mapping, and high fault tolerance. However, it is also known for its high complexity and the inability to provide precise clustering information post-clustering (Kohonen 1982).

The network structure of the SOM neural network comprises an input layer and a competitive layer. The neurons in the input layer are connected to the weights of the output neurons in the competitive layer. There exists localized connectivity between neurons within a certain vicinity in the competitive layer. The network manages the response weights of the neurons to the input patterns and the lateral inhibition among neurons, continuously adjusting these weights throughout the model training process. The neuron nodes in the model are topologically connected, and this connectivity reflects the strength of associations between the nodes (Bigdeli et al. 2022).

K-means clustering algorithm

The K-means clustering algorithm stands as one of the most fundamental and widely utilized unsupervised clustering techniques. Its essential function is to divide a specific dataset into N separate groups and establish the central point of each group, ensuring high intra-group similarity and low inter-group similarity. The algorithm predominantly uses Euclidean distance as the measure of similarity and the sum of squared errors as the loss function, as described by Astel et al. (2007). The relevant formulas are:
(1)
(2)
where x denotes a data object in dataset D; represents the i cluster; is the clustering center of cluster ; k is the number of clusters.

One critical aspect of the K-means algorithm is the need to predefine the cluster centers and the number of clusters. This algorithm is highly sensitive to the initial choice of cluster centers. Different initial selections can lead to varied clustering outcomes, impacting the stability and accuracy of the results. In addition, the K-means algorithm is susceptible to falling into local optimization, which can further affect the reliability of the clustering (Clare & Cohen 2001).

SOM+K-means two-stage clustering approach

To address the challenges associated with the K-means algorithm and the SOM neural network as previously mentioned, this study integrates these two algorithms to form a two-stage SOM-K-means clustering model. The initial classification of clusters is determined by the SOM algorithm. The centroids generated by SOM clustering serve as the initial values for the K-means algorithm. The final clustering results are derived from the K-means algorithm, utilizing internal indicators to optimize the number of clusters (Han & Tang 2021; Huang et al. 2022).

Flood peak modulus

The flood peak modulus is defined as the peak discharge generated per unit area in the watershed, which indicates the capacity of the watershed to produce floods. This measure is closely related to the elevation and slope of the watershed. The flood peak modulus is calculated as the ratio of peak discharge to watershed area (Li et al. 2017):
(3)
where is the design flood peak discharge; F is the catchment area; represents the flood peak modulus.

Selection of similarity indicators

In this study, various characteristics were selected as similarity indicators, including multi-year average flood-season rainfall, watershed area, watershed length, average watershed slope, average elevation, topographic relief, watershed shape factor, river network density, main stream gradient, main stream length, stream tortuosity, average topographic index (Jin et al. 2017), normalized difference vegetation index (NDVI), land-use types (percentage area in grasslands, buildings, farmland, woodland, and waters), and soil texture. The soil texture was categorized into four types (Wang et al. 2016). Table 1 shows the eigenvalues for each similarity indicator across all small watersheds. In this paper, each indicator is considered to have the same impact, i.e., each indicator is equally weighted.

Table 1

Eigenvalues of characterization indicators for small watersheds in the study area

FeatureIndicatorMaxMinMean
Underlay Area/km2 359.55 42.14 98.65 
Average slope/° 21.43 1.24 6.48 
Average elevation/m 554.55 3.65 156.55 
Topographic relief/m 1,404.00 11.00 375.50 
River network density/(km/km20.96 0.08 0.39 
Main stream gradient/% 3.27 0.01 0.36 
Main stream length/km 66.24 4.69 19.60 
Watershed length/km 49.78 8.55 17.54 
Watershed shape factor 0.91 0.06 0.35 
Average topographic index 7.28 4.36 5.96 
Stream tortuosity 2.53 1.00 1.31 
Land use Grassland/% 62.26 0.00 10.29 
Buildings/% 89.29 0.79 15.72 
Farmland/% 93.76 1.05 60.02 
Woodland/% 62.96 0.00 11.46 
Waters/% 31.60 0.00 1.98 
NDVI 0.844 0.376 0.713 
Soil texture type Sandy or Loamy sandy soils, Sandy loam soils/% 98.20 0.00 33.26 
Clay loam soils, Clay soils, Silt clay loam soils, Sandy clay soils/% 100.00 0.00 23.44 
Sandy clay loam soils/% 100.00 0.00 35.29 
Silt loam soils/% 100.00 0.00 7.40 
Climate Multi-year average flood-season rainfall/mm 225.82 133.43 184.79 
FeatureIndicatorMaxMinMean
Underlay Area/km2 359.55 42.14 98.65 
Average slope/° 21.43 1.24 6.48 
Average elevation/m 554.55 3.65 156.55 
Topographic relief/m 1,404.00 11.00 375.50 
River network density/(km/km20.96 0.08 0.39 
Main stream gradient/% 3.27 0.01 0.36 
Main stream length/km 66.24 4.69 19.60 
Watershed length/km 49.78 8.55 17.54 
Watershed shape factor 0.91 0.06 0.35 
Average topographic index 7.28 4.36 5.96 
Stream tortuosity 2.53 1.00 1.31 
Land use Grassland/% 62.26 0.00 10.29 
Buildings/% 89.29 0.79 15.72 
Farmland/% 93.76 1.05 60.02 
Woodland/% 62.96 0.00 11.46 
Waters/% 31.60 0.00 1.98 
NDVI 0.844 0.376 0.713 
Soil texture type Sandy or Loamy sandy soils, Sandy loam soils/% 98.20 0.00 33.26 
Clay loam soils, Clay soils, Silt clay loam soils, Sandy clay soils/% 100.00 0.00 23.44 
Sandy clay loam soils/% 100.00 0.00 35.29 
Silt loam soils/% 100.00 0.00 7.40 
Climate Multi-year average flood-season rainfall/mm 225.82 133.43 184.79 

The sources of the data used in this paper are digital elevated model (DEM) data from the ALOS (Advanced Land Observing Satellite, launched in 2006) satellite phased-array L-band synthetic aperture radar; NDVI from the Resources and Environmental Science Data Center (http://www.resdc.cn/); Monthly rainfall data from the National Tibetan Plateau Data Center (https://data.tpdc.ac.cn/) (Ding & Peng 2020; Peng 2020; Peng et al. 2017, 2018, 2019); Land use and soil underlying surface data from the results of the Shandong Province Flash Flood Hazard Analysis and Assessment Project.

Parameter transplantation methods

Based on the watershed clustering results and the existing hydrological data, the HEC-HMS hydrological model was used to perform the parameter transplantation test, and the specific methods and processes are as follows:

  • (1) Combining the clustering results and the available hydrological information to determine the reference watersheds.

  • (2) Parameter transplantation tests are performed using the HEC-HMS hydrological model, in which the SCS-CN is selected for the Loss model, the Snyder Unit Hydrograph for the Transform model, and the Recession for the Baseflow model, without considering channel routing (Zhao et al. 2017; Cheng et al. 2021). The parameters include curve number, initial abstraction, standard lag, peaking coefficient, initial discharge, recession constant, and ratio.

  • (3) Select typical flood events to determine parameter calibration for the reference watershed. After passing the calibration is qualified, the parameters are transplanted to the ungauged watersheds. Watersheds in the same category can share a set of model parameters. All the parameters are performed by using the direct transplantation method (Wu et al. 2023).

  • (4) Compare the simulation effect of many floods and evaluate the parameter transplantation effect.

Results

Initial clustering

The index data were first standardized and then fed into the SOM neural network for the initial phase of clustering. Since this stage was followed by a secondary clustering process, complete convergence of the SOM neural network was not necessary; rather, an approximate clustering result was sufficient. The iteration count was set at 300, and the number of neurons was determined using the empirical formula , where N represents the number of samples (Astel et al. 2007). Given that the study involved 545 samples, the neural network was configured with a neuron count of . The interclass distances and the hit markers are depicted in the Supplementary material, Figure S2. The preliminary clustering results from the SOM neural network were then input into the K-means algorithm to derive the final clustering results.

Determination of the optimal number of clusters

To ascertain the optimal number of clusters (k) for the K-means algorithm, the Dunn index (Dunn 1974; Ibrahim et al. 2021) and the Davis–Bouldin (DB) index (Davies & Bouldin 1979) were utilized. As shown in Figure 3, the Dunn index peaked and the DB index was at its lowest when k = 12, thereby establishing 12 as the optimal number of clusters.
Figure 3

Determination of optimal cluster number (a) Dunn index curve (b) DB index curve.

Figure 3

Determination of optimal cluster number (a) Dunn index curve (b) DB index curve.

Close modal

Results of final clustering

The final clustering phase, executed using the K-means algorithm, successfully segregated the 545 small watersheds into 12 distinct groups, as depicted in Figure 4. However, the variety of similarity indicators and the complex topography of Shandong Province led to limited spatial clustering within each group. It was observed that groups III, VI, VIII, X, and XII contained a higher number of small watersheds. In contrast, groups I, IV, IX, and XI encompassed a smaller number of watersheds. This distribution is attributed to the unique characteristics of these watersheds, including their geographical features, vegetation cover, soil characteristics, and rainfall, which resulted in a lower similarity degree with other watersheds.
Figure 4

Clustering results of small watersheds.

Figure 4

Clustering results of small watersheds.

Close modal

Discussion

Characterization of different groups of small watersheds

The 545 small watersheds were classified into 12 groups of similar watersheds, with the distribution of basic characteristics and major land-use and soil texture types detailed in Figures 5 and S3.
Figure 5

(a) Spatial distribution of topography, rainfall and other indicators for each group of small watersheds; (b) Percentage of area for each land-use type and soil texture type in each group of small watersheds. (Note: (a): 1,2, … , | 12 means group I, group II, … , group XII. (b): ‘high’, ‘relatively high’, ‘medium’, ‘relatively low’, and ‘low’ in the figure represent the distribution of the indicator values of the watershed group among all the small watersheds).

Figure 5

(a) Spatial distribution of topography, rainfall and other indicators for each group of small watersheds; (b) Percentage of area for each land-use type and soil texture type in each group of small watersheds. (Note: (a): 1,2, … , | 12 means group I, group II, … , group XII. (b): ‘high’, ‘relatively high’, ‘medium’, ‘relatively low’, and ‘low’ in the figure represent the distribution of the indicator values of the watershed group among all the small watersheds).

Close modal

Groups I, VI, and X, located in mountainous areas, are characterized by steep slopes, high elevations, sparse populations, scattered villages, and significant undulation. Group I, in particular, has a smaller number of watersheds. Its land use is primarily forested land, farmland, and grassland, with the largest area being forested. These watersheds have a relatively high shape coefficient, dense river networks, high vegetation cover, more tortuous rivers, and predominantly sandy loam and clay loam soils. These soil types suggest average water permeability, indicating a tendency for these areas to accumulate rainfall and pose flooding risks. Group VI shares similarities with Group I but differs in soil content and the percentage area of land-use types. Group X's small watersheds have a lower river network density, are predominantly used for cropland, and their runoff processes are significantly influenced by irrigation.

Groups II, VII, VIII, and XII, which exhibit slight differences in characteristics, are situated in hilly areas characterized by large slopes, high elevations, and relatively high density of river networks. In terms of land use, these areas are predominantly comprised of arable land and zones used for housing and construction, indicating frequent human activities. The soil types in these groups are primarily sandy clay loam and sandy loam, which have weak water permeability. In the event of a disaster, these sub-basins are particularly vulnerable to causing significant casualties and economic losses, and therefore, they should be prioritized in flash flood prevention and control efforts.

Groups III, V, and XI are located in regions with gentle slopes and low elevations, which are more densely populated and have a lower likelihood of experiencing flash floods compared to other groups. However, if prolonged and persistent rainfall occurs, the weak water permeability of the soils in these small watersheds could trigger siltation flash floods or waterlogging, potentially leading to economic losses and casualties.

Groups IV and IX are situated near lakes and at the mouths of seas, featuring a small number of small watersheds with topographic characteristics similar to those in Group V. However, the soil types in Groups IV and IX are predominantly powdery loam, which offers good water permeability. The relatively high water area in Group IX suggests a higher number of small reservoirs and other water conservancy facilities within the watersheds of this group, which may influence the hydrological dynamics and management strategies in these areas.

Discussion on reasonableness

The flood peak modulus of the small watersheds can be used to further validate the rationality of the clustering approach. In this study, design peak flows for the 545 small watersheds were calculated using the hydrologic-hydraulic method (Xue 2016).

As depicted in Figure 6(a), the flood peak modulus distribution in the study area's small watersheds tends to decrease gradually from two central and eastern focal points outward. This pattern aligns closely with the clustering results of the small watersheds. Figure 6(b) presents a scatter plot of the flood peak modulus distribution across the 12 groups of similar watersheds. Table S1 shows the mean flood peak modulus values for each group. Analysis of Figure 6(b) and Table S1 reveals that factors such as average watershed slope, specific drop of the river channel, and rainfall are significant determinants of the flood peak modulus. As such, watershed groups with generally larger characteristics such as mean slope, mean elevation, and main stream gradient also have relatively large flood peak modulus, and watersheds with lesser slopes and specific drops generally exhibit decreasing flood peak modulus trends. Watersheds with similar topographical and rainfall conditions also show comparable flood peak modulus values. However, there is notable variance within some watershed groups, attributable to the inclusion of features like river curvature, and river network density in the similarity indicators. These features have a lesser correlation with slope, specific fall, and rainfall, leading to the observed dispersion in flood peak modulus values among similar watersheds.
Figure 6

Flood-modulus results: (a) Flood peak modulus distribution in sub-catchments; (b) Flood peak modulus distribution by sub-basin within six large communities.

Figure 6

Flood-modulus results: (a) Flood peak modulus distribution in sub-catchments; (b) Flood peak modulus distribution by sub-basin within six large communities.

Close modal

Parameter transplantation test

This paper conducted a parameter transplantation experiment using the HEC-HMS model with two watersheds from Group X (Qinshui River watershed and Dongcun River watershed) in the eastern part of Shandong Province as examples. The Qinshui River watershed was designated as the reference watershed, and the Dongcun River watershed was considered an ungauged watershed. Initially, parameters were determined by selecting 10 floods from the Qinshui River watershed, with the results detailed in Table S2. Subsequently, these parameters were transplanted to the Dongcun River watershed, and the simulation results for five flood events are presented in Table S3 and Figure S5. According to the data in Table S3 and Figure S5, only one of the five floods was deemed unqualified, resulting in a qualification rate of 80%. This indicates that the methodology of grouping basins based on similar physical attributes is a viable approach for hydrological forecasting in ungauged watersheds.

Extension and application of results

This paper further generalizes the outcomes of watershed clustering. Given that the study area encompasses virtually all types of watersheds in Shandong Province, the clustering results have been employed to train a random forest classification model (Jonathan et al. 2022). The trained model can be used to identify the category of any ungauged watershed in Shandong Province.

Taking the Gaocun River Basin as an example, the physical characteristics of the Gaocun River Basin were input into the trained random forest classification model, and the category to which the basin belonged was predicted to be Group I. The physical characteristics of the Gaocun River Basin were then used as the basis for the classification model. Combined with the existing hydrological data, the Baisha River Basin in this group is taken as the reference basin. The parameters determined for the Baisha River Basin were then transplanted to the Gaocun River Basin. The results of the flood simulation are documented in Table S5 and Figure S7. The simulation data indicate that the errors for the four simulated floods are within a reasonable range, and the average NSE is 0.685. This confirms the efficacy of this method for parameter transplantation in any ungauged watersheds within Shandong Province, demonstrating the practical applicability of the model in practical application scenarios.

In this research, a comprehensive analysis was conducted on 545 small watersheds in Shandong Province, selecting 22 pertinent indicators related to their runoff production and concentration processes. The clustering of these watersheds was achieved through the application of the SOM neural network and K-means methods. The validity of the classification was confirmed by analyzing the flood peak modules. Meanwhile, parameter transplantation using the HEC-HMS hydrological model was carried out in the ungauged watershed, which proved the feasibility of parameter transplantation based on the similarity method of physical attributes and further verified the rationality of the classification. In addition, the results of the clustering were leveraged to extend the application of these findings within the province. The key findings of this study include:

  • (a) The 545 watersheds were categorized into 12 similar groups. The diverse and intricate topographical features of Shandong Province, along with the wide selection of similarity indicators, led to a predominantly dispersed spatial distribution within each group, with some degree of clustering observed in a minority of cases. Noticeable disparities were found among different groups, and the clustering results aligned with the geographical distribution of the flood peak modulus. This outcome demonstrates the ability of the method to identify similarities in runoff generation and routing conditions of the watersheds and also helps to clarify the hydrologic characteristics of each watershed.

  • (b) The watershed clustering method based on physical attribute similarity used in this study proves to be a viable solution for flood forecasting in ungauged watersheds. Looking forward, the applicability and stability of this method can be further analyzed by increasing the number of validated watersheds, provided that sufficient data are available. In addition, assessing the influence of diverse indicators on streamflow production and catchment processes across various watersheds could inform the assignment of weights to these indicators.

The authors wish to gratefully acknowledge the financial assistance from the Natural Science Foundation of Shandong Province (ZR2020ME249), National Natural Science Foundation of China (42301046), Major Science and Technology Program of the Ministry of Water Resources of the People's Republic of China (SKS-2022013), and the other anonymous reviewer whose comments significantly improved the quality of this paper. The datasets were provided by the National Tibetan Plateau/Third Pole Environment Data Center (http://data.tpdc.ac.cn).

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Alves
M. A.
,
Meneghini
I. R.
,
Gaspar-Cunha
A.
&
Guimarães
F. G.
(
2023
)
Machine learning-driven approach for large scale decision making with the analytic hierarchy process
,
Mathematics
,
11
(
3
),
627
.
Cao
Z. T.
,
Fang
Z. D.
,
Yao
J.
&
Xiong
L. Y.
(
2020
)
Loess landform classification based on random forest
,
Journal of Geo-Information Science
,
22
(
3
),
452
463
.
Cheng
X.
,
Ma
X. X.
,
Wang
W. S.
,
Xiao
Y.
,
Wang
Q. L.
&
Liu
X. X.
(
2021
)
Application of HEC-HMS parameter regionalization in small watershed of hilly area
,
Water Resources Management
,
35
,
1961
1976
.
Clare
A. P.
&
Cohen
D. R.
(
2001
)
A comparison of unsupervised neural networks and k-means clustering in the analysis of multi-element stream sediment data
,
Geochemistry: Exploration, Environment, Analysis
,
1
(
2
),
119
134
.
Ding, Y. X. & Peng, S. Z. (2020) Spatiotemporal trends and attribution of drought across China from 1901–2100. Sustainability, 12 (2), 477.
Dunn
J. C.
(
1974
)
Well-separated clusters and optimal fuzzy partitions
,
Journal of Cybernetics
,
4
(
1
),
95
104
.
Fan
M. G.
&
Liu
J. F.
(
2015
)
Analysis of hydrologically similar basins based on clustering analysis
,
Hydro-Science and Engineering
, 4,
106
111
.
Guo
Y. H.
,
Zhang
Y. Q.
,
Zhang
L.
&
Wang
Z. Z.
(
2020
)
Regionalization of hydrological modeling for predicting streamflow in ungauged catchments: a comprehensive review
,
Wiley Interdisciplinary Reviews-Water
,
8
(
1
),
e1487
.
Han
D. H.
&
Tang
Y. G.
(
2021
)
Coal quality big data mining method and application based on SOM plus K-means two-stage clustering
,
Coal Science and Technology
,
1
12
. doi:10.13199/j.cnki.cst.2021-1048.
Hassan
B. G. H.
&
Ping
F.
(
2012
)
Formation of homogenous regions for Luanhe basin– by using L-Moments and cluster techniques
,
International Journal of Environmental Science and Development
, 3 (2),
205
210
.
Hrachowitz
M.
,
Savenije
H. H. G.
,
Blöschl
G.
,
McDonnell
J. J.
,
Sivapalan
M.
,
Pomeroy
J. W.
,
Arheimer
B.
,
Blume
T.
,
Clark
M. P.
,
Ehret
U.
,
Fenicia
F.
,
Freer
J. E.
,
Gelfan
A.
,
Gupta
H. V.
,
Hughes
D. A.
,
Hut
R. W.
,
Montanari
A.
,
Pande
S.
,
Tetzlaff
D.
,
Troch
P. A.
,
Uhlenbrook
S.
,
Wagener
T.
,
Winsemius
H. C.
,
Woods
R. A.
,
Zehe
E.
&
Cudennec
C.
(
2013
)
A decade of predictions in ungauged basins (PUB) – a review
,
Hydrological Sciences Journal
,
58
(
6
),
1198
1255
.
Huang
Y. P.
,
Wang
Y. H.
,
Wang
C.
,
Liu
W. J.
,
Wang
H.
,
Lv
G. F.
,
Lin
S. J.
&
Hu
Q.
(
2022
)
Characteristics analysis and zoning control of groundwater pollution based on self-organizing maps and k-means
,
Environmental Engineering
,
40
(
6
),
31
41. + 47
.
Ibrahim
O. A.
,
Keller
J. M.
&
Bezdek
J. C.
(
2021
)
Evaluating evolving structure in streaming data with modified Dunn's indices
,
IEEE Transactions on Emerging Topics in Computational Intelligence
,
5
(
2
),
262
273
.
Jin
Y.
,
Zhang
W. P.
,
Liu
J. T.
&
Wu
G. Q.
(
2017
)
Relation analysis of topographic index and flood recession characteristics
,
Yangtze River
,
48
(
13
),
23
25 + 53
.
Jonathan
L.
,
Adeline
F.
,
Andyne
L.
,
Céline
P.
&
Hugues
C.
(
2022
)
Prediction of forest nutrient and moisture regimes from understory vegetation with random forest classification models
,
Ecological Indicators
,
144
,
109446
.
Kohonen
T.
(
1982
)
Self-organized formation of topologically correct feature maps
,
Biological Cybernetics
,
43
(
1
),
59
69
.
Li
Q.
,
Wang
Y. L.
,
Li
H. C.
,
Zhang
M.
,
Li
C. Z.
&
Chen
X.
(
2017
)
Rainfall threshold for flash flood early warning based on flood peak modulus
,
Journal of Geo-Information Science
,
19
(
12
),
1643
1652
.
Liu
H.
,
Chen
J.
,
Li
J.
,
Shao
L.
,
Ren
L.
&
Zhu
L.
(
2023
)
Transformer fault warning based on spectral clustering and decision tree
,
Electronics
,
12
(
2
),
265
.
Merz
R.
&
Blöschl
G.
(
2004
)
Regionalisation of catchment model parameters
,
Journal of Hydrology
,
287
(
1–4
),
95
123
.
Peng, S. Z., Ding, Y. X, Wen, Z. M., Chen, Y. M., Cao, Y. & Ren, J. Y. (2017) Spatiotemporal change and trend analysis of potential evapotranspiration over the Loess Plateau of China during 2011–2100. Agricultural and Forest Meteorology, 233, 183–194.
Peng, S. Z., Gang, C. C., Cao, Y. & Chen, Y. M. (2018) Assessment of climate change trends over the Loess Plateau in China from 1901 to 2100. International Journal of Climatology, 38 (5), 2250–2264.
Peng, S. Z., Ding, Y. X., Liu, W. Z. & Li, Z. (2019) 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth System Science Data, 11, 1931–1946
.
Peng
S.
(
2020
)
1-km Monthly Precipitation Dataset for China (1901–2022)
.
National Tibetan Plateau/Third Pole Environment Data Center
. Xianyang: Northwest A&F University.
Ram
A.
,
Jalal
S.
,
Jalal
A. S.
&
Kumar
M.
(
2010
)
A density based algorithm for discovering density varied clusters in large spatial databases
,
International Journal of Computer Applications
,
3
(
6
),
1
4
.
Sun
Z. L.
,
Liu
Y. L.
,
Chen
X.
,
Shu
Z. K.
,
Wu
H. F.
,
Wang
J.
,
Bao
C. X.
&
Wang
G. Q.
(
2023a
)
Review of hydrological model parameter regionalization method
,
Journal of China Hydrology
,
43
(
4
),
1
7
.
Sun
Z. L.
,
Liu
Y. L.
,
Shu
Z. K.
,
Chen
X.
,
Wang
J.
&
He
R. M.
(
2023b
)
Review of the theories of hydrological similarity
,
Hydro-Science and Engineering
, 3,
155
164
.
Varouchakis
E. A.
,
Perez
G. A. C.
,
Loaiza
M. A. D.
&
Spanoudaki
K.
(
2022
)
Sustainability of mining activities in the European Mediterranean region in terms of a spatial groundwater stress index
,
Spatial Statistics
,
50
,
100625
.
Varouchakis
E. A.
,
Solomatine
D.
,
Perez
G. A. C.
,
Jomaa
S.
&
Karatzas
G. P.
(
2023
)
Combination of geostatistics and self-organizing maps for the spatial analysis of groundwater level variations in complex hydrogeological systems
,
Stochastic Environmental Research and Risk Assessment
,
37
(
8
),
3009
3020
.
Wang
J. J.
,
Ding
J. L.
,
Zhang
C.
&
Zhang
Z.
(
2016
)
Runoff simulation based on SCS mode in Bortala River Basin in Xinjiang
,
Transactions of the Chinese Society of Agricultural Engineering
,
32
(
7
),
129
135
.
Wu
H. S.
,
Shi
P.
,
Qu
S. M.
,
Qiu
C.
,
Ding
S.
,
Wang
X.
,
Lu
M. X.
&
Li
Z. C.
(
2023
)
Research progress and discussion on hydrologic similarity
,
Water Resources Protection
,
39
(
6
),
77
86 + 94
.
Xue
X.
(
2016
)
Research on Small Watershed Classification and Early-Warning Indicators of Flash Flood Disasters in Prevention Areas. [online]. Jinan: Shandong University. Available at: https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CMFD&dbname=CMFD201701&filename=1016164245.nh&v = .
Yang
Y. Q.
,
Liu
J. T.
,
Yang
S.
&
Bai
T. K.
(
2022
)
Hydrological classification and model parameter transfer of basins in eastern hilly region based on physical similarity method
,
Water Resources and Power
,
40
(
4
),
23
27
.
Yi
X.
,
Zhou
F.
,
Wang
X. Y.
,
Yang
Y. H.
&
Guo
H. C.
(
2014
)
Classification and runoff simulation of data-scarce basins based on self-organizing maps
,
Progress in Geography
,
33
(
8
),
1109
1116
.
Yu
G. S.
&
Li
K.
(
2014
)
Watershed image segmentation based on PSO and FCM
,
Advanced Materials Research
,
1070–1072
,
2041
2044
.
Zhao
Y. C.
,
Wang
J. H.
,
Liang
J. P.
,
Yu
J. J.
&
Wang
D.
(
2017
)
Application of HEC-HMS for hydrological simulation in Zijingguan watershed
,
Water Resources and Power
,
35
(
12
),
10
13
.
Zhou
K. L.
,
Yang
S. L.
,
Ding
S.
&
Luo
H.
(
2014
)
On cluster validation
,
Systems Engineering-Theory & Practice
,
34
(
9
),
2417
2431
.
Zhu
B. J.
,
Kan
G. Y.
&
He
X. Y.
(
2020
)
Study on hydrological similarity and parameter transplantation of dataless nested watershed
,
Journal of China Institute of Water Resources and Hydropower Research
,
18
(
03
),
223
231
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-ND 4.0), which permits copying and redistribution with no derivatives, provided the original work is properly cited (http://creativecommons.org/licenses/by-nd/4.0/).

Supplementary data