Abstract
Partitioning water distribution networks (WDNs) into district metered areas offers benefits including reduced nonrevenue water and simplified pressure management. However, current research in this field tends to narrowly focus on the topological relationships among nodes, often overlooking the influence of pressure reducing valves (PRVs) and pump stations on clustering results. To address this limitation, this study introduces a topology-distance-based clustering (TDBC) method that enhances the accuracy of partitioning by explicitly considering the impact of PRVs and pump stations. In the TDBC method, the Floyd algorithm is initially employed to construct a topological distance matrix that quantifies the degree of node connectivity. By amplifying topological distances for links including PRVs and pump stations, their effect on clustering results is incorporated. Subsequently, nodes are clustered using the K-means algorithm based on the resulting topology-distance matrix. The proposed TDBC approach is applied to four network cases, and its outcomes are compared with those of two traditional methods. The comparative analysis indicates that the TDBC algorithm achieves precise partitioning results for networks incorporating PRVs or pump stations, while ensuring a harmonious balance between modularity and the uniformity of the partitioning results, even in networks with complex structures and highly interconnected loops.
HIGHLIGHTS
Incorporation of PRVs and pump stations: The TDBC method introduces a novel approach by explicitly incorporating the influence of pressure reducing valves (PRVs) and pump stations into the clustering process.
Topological distance-based clustering: The TDBC method shifts from traditional geographical proximity-based methods to a topological distance-based approach.
Improved clustering uniformity.
INTRODUCTION
Treated water, being an indispensable resource for human survival, is commonly distributed to customers through networks (Alamanos 2021; Li & Yu 2022; Tufa & Abate 2022). However, with the expansion of urban areas, water distribution networks (WDNs) are also growing in size, presenting significant management challenges (Albarakati et al. 2021; Liu & Kang 2021; Zhang et al. 2022). It is widely acknowledged that partitioning WDNs into district metered areas (DMAs) offers several benefits (Saldarriaga et al. 2019). These benefits include reducing nonrevenue water, simplifying pressure management, facilitating rapid detection of burst pipes, and isolating districts to prevent contamination (Bui et al. 2020; Diao 2021). Therefore, the partitioning of WDNs into smaller sub-networks (i.e., DMAs) has become a prominent and extensively studied topic (Nardo et al. 2017; Creaco & Haidar 2019).
The manual partitioning WDNs into DMAs presents challenges due to the complex network topology, especially in large-scale systems with significant looping. Hence, extensive research has been dedicated to automating the creation of DMAs over the past few decades. Notably, Nardo & Natale (2011) and Nardo et al. (2013, 2020) have been significant contributors to utilizing optimization techniques based on graph theory for automating the design of DMAs. Additionally, Deuerlein (2008) and Perelman & Ostfeld (2011) developed topological clustering tools for WDN analysis; however, these were not specifically designed for WDN partitioning. In contrast, Liu & Han (2018) presented a three-step methodology employing spectral clustering and graph theory for effective DMA design. Building upon this progress, Liu & Lansey (2020) introduced a multi-phase DMA design method to address the partitioning of complex WDNs. More recently, Creaco et al. (2022) introduced a novel multi-objective genetic algorithm with linear programming for efficient WDN partitioning.
Diao et al. (2013) pioneered the concept of community-based partitioning for WDNs, utilizing modularity as a key indicator. Subsequently, researchers in WDN partitioning have devoted considerable attention to identifying community structures, characterized by strong internal connections, and assessing modularity (Bui et al. 2020). However, it has been observed that maximizing modularity may lead to the formation of impractically large clusters (Zhang et al. 2017). Hence, the consideration of other metrics, such as uniformity, has become indispensable. For instance, Liu et al. (2018) compared different methods for creating DMAs and highlighted the balance of partitions as a critical indicator. Additionally, Zhang et al. (2021) proposed a modified modularity approach to represent elevation uniformity within DMAs. Recently, Creaco et al. (2023) introduced a novel modularity formulation and algorithm for the optimal clustering of dual WDNs, ensuring uniform properties and improved hydraulic performance. Nevertheless, it is noteworthy that while the aforementioned studies have significantly contributed to WDN partitioning, they focus solely on the topological relationships between pipes and nodes. None of them takes into account existing pressure reducing valves (PRVs) and pump stations, despite these components directly influencing the partitioning results. Therefore, an essential challenge remains in incorporating PRVs and pump stations into the automatic partitioning process.
Detecting communities in complex networks using unsupervised machine learning, especially the K-means clustering algorithm, constitutes a noteworthy research field due to its simplicity, efficiency, and adaptability in defining the desired number of clusters. For instance, Li et al. (2019) integrated K-means with density and canopy clustering, demonstrating superior accuracy and visualization capabilities for social and biological networks. Zhou et al. (2020) introduced the CPMK-Means algorithm, addressing convergence and time complexity issues and outperforming other algorithms across various metrics. In a different domain, Chaudhary & Singh (2021) employed K-means clustering to analyze the impact of COVID-19, uncovering hidden community structures among countries. In the context of WDN partitioning research, Hajebi et al. (2013) applied the K-means algorithm to geographically cluster the network and employed a multi-agent system for boundary node adjustments based on hydrological constraints. Nonetheless, further validation is necessary to ascertain its effectiveness in larger WDNs.
In summary, DMA design involves two primary phases: clustering, which forms DMAs based on network connectivity and topology; and sectorization, which optimizes the placement of flow meters and valves for network performance and cost minimization. This study specifically focuses on the first phase of network clustering and presents an innovative topology-distance-based clustering (TDBC) method for WDN partitioning. Compared to existing methods, the significant contribution of this research lies in its explicit integration of PRVs and pump stations during the clustering process, aiming to achieve a more accurate DMA design.
Specifically, PRVs and pump stations are indispensable components in a WDN: PRVs lower high water pressure to protect infrastructure and conserve water, while pump stations elevate pressure to ensure a consistent flow, especially in areas with elevation differences. As a result, these elements naturally define water supply zone boundaries and should be integrated into the DMA design of a WDN. To explicitly account for their impact on automatic network partitioning, the proposed TDBC method introduces topological distance to quantify node connectivity and amplifies the distance for links containing PRVs and pump stations. Nodes are then clustered using the K-means algorithm based on this topology distance matrix, ensuring that networks can be divided according to the specified clustering number.
The remaining sections of the paper are as follows: Section 2 starts by presenting the schematic of this research work. It is then followed by a description of the proposed TDBC method. Additionally, the modularity-based fast greedy clustering (MFGC) method is briefly introduced. Section 3 provides information about the four networks used in this study. In Section 4, the application results of three different methods to the four networks are compared and analyzed. Section 5 summarizes the main findings of this study.
METHODOLOGY
In this study, the GCBP and TDBC methods were developed and executed within the MATLAB environment, while the MFGC method was specifically implemented using the Python language. Additionally, network files sourced from EPANET (Rossman 2000) were employed to construct the adjacency matrix, thereby facilitating the creation of the topology distance matrix and the successful execution of the MFGC method.
The proposed TDBC method
The Floyd algorithm, a fundamental technique in graph theory and network analysis, efficiently computes the shortest paths between all pairs of vertices in a graph, even when negative edge weights are present (Hougardy 2010). This versatility makes it a valuable choice for addressing various real-world problems in this study. The key advantages of using the Floyd algorithm are as follows: Firstly, it can efficiently determine all-pairs shortest paths in a single run, making it particularly advantageous for scenarios requiring computation of multiple shortest paths simultaneously. Secondly, the algorithm's simplicity and ease of implementation render it accessible to researchers and developers with varying levels of expertise. Thirdly, the Floyd algorithm's ability to handle graphs with negative edge weights, provided no negative cycles are present, enhances its applicability in real-world models involving negative weights. For further implementation details of the Floyd algorithm, please refer to Zhao & Zhao (2017).
The K-means algorithm is a widely used clustering technique for partitioning a given dataset into K distinct clusters. It operates iteratively by minimizing the total intra-cluster variance, ensuring convergence to a stable solution. Based on the constructed topological distance matrix, the K-means algorithm treats the nodes in a WDN as data points in a multi-dimensional space. By assigning each data point to the cluster with the nearest mean, it naturally lends itself to partitioning WDNs. This is attributed to its ability to cluster nodes with smaller topological distances, resulting in tightly connected nodes within each cluster while maintaining sparse connectivity between different clusters. Consequently, the incorporation of the topological distance matrix renders the K-means algorithm a valuable tool for effectively achieving this objective in WDN partitioning tasks.
One of the key advantages of the K-means algorithm lies in its simplicity and efficiency. It efficiently handles large datasets and can be easily implemented, making it a practical choice for various clustering tasks. Additionally, the algorithm's straightforward interpretation empowers users to gain insights into the clustering process and make informed decisions based on the resulting clusters. In this study, we adopt the approach proposed by Arthur & Vassilvitskii (2007) for selecting initial centroids, as it has been proven to yield more stable and improved clustering results compared to the standard K-means method.
The MFGC method
Diao et al. (2013) were among the pioneers in introducing the MFGC algorithm for WDN partitioning. The rationale behind their work stems from the goal of achieving clustering results for WDNs where nodes within each cluster are tightly interconnected, while connections between different clusters are sparse. This arrangement is expected to be advantageous as it allows for the implementation of DMAs with a reduced number of flow meters and valves, leading to more efficient water management. Furthermore, as WDNs expand in size due to the growth of residential communities, they naturally exhibit community-like properties. Hence, the application of the MFGC algorithm to detect and partition communities within WDNs is justified. Further implementation details for the MFGC algorithm can be found in Diao et al. (2013).
CASE STUDIES
The Anytown and rural networks represented a class of networks with complex structures and highly interconnected loops, or they were large-scale complex DMAs requiring further partitioning. These two networks were selected to test the algorithm performance in terms of modularity and uniformity when applied to complex network partitioning scenarios. The Anytown distribution system, conceived by Walski et al. (1987) for a Battle of the Networks competition aimed at enhancing analysis methods, is a hypothetical system. The rural network, on the other hand, represents a hydraulic model of an actual network derived from a piped irrigation system. It was initially employed by Marchi et al. (2014) to assess the performance of various evolutionary algorithms in optimizing WDSs. For comprehensive network information, including the EPANET files for both the C-town and Anytown networks, please refer to the Kentucky Water Resources Research Institute's website. Detailed information about the rural network can be found in Marchi et al. (2014).
RESULTS AND DISCUSSION
Comparison with the traditional GCBP method
This section compares the GCBP method with the newly proposed TDBC method on the simple and C-town networks, investigating clustering mechanisms and partitioning effectiveness with PRVs and pump stations. Both methods use the K-means clustering algorithm, but GCBP relies on geographical coordinates, while TDBC uses the topological distance matrix.
Partitioning results for the simple network
In contrast, the proposed TDBC method clusters node 4 with nodes 1, 2, and 3, and node 8 with nodes 5, 6, and 7, as depicted in Figure 4(b). The rationale behind the results can be attributed to the direct connections between node 4 and node 2, the direct connection between node 8 and node 7, as well as the presence of the PRV between these two clusters. Therefore, this indicates that the proposed TDBC method takes into account the pipe connectivity, as well as the presence of PRVs to achieve accurate partitioning results.
As illustrated by Figure 5(a), the traditional GCBP method prioritizes nodes with shorter geographical distances for clustering, such as nodes 4 and 6, as well as nodes 3 and 8. However, Figure 5(b) displays the proposed TDBC algorithm, which clusters nodes based on closer topological distances, such as node 2 with node 4 and node 7 with node 8. This distinction highlights that the proposed method focuses on the level of connectivity between nodes rather than their geographical distances. In conclusion, the comparative analysis reveals distinct clustering mechanisms, with the TDBC method outperforming in achieving accurate network partitioning. This is attributed to its incorporation of nodal connectivity and the presence of PRVs, effectively improving clustering accuracy.
Partitioning results for C-town network
Figure 6(b) reveals that the proposed TDBC method produces identical partitioning results to the manual method, demonstrating its effectiveness in handling real-world networks, including pump stations. On the other hand, Figure 6(d) indicates less accurate results for the proposed TDBC method without magnified topological distances for pump stations. This observation further validates the rationale behind the TDBC method, which emphasizes the importance of magnifying topological distances of pump stations to achieve more precise network partitioning.
Additionally, Figure 6(c) demonstrates that the traditional GCBP method produces impractical partitioning results. This is evident from the fact that the GCBP method groups together a large number of nodes that lack connections to each other. This is a result of the method's sole reliance on the geographical locations of nodes, with no consideration for node connections or network topology. Consequently, the resulting clusters significantly deviate from the manual method's results. This emphasizes the infeasibility of the traditional GCBP method for effectively clustering complex networks with pump stations.
Comparison with the MFGC method
This section compares the proposed TDBC method with the MFGC method on the Anytown network and the rural network, using modularity and uniformity metrics. The aim is to understand the capabilities and limitations of these two clustering methods.
Partitioning results for Anytown network
Figure 9 illustrates the superior performance of the proposed TDBC algorithm compared to the MFGC method in achieving greater node uniformity across the four clustering scenarios. Specifically, the box plots representing the TDBC algorithm exhibit smaller interquartile ranges, thereby indicating better uniformity in node counts across the clusters. In contrast, the MFGC method displays larger box plots, suggesting greater discrepancies in node counts among the clusters. Additionally, as the number of clusters decreases, the box plots of the MFGC method gradually increase in size, indicating a decrease in node count uniformity. However, the proposed TDBC algorithm consistently maintains smaller box plots across different cluster numbers, thereby demonstrating its ability to maintain balanced node counts.
As depicted in Figure 10, the MFGC and TDBC methods demonstrate fundamentally different clustering mechanisms, with MFGC relying on the consolidation of small communities for clustering, while the TDBC method achieves clustering through the generation of new centroids. Figure 10(a) shows that the MFGC method achieves a final partitioning scheme comprising two clusters by merging four clusters through three rounds of community merging. Specifically, C5 is initially merged with C1, resulting in a community comprising six nodes. Subsequently, C3 is merged with C4, leading to a community with nine nodes. Finally, these two merged communities are combined to form a larger community consisting of 15 nodes. It is important to note that throughout this process, the C2 community does not participate in any community merging, resulting in an uneven distribution of nodes among the final two communities. This discrepancy can be attributed to the mechanism of the MFGC method, which prioritizes the maximization of the current modularity Q without taking node uniformity into account. Consequently, as the desired number of clusters decreases, there is a tendency to merge large communities with other large communities, while the size of small communities remains unchanged. This ultimately leads to nonuniformity among nodes within clusters, highlighting the prioritization of modularity at the expense of node uniformity.
As depicted in Figure 10(b), the TDBC method, built upon the K-means algorithm, achieves clustering by generating dispersed cluster centroids. This implies that the clustering process is executed autonomously for each partition, and the formation of larger communities does not necessitate the amalgamation of smaller ones. Consequently, this avoids the issue of community merging that is evident in the MFGC method. As a result, the TDBC method ensures a more balanced distribution of node counts among the clustering partitions. In addition, the proposed TDBC method utilizes topological distance for node clustering, resulting in autonomous grouping of nodes with strong connections within their respective clusters. This aspect contributes to the achievement of high modularity in the obtained clustering results.
Application results for rural network
Overall, the proposed TDBC method consistently achieves favorable outcomes in terms of modularity, with all values surpassing 0.3, despite modularity (Q) not being the optimization criterion. According to Newman & Girvan (2004), a modularity value greater than 0.3 indicates satisfactory petitioning results with significant community structure. Therefore, these results suggest that the proposed TDBC method effectively partitions the network into clusters that align with the community structure, characterized by strong internal connections.
In contrast, the proposed TDBC method consistently generates smaller and more consistent box plots across the three clustering schemes, implying more comparable number of nodes among the clusters, as indicated by Figure 13(b). These findings indicate that the TDBC algorithm is capable of producing clustering schemes that are more uniformly distributed across various partition scenarios. This adaptability is achieved by updating the centers for each clustering implementation, thereby avoiding situations with excessively large or small cluster sizes. Consequently, the proposed TDBC method ensures more balanced clustering sizes and mitigates the issue of extreme disparities observed in the proposed MFGC method. In summary, within the framework of networks possessing intricate structures, the TDBC method, when compared to the MFGC method, produces clustering patterns with comparable modularity, yet it showcases less variation in node count across the clusters. This suggests that the TDBC method holds substantial promise as an approach to enhance clustering performance within such networks.
SUMMARY AND CONCLUSION
This paper presents a novel TDBC method for automatic partitioning of WDNs, which serves as the first step in forming DMAs. The motivation behind this method arises from the oversight in previous studies, where the incorporation of the PRVs and pump stations into the clustering process was neglected, despite their significant impact on the partitioning results. The proposed approach involves two key steps: firstly, constructing a topological distance matrix, and secondly, employing K-means clustering for automated partitioning, wherein the topological distance for links containing PRVs or pump stations is magnified. Comparative evaluations are conducted among the results obtained from three methods, including the traditional GCBP method, the proposed TDBC method, and the MFGC method, across four distinct networks, and the main findings are summarized.
- (1)
The proposed TDBC algorithm achieves correct partitioning results for networks that include PRVs or multiple pump pumps. However, the traditional GCBP method proves to be impractical when applied to these types of networks, as evidenced by the erroneous partitioning results observed in both the cases of a simple network and the C-town network.
- (2)
The new TDBC method differs significantly from the GCBP method in its partitioning approach. Instead of using geographical proximity, the TDBC method uses topological distance to assess node connectivity. This shift leads to partitioning results with greater modularity values, even though modularity is not the main optimization goal. This is confirmed by the clustering results for Anytown and rural networks, where modularity surpassed 0.3. These results confirm TDBC's effectiveness in creating well-structured clusters.
- (3)
The TDBC algorithm outperforms the MFGC method in terms of clustering uniformity, even though it may have slightly lower modularity values. This distinction is especially evident when dealing with fewer clusters. The MFGC method has a tendency to merge smaller communities to create larger ones, prioritizing modularity at the expense of uniformity. Conversely, the TDBC algorithm executes the clustering independently for each implementation. This guarantees that the establishment of larger communities does not require merging smaller ones, thus reducing the occurrence of significantly varying cluster sizes and enhancing uniformity within the clusters.
In conclusion, the proposed TDBC can serve as a valuable alternative for automatically partitioning WDNs. This is attributed to its explicit consideration of the impact of components such as valves and pumps on the network partitioning, coupled with its ability to maintain a balance between community attributes and the uniformity of partitioning results. However, the validation of the proposed TDBC method in this study is confined to relatively small networks, a limitation requiring future research attention. Furthermore, given its successful application in addressing PRVs and pump stations in DMA design, there is significant potential to expand the method's scope to encompass pipes crossing rivers or administrative districts. This expansion, involving the incorporation of pseudo-PRVs on these pipes, represents an important avenue for future research.
ACKNOWLEDGEMENTS
This work was funded by the National Natural Science Foundation of China (52260011).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.