ABSTRACT
Understanding daily water consumption patterns is crucial for efficient management and distribution of water resources, as well as for promoting energy conservation and achieving carbon peaking and neutrality targets. It compares performance of three clustering algorithms, K-means Clustering (KC), Agglomerative Hierarchical Clustering (AHC), and Spectral Clustering (SC), using Silhouette Coefficient Index (SCI) and Calinski–Harabasz Index (CHI) as evaluation metrics. We conducted a case study using original hourly flow series of a water distribution division. It aims to identify typical daily water consumption patterns and explore factors that influence them. Findings are as follows: (1) among the three algorithms, KC demonstrates the best, with SCI of 0.6315, 0.5922, and 0.6272, and CHI of 305.9207, 274.1120, and 302.4738 for KC, AHC, and SC, respectively. (2) KC successfully identifies three distinct typical daily water consumption patterns. (3) Results indicate a significant impact of seasons on daily water consumption patterns. (4) Conversely, weekdays and holidays have minimal effect on daily water consumption patterns. It highlights the importance of comprehending daily water consumption patterns and underscores the effectiveness of KC in identifying such patterns. Furthermore, it emphasizes the significant influence of seasons while revealing limited impact of weekdays and holidays on daily water consumption patterns.
HIGHLIGHTS
K-means Clustering performs the best among the three clustering algorithms.
Three typical daily water consumption patterns were identified in the case study.
Season was found to be a significant influencing factor for the three patterns.
ABBREVIATIONS
INTRODUCTION
In recent years, the severity of water scarcity caused by climate change has heightened the importance of the solar distillation system (United Nations Human Settlements Programme 2011; Ashok Kumar & Samsher 2021a; Ashok Kumar 2023). Coupled with the continuous growth of the population, this has resulted in a more urgent demand for water supply pumping stations. Understanding and studying water consumption patterns play a crucial role in controlling, scheduling, and optimizing the operation and management of water supply pumping stations, as it provides valuable insights into the demand patterns of water consumers. Additionally, accurate modeling of water consumption patterns aids in the design and optimization of pumping station control systems (Avni et al. 2015; Hussien et al. 2016). The study of water consumption patterns provides essential information for water resource planning and management (Dong et al. 2013) and allows us to predict and anticipate peak periods of water demand. This enables efficient allocation of water resources and ensures a reliable water supply to meet consumer needs. Furthermore, by identifying and analyzing the factors influencing water consumption patterns, we can develop strategies to encourage water conservation and promote sustainable water usage practices. Understanding the temporal variations in water demand helps develop effective control and scheduling strategies for pumping stations, ensuring optimal operation and energy efficiency. This, in turn, leads to cost savings, reduced energy consumption, and improved overall system performance. In recent years, numerous studies have focused on conducting technology–environment–economy–energy matrix observations under specific conditions to seek more comprehensive solutions. These studies aim to delve deeper into understanding the environmental impacts, economic feasibility, and energy efficiency of different technologies, providing more accurate information for decision-making and planning processes (Ashok Kumar & Samsher 2020, 2021a, 2021b, 2022a, 2022b, 2022c).
As water scarcity becomes increasingly widespread, it is crucial to develop effective water management strategies that are based on an enhanced understanding of urban water demand and the factors that influence household water usage patterns. This understanding can be achieved through data analysis and Machine Learning (ML), particularly by analyzing live traffic sequences (Cominola et al. 2019).
Many researchers have shown an increasing interest in ML to address various problems in water resources and hydrology (Maier & Dandy 2000). The use of ML, a branch of Artificial Intelligence (AI), has supported numerous water studies by providing tools for high-level exploratory and statistical analysis of large-scale water consumption data (Garcia et al. 2017; Duerr et al. 2018; Rahim et al. 2021). From a data label perspective, there are three modes of learning: unsupervised, supervised, and semi-supervised (Ang et al. 2015; Li et al. 2017). Clustering algorithms are popular methods for data analysis, particularly in unsupervised learning tasks where items or objects are grouped based on inherent similarities among them (Saxena et al. 2017; Rahim et al. 2021). These algorithms can effectively group flow sequences through calculations, providing an effective method for identifying water consumption patterns. Chen demonstrated the successful application of similarity-based cluster analysis to sequence datasets using different distance measures (Chen 2007).
Clustering algorithms can be categorized into two main types: hierarchical clustering algorithms and non-hierarchical clustering algorithms (Gülagiz & Sahin 2017). K-means Clustering (KC) is a widely recognized non-hierarchical algorithm that has been extensively applied across different domains, with several researchers demonstrating its effectiveness in cluster analysis (Gnanadesikan 2011; Rahim et al. 2020; Mirzal 2022). On the other hand, Agglomerative Hierarchical Clustering (AHC) belongs to the hierarchical clustering category and has found significant usage in addressing urban water issues (Yu et al. 2013; Diao et al. 2014). While KC produces clusters that form convex sets, Spectral Clustering (SC) has been shown to effectively handle more complex problems, such as intertwined spirals (Ng et al. 2001; Nascimento & De Carvalho 2011; Ding et al. 2014; Saxena et al. 2017). However, in the context of water consumption research, SC has received less attention compared to KC and AHC. Previous studies have explored the application of these three algorithms in various areas, such as urban flood detection, early warning, and clustering performance in time series water depth difference and flood prediction (Li et al. 2020). While these algorithms have demonstrated good performance in various areas of data analysis, we discovered that SC, as an excellent algorithm, is rarely utilized in water consumption research. This observation greatly inspired our study.
Currently, there are numerous studies on residential water consumption patterns (Memon & Butler 2006; Wong et al. 2010; Browne et al. 2014; Yang et al. 2015; Garcia et al. 2017; Vieira et al. 2018). However, these studies vary in terms of data sources, materials, spatial scales, temporal scales, and clustering algorithms employed. Some scholars have employed self-mapping clustering algorithms, such as Self-Organizing Maps (SOMs), supplemented by KC and AHC, to analyze the water consumption behavior of individual users within a specific study area over a short period. The combination of KC and AHC has been shown to greatly enhance the performance of clustering algorithms. (Ioannou et al. 2021). Furthermore, the application of spectral algorithms in a real-life water distribution network (WDN) in South Italy demonstrated improved performance compared to graph partitioning in terms of minimizing the number of edge cuts, making it more efficient in both hydraulic and economic aspects. (Khoa Bui et al. 2020).
Some of these studies rely on questionnaires, and the data sources lack completeness, accuracy, and representativeness. (Beal et al. 2018; Cominola et al. 2019). Additionally, most of these studies primarily focus on water-consuming fixtures commonly found in households, such as dishwashers, toilets, and kitchen faucets, to segment users and explore different water consumption patterns (Russell & Fielding 2010; Aghabozorgi et al. 2015; Nguyen et al. 2015). Research based on real water consumption data yields more accurate results. These studies are conducted on a monthly or yearly time scale and have achieved favorable outcomes (Gato et al. 2007; Wang et al. 2009; Dey et al. 2012).
However, existing studies often have limitations in terms of their time scale and utilization of hourly water consumption data from a substantial number of users. Additionally, these studies incorporate multiple factors that influence water consumption, resulting in complex and accurate models. However, the requirement for a significant amount of data to build these models can be inconvenient to handle. In contrast, this study focuses on real-time flow data, which offers advantages in terms of ease of clustering and manipulation. By employing KC, AHC, and SC, this research not only identifies typical daily water consumption patterns but also analyzes the factors that contribute to these patterns. The utilization of these clustering techniques enables a comparative analysis, providing novel insights into the understanding of water consumption behavior. The findings of this study not only contribute to the field of water consumption modeling methods but also offer valuable guidance for future research in this area. By emphasizing the significance of utilizing real-time flow data and employing different clustering techniques, this research contributes to the development of more efficient and effective approaches for analyzing water consumption patterns.
The remainder of this study is structured as follows: In Section 2, a comprehensive introduction to the clustering algorithms employed for identifying daily water consumption patterns is provided, including KC, AHC, and SC. Subsequently, the study data and the preprocessing methods applied to the data are presented. Section 3 presents a comparison of the three clustering algorithms using evaluation metrics. Finally, in Section 4, the impact of weekdays or holidays on clustering results, as well as the influence of seasons on clustering results, is discussed.
METHODOLOGY AND DATA
Clustering algorithms
K-means Clustering
KC is a typical unsupervised clustering algorithm (Ahmed et al. 2020), which aims to divide the input sample dataset into k clusters such that the items of the same cluster are as similar to each other as possible, while the items of different clusters are as different as possible. The algorithm runs as follows (Sinaga & Yang 2020):
(1) First, input the value of k, and then select the initial centers of mass from the dataset D.
- (2) Assign each point in the sample set to a cluster. Then the distance between each point and the center of mass is calculated using the Euclidean distance measure. Finally, each point is assigned to the cluster that corresponds to the center of mass with the closest distance. The calculation formula of Euclidean distance is as follows:where x represents a sample point within the cluster, represents the center of mass of the cluster, n represents the number of features in each sample point, i represents individual features that constitute a point x.
(3) After assigning all the objects to their respective clusters, the cluster centers are recalculated by considering the existing objects within each cluster.
(4) Repeat step (2) and step (3) until the center of mass no longer changes, then output k cluster divisions .
Agglomerative Hierarchical Clustering
AHC is a type of hierarchical clustering, often referred to as a bottom-up method. The classical AHC algorithm relies on two essential components: a similarity or dissimilarity measure between objects and an aggregation criterion or linkage rule that governs the merging of clusters of objects (Kojadinovic 2004).
The detailed process of AHC is as follows (Day & Edelsbrunner 1984):
(1) Create n initial populations, each consisting of a single individual.
(2) Calculate the distance between each two clusters and merge them to form a new cluster.
- (3) Minimizes the total within-cluster sum-of-squares between two clusters. In other words, given two clusters A and B, the value of represents the minimum sum of squares.where represents a data point of a cluster A or cluster B, represents the center of cluster j, and represents the number of points in it. represents the merging cost of combining A and B.
(4) Repeat step (2) and step (3), until all clusters are combined into one.
Spectral Clustering
SC is a clustering method. It is based on algebraic graph theory and was proposed by Donath and Hoffman in 1973 (Donath & Hoffman 1973). In recent years, it has gained extensive attention from academia. This is due to its solid theoretical foundation and its ability to deliver good clustering performance (Jia et al. 2013). SC is rooted in spectral graph theory, treating the data clustering problem as a graph partitioning problem. It constructs an undirected weighted graph, where each data point in the dataset represents a vertex, and the similarity value between any two points represents the weight of the edge connecting the corresponding vertices. The SC algorithm runs as follows (Ng et al. 2001):
(3) Calculate graph Laplacian matrix L=D – W, where D represents the degree matrix and W represents the similarity matrix. Clearly, since both D and W represent symmetric matrices, L represents also a symmetric matrix.
(4) The Eigen decomposition of L to obtain the smallest k eigenvalues corresponds to the eigenvectors arranged in columns to form a matrix , where q represents a column vector.
(5) Clustering all rows of matrix Q are clustered to obtain , output the grouping of the original data.
Evaluation indicators of clustering method performance
Clustering algorithms are typically evaluated using external and internal metrics. In this study, internal metrics, namely the Silhouette Coefficient Index (SCI) and Calinski–Harabasz Index (CHI), are employed. These metrics are unsupervised and do not depend on a benchmark dataset or an external reference model. They assess the quality of clustering results by considering the distances between sample points in the dataset and their respective cluster centers.
Silhouette Coefficient Index
Calinski–Harabasz Index
Development environment
In this study, the clustering analysis was performed using Python programming language with the support of the PyCharm development environment. Python provided a versatile and widely-used platform for data analysis and ML, while PyCharm offered a user-friendly interface for coding and executing the clustering algorithm. These tools allowed for effective implementation and analysis, ensuring reliable and accurate results for our study.
Study data
To gain insights into the composition of water consumption, we collected data on the proportion of residential and industrial demand through surveys and collaboration with local water supply authorities. Our findings revealed that residential demand accounted for approximately 70% of the total water consumption, while the remaining 30% was attributed to industrial demand. These proportions were taken into account during the clustering analysis to gain a better understanding of the different consumption patterns.
Data preprocessing
RESULTS
Results of clustering performance evaluation
Clustering algorithm . | Cluster numbers . | SCI . | CHI . |
---|---|---|---|
K-means Clustering | 2 | 0.5161 | 207.3337 |
3 | 0.6315 | 305.9207 | |
4 | 0.5640 | 287.8707 | |
5 | 0.5125 | 250.1610 | |
Agglomerative Hierarchical Clustering | 2 | 0.5127 | 176.7943 |
3 | 0.5922 | 274.1120 | |
4 | 0.5271 | 264.5629 | |
5 | 0.4774 | 230.7360 | |
Spectral Clustering | 2 | 0.5215 | 203.3216 |
3 | 0.6272 | 302.4738 | |
4 | 0.5347 | 263.4528 | |
5 | 0.4341 | 193.4173 |
Clustering algorithm . | Cluster numbers . | SCI . | CHI . |
---|---|---|---|
K-means Clustering | 2 | 0.5161 | 207.3337 |
3 | 0.6315 | 305.9207 | |
4 | 0.5640 | 287.8707 | |
5 | 0.5125 | 250.1610 | |
Agglomerative Hierarchical Clustering | 2 | 0.5127 | 176.7943 |
3 | 0.5922 | 274.1120 | |
4 | 0.5271 | 264.5629 | |
5 | 0.4774 | 230.7360 | |
Spectral Clustering | 2 | 0.5215 | 203.3216 |
3 | 0.6272 | 302.4738 | |
4 | 0.5347 | 263.4528 | |
5 | 0.4341 | 193.4173 |
Figure 7 shows that the optimal solution is three clusters, as indicated by the highest SCI value and CHI value. This study conclusively determines three as the ideal number of clusters, and this number was employed in the subsequent analysis.
Summarization of daily water consumption patterns
The green line in Figure 10 shows the daily water consumption curve for Cluster 1. It exhibits a relatively small fluctuation range but demonstrates a noticeable overall trend with an initial decrease, followed by an increase, and then another decrease. The peak occurs at 14:00 pm, while the lowest point is reached at 2:00 am. Cluster 1 consists of 29 holidays and 57 weekdays, with 32 days falling in spring and 52 days in winter.
The deep blue line in Figure 10 displays the daily water consumption curve for Cluster 2. This cluster shows a highly regular pattern with a distinctive double-hump shape. There are two peak periods at 10:00 am and 18:00 pm. The curve gradually declines and reaches the first trough at 4:00 am, followed by the second trough at 15:00 pm. Cluster 2 includes 33 holidays and 71 weekdays, with 75 days in autumn and 29 days each in winter and spring.
The light blue line in Figure 10 presents the daily water consumption variation curve for Cluster 3. The pattern in this cluster resembles that of Category 2 between 0:00 am and 8:00 am. However, from 8:00 am to 20:00 pm, the flow remains relatively flat and constant. The water usage reaches its maximum at 21:00 pm and then sharply declines from 21:00 pm to 23:00 pm. Cluster 3 encompasses 52 holidays and 124 weekdays, covering all four seasons, with 61 dates in spring, 91 dates in summer, 15 dates in autumn, and 9 dates in winter, as shown in Table 2.
Clusters . | Holidays . | Weekdays . | Spring . | Summer . | Autumn . | Winter . |
---|---|---|---|---|---|---|
Cluster 1 | 29 | 57 | 32 | 1 | 1 | 52 |
Cluster 2 | 33 | 71 | 0 | 0 | 75 | 29 |
Cluster 3 | 52 | 124 | 61 | 91 | 15 | 9 |
Clusters . | Holidays . | Weekdays . | Spring . | Summer . | Autumn . | Winter . |
---|---|---|---|---|---|---|
Cluster 1 | 29 | 57 | 32 | 1 | 1 | 52 |
Cluster 2 | 33 | 71 | 0 | 0 | 75 | 29 |
Cluster 3 | 52 | 124 | 61 | 91 | 15 | 9 |
DISCUSSIONS
Impact of weekdays or holidays on clustering results
Daily water consumption is influenced by users' daily activity habits. Speculations about differences in water consumption patterns between weekdays and weekends, as well as holidays and weekdays, are subjective and based on individual perceptions. In this study, we analyze real-time data considering additional factors associated with weekdays and holidays.
Another possibility to consider is that the suburban area, where the data is sourced, may not be representative of the main urban area. The demographics of this area are predominantly agricultural, and daily activities may not significantly change during holidays. Additionally, the COVID-19 pandemic in 2020 had a major worldwide impact on water consumption patterns (Dzimińska et al. 2021). The distinction between weekdays and holidays became less evident, especially with the rise of telecommuting and distance learning, where the ‘stay at home’ lifestyle became the norm. Overall, our findings suggest that there are no significant differences in water consumption between holidays and weekdays. The suburban context and the influence of the COVID-19 pandemic contribute to these observed patterns.
Impact of seasons on clustering results
Cluster 1, which we have labeled as Pattern A, primarily occurs during the winter and early spring seasons, with minimal representation during summer and fall. Our analysis, as shown in Figure 10 reveals several distinctive characteristics of this pattern when compared to the other two. First, unlike the other patterns, Pattern A does not exhibit a prominent morning peak in water flow. Instead, the daily flow curve demonstrates significant rises and falls throughout the day. Notably, an afternoon peak is observed between 10:00 and 14:00, followed by a slight rebound at 16:00, forming a smaller peak. Consequently, this irregular behavior sets Pattern A apart from the other patterns. Furthermore, it is important to acknowledge the influence of the prevailing COVID-19 pandemic during the months predominantly associated with Pattern A, namely January, February, and March. This pandemic significantly impacts the morning, afternoon, and evening peaks of daily water consumption within this pattern. The altered routines and increased time spent at home during lockdown measures contribute to these notable changes in water usage patterns. However, despite the significant impact of the COVID-19 pandemic, the fluctuations in water consumption during winter and early spring – periods that hold particular significance – are relatively negligible within Pattern A. Further investigation into the underlying factors contributing to these negligible fluctuations during this time frame remains an area of future research and warrants additional attention.
Cluster 2 consists of autumn and winter, with spring and summer at 0. We refer to this as Pattern B.
Cluster 3 represents a daily water usage pattern dominated by summer, which we call Pattern C. Figure 10 shows the corresponding daily average water usage variation curves, while Figure 13 displays the curves for summer. Comparing these two curves, we confirm their high similarity. From 0:00 to 7:00, there is no noticeable difference in the water consumption curves between Pattern C and Pattern B. Between 7:00 and 20:00, Pattern C shows minimal changes, maintaining consistently high levels without significant drops or valleys. This can be attributed to the hot weather in summer and the unusually long summer experienced in Shanghai in 2020, resulting in high temperatures and drought that may affect crop irrigation. Pattern C experiences a rise in water usage from 19:00 to 21:00, peaking at 21:00, likely due to residents using water for evening bathing, creating a small peak. However, unlike Pattern C, the morning peak for the summer mode occurs at 7:00, with Pattern C experiencing a 1-hour delay. Additionally, the afternoon peak for Pattern C happens between 12:00 and 14:00, peaking at 14:00, which aligns with the spring pattern. Moreover, considering the dates, more than half of the spring falls into Pattern C, further suggesting that spring may have influenced the midday peak.
Integrating daily consumption patterns into the optimization of WDNs presents significant advantages, as evidenced by the analysis of different patterns in this study. Notably, Cluster 1, representing winter and early spring as Pattern A, exhibits distinctive characteristics. This pattern lacks a morning peak and displays significant fluctuations in the daily flow curve, rendering it the most irregular pattern. Understanding and recognizing such patterns enables the scheduling algorithm to adapt and allocate resources accordingly, ensuring a stable water supply despite irregular consumption behaviors. Similarly, Cluster 3, dominated by summer as Pattern C, demonstrates consistently high levels of water usage throughout the day. By identifying and acknowledging this pattern, the algorithm can optimize supply schedules to accommodate the specific demands of summer, such as the evening bathing peak at 21:00. This targeted approach enhances operational efficiency and ensures adequate water distribution during critical periods. These observations highlight the novelty and contribution of our research in integrating daily consumption patterns into WDN optimization. By considering and analyzing these patterns, our study provides valuable insights for improving the operational efficiency of water distribution systems. This approach enables better resource allocation, resulting in optimized water supply schedules that align with the specific demands of different seasons and consumption patterns.
CONCLUSIONS
Table 3 presents a comprehensive summary of the research findings obtained in this study.
Research method . | Innovation . |
---|---|
Cluster method selection | The study utilized a diverse range of clustering methods, encompassing various principles and applicabilities. This facilitates the comparison of different methods in analyzing daily water usage patterns, providing a more comprehensive evaluation and insights. |
Cluster analysis | The selected clustering method is applied to the preprocessed daily water usage data in this study, dividing the data samples into distinct clusters. Proper distance metrics and clustering evaluation metrics are employed to ensure the quality assessment of the clustering analysis results. |
Pattern variations and feature interpretation | By comparing the results of different clustering methods, this study revealed the explanatory capacity and differences in water usage patterns among the methods. By providing detailed explanations of the patterns identified by each clustering method and their features, a more comprehensive understanding of water consumption patterns is achieved, aiding in identifying the applicability of different clustering methods in specific scenarios. |
Practical application validation | This study applied the experimental method to real-world daily water usage data. By validating the effectiveness of different clustering methods in practical applications, it provides practical value and guidance for decision-makers in the field. |
Research method . | Innovation . |
---|---|
Cluster method selection | The study utilized a diverse range of clustering methods, encompassing various principles and applicabilities. This facilitates the comparison of different methods in analyzing daily water usage patterns, providing a more comprehensive evaluation and insights. |
Cluster analysis | The selected clustering method is applied to the preprocessed daily water usage data in this study, dividing the data samples into distinct clusters. Proper distance metrics and clustering evaluation metrics are employed to ensure the quality assessment of the clustering analysis results. |
Pattern variations and feature interpretation | By comparing the results of different clustering methods, this study revealed the explanatory capacity and differences in water usage patterns among the methods. By providing detailed explanations of the patterns identified by each clustering method and their features, a more comprehensive understanding of water consumption patterns is achieved, aiding in identifying the applicability of different clustering methods in specific scenarios. |
Practical application validation | This study applied the experimental method to real-world daily water usage data. By validating the effectiveness of different clustering methods in practical applications, it provides practical value and guidance for decision-makers in the field. |
This study utilized three clustering algorithms (K-means, agglomerative hierarchical, and SC) to analyze daily water consumption patterns. The performance of these algorithms was evaluated using the SCI and the CHI. The conclusions of the study are as follows:
(1) KC outperformed AHC and SC, as indicated by higher SCI and CHI values (0.6315 and 305.9207, respectively).
(2) The data were clustered into three patterns: Pattern A, Pattern B, and Pattern C. These patterns have similar proportions of weekdays and holidays. Pattern A is dominated by winter and spring, Pattern B by autumn, and Pattern C by summer and spring. Pattern B and Pattern C exhibit similar variations from 0:00 to 7:00, while Pattern A differs during this time period. The main distinction among the patterns lies in the water consumption variation between 8:00 and 21:00.
(3) Seasons significantly influence daily water consumption patterns. In spring, the midday peak is delayed, likely due to the impact of the COVID-19 pandemic in 2020. Summer primarily affects water consumption changes between 7:00 and 20:00, influenced by high temperatures and drought conditions. The area's citrus cultivation and continuous need for irrigation water contribute to minimal variation in daily water consumption during this period, resulting in a delay in the evening peak. Autumn and winter lack a midday peak, coinciding with the citrus harvest season and the agricultural activities prevalent in the study area.
(4) The number of weekdays and holidays shows no significant effect on the three patterns, as the proportions remain proportional across all patterns.
Future scope can include investigating the underlying factors causing the delay in the midday peak during spring and examining the specific impacts of citrus cultivation on water consumption patterns in autumn and winter. Additionally, exploring the influence of external factors such as economic activities or cultural events on daily water consumption patterns can provide valuable insights.
ACKNOWLEDGEMENTS
We are very grateful to the editors and anonymous reviewers for their insightful suggestions and comments on this paper.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.