The need for efficient management of water distribution systems is a growing concern for economic, environmental and social reasons. Water supply systems are commonly designed to ensure adequate behaviour under the worst conditions, such as maximum consumption, which leads to overestimation in supply tanks and energy waste. While overestimation should be considered, to account for unpredictable demands and emergency scenarios, we advocate that a detailed understanding of consumption patterns enables an improvement in water management and is additionally beneficial to correlated resources such as electricity. A novel framework to detect water consumption patterns is developed and applied to an urban scenario. Observed discrepancies among computed patterns enable readjustments of supplied water flowrate, thus promoting effective water allocation and pumping costs, while mitigating water contamination risks.
Water is a vital resource and a key element to the settlement and growth of human communities. Over the years, climate changes, demographic pressure and urbanisation have strongly impacted urban water resources. At drainage level, impermeable surfaces along with heavy rainfall events increase flooding risks (Hammond et al. 2018; Velasco et al. 2018). From a supply perspective, time-varying and increasing demands must be supplied over longer distances (Katirtzidou & Latinopoulos 2017).
Water supply systems are designed to ensure customer demands even during peak hours, resulting in repeated overestimation of supply tank volumes. On the other hand, most systems have water losses and leaks that need to be identified. Serious consequences to both water companies and consumers are expected from this. Stagnant water in tanks may have increasing treatment and transportation costs (Filho et al. 2017). For example, in Coimbra, Portugal, around €1.5 million (from €6.5 million € total) are spent annually as a result of water supply system leakages (ERSAR 2016).
The management of water distribution networks consists in a trade-off between two conflicting goals, maximising available water and minimising transportation and treatment costs. A proper understanding of urban water consumption enables companies to adjust excess tank storage and pump operations and reduce contamination risks and operating costs, motivating several studies in recent years (Donkor et al. 2014; Qin & Boccelli 2019).
This work follows Leitão et al. (2018), developing a framework for automatic urban consumption pattern detection supported by a time-series clustering methodology. Despite strong interest and investigation in urban water patterns, to the best of our knowledge, very little attention has been given to the application of selected clustering methods to raw data. This motivated our methodology, which enables real-time consumption analysis and pattern identification, exploring techniques well-acclaimed in other fields (Jain 2010; Aghabozorgi et al. 2015).
The following section reviews the literature on water consumption patterns and presents our approach. The section after that presents and analyses consumption profiles, and the final section concludes the document.
This section addresses our methodology, reviewing water consumption pattern publications and employed techniques to extract patterns. Time-series clustering techniques adopted in different domains are also covered. From this survey, a selection of techniques was incorporated into our approach, covered in the second half of this section. The reader is directed to the cited literature for detailed coverage.
State of the art
Characterising urban water consumption is of great importance to water companies, being the subject of intensive study for at least 60 years.
In one such early study (Linaweaver et al. 1967), historic consumption data are correlated with events of interest. Despite distinguishing summer and winter weekly patterns, details regarding their computation are not provided. Furthermore, consumption throughout the remainder of the year (autumn and spring) and how it relates to computed profiles are not addressed.
Following trends in other industries, namely energy, data mining and statistical approaches such as cluster analysis gained popularity. They divide a collection of observations (e.g., daily consumption) in groups or clusters, such that the farther the clusters are from each other the more distant their assigned observations will be.
One approach relies on raw consumption records, divided into equal-length time-series samples (e.g., daily consumption with hourly records). Clustering algorithms partition the feature space (e.g., 24-dimension space) to find dense and well-separated regions. The high computational burden of such approaches is highlighted as their main drawback (Aghabozorgi et al. 2015): clustering requires many distance computations; as such, the higher the dimensionality of each sample, the higher the computational requirements (time and memory).
Dimensionality reduction techniques have also been employed to represent a time-series in lower dimensional spaces, reducing computational burden. Examples include principal component analysis (PCA) (Abreu et al. 2012), non-negative tensor factorisation (Figueiredo 2014) and replacing data with additional information (features) such as peak and mean daily/weekly (Laspidou et al. 2015; Yang et al. 2015) and per capita consumption (Noiva et al. 2016).
The choice of similarity metric is, naturally, another important decision. The importance of adequate representation and distance metrics is stressed by Aghabozorgi et al. (2015) to avoid clustering series similar in noise, instead of shape. Simple and with low cost, the Euclidean distance remains the most common. Metrics that better express similarity between time-series samples, such as dynamic time warping (DTW) (Chu et al. 2002), have also been adopted, albeit with higher computation costs.
Among water consumption clustering works, self-organising maps (Laspidou et al. 2015), hierarchical (Noiva et al. 2016) and K-means clustering (Candelieri et al. 2015) are highlighted. Although apparently not as common in this field, hierarchical and K-means clustering are commonly employed together (Jain 2010), in a technique we name hierarchical K-means clustering.
Works cited in the previous paragraph (and related sources) show a recent preference towards identifying patterns among extracted features, rather than in raw consumption data. While relevant for a characterisation of urban consumption, these features remove time information, making it impossible to characterise consumption over time: indeed, raw time-series clustering allows for time-indexed consumption characterisation, not only describing consumption peaks in value but also when they are typically observed during the day.
K-means represents each computed cluster by a centroid, a sample representative of the cluster which, in this case, constitutes the consumption profile. A multitude of centroids are proposed in the literature, namely average (average of all samples for each dimension) (Aghabozorgi et al. 2015), medoid (sample that minimises the within-cluster sum of squared distances) (Izakian et al. 2015) and DTW barycentre averaging (Petitjean et al. 2011).
A novel algorithm was also proposed for raw time-series clustering. Time-series anytime density peaks (TADPole) is a variant of density peaks clustering (Rodriguez & Laio 2014; Begum et al. 2015), imposing constraints over the DTW to prune unnecessary computations and reduce computation time. It uses the medoid centroid and explores local density (number of points whose distances to each other are below a cut-off distance) to find cluster centroids (samples with many close neighbours). According to Rodriguez & Laio (2014), TADPole has reported adequate cluster structures, equivalent to other algorithms, at a fraction of the cost. Nonetheless, no water data applications were found.
Real-world measurements are subjected to missing and invalid data which, if not properly accommodated, hinder computed patterns. It is not the goal of this document to cover the literature on data imputation; the reader is directed to existing literature for complete coverage. Moritz et al. (2015) highlight linear interpolation, autoregressive integrated moving average (ARIMA) and Kalman filters in univariate time-series.
Linear interpolation estimates missing values by fitting a linear regression to previous and subsequent observations. ARIMA imputations are computed based on a linear combination of previous values (autoregressive component), white noise error terms (moving average) and integrated component (number of differences for stationarity). Finally, Kalman filters develop a state-space model (Durbin & Koopman 2012). Its state gives the current estimation, resulting from an iterative process where a prediction for a previously known state is first made and later corrected using the actual state value.
Our literature study highlighted a recent trend featuring statistical approaches to extract knowledge from historic data.
Water consumption data were retrieved from a utility company, covering a one-year period of hourly aggregated distributed volumes (m3). The data were collected from a residential area, encompassing around 3,000 counters and 9,000 residents. It is subject to reading imprecisions, and communication and instrumentation failures.
An exploratory analysis of highlighted techniques was conducted. Following the approach in Abreu et al. (2012) the unit of analysis, suggested by the frequency with higher contribution to the time-series, was computed. As the frequency Hz (approximately 24 h) was the non-zero frequency with higher contributions, our data were divided into daily samples.
Kalman filters produce estimations with smaller error over artificial missing data, chosen to impute real missing values. Despite having no accurate knowledge of how many profiles exist or their distribution throughout the year, dense and well-separated clusters are expected as well as morning and evening/night peaks. We assess computed clusters according to the following:
Sum of Squared Errors (SSE) (Aghabozorgi et al. 2015). The error of a sample is defined as the distance to its cluster centroid.
Average Within-Cluster Sum Squares (AWCSS) (Aghabozorgi et al. 2015). It computes the average dissimilarity of samples in the same cluster.
- Silhouette Coefficient (SC) (Rousseeuw 1987). SC measures how similar a sample is to others in its cluster, in comparison with other clusters. It is defined as the average of individual samples' silhouette values, :
- Caliński-Harabaz Index (CH) (Caliński & Harabasz 1974):
In our literature review, we identified a lack of applications detecting water patterns from raw time-series. Motivated by reported results in other domains, consumption patterns were computed using the hierarchical K-means algorithm. The average centroid without data transformations (raw time-series), exploring the DTW metric, produced more adequate cluster structures. TADPole was the fastest method to converge but was discarded due to inadequate cluster structures.
Leitão et al. (2018) collected and segmented data to reveal three predominant patterns: one corresponding to low consumptions, featuring reduced demands throughout the day (cluster 0: average = 22 m3/day); and two intense consumptions patterns (cluster 1: average = 229 m3/day, and cluster 2: average = 227 m3/day), presented in Figure 1. To better illustrate differences in consumption values between clusters, a second axis (drawn in red) is included describing the average daily consumption.
For each pattern, hourly upper and lower deviations from the consumption profile are presented. This information complements profiles and allows tank volumes to be adjusted according to observed (past) deviations: if, for a given day and hour, consumption has greatly exceeded its pattern in the past, the utility company can allocate higher volumes in advance; conversely, when consumption remains closer (or inferior) to its pattern, overestimation can be diminished, reducing sediment deposition and transportation and treatment costs.
Such a complement to consumption patterns is not typically reported in the studied literature. As the supply system must ensure customers' demands for worst cases (i.e., maximum consumption), its specification presents an added value to consumption analysis and management of supply systems, in comparison with quantile information.
All patterns have similar daily shapes, with early morning (8 am) and night (8 pm–9 pm) peaks and low consumption during the late night. For all groups, consumption peaks (around 10 am and 8 pm) surpass average consumption by a factor of 3. Conversely, low demand periods drop to about 30% of the average value for clusters 1 and 2, and 50% for cluster 0.
Figure 2(b) and 2(c) present a yearly distribution for these patterns. Clearly, cluster 2 encompasses a substantial portion of records (72% of days). All clusters show a similar proportion of five weekdays for two weekend days. As such, there are no significant differences between weekends and weekdays; indeed, one could have expected them to follow different shapes, as people's daily routines typically change during the weekend.
Cluster 0 stands out for its reduced demands. Very few days (around 21) were assigned to this group, covering summer months of June and July (Figure 2(a)). While the identification of a low consumption group was expected, its representative pattern and day distribution were surprising.
Indeed, as this is a residential area mainly composed of student residences, this low consumption group was expected for August (the month of summer holidays). Even for such a group, very low consumption values are registered, suggesting an outlier group resulting from measurement and transmission errors.
Night demands distinguish intense consumption patterns (Leitão et al. 2018): cluster 1 registers higher consumption between 3 am and 6 am. This can be justified by the operation of irrigation systems during summer months to water nearby municipal parks and public gardens.
Water supply companies can explore consumption profiles to adjust tank refills, taking advantage of the night periods' cheaper electricity. Our analysis shows that, during summer months, it must be performed earlier given higher demands between 3 am and 6 am.
The studied patterns do not suggest intrinsically different consumption habits: daily shape and peak hours remain the same throughout the year. The only difference resides in higher summer consumption around 6 am due to sprinkling. A similar strategy can, therefore, be adopted to regulate tank inflows, with minor pump adjustments. Morning and evening peaks are also identified, as expected for residential regions.
This framework can also be explored for real-time detection of anomalous consumption, such as leakages: by comparing consumption against tuned profiles, real-time alerts can be generated for significant deviations from patterns, prompting immediate action.
A novel framework for automatic urban water consumption patterns detection is presented. Three patterns are detected in historic data: one representing reduced consumption and two characterising intense demands. Even though consumption remains similar in shape, summer months exhibit higher demands during 3 am–6 am, possibly due to outside sprinkling. Overall values for the residual group are too low and were predicted for different months. We believe this group aggregates outliers resulting from data acquisition and transmission errors.
Characterisation of consumption patterns is complemented with information on upper and lower deviations, determined based on historic data. This distinguishes our framework from the cited works, which provide no knowledge of the consumption's expected deviations from representative profiles.
Supported by adequate technology, our framework can be explored to report significant deviations to expected consumption patterns, contributing towards real-time detection of anomalous events in supply networks.
Joaquim Leitão gratefully acknowledges the Portuguese funding institution FCT (Foundation for Science and Technology), Human Capital Operational Program (POCH) and the European Union (EU) for supporting this research work under the PhD grant SFRH/BD/122103/2016.