Abstract
Leakages in water distribution networks (WDNs) affect the hydraulic state of the entire or a large part of the network. Statistical correlation computed among pressure sensors monitoring network nodes aids the detection and localization of such leaks. This opens the possibility to work with water network databases, where graph signal processing (GSP) tools aid in understanding changes in pressure signals due to leakages in the hydraulic system. This paper presents a methodology for time-varying pressure signals on graph structures. The core of this methodology is based on changing of pressure, due to leaks, that modifies the graph structure. Computing for each time step a new topology of the graph and applying centrality analysis based on PageRank, it is possible to identify the presence of new leaks at the water system. A confusion matrix evaluates the precision of the proposed methodology on defining where and when such leakages start and end. Seven leaks are used to validate the process, which presented 86% in accuracy terms. The results show the benefits of the method in terms of speed, computational efficiency, and precision in detecting leakages.
HIGHLIGHTS
A novel data analysis methodology for leak detection in water distribution networks.
Correlation of pressure data for the creation of temporal graphs.
Analysis of water networks as graphs and vertices ranking by the PageRank metric.
INTRODUCTION
The water distribution network (WDN) is an essential infrastructure responsible for supplying citizens with drinkable water. Therefore, damage to pipes has an important impact on the water distribution process and also on water quality and network hydraulics (Hu et al. 2021a, 2021b). In addition, a large portion of the treated water is lost in the WDN, due to leaks, theft, measurement failures, and several other factors (Silva et al. 2021). According to Fan et al. (2021) in the UK around 3,281 Megalitres of water were lost between 2009 and 2011, and in the US around 15% of the treated water supplied is lost annually. In Brazil, it is estimated that in 2019 there were losses of about 38% in water distribution (Oliveira et al. 2021). This shows that even with research works focused on loss reduction and leak detection, the methodologies proposed still require further development, mainly aimed at fast and effective leak detection and easy application in WDNs.
Leaks are linked to losses in the water distribution process and are classified as reported, unreported, or background leakage (Adedeji et al. 2017). Reported leaks are visible on the ground and are easily detected by the public or employees of network administrators (Chan et al. 2018). On the other hand, unreported leaks are similar to those reported, but they do not emerge from the ground. Also finally, background leakage is small and difficult to detect by normal methods, and often goes unnoticed for a long time, resulting in significant losses (Abdulshaheed et al. 2017). Different methods are applied to detect leaks, and they can be classified as visual inspection, transient-based approach, model-based approach, and data-driven approach (Chan et al. 2018).
Visual and sensor-based strategies require the use of mobile inspection equipment linked to optical, electromagnetic, or acoustic sensors. However, it is an expensive and time-consuming process, and often, the acoustic signals in particular are influenced by the soil type and pipe material (Fan et al. 2021). Transient-based approaches analyze and evaluate the pressure transient wave caused by hydraulic changes in the system to detect leaks. This wave can quickly travel through the entire network and affect flows, pressures, contract or expand pipes, and affect many other parameters to the system (Ayati et al. 2019). As the wave speed is usually high, from 400 m/s for PVC pipes achieving 1,200 m/s for iron pipes, transient sensors are more expensive and transient responses decay rapidly, making it difficult for large applications (Fan et al. 2021). Model-based leak detection methods include, for example, the use of sensitivity matrices created in normal situations and with known leaks (Geng et al. 2019; Salguero et al. 2019) and calibration approaches (Sophocleous et al. 2019). These methods are capable of detecting leaks; however, they require calibrated hydraulic models, with user-demand data, pipe conditions, and pressure behaviours, leading to results very sensitive to modelling and measurement errors (Fan et al. 2021; Hu et al. 2021a, 2021b). Finally, data-based methods include feature classification methods (Sun et al. 2019), prediction classification methods (Laucelli et al. 2016), mathematical–statistical methods (Quiñones-Grueiro et al. 2018), and unsupervised clustering (Geelen et al. 2019). These methods use network monitoring data, such as pressure, flow, and reservoir levels. Hu et al. (2021a, 2021b) conclude that data-based detection methods do not require deep WDN comprehension, but require large amounts of data, being more suitable when there is a large amount of historical network data under analysis.
The leak detection process using mathematical–statistical tools has been successfully used in research related to WDNs. Brentan et al. (2021) used time series of hydraulic data (pressure, flow, and reservoir levels) and applied the fast Independent Component Analysis (fastICA) algorithm to separate the hydraulic data into independent components. The authors analyze the independent components with a statistical control algorithm to detect abrupt changes to identify cyber-attacks in WDNs. Gao et al. (2014) use a Blind Signal Separation (BSS) process on flow and pressure data to detect leaks and separate the leak flow from the node total flow. Okeya et al. (2014) used historical demand data, a network hydraulic model and a modified Kalman Filter method to detect leaks in networks. The authors predict hourly demand through historical data, apply the predicted data in the hydraulic model to estimate flows and pressures, and apply the Kalman Filter to calculate corrected demands at the current time step driven by the difference between predicted and observed data (Zanfei et al. 2022).
Computational mathematical methods successfully used in other research fields can be explored and applied to reduce losses in water distribution systems. In this sense, complex networks can be an effective tool for this analysis, since this theory models and identifies interactions through the relationship between objects and has been widely used in network analysis (Tanaka et al. 2020). Complex networks are linked to the graph theory, but with a more irregular, complex structure that can evolve dynamically over time. The application of this theory began with the effort to define new concepts and measures to characterize the topology of graphs and thus united a series of principles and statistical properties (Boccaletti et al. 2006).
Systems modelled as a graph are represented by set of vertices and edges (Stanković et al. 2019). Turning to the data acquired by the WDN monitoring sensors, it is possible to statistically correlate these data for the creation of graphs. As this monitoring occurs at each defined time steps, the graph is also updated with each time step. Thus, events in the physical WDN can affect the structure of the graph and consequently the metrics that evaluate such graphs. In this perspective, studying the relationship and interactions of hydraulic data (e.g. flow, pressure, and water quality parameters) can help in the leak detection and location. Methods linked to the graph theory have been successfully used to detect anomalies in radar images (Pham et al. 2015), detect connectivity patterns in human brain networks (Farahani et al. 2019; Wright et al. 2021), detect anomalies in the integrity of structures (Kaveh et al. 2022), etc. These studies create graphs from the relationship between the monitored data and as soon as new data are issued, the verification of some graph parameters happens, such as amount of information (Pham et al. 2015) and information paths (Farahani et al. 2019; Kaveh et al. 2022).
Since the application of data signal processing on water distribution systems is still incipient, this work presents a methodology for leak detection based on graph signal processing of pressure data. For this, the pressure data of each monitored node is used to create a graph structure and then, a vertex ranking metric is used. This process is performed with the historical data to recognize the pattern of vertex classifications. When the centrality metric deviates from the historical pattern behaviour, an anomaly is then detected and associated with leaks in the network. The proposed methodology proved to be fast, effective, and required low computational effort for leak detection.
METHODS
This section presents the methodology for leak detection based on graph signal processing. The process starts considering the historical pressure data and as they are made available for the graphs creation. For this, the data will be treated in a Python programming environment and the graph signal processing package (PyGSP) (Defferrard et al. 2017) is used to calculate the correlation between the data. The graphs will be created with the correlation calculated at each time step using the Network Analysis package in Python (NetworkX) (Hagberg & Conway 2020). After creating the graph, the PageRank node ranking metric is employed. This metric assigns a rank value to the vertices based on the edge structures. After classifying the nodes with the historical data, as more data are monitored, the new classification value of the vertices is determined, and, if this value does not follow the historical trend, a leak in the network is indicated.
Graph theory applied to WDS
To model a graph (), in general, its objects are determined, called vertices () and the interaction between these vertices that is called edges (), being then the graph representation as . The graph and its interactions can be expressed as an adjacency matrix (), which is a Boolean matrix with columns and rows indexed to vertices. Thus, when there is an interaction between the vertices, the edge is represented as value 1 in the matrix A. A weighted graph can be created by adding values to the edges, which are called edge weights. This results in a matrix of weights (), now no longer Boolean, because it contains values other than 0 and 1 (Stanković et al. 2019).
The modelling of the WDN as graphs is possible by considering the network physical topology, with the vertices corresponding to the nodes (junction nodes, tanks, and reservoirs) and the edges representing the links (pipes, pumps, and valves) (Di Nardo & Di Natale 2011). Another way to model the WDN as a graph is by the correlation between the monitored data. In this case, the vertices represent the monitoring nodes and the edges are the interactions between these nodes (Bezerra et al. 2022).
Graph creation via pressure data correlation
This methodology calculates the distance between data pairs to create the matrix W, being implemented by the package PyGSP. The data are separated into a square matrix where the columns represent the pressure data acquired by a set of sensors in the network and the rows are the temporal data (). A graph will be generated as monitoring data become available. However, the first graph will have the amount of data equal to the number of sensors, in order to generate a square matrix of weights W. Once the new data are available, the oldest data are removed, always keeping the matrix W square. Finally, the matrix W is used in the graph creation and the vertices are ranked with each new graph.
Leak detection using ranking vertex
To detect leaks, we used the maximum values determined by the PageRank metric. The maximum score does not refer to a specific vertex, but the highest value achieved by all vertices. The data of the first 7 days that are simulated without leaks are used as maximum base values and are compared with the data of the next days of simulation. If this comparison exceeds a threshold, it will be considered a data anomaly. The determination of this stipulated value will be determined by a sensitivity analysis and the percentage that presents the best result is chosen. Finally, a confusion matrix will be generated between two sets of data, the first considering the simulated leaks, where the moments of beginning and end are known, and the second the times with anomaly detections.
A confusion matrix is generally used in classification problems between two sets of data in four combinations: true positive (TP), true negative (TN), false positive (FP), and false negative (FN) (Berry et al. 2021). Table 1 shows how the classifications are arranged in the confusion matrix.
. | Observed . | ||
---|---|---|---|
Positive . | Negative . | ||
Estimated | Positive | TP | FP |
Negative | FN | TN |
. | Observed . | ||
---|---|---|---|
Positive . | Negative . | ||
Estimated | Positive | TP | FP |
Negative | FN | TN |
Between the two datasets in the confusion matrix, there is one with the real observations, in the case of this research they are the known leaks, and another with the observations predicted by the method. In other words, in this work, the observed data are the time that the known leaks start and the estimated data are the time that PageRank values vary significantly from the normal trend, higher than a marginal error.
The results are between 0 and 100, with 100 being a perfect prediction result (Berry et al. 2021).
Case study
The research by Mankad et al. (2022) uses this network to place pressure sensors to detect and locate leaks. The authors placed five pressure sensors (exposed in Figure 2) and only the data from these sensors will be considered in the process of graphs generation. This means that the X data matrix (Equation (1)) will have five columns and will result in graphs with five vertices. A leak simulation process is also required due to lack of leakage scenarios in this network. Therefore, seven nodes were selected with leak points and these are shown in Figure 2.
Simulation process
Another factor with random choice will be the start times of the leaks and their durations. The entire simulation process is done for 40 days, with the first 7 days without leaks to obtain a reference historical data. After that date, leaks are included. During the simulation process, the base demand and the hourly multiplication factor are multiplied by a randomness factor, in this study considered between 0.9 and 1.1, making the process closer to reality. The sensors have an acquisition frequency of 1 h, meaning that a graph will be created and PageRank will be calculated every hour.
RESULTS AND DISCUSSION
It can be observed in Figure 5 that the maximum PageRank values change significantly when there are leaks in the network, not having much relation with the intensity of the leaks. It can be seen, for example, that leakage at node 250, even with a high value of leakage, does not significantly impact the behaviour of PageRank values, especially when compared to leakage at nodes 120 and 224, which are of lower intensity, but which cause considerable changes in PageRank values. On the other hand, the leak at node 6 does not cause a change in behaviour, probably because it is a short time, low-flow leak, which the sensors may not have been able to monitor.
It can be seen in Figure 6 that some leaks are detected throughout their duration (nodes 120, 224, and 49). As for the leaks at nodes 250, 43, and 6, the detection takes place in certain time steps, but during a large part of the leaks' duration. However, the leak at node 153, which lasts until the end of the simulation, is detected in just a few moments. For a better understanding of the results, a confusion matrix is created and is shown in Table 2.
. | Observed . | ||
---|---|---|---|
Positive . | Negative . | ||
Estimated | Positive | 655 (68%) | 14 (1.46%) |
Negative | 113 (11.7%) | 176 (1.3%) |
. | Observed . | ||
---|---|---|---|
Positive . | Negative . | ||
Estimated | Positive | 655 (68%) | 14 (1.46%) |
Negative | 113 (11.7%) | 176 (1.3%) |
Evaluating the confusion matrix in terms of accuracy, the detection method obtained an 86% score. This means, according to Chicco et al. (2021), that there is high accuracy in leak detection. This shows that the leak detection process proposed in this research guarantees a good detection rate using monitoring data already performed based on the application. The proposed methodology is easy and quick to execute, proving to be an effective approach to leak detection.
CONCLUSIONS AND RECOMMENDATIONS
This research presented a methodology for leak detection that makes use of the ranking of vertices in graphs by the PageRank metric. The proposed methodology creates graphs with the correlation between the pressure data emitted by monitoring sensors using graph creation and analysis packages in a Python programming environment. It was shown that the graphs undergo changes when leaks occur. Thus, there are also changes in the vertex ranking values, and these changes are used to detect leaks. A historical series of hourly pressure data is used to learn the vertices rank for the different days, and the leak is detected when the maximum PageRank value of the vertices is different from those learned from the historical data. The proposal achieved a score of 86% based on seven simulated leaks with the results evaluated in terms of the accuracy of a confusion matrix. The methodology proved to be easy and quick to apply.
Even so, there are still points to be improved. Some of these points can be addressed in future works, seeking, for example, to use new ways of creating graphs, such as methods that consider the topology of the network. New vertex classification metrics can also be used considering, for example, the centrality of the vertices or the number of degrees and also using other sources of monitoring data, such as flows, reservoir levels, and water quality. Another approach that can be followed in future is leak detection by methods linked to the graph theory, such as through the connection between the sensor nodes and the other network nodes.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.