## Abstract

Leakages in water distribution networks (WDNs) affect the hydraulic state of the entire or a large part of the network. Statistical correlation computed among pressure sensors monitoring network nodes aids the detection and localization of such leaks. This opens the possibility to work with water network databases, where graph signal processing (GSP) tools aid in understanding changes in pressure signals due to leakages in the hydraulic system. This paper presents a methodology for time-varying pressure signals on graph structures. The core of this methodology is based on changing of pressure, due to leaks, that modifies the graph structure. Computing for each time step a new topology of the graph and applying centrality analysis based on PageRank, it is possible to identify the presence of new leaks at the water system. A confusion matrix evaluates the precision of the proposed methodology on defining where and when such leakages start and end. Seven leaks are used to validate the process, which presented 86% in accuracy terms. The results show the benefits of the method in terms of speed, computational efficiency, and precision in detecting leakages.

## HIGHLIGHTS

A novel data analysis methodology for leak detection in water distribution networks.

Correlation of pressure data for the creation of temporal graphs.

Analysis of water networks as graphs and vertices ranking by the PageRank metric.

## INTRODUCTION

The water distribution network (WDN) is an essential infrastructure responsible for supplying citizens with drinkable water. Therefore, damage to pipes has an important impact on the water distribution process and also on water quality and network hydraulics (Hu *et al.* 2021a, 2021b). In addition, a large portion of the treated water is lost in the WDN, due to leaks, theft, measurement failures, and several other factors (Silva *et al.* 2021). According to Fan *et al.* (2021) in the UK around 3,281 Megalitres of water were lost between 2009 and 2011, and in the US around 15% of the treated water supplied is lost annually. In Brazil, it is estimated that in 2019 there were losses of about 38% in water distribution (Oliveira *et al.* 2021). This shows that even with research works focused on loss reduction and leak detection, the methodologies proposed still require further development, mainly aimed at fast and effective leak detection and easy application in WDNs.

Leaks are linked to losses in the water distribution process and are classified as reported, unreported, or background leakage (Adedeji *et al.* 2017). Reported leaks are visible on the ground and are easily detected by the public or employees of network administrators (Chan *et al.* 2018). On the other hand, unreported leaks are similar to those reported, but they do not emerge from the ground. Also finally, background leakage is small and difficult to detect by normal methods, and often goes unnoticed for a long time, resulting in significant losses (Abdulshaheed *et al.* 2017). Different methods are applied to detect leaks, and they can be classified as visual inspection, transient-based approach, model-based approach, and data-driven approach (Chan *et al.* 2018).

Visual and sensor-based strategies require the use of mobile inspection equipment linked to optical, electromagnetic, or acoustic sensors. However, it is an expensive and time-consuming process, and often, the acoustic signals in particular are influenced by the soil type and pipe material (Fan *et al.* 2021). Transient-based approaches analyze and evaluate the pressure transient wave caused by hydraulic changes in the system to detect leaks. This wave can quickly travel through the entire network and affect flows, pressures, contract or expand pipes, and affect many other parameters to the system (Ayati *et al.* 2019). As the wave speed is usually high, from 400 m/s for PVC pipes achieving 1,200 m/s for iron pipes, transient sensors are more expensive and transient responses decay rapidly, making it difficult for large applications (Fan *et al.* 2021). Model-based leak detection methods include, for example, the use of sensitivity matrices created in normal situations and with known leaks (Geng *et al.* 2019; Salguero *et al.* 2019) and calibration approaches (Sophocleous *et al.* 2019). These methods are capable of detecting leaks; however, they require calibrated hydraulic models, with user-demand data, pipe conditions, and pressure behaviours, leading to results very sensitive to modelling and measurement errors (Fan *et al.* 2021; Hu *et al.* 2021a, 2021b). Finally, data-based methods include feature classification methods (Sun *et al.* 2019), prediction classification methods (Laucelli *et al.* 2016), mathematical–statistical methods (Quiñones-Grueiro *et al.* 2018), and unsupervised clustering (Geelen *et al.* 2019). These methods use network monitoring data, such as pressure, flow, and reservoir levels. Hu *et al.* (2021a, 2021b) conclude that data-based detection methods do not require deep WDN comprehension, but require large amounts of data, being more suitable when there is a large amount of historical network data under analysis.

The leak detection process using mathematical–statistical tools has been successfully used in research related to WDNs. Brentan *et al.* (2021) used time series of hydraulic data (pressure, flow, and reservoir levels) and applied the fast Independent Component Analysis (fastICA) algorithm to separate the hydraulic data into independent components. The authors analyze the independent components with a statistical control algorithm to detect abrupt changes to identify cyber-attacks in WDNs. Gao *et al.* (2014) use a Blind Signal Separation (BSS) process on flow and pressure data to detect leaks and separate the leak flow from the node total flow. Okeya *et al.* (2014) used historical demand data, a network hydraulic model and a modified Kalman Filter method to detect leaks in networks. The authors predict hourly demand through historical data, apply the predicted data in the hydraulic model to estimate flows and pressures, and apply the Kalman Filter to calculate corrected demands at the current time step driven by the difference between predicted and observed data (Zanfei *et al.* 2022).

Computational mathematical methods successfully used in other research fields can be explored and applied to reduce losses in water distribution systems. In this sense, complex networks can be an effective tool for this analysis, since this theory models and identifies interactions through the relationship between objects and has been widely used in network analysis (Tanaka *et al.* 2020). Complex networks are linked to the graph theory, but with a more irregular, complex structure that can evolve dynamically over time. The application of this theory began with the effort to define new concepts and measures to characterize the topology of graphs and thus united a series of principles and statistical properties (Boccaletti *et al.* 2006).

Systems modelled as a graph are represented by set of vertices and edges (Stanković *et al.* 2019). Turning to the data acquired by the WDN monitoring sensors, it is possible to statistically correlate these data for the creation of graphs. As this monitoring occurs at each defined time steps, the graph is also updated with each time step. Thus, events in the physical WDN can affect the structure of the graph and consequently the metrics that evaluate such graphs. In this perspective, studying the relationship and interactions of hydraulic data (e.g. flow, pressure, and water quality parameters) can help in the leak detection and location. Methods linked to the graph theory have been successfully used to detect anomalies in radar images (Pham *et al.* 2015), detect connectivity patterns in human brain networks (Farahani *et al.* 2019; Wright *et al.* 2021), detect anomalies in the integrity of structures (Kaveh *et al.* 2022), etc. These studies create graphs from the relationship between the monitored data and as soon as new data are issued, the verification of some graph parameters happens, such as amount of information (Pham *et al.* 2015) and information paths (Farahani *et al.* 2019; Kaveh *et al.* 2022).

Since the application of data signal processing on water distribution systems is still incipient, this work presents a methodology for leak detection based on graph signal processing of pressure data. For this, the pressure data of each monitored node is used to create a graph structure and then, a vertex ranking metric is used. This process is performed with the historical data to recognize the pattern of vertex classifications. When the centrality metric deviates from the historical pattern behaviour, an anomaly is then detected and associated with leaks in the network. The proposed methodology proved to be fast, effective, and required low computational effort for leak detection.

## METHODS

This section presents the methodology for leak detection based on graph signal processing. The process starts considering the historical pressure data and as they are made available for the graphs creation. For this, the data will be treated in a Python programming environment and the graph signal processing package (PyGSP) (Defferrard *et al.* 2017) is used to calculate the correlation between the data. The graphs will be created with the correlation calculated at each time step using the Network Analysis package in Python (NetworkX) (Hagberg & Conway 2020). After creating the graph, the PageRank node ranking metric is employed. This metric assigns a rank value to the vertices based on the edge structures. After classifying the nodes with the historical data, as more data are monitored, the new classification value of the vertices is determined, and, if this value does not follow the historical trend, a leak in the network is indicated.

*et al.*2017). After the simulations, the pressure data at the monitoring points of the network are correlated and used in the creation of graphs. Every process follows the flowchart shown in Figure 1.

### Graph theory applied to WDS

To model a graph (), in general, its objects are determined, called vertices () and the interaction between these vertices that is called edges (), being then the graph representation as . The graph and its interactions can be expressed as an adjacency matrix (), which is a Boolean matrix with columns and rows indexed to vertices. Thus, when there is an interaction between the vertices, the edge is represented as value 1 in the matrix *A*. A weighted graph can be created by adding values to the edges, which are called edge weights. This results in a matrix of weights (), now no longer Boolean, because it contains values other than 0 and 1 (Stanković *et al.* 2019).

The modelling of the WDN as graphs is possible by considering the network physical topology, with the vertices corresponding to the nodes (junction nodes, tanks, and reservoirs) and the edges representing the links (pipes, pumps, and valves) (Di Nardo & Di Natale 2011). Another way to model the WDN as a graph is by the correlation between the monitored data. In this case, the vertices represent the monitoring nodes and the edges are the interactions between these nodes (Bezerra *et al.* 2022).

### Graph creation via pressure data correlation

*X*data matrix, where the columns () represent the sensors and the rows represent the time signals. The method calculates the distances between data pairs, creating a

*Z*distance matrix:denoting the distance between the temporal data of vertices and . This method assumes that the smoothness between and the edge weight matrix () is small:and create a graph that minimizes very high weights on

*W*, so that:where is a set of adjacency matrices (determined by the variation of weights over time). The matrix function prevents the matrix

*W*from a value of zero, controlling its sparsity and imposing a data-dependent structure.

This methodology calculates the distance between data pairs to create the matrix *W*, being implemented by the package PyGSP. The data are separated into a square matrix where the columns represent the pressure data acquired by a set of sensors in the network and the rows are the temporal data (). A graph will be generated as monitoring data become available. However, the first graph will have the amount of data equal to the number of sensors, in order to generate a square matrix of weights *W*. Once the new data are available, the oldest data are removed, always keeping the matrix *W* square. Finally, the matrix *W* is used in the graph creation and the vertices are ranked with each new graph.

### Leak detection using ranking vertex

*et al.*(1999) proposed a method to rank each page based on the web structure. The PageRank algorithm can be described as a way of evaluating the web page importance, based on the links quantity and quality that direct to it. In a classic discrete-time finite-state random walk model, Yao

*et al.*(2012) denoted by in transition matrix, where

*n*is the number of states and is the probability of transition from state

*g*to state

*h*. A stationary distribution value

*s*is defined as:where is the eigenvector of

*P*that corresponds to the eigenvalue 1.

*P*and the other that is the transition from state

*g*to any other state with probability , and has as a mixing parameter. In the initial iteration, all vertices are considered equal with and (Yao

*et al.*2012), but over the interactions the weights of vertices and edges are being updated. This makes the convergence rate limited and empirically mimics the behaviour of web users.

*u*represents the web page (or the vertex in a graph), is the set of vertices that point to , are classification scores of vertices

*u*and

*c*; is the number of edges leaving vertex

*c*(Xing & Ghorbani 2004).

To detect leaks, we used the maximum values determined by the PageRank metric. The maximum score does not refer to a specific vertex, but the highest value achieved by all vertices. The data of the first 7 days that are simulated without leaks are used as maximum base values and are compared with the data of the next days of simulation. If this comparison exceeds a threshold, it will be considered a data anomaly. The determination of this stipulated value will be determined by a sensitivity analysis and the percentage that presents the best result is chosen. Finally, a confusion matrix will be generated between two sets of data, the first considering the simulated leaks, where the moments of beginning and end are known, and the second the times with anomaly detections.

A confusion matrix is generally used in classification problems between two sets of data in four combinations: true positive (TP), true negative (TN), false positive (FP), and false negative (FN) (Berry *et al.* 2021). Table 1 shows how the classifications are arranged in the confusion matrix.

. | Observed . | ||
---|---|---|---|

Positive . | Negative . | ||

Estimated | Positive | TP | FP |

Negative | FN | TN |

. | Observed . | ||
---|---|---|---|

Positive . | Negative . | ||

Estimated | Positive | TP | FP |

Negative | FN | TN |

Between the two datasets in the confusion matrix, there is one with the real observations, in the case of this research they are the known leaks, and another with the observations predicted by the method. In other words, in this work, the observed data are the time that the known leaks start and the estimated data are the time that PageRank values vary significantly from the normal trend, higher than a marginal error.

The results are between 0 and 100, with 100 being a perfect prediction result (Berry *et al.* 2021).

### Case study

*et al.*(2012) is adapted and a set of leaks are simulated for the application of the methodology presented in this work. This network is based on the distribution network of the Italian city Modena, in the Emilia-Romagna region, and it is composed of 268 nodes, 4 reservoirs, and 317 pipes and does not have valves and pumps, as shown in Figure 2.

The research by Mankad *et al.* (2022) uses this network to place pressure sensors to detect and locate leaks. The authors placed five pressure sensors (exposed in Figure 2) and only the data from these sensors will be considered in the process of graphs generation. This means that the *X* data matrix (Equation (1)) will have five columns and will result in graphs with five vertices. A leak simulation process is also required due to lack of leakage scenarios in this network. Therefore, seven nodes were selected with leak points and these are shown in Figure 2.

### Simulation process

*et al.*2017). Leaks will be considered as an additional demand flow at the network nodes, with the leakage flow rate () being determined by the orifice equation:where is the discharge coefficient,

*A*is the orifice area,

*g*is the gravity acceleration, and

*P*is the pressure at the leak point. As the leak flow affects the node pressure and the pressure determines the leak flow, an iterative process must be carried out until the pressure and the leak flow rate stabilize and then this flow is used in addition to the demand. The utilized is equal to 0.6, following the research by Van Zyl (2014). The area values will be chosen randomly between 0.00012 and 0.00050 m

^{2}, and the values are used in the research by Van Zyl & Cassa (2014).

Another factor with random choice will be the start times of the leaks and their durations. The entire simulation process is done for 40 days, with the first 7 days without leaks to obtain a reference historical data. After that date, leaks are included. During the simulation process, the base demand and the hourly multiplication factor are multiplied by a randomness factor, in this study considered between 0.9 and 1.1, making the process closer to reality. The sensors have an acquisition frequency of 1 h, meaning that a graph will be created and PageRank will be calculated every hour.

## RESULTS AND DISCUSSION

It can be observed in Figure 5 that the maximum PageRank values change significantly when there are leaks in the network, not having much relation with the intensity of the leaks. It can be seen, for example, that leakage at node 250, even with a high value of leakage, does not significantly impact the behaviour of PageRank values, especially when compared to leakage at nodes 120 and 224, which are of lower intensity, but which cause considerable changes in PageRank values. On the other hand, the leak at node 6 does not cause a change in behaviour, probably because it is a short time, low-flow leak, which the sensors may not have been able to monitor.

It can be seen in Figure 6 that some leaks are detected throughout their duration (nodes 120, 224, and 49). As for the leaks at nodes 250, 43, and 6, the detection takes place in certain time steps, but during a large part of the leaks' duration. However, the leak at node 153, which lasts until the end of the simulation, is detected in just a few moments. For a better understanding of the results, a confusion matrix is created and is shown in Table 2.

. | Observed . | ||
---|---|---|---|

Positive . | Negative . | ||

Estimated | Positive | 655 (68%) | 14 (1.46%) |

Negative | 113 (11.7%) | 176 (1.3%) |

. | Observed . | ||
---|---|---|---|

Positive . | Negative . | ||

Estimated | Positive | 655 (68%) | 14 (1.46%) |

Negative | 113 (11.7%) | 176 (1.3%) |

Evaluating the confusion matrix in terms of accuracy, the detection method obtained an 86% score. This means, according to Chicco *et al.* (2021), that there is high accuracy in leak detection. This shows that the leak detection process proposed in this research guarantees a good detection rate using monitoring data already performed based on the application. The proposed methodology is easy and quick to execute, proving to be an effective approach to leak detection.

## CONCLUSIONS AND RECOMMENDATIONS

This research presented a methodology for leak detection that makes use of the ranking of vertices in graphs by the PageRank metric. The proposed methodology creates graphs with the correlation between the pressure data emitted by monitoring sensors using graph creation and analysis packages in a Python programming environment. It was shown that the graphs undergo changes when leaks occur. Thus, there are also changes in the vertex ranking values, and these changes are used to detect leaks. A historical series of hourly pressure data is used to learn the vertices rank for the different days, and the leak is detected when the maximum PageRank value of the vertices is different from those learned from the historical data. The proposal achieved a score of 86% based on seven simulated leaks with the results evaluated in terms of the accuracy of a confusion matrix. The methodology proved to be easy and quick to apply.

Even so, there are still points to be improved. Some of these points can be addressed in future works, seeking, for example, to use new ways of creating graphs, such as methods that consider the topology of the network. New vertex classification metrics can also be used considering, for example, the centrality of the vertices or the number of degrees and also using other sources of monitoring data, such as flows, reservoir levels, and water quality. Another approach that can be followed in future is leak detection by methods linked to the graph theory, such as through the connection between the sensor nodes and the other network nodes.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Leak Localization in a Real Water Distribution Network Based on Search-Space Reduction*. Journal of Water Resources and Planning

**145**(7).