## Abstract

Leak detection has significant implications for the long-term stable operation of water distribution networks (WDNs). This study presented a novel leak detection method by calculating the angular variance between a pressure vector and other vectors in the database, to evaluate the presence of an anomaly in a network. The top priority for this method was to establish a reliable dataset collected from the pressure sensors, which is generated by EPANET 2.2. Numerous node water demand data in normal conditions were generated by the Monte Carlo method, and leak conditions with various leak flows were simulated by creating leak holes in the pipes. Through learning the composite normal and abnormal data in a certain proportion, the angle-based outlier detection model was employed to identify abnormal events. This angle-based method was applied in an actual WDN and the identification performance for anomalies was compared with that of previous detection methods. The results indicated that the novel method proposed in this study could significantly improve the accuracy and efficiency of leak detection compared to the threshold-based and distance-based detection methods.

## HIGHLIGHTS

The proposed method for leak detection in the WDN combines the hydraulic model with outlier detection.

Numerous normal and leak scenarios are simulated by the hydraulic model to calculate residuals.

A clustering algorithm is used to determine the optimal locations of pressure sensors.

The performance of the proposed method is compared with the distance-based method in the literature.

## INTRODUCTION

Leakages in water distribution networks (WDNs) are key issues in water resource allocation management. Not only the depletion of water resources, serious leakages have caused enormous energy dissipation for water treatment and the intrusion of external contaminants through broken pipes (Fontanazza *et al.* 2015; Beker & Kansal 2022). According to statistics, in 2020, the water loss was 7.85 billion m^{3} in Chinese cities, with a comprehensive leak rate of 13.39% (CUWA 2020). Therefore, a timely and reliable detection method of anomaly makes assurance for water utilities to detect leakages more promptly, and economize more precious water resources.

Leakages can be formed on mains and service pipes (Farley & Trow 2003), which may be caused by various factors, such as the poor quality of pipes, inappropriate operation, and extreme weather (e.g., freezing weather). Therefore, it is hard to find out when and where leakages happened. To improve the efficiency of the anomaly identification method, numerous methods have been studied and developed. These methods can be generally divided into two main categories: one based on hardware and the other on software (Valizadeh *et al.* 2009).

Many technologies are based on highly specialized hardware equipment, such as leak noise correlators (Guo *et al.* 2021), leak noise loggers (Muggleton *et al.* 2006), gas injection (Hunaidi *et al.* 2000), ground penetrating radar (Hunaidi 1998), and infrared photography (Fahmy & Moselhi 2010). Despite the fact that they perform well in leak detection and location (Puust *et al.* 2010), some drawbacks they have cannot be ignored. For example, they are labor-intensive, expensive, run slowly, and may also require long interruption of pipeline operations (Romano *et al.* 2011).

With the widespread applications of advanced meter infrastructure, data loggers, and sensors, it is imperative to integrate the hydraulic model with data processing methods for higher quality and efficiency of water management (Wu & He 2021). Methods based on software are to actively analyze the signals measured in WDNs to detect leakages. The obvious signals of wireless sensors installed in WDNs are flow and pressure. Methods based on software can be roughly categorized into transient-based approaches, model-based approaches, and data-driven approaches. Transient-based approaches focus on analyzing information about the presence of leakages from transient pressure signals measured within a WDN (Liggett & Chen 1994; Christodoulou *et al.* 2017). Experimental results demonstrate that even a small leak point can be detected by this method (Mpesha *et al.* 2001; Covas *et al.* 2005). However, restricted by technologies and costs, current transient-based methods were not specifically authenticated in actual networks (Li *et al.* 2015).

Benefitting from technical improvements in modeling software and the popularity of supervisory control and data acquisition (SCADA) systems, model-based approaches have improved over the last two decades (Kim *et al.* 2010; Yu *et al.* 2021). These methods detect leakages by comparing flow or pressure data by simulating the hydraulic conditions of WDNs. Compared with other methods, establishing a hydraulic model with certain monitoring information to detect leakages can be more economical and simpler in the application. However, these models need to be updated whenever the topology changes and necessitate more sophisticated equations for calibration to have a good performance.

Given limitations in the aforementioned approaches, with the development of computing capabilities, data-driven approaches have been intensively studied since the beginning of the 21st century. The data-driven approaches can be classified as supervised and unsupervised approaches including prediction, classification, and clustering (Zaman *et al.* 2020). The purpose of the prediction stage is to generate estimated data under normal conditions of networks. Mounce *et al.* (2010) introduced artificial neural networks for leak detection by analyzing the similarity of abnormal pressure or flow variation, which can model any function without specific parameters involved to handle complex hydraulic problems (Mounce *et al.* 2002; Romano *et al.* 2014). The classification stage is then adopted to compare the residuals between predicted values and actual measurements to evaluate abnormal events (Mounce *et al.* 2011; Bakker *et al.* 2014). These methods make it difficult to learn complex features, so Zhou *et al.* (2019) proposed a novel burst location identification framework by fully linear DenseNet, which supersedes the convolutional layers in DenseNet by linear connections. In order to reduce the misjudgments in the prediction process, the clustering-based approaches were studied to detect leakages according to the similarities of monitoring data (Wu *et al.* 2016), which are implemented on the district metering area (DMA) level. However, due to the large scale of the networks in many cities, distance-based anomaly detection algorithms may cause ‘Curse of Dimensionality’, leading to the deterioration of performance.

This paper aims to improve the accuracy and efficiency of leak detection, and the method used in this paper is calculating the angle variance for each candidate data vector (containing pressure values from all sensors at the same time). The current objectives of the research include four parts: (1) importing the sensitivity matrix to the fuzzy c-means (FCM) algorithm, and the cluster centers were calculated to determine installation positions of sensors, (2) simulating normal operating conditions by the Monte Carlo method and collecting various abnormal events by the water network tool for resilience (WNTR), (3) evaluating the performance of the proposed method and other outlier detection methods in large leakages, and (4) comparing the accuracy of angle-based method with the distance-based method in minor leak scenarios.

## METHODS

### Leak scenarios generation

There are many methods of simulating leakages such as the emitter method (Giustolisi *et al.* 2008), the artificial reservoir method (Ang & Jowitt 2006), and the additional node demand method (Shao *et al.* 2019). In this paper, the simulating method is adding a leak hole in the middle of the pipe and analyzing the hydraulic operating parameters by the WNTR. WNTR is an open-source Python package designed to help water utilities simulate and analyze the resilience of WDN, which is compatible with EPANET 2.00.12 and EPANET 2.2.

*q*

_{l}is the leak demand (m

^{3}/s),

*C*

_{d}is the discharge coefficient (unitless),

*A*is the area of the hole (m

^{2}), is an exponent related to characteristics of the leak (unitless),

*g*is the acceleration of gravity (m/s

^{2}), and

*h*is the gauge head (m). The discharge coefficient C

*and leak exponent were set to 0.75 and 0.5 in this paper (Lambert 2001; Greyvenstein & van Zyl 2007).*

_{d}According to Equation (1), it is essential to confirm the diameter of the leak hole, so that the area of the hole can be calculated. Then, adding leak flow to each pipe and calculating the hydraulic parameters one after another, the node pressure values of every leak event are generated to calculate the node pressure sensitivity matrix.

The quantity and quality of data have an important effect on the performance of the leak detection method. However, leak events in realistic WDN are not frequent, and it is hard to collect the recording of leaks in actual pipe networks. Therefore, synthetic leak data were generated by hydraulic model simulation in this paper. These data represent most of the leak events.

In this paper, the degree of the leaks depends on the diameter of the leak hole. The minimum leak diameter indicates the minimum extent of leakage to trigger its associate sensor. Then, set the diameter of the leaking pipe as the maximum leak diameter , if there is no negative node pressure node existing and at least one sensor responds, define the current diameter as the maximum leak diameter. Otherwise, reduce the leak diameter by step until the condition is satisfied. The maximum extent of leakage was redefined as the current leak diameter.

### Pressure sensor deployment

*P*

_{0}is the normal node pressure with

*n*-dimensional column vectors,

*n*is the number of nodes in the network,

*P*is the node pressure under various leaking scenarios, and

_{l}*m*is the total number of pipes with potential for leakages.

*n*is the number of nodes where pressure sensors can be installed,

*m*is the number of pipes in the network, and is the pressure difference of node

*n*between normal conditions and pipe

*m*leaking.

*X*into

*k*categories and setting the cluster center matrix V. Equation (4) is the objective function of the FCM algorithm. The function sums the pairwise difference of every data value and cluster center (Askari 2021).where

*J*is the objective function,

*n*is the number of nodes,

*c*∈ [2,

*n*] is the total number of clusters,

*v*

_{i}is the

*i*-th cluster center,

*x*is the

_{j}*j*-th data,

*u*is the membership value of the

_{ij}*j*-th data in the

*i*-th cluster, and

*m*is the weighting index (

*m*> 1) that affects the clustering results.

The FCM algorithm starts with the initialized membership matrix *U* = rand (*c, n*) = [*u _{ij}*]. Cluster center

*v*and membership degree

_{i}*u*are updated until the convergence of the algorithm. It should be pointed out that the maximum number of iterations should be set as the parameter of the algorithm. A convergence condition is an indicator for the end of the loop, and the program continues until the objective function value is less than . When the convergence condition is satisfied, the clustering process is over, and the output is the result of the calculation.

_{ij}The FCM clustering algorithm was applied to ascertain the installation sites of pressure sensors. By normalizing the node pressure, the data point corresponding to the samples with the shortest Euler distance of node *j* at cluster *i* was determined as the site of sensors.

### Leak event detection

A leak event in the WDN is one of the modalities of abnormal events, which can be reflected in monitoring data, such as outliers in monitoring pressure or flow. Anomaly detection is devoted to identifying the outlier points from the dataset accurately and efficiently. The dominant anomaly detection algorithms are three types: distance-based anomaly detection algorithm, density-based local anomaly detection algorithm, and statistical-based anomaly detection algorithm. These methods are mainly based on statistical theory and use Euclidean distance as an anomaly evaluation criterion.

There are dozens of monitoring sensors in large WDN, which will generate high-dimensional monitoring data at the same time. However, as the dimensionality increases, the distance between points is concentrated to a certain level, which means the nearest neighbor based on distance is close to the farthest neighbors (Beyer *et al.* 1999). Therefore, it is hard to identify the anomaly in large WDN which causes the potential risk to the stability of the WDN.

Different from the methods mentioned above, ABOD is an effective method for the detection of outliers in high-dimensional datasets, which evaluates the degree of outliers in terms of the variance of the angles (VOA) between the target object and the other objects. Some researchers studied the performances of distance-based algorithm and angle-based algorithm in several high-dimensional datasets and testified that ABOD is more stable with increasing data dimensionality (Ye *et al.* 2014).

*O,*it has the distance-weighted angle variance as its outlier score, and the angle ∠

*AOB*for each pair of points

*A*,

*B*(

*A*≠

*O*and

*B*≠

*O*) is calculated and compared with the outlier score. Note that if it is a normal data point (e.g.,

*A*and

*C*), within a cluster, the angles between different vectors and other data point pairs vary greatly. If a point is located at the border of a cluster (e.g.,

*B*and

*D*), the variation of angle is smaller. As for an outlier point (e.g.,

*E*), the spectrum of angles for the data point is substantially narrow, and other points are located merely in certain directions (Kriegel

*et al.*2010), which means the point positioned outside of the cluster is an outlier.

The definition of ABOF is as follows:

The further a data point is away from the clusters, the smaller the variance of angles of a point is, and the smaller the ABOF. Therefore, the ABOD calculates the ABOF for each data point and outputs the list of points in the dataset according to the ascending order of ABOF.

## CASE STUDY AND RESULTS

### Network description and dataset establishment

*et al.*2022). Therefore, the number of sensors was adopted in the case study. To detect abnormal events more effectively, the FCM algorithm was used to determine the installation sites of sensors and sensitive areas of each sensor. The core step is importing the sensitivity matrix into the model, and calculating cluster centers by the FCM algorithm. According to the results of clustering, the settlements of pressure sensors are 3, 9, 26, 27, 29, 41, and 44, which are represented by colored triangles in Figure 3.

The methodology of detecting an anomaly in the WDN relies on the data acquired from a SCADA system. As mentioned in Section 2.1, the construction of the SCADA system contained two parts: (1) the establishment of a pressure database under normal operating conditions and (2) the establishment of a pressure database under different extents of leakage. The construction of the database considered the uncertainty of modeling error and measurement error, which reflects on the parameters as random noise.

For modeling error, the Monte Carlo method was used to simulate and generate the water demand data with 20% floating of each node as random noise (Kapelan *et al.* 2005), and the corresponding pressure value was calculated by WNTR. The sample size of the Monte Carlo simulation was 10,000 times to ensure stable operating conditions. And random noise *N* (0, 0.2 m) was added to the pressure value as a measurement error.

For the traditional leak detection method, the corresponding threshold value of each node is commonly adopted. As for the normal condition, the pressure values follow a normal distribution. According to the Western Electric Corporation rules, the threshold in this paper was (Cheng *et al.* 2020) defined as 2 times the standard deviation of normal pressure data that is suggested in the literature. If the leak event occurs in the WDN, the pressure will fluctuate suddenly. Especially, hydraulic operating parameters vary greatly during the minimum hour demand period (which is always around midnight) (Qi *et al.* 2018). With the increase in leak flow, the pressure value will be lower than the threshold, which will trigger the alarm.

*d*, which is defined as the minimum leak diameter . Then, the diameter of the leak hole was increased at the step of 0.005

^{l}*d*until the pipe completely burst. After several iterations of hydraulic calculation, the total number of 12,742 possible leak events in this network are contained in the leak database. And the spectrum of leak flow for entire leak scenarios is shown in Figure 4.

^{l}In this paper, the leak events were divided into two classes: one is the slight leakage that pressure drop does not exceed the threshold, and the other is the large leakage that pressure sensors have responded. The large class was divided into five levels based on leak flow, in which Ⅴ level indicates that the leak extent just exceeds the threshold value and triggers an alarm, and Ⅰ level indicates that the leakage is the most severe. The leak degree classification results are shown in Table 1. For the slight leak class, the extent of leaks was further divided into three categories according to the leak flow, and the range of leak flow is 10–50, 50–100, and 100–175 L/s.

Leak flow (L/s) . | Leak level . |
---|---|

176–800 | Ⅴ |

801–1,500 | Ⅳ |

1,501–2,200 | Ⅲ |

2,201–2,900 | Ⅱ |

2,901–3,352 | Ⅰ |

Leak flow (L/s) . | Leak level . |
---|---|

176–800 | Ⅴ |

801–1,500 | Ⅳ |

1,501–2,200 | Ⅲ |

2,201–2,900 | Ⅱ |

2,901–3,352 | Ⅰ |

### Detection performance

To evaluate the model performance more comprehensively, the area under the receiver operating characteristic (ROC-AUC) curve was adopted. ROC is a composite indicator reflecting continuous variables of sensitivity and effect, and the ROC curve is based on a series of different dichotomies (boundary value or threshold determination), where the *x*-coordinate is TPR and the *y*-coordinate is FPR. The area under the ROC curve represents the value of the AUC. In other words, the larger value of AUC, the better the performance of the model.

To verify the accuracy and effectiveness of the ABOD for anomaly detection, four common unsupervised outlier detection algorithms were selected for comparison, which are Empirical-Cumulative-distribution-based Outlier Detection (ECOD), principal component analysis (PCA), Isolation Forest (IForest) and local outliers factor (LOF). Based on the normal pressure value and leak pressure value, the detection model was used to identify an anomaly in the WDN. In this part, 6,300 normal data and 700 abnormal data were selected randomly as the samples of the model and divided into training data and testing data according to the proportion of 8:2. In this part, various leak levels that can trigger the pressure sensors alarm were analyzed by outlier detection models, and the results of AUC and precision are shown in Tables 2 and 3.

Algorithm . | Ⅴ . | Ⅳ . | Ⅲ . | Ⅱ . | Ⅰ . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | |

ECOD | 0.942 | 0.930 | 0.942 | 0.921 | 0.942 | 0.922 | 0.942 | 0.936 | 0.942 | 0.927 |

PCA | 0.979 | 0.999 | 0.979 | 1.000 | 0.979 | 1.000 | 0.979 | 1.000 | 0.979 | 0.999 |

IForest | 0.982 | 0.980 | 0.983 | 0.984 | 0.987 | 0.984 | 0.980 | 0.991 | 0.984 | 0.988 |

LOF | 0.689 | 1.000 | 0.689 | 1.000 | 0.689 | 1.000 | 0.689 | 1.000 | 0.689 | 1.000 |

ABOD | 0.999 | 1.000 | 0.999 | 1.000 | 0.999 | 1.000 | 0.999 | 1.000 | 0.999 | 1.000 |

Algorithm . | Ⅴ . | Ⅳ . | Ⅲ . | Ⅱ . | Ⅰ . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | |

ECOD | 0.942 | 0.930 | 0.942 | 0.921 | 0.942 | 0.922 | 0.942 | 0.936 | 0.942 | 0.927 |

PCA | 0.979 | 0.999 | 0.979 | 1.000 | 0.979 | 1.000 | 0.979 | 1.000 | 0.979 | 0.999 |

IForest | 0.982 | 0.980 | 0.983 | 0.984 | 0.987 | 0.984 | 0.980 | 0.991 | 0.984 | 0.988 |

LOF | 0.689 | 1.000 | 0.689 | 1.000 | 0.689 | 1.000 | 0.689 | 1.000 | 0.689 | 1.000 |

ABOD | 0.999 | 1.000 | 0.999 | 1.000 | 0.999 | 1.000 | 0.999 | 1.000 | 0.999 | 1.000 |

Algorithm . | Ⅴ . | Ⅳ . | Ⅲ . | Ⅱ . | Ⅰ . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | |

ECOD | 0.547 | 0.390 | 0.547 | 0.320 | 0.547 | 0.327 | 0.547 | 0.410 | 0.547 | 0.633 |

PCA | 0.871 | 0.957 | 0.871 | 0.977 | 0.871 | 0.983 | 0.871 | 0.993 | 0.871 | 0.982 |

IForest | 0.813 | 0.777 | 0.834 | 0.853 | 0.876 | 0.883 | 0.800 | 0.897 | 0.833 | 0.880 |

LOF | 0.346 | 1.000 | 0.346 | 1.000 | 0.346 | 1.000 | 0.346 | 1.000 | 0.346 | 1.000 |

ABOD | 0.987 | 1.000 | 0.987 | 1.000 | 0.987 | 1.000 | 0.987 | 1.000 | 0.987 | 1.000 |

Algorithm . | Ⅴ . | Ⅳ . | Ⅲ . | Ⅱ . | Ⅰ . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | Training . | Test . | |

ECOD | 0.547 | 0.390 | 0.547 | 0.320 | 0.547 | 0.327 | 0.547 | 0.410 | 0.547 | 0.633 |

PCA | 0.871 | 0.957 | 0.871 | 0.977 | 0.871 | 0.983 | 0.871 | 0.993 | 0.871 | 0.982 |

IForest | 0.813 | 0.777 | 0.834 | 0.853 | 0.876 | 0.883 | 0.800 | 0.897 | 0.833 | 0.880 |

LOF | 0.346 | 1.000 | 0.346 | 1.000 | 0.346 | 1.000 | 0.346 | 1.000 | 0.346 | 1.000 |

ABOD | 0.987 | 1.000 | 0.987 | 1.000 | 0.987 | 1.000 | 0.987 | 1.000 | 0.987 | 1.000 |

### Comparison with other leak identification methods

In this section, the performance of the proposed ABOD is further compared with distance-based outlier detection (DBOD) in order to comprehensively evaluate the applicability of the two methods in the medium-scale network. The core of DBOD is to calculate the vector's local density and its distance from points with a higher density of each point according to the Euclidean distances *d _{ij}* between vectors, and then select the outlier that has a low local density and a long distance. This method created by Rodriguez & Laio (2014) has been applied to burst detection in a small WDN (Wu

*et al.*2016).

Leak events with a leak flow of less than 175 L/s are selected, which means these events are too minor to trigger the monitoring sensors (according to the threshold-based method), so that it could be ignored and developed into a serious burst event. The first step of the establishment of dataset is selecting different degrees of leak scenarios and dividing these data into three categories according to the leak flow. Then, 100 abnormal data and 900 normal data were integrated, and 80% of the data were randomly selected as training data and the remaining data as a testing set. It should be noted that the information includes seven sensors monitoring data and labels about binary classification, but an unsupervised anomaly detection model was not provided with no leak or leak labels for training, which were only used to verify the accuracy of the testing set. By the calculation of the unsupervised anomaly detection algorithm, the performances of ABOD and DBOD responding to different levels of leak events were presented in Table 4.

Leak flow (L/s) . | . | ABOD . | DBOD . | ||
---|---|---|---|---|---|

PNL . | PL . | PNL . | PL . | ||

10–50 | ANL | 182 | 1 | 170 | 11 |

AL | 2 | 15 | 10 | 9 | |

50–100 | ANL | 182 | 1 | 170 | 13 |

AL | 0 | 17 | 9 | 8 | |

100–175 | ANL | 182 | 1 | 178 | 10 |

AL | 1 | 16 | 2 | 10 |

Leak flow (L/s) . | . | ABOD . | DBOD . | ||
---|---|---|---|---|---|

PNL . | PL . | PNL . | PL . | ||

10–50 | ANL | 182 | 1 | 170 | 11 |

AL | 2 | 15 | 10 | 9 | |

50–100 | ANL | 182 | 1 | 170 | 13 |

AL | 0 | 17 | 9 | 8 | |

100–175 | ANL | 182 | 1 | 178 | 10 |

AL | 1 | 16 | 2 | 10 |

*Note:* ANL, actual non-leak; AL, actual leak; PL, predicted leak; PNL, predicted non-leak.

The advantage of applying the ABOD algorithm for leak detection is more accurate and effective than the DBOD algorithm and overcomes the obstacle that the supervised algorithm needs a complete label dataset for accurate identification. In addition, benefitted from the ability of the model to process unbalanced data, it can be used even in the pipe network with fewer leak events. The recognition accuracy can be improved by monitoring the results of the model in real time and continuously training the model with new data.

## CONCLUSION

This paper proposed a novel angle-based leak detection using pressure sensors and investigated it in a middle-scale WDN. The method adopted in this paper needs to be based on scientific sensor deployment, so the FCM clustering algorithm was adopted to determine the placement of pressure sensors. To evaluate leak detection, the leak scenarios were divided into two major categories according to the leak flow. The following conclusions were drawn:

For large leaks, the comprehensive indicators of the ABOD method are better than other common outlier detection algorithms, which mainly contain TPR, FPR, F1 score, accuracy, precision, and AUC. It is noteworthy that ABOD can detect all the leak scenarios correctly.

For minor leaks, ABOD was further compared with the distance-based leak detection method adopted in the literature. The leak events were randomly selected from the database, and the number is determined to be 200 according to the abnormal ratio of 0.2. The results demonstrate that the ABOD performed better than the DBOD in terms of accuracy and other evaluation indicators.

Considering the limitations, future work will be conducted on the actual network and real data for analysis to ensure scientificity and feasibility. This model will further analyze the time series to identify the moment of leakage and predict the extent of the leakage. The ABOD model presented in this study will be further improved to achieve better performance.

## ACKNOWLEDGEMENTS

This work was supported by the National Key Research and Development Program of China (2022YFF0606905), National Natural Science Foundation of China (52070167), and Zhejiang Provincial Natural Science Foundation of China (LHY22E080003).

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Computing and Control for the Water Industry (CCWI2015) – Sharing the Best Practice in Water Management*