## Abstract

This work proposes a reliable leakage detection methodology for water distribution networks (WDNs) using machine-learning strategies. Our solution aims at detecting leakage in WDNs using efficient machine-learning strategies. We analyze pressure measurements from pumps in district metered areas (DMAs) in Stockholm, Sweden, where we consider a residential DMA of the water distribution network. Our proposed methodology uses learning strategies from unsupervised learning (K-means and cluster validation techniques), and supervised learning (learning vector quantization algorithms). The learning strategies we propose have low complexity, and the numerical experiments show the potential of using machine-learning strategies in leakage detection for monitored WDNs. Specifically, our experiments show that the proposed learning strategies are able to obtain correct classification rates up to 93.98%.

## HIGHLIGHTS

Leakage detection in water distribution networks using efficient machine-learning strategies.

We analyze pressure measurements from pumps in district-metered areas in Stockholm, Sweden, where we consider a monitored subarea of the water distribution network.

Our proposal can be applied to leakage detection scenarios where we have access to water pressure measurements at different points of the WDN.

## INTRODUCTION

The usage of pipeline and pipe networks for the transport of water and other fluids has continuously evolved since the past century, and these technological enhancements made this mode of transport more reliable (Lawal 2001). In spite of several pressurized pipeline advantages, see Sharma & Maheshwari (2017), the pipeline and pipe networks need to operate in a secure and sustainable manner, which is challenged by frequent events of leaks and bursts. Early detection is one of the most suitable strategies to minimize the loss of resources.

Leakage detection solutions for water distribution networks (WDNs) have been the subject of research for more than two decades (Gupta & Kulat 2018; Zaman *et al.* 2020). Since then, as detailed by Li *et al.* (2015), several techniques have been developed, including hardware (acoustic and non-acoustic solutions) and software (numerical and non-numerical modelling solutions) methods. Moreover, the work by Chan *et al.* (2018) has reviewed current intelligent technologies focusing on non-numerical modelling solutions, such as the machine-learning strategies of support vector machines (SVMs), neural networks, and convolutional neural networks.

In the context of hardware methods, the recent survey of Moubayed *et al.* (2021) has summarized the state-of-the-art strategies for water leakage detection including ground radar and acoustic solutions, such as reflectometry and Piezoelectric sensor. Furthermore, the works by Lai *et al.* (2016) and Senin *et al.* (2019) have extensively studied the ground penetrating radar method. Moreover, the works by Papadopoulou *et al.* (2008) and Moubarak *et al.* (2011) have comprehensively studied the reflectometry and Piezoelectric sensor solutions, respectively.

Within the machine-learning strategies context, the recent studies by Soldevila *et al.* (2017) and Xing & Sela (2019) have confirmed the efficiency of leakage detection modelling based on pressure analysis and machine-learning techniques. Furthermore, Vrachimis *et al.* (2022) have summarized several results obtained in the emerging framework of the Battle of the Leakage Detection and Isolation Methods (BattLeDIM). In addition, Marzola *et al.* (2022) have proposed a solution for the BattLeDIM problem based on the analysis of observed data and hydraulics simulations.

As can be verified by Belka *et al.* (2018) and Sousa *et al.* (2019), there have been abundant successful prototype-based solutions for diverse anomaly detection applications, such as condition monitoring of electrical motors. On the other hand, the performance of these algorithms is strongly dependent on the pre-specified number of prototypes (Biehl *et al.* 2016).

Moreover, Villmann *et al.* (2017) have summarized the state-of-the-art prototype-based models and have discussed that solutions formulated using prototypes can provide more understandable results than the ones formulated using SVMs and deep learning schemes. Therefore, instead of using robust non-linear solutions, we assess the state-of-the-art prototype-based models to investigate machine-learning strategies for leakage detection in WDNs.

Motivated by those mentioned successful cases, our goal is to propose a reliable leakage detection solution that demands low complexity to analyze pressure measurements acquired from WDNs in municipal areas. Specifically, our approach is a non-numerical modelling solution for detecting leakage in WDNs through the analysis of observed water pressure data using low-complexity learning strategies.

To achieve our goal, we design representative sets with a reduced number of prototypes for generating a compact and realistic dataset for fault detection/classification of a monitored water distribution network. Specifically, we first cluster the observed water pressure data into understandable subgroups; in the following, we train prototypes to represent the generated subgroups; finally, we use the trained prototypes to process operational condition predictions for newly observed water pressure data.

Within the context of the prototype-based models, we propose low-complexity strategies based on both unsupervised and supervised learning. For the unsupervised method, we use the conventional K-means and cluster validation techniques. For the supervised method, we use crucial learning vector quantization (LVQ) classifiers. Specifically, we determine the number of prototypes through a clustering and cluster validation procedure per class label that can determine an adequate number of prototypes to obtain representative subsets of the input data. Then, we fine-tune the prototypes of these generated subgroups using LVQ classifiers.

Moreover, since our solution does not require hydraulic modelling, we are agnostic to it and only water pressure measurements are of our interest. Therefore, as a software-based solution, our proposal can be applied to leakage detection scenarios where we have access to water pressure measurements at different points of the WDN. To this end, we analyze water pressure measurements from pumps in district-metered areas (DMAs) in Stockholm, Sweden. To evaluate our solution, we consider a monitored subarea of the WDN. Our numerical experiments show that the proposed learning strategies are able to obtain correct classification rates of up to 93.98%.

## MATERIALS AND METHODS

### Dataset description

*Stockholm Vatten och Avfall*). In this dataset, we analyze the observed water pressure data in four selected pumping stations collected from January 2018 to March 2019. These stations are located in a DMA of the WDN. The DMA corresponds to a residential area that has a total population of 70,250 people. Moreover, there are no tanks or reservoirs in the monitored area. Figure 1 shows the approximate positions of the pumping stations in the DMA. Due to a privacy agreement with SVOA, we do not identify these stations or reveal the DMA network. Hence, we generically label these pumps as

*A, H, K,*

*and*S.The dataset represents the pumping stations operating in normal and faulty (presence of leakage) working conditions, these conditions are distinguished through a maintenance report provided by the water company. Since leakage detection is an anomaly detection problem, the majority of observations are normal conditions and this imbalance is shown in Table 1.

. | Year . | |||
---|---|---|---|---|

2018 . | 2019 . | Total . | ||

Condition | Normal | 324 | 70 | 394 |

Leakage | 34 | 20 | 54 | |

Total | 358 | 90 | 448 |

. | Year . | |||
---|---|---|---|---|

2018 . | 2019 . | Total . | ||

Condition | Normal | 324 | 70 | 394 |

Leakage | 34 | 20 | 54 | |

Total | 358 | 90 | 448 |

In the dataset, the hydraulic data are stored for entire days of acquisition with a 1-min sampling frequency. For the raw database, we denote by ‘sample’ a pressure signal stored as a vector of 1,440 components for each station and for each day. During the aforementioned period, there are 7 days with excessive missing values and, as a consequence, we remove these samples from this analysis. Therefore, there are 448 available observed days to build the prediction model, and the total number of samples is denoted by .

Let denote the 1,440 pressure measurements from pump during the *n*th day. Then, the sample during the *n*th day is denoted by , which has 1,440 rows and four columns.

#### Feature extraction

To generate suitable feature vectors that represent the proposed engineering application, we apply a canonical discriminant function on the original time series vectors to obtain linear combinations of the projected time series vectors (known as *canonical variables*). Further explanation of canonical analysis is given in Rencher (1992).

*and between-group*

**W***scatter matrices (see definition in the Section ‘Cluster validation techniques’); (v) obtain the eigenvector , which is the eigenvector associated to the largest eigenvalue of the matrix ; (vi) obtain the projected data , which is a projection of the original data, by applying the inner product between and the raw data matrix ; (vii) repeat the steps (ii) to (vi) for the remaining pump stations; (viii) finally, concatenate every projected data as:*

**B**In summary, the treated dataset * D* is comprised of 448 four-dimensional-labelled feature vectors, in which the attribute values represent the canonical values obtained from the most representative canonical function.

In addition to the effective sample representation, we also investigate the existing data imbalance through sampling tuning. For this particularly challenging task, we want to measure the impact of the majority decrease of samples with normal conditions. We hypothesize that by adjusting the trade-off between sample quality representation (pressure signals sampling) and label equilibrium, we can further improve the recognition rates of supervised classifiers trained with the selected dataset.

For this task, we modified the sampling rate and increased the number of leakage cases by the respective gain factors: {3 × |5 × |15 × } (e.g. 3-min sampling frequency for the 3× gain factor). These variants of the dataset are described in Table 2. Note that the first column (variable *p*) shows the decrease in the signal quality representation, whereas the last column shows the equilibrium rate between the number of normal, *N _{N}*, and leakage samples,

*N*, respectively.

_{L}. | p
. | N
. _{S} | N
. _{N} | N
. _{L} | (%) . |
---|---|---|---|---|---|

Original | 1,440 | 448 | 394 | 54 | 13.71 |

3 × Leakage | 480 | 556 | 394 | 162 | 41.12 |

5 × Leakage | 288 | 664 | 394 | 270 | 68.53 |

3 × N + 15 × L | 96 | 1,992 | 1,182 | 810 | 68.53 |

. | p
. | N
. _{S} | N
. _{N} | N
. _{L} | (%) . |
---|---|---|---|---|---|

Original | 1,440 | 448 | 394 | 54 | 13.71 |

3 × Leakage | 480 | 556 | 394 | 162 | 41.12 |

5 × Leakage | 288 | 664 | 394 | 270 | 68.53 |

3 × N + 15 × L | 96 | 1,992 | 1,182 | 810 | 68.53 |

### Prototype-based models

Prototype-based models are recognized in machine-learning due to their potential to explicitly represent observations (Biehl *et al.* 2016). Prototypes are reference vectors used to represent subsets of the input data, in terms of distances dissimilarity measures. As a consequence, it is possible to directly compare input data using prototypes. The prototypes compete to represent data regions, and their positions are updated during the training, which can be unsupervised (e.g. clustering methods) or supervised (e.g. LVQ classifiers). Due to this reason, in the Section ‘Cluster validation techniques’ we present the relevant literature on cluster validation techniques, and in the Section ‘LVQ classifier techniques’ we introduce pertinent improvements on LVQ classifier techniques.

#### Cluster validation techniques

Techniques for cluster validation are used *a posteriori* to evaluate the results of a given clustering algorithm. However, each cluster validation index has its own set of assumptions to quantify the groups’ cohesion and separation. Hence, the final results (e.g. the most adequate number of groups to generate representative subsets) may vary across the chosen techniques. In the following, we give some necessary definitions for the clustering techniques.

We denote *K* as the number of clusters, as the most suitable number of clusters according to a given cluster validation technique, as the centroid of the input data matrix * X*, as the number of objects in cluster , as the centroid of cluster , and as the th feature vector, , belonging to the cluster .

- (i)The
*Davies–Bouldin (DB) index*(Davies & Bouldin 1979): it is a function of the ratio of the sum of scatter within the clusters and the separation between clusters, using the clusters’ centroids. Initially, we need to compute the scatter within the th cluster and the separation between the th and th clusters, and , respectively, which are defined as:where ∥·∥ is the Euclidean norm. Finally, the DB index is defined as:

The value of *K* leading to the smallest is chosen as the , i.e., .

- (ii) The
*Dunn index*(Dunn 1973): it is a function defined as:where and , with*d*(·,·) denoting a dissimilarity function between vectors. Note that while is a measure of the separation between clusters and , is a measure of the dispersion of data within the cluster . The value of*K*resulting in the largest Dunn() is chosen as the , i.e., Dunn. - (iii) The
*Calinski–Harabasz (CH) index*(Calinski & Harabasz 1974): it is a function defined as:where is the between-group scatter matrix for the data partitioned into*K*clusters, and is the within-group scatter matrix for data clustered into*K*clusters. The trace(·) denotes the trace operator. The value of*K*resulting in the largest CH(K) is chosen as the , i.e., CH. - (iv) The
*Silhouettes*(*Sil*)*index*(Rousseeuw 1987): it is a function defined as:where*a*(*i*) represents the average dissimilarity of the*i*th feature vector to all other vectors within the same cluster, and*b*(*i*) denotes the lowest average dissimilarity of the*i*th feature vector to any other cluster of which it is not a member. The silhouettes can be calculated with any dissimilarity metric, such as the Euclidean or Manhattan distances. The value of*K*producing the largest*Sil*(*K*) is chosen as the , i.e., .

#### LVQ classifier techniques

LVQ classifiers are a family of algorithms for statistical pattern classification introduced in the late 1980s (Kohonen 1988), which led to the proposal of several variants. The main advantages of LVQ methods are their flexibility and intuitiveness because they are established on the notion that samples belonging to distinct labels are separated among data regions.

For the LVQ classifiers presented in this section, we make the following definitions. Let us consider a set of training input–output samples , where denotes the *t*th input sample and denotes its corresponding class label. Note that is a categorical variable, which assumes only one out of *L* values in the finite set .

For the family of LVQ classifiers, we have , i.e., the number of prototypes () is higher than the number of classes (). As a consequence, different prototypes may share the same label. Given a set of labelled prototype vectors , , the class assignment for a new input sample is based on the decision criterion that the class of must be the same of the class of , where in which *d*(·,·) denotes a dissimilarity measure specific to the extension of LVQ and *c* is the index of the nearest prototype among the *K* ones available.

Some relevant LVQ algorithms are discussed below respecting their chronological order. We first outline the original algorithm, LVQ1 (Kohonen 1988), which does not have a cost function to ensure convergence to the optimal solution. The following two algorithms, LVQ2.1 and LVQ3 (Kohonen 1988), present improvements to obtain higher convergence speed. Then, the generalized LVQ (GLVQ) (Sato & Yamada 1995) is the first one to propose a cost function, whereas the relevance LVQ (RLVQ) (Bojer *et al.* 2001) is the pioneer to use a distance learning approach, which also learns the relevance of each feature. Finally, the generalized relevance LVQ (GRLVQ) (Hammer & Villmann 2002) and the locally generalized relevance LVQ (LGRLVQ) include improvements by manipulating distance learning with the GLVQ cost function (Hammer *et al.* 2005).

In a nutshell, the LVQ variants LVQ1, LVQ2.1, LVQ3, and RLVQ are heuristic solutions, and the LVQ variants GLVQ, GRLVQ, and LGRLVQ present cost functions that guarantee the convergence. Further explanation of these LVQ variants is given in Nova & Estévez (2014).

### Summary

The exploited techniques to compose the methodology by their characteristics and purpose are summarized in Table 3.

Technique . | Characteristic . | Application . |
---|---|---|

Canonical discriminant function | Feature extraction | Reduce redundant information |

K-means | Unsupervised learning | Clustering |

DB index | Unsupervised learning | Cluster validation technique |

Dunn index | Unsupervised learning | Cluster validation technique |

LVQ1 | Supervised learning | Classification |

LVQ2.1 | Supervised learning | Classification |

LVQ3 | Supervised learning | Classification |

RLVQ | Supervised learning | Classification |

GLVQ | Supervised learning | Classification |

GRLVQ | Supervised learning | Classification |

LGRLVQ | Supervised learning | Classification |

Technique . | Characteristic . | Application . |
---|---|---|

Canonical discriminant function | Feature extraction | Reduce redundant information |

K-means | Unsupervised learning | Clustering |

DB index | Unsupervised learning | Cluster validation technique |

Dunn index | Unsupervised learning | Cluster validation technique |

LVQ1 | Supervised learning | Classification |

LVQ2.1 | Supervised learning | Classification |

LVQ3 | Supervised learning | Classification |

RLVQ | Supervised learning | Classification |

GLVQ | Supervised learning | Classification |

GRLVQ | Supervised learning | Classification |

LGRLVQ | Supervised learning | Classification |

## RESULTS AND DISCUSSION

In this section, we evaluate the proposed methodology to find the optimal number of prototypes and their positions for the two types of classes existing in the available dataset, whose labels are represented as *N* (normal) and *L* (leakage).

For each classifier, 100 independent runs of training and testing are carried out. For each run, the four steps of the proposed methodology are executed: (i) the division of the dataset into training (80%) and validation (20%) sets; (ii) canonical discriminant analysis of the training set and projection of the validation set (see description in the Section ‘Feature extraction’); (iii) determination of the and prototypes’ positions via application of clustering and cluster validity techniques per data class; and (iv) LVQ training and testing. At the end of each run, the accuracy rate of each classifier is determined.

Specifically, in the third step we run the K-means algorithm 10 independent times and we choose the execution that produced the lowest value of mean squared quantization error. We repeat this procedure considering the quantity of prototypes ranging from 2 to 10 to obtain the per class according to the suggestion of each cluster validation technique defined in Section ‘Cluster validation techniques’. Finally, we define the using the majority voting among the suggested values.

The frequency distribution of the suggested per class resulting from the majority voting scheme along the 100 independent turns is shown in Table 4. From this table, it can seen that we obtain the setup along the 100 turns for most of the datasets. For the 3 × Leakage dataset, we obtain the setup along 96 turns, the setup along a single turn, and the setup along three turns. Therefore, it is valuable to emphasize that the following classification procedures emulate scenarios extremely limited of resources and, consequently, present low computational cost.

Dataset . | Classes . | . | . | . | . | . |
---|---|---|---|---|---|---|

Original | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] | |

3 × Leakage | [96,100] | [0,0] | [1,0] | [0,0] | [3,0] | |

5 × Leakage | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] | |

3 × N + 15 × L | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |

Dataset . | Classes . | . | . | . | . | . |
---|---|---|---|---|---|---|

Original | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] | |

3 × Leakage | [96,100] | [0,0] | [1,0] | [0,0] | [3,0] | |

5 × Leakage | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] | |

3 × N + 15 × L | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |

A major characteristic of relevance oriented modifications for LVQ models (such as RLVQ, GRLVQ and LGRLVQ) is that we are able to check the relevance attributes’ weights in order to have a direct notion of which pumps have more influence on the classifiers performance. Accordingly, the last aspect we highlight relates to the relevance vectors weights obtained after the GRLVQ training (see Figure 6(b)). These empirical observations reveal that the first attribute (Pump A) is the most important one and displays an equilibrium tendency of the pumps relevance in the sense that we reduce the data imbalance.

## CONCLUSIONS

In this work, we proposed a non-numerical modelling method for water leakage detection in WDNs through the analysis of observed pressure data by means of machine-learning strategies. To evaluate our solution, we considered water pressure measurements from pumps in a residential DMA of the WDN of Stockholm, Sweden.

We proposed low complexity machine-learning strategies for leakage detection. Specifically, our strategies used techniques from both unsupervised and supervised learning methods. For the numerical experiments of our proposed solution, we used a real dataset from a DMA in Stockholm, Sweden.

The numerical experiments showed the potential benefits of using machine-learning strategies in the leakage detection of monitored WDNs. Specifically, we obtained classification rates up to 93.98% when using the locally GRLVQ algorithm. Among the compared algorithms, the GRLVQ had the least depreciation on the minimum values of the F1 score. Moreover, the GRLVQ showed promising maximum values of classification accuracies (e.g. 91.73% on the original dataset) while computing the importance of each pump. Regarding the importance of the considered pumps, the GRLVQ revealed that Pump A was the most significant for the training of our machine-learning-based solution.

Therefore, since our solution does not require hydraulic modelling, we showed the possibility of leakage detection solutions without neither the modelling of the hydraulic system nor the knowledge of particular information about the network architecture. Specifically, such benefits make our proposal leakage detection algorithm suitable to be applied in real-world scenarios where measurements are available, but without much prior knowledge about them.

When a higher level description of the WDN is available, it is possible to use such knowledge about the network architecture to apply clustering methods aiming to divide the DMA and reduce the search area for the localization of the predicted leakages. Therefore, our machine-learning strategies can be extended and support solutions formulated by hydraulic modelling.

An important aspect to highlight is the required amount of data to properly generate the predictive system. We acknowledge that scenarios with non-sufficient data for training could lead to significantly misleading outcomes when anomalous behaviours in DMAs are analyzed. Therefore, we analyzed the total amount of collected data (15 months) that the SVOA company has shared with us to conduct our machine-learning strategies. From the water utility side, the company may continuously collect new observations to increase the reliability of the predictive system.

For future works, we aim to investigate federated learning strategies to obtain confident data analysis preserving the privacy of the collected information on the pumps for the same fault diagnosis task. This is of high importance when dealing with sensitive and critical information, such as water supply and pumps locations. Moreover, the federated model would be capable of monitoring an entire WDN and distinguishing different DMAs. This would turn the problem into not only a classification but also a preliminary localization (yet in a region) problem.

## ACKNOWLEDGEMENTS

The authors would like to thank the partial financial support from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) – Finance Code 001, Grant 88887.155782/2017-00, CNPq Proc. 313151/2020-2 and FUNCAP Grant PS-0186-00103.01.00/21. The authors would like to thank the Mistra InfraMaint Program for the financial support. The authors also thank Stockholm Vatten och Avfall company, Stockholm, Sweden, for providing the data used in this study.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

**PAMI-1**, (