Abstract
This work proposes a reliable leakage detection methodology for water distribution networks (WDNs) using machine-learning strategies. Our solution aims at detecting leakage in WDNs using efficient machine-learning strategies. We analyze pressure measurements from pumps in district metered areas (DMAs) in Stockholm, Sweden, where we consider a residential DMA of the water distribution network. Our proposed methodology uses learning strategies from unsupervised learning (K-means and cluster validation techniques), and supervised learning (learning vector quantization algorithms). The learning strategies we propose have low complexity, and the numerical experiments show the potential of using machine-learning strategies in leakage detection for monitored WDNs. Specifically, our experiments show that the proposed learning strategies are able to obtain correct classification rates up to 93.98%.
HIGHLIGHTS
Leakage detection in water distribution networks using efficient machine-learning strategies.
We analyze pressure measurements from pumps in district-metered areas in Stockholm, Sweden, where we consider a monitored subarea of the water distribution network.
Our proposal can be applied to leakage detection scenarios where we have access to water pressure measurements at different points of the WDN.
INTRODUCTION
The usage of pipeline and pipe networks for the transport of water and other fluids has continuously evolved since the past century, and these technological enhancements made this mode of transport more reliable (Lawal 2001). In spite of several pressurized pipeline advantages, see Sharma & Maheshwari (2017), the pipeline and pipe networks need to operate in a secure and sustainable manner, which is challenged by frequent events of leaks and bursts. Early detection is one of the most suitable strategies to minimize the loss of resources.
Leakage detection solutions for water distribution networks (WDNs) have been the subject of research for more than two decades (Gupta & Kulat 2018; Zaman et al. 2020). Since then, as detailed by Li et al. (2015), several techniques have been developed, including hardware (acoustic and non-acoustic solutions) and software (numerical and non-numerical modelling solutions) methods. Moreover, the work by Chan et al. (2018) has reviewed current intelligent technologies focusing on non-numerical modelling solutions, such as the machine-learning strategies of support vector machines (SVMs), neural networks, and convolutional neural networks.
In the context of hardware methods, the recent survey of Moubayed et al. (2021) has summarized the state-of-the-art strategies for water leakage detection including ground radar and acoustic solutions, such as reflectometry and Piezoelectric sensor. Furthermore, the works by Lai et al. (2016) and Senin et al. (2019) have extensively studied the ground penetrating radar method. Moreover, the works by Papadopoulou et al. (2008) and Moubarak et al. (2011) have comprehensively studied the reflectometry and Piezoelectric sensor solutions, respectively.
Within the machine-learning strategies context, the recent studies by Soldevila et al. (2017) and Xing & Sela (2019) have confirmed the efficiency of leakage detection modelling based on pressure analysis and machine-learning techniques. Furthermore, Vrachimis et al. (2022) have summarized several results obtained in the emerging framework of the Battle of the Leakage Detection and Isolation Methods (BattLeDIM). In addition, Marzola et al. (2022) have proposed a solution for the BattLeDIM problem based on the analysis of observed data and hydraulics simulations.
As can be verified by Belka et al. (2018) and Sousa et al. (2019), there have been abundant successful prototype-based solutions for diverse anomaly detection applications, such as condition monitoring of electrical motors. On the other hand, the performance of these algorithms is strongly dependent on the pre-specified number of prototypes (Biehl et al. 2016).
Moreover, Villmann et al. (2017) have summarized the state-of-the-art prototype-based models and have discussed that solutions formulated using prototypes can provide more understandable results than the ones formulated using SVMs and deep learning schemes. Therefore, instead of using robust non-linear solutions, we assess the state-of-the-art prototype-based models to investigate machine-learning strategies for leakage detection in WDNs.
Motivated by those mentioned successful cases, our goal is to propose a reliable leakage detection solution that demands low complexity to analyze pressure measurements acquired from WDNs in municipal areas. Specifically, our approach is a non-numerical modelling solution for detecting leakage in WDNs through the analysis of observed water pressure data using low-complexity learning strategies.
To achieve our goal, we design representative sets with a reduced number of prototypes for generating a compact and realistic dataset for fault detection/classification of a monitored water distribution network. Specifically, we first cluster the observed water pressure data into understandable subgroups; in the following, we train prototypes to represent the generated subgroups; finally, we use the trained prototypes to process operational condition predictions for newly observed water pressure data.
Within the context of the prototype-based models, we propose low-complexity strategies based on both unsupervised and supervised learning. For the unsupervised method, we use the conventional K-means and cluster validation techniques. For the supervised method, we use crucial learning vector quantization (LVQ) classifiers. Specifically, we determine the number of prototypes through a clustering and cluster validation procedure per class label that can determine an adequate number of prototypes to obtain representative subsets of the input data. Then, we fine-tune the prototypes of these generated subgroups using LVQ classifiers.
Moreover, since our solution does not require hydraulic modelling, we are agnostic to it and only water pressure measurements are of our interest. Therefore, as a software-based solution, our proposal can be applied to leakage detection scenarios where we have access to water pressure measurements at different points of the WDN. To this end, we analyze water pressure measurements from pumps in district-metered areas (DMAs) in Stockholm, Sweden. To evaluate our solution, we consider a monitored subarea of the WDN. Our numerical experiments show that the proposed learning strategies are able to obtain correct classification rates of up to 93.98%.
MATERIALS AND METHODS
Dataset description
The dataset represents the pumping stations operating in normal and faulty (presence of leakage) working conditions, these conditions are distinguished through a maintenance report provided by the water company. Since leakage detection is an anomaly detection problem, the majority of observations are normal conditions and this imbalance is shown in Table 1.
Number of observed days per working condition
. | Year . | |||
---|---|---|---|---|
2018 . | 2019 . | Total . | ||
Condition | Normal | 324 | 70 | 394 |
Leakage | 34 | 20 | 54 | |
Total | 358 | 90 | 448 |
. | Year . | |||
---|---|---|---|---|
2018 . | 2019 . | Total . | ||
Condition | Normal | 324 | 70 | 394 |
Leakage | 34 | 20 | 54 | |
Total | 358 | 90 | 448 |
In the dataset, the hydraulic data are stored for entire days of acquisition with a 1-min sampling frequency. For the raw database, we denote by ‘sample’ a pressure signal stored as a vector of 1,440 components for each station and for each day. During the aforementioned period, there are 7 days with excessive missing values and, as a consequence, we remove these samples from this analysis. Therefore, there are 448 available observed days to build the prediction model, and the total number of samples is denoted by .
Let denote the 1,440 pressure measurements from pump
during the nth day. Then, the sample during the nth day is denoted by
, which has 1,440 rows and four columns.
Stored daily time series for each pump and working conditions (in metres of water column).
Stored daily time series for each pump and working conditions (in metres of water column).
Feature extraction
To generate suitable feature vectors that represent the proposed engineering application, we apply a canonical discriminant function on the original time series vectors to obtain linear combinations of the projected time series vectors (known as canonical variables). Further explanation of canonical analysis is given in Rencher (1992).





In summary, the treated dataset D is comprised of 448 four-dimensional-labelled feature vectors, in which the attribute values represent the canonical values obtained from the most representative canonical function.
In addition to the effective sample representation, we also investigate the existing data imbalance through sampling tuning. For this particularly challenging task, we want to measure the impact of the majority decrease of samples with normal conditions. We hypothesize that by adjusting the trade-off between sample quality representation (pressure signals sampling) and label equilibrium, we can further improve the recognition rates of supervised classifiers trained with the selected dataset.
For this task, we modified the sampling rate and increased the number of leakage cases by the respective gain factors: {3 × |5 × |15 × } (e.g. 3-min sampling frequency for the 3× gain factor). These variants of the dataset are described in Table 2. Note that the first column (variable p) shows the decrease in the signal quality representation, whereas the last column shows the equilibrium rate between the number of normal, NN, and leakage samples, NL, respectively.
Database setups used in this study
. | p . | NS . | NN . | NL . | ![]() |
---|---|---|---|---|---|
Original | 1,440 | 448 | 394 | 54 | 13.71 |
3 × Leakage | 480 | 556 | 394 | 162 | 41.12 |
5 × Leakage | 288 | 664 | 394 | 270 | 68.53 |
3 × N + 15 × L | 96 | 1,992 | 1,182 | 810 | 68.53 |
. | p . | NS . | NN . | NL . | ![]() |
---|---|---|---|---|---|
Original | 1,440 | 448 | 394 | 54 | 13.71 |
3 × Leakage | 480 | 556 | 394 | 162 | 41.12 |
5 × Leakage | 288 | 664 | 394 | 270 | 68.53 |
3 × N + 15 × L | 96 | 1,992 | 1,182 | 810 | 68.53 |
Prototype-based models
Prototype-based models are recognized in machine-learning due to their potential to explicitly represent observations (Biehl et al. 2016). Prototypes are reference vectors used to represent subsets of the input data, in terms of distances dissimilarity measures. As a consequence, it is possible to directly compare input data using prototypes. The prototypes compete to represent data regions, and their positions are updated during the training, which can be unsupervised (e.g. clustering methods) or supervised (e.g. LVQ classifiers). Due to this reason, in the Section ‘Cluster validation techniques’ we present the relevant literature on cluster validation techniques, and in the Section ‘LVQ classifier techniques’ we introduce pertinent improvements on LVQ classifier techniques.
Cluster validation techniques
Techniques for cluster validation are used a posteriori to evaluate the results of a given clustering algorithm. However, each cluster validation index has its own set of assumptions to quantify the groups’ cohesion and separation. Hence, the final results (e.g. the most adequate number of groups to generate representative subsets) may vary across the chosen techniques. In the following, we give some necessary definitions for the clustering techniques.
We denote K as the number of clusters, as the most suitable number of clusters according to a given cluster validation technique,
as the centroid of the
input data matrix X,
as the number of objects in cluster
,
as the centroid of cluster
, and
as the
th feature vector,
, belonging to the cluster
.
- (i)The Davies–Bouldin (DB) index (Davies & Bouldin 1979): it is a function of the ratio of the sum of scatter within the clusters and the separation between clusters, using the clusters’ centroids. Initially, we need to compute the scatter within the
th cluster and the separation between the
th and
th clusters,
and
, respectively, which are defined as:where ∥·∥ is the Euclidean norm. Finally, the DB index is defined as:
The value of K leading to the smallest is chosen as the
, i.e.,
.
- (ii) The Dunn index (Dunn 1973): it is a function defined as:where
and
, with d(·,·) denoting a dissimilarity function between vectors. Note that while
is a measure of the separation between clusters
and
,
is a measure of the dispersion of data within the cluster
. The value of K resulting in the largest Dunn(
) is chosen as the
, i.e.,
Dunn
.
- (iii) The Calinski–Harabasz (CH) index (Calinski & Harabasz 1974): it is a function defined as:where
is the between-group scatter matrix for the data partitioned into K clusters, and
is the within-group scatter matrix for data clustered into K clusters. The trace(·) denotes the trace operator. The value of K resulting in the largest CH(K) is chosen as the
, i.e.,
CH
.
- (iv) The Silhouettes (Sil) index (Rousseeuw 1987): it is a function defined as:where a(i) represents the average dissimilarity of the ith feature vector to all other vectors within the same cluster, and b(i) denotes the lowest average dissimilarity of the ith feature vector to any other cluster of which it is not a member. The silhouettes can be calculated with any dissimilarity metric, such as the Euclidean or Manhattan distances. The value of K producing the largest Sil(K) is chosen as the
, i.e.,
.
LVQ classifier techniques
LVQ classifiers are a family of algorithms for statistical pattern classification introduced in the late 1980s (Kohonen 1988), which led to the proposal of several variants. The main advantages of LVQ methods are their flexibility and intuitiveness because they are established on the notion that samples belonging to distinct labels are separated among data regions.
For the LVQ classifiers presented in this section, we make the following definitions. Let us consider a set of training input–output samples , where
denotes the tth input sample and
denotes its corresponding class label. Note that
is a categorical variable, which assumes only one out of L values in the finite set
.
For the family of LVQ classifiers, we have , i.e., the number of prototypes (
) is higher than the number of classes (
). As a consequence, different prototypes may share the same label. Given a set of labelled prototype vectors
,
, the class assignment for a new input sample
is based on the decision criterion that the class of
must be the same of the class of
, where
in which d(·,·) denotes a dissimilarity measure specific to the extension of LVQ and c is the index of the nearest prototype among the K ones available.
Some relevant LVQ algorithms are discussed below respecting their chronological order. We first outline the original algorithm, LVQ1 (Kohonen 1988), which does not have a cost function to ensure convergence to the optimal solution. The following two algorithms, LVQ2.1 and LVQ3 (Kohonen 1988), present improvements to obtain higher convergence speed. Then, the generalized LVQ (GLVQ) (Sato & Yamada 1995) is the first one to propose a cost function, whereas the relevance LVQ (RLVQ) (Bojer et al. 2001) is the pioneer to use a distance learning approach, which also learns the relevance of each feature. Finally, the generalized relevance LVQ (GRLVQ) (Hammer & Villmann 2002) and the locally generalized relevance LVQ (LGRLVQ) include improvements by manipulating distance learning with the GLVQ cost function (Hammer et al. 2005).
In a nutshell, the LVQ variants LVQ1, LVQ2.1, LVQ3, and RLVQ are heuristic solutions, and the LVQ variants GLVQ, GRLVQ, and LGRLVQ present cost functions that guarantee the convergence. Further explanation of these LVQ variants is given in Nova & Estévez (2014).
Summary
The exploited techniques to compose the methodology by their characteristics and purpose are summarized in Table 3.
Summary of the exploited techniques
Technique . | Characteristic . | Application . |
---|---|---|
Canonical discriminant function | Feature extraction | Reduce redundant information |
K-means | Unsupervised learning | Clustering |
DB index | Unsupervised learning | Cluster validation technique |
Dunn index | Unsupervised learning | Cluster validation technique |
LVQ1 | Supervised learning | Classification |
LVQ2.1 | Supervised learning | Classification |
LVQ3 | Supervised learning | Classification |
RLVQ | Supervised learning | Classification |
GLVQ | Supervised learning | Classification |
GRLVQ | Supervised learning | Classification |
LGRLVQ | Supervised learning | Classification |
Technique . | Characteristic . | Application . |
---|---|---|
Canonical discriminant function | Feature extraction | Reduce redundant information |
K-means | Unsupervised learning | Clustering |
DB index | Unsupervised learning | Cluster validation technique |
Dunn index | Unsupervised learning | Cluster validation technique |
LVQ1 | Supervised learning | Classification |
LVQ2.1 | Supervised learning | Classification |
LVQ3 | Supervised learning | Classification |
RLVQ | Supervised learning | Classification |
GLVQ | Supervised learning | Classification |
GRLVQ | Supervised learning | Classification |
LGRLVQ | Supervised learning | Classification |
RESULTS AND DISCUSSION
In this section, we evaluate the proposed methodology to find the optimal number of prototypes and their positions for the two types of classes existing in the available dataset, whose labels are represented as N (normal) and L (leakage).
Label separation obtained if we use all the samples of the original dataset to calculate the matrix . Note that here the blue and orange colours represent the normal and leakage samples. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/ws.2023.054.
Label separation obtained if we use all the samples of the original dataset to calculate the matrix . Note that here the blue and orange colours represent the normal and leakage samples. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/ws.2023.054.
For each classifier, 100 independent runs of training and testing are carried out. For each run, the four steps of the proposed methodology are executed: (i) the division of the dataset into training (80%) and validation (20%) sets; (ii) canonical discriminant analysis of the training set and projection of the validation set (see description in the Section ‘Feature extraction’); (iii) determination of the and prototypes’ positions via application of clustering and cluster validity techniques per data class; and (iv) LVQ training and testing. At the end of each run, the accuracy rate of each classifier is determined.
Specifically, in the third step we run the K-means algorithm 10 independent times and we choose the execution that produced the lowest value of mean squared quantization error. We repeat this procedure considering the quantity of prototypes ranging from 2 to 10 to obtain the per class according to the suggestion of each cluster validation technique defined in Section ‘Cluster validation techniques’. Finally, we define the
using the majority voting among the suggested values.


Histograms of per label according to each cluster validity technique. (a) Original: N class; (b) 3 × Leakage: N class; (c) 5 × Leakage: N class; (d) 3 × N + 15 × L: N class; (e) Original: L class; (f) 3 × Leakage: L class; (g) 5 × Leakage: L class; and (h) 3 × N + 15 × L: L class.
Histograms of per label according to each cluster validity technique. (a) Original: N class; (b) 3 × Leakage: N class; (c) 5 × Leakage: N class; (d) 3 × N + 15 × L: N class; (e) Original: L class; (f) 3 × Leakage: L class; (g) 5 × Leakage: L class; and (h) 3 × N + 15 × L: L class.
The frequency distribution of the suggested per class resulting from the majority voting scheme along the 100 independent turns is shown in Table 4. From this table, it can seen that we obtain the setup
along the 100 turns for most of the datasets. For the 3 × Leakage dataset, we obtain the setup
along 96 turns, the setup
along a single turn, and the setup
along three turns. Therefore, it is valuable to emphasize that the following classification procedures emulate scenarios extremely limited of resources and, consequently, present low computational cost.
Distribution of the suggested optimal number of prototypes per class
Dataset . | Classes . | ![]() | ![]() | ![]() | ![]() | ![]() |
---|---|---|---|---|---|---|
Original | ![]() | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |
3 × Leakage | ![]() | [96,100] | [0,0] | [1,0] | [0,0] | [3,0] |
5 × Leakage | ![]() | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |
3 × N + 15 × L | ![]() | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |
Dataset . | Classes . | ![]() | ![]() | ![]() | ![]() | ![]() |
---|---|---|---|---|---|---|
Original | ![]() | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |
3 × Leakage | ![]() | [96,100] | [0,0] | [1,0] | [0,0] | [3,0] |
5 × Leakage | ![]() | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |
3 × N + 15 × L | ![]() | [100,100] | [0,0] | [0,0] | [0,0] | [0,0] |
Classification accuracies obtained by the LVQ algorithms: (a) maximum rates, (b) minimum rates, (c) mean rates, and (d) standard deviation rates.
Classification accuracies obtained by the LVQ algorithms: (a) maximum rates, (b) minimum rates, (c) mean rates, and (d) standard deviation rates.
Boxplots of the different LVQ-based classifier results: (a) F1 score (%) rates obtained by the evaluated LVQ classifiers and (b) GRLVQ relevance vector weights.
Boxplots of the different LVQ-based classifier results: (a) F1 score (%) rates obtained by the evaluated LVQ classifiers and (b) GRLVQ relevance vector weights.
A major characteristic of relevance oriented modifications for LVQ models (such as RLVQ, GRLVQ and LGRLVQ) is that we are able to check the relevance attributes’ weights in order to have a direct notion of which pumps have more influence on the classifiers performance. Accordingly, the last aspect we highlight relates to the relevance vectors weights obtained after the GRLVQ training (see Figure 6(b)). These empirical observations reveal that the first attribute (Pump A) is the most important one and displays an equilibrium tendency of the pumps relevance in the sense that we reduce the data imbalance.
CONCLUSIONS
In this work, we proposed a non-numerical modelling method for water leakage detection in WDNs through the analysis of observed pressure data by means of machine-learning strategies. To evaluate our solution, we considered water pressure measurements from pumps in a residential DMA of the WDN of Stockholm, Sweden.
We proposed low complexity machine-learning strategies for leakage detection. Specifically, our strategies used techniques from both unsupervised and supervised learning methods. For the numerical experiments of our proposed solution, we used a real dataset from a DMA in Stockholm, Sweden.
The numerical experiments showed the potential benefits of using machine-learning strategies in the leakage detection of monitored WDNs. Specifically, we obtained classification rates up to 93.98% when using the locally GRLVQ algorithm. Among the compared algorithms, the GRLVQ had the least depreciation on the minimum values of the F1 score. Moreover, the GRLVQ showed promising maximum values of classification accuracies (e.g. 91.73% on the original dataset) while computing the importance of each pump. Regarding the importance of the considered pumps, the GRLVQ revealed that Pump A was the most significant for the training of our machine-learning-based solution.
Therefore, since our solution does not require hydraulic modelling, we showed the possibility of leakage detection solutions without neither the modelling of the hydraulic system nor the knowledge of particular information about the network architecture. Specifically, such benefits make our proposal leakage detection algorithm suitable to be applied in real-world scenarios where measurements are available, but without much prior knowledge about them.
When a higher level description of the WDN is available, it is possible to use such knowledge about the network architecture to apply clustering methods aiming to divide the DMA and reduce the search area for the localization of the predicted leakages. Therefore, our machine-learning strategies can be extended and support solutions formulated by hydraulic modelling.
An important aspect to highlight is the required amount of data to properly generate the predictive system. We acknowledge that scenarios with non-sufficient data for training could lead to significantly misleading outcomes when anomalous behaviours in DMAs are analyzed. Therefore, we analyzed the total amount of collected data (15 months) that the SVOA company has shared with us to conduct our machine-learning strategies. From the water utility side, the company may continuously collect new observations to increase the reliability of the predictive system.
For future works, we aim to investigate federated learning strategies to obtain confident data analysis preserving the privacy of the collected information on the pumps for the same fault diagnosis task. This is of high importance when dealing with sensitive and critical information, such as water supply and pumps locations. Moreover, the federated model would be capable of monitoring an entire WDN and distinguishing different DMAs. This would turn the problem into not only a classification but also a preliminary localization (yet in a region) problem.
ACKNOWLEDGEMENTS
The authors would like to thank the partial financial support from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) – Finance Code 001, Grant 88887.155782/2017-00, CNPq Proc. 313151/2020-2 and FUNCAP Grant PS-0186-00103.01.00/21. The authors would like to thank the Mistra InfraMaint Program for the financial support. The authors also thank Stockholm Vatten och Avfall company, Stockholm, Sweden, for providing the data used in this study.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.