## Abstract

Accurate and rapid leak localization in water distribution networks is extremely important as it prevents further loss of water and reduces water scarcity. A framework for identifying relevant leak event parameters such as the leak location, leakage area, and start time is presented in this paper. Firstly, the proposed data-driven methodology consists of acquiring pressure data at nodes in the network through hydraulic simulations by randomly changing the leak event initial conditions (leak location, area, and start time). Pressure uncertainties are added to the sensor measurements in order to make the problem more realistic. Secondly, the acquired data are then used to train, test, and validate a machine learning model in order to predict the relevant parameters. The random forest and the histogram-based gradient boosting machine learning algorithms are investigated and compared for the leak detection problem. The proposed approach with the histogram-based gradient boosting algorithm shows high accuracy in predicting the true leak location.

## HIGHLIGHTS

A machine learning-based framework for water network leak localization is presented.

ML models are trained with data generated by the WNTR simulator.

Histogram-based gradient boosting outperforms the random forests algorithm.

HGB achieved a leak node search space reduction of 92% with added pressure uncertainty.

High correlation between pressure measurements and leak area (

*R*^{2}= 75%) and leak start time (*R*^{2}= 97%).

## INTRODUCTION

Water distribution networks (WDN) play a significant role in multiple aspects of today's society such as water supply, agriculture, farming, and energy production, making them crucial for the everyday life of every individual. Efficient and reliable water systems are very expensive and require a great amount of maintenance to function properly. Clean water is distributed using WDNs that need to be designed to realize sustainable usage and also satisfy the needs of every end user. Quality and efficiency standards need to be very high to minimize the chance of leakage which results in an estimated 126 billion cubic meters of water loss per year followed by the cost of the lost water of 39 billion USD annually (Liemberger & Wyatt (2019)). Around 20–30% up to 50% of water in a WDN is lost due to slow and steady degradation of the pipe material as described in the work by Kang *et al.* (2018). The biggest challenge in preventing water leakages is the ability to detect the exact location of the leakage, so a course of action needs to be taken in a timely manner to prevent large financial losses as well as losses in valuable resources. Current leakage detection systems are also prone to faults that can lead to false alarms that cause additional loss of resources.

Many leakage-detecting techniques have been used and optimized through the years. Leak detection methods can be broadly divided into active and passive approaches, based on the paper published by Puust *et al.* (2010). In an active approach, the leaks are detected by measuring acoustic signals, pressure signals, and vibration by analyzing flow data. The passive approach, on the other hand, relies on real-life monitoring and inspection of the WDN. Active approaches can be further divided into transient-based, model-based, and data-driven methods. Transient-based leak detection methods are based on measurements of transient pressure signals that occur in the network (Colombo *et al.* (2009)). The model-based approach uses mathematical modeling to simulate the network state of operation (Perez *et al.* (2011)). In this approach, model calibration must be conducted to ensure the provision of an accurate reflection of the actual states in the real WDN. In summary, model-based approaches involve the construction of the mathematical model, calibration of the hydraulic model, detection of the leak by comparison of the model state and the actual state of the real system, and finally leak localization.

In data-driven approaches, leaks can be detected via statistical and signal-processing analyses of the acquired data, as described in the work by Cody *et al.* (2020). The advantage of this approach is that it does not require the construction of a complex mathematical model. Rather it relies on detecting outlier values caused by abnormal events in the WDN (Geelen *et al.* (2019)). The generic framework for data approaches consists of extracting relevant information from monitoring data, which can be historical or live-updated data. The data are preprocessed and transformed using different strategies to remove outliers or erroneous values in order to facilitate subsequent analyses. The next step is to choose a suitable leak detection strategy starting from a feature set classification method, prediction classification method, unsupervised clustering method, or statistical method. The feature set classification method consists of training a simple classification model, with a machine learning algorithm to separate leak events from the normal operation data. The downside of this strategy is the requirement of a large training dataset which includes numerous known leakage events, needed for achieving high precision of a classification model.

Several published papers tackled the problem of predicting the existence of a leak in the WDN such as in the work by Fan *et al.* (2021) where an autoencoder artificial neural network (AEANN) is used to determine whether a leak is present in the WDN. The AEANN has an input layer of 11 nodes, each for a node of the WDN on which data are trained, and several hidden layers with an encoding layer of three nodes ending with 11 decoding nodes on the final layer. The employed strategy consists of training the model on data from the WDN without a leak and then passing input data with a leak to the model. The model outputs a reconstruction error that is compared to a threshold value indicating the presence of the leak. The accuracy of the described model depends on the compression ratio of the training data, demand uncertainty, and leak size that need to be optimized to maximize its performance.

The paper by Hu *et al.* (2021) proposed a leakage detection model based on density spatial clustering of application with noise (DBSCAN) and a multi-scale fully convolutional network (MFCN) (DBSCAN-MFCN) to detect leakage in a WDN. The DBSCAN is used to divide the pipes into different areas according to the leakage characteristics of the pipes. The MFCN determines the leakage location based on the residuals of the pressure and flow measurements validation. The accuracy of the proposed method was a 78, 28, and 72% improvement over support vector machine (SVM), K-nearest neighbors (KNN), and the naive Bayes classifier (NBC)-based approaches, respectively.

In the work by Ravichandran *et al.* (2021), a binary classifier was used to identify the leak and no-leak cases using acoustic signals. The features have been extracted from the acoustic signals, such as power spectral density and time-series data. The data have been collected from multiple cities across North America. A Multi-Strategy Ensemble Learning (MEL) using a Gradient Boosting Tree (GBT) was used as a classification model, which provides better precision in comparison with other classification models such as KNN and ANN. More improvements have been made by combining a number of GBT classifiers in a parallel ensemble method called bagging. The proposed methodology achieved a significant reduction in false-positive events.

The paper by Zhou *et al.* (2019) presents a hybrid intelligent method for pipeline leak detection. Firstly, signal denoising and reconstruction are performed based on local mean decomposition (LMD), where a Kernel principal component analysis (KPCA) method was used for feature dimension reduction. Subsequently, a K-means algorithm is used for the clustering of various operating modes. A cascade support vector for data division (Cas-SVDD) was used for pipeline leak detection. The Cas-SVDD is compared with a sequential support vector for data division (S-SVDD) and the results point to a reduction of the false-positive rate. The next step was to locate the leak in the exact location where it occurred in the WDN as an expansion of the already well-documented problem of guessing whether a leak is present or not.

A random forest (RF) classifier is used in the work by Lučin *et al.* (2021) to detect the node in which the leak happened. This paper was a continuation of the work by Grbčić *et al.* (2020) where an RF classifier was used to find the injection source of a pollutant in the WDN. In the work by Lučin *et al.* (2021), the RF classifier was trained on a large number of simulation results to predict the node with the occurring leakage. Every simulation result in the dataset was obtained using the engineering pressure analysis network hydraulic simulation software (EPANET) with a known leak location. The RF model requires a great amount of data for its training but this problem was solved by farming a large dataset on a supercomputer. The RF model was trained on five different datasets, ranging in sizes from 100,000 to 500,000 simulations and the accuracy of the results is documented to be in the range of 86–98% in terms of true positive rate. It should be noted that the accuracy of the model greatly depends on the size of the WDN and the amount of data provided for training.

In the work by Kang *et al.* (2018), the solution to the problem of detecting a leak and localizing it was presented in the form of an ensemble model in which a one-dimensional convolutional neural network coupled with a support vector machine (1D-CNN-SVM) detects the leak and a graph-based local search algorithm determines the position of the leak. The proposed ensemble architecture combines the outputs of the CNN and SVM to achieve high accuracy. This particular architecture was chosen to provide a more heterogeneous ensemble that implies that the error is less correlated by model diversity than homogeneous classifier models. This technique also combines a probabilistic model with a non-probabilistic model. In the case of the trained ensemble method, all three evaluations scored very high, 0.993, 0.982, and 0.998, respectively. The signals classified as leakage signals are then fed into the localization algorithm as input to determine the location of the leakage.

In the work by Mashhadi *et al.* (2021), an investigation of the use of machine learning methods for leak localization in WDN was performed, where localization of the leakage was based on the creation of hydraulic zones. In this technique, each zone consisted of several sensors to measure the water supply variations and the water pressure. This methodology was used to investigate the capacity of six machine learning methods used to localize the leaks in the WDN. The models that were tested are three supervised methods, namely logistic regression (LR), decision tree (DT), and RF, and two unsupervised methods, namely hierarchical classification and a combination of principal component analysis (PCA) and K-means classification methods. Finally, the performance of an ANN was tested. The presented results illustrate excellent performances by the supervised methods for detecting leaks, resulting in 100% accuracy for the RF and LR, followed by 98% accuracy for the DT. An accuracy of 100% was achieved by the ANN model, but some difficulties were noticed when training the unsupervised methods due to overlapping clusters.

The issue of sensor placement and data collection was tackled in the work by Fan & Yu (2021), where a novel ML framework named clustering-then-localization semi-supervised learning (CtL-SSL) was developed to wrap several stages of the traditional ML training process in one solution. This framework uses the topological features of WDN and its leakage characteristics for partition and sensor placement and subsequently utilizes the monitoring data for detection and localization of the leak. The framework is applied to two WDNs and achieved 95% detection and 83% localization accuracy. The WDN partition is based on the leaking behaviors of the WDN junctions. The leakage characteristics are defined based on features extracted from non-leaking data with unsupervised ML models such as PCA and AE. The K-means method is proposed for the zone partition of the WDN. The leakage localization is implemented as an ML-based classifier for detecting the partition zones.

In the context of sensor placement, the paper by Tanyimboh (2021) shows the results of a study conducted on two benchmark WDNs, using two different genetic algorithms (GA), to investigate the impact of redundant binary codes on the performance of GA. The study finds that different mapping schemes between the genotype and phenotype spaces can lead to varying solutions and that mapping schemes that improve diversity in the population of candidate solutions achieve better results. The results suggest that mapping schemes that promote diversity in the population can lead to better results and provide guidance for the handling of redundant binary codes.

An extensive analysis was conducted in the work by Djebedjian (2021) to rank a great number of common optimization tools for WDNs. Two performance metrics ( and ) have been created to evaluate algorithms that reach optimal solutions (gcost = 1) in the literature. The first metric measures global performance based on one run, considering the number of generations and function evaluations to reach the optimal solution. The second metric, which is more realistic, measures the average global performance over multiple runs, taking into account the reliability and deviation of results from the optimal solution. The results of the literature indicate that the FDE algorithm has a superior global performance compared to other algorithms, with a higher reliability in handling medium WDN. The FDE algorithm can reach the optimal solution faster with fewer function evaluations than other algorithms.

The article presented by Shende (2019) discusses the optimization of WDN design using a meta-heuristic optimization technique. The optimization problem involves balancing hydraulic requirements and cost. A new approach based on the simple benchmarking (SBA) Algorithm called SiBANET is introduced which uses the hydraulic solver EPANET 2.0 and is able to handle discrete constraints effectively. The results of the analysis show that SiBANET is a robust algorithm with faster convergence and lower CPU time compared to other intelligent optimization algorithms (IOAs). The study focuses on single objective optimization of WDN pipe sizing but suggests further research on optimizing complex looped networks with multiple objectives.

In this paper, a novel machine learning-based framework for determining the leak position, leak area, and leakage start time in a WDN is presented and investigated. The methodology is based on the work done by Lučin *et al.* (2021) where high accuracy is achieved for predicting a leak event using an RF classification algorithm in a simple water distribution network. The proposed framework described in this paper is mainly focused on presenting the improvements in accuracy achieved by a histogram gradient boosting classification algorithm (HGB) in comparison to the RF model presented by Lučin *et al.* (2021). Also, multiple regression HGB and RF models are trained to estimate the leak area and leak start time parameters during a rupture event. The framework utilizes the water network tool for resilience (WNTR) hydraulic simulator with a leak model for massive data farming and HGB model for determining all the relevant leak scenario parameters. The accuracy of HGB is compared with the RF algorithm. The framework was tested on a benchmark WDN with 126 possible leak nodes. Furthermore, pressure uncertainties were incorporated into the leak scenarios in order to make the WDN flow conditions closer to a real-world situation.

## METHODS

### Data farming

In this section, the process of the generation of synthetic data will be defined. The network used for data farming is Network 1 and can be seen in Figure 1. This network consists of 126 nodes and 168 pipes including 1 source, 2 tanks, 2 pumps, and 8 valves. The pressure sensor nodes, i.e. sensor layout was specifically chosen with a preliminary analysis as presented in section 3.1. All nodes have a set steady demand value, while 60% of all nodes have an unsteady demand pattern assigned to them with multipliers.

Sensor layout name . | Nodes . |
---|---|

A | 17, 21, 68, 79, 122 |

B | 10, 31, 45, 83, 118 |

C | 17, 31, 45, 83, 126 |

D | 126, 30, 118, 102, 34 |

E | 126, 30, 102, 118, 58 |

Sensor layout name . | Nodes . |
---|---|

A | 17, 21, 68, 79, 122 |

B | 10, 31, 45, 83, 118 |

C | 17, 31, 45, 83, 126 |

D | 126, 30, 118, 102, 34 |

E | 126, 30, 102, 118, 58 |

*et al.*2017), which is designed to simulate and analyze the resilience of WDN. The package provides a flexible and easy to use framework that allows changes to the network structure and operations, along with simulation of disruptive incidents and recovery actions. One of the benefits that the WNTR package provides is the functionality to explicitly add a leak to a specific node. The package offers two different simulators, namely EpanetSimulator and WNTRSimulator. The data in this paper were obtained using the WNTRSimulator with the flow in the network being pressure driven. The WNTRSimulator is built upon the same mathematical model as the EpanetSimulator but offers additional functionalities. The leak model implemented in the WNTRSimulator is defined as:where is the leak demand (m

^{3}/s), is the dimensionless discharge coefficient, which is set to 0.75 (turbulent flow in pipes) (Lambert 2001),

*A*is the area of the leakage hole (m

^{2}), is the dimensionless leak exponent (set as 0.5) (Lambert 2001),

*p*is the gauge pressure (Pa), and is density of the fluid (kg/m

^{3}).

*BURA*supercomputer of the Center for Advanced Computing and Modelling, University of Rijeka. The loss/break occurs on a randomly selected pipe by creating a temporary node in the middle of the pipe and then adding the leak conditions to that temporary node, as opposed to the implementation in the work by Lučin

*et al.*(2021), where the leak can occur in the already defined nodes of the network. This was done in accordance with the assumption that it is more probable that the rupture event will happen somewhere near the middle of the pipe. The WNTR hydraulic simulator allows adding leaks only at water distribution system junctions; hence, the temporary node methodology is employed. The gathered data are formatted as tabular data in which each row represents a single hydraulic simulation in which a single rupture event occurs. Each row in the tabular data consists of pressure values for a given sensor location at a specific time. Each simulation was conducted over a time period () of 24 h with a hydraulic and report time step () of 5 min. This results in a total of 1,445 features per simulation as shown in the following equation.where represents the number of features, while is the number of sensors. The and values are in seconds, and five features are added as they represent the initial sensor measurements at 00:00 h. To ensure that our model is able to handle a wide range of possible leak scenarios, we selected initial hydraulic conditions at random and uniformly. This rigorous approach allowed us to cover a broad range of possible leaks throughout the entire network, as we included all pipes as potential leak candidates during the data farming process.

### Machine learning algorithms

A comparison in prediction accuracy is conducted between two ML algorithms. The trained algorithms, in this case, are the RF algorithm and the histogram-based gradient boosting algorithm. These algorithms are used based on their proven success in tackling high-dimensional dataset problems as described in the papers by Lučin *et al.* (2021) and Ravichandran *et al.* (2021). In the mentioned paper by Lučin *et al.* (2021), it is shown that an RF model is very successful in predicting the leak node in a simple WDN as the tree-based methodology aids in preventing overfitting in such feature-dense datasets. On the other hand, good performance was achieved by a GBT model in a time-series based dataset by Ravichandran *et al.* (2021) which led to assuming that a gradient boosting approach would be justifiable. Adopting the techniques described in this paper brings several advantages. Two of the most relevant for this paper are robustness to outliers in the dataset and the ability to handle high-dimensional data efficiently. Since a degree of noise is added during the data farming process, there may be outliers present that can negatively impact the trained model's ability to generalize. However, the RF and HGB algorithms are able to divide the feature space into smaller regions, which makes them more resistant to outliers. Additionally, these tree-based models use less memory and computational resources compared to other methods, while still being able to quickly process and make predictions on new data points, which is important for large datasets and tasks where prediction speed is a concern. Both the presented models can be used for both classification and regression tasks.

#### Random forest algorithm

The RF algorithm was introduced by Breiman (2001) as an ensemble algorithm that uses multiple decision trees (DTs), which are being constructed in the training process using random subsets of the data. The algorithm uses the bagging method to train each DT to de-correlate them from each other. The most important hyper-parameter for the training of the RF algorithm is the number of DTs. More decision trees trained can produce greater precision in classification or regression, taking into account the possibility of overfitting the model. The final prediction of the model is acquired by the majority of the predictions of the trained DT in the ensemble. The RF implementation in the Python machine learning module scikit-learn 0.21.3 was used (Pedregosa *et al.* 2011).

#### Histogram gradient boosting

The HGB algorithm is a variation of the well-known gradient boosting (GB) (Bentéjac *et al.* 2021) algorithm that is widely used for a vast number of different ML tasks regarding classification and regression. These algorithms including AdaBoost belong to a set of models called boosting algorithms that have the primary purpose of converting weak learners into strong ones. Boosting methods focus on sequentially adding and training new weak learners to correct the errors of the previous added weak learner. Each newly added weak learner is trained to avoid the errors that the previous learner made. The most commonly used weak learners are decision trees. The HGB algorithm is a boosting algorithm that was developed to resolve the biggest flaw of the GB algorithm which is its long training time when training on big datasets. This problem is solved by discretizing or binning the continuous input variables to a few hundred unique values. The most important hyper-parameter, in this case, is the learning rate of the algorithm. Great focus was directed towards algorithm optimization with multiple rounds of hyper-parameter tuning. The HGB implementation in the Python machine learning module scikit-learn 0.21.3 was used (Pedregosa *et al.* 2011).

### Training methodology

The classification models were trained using two methodologies. To achieve the top picks metric described in the section 2.5, a classical approach to training and testing was considered. The classification model was firstly trained on a random 84% dataset split and then tested with the remaining 16% simulations of the dataset to generate a set of predictions of the true leak node. Also, a cross-validation approach was required to present the mean and standard deviation of the accuracy to investigate the robustness of the model. For the classification model, a Stratified KFold cross-validation method was used to guarantee an optimal spread of all the node classes.

The regression models on the other were tested only using a cross-validation technique. A stratified shuffle validation split (*k* = 5) with enhanced randomness was used for the regression models to ensure that the data were optimally and equally distributed in the process of cross-validation. To achieve the described folds in the data the functions *StratifiedKFold* and *StratifiedShuffleSplit* implemented in scikit-learn 0.21.3. were used. The use of such strategies is particularly important as the dataset is complex in terms of the variability of each hydraulic simulation, with several randomly chosen parameters that need to be calculated during each simulation. Each fold must be as representative of the overall dataset as possible to ensure the reliability of the results. The enhanced randomness refers to using a random number generator with a high degree of unpredictability to shuffle the data rows before splitting it into folds. This helps to further reduce bias and improve the uniformity of the folds. We believe that this approach provides a reliable method for evaluating the performance of the models as it reduces the degree of overfitting by enhanced randomness. It is important to note that cross-validation is a tool used to evaluate the performance of the model and does not directly influence the outcome of the analysis.

Each row in the tabular dataset is composed of 1,445 features in the input data as described in Equation (2), while the output data consist of values derived from the simulation result.

### Pressure uncertainty

The previously described algorithms are then trained on two sets of created data that are farmed with two different deviations from the original pressure and . The final results of the obtained models are compared with the results obtained by training an HGB classifier and regressor with the same amount of data but with perfect data without any pressure noise.

### Machine learning model accuracy metrics

*accuracy_score*function in scikit-learn 0.21.3. was used. The metric is defined in the following equation aswhere is the ML predicted value of the sample and is the true value of the same sample, while defines the total number of samples that are being predicted by the ML model.

Furthermore, due to the great number of output labels and the complexity of the problem, an additional accuracy metric for classification has been used by utilizing the scikit-learn 0.21.3. *predict_proba* function. Firstly, each node's probability of being the true value is obtained by the function, then the nodes are ranked by probability and the top *n* nodes are extracted. If the true value is within the top *n* extracted ML predicted values , the prediction is labeled as True Positive. The value of *n* was varied as 1, 3, 5, and 10, furthermore, when *n* = 1, it is considered to be the same as the metric defined in Equation (3).

*R*

^{2}score metric was used. The

*R*

^{2}score or the coefficient of determination is a measure that provides the goodness of fit of a model. It is a statistical measure of how well the regression line approximates the actual data or its mean as defined in the following equation:where the term is the variation of the model from the actual data and is the variation of the mean from the actual data. In the context of the problem investigated within this research, the

*R*

^{2}score determines how well the measured pressure values at all sensors at the specified time correlate with the leak area and leak start time.

## RESULTS AND DISCUSSION

In this section, the prediction accuracy of the used classification and regression models (RF and HGB) are analyzed. Firstly, the choice of the sensor layout is assessed through comparison with several other sensor layouts taken from the previous literature (Ostfeld *et al.* 2008).

Secondly, the results of hyper-parameter tuning of both RF and HGB are shown. Furthermore, an investigation of RF and HGB algorithms trained with data without added pressure uncertainty is presented.

Finally, the impact of the added pressure uncertainty on the final prediction results is investigated. It is important to mention that the classification and regression accuracy are expressed through the metrics defined in subsection 2.5.

### Sensor layout selection

To determine what is the best sensor layout for the network, five different sensor layouts in Network 1 are considered. Sensor layouts are taken from the paper by Ostfeld *et al.* (2008) and they comprise the top five most effective sensor layouts with five sensors, as given in the mentioned paper. The network is considered calibrated and its demand pattern, pressure pattern, and valve configurations are as in Ostfeld *et al.* (2008). In Figure 4, the different sensor layouts are named A, B, C, D, and E.

To determine the best layout for the task ahead, a preliminary analysis was made with the RF algorithm with 20,000 farmed simulation data without pressure uncertainty. A simple train and test methodology was used where the model was trained on 84% and tested on 16% of the provided data. For each sensor configuration under test, a small dataset was gathered and a simple RF model without hyper-parameter tuning was trained on each dataset. According to the training and testing results of this simple model, an assumption was made about which of the five datasets with different sensor placements would be better to further investigate. The goal was to determine which sensor layout would yield the most useful extraction of information, i.e. the percentage of the top 3, 5, and 10 picks containing the true leak node.

*et al.*(2006), and this layout was used to obtain all the training data throughout this paper and further investigate the performance of RF and HGB algorithms.

### HGB and RF hyper-parameter tuning

The hyper-parameter search was conducted in several rounds in order to find a combination that will result in an optimal combination of hyper-parameters (Tables 2 and 3). The process included several rounds of random hyper-parameter selection through multiple iterations. This was done using the *RandomizedSearchCV* class defined in the *model_selection* module which allows the evaluation of a model with a random set of hyper-parameters using cross-validation on a small subset of the dataset. Each training iteration results in a new set of hyper-parameters for the model. Using this technique, the search space was considerably narrowed down to allow a finer search. After narrowing down and reducing the search space, a more detailed procedure of parameter optimization was conducted. Using the *GridSearchCV* class, we were able to comb the parameter space even further. The mentioned class allows the selection of a range of values for each hyper-parameter on which it will perform a cross-validation procedure for each possible combination to find the most optimal set of parameters. The best set of parameters that was produced by *RandomizedSearchCV* was further analyzed. For each parameter produced by the random procedure, a range of values was selected around that value. With these newly defined ranges, a *GridSearchCV* procedure was conducted. This resulted in the hyper-parameter sets used in the rest of this paper. The search for the optimal combination was conducted on a 100,000 simulation dataset without added uncertainty once for every model. The hyper-parameter search procedure is very resource-consuming so it was conducted in parallel with multiple cores using the BURA supercomputer. This process resulted in the following combination of hyper-parameters for the HGB model. Both algorithms were primarily tuned to increase the accuracy of the true leak node prediction, i.e. multilabel classification.

The same process of hyper-parameter searching was used to find the optimal combination for the RF model. The results are the following:

### HGB and RF prediction results without pressure uncertainty

The results presented in this section are produced by both RF and HGB models that were trained on ideal data, i.e. without pressure sensor uncertainty. The purpose of this comparison is to observe the amount of decline in accuracy due to the addition of uncertainty in the collected data and to assess the better ML algorithm for the given problem in an ideal setting. Tables 4 and 5 show the results for the HGB and RF algorithms, respectively. It can be observed from the presented results that the HGB-trained model is superior to the RF model by almost 30% when trained on one million instances of data. The presented results are obtained by evaluating the models with a Stratified KFold cross-validation procedure using five folds in the data. The maximum number of farmed instances was one million simulations but the training process was conducted starting from 100,000 to observe the learning curve and to get a sense of the upcoming plateau in learning of the models.

Parameter name . | Parameter value . |
---|---|

Learning rate | 0.15 |

L2 regularization | 17 |

Loss | Categorical crossentropy |

Max bins | 156 |

Max depth | 2 |

Max iter | 429 |

Max leaf nodes | 184 |

Min sample leaf | 75 |

Parameter name . | Parameter value . |
---|---|

Learning rate | 0.15 |

L2 regularization | 17 |

Loss | Categorical crossentropy |

Max bins | 156 |

Max depth | 2 |

Max iter | 429 |

Max leaf nodes | 184 |

Min sample leaf | 75 |

Parameter name . | Parameter value . |
---|---|

Max depth | 444 |

Max leaf nodes | 457 |

Min sample leaf | 2 |

Min samples split | 65 |

Number of estimators | 481 |

Parameter name . | Parameter value . |
---|---|

Max depth | 444 |

Max leaf nodes | 457 |

Min sample leaf | 2 |

Min samples split | 65 |

Number of estimators | 481 |

Inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

100k | 85.03 | 90.35 | 94.12 | 98.30 |

300k | 83.25 | 88.69 | 92.23 | 95.97 |

500k | 84.63 | 90.07 | 93.57 | 97.07 |

1 million | 85.59 | 91.18 | 94.59 | 97.98 |

Inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

100k | 85.03 | 90.35 | 94.12 | 98.30 |

300k | 83.25 | 88.69 | 92.23 | 95.97 |

500k | 84.63 | 90.07 | 93.57 | 97.07 |

1 million | 85.59 | 91.18 | 94.59 | 97.98 |

Inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

100k | 22.58 | 21.99 | 22.56 | 22.60 |

300k | 35.86 | 34.50 | 34.87 | 35.33 |

500k | 43.57 | 42.23 | 42.76 | 43.25 |

1 million | 57.17 | 56.11 | 56.74 | 57.50 |

Inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

100k | 22.58 | 21.99 | 22.56 | 22.60 |

300k | 35.86 | 34.50 | 34.87 | 35.33 |

500k | 43.57 | 42.23 | 42.76 | 43.25 |

1 million | 57.17 | 56.11 | 56.74 | 57.50 |

### HGB and RF prediction results with pressure uncertainty

In this subsection, a comparative analysis is conducted for the results obtained with the pressure measurement uncertainty of 5% and subsequently with an uncertainty of 10%. Every model is incrementally evaluated using a cross-validation technique (KFold for classification and Stratified Shuffle Split for regression models) with the total number of input data being one million. For the evaluation of the regression models, a mean of the *R*^{2} score of each fold result is used to indicate the performance. On the other hand for the classification models, the accuracy score mean of all folds is used as described in subsection 2.5. As expected, the classification and regression HGB and RF models achieved a decline in accuracy due to the introduced uncertainty in the pressure measurement.

#### 5% pressure uncertainty

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 57.68 | 65.75 | 69.37 | 75.75 |

3 | 66.60 | 72.35 | 75.62 | 80.39 |

5 | 67.82 | 73.25 | 76.61 | 81.61 |

10 | 69.38 | 75.06 | 78.46 | 83.63 |

20 | 69.52 | 75.14 | 78.48 | 83.90 |

30 | 70.30 | 75.85 | 79.45 | 84.5 |

50 | 70.87 | 76.69 | 80.39 | 85.67 |

60 | 70.78 | 76.66 | 80.39 | 85.72 |

80 | 70.63 | 76.62 | 80.37 | 85.76 |

100 | 70.98 | 77.06 | 80.80 | 86.05 |

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 57.68 | 65.75 | 69.37 | 75.75 |

3 | 66.60 | 72.35 | 75.62 | 80.39 |

5 | 67.82 | 73.25 | 76.61 | 81.61 |

10 | 69.38 | 75.06 | 78.46 | 83.63 |

20 | 69.52 | 75.14 | 78.48 | 83.90 |

30 | 70.30 | 75.85 | 79.45 | 84.5 |

50 | 70.87 | 76.69 | 80.39 | 85.67 |

60 | 70.78 | 76.66 | 80.39 | 85.72 |

80 | 70.63 | 76.62 | 80.37 | 85.76 |

100 | 70.98 | 77.06 | 80.80 | 86.05 |

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 7.81 | 15.56 | 20.12 | 30.87 |

3 | 11.77 | 20.72 | 27.06 | 37.72 |

5 | 12.75 | 22.15 | 28.16 | 40.33 |

10 | 13.8 | 23.82 | 30.06 | 42.51 |

20 | 15.02 | 24.76 | 31.13 | 43.14 |

30 | 15.13 | 24.54 | 30.66 | 42.79 |

50 | 15.54 | 24.87 | 31.20 | 43.22 |

60 | 15.55 | 24.76 | 30.82 | 42.86 |

80 | 15.60 | 25.03 | 31.17 | 42.99 |

100 | 15.48 | 24.88 | 30.93 | 42.72 |

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 7.81 | 15.56 | 20.12 | 30.87 |

3 | 11.77 | 20.72 | 27.06 | 37.72 |

5 | 12.75 | 22.15 | 28.16 | 40.33 |

10 | 13.8 | 23.82 | 30.06 | 42.51 |

20 | 15.02 | 24.76 | 31.13 | 43.14 |

30 | 15.13 | 24.54 | 30.66 | 42.79 |

50 | 15.54 | 24.87 | 31.20 | 43.22 |

60 | 15.55 | 24.76 | 30.82 | 42.86 |

80 | 15.60 | 25.03 | 31.17 | 42.99 |

100 | 15.48 | 24.88 | 30.93 | 42.72 |

Furthermore, the increase in accuracy is also apparent when rankings of the top 3, 5, and 10 nodes are considered. When 100% of the farmed data is used for model training and testing (86 and 14% split), there exists a 86% certainty that the true leak node will be among the top 10 predicted nodes (presented in Figure 7(a) and Table 6). In other words, when the top 10 ranking of predicted nodes are considered, the search space for the true leak node has been narrowed down by 92%.

*R*

^{2}score for the leak area for the HGB and RF models, respectively. The HGB algorithm achieved a higher

*R*

^{2}score than the RF algorithm; however, the discrepancy is not as apparent as for the classification leak localization problem. Furthermore, an increase in the farmed data for stratified shuffle split cross-validation procedure clearly improves the

*R*

^{2}score of the leak area value for both algorithms. However, a trend can be observed for the HGB algorithm as 1 million input data (100% of the total input) is approaching the

*R*

^{2}score of 75%.

*R*

^{2}score validation curves for of both HGB and RF model, respectively. It is apparent that there exists a strong correlation between the pressure measurements at the five sensors and the leak start time since the maximum value of the stratified shuffle split is around 97% for both algorithms.

#### 10% pressure uncertainty

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 60.75 | 66.62 | 70.06 | 74.93 |

3 | 67.37 | 72.43 | 75.58 | 80.56 |

5 | 68.87 | 73.98 | 77.06 | 81.98 |

10 | 69.89 | 74.9 | 77.90 | 82.89 |

20 | 69.75 | 75.21 | 78.55 | 83.46 |

30 | 69.82 | 75.29 | 78.76 | 84.06 |

50 | 70.27 | 75.97 | 79.48 | 84.54 |

60 | 70.52 | 76.21 | 79.71 | 84.83 |

80 | 70.40 | 76.32 | 79.86 | 85.01 |

100 | 70.34 | 76.39 | 80.10 | 85.38 |

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 60.75 | 66.62 | 70.06 | 74.93 |

3 | 67.37 | 72.43 | 75.58 | 80.56 |

5 | 68.87 | 73.98 | 77.06 | 81.98 |

10 | 69.89 | 74.9 | 77.90 | 82.89 |

20 | 69.75 | 75.21 | 78.55 | 83.46 |

30 | 69.82 | 75.29 | 78.76 | 84.06 |

50 | 70.27 | 75.97 | 79.48 | 84.54 |

60 | 70.52 | 76.21 | 79.71 | 84.83 |

80 | 70.40 | 76.32 | 79.86 | 85.01 |

100 | 70.34 | 76.39 | 80.10 | 85.38 |

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 8.812 | 16.68 | 21.12 | 30.68 |

3 | 12.02 | 20.70 | 25.97 | 37.04 |

5 | 12.51 | 21.98 | 27.33 | 37.70 |

10 | 13.44 | 22.51 | 28.69 | 39.55 |

20 | 14.40 | 23.59 | 29.43 | 40.65 |

30 | 15.09 | 24.21 | 29.71 | 40.51 |

50 | 15.25 | 24.23 | 29.99 | 41.03 |

60 | 15.26 | 24.19 | 29.91 | 40.91 |

80 | 15.40 | 24.23 | 29.72 | 40.77 |

100 | 15.32 | 24.21 | 29.83 | 40.61 |

Percentages of total inputs . | Top 1 (%) . | Top 3 (%) . | Top 5 (%) . | Top 10 (%) . |
---|---|---|---|---|

1 | 8.812 | 16.68 | 21.12 | 30.68 |

3 | 12.02 | 20.70 | 25.97 | 37.04 |

5 | 12.51 | 21.98 | 27.33 | 37.70 |

10 | 13.44 | 22.51 | 28.69 | 39.55 |

20 | 14.40 | 23.59 | 29.43 | 40.65 |

30 | 15.09 | 24.21 | 29.71 | 40.51 |

50 | 15.25 | 24.23 | 29.99 | 41.03 |

60 | 15.26 | 24.19 | 29.91 | 40.91 |

80 | 15.40 | 24.23 | 29.72 | 40.77 |

100 | 15.32 | 24.21 | 29.83 | 40.61 |

### Proposed framework limitations

Although the proposed methodology yields good results, it has limitations. One known limitation at this moment is the fact that data farming and training of the predictive models require a great amount of computing resources, i.e. it is impossible to apply this framework on a conventional machine without spending a great amount of time on farming the required data. Based on the slope of the validation curves of some parameters (11, 12), it can be reasonably expected that adding more simulation results to the dataset would produce further improvements in model accuracy up to a point. However, it's important to keep in mind that there is a trade-off between the size of the dataset and the computational resources required to process it. Additionally, it is important to monitor the performance of the model on an independent test set to ensure that any improvements in accuracy on the validation set are not due to overfitting. Therefore, careful consideration should be made when deciding whether to increase the size of the dataset, as the additional data may not necessarily lead to significant improvements in model accuracy or may lead to computational limitations. A proposed solution to the problem would be the segmentation of the process into smaller, more manageable rounds of farming and training. Several rounds of farming and training a model could be done. Where after each farming cycle, a training portion of the farmed data is used to train the model. This process can limit the probability that the model is trained on redundant data to avoid overfitting. A strategy that can be considered is to add a bias to the random leak node selection toward a specific set of nodes to increase the generalization capacity of the model on the test set.

For larger WDN, to reduce the total computational effort of the framework, sub-zones of the network could be created with each sub-zone having a uniquely trained ML model for leak localization which only includes the nodes located within that same sub-zone. Optimal sensor position in this case should be further explored for each sub-zone to reduce cost and maximize data quality. Djebedjian (2021) proposes a comprehensive guide of available optimization techniques that can be used in a case study such as this one. On the other hand, WDN sub-zone sensor optimization can also be conducted using a genetic algorithm (GA) as proposed by Shende (2019).

## CONCLUSION

In this research, an ML-based framework was developed in order to localize a leak within a water distribution network and to investigate the correlation between pressure measurements with the leak event start time and leak area. The framework consisted of massively parallel data farming of leak events with the hydraulic simulator WNTR, acquiring pressure values at sensors and then training and testing an ML algorithm (HGB and RF, both tuned for multilabel classification) in order to solve the leak localization problem. The study suggests several key takeaways:

The presented methodology can be used to assess the quality of pressure sensor positions within a WDN. The sensor layout that yields the highest accuracy of the true leak node localization prediction can be considered the most optimal. Henceforth, this methodology and the used metrics could be used to define a combinatorial optimization problem of the optimal sensor placement for leak localization.

The HGB algorithm clearly outperforms RF for the leak localization problem and is computationally efficient, which can be attributed to its binning of continuous input variables (pressure in this case) technique. However, both algorithms can be used to determine the correlation between sensor measurements and prediction accuracy of leak event parameters.

Increasing the uncertainty of the pressure measurements decreases the leak localization accuracy, however, the HGB algorithm managed to achieve the leak node search space reduction of 92% when both 5% and 10% uncertainty is considered, i.e. the chance that true leak node is among the 10 predicted by the ML model is 86%.

Through an ML regression analysis, a high correlation is achieved between the pressure sensor measurements (with the uncertainty of 5% and 10%) and both the leak area (75%) and leak start time (97%).

This study could serve as a foundation of future research which includes coupling the HGB leak localization model with simulation-calibration models which can be used to determine the leak location with higher certainty due to the narrowing down of search space by the ML model. Furthermore, the methodology could be used along with clustering methods which can determine specific sub-zones of the WDN based on hydraulic characteristics in order to build ML models for each sub-zone. Each model is specialized at recognizing anomalous events in a specific sub-zone. The specialized model at the end can be used as an ensemble rather than use one monolithic model to predict anomalies over the whole WDN. This methodology could also result in a more generalized prediction algorithm that is more robust to noise and outlier events. Another benefit to training smaller, more specialized models is the ability to use them as pre-trained models for different WDN pressure data. This could reduce the training time and farming effort in applying the methodology to other networks.

## ACKNOWLEDGEMENTS

The authors acknowledge the support of the Center of Advanced Computing and Modelling at the University of Rijeka for providing computing resources.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Mathematics*