ABSTRACT
This article presents a model-based fusion classifier technique to estimate the approximate position of a leak in a water distribution network (WDN). This technique uses residuals obtained by comparing pressure measurements with the nominal behavior estimated from a mathematical model. These residuals are analyzed as a classification task to associate the information with an approximate leak location. The classification task is performed by three different classifiers: the K-Nearest Neighbors algorithm, the Multilayer Perceptron, and the Decision Tree classifier. The outputs of these classifiers are combined using two ensemble/fusion methods, the Majority Voting and the Naive Bayes fusion, to improve the accuracy of the leak position estimation. Results from a benchmark problem comprising 126 nodes, 8 valves, 2 tanks, 2 reservoirs, 3 pumps, and 168 pipes, subject to four variable demand patterns, illustrate the performance improvement of the two ensemble methods over an individual classifier. Additionally, to create a more realistic scenario, the performance of the proposed leak localization scheme is evaluated in a perturbed scenario by noise and uncertainties in the demand estimation.
HIGHLIGHTS
A fusion classifier technique for more accurate leak localization in water distribution networks.
Model-based fusion classifier technique integrating KNN, Multilayer Perceptron, and Decision Tree algorithms for precise leak localization in water networks.
Improved leak position estimation through Majority Voting and Naive Bayes fusion methods.
Evaluated under real-world conditions with noise and demand estimation uncertainties.
Shows improvements over individual classifiers.
INTRODUCTION
Water loss management is an issue facing water suppliers around the world due to the potential consequences related to safety, economic, and environmental damage. Leaks can occur as the water infrastructure deteriorates, leading to corrosion, material fatigue, and joint failures, resulting in pipe degradation over time. In addition, high operating pressures in the water network and hydraulic transients intensify these problems by inducing dynamic stresses that accelerate crack propagation (Jara-Arriagada & Stoianov 2024). To mitigate water loss, pressure management has been widely implemented in water distribution network (WDN), demonstrating that regulating pressure effectively reduces leakage (Meniconi et al. 2024). Moreover, increased water consumption by end users causes greater pressure fluctuations and placing additional stress on service lines, leading to high damage probability in the pipes (Meniconi et al. 2022). Beside that, soil movement, temperature fluctuations, and inadequate installation practices are other contributing factors in the water network leakages (Rezaei et al. 2015). Consequently, leakage rates between 30 and 50% are common in water distribution systems (World Water Assessment Programme United Nations and UN-Water 2009; Puig et al. 2016).
In this regard, leak detection and localization methods based on different approaches have been proposed to address this problem. One of these approaches is model-based methods, where nodal pressure and flow measurements, hydraulic models, and different estimation methods are used to perform online leak detection and localization tasks. These methods formulate the leak localization problem based on different design considerations and mathematical assumptions. Pudar & Liggett (1992) addresses the leak issue as a least square parametric estimation problem. Pérez et al. (2011, 2014) propose a sensitivity analysis approach, where a comparison between the pressure in a nominal state and a leak scenario is stored in a matrix and analyzed to obtain leak localization. However, these methods do not consider uncertainties in demand estimation and measurement disturbances, which are presented in real applications. Another study proposed by Valizadeh et al. (2009) and Leu & Bui (2016) where the leakage localization problem is treated as a classification task. These approaches are considered data-driven, as they only require experimental data without the need for mathematical models. However, the performance of these methods is strongly related to the training process, and in real applications, data sets that cover all possible faults can be difficult to obtain.
In this way, Puig et al. (2016) and Sarrate et al. (2014) propose a mixed model-based/data-driven method, where a residual generated from the nominal state and the fault condition is analyzed using classification techniques. Unlike purely data-driven methods, this approach does not require datasets to cover all possible faults, as the model-based scheme allows the generation of potential faults. However, determining a classifier that can handle all test data or guarantee correct labeling can be difficult to achieve, as the nature of demand patterns, leak size, uncertainty level, network structure, and operation varies. Additionally, a single classifier is generally unable to manage the wide data variability. Another approach proposed by Vrachimis et al. (2021) uses a priori available information about the system to improve the accuracy of the model-based leak detection and isolation technique. This approach integrates sensor measurements within an optimization framework to effectively prelocalize a potential leak. Furthermore, in 2020, as part of the CCWI/WDSA conference, the BattLeDIM competition was held to evaluate methods for detecting and locating leaks in a simulated water network based on a real system. Participants used algorithms based on techniques such as machine learning and statistical analysis. Vrachimis et al. (2022) shows that although effective solutions emerged, only 50% of the maximum score is achieved, highlighting the potential for improving these methodologies. This suggests that fusion methods for leak localization, rather than relying on a single approach, can yield more robust results and better address the complexities inherent in real-world systems.
Recent pattern classification techniques use a combination of classifiers and fuse decisions to obtain a result that outperforms each of the single classifiers (Mangai et al. 2010). The current literature presents various applications of fusion classifier methods across multiple fields, such as finance (Tsai 2014), anatomy (Heckemann et al. 2006), earth sciences (Tsai 2014), and manufacturing (Mar et al. 2011), among others. Additionally, experimental studies demonstrate that collecting and combining the outputs of multiple classifiers reduce generalization error (Rokach 2014).
To improve the general performance of the mixed model-based/data-driven approach, this paper proposes a mixed model-based/data-driven approach using a fusion scheme to deal with the leakage localization problem. The proposed classifier comprises a decision tree classifier, a Multilayer Perceptron (MLP), and K-Nearest Neighbors (KNN) algorithms, combined through two different fusion rules. The fusion rules are Majority Voting (MV) and Naive Bayes (NB) combination, which are compared with data from a case study to determine the more appropriate rules for the application. The hydraulic model of the case study corresponds to the Network 1 used to evaluate the algorithm performance in the contest ‘The Battle of the Water Sensor Networks’, BWSN (Ostfeld et al. 2008). The performance of the proposed leak localization scheme is evaluated in a scenario perturbed by noise and uncertainties in the demand estimation. The training process incorporates the residual signal derived from a leak-free baseline or a system with preexisting leaks, enabling the method to specifically locate newly emerging leaks rather than those already present. Also, neither multiple nor non-concurrent leaks are considered.
Leak localization methods have improved significantly in recent years, but their performance under real-world conditions still faces important challenges. Traditional model-based approaches, although theoretically solid, can sometimes be ineffective when dealing with practical issues such as changes in water demand, sensor noise, and inaccuracies in hydraulic models. On the other hand, data-driven methods require large training datasets that cover all possible leak scenarios, which are difficult to achieve. Hybrid approaches attempt to combine the strengths of both strategies, but most still rely on a single classifier, limiting their ability to handle the complexity and variability of leak patterns across different network configurations. This study proposes a new solution using a fusion of classifiers within a hybrid model-based/data-driven framework. The main contributions of this method are as follows: (i) it combines three classifiers (Decision Tree, MLP, and KNN) using both MV and NB fusion to leverage their complementary strengths; (ii) it increases robustness by generating synthetic leak scenarios, reducing dependence on large training datasets; and (iii) it achieves superior performance in benchmark tests under noisy and uncertain conditions. This fusion approach consistently outperforms single-classifier methods, especially in real networks with high variability. Its flexible design supports future extensions to multiple simultaneous leaks and dynamic network conditions, representing a solid step toward real-world implementation in water distribution systems.
The rest of the paper is organized as follows: The Preliminaries section presents the WDN modeling, the number of sensors considered, and the placement procedure used to define the instrumentation scheme for the proposed leakage localization algorithm. The Methods section describes the proposed fusion scheme for leak localization in WDS. The Results section evaluates the performance of the proposed fusion scheme through benchmark problem results. The Discussion section analyzes these findings in context. Finally, the Conclusions section summarizes the main contributions of this study.
PRELIMINARIES
This section introduces the hydraulic simulation scheme, leak model, sensor number, and placement methods considered in this study.
Governing equations
The flow rate and pressure analysis of a WDN can be performed using different model approaches, such as inertial, transient, static, and quasi-static models. The inertial and transient models are characterized by partial differential equations, while the static model analyzes steady-states through a system of algebraic equations. Static models can be extended to different periods of simulation with the superposition of static simulations in time (quasi-static models) with different boundary conditions. In general, static and quasi-static models are used for practical applications, since the daily network management (analysis, design, and operation) can be addressed with this kind of model. Furthermore, transient effects dissipate quickly, and transient models may be computationally expensive (Cabrera & Vela 2013).




























Sensor number and placement
The performance of a monitoring system is highly dependent on available measurements. Before implementing leak localization methods, selecting optimal sensor numbers and positions is fundamental. Although installing sensors at every network node would provide maximum monitoring performance, this approach results in economically unfeasible instrumentation costs (Sarrate et al. 2014).
The placement of M sensors among N potential nodes () requires exponential time for exhaustive methods, becoming computationally infeasible for moderate-sized WDNs (Gamboa-Medina & Reis 2017). Sarrate et al. (2014) proposes a two-stage approach: (1) clustering techniques group similar fault signatures1 to eliminate correlated sensor locations, reducing redundancy and setting the maximum sensor count equal to the number of clusters; (2) formulating the placement as an optimization problem to maximize monitoring performance.



Three fuzzy clustering techniques (FCM Miyamoto et al. 2008, ECM Masson & Denœux 2008, KFCM Girolami 2002) with respective validity indices.
Projection-based placement (Casillas et al. 2015) via sensitivity matrix
and residual matrix
.











METHODS
The model-based leak localization problem aims to compare the mathematical model that describes the theoretical behavior of the WDS with pressure measurements taken by distributed sensors on the network structure. This comparison generates a residual (R) containing information that allows symptom extraction and fault isolation (Isermann 2006).
To perform the leak localization task for WDS, the residuals are obtained from the differences between the estimated pressure () from the hydraulic simulation in a leak-free scenario and the pressure measurements H. Additionally, it is usually assumed in the modeling that all demands (
) occur at the nodes; for this reason, nodes are considered information sources. To obtain a more realistic scenario, measurements are considered to be perturbed by noise, uncertainties in demand patterns, and leaks of different magnitudes, which can appear at any node.





The main objective of the classification task is to efficiently identify which of the network nodes can be associated with the fault symptom. In a WDS, each node represents an information source, and each node has different temporal demand patterns. In general, training classifiers using a vast amount of data can be impractical and inefficient. Recent trends in classifier design propose using fusion techniques to improve performance. In fusion techniques, a set of classifiers is combined to provide a better and less biased output than a single classifier. Moreover, a fusion scheme can be efficient, as each classifier’s training process can be divided into smaller subsets of the data and later combined (Mangai et al. 2010).
The possible ways of combining classifiers depend on the output type of the individual classifiers (Kuncheva 2014). In the leak localization problem, the classifier output is a crisp label representing a network node. To evaluate the performance of fusion schemes in the leak localization problem, two fusion techniques are presented: MV and NB combination.
In MV, decisions are determined by an output matching count from the classifier pool. A class is selected when: (i) all classifiers agree on a specific output (unanimous decision), (ii) an output receives more than half the votes (simple majority), or (iii) the output receives the highest number of votes, regardless of whether it exceeds half of the total decisions (plurality vote) (Mangai et al. 2010). On the other hand, the NB combination is a decision rule based on the assumption of mutually independent classifiers and conditional probabilities (Kuncheva 2014).














RESULTS
To illustrate the performance of the proposed Fusion Classifier schemes to deal with the leak localization in WDS, this section presents different fault scenarios generated by a PDD simulation of a real-world case study. In addition, the proposed fusion scheme performance is compared with each individual classifier.
Hard classification of fault signatures and sensor position in Network under study.
Hard classification of fault signatures and sensor position in Network under study.
It is important to point out that the population (sensor locations) in this optimization method can be configured by the user to exclude nodes where sensor installation may not be feasible in a real-world network due to practical constraints. This can be achieved by defining a restricted search space in the algorithm, where only eligible nodes are considered during the optimization process. Constraints can be incorporated by assigning a predefined list of feasible locations or by introducing penalty functions that discourage the selection of non-viable nodes.
Classifiers training
In the fusion schemes, it is necessary to define the tuning parameters of each algorithm. In the KNN algorithm the number of nearest neighbors, for the MLP classifier the hidden layers number, and the maximum number of iterations, and in the SVM algorithm the γ parameter of the RBF. Also, it is important to consider the estimation uncertainties and the noise in the measurements affect the performance of the classification methods and it is possible to obtain a poor performance despite a good tuning of the classifiers. To smooth the uncertainties and noise measured considered in the proposed algorithms testing scenario, the confusion matrix information is used in the window time or time horizon (Ferrandez-Gamot et al. 2015). The ith column of the confusion matrix contains the probabilities of a leak being present in the node i, when the classifier predicts that the leak is in the node j, according to the available information for a time instant t. The sum of the column vector of the confusion matrix along the window time is performed to obtain the most probable leak position. On the other hand, to define the tuning parameters of each algorithm, a test simulates different leaks and considers different tuning parameters for the classification methods. Taking into account the information of the test, the values of each adjustment parameter where the best performance is presented in Table 1.
Tuning parameters of classification algorithms
Algorithm . | Parameter . | Value . |
---|---|---|
KNN | Neighbors number | 3 |
Distance metric | Euclidean | |
MLP | Activation function | Hyperbolic tangent (tanh) |
Maximum iterations | 1,500 | |
Hidden layers | 200 | |
Decision Tree Classifier | Criterion | Gini |
Maximum depth | 100 |
Algorithm . | Parameter . | Value . |
---|---|---|
KNN | Neighbors number | 3 |
Distance metric | Euclidean | |
MLP | Activation function | Hyperbolic tangent (tanh) |
Maximum iterations | 1,500 | |
Hidden layers | 200 | |
Decision Tree Classifier | Criterion | Gini |
Maximum depth | 100 |
Pressure measurements in study case in the nodes 22, 75, and 115, respectively.
DISCUSSION
The results obtained from the fusion classifier schemes for leak localization in WDS illustrate that the Voting Classifier outperforms the other classifiers in terms of accuracy, precision, and F1-score. Its superior performance, with an accuracy of 0.80, precision of 0.72, and F1-score of 0.84, suggests that combining multiple classifiers can lead to improved overall performance in leak localization tasks. Moreover, the mean distance of 7.05 shows that the Voting Classifier provides better proximity to actual leak locations.
On the other hand, the Naive Bayes Fusion classifier shows lower performance compared to the Voting Classifier. Its mean distance of 8.53 indicates that it is less effective in terms of prediction accuracy and proximity to the true leak points. The lower performance could be due to the tuning parameters, which could be further analyzed in future studies.
The Decision Tree classifier, though not as effective as the Voting Classifier, demonstrated good results when compared to the individual classifiers. This method achieves an accuracy of 0.76, precision of 0.68, and an F1-score of 0.81, providing a good approximation of leak locations. Considering the above, a fusion method such as Random Forest could be explored as a potential approach to improve accuracy.
The observed results highlight the importance of selecting the appropriate classifier and tuning its parameters to the specific characteristics of the WDS. Additionally, the incorporation of uncertainty and noise in the pressure measurements during the data generation phase added complexity to the problem but also helped assess the robustness of each classifier in real-world scenarios.
A fundamental aspect influencing the performance of the proposed method is the simulation time. Large simulation times can provide more data to train the classifier which and potentially improves the accuracy of leak localization. This benefit comes at the cost of increased computational effort, which could affect the overall system performance, especially in real-time applications. In this regard, future work could explore optimizing the balance between simulation time and the quality of the leak estimation results, ensuring that the technique remains effective in both practical and time-sensitive environments.
CONCLUSIONS
This study presents a mixed model-based fusion approach to locate leaks in WDNs, focusing on identifying new leaks. The following three classifiers are used: the KNN algorithm, the MLP, and the Decision Tree classifier. Their outputs are combined through fusion techniques, specifically the Voting Classifier and Naive Bayes Fusion. The results underscore the benefits of using fusion methods over individual classifiers. The Voting Classifier emerges as the most effective, achieving 0.80 accuracy, 0.84 F1-score, and an average distance of 7.05 from the actual leak location. Although Naive Bayes Fusion yields slightly lower metrics, confirming that model fusion enhances both accuracy and robustness in leak localization. For future study, two main directions are proposed: first, refining the design of the Naive Bayes Fusion method to improve its performance and second, extending the evaluation framework to include scenarios with multiple simultaneous leaks, to better align with real-world operating conditions.
ACKNOWLEDGEMENTS
We gratefully acknowledge the financial support and facilities provided by the Tecnológico Nacional de México (TecNM), TecNM Campus Zacatecas Norte, TecNM campus Chihuahua and Universidad Autonóma de Zacatecas (UAZ). This study was published with the support of the Instituto de Innovación y Competitividad of the Secretaría de Innovación y Desarrollo Económico del Estado de Chihuahua.
Pressure differences between leak-free and leak-affected states.
The Network 1 is a real network which its location is hidden to avoid advantages in the contest.
FUNDING
This work was published with the support of the INSTITUTO DE INNOVACIÓN Y COMPETITIVIDAD de la SECRETARÍA DE INNOVACIÓN Y DESARROLLO ECONÓMICO of the state of Chihuahua, México.
ETHICS STATEMENT
All authors declare that this manuscript is original and has not been submitted to another journal for simultaneous consideration, nor has it been previously published in any language or format. The study is presented in full, without being divided into separate parts, and the results are honestly reported, without fabrication, falsification, or inappropriate data manipulation. Only open-source software was used during the research, and all authors agreed with the designation of the corresponding author and the established authorship order.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.