Simulation of a leak ’ s growth process in water distribution systems based on growth functions

Water loss in water distribution systems is one of the major problems faced by water utilities. The components of water losses should be accurately assessed and their priority should be determined. Generally, water balance analysis is used to quantify different components of water losses and identify the main contributor to high leakage rates. The leak ﬂ ow rate is assumed to be static within a given calculation period during the calculation of real losses. Errors will inevitably arise during this process. This is mainly due to our limited understanding of a leak ’ s growth process. To overcome this problem, the current work proposes the use of growth functions to represent a leak ’ s growth process and establish a functional relationship between the leak ﬂ ow rate and the leak duration. A leakage development model is adopted to simulate a leak ’ s growth process and optimize the parameters of growth functions. The results show that the Richards function performs better than other growth functions and its mean absolute percentage error is 15.33%. Furthermore, the growth function could be used to calculate real losses and has the prospect of evaluating the effects of leakage detection.


INTRODUCTION
Over the past few decades, the shortage of water resources has gradually received attention. Annually, about 20-30% of the total water supply in water distribution systems (WDS) will be lost (Al-Washali et al. , ). Water loss increases the cost of water abstraction and delivery, and the use of electricity and chemicals, and may affect water quality (Eryigȋt ; Xue et al. ). This is a challenge faced by many water utilities. It is not only an economic issue but also an environmental and safety issue A standard for quantifying and evaluating water loss is defined by the International Water Association (IWA).
Water loss includes real loss and apparent loss (Al-Washali et al. ). Real loss mainly occurs in pipelines, reservoirs, and customer connections. Apparent loss is mainly due to unregistered customer meters, data processing or billing errors, and unauthorized use (Ríos et al. ; Ethem Karadirek ). In practice, water loss in WDS is inevitable.
Reducing water loss to an acceptable range is feasible from the perspective of economic and environmental costs Component analysis of leakage is also called burst and background estimates (BABE), which is an empirical model to analyze a certain part of real losses (Lambert ). In this method, real losses mainly include water losses caused by three types of leakage (Lambert et al. ). The first is background leakage. Leaks cannot be detected by advanced technologies or measures due to small leak flow rates. The second is unreported leakage.
Leaks can be detected underground using leakage detection devices (e.g., noise loggers or correlators). The third is reported leakage. Leaks can be easily discovered by practitioners because water overflows to the ground. The lost volume of a leak is calculated as the leak flow rate multiplied by the calculation period (Lambert et al. ).
Although background or unreported leakages have small leak flow rates, they can cause significant water losses due to their long leak duration (Aboelnga et al. ). Reported leakages often have low water losses due to their short leak duration (Lambert & Fantozzi ). The conventional method for calculating real losses is related to leak flow rates and calculation periods. Generally, the leak flow rate is assumed to be static and its evolution over time is ignored.
The detection period (i.e., the time required for practitioners to detect all pipelines in a city) is usually used as the calculation period, which is not equal to the leak duration. This has great uncertainty because it mainly depends on the experience of practitioners. However, no verification is found in the literature for this assumption. There is limited knowledge of a leak's growth process and no lab work that studies the evolution of a leak in water supply pipes.
As a result, the current calculation method of real losses is misleading and needs to be improved.
In real situations, a leak could exist for several detection periods until it is discovered, and its leak flow rate might vary over time. This is a dynamic process. In each detection period, there may be some leaks that can be detected by existing technologies or measures but eventually are not found. This is related to practitioners' experience or detection strategies. For example, it is not easy for inexperienced practitioners to find small leaks or use correlators to detect leaks under ambient noise. Furthermore, it is difficult to find a suitable calculation period for calculating real losses. Essentially, an optimal calculation period should be equal to the leak duration. There is insufficient consideration of this in the literature.
To fill the research gaps mentioned above, it is of great significance to understand a leak's growth process and clarify the mathematical relationship between the leak flow rate and the leak duration. The current work uses a growth function (i.e., Logistic, Gompertz, and Richards functions) to represent a leak's growth process. The contributions of this study are summarized as follows: (1) A growth function is used to describe a dynamic evolution process between the leak flow rate and the leak duration, which helps analyze a leak's growth process from a microscopic point of view.
(2) A leakage development model is adopted to simulate a leak's growth process, and a method to calculate real losses using growth functions is proposed. This helps provide a new method for accurately quantifying each component of real losses (i.e., unreported losses or reported losses).
(3) A detection coefficient is proposed to evaluate the effects of leakage detection, which helps provide a new angle to understand water utilities' leakage detection efficiency.
The remainder of this paper is organized as follows. The

METHODOLOGY Growth function
Generally, pipeline leakage in WDS is caused by pipe sinking, pipe corrosion, pipe aging, and excessive external loads (Morais & de Almeida ). Excluding human interference (e.g., pipe bursts due to construction accidents), a leak's growth process is unidirectional and irreversible: the leak flow rate gradually increases over time until the leak is found and repaired by practitioners. In this study, it is assumed that a complete leak's growth process includes two steps, as shown in Figure 1  be expressed as where f(x) denotes the leak flow rate (m 3 /h), and x denotes the leak duration. a represents the extreme leak flow rate.
b, c, and N are parameters that need to be optimized.

LEAKAGE DEVELOPMENT MODEL
Model structure A leakage development model is developed to simulate a leak's growth process and optimize the parameters of growth functions, as shown in Figure 2. Leakage information and pipeline information are collected from a pipeline maintenance database. A real leakage dataset is divided into a low leak flow rate dataset and a high leak flow rate dataset. The main reason is that for the same probability density of leak flow rates, it is assumed that the leak duration of high leak flow rates is longer than that of low leak flow rates. In this paper, the kernel density estimation is used to calculate the probability density of leak flow rates. This is a basic data-smoothing approach inferring populations based on a finite data sample (Heidenreich et al. ). Then, a random sampling process is established to simulate a real leakage detection process. In each random sampling process, the simulated leak flow rates will increase according to growth functions. Finally, the mean square error (MSE) between real leak flow rates and simulated leak flow rates is selected as an objective function of the leakage development model. The optimal parameters of growth functions can be obtained by minimizing the objective function.

Leakage search dataset
To describe the features of a leak, leakage information and pipeline information are used in this study. Leakage information includes detection date, detection period, leak type, and leak flow rate. Pipeline information includes pipe material, pipe diameter, pipe age, and pipe length. Leaks are classified into different categories according to the pipe material, pipe diameter, and pipe age. In this paper, H denotes the pipe length, h represents the detection distance (i.e., the distance between the leak and the practitioner when using a listening stick to detect leaks), and t denotes the detection period. In each detection period, the number of searches is H/h. In T years (i.e., the total time of data collection), the number of searches is m and the number of leaks that have been found is n. Then, real leakage dataset M can be expressed as where Q i represents the leak flow rate, and n is the number of leaks that have been discovered in T years.
Furthermore, the probability density estimation of leak flow rates in dataset M is P(Q i ). The maximum probability density estimation is P max (Q l ), which corresponds to the lth leak in dataset M. Then, dataset M is divided into a low leak flow rate dataset M1 and a high leak flow rate dataset M2. They can be expressed as (6) where i1 represents the i1th leak in dataset M1, and i2 rep- Firstly, a real leak A is selected from dataset M1. Its leak flow rate is Q i1 and its probability density estimation is P(Q i1 ). In each random sampling process, H/h samples are randomly selected from the leakage search dataset.
The classical Knuth-Durstenfeld shuffle algorithm (Fisher & Yates ) is used to ensure that the zero and nonzero points in the leakage search dataset can be selected equally.
The random sampling process in dataset M1 is as follows.
Step I: In the k1th random sampling process (1 k1 [x l ] þ 1), H/h samples are randomly selected from the leakage search dataset, which contains d1 nonzero points. This indicates that practitioners implemented H/h searches and found d1 leaks in the kth detection period.
Step II: The simulated leak flow rate of leak A is f(k1) and its larger than P(Q i1 ), the real leak A is found. Otherwise, continue to repeat step I to step II. This indicates that when the probability density of simulated leak flow rates is larger than the average probability density of real leak flow rates, the real leak A in dataset M1 will be discovered.
Step III: The sum of squared errors of real leaks in dataset Secondly, a real leak B is selected from dataset M2. Its leak flow rate is Q i2 and its probability density estimation is P(Q i2 ). In dataset M2, the leak duration before the real leak B is x p ( p ¼ l, l þ 1, . . . , n À 1). The random sampling process in dataset M2 is as follows: Step I: In the k2th random sampling process (1 k2 (T =t) À [x l ] À 1), H/h samples are randomly selected from the leakage search dataset, which contains d2 nonzero points. The leak duration of leak B is . . , n; p ¼ l, . . . , n À 1).
Step II: The simulated leak flow rate of leak B is f(x i2 ) and its larger than P(Q i2 ), the real leak B is found. Otherwise, continue to repeat step I to step II.
Step III: The sum of squared errors of real leaks in dataset M2 is calculated (i.e., Thirdly, the objective function of the leakage development model is the sum of MSE between real leak flow rates and simulated leak flow rates. When the objective function reaches the minimum value, the corresponding parameters of growth functions are optimal. It can be expressed as CASE STUDY

Data acquisition
Pipeline information and leakage information are collected in City DC from 2007 to 2016. In City DC, practitioners detect leaks on all water supply pipelines every 3 months by using listening sticks or noise correlators. The average detection distance is 100 m, and the detection period is 3 months. Table 1 presents the main pipeline information in City DC.    where Y i is a real leak flow rate, and Y i is a mean leak flow rate. Y 0 i is a simulated leak flow rate, and n is the number of leaks. P(x) is the probability distribution of real leak flow rates, and Q(x) is the probability distribution of simulated leak flow rates. The RE indicator is used to evaluate the similarity of probability distributions. The closer the RE is to 0, the closer the probability distribution is. The closer the NSE is to 1, the more accurate the model is. We can also observe that MAPE values for three growth functions are relatively high. The main reason is that the prediction error of the low leak flow rate may lead to a larger MAPE value. This indicates that the performance of leakage development models for low leak flow rates still needs to be improved. For !DN400 cement pipelines, the model per-  For the same pipe diameter, the steel and cement pipes have a longer average leak duration than cast iron pipes due to strong corrosion resistance. The leak's growth process is related to pipe properties.

Rationality of the random sampling process
In the leakage development model, the probability distribution of leak flow rates is selected as an indicator to stop the random sampling process. When a simulated leak has a higher kernel density estimation than a real leak, it indicates that the probability of the simulated leak occurring is higher than the historical average probability. In this case, the simulated leak is considered detected. The frequency (i.e., (d1 þ d2)/(H/h)) converges to the probability P based on the law of large numbers. This can be expressed as where d1 and d2 obey a Binomial distribution, i.e., d1 þ d2 ∼ B H=h, P ð Þ , and ε is an error coefficient. According to Chebyshev's theorem, the following equation is satisfied Using the DN100 cast iron pipeline as an example, d1 þ d2 ∼ B(10093, 0:0024), and ε is equal to 0.01. It is given in the form when the number of samples is equal to 10,093, the probability of a large deviation is less than 0.24%. The larger the number of samples, the smaller the deviation of the random sampling process. For each random sampling process, the frequency of leaks at a certain leak flow rate can be regarded as the probability of leaks. The random sampling process is reasonable and can represent the average probability distribution of leak flow rates in dataset M.

Application of the growth function
The growth function is used to establish the mathematical relationship between the leak flow rate and the leak duration. In this study, real losses mainly include unreported losses and reported losses, which can be calculated by inte-  growth functions. The results illustrate that the conventional method ignores a leak's growth process, so errors will inevitably occur when calculating real losses. The proposed method for calculating real losses by growth functions considers the dynamic evolution process of leak flow rates, which is in line with the actual situation. A novel calculation method of real losses proposed in this paper is as follows: (1) For unreported leakage, real losses can be calculated by integrating growth functions, which includes leaks that have been discovered by existing technologies (e.g., unreported leak A in Figure 7) and leaks that can be detected by existing technologies but are not discovered (e.g., unreported leak B in Figure 7).
(2) For reported leakage, real losses caused by human factors are equal to the leak flow rate multiplied by the repair duration, and real losses caused by nonhuman factors can be calculated by integrating growth functions (e.g., reported leak C in Figure 7).
(3) For the current detection period, real losses caused by unreported leak A and reported leak C correspond to detected losses, and real losses caused by unreported leak B correspond to undetected losses.  Figure 7). For the conventional method, the calculation period is uncertain and mainly depends on the experience of practitioners. Our proposed method for calculating real losses will not be influenced by these factors. In a certain calculation period, practitioners discover fewer leaks, which does not mean that there are no other leaks on the pipelines. According to our assumptions, a leak does not occur suddenly, but will gradually

CONCLUSIONS
This study investigates the potential of using a growth function to represent a leak's growth process. The growth function can be used to calculate real losses and assess leakage detection efficiency. The following conclusions can be drawn: (1) The growth function can be used to represent a leak's growth process. The Richards function is more suitable for establishing the mathematical relationship between the leak flow rate and the leak duration than the Logistic and Gompertz functions.
(2) The leakage development model can be used to simulate a leak's growth process and optimize the parameters of growth functions. The random sampling process can be used to simulate a real leakage detection process, and its rationality has been verified.
(3) The proposed calculation method of real losses considers a leak's dynamic evolution process, which helps provide a new angle to further understand each component of real losses. However, the conventional method ignores a leak's growth process, which may lead to large errors.
(4) The detection coefficient may be used as an indicator to evaluate leakage detection efficiency. Water utilities could formulate leakage detection measures based on the detection coefficients of different regions.
It is suggested that future works test the leakage development model on a large leakage dataset. For small leak flow rates, consider shortening the horizontal time scale and make accurate assessments. In addition, the calculation method of real losses based on growth functions still needs to be improved and is compared with the actual real losses. Further research could investigate the influence of pipe properties (e.g., pipe age) on the proposed method.