## Abstract

Water loss in water distribution systems is one of the major problems faced by water utilities. The components of water losses should be accurately assessed and their priority should be determined. Generally, water balance analysis is used to quantify different components of water losses and identify the main contributor to high leakage rates. The leak flow rate is assumed to be static within a given calculation period during the calculation of real losses. Errors will inevitably arise during this process. This is mainly due to our limited understanding of a leak's growth process. To overcome this problem, the current work proposes the use of growth functions to represent a leak's growth process and establish a functional relationship between the leak flow rate and the leak duration. A leakage development model is adopted to simulate a leak's growth process and optimize the parameters of growth functions. The results show that the Richards function performs better than other growth functions and its mean absolute percentage error is 15.33%. Furthermore, the growth function could be used to calculate real losses and has the prospect of evaluating the effects of leakage detection.

## HIGHLIGHTS

The growth function can be used to represent a leak's growth process.

The leakage development model is adopted to simulate a leak's growth process.

The proposed calculation method of real losses considers a leak's evolution process.

The detection coefficient is proposed to evaluate leakage detection efficiency.

### Graphical Abstract

## INTRODUCTION

Over the past few decades, the shortage of water resources has gradually received attention. Annually, about 20–30% of the total water supply in water distribution systems (WDS) will be lost (Al-Washali *et al.* 2016, 2020). Water loss increases the cost of water abstraction and delivery, and the use of electricity and chemicals, and may affect water quality (Eryiğit 2019; Xue *et al.* 2020). This is a challenge faced by many water utilities. It is not only an economic issue but also an environmental and safety issue (Güngör-Demirci *et al.* 2018; Guo *et al.* 2020).

A standard for quantifying and evaluating water loss is defined by the International Water Association (IWA). Water loss includes real loss and apparent loss (Al-Washali *et al.* 2020). Real loss mainly occurs in pipelines, reservoirs, and customer connections. Apparent loss is mainly due to unregistered customer meters, data processing or billing errors, and unauthorized use (Ríos *et al.* 2014; Ethem Karadirek 2019). In practice, water loss in WDS is inevitable. Reducing water loss to an acceptable range is feasible from the perspective of economic and environmental costs (Kanakoudis *et al.* 2012; Sakai *et al.* 2020). Real loss is the main form of water loss and depends on the characteristics of pipe networks. Three methods can be used to estimate real losses: (1) a top-down water balance analysis (Al-Washali *et al.* 2016); (2) a down-up minimum night flow (MNF) analysis (Alkasseh *et al.* 2013); and (3) a component analysis of leakage (Aboelnga *et al.* 2018). To effectively improve leakage control management, each component or subcomponent of real losses should be assessed, including background losses, unreported losses, and reported losses. Component analysis of leakage is a conventional method that breaks down real losses into subcomponents (Al-Washali *et al.* 2020).

Component analysis of leakage is also called burst and background estimates (BABE), which is an empirical model to analyze a certain part of real losses (Lambert 1994). In this method, real losses mainly include water losses caused by three types of leakage (Lambert *et al.* 1999). The first is background leakage. Leaks cannot be detected by advanced technologies or measures due to small leak flow rates. The second is unreported leakage. Leaks can be detected underground using leakage detection devices (e.g., noise loggers or correlators). The third is reported leakage. Leaks can be easily discovered by practitioners because water overflows to the ground. The lost volume of a leak is calculated as the leak flow rate multiplied by the calculation period (Lambert *et al.* 1999). Although background or unreported leakages have small leak flow rates, they can cause significant water losses due to their long leak duration (Aboelnga *et al.* 2018). Reported leakages often have low water losses due to their short leak duration (Lambert & Fantozzi 2005). The conventional method for calculating real losses is related to leak flow rates and calculation periods. Generally, the leak flow rate is assumed to be static and its evolution over time is ignored. The detection period (i.e., the time required for practitioners to detect all pipelines in a city) is usually used as the calculation period, which is not equal to the leak duration. This has great uncertainty because it mainly depends on the experience of practitioners. However, no verification is found in the literature for this assumption. There is limited knowledge of a leak's growth process and no lab work that studies the evolution of a leak in water supply pipes. As a result, the current calculation method of real losses is misleading and needs to be improved.

In real situations, a leak could exist for several detection periods until it is discovered, and its leak flow rate might vary over time. This is a dynamic process. In each detection period, there may be some leaks that can be detected by existing technologies or measures but eventually are not found. This is related to practitioners’ experience or detection strategies. For example, it is not easy for inexperienced practitioners to find small leaks or use correlators to detect leaks under ambient noise. Furthermore, it is difficult to find a suitable calculation period for calculating real losses. Essentially, an optimal calculation period should be equal to the leak duration. There is insufficient consideration of this in the literature.

To fill the research gaps mentioned above, it is of great significance to understand a leak's growth process and clarify the mathematical relationship between the leak flow rate and the leak duration. The current work uses a growth function (i.e., Logistic, Gompertz, and Richards functions) to represent a leak's growth process. The contributions of this study are summarized as follows:

- (1)
A growth function is used to describe a dynamic evolution process between the leak flow rate and the leak duration, which helps analyze a leak's growth process from a microscopic point of view.

- (2)
A leakage development model is adopted to simulate a leak's growth process, and a method to calculate real losses using growth functions is proposed. This helps provide a new method for accurately quantifying each component of real losses (i.e., unreported losses or reported losses).

- (3)
A detection coefficient is proposed to evaluate the effects of leakage detection, which helps provide a new angle to understand water utilities’ leakage detection efficiency.

The remainder of this paper is organized as follows. The ‘Methodology’ section describes the growth function and the leakage development model. The ‘Case study’ section describes the data acquisition and modeling process. The ‘Results and discussion’ section discusses optimization results and the application of growth functions. The ‘Conclusions’ section summarizes this work and offers suggestions for future work.

## METHODOLOGY

### Growth function

Generally, pipeline leakage in WDS is caused by pipe sinking, pipe corrosion, pipe aging, and excessive external loads (Morais & de Almeida 2007). Excluding human interference (e.g., pipe bursts due to construction accidents), a leak's growth process is unidirectional and irreversible: the leak flow rate gradually increases over time until the leak is found and repaired by practitioners. In this study, it is assumed that a complete leak's growth process includes two steps, as shown in Figure 1. The first is the background leakage stage. These background leaks cannot be detected by existing technologies or measures due to low leak flow rates. The second is the leakage development stage. These unreported or reported leaks can be discovered using existing technologies or measures, and their leak flow rates will gradually increase over time. Unreported leaks are discovered underground, and reported leaks can be easily found because water overflows to the ground.

*et al.*(1995) used the Gompertz function to simulate the growth of car ownership; Huo & Wang (2012) used the Richards function to simulate vehicle sales and stock in China. Inspired by these studies mentioned above, we try to use a growth function to represent a leak's growth process. Essentially, the leak's growth process is a complex physical, chemical, and biological reaction process caused by some microorganisms attached to the pipelines. This process is similar to the growth process of animals, plants, and microorganisms (Thornley & France 2005). These growth functions can be expressed aswhere denotes the leak flow rate (m

^{3}/h), and

*x*denotes the leak duration.

*a*represents the extreme leak flow rate.

*b*,

*c*, and

*N*are parameters that need to be optimized.

## LEAKAGE DEVELOPMENT MODEL

### Model structure

A leakage development model is developed to simulate a leak's growth process and optimize the parameters of growth functions, as shown in Figure 2. Leakage information and pipeline information are collected from a pipeline maintenance database. A real leakage dataset is divided into a low leak flow rate dataset and a high leak flow rate dataset. The main reason is that for the same probability density of leak flow rates, it is assumed that the leak duration of high leak flow rates is longer than that of low leak flow rates. In this paper, the kernel density estimation is used to calculate the probability density of leak flow rates. This is a basic data-smoothing approach inferring populations based on a finite data sample (Heidenreich *et al.* 2013). Then, a random sampling process is established to simulate a real leakage detection process. In each random sampling process, the simulated leak flow rates will increase according to growth functions. Finally, the mean square error (MSE) between real leak flow rates and simulated leak flow rates is selected as an objective function of the leakage development model. The optimal parameters of growth functions can be obtained by minimizing the objective function.

### Leakage search dataset

*H*denotes the pipe length,

*h*represents the detection distance (i.e., the distance between the leak and the practitioner when using a listening stick to detect leaks), and

*t*denotes the detection period. In each detection period, the number of searches is

*H*/

*h*. In

*T*years (i.e., the total time of data collection), the number of searches is

*m*and the number of leaks that have been found is

*n*. Then, real leakage dataset

*M*can be expressed aswhere represents the leak flow rate, and

*n*is the number of leaks that have been discovered in

*T*years.

*M*is . The maximum probability density estimation is , which corresponds to the

*l*th leak in dataset

*M*. Then, dataset

*M*is divided into a low leak flow rate dataset

*M*1 and a high leak flow rate dataset

*M*2. They can be expressed aswhere represents the

*i*1th leak in dataset

*M*1, and represents the

*i*2th leak in dataset

*M*2.

Finally, a leakage search dataset that consists of zero points and nonzero points is established. The number of zero points is *m-n* and their corresponding values are equal to 0. This indicates that no leaks are discovered. The number of nonzero points is *n* and their corresponding values are equal to leak flow rates in dataset *M*. This indicates that leaks are discovered.

### Random sampling process

Although water utilities have formulated detailed leakage detection arrangements, there are still many uncertainties in practice. Whether a leak can be found depends not only on its size or leak flow rate, but also on the experience of practitioners, the precision of detection devices, the intensity of ambient noise, and other factors. This study establishes a random sampling process to simulate a real leakage detection process, as shown in Figure 3. The statistical probability distribution of real leak flow rates in *T* years represents the overall level of the probability distribution of leak flow rates in each detection period. Each random sampling process represents a real leakage detection process in a detection period. The probability distribution of simulated leak flow rates is random and can represent the probability distribution of leak flow rates during each random sampling process. In the leakage development model, by comparing the probability distribution of real leak flow rates and simulated leak flow rates, it can be determined whether a leak is found. The random sampling process contains three parts.

Firstly, a real leak A is selected from dataset *M*1. Its leak flow rate is and its probability density estimation is . In each random sampling process, *H*/*h* samples are randomly selected from the leakage search dataset. The classical Knuth–Durstenfeld shuffle algorithm (Fisher & Yates 1963) is used to ensure that the zero and nonzero points in the leakage search dataset can be selected equally. The random sampling process in dataset *M*1 is as follows.

Step I: In the

*k*1th random sampling process ,*H*/*h*samples are randomly selected from the leakage search dataset, which contains*d*1 nonzero points. This indicates that practitioners implemented*H*/*h*searches and found*d*1 leaks in the*k*th detection period.Step II: The simulated leak flow rate of leak A is and its probability density estimation is . The value of is calculated by growth functions. If is larger than , the real leak A is found. Otherwise, continue to repeat step I to step II. This indicates that when the probability density of simulated leak flow rates is larger than the average probability density of real leak flow rates, the real leak A in dataset

*M*1 will be discovered.Step III: The sum of squared errors of real leaks in dataset

*M*1 is calculated (i.e., ).

Secondly, a real leak B is selected from dataset *M*2. Its leak flow rate is and its probability density estimation is . In dataset *M*2, the leak duration before the real leak B is . The random sampling process in dataset *M*2 is as follows:

Step I: In the

*k*2th random sampling process (),*H*/*h*samples are randomly selected from the leakage search dataset, which contains*d*2 nonzero points. The leak duration of leak B is .Step II: The simulated leak flow rate of leak B is and its probability density estimation is . If is larger than , the real leak B is found. Otherwise, continue to repeat step I to step II.

Step III: The sum of squared errors of real leaks in dataset

*M*2 is calculated (i.e., ).

## CASE STUDY

### Data acquisition

Pipeline information and leakage information are collected in City DC from 2007 to 2016. In City DC, practitioners detect leaks on all water supply pipelines every 3 months by using listening sticks or noise correlators. The average detection distance is 100 m, and the detection period is 3 months. Table 1 presents the main pipeline information in City DC.

Table 2 presents the main leakage information in City DC. The leak flow rate is calculated according to the orifice type function (Puust *et al.* 2010).

### Parameter selection of models

The leakage development model is developed based on the leakage search dataset from 2007 to 2016. Different growth functions have different parameters. Essentially, the Richards function can be regarded as a combination of the Logistic function and the Gompertz function. When parameter *N* is equal to 1, the Richards function is the Logistic function. When parameter *N* tends to zero, the Richards function is close to the Gompertz function. Parameter *a* is equal to the maximum leak flow rate. Parameter *b* ranges from 1 to 20 with an increase of 1. Parameters *c* and *N* range from 0 to 1 with an increase of 0.1.

For the probability density estimation, the uniform function, the triangular function, and the Gaussian function are commonly used kernel functions. Prakasa Rao (1983) proved that different kernel functions have little effect on the nonparametric density estimation. The Gaussian function is selected as a kernel function and the optimal bandwidth can be calculated according to an empirical method (Silverman 1988). The optimal bandwidth is , where is the standard deviation of samples, and *n* is the number of samples. Table 3 shows the optimized parameters for different leakage development models.

### Performance indicator of models

*et al.*2013): mean absolute error (MAE), mean absolute percentage error (MAPE), relative entropy (RE), and Nash–Sutcliffe model efficiency (NSE).where is a real leak flow rate, and is a mean leak flow rate. is a simulated leak flow rate, and

*n*is the number of leaks. is the probability distribution of real leak flow rates, and is the probability distribution of simulated leak flow rates. The RE indicator is used to evaluate the similarity of probability distributions. The closer the RE is to 0, the closer the probability distribution is. The closer the NSE is to 1, the more accurate the model is.

## RESULTS AND DISCUSSION

### Optimization results

Table 4 presents the performance of different leakage development models. The results show that the Richards function performs better than the Logistic and Gompertz functions. Its average MAPE is 15.33%. The Gompertz function performs worse than other growth functions. Its average MAPE is 22.83%. This indicates that the Richards function is more suitable for simulating the leak's growth process. We can also observe that MAPE values for three growth functions are relatively high. The main reason is that the prediction error of the low leak flow rate may lead to a larger MAPE value. This indicates that the performance of leakage development models for low leak flow rates still needs to be improved. For DN400 cement pipelines, the model performance is poor because there are fewer leaks that can be used to optimize growth functions. For the same leak type, the leakage development model needs data for at least 200 leaks to obtain a satisfactory result.

The growth function and the probability distribution of leak flow rates for different pipelines are similar, as shown in Figures 4–6. This is an ideal growth curve. A leak is discovered with gradually increasing leak flow rates under natural conditions. In practice, leaks discovered by practitioners could be located anywhere on growth functions. Most leak flow rates are less than 25 m^{3}/h and the leak duration is less than 2 years. A few leak flow rates are more than 50 m^{3}/h and the leak duration exceeds 5 years. The potential reasons are given as follows: (1) some pipelines may not be detected by practitioners and (2) large ambient noise could reduce leakage detection accuracy. Furthermore, the probability distribution of simulated leak flow rates using the Richards function is close to the probability distribution of real leak flow rates, which further demonstrates that the Richards function outperforms other growth functions. For the same pipe material, the larger the pipe diameter, the longer the average leak duration. For the same pipe diameter, the steel and cement pipes have a longer average leak duration than cast iron pipes due to strong corrosion resistance. The leak's growth process is related to pipe properties.

### Rationality of the random sampling process

*d*1 +

*d*2)/(

*H*/

*h*)) converges to the probability

*P*based on the law of large numbers. This can be expressed aswhere

*d*1 and

*d*2 obey a Binomial distribution, i.e., , and is an error coefficient. According to Chebyshev's theorem, the following equation is satisfied

*M*.

### Application of the growth function

The growth function is used to establish the mathematical relationship between the leak flow rate and the leak duration. In this study, real losses mainly include unreported losses and reported losses, which can be calculated by integrating growth functions. For example, a leak was discovered on a DN100 cast iron pipe on 5 September 2013, and its leak flow rate was 5.74 m^{3}/h. According to the Richards function, this leak appeared on 27 June 2012, and its leak duration was 435 days. The total real loss caused was 35,544 m^{3}. However, according to the conventional method (i.e., component analysis of leakage) for calculating real losses, the leak flow rate is considered a constant value during a given calculation period (Al-Washali *et al.* 2020). The total real loss is equal to the leak flow rate multiplied by the calculation period. If the calculation period is equal to 3 months, the total real loss caused is 12,398 m^{3}.

Figure 7 shows the comparison of real losses under different scenarios. For scenario 1, unreported leak A is discovered in the current detection period. Areas A1 and A2 represent real losses calculated by the leak flow rate multiplied by the calculation period, and area A2 represents real losses calculated by integrating growth functions. In this case, real losses calculated by the conventional method are larger than real losses calculated by growth functions. For scenario 2, unreported leak B is discovered in the next detection period, and real losses for the current detection period are calculated by integrating growth functions. However, the conventional method ignores real losses caused by unreported leak B. For scenario 3, reported leak C is discovered in the current detection period. Area C1 represents real losses calculated by the leak flow rate multiplied by the repair duration, and area C2 represents real losses calculated by integrating growth functions. In this case, real losses calculated by the conventional method are not necessarily greater than real losses calculated by growth functions. The results illustrate that the conventional method ignores a leak's growth process, so errors will inevitably occur when calculating real losses. The proposed method for calculating real losses by growth functions considers the dynamic evolution process of leak flow rates, which is in line with the actual situation. A novel calculation method of real losses proposed in this paper is as follows:

- (1)
For unreported leakage, real losses can be calculated by integrating growth functions, which includes leaks that have been discovered by existing technologies (e.g., unreported leak A in Figure 7) and leaks that can be detected by existing technologies but are not discovered (e.g., unreported leak B in Figure 7).

- (2)
For reported leakage, real losses caused by human factors are equal to the leak flow rate multiplied by the repair duration, and real losses caused by nonhuman factors can be calculated by integrating growth functions (e.g., reported leak C in Figure 7).

- (3)
For the current detection period, real losses caused by unreported leak A and reported leak C correspond to detected losses, and real losses caused by unreported leak B correspond to undetected losses.

Table 5 presents a comparison of real losses between the conventional method and the proposed method using growth functions. For different calculation periods (i.e., 3, 6, 9, and 12 months), the errors between the conventional method and the proposed method are different. When the calculation period is equal to 3 months, the errors are larger than those of other calculation periods. When the calculation period is equal to 6 months, real losses calculated by the proposed method are close to real losses calculated by the conventional method. During the same calculation period, the number of leaks used to calculate real losses in the conventional method is less than the number of leaks used to calculate real losses in the proposed method. The main reason is that many leaks that can be detected by existing technologies or measures are not discovered (e.g., unreported leak B in Figure 7). For the conventional method, the calculation period is uncertain and mainly depends on the experience of practitioners. Our proposed method for calculating real losses will not be influenced by these factors. In a certain calculation period, practitioners discover fewer leaks, which does not mean that there are no other leaks on the pipelines. According to our assumptions, a leak does not occur suddenly, but will gradually grow until it is discovered. The number of leaks used to calculate real losses in each calculation period should be stable. The proposed method may be more in line with the actual situation than the conventional method.

Figure 8 presents calculation errors for real losses calculated using the conventional method and the proposed method from 2007 to 2014 in City DC. The results show that when the calculation period is equal to 6 months, the absolute percentage error is low. When the calculation period is equal to 3 months, the absolute percentage error has a large deviation. The number of leaks discovered by practitioners during a short calculation period has great uncertainty because it may be affected by factors such as practitioners’ experience, equipment performance, and ambient noise. This indicates that 6 months may be a suitable calculation period for the conventional method. In real cases, it is difficult to select an optimal calculation period for calculating real losses. The leak duration cannot be obtained for the conventional method, which may lead to calculation errors of real losses. However, the proposed method can avoid the influence of the calculation period on real losses.

Meanwhile, the growth function can be used to evaluate the effects of leakage detection. For example, Table 6 shows real losses in DN100 cast iron pipes in City DC. The detection coefficient is equal to detected losses divided by real losses, which can reflect the leakage detection efficiency of practitioners. The higher the detection coefficient, the higher the detection efficiency of practitioners. The results show that detection coefficients gradually increase from 2007 to 2014, which indicates that the detection effect has improved during these years. For some areas with low detection coefficients, water utilities should further improve leakage detection strategies (e.g., increase the frequency of leakage detection).

## CONCLUSIONS

This study investigates the potential of using a growth function to represent a leak's growth process. The growth function can be used to calculate real losses and assess leakage detection efficiency. The following conclusions can be drawn:

- (1)
The growth function can be used to represent a leak's growth process. The Richards function is more suitable for establishing the mathematical relationship between the leak flow rate and the leak duration than the Logistic and Gompertz functions.

- (2)
The leakage development model can be used to simulate a leak's growth process and optimize the parameters of growth functions. The random sampling process can be used to simulate a real leakage detection process, and its rationality has been verified.

- (3)
The proposed calculation method of real losses considers a leak's dynamic evolution process, which helps provide a new angle to further understand each component of real losses. However, the conventional method ignores a leak's growth process, which may lead to large errors.

- (4)
The detection coefficient may be used as an indicator to evaluate leakage detection efficiency. Water utilities could formulate leakage detection measures based on the detection coefficients of different regions.

It is suggested that future works test the leakage development model on a large leakage dataset. For small leak flow rates, consider shortening the horizontal time scale and make accurate assessments. In addition, the calculation method of real losses based on growth functions still needs to be improved and is compared with the actual real losses. Further research could investigate the influence of pipe properties (e.g., pipe age) on the proposed method.

## ACKNOWLEDGEMENT

This work was supported by the National Natural Science Foundation of China (51879139).

## DECLARATION OF COMPETING INTEREST

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.