Automatic water meter reading (AMR) is now the best kind of technology to supply real time information on water consumption. Complete equipment of a district metered area enables the assessment of the total consumption of a finite size population, for a time scale sometimes as short as an hour. However, its cost for generalization can generate high capital expenditures (CAPEX), unaffordable for the utility, in which case sampling techniques have to be set up. With the purpose of total consumption estimation, this article describes standard methods of survey techniques applied to water networks and proposes a methodology for implementation of an operational sample. The methodology, which includes some constraints on the estimator precision, proposes a smart AMR equipment plan of the population, while reducing CAPEX. Finally, estimation of the total consumption, in addition to the knowledge of supplied volume, enables more accurate loss assessment and potential detection of new leaks.

## INTRODUCTION

Problems linked to water have evolved through time and the current aim for water managers is the preservation of the resource. Losses in the French drinking water network represent on average 20% of the annual volume delivered and reduction of lost volumes has become an important goal to achieve.

Deducting the total consumption from the total supplied volume is the more accurate way to calculate losses. Despite an accurate knowledge of supplied water, data of total consumption are often available on a monthly or yearly time scale. Automatic meter reading (AMR) seems to be a solution to this problem as it supplies consumption indexes at a time scale shorter than a day. However, the instrumentation of a whole district metered area (DMA) can either generate high capital expenditures (CAPEX) that are unbearable for some utilities or be too long (up to 10 years in some cases) to enable a direct exploitation of AMR data. In 2008, *Lyonnaise des Eaux* initiated a reflection process on the estimation of total consumption from a sample of meters equipped with AMR. The consumption estimator must be accurate enough to detect leaks on the network. Therefore, the sample has to fulfil some criteria (sample size, selection of the individuals in the sample) to generate a total consumption estimator with a controlled precision. The application of survey techniques in the case of a finite population is an answer to this issue. But, as we are dealing with longitudinal data, the efficiency of the sample worsens through time. If the initial estimator precision does not fulfil the criteria to reliably detect leaks on the water network, auxiliary information can be used to calibrate the estimator, improving its precision.

From this estimation of a DMA total consumption, losses can be assessed and it would be possible to detect leakage. Other methodologies for a more accurate leakage detection and location already exist and have been proven to be efficient, like using minimum night flow (MNF) technique (Amoatey *et al*. 2014), pressure data (Romano *et al*. 2010) or sensor devices (Sarrate *et al.* 2014). However, we present here a methodology for enhancing AMR value, not only for billing or customer profiling (Solanas & Cusso 2010), but also for network monitoring, with the aim of minimizing the investments made.

We propose here a full sampling plan for the equipment of a DMA with AMR, which leads to a total consumption estimator accurate enough to permit leakage detection. The first section of this article is devoted to the definition of the sampling plan that generates a stratified sample, after a review of the intermediate steps (strata constitution, sample size). After having illustrated the fact that additional information becomes available between the time of the sample composition and its use for real time follow-up, we evaluate, in the Methods section, the performance of the estimator calibration, using this information. We present, in the Results and discussion section, a numerical application of this methodology on real data from the fully AMR equipped city of Canéjan (France, 33). This case enables a numerical check of the method efficiency. Then, we emphasize the main benefit of using such a method: a CAPEX reduction generated by a partial AMR equipment of the DMA. Finally we introduce the advantage of using AMR data in the case of leakage estimation. The last section concludes this article and gives some perspectives.

## METHODS

*N*into homogeneous groups according to a stratification variable, supposed to be correlated with the variable of interest and known for each individual. Then in each group, individuals are randomly sampled. Each year, a manual reading of the meters is carried out, giving the volume of water consumed by a user during the past year. For every individual of the population, the annual consumption for the last 4 years is available. These data fulfil the criterion previously mentioned: the information is available for all the population and is likely correlated to the daily consumption. Because of the evolution of the user's behaviour (demographic evolution, water saving awareness, etc.), the most suitable information is the last annual consumption. We note

*Y*(

_{i}*t*) as the variable of interest (individual daily water consumption for the

*i*th individual at time

*t*,

*i*= 1, …,

*N*) and

*X*designates the stratification variable. The final goal of the sampling is to estimate the total of the variable

*Y*(

_{i}*t*):

### Selection of the sample

^{3}. Despite their small size (1% of the population in our study case), these individuals represent a significant part of the total consumption (in this case, 12% of the 2010 annual consumption); for this reason, we advocate putting them in a specific stratum and sampling them exhaustively as their consumption represents influential values (Beaumont

*et al.*2013). If we try to create

*L*strata, one is thus defined and

*L**(=

*L*− 1) are remaining. The number and the bounds of the strata are chosen so as to minimize

*V*(

*X*) (the variance of the stratification variable (

*X*)). Increasing the number of strata should obviously also increase the strata homogeneity and thus decrease the dispersion. However, as Kpedekpo (1973) says: ‘the gain in efficiency, though substantial for initial increases in the number of strata, becomes marginal after a certain stage […] There may also be situations where an increase in the number of strata may even lead to less homogeneous strata, and thereby decrease the efficiency of stratified sampling.’ The number of strata is thus iteratively chosen considering the efficiency in adding one more stratum. where is the variance of considering

*l*strata.

*L**is the maximal number of strata with a gain in efficiency, by adding one more stratum, greater than 1%.

*h*(

*x*) such as where is the number of

_{h}*X*values lower than or equal to

*ξ*and

*h*= {1, …,

*L**− 1}. Once this step is achieved, the strata are completely defined, the next step being the selection of the sample.

*s*is split in each stratum

*h*into sub-samples of size

*n*. The size

_{h}*n*is determined according to the sample allocation,

_{h}*a*(percentage of the sample in the stratum

_{h}*h*). As the last stratum (the ‘big consumers’) is exhaustively investigated, this sample allocation only concerns the other strata. We chose the

*x*-optimal allocation (see for instance Cochran 1977) to define

*a*: where

_{h}*N*is the size of the stratum

_{h}*h*and

*S*is the standard deviation of

_{X,h}*X*in the stratum

*h*. This approach is preferred in cases of high inter-strata dispersion (see Table 3). The size of our sample (

*n*) has a direct impact on the precision of the estimator (see for instance Tillé 2001). Giving a required precision

*σ*for the estimator, we can estimate the optimal size of the sample: where is the variance of the variable

*Y*within the stratum

*h*. This quantity is, however, unknown, as we want to estimate it. As recommended by Fellegi (2010), this calculation is ‘hard to obtain and an approximation is frequently made from similar population’. A preliminary calculation has to be made on several sectors entirely equipped in order to define the sample size

*n*needed with respect to . Once

*n*has been estimated and allocated in each stratum, a sub-sample is randomly selected in each stratum using sampling techniques of calibration (see for instance Ardilly 2006).

*i*is in the stratum

*h*, 0 otherwise. This estimator is a Horvitz–Thompson estimator (Cochran 1977): it is asymptotically unbiased (the expected value of equals the value of ).

### Calibration of the estimator

The population is split using the best information (*X*) available and correlated to the variable of interest (*Y*). However, because of the dynamic nature of *Y*, the evolution of individual behaviour can lead to strata alterations (movement of individuals from one stratum to another) due to ‘stratum jumpers’ (Rivest 1999), which harm the initial homogeneity of the strata. In other words, the larger the time gap between the sample creation and its use, the less accurate the estimator will be.

In addition to this ‘stratum jumper’ phenomenon, new information is available. For instance, new customers' data, like the annual consumption, are available. It is actualised and exhaustive information which can be used to correct the estimations. This auxiliary variable (denoted *Z*) can be used to calibrate the initial estimator, using sampling techniques of calibration. We will use one of them to improve the estimator : the regression calibration (see for instance Isaki 1983; Särndal *et al.* 1992; Deville & Särndal 1992).

If is an estimator of , the same estimator is computed for the variable *Z* ( is an estimator of the total ). As *Z* is known for the whole population, the total is also known.

The estimator calibrated by regression is not necessarily less biased than the initial one (which was already unbiased) but the precision is improved as the calibrated estimator has a lower dispersion than the initial one (Cochran 1977).

## RESULTS AND DISCUSSION

### Application of the methodology

The numerical results of this article come from the city of Canéjan (33): the population is composed of 1,822 water meters, all equipped with AMR. The data available are the consumption, every 6 hours, between 01/01/2011 and 12/31/2012, for the entire population. The key point of using this population is to be able to compare the estimation results with the real data and verify the relevance of the methodology previously described. The time step chosen is the day in order to use the estimation for assessing the daily water loss.

There is no way to directly assess the performance of the sampling method from just one sample, because the interpretation can be biased by the choice of the selected individuals. To evaluate this plan numerically, we chose a Monte Carlo approach, by repeatedly drawing a sample following our preconisation. The results presented hereafter come from an average estimation of 25,000 sample draws. The stratified estimator is a Horvitz–Thompson estimator, which is asymptotically unbiased (Cochran 1977); we mainly assess the standard deviation (sd.) of the estimator. We first estimate the daily consumption for year 2011 and then we estimate and calibrate the estimator for year 2012.

#### Sampling plan

The methodology described in the Methods section enables the construction of the sampling plan: following Equation (2) the population is split into 11 strata (including the ‘big consumers’) using the 2010 annual consumption as stratification variable as shown in Table 1.

l
. | . | . |
---|---|---|

1 | 16,019 | |

2 | 13,372 | 16.50% |

… | … | … |

9 | 11,088 | 1.10% |

10 | 10,972 | 1.10% |

11 | 10,869 | 0.90% |

l
. | . | . |
---|---|---|

1 | 16,019 | |

2 | 13,372 | 16.50% |

… | … | … |

9 | 11,088 | 1.10% |

10 | 10,972 | 1.10% |

11 | 10,869 | 0.90% |

According to Table 1, the optimal number of strata is 10 (the gain in stratum homogeneity is marginal by selecting more than 10) plus the ‘big consumers’ stratum. Regarding the bounds of the stratum, we have . Table 2 presents an example of selection of the upper bound of the first strata.

(m^{3}/year)
. | . |
---|---|

… | … |

30 | 62.51 |

31 | 64.96 |

32 | 67.96 |

(m^{3}/year)
. | . |
---|---|

… | … |

30 | 62.51 |

31 | 64.96 |

32 | 67.96 |

As (1/10)*655 = 65.5 and according to Equation (3), the upper bound of the first stratum would be 31 m^{3}/year. The other strata boundaries are presented in Table 3.

Stratum . | Bounds (m^{3}/year)
. | Stratum size . | Variance . | Sample Allocation (%) . | Sample size . |
---|---|---|---|---|---|

1 | [ 0 ; 30 [ | 180 | 100 | 13 | 108 |

2 | [ 30 ; 49 [ | 173 | 33 | 7 | 55 |

3 | [ 49 ; 65 [ | 205 | 19 | 7 | 51 |

4 | [ 65 ; 79 [ | 200 | 20 | 6 | 46 |

5 | [ 79 ; 94 [ | 198 | 19 | 7 | 49 |

6 | [ 94 ; 109 [ | 191 | 19 | 6 | 47 |

7 | [ 109 ; 129 [ | 180 | 34 | 8 | 57 |

8 | [ 129 ; 150 [ | 174 | 31 | 8 | 57 |

9 | [ 150 ; 185 [ | 159 | 111 | 13 | 95 |

10 | [ 185 ; 1,000 [ | 149 | 22,071 | 23 | 149 |

11 | [ 1,000 ; [ | 13 | 1,277,965 | 2 | 13 |

Stratum . | Bounds (m^{3}/year)
. | Stratum size . | Variance . | Sample Allocation (%) . | Sample size . |
---|---|---|---|---|---|

1 | [ 0 ; 30 [ | 180 | 100 | 13 | 108 |

2 | [ 30 ; 49 [ | 173 | 33 | 7 | 55 |

3 | [ 49 ; 65 [ | 205 | 19 | 7 | 51 |

4 | [ 65 ; 79 [ | 200 | 20 | 6 | 46 |

5 | [ 79 ; 94 [ | 198 | 19 | 7 | 49 |

6 | [ 94 ; 109 [ | 191 | 19 | 6 | 47 |

7 | [ 109 ; 129 [ | 180 | 34 | 8 | 57 |

8 | [ 129 ; 150 [ | 174 | 31 | 8 | 57 |

9 | [ 150 ; 185 [ | 159 | 111 | 13 | 95 |

10 | [ 185 ; 1,000 [ | 149 | 22,071 | 23 | 149 |

11 | [ 1,000 ; [ | 13 | 1,277,965 | 2 | 13 |

Concerning the choice of the sample size, the use of data coming from an entirely equipped DMA enables us to estimate the value of the sample size *n* according to an expected precision . We calculate from Equation (6) a mean value of *n/N* for a range of value for *σ* between 0 and 50 m^{3}/day. Figure 1 plots the required sample rate *n/N* against the expected precision *σ*.

As we try to assess leakage on the drinking water network, we set equal to 13 m^{3}/day, corresponding to an estimation of a daily leakage flow for a connection pipe (Lambert *et al.* 1999). This means that the error made by estimating the total consumption will be lower than this flow, still enabling us to detect a leakage on the network. For , the sample size has to be at least 727, which represents a sample rate (*n/N*) of 40%. These individuals have been selected from the first 10 strata by simple random sampling, and the size of the sample in each stratum has been defined by the *x*-optimal allocation. Only the last stratum (the ‘big consumers’) has been exhaustively investigated. The sampling plan is presented in Table 3.

We can notice that, because of the high dispersion of the 10th stratum, the allocation technique leads to investigating it exhaustively.

#### Estimation results

As we estimate the total daily consumption for the whole year 2011, we obtain 365 estimators (one for each day). We average the results of 25,000 simulations in order to obtain one estimator for each day and its dispersion.

As shown in Table 4, the estimator matches the criterion imposed, as the average value of its standard deviation (resp. median value) is 13 m^{3}/day (resp. 12 m^{3}/day). The sampling plan proposed here is efficient with respect to the imposed constraint: to obtain an estimator of the total daily consumption accurate enough for the detection of leaks.

1st Quart. daily sd. | 10 |

Median daily sd. | 12 |

Average daily sd. | 13 |

3rd Quart. daily sd. | 18 |

% daily sd. ≤ 13 m^{3}/day | 54% |

1st Quart. daily sd. | 10 |

Median daily sd. | 12 |

Average daily sd. | 13 |

3rd Quart. daily sd. | 18 |

% daily sd. ≤ 13 m^{3}/day | 54% |

#### Calibration of the stratified estimator

Let suppose that the same sample (see Table 1) is used for an estimation of the total daily consumption in 2012. Because of the *stratum jumper* phenomenon, the strata are no longer homogeneous and there is no guarantee that the criterion imposed (an average value of standard deviation lower or equal to 13 m^{3}/day) is still respected. The use of actualised data, like the 2011 annual consumptions, will enable us to solve this problem. The regression calibration of the initial estimator, using this variable, does indeed reduce the dispersion of the initial estimator, as is shown in Table 5.

. | Sd. of the initial estimator (m^{3}/day)
. | Sd. of the calibrated estimator (m^{3}/day)
. | Evolution initial VS calibrated (%) . |
---|---|---|---|

1st Quart. daily sd. | 12 | 11 | −8 |

Median daily sd. | 15 | 13 | −13 |

Average daily sd. | 19 | 15 | −21 |

3rd Quart. daily sd. | 24 | 22 | −8 |

% daily sd. ≤ 13 m^{3}/day | 39% | 51% | −31 |

. | Sd. of the initial estimator (m^{3}/day)
. | Sd. of the calibrated estimator (m^{3}/day)
. | Evolution initial VS calibrated (%) . |
---|---|---|---|

1st Quart. daily sd. | 12 | 11 | −8 |

Median daily sd. | 15 | 13 | −13 |

Average daily sd. | 19 | 15 | −21 |

3rd Quart. daily sd. | 24 | 22 | −8 |

% daily sd. ≤ 13 m^{3}/day | 39% | 51% | −31 |

Scenario . | Sample size* . | CAPEX** . |
---|---|---|

. | (sample rate) . | (compared to P_{0})
. |

P_{0} | 727 (40%) | 54 k€ |

P_{1} | 1,687 (93%) | 120 k€ (+124%) |

P_{2} | 1,804 (99%) | 125 k€ (+133%) |

P_{3} | 791 (43%) | 58 k€ (+7%) |

P_{4} | 833 (46%) | 61 k€ (+13%) |

P_{5} | 1,291 (71%) | 93 k€ (+72%) |

P_{6} | 770 (42%) | 57 k€ (+6%) |

Scenario . | Sample size* . | CAPEX** . |
---|---|---|

. | (sample rate) . | (compared to P_{0})
. |

P_{0} | 727 (40%) | 54 k€ |

P_{1} | 1,687 (93%) | 120 k€ (+124%) |

P_{2} | 1,804 (99%) | 125 k€ (+133%) |

P_{3} | 791 (43%) | 58 k€ (+7%) |

P_{4} | 833 (46%) | 61 k€ (+13%) |

P_{5} | 1,291 (71%) | 93 k€ (+72%) |

P_{6} | 770 (42%) | 57 k€ (+6%) |

*Sample size needed to reach the same precision as *P*_{0}.

**Assuming an equipment cost of €70/meter.

The calibration of the estimator clearly improves its precision. On the 366 estimators, only 39% initially meet the precision criterion fixed, against 51% after calibration.

### Interest of the methodology

We have proposed a methodology to partially equip a DMA with AMR for an estimation of the total consumption. The relevance of using this method instead of simple random sampling could be questioned. We compare (Table 6) our plan (called *P*_{0}: sample plan as defined in Table 3 + calibration) with other sampling scenarios (each scenario exhaustively includes the ‘big consumers’, the plan only concerns households):

*P*_{1}: a simple random sampling with equal sampling probability,*P*_{2}: a simple random sampling with sampling probability proportional to*X*,*P*_{3}: a stratified sampling with strata boundaries roughly defined by technical experts (expert-defined strata) and*x*-optimal allocation in each stratum,*P*_{4}: a stratified sampling with expert-defined strata and stratum allocation proportional to*X*,*P*_{5}: a stratified sampling with expert-defined strata and allocation constant in each stratum,*P*_{6}: a stratified sampling as defined in Table 3, without calibration.

The main idea is to compare the gap in precision between these scenarios. But, as we saw in Equation (5), precision is linked to the sample size. This is why we are going to present for the scenarios *P*_{1} to *P*_{6} the sample size needed to reach the same precision obtained with *P*_{0} and sample rate of 40%.

As there are no specific necessary investments for the methodology, the plan *P*_{0} is the one requiring the least CAPEX. There are few differences between *P*_{0} and *P*_{6}, which would lead us to think that calibration is not that efficient. However, in this case, there is a difference of only 1 year between the stratification variable (2010 annual consumption) and the calibration variable (2011 annual consumption); the calibration efficiency increases as the time between the two variables grows.

### Perspectives

The methods presented hereafter are just an example of what could be carried out using estimated consumption, especially on losses assessment and leakage detection.

#### From water losses assessment …

Real-time consumption estimation enables the assessment of real-time losses. Knowing the supplied volumes, the losses can be calculated using the estimator of the total consumption. We use the sample previously created (see Table 3), for an estimation of the total daily consumption between January 1st 2011 and April 1st 2011. We draw (Figure 2) the supplied volumes, the estimated losses and a 95% confidence interval of lost volumes, using the standard deviation obtained from the consumption estimation.

#### … to leakage detection

As fluctuations on the estimated loss curve can mainly be due to inaccuracies on the consumption estimator, it will be hard to detect new leaks on the network. Statistical process control provides charts that evaluate the quality of a process (in our case the losses). We just provide here an example of techniques that can be used to detect leaks.

*L*the estimated lost volume at time

_{t}*t*, we build a new variable, the EWMA statistic, named

*R*: where 0 <

_{t}*λ*≤ 1 is a parameter weighting the effect of previous values of

*L*. The process control consists in drawing on the same chart

_{t}*R*and two control limits [

_{t}*LCL*;

*UCL*] (Lucas & Saccucci 1990). As the consumption is estimated and as there could be meter inaccuracies, it is possible that

*R*fluctuates through time without the actual presence of a leak. These limits prevent the detection of an abnormality in the lost process in the case of merely inaccuracy in the lost volumes calculation.

_{t}*LCL*and

*UCL*are built using a starting level (named

*μ*) from which

*R*can fluctuate:

_{t}*σ*is the standard deviation of

_{L}*L*. It depends on the standard deviation of the consumption estimator but also on uncertainty due to measurement devices, as pointed out by (Mamo & Juran 2014) in their review of uncertainty sources. We consider, however, that these last are negligible in front of the consumption estimation error, considering the orders of magnitude; in such case

_{t}*σ*is equal to the standard deviation of the consumption estimator. Moreover, we can notice that, as most of the meters have been replaced because of AMR equipment, meter under-registration is almost null.

_{L}In this study case, we chose *μ* as the threshold of background water loss, which means that the starting value for losses is the background losses (below that value, losses would be undetectable) and any increase in the process suggests the presence of a detectable leak. If at any time *R _{t}* >

*UCL*then the process is said to be out of control, which means that a leak has occurred on the network. Figure 3 provides a simplified example of an EWMA chart for the estimated losses. The chart has been implemented thanks to the

*R*package

*qcc*(Scrucca 2004).

During the considered period, three interventions were recorded on the network: two burst repairs on a connection pipe, on January 13th and 14th, and a leak repair on a main pipe after an active leakage control on March 31st. As we can see on Figure 3, the process control easily detects the burst on a connection pipe. It takes some days to detect invisible leaks, but at the end of January, the model confirms the presence of the leak, whereas the repair only occurs two months later.

The use of such a technique for network monitoring would have prevented the development of the leak during this time elapsed.

## CONCLUSION

The methodology, described in the first part of this article, efficiently estimates the total daily consumption, considering the criterion imposed for the precision of the estimator. Some points have to be clearly stated, like the size of the sample, which can be defined according to operational (as in this example), economic or contractual constraints.

This sample plan, as efficient as it may be, becomes, however, obsolete because of the deterioration of the relation between the variable of interest and the stratification variable. Despite the un-biasedness of the estimator, the precision never stops deteriorating as time elapses. Calibration techniques, like regression calibration, solve this problem; the regression calibration is an efficient method to reduce the standard deviation of the initial estimator, using an actualised variable and its link with the variable of interest.

As we have shown, this methodology reduces CAPEX, as it does not require any additional investments and is more efficient than standard sampling plans. Estimation of total consumption can thus lead to assessment of lost volumes. A control chart of the lost volumes seems to be an efficient tool for real-time monitoring of the network by alerting the operator to a leak occurrence.

Finally, associated with a leakage model, control charts of lost volume would enable the operator to model leakage on the network and optimize the water network monitoring with early leakage detection.

The work presented in the Perspectives section does not pretend to replace existing methods or technologies for leakage detection; rather, it just introduces a way of enhancing AMR data for network management and in some way improving the existing techniques. For instance, the MNF technique has been proven to be efficient; however, it is often based on an approximation of customer night use (CNU) (Amoatey *et al*. 2014). Estimation of consumption from AMR data should hence be complementary to MNF technique by enabling the estimation of CNU.

## ACKNOWLEDGEMENTS

The authors wish to thank the editor as well as the reviewers for their valuable comments on the manuscript and their constructive suggestions.