## Abstract

Water consumption varies with time of use, season and socio-economic status of consumers, and is defined as a continuous random variable. Incorporating probabilistic nature in water-consumption modelling will lead to more realistic assessments of performance of water distribution systems. Furthermore, fitting water-consumption patterns into a suitable statistical distribution will assist in determining how often peaks will occur, or the probability of exceeding the peaking factor in a system, for incorporation into design calculations. There are few studies in the literature where the random variations of consumption have been considered. The purpose of this study is to evaluate real water-consumption data from the United Kingdom (UK) and North America and to investigate the possibility of establishing a standard probability distribution function to apply in simulating water consumption in developed countries. Daily water-consumption data for 5 years (2009–2013) were obtained from water companies in the UK and North America and analysed by fitting into normal, log-normal, log-logistic and Weibull distributions. Statistical modelling was performed using MINITAB version 18 statistical package. The Anderson-Darling goodness-of-fit test was used to show how well the selected statistical distribution fits the water-consumption data.

## INTRODUCTION

A major unresolved problem in water-consumption modelling is the identification of an appropriate statistical distribution which best represents the water-consumption pattern. Fitting water-consumption patterns into a suitable statistical distribution will assist in finding how often the peaks will occur, or the probability of exceeding the peaking factor, in the system to incorporate into design calculations with a scientifically proven method. The aim of this research is to study real water-consumption data and to find a standard statistical distribution to use in water-consumption modelling to address the probabilistic nature of water consumption. The advantage of modelling real water-consumption data is that it will permit forecasting of the probability of occurrence in any consumption value and provide confidence in future projections.

There are relatively few studies that have considered the random variations of water consumption. It is often assumed that variation in water consumption in distribution systems follows the normal distribution, usually with insufficient justification. Furthermore, there is inadequate reliable data regarding the suitability of various statistical distributions for modelling water consumption. Goulter & Boulchart (1990), Bao & Mays (1990), Xu & Goulter (1997, 1998, 1999), Syntetos & Boylan (2001, 2005) and Kwietniewski (2003) made assumptions that water consumption has a normal distribution. Mays (1994) used randomly generated water-consumption data using a range of distributions to study the sensitivity of a system's performance to changes in water-consumption patterns. Khomsi *et al.* (1996) stated that the consumption of water has a normal distribution based on the Kolmogorov-Smirnov test (KS). However, the KS test is more sensitive near the centre of the distribution than at the tails and was not suitable to validate the water-consumption data, as the high consumption data points lie on the tail of the distribution. In the technical literature further research papers written by De Marinis *et al.* (2007), Tricarico *et al.* (2007) and Gato-Trinidad & Gan (2012) support the effectiveness of the normal distribution by means of rigorous statistical inferences on real data.

The American Water Works Association (AWWA) Research Foundation sponsored a study (Bowen *et al.* 1993) in residential water-use patterns in the USA, and results revealed that the demand data were not distributed normally. Several data transformations to improve the data analysis were investigated and it was found that the log transformation was only mildly effective in reducing the positive skewness of the frequency distributions of the data.

Surendran & Tanyimboh (2005) and Tanyimboh *et al.* (2004) addressed the issue of the modelling of short-term consumption variations in a comprehensive way, using UK water-consumption data and concluded that data fitted better with a long tail distribution rather than a normal distribution. However, the findings were limited to UK water-consumption data.

The log-logistic distribution resembles the log-normal in shape, it has a more tractable form and is one of the few distributions for which the probability distribution, cumulative density and quartile functions exist in simple closed form (Kleiber 2004). Furthermore, it can cope well with outliers in the upper tail (Dey & Kundu 2004). The log-logistic distribution has been used by Swamee (2002), El-Saidi *et al.* (1990) and Rowinski *et al.* (2001), in hydrological studies (frequency analysis of multi-year drought durations, precipitation data and flood frequency analysis) and survival (reliability) analysis, which have the outliers in upper tail. Gargano *et al.* (2016, 2017) stated that log-logistic distribution is the best fit for real water-consumption data.

Ashkar & Mahdi (2006), Cordeiro *et al.* (2012) and Ramos *et al.* (2013) described the log-logistic distribution in detail and concluded that the log-logistic model is suitable for positive skewed data and positive random variables. As water consumption is a random variable and is positively skewed, it defines the suitability of log-logistic distribution in modelling water consumption.

### Identifying a suitable statistical distribution

It is a general assumption that water consumption follows a normal distribution and literature review shows that the studies undertaken in the past supported this assumption. Water consumption will vary as a result of weather patterns, fire incidents and leakages, and these scenarios will lead to extreme usage conditions. Consequently, due to high consumption from time to time, the data would fit better in a positively skewed distribution than in a normal distribution.

As a preliminary check, to identify the distribution patterns of the real water-consumption data obtained from the two water companies, normal graphs were drawn to check the normality of the data. The normal graphs were drawn for the 20 data sets (yearly data) and they show that out of 20 data sets, two sets follow a normal distribution and the other 18 sets follow a positively skewed distribution.

It can be concluded that water-consumption data will fit well into positively skewed distributions such as log-normal, log-logistic and Weibull distributions.

This study used these distributions to select the best fit for water-consumption modelling. The normal distribution was used for comparison. These four distributions are described in Table 1, providing their suitability on application to modelling water consumption.

Distribution | Description | Suitability for water-consumption data modelling | Probability distribution function (PDF) |
---|---|---|---|

Normal | Has the familiar symmetrical bell shape and its sample space extends from minus to plus infinity. | The water-consumption data contains only positive values and since the data typically show skewed frequency patterns, the normal distribution must be approached with caution, particularly if inferences will focus on the tails of the distribution. | where μ is mean of the distribution and σ is the standard deviation. |

Log-normal | Logarithmically related to normal distribution and shows considerable flexibility of shape, which is always skewed to the right. | In log-normal distribution its sample space admits only positive values and suitable to use in analysing water-consumption data. | where μ is the mean and σ is the standard deviation. |

Log-logistic | The log-logistic distribution resembles the log-normal in shape, has a more tractable form. It can cope well with outliers in the upper tail. | It is a uni-model, defined only for positive random variables and positively skewed which is best representing the water consumption pattern. | where is called a shape parameter, as increases the density become more peaked. The parameter is a scale parameter. |

Weibull | Depending on the values of the parameters, the Weibull distribution can be used to model a variety of life behaviors and provides better distribution for life length data. | To use Weibull distribution in analysis, it is essential to have a very good justifiable estimate for the shape parameter to replicate the accurate distribution pattern. | where β is the shape parameter and η is the scale parameter. |

Distribution | Description | Suitability for water-consumption data modelling | Probability distribution function (PDF) |
---|---|---|---|

Normal | Has the familiar symmetrical bell shape and its sample space extends from minus to plus infinity. | The water-consumption data contains only positive values and since the data typically show skewed frequency patterns, the normal distribution must be approached with caution, particularly if inferences will focus on the tails of the distribution. | where μ is mean of the distribution and σ is the standard deviation. |

Log-normal | Logarithmically related to normal distribution and shows considerable flexibility of shape, which is always skewed to the right. | In log-normal distribution its sample space admits only positive values and suitable to use in analysing water-consumption data. | where μ is the mean and σ is the standard deviation. |

Log-logistic | The log-logistic distribution resembles the log-normal in shape, has a more tractable form. It can cope well with outliers in the upper tail. | It is a uni-model, defined only for positive random variables and positively skewed which is best representing the water consumption pattern. | where is called a shape parameter, as increases the density become more peaked. The parameter is a scale parameter. |

Weibull | Depending on the values of the parameters, the Weibull distribution can be used to model a variety of life behaviors and provides better distribution for life length data. | To use Weibull distribution in analysis, it is essential to have a very good justifiable estimate for the shape parameter to replicate the accurate distribution pattern. | where β is the shape parameter and η is the scale parameter. |

### Description of data and the relative water distribution systems

The daily water-consumption data for the 5 years from 1st April 2009 to 1st April 2013 were obtained from a water utility company in the North-West region of England to use in this research. The water works system delivers water to approximately 6.7 million households and businesses in the UK. The data were collected at the supply end of the network system using flow meters which are either connected to telemetry (which are live) or data loggers. The data loggers record the number of pulses within a 15-minute interval. This was then used to calculate and average the flow rate during the 15-minute period, depending on the pulse setting (i.e. how many litres per pulse). These raw data are imported daily into the data management system and converged to hourly and daily volumes.

To analyse the North American daily water consumption, data for three demand zones from a Canadian city in Manitoba Province were obtained. The data were collected using flow meters connected to data loggers at the water treatment plant by the water services division. The data were received between 1st January 2009 and 31st December 2014. The water supply system delivers an average of 225 million liters per day of water to approximately 270,000 households and businesses across approximately 297 square kilometers (114 square miles) of developed area in Canada.

## METHODOLOGY

In this research, a suitable statistical distribution was selected using a descriptive analysis. Data were screened and sorted by plotting raw demand data against time. This provided a quick reference to check the abnormality of data. If the points were homogeneously distributed and there were no negative points, this meant that the data were all most acceptable to use in the analysis. Similarly, if there were any inconsistencies in the distribution, these time series graphs would show the abnormal data points to be removed prior to analysis.

The data were then analysed using MINITAB version 18 statistical package, and was fitted into a suitable probability distribution. As previously described, the descriptive analysis show that data fit well into a positively skewed distribution and log-normal, Weibull and log-logistic distributions were applied to find an appropriate distribution. The normal distribution was used for comparison purposes.

### Analysing data

Once data have been fitted into any distribution, the ‘goodness-of-fit test’ should be used to see how well the data fit into the distribution. The parameters of distribution such as location, shape and scale are also essential to describe the distribution.

#### The Anderson-Darling test

The appropriateness of the distribution for water-consumption data was assessed by comparison to the normal, log-normal, log-logistic and Weibull distributions using the Anderson-Darling (AD) goodness of fit test. The AD test (Stephens 1974) is used to test if a sample of data came from a population with a specific distribution. It is a modification of the K-S test and gives more weight to the tails than the K-S test. The K-S test is distribution free in the sense that the critical values do not depend on the specific distribution being tested. The AD test makes use of the specific distribution in calculating critical values. This has an advantage of allowing a more sensitive test and the disadvantage is that critical values must be calculated for each distribution. The critical values were calculated, tabulated and published by Stephens (1974), for a few specific distributions, including log-logistic distribution.

*A*

^{2}, is where

*n*is the number of observations and

*ω*

_{i}is the value of the distribution in question at the

*i*th largest observation. A smaller AD value indicates that the distribution fits the data better. The critical value of the AD parameter at the 95% confidence interval is 2.492 the 1% point is 3.857 for

*n*≥ 5 (Johnson 2000). The AD test was preferred to the KS test because of the latter's lack of sensitivity in tails (Ahmad

*et al.*1988; Johnson 2000).

#### Parameter estimates

The location and scale parameters are associated with central tendency and dispersion, respectively, and are essential to describe the distribution. The parameters for normal distribution are the mean and standard deviation and they are directly related to the location and scale parameters. The log-normal, log-logistic and Weibull distributions use location, shape or scale as their parameters and unlike normal distribution they need to transform the location and scale parameters to represent mean and standard deviation using complex equations.

These parameters have allowed the distribution to have flexibility and effectiveness in modelling applications. In simple terms, the shape parameter allows a distribution to take on a variety of shapes depending on the value of the shape parameter. The effect of the location parameter is to shift the graph to the left or right of the horizontal axis. The scale parameter describes the stretching capacity of the probability distribution function.

#### The graphical method

There are various numerical and graphical methods used in the literature for estimating the parameters of a probability distribution. In this study, graphical methods were selected for the analysis along with the maximum likelihood method to draw the probability plots (see Figures 1 and 2 in the following section). The data were analysed using 95% confidence intervals (5% significance level) and were fitted to normal, log-normal, Weibull and log-logistic distributions to establish the parameters for the distribution. The middle line in the probability plot shows the normal line and the other two lines show the 95% confidence intervals. Montogomary & Runger (2002) stated that normal probability plots are useful in identifying distributions that fit into the normal distribution and those which have skewed distributions with long tails. If the data fall below the normal line then data have a positively skewed distribution (Montogomary & Runger 2002).

## RESULTS AND DISCUSSION

The normal probability plots for the UK and North American water-consumption data for the year 2009 are shown in Figures 1 and 2, respectively. The data in Figures 1 and 2 show that values on both ends tend to fall below the normal line. This demonstrates that the data have a positively skewed distribution. Further to this, the graphs of four distributions for the UK and North American data show that more data points are within the 95% confidence level for log-logistic distribution than log-normal, Weibull and normal distributions.

### The goodness-of-fit test

The AD goodness-of-fit test was used to confirm the best fit of data for normal, log-normal, log-logistic and Weibull distribution. The AD values for normal, log-normal, log-logistic and Weibull distributions for the UK and North American data are shown in Figures 3–6. The data used in this study have shown that the log-logistic distribution has the lowest AD values when compared with the normal, Weibull and log-normal distributions.

### Parameter estimates

The location and scale parameters are associated with central tendency and dispersion, respectively, and are essential to describe the distribution. The parameters for normal distribution are the mean and standard deviation and they are directly related to the location and scale parameters. The log-normal, log-logistic and Weibull distributions use location, shape or scale as their parameters and unlike normal distribution they need to transform the location and scale parameters to represent mean and standard deviation using complex equations.

These parameters have allowed the distribution to have flexibility and effectiveness in modelling applications. In simple terms the shape parameter allows a distribution to take on a variety of shapes depending on the value of the shape parameter. The effect of the location parameter is to shift the graph to the left or right on the horizontal axis. The scale parameter describes the stretching capacity of the probability distribution function.

The location parameter obtained in this study for the log-logistic distribution is approximately 7.4 for the UK's water-consumption data. The scale parameter is in between 0.0107 and 0.026 (Table 2). With regard to the North American water-consumption data, the location parameter obtained for log-logistic distribution is between 3.92 and 4.562. Similarly, the scale parameter is between 0.0296 and 0.389 (Table 3). The standard deviation is in a range of 1,720 to 1,792 and mean value is 35 to 96.51 for the UK's consumption data (Table 4).

Date | Location | Scale |
---|---|---|

2009 data | 7.485 | 0.01776 |

2010 data | 7.482 | 0.0256 |

2011 data | 7.464 | 0.01517 |

2012 data | 7.448 | 0.01068 |

Date | Location | Scale |
---|---|---|

2009 data | 7.485 | 0.01776 |

2010 data | 7.482 | 0.0256 |

2011 data | 7.464 | 0.01517 |

2012 data | 7.448 | 0.01068 |

Date | Zone 1 | Zone 2 | Zone 3 | |||
---|---|---|---|---|---|---|

Location | Scale | Location | Scale | Location | Scale | |

2009 data | 4.089 | 0.0415 | 4.545 | 0.0457 | 3.920 | 0.105 |

2010 data | 4.089 | 0.0415 | 4.545 | 0.0457 | 3.920 | 0.105 |

2011 data | 4.187 | 0.055 | 4.446 | 0.063 | 4.308 | 0.142 |

2012 data | 4.098 | 0.389 | 4.361 | 0.041 | 4.152 | 0.092 |

2013 data | 4.115 | 0.0296 | 4.562 | 0.054 | 3.992 | 0.088 |

Date | Zone 1 | Zone 2 | Zone 3 | |||
---|---|---|---|---|---|---|

Location | Scale | Location | Scale | Location | Scale | |

2009 data | 4.089 | 0.0415 | 4.545 | 0.0457 | 3.920 | 0.105 |

2010 data | 4.089 | 0.0415 | 4.545 | 0.0457 | 3.920 | 0.105 |

2011 data | 4.187 | 0.055 | 4.446 | 0.063 | 4.308 | 0.142 |

2012 data | 4.098 | 0.389 | 4.361 | 0.041 | 4.152 | 0.092 |

2013 data | 4.115 | 0.0296 | 4.562 | 0.054 | 3.992 | 0.088 |

Data | Standard Deviation | Mean |
---|---|---|

2009 data | 67.9 | 1,792 |

2010 data | 96.51 | 1,787 |

2011 data | 45.31 | 1,746 |

2012 data | 35 | 1,720 |

Data | Standard Deviation | Mean |
---|---|---|

2009 data | 67.9 | 1,792 |

2010 data | 96.51 | 1,787 |

2011 data | 45.31 | 1,746 |

2012 data | 35 | 1,720 |

## CONCLUSIONS

It was observed that by analysing water-consumption data, 88% of the water-consumption data has a positively skewed distribution. This means that data would fit better for positively skewed distributions such as log-normal, log-logistic and Weibull. Following detailed analysis of data, the study shows that from the four selected distribution patterns studied, the log-logistic distribution provided the lowest AD values and was the most suitable water-distribution pattern to standardise when modelling the water demand.

The findings in this study are in accordance with the literature which stated that log-logistic distribution is the best fit for real water-consumption data. Although log-normal and log-logistic distributions may be similar for moderate sample sizes, it is still desirable to choose a more suitable model to obtain accurate probability values at the tails.

Moreover, the normal and log-normal distributions produced marginally acceptable AD values. The AD values obtained for the Weibull distribution have higher values when compared with the other three distributions (log-logistic, log-normal and normal) and were found not to be suitable in simulating the water demand data.

To the best of the authors' knowledge, there are no previous studies which have incorporated the probability of occurrence using real water-consumption data built upon a statistically analysed method focused on the upper tails. Using AD test to validate the data, this study focused on the data on upper tails which best represent the water-consumption data.

The log-logistic distribution could be used as a standard statistical distribution in quantifying the probability of exceedence of the water consumption. Additionally, this work also has the potential to provide significant information to help policy makers forecast future demands using a fully probabilistic method.

## ACKNOWLEDGEMENTS

The authors wish to express their gratitude to the Environment Agency for supporting this research initiative, and water utility companies from the UK and North America for providing the data sets for this study.