## Abstract

Four statistical models (linear regression, exponential regression, Poisson regression and logistic regression) applied to analyze the variables in pipe vulnerabilities with the objective of finding equations to predict probable future pipe accidents. The most effective variables in pipe failures are material, age, length, diameter and hydraulic pressure. To evaluate these models, the data collected in recent years in the water distribution network of district 1 in Tehran were used, with a total length of 582,702 m of pipes, and 48,500 consumers. The results demonstrate that among the four studied models, the logistic regression model is best able to give a good performance and is capable of predicting future accidents with a higher probability.

## INTRODUCTION

Accidents and pipe failures in urban water systems, amongst them water distribution networks, create great challenges for water utilities and researchers around the world (Kropp & Herz 2005; Grigg 2012). Winkler *et al.* (2018) used a boosted technique of decision trees for pipe failure prediction based on existing network data and historical failure records in a medium sized city, and concluded that this method has the best performance for studying real pipe failures in comparison to other models.

Wilson *et al.* (2017) studied pipe failure prediction for large diameter water mains (greater than 500 mm) within a period of 13 years by a series of statistical models. They concluded that the studied models were capable to predict the failures of an individual pipe or pipe segments with a high probability. Alvisi & Franchini (2010) used two proposed parameterization models by Mailhot *et al.* (2000) and Le Gat & Eisenbeis (2000) to estimate the pipe failures and concluded that both models have similar performances in an observed period. Savic (2009) and Kakoudakis *et al.* (2017) developed an evolutionary polynomial regression model (EPR) with K-means clustering to predict the failure of pipes based on length, diameter, and age.

Inanloo *et al.* (2016) investigated the risk of vulnerability and pipe failure in distribution networks in Miami, Florida, according to a decision aid GIS-based risk assessment. In their research, Pelletier *et al.* (2003) used a variety of scenarios with a strategy model to find the number of annual failures in the water networks of three cities. Tabesh *et al.* (2009) used two models based on the Data-Driven Modeling named Artificial Neural Networks and Neuro-fuzzy systems along with a multivariate regression approach to predict more precisely the failure of pipes and concluded that the Neural network outcomes are more accurate. Kabir *et al.* (2015) reported that in the year 2000 there were on average 700 water pipe breaks in Canada and USA incurring a cost equivalent to 10 billion Canadian dollars for these two governments. Most current actions to make decisions on a pipe's repair or exchange were based only on the age of the pipe and the number of previous breaks (Shamir & Howard 1979; Kettler & Goulter 1985; Andreou *et al.* 1987; Le gat & Eisenbeis 2000; Xu *et al.* 2018). Debón *et al.* (2010) compared models with using a receiver operating characteristics (ROC) graph to evaluate them and concluded that the generalized linear regression model is the best model for this subject.

Some researchers in Iran note that over 20–30% of the total revenues of Iranian water and wastewater companies is spent on repair, rehabilitation and corrective actions, and about 30% of accidents occurred in the pipes of the distribution system (Elahipanah 1999; Beigi 2000; Tajrishi & Abrishamchi 2005).

The objective of this research is to compare four statistical regression models that are appropriate for identification of pipe break patterns in a real pilot in a Tehran distribution network. The models used in this research were the linear model, exponential model, the Poisson generalized linear model, and the logistic generalized linear model, while the variables used include pipe diameter, pipe length, pipe age and pipe internal pressure.

## METHOD

This research has applied the four mentioned regression models to analyze the variables in pipe vulnerabilities. In regression, a search is made for an equation which can express the relationship between the variables and on the basis of which it is possible to make the necessary predictions or estimations (Scheidegger *et al.* 2015; Nishiyama & Filion 2013; Rausand & Arnljot, 2004). Explanations of the four models are given below.

### The linear regression model

_{i}. in which

*β*

_{0}and

*β*

_{i}are the constants (regression parameters) that are estimated, is the error value, assuming that errors of zero average and unknown variance have normal distribution and are independent (Montgomery

*et al.*, 2012).

_{0}is the regression parameter and Age is the age of the pipe at the first break.

The variables and the dimensions in all equation are as follows: NB: number of failures; D: pipe diameter (mm); Age: pipe age (years); P: pressure (atmosphere); L: pipe length (meters). The pipe materials are: DI: Ductile iron; AC: asbestos cement; GCI: Cast iron; PE: poly ethylene. It should be mentioned that the numerical value for pipe materials in all equations is either 0 or 1. For example, if a pipe is in asbestos cement (AC) material , the value is 1, but 0 for other materials.

### Exponential regression model

*β*) is the non-linear function with the parameters of

*β*0,

*β*1, … , and

*ɛ*is the amount of remaining error. Shamir & Howard (1979) applied non-linear regression analysis to find the exponential relation between the age of pipes and breaks. Their model is shown in Equation (5): where; N(t): the number of breaks (NB) in the unit length per year; N(t

_{0}): the number of breaks in the unit length at the year of pipe installations; t: the time between the break in a year and the year of previous break; g: the pipe age at time t; and A: the coefficient rate of break yr

^{−1}.

Some researchers developed Equation (5) so it was capable of considering all the variables in pipe failures (Andreou *et al.*, 1987; Tabesh *et al.* 2009). In this study the exponential regression model is not just dependent on age but also on all the variables according to Equation (11).

### Poisson generalized linear model

*μ*(

_{i}, ); if

_{i}= [x

_{1i},x

_{2i}, … ,x

_{ni}] is the vector of covariates from the i

^{th}section of the system (i = 1,2, … .,m), is the vector of the regression coefficients, and y

_{i}is the independent variable according to Equations (6) and (7).

*β*i, the regression coefficients and the independent variables are as explained above.

### The logistic generalized linear mode

In most cases the main concern for water supply utilities is to check for occurrence/non-occurrence of failure in a pipe during a specific period of time and not the actual number of failures. The dependent variable is a binary variable, which at the time of pipe break during a specific period returns an answer of 1. It is not necessary in this model for the independent variable to have a normal distribution. The independent variables could have the value of 1 with the probability of *P* or the value of 0 with the probability of (1 − *P*).

*P*to its LOGIT is shown in Equation (8): where

*α*is the regression's constant parameter,

*β*

_{i}are the regression coefficients for descriptive variables and x

_{i}are the independent variables. LOGIT is a form of substitute for this model, wherein the joint function is a LOGIT one.

*P*(x) is the break probability, 1 −

*P*(x) is the no-break probability,

*α*is the width from the beginning and the

*β*

_{i}represent the estimated regression parameters.

## APPLYING THE MODELS IN A REAL PILOT

The selected pilot is district 1 of Tehran's north region with a total length of 582,702 m pipes, and 48,500 consumers (Figure 1). Given the large number of subscribers, this district has a high record of accidents. For example, during the period from July 2004 to December 2007, more than 65,000 cases were registered, of which more than 25,000 were registered alone in 2007 (IWWC 2017). Given the existing constraints in data collection, the parameters of pipe material, diameter, length, pressure and age were selected. The pipes in the study area are mainly of ductile (DI), asbestos cement (AC), cast iron (GCI), and poly ethylene (PE), having an average age of about 30 years.

## RESULTS AND DISCUSSION

The four described regression models were applied to the gathered data of the study pilot to find the number and probability of pipe failures. The open source Statistical R software (Hothorn & Everitt 2014) was used for this study. It should be mentioned that for every model, different cases were analyzed. For example as shown in Table 1, six cases were studied for then linear regression model. The methodology of using the data for every case was as follows: 70% of the data randomly was used for training and the remaining 30% for testing and validation.

. | Case 1 . | Case 2 . | Case 3 . | Case 4 . | Case 5 . | Case 6 . |
---|---|---|---|---|---|---|

Width from beginning | **** | **** | **** | **** | **** | **** |

L | **** | **** | **** | **** | **** | **** |

P | * | * | * | . | . | _ |

D | . | _ | _ | _ | _ | _ |

Age | . | . | _ | _ | _ | _ |

AC | . | . | . | _ | _ | _ |

DI | **** | **** | **** | **** | **** | **** |

GCI | . | . | . | . | _ | _ |

PE | NA | _ | _ | _ | _ | _ |

R² | 0.1913 | 0.1906 | 0.1902 | 0.1891 | 0.1881 | 0.1856 |

Degree of freedom | 771 | 772 | 773 | 774 | 775 | 776 |

. | Case 1 . | Case 2 . | Case 3 . | Case 4 . | Case 5 . | Case 6 . |
---|---|---|---|---|---|---|

Width from beginning | **** | **** | **** | **** | **** | **** |

L | **** | **** | **** | **** | **** | **** |

P | * | * | * | . | . | _ |

D | . | _ | _ | _ | _ | _ |

Age | . | . | _ | _ | _ | _ |

AC | . | . | . | _ | _ | _ |

DI | **** | **** | **** | **** | **** | **** |

GCI | . | . | . | . | _ | _ |

PE | NA | _ | _ | _ | _ | _ |

R² | 0.1913 | 0.1906 | 0.1902 | 0.1891 | 0.1881 | 0.1856 |

Degree of freedom | 771 | 772 | 773 | 774 | 775 | 776 |

‘****’ denotes that the *p*-value of the considered parameter is between 0 and 0.001.

‘***’ denotes that the *p*-value of the considered parameter is between 0.001 and 0.01.

‘**’ denotes that the *p*-value of the considered parameter is between 0.01 and 0.05.

‘*’ denotes that the *p*-value of the considered parameter is between 0.05 and 0.01.

‘.’ denotes that the *p*-value of the considered parameter is between 0.1 and 1.

‘_’ denotes that the considered parameter was not used in the model.

### Linear model results

This method was used according to *p*-value and the level of significance was defined as 0.05. Variables with a lower significance level (*p*-value > 0.1) were eliminated during the training of data in the next case. This process continued until all variables available in the model had a high significance level (*p*-value < 0.001).

The results of *p*-value of the pipe diameter have a very low significant level and therefore in case 2 this variable was eliminated from covariates. The highest *p*-value of parameters existing in Equation (10) was related to variables L and DI but lower for L.

Figure 2(a) shows the number of failures predicted versus their actual number per year.

### Exponential model results

*et al.*, 2012). The selection of the initial values was made through trial and error and then they were applied as the initial values of parameters. Two exponential cases were used in this research and the investigated exponential regression is shown in Equation (11).

According to the parameters estimated in Equation (11) for the existing variables, as the pipe length increases the number of failures also increases, while there is an inverse relation between the pipe diameter and number of failures. Furthermore, the use of ductile iron pipe significantly reduced the number of accidents.

Figure 2(b) is a chart of the number of predicted breaks versus the number of observed breaks. The MSE for zero and non-zero breaks in the analysis were 0.54 and 2.28, respectively, as shown in Table 2.

Regression model . | MSE . | MSE (zero failure) . | MSE (non-zero failure) . | LL (Log likelihood) . | AIC . | Deviance . | GCV . |
---|---|---|---|---|---|---|---|

Linear | 1.41 | 0.65 | 2.37 | −1240.4 | 2488.8 | 1101.8 | 1.42 |

Exponential | 1.3 | 0.54 | 2.28 | −1209.08 | 2434.16 | 1016.7 | 1.32 |

Poisson | 1.3 | 0.55 | 2.28 | −900.75 | 1811.5 | 991.3 | 1.31 |

Regression model . | MSE . | MSE (zero failure) . | MSE (non-zero failure) . | LL (Log likelihood) . | AIC . | Deviance . | GCV . |
---|---|---|---|---|---|---|---|

Linear | 1.41 | 0.65 | 2.37 | −1240.4 | 2488.8 | 1101.8 | 1.42 |

Exponential | 1.3 | 0.54 | 2.28 | −1209.08 | 2434.16 | 1016.7 | 1.32 |

Poisson | 1.3 | 0.55 | 2.28 | −900.75 | 1811.5 | 991.3 | 1.31 |

### Poisson generalized linear model results

As in the linear and exponential models, in this model the process began with the level of significance set on 0.05. The summary of the significance results of parameters of the Poisson generalized linear are shown in Table 2.

By setting the level of significance on 0.05 the variables of pressure and ductile pipe material have the highest level of significance from a statistical aspect. In this model the number of failures increases with the increase in the length of the pipe and reduces with the increase in diameter. Moreover the use of ductile iron pipes lead to a reduction in the number of breaks. Table 2 shows the MSE for this model. Figure 2(c) shows the number of observed failures versus their predicted number.

The results showing in Table 2 demonstrates that all three models perform better in predicting the zero breaks than the non-zero ones.

By considering the log likelihood (LL), Deviance, AKAIKE Information Criterion (AIC) Akaike (1974), and Generalized Cross Validation (GCV), (Golub *et al.* 1979), the model with the highest value in (LL) and the lower value in (AIC), Deviance and (GCV) would be the better one. Based on these results the Poisson model is better than two other models, while the quality of the linear and exponential models are very similar.

In general, the models used up to this point in the research are not suitable predictors for the number of non-zero breaks. To address the weakness of the previous models in predicting non-zero breaks, the logistic generalized linear model was used.

### Logistic generalized linear model results

As can be observed, the number of failures in the model increases with an increase in the length and age of the pipe. Moreover, the use of ductile iron reduces the number of failures and by contrast, the use of cast iron pipes increases the number of breaks. The results of significance for the logistic general model are shown in Table 3. Based on Table 3, in case 4 the model's deviance and AIC are 936.6 and 946 respectively, which shows a better performance in comparison with other cases.

. | Case 1 . | Case 2 . | Case 3 . | Case 4 . |
---|---|---|---|---|

Width from beginning | . | _ | _ | _ |

L | **** | **** | **** | **** |

P | ** | ** | ** | ** |

D | . | . | . | _ |

Age | ** | *** | *** | *** |

AC | . | . | _ | _ |

DI | **** | **** | **** | **** |

GCI | ** | ** | **** | **** |

PE | NA | _ | _ | _ |

Deviance | 932.71 | 933.14 | 934.01 | 936.6 |

AIC | 948.71 | 947.14 | 946 | 946 |

. | Case 1 . | Case 2 . | Case 3 . | Case 4 . |
---|---|---|---|---|

Width from beginning | . | _ | _ | _ |

L | **** | **** | **** | **** |

P | ** | ** | ** | ** |

D | . | . | . | _ |

Age | ** | *** | *** | *** |

AC | . | . | _ | _ |

DI | **** | **** | **** | **** |

GCI | ** | ** | **** | **** |

PE | NA | _ | _ | _ |

Deviance | 932.71 | 933.14 | 934.01 | 936.6 |

AIC | 948.71 | 947.14 | 946 | 946 |

‘****’ denotes that the *p*-value of the considered parameter is between 0 and 0.001.

‘***’ denotes that the *p*-value of the considered parameter is between 0.001 and 0.01.

‘**’ denotes that the *p*-value of the considered parameter is between 0.01 and 0.05.

‘*’ denotes that the *p*-value of the considered parameter is between 0.05 and 0.01.

‘.’ denotes that the *p*-value of the considered parameter is between 0.1 and 1.

“_” denotes that the considered parameter was not used in the model.

Figure 2(d) shows probability of prediction failures against probability of actual breaks. Consequently the value of 0 represents pipes without failures probability and the value of 1 with failures shows probability.

## CONCLUSION

Four different statistical models, namely the linear regression, exponential regression, Poisson generalized linear regression and logistic generalized regression were used to study probability of failures in water distribution pipes. The results obtained by the linear model shows that it is not suitable for modeling the reliability of a water distribution network system. The exponential model results showed that with an increase in the length of the pipe the number of failures increases as well, while it decreases with the increase of diameter.

The comparison of the goodness of fit criterion in the linear, exponential and Poisson models shows that the Poisson generalized linear regression model is better at predicting failures.

Overall, according to the results, the logistic model is more appropriate for estimating and predicting the probability of failures in the considered pilot distribution system over the three and half year data. In all the presented models, the ductile iron pipe variable was an important factor in reducing the number of failures in comparison with other independent variables. However, it should be mentioned that the results demonstrated in this paper have not included pressure data, since the available data were not sufficient.

## ACKNOWLEDGEMENT

The author expresses his gratitude to Tehran Water and Wastewater Company, Dr Mousavi-Nadooshani and PhD student Mr Ahani for their assistance in using R software.

## REFERENCES

*Tehran Water and Wastewater Website*