The increasing availability of condition assessment data highlights the challenge of managing data imbalance in the asset management of aging infrastructure. Aging sewer pipes pose significant threats to health and the environment, underscoring the importance of proactive management practices to enhance asset maintenance and mitigate associated risks. While machine learning (ML) models are widely employed to model the complex deterioration process of sewer pipes, they face performance limitations when trained on imbalanced condition grade data. This paper addresses this issue by proposing a novel approach using conditional generative adversarial network (cGAN) for data augmentation. By generating synthetic data for minority classes, the skewed distribution of the sewer dataset is balanced, facilitating more robust and accurate predictive models. The utility of the proposed method is evaluated by training different ML classifiers, including neural network (NN), decision tree, quadratic discriminant analysis, Naïve Bayes, support vector machine (SVM), and K-nearest neighbor. Quadratic discriminant, Naïve Bayes, NN, and SVM classifiers demonstrated improvement. The cGAN-based data augmentation method also outperformed two other data imbalance handling techniques, random under-sampling, and cost-sensitive NN. Consequently, data generated by cGAN can effectively aid asset management by developing proactive classifiers that accurately predict pipes at a high risk of failure.

  • A conditional generative adversarial network (cGAN) model is utilized to handle an imbalanced sewer condition dataset.

  • The effectiveness of cGAN-based data augmentation is assessed by employing various machine learning classifiers.

  • The cGAN-based data augmentation method demonstrated superior performance compared to other data augmentation approaches such as random under-sampling and cost-sensitive neural networks.

Public infrastructure plays a crucial role in the health, safety and economic prosperity of citizens. According to Canada’s national infrastructure report card, large portion of sewer pipes are in ‘poor’ to ‘very poor’ conditions, with over 26 billion CAD replacement cost (Canada Infrastructure Report Card 2016). Similar trend is observed in North American sewer pipe condition grades (Sterling et al. 2009; American Society of Civil Engineers 2013; Canada Infrastructure Report Card 2019). Condition prediction models, which are trained on actual inspection databases, are needed to determine the condition of the assets and help decision makers to prioritize inspection, renewal or repair of sewer pipes (Danandeh Mehr & Safari 2020; Ebtehaj et al. 2020; Fitchett et al. 2020; Hawari et al. 2020; Gul et al. 2021). Reliable prediction of the current and future conditions of sewer systems is necessary for proactive management plans (Moradi et al. 2020).

With increasing availability of condition assessment database, however, one prevalent challenge is data imbalance (Harvey & McBean 2014b). Imbalanced data can be defined as a binary or multi-class dataset where the minority class has fewer instances compared to the majority class (Liu et al. 2018). Most machine learning (ML)-based classifiers assume a uniform class distribution, and hence an imbalanced data posed a challenge during model training (Japkowicz & Stephen 2002). The modeling dataset in this study, Calgary sewer database, is class imbalanced as the structurally defective pipes are heavily under-represented. A model which is trained with an imbalanced data performs poorly by misclassifying the instances belonging to the minority class more often than those in the majority class. If the minority class condition of the sewer pipes are poor, the cost of misclassification can be high. In subsequent sub-sections, details are given on sewer condition assessment, and data augmentation techniques that helps in balancing the skewed distribution in the Calgary sewer database.

Sewer condition assessment and prediction

The Water Research Centre (WRc) protocol assign a condition grade to sewer pipes based on a closed-circuit television (CCTV) inspections (Sousa et al. 2014). The defects observed through the CCTV camera are categorized into Structural (e.g., crack, fracture, deformation, join defects, collapse, break, sag, surface damage, corrosion and hole) and Operational (e.g., root, debris, encrustation, protrusion, and infiltration). Although, most municipalities are concerned with the structural defects, the operational defects have also a significant impact on the sewer system (Rahman & Vanier 2004). An internal condition grade (ICG) is allocated according to the WRc protocol, based on the highest defect score that describes the severity and orientation of the pipe defects. The grading has five ordinal value system as shown in Table 1, where pipes in ICGs 1 and 5 represents an excellent and bad condition of the sewer, respectively. The inspection program is limited to few samples in the sewer network, and prediction models are required to infer the condition of the sewers not inspected (Mashford et al. 2011). A number of papers are published on the defect detection and classification techniques in order to assess sewer condition by using CCTV images and videos (Moselhi & Shehab-Eldeen 2000; Yang & Su 2008; Cheng & Wang 2018; Myrans et al. 2018, 2019; Kumar et al. 2020).

Table 1

Description of pipe ICGs and their defect values range

Condition gradeConditionStructuralOperational
Excellent   
Good 10–39 1–1.9 
Fair 40–79 2–4.9 
Poor 80–164 5–9.9 
Bad   
Condition gradeConditionStructuralOperational
Excellent   
Good 10–39 1–1.9 
Fair 40–79 2–4.9 
Poor 80–164 5–9.9 
Bad   

Many researchers have developed condition assessment models in an attempt to help municipalities to predict the condition of their sewer network (Sousa et al. 2014). The condition assessment models can be classified into physical (Wirahadikusumah et al. 2001; Chughtai & Zayed 2008), statistical (Kabir et al. 2018b; Balekelayi & Tesfamariam 2019) and artificial intelligence (Khan et al. 2010; Sousa et al. 2014) models. The ML-based models, a subset of artificial intelligence, can capture the complex nonlinear deterioration process of sewer pipes (Kerwin et al. 2023). Different ML algorithms were used in modeling of sewer infrastructure deterioration: Artificial neural network (ANN), Support vector machine (SVM) and Random forest. A non-exhaustive summary of the applications of ML in the computation of the condition of sewer pipes is given in Table 2.

Table 2

Machine learning models for sewer condition prediction

ReferencesFactorsModels
Najafi & Kulandaivel (2005)  Length, diameter, material, age, depth, slope and type of sewer Artificial neural network 
Kulandaivel (2004)  Length, size, type of material, age, depth of cover, slope and type of sewer Artificial neural network 
Caradot et al. (2018)  Construction year, material, effluent type, width, length, depth, district and tree density Random forest 
Tran & Ng (2010)  Diameter, age, depth, slope, tree-count, location, hydraulic condition, soil type, moisture Support vector machine 
Tran et al. (2009)  Size, age, depth, slope, tree-count, hydraulic condition, location, soil type and Thornwaite moisture index Neural network 
Khan et al. (2010)  Diameter, depth, bedding material, material, length, age Back propagation and probabilistic neural network 
Sousa et al. (2014)  Diameter, depth, length, slope, material, velocity and age Artificial neural network; Support vector machine; Logistic regression 
Harvey & McBean (2014a)  Material, age, type, diameter, length, slope, down elevation, depth, road coverage, water main breaks (3m) Decision trees; Support vector machine 
Harvey & McBean (2014b)  Material, age, installation era, type, diameter, length, slope, slope change, up elevation, down elevation, orientation change, burial depth, road coverage, watermain breaks, land use, census tract Random forests 
Mashford et al. (2011)  Diameter, time laid, road, grade, start invert, end invert, material, grade, angle, soil corrosivity, and sulfate soil/ground water Support vector machine 
Vitorino et al. (2014)  Age, and material Random forest 
Tran et al. (2006)  Size, age, depth, slope, location, tree new, hydraulic condition, soil type, and TMI Probabilistic neural network 
ReferencesFactorsModels
Najafi & Kulandaivel (2005)  Length, diameter, material, age, depth, slope and type of sewer Artificial neural network 
Kulandaivel (2004)  Length, size, type of material, age, depth of cover, slope and type of sewer Artificial neural network 
Caradot et al. (2018)  Construction year, material, effluent type, width, length, depth, district and tree density Random forest 
Tran & Ng (2010)  Diameter, age, depth, slope, tree-count, location, hydraulic condition, soil type, moisture Support vector machine 
Tran et al. (2009)  Size, age, depth, slope, tree-count, hydraulic condition, location, soil type and Thornwaite moisture index Neural network 
Khan et al. (2010)  Diameter, depth, bedding material, material, length, age Back propagation and probabilistic neural network 
Sousa et al. (2014)  Diameter, depth, length, slope, material, velocity and age Artificial neural network; Support vector machine; Logistic regression 
Harvey & McBean (2014a)  Material, age, type, diameter, length, slope, down elevation, depth, road coverage, water main breaks (3m) Decision trees; Support vector machine 
Harvey & McBean (2014b)  Material, age, installation era, type, diameter, length, slope, slope change, up elevation, down elevation, orientation change, burial depth, road coverage, watermain breaks, land use, census tract Random forests 
Mashford et al. (2011)  Diameter, time laid, road, grade, start invert, end invert, material, grade, angle, soil corrosivity, and sulfate soil/ground water Support vector machine 
Vitorino et al. (2014)  Age, and material Random forest 
Tran et al. (2006)  Size, age, depth, slope, location, tree new, hydraulic condition, soil type, and TMI Probabilistic neural network 

Data augmentation

Three classic methods of handling class-imbalance are cost-sensitive learning, resampling and generating artificial data (Wang et al. 2020). The cost-sensitive learning method handles the class imbalance by considering cost of misclassification of the minority class samples (Harvey & McBean 2014b; Haixiang et al. 2017). Though it is computationally efficient compared to resampling techniques, it has a major drawback on assigning misclassification cost (Haixiang et al. 2017). The resampling technique alleviate the effect of skewed class distribution in the learning process by rebalancing the sample space for an imbalanced dataset (Haixiang et al. 2017). Resampling techniques fall into two categories: under-sampling and over-sampling (Wang et al. 2020). The under-sampling technique balance the skewed class distribution by removing samples from the majority class. The over-sampling technique alleviate the effect of the skewed class distribution by copying instances from the minority class. In artificial data generating techniques, the effect of skewed distribution is balanced by generating new samples belonging to minority class rather than copying. This is a special type of over-sampling technique. The frequently used techniques of this method are Synthetic Minority Oversampling Technique (SMOTE) or Adaptive Synthetic Sampling Approach (ADASYN) (Liu et al. 2018). The SMOTE technique generates artificial data by taking a random sample in the minority class and introducing synthetic samples along the line, joining the random sample and all the k-nearest neighbors of the same class. This technique focuses on a low dimensional data, and has limited application to a multi-class high-dimensional data (García et al. 2012).

Harvey & McBean (2014b) solved the negative impact of class imbalance on predictive performance by adjusting the predicted probability cutoffs. This is performed first by grouping the sewer pipes into either good or bad structural condition, and then receiver-operating characteristic (ROC) curve is used to specify cutoffs for the predicted class probability. A generative adversarial network (GAN) handles class imbalance in a dataset by generating new data samples based on the joint probability of the input data and labels (Gao et al. 2019). Douzas & Bacao (2018) proposed a conditional GAN (cGAN) model to approximate the original data distribution and generate synthetic data to augment the instances pertaining to minor class in various imbalanced datasets. The performance of cGAN is compared with other oversampling algorithms, and the result showed that cGAN can improve the quality of the generated data significantly. Wang et al. (2020) proposed a novel traffic data augmenting method called PacketcGAN using cGAN. The model compared with other data augmentation techniques (e.g., random over-sampling, SMOTE and GAN) and the experimental result indicated that the encrypted traffic dataset augmented by PacketcGAN achieved a higher classification accuracy than the remaining in terms of encrypted traffic classification.

In this paper, a cGAN model is used to generate samples belonging to minority class in a sewer dataset by conditioning the training on external information (class labels). GAN has received an attention in the ML field for their potential to learn high-dimensional and complex real data distribution. As a generative model, cGAN learns the distribution of the real data, and generates synthetic data having similar distribution to the real data. The generated data are combined with the original data to build a new dataset, and hence it keeps the balance between the major and minor classes of the dataset. With the augmented data, condition classification is undertaken using ML classifiers to facilitate the learning of many features. Over the past years, there have been many studies done on the application of cGAN on imbalance data (Douzas & Bacao 2018; Gao et al. 2019; Wang et al. 2020). The current study focuses on the application of cGAN for augmenting the sewer condition database. This paper is divided into five sections. Section two discusses about the proposed methodology and different machine learning algorithms. Section three discusses about the case study. The result and discussion are provided in section four. Section five delivers the conclusion and future recommendation of this study.

The main objective of this study is to demonstrate application of cGAN model in handling class imbalance in a sewer dataset. The methodology includes data generation using cGAN, and evaluating the utility of this technique using different ML-based classifiers. The classifiers considered in this study are NN, decision tree, quadaratic discriminant analysis, Naïve Bayes, SVM, and KNN. During model development, the original dataset was divided into 70% training and 30% testing sets. The cGAN model was trained on the training set to generate new instances (augmented dataset). All the classifiers were trained using a 10-fold cross validation on the training and augmented set, and validated using the testing set.

The proposed methodology applied to a multi-class and binary-class problems. Condition of the sewer dataset are grouped into five ICGs in the multi-class problem. However, in the binary-class problem, condition of the sewer are grouped into bad (ICG 4–5) and good (ICG 1–3) pipes (Harvey & McBean 2014b). A flowchart of the proposed study is shown in Figure 1. Details of cGAN and ML classifiers are given in the subsequent sub-sections.
Figure 1

Work flow of the proposed study.

Figure 1

Work flow of the proposed study.

Close modal

Generative adversarial network

GANs are framework designed to train implicit generative models using NNs (Goodfellow et al. 2014). GAN approximates the underlying distribution of real data in order to generate samples that have similar distribution with the real data. Generator (G) and discriminator (D) are the two networks of GAN. The G feeds a sample of noise and gives a generated sample . The pz is distribution of the noise samples, which can be either uniform or standard Gaussian distribution. Contrary, the role of D is to distinguish between real samples from a dataset and fake samples generated by G. The D has a binary output and can be interpreted as an estimated probability that a given sample is real.

Training of GAN involves the D and G to play the following minimax game with objective function (Goodfellow et al. 2014). The optimization of the GAN stops when the D is no longer able to distinguish the generated data from the real data.
formula
(1)
The objective function in Equation (1) is a binary cross-entropy cost function. In addition, the network parameters of the G and D are indicated by and . Looking at the maximization with respect to D, the first term in indicates that a good D should assign high probabilities to real samples. The second term in means that a good D should assign low probability to generated samples. Looking at the minimization with respect to G, the first term is constant. The second term means that the best G is such that D indicates high probability to its samples (Oskarsson 2020). For example, given a real sewer attributes, the ideal output of the discriminator would be expected to be 1. Conversely, given a fake sewer attributes generated from , the out of the discriminator would be expected to be 0.
The G implicitly defines the generator distribution Pg using the specific choice of noise distribution pz and G itself. The pg can be sampled from by sampling and passing z through the generator to get (Goodfellow et al. 2014). Since pg is completely implicitly defined using the sampling process, it is not possible to actually compute the probability density for any value of y. Hence, by defining pg it is possible to rewrite Equation (1) using only expectations over y (Equation (2)).
formula
(2)
The minimax optimization problem is solved by minimizing the expected loss of the G and D, where the expected loss can be estimated using a training dataset. The task of the discriminator is a binary classification problem, and hence the loss function can be expressed using cross-entropy loss function. Also, the maximization can be changed into minimization by multiplying the objective with -1 and hence can be arrived at the following loss functions.
formula
(3)
formula
(4)
where Nd is number of data samples yi from the underlying data distribution pd. The GAN uses a gradient descent step to minimize Equations (3) and (4) for each batch of training data. The remaining expectations are over the known noise distribution pz. These can be estimated in each training step using samples drawn from pz. This leads to an unbiased estimate of the true loss (Goodfellow et al. 2014). The training procedures of a standard GAN can be summarized in Algorithm 1.

Algorithm 1: Standard Generative Adversarial Networks Algorithm

Conditional generative adversarial network (cGAN)

Conditional generative adversarial network (cGAN) is an enhancement of GAN proposed by Mirza & Osindero (2014). In cGAN, both the generator and discriminator receive an additional information x (e.g., class label) to control the data generation process in a supervised manner. With the class information, cGAN can control the number of instances corresponding to a particular label which is impossible for standard GAN (Hong et al. 2019). The cGAN technique has a similar principle, structure and training process as GAN, except the objective function changes slightly to include the class information as:
formula
(5)
The G and D conditioning on x is not present in Equation (1). This means the G in cGAN learns a conditional distribution to approximate the true conditional data distribution . In practice, the conditioning on x is achieved by concatenating it to the network input. Hence, the G network feds both noise z and x and the D network takes both a sample y and x. In this study, x represents the minority sewer grade instances (Figure 1).

ML classifiers

Artificial neural network

ANN is a ML algorithm which is composed of an input layer, hidden layer, and output layer. The layers are connected to each other by weighted connections, and the weights are calculated through the training process according to the chosen algorithm (Sousa et al. 2014). Backpropagation is a frequently used training methodology that adjusts the weights by minimizing the error between the model output and the observed values (Hassoun 1995). The general equation for an ANN is written as follows:
formula
(6)
where y = output of the model; = activation function; wi = weight corresponding to input ; and b = bias. A loss function is used to measure the difference between the model output and observed values. A minimum error indicates that the model is able to approximate mapping relationship between the input and output layers (Hassoun 1995). The deep architecture neural network has better generalization ability compared with shallow architecture (e.g., neural network with less than three hidden layers) when modeling a complex system (Bengio 2009).

Decision trees

Decision trees are one of the most popular classification algorithms, which are used to predict the target variables through a set of prediction rules. The prediction rules are arranged in a tree-like structure (Syachrani et al. 2013). During the training process, decision trees start with the most informative attribute, and split the data based on the values of predictor variables. As a result of the splitting, at least two branches grow out of each node. The splitting process continue at the nodes in each branch based on their informativeness. The decision node where the split could no longer continue is called a leaf, and it belongs to a particular class (Piryonesi & El-Diraby 2020). The interested readers are referred to Syachrani et al. (2013) and Harvey & McBean (2014a) for application of this algorithm in sewer infrastructure.

Discriminant analysis

Discriminant analysis is a statistical technique which uses a set of independent variables to allocate objects into discrete dependent categorical variables. A discriminant analysis is used in sewer condition assessment to analyze the relationship between a number of predictor variables (e.g., deterioration factors) and a dependent categorical variable (e.g., pipe condition) (Hawari et al. 2020).

Naïve Bayes classifier

The Naïve Bayes classifier is a simple probabilistic classifier which works based on the Bayes rule. The basic assumption of the Naïve Bayes is that all features are conditionally independent, and each of them contribute independently to the probability that an input belongs to a class (Hastie et al. 2009). However, Naïve Bayes can be applied still in the presence of feature dependencies. The classifier’s parameters can be obtained by calculating the mean and variance of the training data. Once these parameters are obtained, they can later be used to estimate the testing data set (Hastie et al. 2009).

Support vector machine

SVM classifier is a well-established classification technique that has been used for various applications (Eker et al. 2012). They have some advantages over ANNs because of their ability to be trained on smaller training sets than those required by ANNs, and lack of requirement for specification of internal architecture (Tran & Ng 2010). The SVM algorithm works by finding the best hyperplane that separates data points of one class from those of another class. The classifier is based on the concept of maximum margin hyperplanes. The hyperplane with a maximum margin is the one whose decision boundary has the largest margin, and intuitively it tends to have a smaller generalization errors than the hyperplanes with small margins (Tran & Ng 2010). Mashford et al. (2011) developed an SVM model to predict the condition of sewers, and the result showed that the SVM achieved a good performance.

K-nearest neighbor (KNN)

The KNN is a simple non-parametric algorithm which is used for both regression and classification problem. This classifier works based on a distance measures (e.g., Euclidean, Mahalanbois) between the training samples and test sample, and the test sample is classified into the majority class among K-nearest training samples. For a large training sets, the KNN classifier can be computationally expensive because of the need to calculate the distances between input test samples and entire training set (Cover & Hart 1967).

Performance metrics

Prediction accuracy is a good indicator of an overall performance of a classifier in a balanced data. However, in an imbalanced data, confusion matrix, precision and recall, are suitable to assess the classifier on the minority class (Harvey & McBean 2014b). For example, in an imbalanced classification problem where the bad pipes (minority class) are the class of interest, True positive (TP) indicates the correctly classified bad pipes; False positive (FP) indicates the pipes classified as bad, when they are good pipes in actual; True negative (TN) indicates the correctly classified good pipes; False negative (FN) indicates the pipes classified as good, when they are bad pipes in actuality (Harvey & McBean 2014b). A good classifier is expected to minimize the FN more than the FP. The accuracy, precision, recall and F1 score were calculated using the following equations:
formula
The proposed methodology of this paper is applied to the sewer database collected from the city of Calgary, Alberta. The network (Figure 2) comprises more than 76,743 sewer pipes (5,316.43 km) connects residential homes, industrial, commercial and public buildings with three treatment facilities (Fish Creek, Pine Creek, and Bonnybrook). The sewer pipe materials are concrete, asbestos cement, brick, vitrified clay, steel, cast iron, corrugated metal pipes, PVC, and polyethylene (PE). In this study, the different materials are aggregated into four material categories: (1) cementitious (concrete , asbestos cement); (2) clay (brick, vitrified clay); (3) metallic (steel, cast iron, corrugated metal pipes); and (4) plastic (PVC, polyethylene). Majority of the pipes used prior to 1980s were mostly brick, CI and concrete, and PVC and HDPE dominated afterwards. The sewer pipes have diameters in the range of 25–3,300 mm, and material and vintage of construction are plotted in (Figure 3). Balekelayi & Tesfamariam (2019) and Kabir et al. (2018b) have used the Calgary sewer dataset to study about the sewer asset condition.
Figure 2

Sewer pipes network in the city of Calgary.

Figure 2

Sewer pipes network in the city of Calgary.

Close modal
Figure 3

Sewer pipes material usage over the time.

Figure 3

Sewer pipes material usage over the time.

Close modal

The city of Calgary performs a regular sewer inspection programs using CCTV technique to help prioritizing pipes during maintenance and replacement. Through this program, around 12,736 (1,052.5 km) pipes were inspected and assigned an ICG. In our study, the pipes with missing information were removed from the dataset, and the remaining 12,146 instances were used to implement the proposed methodology. Distribution of the pipes according to their pipe material and ICG are shown in Table 3. The material type constitutes: cementitious (69.4%), clay (17.5%), metallic (8.6%), plastic (4.5%). Furthermore, the total sum of each row in Table 3 indicates that a class imbalance exist in the database, with each class forms: ICG 1 (11.1%), ICG 2 (44.2%), ICG 3 (20%), ICG 4 (21.4%), and ICG 5 (3.2%). The imbalance between the different ICG makes the data to have a skewed distribution. Furthermore, the sewer pipes in a good condition (ICGs 1, 2, and 3) and bad condition (ICGs 4 and 5) are shown in Figure 2 by gray and red colors, respectively.

Table 3

Distribution of sewer pipes according to ICG and material type

ICGCementitiousClayMetallicPlasticTotal
927 272 69 78 1,346 
4,093 718 253 309 5,373 
1,754 448 216 16 2,434 
1,475 594 396 138 2,603 
179 94 109 390 
Total 8,428 2,126 1,043 549 12,146 
ICGCementitiousClayMetallicPlasticTotal
927 272 69 78 1,346 
4,093 718 253 309 5,373 
1,754 448 216 16 2,434 
1,475 594 396 138 2,603 
179 94 109 390 
Total 8,428 2,126 1,043 549 12,146 

Factors considered in sewer condition assessment are shown in Table 2. Identifying sensitive variables can reduce the costs associated with data collection and enhance the condition prediction accuracy of models (Malek Mohammadi et al. 2020). In this paper, sewer condition grade is the target variable and three explanatory variables are used as independent variables (such as age, length and depth). A pairwise comparison of the explanatory variables are shown in Figure 4.
Figure 4

Pairwise comparison of a frequently used attributes in sewer condition assessment.

Figure 4

Pairwise comparison of a frequently used attributes in sewer condition assessment.

Close modal

Multi-class problem

The pre-processed training dataset is normalized and split into major (ICG 2) and minor (ICGs 1, 3, 4, and 5) classes for the cGAN training. The minor class labels and noise vector sampled from a Gaussian distribution are used as an input to train the cGAN model. The output of the model are generated instances of sewer attributes for the minor classes. The generated data augments the number of instances for each of the minor classes and hence each of the ICGs will have equal number of instances (5,373). Subsequently, the original data is combined with the generated data to make up the augmented dataset (26,865).

Result for data augmentation using cGAN

The cGAN model is implemented in Python using the code developed by Aggarwal et al. (2019). Considering the sensitivity of training a deep network, a pre-processed data is used to train the cGAN model. The minor class labels and noise vector sampled from a Gaussian distribution are feed to the generator to train the cGAN model (Figure 1). Implementation of a cGAN model requires an extensive fine-tuning process to handle the parameters, such as the number of hidden layers, hidden neurons, number of epochs, and mini-batch sample size. A trial and error method was used to select the best architecture of generator and discriminator. The fine-tuned network architecture of the generator and discriminator are summarized in Table 4. The discriminator feeds the inputs of minor class labels and generator’s output of sewer attributes, and distinguishes between the real and synthetic data using a sigmoid activation. Adam optimizer is found to be stable and achieve a good model output generation. The parameters and remained with their default values, and similar learning rates (0.0002) are used for generator and discriminator. The training steps to update the networks parameter depend on the batch size. A batch size of 16 is found to have a desirable output in this work. Although the output of a cGAN is remarkable, it is challenging to obtain a stable model.

Table 4

Architecture of both generator and discriminator

Network layersGenerator
Discriminator
NeuronsActivation functionNeuronsActivation function
Input Layer (2,75) ReLU (2,50) ReLU 
Hidden Layer (75,75) ReLU (50,50) ReLU 
Hidden Layer (75,75) ReLU (50,50) ReLU 
Hidden Layer (75,75) ReLU (50,50) ReLU 
Hidden Layer (75,75) ReLU (50,75) ReLU 
Hidden Layer (75,75) ReLU – ReLU 
Hidden Layer (75,100) ReLU – ReLU 
Output Layer (100,15) Linear (75,1) Sigmoid 
Network layersGenerator
Discriminator
NeuronsActivation functionNeuronsActivation function
Input Layer (2,75) ReLU (2,50) ReLU 
Hidden Layer (75,75) ReLU (50,50) ReLU 
Hidden Layer (75,75) ReLU (50,50) ReLU 
Hidden Layer (75,75) ReLU (50,50) ReLU 
Hidden Layer (75,75) ReLU (50,75) ReLU 
Hidden Layer (75,75) ReLU – ReLU 
Hidden Layer (75,100) ReLU – ReLU 
Output Layer (100,15) Linear (75,1) Sigmoid 

Using the calibrated cGAN model, synthetic data are generated for all minor classes (ICG 1 = 4,027, ICG 3 = 2,939, ICG 4 = 2,770, ICG 5 = 4,983) to have an equivalent number of instance with the majority class (ICG 2 = 5,373). The generator output has three neurons to generate continuous values for age, length and depth features, respectively. The generated data is compared with the original data in a scatterplot in Figure 5, where each row represents the subplot for ICGs 5, 4, 3, and 1 consecutively. From Figure 5, it is noticeable that majority of the generated data (indicated by the Blue color) simulate the distribution of most original variables (indicated by the Red color). This affirms that cGAN is able to generalize the distribution of a data from a small dataset.
Figure 5

Scatter plot comparison between original and augmented data (Red color indicates the original data and Blue color indicates the augmented data). Subplots in rows 1–4 are for ICG 5,4,3 and 1, respectively.

Figure 5

Scatter plot comparison between original and augmented data (Red color indicates the original data and Blue color indicates the augmented data). Subplots in rows 1–4 are for ICG 5,4,3 and 1, respectively.

Close modal

Comparison of the classifiers

The classifiers were developed using the built-in classification learners on MATLAB version R2019b. Confusion matrices (CMs) of the model testing for original and augmented dataset are given in Figures 6 and 7. Instances along the diagonal and off-diagonal of the CMs indicated for the correctly classified and incorrectly classified instances, respectively. Row summary of CMs indicates the rates of correctly and incorrectly classified pipes for each ICG. Pipes in ICGs 4 and 5 have high risk of failure, hence they are taken as a class of interest. A false prediction leading good pipes to inspection is less of an issue than misclassifying bad pipes to be in a good state. Overall, the classifiers showed poor performance on the original dataset, where only ICG 2 was predicted accurately. The accuracy of the models showed very small improvement on the augmented dataset, however, rate of correctly classified pipes in ICGs 4 and 5 remained low. Correctly classified bad pipes of ICGs 4 and 5 are: Decision Tree (30.5% and 11.5%), Quadratic discriminant analysis (38.2% and 19.2%), Naive Bayes (35.7% and 10.3%), NN (35.1% and 11.5%), KNN (33.4% and 6.4%) and SVM (35.7% and 7.7%). The discriminant analysis achieved relatively higher classification accuracy on both ICGs 4 and 5. The low performance of the classifiers indicate that the models are unsuitable for the task of identifying uninspected pipes that pose a threat to the structural integrity of the sewer network (Harvey & McBean 2014b).
Figure 6

Test set confusion matrix for decision tree, quadratic discriminant, and Naïve Bayes models, with the augmented dataset is related to pipes in ICGs 1, 3, 4, and 5. (a) Decision Tree - Original data. (b) Decision Tree – Augmented data. (c) Quadratic Discriminant – Original data. (d) Quadratic Discriminant – Augmented data. (e) Naïve Bayes – Original data. (f) Naïve Bayes – Augmented data.

Figure 6

Test set confusion matrix for decision tree, quadratic discriminant, and Naïve Bayes models, with the augmented dataset is related to pipes in ICGs 1, 3, 4, and 5. (a) Decision Tree - Original data. (b) Decision Tree – Augmented data. (c) Quadratic Discriminant – Original data. (d) Quadratic Discriminant – Augmented data. (e) Naïve Bayes – Original data. (f) Naïve Bayes – Augmented data.

Close modal
Figure 7

Test set confusion matrix for NN, KNN, and SVM models, with the augmented dataset is related to pipes in ICG 1, 3, 4 and 5. (a) NN – Original data. (b) NN – Augmented data. (c) KNN – Original data. (d) KNN – Augmented data. (e) SVM – Original data and (f) SVM - Augmented data.

Figure 7

Test set confusion matrix for NN, KNN, and SVM models, with the augmented dataset is related to pipes in ICG 1, 3, 4 and 5. (a) NN – Original data. (b) NN – Augmented data. (c) KNN – Original data. (d) KNN – Augmented data. (e) SVM – Original data and (f) SVM - Augmented data.

Close modal
The performance of the classifiers was also assessed by training the cGAN model with ICGs 4 and 5 only. The CMs of the model testing for the augmented dataset are given in Figure 8. Correctly classified bad pipes of ICGs 4 and 5 are: Decision Tree (24.6% and 28.2%), Quadratic Discriminant (22.1% and 46.2%), Naïve Bayes (20.5% and 51.3%), NN (29.4% and 43.6%), KNN (29.8% and 28.2%) and SVM (30.9% and 30.8%). Training cGAN with the specific class of interest resulted in increasing the classification accuracy of the classifiers on ICG 5. The Naïve Bayes achieved higher classification accuracy on ICG 5 (51.3%). It can also be noted that there isn’t a single classifier that showed consistently higher results for the different training scenarios.
Figure 8

Test set confusion matrix for all classifiers, with the augmented dataset is related to pipes in ICG 4 and 5. (a) Decision Tree – Augmented data. (b) Quadratic Discriminant - Augmented data. (c) Naïve Bayes – Augmented data. (d) NN – Augmented data. (e) KNN – Augmented data and (f) SVM – Augmented data.

Figure 8

Test set confusion matrix for all classifiers, with the augmented dataset is related to pipes in ICG 4 and 5. (a) Decision Tree – Augmented data. (b) Quadratic Discriminant - Augmented data. (c) Naïve Bayes – Augmented data. (d) NN – Augmented data. (e) KNN – Augmented data and (f) SVM – Augmented data.

Close modal

The negative impact of class imbalance on predictive performance can be neutralized by categorizing pipe conditions into binary-class (good and bad) because majority of class-imbalance learning strategies have been designed for two-class problems, which is easier for predictive models to predict the minority class of interest. Hence, the ICGs are aggregated into bad(ICGs 4–5) and good(ICGs 1–3) pipes in subsequent analysis (Harvey & McBean 2014b).

Binary-class problem

The minor (ICG 4–5) and major (ICG 1–3) classes have 2,993 and 9,153 instances, respectively. Similar to the multi-class problem, the output of the cGAN model are generated instances of sewer attributes for the minor class. The generated data augments the number of instances for the minor class and hence each of the class will have equal number of instances (9,153).

Result for data augmentation using cGAN

The fine-tuned network architecture of the generator and discriminator (Table 4) is used for the binary-class problem. The generated and original data of the minority class is compared in a scatter plot in Figure 9. The scatter plot indicates a good overlap of the original and generated data; hence, the cGAN is able to generalize the distribution of the minority class.
Figure 9

Scatter plot comparison between original and augmented data (Red color indicates the original data and Blue color indicates the augmented data).

Figure 9

Scatter plot comparison between original and augmented data (Red color indicates the original data and Blue color indicates the augmented data).

Close modal

Comparison of the classifiers

The model’s performance is presented in Table 5. The classifiers overall accuracy are high when trained on the original dataset as compared to the augmented dataset. Nevertheless it is difficult to consider accuracy as a reliable metric because of the imbalanced dataset. Similar to the accuracy, the models has a higher precision when trained on the original dataset. The Naïve Bayes classifier achieved the highest precision among the models that are trained on the augmented dataset. The lower values of precision indicates that there more false positives in the model prediction. A false positive that results in the inspection of a pipe that is actually in good condition is less of a problem in the case of sewer condition prediction than a false negative (where a pipe leaking wastewater into the ground is missed in the next round of inspections as it was predicted to be in good condition) (Harvey & McBean 2014b). A model with a high recall value indicates that it has correctly classified the majority of the positive classes (pipes in bad condition). The models have a higher recall values when trained on the augmented dataset, which indicates that creating synthetic data about minority classes helps the model’s ability to recognize them more accurately when evaluated on new data. The highest recall values, 53 and 51%, were attained by the Naïve Bayes and Quadratic Discriminant, respectively. The importance of data augmentation and the best classifier is finally determined using the F1 score. The greater the F1 score, the better the model. Overall, the model trained on the augmented dataset have a higher F1 score as compared to the models trained on the original dataset. The Naïve Bayes and Quadratic Discriminant classifiers also results in a higher F1 score when trained on the augmented dataset.

Table 5

The performance of the classification model

Model developedTraining datasetAccuracyPrecisionRecallF1score
Decision Tree Original 0.75 0.48 0.11 0.18 
 Augmented 0.63 0.30 0.38 0.34 
Quadratic Discriminant Original 0.75 0.58 0.15 0.23 
 Augmented 0.57 0.43 0.51 0.47 
Naïve Bayes Original 0.75 0.58 0.15 0.24 
 Augmented 0.57 0.44 0.53 0.48 
Neural network Original 0.75 0.46 0.10 0.16 
 Augmented 0.61 0.36 0.45 0.40 
KNN Original 0.75 0.45 0.09 0.15 
 Augmented 0.62 0.34 0.42 0.37 
SVM Original 0.75 0.39 0.08 0.13 
 Augmented 0.62 0.37 0.45 0.40 
Model developedTraining datasetAccuracyPrecisionRecallF1score
Decision Tree Original 0.75 0.48 0.11 0.18 
 Augmented 0.63 0.30 0.38 0.34 
Quadratic Discriminant Original 0.75 0.58 0.15 0.23 
 Augmented 0.57 0.43 0.51 0.47 
Naïve Bayes Original 0.75 0.58 0.15 0.24 
 Augmented 0.57 0.44 0.53 0.48 
Neural network Original 0.75 0.46 0.10 0.16 
 Augmented 0.61 0.36 0.45 0.40 
KNN Original 0.75 0.45 0.09 0.15 
 Augmented 0.62 0.34 0.42 0.37 
SVM Original 0.75 0.39 0.08 0.13 
 Augmented 0.62 0.37 0.45 0.40 

Finally, the cGAN-based data augmentation method is compared with two other data imbalance handling techniques, random under-sampling (RUS) (Kim et al. 2016) and cost-sensitive NN (Zhou & Liu 2005). The RUS creates new dataset by randomly removing data from the majority class, to have a low number of instances as minority class. On the other side, the cost-sensitive NN assigns more weight to the minority classes. The CMs of the model testing for RUS and cost-sensitive NN are given in Table 6. The accuracy of RUS and cost-sensitive models are 65 and 64%, respectively. The low precision values of both models indicate that there are many false positives in the models prediction. The cost-sensitive model achieved a better performance in correctly classifying the majority of the positive classes, which can be observed from the recall and f1 score values. However, the cost-sensitive model underperformed when compared to the Naïve Bayes and Quadratic Discriminant classifiers. As a result, it can be concluded from the overall analysis that data generating using cGAN method can be better used to handle the class imbalance problem in a sewer condition database. The overall classification accuracy achieved in the current paper is lower than what is reported in the literature review. However, the objective of this paper is to demonstrate the utility of cGAN model for handling class imbalance problem. The performance of the classifiers can be increased by considering additional explanatory variables and further tuning the cGAN model.

Table 6

The performance of the RUS and cost-sensitive NN models

Model developedAccuracyPrecisionRecallF1score
RUS 0.65 0.34 0.47 0.40 
Cost-Sensitive 0.64 0.35 0.53 0.42 
Model developedAccuracyPrecisionRecallF1score
RUS 0.65 0.34 0.47 0.40 
Cost-Sensitive 0.64 0.35 0.53 0.42 

Aging sewer pipes pose major health, environmental, and economic threats to citizens. Utilities need a proactive management practices to increase their asset maintenance system. ML-based models are frequently used to model the complex and nonlinear deterioration process of sewer pipes. However, performance of the ML is affected when trained with imbalanced condition grade data. This paper contributes to addressing this challenge by proposing a novel cGAN-based data augmentation method for handling class imbalance in sewer datasets (ICGs 1, 3, 4, and 5). The generated data, exhibiting a similar distribution to the original data, proves valuable in building a balanced sewer condition database for asset management. Additionally, the cGAN is compared with other data augmentation methods, including random under-sampling (RUS) and cost-sensitive NN, demonstrating its superiority in enhancing the predictive performance of ML classifiers. This novel approach offers a significant contribution to the field, providing a practical solution to improve the accuracy of classifiers and predict pipes at high risk of failure in sewer networks.

Utility of this concept is evaluated using different ML classifiers (NN, decision tree, discriminant analysis, Naïve Bayes, SVM and KNN) on the original and augmented datasets. In the multi-class problem, the classification accuracy of cGAN-based data augmentation increased more when trained with the classes of interest rather than all the minority classes. In the binary-class problem, the cGAN-based data augmentation method outperformed two other data imbalance handling techniques, RUS and cost-sensitive NN. ML classifiers trained with augmented dataset improved their classification accuracy of the bad pipes that have a high risk of failure. The classifiers which showed relatively better classification improvement on these pipes are Quadratic discriminant and Naïve Bayes.

Pipes in ICGs 4 and 5 are close to failure and enhancement of the condition prediction for this classes will help municipalities avoid sever consequence of failure. Classification accuracy of these pipes significantly increased with augmented dataset. Hence, the data generated by cGAN can be used in asset management to build a proactive techniques that predict pipes in high risk of failure with good accuracy. Furthermore, a risk-based decision making frame work can be developed by integrating the condition prediction ML classifiers with a consequence model (Kabir et al. 2018a). The pipes classified as high-risk pipes will be prioritized for maintenance, rehabilitation, and replacement or inspection (Balekelayi & Tesfamariam 2020).

The current work can be expanded by comparing cGAN to other GAN variants (e.g., Wasserstein Generative Adversarial Network (WGAN)), architectures that minimize the instability in GAN models, and other data generation techniques (Zhan et al. 2022; Sun et al. 2023). For future work, the predictive power of the classifiers can be assessed by including additional pipe specific features or others that have a connection with the pipe condition. Finally, the concept of the proposed cGAN-based data augmentation can be further extended to the other utilities (e.g., roads and water pipes).

The authors acknowledge the financial support through the Natural Sciences and Engineering Research Council of Canada (RGPIN-2019-05584) under the Discovery Grant programs.

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Aggarwal
K.
,
Kirchmeyer
M.
,
Yadav
P.
,
Keerthi
S. S.
&
Gallinari
P
2019
Regression with Conditional GAN. ArXiv Preprint ArXiv:1905.12868 (accessed 1 September 2022)
.
American Society of Civil Engineers
2013
2013 Report Card for America’s Infrastructure. ASCE (accessed 28 February 2020)
.
Balekelayi
N.
&
Tesfamariam
S.
2019
Statistical inference of sewer pipe deterioration using Bayesian geoadditive regression model
.
Journal of Infrastructure Systems
25
(
3
),
04019021
.
Balekelayi
N.
&
Tesfamariam
S.
2020
Operational risk-based decision making for wastewater pipe management
.
Journal of Infrastructure Systems
27
(
1
),
04020042
.
Bengio
Y.
2009
Learning deep architectures for AI
.
Foundations and Trends**** in Machine Learning
2
(
1
),
1
127
.
Canada Infrastructure Report Card
2016
Informing the Future: The Canadian Infrastructure Report Card 2016. CIRC (accessed 27 November 2020)
.
Canada Infrastructure Report Card
2019
Monitoring the State of Canada’s Core Public Infrastructure: The Canadian Infrastructure Report Card 2019. CIRC (accessed 28 February 2024)
.
Caradot
N.
,
Riechel
M.
,
Fesneau
M.
,
Hernandez
N.
,
Torres
A.
,
Sonnenberg
H.
,
Eckert
E.
,
Lengemann
N.
,
Waschnewski
J.
&
Rouault
P.
2018
Practical benchmarking of statistical and machine learning models for predicting the condition of sewer pipes in Berlin, Germany
.
Journal of Hydroinformatics
20
(
5
),
1131
1147
.
Chughtai
F.
&
Zayed
T.
2008
Infrastructure condition prediction models for sustainable sewer pipelines
.
Journal of Performance of Constructed Facilities
22
(
5
),
333
341
.
Cover
T.
&
Hart
P.
1967
Nearest neighbor pattern classification
.
IEEE Transactions on Information Theory
13
(
1
),
21
27
.
Danandeh Mehr
A.
&
Safari
M. J. S.
2020
Application of soft computing techniques for particle Froude number estimation in sewer pipes
.
Journal of Pipeline Systems Engineering and Practice
11
(
2
),
04020002
.
Ebtehaj
I.
,
Bonakdari
H.
,
Safari
M. J. S.
,
Gharabaghi
B.
,
Zaji
A. H.
,
Madavar
H. R.
,
Khozani
Z. S.
,
Es-haghi
M. S.
,
Shishegaran
A.
&
Mehr
A. D.
2020
Combination of sensitivity and uncertainty analyses for sediment transport modeling in sewer pipes
.
International Journal of Sediment Research
35
(
2
),
157
170
.
Eker
O.
,
Camci
F.
&
Kumar
U.
2012
SVM based diagnostics on railway turnouts
.
International Journal of Performability Engineering
8
(
3
),
289
298
.
Fitchett
J. C.
,
Karadimitriou
K.
,
West
Z.
&
Hughes
D. M.
2020
Machine learning for pipe condition assessments
.
Journal-American Water Works Association
112
(
5
),
50
55
.
Gao
Y.
,
Kong
B.
&
Mosalam
K. M.
2019
Deep leaf-bootstrapping generative adversarial network for structural image data augmentation
.
Computer-Aided Civil and Infrastructure Engineering
34
(
9
),
755
773
.
García
V.
,
Sánchez
J. S.
,
Martín-Félez
R.
&
Mollineda
R. A.
2012
Surrounding neighborhood-based SMOTE for learning from imbalanced data sets
.
Progress in Artificial Intelligence
1
(
4
),
347
362
.
Goodfellow
I.
,
Pouget-Abadie
J.
,
Mirza
M.
,
Xu
B.
,
Warde-Farley
D.
,
Ozair
S.
,
Courville
A.
&
Bengio
Y.
2014
Generative adversarial nets
.
In: Advances in Neural Information Processing Systems. Curran Associates, Inc., New York, Volume 27, 2672–2680
.
Gul
E.
,
Safari
M. J. S.
,
Torabi Haghighi
A.
&
Danandeh Mehr
A.
2021
Sediment transport modeling in non-deposition with clean bed condition using different tree-based algorithms
.
PLoS One
16
(
10
),
e0258125
.
Haixiang
G.
,
Yijing
L.
,
Shang
J.
,
Mingyun
G.
,
Yuanyue
H.
&
Bing
G.
2017
Learning from class-imbalanced data: Review of methods and applications
.
Expert Systems with Applications
73
,
220
239
.
Harvey
R. R.
&
McBean
E. A.
2014b
Predicting the structural condition of individual sanitary sewer pipes with random forests
.
Canadian Journal of Civil Engineering
41
(
4
),
294
303
.
Hassoun
M. H.
1995
Fundamentals of Artificial Neural Networks
.
MIT Press
,
Cambridge, MA
.
Hastie
T.
,
Tibshirani
R.
&
Friedman
J.
2009
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
, Volume
2
.
Springer
, New York, USA.
Hawari
A.
,
Alkadour
F.
,
Elmasry
M.
&
Zayed
T.
2020
A state of the art review on condition assessment models developed for sewer pipelines
.
Engineering Applications of Artificial Intelligence
93
,
103721
.
Hong
Y.
,
Hwang
U.
,
Yoo
J.
&
Yoon
S.
2019
How generative adversarial networks and their variants work: An overview
.
ACM Computing Surveys (CSUR)
52
(
1
),
1
43
.
Japkowicz
N.
&
Stephen
S.
2002
The class imbalance problem: a systematic study
.
Intelligent Data Analysis
6
(
5
),
429
449
.
Kabir
G.
,
Balek
N. B. C.
&
Tesfamariam
S.
2018a
Consequence-based framework for buried infrastructure systems: a Bayesian belief network model
.
Reliability Engineering & System Safety
180
,
290
301
.
Kabir
G.
,
Balek
N. B. C.
&
Tesfamariam
S.
2018b
Sewer structural condition prediction integrating Bayesian model averaging with logistic regression
.
Journal of Performance of Constructed Facilities
32
(
3
),
04018019
.
Kerwin
S.
,
Garcia de Soto
B.
,
Adey
B.
,
Sampatakaki
K.
&
Heller
H.
2023
Combining recorded failures and expert opinion in the development of ANN pipe failure prediction models
.
Sustainable and Resilient Infrastructure
8
(
1
),
86
108
.
Khan
Z.
,
Zayed
T.
&
Moselhi
O.
2010
Structural condition assessment of sewer pipelines
.
Journal of Performance of Constructed Facilities
24
(
2
),
170
179
.
Kulandaivel
G.
2004
Sewer Pipeline Condition Prediction Using Neural Network Models
.
Michigan State University
, Michigan, US.
Kumar
S. S.
,
Wang
M.
,
Abraham
D. M.
,
Jahanshahi
M. R.
,
Iseley
T.
&
Cheng
J. C.
2020
Deep learning–based automated detection of sewer defects in CCTV videos
.
Journal of Computing in Civil Engineering
34
(
1
),
04019047
.
Liu
C.
,
Wang
X.
,
Wu
K.
,
Tan
J.
,
Li
F.
&
Liu
W
2018
Oversampling for imbalanced time series classification based on generative adversarial networks. In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC). IEEE. pp. 1104–1108
.
Malek Mohammadi
M.
,
Najafi
M.
,
Kermanshachi
S.
,
Kaushal
V.
&
Serajiantehrani
R.
2020
Factors influencing the condition of sewer pipes: state-of-the-art review
.
Journal of Pipeline Systems Engineering and Practice
11
(
4
),
03120002
.
Mashford
J.
,
Marlow
D.
,
Tran
D.
&
May
R.
2011
Prediction of sewer condition grade using support vector machines
.
Journal of Computing in Civil Engineering
25
(
4
),
283
290
.
Mirza
M.
&
Osindero
S
2014
Conditional Generative Adversarial Nets. ArXiv Preprint ArXiv:1411.1784 (accessed 1 September 2022)
.
Moselhi
O.
&
Shehab-Eldeen
T.
2000
Classification of defects in sewer pipes using neural networks
.
Journal of Infrastructure Systems
6
(
3
),
97
104
.
Myrans
J.
,
Everson
R.
&
Kapelan
Z.
2018
Automated detection of faults in sewers using CCTV image sequences
.
Automation in Construction
95
,
64
71
.
Myrans
J.
,
Everson
R.
&
Kapelan
Z.
2019
Automated detection of fault types in CCTV sewer surveys
.
Journal of Hydroinformatics
21
(
1
),
153
163
.
Najafi
M.
&
Kulandaivel
G.
2005
Pipeline condition prediction using neural network models
.
In: Pipelines 2005: Optimizing Pipeline Design, Operations, and Maintenance in Today's Economy. ASCE, Reston, VA, 767–781
.
Oskarsson
J.
2020
Probabilistic Regression Using Conditional Generative Adversarial Networks
.
Linköping University
, Linköping, Sweden.
Piryonesi
S. M.
&
El-Diraby
T. E.
2020
Role of data analytics in infrastructure asset management: overcoming data size and quality problems
.
Journal of Transportation Engineering, Part B: Pavements
146
(
2
),
04020022
.
Rahman
S.
&
Vanier
D
2004
An Evaluation of Condition Assessment Protocols for Sewer Management. NRC-CNRC B-5123.6, NRC Institute for Research in Construction, National Research Council Canada, Ottawa, ON
.
Sterling
R.
,
Wang
L.
&
Morrison
R.
2009
Rehabilitation of Wastewater Collection and Water Distribution Systems: State of Technology Review Report
.
US Environmental Protection Agency
,
Washington, DC
.
Syachrani
S.
,
Jeong
H. S. D.
&
Chung
C. S.
2013
Decision tree-based deterioration model for buried wastewater pipelines
.
Journal of Performance of Constructed Facilities
27
(
5
),
633
645
.
Tran
H. D.
&
Ng
A. W. M.
2010
Classifying structural condition of deteriorating stormwater pipes using support vector machine
.
In: Pipelines 2010: Climbing New Peaks to Infrastructure Reliability: Renew, Rehab, and Reinvest. ASCE, Reston, VA, 857–866
.
Tran
D.
,
Ng
A.
,
Perera
B.
,
Burn
S.
&
Davis
P.
2006
Application of probabilistic neural networks in modelling structural deterioration of stormwater pipes
.
Urban Water Journal
3
(
3
),
175
184
.
Tran
D.
,
Perera
B. C.
&
Ng
A.
2009
Comparison of structural deterioration models for stormwater drainage pipes
.
Computer-Aided Civil and Infrastructure Engineering
24
(
2
),
145
156
.
Vitorino
D.
,
Coelho
S. T.
,
Santos
P. M.
,
Sheets
S.
,
Jurkovac
B.
&
Amado
C.
2014
A random forest algorithm applied to condition-based wastewater deterioration modeling and forecasting
.
Procedia Engineering
89
,
401
410
.
Wang
P.
,
Li
S.
,
Ye
F.
,
Wang
Z.
&
Zhang
M
2020
PacketCGAN: exploratory study of class imbalance for encrypted traffic classification using CGAN. In: ICC 2020 - 2020 IEEE International Conference on Communications (ICC). pp. 1–7
.
Wirahadikusumah
R.
,
Abraham
D.
&
Iseley
T.
2001
Challenging issues in modeling deterioration of combined sewers
.
Journal of Infrastructure Systems
7
(
2
),
77
84
.
Yang
M.-D.
&
Su
T.-C.
2008
Automated diagnosis of sewer pipe defects based on machine learning approaches
.
Expert Systems with Applications
35
(
3
),
1327
1337
.
Zhan
C.
,
Dai
Z.
,
Soltanian
M. R.
&
de Barros
F. P.
2022
Data-worth analysis for heterogeneous subsurface structure identification with a stochastic deep learning framework
.
Water Resources Research
58
(
11
),
e2022WR033241
.
Zhou
Z.-H.
&
Liu
X.-Y.
2005
Training cost-sensitive neural networks with methods addressing the class imbalance problem
.
IEEE Transactions on Knowledge and Data Engineering
18
(
1
),
63
77
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).