ABSTRACT
The increasing availability of condition assessment data highlights the challenge of managing data imbalance in the asset management of aging infrastructure. Aging sewer pipes pose significant threats to health and the environment, underscoring the importance of proactive management practices to enhance asset maintenance and mitigate associated risks. While machine learning (ML) models are widely employed to model the complex deterioration process of sewer pipes, they face performance limitations when trained on imbalanced condition grade data. This paper addresses this issue by proposing a novel approach using conditional generative adversarial network (cGAN) for data augmentation. By generating synthetic data for minority classes, the skewed distribution of the sewer dataset is balanced, facilitating more robust and accurate predictive models. The utility of the proposed method is evaluated by training different ML classifiers, including neural network (NN), decision tree, quadratic discriminant analysis, Naïve Bayes, support vector machine (SVM), and K-nearest neighbor. Quadratic discriminant, Naïve Bayes, NN, and SVM classifiers demonstrated improvement. The cGAN-based data augmentation method also outperformed two other data imbalance handling techniques, random under-sampling, and cost-sensitive NN. Consequently, data generated by cGAN can effectively aid asset management by developing proactive classifiers that accurately predict pipes at a high risk of failure.
HIGHLIGHTS
A conditional generative adversarial network (cGAN) model is utilized to handle an imbalanced sewer condition dataset.
The effectiveness of cGAN-based data augmentation is assessed by employing various machine learning classifiers.
The cGAN-based data augmentation method demonstrated superior performance compared to other data augmentation approaches such as random under-sampling and cost-sensitive neural networks.
INTRODUCTION
Public infrastructure plays a crucial role in the health, safety and economic prosperity of citizens. According to Canada’s national infrastructure report card, large portion of sewer pipes are in ‘poor’ to ‘very poor’ conditions, with over 26 billion CAD replacement cost (Canada Infrastructure Report Card 2016). Similar trend is observed in North American sewer pipe condition grades (Sterling et al. 2009; American Society of Civil Engineers 2013; Canada Infrastructure Report Card 2019). Condition prediction models, which are trained on actual inspection databases, are needed to determine the condition of the assets and help decision makers to prioritize inspection, renewal or repair of sewer pipes (Danandeh Mehr & Safari 2020; Ebtehaj et al. 2020; Fitchett et al. 2020; Hawari et al. 2020; Gul et al. 2021). Reliable prediction of the current and future conditions of sewer systems is necessary for proactive management plans (Moradi et al. 2020).
With increasing availability of condition assessment database, however, one prevalent challenge is data imbalance (Harvey & McBean 2014b). Imbalanced data can be defined as a binary or multi-class dataset where the minority class has fewer instances compared to the majority class (Liu et al. 2018). Most machine learning (ML)-based classifiers assume a uniform class distribution, and hence an imbalanced data posed a challenge during model training (Japkowicz & Stephen 2002). The modeling dataset in this study, Calgary sewer database, is class imbalanced as the structurally defective pipes are heavily under-represented. A model which is trained with an imbalanced data performs poorly by misclassifying the instances belonging to the minority class more often than those in the majority class. If the minority class condition of the sewer pipes are poor, the cost of misclassification can be high. In subsequent sub-sections, details are given on sewer condition assessment, and data augmentation techniques that helps in balancing the skewed distribution in the Calgary sewer database.
Sewer condition assessment and prediction
The Water Research Centre (WRc) protocol assign a condition grade to sewer pipes based on a closed-circuit television (CCTV) inspections (Sousa et al. 2014). The defects observed through the CCTV camera are categorized into Structural (e.g., crack, fracture, deformation, join defects, collapse, break, sag, surface damage, corrosion and hole) and Operational (e.g., root, debris, encrustation, protrusion, and infiltration). Although, most municipalities are concerned with the structural defects, the operational defects have also a significant impact on the sewer system (Rahman & Vanier 2004). An internal condition grade (ICG) is allocated according to the WRc protocol, based on the highest defect score that describes the severity and orientation of the pipe defects. The grading has five ordinal value system as shown in Table 1, where pipes in ICGs 1 and 5 represents an excellent and bad condition of the sewer, respectively. The inspection program is limited to few samples in the sewer network, and prediction models are required to infer the condition of the sewers not inspected (Mashford et al. 2011). A number of papers are published on the defect detection and classification techniques in order to assess sewer condition by using CCTV images and videos (Moselhi & Shehab-Eldeen 2000; Yang & Su 2008; Cheng & Wang 2018; Myrans et al. 2018, 2019; Kumar et al. 2020).
Condition grade . | Condition . | Structural . | Operational . |
---|---|---|---|
1 | Excellent | ||
2 | Good | 10–39 | 1–1.9 |
3 | Fair | 40–79 | 2–4.9 |
4 | Poor | 80–164 | 5–9.9 |
5 | Bad |
Condition grade . | Condition . | Structural . | Operational . |
---|---|---|---|
1 | Excellent | ||
2 | Good | 10–39 | 1–1.9 |
3 | Fair | 40–79 | 2–4.9 |
4 | Poor | 80–164 | 5–9.9 |
5 | Bad |
Many researchers have developed condition assessment models in an attempt to help municipalities to predict the condition of their sewer network (Sousa et al. 2014). The condition assessment models can be classified into physical (Wirahadikusumah et al. 2001; Chughtai & Zayed 2008), statistical (Kabir et al. 2018b; Balekelayi & Tesfamariam 2019) and artificial intelligence (Khan et al. 2010; Sousa et al. 2014) models. The ML-based models, a subset of artificial intelligence, can capture the complex nonlinear deterioration process of sewer pipes (Kerwin et al. 2023). Different ML algorithms were used in modeling of sewer infrastructure deterioration: Artificial neural network (ANN), Support vector machine (SVM) and Random forest. A non-exhaustive summary of the applications of ML in the computation of the condition of sewer pipes is given in Table 2.
References . | Factors . | Models . |
---|---|---|
Najafi & Kulandaivel (2005) | Length, diameter, material, age, depth, slope and type of sewer | Artificial neural network |
Kulandaivel (2004) | Length, size, type of material, age, depth of cover, slope and type of sewer | Artificial neural network |
Caradot et al. (2018) | Construction year, material, effluent type, width, length, depth, district and tree density | Random forest |
Tran & Ng (2010) | Diameter, age, depth, slope, tree-count, location, hydraulic condition, soil type, moisture | Support vector machine |
Tran et al. (2009) | Size, age, depth, slope, tree-count, hydraulic condition, location, soil type and Thornwaite moisture index | Neural network |
Khan et al. (2010) | Diameter, depth, bedding material, material, length, age | Back propagation and probabilistic neural network |
Sousa et al. (2014) | Diameter, depth, length, slope, material, velocity and age | Artificial neural network; Support vector machine; Logistic regression |
Harvey & McBean (2014a) | Material, age, type, diameter, length, slope, down elevation, depth, road coverage, water main breaks (3m) | Decision trees; Support vector machine |
Harvey & McBean (2014b) | Material, age, installation era, type, diameter, length, slope, slope change, up elevation, down elevation, orientation change, burial depth, road coverage, watermain breaks, land use, census tract | Random forests |
Mashford et al. (2011) | Diameter, time laid, road, grade, start invert, end invert, material, grade, angle, soil corrosivity, and sulfate soil/ground water | Support vector machine |
Vitorino et al. (2014) | Age, and material | Random forest |
Tran et al. (2006) | Size, age, depth, slope, location, tree new, hydraulic condition, soil type, and TMI | Probabilistic neural network |
References . | Factors . | Models . |
---|---|---|
Najafi & Kulandaivel (2005) | Length, diameter, material, age, depth, slope and type of sewer | Artificial neural network |
Kulandaivel (2004) | Length, size, type of material, age, depth of cover, slope and type of sewer | Artificial neural network |
Caradot et al. (2018) | Construction year, material, effluent type, width, length, depth, district and tree density | Random forest |
Tran & Ng (2010) | Diameter, age, depth, slope, tree-count, location, hydraulic condition, soil type, moisture | Support vector machine |
Tran et al. (2009) | Size, age, depth, slope, tree-count, hydraulic condition, location, soil type and Thornwaite moisture index | Neural network |
Khan et al. (2010) | Diameter, depth, bedding material, material, length, age | Back propagation and probabilistic neural network |
Sousa et al. (2014) | Diameter, depth, length, slope, material, velocity and age | Artificial neural network; Support vector machine; Logistic regression |
Harvey & McBean (2014a) | Material, age, type, diameter, length, slope, down elevation, depth, road coverage, water main breaks (3m) | Decision trees; Support vector machine |
Harvey & McBean (2014b) | Material, age, installation era, type, diameter, length, slope, slope change, up elevation, down elevation, orientation change, burial depth, road coverage, watermain breaks, land use, census tract | Random forests |
Mashford et al. (2011) | Diameter, time laid, road, grade, start invert, end invert, material, grade, angle, soil corrosivity, and sulfate soil/ground water | Support vector machine |
Vitorino et al. (2014) | Age, and material | Random forest |
Tran et al. (2006) | Size, age, depth, slope, location, tree new, hydraulic condition, soil type, and TMI | Probabilistic neural network |
Data augmentation
Three classic methods of handling class-imbalance are cost-sensitive learning, resampling and generating artificial data (Wang et al. 2020). The cost-sensitive learning method handles the class imbalance by considering cost of misclassification of the minority class samples (Harvey & McBean 2014b; Haixiang et al. 2017). Though it is computationally efficient compared to resampling techniques, it has a major drawback on assigning misclassification cost (Haixiang et al. 2017). The resampling technique alleviate the effect of skewed class distribution in the learning process by rebalancing the sample space for an imbalanced dataset (Haixiang et al. 2017). Resampling techniques fall into two categories: under-sampling and over-sampling (Wang et al. 2020). The under-sampling technique balance the skewed class distribution by removing samples from the majority class. The over-sampling technique alleviate the effect of the skewed class distribution by copying instances from the minority class. In artificial data generating techniques, the effect of skewed distribution is balanced by generating new samples belonging to minority class rather than copying. This is a special type of over-sampling technique. The frequently used techniques of this method are Synthetic Minority Oversampling Technique (SMOTE) or Adaptive Synthetic Sampling Approach (ADASYN) (Liu et al. 2018). The SMOTE technique generates artificial data by taking a random sample in the minority class and introducing synthetic samples along the line, joining the random sample and all the k-nearest neighbors of the same class. This technique focuses on a low dimensional data, and has limited application to a multi-class high-dimensional data (García et al. 2012).
Harvey & McBean (2014b) solved the negative impact of class imbalance on predictive performance by adjusting the predicted probability cutoffs. This is performed first by grouping the sewer pipes into either good or bad structural condition, and then receiver-operating characteristic (ROC) curve is used to specify cutoffs for the predicted class probability. A generative adversarial network (GAN) handles class imbalance in a dataset by generating new data samples based on the joint probability of the input data and labels (Gao et al. 2019). Douzas & Bacao (2018) proposed a conditional GAN (cGAN) model to approximate the original data distribution and generate synthetic data to augment the instances pertaining to minor class in various imbalanced datasets. The performance of cGAN is compared with other oversampling algorithms, and the result showed that cGAN can improve the quality of the generated data significantly. Wang et al. (2020) proposed a novel traffic data augmenting method called PacketcGAN using cGAN. The model compared with other data augmentation techniques (e.g., random over-sampling, SMOTE and GAN) and the experimental result indicated that the encrypted traffic dataset augmented by PacketcGAN achieved a higher classification accuracy than the remaining in terms of encrypted traffic classification.
In this paper, a cGAN model is used to generate samples belonging to minority class in a sewer dataset by conditioning the training on external information (class labels). GAN has received an attention in the ML field for their potential to learn high-dimensional and complex real data distribution. As a generative model, cGAN learns the distribution of the real data, and generates synthetic data having similar distribution to the real data. The generated data are combined with the original data to build a new dataset, and hence it keeps the balance between the major and minor classes of the dataset. With the augmented data, condition classification is undertaken using ML classifiers to facilitate the learning of many features. Over the past years, there have been many studies done on the application of cGAN on imbalance data (Douzas & Bacao 2018; Gao et al. 2019; Wang et al. 2020). The current study focuses on the application of cGAN for augmenting the sewer condition database. This paper is divided into five sections. Section two discusses about the proposed methodology and different machine learning algorithms. Section three discusses about the case study. The result and discussion are provided in section four. Section five delivers the conclusion and future recommendation of this study.
METHODOLOGY
The main objective of this study is to demonstrate application of cGAN model in handling class imbalance in a sewer dataset. The methodology includes data generation using cGAN, and evaluating the utility of this technique using different ML-based classifiers. The classifiers considered in this study are NN, decision tree, quadaratic discriminant analysis, Naïve Bayes, SVM, and KNN. During model development, the original dataset was divided into 70% training and 30% testing sets. The cGAN model was trained on the training set to generate new instances (augmented dataset). All the classifiers were trained using a 10-fold cross validation on the training and augmented set, and validated using the testing set.
Generative adversarial network
GANs are framework designed to train implicit generative models using NNs (Goodfellow et al. 2014). GAN approximates the underlying distribution of real data in order to generate samples that have similar distribution with the real data. Generator (G) and discriminator (D) are the two networks of GAN. The G feeds a sample of noise and gives a generated sample . The pz is distribution of the noise samples, which can be either uniform or standard Gaussian distribution. Contrary, the role of D is to distinguish between real samples from a dataset and fake samples generated by G. The D has a binary output and can be interpreted as an estimated probability that a given sample is real.
Algorithm 1: Standard Generative Adversarial Networks Algorithm
Conditional generative adversarial network (cGAN)
ML classifiers
Artificial neural network
Decision trees
Decision trees are one of the most popular classification algorithms, which are used to predict the target variables through a set of prediction rules. The prediction rules are arranged in a tree-like structure (Syachrani et al. 2013). During the training process, decision trees start with the most informative attribute, and split the data based on the values of predictor variables. As a result of the splitting, at least two branches grow out of each node. The splitting process continue at the nodes in each branch based on their informativeness. The decision node where the split could no longer continue is called a leaf, and it belongs to a particular class (Piryonesi & El-Diraby 2020). The interested readers are referred to Syachrani et al. (2013) and Harvey & McBean (2014a) for application of this algorithm in sewer infrastructure.
Discriminant analysis
Discriminant analysis is a statistical technique which uses a set of independent variables to allocate objects into discrete dependent categorical variables. A discriminant analysis is used in sewer condition assessment to analyze the relationship between a number of predictor variables (e.g., deterioration factors) and a dependent categorical variable (e.g., pipe condition) (Hawari et al. 2020).
Naïve Bayes classifier
The Naïve Bayes classifier is a simple probabilistic classifier which works based on the Bayes rule. The basic assumption of the Naïve Bayes is that all features are conditionally independent, and each of them contribute independently to the probability that an input belongs to a class (Hastie et al. 2009). However, Naïve Bayes can be applied still in the presence of feature dependencies. The classifier’s parameters can be obtained by calculating the mean and variance of the training data. Once these parameters are obtained, they can later be used to estimate the testing data set (Hastie et al. 2009).
Support vector machine
SVM classifier is a well-established classification technique that has been used for various applications (Eker et al. 2012). They have some advantages over ANNs because of their ability to be trained on smaller training sets than those required by ANNs, and lack of requirement for specification of internal architecture (Tran & Ng 2010). The SVM algorithm works by finding the best hyperplane that separates data points of one class from those of another class. The classifier is based on the concept of maximum margin hyperplanes. The hyperplane with a maximum margin is the one whose decision boundary has the largest margin, and intuitively it tends to have a smaller generalization errors than the hyperplanes with small margins (Tran & Ng 2010). Mashford et al. (2011) developed an SVM model to predict the condition of sewers, and the result showed that the SVM achieved a good performance.
K-nearest neighbor (KNN)
The KNN is a simple non-parametric algorithm which is used for both regression and classification problem. This classifier works based on a distance measures (e.g., Euclidean, Mahalanbois) between the training samples and test sample, and the test sample is classified into the majority class among K-nearest training samples. For a large training sets, the KNN classifier can be computationally expensive because of the need to calculate the distances between input test samples and entire training set (Cover & Hart 1967).
Performance metrics
CASE STUDY
The city of Calgary performs a regular sewer inspection programs using CCTV technique to help prioritizing pipes during maintenance and replacement. Through this program, around 12,736 (1,052.5 km) pipes were inspected and assigned an ICG. In our study, the pipes with missing information were removed from the dataset, and the remaining 12,146 instances were used to implement the proposed methodology. Distribution of the pipes according to their pipe material and ICG are shown in Table 3. The material type constitutes: cementitious (69.4%), clay (17.5%), metallic (8.6%), plastic (4.5%). Furthermore, the total sum of each row in Table 3 indicates that a class imbalance exist in the database, with each class forms: ICG 1 (11.1%), ICG 2 (44.2%), ICG 3 (20%), ICG 4 (21.4%), and ICG 5 (3.2%). The imbalance between the different ICG makes the data to have a skewed distribution. Furthermore, the sewer pipes in a good condition (ICGs 1, 2, and 3) and bad condition (ICGs 4 and 5) are shown in Figure 2 by gray and red colors, respectively.
ICG . | Cementitious . | Clay . | Metallic . | Plastic . | Total . |
---|---|---|---|---|---|
1 | 927 | 272 | 69 | 78 | 1,346 |
2 | 4,093 | 718 | 253 | 309 | 5,373 |
3 | 1,754 | 448 | 216 | 16 | 2,434 |
4 | 1,475 | 594 | 396 | 138 | 2,603 |
5 | 179 | 94 | 109 | 8 | 390 |
Total | 8,428 | 2,126 | 1,043 | 549 | 12,146 |
ICG . | Cementitious . | Clay . | Metallic . | Plastic . | Total . |
---|---|---|---|---|---|
1 | 927 | 272 | 69 | 78 | 1,346 |
2 | 4,093 | 718 | 253 | 309 | 5,373 |
3 | 1,754 | 448 | 216 | 16 | 2,434 |
4 | 1,475 | 594 | 396 | 138 | 2,603 |
5 | 179 | 94 | 109 | 8 | 390 |
Total | 8,428 | 2,126 | 1,043 | 549 | 12,146 |
RESULTS AND DISCUSSION
Multi-class problem
The pre-processed training dataset is normalized and split into major (ICG 2) and minor (ICGs 1, 3, 4, and 5) classes for the cGAN training. The minor class labels and noise vector sampled from a Gaussian distribution are used as an input to train the cGAN model. The output of the model are generated instances of sewer attributes for the minor classes. The generated data augments the number of instances for each of the minor classes and hence each of the ICGs will have equal number of instances (5,373). Subsequently, the original data is combined with the generated data to make up the augmented dataset (26,865).
Result for data augmentation using cGAN
The cGAN model is implemented in Python using the code developed by Aggarwal et al. (2019). Considering the sensitivity of training a deep network, a pre-processed data is used to train the cGAN model. The minor class labels and noise vector sampled from a Gaussian distribution are feed to the generator to train the cGAN model (Figure 1). Implementation of a cGAN model requires an extensive fine-tuning process to handle the parameters, such as the number of hidden layers, hidden neurons, number of epochs, and mini-batch sample size. A trial and error method was used to select the best architecture of generator and discriminator. The fine-tuned network architecture of the generator and discriminator are summarized in Table 4. The discriminator feeds the inputs of minor class labels and generator’s output of sewer attributes, and distinguishes between the real and synthetic data using a sigmoid activation. Adam optimizer is found to be stable and achieve a good model output generation. The parameters and remained with their default values, and similar learning rates (0.0002) are used for generator and discriminator. The training steps to update the networks parameter depend on the batch size. A batch size of 16 is found to have a desirable output in this work. Although the output of a cGAN is remarkable, it is challenging to obtain a stable model.
Network layers . | Generator . | Discriminator . | ||
---|---|---|---|---|
. | Neurons . | Activation function . | Neurons . | Activation function . |
Input Layer | (2,75) | ReLU | (2,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,75) | ReLU |
Hidden Layer | (75,75) | ReLU | – | ReLU |
Hidden Layer | (75,100) | ReLU | – | ReLU |
Output Layer | (100,15) | Linear | (75,1) | Sigmoid |
Network layers . | Generator . | Discriminator . | ||
---|---|---|---|---|
. | Neurons . | Activation function . | Neurons . | Activation function . |
Input Layer | (2,75) | ReLU | (2,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,50) | ReLU |
Hidden Layer | (75,75) | ReLU | (50,75) | ReLU |
Hidden Layer | (75,75) | ReLU | – | ReLU |
Hidden Layer | (75,100) | ReLU | – | ReLU |
Output Layer | (100,15) | Linear | (75,1) | Sigmoid |
Comparison of the classifiers
The negative impact of class imbalance on predictive performance can be neutralized by categorizing pipe conditions into binary-class (good and bad) because majority of class-imbalance learning strategies have been designed for two-class problems, which is easier for predictive models to predict the minority class of interest. Hence, the ICGs are aggregated into bad(ICGs 4–5) and good(ICGs 1–3) pipes in subsequent analysis (Harvey & McBean 2014b).
Binary-class problem
The minor (ICG 4–5) and major (ICG 1–3) classes have 2,993 and 9,153 instances, respectively. Similar to the multi-class problem, the output of the cGAN model are generated instances of sewer attributes for the minor class. The generated data augments the number of instances for the minor class and hence each of the class will have equal number of instances (9,153).
Result for data augmentation using cGAN
Comparison of the classifiers
The model’s performance is presented in Table 5. The classifiers overall accuracy are high when trained on the original dataset as compared to the augmented dataset. Nevertheless it is difficult to consider accuracy as a reliable metric because of the imbalanced dataset. Similar to the accuracy, the models has a higher precision when trained on the original dataset. The Naïve Bayes classifier achieved the highest precision among the models that are trained on the augmented dataset. The lower values of precision indicates that there more false positives in the model prediction. A false positive that results in the inspection of a pipe that is actually in good condition is less of a problem in the case of sewer condition prediction than a false negative (where a pipe leaking wastewater into the ground is missed in the next round of inspections as it was predicted to be in good condition) (Harvey & McBean 2014b). A model with a high recall value indicates that it has correctly classified the majority of the positive classes (pipes in bad condition). The models have a higher recall values when trained on the augmented dataset, which indicates that creating synthetic data about minority classes helps the model’s ability to recognize them more accurately when evaluated on new data. The highest recall values, 53 and 51%, were attained by the Naïve Bayes and Quadratic Discriminant, respectively. The importance of data augmentation and the best classifier is finally determined using the F1 score. The greater the F1 score, the better the model. Overall, the model trained on the augmented dataset have a higher F1 score as compared to the models trained on the original dataset. The Naïve Bayes and Quadratic Discriminant classifiers also results in a higher F1 score when trained on the augmented dataset.
Model developed . | Training dataset . | Accuracy . | Precision . | Recall . | F1score . |
---|---|---|---|---|---|
Decision Tree | Original | 0.75 | 0.48 | 0.11 | 0.18 |
Augmented | 0.63 | 0.30 | 0.38 | 0.34 | |
Quadratic Discriminant | Original | 0.75 | 0.58 | 0.15 | 0.23 |
Augmented | 0.57 | 0.43 | 0.51 | 0.47 | |
Naïve Bayes | Original | 0.75 | 0.58 | 0.15 | 0.24 |
Augmented | 0.57 | 0.44 | 0.53 | 0.48 | |
Neural network | Original | 0.75 | 0.46 | 0.10 | 0.16 |
Augmented | 0.61 | 0.36 | 0.45 | 0.40 | |
KNN | Original | 0.75 | 0.45 | 0.09 | 0.15 |
Augmented | 0.62 | 0.34 | 0.42 | 0.37 | |
SVM | Original | 0.75 | 0.39 | 0.08 | 0.13 |
Augmented | 0.62 | 0.37 | 0.45 | 0.40 |
Model developed . | Training dataset . | Accuracy . | Precision . | Recall . | F1score . |
---|---|---|---|---|---|
Decision Tree | Original | 0.75 | 0.48 | 0.11 | 0.18 |
Augmented | 0.63 | 0.30 | 0.38 | 0.34 | |
Quadratic Discriminant | Original | 0.75 | 0.58 | 0.15 | 0.23 |
Augmented | 0.57 | 0.43 | 0.51 | 0.47 | |
Naïve Bayes | Original | 0.75 | 0.58 | 0.15 | 0.24 |
Augmented | 0.57 | 0.44 | 0.53 | 0.48 | |
Neural network | Original | 0.75 | 0.46 | 0.10 | 0.16 |
Augmented | 0.61 | 0.36 | 0.45 | 0.40 | |
KNN | Original | 0.75 | 0.45 | 0.09 | 0.15 |
Augmented | 0.62 | 0.34 | 0.42 | 0.37 | |
SVM | Original | 0.75 | 0.39 | 0.08 | 0.13 |
Augmented | 0.62 | 0.37 | 0.45 | 0.40 |
Finally, the cGAN-based data augmentation method is compared with two other data imbalance handling techniques, random under-sampling (RUS) (Kim et al. 2016) and cost-sensitive NN (Zhou & Liu 2005). The RUS creates new dataset by randomly removing data from the majority class, to have a low number of instances as minority class. On the other side, the cost-sensitive NN assigns more weight to the minority classes. The CMs of the model testing for RUS and cost-sensitive NN are given in Table 6. The accuracy of RUS and cost-sensitive models are 65 and 64%, respectively. The low precision values of both models indicate that there are many false positives in the models prediction. The cost-sensitive model achieved a better performance in correctly classifying the majority of the positive classes, which can be observed from the recall and f1 score values. However, the cost-sensitive model underperformed when compared to the Naïve Bayes and Quadratic Discriminant classifiers. As a result, it can be concluded from the overall analysis that data generating using cGAN method can be better used to handle the class imbalance problem in a sewer condition database. The overall classification accuracy achieved in the current paper is lower than what is reported in the literature review. However, the objective of this paper is to demonstrate the utility of cGAN model for handling class imbalance problem. The performance of the classifiers can be increased by considering additional explanatory variables and further tuning the cGAN model.
Model developed . | Accuracy . | Precision . | Recall . | F1score . |
---|---|---|---|---|
RUS | 0.65 | 0.34 | 0.47 | 0.40 |
Cost-Sensitive | 0.64 | 0.35 | 0.53 | 0.42 |
Model developed . | Accuracy . | Precision . | Recall . | F1score . |
---|---|---|---|---|
RUS | 0.65 | 0.34 | 0.47 | 0.40 |
Cost-Sensitive | 0.64 | 0.35 | 0.53 | 0.42 |
CONCLUSION
Aging sewer pipes pose major health, environmental, and economic threats to citizens. Utilities need a proactive management practices to increase their asset maintenance system. ML-based models are frequently used to model the complex and nonlinear deterioration process of sewer pipes. However, performance of the ML is affected when trained with imbalanced condition grade data. This paper contributes to addressing this challenge by proposing a novel cGAN-based data augmentation method for handling class imbalance in sewer datasets (ICGs 1, 3, 4, and 5). The generated data, exhibiting a similar distribution to the original data, proves valuable in building a balanced sewer condition database for asset management. Additionally, the cGAN is compared with other data augmentation methods, including random under-sampling (RUS) and cost-sensitive NN, demonstrating its superiority in enhancing the predictive performance of ML classifiers. This novel approach offers a significant contribution to the field, providing a practical solution to improve the accuracy of classifiers and predict pipes at high risk of failure in sewer networks.
Utility of this concept is evaluated using different ML classifiers (NN, decision tree, discriminant analysis, Naïve Bayes, SVM and KNN) on the original and augmented datasets. In the multi-class problem, the classification accuracy of cGAN-based data augmentation increased more when trained with the classes of interest rather than all the minority classes. In the binary-class problem, the cGAN-based data augmentation method outperformed two other data imbalance handling techniques, RUS and cost-sensitive NN. ML classifiers trained with augmented dataset improved their classification accuracy of the bad pipes that have a high risk of failure. The classifiers which showed relatively better classification improvement on these pipes are Quadratic discriminant and Naïve Bayes.
Pipes in ICGs 4 and 5 are close to failure and enhancement of the condition prediction for this classes will help municipalities avoid sever consequence of failure. Classification accuracy of these pipes significantly increased with augmented dataset. Hence, the data generated by cGAN can be used in asset management to build a proactive techniques that predict pipes in high risk of failure with good accuracy. Furthermore, a risk-based decision making frame work can be developed by integrating the condition prediction ML classifiers with a consequence model (Kabir et al. 2018a). The pipes classified as high-risk pipes will be prioritized for maintenance, rehabilitation, and replacement or inspection (Balekelayi & Tesfamariam 2020).
The current work can be expanded by comparing cGAN to other GAN variants (e.g., Wasserstein Generative Adversarial Network (WGAN)), architectures that minimize the instability in GAN models, and other data generation techniques (Zhan et al. 2022; Sun et al. 2023). For future work, the predictive power of the classifiers can be assessed by including additional pipe specific features or others that have a connection with the pipe condition. Finally, the concept of the proposed cGAN-based data augmentation can be further extended to the other utilities (e.g., roads and water pipes).
ACKNOWLEDGEMENTS
The authors acknowledge the financial support through the Natural Sciences and Engineering Research Council of Canada (RGPIN-2019-05584) under the Discovery Grant programs.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.