Abstract
Sewer systems play a key role in cities to ensure public assets and safety. Timely detection of defects can effectively alleviate system deterioration. Conventional manual inspection is labor-intensive, error-prone and expensive. Object detection is a powerful deep learning technique that can complement and/or replace conventional inspection, especially in complex environments. This study compares two classic object-detection methods, namely faster region-based convolutional neural network (R-CNN) and You Only Look Once (YOLO), for the detection and localization of five types of sewer defects. Model performances are evaluated based on their detection accuracy and processing speed under parameterization impacts of dataset size and training parameters. Results show that faster R-CNN achieved higher prediction accuracy. Training dataset size and maximum number of epochs (MaxE) had dominant impacts on model performances of faster R-CNN and YOLO, respectively. The processing speed increased along with the increasing training data for faster R-CNN, but did not vary significantly for YOLO. The models' abilities to detect disjoint and residential wall were highest, whereas crack and tree root were more difficult to detect. The results help to better understand the strengths and weaknesses of the classic methods and provide a useful user guidance for practical applications in automated sewer defect detection.
HIGHLIGHTS
A deep learning technique for automated detection of multiple types of sewer defects.
Compared the performances of two types of classic object-detection models.
Evaluated model parameterization impacts and identification of key factors.
Graphical Abstract
INTRODUCTION
Sewer systems play a key role in cities to ensure public assets and safety by transporting excess water from urban areas to receiving water or treatment facilities (Butler & Davies 2010; Xie et al. 2017; Cheng & Wang 2018; Zhou et al. 2019). Sewers without proper operation and maintenance are often associated with structural and/or functional failures and regular inspections are necessary to ensure system performance and reliability (Tafuri & Selvakumar 2002; Stanić et al. 2014). Given the large budget and resources on sewer systems, inspection of sewer defects at an early stage is especially beneficial, so that the system deteriorations can be detected, fixed or avoided in a timely manner by taking appropriate measures (Cheng & Wang 2018; Yin et al. 2020). Nevertheless, achieving rapid and accurate detection and classification of sewer defects remains a very challenging task, given the huge number of pipelines and their complex and varying associated conditions (Kumar et al. 2018; Hassan et al. 2019; Meijer et al. 2019).
Nowadays, the closed-circuit television (CCTV) inspection systems are widely applied to examine and record sewer conditions. The tool is especially necessary under unfavorable circumstances, such as high pressure, toxic and unsanitary environment (Wirahadikusumah et al. 2001; Hassan et al. 2019; Meijer et al. 2019; Yin et al. 2020). The CCTV videos are further sent to off-site technologists for an assessment of sewer defects based on their expert knowledge and personal experience. Such a manual assessment process is time-consuming, labor-intensive and thus associated with inconsistencies and uncertainties due to subjective assessment (Dirksen et al. 2013; Meijer et al. 2019; Xie et al. 2019; Yin et al. 2020). There is a pressing need to develop automated sewer defect detection methods, so that the large amount of time and resources spent on the conventional manual assessment can be saved. This can also greatly benefit the sewer maintenance and management in the long term (Cheng & Wang 2018; Jiang et al. 2019; Kumar & Abraham 2019; Xie et al. 2019).
The deep learning techniques that have been developed rapidly in the past decades are the pathway toward the automated methods (Cheng & Wang 2018; Xie et al. 2019; Yin et al. 2020). The technique has been successfully applied in different fields including water resources (Roushangar & Alizadeh 2018; Roushangar et al. 2021). One of the most representative and popular algorithms, namely the convolutional neural networks (CNNs), has been increasingly used in the field of pipe defect classification (Goodfellow et al. 2016; Gu et al. 2018). Compared to conventional image processing methods, the CNN undertakes both supervised and unsupervised learning and minimizes the complex feature extraction and training processes, which is thus superior to the conventional methods in terms of both processing speed and detection accuracy (Hassan et al. 2019; Kumar & Abraham 2019; Meijer et al. 2019; Xie et al. 2019). Despite the advantages, the image classification-based CNN can only detect a single type of defect at a time for image/defect classification and thus does not perform well in real-life inspection as complex sewer conditions (e.g., coexistence of multiple types of sewer defects) often exist (Krizhevsky et al. 2017).
In recent years, technologies that can detect, classify and localize multiple defects simultaneously, namely object detection, have gained widespread attention (Cheng & Wang 2018; Jiang et al. 2019; Kumar & Abraham 2019; Yin et al. 2020). There are several object-detection algorithms, which are the extensions of CNNs, including the region-based CNN (R-CNN) (Girshick et al. 2014), fast R-CNN (Girshick 2015) and faster R-CNN (Ren et al. 2017). The main improvements of these algorithms are the use of innovative methods to find the region of interest (ROI), such as selective search of R-CNN, convolutional feature map of the fast R-CNN and region proposal network (RPN) of the faster R-CNN (Girshick et al. 2014; Girshick 2015; Ren et al. 2017). For example, Cheng & Wang (2018) adopted a faster R-CNN to automate the detection and localization of tree root, deposit, infiltration and crack inside sewer pipes. Their results showed that faster R-CNN has high precision and recall value and enables an accurate detection of sewer defects. Nevertheless, these algorithms were reported to be slow due to their two-staged processing approach.
On the other hand, the one-stage detection method, such as the most well-known YOLO by Redmon & Farhadi (2017), was reported to be faster than the faster R-CNN method due to the removal of the region proposal mechanism. YOLO uses only one CNN (the faster R-CNN including one RPN and one CNN) to directly predict the classifications and locations of different objects in an image. Thus, in theory, the YOLO model has a higher calculation speed, but with some loss in accuracy as it was difficult to solve the problem of precise location and classification. Kumar & Abraham (2019) adopted a YOLO model to detect pipe fractures. Yin et al. (2020) applied a YOLO-based object detector for a real-time detection of six types of sewer defects.
Nevertheless, research comparing the performances of the two classic types of object-detection algorithms has been limited, especially for detecting multiple types of sewer defects in a consistent framework. This study presents the object-detection-based deep learning technology for accurate and efficient detection of multiple types of sewer defects as a complement and/or replacement for conventional manual inspection. The faster R-CNN and YOLO_v2 models were compared for automated sewer defect detection. Model performances were evaluated in terms of detection accuracy and processing speed. Five types of sewer defects, namely crack (CR), disjoint (DJ), obstacle (OB), residential wall (RW) and tree root (TR) were investigated. The results obtained will provide more insights into the performance (e.g., strengths and weaknesses) of the classic object-detection techniques for sewer defect detection and proper guidance/references on the application and selection of an appropriate method for practical applications.
MATERIALS AND METHODS
The overall workflow of this study (Figure 1) consists of (1) sewer defect image processing, including data augmentation and annotation techniques, (2) a CNN model, pertained as the feature extractor for object-detection methods, (3) object-detection model training and testing: faster R-CNN vs. YOLO_v2 and (4) performance evaluation and comparison. Details are explained in the following sections.
Image processing
The investigated sewer images include five types of defects, namely CR, DJ, OB, RW and TR. The five types of defects were selected because they are the most commonly encountered sewer defects in southern China (Lin 2014; Qi et al. 2017; Xiao et al. 2019). The original images were obtained via multiple sources of CCTV videos under various pipe conditions. The images were further examined and assigned with their own defect labels by sewer experts. All images were rescaled to a uniform 256 × 256 pixel resolution based on technical details suggested in the previous literature (Cha et al. 2017; Xu et al. 2020) to retain the image features and meanwhile obtain high computational efficiency, despite the fact that images with higher resolution can provide more information for model training (Cheng & Wang 2018).
To improve network detection accuracy and prevent overfitting phenomenon, the data augmentation techniques were employed to increase the size and variety of the model dataset (Krizhevsky et al. 2017; Kumar et al. 2018; Douarre et al. 2019; Zhang et al. 2019). Data augmentation is an important technique, especially for limited data conditions, to generate new and representative data and thus enhance the quantity and quality of data. Consequently, the technique can significantly improve the performance of the neural network (Li et al. 2020b; Rodriguez-Lozano et al. 2020; Xu et al. 2020). In summary, the geometric transformations (including rotation, mirror and translation) and color transformation operations (including ditching, color adjust, noises with Gaussian, Salt and pepper, and Poisson processing) were applied (Figure 2). Details on the applied data augmentation are shown in Table 1, and it can be seen that there are in total 11 operations adopted. The original number of defect images was 610 and was enriched to 7,320 for model training and validation. During the data augmentation, we kept the image format (.tif format) and resolution of each image unchanged. When using the color transformation, the brightness and definition of image may be changed, but inputting images with different qualities can ensure higher robustness of the model. Finally, the ‘Image Labeler’ toolbox in MATLAB was employed to label the rectangular ROI (i.e., ground-truth bounding boxes) and associated class ID (namely, CR, DJ, OB, RW and TR) for object detection of the sewer defects.
Data augmentation operations adopted in the study
General type . | Subtype . | Description . |
---|---|---|
Geometric transformations | Rotation | Rotate the image by 15° in a counterclockwise. |
Rotate the image by 15° in a clockwise. | ||
Mirror | Mirror the input image horizontally. | |
Translation | Translate the input image by the [25, 25] vector specified in translation. | |
Translate the input image by the [−25, − 25] vector specified in translation. | ||
Color transformations | Ditching | Create an indexed image approximation of the input image by dithering the colors in the Parula colormap. |
Color adjustment | Adjust the image grayscale and increase brightness with nonlinear mapping. | |
Adjust the image grayscale and reduce brightness in a nonlinear mapping. | ||
Adding noise | Add Gaussian white noise with the mean value of 0 and variance selected randomly from the range [0.05, 0.1]. | |
Add Salt and pepper noise with a noise density selected randomly from the range [0.1, 0.3]. | ||
Add Poisson noise. |
General type . | Subtype . | Description . |
---|---|---|
Geometric transformations | Rotation | Rotate the image by 15° in a counterclockwise. |
Rotate the image by 15° in a clockwise. | ||
Mirror | Mirror the input image horizontally. | |
Translation | Translate the input image by the [25, 25] vector specified in translation. | |
Translate the input image by the [−25, − 25] vector specified in translation. | ||
Color transformations | Ditching | Create an indexed image approximation of the input image by dithering the colors in the Parula colormap. |
Color adjustment | Adjust the image grayscale and increase brightness with nonlinear mapping. | |
Adjust the image grayscale and reduce brightness in a nonlinear mapping. | ||
Adding noise | Add Gaussian white noise with the mean value of 0 and variance selected randomly from the range [0.05, 0.1]. | |
Add Salt and pepper noise with a noise density selected randomly from the range [0.1, 0.3]. | ||
Add Poisson noise. |
CNN-based feature extractor
A pretrained CNN model (Zhou et al. 2021) is used as the feature extractor for both faster R-CNN and YOLO_v2 (Figure 3). There are several essential layers with different functions during the feature extraction: (1) convolutional layer (CONV), to extract local features from input images. The main operation of convolution is to obtain the weighted sums of local regions for feature maps by multiplying the elements in the convolutional kernel with the elements in the input data (Cha et al. 2017; Gu et al. 2018; Xie et al. 2019). (2) Activation layer with Rectified Linear Units (ReLUs), which adds non-linearity between different layers to allow the model to converge faster and enhance the computing capability (Nair & Hinton 2010; Krizhevsky et al. 2017). (3) Max pooling layer (MaxPOOLing), to reduce the dimensions of feature maps by down sampling. This process is to take the maximum values of different local regions at the prior layer/input layer (Scherer et al. 2010; Xie et al. 2019; Teng et al. 2020). (4) Fully connected layer, to compute the class probability by generating final nonlinear combinations of features (Hassan et al. 2019; Meijer et al. 2019). Meanwhile, the softmax function is used to calculate the output probability scores for the predicted classes. The detailed architecture of the pretrained CNN model is shown in Figure 3, which consists of one input layer, three convolution layers, two max pooling layers, three activation layers (i.e., kernel function), one fully-connected layer and one output layer. The softmax layer was set behind the fully connected layer and before the output layer.
Object-detection models
Faster R-CNN
Architecture of the applied (a) faster R-CNN model and (b) YOLO_v2 model.
Several inputs were further specified to parameterize the faster R-CNN network: (1) network input size, (2) feature extraction network and (3) training options. In this study, we specified the network input size of [256 256 3], i.e., 256 × 256 pixel resolution with three channels (i.e., RGB image), to ensure a reasonable computing time and thus all images were resized in the image processing prior to the training step. As for the feature extraction layer, the last CONV of the pretrained CNN model was selected based on empirical analysis. The design of network layer was primarily based on the official manuals of the MATLAB and our previous research results (Teng et al. 2021; Lin et al. 2022). The previous work has explained the functions of all layers in detail and confirmed that these layers have excellent performance for defect image detection. Also, these layers can obtain the most abundant defect features. Furthermore, option parameters such as initial learning rate, solver for training network, maximum number of epochs and size of mini-batch can be defined to specify the network training options of the model. Finally, the faster R-CNN object detector was set and trained by calling the ‘trainFasterRCNNObjectDetector’ tool with the training dataset in the MATLAB.
YOLO_v2 model
To create the YOLO_v2 object-detection network, the network input size, number of object classes, anchor boxes, base network and feature extraction layer need to be defined in the ‘yolov2Layers’ function and returned as a LayerGraph object (in the MATLAB platform). Similarly, the training options such as initial learning rate, solver for training network and maximum number of epochs were defined. Finally, with inputs of the training dataset, LayerGraph object and option setting, the YOLO_v2 detector was established by using the ‘trainYOLOv2ObjectDetector’ function. In addition, the training parameters for both faster R-CNN and YOLO_v2 included: (1) optimizer: SGDM; (2) mini-batch-size: 8 and (3) learning rate: 0.001. The training platform was performed on a computer with NVIDIA GTX GeForce 1650 GPU, Intel Core [email protected] GHz CPU, Windows 10.
Experimental setup and performance evaluation
Experimental setup
The effects of three main parameters on model performance were investigated: (1) the proportion of total number of images to be used for model training and validation (i.e., ND). In total, there were six values of ND tested, which are 0.1, 0.2, 0.4, 0.6, 0.8 and 1, respectively. For example, if a value of 0.8 was adopted (i.e., ND = 0.8), then 80% of images in the total dataset were used for model training and validation. (2) The proportion of ND, namely Tp, describes the number of images for model training. The tested Tp values include 0.5, 0.6, 0.7, 0.8 and 0.9. Note that once Tp is assigned, the images were randomly divided into two independent datasets for training and validation, respectively. For example, when ND = 0.8 and Tp = 0.9, then 72% (80% × 90%) of total images were employed for model training. Note that the first two parameters are used to test the impacts of different partitions/percentages of images on model performance. (3) Maximum number of epoch (MaxE) – an epoch is the full pass of the training algorithm over the entire training set. We tested four types of MaxE in this study, namely 50, 100, 200 and 300.
To summarize, the first two parameters mainly define the size for different datasets and the third one is the main influencing factor in the model training process. The adopted values of the parameters were set according to the values suggested in the literature (Deng et al. 2020; Kumar et al. 2020; Li et al. 2020a, 2021). Theoretically, the higher the three parameters, the better detection performance anticipated. Both object-detection models were run for each combination of the three parameters (ND (namely, 0.1, 0.2, 0.4, 0.6, 0.8 and 1.0), Tp (namely, 0.5, 0.6, 0.7, 0.8 and 0.9) and MaxE (namely, 50, 100, 200 and 300)), and thus, there are, in total, 6 × 5 × 4 = 120 groups of parameter combinations for model performance comparison. As the training and validation sets and MaxE differ in each simulation, the associated processing speed also varies.
Performance evaluation
RESULTS AND DISCUSSION
The example images of the object-detection models are shown in Figure 5. Each bounding box is associated with two types of information, namely the predicted class label (in abbreviation) and the corresponding confidence level, respectively. It is seen that the models can identify and label multiple defects simultaneously, even in very complex conditions, such as in Figure 5(a), 5(c) and 5(d). Comparisons of model performances of the faster R-CNN and YOLO_v2 under parameterization impacts are illustrated by the compass plots in Figure 6. As indicated by the results, dataset size and training parameters can influence the performance of the defect detection models. The results of mean APs are shown in the inner-most black ring, where the APs were categorized into four classes (i.e., 0–25, 25–50, 50–75 and 75–100 quartiles) according to the respective values. Similarly, the corresponding processing time is categorized and illustrated in the green ring next to the mean AP results. The combinations of the influencing parameters investigated in this study are shown in the three outside rings (i.e., MaxE in yellow, Tp in blue and ND in red).
AP and processing speed (time) of (a) faster R-CNN and (b) YOLO_v2 under parameterization impacts of MaxE, Tp and ND over the 120 runs. Each ring illustrates one type of parameter shown in the legend and each slot (highlighted by the dotted line) in the radial direction corresponds to a specific combination of investigated parameters.
AP and processing speed (time) of (a) faster R-CNN and (b) YOLO_v2 under parameterization impacts of MaxE, Tp and ND over the 120 runs. Each ring illustrates one type of parameter shown in the legend and each slot (highlighted by the dotted line) in the radial direction corresponds to a specific combination of investigated parameters.
The model performance of faster R-CNN and YOLO_v2 based on combinations of the three parameters is shown in the compass plots in Figure 6. It summarizes the model performances under each combination of the investigated parameters (ND (namely, 0.1, 0.2, 0.4, 0.6, 0.8 and 1.0), Tp (namely, 0.5, 0.6, 0.7, 0.8 and 0.9) and MaxE (namely, 50, 100, 200 and 300)). Both compasses were thus sorted by the value of mean AP (i.e., the most inner circle) in descending order in the clockwise direction. Results show that high APs generally required longer processing time. Nevertheless, the parameterization impacts differed for the two models. For the faster R-CNN, ND had high influences on the performance of model performance and high APs were generally associated with high NDs. This indicates that the number of training images (i.e., size of training dataset) is important to enhance the faster R-CNN model prediction capability. For Tp and MaxE, there was a slight tendency that larger values contribute to higher APs. As for the YOLO_v2, results show that MaxE had the dominant impact on the model performance. The larger values of MaxE contributed to higher detection accuracy. However, the combinations of Tp and ND made no clear contributions to the mean AP values. This confirms that the faster R-CNN was more susceptible to relevant parameters than the YOLO_v2. The findings discussed are essential to identify important parameters in the model setting and guide further uses of the two types of object-detection models.
A statistical comparison of the model performance is shown in Figure 7. Specifically, when there was a small number of ND (ND ≤ 100), YOLO_v2 achieved higher detection accuracy. Nevertheless, the mean APs for YOLO_v2 changed much less significantly in comparison to the faster R-CNN. With the increase of ND (i.e., dataset size), the faster R-CNN starts to outperform the YOLO_v2 in identifying sewer defects. The main reason can be that for the faster R-CNN, with more training data, the model has more input resources and thus learns the object features more precisely. Overall, the faster R-CNN has higher median mean AP in comparison to the YOLO_v2 in this study. As for processing speed, there was less time required by the faster R-CNN with a small value of ND. But note that despite the processing time being less for the faster R-CNN, the achieved relative AP was also lower. The computation cost increases along with the increasing ND. Interestingly, it was found that the processing time for the YOLO_v2 did not vary significantly over the entire simulation. Therefore, there was a trade-off between the detection accuracy and computation cost among the two types of models.
Statistical summary of the relative (a) AP and (b) processing time of faster R-CNN and YOLO_v2 over the 120 runs.
Statistical summary of the relative (a) AP and (b) processing time of faster R-CNN and YOLO_v2 over the 120 runs.
The result of AP in detecting each type of defect is shown in Figure 8. The results show that the faster R-CNN or YOLO_v2 models have different detection capabilities for different types of defects and thus the achieved detection accuracy differed for the investigated defects. The AP of disjoint was highest (higher AP means better detection effect), followed by ones of the residential wall and obstacle. The crack and tree root have much lower APs than other types of defects. This means the crack and tree root were difficult to be predicted by both current object-detection models. The reason may be attributed to the complexity in detecting the two types of defects, which have less distinct features to be identified in comparison to other defects. Second, there were often multiple simultaneous occurrences of the crack and tree root in one image and the locations of the two types of defects are more scattered. This makes it difficult to detect all of them at once and the recall rates are thus lower. Finally, we also examine the distribution of the five sewer defects in the training datasets. It is shown that in this study, the low APs are less likely to be attributed to the sample size of the two defects as their distribution proportions (the pie chart) did not differ greatly to other defects. The ratios of the five defects were kept as similar as possible to ensure the comparability of the model results.
Individual prediction accuracy of the five types of sewer defects achieved by the faster R-CNN (F) and YOLO_v2 (Y), respectively. The box edges illustrate the 25th and 75th percentiles of the data, the red line in the center of the box represents the median values and the whiskers mark the 5th and 95th percentiles. The distribution proportions of the five sewer defects in the datasets are illustrated in the pie chart. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/hydro.2022.132.
Individual prediction accuracy of the five types of sewer defects achieved by the faster R-CNN (F) and YOLO_v2 (Y), respectively. The box edges illustrate the 25th and 75th percentiles of the data, the red line in the center of the box represents the median values and the whiskers mark the 5th and 95th percentiles. The distribution proportions of the five sewer defects in the datasets are illustrated in the pie chart. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/hydro.2022.132.
CONCLUSIONS AND FUTURE WORK
This paper compares the performances of two types of classic object-detection models, namely faster R-CNN and YOLO_v2 in terms of their prediction accuracy and processing speed. We evaluated the model capabilities in detecting and localizing five types of sewer defects under parameterization impacts of dataset size and model parameter (i.e., maximum number of epoch). Results show that on the whole, faster R-CNN achieved a higher prediction accuracy according to the median values. Regarding the parameterization, the size of the training dataset has essential impacts on the model performance of the faster R-CNN. With more training data, the model learns the object feature more precisely. As for the YOLO_v2, the MaxE has a dominant contribution to its prediction accuracy. As for processing speed, the computation cost increases along with the increasing training data for the faster R-CNN. Nevertheless, the processing time for the YOLO_v2 did not vary significantly over the entire simulation. It is thus addressed that there is a trade-off between the detection accuracy and computation cost among the two types of models. Furthermore, the models' abilities to detect disjoint and residential wall are highest with both object-detection models, whereas crack and tree root are more difficult to detect.
We acknowledge limitations of the current study for future work to better improve the model performance. The detection accuracies of both models are expected to be improved, which can be achieved by further research from four perspectives. The network structure and parameter of both models can be optimized to benefit the feature extraction and detection performance. The color, shape, brightness and contrast of the defect images can influence the model accuracy. We will look into more data augmentation techniques (e.g., generative adversarial network (GAN) generates a large number of virtual images with similar characteristics to real-world images) and combination strategies. Also, more original defect images are needed to enlarge the size of the training dataset to cover as many features of the multiple types of defects as possible. Meanwhile, improved label techniques are suggested. Other types of object-detection methods and/or transfer learning techniques are considered to improve the accuracy, such as trying SSD (Single Shot MultiBox Detector) algorithm, especially for multi-scale defects. What's more, one limitation of object detection is that the method cannot provide further detailed information on the geometric properties of sewer defects (e.g., shape, area and boundary), and future work will look into the application of the semantic segmentation technique that can provide a pixel-level segmentation of sewer defects for a more accurate description.
Despite the limitations, this study compares the performances of the two types of object-detection methods based on comparable input dataset and model settings. It is shown that both models have the capability to detect and localize multiple types of defects for sewer pipelines. This has great potential to complement the conventional labor-intensive manual sewer inspection. The results provide references for other studies to better understand the relative strengths and weaknesses of the two types of object-detection methods. The findings help to indicate the important influencing factors of the two models and provide insights for guiding further uses of the models. Equally important, the trade-off between detection accuracy and computation cost revealed for both models can be used as further references for practical applications.
ACKNOWLEDGEMENTS
This research was funded by the National Natural Science Foundation of China (Grant No. 51809049), the Science and Technology Program of Guangzhou, China (Grant No. 201804010406) and the National College Student Innovation and Entrepreneurship Training Program Project (Grant No. 202111845038).
CONFLICTS OF INTEREST
The authors declare no conflict of interest.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.