Urban water supply and drainage systems are a crucial component of urban infrastructure, directly affecting residents' livelihoods and industrial production. The normal operation of the water supply and the drainage pipeline is of great significance for conserving water resources and preventing water pollution. However, due to characteristics such as deep burial, diverse materials, and extensive lengths, the detection of defects becomes exceptionally complex. Traditional detection methods used in practical applications, such as ground excavation and destructive testing, typically require the shutdown of water pipelines. This process is time-consuming and labor-intensive, often resulting in significant economic losses. This paper proposes an effective technique for detecting defects in the water supply and the drainage pipeline. The method involves capturing images of the inner walls of water supply conduits and subsequently utilizing an artificial intelligence large-scale model approach (grounded language-image pre-training, GLIP) and a You Only Look Once version 5 (YOLOv5) model to detect defects within them. The experimental results show that GLIP demonstrates impressive detection performance in zero-shot scenarios, while YOLOv5 performs well on existing datasets. By combining these two models, we were able to achieve a balance between fast, flexible detection and high precision, making our approach both practical and efficient for real-world applications.

  • This study combines GLIP and YOLOv5 for efficient and accurate pipeline defect detection.

  • It applies zero-shot learning for rapid defect identification without prior training.

  • It provides non-destructive testing to minimize environmental impacts.

  • It enhances early defect detection in urban water systems.

  • It offers a globally applicable solution for sustainable water infrastructure management.

Water supply pipelines are crucial for daily life, leakages and industrial activities, as they ensure the delivery of clean water. Imperfections in these pipelines can lead to significant water loss (Taiwo et al. 2023) – and introduce pollutants that harm human health (Lee & Schwab 2005). Similarly, sewage systems are critical, and leaks can contaminate groundwater (Reynolds & Barrett 2003), posing health risks and complicating water treatment.

Table 1

Results of the GLIP pre-trained model and YOLOv5 model

Results based on the GLIP model
Results based on the YOLOv5 model
RuptureCorrosionLeakageSeepageRuptureCorrosionLeakageSeepage
Precision 0.222 0.728 0.275 0.008 0.331 0.887 0.456 0.855 
Recall 0.176 0.738 0.259 0.074 0.286 0.891 0.75 0.333 
F1 score 0.196 0.734 0.267 0.077 0.307 0.889 0.567 0.479 
mAP50 – – – – 0.296 0.933 0.648 0.417 
mAP50-95 – – – – 0.0858 0.742 0.444 0.329 
Results based on the GLIP model
Results based on the YOLOv5 model
RuptureCorrosionLeakageSeepageRuptureCorrosionLeakageSeepage
Precision 0.222 0.728 0.275 0.008 0.331 0.887 0.456 0.855 
Recall 0.176 0.738 0.259 0.074 0.286 0.891 0.75 0.333 
F1 score 0.196 0.734 0.267 0.077 0.307 0.889 0.567 0.479 
mAP50 – – – – 0.296 0.933 0.648 0.417 
mAP50-95 – – – – 0.0858 0.742 0.444 0.329 

The automated identification of faults in these systems is therefore critical. Closed Circuit Television (CCTV) inspection is a popular method due to its safety and the intuitive nature of the resulting imagery. Traditional approaches for defect detection using CCTV data range from pattern recognition (Guo et al. 2009) to advanced deep learning techniques like convolutional neural networks (CNNs) (Zhao et al. 2023). In the field of water pipeline defect detection, several studies and applications based on deep learning models have been developed. For instance, Shen et al. (2023) proposed an improved object detection algorithm, enhanced feature extraction (EFE)-Single Shot MultiBox Detector (SSD), which enhances feature extraction capabilities by adding the Receptive Block Fields (RFB_s) module and an improved Efficient Channel Attention (ECA) attention mechanism, and addresses the issue of positive and negative sample imbalance during training by using focal loss instead of cross-entropy loss. Additionally, the You Only Look Once version 5 (YOLOv5) model has been applied in multiple pipeline defect detection tasks. Wang et al. (2023) introduced a pipeline defect detection model based on an improved YOLOv5s algorithm. This model incorporates convolution modules and Grouped Spatial Convolution (GSConv) to simplify its model structure while integrating the Convolutional Block Attention Module (CBAM) attention mechanism. As a result, it significantly improves detection accuracy and speed, achieving a mean average precision (mAP) of 80.5% and a detection speed of 75 Frames Per Second (FPS). Hu et al. (2022) applied the YOLOv3 network model to the identification and localization of sewage pipeline defects based on a self-designed pipeline detection robot system. Moreover, Chen et al. (2024) proposed a cascaded deep learning approach in 2022, combining YOLOv5 and pre-trained Vision Transformer (ViT) models, which performed well in detecting and classifying pipeline defects. Chen et al. (2024) introduced a novel cascaded deep learning model using the Swin Transformer Backbone YOLOv5 (SwinYv5) for object detection and a cross-residual CNN (CR-CNN) for quantifying pipeline defects.

Although deep learning offers high accuracy and the ability to discern geometric features of defects, enabling the simultaneous detection and classification of multiple pipeline defects, models such as YOLO architecture still face challenges in detecting small objects due to their limited size, low visibility, and the lack of large-scale training datasets. Future improvements in pipeline defect detection using deep learning models based on YOLO architecture can be achieved by introducing more sophisticated small object detection methods and creating large-scale datasets that include a variety of defect types and sizes.

This paper proposes a method for detecting defects in water supply and drainage pipelines based on the grounded language-image pre-training (GLIP) model and the YOLOv5 model. The defect detection utilizes CCTV-collected images of pipeline defects, which include structural defects (corrosion, misalignment, foreign object intrusion, and leakage) and functional defects (sediment, scaling, blockages, and floating debris), encompassing a total of eight types of defects. The GLIP model focuses on evaluating the detection success rate without the need for dataset fine-tuning, assessing the transferability of this approach and thereby demonstrating the effectiveness of using pre-trained models for defect detection. The advantages of this work are as follows:

  • (1) Rich real-world data: The dataset collected independently includes abundant real-world scenario data, enhancing the generalization capability of the model.

  • (2) Zero-shot and fine-tuning performance: The GLIP pre-trained model demonstrates excellent detection capability in both zero-shot scenarios, achieving efficient defect detection without requiring extensive training on specific data.

  • (3) Combining GLIP and YOLOv5: By integrating the GLIP and YOLOv5 models, this method exhibits outstanding detection accuracy and speed, making it highly practical and promising for application.

The GLIP pre-trained model

The GLIP pre-trained model is a unified model for object detection and grounding tasks (Li et al. 2022), which boasts several advantages in the field of target detection: (1) compared to previous supervised models (such as Fast region-based convolutional neural network (RCNN) and Dynamic Head (Dai et al. 2021)), GLIP demonstrates superior detection performance in zero-shot and fine-tuning domains; (2) with only one-shot training, GLIP-L is competitive with fully supervised Dynamic Head models; and (3) GLIP can perform all downstream tasks without changing the model parameters.

For pipeline defect detection, the advantages of GLIP are as follows:

  • (1) GLIP's robust zero-shot performance allows the model to recognize targets of untrained categories, which is highly beneficial in scenarios where it is impractical to collect extensive data for every type of defect.

  • (2) GLIP is more accessible for grassroot pipeline maintenance workers who may not have education in artificial intelligence or deep learning, as specific tasks can be executed with GLIP without the need to adjust model parameters.

The YOLOv5 model

The YOLOv5 model is a single-stage object detection algorithm that predicts multiple class confidences and bounding boxes using the entire highest feature map simultaneously. Unlike two-stage object detection models (such as RCNN, Fast RCNN, Faster RCNN, and Mask RCNN) that first propose regions of interest (ROIs) and then perform classification, the YOLO algorithm treats object detection as a single-stage problem, predicting bounding boxes and their associated class probabilities from the entire image at once. This approach makes YOLO significantly faster than two-stage models, albeit with a slight trade-off in accuracy. Its flexibility allows for fine-tuning on specific datasets to achieve better performance.

The YOLOv5 model belongs to the family of compound-scaled object detection models. It uses a full CNN to process images, dividing them into a grid where each cell is responsible for detecting objects within that cell. YOLOv5 uses CSPDarknet as its backbone network, PANet as its neck network, and YOLO layers as its head. This design reduces the model's parameters and floating-point operations per second. These components enhance the flow of information and the utilization of low-level features in end-to-end training, thereby increasing the accuracy of multi-scale predictions and localization. YOLOv5 also uses the Generalized Intersection over the Union loss function and a weighted Non-Maximum Suppression process to obtain the optimal bounding boxes.

In this study, the YOLOv5 model is used for binary detection tasks because it balances classification accuracy with high detection efficiency. The ViT is employed as a cascaded model for pipeline defect classification. This combination of a pre-trained YOLOv5 object detection model and a ViT image classification architecture is suitable for offline analysis of pipeline inspection data, effectively balancing detection precision and efficiency. By leveraging existing pre-trained models, this method reduces the need for large amounts of specific defect training data and model tuning, thus achieving rapid analysis of pipeline inspection data.

Results of the GLIP pre-trained model

We used the German IPEK pipeline inspection robot to conduct defect detection on water supply pipelines around Guangzhou, China and collected 467 internal pipeline images (13 categories) as our dataset through CCTV. The pipeline robot consists of a controller, an automatic cable reel, a crawler vehicle equipped with a high-definition imaging system, a memory unit, and a telescopic control rod. The robot enters the pipeline via the crawler, capturing real-time images of the pipeline's interior using its integrated imaging system, which are displayed on the control panel monitor and stored in its memory. The dimensions of the crawler vehicle are 310 mm in length, 110 mm in width, and 90 mm in height, with a weight of 6 kg and a maximum waterproof depth of 10 m. The camera system measures 168 mm in length, 81 mm in width, and 72 mm in height, with a weight of 1.5 kg, a sensitivity of 1 Lux, a horizontal resolution of 460 lines, 10× optical zoom, 12× digital zoom, a viewing angle of 60°, and LED lighting. To facilitate the robot's entry into the pipeline for image acquisition, all images were captured during pipeline maintenance when there was no water flow inside the pipeline. We used the GLIP model for detection. The model successfully detected various types of defects, with the defects in the input images marked with green bounding boxes, as shown in Figure 1.
Figure 1

Pipeline defect detection results based on the GLIP pre-trained model.

Figure 1

Pipeline defect detection results based on the GLIP pre-trained model.

Close modal
Figure 2

Pipeline defect detection results based on the YOLOv5 model.

Figure 2

Pipeline defect detection results based on the YOLOv5 model.

Close modal
Figure 3

F1-confidence curve and precision–recall curve of the trained YOLOv5 model.

Figure 3

F1-confidence curve and precision–recall curve of the trained YOLOv5 model.

Close modal

In zero-shot scenarios, where the model has not been specifically trained for the task, the GLIP method performs well in detecting foreign object intrusion, scaling, leakage, corrosion, cracks, and floating debris and can correctly identify misalignment and sediment. This indicates that GLIP can identify defects without being trained on specific data. However, as shown in Figure 1(a), the detection boxes marked two parts of the pipeline but did not correctly distinguish the gap at the misaligned section. Additionally, while the GLIP model successfully detected sediment targets, it did not capture all instances of sediment. This suggests that further training or fine-tuning is needed to improve its accuracy for pipeline defect detection. To address this limitation and enhance detection performance, we employed YOLOv5, which was specifically trained on our dataset to ensure a more comprehensive identification of sediment and other defects. By combining the strengths of both models, we were able to achieve faster detection with GLIP and more precise results with YOLOv5.

Results of the YOLOv5 model

The YOLOv5 model achieved commendable performance under conditions of sparse samples or indistinct features in the existing dataset. The training was conducted using an Intel(R) Xeon(R) Silver 4210R CPU, 128 GB RAM, and an NVIDIA V100s GPU with 32 GB memory. The final training and detection results for the four types of pipeline defect detection (rupture, corrosion, leakage, and seepage) are as show in Figure 2:

The F1 confidence curve illustrates the F1 scores for various categories at different confidence thresholds. As shown in Figure 3, it is evident that the corrosion category performs exceptionally well within the high-confidence range, with an F1 score approaching 0.9. In contrast, the rupture and seepage categories exhibit relatively lower F1 scores, indicating that the model's precision and recall for these categories need further optimization.

The precision–recall curve provides insights into the model's precision at different recall rates. It can be observed that the corrosion category achieves very high precision and recall, demonstrating the best performance. However, the rupture and seepage categories have lower precision and recall. This suggests that there are certain challenges in detecting these categories, which may require a larger dataset or a more sophisticated model to improve detection performance for these categories.

Figure 4 illustrates the training and validation losses, along with key performance metrics for the YOLOv5 model. The first column represents the bounding box regression loss during training and validation, indicating the accuracy of predicted bounding boxes. The second column shows the objectness loss for both training and validation, measuring how well the model predicts the presence of objects. The third column displays the classification loss during training and validation, reflecting the accuracy of object classification. The fourth and fifth columns show the precision and recall metrics during training, which represent the model's ability to correctly identify objects. The last two columns depict the mAP at an Intersection over Union (IoU) threshold of 0.5 and across thresholds from 0.5 to 0.95, providing a comprehensive measure of the model's object detection performance. The mAP at an IoU threshold of 0.5 (often referred to as [email protected]) measures the model's detection performance when the predicted bounding box overlaps with the ground truth by at least 50%. It is a more lenient metric, allowing moderate variations in bounding box predictions. On the other hand, mAP across thresholds from 0.5 to 0.95 (often written as [email protected]:0.95) calculates the average precision over a range of IoU thresholds (from 0.5 to 0.95, typically with steps of 0.05). This provides a more comprehensive and stricter evaluation, as it requires the predicted bounding boxes to match the ground truth more closely (up to 95% overlap). Therefore, [email protected]:0.95 is considered a more robust indicator of the model's overall performance across varying levels of overlap accuracy. The training and validation loss curves show a gradual reduction in loss, indicating that the model is progressively learning and converging. Simultaneously, the precision, recall, and mAP metrics consistently improve, demonstrating the model's increasing accuracy in detecting and classifying objects. Notably, the validation loss stabilizes after some epochs, suggesting that the model is not overfitting and has good generalization capability on the validation set.
Figure 4

YOLOv5 training and validation losses and metrics.

Figure 4

YOLOv5 training and validation losses and metrics.

Close modal

Comparison of test results

Due to the limited size of our dataset, the four types of defect images with the most samples were selected for testing and validation. The test results are as show in Table 1:

The comparative analysis indicates that YOLOv5 has better results. The YOLOv5 model demonstrates strong performance in detecting defects, particularly corrosion and seepage, with high precision and recall in these categories. However, due to the limited size of our dataset, the model tends to overfit, reducing its ability to generalize well to new, unseen data. This overfitting is especially noticeable in the lower precision and recall for misalignment and seepage defects.

In addition, the GLIP with the pre-trained model exhibits robust performance in zero-shot scenarios, effectively detecting a wide range of defects without specific training on the dataset. This indicates strong transferability and generalization capabilities. By specifically training the GLIP model on the current dataset, it is likely to achieve superior performance. The GLIP model's inherent strength in feature extraction and defect-recognition suggests that, with dedicated training, it could overcome the limitations faced by YOLOv5 due to constraints related to dataset size.

This paper proposes a method for detecting defects in water supply and drainage pipelines based on the GLIP pre-trained model and YOLOv5, with experimental validation conducted on a limited-size dataset. The GLIP model, with its robust zero-shot detection capabilities, was used for rapid defect identification during the data collection phase. This allowed for efficient detection without the need for extensive model retraining. However, to further enhance the accuracy and performance of defect detection, we fine-tuned YOLOv5 on our dataset. By combining these two models, we were able to achieve a balance between fast, flexible detection and high precision, making our approach both practical and efficient for real-world applications.

With over 500,000 km of sewage pipelines needing maintenance in China alone (Fan et al. 2024), the widespread adoption of this method could benefit 440 million people who rely on urban water supply and drainage systems (Wang et al. 2021). Additionally, it can protect the surrounding soil and groundwater from pollution. Timely repair of water pipeline defects using this method is crucial for public health, the stability of industrial production, and the sustainable use of water resources. Future work will focus on expanding the dataset and further optimizing the GLIP model to improve detection accuracy.

This work was supported by the Water Conservancy Science and Technology Innovation Project of Guangdong Province (No. 2022-03) and the Shenzhen Science and Technology Program (Grant No. GJHZ20210705141403009).

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Chen
P. C.
,
Li
R.
,
Fu
K.
,
Zhong
Z.
,
Xie
J.
,
Wang
J.
&
Zhu
J.
(
2024
)
A cascaded deep learning approach for detecting pipeline defects via pretrained YOLOv5 and ViT models based on MFL data
,
Mechanical Systems and Signal Processing
,
206
,
110919
.
https://doi.org/10.1016/j.ymssp.2023.110919
.
Dai
X.
,
Chen
Y.
,
Xiao
B.
,
Chen
D.
,
Liu
M.
,
Yuan
L.
&
Zhang
L.
(
2021
) “
Dynamic Head: Unifying Object Detection Heads with Attentions
”,
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 7369–7378, doi: 10.1109/CVPR46437.2021.00729
.
Fan
X.
,
Gao
X.
,
Cai
M.
,
Ma
H.
,
Fu
J.
&
Li
Z.
, (
2024
)
The sustainable development model of rural domestic sewage treatment in China
. In:
Weng
C. H.
(ed.)
Proceedings of the 5th International Conference on Advances in Civil and Ecological Engineering Research. ACEER 2023. Lecture Notes in Civil Engineering
, vol.
336
.
Singapore
:
Springer
.
https://doi.org/10.1007/978-981-99-5716-3_22
.
Guo
W.
,
Soibelman
L.
&
Garrett
J. H.
Jr
(
2009
)
Automated defect detection for sewer pipeline inspection and condition assessment
,
Automation in Construction
,
18
(
5
),
587
596
.
Hu
Z.
,
Zhou
J.
,
Yang
B.
&
Chen
A.
(
2022
) “
Design of pipe-inspection robot based on YOLOv3
”,
Journal of Physics: Conference Series 2284 (1), 012023
.
Lee
E. J.
&
Schwab
K. J.
(
2005
)
Deficiencies in drinking water distribution systems in developing countries
,
Journal of Water and Health
,
3
(
2
),
109
127
.
Li
L. H.
,
Zhang
P.
,
Zhang
H.
,
Yang
J.
,
Li
C.
,
Zhong
Y.
,
Wang
L.
,
Yuan
L.
,
Zhang
L.
,
Hwang
J.-N.
,
Chang
K.-W.
&
Gao
J.
(
2022
)
In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10955–10965
.
Reynolds
J. H.
&
Barrett
M. H.
(
2003
)
A review of the effects of sewer leakage on groundwater quality
,
Water and Environment Journal
,
17
(
1
),
34
39
.
Shen
D.
,
Liu
X.
,
Shang
Y.
&
Tang
X.
(
2023
)
Deep learning-based automatic defect detection method for sewer pipelines
,
Sustainability
,
15
(
12
),
9164
.
https://doi.org/10.3390/su15129164
.
Taiwo
R.
,
Shaban
I. A.
&
Zayed
T.
(
2023
)
Development of sustainable water infrastructure: A proper understanding of water pipe failure
,
Journal of Cleaner Production
,
398
,
136653
.
https://doi.org/10.1016/j.jclepro.2023.136653
.
Wang
J.
,
Liu
G.
,
Wang
J.
,
Xu
X.
,
Shao
Y.
,
Zhang
Q.
,
Liu
Y.
,
Qi
L.
&
Wang
H.
(
2021
)
Current status, existent problems, and coping strategy of urban drainage pipeline network in China
,
Environmental Science and Pollution Research
,
28
,
43035
43049
.
https://doi.org/10.1007/s11356-021-14802-9
.
Wang
T.
,
Li
Y.
,
Zhai
Y.
,
Wang
W.
&
Huang
R.
(
2023
)
A sewer pipeline defect detection method based on improved YOLOv5
,
Processes
,
11
(
8
),
2508
.
https://doi.org/ 10.3390/pr11082508
.
Yuksel
V.
,
Tetik
Y. E.
,
Basturk
M. O.
,
Recepoglu
O.
,
Gokce
K.
&
Cimen
M. A.
(
2023
)
A novel cascaded deep learning model for the detection and quantification of defects in pipelines via magnetic flux leakage signals
,
IEEE Transactions on Instrumentation and Measurement
,
72
,
1
9
,
Art no. 2512709. https://doi.org/10.1109/TIM.2023.3272377
.
Zhao
C.
,
Hu
C.
,
Shao
H.
,
Wang
Z.
&
Wang
Y.
(
2023
)
Towards trustworthy multi-label sewer defect classification via evidential deep learning
.
In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, 1–55
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).