To solve the problems of localization and identification of fish in the complex fishway environment, improving the accuracy of fish detection, this paper proposes an object detection algorithm YOLORG, and a fishway fish detection dataset (FFDD). The FFDD contains 4,591 images from the web and lab shots and labeled with the LabelIMG tool, covering fish in a wide range of complex scenarios. The YOLORG algorithm, based on YOLOv8, improves the traditional FPN–PAN network into a C2f Multi-scale feature fusion network with a Gather-and-Distribute mechanism, which solves the problem of information loss accompanied by the network in the fusion of feature maps of different sizes. Also, we propose a C2D Structural Re-parameterization module with a rich gradient flow and good performance to further improve the detection accuracy of the algorithm. The experimental results show that the YOLORG algorithm improves the mAP50 and mAP50-95 by 1.2 and 1.8% compared to the original network under the joint VOC dataset, and also performs very well in terms of accuracy compared to other state-of-the-art object detection algorithms, and is able to detect fish in very turbid environments after training on the FFDD.

  • We propose an FFDD fish detection dataset.

  • We propose a Structural Re-parameterization convolution module C2D.

  • We propose a C2f Multi-scale feature fusion network to solve the problem of information loss in the YOLOv8 network.

  • We propose a YOLORG-series model constructed by C2D Structural Re-parameterization module and C2f Multi-scale feature fusion network.

  • The proposed method has fewer parameters.

Fish is an important part of the ecosystem and human culture, with more than 300 million people in the world using it as their main diet (Vianna et al. 2020). At present, the destruction of fish-habitat has led to the diminishing of fishery resources. Fishway and other fish-crossing facilities have become an important measure to mitigate the adverse impacts caused by water conservancy projects. Biologists and fisheries practitioners estimate the presence and abundance of fish from videos and images taken underwater to help understand the natural environment and promote industry. However, manually analyzing large amounts of underwater videos and images would be a tedious and time-consuming task (Li et al. 2015). Therefore, it is very realistic to use object detection techniques to automatically detect fish in videos and images to help relevant practitioners save valuable time.

Traditional fish detection methods rely on sonar equipment to analyze, extract, and study the collected data by developing spatial and temporal sampling plans to achieve the goal of detecting fish. Digital image processing techniques have also been applied to fish detection, distinguishing fish by background subtraction and other classification methods. Object detection methods have been used frequently in fish detection, but mostly using two-stage or early single-stage detectors to identify and localize fish. Sonar-based fish detection, limited by expensive sonar equipment, is very unfriendly to small-scale smart fisheries projects. Fish detection based on digital image processing performs poorly in detection and classification. The method can’t meet the needs of modern intelligent fisheries. Fish detection based on early object detection methods still falls short of the modern smart fisheries requirements in inference speed and detection accuracy when analyzing images and videos in an automated manner. In this paper, we proposed an improved object detection algorithm YOLORG (YOLO Re-parameterization and Gather-and-Distribute Network) for the above-mentioned fish detection work, where fish detection is limited by the equipment, method, and performance. The YOLORG algorithm is an end-to-end fish detector that automatically detects multiple types of fish in images and videos, helping to build smart fisheries and smart fishway.

The YOLO series of algorithms is the cornerstone of the object detection field, many YOLO algorithms improve network performance by modifying network structure and designing new loss functions. What they fail to notice, however, is that feature maps at different scales are often accompanied by information loss when information is fused, although FPN–PAN networks have greatly alleviated this problem (Wang et al. 2023a). To this end, we designed a C2f Multi-scale feature fusion network; it fuses feature maps at different scales through Gather-and-Distribute mechanism (consisting of Feature Alignment Module-FAM, Information Fusion Module-IFM and Information Injection Module-Inject_C2f), which is able to reduce information loss in fusion phase and improve the network performance. In addition, we have performed a Structural Re-parameterization modification for C2f, the main module in the YOLOv8 network. The improved C2D module has a multi-branch complex structure in training phase, and it is able to convert the multi-branch complex structure into a single convolution in inference phase, which has the good performance of the multi-branch structure and can be equivalently embedded into the network to improve the accuracy of the network. Finally, we propose a fishway fish detection dataset (FFDD), containing 3,428 training images and 1,163 test images, covering motion pictures of fish in complex environments.

Overall, the contributions of this paper are as follows:

  • (1)

    We propose an FFDD, containing 3,428 training images and 1,163 test images, covering the motion pictures of fish in a variety of complex environments. FFDD enriches the underwater object detection dataset, fills in fish images within special environments, and expands the scope of underwater object detection.

  • (2)

    We propose a Structural Re-parameterization convolution module C2D, which possesses a complex multi-branch structure during training, and converts the multi-branch structure into a single convolution during inference, which is equivalently embedded into the network to improve the accuracy. The C2D Structural Re-parameterization method provides ideas for optimizing the detector performance, and the method can be equivalently applied to other detection algorithms.

  • (3)

    We propose a C2f Multi-scale feature fusion network to solve the problem of information loss, aligning different scale feature maps through the FAM module, fusing the aligned feature maps through the IFM module, injecting the fused feature information to different levels through Inject_C2f module. C2f Multis-cale Feature Fusion Network provides ideas for solving the information loss problem in the field of object detection.

  • (4)

    We propose a YOLORG-series model constructed by C2D Structural Re-parameterization module and C2f Multi-scale feature fusion network. The YOLORG-n model obtains excellent performance of 82.6 and 63.9% on the joint VOC dataset for and , which outperforms the existing YOLO series of object detectors and related fish detection methods. Multiple experimental results demonstrate that we have contributed an excellent detection algorithm for the field of object detection.

Structural Re-parameterization

In recent years, Structural Re-parameterization has been one of the most popular research hotspots in the field of CNN. Structural Re-parameterization obtains better performance without bringing any extra parameter by training the multi-branch structure of the module, and fusing the multi-branches into a single convolution when inference is performed. Ioffe & Szegedy (2015), Szegedy et al. (2015), Szegedy et al. (2016) and Szegedy et al. (2017) found that the multi-branch structure of the network can effectively enrich the feature space, demonstrating the importance of different connections and combinations of multiple branches. Hu et al. (2018) and Wang et al. (2020) proposed the SE Spatial Attention Module and ECA Spatial Attention Module, which are effective in improving the representation ability of modules. Ding et al. (2019), Ding et al. (2021a) and Ding et al. (2021c) proposed a convolutional module with a multi-branch structure capable of effectively extracting spatial feature information and was able to convert the multi-branch structure into a single convolution during inference, improving the accuracy of network detection without bringing in additional parameters. Ding et al. (2021b) first applied the Structural Re-parameterization method to FC by constructing convolutional layers within RepMLP during training and merging them into FC for inference. Combining the global representation capability and location awareness of FC with the local prior of convolution improves the performance and location patterns of the neural network. Meng et al. (2021) found that the Structural Re-parameterization method can only be applied to linear blocks, while the nonlinear layer (ReLU) must be placed outside the residual connections. Therefore, a new RM operation method is proposed, which removes the residual connections with nonlinear layers inside and keeps the results of the model unchanged. The work in this paper is to modify the Structural Re-parameterization of the main module in the object detection algorithm and apply it in a real environment to detect fish, improving the accuracy of the detector.

Multi-scale feature fusion

With the development of object detection, people have gradually realized that Neck networks in detectors play a very important role in small object recognition and detector accuracy improvement. Redmon & Farhadi (2018) used FPN networks for the first time in the YOLOv3 algorithm to achieve a downward fusion of feature maps at different scales, which dramatically improves the performance of the detector. Bochkovskiy et al. (2020) used the FPN–PAN network for the first time in the YOLOv4 algorithm, which was used to solve the problem of severe loss of information after the shallow information passes through the multilayer network, and drastically improved the detection accuracy of the network. Li et al. (2022b) designed BiC modules with bi-directional linkages for the FPN–PAN network and applied the Re-parameterization method to the FPN–PAN network, proposing the RepBi–PAN Multi-scale fusion network. The network introduces a bottom-up flow of information into the top-down delivery path, allowing shallow information to participate efficiently in the fusion. Xu et al. (2022) proposed a new structure Light-Backbone Heavy-Neck, which uses Efficient RepGFPN as the Neck network so that high-level semantic information and low-level spatial information can be fully exchanged and achieved SOTA performance. Li et al. (2022a) found that it is difficult for large models to meet the requirements of real-time detection in in-vehicle edge computing platforms, and lightweight models constructed by a large number of deeply separable convolutions can’t achieve sufficient accuracy. Therefore, a design paradigm Slim-Neck is proposed, which can better balance the accuracy and speed of the model.

Although the above detectors are excellent in terms of accuracy metrics, they still suffer from information loss in fusion different feature maps, even information loss is greatly mitigated by the design of many different FPN–PAN. Wang et al. (2023a) proposed a Gather-and-Distribute Mechanism Multi-scale feature fusion network implemented by convolution and self-attention, which greatly improves the problem of information loss and enhances the performance of networks. The work in this paper draws on the Gather-and-Distribute Mechanism Multi-scale fusion network in the Gold-YOLO algorithm to design a C2f Multi-scale feature fusion network for improving object detection algorithm accuracy.

Application of object detection in fisheries

Smart fishery can save a lot of manpower and resources, now it has become the main trend of fishery development. Paspalakis et al. (2020) use small, low-cost, autonomous underwater vehicles instead of traditional divers for periodic inspection of fishing nets, with significant cost savings. Dong et al. (2023) proposed a new method to localize fish keypoints based on object detection and regression model. Through YOLOv5 and perceptual strategies, the method is able to efficiently detect individual fish and estimate keypoints. Yu et al. (2023) used a detection model to measure fish size instead of acoustic detection methods with high-cost equipment and low detection accuracy. Kandimalla et al. (2022) use YOLOv3 and Mask-RCNN to detect and classify eight fish species from a high-resolution DIDESN imaging sonar dataset and integrated the Norfair object tracking framework to track and count fish. Zhang et al. (2020) proposed an automatic fish counting method to estimate the population of farmed Atlantic salmon by using machine vision and a new hybrid deep neural network model. Li et al. (2015) trained a Fast R-CNN algorithm with 24,277 homemade ImageCLEF fish images to detect and recognize fish in underwater images. Kay & Merrifield (2021) set up a website Fishnet.AI that continuously collects fish detection data and publishes the collected data on the website, including multiple fish categories. Xu & Matzner (2018) annotated 34,316 fish images by extracting the image frames of three fish videos and used the dataset to train the YOLOv3 algorithm to automatically identify fish in underwater videos. Muksit et al. (2022) proposed a fish detection model, YOLO-Fish, for detecting fish in real underwater environments. By fixing the up-sampling step problem existing in YOLOv3 and adding a spatial pyramid pooling module, they increased the model’s ability to detect fish in dynamic environments, and achieved good results. Li et al. (2023) proposed a new fish detection model RC_YOLOv5, which introduces the Res2Net residual structure and coordinate attention mechanism to achieve fast and accurate fish detection. Wang et al. (2022) proposed a fish anomaly detection algorithm, which added multi-level features and feature mapping to the YOLOv5 algorithm to achieve accurate fish detection, and used the single tracking algorithm SiamRPN++ to track abnormal fish individuals.

Although the above work on fish detection has made good progress, they still suffer from misdetections and omissions, and most algorithms have a lot of computation and parameters, which makes them inconvenient to deploy in actual projects. The reason for these problems is that these detectors are still lacking in image preprocessing, network structure, loss calculation, speed-accuracy balance, anchor assignment, etc., and most of them face the problem of information loss during information fusion. The work in this paper is applying object detector incorporating state-of-the-art methods in fish-passes fish detection. Meantime, we address the information loss problem present in the detector, propose a Structural Re-parameterization method to non-destructively improve the performance of the detector, and contribute a fish detection dataset FFDD, containing 3,428 training images and 1,163 test images.

YOLO re-parameterization and Gather-and-Distribute network

The network structure of the n-version of the YOLORG (YOLO Re-parameterization and Gather-and-Distribute Network) algorithm is shown in Figure 1. The YOLORG algorithm contains five versions, n,s,m,l,x, in order to achieve a balance between inference speed and detection accuracy. The YOLORG model consists of the SPPF module, the Detect module, the C2D Structural Re-parameterization module, and the C2f Multi-scale feature fusion network. The SPPF module is a commonly used spatial pyramid pooling module that enables adaptive output size to improve detector accuracy. The Detect module adopts the current mainstream decoupled header structure, and at the same time switches from Anchor-Based to Anchor-Free, which adopts the TAL dynamic label allocation strategy to calculate the loss. The C2D Structural Re-parameterization module has a complex multi-branch structure to learn more useful feature information in the training phase. In the inference phase, the multi-branch structure is equivalently converted into a single convolution, which can be equivalently embedded into the network to improve the accuracy of network without bringing additional parameters. In addition, the C2D module has rich gradient flow while maintaining light weight. The C2f Multi-scale feature fusion network, which combines the outputs of four network layers, aligns the feature information at different scales through the FAM module, fuses the aligned information through the IFM module, injects the fusion information into different levels by using the Inject_C2f module, improving the accuracy of the algorithm’s detection. Among them, the C2D Structural Re-parameterization Module and the C2f Gather-and-Distribute Mechanism Multi-scale Feature Fusion Network are two of the innovative approaches we propose. These two methods make full use of the mathematical properties of convolution and the Gather-and-Distribute mechanism idea to deal with the feature flow during the algorithm computation, which greatly improves the detection accuracy of the network, and does not exist in the previous fish detection algorithms.
Figure 1

YOLORG network diagram.

Figure 1

YOLORG network diagram.

Close modal

Meanwhile, compared to other fish detection algorithms, YOLORG incorporates other advanced research results in the field of object detection. YOLORG used Mosica image preprocessing to enhance the generalization performance of the detector, used a more geometrically logical CIoU to calculate the fish localization loss, used Anchor-Free idel and TAL tag assignment strategies to reduce the impact of anchor on fish detection. Compared with other fish detection algorithms, YOLORG has faster convergence during training, detects fish with better generalization, higher classification and localization accuracy, and fewer false and missed detections. Therefore, YOLORG is a very good method to apply in fish detection.

The network structure of the YOLORG series of models is written in a Yaml configuration file, which is parsed and instantiated by code at runtime, and the parse pseudo-code is shown in Algorithm 1.

C2D Structural Re-parameterization module

The C2D Structural Re-parameterization module inherits the basic structure of the C2f module. The DBB module has a complex multi-branch structure in the training phase and can convert the multi-branch structure into a single convolution in the inference phase, which is equivalently embedded into any existing network structure. Therefore, by embedding the DBB module into the C2f module, the converted single convolution in the inference stage is not only better than the single convolution trained directly by the C2f module but also has the advantage of the multi-gradient flow of the C2f module, which improves network detection accuracy. The C2D module network structure is shown in Figure 2, and the pseudo-code is shown in Algorithm 2.
Figure 2

C2D module network structure diagram.

Figure 2

C2D module network structure diagram.

Close modal

Algorithm 2: C2D Module Pseudo-Code

Input: Input_tensor; in_channal; out_channal; number = 1; expansion = 0.5;

Output: Output_tensor;

Begin Initialization

c = int(out_channal * expansion)

cv1 = dbb(in_channal, 2 * c, 1, 1)

cv2 = dbb((2 + n) * c, out_channal, 1)

m = torch.ModuleList(BottleneckD(c, c, k=(3, 3), e=1.0) for_in range(number))

End Initialization

Begin Calculation

x = list(cv1(Input_tensor).chunk(2,1))

x.extend(module(x[-1]) for module in m)

return cv2(torch.cat(x,1))

End Calculation

The C2f module is the main module in the YOLOv8 algorithm. The C2f module references the C3 module in YOLOv5 (Jocher et al. 2021) and the ELAN module in YOLOv7 (Wang et al. 2023b). The C2f module has multiple gradient flow branches in the operation process, which not only can effectively extract spatial feature information but also has a small number of parameters and fast computation. The structure of C2f network is shown in Figure 3.

Diverse Branch Block (DBB) is a Structural Re-parameterization neural network module (Ding et al. 2021a). In the early days of deep learning, networks such as VGG16, AlexNet, etc., were single-branch neural network models, but with continuous experimentation and exploration, it was found that multi-branch networks were able to significantly increase the performance of network, then multi-branch network models dominated the mainstream. Structural Re-parameterization draws on the design of multi-branch structures to train the multi-branch structure of modules and reconstructs the multi-branches of module at inference time to make them into a single convolution with good performance. The essence of Structural Re-parameterization is to use the homogeneity and additivity of convolution, which are very similar to the exchange law and distributive law inside mathematics. The specific performance of homogeneity and additivity is shown in the following formula:
formula
(1)
formula
(2)
Input is the input feature map, p is a real number, and are convolutional kernel tensors, and it is worth noting that additivity is only satisfied when and have the same shape.

The six transformations are used in the DBB module and shown in Figure 4. All transformations are based on variations of homogeneity and additivity of convolutions, and will end up with a K*K size convolution. Through these six transformations, multiple branches in the DBB module are continuously ‘simplified’ and eventually merged into a K*K size convolution. The network structure diagram of the DBB module is shown in Figure 5. The structure of the DBB during training is a multi-branch structure similar to Inception, these branching structures have different-sized convolutions and therefore can obtain different sizes receptive fields. The DBB module can generate richer feature spaces and enhance the characterization of a single convolution by combining multi-branch structures. In the inference stage, the single convolution parameter corresponding to the parameters of the multi-branched structure is calculated in a linear combination and deployed in the model for using.

The C2D module combines the advantages of the C2f module and the DBB module, which possesses good performance and rich gradient streams. And compared to the C2f module it is able to bring about a performance increase in the network and improve the accuracy of the detection without any growth in computation and parameter count.
Figure 3

C2f module network structure diagram.

Figure 3

C2f module network structure diagram.

Close modal
Figure 4

Forms of transformation.

Figure 4

Forms of transformation.

Close modal
Figure 5

DBB module network structure diagram.

Figure 5

DBB module network structure diagram.

Close modal

C2f Multi-scale feature fusion network

Information loss is a problem that exists in many detection algorithms. In algorithms, different branches are responsible for detecting different sizes objects. At the beginning they all have a lot of high-level semantic information in their feature maps, but this semantic information is dissipated during propagation or fusion, which means that the detection head obtains incomplete spatial information of the feature maps, ultimately affecting algorithm performance.

Many of the advanced YOLO family of algorithms use the traditional FPN–PAN network structure to fuse different levels feature map for improving detector accuracy. The traditional FPN–PAN network is shown in Figure 6. This structure contains multiple branches for Multi-scale feature fusion, but they can only fully fuse information from neighboring layers, while for non-neighboring layers, they can only obtain indirectly and recursively. For example, if B1 wants information about B2, they can fuse directly. If B1 wants to get information from B3, B1 can only fuse the information from the fusion of B2 and B3 and indirectly get information from B3. This type of information fusion leads to significant information loss. As shown in Figure 7, 7(a) is the original dog picture, Figure 7(b) is a feature map of FPN–PAN Output1. It is already hard to see the dog in Figure 7(b), because it loses a lot of spatial information features about the dog, which makes it difficult for the detection head module to classify and localize the dog.
Figure 6

Traditional FPN–PAN network structure.

Figure 6

Traditional FPN–PAN network structure.

Close modal
Figure 7

Feature maps for different network outputs: (a) Original image; (b) Feature map of FPN network output; and (c) Feature map of C2f Multi-scale feature fusion network output.

Figure 7

Feature maps for different network outputs: (a) Original image; (b) Feature map of FPN network output; and (c) Feature map of C2f Multi-scale feature fusion network output.

Close modal
In order to avoid the loss caused by the traditional FPN–PAN network structure, the Gather-and-Distribute information fusion mechanism is constructed, which uses a unified module to collect and distribute information from all levels (Wang et al. 2023a). In the C2f Multi-scale feature fusion network, we use the FAM feature alignment module, the IFM Information fusion module, and the Inject_C2f information injection module. As shown in Figure 7, 7(c) is compared with Figure 7(b), where the outline of the dog is already very blurred in Figure 7(b), but Figure 7(c) is still able to see the outline of the dog at a glance. This shows that the C2f Multi-scale feature fusion network is indeed able to retain semantic information compared with the traditional FPN–PAN network, which solves the information loss problem to a certain extent, improving algorithm performance. The structure of the C2f Multi-scale feature fusion network is shown in Figure 1. The C2f Multi-scale feature fusion network has two branches, Low Gather-and-Distribute and High Gather-and-Distribute branches. These branches fuse feature maps from high and low levels, respectively, which are then injected into different layers of feature maps via the Inject_C2f module. The detailed network structure of LowGD is shown in Figure 8. The LowFAM module uses the feature map size of B4 as the baseline size, B2, and B3 use the AvgPool layer to downsample the feature maps to the B4 size, and the B5 feature maps are expanded to the B4 size using bilinear interpolation and spliced together in channel. The LowIFM module is designed to include convolution, Structural Re-parameterization RepConv Block, and Split operations. RepConv Block takes the output Falign after the Low FAM module as input, and the output is split into two outputs in channel to facilitate the fusion of information at different levels, which is formulated as follows:
formula
(3)
formula
(4)
formula
(5)
The detailed network structure of HighGD is shown in Figure 8. The HighFAM module uses the feature map size of P5 as the baseline size, and P3 and P4 use the AvgPool layer to downsample the feature maps to the P5 size and splice them in channel. The HighIFM design includes multiple self-attention, MLP, and Split operations. The specific formulas are as follows:
formula
(6)
formula
(7)
formula
(8)
The network structure diagram of the Inject_C2f module is shown in Figure 9. X_Global is the global information generated by the IFM module and X_Local is the local information at different levels. With Low LAF and High LAF, the receptive field is further enlarged for the Inject_C2f module to enhance the characterization of the module. By performing operations such as convolution, sampling, and interpolation on local and global information, different information is spliced in channel, and the final output of the information injection module is obtained after the calculation of the C2f module.
Figure 8

HighGD and LowGD module network structure diagram.

Figure 8

HighGD and LowGD module network structure diagram.

Close modal
Figure 9

Inject_C2f module network structure diagram.

Figure 9

Inject_C2f module network structure diagram.

Close modal

The reason why the C2f Multi-scale feature fusion network works is that it doesn’t have to obtain information from other layers indirectly or recursively like FPN–PAN networks. It replaces the downward fusion and upward fusion operations in the FPN–PAN network through the LowGD module and the HighGD module, which means that all the information in the C2f Multi-scale feature fusion network is acquired directly, with less loss and more efficiency, which improves the accuracy of the detector.

Experimental environment and datasets

The experimental environment is four graphics cards NIVIDA RTX3060, CPU is Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, the memory size is 128GB, the system is Ubuntu 20.04.1, CUDA version is 11.3, the programming language is Python3.8, the deep learning framework is Pytorch version 1.12.1, and all algorithm results are tested on this experimental environment. Detailed experimental training data are shown in Table 1, and all datasets use this set of training parameters.

Table 1

Experimental training data

Modelepochbatch_sizeoptimlearning_rateGPU(RTX 3060)
YOLOv8-n 500 64 SGD 1e-2 0,1,2,3 
YOLOv8-n C2D 500 64 SGD 1e-2 0,1,2,3 
YOLOv8-n Multi-scale 500 64 SGD 1e-2 0,1,2,3 
YOLORG-n 500 64 SGD 1e-2 0,1,2,3 
YOLOv5-s 500 64 SGD 1e-2 0,1,2,3 
YOLOv6-n 500 64 SGD 1e-2 0,1,2,3 
YOLOv7-tiny 500 64 SGD 1e-2 0,1,2,3 
YOLORG-s 500 16 SGD 1e-2 0,1,2,3 
Modelepochbatch_sizeoptimlearning_rateGPU(RTX 3060)
YOLOv8-n 500 64 SGD 1e-2 0,1,2,3 
YOLOv8-n C2D 500 64 SGD 1e-2 0,1,2,3 
YOLOv8-n Multi-scale 500 64 SGD 1e-2 0,1,2,3 
YOLORG-n 500 64 SGD 1e-2 0,1,2,3 
YOLOv5-s 500 64 SGD 1e-2 0,1,2,3 
YOLOv6-n 500 64 SGD 1e-2 0,1,2,3 
YOLOv7-tiny 500 64 SGD 1e-2 0,1,2,3 
YOLORG-s 500 16 SGD 1e-2 0,1,2,3 

There are four datasets used in this experiment:
  1. The homemade dataset FFDD contains 3,428 training images and 1,163 test images. A portion of the fish images in the dataset were collected from the web and another portion of the fish images were taken from underwater cameras installed in the laboratory environment. The laboratory environment is mixed with a lot of sediment and water plants to simulate the real fish passage environment, the visibility is very low. All images in the dataset were labeled using the LabelIMG tool, and the FFDD covers multiple motion images of fish in complex environments compared to other underwater datasets.

  2. COCO2017 public dataset, the training set contains 118,000 images, and the test set more than 40,000 images, the COCO dataset contains a total of 1.5 million objects and 80 detection categories, the COCO dataset is also widely used among other computer vision fields (Lin et al. 2014).

  3. The joint VOC dataset is VOC2007+VOC2012, which contains 1.6W images and more than 4W detection targets in the training set, and 5k images and 1.2W object detection targets in the test set, which contains 20 common objects such as airplanes, bicycles, people, various small animals, etc (Everingham et al. 2010).

  4. The DeepFish dataset contains approximately 40,000 images from 20 habitats in tropical Australian marine environments. The dataset contains classification labels, localization labels and segmentation labels, and is able to meet a wide range of underwater research needs (Saleh et al. 2020).

Figure 10 shows example images of the four datasets. The FFDD is used to evaluate the model and practical application in fish detection. COCO2017 and VOC2007_2012 datasets are common datasets in object detection; good detectors need to be validated by COCO2017 and VOC2007_2012 datasets, and they are used together with DeepFish datasets to compare the YOLORG model with other relevant fish detection models.
Figure 10

Sample images of DataSet: (a) FFDD; (b) COCO; (c) VOC2007_2012; and (d) DeepFish.

Figure 10

Sample images of DataSet: (a) FFDD; (b) COCO; (c) VOC2007_2012; and (d) DeepFish.

Close modal

Ablation experiment

In order to verify the effectiveness of our proposed C2D Structural Re-parameterization module and C2f Multi-scale feature fusion network, we independently checked each module among YOLORG-n, focusing on the mAP, the number of parameters, and the amount of computation. The YOLOv8 model, with the addition of the C2D module, has a 0.2 and 0.3% improvement in accuracy due to the nature of the Structural Re-parameterization, which does not introduce any additional number of parameters or computations. After replacing the traditional FPN–PAN network with the C2f Multi-scale feature fusion network, YOLOv8 solves the problem of information loss and effectively improves network detection accuracy by 0.6 and 1.4% for and , respectively. YOLORG has a 1.2 and 1.8% increase in and on this basis. The experimental results are shown in Table 2. According to the experimental results, it can be found that we proposed C2D Structural Re-parameterization module and the C2f Multi-scale feature fusion network indeed can effectively improve the performance of the detector.

Table 2

Ablation experimental results of YOLORG-n and YOLOv8-n

ModelC2DC2f Multi-scale feature fusion networkFLOPs(inference)Params
YOLOv8-n   81.4% 62.1% 8.1GFLOPs 3M 
YOLOv8-n ✓  81.6%(+0.2%) 62.4%(+0.3%) 8.1GFLOPs 3M 
YOLOv8-n  ✓ 82%(+0.6%) 63.5%(+1.4%) 12.3GFLOPs 6.1M 
YOLORG-n ✓ ✓ 82.6(+1.2%) 63.9%(+1.8%) 12.4GFLOPs 6.1M 
ModelC2DC2f Multi-scale feature fusion networkFLOPs(inference)Params
YOLOv8-n   81.4% 62.1% 8.1GFLOPs 3M 
YOLOv8-n ✓  81.6%(+0.2%) 62.4%(+0.3%) 8.1GFLOPs 3M 
YOLOv8-n  ✓ 82%(+0.6%) 63.5%(+1.4%) 12.3GFLOPs 6.1M 
YOLORG-n ✓ ✓ 82.6(+1.2%) 63.9%(+1.8%) 12.4GFLOPs 6.1M 

Bold values signifies best data in this column.

Comparison experiments

In order to validate the performance of the YOLORG algorithm, we selected state-of-the-art object detectors in the current field of object detection , controlling their computation and parameters, training them in the same experimental environment. The experimental results show that after training on the joint VOC dataset, YOLORG has the highest detection accuracy for the results obtained on the test set, it proves that our proposed YOLORG algorithm does have excellent detection capability. The experimental results are shown in Table 3.

Table 3

YOLORG-n vs. advanced YOLO model

ModelFLOPs(inference)Pararms
YOLOv5-s 7.0 (Jocher et al. 202179.8% 57.2% 15.9GFLOPs 7M 
YOLOv6-n 0.2.0 (Li et al. 2022b82.2% 60.4% 11.08GFLOPs 4.3M 
YOLOv7-tiny (Wang et al. 2023b79.2% 55.3% 13.2GFLOPs 6M 
YOLORG-n 82.6% 63.9% 12.4GFLOPs 6.1M 
ModelFLOPs(inference)Pararms
YOLOv5-s 7.0 (Jocher et al. 202179.8% 57.2% 15.9GFLOPs 7M 
YOLOv6-n 0.2.0 (Li et al. 2022b82.2% 60.4% 11.08GFLOPs 4.3M 
YOLOv7-tiny (Wang et al. 2023b79.2% 55.3% 13.2GFLOPs 6M 
YOLORG-n 82.6% 63.9% 12.4GFLOPs 6.1M 

Bold values signifies best data in this column.

Detection of fish in the fish passage

Table 4 shows the results of comparing the YOLORG algorithm with other related fish detection algorithms. The RetinaNet-ResNet101-FPN model was used in fish detection by Kay & Merrifield (2021). The experimental results of the model after training on the COCO2017 dataset are shown in the article only for , and don’t show the computational and parametric counts of the model. To be fair, YOLORG did not choose a very large model, but simply chose the YOLORG-s model, which gives the YOLORG-s algorithm a 4.7% improvement on compared to RetinaNet-ResNet101-FPN. Xu & Matzner (2018) used the YOLOv3 model to train on a homebrew dataset, unfortunately, we did not find his homebrew dataset, but we resized the Ultralytics YOLOv3 algorithm to the same size as YOLORG-n and trained it on the joint VOC dataset. YOLORG-s show 1.4 and 2.9% improvement on and , respectively. Muksit et al. (2022) used two improved YOLOv3 algorithms to train on the DeepFish dataset, and we also used the YOLORG-n algorithm to train on the DeepFish dataset. Although we didn’t find computational and data on YOLO-Fish-1 and YOLO-Fish-2 in Muksit et al. (2022), our parameter count is nearly 10 times less than these two algorithms, and improves the accuracy by nearly 2%. According to the experimental results, YOLORG is superior to the algorithms used in other related fish research articles.

Table 4

Experimental results comparing YOLORG-s and fish passage correlation algorithms

ModelDataSetsFLOPs(inference)Pararms
YOLOv3-n ultralytics VOC2007+2012 81.2 61% 12.8GFLOPs 4.2M 
YOLORG-n VOC2007+2012 82.6%(+1.4%) 63.9%(+2.9%) 12.4GFLOPs 6.1M 
RetinaNet-ResNet101-FPN (Kay & Merrifield 2021COCO2017 – 40.4% – – 
YOLORG-s COCO2017 61.9% 45.1%(+4.7%) 33.6GFLOPs 14.2M 
YOLOv3 (Muksit et al. 2022DeepFish 96.01% – – 61.58M 
YOLO-Fish-1 (Muksit et al. 2022DeepFish 96.15% – – 61.6M 
YOLO-Fish-2 (Muksit et al. 2022DeepFish 95.74% – – 62.61M 
YOLORG-n DeepFish 98% 67% 12.4GFLOPs 6.1M 
ModelDataSetsFLOPs(inference)Pararms
YOLOv3-n ultralytics VOC2007+2012 81.2 61% 12.8GFLOPs 4.2M 
YOLORG-n VOC2007+2012 82.6%(+1.4%) 63.9%(+2.9%) 12.4GFLOPs 6.1M 
RetinaNet-ResNet101-FPN (Kay & Merrifield 2021COCO2017 – 40.4% – – 
YOLORG-s COCO2017 61.9% 45.1%(+4.7%) 33.6GFLOPs 14.2M 
YOLOv3 (Muksit et al. 2022DeepFish 96.01% – – 61.58M 
YOLO-Fish-1 (Muksit et al. 2022DeepFish 96.15% – – 61.6M 
YOLO-Fish-2 (Muksit et al. 2022DeepFish 95.74% – – 62.61M 
YOLORG-n DeepFish 98% 67% 12.4GFLOPs 6.1M 

Bold values signifies best data in this column.

Before applying YOLORG to the detection of fish, we tried to use dogs as a substitute for fish to see if YOLORG could correctly detect the dogs in the pictures. As shown in Figure 11, 11(a) and 11(b) show different categories of dogs, both YOLORG_VOC (trained on the VOC2007_2012 dataset) and YOLORG_COCO (trained on the COCO2017 dataset) models correctly recognize all the dogs in the pictures and label them. Table 5 shows the experimental results of the YOLORG-n algorithm and other advanced models after training on the FFDD, from Table 5 we can learn that YOLORG-n obtains the highest recognition accuracy among all detection models, although the difference in is less pronounced, is almost 1% more accurate than the second accurate model. As shown in Figure 12, fish in Figure 12(a) live in our laboratory-simulated turbid fish passage environment, whereas fish in Figure 12(b) live in a real fish passage environment. The water in Figure 12(a) and 12(b) is so turbid that it is difficult to distinguish even with the naked eye, but the YOLORG model accurately detects the fish in the water. This shows that the YOLORG model trained on the FFDD is able to accurately recognize fish even in very turbid environments, not just in experimental data on paper. In actual smart fishery projects, it is only necessary to purchase an underwater camera, send back underwater images in real time through the network cable, and use the YOLORG model to automatically detect and classify fish, which can liberate the eyes of practitioners.
Table 5

YOLORG vs. advanced YOLO model on FFDD dataset.

ModelFLOPs(inference)Pararms
YOLOv5-n 7.0 (Jocher et al. 202190.7% 64% 4.2GFLOPs 1.7M 
YOLOv6-n 0.2.0 (Li et al. 2022b90.4% 67% 11.08GFLOPs 4.3M 
YOLOv7-tiny (Wang et al. 2023b90.8% 63.8% 13.2GFLOPs 6M 
YOLOv8-n 91.1% 68.6% 8.1GFLOPs 3M 
YOLORG-n 91.3% 69.4% 12.4GFLOPs 6.1M 
ModelFLOPs(inference)Pararms
YOLOv5-n 7.0 (Jocher et al. 202190.7% 64% 4.2GFLOPs 1.7M 
YOLOv6-n 0.2.0 (Li et al. 2022b90.4% 67% 11.08GFLOPs 4.3M 
YOLOv7-tiny (Wang et al. 2023b90.8% 63.8% 13.2GFLOPs 6M 
YOLOv8-n 91.1% 68.6% 8.1GFLOPs 3M 
YOLORG-n 91.3% 69.4% 12.4GFLOPs 6.1M 

Bold values signifies best data in this column.

Figure 11

Example pictures of a YOLORG detection dog: (a) VOC2007_2012-trained YOLORG detection dog; (b) COCO2017-trained YOLORG detection dog.

Figure 11

Example pictures of a YOLORG detection dog: (a) VOC2007_2012-trained YOLORG detection dog; (b) COCO2017-trained YOLORG detection dog.

Close modal
Figure 12

Prediction results in laboratory and fish passage environment: (a) fish in laboratory and (b) fish in passage.

Figure 12

Prediction results in laboratory and fish passage environment: (a) fish in laboratory and (b) fish in passage.

Close modal

In this paper, in order to improve the accuracy of fish detection and increase the amount of data in the field of fish detection, we propose an FFDD and a fish detection series model YOLORG. Through multiple experiments, it is proved that our proposed YOLORG algorithm achieved advanced results.

The FFDD is well organized, can be easily downloaded, and trained to help fish projects in need. The C2D Structural Re-parameterization module can be easily embedded into any existing fish detection model without changing original network structure, improving the detection performance without bringing additional parameters. The C2f Multi-scale feature fusion network solving the problem of information loss in fish detection models provides an idea for improving the accuracy of fish detection algorithms. The YOLORG series models are available in several versions, which can achieve a speed-accuracy balance according to the needs of the actual project. Also, the YOLORG-n model parameters are so small that they can directly run in embedded devices.

The FFDD and the YOLORG model can provide data and ideas for fish detection research, and can also provide some help for actual projects. In the future, we will further expand the number of FFDDs and propose more effective fish detection methods.

This work was funded in part by Scientific Research Fund of Hunan Provincial Education Department of China under Grant 22A0200, in part by Scientific and Technological Innovation Project of Quanmutang Reservoir under Grant 2022430119001440.

All relevant data are available from https://github.com/wannabetter/YOLORG.

The authors declare there is no conflict.

Bochkovskiy
A.
,
Wang
C. Y.
&
Liao
H. Y. M.
2020
Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934
.
Ding
X.
,
Guo
Y.
,
Ding
G.
&
Han
J.
2019
Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1911–1920
.
Ding
X.
,
Zhang
X.
,
Han
J.
&
Ding
G.
2021a
Diverse branch block: Building a convolution as an inception-like unit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10886–10895
.
Ding
X.
,
Xia
C.
,
Zhang
X.
,
Chu
X.
,
Han
J.
&
Ding
G.
2021b
Repmlp: Re-parameterizing Convolutions into Fully-Connected Layers for Image Recognition
.
Ding
X.
,
Zhang
X.
,
Ma
N.
,
Han
J.
,
Ding
G.
&
Sun
J.
2021c
Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742
.
Dong
J.
,
Shangguan
X.
,
Zhou
K.
,
Gan
Y.
,
Fan
H.
&
Chen
L.
2023
A detection-regression based framework for fish keypoints detection
.
Intelligent Marine Technology and Systems
1
,
9
.
Everingham
M.
,
Gool
L. V.
,
Williams
C. K. I.
,
Winn
J.
&
Zisserman
A.
2010
The pascal visual object classes (voc) challenge
.
International Journal of Computer Vision
88
,
303
338
.
Hu
J.
,
Shen
L.
&
Sun
G.
2018
Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141
.
Ioffe
S.
&
Szegedy
C.
2015
Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, PMLR, pp. 448–456
.
Jocher
G.
,
Stoken
A.
,
Chaurasia
A.
,
Borovec
J.
,
Kwon
Y.
,
Michael
K.
,
Changyu
L.
,
Fang
J.
,
Skalski
P.
,
Hogan
A.
,
Nadar
J.
,
Mammana
L.
,
Fati
C.
,
Montes
D.
,
Hajek
J.
,
Diaconu
L.
&
Minh
M. T.
2021
ultralytics/yolov5: v6.0 – YOLOv5n ’Nano’ Models, Roboflow Integration, TensorFlow Export, OpenCV DNN Support
.
Zenodo
.
Kandimalla
V.
,
Richard
M.
,
Smith
F.
,
Quirion
J.
,
Torgo
L.
&
Whidden
C.
2022
Automated detection, classification and counting of fish in fish passages with deep learning
.
Frontiers in Marine Science
8
,
2049
.
Kay
J.
&
Merrifield
M.
2021
The Fishnet Open Images Database: A Dataset for Fish Detection and Fine-Grained Categorization in Fisheries. arXiv preprint arXiv:2106.09178
.
Li
X.
,
Shang
M.
,
Qin
H.
&
Chen
L.
2015
Fast accurate fish detection and recognition of underwater images with fast r-cnn. In: OCEANS 2015-MTS/IEEE Washington. IEEE, pp. 1–5
.
Li
H.
,
Li
J.
,
Wei
H.
,
Liu
Z.
,
Zhan
Z.
&
Ren
Q.
2022a
Slim-Neck by gsconv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. arXiv preprint arXiv:2206.02424
.
Li
C.
,
Li
L.
,
Jiang
H.
,
Weng
K.
,
Geng
Y.
,
Li
L.
,
Ke
Z.
,
Li
Q.
,
Cheng
M.
,
Nie
W.
,
Li
Y.
,
Zhang
B.
,
Liang
Y.
,
Zhou
L.
,
Xu
X.
,
Chu
X.
,
Wei
X.
&
Wei
X.
2022b
Yolov6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv preprint arXiv:2209.02976
.
Li
L.
,
Shi
G.
&
Jiang
T.
2023
Fish detection method based on improved yolov5
.
Aquaculture International
31
,
2513
2530
.
Lin
T. Y.
,
Maire
M.
,
Belongie
S.
,
Hays
J.
,
Perona
P.
,
Ramanan
D.
,
Dollár
P.
&
Zitnick
C. L.
2014
Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, pp. 740–755
.
Meng
F.
,
Cheng
H.
,
Zhuang
J.
,
Li
K.
&
Sun
X.
2021
Rmnet: Equivalently Removing Residual Connection from Networks. arXiv e-prints
.
Muksit
A. A.
,
Hasan
F.
,
Emon
M. F. H. B.
,
Haque
M. R.
,
Anwary
A. R.
&
Shatabda
S.
2022
Yolo-fish: A robust fish detection model to detect fish in realistic underwater environment
.
Ecological Informatics
72
,
101847
.
Paspalakis
S.
,
Moirogiorgou
K.
,
Papandroulakis
N.
,
Giakos
G.
&
Zervakis
M.
2020
Automated fish cage net inspection using image processing techniques
.
IET Image Processing
14
,
2028
2034
.
Redmon
J.
&
Farhadi
A.
2018
Yolov3: An Incremental Improvement. arXiv preprint arXiv:1804.02767
.
Saleh
A.
,
Laradji
I. H.
,
Konovalov
D. A.
,
Bradley
M.
,
Vazquez
D.
&
Sheaves
M.
2020
A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis
.
Scientific Reports
10
,
14671
.
doi:10.1038/s41598-020-71639-x
.
Szegedy
C.
,
Liu
W.
,
Jia
Y.
,
Sermanet
P.
,
Reed
S.
,
Anguelov
D.
,
Erhan
D.
,
Vanhoucke
V.
&
Rabinovich
A.
2015
Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9
.
Szegedy
C.
,
Vanhoucke
V.
,
Ioffe
S.
,
Shlens
J.
&
Wojna
Z.
2016
Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826
.
Szegedy
C.
,
Ioffe
S.
,
Vanhoucke
V.
&
Alemi
A.
2017
Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence
.
Vianna
G. M.
,
Zeller
D.
&
Pauly
D.
2020
Fisheries and policy implications for human nutrition
.
Current Environmental Health Reports
7
,
161
169
.
Wang
Q.
,
Wu
B.
,
Zhu
P.
,
Li
P.
,
Zuo
W.
&
Hu
Q.
2020
Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542
.
Wang
H.
,
Zhang
S.
,
Zhao
S.
,
Wang
Q.
,
Li
D.
&
Zhao
R.
2022
Real-time detection and tracking of fish abnormal behavior based on improved yolov5 and siamrpn++
.
Computers and Electronics in Agriculture
192
,
106512
.
Wang
C.
,
He
W.
,
Nie
Y.
,
Guo
J.
,
Liu
C.
,
Wang
Y.
&
Han
K.
2023a
Gold-yolo: Efficient object detector via Gather-and-Distribute mechanism. In: Thirty-Seventh Conference on Neural Information Processing Systems
.
Wang
C. Y.
,
Bochkovskiy
A.
&
Liao
H. Y. M.
2023b
Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464–7475
.
Xu
W.
&
Matzner
S.
2018
Underwater fish detection using deep learning for water power applications. In: 2018 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE, pp. 313–318
.
Xu
X.
,
Jiang
Y.
,
Chen
W.
,
Huang
Y.
,
Zhang
Y.
&
Sun
X.
2022
Damo-yolo: A Report on Real-time Object Detection Design. arXiv preprint arXiv:2211.15444
.
Yu
Y.
,
Zhang
H.
&
Yuan
F.
2023
Key point detection method for fish size measurement based on deep learning
.
IET Image Processing
17
,
4142
4158
.
Zhang
S.
,
Yang
X.
,
Wang
Y.
,
Zhao
Z.
,
Liu
J.
,
Liu
Y.
,
Sun
C.
&
Zhou
C.
2020
Automatic fish population counting by machine vision and a hybrid deep neural network model
.
Animals
10
,
364
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).