ABSTRACT
This study develops a novel double-loop contraction and C value sorting selection-based shrinkage frog-leaping algorithm (double-contractive cognitive random field [DC-CRF]) to mitigate the interference of complex salts and ions in seawater on the ultraviolet–visible (UV–Vis) absorbance spectra for chemical oxygen demand (COD) quantification. The key innovations of DC-CRF are introducing variable importance evaluation via C value to guide wavelength selection and accelerate convergence; a double-loop structure integrating random frog (RF) leaping and contraction attenuation to dynamically balance convergence speed and efficiency. Utilizing seawater samples from Jiaozhou Bay, DC-CRF-partial least squares regression (PLSR) reduced the input variables by 97.5% after 1,600 iterations relative to full-spectrum PLSR, RF-PLSR, and CRF-PLSR. It achieved a test R2 of 0.943 and root mean square error of 1.603, markedly improving prediction accuracy and efficiency. This work demonstrates the efficacy of DC-CRF-PLSR in enhancing UV–Vis spectroscopy for rapid COD analysis in intricate seawater matrices, providing an efficient solution for optimizing seawater spectra.
HIGHLIGHT
The double-contractive cognitive random field-partial least squares regression algorithm reduces input 97.5% with high seawater chemical oxygen demand (COD) prediction accuracy. Integrating variable selection and shrinkage frog leaping optimizes efficiency and robustness. An outer loop strategy and attenuation function balance convergence and efficiency. The C value criterion evaluates variables for seawater COD modeling. The algorithm improves ultraviolet–visible detection for rapidly determining seawater COD.
INTRODUCTION
In recent decades, the rapid development of industrialization and urbanization in China has led to a large discharge of wastewater into the ocean, exacerbating water pollution levels (Xu et al. 2021). Coastal water quality is not only affected by land-based pollutant inputs, but also closely related to the interaction with nearshore hydrodynamic processes such as waves and currents (Abolfathi & Pearson 2017). These hydrodynamic processes directly influence coastal water quality by driving water mixing and dilution (Abolfathi & Pearson 2014; Abolfathi et al. 2020). The chemical oxygen demand (COD), inorganic nitrogen, and active phosphates in wastewater from land-based sewage outlets have exceeded standards, resulting in eutrophication, frequent red tides and damage to marine ecosystems, and human health in coastal waters (Lin et al. 2020). COD is an important parameter for detecting water quality pollution (Fogelman et al. 2006), reflecting the relative content of organics and the pollution levels of reducing substances such as ferrous ions and sulfides in water bodies. Although traditional chemical detection methods such as potassium permanganate methods can achieve accurate detection results (Miller et al. 2001), they require 2–3 h for sample pretreatment, titration, and result calculation, which is time-consuming and may cause secondary pollution. In contrast, the ultraviolet–visible (UV–Vis) spectroscopy used in this study can rapidly scan samples for detection in 1–2 min (Guo et al. 2020). In addition, improper treatment of chemical reagents may also cause secondary pollution. Therefore, timely detection and protection of marine resources can be achieved by developing effective, rapid, and eco-friendly detection techniques. UV–Vis spectroscopy is a novel rapid non-destructive detection method (Picollo et al. 2018; Xu et al. 2021). It establishes regression models between the specific spectral curves of substance absorption and chemical measurement parameters, and is widely used in ecological, industrial, and environmental monitoring fields (Chen et al. 2014; Li et al. 2020).
Considering the deterioration of water environments, the composition of water quality has become increasingly complex, and the demand for model prediction accuracy has also increased (Cao et al. 2014). Therefore, research on the use of UV–Vis spectroscopy to determine the concentration of organic pollutants in water has developed from single- and double-wavelength modeling to the use of multi-wavelength and wide-band calibration modeling (Lee et al. 1999; Kim et al. 2001). Multivariate statistical methods such as partial least squares (PLS) (Kusnierek & Korsaeth 2015), principal component analysis (Sharifzadeh et al. 2017), and support vector machines (Devos et al. 2009) are commonly used for spectral analysis and modeling. The PLS algorithm can comprehensively filter the spectrum and play an important role in modeling and analyzing spectral information variables. However, a large amount of research has shown that water quality COD measurement is severely affected by various ion spectra. Considering the strong UV absorption spectrum band formed by a large number of ions in seawater, the use of PLS full-spectrum variables for modeling will greatly increase the complexity of the model and produce large measurement errors (Han et al. 2022). Therefore, the effective selection of spectral variables is a key issue in spectral detection. Population-based evolutionary algorithms can effectively solve this problem. These methods can fully explore the solution space and handle discrete data that traditional methods are difficult to handle, including genetic algorithms (GAs) (Feng et al. 2020), particle swarm optimization (Qi et al. 2018), ant colony optimization (Shamsipur et al. 2006), and gray wolf optimization (Gao et al. 2022). Among these methods, random frog (RF) (Li et al. 2012) is a feature wavelength selection algorithm proposed in recent years. It uses the reversible jump Markov chain Monte Carlo (RJMCMC) method to iteratively calculate the probability of each variable being selected in each iteration, evaluate the importance of variables, and select variables with high probabilities as feature variables. Given it is excellent performance in selecting explanatory feature variables from a wide range of complex datasets, this algorithm has been effectively applied in the field of spectral feature selection.
Although the RF algorithm has certain advantages in feature wavelength selection, when this method is applied to complex datasets, the randomness of the initial variable set leads to slow iteration speed and easy convergence to local optima. In response to this problem, some studies have introduced preprocessing or heuristic methods (Zhu et al. 2010; Fan et al. 2017) that can effectively remove irrelevant information, noise, and background interference from variables, optimize the initial variable set, and improve the convergence speed of the algorithm. However, when collinearity exists between variables, the association between the size of the coefficients and the importance of the variables decreases, thus decreasing the effectiveness of the initial variable set selected. The C value criterion proposed in recent years reasonably evaluates the importance of each variable (Zhang et al. 2019). By statistically calculating the contribution of different variables to model errors, it can effectively avoid the synergistic effects between variables (Kjeldahl & Bro 2010; Tran et al. 2015).
Moreover, although optimizing the initial variable set can improve the computational efficiency of the water quality spectral model, the high degree of randomness during the RF algorithm hinders the rapid and accurate extraction of feature variables. Relevant research has been conducted, but a general consensus has not been reached on the selection of probability-guided parameters and optimization methods. In response to the issue of individual fitness, Xu et al. (2014) proposed a least random shuffled frog-leaping algorithm, which optimizes the population's iteration efficiency based on a roulette wheel strategy. Yun et al. (2013) suggested that replacing a single wavelength point with a continuous window improves the optimization accuracy of the original RF algorithm. Sun et al. (2020) proposed interval selection based on RF, which optimizes wavelength selection and its width. However, balancing the search and optimization capabilities of the algorithm when applied to seawater COD spectra is crucial. Excessive searching leads to slow convergence, while focusing on finding the optimal solution may cause the algorithm to become trapped in a local optimum. The adaptive balance between search and optimization is essential for ensuring the robustness of the algorithm (Deng et al. 2017).
This paper proposes a double-contractive cognitive random field (DC-CRF) algorithm based on adaptive and C value sorting selection for the improved efficiency of the solution of seawater COD spectra to strengthen the balance in exploring global optimal solutions. The contributions and main findings of this paper are as follows: (1) the introduction of heuristic probability optimization and new contraction strategies, by nesting the RF algorithm in a dual-layer loop, to adapt to COD spectral optimization in seawater environments; (2) the validation of the rationality of the C value forward sorting parameter applied to seawater COD spectral optimization; (3) the evaluation of the RF, CRF, and DC-CRF wavelength selection methods using partial least squares regression (PLSR) modeling from the perspectives of speed, robustness, and accuracy; and (4) the determination of the optimal spectral region and sensitive bands for seawater COD.
MATERIALS AND METHODS
Study area and sample collection
Division and preprocessing of spectral dataset
Theory and algorithm
RF is an algorithm proposed by Li that bears resemblance to the RJMCMC algorithm. Its primary steps are as follows: (1) initialization: setting parameters and randomly selecting a variable subset Z0 containing Q0 variables; (2) probability-guided model search: based on Z0, selecting a candidate variable subset Z* containing Q* (randomly generated) variables, accepting Z* as Z0 with a certain probability, and replacing Z0 with Z*, repeating this step until Lk iterations are completed (where Lk represents the number of iterations within the internal loop); (3) variable evaluation: calculating the probability of each selected variable, with higher probability indicating greater importance.
The DC-CRF algorithm proposed by the research institute adopts the new variable generation strategy in RF, but differs in three aspects. First, it uses the C value importance ranking criterion to determine the initial variable set instead of a random strategy. Then, it introduces a parameter adaptive contraction strategy to control the variable selection range and acceptance rate during the algorithm iteration process. Finally, it adds an outer contraction loop and metropolis acceptance criterion to preserve or eliminate new solutions of spectral wavelength variables, thus enhancing the robustness and facilitating escape from local optima. The specific method will be further elaborated.
The proposed DC-CRF algorithm
Determining initial subset with C value criterion
In RF, the initial variable subset Z0 was randomly generated, which may result in the presence of uninformative or interfering variables, thereby increasing the number of iterations and runtime of the algorithm. The effectiveness of the initial subset Z0 variables was improved, and the number of iterations was reduced by improving the generation of the Z0 subset.



The binary matrix b and error vector e were modeled using multiple linear regression (MLR), where the regression coefficients for the variables were defined as the C values to estimate the contribution of the variables to the model error. A smaller C value indicates a higher importance of the corresponding variable. The variables were sorted into positive, neutral, and negative categories, and the top Q0 variables were selected as the initial subset Z0.
Shrinkage strategy
Subset generation of variables
A strategy was set for generating new variables in RF as an inner loop with a loop count of Lk based on a normal distribution to control the number of variables and achieve variable selection and modification. In comparison with classical RF, an elastic shrinkage parameter k was introduced for the new variable set (the definition of k is introduced in 2.3). An integer β was randomly selected from a normal distribution with mean k·α and variance 0.3α. Then, a candidate variable subset Z* containing β variables was randomly generated using one of the following three constraint methods:
If β = α, let Z0 = Z*.
If β < α, a PLS model was first established for Z*, and the regression coefficient values were recorded and compared for each variable in the model. The α–β variables with the smallest regression coefficients are then removed from Z* and the remaining β variables form the candidate subset Z*.
If β > α, δ (default value of 3) times the difference between β and α variables are randomly selected from V–Z* (where V represents the set containing all N variables) to generate a variable subset T. A PLS model is established using the combination of Z* and T, and δ variables with the largest regression coefficients in the model are retained and set as the candidate subset Z*.
Update of the variable subset

The probability of accepting new solutions is denoted by , while
represents the RMSECV of the Z0 subset and
represents the RMSECV of the Z* subset.
However, the fixed threshold elimination method in RF can sometimes become a disadvantage, because it may lead increase in the computational burden and require a longer number of iterations to achieve convergence. This situation is particularly problematic for problems involving the evaluation of complex datasets, such as the optimization of seawater COD spectra. In connection with this, Li et al. have revealed a search mechanism that allows for elastic contraction strategies and is better suited for optimizing complex datasets. Accordingly, the elastic contraction operator in the simulated annealing (SA) algorithm was utilized to maintain the diversity of the initial population as much as possible while ensuring accuracy in the later stages, thereby reducing the number of iterations and adapting the termination threshold.


In this equation, represents the objective functions of solutions i and j, respectively, while
represents the value of T during the nth iteration of the outer loop. If
, where the RMSECV of the original variable subset is not less than that of the new variable subset, then the new variable subset is accepted. If
, where the RMSECV of the original variable subset is less than that of the new variable subset, then the new subset is accepted with a corresponding probability determined by the contraction formula.
The improved non-optimal solution acceptance strategy was controlled by the parameter T, which gradually shrinks with the algorithm's iterations. The corresponding selection mechanism has a higher probability of accepting poor solutions in the early stages, thus expanding the search space. As the algorithm progressed, the probability of accepting poor solutions decreased, thus improving the search accuracy. The detailed setting of parameter T is described below. After each iteration of the inner loop, the subset and model error were saved. After Lk iterations, the optimal subset Z1, the number of variables δ, and the model error were extracted. The initial subset Z0 is updated by setting Z0 = Z1 and α = δ.
Outer loop and contraction variable range setting

After updating T, the initial subset Z0 was updated by setting a Z0 value that is equal to Zbest and Q0 value that is equal to δ, and then entering the next round of Lk inner loops. The algorithm stops and outputs the final number of variable wavelengths, corresponding variable wavelengths, and model error when either the error acceptance threshold or the lower limit of T (Tf) is reached as the termination condition.
The process of the shrinking control algorithm and the acceptance probability p of the inner loop solution are controlled by the outer loop parameter. When the initial T is large, the selection range of variables is wide and the probability of accepting new solutions is high. As the outer loop iterates, the T parameter decreases, the selection range of variables narrows, and the probability of accepting new solutions decreases. The algorithm implements a wide range of initial selection to ensure that the search range includes the entire spectral dataset while avoiding local optima, and focuses on variable selection in the later stages of the algorithm. Therefore, the key to the algorithm is to choose a set of control parameters for the algorithm process that returns an optimal solution within a reasonable time. Such a set of control parameters is usually referred to as a schedule, which mainly includes the following parameters: (1) starting parameter T0; (2) decay function k; (3) parameter lower limit Tf.
When selecting these parameters, multiple factors need to be considered. The size of the initial temperature T0 determines the exploration range in the early stages. The decay coefficient k controls the cooling schedule and should be selected to gradually decrease the temperature. The final temperature Tf needs to be set small enough to allow convergence. The number of inner loops Lk affects optimization progress. In addition, the acceptance probability of new solutions depends on parameter T.
For the dataset used in this study, the parameters were optimized through extensive experiments. The initial temperature T0 was set to 100 based on the search space size to ensure sufficient exploration. The decay coefficient k of 0.85 achieved a good balance between global and local search. The final temperature Tf of 0.01 allowed the algorithm to converge to a near-optimal solution. The number of inner loops Lk was set to 50 through trial and error. The parameter C was chosen as −0.016 for the acceptance probability, which led to the best results.
It should be noted that the parameters need to be individually tuned for different datasets to achieve optimal performance. We performed detailed sensitivity analysis and obtained robust value ranges for the parameters. The algorithm shows low sensitivity to small variations of the parameters within these ranges. The algorithm process is shown by the flow chart in Figure 2.
Model evaluation methods
In this study, the measured COD content of seawater was taken as the dependent variable, and a set of spectral variables was selected as the independent variables to establish four PLSR models, including a PLSR model based on the full spectrum, a PLSR model based on RF wavelength selection, a PLSR model based on CRF (C value sorting to determine the initial variable set of RF) wavelength selection, and a PLSR model based on DC-CRF wavelength selection. The effectiveness of DC-CRF-PLSR was verified by comparing the predictive abilities of the four models. The predictive ability of the models was mainly evaluated based on the calibration coefficient of determination, root mean square error of calibration (RMSEC), relative percent deviation of calibration, prediction coefficient of determination (Rp2), root mean square error of prediction (RMSEP), relative percent deviation of prediction, root mean squared logarithmic error (RMSLE) (Habib et al. 2023), mean absolute error (MAE) and relative absolute error (RAE) (Yeganeh-Bakhtiary et al. 2023). A value of R closer to 1 and RMSEC, RMSEP, RMSLE, and MAE closer to 0 indicate better model fitting and higher prediction accuracy (Rohman et al. 2010). An RPD value less than 1.4 indicates an unreliable model, an RPD value between 1.4 and 2.0 indicates a relatively reliable model, and an RPD value greater than 2.0 indicates a highly reliable model that can be used for model analysis (Fearn 2002). Data analysis and modeling software were carried out in MATLAB 2020A.
RESULTS AND DISCUSSION
Spectral analysis
A PLSR prediction model for seawater COD was established based on the bands within the range of 200–800 nm, not only because seawater COD has characteristic absorption peaks between 200 and 800 nm which are closely related to COD values, but also there exist subtle fluctuations identifiable by the algorithm itself within this range that are useful for COD prediction. The spectral graphs after applying different preprocessing methods are shown in Figure 4, and the prediction results are presented in Table 2. The model performance was poor when using the full-spectrum data, indicating that certain factors such as redundancy and interference information affect the performance and prediction results of the data. Therefore, an appropriate preprocessing method is needed to effectively optimize the data without affecting the spectral characteristics of the combined bands. For the seawater COD spectral model, after S-G preprocessing, the prediction results did not show significant improvement compared with the PLSR model established with the original spectra. The SNV preprocessing method had the best positive effect, with a prediction result of of 0.712, RMSEP of 4.787, and
of 1.863. The
value of the MSC-PLS model was 99.36% of the SNV-PLS model, the RPD value of the 1stDer-PLS model was 95.33% of the SNV-PLS model, and the RPD value of the 2ndDer-PLS model was 97.53% of the SNV-PLS model. Therefore, the SNV method was chosen as the preprocessing method for subsequent research.
Performance of C value
The impact of variable sets with three different C value intervals on the PLSR model
Removed variables . | nVar . | nLV . | RMSECV . | R2 . | RMSLE . | MAE . | RAE . |
---|---|---|---|---|---|---|---|
None | 601 | 6 | 4.86 | 0.699 | 0.039 | 1.245 | 0.134 |
Positive | 450 | 6 | 5.03 | 0.626 | 0.042 | 1.372 | 0.149 |
Negative | 450 | 6 | 4.61 | 0.734 | 0.037 | 1.092 | 0.118 |
Removed variables . | nVar . | nLV . | RMSECV . | R2 . | RMSLE . | MAE . | RAE . |
---|---|---|---|---|---|---|---|
None | 601 | 6 | 4.86 | 0.699 | 0.039 | 1.245 | 0.134 |
Positive | 450 | 6 | 5.03 | 0.626 | 0.042 | 1.372 | 0.149 |
Negative | 450 | 6 | 4.61 | 0.734 | 0.037 | 1.092 | 0.118 |
nVar: Number of spectral variables used in the model and nLV: number of latent variables used in the model.
Table 1 shows that removing positive variables leads to an increase in RMSECV and a decrease in R2, indicating model degradation. On the other hand, removing the negative variable range results in a remarkable increase in RMSEP and a noticeable change in R2, indicating model improvement. This result fully demonstrates the effectiveness of the C-criterion in evaluating the importance of spectral variables in modeling.
The C value statistical chart of 601 variables. The blue line Q1 represents the upper quartile, while the red line Q3 represents the lower quartile.
The C value statistical chart of 601 variables. The blue line Q1 represents the upper quartile, while the red line Q3 represents the lower quartile.
Effect of outer loop shrinkage strategy
Changes in RMSECV values with increasing number of variables using two methods: C-index ranking and random selection.
Changes in RMSECV values with increasing number of variables using two methods: C-index ranking and random selection.
Changes in the number of variables selected by DC-CRF and corresponding RMSECV of PLSR modeling with increasing iteration times.
Changes in the number of variables selected by DC-CRF and corresponding RMSECV of PLSR modeling with increasing iteration times.
Comparison of models
This study explores the effect of four methods, namely RF, CRF, DC-CRF variable selection methods, and full-spectrum variables, on the accuracy and simplicity of PLSR models based on UV–Vis spectra transformed by standard normal transformation. All methods ran 200 times, and the mean values were taken. The prediction results of FULL-PLSR, RF-PLSR, CRF-PLSR, and DC-CRF-PLSR models for seawater are shown in Table 2, which shows that the three feature variable selection techniques improved the model performance compared with the full-spectrum variables. The number of wavelengths selected by the RF-PLSR model is less than that selected by the combined band PLSR model. The number of feature variables selected by the CRF-PLSR model is similar to that of the RF-PLSR model, as the variable selection methods of the two methods are the same. However, the number of feature variables selected by DC-CRF was further reduced, indicating that the variables selected by DC-CRF are smaller than the critical band range selected by RF. At the same time, the R2, RMSE, and RPD scores of the three variable selection modeling systems are significantly higher than those of the full-spectrum PLSR. By using the RF-PLSR algorithm, the R2 of the test set increased from 0.712 to 0.921, RMSEP decreased from 4.787 to 1.742, and RPD increased from 1.863 to 1.903. By using the CRF-PLSR algorithm, R2 increased to 0.914, RMSEP decreased to 1.731, and RPD increased to 2.364. In comparison, the DC-CRF-PLSR model obtained a higher R2 value of 0.943, a lower RMSEP value of 1.603, and a higher RPD value of 2.635. In addition, the RMSE, RAE, and RMSLE of the FUII-PLSR model were 4.787, 0.712, and 1.863 respectively; while for RF-PLSR they decreased to 1.742, 0.421, and 1.903; for CRF-PLSR they decreased to 1.731, 0.407, and 2.364; and for DC-CRF-PLSR they further decreased to 1.603, 0.318, and 2.635. This indicates that the DC-CRF-PLSR model performs well in seawater prediction.
Summary of PLSR, RF-PLSR, CRF-PLSR, and DC-CRF-PLSR prediction model results for seawater based on different preprocessing methods
Method . | Pre.p . | nVar . | Testing . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
RMSE . | ![]() | ![]() | RAE . | RMSLE . | MAE . | |||
PLSR | RS | 601 | 5.126 | 0.686 | 1.785 | 0.123 | 0.043 | 1.327 |
FD | 601 | 4.863 | 0.699 | 1.823 | 0.134 | 0.039 | 1.245 | |
SD | 601 | 4.813 | 0.713 | 1.867 | 0.127 | 0.041 | 1.372 | |
S-G | 601 | 4.854 | 0.706 | 1.844 | 0.115 | 0.038 | 1.201 | |
MSC | 601 | 4.712 | 0.711 | 1.860 | 0.129 | 0.037 | 1.263 | |
SNV | 601 | 4.722 | 0.720 | 1.890 | 0.124 | 0.036 | 1.192 | |
RF-PLSR | SNV | 38 | 1.728 | 0.935 | 3.922 | 0.116 | 0.028 | 0.921 |
CRF-PLSR | SNV | 33 | 1.722 | 0.931 | 3.807 | 0.129 | 0.027 | 0.836 |
DC-CRF-PLSR | SNV | 8 | 1.576 | 0.955 | 5.345 | 0.107 | 0.023 | 0.672 |
Method . | Pre.p . | nVar . | Testing . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
RMSE . | ![]() | ![]() | RAE . | RMSLE . | MAE . | |||
PLSR | RS | 601 | 5.126 | 0.686 | 1.785 | 0.123 | 0.043 | 1.327 |
FD | 601 | 4.863 | 0.699 | 1.823 | 0.134 | 0.039 | 1.245 | |
SD | 601 | 4.813 | 0.713 | 1.867 | 0.127 | 0.041 | 1.372 | |
S-G | 601 | 4.854 | 0.706 | 1.844 | 0.115 | 0.038 | 1.201 | |
MSC | 601 | 4.712 | 0.711 | 1.860 | 0.129 | 0.037 | 1.263 | |
SNV | 601 | 4.722 | 0.720 | 1.890 | 0.124 | 0.036 | 1.192 | |
RF-PLSR | SNV | 38 | 1.728 | 0.935 | 3.922 | 0.116 | 0.028 | 0.921 |
CRF-PLSR | SNV | 33 | 1.722 | 0.931 | 3.807 | 0.129 | 0.027 | 0.836 |
DC-CRF-PLSR | SNV | 8 | 1.576 | 0.955 | 5.345 | 0.107 | 0.023 | 0.672 |
RS: raw spectra; Pre.p: preprocessing method; nVAR: the number of variables; FD: first derivative; SD: second derivative; S-G: Savitzky–Golay smoothing; MSC: multiple scattering correction; and SNV: standard normal variate.
Compared to the study by Li et al. (2020), which used PLSR combined with successive projections algorithm for wavelength selection and obtained R2 of 0.843 and RMSEP of 6.622 for COD prediction, our proposed DC-CRF-PLSR achieved better prediction performance in terms of both Rp2 and RMSEP. Another recent study by Cen et al. (2021) applied a multi-scale analysis method combined with GA to identify phytoplankton functional groups in coastal waters using UV–Vis spectroscopy, and attained R2 of 0.993 and RMSEP of 1.29. In comparison, our DC-CRF-PLSR model achieved comparable results while using a simpler modeling approach. To further validate the differences in RMSEP values among these variable selection methods, a Wilcoxon signed rank test was performed, and the results indicated a significant difference between the DC-CRF method and other methods at a significance level of 0.05. Overall, for the performance of seawater COD spectra, the DC-CRF-PLSR is far more accurate than FULL-PLSR and slightly better than the other two methods.
Table 3 shows the RMSECV values and error fluctuations of the RF-PLSR, CRF-PLSR, and DC-CRF-PLSR algorithms as the number of iterations changes. Each case ran 20 times for statistical purposes, and the DC-CRF-PLSR iteration was calculated by multiplying the number of outer and inner loops. RF-PLSR and CRF-PLSR stabilized after 5,000 iterations, while DC-CRF-PLSR achieved stable performance in less than 2,000 iterations, with stability achieved at around 1,600 iterations. The reduction in the number of iterations indicates that a large amount of repeated random sampling was avoided, which improved the efficiency on the basis of algorithm stability. The RMSECV of the three algorithms all showed a downward trend with increasing iterations, with rapid initial decline and gradual flattening in the middle and later stages. The difference lies in the fact that at 50 iterations, the RMSECV of the random variables selected by RF for modeling was significantly higher than the other two because of the optimization of the C value ranking for the initial variable subset. After 2,000 iterations, the results of RF-PLSR and CRF-PLSR tended to be consistent, indicating that a large number of iterations had overwhelmed the advantages of the initial variable set optimization. In addition, after 2,000 iterations, the RMSECV of RF-PLSR and CRF-PLSR still showed a downward trend, although the change was small, while the RMSECV of DC-CRF-PLSR had solidified. Although the overall fluctuation error of DC-CRF-PLSR is smaller than that of the two other algorithms, this solidification of the variable set caused by the contraction function hindered the further optimization of the model, resulting in limitations in model optimization when facing different datasets.
Variation of RMSECV values for the three models with increasing number of iterations
nITE . | RF-PLSR . | CRF-PLSR . | DC-CRF-PLSR . |
---|---|---|---|
50 | 4.483 ± 2.131 | 3.802 ± 1.834 | 3.761 ± 1.345 |
200 | 3.105 ± 2.013 | 2.879 ± 1.637 | 3.686 ± 1.531 |
1,000 | 2.359 ± 0.831 | 2.351 ± 0.816 | 2.560 ± 1.653 |
2,000 | 2.133 ± 0.234 | 2.153 ± 0.315 | 1.575 ± 0.072 |
5,000 | 1.845 ± 0.213 | 1.834 ± 0.234 | 1.583 ± 0.046 |
10,000 | 1.722 ± 0.135 | 1.731 ± 0.101 | 1.577 ± 0.045 |
nITE . | RF-PLSR . | CRF-PLSR . | DC-CRF-PLSR . |
---|---|---|---|
50 | 4.483 ± 2.131 | 3.802 ± 1.834 | 3.761 ± 1.345 |
200 | 3.105 ± 2.013 | 2.879 ± 1.637 | 3.686 ± 1.531 |
1,000 | 2.359 ± 0.831 | 2.351 ± 0.816 | 2.560 ± 1.653 |
2,000 | 2.133 ± 0.234 | 2.153 ± 0.315 | 1.575 ± 0.072 |
5,000 | 1.845 ± 0.213 | 1.834 ± 0.234 | 1.583 ± 0.046 |
10,000 | 1.722 ± 0.135 | 1.731 ± 0.101 | 1.577 ± 0.045 |
nITE: number of iterations.
Scatter plots of the four models
The prediction scatter plots of FULL-PLSR (scatter index: 1.26), RF-PLSR (scatter index: 1.21), CRF-PLSR (scatter index: 1.23), and DC-CRF-PLSR (scatter index: 1.15) models based on UV–Vis spectra are shown in Figure 9. The plots exhibit a strong linear relationship between the predictor and response variables in each PLSR model. The scatter index was computed by taking the logarithmic transformation of the original data, fitting a linear regression model to the log-transformed data, calculating the mean squared error (MSE) of the regression model, and taking the square root of the MSE. Compared to the FULL-PLSR model predictions, the scatter points from the other three models cluster more tightly around the 1:1 line, indicating a high correlation between the selected wavelength features and the COD values of the seawater spectra. However, the distribution of points in the DC-CRF-PLSR scatter plot was not markedly different from that of the other two variable selection techniques. This suggests the limited diversity in the sample set may have constrained the performance of the DC-CRF algorithm.
CONCLUSION
This paper proposes a novel wavelength variable selection algorithm based on DC-CRF. The DC-CRF algorithm successfully addresses the issues of slow convergence, reduced exploratory power, and difficult parameter adjustment faced by previous methods for complex seawater spectral optimization. By analyzing five different spectral preprocessing methods and selecting SNV transformation as the optimal preprocessing approach, this study demonstrates the rationale behind selecting the C value and outer loop contraction settings in the DC-CRF algorithm. Through comparative analysis of the prediction performance of three models – RF-PLSR, CRF-PLSR, and DC-CRF-PLSR – for estimating seawater COD levels, it is shown that the DC-CRF-PLSR model achieves superior performance and maximizes simplification of model inputs.
The DC-CRF algorithm selects important spectral feature variables through a double-loop contraction mechanism, improving efficiency as well as identifying key wavelengths containing information relevant to COD variation. This helps gain insights into the spectral characteristics and variation patterns of COD and enhances the accuracy of COD detection. This study presents a preliminary investigation of the spectral variables related to seawater COD, providing useful improvements to guide further research. However, this research has limitations in only considering the internal factors in water and not accounting for external environmental factors in a more comprehensive way. Future studies will expand sample ranges, integrate other spectral models, and consider environmental factors such as waves and climate to improve model applicability. Expanding the range of low COD seawater samples and samples from diverse regions will also be crucial. Overall, this paper makes important contributions by proposing and validating a novel DC-CRF algorithm for optimizing complex seawater spectral analyses.
AUTHORS’ CONTRIBUTIONS
All authors contributed to the study's conception and design. Material preparation, data collection and analysis were performed by S. Hou. The first draft of the manuscript was written by S. Hou, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
FUNDING
This study was financially supported by the Taishan Scholars Program, the National Natural Science Foundation of China (No. U2006222), the National Key Research and Development Program of China (No. 2020ZLHY04), and the Qilu University of Technology (Shandong Academy of Sciences) unveiling project (No. 2022JBZ01-02-02).
CODE AVAILABILITY
The code generated during the current study is available from the corresponding author on reasonable request.
DECLARATIONS
All authors have read, understood, and have complied as applicable with the statement on ‘Ethical responsibilities of authors’ as found in the Instructions for Authors.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.