Abstract
Oil content (OC) is one of the important evaluation indicators in oilfield wastewater (OW) treatment. The purpose of this study is to realize online real-time detection of OC in OW by combining ultraviolet spectrophotometry with the convolutional neural network (CNN). In this paper, 80 groups of OW transmission data were measured for model establishment. Three CNN models with different structures are established to generalize the super parametric optimization process of the model. Furthermore, as a common method used in spectroscopy, the synergy interval partial least squares (siPLS) model is built in order to compare its accuracy with the CNN model. The results indicated the CNN model has a better performance than siPLS, in which the CNN model numbered Model 3 has the lowest root mean square error (MSE) of prediction (RMSEP) of 1.606 mg/L. As a consequence, the CNN model can be used in the monitoring of OW. This article will guide a rapid analysis of the OC of OW.
HIGHLIGHTS
Transmission spectra of the oil wastewater with different oil contents were determined.
Accuracy of the siPLS and CNN models for the prediction of the oil content in the oil wastewater was compared.
The CNN model is more suitable to analyze the oil content of oilfield wastewater than the siPLS model.
The CNN models with different depth structures were compared, and the deepest model exhibited the highest ability of prediction.
Graphical Abstract
INTRODUCTION
Based on the application of tertiary oil extraction in China oilfields in recent years (Seyyedsar et al. 2016; Hamza et al. 2017; Wei et al. 2018), the annual oilfield wastewater (OW) has continued to increase (Cui et al. 2019; Zhang et al. 2019a, 2019b), in which emulsion oil, suspended solids, and dissolved salts are major pollutants. OW must be treated and reinjected into the oil well after reaching the standard (He et al. 2020). Oil content (OC) is one of the important parameters in OW treatment, so this work is to research an effective method to rapidly analyze the OC of OW.
A traditional method to detect the OC of OW is using the standard curve (Qi et al. 2018). This method is based on the comparison of the absorbance of the extract collected from OW samples with the standard curve to obtain the OC. However, this method is not suitable for rapid measurement of OC in OW because it requires manual operation and involves complicated steps. The spectral detection method is used for the analysis of a variety of substances and has the characteristics of non-contact, fast, and high precision (Leng et al. 2021; Cardoso et al. 2022; Liu et al. 2022; Ma et al. 2022), so it is more suitable to be used in the online real-time monitor. Many studies are improving this point in recent years. Yan et al. (2020a, 2020b) proposed a semi-supervised soft sensor modeling method based on a deep neural network that included an embedment of a manifold into a deep neural regression network, which is used to predict the total Kjeldahl nitrogen concentration of the sewage treatment process. Yang et al. (2022) applied near-infrared (NIR) spectroscopy to reactive group analysis of the photothermal curing silicone, and they established the calibration model for the epoxy and the C = C double bond values. The result shows that the determination coefficient (R2) of the calibration model for the epoxy and the C = C double bond values were 0.9631 and 0.9689, respectively, and the corresponding root mean square error (RMSE) was 0.0179 and 0.0143, respectively. Zhang et al. (2021) compared the ability of UV and infrared (IR) spectroscopy based on partial least squares (PLS) to quantitatively analyze diesel content in the oil mixture. The results show that IR spectroscopy combined with PLS has a higher accuracy, in which RMSEP and RMSE of cross validation (RMSECV) reaching 0.0181 and 0.111, respectively. Qi et al. (2018) determined the UV transmittance spectra of the configured OW by measurements dealing with the UV–Vis spectrophotometer. The researchers focused on the relationship between the absorption coefficient and its OC, they discovered that the fitting effect of the third-order polynomial at 234 nm is the best, and the fitting accuracy is 0.9940. Shimamoto & Tubino (2016) established the PLS model to quantitatively identify the differences between biodiesel and vegetable oil based on UV–Vis spectroscopy and they found that full-spectrum data can identify differences that single-wavelength data cannot. The above papers suggest that spectrophotometry combined with an appropriate quantitative analysis model has high accuracy with less response time in substance detection. However, there are few studies on the direct analysis of actual water samples at the oilfield site due to the complex nature of OW, which makes the analysis of OC difficult. Therefore, this work selected and established the quantitative analysis model based on data from actual water samples at the oilfield site.
As a popular deep learning method, the convolutional neural network (CNN) has been widely employed in image recognition (Bora et al. 2020), video analysis (Ma et al. 2017; Hu et al. 2019), and language identification (Pande et al. 2022; Zhang et al. 2022a, 2022b). Lately, the CNN has been gradually applied to spectral data analysis (Liu et al. 2017; Malek et al. 2018; Zhang et al. 2019a, 2019b). For the prediction of nine different soil properties, four different regression methods based on NIR spectroscopy and Fourier transform infrared spectroscopy (FTIR) are analyzed by Haghi et al. (2021) in terms of predictive performance. The outcomes of this work pointed out that the one-dimensional CNN model gives better results compared to those obtained by PLS, support vector regression, and the two-dimensional CNN model for all the considered soil components regarding the RMSEP. Yan et al. (2020a, 2020b) developed quantitative calibration models by online Raman spectroscopy combined with the CNN to compare with the PLS quantitative calibration model, and the experimental results show that the CNN model using raw spectra was found better than the PLS model developed with preprocessed spectra for most analytes. Zhang et al. (2022a, 2022b) reported that the combination of hyperspectral imaging technology and deep convolutional generative adversarial network could generate single maize kernel spectral data and OC data, and the data was used to improve the performance of the OC prediction model. Wu et al. (2022) evaluated the ability of the one-dimensional CNN, SVM and improved PLS discriminant analysis in combination with Raman spectroscopy to specify three types of vegetable blend oil, and the results proved that the overall performance of the one-dimensional CNN was significantly higher than the other two methods. The above studies prove that the CNN performed well in spectrum analysis, which can be attributed to its ability of non-linear fitting. Meanwhile, the CNN has better processing performance of the high-dimensional independent variable matrix and discrete points than linear analysis. Nevertheless, the model-building process of the CNN is invisible, unlike linear fitting, which can find the polynomial between independent variables and dependent variables. The number of samples is also one of the factors that influence the performance of the CNN. When the sample size is small, the accuracy of the model is low, and the adaptability is poor. On the other hand, in different fields of wastewater monitoring, the best qualitative analysis model may be different.
Above all, it is obvious that the UV transmission spectrum combined with the quantitative analysis model, such as PLS or CNN, is suitable for online, real-time monitoring of OC in OW. The CNN model and PLS model have their own advantages and disadvantages, so it is necessary to compare the two models and select the one with the highest accuracy. In this paper, the UV spectra of 80 groups of OW samples were measured. The optimization process of the CNN model is complicated due to its many super parameters, so three CNN models with different structures were established and compared to summarize the model optimization process. Then these three CNN models were compared with the siPLS model to select the quantitative analysis model with the highest accuracy. It is hoped that the outcomes of this work can contribute to the machine learning implementations in the fields related to OW evaluation.
MATERIALS AND METHODS
Material and transmittance spectra measurement
OW samples were collected from the sewage treatment stations of five oil extraction plants in Daqing. There are a total of 80 groups of OW samples collected from different injection wells and each injection well was sampled twice at a 2 h interval. The UV–Vis spectrophotometer (TU1900) was used to measure the transmission spectrum of OW with an optical path of 10 mm. The wavelength range was 190–900 nm and the resolution was 2 nm.
The OC in re-injection samples
Number . | OC (mg/L) . | Number . | OC (mg/L) . | Number . | OC (mg/L) . |
---|---|---|---|---|---|
1 | 0.99 | 28 | 11.91 | 55 | 9.68 |
2 | 16.76 | 29 | 6.59 | 56 | 2.40 |
3 | 11.64 | 30 | 13.41 | 57 | 6.31 |
4 | 5.96 | 31 | 9.76 | 58 | 15.07 |
5 | 2.68 | 32 | 7.11 | 59 | 12.48 |
6 | 3.76 | 33 | 25.54 | 60 | 13.32 |
7 | 1.45 | 34 | 31.60 | 61 | 0.91 |
8 | 9.22 | 35 | 3.91 | 62 | 27.43 |
9 | 20.26 | 36 | 0.76 | 63 | 19.92 |
10 | 8.53 | 37 | 3.22 | 64 | 3.85 |
11 | 0.93 | 38 | 1.24 | 65 | 14.24 |
12 | 1.17 | 39 | 2.17 | 66 | 2.39 |
13 | 22.66 | 40 | 11.94 | 67 | 1.70 |
14 | 5.92 | 41 | 2.09 | 68 | 11.16 |
15 | 4.38 | 42 | 0.98 | 69 | 8.71 |
16 | 10.37 | 43 | 19.04 | 70 | 2.21 |
17 | 13.28 | 44 | 2.03 | 71 | 12.08 |
18 | 0.92 | 45 | 1.57 | 72 | 8.04 |
19 | 3.47 | 46 | 1.72 | 73 | 6.70 |
20 | 14.37 | 47 | 15.56 | 74 | 3.39 |
21 | 1.07 | 48 | 6.48 | 75 | 15.47 |
22 | 2.71 | 49 | 17.92 | 76 | 1.39 |
23 | 4.68 | 50 | 3.75 | 77 | 10.76 |
24 | 2.50 | 51 | 13.86 | 78 | 11.84 |
25 | 2.63 | 52 | 3.94 | 79 | 21.08 |
26 | 11.13 | 53 | 1.53 | 80 | 3.43 |
27 | 0.91 | 54 | 21.76 | – | – |
Number . | OC (mg/L) . | Number . | OC (mg/L) . | Number . | OC (mg/L) . |
---|---|---|---|---|---|
1 | 0.99 | 28 | 11.91 | 55 | 9.68 |
2 | 16.76 | 29 | 6.59 | 56 | 2.40 |
3 | 11.64 | 30 | 13.41 | 57 | 6.31 |
4 | 5.96 | 31 | 9.76 | 58 | 15.07 |
5 | 2.68 | 32 | 7.11 | 59 | 12.48 |
6 | 3.76 | 33 | 25.54 | 60 | 13.32 |
7 | 1.45 | 34 | 31.60 | 61 | 0.91 |
8 | 9.22 | 35 | 3.91 | 62 | 27.43 |
9 | 20.26 | 36 | 0.76 | 63 | 19.92 |
10 | 8.53 | 37 | 3.22 | 64 | 3.85 |
11 | 0.93 | 38 | 1.24 | 65 | 14.24 |
12 | 1.17 | 39 | 2.17 | 66 | 2.39 |
13 | 22.66 | 40 | 11.94 | 67 | 1.70 |
14 | 5.92 | 41 | 2.09 | 68 | 11.16 |
15 | 4.38 | 42 | 0.98 | 69 | 8.71 |
16 | 10.37 | 43 | 19.04 | 70 | 2.21 |
17 | 13.28 | 44 | 2.03 | 71 | 12.08 |
18 | 0.92 | 45 | 1.57 | 72 | 8.04 |
19 | 3.47 | 46 | 1.72 | 73 | 6.70 |
20 | 14.37 | 47 | 15.56 | 74 | 3.39 |
21 | 1.07 | 48 | 6.48 | 75 | 15.47 |
22 | 2.71 | 49 | 17.92 | 76 | 1.39 |
23 | 4.68 | 50 | 3.75 | 77 | 10.76 |
24 | 2.50 | 51 | 13.86 | 78 | 11.84 |
25 | 2.63 | 52 | 3.94 | 79 | 21.08 |
26 | 11.13 | 53 | 1.53 | 80 | 3.43 |
27 | 0.91 | 54 | 21.76 | – | – |
Considering Figure 1, the transmission spectra of the test set showed a similar trend at wavelengths between 190 and 900 nm. The transmittance increases gradually with respect to the increase in wavelength in the range of 190–500 nm, changes a little between 500 and 850 nm, and decreases with the increment in wavelength between 850 and 900 nm. OW with a higher OC has lower transmittance, which follows the Lambert–Beer law.
Establishment of the CNN model
The characteristics of local connections and weight sharing (LeCun et al. 2015) that the CNN has can greatly reduce the number of parameters, increase the ability to extract data features, and prevent model overfitting. The convolutional layer, which is the core structure of the CNN, can locally perceive the spectral data, which is conducive to extracting the spectral information of peaks and valleys. Therefore, the convolutional layer is more suitable for the analysis and processing of one-dimensional spectral data.
The structure of a two-layer CNN. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wst.2023.097.
The structure of a two-layer CNN. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wst.2023.097.
In addition to the convolutional layer, the CNN also has pooling, activation, flatten, and fully connected layers. The activation layer can introduce non-linear factors to solve the problems that cannot be solved by the linear model. The function of the pooling layer is to extract the main features to simplify the complexity of network calculations. In order to flatten the multi-dimensional input into one dimension, the flatten layer function is considered. The function of the fully connected layer is to summarize the features learned from the models and map them with sample labels. The output layer is responsible for outputting the results. In addition, the CNN can also add batch normalization (BN) layer (Ioffe & Szegedy 2015), dropout function (Srivastava et al. 2014), L1 regularization, and L2 regularization to prevent overfitting of the model.
Hyperparameter optimization results of each CNN model
. | Model 1 . | Model 2 . | Model 3 . |
---|---|---|---|
Kernel size | 5 | 3 | 3 |
Kernel number | 16 | 16 | 8 |
Hidden number | 128 | 64 | 32 |
Dropout rate | 0.5 | 0.2 | 0.3 |
Batch size | 8 | 16 | 8 |
. | Model 1 . | Model 2 . | Model 3 . |
---|---|---|---|
Kernel size | 5 | 3 | 3 |
Kernel number | 16 | 16 | 8 |
Hidden number | 128 | 64 | 32 |
Dropout rate | 0.5 | 0.2 | 0.3 |
Batch size | 8 | 16 | 8 |


As the activation function of the convolutional layer and F1 layer, the rectified linear unit (ReLU) function was used in the model. For the training of the model and finding the local optimal solution of the objective function, the Adam optimizer is used.
Model 2 includes two convolutional layers (Conv1 and Conv2), two max-pooling layers (Pooling1 and Pooling2), one fully connected layer (F1), and one output layer (Output). The dropout function was added to the F1 layer like Model 1. The difference between Model 3 and the other two CNN models is that there are three convolutional layers and three max-pooling layers (Conv1, Conv2, Conv3, Pooling1, Pooling2, and Pooling3). The dropout function was also added to the fully connected layer.
The three models have the same objective function and activation function.
Establishment of the siPLS model
As a common model in spectral quantitative analysis, the PLS model has been widely used in many works and it has many improved versions. Interval partial least squares (iPLS) is a PLS algorithm proposed by Nørgaard et al. (2000). The principle of iPLS is to divide the entire modeled band into multiple sub-intervals of the same width, then build multiple PLS models on each sub-interval to select the modeling interval with the highest accuracy. The siPLS (Chen et al. 2008) is a model combining two or more sub-intervals based on the iPLS, which has better predictive power than the siPLS.
Based on the spectral data of the OW, the siPLS model and three CNN models were established. The training set data was utilized to establish the model, while the test set data was considered to validate it.
The evaluation parameters of the regression model
In these equations, represents the real value;
indicates the predictive value;
expresses the average of
;
stands for the average of
;and n and m represent the number of the correction sets and the prediction set, respectively.
The decrease in RMSECV means that the loss of model training is reduced. The lower value of the RMSEP means the lower the prediction loss of the model to the test set. The model is overfitting when RMSECV is too lower than RMSEP as well as RMSEP does not reach the requirement. On the contrary, if RMSEP, as well as RMSECV are higher than expected, the model is underfitting. The hyperparameters should be optimized under both of the above cases until RMSEP reaches the requirement.
RESULTS
The comparison of the three CNN models with the siPLS model
The correlation plots between the predicted and the measured values of the CNN on the training set and test set, obtained from (a) Model 1; (b) Model 2; (c) Model 3; and (d) siPLS.
The correlation plots between the predicted and the measured values of the CNN on the training set and test set, obtained from (a) Model 1; (b) Model 2; (c) Model 3; and (d) siPLS.
The prediction results of the three CNN models are shown in Figures 4(a)–(c). The three models show great prediction ability where the RMSEP is lower than 3 and the rp is higher than 0.88. Model 3 has the best prediction results among all models. The rp, RMSEP, rc, and RMSECV of Model 3 are 0.9732, 1.606, 0.9949, and 0.845, respectively. As shown in Figure 4(d), the rp, RMSEP, rc, and RMSECV of the siPLS model are 0.7375, 3.372, 0.8140, and 4.325, respectively.
The outcomes obtained from all models are given in Table 3 for a more comprehensive comparison. It can be concluded from Table 3 that Model 3 has the best prediction ability, followed by Model 1 and then Model 2. The siPLS model has the worst prediction ability. The prediction accuracy of Model 1 and Model 2 is close. Meanwhile, the prediction ability of Model 3 has been greatly improved from Model 1 and the RMSEP has been reduced by 44.20%. Compared with the siPLS model, the RMSEP of Models 1, 2, and 3 decreased by 14.65, 12.77, and 52.35%, respectively. The prediction results of the CNN are generally better than the results of the siPLS model.
The comparison of evaluation parameters calculated by each model
. | RMSECV . | RMSEP . | rv . | rp . |
---|---|---|---|---|
Model 1 | 1.298 | 2.878 | 0.9850 | 0.8996 |
Model 2 | 2.231 | 2.941 | 0.9536 | 0.8801 |
Model 3 | 0.845 | 1.606 | 0.9949 | 0.9732 |
siPLS | 4.325 | 3.372 | 0.8140 | 0.7375 |
. | RMSECV . | RMSEP . | rv . | rp . |
---|---|---|---|---|
Model 1 | 1.298 | 2.878 | 0.9850 | 0.8996 |
Model 2 | 2.231 | 2.941 | 0.9536 | 0.8801 |
Model 3 | 0.845 | 1.606 | 0.9949 | 0.9732 |
siPLS | 4.325 | 3.372 | 0.8140 | 0.7375 |
The errors of three CNN models
(a) The absolute and (b) relative errors between the predicted and the measured values of the CNN on the test set.
(a) The absolute and (b) relative errors between the predicted and the measured values of the CNN on the test set.
DISCUSSION
It should be noted that the depth of Model 3 is deeper than that of Model 2 and Model 2 is deeper than Model 1. Theoretically, on the premise that the model does not overfit, the more complex the structure, the higher the accuracy of the model. For example (Lopez-Fornieles et al. 2022), in the PLS model optimization process, the accuracy raises with the increase of the latent variable until it reaches the highest point, then the model becomes overfitting and the accuracy decreases. But Model 2 is the model with the lowest accuracy instead of Models 1 or 3 in this work, which can be attributed to the fact that the CNN model optimization process is a process of shrinking from a big circle to a small circle rather than change of the point in a curve. Furthermore, PLS is the linear model, while the CNN can deal with the non-linear correlation of data, so the CNN model can more accurately analyze the spectrum of OW with complex properties. Nevertheless, the optimization process of the CNN model is more complex and time-consuming compared with the PLS model, which should be considered in the practical application.
On the other hand, the influence of sample distribution on model accuracy cannot be ignored (Buda et al. 2018). As shown in Figure 5, the predicted error of samples with an OC concentration of more than 20 mg/L is relatively higher than samples with other concentrations for all three models, because there are few samples with a concentration of more than 20 g/L. Meanwhile, the water quality of samples with excessive OC is more complex and it is more difficult to predict with the model. As a result, in practical application, it is necessary to take samples with a uniform distribution. Also, water quality should be added into influence factors in a further study.
CONCLUSIONS
In this paper, the UV spectra of 80 groups of OW samples with different OCs were measured and analyzed. The CNN model and siPLS model were established for comparison. Three CNN models with different structures were selected to show the general optimization process. The following conclusions can be drawn:
OC in OW can be quickly detected based on UV spectrophotometry and a quantitative analysis model, in which the CNN model has higher accuracy than the siPLS model.
The optimization process of the CNN model is more complicated than that of the PLS model, which means more time for model establishment and better accuracy.
ACKNOWLEDGEMENTS
The authors would like to thank Daqing Oil Field Co. for the support of sample collection.
FUNDING
This work was supported by the Youth Innovative Talents Training Plan of General Undergraduate University in Heilongjiang Province (UNPYSCT-2020148), the Open project of MOE Key Laboratory for Enhanced Oil and Gas Recovery (No. NEPU-EOR-2022-06), and the Postdoctoral support project of Heilongjiang Province (LBH-Q21084).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.