Abstract
Recently, with the growing demand for water quality monitoring, soft measurement sensors have drawn public attention, which can overcome the drawbacks of high cost and long time needed in traditional measurement methods. In this study, a machine learning-based soft monitoring sensor was developed to simultaneously monitor four water quality indicators including COD, NH4+-N, NO3--N, PO43--P. Firstly, specialized experimental equipment and calibration methods were developed to generate a matching dataset that collected over 94,000 data points. Secondly, five models including Multiple Linear Regression, Ridge Regression, AdaBoost, Decision Tree Regression, and Bagging Regression were constructed and compared. The learning accuracy of the models ranged from 0.8860 to 0.9999, among which the predicted value of Bagging Regression is highly fit to the true value. Subsequently, the fuzzy grade method was adopted to reduce the prediction error and strike a balance between efficiency and accuracy. Finally, the designed soft sensor was used for real-time monitoring at three monitoring points in Changzhou, China from September to October 2020, and the results proved the feasibility of the soft sensor in practical application. This study provided a fast and accurate method for water quality measurement, which is of great significance for the management of rural sewage treatment facilities.
HIGHLIGHTS
The soft sensor based on machine learning is applied to the water quality monitoring of rural sewage treatment facilities.
Designing of laboratory-level devices to obtain datasets for model training.
Fuzzy classification method is introduced to analyze and process data to reduce errors and obtain comprehensive water quality evaluation results.
INTRODUCTION
In recent years, with the construction of new sewage treatment plants, renovation and upgrading of discharge standards, China's sewage treatment rate has been gradually increasing. But statistics show that the treatment of rural wastewater lags far behind that of urban areas (Wang & Gong 2018; Yizhou & Liming 2019; Huang et al. 2020). Rural wastewater and the corresponding treatment facilities have very different characteristics than urban areas. The characteristics of rural wastewater include its small size, complex composition, wide variation, relatively low pollution concentration, and biochemical properties (Liang et al. 2011; Liu & Shen 2015; Yizhou & Liming 2019). Decentralized treatment is better suited to reduce cost and improve treatment efficiency depending on the water quality characteristics of rural wastewater, which also brings more complex management needs (Massoud et al. 2009; Liu & Shen 2015) as well as rapid and accurate measurement of water quality indicators. However, currently commonly used online or offline water quality detection methods are difficult to achieve real-time feedback control of water quality and are not suitable for process control and management systems.
With the development of computer technology, soft measurement has emerged as a major research hotspot in the process control field due to its dynamic response, low cost, and smart agility (Qing & Yu 2005; Kadlec et al. 2011; Haimi et al. 2013; Zhu et al. 2020; Ching et al. 2021; Paepae et al. 2021). The soft measurement technique builds some sort of correlation between the target water quality index and the other water quality index and extrapolates from the numeric value. Soft sensors are iterated and optimized to deal with both the accuracy and speed of real-time online measurements through the relationship between water quality indicators and the idea of a computational prediction (Paepae et al. 2021). According to the principle and structure of the internal model (Haimi et al. 2013), soft sensors can be divided into three categories: mechanistic model sensors, data-driven model sensors and composite model sensors. One of the most typical models of mechanisms is the serial ASM model (Henze et al. 1987). With ASM models, Spérandio et al. and Grau et al. performed the prediction of ammonia and nitrate based on the DO and ORP value, respectively (Sperandio & Queinnec 2004), and inverted the component concentration and mass fraction based on the measured parameter (Grau et al. 2007). Due to the many constraints faced by mechanism models, data-driven models are receiving increasing attention. The construction of data-driven models is not based on known physical or chemical knowledge, but is completely dependent on the collected data, which can reflect the real process state, and is particularly suitable for wastewater quality monitoring (Dürrenmatt & Gujer 2012). The composite model combines biochemical mechanisms and data-driven black-box features with high accuracy and interpretability (Sheng-guang et al. 2008; Cong et al. 2015), and has a higher computational demand (Nair et al. 2022). Compared with mechanism models and composite models, data-driven models compromise accuracy and prediction efficiency, and can achieve higher practical application value after purposeful development and optimization, and are more in line with the complex and changeable rural sewage application scenarios.
In the development of soft sensors, improving measurement accuracy requires selecting suitable model algorithms. Previous studies have reported on algorithm selection and optimization of data-driven models in different application scenarios, and have implemented methods such as artificial neural network (ANN) with back-propagation (BP) training algorithm (Liping & Boshnakov 2010), radial basis function (RBF) neural network optimized by genetic algorithm (Li & Yang 2009), time difference-based multi-kernel relevance vector machine (MRVM) (Wu et al. 2020), eXtreme Gradient boosting (XGboost) machine (Ching et al. 2022). Although existing studies have made breakthroughs in algorithm design, there are relatively few studies on predicting water quality indicators of rural sewage treatment facilities. Most existing prediction models are single objective measurement methods, which make it difficult to meet the sewage treatment indicators. Moreover, artificial neural networks have high requirements for datasets and hardware equipment, which do not meet the characteristics of rural sewage treatment facilities. In contrast, machine learning algorithms have shown good performance on datasets, training durations, and discrete points. Therefore, this study chose machine learning algorithms for soft sensor development.
In addition, the model algorithms can be continuously optimized based on different application scenarios and data history, and model performance will also be affected by the characteristics and differences of the dataset (Wu et al. 2019; Ching et al. 2022; Fox et al. 2022; Nair et al. 2022). Therefore, it is necessary to design a matching dataset specifically to improve prediction accuracy. Due to the constraints of matching datasets, the design of soft sensors for rural sewage monitoring in China currently faces a series of problems. Firstly, the characteristics of sewage quality in rural China are different from those in urban areas and from other countries, and it is not possible to directly use existing datasets to train soft sensors. Therefore, specific datasets need to be constructed based on actual water quality characteristics. Secondly, in the development of soft sensors, the algorithm selection and parameter adjustment optimization need to be based on matched datasets and be calibrated in actual water samples to improve accuracy. Finally, compared to actual laboratory measurement methods, the accuracy of soft sensors in machine learning models inevitably faces shortcomings. Therefore, it is necessary to combine it with the needs of actual management to seek a balance between accuracy and timeliness.
In response to the above issues, this study focused on data-driven models under machine learning algorithms and designed a soft sensor suitable for real-time monitoring of data in rural sewage treatment facilities in China. In the development of soft sensors, a corresponding laboratory pilot device was designed to obtain a dataset that matches the scenario of rural sewage treatment facilities; Then, based on the dataset, multiple regression models and machine learning models were compared, and the algorithm adopted for the soft sensor was determined. Next, the parameters of the soft sensor model were optimized using actual water quality data from the sewage treatment plant, verifying the prediction accuracy and application universality. Finally, the calibrated soft sensors were arranged at three points in the Wujin Port watershed in Changzhou City for practical application effect analysis.
MATERIALS AND METHODS
Development framework for soft sensors
Design of laboratory experimental devices and data generation
In the experimental process, sensor probe measurement and mathematical calculation were combined to fit the change equation on the time series on the basis of an exact measurement, and the corresponding additional dataset was obtained. The measured dataset was combined with additional data to form the large data needed for model design. Physical indicators (pH, ORP, COND, TURB) were determined directly by sensor probes. Changes in the concentration of target biochemical indicators (COD, -N, -N, -P) were calculated using absorbance and initial concentration data.
The data were cleaned and pretreated before fitting and during calculation. The main purpose is to remove noise and outliers and reflect different background states of sewage. Details of the specific dataset generation and cleaning steps are provided in Supplementary material, Appendix 1.
Modelling and comparison
Five algorithms, including traditional regression and machine learning, were modeled in this study and compared by the model output. The five algorithms are the Multiple Linear Regression (MLR) Model, Ridge Regression (RR) Model (Hoerl & Kennard 1970a, 1970b), AdaBoost Model (Freund & Schapire 1997), Decision Tree Regression Model (Quinlan 1996), and Bagging Regression (BagR) Model (Breiman 1996). Among them, three MLR models are used: the general multiple regression model, the generalized MLR model with cross-multiplied terms, and the generalized MLR model with cross-terms and quadratic terms, which are abbreviated as LR, LR2, and LR3 for convenience. The RR model adds restrictions on the size of the model based on the multivariate linear model, preventing overfitting while allowing the model to have stronger generalization capabilities. The AdaBoost model (abbreviated as AdaR) integrates a series of linear regression models to obtain a better-performing model, resulting in a stronger generalization ability of the model. When integrating relatively simple and interpretable models, high accuracy can be achieved. The Decision regression tree (DTR) divides the continuous independent variable interval and replaces the 0–1 loss function by the continuous loss function, such that the regression problem is turned into a classification problem. In this study, CART algorithm is adopted for the Decision Tree Regression model. CART algorithm is a greedy algorithm that judges the purity of a classification based on the GINI index, ensuring the greatest decrease in the coefficient of GINI with each classification. In the BagR algorithm, modeling and training were conducted in multiple rounds. Each round was composed of several randomly selected training samples from the initial training set. After training multiple rounds, the results were averaged.
Applications in actual senarios
The model algorithm with the best performance after screening needs to be validated and calibrated by different sewage background data to reflect the accuracy and robustness of the model prediction. In order to validate the performance of the model, a total of 5 datasets with different water quality backgrounds were obtained through small experiments and field measurements. The background water quality is as follows: pure water, tap water, SBR water, SBR water (before filtration), SBR water (after filtration). The associated SBR data are obtained from the actual operation data of Changzhou WWTP. As there is a lack of -N in the SBR-related data, a total of 17 application scenarios can be constructed. In each application scenario, the parameters of the model need to be determined sufficiently to obtain the relatively optimal prediction effect in an acceptable range. The comprehensive performance and stability of the model are analyzed by the statistics of the optimal prediction results.
To verify the true monitoring effect, the designed soft sensor was applied to the rural wastewater treatment facilities in Changzhou. The equipment was set up at the outlet of three sewage treatment facilities in the Wujingang basin of Changzhou City, Luodong area, Qingdun head area and Xie Jiatou area. The location of these three facilities is shown in Supplementary material, Figure A2.1, and the processes used in these facilities were Airlift circulation MBR. Monitoring sensors with pH, ORP, COND, TURB, SS, and so on are provided at corresponding points at 3-min intervals. The resulting data were able to predict target indicators through soft sensors. The study examined operational data from September 1, 2020 to October 15, 2020 and performed a follow-up analysis based on data from the above period.
Fuzzy grade method
In the monitoring work for rural sewage treatment facilities, the objective is to obtain trends in chemical indicators to be measured by soft sensors and the comprehensive pollution levels at monitoring sites. The paper thus analyses actual water quality monitoring and management requirements, and takes the fuzzy grading method to partition the degree of water quality pollution. The fuzzy grading method compensates for the small deficiency in the absolute accuracy of the soft sensor in machine learning, and reduces the time consumption and measurement cost.
SS, TN, TP refer to the concentration of suspended solids, total nitrogen, and total phosphorus estimated by calculation, TURB refers to the turbidity measurement, -N, -N, -P refers to the ammonia-nitrogen, nitrate, and phosphate prediction obtained by the soft sensor, respectively.
Once the composite water quality index is obtained, it is ranked according to the fuzzy scale of the index value. The formula suggests that the range of the index should be between 0 and 100 when the water meets the standard, so that the index greater than 100 is considered excessive. A score of less than 100 is broken down into five equal scores, which correspond to excellent, good, medium, poor, and bad. Specific grading criteria are shown in Table 1. Finally, the changes and statistics of the comprehensive water quality index rank in a certain time series are obtained for further analysis.
Score of water quality . | Grade . |
---|---|
0–20 | Excellent/Grade Ⅰ |
20–40 | Good/Grade Ⅱ |
40–60 | Medium/Grade Ⅲ |
60–80 | Poor/Grade Ⅳ |
80–100 | Bad/Grade Ⅴ |
>100 | Exceeding |
Score of water quality . | Grade . |
---|---|
0–20 | Excellent/Grade Ⅰ |
20–40 | Good/Grade Ⅱ |
40–60 | Medium/Grade Ⅲ |
60–80 | Poor/Grade Ⅳ |
80–100 | Bad/Grade Ⅴ |
>100 | Exceeding |
RESULTS
Dataset generation and descriptive statistics
A total of 94,852 experimental data were obtained from 23 experimental forms, forming a dataset for training soft sensor machine learning models. In the dataset, the concentration range of -N and -N is 0–150 mg/L, COD is 0–1,000 mg/L, and -P is 0–15 mg/L, which can achieve large-scale measurement.
Machine learning model prediction and comparison
Fit validation of residual index of experimental data
To further demonstrate the predictive performance of the algorithms in -N, -N, -P soft sensors, representative linear models (multiple regression models) and machine learning models (Decision Tree Regression and BagR models) were selected for validation of the remaining water quality metrics based on Section 3.2. The prediction effect is shown in Table 2. Fitting diagrams are shown in Supplementary material, Appendix 4.
. | MLR . | DTR . | BagR . | |
---|---|---|---|---|
COD | MRE | 1.1989 | 0.0056 | 0.0038 |
MAE | 2.7369 | 0.0425 | 0.0324 | |
RMSE | 37.7377 | 1.1758 | 0.9739 | |
Corr | 0.8860 | 0.9999 | 0.9999 | |
-N | MRE | 0.9751 | 0.0047 | 0.0035 |
MAE | 0.4198 | 0.0045 | 0.0033 | |
RMSE | 5.9182 | 0.1321 | 0.0791 | |
Corr | 0.8864 | 0.9999 | 0.9999 | |
-N | MRE | 0.9638 | 0.0044 | 0.0030 |
MAE | 0.4713 | 0.0039 | 0.0024 | |
RMSE | 7.0369 | 0.0938 | 0.0508 | |
Corr | 0.8422 | 0.9999 | 0.9999 | |
-P | MRE | 1.4475 | 0.0165 | 0.0074 |
MAE | 0.0312 | 0.0005 | 0.0003 | |
RMSE | 0.4614 | 0.0073 | 0.0065 | |
Corr | 0.8636 | 0.9999 | 0.9999 |
. | MLR . | DTR . | BagR . | |
---|---|---|---|---|
COD | MRE | 1.1989 | 0.0056 | 0.0038 |
MAE | 2.7369 | 0.0425 | 0.0324 | |
RMSE | 37.7377 | 1.1758 | 0.9739 | |
Corr | 0.8860 | 0.9999 | 0.9999 | |
-N | MRE | 0.9751 | 0.0047 | 0.0035 |
MAE | 0.4198 | 0.0045 | 0.0033 | |
RMSE | 5.9182 | 0.1321 | 0.0791 | |
Corr | 0.8864 | 0.9999 | 0.9999 | |
-N | MRE | 0.9638 | 0.0044 | 0.0030 |
MAE | 0.4713 | 0.0039 | 0.0024 | |
RMSE | 7.0369 | 0.0938 | 0.0508 | |
Corr | 0.8422 | 0.9999 | 0.9999 | |
-P | MRE | 1.4475 | 0.0165 | 0.0074 |
MAE | 0.0312 | 0.0005 | 0.0003 | |
RMSE | 0.4614 | 0.0073 | 0.0065 | |
Corr | 0.8636 | 0.9999 | 0.9999 |
The model efficacy parameters in Table 2 show that both machine learning models have low MRE, MAE, and RMSE values, and Corr is close to 1, which is a good predictor of all four water quality metrics and significantly better than multiple regression models. The advantages of machine learning models over multiple regression models are also evident from the fit curves in Supplementary material, Appendix 4. Although both machine learning models achieved high accuracies, the BagR model performed slightly better than the Decision Tree Regression model in comparison to these four metrics. For this reason, the BagR model is chosen for further investigation in the design of subsequent soft sensors.
Validation of model measurement capabilities in different sewage background
To further validate the wider applicability of the machine learning models in the present study, the BagR model was chosen to validate measurement capabilities in different water intake environments. Table 3 provides a summary of the forecasting effect of various indicators under different water intake antecedents. The actual and projected values of the fitting curve almost overlap, as detailed in Supplementary material, Appendix 5.
. | Pure water . | Tap water . | SBR influent . | SBR effluent (filtered) . | SBR effluent (before filtration) . | |
---|---|---|---|---|---|---|
COD | MRE | 0.0038 | 0.0041 | 0.0046 | 0.0072 | 0.0044 |
MAE | 0.0202 | 0.0311 | 0.0223 | 0.0500 | 0.0178 | |
RMSE | 0.3960 | 0.5458 | 0.4483 | 1.4965 | 0.3933 | |
Corr | 0.999988 | 0.999987 | 0.999973 | 0.999844 | 0.999973 | |
-N | MRE | 0.0035 | 0.0034 | 0.0056 | 0.0054 | 0.0040 |
MAE | 0.0046 | 0.0030 | 0.0017 | 0.0028 | 0.0020 | |
RMSE | 0.1142 | 0.0940 | 0.0240 | 0.0429 | 0.0667 | |
Corr | 0.999957 | 0.999974 | 0.999995 | 0.999982 | 0.999952 | |
-N | MRE | 0.0032 | 0.0036 | |||
MAE | 0.0028 | 0.0051 | ||||
RMSE | 0.0597 | 0.1128 | ||||
Corr | 0.999988 | 0.999966 | ||||
-P | MRE | 0.0069 | 0.0062 | 0.0102 | 0.0054 | 0.0218 |
MAE | 0.0005 | 0.0005 | 0.0004 | 0.0005 | 0.0001 | |
RMSE | 0.0141 | 0.0105 | 0.0087 | 0.0157 | 0.0015 | |
Corr | 0.999890 | 0.999960 | 0.999943 | 0.999880 | 0.999943 |
. | Pure water . | Tap water . | SBR influent . | SBR effluent (filtered) . | SBR effluent (before filtration) . | |
---|---|---|---|---|---|---|
COD | MRE | 0.0038 | 0.0041 | 0.0046 | 0.0072 | 0.0044 |
MAE | 0.0202 | 0.0311 | 0.0223 | 0.0500 | 0.0178 | |
RMSE | 0.3960 | 0.5458 | 0.4483 | 1.4965 | 0.3933 | |
Corr | 0.999988 | 0.999987 | 0.999973 | 0.999844 | 0.999973 | |
-N | MRE | 0.0035 | 0.0034 | 0.0056 | 0.0054 | 0.0040 |
MAE | 0.0046 | 0.0030 | 0.0017 | 0.0028 | 0.0020 | |
RMSE | 0.1142 | 0.0940 | 0.0240 | 0.0429 | 0.0667 | |
Corr | 0.999957 | 0.999974 | 0.999995 | 0.999982 | 0.999952 | |
-N | MRE | 0.0032 | 0.0036 | |||
MAE | 0.0028 | 0.0051 | ||||
RMSE | 0.0597 | 0.1128 | ||||
Corr | 0.999988 | 0.999966 | ||||
-P | MRE | 0.0069 | 0.0062 | 0.0102 | 0.0054 | 0.0218 |
MAE | 0.0005 | 0.0005 | 0.0004 | 0.0005 | 0.0001 | |
RMSE | 0.0141 | 0.0105 | 0.0087 | 0.0157 | 0.0015 | |
Corr | 0.999890 | 0.999960 | 0.999943 | 0.999880 | 0.999943 |
All 17 models have prediction effects that are close to or more precise than some of the same studies, which demonstrates the feasibility of this model across all four water quality indices. Model training speed and prediction speed are also very fast because of the use of training data as the basis for the model logic, operating on computers set up with Intel (R) Core (TM) i5-8300H CPU @ 2.30 GHz, model training time fluctuates in the range of 4–10 min, and single-group data prediction time in the range of 10–20 ms.
Practical application effectiveness and analysis
To obtain and analyze data in real-time, and to validate the model's effectiveness, the model obtained by the BagR algorithm was applied to the testing of water quality indicators in actual water samples. Three sampling points in the Luodong area, Qingduntou area and Xiejiatou area were analyzed, and the prediction data of each sampling point were obtained. The above data were collated into 12 images showing the time series projections for each of the four indicators, as detailed in Supplementary material, Appendix 6.
Site . | Excellent/Grade Ⅰ . | Good/Grade Ⅱ . | Medium/Grade Ⅲ . | Poor/Grade Ⅳ . | Bad/Grade Ⅴ . | Exceeding . |
---|---|---|---|---|---|---|
Luodong | 0 | 740 | 2,746 | 341 | 284 | 26 |
Qingduntou | 0 | 1,929 | 1,628 | 466 | 72 | 42 |
Xiejiatou | 0 | 1,394 | 192 | 242 | 2,222 | 87 |
Site . | Excellent/Grade Ⅰ . | Good/Grade Ⅱ . | Medium/Grade Ⅲ . | Poor/Grade Ⅳ . | Bad/Grade Ⅴ . | Exceeding . |
---|---|---|---|---|---|---|
Luodong | 0 | 740 | 2,746 | 341 | 284 | 26 |
Qingduntou | 0 | 1,929 | 1,628 | 466 | 72 | 42 |
Xiejiatou | 0 | 1,394 | 192 | 242 | 2,222 | 87 |
The average comprehensive water quality index of Luodong during this period was 50.55, which is at a medium level. Qingduntou had an average comprehensive water quality index of 44.17, which is at a medium level. The average comprehensive water quality index in the Xiejiatou area is 65.56, which is at a poor level, and 87 of the measured values are exceeding.
In terms of ratings, water quality data at most of the test points in Loudong and Qingduntou during the period of the survey were at Grade Ⅱ or Ⅲ, which suggests that the facilities were relatively well worked. This is in contrast to the test results in Xiejiatou District, which showed that the majority of the test time points in the region were of Grade II or IV water quality, indicating that the overall treatment performance of the facility (for effluent only) was inferior to that of the Luodong or Qingduntou facilities. Despite the mixed results of the actual monitoring of the three sites and the use of adjectives such as ‘poor’ and ‘bad’ in the classification description, monitoring results with an integrated pollution index score of one to five met the baseline requirements for wastewater treatment facilities and water quality was in the normal range. The number of ‘exceeding’ in the monitoring results for all three facilities is noteworthy, as it shows that all three facilities still require some degree of optimization, and that the number of overshoots reflects the scale of the problem.
Figure 5 shows the variation in water quality at three monitoring points. It can be found that the data on poor water quality are all concentrated in a certain time period. Monitoring results from the Xiejiatou facility show that while the average comprehensive pollution index and amount of substandard and excess data are the highest of the three, most of the severe pollution is concentrated in the early portion of the monitoring period, and similar levels of pollution can be achieved over the medium term with the other two monitoring locations. This reflects the volatility of data monitoring and indirectly the instability of rural domestic wastewater and the need for monitoring and management of treatment facilities. On the other hand, we also need to monitor and analyze more data from a longer time scale in order to give a more meaningful reference for optimizing operation and handling regulation.
DISCUSSIONS
Effect of soft sensors based on BagR algorithm
In this study, we compared the results of machine learning algorithms with traditional linear regression models, in which the linear regression model's predictions were close to actual data in terms of trends. However, many of the single-point predictions were biased, suggesting limitations of linear regression models compared with machine learning models, which is in agreement with the results of other researchers (Viviano et al. 2014; Wang et al. 2015; Ha et al. 2020; Pattanayak et al. 2020; Pattnaik et al. 2021). Schilling's research on monitoring points for two rivers in Iowa, the United States, however, showed that the accuracy of the results of the MLR model in predicting TP was significantly improved after the addition of the input variable to the OP (orthophorus) (Schilling et al. 2017), suggesting that traditional MLR models continue to have superior performance after filtering with appropriate primary analysis and other methods. The MLR model is still useful in certain scenarios due to its lower computation demand and faster training (Zhu & Anderson 2019; Pattanayak et al. 2020). These results also show that the optimal models are not necessarily consistent across different scenarios. Since most machine learning models can achieve relatively high prediction accuracies after parameter optimization, the characteristics of the training data will be a significant factor affecting model selection. Similarly, the results of this study can only suggest that the BagR algorithm is more suitable for use in soft sensors based on rural wastewater treatment facilities in the Wujingang Watershed, Changzhou City, Jiangsu Province, China.
In this paper, we use the machine learning model of BagR and combine it with management requirements to build a soft sensor with good effect. While the prediction performance is good, the machine learning model can achieve a relative balance in computation, prediction time and interpretability. Our BagR algorithm takes four physical indicators as inputs to the model and then makes full use of the facilities of the monitoring equipment to reduce costs while avoiding overfitting the model with too many input variables. The prediction effect for the soft sensor designed in this study is stable or significantly improved when compared to the model in the earlier study in Table 5.
Authors . | Algorithms/Models . | Input variables . | Output variables . | Sources/Features of Data . | Effect of models . | Year of publiction . | References . |
---|---|---|---|---|---|---|---|
Spérandio et al. | ASM | DO + ORP | -N/-N | Lab + Software Simulation | Ammonia relative error = 3.94% | 2004 | Sperandio & Queinnec (2004) |
Sheng-guang et al. | Combined model of mechanism and data-driven | Soluble refractory organics, Soluble degradable organics, Soluble Oxygen, Heterotrophic bacteria | Effluent COD | average error = 0.0326 standard deviation = 0.3933 R2 = 0.9685 | 2008 | Sheng-guang et al. (2008) | |
Li and Yang et al. | Radial Basis Function (RBF) neural network + genetic algorithm + Gradient descent method | COD + TN + DO + T + HRT | TN | WWTP of Dingshu, Yixing | relative errors = 1.65%-3.14% | 2009.7 | Li & Yang (2009) |
Liping and Boshnakov et al. | BP neural network | Effluent COD/BOD | relative errors = 7.5%-12% | 2010 | Liping & Boshnakov (2010) | ||
Mulas et al. | ordinary least squares regressions (OLSR) partial least squares regression (PLSR) local linear regression based on k-nearest neighbors (k-NN LLR) | 6 variables after PCA | -N | Viikinmäki WWTP | RMSE = 0.14–0.46 | 2012 | Mulas et al. (2012) |
Liu et al. | PCA + JIT(just-in-time learning)-ENS(ensemble learning) | 19 variables after PCA | Effluent BOD5 | Baecelona WWTP | RMSE = 0.3825 r = 0.8991 | 2013.4 | Liu et al. (2013) |
Guo et al. | PSO + Elman neural network | -N、pH、T、MLSS | SVI | A WWTP in Beijing | RMSE = 0.0509–0.1039 | 2014.6 | Guo et al. (2014) |
Cong et al. | Combined model of mechanism and data-driven | SS、-N、Q、CODinf、DO | Effluent COD | A WWTP in Shenyang | RMSE = 8.31 | 2015 | Cong et al. (2015) |
Mari and Laskar et al. | deep learning-based soft sensor (DLSS) | DO | TN | BSM2 | mean squared errors MSE = 0.072–0.0825 r = 0.9852–0.9869 | 2020 | Mali & Laskar (2020) |
Wu et al. | Lasso Regression + Time Difference-based Multi-kernel Relevance Vector Machine (MRVM) | 20 variables for analysis | BOD | BSM1 model + a real WWTP | RMSE = 7.0301 r = 0.9580 | 2020.5 | Wu et al. (2020) |
Schneider et al. | feature detection algorithms | pH/DO | -N | Lab + 3 real WWTP | accuracy = 68%-94% | 2019.6 2020.7 | Schneider et al. (2019); Schneider et al. (2020) |
Li et al. | Gaussian process regression (GPR) and least squares support vector machine (LSSVM) algorithm, Kalman filter (KF) and moving window function (MW) | Historical data | SS、-N、-N、COD、BOD | BSM1 model + a real WWTP | RMSE = 0.013–89.654 R = 0.796–0.957 RMSSD = 2.984–11.135 RR = 0.697–0.745 | 2021.2 | Li et al. (2021) |
Ching et al. | Xgboost | Other index of the same side (influent or effluent) | Influent and effluent BOD | UCI Machine Learning Repository, a WWTP in Hong Kong (The slope of the data is high and contains extreme values) | RMSE = 0.92–62.10 | 2022.2 | Ching et al. (2022) |
Fox et al. | neural network (NN)、Multiple Linear Regression (MLR) | pH + ORP | Effluent -N | Lab | R2 = 0.465–0.769 RMSE = 0.196–0.5 | 2022.3 | Fox et al. (2022) |
Alvi et al. | Gated Recurrent Neural Network units (GRUs) + Convolution Neural Network (CNN) | pH + DO + turbidity + TSS + ORP | -N | Luggage Point sewage treatment plant in Pinkenba Queensland | RMSE = 0.04909 ± 0.0106 MAE = 0.01655 ± 0.0022 R2 = 0.9305 ± 0.0318 | 2022.5 | Alvi et al. (2022) |
Authors . | Algorithms/Models . | Input variables . | Output variables . | Sources/Features of Data . | Effect of models . | Year of publiction . | References . |
---|---|---|---|---|---|---|---|
Spérandio et al. | ASM | DO + ORP | -N/-N | Lab + Software Simulation | Ammonia relative error = 3.94% | 2004 | Sperandio & Queinnec (2004) |
Sheng-guang et al. | Combined model of mechanism and data-driven | Soluble refractory organics, Soluble degradable organics, Soluble Oxygen, Heterotrophic bacteria | Effluent COD | average error = 0.0326 standard deviation = 0.3933 R2 = 0.9685 | 2008 | Sheng-guang et al. (2008) | |
Li and Yang et al. | Radial Basis Function (RBF) neural network + genetic algorithm + Gradient descent method | COD + TN + DO + T + HRT | TN | WWTP of Dingshu, Yixing | relative errors = 1.65%-3.14% | 2009.7 | Li & Yang (2009) |
Liping and Boshnakov et al. | BP neural network | Effluent COD/BOD | relative errors = 7.5%-12% | 2010 | Liping & Boshnakov (2010) | ||
Mulas et al. | ordinary least squares regressions (OLSR) partial least squares regression (PLSR) local linear regression based on k-nearest neighbors (k-NN LLR) | 6 variables after PCA | -N | Viikinmäki WWTP | RMSE = 0.14–0.46 | 2012 | Mulas et al. (2012) |
Liu et al. | PCA + JIT(just-in-time learning)-ENS(ensemble learning) | 19 variables after PCA | Effluent BOD5 | Baecelona WWTP | RMSE = 0.3825 r = 0.8991 | 2013.4 | Liu et al. (2013) |
Guo et al. | PSO + Elman neural network | -N、pH、T、MLSS | SVI | A WWTP in Beijing | RMSE = 0.0509–0.1039 | 2014.6 | Guo et al. (2014) |
Cong et al. | Combined model of mechanism and data-driven | SS、-N、Q、CODinf、DO | Effluent COD | A WWTP in Shenyang | RMSE = 8.31 | 2015 | Cong et al. (2015) |
Mari and Laskar et al. | deep learning-based soft sensor (DLSS) | DO | TN | BSM2 | mean squared errors MSE = 0.072–0.0825 r = 0.9852–0.9869 | 2020 | Mali & Laskar (2020) |
Wu et al. | Lasso Regression + Time Difference-based Multi-kernel Relevance Vector Machine (MRVM) | 20 variables for analysis | BOD | BSM1 model + a real WWTP | RMSE = 7.0301 r = 0.9580 | 2020.5 | Wu et al. (2020) |
Schneider et al. | feature detection algorithms | pH/DO | -N | Lab + 3 real WWTP | accuracy = 68%-94% | 2019.6 2020.7 | Schneider et al. (2019); Schneider et al. (2020) |
Li et al. | Gaussian process regression (GPR) and least squares support vector machine (LSSVM) algorithm, Kalman filter (KF) and moving window function (MW) | Historical data | SS、-N、-N、COD、BOD | BSM1 model + a real WWTP | RMSE = 0.013–89.654 R = 0.796–0.957 RMSSD = 2.984–11.135 RR = 0.697–0.745 | 2021.2 | Li et al. (2021) |
Ching et al. | Xgboost | Other index of the same side (influent or effluent) | Influent and effluent BOD | UCI Machine Learning Repository, a WWTP in Hong Kong (The slope of the data is high and contains extreme values) | RMSE = 0.92–62.10 | 2022.2 | Ching et al. (2022) |
Fox et al. | neural network (NN)、Multiple Linear Regression (MLR) | pH + ORP | Effluent -N | Lab | R2 = 0.465–0.769 RMSE = 0.196–0.5 | 2022.3 | Fox et al. (2022) |
Alvi et al. | Gated Recurrent Neural Network units (GRUs) + Convolution Neural Network (CNN) | pH + DO + turbidity + TSS + ORP | -N | Luggage Point sewage treatment plant in Pinkenba Queensland | RMSE = 0.04909 ± 0.0106 MAE = 0.01655 ± 0.0022 R2 = 0.9305 ± 0.0318 | 2022.5 | Alvi et al. (2022) |
The soft sensor based on the BagR algorithm has been shown to achieve a very good predictive effect in validation experiments of a variety of water quality indicators in many different contexts. RMSE is less than 1 in most cases, the average relative error of the MRE is less than 1% in nearly all scenarios, and the majority are less than 0.5%, leading to better predictions than the single input models (Schneider et al. 2020). Our algorithm has significant interpretability gaps compared to the ASM-based mechanistic model (Sperandio & Queinnec 2004), but the prediction accuracy is comparable, and data-driven models can quickly learn and adapt in complex and variable scenarios. In the rural wastewater treatment facility scene, the complex change in water quality and quantity and multi-parameter forecast demand make the mechanism model difficult to accurately compute. Therefore, the data-driven model is more feasible in practice.
Compared with other data-driven model soft sensors, the soft sensors designed in this study managed to catch up and surpass some of the studies in absolute accuracy, but did not reach the highest level in the same type of study. Compared to studies by Cong Qiumei et al., Jing Wu et al., Dong Li et al., P.M.L. Ching et al., Shane Fox et al., Liu et al. (Liu et al. 2013; Cong et al. 2015; Wu et al. 2020; Li et al. 2021; Ching et al. 2022; Fox et al. 2022), our sensor's value was enhanced or comparable to the optimal situation in these studies when using RMSE as an indicator of model accuracy characterization simultaneously. In comparison to studies using correlation coefficients or similar determinants in Table 5 as the key predictors of the accuracy of the model (Sheng-guang et al. 2008; Liu et al. 2013; Wu et al. 2020; Li et al. 2021), our sensors generally exhibit better prediction performance. Although we cannot be considered as surpassing previous studies simply because of the difference between the dataset and the measurement index used, some of the higher accuracy characterization values demonstrate the applicability and accuracy of the sensor designed in this study under certain conditions. On the other hand, many of the more algorithmically biased studies have been able to provide soft sensor model scenarios with higher accuracy or robustness than in the present study (Mulas et al. 2012; Guo et al. 2014; Mali & Laskar 2020; Zhu et al. 2020; Alvi et al. 2022). Complex models combining some of the mechanisms and data-driven methods can also further improve the accuracy of predictions. However, more accurate and complex models mean larger calculations and longer training and prediction times or higher computational costs. For the practical application scenario of water quality testing in rural wastewater treatment facilities, a balance should be sought between accuracy and cost, without the need for the highest predictive accuracy. The BagR model in this study has a moderate amount of data, a moderate number of parameters, a low complexity of the algorithm itself, and a low computational volume, enabling soft sensors to achieve training time of up to 10 min or less and prediction time of up to 20 ms, which greatly improves timeliness. Furthermore, the application of the fuzzy grading method also reduces the data monitor's accuracy requirement to the soft sensor level. Application practice under the actual conditions also demonstrates that the current accuracy of our sensors meets the requirements for use (this will be discussed in detail later). Therefore, although the accuracy is not up to the highest level, this study can balance the calculation cost and accuracy, and is the most suitable for real application scenarios.
Recommendations for the preparation of datasets
On the dataset of the model training, the method of generating data by the small laboratory test device is adopted in this study. This method can quickly generate matching datasets. By contrast, the time and economic cost of collecting and marking relevant data during the operation of actual sewage plants is relatively high. Thulane et al. reported that many studies have used large amounts of time series data during the training of soft sensors (Paepae et al. 2021). In most cases, the time series spanned months to years, and even some of them covered data for up to 49 years (Sepahvand et al. 2021), with frequency as high as once every 10 min (Wang et al. 2015). Although some of these studies used very large data over very long periods of time, many of which directly used routinely recorded data during the operation of wastewater treatment plants and were not calibrated in the field. Indeed, in the corresponding application scenario for the design of soft sensors, it is the most appropriate dataset to be processed into a training dataset if accurate operational monitoring data are available. However, under the application scenario of rural sewage treatment facilities in Wujin port, Changzhou, which corresponds to the present study, a large number of small and scattered sewage treatment facilities and the downstream channels involved have not been regularized and monitored before, and the data in the real scenario is completely missing. In such a scenario, significant time and measurement costs would be required if field data tagging were used. Moreover, the values of the indicators to be measured, COD, -N, -N and -P, need to be determined by manual measurement, which is a heavy workload and makes it difficult to guarantee the frequency of data collection. In this study, we take the training of generating data by small laboratory test device, and ensure the reliability of soft sensor data by double-checking the operation data of surrounding sewage plants which have been collected and centralized while avoiding additional field measurement. The soft sensor in this study is effective in predicting SBR input water quality in Changzhou Sewage Plant, and has been used successfully in Luodong, Qingduntou and Xiejiatou. This approach also reduces the development cycle to some extent.
Necessity and feasibility of integration with management
Since the fundamental principle of soft sensing is data-driven black-box models, errors and uncertainties in the estimation of the data are unavoidable. The design of soft sensors using the basic BagR algorithm does not add more data processing algorithms or model optimization algorithms, and the accuracy achieved is not the highest of its kind. However, the fuzzy grading method in the management process effectively reduces the need for model accuracy and allows our model to strike a balance between efficiency and accuracy. The soft sensor errors in this study were reduced through two channels. Firstly, in the calculation of the composite water quality index, the errors made in the estimates by the four soft sensors do not always appear in the results as a maximum of five individual indicators are used. Secondly, there is the contribution of the fuzzy grading method. The fuzzy grading method used in the actual application case of this study divides the comprehensive water quality index into six grades and has a large width. Our soft sensor test results showed that the MRE was less than 1% in most scenarios and less than 6% in all of the scenarios, which means that it can be approximated that errors in the data of only about twice the MRE ratio could lead to hierarchical changes. Thirdly, overall statistics are presented. In practical applications, the management requirement is the amount of various comprehensive pollution index ratings that must be elicited within a certain amount of data (i.e. over time). Thus, even if only one composite pollution index is affected by soft sensor misestimation and classification is altered, this effect will be mitigated in the quantitative analysis of high-amount data. Through the above three management methods and data requirements to reduce the estimation error, we designed a simple model of the soft sensor that can be applied in practice. Similarly, in scenarios that require specific and accurate data, more accurate machine learning algorithms and more efficient prediction models are what researchers are looking for. However, not all application scenarios require the absolute accuracy of the model. If conditions permit, the adoption of an appropriate management strategy and the design of a matching management system may weaken the requirement for data accuracy, to make the application of data-driven soft sensing more widespread and to some extent to realize the substitution of traditional methods.
CONCLUSIONS
This study takes the development of soft sensors as the starting point to solve the problem of real-time monitoring and management of rural sewage treatment facilities, and applies the research results to the management of actual facilities. The article analyses and draws the following conclusions:
- (1)
A laboratory pilot device designed based on the characteristics of actual scenarios can generate datasets that meet the requirements of soft sensor training.
- (2)
In the design of soft sensing sensors, it is recommended to use the BagR model, which combines relatively fast simulation speed and relatively high accuracy, and can match actual management needs.
- (3)
The method of fuzzy classification can effectively reduce the error of prediction results. Adjusting management strategies based on practical application needs can reduce the difficulty of developing soft sensors and improve their practicality.
- (4)
Using the soft sensors designed in this study, it was found that the water quality in Luodong and Qingdun was at a moderate level, while the water quality in Xiejiatou was at an inferior level in the actual measurement of three locations in the Wujin Port basin. The results demonstrate the feasibility of soft sensors in practical applications and provide data reference for local water quality supervision authorities.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.