Dissolved oxygen (DO) is an important indicator reflecting the healthy state of aquatic ecosystems. The balance between oxygen supply and consuming in the water body is significantly influenced by physical and chemical parameters. This study aimed to evaluate and compare the performance of multiple linear regression (MLR), back propagation neural network (BPNN), and support vector machine (SVM) for the prediction of DO concentration based on multiple water quality parameters. The data set included 969 samples collected from rivers in China and the 16 predicted variables involved physical factors, nutrients, organic substances, and metal ions, which would affect the DO concentrations directly or indirectly by influencing the water–air exchange, the growth of water plants, and the lives of aquatic animals. The models optimized by particle swarm optimization (PSO) algorithm were calibrated and tested, with nearly 80% and 20% data, respectively. The results showed that the PSO-BPNN and PSO-SVM had better predicted performances than linear regression methods. All of the evaluated criteria, including coefficient of determination, mean squared error, and absolute relative errors suggested that the PSO-SVM model was superior to the MLR and PSO-BPNN for DO prediction in the rivers of China with limited knowledge of other information.
INTRODUCTION
Water quality monitoring always involves a series of physical, chemical, and biological parameters (Collins et al. 2012; Cao et al. 2016). It is a difficult problem to find out the complex relationships among the large number of variables (Carlyle & Hill 2001; Bonansea et al. 2015). Among the multiple physical and chemical parameters, the concentration of dissolved oxygen (DO) is an important parameter affecting the healthy functioning of aquatic species and an integrated indicator reflecting the state of aquatic ecosystems (Ficklin et al. 2013). The sources of DO mainly include re-aeration from the atmosphere and photosynthetic oxygen production, while the consumption processes include the oxidation of carbonaceous and nitrogenous material, respiration of aquatic animals and plants, as well as sediment oxygen demand (Kuo et al. 2007; Klose et al. 2012). During the momentary balance between oxygen supply and metabolic consumption, the concentration of DO is known to be influenced directly by many biological processes, such as respiration, photosynthesis, and decomposition (Kannel et al. 2007). Although biological processes directly influence DO, other physical and chemical parameters could control and limit the effects of these biological processes to some extent (Rounds 2002; Salami Shahid & Ehteshami 2016). For example, nitrogen and phosphorus have an effect on the growth of algae and plants, while metal ions are important to the lives of animals and bacteria in the aquatic environment (Li et al. 2015; Surinaidu 2016). As an integrated indicator, many inaccessible parameters during complex biological processes are an important basis to construct deterministic models. It is highly desirable to determine a DO model for rivers which could quantify and predict DO concentrations accurately, only based on physico-chemical parameters, for water resources managers.
Several models generally grouped as deterministic models and statistical models have been developed for the analysis of DO (Cox 2003). Most physically based models have complex structures and require many types of input data which are not easily accessible, making it a very costly and time-consuming modeling process (Stefan & Fang 1994; Sear et al. 2014). Besides, the demands of explicit understanding of a series of physical processes, a degree of expertise and experience with the models raise high demands for the researchers. The data-driven models without any information on the physical, chemical, and biological reaction processes are very useful for DO prediction in rivers. Many different methods, such as multiple linear regression (MLR), partial least squares regression, various types of artificial neural networks (ANNs), genetic algorithms (GAs), and support vector machines (SVMs) have been developed and applied widely in recent years (Modaresi & Araghinejad 2014; Were et al. 2015; Isunju & Kemp 2016). Especially the ANN and SVM, which offer advantages over conventional non-linear models for the handling of complex relationships between input and output variables, have been successfully used in various water resources problems (Hosseini & Mahjouri 2014), including the modeling and forecasting of DO concentrations (Liu et al. 2013; Wen et al. 2013). However, it is still not clear which method has the best performance on the prediction of DO concentrations in rivers based on other physico-chemical parameters.
In this study, the MLR, ANN, and SVM were applied to forecast DO concentrations in the rivers of China. The results of the three models were compared to each other based on various statistical evaluation measures. The aim of this study was to discuss and evaluate the performance of three data-driven models and choose the best one on the prediction of DO concentration influenced by other water quality parameters.
MATERIALS AND METHODS
Data sets
In this study, three models with different structures were designed to predict DO concentrations based on multiple water quality parameters. To achieve this objective, a data set including 600 monitoring sites distributed on nearly all the main streams and chief tributaries in China from 2009 to 2010 was obtained from national environmental agencies. The monitoring sites with missing data were not taken into account and 969 records with 21 parameters were selected as the initial data set. The parameters included DO, water temperature (TEMP), pH values, potassium permanganate index (CODMn), biochemical oxygen demand (BOD), ammonia nitrogen (NH3-N), total nitrogen (TN), total phosphorus (TP), petroleum (PE), volatile phenol (VP), chemical oxygen demand (CODCr), mercury (Hg), lead (Pb), copper (Cu), fluoride (F), zinc (Zn), arsenic (As), cadmium (Cd), hexavalent chrome (Cr), cyanide (Cyn), and anionic surfactant (LAS). In order to reduce the large numbers of predictors and select the most effective variables, the Spearman correlation analysis was used to evaluate the degree of association between DO and other parameters, shown in Table 1. According to the results of correlation analysis, significant (P < 0.001) correlations were observed between DO and most of the water quality parameters except for Pb, Cu, Cd, and Cr. Although some values of correlation coefficient were relatively small, weak linear relationships were indicated. However, the statistically significant correlations showed appropriate and significant associations between these variables, maybe non-linear relationships. There were 16 parameters finally selected as input data and the basic statistics of these measured water quality parameters are summarized in Table 2. The predicted variables were concerned with physical factors, nutrients, organic substances, and metal ions, which affected the DO concentrations directly or indirectly by influencing the water–air exchange, the growth of water plants, and the lives of aquatic animals, etc. The data set included 969 samples randomly split into 769 samples as the training set and 200 samples as the testing set, which was nearly 80% and 20% of the whole data set. The test set was used to evaluate the effects of the calibrated models. The raw data of both training set and testing set were standardized between 0.1 and 0.9 before analysis to eliminate the effects of various dimensions and maintain the same or similar importance.
Parameters . | Correlation coefficient . | Sig. (2-tailed) . | Parameters . | Correlation coefficient . | Sig. (2-tailed) . |
---|---|---|---|---|---|
TEMP (°C) | −0.31** | <0.01 | TN (mg/L) | −0.52** | <0.01 |
pH | 0.40** | <0.01 | TP (mg/L) | −0.55** | <0.01 |
CODMn (mg/L) | −0.50** | <0.01 | Cu (mg/L) | −0.01 | 0.81 |
BOD (mg/L) | −0.55** | <0.01 | F (mg/L) | −0.39** | <0.01 |
NH3-N (mg/L) | −0.65** | <0.01 | Zn (mg/L) | −0.22** | <0.01 |
PE (mg/L) | −0.39** | <0.01 | As (mg/L) | −0.17** | <0.01 |
VP (mg/L) | −0.37** | <0.01 | Cd (mg/L) | 0 | 0.99 |
Hg (mg/L) | −0.13** | <0.01 | Cr (mg/L) | −0.04 | 0.22 |
Pb (mg/L) | −0.002 | 0.95 | Cyn (mg/L) | −0.20** | <0.01 |
CODCr (mg/L) | −0.48** | <0.01 | LAS (mg/L) | −0.45** | <0.01 |
Parameters . | Correlation coefficient . | Sig. (2-tailed) . | Parameters . | Correlation coefficient . | Sig. (2-tailed) . |
---|---|---|---|---|---|
TEMP (°C) | −0.31** | <0.01 | TN (mg/L) | −0.52** | <0.01 |
pH | 0.40** | <0.01 | TP (mg/L) | −0.55** | <0.01 |
CODMn (mg/L) | −0.50** | <0.01 | Cu (mg/L) | −0.01 | 0.81 |
BOD (mg/L) | −0.55** | <0.01 | F (mg/L) | −0.39** | <0.01 |
NH3-N (mg/L) | −0.65** | <0.01 | Zn (mg/L) | −0.22** | <0.01 |
PE (mg/L) | −0.39** | <0.01 | As (mg/L) | −0.17** | <0.01 |
VP (mg/L) | −0.37** | <0.01 | Cd (mg/L) | 0 | 0.99 |
Hg (mg/L) | −0.13** | <0.01 | Cr (mg/L) | −0.04 | 0.22 |
Pb (mg/L) | −0.002 | 0.95 | Cyn (mg/L) | −0.20** | <0.01 |
CODCr (mg/L) | −0.48** | <0.01 | LAS (mg/L) | −0.45** | <0.01 |
**Correlation is significant at the 0.01 level (2-tailed).
Parameters . | Unit . | Minimum value . | Mean value . | Maximum value . | Standard deviation . |
---|---|---|---|---|---|
DO | mg/L | 0.10 | 7.13 | 14.92 | 1.86 |
TEMP | °C | 2.00 | 16.74 | 30.93 | 3.79 |
pH | dimensionless | 6.76 | 7.72 | 8.96 | 0.36 |
CODMn | mg/L | 0.68 | 5.28 | 52.99 | 5.15 |
BOD | mg/L | 0.42 | 4.36 | 98.35 | 7.50 |
NH3-N | mg/L | 0.02 | 1.92 | 37.10 | 4.58 |
PE | mg/L | 0.01 | 0.07 | 1.39 | 0.12 |
VP | mg/L | 0.0002 | 0.003 | 0.11 | 0.008 |
Hg | mg/L | 0 | 0.00003 | 0.001 | 0.00005 |
CODCr | mg/L | 1.65 | 21.80 | 258.42 | 22.29 |
TN | mg/L | 0.03 | 4.27 | 54.72 | 6.41 |
TP | mg/L | 0.01 | 0.24 | 4.80 | 0.45 |
F | mg/L | 0.01 | 0.54 | 7.40 | 0.47 |
Zn | mg/L | 0.0005 | 0.03 | 0.44 | 0.04 |
As | mg/L | 0.00003 | 0.003 | 0.08 | 0.006 |
Cyn | mg/L | 0.0005 | 0.003 | 0.07 | 0.004 |
LAS | mg/L | 0.01 | 0.09 | 2.88 | 0.19 |
Parameters . | Unit . | Minimum value . | Mean value . | Maximum value . | Standard deviation . |
---|---|---|---|---|---|
DO | mg/L | 0.10 | 7.13 | 14.92 | 1.86 |
TEMP | °C | 2.00 | 16.74 | 30.93 | 3.79 |
pH | dimensionless | 6.76 | 7.72 | 8.96 | 0.36 |
CODMn | mg/L | 0.68 | 5.28 | 52.99 | 5.15 |
BOD | mg/L | 0.42 | 4.36 | 98.35 | 7.50 |
NH3-N | mg/L | 0.02 | 1.92 | 37.10 | 4.58 |
PE | mg/L | 0.01 | 0.07 | 1.39 | 0.12 |
VP | mg/L | 0.0002 | 0.003 | 0.11 | 0.008 |
Hg | mg/L | 0 | 0.00003 | 0.001 | 0.00005 |
CODCr | mg/L | 1.65 | 21.80 | 258.42 | 22.29 |
TN | mg/L | 0.03 | 4.27 | 54.72 | 6.41 |
TP | mg/L | 0.01 | 0.24 | 4.80 | 0.45 |
F | mg/L | 0.01 | 0.54 | 7.40 | 0.47 |
Zn | mg/L | 0.0005 | 0.03 | 0.44 | 0.04 |
As | mg/L | 0.00003 | 0.003 | 0.08 | 0.006 |
Cyn | mg/L | 0.0005 | 0.003 | 0.07 | 0.004 |
LAS | mg/L | 0.01 | 0.09 | 2.88 | 0.19 |
SVM for regression
linear kernel:
polynomial kernel:
radial basis kernel (RBF): K(xi, xj) = exp
sigmoid kernel:
Here, , r, and d are kernel parameters. The performance of SVM for regression depends on a set of parameters: C, the kernel type and corresponding kernel parameters (Min & Lee 2005). In this paper, as the structure of predictors was not accurately recognizable, the four types of kernel functions were all prepared to be tried in the SVM model and the most appropriate one would be selected based on the results. The penalty parameter C and kernel function's parameters are determined by particle swarm optimization (PSO) algorithm with cross-validation, as described below.
ANNs
PSO algorithms
Leave-one-out cross-validation
Cross-validation is a popular statistical method to evaluate and compare learning algorithms by dividing data into two segments, in which one part is used to train a model and the other is used to validate the same model (Bengio & Grandvalet 2004). The basic and most used form of cross-validation is k-fold cross-validation. In k-fold cross-validation, the training data are randomly partitioned into k mutually exclusive subsets with approximately equal size. A different subset is held-out for validation while the remaining k−1 subsets are used for model training at each time. This process is repeated k times, and the estimated parameters and accuracy are derived by averaging the runs (Diamantidis et al. 2000). The goal of cross-validation is to improve the generalization ability by defining several independent data sets to test the model, in order to limit over-fitting problem, especially when the size of the training data are small or the number of parameters in the model is large (Prechelt 1998). However, it is always difficult to determine the number of k for common k-fold cross-validation, as the results may have considerable bias (Kohavi 1995). However, when k is significantly increased to a much larger number, the condition improves. The most extreme form of k-fold cross-validation, when k is given by the number of training patterns, is known as leave-one-out cross-validation, which has been shown to provide an ‘almost’ unbiased estimate of the true generalization ability of the model (Cawley & Talbot 2004). Although the key disadvantage of leave-one-out cross-validation is the high computational cost, it was overcome and combined into MLR, BPNN, and SVM, respectively.
In this study, the data set with 969 samples were divided into a training subset and testing subset at first. The training subset with 769 samples was used to obtain the best parameters of MLR, PSO-BPNN, and PSO-SVM through leave-one-out cross-validation method. The testing subset, including 200 samples, were randomly extracted from the whole data pool to be used for the comparison of predictive capacities among different models. All of the calibration and the following predicting work were performed by programming codes in the MATLAB R2013b. A library for support vector machines developed by Lin et al. (Chang & Lin 2011) was used to design the PSO-SVM model.
Models’ performance criteria
RESULTS AND DISCUSSION
. | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|
. | Linear . | Polynomial . | RBF . | Sigmoid . | Linear . | Polynomial . | RBF . | Sigmoid . |
R2 | 0.50 | 0.73 | 0.76 | 0.14 | 0.55 | 0.43 | 0.74 | 0.13 |
MSE | 1.56 | 1.67 | 1.64 | 3.56 | 1.74 | 2.20 | 1.62 | 3.16 |
. | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|
. | Linear . | Polynomial . | RBF . | Sigmoid . | Linear . | Polynomial . | RBF . | Sigmoid . |
R2 | 0.50 | 0.73 | 0.76 | 0.14 | 0.55 | 0.43 | 0.74 | 0.13 |
MSE | 1.56 | 1.67 | 1.64 | 3.56 | 1.74 | 2.20 | 1.62 | 3.16 |
. | Training . | Testing . | ||||
---|---|---|---|---|---|---|
. | MLR . | PSO-BPNN . | PSO-SVM . | MLR . | PSO-BPNN . | PSO-SVM . |
R2 | 0.20 | 0.69 | 0.76 | 0.22 | 0.63 | 0.74 |
MSE | 1.68 | 1.65 | 1.64 | 1.93 | 1.80 | 1.62 |
. | Training . | Testing . | ||||
---|---|---|---|---|---|---|
. | MLR . | PSO-BPNN . | PSO-SVM . | MLR . | PSO-BPNN . | PSO-SVM . |
R2 | 0.20 | 0.69 | 0.76 | 0.22 | 0.63 | 0.74 |
MSE | 1.68 | 1.65 | 1.64 | 1.93 | 1.80 | 1.62 |
. | Min value . | Mean value . | Max value . | Percent lower than 10 . |
---|---|---|---|---|
MLR | 0.48 | 22.16 | 172.36 | 39% |
PSO-BPNN | 0.01 | 15.00 | 183.65 | 52.5% |
PSO-SVM | 0.02 | 13.00 | 147.36 | 59.0% |
. | Min value . | Mean value . | Max value . | Percent lower than 10 . |
---|---|---|---|---|
MLR | 0.48 | 22.16 | 172.36 | 39% |
PSO-BPNN | 0.01 | 15.00 | 183.65 | 52.5% |
PSO-SVM | 0.02 | 13.00 | 147.36 | 59.0% |
CONCLUSION
In this study, the BPNN and SVM optimized by PSO as well as the MLR models were developed to predict DO concentration based on water quality parameters. To achieve this objective, 969 samples collected during 2009 and 2010 from rivers in China were selected as input data. The training process was mainly used to calibrate the parameters in the models. In order to avoid over-fitting problem and biased estimation, the leave-one-out cross-validation was applied for the three models during the training process. The results of the testing period reflected the prediction and generalization capabilities, while 20% input data were separated from the whole to test the models. The statistical criteria obtained from the three models with best fit structures during both training and testing periods showed that the PSO-BPNN and PSO-SVM had better predicted performances than linear regression methods, which suggested non-linear relationships between DO concentration and water quality parameters.
The absolute relative errors calculated for the testing results showed that the PSO-SVM had minimum mean error, while the MLR had the maximum one. The PSO-BPNN with more violent error fluctuation indicated less stable predicted capacity than the PSO-SVM, but the distribution of absolute relative errors was not smooth even for the PSO-SVM. The analysis between observed values and absolute relative errors showed that the model had better predictive capacities when the data were close to average values. The errors at the points with extreme values were much larger, indicating inaccurate predictions of the data-driven models. The samples with larger predictive errors were similar for the three models, and had the lowest observed DO concentrations. Overall, the PSO-SVM model was superior to the MLR and PSO-BPNN in the DO prediction based on multiple physical and chemical parameters. It can be considered to predict DO levels with limited knowledge of other information for the rivers in China.
However, improvement will be necessary in future research. For example, the prediction accuracy, especially for the extreme values, needs to be improved by combining other model parameters. The factors influencing DO concentration of the rivers distributed in various regions may be different. The selection of suitable predictors would lead to more efficient models and accurate results.
ACKNOWLEDGEMENTS
The Chinese Academy for Environmental Planning is acknowledged for the support of water quality data. This work was supported by Tianjin Normal University Doctor Foundation (52XB1517), the innovation team training plan of the Tianjin Education Committee (TD12-5037) and the National Natural Science Foundation of China (No. 41372373). Comments and suggestions from the anonymous reviewers and the editor are greatly appreciated.