Dissolved oxygen (DO) is an important indicator reflecting the healthy state of aquatic ecosystems. The balance between oxygen supply and consuming in the water body is significantly influenced by physical and chemical parameters. This study aimed to evaluate and compare the performance of multiple linear regression (MLR), back propagation neural network (BPNN), and support vector machine (SVM) for the prediction of DO concentration based on multiple water quality parameters. The data set included 969 samples collected from rivers in China and the 16 predicted variables involved physical factors, nutrients, organic substances, and metal ions, which would affect the DO concentrations directly or indirectly by influencing the water–air exchange, the growth of water plants, and the lives of aquatic animals. The models optimized by particle swarm optimization (PSO) algorithm were calibrated and tested, with nearly 80% and 20% data, respectively. The results showed that the PSO-BPNN and PSO-SVM had better predicted performances than linear regression methods. All of the evaluated criteria, including coefficient of determination, mean squared error, and absolute relative errors suggested that the PSO-SVM model was superior to the MLR and PSO-BPNN for DO prediction in the rivers of China with limited knowledge of other information.

## INTRODUCTION

Water quality monitoring always involves a series of physical, chemical, and biological parameters (Collins *et al.* 2012; Cao *et al.* 2016). It is a difficult problem to find out the complex relationships among the large number of variables (Carlyle & Hill 2001; Bonansea *et al.* 2015). Among the multiple physical and chemical parameters, the concentration of dissolved oxygen (DO) is an important parameter affecting the healthy functioning of aquatic species and an integrated indicator reflecting the state of aquatic ecosystems (Ficklin *et al.* 2013). The sources of DO mainly include re-aeration from the atmosphere and photosynthetic oxygen production, while the consumption processes include the oxidation of carbonaceous and nitrogenous material, respiration of aquatic animals and plants, as well as sediment oxygen demand (Kuo *et al.* 2007; Klose *et al.* 2012). During the momentary balance between oxygen supply and metabolic consumption, the concentration of DO is known to be influenced directly by many biological processes, such as respiration, photosynthesis, and decomposition (Kannel *et al.* 2007). Although biological processes directly influence DO, other physical and chemical parameters could control and limit the effects of these biological processes to some extent (Rounds 2002; Salami Shahid & Ehteshami 2016). For example, nitrogen and phosphorus have an effect on the growth of algae and plants, while metal ions are important to the lives of animals and bacteria in the aquatic environment (Li *et al.* 2015; Surinaidu 2016). As an integrated indicator, many inaccessible parameters during complex biological processes are an important basis to construct deterministic models. It is highly desirable to determine a DO model for rivers which could quantify and predict DO concentrations accurately, only based on physico-chemical parameters, for water resources managers.

Several models generally grouped as deterministic models and statistical models have been developed for the analysis of DO (Cox 2003). Most physically based models have complex structures and require many types of input data which are not easily accessible, making it a very costly and time-consuming modeling process (Stefan & Fang 1994; Sear *et al.* 2014). Besides, the demands of explicit understanding of a series of physical processes, a degree of expertise and experience with the models raise high demands for the researchers. The data-driven models without any information on the physical, chemical, and biological reaction processes are very useful for DO prediction in rivers. Many different methods, such as multiple linear regression (MLR), partial least squares regression, various types of artificial neural networks (ANNs), genetic algorithms (GAs), and support vector machines (SVMs) have been developed and applied widely in recent years (Modaresi & Araghinejad 2014; Were *et al.* 2015; Isunju & Kemp 2016). Especially the ANN and SVM, which offer advantages over conventional non-linear models for the handling of complex relationships between input and output variables, have been successfully used in various water resources problems (Hosseini & Mahjouri 2014), including the modeling and forecasting of DO concentrations (Liu *et al.* 2013; Wen *et al.* 2013). However, it is still not clear which method has the best performance on the prediction of DO concentrations in rivers based on other physico-chemical parameters.

In this study, the MLR, ANN, and SVM were applied to forecast DO concentrations in the rivers of China. The results of the three models were compared to each other based on various statistical evaluation measures. The aim of this study was to discuss and evaluate the performance of three data-driven models and choose the best one on the prediction of DO concentration influenced by other water quality parameters.

## MATERIALS AND METHODS

### Data sets

In this study, three models with different structures were designed to predict DO concentrations based on multiple water quality parameters. To achieve this objective, a data set including 600 monitoring sites distributed on nearly all the main streams and chief tributaries in China from 2009 to 2010 was obtained from national environmental agencies. The monitoring sites with missing data were not taken into account and 969 records with 21 parameters were selected as the initial data set. The parameters included DO, water temperature (TEMP), pH values, potassium permanganate index (COD_{Mn}), biochemical oxygen demand (BOD), ammonia nitrogen (NH_{3}-N), total nitrogen (TN), total phosphorus (TP), petroleum (PE), volatile phenol (VP), chemical oxygen demand (COD_{Cr}), mercury (Hg), lead (Pb), copper (Cu), fluoride (F), zinc (Zn), arsenic (As), cadmium (Cd), hexavalent chrome (Cr), cyanide (Cyn), and anionic surfactant (LAS). In order to reduce the large numbers of predictors and select the most effective variables, the Spearman correlation analysis was used to evaluate the degree of association between DO and other parameters, shown in Table 1. According to the results of correlation analysis, significant (*P* < 0.001) correlations were observed between DO and most of the water quality parameters except for Pb, Cu, Cd, and Cr. Although some values of correlation coefficient were relatively small, weak linear relationships were indicated. However, the statistically significant correlations showed appropriate and significant associations between these variables, maybe non-linear relationships. There were 16 parameters finally selected as input data and the basic statistics of these measured water quality parameters are summarized in Table 2. The predicted variables were concerned with physical factors, nutrients, organic substances, and metal ions, which affected the DO concentrations directly or indirectly by influencing the water–air exchange, the growth of water plants, and the lives of aquatic animals, etc. The data set included 969 samples randomly split into 769 samples as the training set and 200 samples as the testing set, which was nearly 80% and 20% of the whole data set. The test set was used to evaluate the effects of the calibrated models. The raw data of both training set and testing set were standardized between 0.1 and 0.9 before analysis to eliminate the effects of various dimensions and maintain the same or similar importance.

Parameters . | Correlation coefficient . | Sig. (2-tailed) . | Parameters . | Correlation coefficient . | Sig. (2-tailed) . |
---|---|---|---|---|---|

TEMP (°C) | −0.31** | <0.01 | TN (mg/L) | −0.52** | <0.01 |

pH | 0.40** | <0.01 | TP (mg/L) | −0.55** | <0.01 |

COD_{Mn} (mg/L) | −0.50** | <0.01 | Cu (mg/L) | −0.01 | 0.81 |

BOD (mg/L) | −0.55** | <0.01 | F (mg/L) | −0.39** | <0.01 |

NH_{3}-N (mg/L) | −0.65** | <0.01 | Zn (mg/L) | −0.22** | <0.01 |

PE (mg/L) | −0.39** | <0.01 | As (mg/L) | −0.17** | <0.01 |

VP (mg/L) | −0.37** | <0.01 | Cd (mg/L) | 0 | 0.99 |

Hg (mg/L) | −0.13** | <0.01 | Cr (mg/L) | −0.04 | 0.22 |

Pb (mg/L) | −0.002 | 0.95 | Cyn (mg/L) | −0.20** | <0.01 |

COD_{Cr} (mg/L) | −0.48** | <0.01 | LAS (mg/L) | −0.45** | <0.01 |

Parameters . | Correlation coefficient . | Sig. (2-tailed) . | Parameters . | Correlation coefficient . | Sig. (2-tailed) . |
---|---|---|---|---|---|

TEMP (°C) | −0.31** | <0.01 | TN (mg/L) | −0.52** | <0.01 |

pH | 0.40** | <0.01 | TP (mg/L) | −0.55** | <0.01 |

COD_{Mn} (mg/L) | −0.50** | <0.01 | Cu (mg/L) | −0.01 | 0.81 |

BOD (mg/L) | −0.55** | <0.01 | F (mg/L) | −0.39** | <0.01 |

NH_{3}-N (mg/L) | −0.65** | <0.01 | Zn (mg/L) | −0.22** | <0.01 |

PE (mg/L) | −0.39** | <0.01 | As (mg/L) | −0.17** | <0.01 |

VP (mg/L) | −0.37** | <0.01 | Cd (mg/L) | 0 | 0.99 |

Hg (mg/L) | −0.13** | <0.01 | Cr (mg/L) | −0.04 | 0.22 |

Pb (mg/L) | −0.002 | 0.95 | Cyn (mg/L) | −0.20** | <0.01 |

COD_{Cr} (mg/L) | −0.48** | <0.01 | LAS (mg/L) | −0.45** | <0.01 |

**Correlation is significant at the 0.01 level (2-tailed).

Parameters . | Unit . | Minimum value . | Mean value . | Maximum value . | Standard deviation . |
---|---|---|---|---|---|

DO | mg/L | 0.10 | 7.13 | 14.92 | 1.86 |

TEMP | °C | 2.00 | 16.74 | 30.93 | 3.79 |

pH | dimensionless | 6.76 | 7.72 | 8.96 | 0.36 |

COD_{Mn} | mg/L | 0.68 | 5.28 | 52.99 | 5.15 |

BOD | mg/L | 0.42 | 4.36 | 98.35 | 7.50 |

NH_{3}-N | mg/L | 0.02 | 1.92 | 37.10 | 4.58 |

PE | mg/L | 0.01 | 0.07 | 1.39 | 0.12 |

VP | mg/L | 0.0002 | 0.003 | 0.11 | 0.008 |

Hg | mg/L | 0 | 0.00003 | 0.001 | 0.00005 |

COD_{Cr} | mg/L | 1.65 | 21.80 | 258.42 | 22.29 |

TN | mg/L | 0.03 | 4.27 | 54.72 | 6.41 |

TP | mg/L | 0.01 | 0.24 | 4.80 | 0.45 |

F | mg/L | 0.01 | 0.54 | 7.40 | 0.47 |

Zn | mg/L | 0.0005 | 0.03 | 0.44 | 0.04 |

As | mg/L | 0.00003 | 0.003 | 0.08 | 0.006 |

Cyn | mg/L | 0.0005 | 0.003 | 0.07 | 0.004 |

LAS | mg/L | 0.01 | 0.09 | 2.88 | 0.19 |

Parameters . | Unit . | Minimum value . | Mean value . | Maximum value . | Standard deviation . |
---|---|---|---|---|---|

DO | mg/L | 0.10 | 7.13 | 14.92 | 1.86 |

TEMP | °C | 2.00 | 16.74 | 30.93 | 3.79 |

pH | dimensionless | 6.76 | 7.72 | 8.96 | 0.36 |

COD_{Mn} | mg/L | 0.68 | 5.28 | 52.99 | 5.15 |

BOD | mg/L | 0.42 | 4.36 | 98.35 | 7.50 |

NH_{3}-N | mg/L | 0.02 | 1.92 | 37.10 | 4.58 |

PE | mg/L | 0.01 | 0.07 | 1.39 | 0.12 |

VP | mg/L | 0.0002 | 0.003 | 0.11 | 0.008 |

Hg | mg/L | 0 | 0.00003 | 0.001 | 0.00005 |

COD_{Cr} | mg/L | 1.65 | 21.80 | 258.42 | 22.29 |

TN | mg/L | 0.03 | 4.27 | 54.72 | 6.41 |

TP | mg/L | 0.01 | 0.24 | 4.80 | 0.45 |

F | mg/L | 0.01 | 0.54 | 7.40 | 0.47 |

Zn | mg/L | 0.0005 | 0.03 | 0.44 | 0.04 |

As | mg/L | 0.00003 | 0.003 | 0.08 | 0.006 |

Cyn | mg/L | 0.0005 | 0.003 | 0.07 | 0.004 |

LAS | mg/L | 0.01 | 0.09 | 2.88 | 0.19 |

### SVM for regression

*et al.*2003). It is based on non-linear statistical theory to transform input space into a higher dimensional feature space for the purpose of separating the data patterns (Baylar

*et al.*2009). The goal of the SVM is to find an optimal hyperplane, which could differentiate the data in different classes by the maximum gap (Smola & Schölkopf 2004). The SVM is one of the best algorithms used for binary classification, and also has been extended to solve non-linear regression problems (He

*et al.*2014). Considering a set of training data (

*x*

_{1},

*y*

_{1}), (

*x*

_{2},

*y*

_{2}), ···, (

*x*), ···, (

_{i}, y_{i}*x*), where

_{n}, y_{n}*x*is the input vector containing m features,

_{i}*y*is the observed output value related to

_{i}*x*, and n represents the number of samples in the data set. The regression function of SVM is constructed as follows: where

_{i}*w*is a vector of weights in a feature space with the same dimension of

*x*, b is the bias term, and denotes the inner product. The regression problem can be expressed as a process to minimize the following regularized risk function with ε-insensitivity loss function: where i = 1, 2, ···, n, and are the two slack variables to form the distance from actual values to the corresponding boundary values of ɛ, C is a constant that determines the penalty for the prediction error higher than ɛ. This optimization problem is often transformed into a quadratic programming problem by using Lagrangian multipliers, and the form of the solution can be given by: where and are Lagrange multipliers. One of the basic ideas in SVM is to map the data set

*x*into a higher dimensional feature space by the function

_{i}*ϕ*. is the kernel function and defined as an inner product of the points

*ϕ*(

*x*) and

_{i}*ϕ*(

*x*) as follows: The most popular kernel functions used in the literature are:

_{j}linear kernel:

polynomial kernel:

radial basis kernel (RBF):

*K*(*x*_{i},*x*) =_{j}*exp*sigmoid kernel:

Here, , r, and d are kernel parameters. The performance of SVM for regression depends on a set of parameters: C, the kernel type and corresponding kernel parameters (Min & Lee 2005). In this paper, as the structure of predictors was not accurately recognizable, the four types of kernel functions were all prepared to be tried in the SVM model and the most appropriate one would be selected based on the results. The penalty parameter C and kernel function's parameters are determined by particle swarm optimization (PSO) algorithm with cross-validation, as described below.

### ANNs

*et al.*2007). The typical BPNN includes three types of neuron layers: an input layer, a hidden layer, and an output layer. The information flows unidirectionally from input layer to output layer, through the hidden layer (Pradhan & Lee 2010). As a supervised learning algorithm, the number of neurons in the input layer is equal to the number of input variables (20 water quality parameters in this study), and the neuron in the output layer is the related concentration of DO for each case. The differences between predicted outputs and observed data are defined as error values, which would be propagated backwards through the network. During the training process, the weights between the nodes in response to the errors are adjusted until the overall error value reduces below the pre-determined threshold. The number of neurons of the hidden layer (NNH) has significant influence on the result of BPNN. If the number is too small, the fault tolerance and generalization capability of the net would be bad. However, the problem of over-fitting will exist when the hidden layer has a large number of neurons. In this study, the upper and lower bounds of NNH were calculated based on first the empirical formula: where N was NNH, M was the number of neurons of the input layer (16), J was the number of neurons of the output layer (1), a was a constant from 1 to 10. The value range of NNH was defined as 4–14 according to the values of N, N

_{1}, and N

_{2}in this study. A loop was designed to be nested into the optimized process and the training processes replicated several times, starting with 4 neurons and then increasing the number up to 14. The sum of the absolute values of errors (SAE) were monitored and the number with the minimum SAE chosen. The choices of initial values of the network connection weights and NNH are very important to the convergent behavior of the BPNN, and PSO algorithm with cross-validation is chosen as the optimized method to improve the model performance, as described below.

### PSO algorithms

*et al.*2009). The best position of each particle in the problem space is obtained by keeping track of its previous best position, which is stored as

*p*. Then the new position that the particle should fly to is calculated through

_{best}*p*and the best position of its neighbors (

_{best}*g*). The dimension of the searching space is defined as D, the total number of particles is n, and the position of

_{best}*i*th particle is represented as vector. The velocity of

*i*th particle is defined as vector and the updating processes could be described as: then where

*k*and

*k*+ 1 represent the iteration count,

*c*and

_{1}*c*are the acceleration coefficients with positive values,

_{2}*rand*is a random number between 0 and 1, ω is the inertia weight representing the degree of the current velocity of the particle influenced by its previous one. The PSO is a flexible algorithm to be combined with SVM and BPNN for better model performances (Zhang

*et al.*2007; Lin

*et al.*2008). The fitness functions of the

*i*th particle in both models are expressed in terms of an output error as follows: where

*f*is the fitness value,

*S*is the number of training samples,

*t*is the target output (observed values),

_{k}*p*is the predicted output based on

_{k}*x*. In the PSO-BPNN,

_{i}*x*indicates the connection weight matrixes between the input layer and hidden layer, as well as between the hidden layer and output layer. In the PSO-SVM,

_{i}*x*indicates the penalty parameter and kernel function's parameter. The main goal of the optimization is to search the best parameters that produce the most accurate predictions for different models.

_{i}Leave-one-out cross-validation

Cross-validation is a popular statistical method to evaluate and compare learning algorithms by dividing data into two segments, in which one part is used to train a model and the other is used to validate the same model (Bengio & Grandvalet 2004). The basic and most used form of cross-validation is *k*-fold cross-validation. In *k*-fold cross-validation, the training data are randomly partitioned into *k* mutually exclusive subsets with approximately equal size. A different subset is held-out for validation while the remaining *k*−1 subsets are used for model training at each time. This process is repeated *k* times, and the estimated parameters and accuracy are derived by averaging the runs (Diamantidis *et al.* 2000). The goal of cross-validation is to improve the generalization ability by defining several independent data sets to test the model, in order to limit over-fitting problem, especially when the size of the training data are small or the number of parameters in the model is large (Prechelt 1998). However, it is always difficult to determine the number of *k* for common *k-fold* cross-validation, as the results may have considerable bias (Kohavi 1995). However, when *k* is significantly increased to a much larger number, the condition improves. The most extreme form of *k-fold* cross-validation, when *k* is given by the number of training patterns, is known as *leave-one-out* cross-validation, which has been shown to provide an ‘almost’ unbiased estimate of the true generalization ability of the model (Cawley & Talbot 2004). Although the key disadvantage of *leave-one-out* cross-validation is the high computational cost, it was overcome and combined into MLR, BPNN, and SVM, respectively.

In this study, the data set with 969 samples were divided into a training subset and testing subset at first. The training subset with 769 samples was used to obtain the best parameters of MLR, PSO-BPNN, and PSO-SVM through *leave-one-out* cross-validation method. The testing subset, including 200 samples, were randomly extracted from the whole data pool to be used for the comparison of predictive capacities among different models. All of the calibration and the following predicting work were performed by programming codes in the MATLAB R2013b. A library for support vector machines developed by Lin *et al.* (Chang & Lin 2011) was used to design the PSO-SVM model.

### Models’ performance criteria

*R*

^{2}) and mean squared error (MSE). The degree of correlation between the observed and predicted values is defined as

*R*

^{2}and described as follows:

*n*is the number of input samples,

*DO*and

_{io}*DO*are observed and predicted DO concentrations of sample

_{ip}*i*, respectively, and are the mean values of observed and predicted DO concentration. The best fit between observed and predicted values was

*R*

^{2}= 1, MSE = 0.

## RESULTS AND DISCUSSION

. | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|

. | Linear . | Polynomial . | RBF . | Sigmoid . | Linear . | Polynomial . | RBF . | Sigmoid . |

R^{2} | 0.50 | 0.73 | 0.76 | 0.14 | 0.55 | 0.43 | 0.74 | 0.13 |

MSE | 1.56 | 1.67 | 1.64 | 3.56 | 1.74 | 2.20 | 1.62 | 3.16 |

. | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|

. | Linear . | Polynomial . | RBF . | Sigmoid . | Linear . | Polynomial . | RBF . | Sigmoid . |

R^{2} | 0.50 | 0.73 | 0.76 | 0.14 | 0.55 | 0.43 | 0.74 | 0.13 |

MSE | 1.56 | 1.67 | 1.64 | 3.56 | 1.74 | 2.20 | 1.62 | 3.16 |

^{2}values in both training and testing periods. As the concentration of DO was a complex outcome influenced by a series of factors, it was not able to describe the combining effect as a linear relationship. Differently from the ANN and SVM, which could not reveal the functional relationships between the target and predictor variables and always referred to as ‘black box’ approaches, the MLR was able to define the coefficient of each parameter; however, the results would be meaningful only if they could satisfy a series of statistical criteria. The importance of input variables could be defined by

*leave-one-out*cross-validation method for the ANN and SVM, which was not the objective of this study. The PSO-BPNN gave higher

*R*

^{2}values than the MLR in both processes, while the difference between MSEs was very small. The PSO-SVM had the best results for both indices among the three models with higher

*R*

^{2}and lower MSE. The line charts obtained by using the MLR, PSO-BPNN, and PSO-SVM during the testing period are shown in Figures 2–4. It can be seen that the estimated DO concentrations of the PSO-SVM were closer to the corresponding observed values than the other two models.

. | Training . | Testing . | ||||
---|---|---|---|---|---|---|

. | MLR . | PSO-BPNN . | PSO-SVM . | MLR . | PSO-BPNN . | PSO-SVM . |

R^{2} | 0.20 | 0.69 | 0.76 | 0.22 | 0.63 | 0.74 |

MSE | 1.68 | 1.65 | 1.64 | 1.93 | 1.80 | 1.62 |

. | Training . | Testing . | ||||
---|---|---|---|---|---|---|

. | MLR . | PSO-BPNN . | PSO-SVM . | MLR . | PSO-BPNN . | PSO-SVM . |

R^{2} | 0.20 | 0.69 | 0.76 | 0.22 | 0.63 | 0.74 |

MSE | 1.68 | 1.65 | 1.64 | 1.93 | 1.80 | 1.62 |

. | Min value . | Mean value . | Max value . | Percent lower than 10 . |
---|---|---|---|---|

MLR | 0.48 | 22.16 | 172.36 | 39% |

PSO-BPNN | 0.01 | 15.00 | 183.65 | 52.5% |

PSO-SVM | 0.02 | 13.00 | 147.36 | 59.0% |

. | Min value . | Mean value . | Max value . | Percent lower than 10 . |
---|---|---|---|---|

MLR | 0.48 | 22.16 | 172.36 | 39% |

PSO-BPNN | 0.01 | 15.00 | 183.65 | 52.5% |

PSO-SVM | 0.02 | 13.00 | 147.36 | 59.0% |

## CONCLUSION

In this study, the BPNN and SVM optimized by PSO as well as the MLR models were developed to predict DO concentration based on water quality parameters. To achieve this objective, 969 samples collected during 2009 and 2010 from rivers in China were selected as input data. The training process was mainly used to calibrate the parameters in the models. In order to avoid over-fitting problem and biased estimation, the *leave-one-out* cross-validation was applied for the three models during the training process. The results of the testing period reflected the prediction and generalization capabilities, while 20% input data were separated from the whole to test the models. The statistical criteria obtained from the three models with best fit structures during both training and testing periods showed that the PSO-BPNN and PSO-SVM had better predicted performances than linear regression methods, which suggested non-linear relationships between DO concentration and water quality parameters.

The absolute relative errors calculated for the testing results showed that the PSO-SVM had minimum mean error, while the MLR had the maximum one. The PSO-BPNN with more violent error fluctuation indicated less stable predicted capacity than the PSO-SVM, but the distribution of absolute relative errors was not smooth even for the PSO-SVM. The analysis between observed values and absolute relative errors showed that the model had better predictive capacities when the data were close to average values. The errors at the points with extreme values were much larger, indicating inaccurate predictions of the data-driven models. The samples with larger predictive errors were similar for the three models, and had the lowest observed DO concentrations. Overall, the PSO-SVM model was superior to the MLR and PSO-BPNN in the DO prediction based on multiple physical and chemical parameters. It can be considered to predict DO levels with limited knowledge of other information for the rivers in China.

However, improvement will be necessary in future research. For example, the prediction accuracy, especially for the extreme values, needs to be improved by combining other model parameters. The factors influencing DO concentration of the rivers distributed in various regions may be different. The selection of suitable predictors would lead to more efficient models and accurate results.

## ACKNOWLEDGEMENTS

The Chinese Academy for Environmental Planning is acknowledged for the support of water quality data. This work was supported by Tianjin Normal University Doctor Foundation (52XB1517), the innovation team training plan of the Tianjin Education Committee (TD12-5037) and the National Natural Science Foundation of China (No. 41372373). Comments and suggestions from the anonymous reviewers and the editor are greatly appreciated.