## Abstract

With the serious deterioration of the water environment, accurate prediction of water quality changes has become a topic of increasing concern. To further improve the accuracy of water quality prediction and the stability and generalization ability of the model, we propose a new water quality spatiotemporal forecast model to predict future water quality. To capture the spatiotemporal characteristics of water quality pollution data, the three sites (station S1, station S2, station S4) with the highest temperature time series concentration correlation at the experimental sites were first extracted to predict the water temperature at station S1, and 17,380 records were collected at each monitoring station, and the spatiotemporal characteristics were extracted by BiGRU-SVR network model. This paper's prediction test is based on the actual water quality data of the Qinhuangdao sea area in Hebei province from 2 September to 26 September 2013 and compared with other baseline models. The experimental results show that the proposed model is better than other baseline models and effectively improves the accuracy of water quality prediction, and the mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (*R*^{2}) are 0.071, 0.076, and 0.957, respectively, which have good robustness.

## HIGHLIGHT

In order to capture the spatiotemporal characteristics of water pollution data, a new bidirectional gated recurrent unit networks and support vector regression model hybrid neural network model proposed in this paper focuses on water quality data trends and contextual temporal attributes to capture the spatiotemporal characteristics of water pollution data.

Multiparameter water quality prediction is realized, with comprehensive consideration of the correlations between data.

## INTRODUCTION

With the rapid development of urbanization and industrialization in recent years, water quality has deteriorated and water safety has suffered from great threats and challenges. Water quality safety is a guarantee for human health and aquaculture, and timely monitoring and scientific management of water quality are of great importance to ensure water safety. The Ministry of Ecology and Environment of China released the ‘Ecological Environment Monitoring Planning Outline (2020–2035)’, which emphasizes the need to speed up the implementation of the Ministry of Water Resources to build an automatic water quality monitoring network to ensure that the monitoring data are ‘accurate and complete’. Water quality prediction is an important part of the automatic water quality monitoring network, which is a prerequisite for all subsequent work and an important tool for water resources protection. As an important factor affecting the stability of water ecosystems, water temperature not only restricts the reproduction and growth of aquatic plants and animals, the living environment, and the population distribution of each aquatic species but also affects the dissolved oxygen and chemical reactions of toxic substances in water (Fu *et al.* 2021). Therefore, water temperature plays a key role in water resource management decisions. Advance prediction of water quality parameter concentrations is the basis for water quality pollution prevention and control, achieving integrated environmental management and is important for public health and governmental decision-making.

However, the current water temperature prototype observation has great limitations due to the influence of factors, such as observation point location and instrument accuracy (Quan *et al.* 2020). Therefore, it is necessary to explore the high-precision water temperature simulation method, which has important reference value for the management and protection of the reservoir water environment and aquaculture. In water quality prediction, the existing methods include the water quality simulation model, gray theory method (Wu *et al.* 2021), regression analysis method (Bauwe *et al.* 2019), time series method (Li *et al.* 2021), and neural network method (Deng *et al.* 2021). Water quality simulation model is an early and widely used method for water quality research, through a series of water quality indicators, to establish a mathematical model to predict future water quality changes, such as the Soil and Water Assessment Tool (SWAT) model (Akoko *et al.* 2021). This type of model can accurately simulate the basic water quality laws but is often only applicable to a small range of waters, such as specific lakes and rivers, the generality is poor. Gray theory method, through a small amount of incomplete information, the establishment of a gray differential prediction model, the development of the law of things to make a fuzzy long-term description. However, its drawback is the large average error in prediction and the inability to predict interval number time series. Regression analysis method, using regression equations to fit the relationship between the independent and dependent variables, in the prediction of water quality needs to first analyze the correlation coefficient between water quality indicators, commonly used linear regression (LR) models, multiple regression analysis, and so on. For the water environment affected by the interaction of multiple factors, regression analysis considering multiple variables has the unique advantage of predicting better when the situation is simple, but it is difficult to express the highly complex data well. The time series method, using the principles of mathematical statistics, analysis, and collation of the historical series of water quality data itself, the study of water quality data parameters change trends and thus achieve the purpose of predicting the future, the method is often used for short and medium-term water quality forecasting, such as autoregressive integrated moving average model (ARIMA) (Chen *et al.* 2021), autoregressive conditional heteroskedastic (ARCH) (Wu *et al.* 2012), and so on. Its advantage is that it does not need to rely on other variables; however, it has the following limitations: it can only use its data for prediction and requires a long enough historical series; the data must be auto-correlated; it can only capture linear relationships, not nonlinear relationships between attributes.

The use of machine learning neural network models for water resources domain research is a current research hotspot. Neural networks are extensive and interconnected complex network structures composed of a large number of simple units with adaptability, which can simulate the process of human brain information processing through effective learning mechanisms (Haghiabi *et al.* 2018). Able to learn by themselves the nonlinear mapping relationship between the historical parameters of water quality and the external factor variables, through this mapping relationship applied to the future forecast. To efficiently integrate relevant information in the context of time series data, Liu *et al.* (2020) and other scholars successfully applied bidirectional stacked simple recursive units (Bi-S-SRU) to mariculture water quality prediction and demonstrated the feasibility of bidirectional neural networks to predict water quality parameters. Yan *et al.* (2021) used one-deep residual convolutional neural network (CNN) (1-DRCNN) and bidirectional gating recurrent unit (BIGRU) hybrid neural networks to predict the water quality of the Luan River, fully extracting the potential local features among water quality parameters and integrating the before and after time series information. Hu *et al.* (2019) for smart mariculture proposed a water quality prediction method based on deep long short term memory (LSTM) learning network to predict PH and water temperature. Wang *et al.* (2019) established a water pollution prediction model with eutrophication indicators total phosphorus and total nitrogen as parameters based on a time series ARIMA model with the introduction of the Holt–Winters seasonal model for optimization. Valadkhan *et al.* (2022) proposed a groundwater quality parameter prediction method with new effective parameters based on LSTM and recurrent neural networks (RNN) to prevent water pollution. Zhou *et al.* (2022) proposed the integrated wavelet decomposition, autoregressive integrated moving average, and gated recurrent unit (W-ARIMA-GRU) model for water quality prediction for decomposing the original water quality index data series into two series of trends and fluctuations, and the characteristics of the decomposed series data. Haq & Harigovindan (2022)) used CNN to obtain aquaculture water quality characteristics, LSTM and GRU to learn long-term dependencies in time series data and proposed a hybrid deep learning model for water quality prediction. Yan *et al.* (2020) predicted water quality based on a deep belief network and a least squares support vector regression model (PSO-DBN-LSSVR). In addition to error back-propagation (BP) networks (Huang *et al.* 2022) support vector regression (SVR) (Su *et al.* 2022), temporal convolutional networks (TCN) models (Li *et al.* 2022), and other machine learning methods are also used in water quality prediction. These different structures of machine learning models provide an important reference for the study in this paper.

In this paper, spatially, the spatial dependence is captured based on the influence of water quality parameters by other monitoring sites; in the temporal dimension, the water quality temperature at the experimental site is also influenced by the past water temperature at the site, so the temporal dependence is captured. We propose a new hybrid neural network model combining SVR and BiGRU to focus on the trend and contextual temporal attributes of water quality data and then predict the water temperature in Qinhuangdao water, Hebei, China. Finally, the experimental structure is compared with a total of seven models, both non-deep and deep models, and the accuracy and stability of each method are evaluated based on real values.

## DATA AND METHOD

### Research area and data

The study area of this paper is located between 119°34′–119°54′ E and 39°25′–39°42′ N in Qinhuangdao Sea, Bohai Sea, China, and the data are obtained from the water quality temperature and salinity observation data of Qinhuangdao sea monitoring station. The temperature and salinity water quality dataset use data records from 2 September to 26 September 2013, with monitoring station sensors collecting experimental data every 10 min, and 17,380 records were collected at each monitoring station. The name, data volume, and location information of each monitoring station are listed in Table 1. In this paper, one of the four monitoring stations was selected as the evaluation point (S1) for water temperature prediction.

Monitoring station name . | Number of data . | Location . | |
---|---|---|---|

Longitude . | Latitude . | ||

Station 1 (S1) | 17,380 | 119°37′ 52.3812″E | 39°39′ 55.3212″N |

Station 2 (S2) | 17,380 | 119°54′ 57.5994″E | 39°42′ 21.3588″N |

Station 3 (S3) | 17,380 | 119°45′ 55.4394″E | 39°37′ 1.4406″N |

Station 4 (S4) | 17,380 | 119°34′ 56.46″E | 39°25′ 55.9812″N |

Monitoring station name . | Number of data . | Location . | |
---|---|---|---|

Longitude . | Latitude . | ||

Station 1 (S1) | 17,380 | 119°37′ 52.3812″E | 39°39′ 55.3212″N |

Station 2 (S2) | 17,380 | 119°54′ 57.5994″E | 39°42′ 21.3588″N |

Station 3 (S3) | 17,380 | 119°45′ 55.4394″E | 39°37′ 1.4406″N |

Station 4 (S4) | 17,380 | 119°34′ 56.46″E | 39°25′ 55.9812″N |

Water temperature data were collected from four monitoring stations via water temperature sensors to provide data support for subsequent data analysis modeling. The monitoring station sensors collected data at a frequency of one data item every 10 min, with data collected from 2 September to 26 September 2013, with 17,380 data items collected at each monitoring station.

### Extraction for spatial factors

According to Tobler's first law of geography, everything is related to everything else, and similar things are more closely connected, neighboring sites have a greater influence on the experimental site than distant sites (Li *et al.* 2021a, 2021b). To illustrate the spatial characteristics of the water temperature data, the authors calculated the distance between two monitoring points and the Pearson correlation coefficient of the water temperature between each monitoring point.

*et al.*2022). The formula uses a sine function to keep enough valid numbers, even if the distance is small. The formula is as follows:

*et al.*2020) is commonly used to qualitatively measure and analyze the degree of linear correlation between variables. The formula is as follows:

Knowing the target site S1(, ) and the other sites Sn(, ), *n* brings the coordinate values of the surrounding sites Sn into the Haversin formula above to calculate the distance between the other sites Sn and the target site S1. The Pearson correlation coefficient is used to calculate the correlation between the water temperature of the surrounding sites and the target site, and the water temperature data from the sites with a high correlation with the target site are selected and fused with the target site's water temperature for later model input.

### Data cleaning

The water quality data used in this paper come from different collection devices of automatic monitoring stations. Due to equipment failure or human error records and other factors, water quality data will inevitably have abnormal values and vacant values. These non-conforming data can lead the algorithm to have a poor grasp of the direction of the predicted values, leading to a decrease in the accuracy of the model. Therefore, the dataset must be cleaned before constructing the prediction model. In this paper, the Pauta criterion and the generative adversarial networks approach are used to correct water quality data.

#### Pauta criterion

*et al.*2020) used the Pauta criterion (also known as the 3

*σ*principle) to effectively detect anomalous values in highway traffic flow, showing that the algorithm is able to correctly identify anomalous data. Pauta criterion is to assume that a set of test data contains only random errors, which are calculated and processed to obtain the standard deviation, and determine an interval with a certain probability, and consider that any error exceeding this interval is an outlier (the general threshold value is + 3). The Pauta criterion formula is:where is the sample mean,

*V*= , and

_{i}*σ*is the standard deviation. If a value of the sample satisfies Equation (7), is considered to be excluded.

#### Generative adversarial network

### ST-BiGRU-SVR

*t*+ 1,

*t*+ 2, …,

*t*+

*N*. The model is divided into three parts: site self-correlation water temperature and adjacent site correlation water temperature data extraction, auxiliary data fusion and spatiotemporal feature extraction, and future water temperature prediction.

The first part is the extraction of spatial factors. That is, the extraction of station self-correlation water temperature and water temperature data from neighboring stations. The details have been described in detail in Section 2.2.

The second part is the fusion of auxiliary data. Auxiliary data are incorporated to extract more spatiotemporal features during model training. All data are cleaned and processed for missing values before use, and records with outliers are removed. For the water temperature values at each site, the authors used the Pauta criterion and GAN model padding to process the water quality data. The merged data are normalized and used as input for the next stage.

The last part is the extraction of spatiotemporal features and the prediction of future water temperature. After cleaning the data and normalizing the data, the enhanced dataset is obtained. The spatiotemporal features of the normalized time series data are extracted by the BiGRU model and the SVR model. The predicted series values at (*t* + 1, *t* + 2, …, *t* + *N*) time the predicted series values are predicted using lagged data from past time *t*.

In this paper, the BIGRU model is fused with the SVR model to obtain the ST-BIGRU-SVR water temperature prediction model. First split the input data to form the training data and test data for the ST-BIGRU model, determine the ST-BIGRU network structure and weights, queue parameters, train the network, make predictions on the test data, and calculate the prediction error. Then, based on the ST-BIGRU error, the timing data are matched again and normalized as the input data of the ST-SVR network, the ST-SVR network structure and the weight and queue parameters are determined, training and prediction are performed, and the ST-SVR residual predictions are obtained. Finally, the ST-BIGRU and ST-SVR residual prediction results are summed to obtain the final predicted values.

## EXPERIMENTS

### Evaluation

In the formula, is the actual value, is the predicted value, and *n* is the total number of samples. In this paper, three evaluation metrics, MAE, RMSE, and coefficient of determination (), are used to evaluate the performance of the model. in statistics characterizes how well the regression equation explains the variation in the dependent variable and can be used to determine the degree of fit between the true and interpolated values, with closer to 1 indicating a better fit between the variables, that the model explains 100% of the variation in the target values.

### Baseline model

To test the overall performance of the current model, the authors conducted a series of comparative experiments on two baseline models, the non-deep learning model, and the deep learning model.

- (1)
Non-deep learning baseline models. These include LR models, random forest (RF), ARIMA model, and SVR model.

- (2)
Deep learning baseline models. These include the temporal convolutional network (TCN) model, the gated recurrent unit network (GRU) model, the bidirectional gated recurrent unit network (BIGRU) model, and ST_ BIGRU_SVR model (the model proposed in this paper).

## RESULTS AND DISCUSSION

### Prediction performance

### Comparison of experiments

A comparison of the performance of the model proposed in this paper with the other seven baseline predictions is shown in Table 2 regarding the three evaluation metrics MAE, RMSE and . It can be found that the deep learning baseline model performs better than the non-deep learning baseline model, where the ARIMA non-deep learning baseline model performs the worst. Comparing the three evaluation metrics of all the deep learning baseline models, it is found that the proposed model performs best in terms of prediction performance and has the lowest model error. The reason for the high prediction accuracy of the ST-BIGRU-SVR proposed in this paper may be that the water temperature of the surrounding monitoring station influences the water temperature of the current station S1 to be predicted, while other models only consider the water temperature data of the current station to be predicted. Because the water temperature data are affected by the water temperature of the surrounding stations and the time attributes of the water temperature context, all the influencing factors are used as the input of the ST-BIGRU-SVR model, and the predicted water temperature data of the S1 station is used as the model output. ST-BIGRU-SVR model, in essence, the water temperature data of neighboring stations with higher correlation with S1 of the predicted site is determined as the input of the model by calculating the distance between the monitoring stations and the water temperature correlation coefficient between the monitoring stations. According to the temporal characteristics of water temperature data, BiGRU is used to process the water quality data at different times, mine the time series information of water quality data, and make full use of the time series characteristics of water quality data. Aiming at the nonlinear characteristics of water quality datasets, the SVR model is used to transform the nonlinear regression problem into a LR problem in high-dimensional space, because the SVR model has the advantages of global optimality, simple structure, strong generalization ability, etc., which is very suitable for the prediction of nonlinear data, and the core idea of SVR is to find a curve in high-dimensional space to represent the relationship between input data and output data and obtain the desired predicted value through the curve function. Therefore, the ST-BiGRU-SVR model predicts better than the general non-deep learning baseline model (LR, RF, ARIMA, SVR) and deep model (TCN, GRU, BiGRU).

Baseline model . | MAE . | RMSE . | . |
---|---|---|---|

LR | 0.202 | 0.260 | 0.831 |

RF | 0.112 | 0.171 | 0.868 |

ARIMA | 1.63 | 2.620 | 0.581 |

SVR | 0.124 | 0.186 | 0.845 |

TCN | 0.081 | 0.094 | 0.903 |

GRU | 0.212 | 0.084 | 0.916 |

BIGRU | 0.078 | 0.071 | 0.922 |

ST-BIGRU-SVR | 0.071 | 0.076 | 0.957 |

Baseline model . | MAE . | RMSE . | . |
---|---|---|---|

LR | 0.202 | 0.260 | 0.831 |

RF | 0.112 | 0.171 | 0.868 |

ARIMA | 1.63 | 2.620 | 0.581 |

SVR | 0.124 | 0.186 | 0.845 |

TCN | 0.081 | 0.094 | 0.903 |

GRU | 0.212 | 0.084 | 0.916 |

BIGRU | 0.078 | 0.071 | 0.922 |

ST-BIGRU-SVR | 0.071 | 0.076 | 0.957 |

## CONCLUSIONS AND OUTLOOK

Currently, most reservoirs and intelligent aquaculture systems mainly use sensor-based IoT systems to monitor water quality in real-time, and this method of real-time monitoring has a lag. If water quality can be accurately predicted for a long period time in the future, it will allow water quality monitors to propose measures to prevent water pollution in advance so that farmers can take measures in advance to effectively counteract farming risks and improve production efficiency. However, the current trend of water quality is mainly based on regular manual surveys and monitoring by inspectors, farmers' long-term accumulated experience in speculation, monitoring methods have a strong subjectivity, poor reliability, and poor timeliness.

For the long-term prediction of water quality parameters, this paper proposes a hybrid neural network-based future water quality spatiotemporal prediction model. The model achieves more accurate and stable future prediction by integrating historical time water quality data, and nearest neighbor water quality data into the model. After pre-processing the data with GAN interpolation, we also use the Pearson correlation coefficient to analyze the correlation of water quality parameters. Finally, we input the prior information and preprocessed data into the constructed model for training. The experimental results show that the model proposed in this paper has a higher prediction accuracy compared to the non-depth model and the depth model. Specifically, this paper proposes a long-term prediction method for water quality parameters with a prediction accuracy of 95.7%, which makes the effectiveness of model in the paper. The proposed model still has some limitations: (1) water quality parameters by a variety of complex factors, however, due to equipment funding issues, failure to obtain more parameters of the site, the future can be integrated with other parameters of water quality, meteorological factors, etc., to improve predictive performance. (2) In addition, to make the water quality prediction model more robust and further reduce the impact of water quality changes on the model prediction, a variety of factors such as seasonal changes and water quality changes can be incorporated into the deep neural network as a priori information, so that the prediction model can obtain longer-term prediction results. (3) The model proposed in the paper was only evaluated on the dataset of the Qinhuangdao watershed in Hebei, which has some limitations. In the future, it is hoped that more monitoring data (such as dissolved oxygen, chemical oxygen demand, ammonia nitrogen, etc.) from other watershed water quality monitoring sites can be collected to further validate the performance of the model.

## ACKNOWLEDGEMENTS

This work was financially supported by the Dalian Science and Technology Innovation Fund Project: Liupanshui Citizens' Drinking Water Source Water Quality Monitoring and Early Warning System Construction(2020JJ27SN106).

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Rongli Gai reports financial support was provided by Department of Science and Technology of Liaoning Province.