Abstract
Signal analysis and anomaly detection for water pollution early warning systems are important and necessary. In view of the nonlinear and volatile characteristics of water quality time series, this paper proposes a new method for water anomaly detection based on fluctuation feature analysis. The method has two steps. First, the water quality time series data are used to calculate the residuals between the observed value and the predicted value with the long short-term memory (LSTM) network. Second, the dynamic features are extracted by sliding time window and described by the Approximate Entropy (ApEn) which are input to the anomaly detection model with Isolation Forest. Compared with traditional anomaly detection methods, the results obtained by the proposed method show better performance in distinguishing water quality anomalies. The proposed method can be applied to real-time water quality anomaly detection and early warning.
HIGHLIGHTS
A prediction model based on LSTM networks is constructed to predict six water quality indicators.
Dynamic features of water time series are extracted by the Approximate Entropy (ApEn).
Combining with the high-dimensional ApEn characteristics, the Isolation Forest method is applied to identify anomalies of water quality.
This research has the potential for the improvement of water quality early warning system.
Graphical Abstract
INTRODUCTION
Water is an essential resource for living organisms and human society. However, with the accelerated development of industrialization and urbanization, river pollution incidents have increasingly occurred in the last decades, damaging the ecological environments and endangering the safety of people's lives and property (Zhu et al. 2018; Liu et al. 2020). Thus, anomaly detection of water quality is crucial to protect the ecological environment and human beings.
Conventional water quality indicators, such as total phosphorus (TP), total nitrogen (TN) and chemical oxygen demand (COD), are the most direct tools for water quality monitoring, which are mainly collected chronologically in the form of time series at a fixed time interval (Shi et al. 2018; Tinelli & Juran 2019; Jiang et al. 2020). Using these conventional water quality indicators, many water quality anomaly detection approaches have been proposed (Byer & Carlson 2005; Koch & McKenna 2010; Arad et al. 2012; Olsen et al. 2012; Perelman et al. 2012; Wechmongkhonkon et al. 2012; Zhang et al. 2014; Azhar et al. 2015), including threshold methods (Byer & Carlson 2005; Zhang et al. 2014), statistical analysis models (Koch & McKenna 2010; Perelman et al. 2012) and artificial intelligence methods (Arad et al. 2012; Olsen et al. 2012; Wechmongkhonkon et al. 2012; Azhar et al. 2015), such as clustering classification (Azhar et al. 2015), time series analysis and artificial neural network (ANN) algorithm (Wechmongkhonkon et al. 2012). However, most threshold methods and statistical analysis methods with a single water quality indicator are difficult to describe the fluctuation of water quality. Thus, it is necessary to propose an ideal method, which can capture the fluctuation characteristics of water quality and can be applied to real-time water quality anomaly detection.
With the development of water quality monitoring technologies, more water quality indicators can be obtained and used in water quality prediction and anomaly detection. Now the relevant research teams introduce different methods (Bouamar & Ladjal 2007; Durdu 2010; Bao & Meng 2015; Wang et al. 2016; Wang et al. 2019; Baek et al. 2020; Liu et al. 2020) to improve the accuracy of the water quality prediction model and anomaly detection algorithm, such as data assimilation, machine learning and ANNs. Durdu (2010) proposed a hybrid model combining a neural network and autoregressive moving average (ARMA) sequence for water quality prediction, which achieved better performance than a single model. Bouamar & Ladjal (2007) utilized ANNs and support vector machines (SVMs) to classify water quality data into normal and anomalous groups. According to strong water quality fluctuation, Bao & Meng (2015) proposed a wavelet-based method to extract the features of the water quality index under different scales, and realized feature recognition using energy spectrum analysis, to implement water quality anomaly detection. Baek et al. (2020) used long short-term memory (LSTM) networks accurately simulate the water quality including TP, TN and total organic carbon (TOC). Liu et al. (2020) used Bayesian autoregressive (BAR) model for water quality variation prediction and Isolation Forest (IF) algorithm for water quality anomaly detection. However, these methods less consider the trend of changes and the dependence properties in long-term time series, and rarely explore the correlation among the multidimensional features, resulting in low accuracy of long-term prediction of water quality (Ahmed et al. 2019).
Therefore, it is a challenge to predict the changes in water quality indicators and extract the fluctuation features of water quality accurately over a long period.
MATERIALS AND METHODS
Dataset
Out of practical needs and the analysis of common pollution sources in urban rivers, according to water quality standards such as environmental quality standards for surface water (Zhang et al. 2020), sanitary standards for drinking water (Zhang et al. 2022) and technical specifications for surface water and sewage monitoring (Qi et al. 2006), we had chosen six conventional water quality indicators, including dissolved oxygen (DO), NH3-N, electrical conductivity (Cond), pH, COD and turbidity.
In this paper, the normal water quality time series data were measured each hour from the water quality monitoring stations deployed in an urban river in south China. The obtained dataset comprised 3,000 continuous data points in total. Each data point contained the measurements of six types of conventional water quality indicators, including DO, NH3-N, Cond, pH, COD and turbidity.
In the water quality prediction stage, all normal water quality time series data were used to construct the prediction models. Also, 85% of the normal water quality time series data were used for training and 15% were used for testing.
In the water anomaly detection stage, only the last 600 data points were used as the observed values. In order to verify the performance of the detection algorithm, the Gaussian inverted U-shaped anomalies were superimposed artificially on the normal observed values to simulate different strengths of water pollution events. The time periods of adding anomaly events were (30, 60), (150, 180), (250, 270), (350, 370) and (500, 520), respectively.
Water quality prediction model based on LSTM networks
When a water pollution incident occurred, the related water quality indicators would fluctuate and differ from the normal level. To accurately capture the fluctuation and change of water quality indicators, we used the LSTM network for water quality prediction in this paper, combining six water quality indicators which included DO, NH3-N, pH, electrical conductivity, COD and turbidity.
The monitoring data of water quality indicators were arranged in chronological order, and the sliding window structure could be used as the input of the LSTM model for multivariate time series prediction.
When the prediction model based on LSTM was trained, the observed water quality time series data could be input into the prediction model to predict the current water quality indicators, and then the deviation between the observed value and the predicted value at the current time was calculated to obtain a residual vector group.
Feature extraction method with ApEn
ApEn, combining the advantages of good anti-interference ability of information entropy in signal processing, was suitable for extracting the statistical complexity characteristics of irregular nonlinear signals, such as water quality time series data (Huang et al. 2017; Ma et al. 2019). Because the entropy amplitude was greatly different between the normal and abnormal water quality, the ApEn was used as a feature to detect water quality anomalies in this paper. The ApEn of water quality time series data could be calculated by the following steps:
- The reconstruction subsequence was defined in Equation (3).where was the one-dimensional time series, N was the length of time series, m was the size of the sliding time window.
- The distance d between the vector and was calculated in Equation (4).
- Given a threshold r, counted the number of d < r and the ratio of it to the total number of vectors was noted in Equation (5).
For sliding time window size m + 1, repeated the steps above to get .
- The ApEn of the sequences of length N was estimated in Equation (7).
Based on empirical values, took m = 2 and r = 0.25SD(u) in this paper, where SD(u) represents the standard deviation of original water quality time series.
According to the calculation principle of the ApEn value, after selecting different subsequences from the original data sequence by using the sliding window, we could calculate the ApEn of each subsequence and analyze the abnormal volatility of the data by the fluctuation in the ApEn values.
IF algorithm for anomaly detection
The construction process of IF (Liu et al. 2008) could be divided into the following steps:
Constructed each iTree based on sampling without replacement to improve the diversity among iTrees.
Set the depth of iTrees and the stop condition when construct an iTree from samples.
Calculated the weighted path length of multiple iTrees for real-time anomaly detection. When there were abnormal fluctuations in multidimensional water quality characteristics, the weighted path length would be shorter.
- Used ensemble learning to carry out convergence calculation of the fused results of multiple iTrees. Details of calculation could be found in Equations (8) and (9).where n was the number of given samples, was a harmonic number, was the average path length of failed search in binary tree, was the standardized path length of any sample, was the anomaly score.
The LSTM network could be used for high-precision and dynamical prediction of the time series of water quality. Then, the IF algorithm was proposed to detect the abnormal water quality by using ApEn to select anomalous fluctuation characteristics.
The combination of the prediction-based and isolation-based method could achieve water quality anomaly detection and early warning.
RESULTS AND DISCUSSION
Water quality prediction
Take the prediction results of COD and DO for examples, we had compared and analyzed the performance of the LSTM model, back propagation (BP) neural network model, recurrent neural network (RNN) model and gate recurrent unit (GRU) model.
In BP prediction model, the number of hidden nodes was set to 50, the optimizer was set as ‘adam’ and the MSE was used as the loss function. In the RNN prediction model, the unit was set to 64, the optimizer was set as Root Mean Square Propagation (RMSprop) and the MSE was used as the loss function. In GRU prediction model, the unit was set to 50, the optimizer was set as ‘adam’ and the MSE was used as the loss function.
Model . | RMSE . | MAE . | MAPE (%) . | R2 . |
---|---|---|---|---|
BP | 0.8197 | 0.6727 | 10.7912 | 0.7373 |
RNN | 0.5880 | 0.4271 | 6.4838 | 0.9510 |
GRU | 0.5713 | 0.4163 | 6.2606 | 0.9513 |
LSTM | 0.2842 | 0.2169 | 3.1484 | 0.9878 |
Model . | RMSE . | MAE . | MAPE (%) . | R2 . |
---|---|---|---|---|
BP | 0.8197 | 0.6727 | 10.7912 | 0.7373 |
RNN | 0.5880 | 0.4271 | 6.4838 | 0.9510 |
GRU | 0.5713 | 0.4163 | 6.2606 | 0.9513 |
LSTM | 0.2842 | 0.2169 | 3.1484 | 0.9878 |
Bold entries emphasize that LSTM model performs better than others.
Anomaly detection
Event strength . | Evaluation index . | Original . | LSTM . | LSTM + ApEn . |
---|---|---|---|---|
0.8 | Precision | 0.66 | 0.76 | 0.78 |
Recall | 0.63 | 0.77 | 0.87 | |
F1-score | 0.64 | 0.76 | 0.81 | |
1 | Precision | 0.67 | 0.77 | 0.90 |
Recall | 0.64 | 0.77 | 0.81 | |
F1-score | 0.65 | 0.77 | 0.83 | |
1.5 | Precision | 0.71 | 0.80 | 0.90 |
Recall | 0.68 | 0.83 | 0.80 | |
F1-score | 0.69 | 0.81 | 0.82 | |
2 | Precision | 0.83 | 0.90 | 0.87 |
Recall | 0.76 | 0.82 | 0.91 | |
F1-score | 0.79 | 0.84 | 0.89 |
Event strength . | Evaluation index . | Original . | LSTM . | LSTM + ApEn . |
---|---|---|---|---|
0.8 | Precision | 0.66 | 0.76 | 0.78 |
Recall | 0.63 | 0.77 | 0.87 | |
F1-score | 0.64 | 0.76 | 0.81 | |
1 | Precision | 0.67 | 0.77 | 0.90 |
Recall | 0.64 | 0.77 | 0.81 | |
F1-score | 0.65 | 0.77 | 0.83 | |
1.5 | Precision | 0.71 | 0.80 | 0.90 |
Recall | 0.68 | 0.83 | 0.80 | |
F1-score | 0.69 | 0.81 | 0.82 | |
2 | Precision | 0.83 | 0.90 | 0.87 |
Recall | 0.76 | 0.82 | 0.91 | |
F1-score | 0.79 | 0.84 | 0.89 |
CONCLUSIONS
This study presented a method for water anomaly detection based on the entropy features and water prediction model, which was effective in early warning of water anomaly. The results and analysis demonstrate the following:
The water prediction model based on LSTM revealed that trends of water baseline. Compared with the prediction model based on RNN, the proposed prediction method could obtain higher prediction accuracy with less time.
The ApEn feature selection method played an important role in the detection performance. The residual sequences of each principal component were combined and the ApEn was used to select features of water prediction residuals. The ApEn had different trends and distributions when the water system was under pollution or chaos.
The developed IF algorithm, using multiple water time series, proved to be effective in detecting water quality anomalies.
Abnormal water quality events often cause correlation changes of multiple water quality indicators at the same time (Mao et al. 2017). Different water quality parameters have different sensitivity to different pollutants. Kroll (2006) has concluded that residual chlorine and TOC are sensitive parameters to the pollutants such as sodium citrate, sodium cyanide, nicotine. Relevant researches have proved that using multiple water quality indicators (Vugrin et al. 2009; Liu et al. 2014) can reduce false alarm caused by non-water pollution events such as operating conditions.
In the process of water quality prediction, due to the complexity of prediction problems and the limitation of data availability, this proposed method only considers the historical data of conventional water quality indicators that directly reflect the impact of water quality changes. However, the river water body is more complex and is often affected by external factors such as geography and meteorology, the water quality prediction model proposed in this paper does not consider the hydrological information (Herath et al. 2020; Jiang et al. 2022), resulting in lack of interpretability and physical consistency (Herath et al. 2021). How to take into account more background knowledge and enhance the interpretability of the prediction model are the content of my follow-up research.
FUNDING
This work was funded by the Key Technology Research and Development Program of Zhejiang Province (2021C03177 and 2022C03078) and National Natural Science Foundation of China (U21A20519 and 61803333).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.
REFERENCES
Author notes
These authors contributed equally.