In order to improve the efficiency of urban drinking water safety monitoring and early warning management, a pollution risk early warning model of urban drinking water supply chain is proposed. Firstly, the current situation of urban drinking water supply is analyzed and the causes of pollution are analyzed. Then, the autoregressive model is used to predict the time series of multiple water quality indicators by constantly introducing new monitoring data modes for the residual vector group, the outlier scores of each vector group are obtained by using the isolated forest algorithm to judge whether the water quality is abnormal or not, and the fuzzy comprehensive evaluation method is used to evaluate the level of the abnormal situation and carry out the corresponding level early warning. The experimental results show that the area under the receiver operating characteristic curve can reach 0.919 when using the prediction residual vector group of turbidity and conductivity to detect the numerical changes of water quality parameters in the drinking water supply chain, accurately predict the abnormal data, make early warning, and provide the guarantee for the survival of urban residents and urban development.

  • Current situation of urban drinking water supply.

  • The causes of pollution.

  • AR model (autoregressive model).

  • Isolated forest algorithm.

  • Accurately predict the abnormal data

Graphical Abstract

Graphical Abstract

Water is the life of a city, and water source is the most basic condition for the survival and development of urban residents. The risk assessment and early warning management of water source security is the basis for the safety of urban public water supply and the safe operation of urban economic and social system. However, for a long time, the worsening water pollution problem has always been an important threat to the safety of water sources in most cities at home and abroad. These water pollutions not only include conventional pollution such as industrial sewage, domestic wastewater and agricultural non-point source, but also include sudden water pollution such as terrorism, ship chemicals and oil leakage, industrial accident discharge, storm runoff pollution and poisoning (Bai et al. 2018; Díez-Del-Molino et al. 2018; McComb et al. 2019).

The research on risk assessment and safety early warning management of urban water sources in foreign countries has been carried out rapidly since the promulgation of the World Health Organization (WHO) drinking water guidelines in 2004. The WHO has developed some guidelines for drinking water quality, which are international reference points for standard setting and drinking water safety. The latest guidelines developed by WHO are those agreed in Geneva in 1993. There are no guidelines for certain elements and substances considered. This is because there is not enough research on the effects of substances on organisms, so it is impossible to define the guidance limit. In other cases, there are no guidelines because the substance cannot reach dangerous concentrations in water due to its insolubility or scarcity. The purpose of the study is to monitor and warn pollutants, risk assessment and risk warning management (Ahmed et al. 2018; Khatmullina et al. 2018). The purpose of early warning is to ensure the water quality safety and sufficient water supply in the whole process of the water supply system from water source to user faucet, and gradually from ‘emphasizing the importance of building a robust water supply system’ (Diendere et al. 2018) to ‘using specific models to analyze and evaluate its safety’ (Liu et al. 2019). At present, water source risk assessment studies have focused on two aspects: one is the research on the relationship between water supply system and human health; The other is the assessment of water supply system's own security and vulnerability (Wilson et al. 2018).

In China, the current research focuses on water quality health risk assessment, index comprehensive evaluation, emergency management system construction, risk response measures implementation and research and development, etc. The comprehensive risk assessment and early warning management of the whole process of urban water supply chain has not been paid enough attention.

Compared with foreign countries, domestic water quality prediction research started relatively late. Chen Yue studied the water quality anomaly detection method based on radial basis function (RBF) neural network prediction and wavelet de-noising, and used fuzzy c-means clustering (FCM) for anomaly classification. The RBF neural network is introduced to predict the future water quality value by using the previous water quality observation value. The time series of wavelet denoising residual is obtained by comparing the predicted value with the actual value, and then wavelet denoising is used in the sliding window. If the new residual is greater than the specific threshold/baseline, the water quality is abnormal. Taking online monitoring of ammonia nitrogen value as the research object, it is proved that the algorithm has a lower false alarm rate (far) and a higher detection probability (PD) compared with time series increment. The performance of water quality anomaly detection algorithm was further improved (Hou et al. 2013). He proposed a water quality anomaly detection algorithm based on multifactor fusion. Firstly, the autoregressive model was used to predict the water quality indicators, and then fuzzy c-means clustering analysis was carried out by fusing the prediction residuals of various sensitive water quality indicators. Through comparison, it was proved that the anomaly detection performance of the proposed method was better than the traditional autoregressive prediction method and multidimensional Euclidean distance method (He 2013).

Wang et al. (2019) proposed a supervised learning-based UV–visible spectral water quality anomaly detection method. The method obtains the normal sample difference space in different data sets, and then uses an orthogonal projection method to remove the spectral data components in the difference space to achieve the purpose of baseline correction. Then, partial least squares discriminant analysis is used to extract features from the corrected spectra. The optimal threshold obtained from the training set is used to determine outlier points; Finally, sequential Bayesian rolling updating is used to update the abnormal probability at each time to determine the water quality alarm sequence. The background difference of different batches of water quality spectrum is eliminated, the spectral information is more fully utilized, and the detection lower limit of characteristic pollutants is reduced. Yin et al. (2019) proposed a method of water quality anomaly detection based on supervised learning. The method firstly obtains the difference space of normal samples in different data sets, and then uses the orthogonal projection method to remove the spectral data components in the difference space to achieve the purpose of baseline correction; then partial least squares discriminant analysis is used to extract features from the corrected spectra, and the optimal threshold obtained from the training set is used to determine outliers. Finally, the water quality alarm sequence is determined by using a sequential Bayesian rolling update to update the abnormal probability at each time (Liu et al. 2019).

Yang et al. (2018) regarded the parameter calibration of the prediction model as a Bayesian estimation problem, and the posterior probability density function of the parameters is obtained according to the finite difference method and Bayesian inference, and then the more reasonable parameter value is obtained by the improved metropolis Hastings sampling method The results show that the new method is more suitable for the prediction of water pollution.

On this basis, an early warning model of pollution risk in urban drinking water supply chain is proposed. The AR model and the isolated forest algorithm are used to analyze the abnormal scores of water quality in the water supply network system of the urban drinking water supply chain (Zuo et al. 2020). The fuzzy comprehensive evaluation method is used to analyze the abnormal grade and make different grade early warning. The effectiveness of the proposed method is verified by an actual case.

Drinking water safety is a prominent environmental problem, which is directly related to people's life and health. Drinking water pollution accidents seriously affect the normal drinking water of residents, and even lead to the outbreak of water-borne diseases or pollution poisoning. Drinking water safety is directly related to the health of the masses. Poor drinking water quality can cause a variety of diseases. Studies have shown that 80% of human diseases are related to water (Jindal et al. 2018). At present, there are two major problems of drinking water in the world.

Serious shortage of drinking water resources

China's total water resources rank sixth in the world, but China has a large population. If the per capita water resources are calculated, the per capita occupancy is only 2,500 cubic meters, about 1/4 of the world's per capita water resources, ranking 110th in the world. Water resource constraints restrict the scale and growth speed of economic development, and the resource constraints caused by water resource shortage will restrain short-term economic development and often become the ‘bottleneck’ of short-term economic development. With the acceleration of modernization, the sharp increase of world urban population, especially the improvement of people's living standards, the shortage of water resources has become a major problem restricting urban development and affecting people's quality of life (Shi et al. 2018).

Water pollution is increasing

Due to the increasingly serious environmental pollution, urban drinking water sources are polluted to varying degrees. The city is the center of economic life, often with a large amount of sewage discharge. In addition, the level of urban sewage treatment is not high. The situation of urban water environment is very serious. The river section flowing through the city is polluted more seriously, and the water quality of urban lakes is poor. Considerable numbers of urban water sources are seriously polluted, directly threatening the safety of drinking water (Burnett et al. 2014).

Urban drinking water supply mode

At present, the urban water supply chain with different quality follows the following basic pattern: for high-quality source water, regular treatment is used for urban drinking water and industrial production water, or centralized advanced treatment is used for urban domestic drinking water, or local advanced treatment is used for direct drinking with barreled purified water or pipeline purified water. For low quality source water, it is directly or simply treated for industrial cooling water, municipal miscellaneous water and domestic miscellaneous water (Chen et al. 2020). The reclaimed water of urban sewage treatment is mainly used for irrigation, municipal administration, industrial production, environmental landscape entertainment and domestic miscellaneous water; For residential areas, building groups and public buildings, the reclaimed water is promoted to be used as miscellaneous water; The internal wastewater reuse or interplant series reuse is implemented in industrial enterprises. The pattern of water supply by quality is shown in Figure 1.
Figure 1

Chinese dual water supply situation.

Figure 1

Chinese dual water supply situation.

Close modal

Classification of urban water supply chain

The urban water supply chain is mainly divided into three categories: The first is centralized treatment and unified supply (a set of pipe network). In terms of water quality control, an advanced treatment process is added after the current conventional treatment process to make the produced water meet the standard of direct drinking water and be delivered to the household through high-quality pipe networks. The second is centralized treatment, and the water supply by quality (two sets of pipe networks) is also called the urban separate quality supply water, using two sets of water supply pipe network systems, with drinking water system as the main water supply system of the city, the water supplied can be directly drunk. The third type is subquality water supply (also known as ‘pipeline direct drinking water’) refers to the depth of urban tap water or water that meets the quality standard of the drinking water source by setting up special water treatment stations in residential areas and using advanced water treatment technologies (Chen et al. 2021). At the same time, high-quality pipes and a set of independent circulation networks are used to transport the purified high-quality water to users for direct drinking, to ensure that the water quality is sanitary, stable and fresh. The system only needs a small amount of water for advanced treatment, and the water purification station is located in the water supply area, so it can use high-quality pipes (accessories) to send to the user points in the shortest distance.

Causes and characteristics of abnormal water quality

The main factors causing water quality abnormal events include rainstorm, flood and other natural factors, and enterprise's illegal discharge, chemical leakage and other factors. In this paper, the early warning of water pollution in the water supply network system of urban drinking water supply chain is studied. In the water supply network system, water quality anomalies are usually divided into three categories: baseline changes, outliers and abnormal events. The baseline change is due to the obvious deviation of the average value of water quality parameters caused by the operation of pump on and off. The outlier point is the sudden increase or decrease of a time point caused by noise, which conforms to the law of water quality change before and after, and this abnormal situation does not need to be warned. An abnormal event refers to the behavior that the water quality time series does not conform to the historical change law (Gao & Lu 2020). The time series change characteristics of the three water quality anomalies are shown in Figure 2.
Figure 2

Time series variation characteristics of abnormal water quality in a drinking water supply chain pipe network.

Figure 2

Time series variation characteristics of abnormal water quality in a drinking water supply chain pipe network.

Close modal

Abnormal water quality detection process

Based on the causes of water quality anomalies, the AR model with good prediction effect, is simple and convenient, which is suitable for water quality time series prediction, and is used to predict the time series of multiple water quality indexes in the water supply network of urban drinking water supply chain. The specific process of prediction is to realize the prediction of the next time point by continuously introducing new monitoring data. When the water quality monitoring data change greatly beyond the range of three times of the mean value plus or minus the residual error, the predicted value is used to predict the next time point instead of the actual monitoring value. The prediction residual time series is obtained by subtracting the actual monitoring value and the predicted value of each index (He et al. 2018). The residual vector group is obtained by combining the residual time series of each index. Then, the outlier score of each vector group is obtained by using the isolated forest algorithm. If the abnormal score is greater than the threshold value, the water quality is considered to be normal; Otherwise, the water quality is considered abnormal. The specific process is shown in Figure 3.
Figure 3

Flow chart of water quality anomaly detection technology.

Figure 3

Flow chart of water quality anomaly detection technology.

Close modal

Water quality time series prediction based on the autoregressive model

In the AR model, the value of the next time point in time series can be expressed by the sum of linear combination and random error of the first p time points. The specific formula is as follows in Equation (1):
(1)

Among them, is is the coefficient of the historical observation value, is the constant value, and is the error value. The residual vector group of each index is obtained.

The process of time series prediction of water quality index by AR model is as follows:

  • (1)

    Stationary detection of time series: augmented Dickey fuller test (ADF) is used to detect the stationarity of time series. When the assumed value (P-value) is less than 0.05 (α = 0.05), the null hypothesis with unit root is rejected, and the time series is considered to be stable;

  • (2)
    Bayesian information criterion (BIC) is used to determine the order p of the model. The AR model is as follows in Equation (2):
    (2)

Among them, is the maximum likelihood function of the model, n is the number of data, and k is the number of parameters to be estimated. When BIC reaches the minimum value, the corresponding model order is the best order of AR model;
  • (3)

    The AR model of order p is established;

  • (4)

    Based on the historical monitoring data, the current water quality data are predicted. The specific process is shown in Figure 4.

Figure 4

Flow chart of AR model prediction.

Figure 4

Flow chart of AR model prediction.

Close modal

Analysis of water quality outliers based on forest isolation algorithm

After the prediction results of water quality data are obtained, the isolated forest algorithm is used to analyze the water quality anomalies. Isolated forest algorithm (SFA) is an unsupervised anomaly detection method proposed by Liu et al. (2019). It has two stages: the training stage and the evaluation stage. In the training stage, the specific process of constructing an i-Tree is to randomly select a sample point from the sample without replacement to divide the true binary tree; that is, select the residual vector group feature of each index of water quality, and randomly select a split point between the maximum and minimum value of the feature, and those less than the split point enter the left subtree, and those greater than or equal to the split point enter the right subtree. Repeat the above process until there is only one sample or the same sample (unable to continue splitting) or reach the depth limit of the tree. The flow chart of the i-Tree construction is shown in Figure 5. Path length h(x) refers to the number of edges of the binary tree that sample point x passes through from the root node to the external node. Due to its particularity, abnormal points can usually be separated earlier and reach the external nodes. The path length is small, while the normal points can only be separated after many times of binary tree classification. The path length is usually large. Figure 6 shows the process of anomaly detection for two-dimensional data. Because the process of recursive partition of two-dimensional plane is equivalent to the classification process of binary tree, the number of times to separate a point is equal to the length of the path for binary tree classification. As can be seen from the figure, the path length of the orange point which represents the anomaly is separated is 2, while the blue point needs to go through a five times splitting process to reach the external node to be separated. Therefore, the abnormal point is easier to be isolated and the path length is smaller. There are two input parameters in the training phase of the iForest algorithm, which are the size of subsamples and the number t of i-Trees constructed. The depth limit l of isolated trees is calculated according to Equation (3):
(3)
where, is the size of the subsample.
Figure 5

Flow chart of the isolated tree construction.

Figure 5

Flow chart of the isolated tree construction.

Close modal
Figure 6

Two-dimensional data anomaly detection process.

Figure 6

Two-dimensional data anomaly detection process.

Close modal
In the same way, an isolated forest consisting of t isolated trees is constructed. The schematic diagram of the isolated forest constructed is shown in Figure 7. The red nodes represent the scenario where the outliers may be separated.
Figure 7

Schematic diagram of isolated forest.

Figure 7

Schematic diagram of isolated forest.

Close modal
The second stage is the evaluation stage. After the iForest is established, the anomaly degree of the data can be evaluated according to the path length of each isolated tree. The degree of abnormal data can be judged by the abnormal scores (x, n) (abnormal core). It is defined as follows in Equation (4):
(4)

Among them, H() is the harmonic number, which can be estimated by ln() + 0.5772156649 (Euler constant); C(δ) is the average path length of binary search tree (BST), which is used for standardization of H(x); E(h(x)) is the average path length of all isolated trees in the isolated forest. If a detected water quality data point reaches the external node and the external node contains multiple data, the path length is adjusted according to the number of data points to make up for the subtree that is not built below the depth of the tree. The evaluation rule based on the abnormal score is: when the abnormal score s(x, δ) is close to 1, it must be an abnormal point. If s(x,δ) is less than 0.5, it can be regarded as normal data. If the abnormal score s(x, δ) of all water quality data points is about 0.5, there is no obvious abnormality in the residual vector group of each index monitored. Based on the abnormal score, the most likely abnormal data in the sample can be obtained. In the training phase, the isolated tree structure and segmentation conditions are returned, and the residual vector group data of each index outside the sample can be used to calculate the abnormal score to judge whether it is abnormal or not, and make early warning judgment.

Early warning model based on fuzzy comprehensive evaluation

After getting the test data analysis and prediction results, the fuzzy comprehensive evaluation method is used for subjective evaluation. The fuzzy relationship between the evaluation index and the evaluation level is described by the membership function, which effectively deals with the objective fuzzy problem and the subjective problem of the subjective evaluation. The risk assessment of urban drinking water pollution by fuzzy comprehensive evaluation includes the following five steps:

  • (1)

    If there are n indexes in the index system, then a series of indexes can be expressed in the form of vector as u = {u1, u2, un}.

  • (2)

    Determine the evaluation grade system, if the evaluation level has m levels, then the evaluation grade can be expressed in the form of vector as v = {v1, v2, vm}.

  • (3)
    The form of fuzzy relation matrix R is shown in Equation (5):
    (5)

Each row in the matrix R represents the degree to which each evaluation index belongs to m evaluation levels.

  • (4)

    In order to determine the weight of evaluation index, the weight of each index needs to be determined. The weight vector of N indexes can be expressed as w = {w1, w2, wm}.

  • (5)
    The comprehensive membership function and index weight of fuzzy comprehensive evaluation are calculated. The specific formula is shown in Equation (6):
    (6)

Among them, is the fuzzy composition operator, different composition operators have different operation rules. B represents the fuzzy comprehensive evaluation vector, that is, the membership degree of the evaluated object to m evaluation grades. Fuzzy comprehensive evaluation vector is evaluated by the following two principles: (1) the formula of weighted average principle is as follows in Equation (7):
(7)
  • (6)

    According to the principle of maximum membership degree, the evaluation result is the grade corresponding to the maximum value in the fuzzy comprehensive evaluation vector, and is used as the basis for early warning level evaluation.

Study area and monitoring data

In order to study the effectiveness of the proposed method, the drinking water supply system in a city was monitored. According to the fuzzy comprehensive evaluation method, the early warning levels of pollution accidents are established, which are red warning, orange early warning, yellow early warning, blue early warning and green early warning. Taking turbidity, conductivity and dissolved oxygen as monitoring indicators (the selection of water quality indicators is not limited to the three indicators in the case, and more water quality online monitoring indicators can be combined to detect water quality abnormal events), the original monitoring data are preprocessed by interpolation method, and the missing data are supplemented by interpolation method. Due to the different dimensions of water quality indicators, Z-score standardized treatment time series should be used before AR model prediction of water quality indicators. The treated series conforms to the normal distribution with mean value of 0 and variance of 1. The calculation method of Z-score is as follows in Equation (8):
(8)
where, is the mean value of water quality time series, is the water quality value of time point I, and is the standard deviation of water quality time series. Basic statistical characteristics of data are shown to Table 1.
Table 1

Basic statistical characteristics of data

Statistical characteristicsTurbidity/(NTU)Conductivity/(μs/cm)Dissolved oxygen/(mg/L)
Maximum 1.23 520 14.8 
Minimum 429 7.9 
Average 0.46 492.36 11.54 
Variance 0.48 15.99 1.33 
Statistical characteristicsTurbidity/(NTU)Conductivity/(μs/cm)Dissolved oxygen/(mg/L)
Maximum 1.23 520 14.8 
Minimum 429 7.9 
Average 0.46 492.36 11.54 
Variance 0.48 15.99 1.33 

Water quality time series prediction

In this study, the processed data are divided into two parts. The first 2400 data are used as training set to establish the AR model. Based on the AR model, the last 769 data are predicted in a dynamic way. Firstly, the time series were detected by ADF, and the P-values of turbidity, conductivity and dissolved oxygen time series were all less than 0.05, which showed that they met the requirements of AR model for time series stability. Secondly, it is necessary to determine the order of the AR model for each water quality parameter, and construct an AR model with a different order of each index. The corresponding BIC is shown in Figure 8. According to the BIC criterion, the order I of AR model for turbidity, conductivity and dissolved oxygen can be determined as 7, 12 and 7, respectively, that is, the linear combination of the first 7, 12 and 7 historical monitoring values is used to predict the current value. Figure 9 shows three water quality indicators.
Figure 8

BIC corresponding to the AR model with different orders of turbidity, conductivity and dissolved oxygen. (a) Order of turbidity AR model. (b) Order of AR model for conductivity. (c) Order of AR model for dissolved oxygen.

Figure 8

BIC corresponding to the AR model with different orders of turbidity, conductivity and dissolved oxygen. (a) Order of turbidity AR model. (b) Order of AR model for conductivity. (c) Order of AR model for dissolved oxygen.

Close modal
Figure 9

Standardized value, AR model prediction value and residual error of water quality time series monitoring.

Figure 9

Standardized value, AR model prediction value and residual error of water quality time series monitoring.

Close modal

The prediction value is compared with the actual monitoring value and the corresponding residual time series. In general, the AR model has good prediction effect in the stable fluctuation stage of water quality index, and can track the change of water quality data better. In the later stage of water quality anomaly, when the water quality index changes suddenly, the prediction residual time series of turbidity and conductivity increase and decrease obviously in this stage, the absolute value of residual is 11.35 and 2.11 respectively, while the dissolved oxygen concentration in the river is not affected by the abnormal event, the residual sequence has no obvious change, and the absolute value of residual is the largest 07. Table 2 shows the residual statistical results of water quality time series from June 13 to June 17 in the prediction stage. It can be seen from the table that the average absolute error, mean square error and root mean square error are small, which indicates that the AR model can better predict the water quality time series.

Table 2

Prediction and evaluation of standard water quality parameters by AR model

Evaluating indicatorTurbidity/(NTU)Conductivity/(μs/cm)Dissolved oxygen/(mg/L)
Mean absolute error (MAE) 0.1086 0.0453 0.0282 
Mean square error (MSE) 0.0287 0.0069 0.0011 
Root mean square error (RMSE) 0.1694 0.0831 0.0332 
Evaluating indicatorTurbidity/(NTU)Conductivity/(μs/cm)Dissolved oxygen/(mg/L)
Mean absolute error (MAE) 0.1086 0.0453 0.0282 
Mean square error (MSE) 0.0287 0.0069 0.0011 
Root mean square error (RMSE) 0.1694 0.0831 0.0332 

Abnormal detection of water quality time series

In this paper, the outlier detection is carried out for the residual vector group of pairwise combination of residual time series by using the isolated forest algorithm, and the abnormal water quality can be judged according to the abnormal score. Outlier detection of isolated forest includes two stages (Yu et al. 2020). The first stage is the training stage. In total, 256 subsamples are randomly selected from 2,400 groups of residual vector groups in the first 25 days of time series to construct an isolated tree with maximum depth of 8. In the same way, 100 i-Trees are constructed to form an iForest, and the abnormal scores of each vector group in the training set are obtained.

The second stage is the anomaly detection stage. The outlier scores of 8 days after the establishment of the isolated forest model are calculated. After that, the receiver operating characteristic (ROC) curve is made based on the abnormal score to evaluate the performance of the anomaly detection method, and the threshold value is set to judge whether there is a water quality abnormal event to start emergency monitoring. Outlier detection of the isolated forest was carried out by combining the predicted residuals of turbidity, conductivity and dissolved oxygen predicted by AR model in pairs. The results are shown in Figure 10. The dotted line in the figure is the contour line drawn based on the abnormal score. It can be seen from the figure that most of the water quality residual vectors is concentrated near point (0, 0), and only a few points are far away from (0, 0), and the closer to (0, 0). The denser contour lines are at point 0. When the residual vector group of turbidity and conductivity is selected for anomaly detection, many abnormal points obviously deviate from the circular area near the (0, 0) point in the normal point concentration, and some of them are located outside the contour line which represents the abnormal score of −0.24. When the other two groups of combinations were selected, the outliers were concentrated in the interior of the contour lines with the abnormal scores of −0.24 and −0.20 respectively, and the number decreased. It can be seen that when the residual vector group of turbidity and conductivity is selected for anomaly detection, the abnormal water quality can be identified as better, and the blue pollution warning can be made according to the detection results.
Figure 10

Outlier detection results of isolated forest algorithm.

Figure 10

Outlier detection results of isolated forest algorithm.

Close modal

Water quality time series anomaly detection performance

In this paper, a ROC curve is used to evaluate the performance of anomaly detection. The ROC curve is used to quantitatively evaluate the performance of the anomaly detection algorithm. Taking false positive rate (FPR) as the abscissa and the true positive rate (TPR) as the ordinate, the calculation formula of TPR and FPR is as follows in Equations (9) and (10):
(9)
(10)

TP stands for true positive (the actual water quality is abnormal, and the detection result is abnormal); FN (false positive) represents false negative (the actual water quality is abnormal and the detection result is normal); FP (false positive) represents false positive (the actual water quality is normal and the detection result is abnormal); TN (true negative) represents true negative (the actual water quality is normal and the detection result is normal). The area under the ROC curve (AUC) is the area enclosed by the ROC curve and the abscissa. The larger the area, the better the performance of the anomaly detection algorithm and the higher the warning accuracy.

Through the analysis of this case, the abnormal water quality occurred from June 18 to June 21. Set the water quality condition label of this stage as 1 to represent abnormal, and the rest to 0, representing normal. The ROC curve obtained according to the abnormal score of the residual vector group and the water quality condition label is shown in Figure 11. It can be seen that the AUC obtained by using the prediction residual vector group of turbidity and conductivity for outlier detection of isolated forest is 0.919. When the threshold value is −0.0334, that is to say, when the abnormal score is less than −0.0334, it is considered that the water quality is abnormal. When the detection rate reaches 80%, the false alarm rate is 9.7%. The AUC of turbidity, dissolved oxygen, conductivity and dissolved oxygen were 0.797 and 0.805, respectively. When the detection rate reached 80%, the corresponding false alarm rate was 45.4% and 33.9%, respectively. This further proved that the residual vector group of turbidity and conductivity can effectively judge whether the water quality is abnormal or not, which provides guarantees for accurate early warning.
Figure 11

Performance evaluation of the ROC curve for abnormal detection of different water quality index combinations.

Figure 11

Performance evaluation of the ROC curve for abnormal detection of different water quality index combinations.

Close modal
AR model prediction is to forecast the current water quality index in a dynamic way by continuously introducing the actual monitoring data. When the water quality index time series changes suddenly, the residual time series will increase or decrease obviously. The outlier points of each residual vector group can be obtained by using the isolated forest algorithm. When the residual time series of only one water quality index fluctuates, it may be caused by sensor abnormality. By combining multiple water quality indicators, the accuracy of anomaly detection can be improved. The ROC curves are shown in Figures 12 and 13.
Figure 12

Performance evaluation of single index abnormal detection ROC curve.

Figure 12

Performance evaluation of single index abnormal detection ROC curve.

Close modal
Figure 13

Performance evaluation of ROC curve for abnormal detection of three indexes.

Figure 13

Performance evaluation of ROC curve for abnormal detection of three indexes.

Close modal

The AUC was 0.821, 0.816 and 0.509 when using only the predicted residual of turbidity, conductivity and dissolved oxygen. The AUC is 0.902 when the three-dimensional vector group composed of the predicted residuals of three water quality indexes is used for anomaly detection. It can be seen that the AUC will have a very small decrease when the index without abnormal change is added for anomaly detection, which basically does not affect the performance of anomaly detection. When the abnormal detection of water quality indicators responding to abnormal events is increased, the AUC will increase significantly. Therefore, we can consider the real-time prediction of as many online monitoring indicators as possible, and integrate the prediction residuals of all indicators for anomaly detection, to increase the detection rate of abnormal water quality events and improve the early warning efficiency.

In order to further test the recognition efficiency of this method for low-level abnormal conditions, the turbidity and conductivity data after standardization at 04:00 every day from June 1 to June 7 are doubled as test sets, and the label is set to 1, and the time series changes to the situation shown in Figures 14 and 15. The red area in the figure indicates the set abnormal situation. Taking the data from May 16 to May 28 as the training set, the AR prediction model of turbidity with order 7 and the AR prediction model of conductivity with order 12 are established. The predicted values of turbidity and conductivity from zero point of June 1 to zero point of June 8 are obtained by dynamic prediction, and the corresponding prediction residual error is calculated. The predicted and actual values of turbidity and conductivity are shown in Figures 16 and 17. The two-dimensional prediction residual vector group corresponding to the training set is also used to construct 100 i-Trees to form iForest, and the subsample size is 256. Using the isolated forest, the abnormal scores from May 1 to June 8 are obtained. The ROC curve based on the abnormal score and label is shown in Figure 18, with an AUC of 0.745. The performance of anomaly detection is still improved compared with the prediction residuals of the turbidity AR model (AUC = 0.691) and conductivity AR model (AUC = 0.654), but it is lower than the actual case scenario. The main reason is that the change degree of water quality is small and the prediction residual error is reduced, so the abnormal detection performance is reduced.
Figure 14

Turbidity anomaly of structure.

Figure 14

Turbidity anomaly of structure.

Close modal
Figure 15

Conductivity anomaly of structure.

Figure 15

Conductivity anomaly of structure.

Close modal
Figure 16

Monitoring value and prediction value of turbidity.

Figure 16

Monitoring value and prediction value of turbidity.

Close modal
Figure 17

Monitoring value and prediction value of conductivity.

Figure 17

Monitoring value and prediction value of conductivity.

Close modal
Figure 18

Performance evaluation of turbidity conductivity ROC curve.

Figure 18

Performance evaluation of turbidity conductivity ROC curve.

Close modal

In conclusion, it can be seen that if the number of high-frequency water quality index changes caused by abnormal water quality increases, the greater the change range, the detection effect of abnormal situation is more obvious, which can better realize the early warning of water quality abnormal events.

Through the analysis of water quality time series, abnormal water quality events can be identified in time, to realize the supervision of water environment quality and the timely warning and prediction of water pollution events:

  • (1)

    The method can judge the degree of water quality abnormity dynamically, and the fuzzy comprehensive evaluation method can classify the abnormal degree of surface water quality, and realize the classification early warning.

  • (2)

    The proposed method is used to detect the abnormal turbidity, conductivity and dissolved oxygen time series of urban drinking water. The AR prediction model is established and the predicted residual error is obtained. The outlier detection is carried out by using the isolated forest algorithm. The results show that the AUC can reach 0.919 when using the prediction residual vector group of turbidity and conductivity to detect the isolated forest anomaly, which verifies that the method has good anomaly detection performance. For the turbidity and conductivity time series with low anomaly degree, the AUC decreases to 0.745 when the same method is used for anomaly detection.

  • (3)

    The more the number of water quality index changes caused by water quality abnormality, the greater the change range, the higher the detection rate of abnormal events, and the stronger the ability of early warning.

The authors are grateful to the support of 2021 Henan Provincial Water Conservancy Science and Technology Tackle Project (No. 2021063). Special thanks to the reviewers for their constructive comments and suggestions in improving the quality of this manuscript.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Ahmed
I.
,
Nazzal
Y.
&
Zaidi
F.
2018
Groundwater pollution risk mapping using modified DRASTIC model in parts of Hail region of Saudi Arabia
.
Environmental Engineering Research
23
(
s1-2
),
145
162
.
Bai, Z., Lu, J., Zhao, H., Velthof, G. L., Oenema, O., Chadwick, D., Williams, J. R., Jin, S., Liu, H., Wang, M., Strokal, M., Kroeze, C., Hu, C. & Ma, L.
2018
Designing vulnerable zones of nitrogen and phosphorus transfers to control water pollution in China
.
Environmental Science & Technology
52
(
16
),
8987
8988
.
Burnett, R. T., Pope, C. A., 3rd, Ezzati, M., Olives, C., Lim, S. S., Mehta, S., Shin, H. H., Singh, G., Hubbell, B., Brauer, M., Anderson, H. R., Smith, K. R., Balmes, J. R., Bruce, N. G., Kan, H., Laden, F., Prüss-Ustün, A., Turner, M. C., Gapstur, S. M., Diver, W. R. & Cohen, A.
2014
An integrated risk function for estimating the global burden of disease attributable to ambient fine particulate matter exposure
.
Environmental Health Perspectives
122
(
4
),
397
403
.
Chen, H., Chen, A., Xuc, L., Xie, H., Qiao, H., Lin, Q & Cai, K.
2020
A deep learning CNN architecture applied in smart near-infrared analysis of water pollution for agricultural irrigation resources
.
Agricultural Water Management
240
,
106303
.
Diendéré, A., Nguyen, G., Del Corsoc, J. P. & Kephaliacosc, C.
2018
Modeling the relationship between pesticide use and farmers’ beliefs about water pollution in Burkina Faso
.
Ecological Economics
151
,
114
121
.
Díez-Del-Molino, D., García-Berthou, E., Araguas, R. M., Alcaraz, C., Vidal, O., Sanz, N. & García-Marín, J. L.
2018
Effects of water pollution and river fragmentation on population genetic structure of invasive mosquitofish
.
The Science of the Total Environment
637-638
,
1372
1382
.
He
H. M.
2013
Research on Water Quality Anomaly Detection Method Based on Multi-Sensor Data Fusion
.
Zhejiang University Hangzhou, Zhejiang, China
.
Hou, D.-B., Chen, Y., Zhao, H.-F., Huang, P.-J. & Zhang, G.-X. 2013 Water quality anomaly detection method based on RBF neural network and wavelet analysis. Transducer and Microsystem Technologies 32 (2), 138–141.
Jindal
H.
,
Saxena
S.
&
Kasana
S. S.
2018
A sustainable multi-parametric sensors network topology for river water quality monitoring
.
Wireless Networks
24
(
8
),
3241
3265
.
Khatmullina
R. M.
,
Safarova
V. I.
&
Latypova
V. Z.
2018
Reliability of the assessment of water pollution by petroleum hydrocarbons and phenols using some of total indices
.
Journal of Analytical Chemistry
73
(
7
),
728
733
.
Liu
J.
,
Liu
R.
,
Zhang
Z.
,
Cai
Y.
&
Zhang
L.
2019
A Bayesian network-based risk dynamic simulation model for accidental water pollution discharge of mine tailings ponds at watershed-scale
.
Journal of Environmental Management
246
(
15
),
821
831
.
McComb
D. W.
,
Lengyel
J.
&
Carter
C. B.
2019
The G-filter: a simple high-tech solution to India's water pollution
.
Mrs Bulletin
44
(
12
),
914
915
.
Wang, B., Liu, S. D., Jin, B. & Qiu, W.
2019
Fine imaging by using advanced detection of reflected waves in underground coal mine
.
Earth Sciences Research Journal
23
(
1
),
93
99
.
Wilson, N. J., Mutter, E., Inkster, J. & Satterfield, T.
2018
Community-based monitoring as the practice of indigenous governance: a case study of indigenous-led water quality monitoring in the Yukon River Basin
.
Journal of Environmental Management
210
(
15
),
290
298
.
Yang
H. D.
,
Liu
B. Y.
&
Huang
J. H.
2018
Forecast model parameters calibration method for sudden water pollution accidents based on improved Bayesian-Markov chain Monte Carlo
.
Control and Decision
33
(
4
),
679
686
.
Yin, H., Yu, Q. J., Hou, D. B., Huang, P. J., Zhang, G. X. & Zhang, H. J.
2019
In-situ detection of water quality anomaly with UV/Vis spectrum based on supervised learning
.
Spectroscopy and Spectral Analysis
39
(
2
),
491
499
.
Yu, D., Mao, Y., Gu, B., Nojavan, S., Jermsittiparsert, K. & Nasseri, M.
2020
A new LQG optimal control strategy applied on a hybrid wind turbine/solid oxide fuel cell/ in the presence of the interval uncertainties
.
Sustainable Energy, Grids and Networks
21
,
100296
.
Zuo, X., Dong, M., Gao, F. & Tian, S.
2020
The modeling of the electric heating and cooling system of the integrated energy system in the coastal area
.
Journal of Coastal Research
103
(
sp1
),
1022
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).