We develop an approach for water quality time series monitoring and contamination event detection. The approach combines affine projection algorithms and an autoregressive (AR) model to predict water quality time series. Then, we apply online change-point detection methods to the estimated residuals to determine the presence, or not, of contamination events. Particularly, we compare the performance of four change-point detection methods, namely, sequential probability ratio test (SPRT), cumulative sum (CUSUM), binomial event discriminator (BED), and online Bayesian change-point detection (OBCPD), by using residuals obtained from four water quality time series, chlorine, conductivity, total organic carbon, and turbidity. Our fundamental criterion for the performance evaluation of the four change-point detection methods is given by the receiver operating characteristic (ROC) curve which is characterized by the true positive rate as a function of the false positive rate. We highlight with detailed experiments that OBCPD provides the best performance for large contamination events, and we also provide insight on the choice of change-point detection algorithms to consider for designing efficient contamination detection schemes.
The expansion of the world's population leads to the scarcity of drinking water and drives increasing exploitation of untapped water resources. Among them, the desalination of seawater or saline ground water and the purification of water originating from rivers or lakes are of growing interest. Chemical, biological, and physical parameters defining water quality criteria and quality thresholds imposed by government agencies and health institutions must be satisfied. This requires the implementation of monitoring procedures in order to assess the conformity of potable water against specified objectives. Facilitated by the advent of information communication and technology, combined with the development of state-of-the-art signal processing algorithms, new approaches for efficient online water quality monitoring are investigated. The benefit of such approaches is to allow for online monitoring of the measurements of large scale water systems and early detection of intentional or unintentional contamination events.
The design of an efficient online water quality monitoring approach is complicated by several factors. First, changes within the system including variation of source water or the treatment process, the fluctuation of water demand, and changes in the water quality sensors can create anomalous water quality signatures, different from contamination events. Distinguishing normal behavior from abnormal behavior is difficult in these situations. Second, due to the complex characteristics of water quality, a physical model describing water quality behavior is difficult to derive and implement in real time. This complexity has motivated the extensive use of data-driven models for predicting water quality behavior. Although data-driven approaches have proven to be an efficient alternative, several challenges persist including: accurate detection of contamination events with low false positive rates (FPRs) and low detection latency; definition of an adequate trade-off between sensitivity to contamination events and remaining insensitive to non-event changes; characterization of faults due to water quality sensor failures, numerical algorithmic instabilities and water process operation. In an attempt to alleviate some of these challenges, research in both forecasting and monitoring has been conducted.
Water quality forecasting and monitoring
Time series forecasting algorithms have seen widespread application in hydrology. As examples, Yu et al. (2004) combine a support vector machine with chaos theory to analyze chaotic time series, and Phoon et al. (2002) explore an inverse method to forecast nonlinear time series. In a similar vein, artificial neural network methods have been applied in Palani et al. (2008) to forecast water quality data. To detect contamination events, classification methods have also been widely examined. As an example, Yang et al. (2009) develop a real-time method for contaminant detection and classification, while Branisavljevic et al. (2011) examine a context classification method for real-time anomaly detection. A real-time, threshold-based approach for detecting intentional contamination events is proposed by Byer (2005). Studies concerning attributes of water quality sensors in the framework of contamination events have also been investigated; for more details see McKenna et al. (2006), Hatzikos et al. (2007), Yang et al. (2009) and Koch & McKenna (2010). Work on the subject is presented in various reports and reviews, for example Grayman et al. (1988), Hasan et al. (2004, 2005), Chau (2006), Murray et al. (2010) and Perelman et al. (2012). A general framework for event detection is to combine a forecasting method with an algorithm that can analyze the residuals of the forecast to determine the onset of an event. Change-point detection methods provide a means of event determination and are being applied for water quality monitoring.
Change-point detection methods have been the subject of intensive research investigations. Due to their sequential nature and their adequacy to address online problems, sequential probability ratio test (SPRT; Wald 1945), cumulative sum (CUSUM; Page 1954), and generalized likelihood ratio (GLR; Willsky & Jones 1976) are among the most popular frequentist change-point detection methods. The commonality of these widely used algorithms is that the logarithm of the likelihood ratio between two consecutive intervals, for the same time series, is monitored and a change-point is declared whenever these intervals possess different statistical properties. In spite of their widespread success for practical applications, their major limitation comes from their dependence on the knowledge of probability distribution functions (pdf) of the data. Furthermore, other approaches, such as spectral based methods (Adak 1998), maximum likelihood estimation (Guralnik & Srivastava 1999), and subspace identification (Katayama 2005) have also been explored. The limitation of all these change-point detection methods is that they rely on prespecified thresholds, which are difficult to establish a priori. However, most of the Bayesian change-point detection (BCPD) approaches have been used in an offline setting for retrospective studies, which means that the entire data set is required before computation of the probability of change-point. To overcome these limitations, and those of the frequentist methods, online BCPD (OBCPD) approaches have been introduced by Adams & MacKay (2007).
The purpose of this paper is to provide an efficient methodology for water quality monitoring and contamination event detection. We review a recent development in time series forecasting and use it to forecast different water quality time series. The performance of four change-point detection algorithms is compared using residuals from the forecast. Note that the change-point detection algorithms examined here can be applied to residuals derived from any predictive modeling algorithm. We place emphasis on the smallest detectable contaminant, and the rate of false and true positives, which are characterized by the receiver operating characteristic (ROC) curve. Particularly, we show that our methodology is able to distinguish normal operation changes from contamination events. For each change-point detection method, and a particular contamination event, we provide the ROC curve. The remainder of the paper is organized as follows. The following section, ‘Water quality time series prediction’, presents the time series prediction method used to learn the parameters of the autoregressive (AR) model describing the water quality data. Next is ‘Water quality time series monitoring’, which introduces the frequentist approaches used in this paper, namely SPRT and CUSUM as well as the binomial event discriminator (BED) and the OBCPD. This is followed by ‘Application of change-point detection methods to water quality data’ in which the experiments with water quality time series are examined. The final section provides conclusions and future work.
WATER QUALITY TIME SERIES PREDICTION
WATER QUALITY TIME SERIES MONITORING
We present in this section the frequentist algorithms, SPRT and CUSUM, the BED and the OBCPD.
Frequentist approaches for change-point detection
Frequentist approaches are based on hypothesis testing, and they are designed to signal an alert when a change-point occurs. We present two frequentist approaches, derived from Wald's hypothesis testing, used here for water quality time series monitoring, namely SPRT and CUSUM.
Sequential probability ratio test
Cumulative sum test
Binomial event discriminator
Bayesian approach to change-point detection
APPLICATION OF CHANGE-POINT DETECTION METHODS TO WATER QUALITY DATA
We analyze the performance of the four change-point detection methods introduced previously. First, we incorporate contamination events into the measured background data. Then, we apply the APA to learn the coefficients of the AR model describing the water quality time series. The obtained residuals are subsequently used with the change-point detection methods to monitor water quality.
We experiment using four water quality time series, which have been selected due to these parameters having extensive use in online water quality monitoring. These data are taken from the ‘Station C’ data set supplied as part of the Event Detection System Challenge organized by the US Environmental Protection Agency (EPA 2013). All signals are collected at the same location in an operating water distribution network with a 2-minute sampling rate. The measured signals are as follows.
Chlorine: Chlorine is used in the majority of drinking water networks as a disinfectant to eliminate microbial and other contaminants. Typical values are maintained between 0.5 and 1.0 mg/L.
Conductivity: Measures the ability of water to transmit an electrical current and is a measure of the amount of dissolved solids in the water. Typical values in drinking water networks range from 50 to 500 μS/cm.
Total organic carbon (TOC): Gives the amount of organic carbon dissolved in the water. Organic carbon in water is generally from decaying naturally occurring organic matter in the source waters. TOC levels are generally below 10 mg/L.
Turbidity: Measures the water clarity and is often subject to sharp increases due to changes in flow that suspend sediment within the water. Typical values are near 0.0 with short-lived increases to 5.0 NTU or more being common.
The background values of these water quality parameters can be altered by the presence of contaminants added to the water (see Hall et al. 2007). Therefore, the objective is to monitor these variables in real-time and rapidly and accurately detect any abnormal situations.
We present the tuning parameters that have been used, and which allow the figures that will be presented in the next subsection to be obtained. Recall that both SPRT and CUSUM consider the knowledge of the pdf of the data. Here, we use Gaussian pdf, where in the absence of contamination the mean and variance of our data are respectively given by and , which are determined by computing respectively the mean and standard deviation of the signal of interest, and in the absence of contamination. They change to and in the presence of contamination, and in this study we consider that and for all the measurements except the TOC, where and . A and B were computed by using Equation (5), and determine the thresholds for SPRT, the threshold for the CUSUM test is given by . We summarize the values of the tuning parameters for the SPRT and CUSUM in Table 1. As mentioned previously, the BED considers as tuning parameters the threshold , the window W, the probability of failure , and the threshold h; the probability of failure has been fixed at 0.5. This choice is driven by the trade-off that one wishes to establish between the number of false alarms and the number of missed detections. For OBCPD (Table 2), the initial tuning parameters are the center , the precision , the shape , the scale , and the hazard rate H; for these tuning parameters we fix , , for all the water quality parameters and we attribute them the same hazard rate , except for the conductivity , which is due to the fact that the conductivity measurements present more noise than the rest of the measurements. For the scale , the initial parameter has been determined empirically. In all cases, these parameters are set without any measurements of the contamination events, as would be the case in an actual application. The parameters for the BED are given as follows: , , , . These tuning parameters are maintained intact for all the water quality parameters.
We present the results of our experiments using the aforementioned tuning parameters for each of the water quality time series. Through Figures 12345–6: (a) shows the raw data; (b) presents the raw data and the incorporated contaminants; (c) provides the residuals obtained from the prediction results and in the presence of contaminations; (d), (e), (f), (g) depict respectively the results obtained with the SPRT, the CUSUM, the BED, and the OBCPD. The y-axes, for (d), (e), (f), (g) show respectively the computed sequential probability ratio, the computed CUSUM, the computed probability with the binomial distribution, and the computed run length.
The first result of this experiment is given by the residuals. As can be seen, the values of the residuals are low, and in the order of 0 for the mean, and this is valid for the presence or absence of contamination. Also, this result is valid when the chlorine measurement is in a steady state (constant mean) or transient stage (mean deviating from a constant value). The variance of the residual is also relatively low. This is an indication of the good performance of the forecast, where we analyzed approximately 9 days of data, but Figure 1 places emphasis on the interval (1.2 × 105) in order to illustrate the different scenarios. The presence of contamination is declared whenever the sequential probability ratio and the CUSUM cross their thresholds. The values of the tuning parameters that have been used are given in Tables 1 and 2. Throughout the figures, we observe that the four change-point detection algorithms react with different performances to the presence of contamination events. However, they also present a small delay to detections, due to the data buffering process necessary to construct the sequence of samples, in the context of SPRT and CUSUM. As this delay is mainly constant, it can be subtracted in real operations, if one is interested in determining the exact time at which the change-point occurred. As will be evidenced later, the OBCPD provides the best performance in terms of fewer missed detections and false positives for a certain category of contamination events. Moreover, the developed algorithms are able to detect contamination events, even when they occur simultaneously to the change in the background chlorine levels due to changes in network operation, except in some cases for the CUSUM test.
During this experiment, we used the same tuning parameters as those used for chlorine, except for the the OBCPD, where the scale has been changed to and the hazard rate H to . Experiments with conductivity show that the four change-point detection algorithms provide different performances in terms of contamination event detection. Particularly, the results obtained with the SPRT and the CUSUM do not outperform those given by the BED and the OBCPD, where OBCPD provides the best performances in terms of detection. However, these results are sensitive to the amount of noise in the data (Figure 2). In Figure 3, we observe that the residuals exhibit a high level of noise, which is a consequence of the high level of noise in the conductivity measurement and in this interval of time, which impacts the change-point. In this high noise situation, all the change-point detection methods fail to provide good results, due to their difficulty in distinguishing normal changes from contamination events. However, after this interval of time, BED and OBCPD are able to detect the contaminations.
Total organic carbon [mg/L]
The experiments with TOC have been conducted with the tuning parameters indicated in Table 1. For OBCPD, the tuning parameters have been maintained intact as those used for chlorine except for the scale . The results presented here show a low level of noise in the computed residuals, which is also translated into the results of the change-point algorithms, where we observe that the four change-point algorithms react with good performances to the presence of contaminations (Figure 4). The large spikes in the background TOC (Figure 5, after 3.8 × 105s) create difficulties for all four change-point algorithms.
The results obtained with the turbidity data (Figure 6) show the sensitivity of the change-point detection methods to the contaminations. Note that during this experiment we used the same tuning parameters as those used for the experiment with the chlorine. Turbidity, similar to TOC, has very low variance in residuals. This causes SPRT and CUSUM to typically identify two change-points, one at the start of event and one at the end. BED results continue to be relatively wide due to the longer time period chosen for the length of window W. OBCPD is the only method that only identifies a single change-point for each event.
True positive TP: occurs when the algorithm detects a change, and there is a change.
False positive FP: corresponds to situations where the algorithm detects a change, whereas there is no change.
False negative FN: occurs when the algorithm does not declare a change and there is a change, also known as missed detection.
True negative TN: the algorithm does not declare a change and there is no change.
High strength events
In this section we present ROC curves obtained for high strength events, and for both chlorine and turbidity.
Performance evaluation for chlorine
Table 3 presents the ROC for the chlorine measurement and shows that at a 5% FPR, OBCPD provides the highest TPR value compared to the other change-point detection algorithms. Also, the result obtained with OBCPD shows a TPR equal to 97% for a contamination strength higher than or equal to 7. We observe that the performances of the four algorithms improve as we increase the strength of the contamination. Figure 7 presents the ROC curves corresponding to the cases where the contamination strength is equal to 5 (left) and 8 (right).
Performance evaluation for turbidity
Similarly to Table 3, for the turbidity measurement we observe that at a 5% FPR, the OBCPD provides the highest TPR value compared with the other change-point detection algorithms. Figure 8 presents the ROC curves corresponding to the cases where the contamination strength is equal to 5 (left) and 8 (right). ROC results for turbidity are generally lower than for chlorine, especially for SPRT and CUSUM. This is due to two change-points being identified for each event in the lower noise data as discussed in the ‘Results’ section. There, the second change-point (end event) is counted as a false positive.
Low strength events
Here, we focus on the cases where the contamination strength is lower than or equal to 1, and for both chlorine and turbidity.
Performance evaluation for chlorine
Figure 9 shows the results of the experiments where the previous tuning parameters are maintained. These ROC curves indicate that for a contamination strength lower than or equal to 0.75, the TPR becomes small, except for the results obtained with BED, for a FPR equal to 5% (Table 4). BED introduces a delay into the change-point detection which causes an increase in the number of false positive time steps and a slower rise in the ROC curve. We summarize the results of our experiments in Table 5. Figure 9 presents the ROC curves corresponding to the cases where the contamination strength is equal to 0.5 (left) and 1 (right).
Performance evaluation for turbidity
Here, we present the performance evaluation for the turbidity at 5% of FPR for low strength events, lower than or equal to 1. SPRT and CUSUM are again penalized for two change-points at every event. Figure 10 presents the ROC curves corresponding to the cases where the contamination strength is equal to 0.5 (left) and 1 (right). We summarize the results of our experiments in Table 6.
CONCLUSIONS AND FUTURE WORK
This paper demonstrates the use of APA with an AR model for accurate predictions of water quality time series. We also introduce the use of OBCPD for water quality event detection. The OBCPD is compared with three other online algorithms using ROC calculations and demonstrates superior performance for events with large contamination strengths, whereas for smaller contamination strengths the best performances of the change-point detection methods are given by the frequentist methods. We observed, for example, for contamination strength equal to 0.75, the frequentist approaches, due to the fact that their tuning parameters calculated in the presence and in the absence of contamination are supposed to be known a priori, yield the best performance. The ROC curves associated with OBCPD, for the chlorine for instance, showed that the TPR is higher than 97% at the expense of a 5% FPR, and from a contamination strength equal to 7. This result is valid for the rest of the water quality time series examined here.
A reason for the good performances provided by OBCPD, for contamination strength overtaking 4, comes from its ability to combine Gaussian and Gamma distributions to efficiently predict the residuals of water quality time series data irrespective of their prior knowledge. A significant advantage of the OBCPD is that it is not necessary to set any threshold to detect changes. These two factors represent a crucial asset to water utility companies when designing online water quality monitoring approaches. Furthermore, our experiments demonstrated that the smallest detectable fault corresponds to the contamination strength being equal to 1 times the standard deviation of the recent data. This study highlights that water utility companies can consider each of the experimented algorithms, depending on the expected performances and the possibility to have access to the data a priori. For example, if the data are unavailable it is preferable to have recourse to OBCPD, whereas if the availability of the data allows determination of the descriptive statistics of the water quality time series a priori in the presence and in the absence of contamination, then the frequentist approaches represent the best alternatives. However, in practice, it is difficult to establish the characteristics of the contamination a priori. This suggests the use of OBCPD as the adequate solution.
In this paper, we considered that the structure of the contamination follows a Gaussian distribution with a relatively sharp change; however, situations where the change has a very slow variation represent a viable alternative. Therefore, improving the characteristics of the change-point detection algorithms to make them rapidly reactive to such changes is our future work.