Methods to improve the operational efficiency of a water supply network by early detection of anomalies are investigated by making use of the data streams from multiple sensor locations within the network. The water supply network is a demonstration site of Vitens, a Dutch water company that has several district metering areas where flow, pressure, electrical conductance and temperature are measured and logged online. Three different machine learning approaches are tested for their feasibility to detect anomalies. In the first approach, day-dependent support vector regression (SVR) models are trained for predicting the measurement signals and compared to straightforward models using mean and median estimates, respectively. Using SVRs or the averaged data as real-time pattern recognizers on all available signals, large leakages can be detected. The second approach utilizes adaptive orthogonal projections and reports an event when the number of hidden variables required to describe the streaming data to a user-defined degree (energy-level threshold) increases. As a third approach, (unsupervised) clustering techniques are applied to detect anomalies and underlying patterns from the raw data streams. Preliminary results indicate that the current dataset is too limited in the amount of events and patterns to harness the potential of these techniques.
INTRODUCTION
Water utilities collect ever-increasing amounts of data from water distribution systems via loggers and telemetry systems. The unprecedented availability of data can help improve operation in water supply systems by increasing the reliability and availability of water supply and safe and efficient operation (Wang et al. 2015).
Interpreting this data is challenging, because of the volume of data and possible sensor malfunctions, as well as specific ‘fingerprints’ of the data related to seasonal variation, location of water supply, the daily and weekly cycles, and so forth. Moreover, inaccuracies in hydraulic models, poor calibration of measurements, and lack of system operational feedback exacerbate the data analysis problems. Hence, a continuous improvement of operational practice is desired, preferably aided by real-time (monitoring) techniques.
In the past, several approaches have been explored. For example, Izquierdo et al. (2007) assess anomalies using a hybrid model composed of a deterministic part (flow rates and head at the nodes) coupled with a state estimation technique and a machine learning algorithm (neural networks). Kühnert et al. (2014) applied principal component analysis to detect novelties in water distribution network (WDN) sensor datastreams. In a recent study of Aminravan et al. (2015), a hierarchical rule-based approach was suggested to account also for the spatial aspect of events in WDNs.
In this work we explore alternative machine learning and data processing methods for anomaly detection on WDN sensor data streams with the goal to improve the operational efficiency. Three methods to recognize unusual patterns in these sensor streams are applied: supervised and unsupervised machine learning, and orthogonal transformation with subsequent tracking of hidden variables. We make use of interrelated sensor measurements at multiple locations in the same distribution network to enhance model development and validation.
METHOD
Machine learning techniques are combined with trained, day-dependent models to function as real-time pattern recognizers and classifiers of anomalies. The dataset, water supply network, and the pattern recognizers and classifiers are outlined below.
Water supply network and dataset
Water supply network within the VIP. DMA are depicted as blue regions. Labeled green dots indicate sensor locations. The network is constituted of transport pipes (red) and secondary and tertiary pipes (blue). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Water supply network within the VIP. DMA are depicted as blue regions. Labeled green dots indicate sensor locations. The network is constituted of transport pipes (red) and secondary and tertiary pipes (blue). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Machine learning and event detection
The sensor signals with averaged 5 min sampling interval are processed with machine learning methods, orthogonal transformation and hidden variable tracking and straightforward median and average models and subsequently an event detection algorithm. The following three methods are carried out.
(M1) (supervised) machine learning by epsilon-Support Vector Regression (ɛ-SVR), where data are first split into a training and test set, and labeled where necessary. We make use of ɛ-SVR as provided by the Scikit-learn Python module (Pedregosa et al. 2011) and apply the procedure as outlined by Mounce et al. (2011). ɛ-SVR is characterized by computing a linear regression function in feature space where input data are mapped via a nonlinear function in such a way that a so-called ɛ-sensitive loss function is minimized (Basak et al. 2007). Event monitoring and subsequent detection of anomalies is performed as outlined by (Mounce et al. 2011). In short, an anomaly is detected when a pre-specified number of measurements within a certain time frame differ from the modelled signal by more than a certain tolerance width. This tolerance is a dynamic bandwidth of three standard deviations around the model prediction. The anomaly, or event, is then assigned a ‘surprise score’; a measure of severity (Mounce et al. 2011). Anomalies in sensor data may be caused by, e.g. pipe leaks, sensor malfunctions, and unusual demand patterns. Trained SVR-models are compared to median and average signals over selected periods.
(M2) (unsupervised) clustering for pattern recognition and event detection. We make use of clustering algorithms in an attempt to obtain underlying patterns that may be present in the data. Because the data is expected to display patterns with daily and weekly periodicities, additional variables are added to denote the timestamp of each data point, incorporating these periodicities. The extended dataset is explored with various clustering algorithms, including k-means clustering and several variations of the Gaussian Mixture Models method provided by the Scikit-learn Python module (Pedregosa et al. 2011). Resulting cluster assignments are investigated for underlying patterns within the dataset and are compared with known events and outliers in the data. To further investigate outliers in the dataset, a daily average signal is computed for each measured variable and subtracted from the signal, and the cluster analysis is repeated on the residual dataset.
(M3) orthogonal transformation and tracking hidden variables. The main idea behind this simple approach is to use the multidimensional time series data stream and reduce its dimensionality by using adaptive orthogonal projections for each time step. Given n numerical data streams whose values are observed at each time sampling interval t, one can incrementally find correlations and hidden variables, which summarise the key patterns in the entire multidimensional dataset. For this purpose, we implemented a modified version of the ‘Streaming Pattern dIscoveRy on multIple Time series’ (SPIRIT) algorithm (Papadimitriou et al. 2005), resembling principal component analysis calculations. The algorithm performs incremental updates of the weights of a small set of orthogonal components, say k, and further tracks the amount of components that is needed to satisfy a predefined threshold (energy level). If more than k orthogonal components – these will be called hidden variables – are needed, it is expected that an unusual pattern is occurring. This new hidden variable indicates the possible occurrence of a new pattern or anomaly in the streaming dataset. The value of k is adapted on the fly. In addition, by tracking and updating the vector of the weights of the orthogonal projections, one can estimate the contributions of each of the time series measurements on the hidden variables.
RESULTS AND DISCUSSION
Results are presented in line with the methods M1, M2, and M3 as outlined in the Method section.
M1: supervised machine learning by ɛ-SVR
Anomaly detection with three models applied to a 2-day period (May 1 and May 2, 2015, horizontal axis) of flow data (in cubic metres per hour, vertical axis) for sensor location FR-MLKA. Shown are the sensor signal (red), SVR model (orange curve), mean model (green), and median model (blue). The gray shaded region indicates the tolerance width of the SVR model. An anomaly was detected just after 8:00 h AM on May 2 (framed rectangle). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Anomaly detection with three models applied to a 2-day period (May 1 and May 2, 2015, horizontal axis) of flow data (in cubic metres per hour, vertical axis) for sensor location FR-MLKA. Shown are the sensor signal (red), SVR model (orange curve), mean model (green), and median model (blue). The gray shaded region indicates the tolerance width of the SVR model. An anomaly was detected just after 8:00 h AM on May 2 (framed rectangle). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Events for flow, conductivity, and temperature sensors as detected by SVR during a period from July 2014 until September 2015. Colors indicate aggregated weekly abnormality of the sensor signal, expressed as the sum of individual surprise scores and normalized by the standard deviation of the sensor signal. The severity of detected anomalies is expressed as a ‘surprise score’: 0 to 5 (gray), 5 to 20 (yellow), 20 to 800 (orange), 800 to 100,000 (purple), more than 100,000 (black).The coefficient of determination (R2) is given between brackets. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Events for flow, conductivity, and temperature sensors as detected by SVR during a period from July 2014 until September 2015. Colors indicate aggregated weekly abnormality of the sensor signal, expressed as the sum of individual surprise scores and normalized by the standard deviation of the sensor signal. The severity of detected anomalies is expressed as a ‘surprise score’: 0 to 5 (gray), 5 to 20 (yellow), 20 to 800 (orange), 800 to 100,000 (purple), more than 100,000 (black).The coefficient of determination (R2) is given between brackets. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
M2: using clustering for pattern recognition and event detection
Data for one sensor location projected on two components (calculated with PCA) after subtracting the day-average from each signal. Data are colored according to their cluster assignments. The crosses denote the centroids of the clusters. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Data for one sensor location projected on two components (calculated with PCA) after subtracting the day-average from each signal. Data are colored according to their cluster assignments. The crosses denote the centroids of the clusters. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Data from one sensor location in a 6-week period. Shown are (from top to bottom) the pipe flow (negative values indicate a direction reversal), pressure, conductivity and temperature. A major leak occurred during this period (cyan data). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
Data from one sensor location in a 6-week period. Shown are (from top to bottom) the pipe flow (negative values indicate a direction reversal), pressure, conductivity and temperature. A major leak occurred during this period (cyan data). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.
M3: using orthogonal transformation and tracking hidden variables
Flow data for three DMA locations for January–March 2014. The top figure depicts the measured flow signals in three DMAs (Westeinde, Bilgaard, and Marwei), the bottom figure shows the reconstructed flow data using the orthogonal projections. The algorithm is able to reduce the main hydrodynamics of the multidimensional dataset and describe it with only two hidden linearly independent orthogonal variables (Figure 7).
Flow data for three DMA locations for January–March 2014. The top figure depicts the measured flow signals in three DMAs (Westeinde, Bilgaard, and Marwei), the bottom figure shows the reconstructed flow data using the orthogonal projections. The algorithm is able to reduce the main hydrodynamics of the multidimensional dataset and describe it with only two hidden linearly independent orthogonal variables (Figure 7).
One of the flow signals with the hidden variables describing the main patterns and structure of the multidimensional data (2 weeks). The top figure depicts one of the measured flow signals at Bilgaard, within which a possible anomaly is highlighted; the bottom figure shows the uncovered hidden variables including anomaly detection in the multi-dimensional dataset.
One of the flow signals with the hidden variables describing the main patterns and structure of the multidimensional data (2 weeks). The top figure depicts one of the measured flow signals at Bilgaard, within which a possible anomaly is highlighted; the bottom figure shows the uncovered hidden variables including anomaly detection in the multi-dimensional dataset.
Due to the identified possible anomaly (irregular flow pattern followed by spike) in the original data, the algorithm is no longer able to describe the signal with the two hidden variables while maintaining sufficient accuracy. Thus in order to correctly describe the anomaly, the algorithm temporarily generates an additional third hidden variable (depicted in red color). After the behavior of the system returns to its regular operating state (i.e. when the energy of the signal drops), this additional hidden variable is no longer required to achieve the desired level of accuracy and is therefore automatically removed by the algorithm. There are however major disadvantages in using this algorithm. Firstly, the algorithm requires quite a lot of tuning (amongst others the boundary parameters for capturing the energy level) to reveal the hidden variable when anomalies were present. Secondly, the algorithm is not suitable to capture sensor faults reliably, e.g. when the sensor gives extreme values or zeros, because it captures the faulty signal with the same amount of hidden variables as before the faulty sensor event (not shown).
CONCLUSIONS
In this work, supervised machine learning, simple mean and median models, (unsupervised) clustering techniques and a tracking technique of hidden variables by orthogonal transformation are used to recognize unusual patterns in flow, pressure, electrical conductivity and temperature sensor readings for a real-life drinking WDN. Although all methods are capable of recognizing unusual patterns and detecting anomalies, not all methods use the information in the same way and detect the same amount of anomaly events. Apart from a small number of clear data outliers due to severe events, the current dataset lacks (logged) information on what is considered an anomaly. The small amount of known anomalies is a severe limitation for training machine learning methods for the detection of anomalies.
Furthermore:
Selection of the training sets strongly determines the performance of machine learning when using supervised methods like SVR. Stacking the-same-weekday flow patterns to train a model for expected water demand results in a pattern recognition performance that is comparable across median or mean models with a confidence region and more complex SVR models using the epsilon bandwidth.
With clustering, recognition of clear outliers by an unsupervised technique is possible, but requires a transformation of the time series to use clustering techniques as pattern recognizers. Other data points could not yet be separated into meaningful classes. To the authors’ knowledge, the proposed clustering procedure has not been used for the application of anomaly detection in water supply networks.
The modified SPIRIT algorithm is able to reveal correlations and reduce dimensionality of the original multidimensional dataset and identify possible temporal patterns (anomalies) in the dataset with low computational demand and without training. However, the algorithm is very susceptible to its tuning parameters and we have found it generates a high number of false negatives (e.g. a sensor fault not detected as a change in the amount of hidden variables).
Future research
The algorithms developed in this work will be tuned to produce a robust and early detection and classification system for anomalies in drinking WDNs. Further work is aimed at increasing the reliability and performance of detecting novelties and investigating whether classification with machine learning can further aid in discerning the severity and type of event (leakage, sensor error or otherwise).
ACKNOWLEDGEMENTS
This activity is co-financed with TKI-funding from the Topconsortia for Knowledge & Innovation (TKIs) of the Dutch Ministry of Economic Affairs. We thank the partners within the TKI project for their ideas and cooperation: Slavco Velickov (formerly Hydrologic), Jonathan van der Wielen (Hydrologic), Johan Fitié and Eelco Trietsch (Vitens), Sijbrand Balkema (formerly Hydrologic, now Royal HaskoningDHV), Henk van Duist (PWN), and Perry van der Marel (WLN).