Methods to improve the operational efficiency of a water supply network by early detection of anomalies are investigated by making use of the data streams from multiple sensor locations within the network. The water supply network is a demonstration site of Vitens, a Dutch water company that has several district metering areas where flow, pressure, electrical conductance and temperature are measured and logged online. Three different machine learning approaches are tested for their feasibility to detect anomalies. In the first approach, day-dependent support vector regression (SVR) models are trained for predicting the measurement signals and compared to straightforward models using mean and median estimates, respectively. Using SVRs or the averaged data as real-time pattern recognizers on all available signals, large leakages can be detected. The second approach utilizes adaptive orthogonal projections and reports an event when the number of hidden variables required to describe the streaming data to a user-defined degree (energy-level threshold) increases. As a third approach, (unsupervised) clustering techniques are applied to detect anomalies and underlying patterns from the raw data streams. Preliminary results indicate that the current dataset is too limited in the amount of events and patterns to harness the potential of these techniques.

INTRODUCTION

Water utilities collect ever-increasing amounts of data from water distribution systems via loggers and telemetry systems. The unprecedented availability of data can help improve operation in water supply systems by increasing the reliability and availability of water supply and safe and efficient operation (Wang et al. 2015).

Interpreting this data is challenging, because of the volume of data and possible sensor malfunctions, as well as specific ‘fingerprints’ of the data related to seasonal variation, location of water supply, the daily and weekly cycles, and so forth. Moreover, inaccuracies in hydraulic models, poor calibration of measurements, and lack of system operational feedback exacerbate the data analysis problems. Hence, a continuous improvement of operational practice is desired, preferably aided by real-time (monitoring) techniques.

In the past, several approaches have been explored. For example, Izquierdo et al. (2007) assess anomalies using a hybrid model composed of a deterministic part (flow rates and head at the nodes) coupled with a state estimation technique and a machine learning algorithm (neural networks). Kühnert et al. (2014) applied principal component analysis to detect novelties in water distribution network (WDN) sensor datastreams. In a recent study of Aminravan et al. (2015), a hierarchical rule-based approach was suggested to account also for the spatial aspect of events in WDNs.

In this work we explore alternative machine learning and data processing methods for anomaly detection on WDN sensor data streams with the goal to improve the operational efficiency. Three methods to recognize unusual patterns in these sensor streams are applied: supervised and unsupervised machine learning, and orthogonal transformation with subsequent tracking of hidden variables. We make use of interrelated sensor measurements at multiple locations in the same distribution network to enhance model development and validation.

METHOD

Machine learning techniques are combined with trained, day-dependent models to function as real-time pattern recognizers and classifiers of anomalies. The dataset, water supply network, and the pattern recognizers and classifiers are outlined below.

Water supply network and dataset

We apply machine learning techniques for anomaly detection and classification by using four sensor signals (flow, pressure, temperature, conductivity) within the Vitens Innovation Playground (VIP). Vitens, the largest water supply company in the Netherlands, has a dedicated demonstration area used for testing new technologies aimed at creating an intelligent water supply (see, e.g., De Graaf et al. 2012). The demonstration site comprises 2,300 km of distribution network with an average age of 50 to 60 years serving over 200,000 households. This non-chlorinated drinking WDN has several district metering areas (DMA) equipped with sensors to measure (amongst others) flow, pressure, conductivity and temperature (Figure 1). For this article we investigated 144 measurement points. The number of connections within a DMA ranges from 136 (mainly industrial) to 4,876 (mainly household).
Figure 1

Water supply network within the VIP. DMA are depicted as blue regions. Labeled green dots indicate sensor locations. The network is constituted of transport pipes (red) and secondary and tertiary pipes (blue). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Figure 1

Water supply network within the VIP. DMA are depicted as blue regions. Labeled green dots indicate sensor locations. The network is constituted of transport pipes (red) and secondary and tertiary pipes (blue). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Machine learning and event detection

The sensor signals with averaged 5 min sampling interval are processed with machine learning methods, orthogonal transformation and hidden variable tracking and straightforward median and average models and subsequently an event detection algorithm. The following three methods are carried out.

  • (M1) (supervised) machine learning by epsilon-Support Vector Regression (ɛ-SVR), where data are first split into a training and test set, and labeled where necessary. We make use of ɛ-SVR as provided by the Scikit-learn Python module (Pedregosa et al. 2011) and apply the procedure as outlined by Mounce et al. (2011). ɛ-SVR is characterized by computing a linear regression function in feature space where input data are mapped via a nonlinear function in such a way that a so-called ɛ-sensitive loss function is minimized (Basak et al. 2007). Event monitoring and subsequent detection of anomalies is performed as outlined by (Mounce et al. 2011). In short, an anomaly is detected when a pre-specified number of measurements within a certain time frame differ from the modelled signal by more than a certain tolerance width. This tolerance is a dynamic bandwidth of three standard deviations around the model prediction. The anomaly, or event, is then assigned a ‘surprise score’; a measure of severity (Mounce et al. 2011). Anomalies in sensor data may be caused by, e.g. pipe leaks, sensor malfunctions, and unusual demand patterns. Trained SVR-models are compared to median and average signals over selected periods.

  • (M2) (unsupervised) clustering for pattern recognition and event detection. We make use of clustering algorithms in an attempt to obtain underlying patterns that may be present in the data. Because the data is expected to display patterns with daily and weekly periodicities, additional variables are added to denote the timestamp of each data point, incorporating these periodicities. The extended dataset is explored with various clustering algorithms, including k-means clustering and several variations of the Gaussian Mixture Models method provided by the Scikit-learn Python module (Pedregosa et al. 2011). Resulting cluster assignments are investigated for underlying patterns within the dataset and are compared with known events and outliers in the data. To further investigate outliers in the dataset, a daily average signal is computed for each measured variable and subtracted from the signal, and the cluster analysis is repeated on the residual dataset.

  • (M3) orthogonal transformation and tracking hidden variables. The main idea behind this simple approach is to use the multidimensional time series data stream and reduce its dimensionality by using adaptive orthogonal projections for each time step. Given n numerical data streams whose values are observed at each time sampling interval t, one can incrementally find correlations and hidden variables, which summarise the key patterns in the entire multidimensional dataset. For this purpose, we implemented a modified version of the ‘Streaming Pattern dIscoveRy on multIple Time series’ (SPIRIT) algorithm (Papadimitriou et al. 2005), resembling principal component analysis calculations. The algorithm performs incremental updates of the weights of a small set of orthogonal components, say k, and further tracks the amount of components that is needed to satisfy a predefined threshold (energy level). If more than k orthogonal components – these will be called hidden variables – are needed, it is expected that an unusual pattern is occurring. This new hidden variable indicates the possible occurrence of a new pattern or anomaly in the streaming dataset. The value of k is adapted on the fly. In addition, by tracking and updating the vector of the weights of the orthogonal projections, one can estimate the contributions of each of the time series measurements on the hidden variables.

RESULTS AND DISCUSSION

Results are presented in line with the methods M1, M2, and M3 as outlined in the Method section.

M1: supervised machine learning by ɛ-SVR

Time series data covering approximately a period of 1 year is used. The SVR models are trained on 26 weeks of flow data which are selected to exclude weeks of known anomalous flow (holidays, recalibration events, major leaks, etc.), the remaining dataset (27 weeks) is used for testing. Figure 2 depicts a period during which a recalibration of the sensor has produced a higher than expected (trained) signal, prompting the detection of anomalies by the algorithm. Similarities in performance demonstrate that all three models (SVR, median and mean) are suitable for event detection for the current dataset.
Figure 2

Anomaly detection with three models applied to a 2-day period (May 1 and May 2, 2015, horizontal axis) of flow data (in cubic metres per hour, vertical axis) for sensor location FR-MLKA. Shown are the sensor signal (red), SVR model (orange curve), mean model (green), and median model (blue). The gray shaded region indicates the tolerance width of the SVR model. An anomaly was detected just after 8:00 h AM on May 2 (framed rectangle). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Figure 2

Anomaly detection with three models applied to a 2-day period (May 1 and May 2, 2015, horizontal axis) of flow data (in cubic metres per hour, vertical axis) for sensor location FR-MLKA. Shown are the sensor signal (red), SVR model (orange curve), mean model (green), and median model (blue). The gray shaded region indicates the tolerance width of the SVR model. An anomaly was detected just after 8:00 h AM on May 2 (framed rectangle). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

The SVR analysis was performed for three measured variables (flow, conductivity and pressure) for several measurement locations to assess the robustness of the algorithm and the co-occurrence of events (Figure 3). Such an application of event detection to multiple sensor locations allows for an improved understanding of large-scale anomalous activity and the interrelations between sensor locations. The temperature signal was excluded from the analysis because its seasonal pattern does not comply with the assumption of a weekly periodicity and the training period was found to be too short for success. No events from the conductivity measurements were detected during this period. Three flow sensors show events throughout the measurement period that were caused by anomalous measurements during the training period, indicating the importance of checking the training set.
Figure 3

Events for flow, conductivity, and temperature sensors as detected by SVR during a period from July 2014 until September 2015. Colors indicate aggregated weekly abnormality of the sensor signal, expressed as the sum of individual surprise scores and normalized by the standard deviation of the sensor signal. The severity of detected anomalies is expressed as a ‘surprise score’: 0 to 5 (gray), 5 to 20 (yellow), 20 to 800 (orange), 800 to 100,000 (purple), more than 100,000 (black).The coefficient of determination (R2) is given between brackets. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Figure 3

Events for flow, conductivity, and temperature sensors as detected by SVR during a period from July 2014 until September 2015. Colors indicate aggregated weekly abnormality of the sensor signal, expressed as the sum of individual surprise scores and normalized by the standard deviation of the sensor signal. The severity of detected anomalies is expressed as a ‘surprise score’: 0 to 5 (gray), 5 to 20 (yellow), 20 to 800 (orange), 800 to 100,000 (purple), more than 100,000 (black).The coefficient of determination (R2) is given between brackets. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

M2: using clustering for pattern recognition and event detection

The dataset has been separated into four clusters using a mixture of full covariance matrix Gaussians (GMM clustering). The clustering algorithms show that very few distinct patterns can be discerned from the datasets (Figure 4). Apart from a small number of clear outliers (cyan cluster), the data points could not yet be separated into meaningful classes. The outliers correspond to easily detectable leakages and sensor malfunctions (Figure 5).
Figure 4

Data for one sensor location projected on two components (calculated with PCA) after subtracting the day-average from each signal. Data are colored according to their cluster assignments. The crosses denote the centroids of the clusters. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Figure 4

Data for one sensor location projected on two components (calculated with PCA) after subtracting the day-average from each signal. Data are colored according to their cluster assignments. The crosses denote the centroids of the clusters. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Figure 5

Data from one sensor location in a 6-week period. Shown are (from top to bottom) the pipe flow (negative values indicate a direction reversal), pressure, conductivity and temperature. A major leak occurred during this period (cyan data). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

Figure 5

Data from one sensor location in a 6-week period. Shown are (from top to bottom) the pipe flow (negative values indicate a direction reversal), pressure, conductivity and temperature. A major leak occurred during this period (cyan data). Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/ws.2016.062.

M3: using orthogonal transformation and tracking hidden variables

From the raw, unfiltered signals, we computed the net total flows for each of the DMAs in the study area for 2013 and 2014. This resulted in a multivariate time series with, for each time step, a set of flow and pressure data streams from six DMAs, hence 12 data streams are processed. The measurement interval is 5 minutes which amounts to 105,120 samples per year. Figure 6 shows the measured and the reconstructed flow data for three DMAs using only two hidden variables, whilst capturing 99% of the energy level of the original signals.
Figure 6

Flow data for three DMA locations for January–March 2014. The top figure depicts the measured flow signals in three DMAs (Westeinde, Bilgaard, and Marwei), the bottom figure shows the reconstructed flow data using the orthogonal projections. The algorithm is able to reduce the main hydrodynamics of the multidimensional dataset and describe it with only two hidden linearly independent orthogonal variables (Figure 7).

Figure 6

Flow data for three DMA locations for January–March 2014. The top figure depicts the measured flow signals in three DMAs (Westeinde, Bilgaard, and Marwei), the bottom figure shows the reconstructed flow data using the orthogonal projections. The algorithm is able to reduce the main hydrodynamics of the multidimensional dataset and describe it with only two hidden linearly independent orthogonal variables (Figure 7).

Figure 7

One of the flow signals with the hidden variables describing the main patterns and structure of the multidimensional data (2 weeks). The top figure depicts one of the measured flow signals at Bilgaard, within which a possible anomaly is highlighted; the bottom figure shows the uncovered hidden variables including anomaly detection in the multi-dimensional dataset.

Figure 7

One of the flow signals with the hidden variables describing the main patterns and structure of the multidimensional data (2 weeks). The top figure depicts one of the measured flow signals at Bilgaard, within which a possible anomaly is highlighted; the bottom figure shows the uncovered hidden variables including anomaly detection in the multi-dimensional dataset.

Due to the identified possible anomaly (irregular flow pattern followed by spike) in the original data, the algorithm is no longer able to describe the signal with the two hidden variables while maintaining sufficient accuracy. Thus in order to correctly describe the anomaly, the algorithm temporarily generates an additional third hidden variable (depicted in red color). After the behavior of the system returns to its regular operating state (i.e. when the energy of the signal drops), this additional hidden variable is no longer required to achieve the desired level of accuracy and is therefore automatically removed by the algorithm. There are however major disadvantages in using this algorithm. Firstly, the algorithm requires quite a lot of tuning (amongst others the boundary parameters for capturing the energy level) to reveal the hidden variable when anomalies were present. Secondly, the algorithm is not suitable to capture sensor faults reliably, e.g. when the sensor gives extreme values or zeros, because it captures the faulty signal with the same amount of hidden variables as before the faulty sensor event (not shown).

CONCLUSIONS

In this work, supervised machine learning, simple mean and median models, (unsupervised) clustering techniques and a tracking technique of hidden variables by orthogonal transformation are used to recognize unusual patterns in flow, pressure, electrical conductivity and temperature sensor readings for a real-life drinking WDN. Although all methods are capable of recognizing unusual patterns and detecting anomalies, not all methods use the information in the same way and detect the same amount of anomaly events. Apart from a small number of clear data outliers due to severe events, the current dataset lacks (logged) information on what is considered an anomaly. The small amount of known anomalies is a severe limitation for training machine learning methods for the detection of anomalies.

Furthermore:

  • Selection of the training sets strongly determines the performance of machine learning when using supervised methods like SVR. Stacking the-same-weekday flow patterns to train a model for expected water demand results in a pattern recognition performance that is comparable across median or mean models with a confidence region and more complex SVR models using the epsilon bandwidth.

  • With clustering, recognition of clear outliers by an unsupervised technique is possible, but requires a transformation of the time series to use clustering techniques as pattern recognizers. Other data points could not yet be separated into meaningful classes. To the authors’ knowledge, the proposed clustering procedure has not been used for the application of anomaly detection in water supply networks.

  • The modified SPIRIT algorithm is able to reveal correlations and reduce dimensionality of the original multidimensional dataset and identify possible temporal patterns (anomalies) in the dataset with low computational demand and without training. However, the algorithm is very susceptible to its tuning parameters and we have found it generates a high number of false negatives (e.g. a sensor fault not detected as a change in the amount of hidden variables).

Future research

The algorithms developed in this work will be tuned to produce a robust and early detection and classification system for anomalies in drinking WDNs. Further work is aimed at increasing the reliability and performance of detecting novelties and investigating whether classification with machine learning can further aid in discerning the severity and type of event (leakage, sensor error or otherwise).

ACKNOWLEDGEMENTS

This activity is co-financed with TKI-funding from the Topconsortia for Knowledge & Innovation (TKIs) of the Dutch Ministry of Economic Affairs. We thank the partners within the TKI project for their ideas and cooperation: Slavco Velickov (formerly Hydrologic), Jonathan van der Wielen (Hydrologic), Johan Fitié and Eelco Trietsch (Vitens), Sijbrand Balkema (formerly Hydrologic, now Royal HaskoningDHV), Henk van Duist (PWN), and Perry van der Marel (WLN).

REFERENCES

REFERENCES
Aminravan
F.
Sadiq
R.
Hoorfar
M.
Rodriguez
M. J.
Najjaran
H.
2015
Multi-level information fusion for spatiotemporal monitoring in water distribution networks
.
Expert Systems with Applications
42
,
3813
3831
.
Basak
D.
Pal
S.
Patranabis
D. C.
2007
Support vector regression
.
Neural Information Processing – Letters and Reviews
11
,
203
224
.
de Graaf
B.
Williamson
F.
Koerkamp
M. K.
Verhoef
J.
Wuestman
R.
Bajema
B.
Trietsch
E.
van Delft
W.
2012
Implementation of an innovative sensor technology for effective online water quality monitoring in the distribution network. In: Proc. Singapore International Water Week
.
Izquierdo
J.
López
P. A.
Martinez
F. J.
Pérez
R.
2007
Fault detection in water supply systems using hybrid (theory and data-driven) modelling
.
Mathematical and Computer Modelling
46
,
341
350
.
Papadimitriou
S.
Sun
J.
Faloutsos
C.
2005
Streaming pattern discovery in multiple time-series
, In:
Proceedings of the 31st International Conference on Very Large Data Bases, Klemens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-Åke Larson & Beng-Chin Ooi (eds), ACM, New York
, pp.
697
708
.
Pedregosa
F.
Varoquaux
G.
Gramfort
A.
Michel
V.
Thirion
B.
Grisel
O.
Blondel
M.
Prettenhofer
P.
Weiss
R.
Dubourg
V.
2011
Scikit-learn: machine learning in Python
.
The Journal of Machine Learning Research
12
,
2825
2830
.
Wang
Z.
Song
H.
Watkins
D. W.
Ong
K. G.
Xue
P.
Yang
Q.
Shi
X.
2015
Cyber-physical systems for water sustainability: challenges and opportunities
.
Communications Magazine, IEEE
53
(
5
),
2016
2222
.