Pattern matching and associative artificial neural networks for water distribution system time series data analysis

Water distribution systems, and other infrastructures, are increasingly being pervaded by sensing technologies, collecting a growing volume of data aimed at supporting operational and investment decisions. These sensors monitor system characteristics, i.e. flows, pressures and water quality, such as in pipes. This paper presents the application of pattern matching techniques and binary associative neural networks for novelty detection in such data. A protocol for applying pattern matching to automatically recognise specific waveforms in time series based on their shapes is described together with a system called Advanced Uncertain Reasoning Architecture (AURA) Alert for autonomous determination of novelty. AURA is a class of binary neural network that has a number of advantages over standard artificial neural network techniques for condition monitoring including a sound theoretical basis to determine the bounds of the system operation. Results from application to several case studies are provided including both hydraulic and water quality data. In the case of pattern matching, the results demonstrated some transferability of burst patterns across District Metered Areas; however limitations in performance and difficulties with assembling pattern libraries were found. Results for the AURA system demonstrate the potential for robust event detection across multiple parameters providing valuable information for diagnosis; one example also demonstrates the potential for detection of precursor information, vital for proactive management. doi: 10.2166/hydro.2013.057 s://iwaponline.com/jh/article-pdf/16/3/617/387269/617.pdf S. R. Mounce (corresponding author) R. B. Mounce J. B. Boxall Pennine Water Group, Department of Civil and Structural Engineering, University of Sheffield, Sheffield, S1 3JD, UK E-mail: s.r.mounce@sheffield.ac.uk T. Jackson J. Austin Advanced Computer Architecture Group, Department of Computer Science, University of York, Deramore Lane, York, YO10 5GH, UK


INTRODUCTION
Population growth, urbanisation, industrialisation and climate change are placing increasing pressure on water resources. The water-energy-food nexus is a term being used to describe the complex linkages and dependencies among water, energy and food security (Olsson ).
Global demand for water is forecast to outstrip supply by 40% by 2030 due to factors such as population growth and climate change (Parliamentary Office of Science and Technology ). This building pressure on water availability is driving a greater consideration of optimal management of clean water resources. Continuous online monitors and sensors are increasingly being used to measure a wide range of potable water hydraulic and quality variables within water distribution systems (WDSs) (Wu et al. ). Obtaining system information from these data can facilitate proactive system operation and maintenance. For water quality in particular, online data are generally not as reliable as laboratory-based discrete sample analysis with many associated problems that include absolute accuracy, maintenance, calibration, connectivity issues and local disturbances (Aisopou et al. ). This situation is compounded by the ever increasing volumes of data being collected at a higher than ever seen before frequency of sampling and with coverage of hundreds or even thousands of sites.

Data from online monitors potentially provide a wealth of information about what is happening within
WDSs and intelligent algorithms can be applied to turn these data into information for water utility companies.
Many companies are not making effective use of what is being collected in this regard and are missing an opportunity to better understand and assess current system status. Any data interpretation system employed must be able to deal with 'dirty data' such as inherent, though improving data variability and quality limitations. Hence systems need to include strategies for handling missing values and dealing with noise, e.g. Branisavljević et al. (). Analysis systems need to provide useful classifications of system status, events and conditions and not provide an onerous amount of alerts or alarms to system operators who will otherwise ignore warnings hence compromising the value of the information.
This paper presents the application of pattern matching techniques and binary associative artificial neural networks (ANNs) for novelty detection in time series data collected from WDSs. Algorithms are described and a protocol developed for applying the approach to case study data, both hydraulic and water quality, from water supply systems in the UK.

APPROACHES FOR EVENT DETECTION IN WDS MEASURED TIME SERIES DATA
A water distribution network is a complex, distributed, non-linear dynamic system, and thus it may not be effectively or satisfactorily described using purely linear methods or models. It is not possible to build an accurate non-linear model completely describing the system from data due to the uncertainties present. However, datadriven modelling is highly applicable. It has the advantage of not requiring a detailed understanding of the interacting physical, chemical and/or biological processes that affect a system before model inputs can be mapped to outputs.
Data-driven models can complement and sometimes replace deterministic models (Solomatine ). Recent developments in the field of computational intelligence (sometimes termed soft computing or machine learning) are helping to solve various problems in the water resources domain.
A number of approaches from the fields of artificial intelligence and statistics have been applied for detecting abnormality in WDSs from time series data. Alert systems that convert flow and pressure sensor data into usable information in the form of timely alerts (event detection systems) have been developed with a focus on burst detection to help with the issue of leakage reduction. Some of the most recent approaches are summarised in Table 1 The aforementioned event detection systems generally have the following features in common: (i) They learn from training data in some way to make a prediction about expected future values.
(ii) They have some type of methodology or rules for deciding when sufficient deviation from normality constitutes an abnormal event.

Data pre-processing
It is necessary to pre-process the data in order to be able to compare patterns from different sensors and at different times on a more equal basis. Suppose that the time series for a particular variable is represented by x t ð Þ t∈T . Firstly, we transform the time series to differences from the mean, i.e.
for each time t, where μ(t) is either: (i) the current average on some moving window [tÀt A , t], i.e.
where t A is the length of the time window for averaging and n is the number of time series values in the interval (ii) the average for that time of day, i.e.
where τ(u) ¼ τ(t) means that times u and t are at the same time of day and n is the number of measurements u that meet the criteria in the summation.
Secondly, we need to normalise these differences by the standard deviation over the same time series window as the mean is calculated from, so that overall where σ t is the standard deviation of the values on which the mean is calculated.

Populating the libraries
The libraries need to be populated with profiles for the different relevant variables. These profiles consist of consecutive measurements over possibly different event window lengths. It is important to use profiles that are typical and indicative of the given event type. A level of expert knowledge and/or water network records may be required to obtain these exemplars.

Searching the libraries
Define t(E) to be the duration of event E. At each time t, the time series profiles used for comparison with event library L are the time series so that if the event library has profiles of length 30, 60 and 90 minutes, at each time step we would perform a search over the last 30, 60 and 90 minutes worth of data respectively.
The distance between profiles (which must be of the same length in terms of time) is found using the l 2 norm (Euclidean distance), i.e. the distance between the two n-vectors x ¼ (x 1 , x 2 , . . . x n ) and y ¼ (y 1 , y 2 , . . . y n ) then Match scores can then be generated by calculating (assuming we are comparing profile x with library pattern y), so that if d(x, y) ¼ 0 then the match score is 1. A threshold can then be used above which two time sequences are said to be similar.
In order to calculate similarity with profiles that are of a similar shape but different magnitude we can calculate This research uses this type of pattern matching for populating a pattern library and then comparing a new data stream against it for detecting faults.

Associative memories
Novelty detection is the identification of new or unknown data that a machine learning system has not been trained on or previously seen. Many applications exist for analysing temporal sequences (Keogh et al. ). Rather than relying on manual review, it is useful to have some form of automated analysis system, which can scan the time series generated by monitoring sensors, and report any abnormal observations. This can be crucial in safety-critical environments. Novelty detection is a two class problem in that it needs to be ascertained whether acquired data come from a normal operating condition or not. There are many techniques for novelty detection including using outlier analysis, however some types of faults do not involve any one variable departing from normal operating range.
Since the classification of novelty is a priori unknown, this is a challenging problem and rules out the use of many supervised techniques. There is often no clear-cut boundary between novel events and normal events in real-world applications and a lack of meta-data (such as information about water treatment or process changes, maintenance events, industrial processes, etc.) in WDSs is a particular problem.
We can treat the WDS, or sub-areas (such as individual DMAs) in the context of real-time condition monitoring (CM), where it is critical to identify deviations from normal behaviour in sensor readings. A key element of CM is the early detection of potential faults in the monitored system or asset (such as a building, an engine or a pipeline), allowing preventative action to be taken before major damage occurs (for example a catastrophic burst). The CM system has to identify these potential faults based on the values of a (possibly large) number of variables.
In the field of ANNs, an associative memory is a network which stores mappings from specific input representations to specific output representations. Hence, a system that 'associates' two patterns is one that, when presented with only one of these patterns later, can reliably recall the other. There are two types of associative memory: auto-associative and hetero-associative. Autoassociative memories are capable of retrieving a piece of data upon presentation of only partial information from that piece of data, while hetero-associative memories can recall an associated piece of data from one category upon presentation of data from another category. Auto-associative mapping can be created by training an ANN to reproduce its input at its output (Masters  This research uses binary associative neural networks for detecting faults, by storing a representation of normal behaviour and monitoring when the asset's activity deviates from this behaviour. They are an example of a heteroassociative memory (although can also be used in an auto-associative fashion).

METHODOLOGY AND SOFTWARE Signal Data Explorer
The Signal Data Explorer (SDE) is a general purpose data browser and search engine for time series signal data (Fletcher et al. ). The SDE allows a user to specify the signal event to be searched by supplying a short example of that event (query by content). This can be specified where M k and M k-1 are the CMM after and before the training (with M 0 ¼ 0 and ∪ denoting a logical OR operation between the vectors). The recall vector S i associated to the input I i is defined as: This recall vector is generally an integer vector and the value of each element of the recall vector is called the 'score' of the CMM matching on the relevant column vector. The recall vector can then be thresholded to a binary output vector by either using a fixed threshold or selecting the L closest matches. This process is shown in In practice, the recall system needs to factor in not only the number of bins that match exactly, but also the distance between the assigned bins when they differ since this will provide important information on the closeness of match.
where the output is the value of bin number k (bins k ) and the value of the variable has been assigned to bin t (bins t ), max(n) is the number of bins for any variable and n f is the number of bins for this variable.

Water distribution system time series data
Data streams from WDSs can be somewhat different to other domains such as found in engine or power plant monitoring.
Some variables, particularly hydraulic parameters, such as flow and pressure, possess a diurnal pattern which reflects the daily demand profile dominated by residential use, pressure in Figure 2 illustrates this. Some water quality measurements also reflect this, so that chlorine concentration for example will (generally) have a periodic sinusoidal like profile. However, other water quality parameters such as conductivity are more similar to those measurements usually encountered in CM. Finally, some can have both characteristics, such as turbidity (as seen in Figure 2).
In order to use AURA alerts on data with periodic (e.g. daily) cycles, it is necessary to introduce an extra 'time of day' variable (e.g. the number of elapsed hours of the day).
This enables AURA alerts to detect patterns in the data that are unusual at that time of day. The data collected from sensors are first formatted into input files for a MATLAB pre-processing program which identifies and fills in any missing timestamps or values so as to provide a continuous stream of data. The data are finally reformatted into an appropriate comma delimited format required by the SDE. Note though that for non-periodic data streams, the AURA system is able to deal with completely missing data, with a zero code indicating the absence of dataparticularly useful for dealing with instrumentation or telemetry problems in online systems.
AURA Alert is then provided with data from an extended period of time (at least 2 weeks) during which the WDS sensor has been known to perform correctly with 'normal' conditions in the distribution system. At regular time intervals during this period, the values of a representative set of variables from the data are converted into a pattern, which represents the state of the WDS zone at that time instance. This pattern is then stored in an AURA associative memory.

Pattern matching
Data analysis was conducted using the SDE and a query by content approach for pattern matching. In addition, pattern matching software was developed in C# using Microsoft Visual Studio. Libraries of event profiles were created from .csv data to allow batch processing. Ten DMA inlet flows (A to J) were obtained for a large water supply system, with a mixture of urban and rural areas, for an approximate 8 month period for use in selecting burst profiles (industry standard 15 minute sampled data) along with the Work Management System (WMS) mains repairs record. A pattern library of known bursts for these DMA flow inlets using the SDE was assembled from this dataset.
These were identified from within the 10 DMA flow inlet datasets (normalised as described previously in order to allow generalisation from the DMA flow values) by using WMS information to confirm large burst events and hence creating a set of profiles consisting of a number of consecutive measurements (described in the pattern matching section). These were chosen to capture the significant first features of change in parameter due to an eventusing between one and two hours of data. The SDE allows searching for similar patterns in this library. One DMA (G) was held back for testing using the pattern library. An example is provided in Figure 4 of a detected burst in this DMA, which was matched with a very high probability to a burst from another DMA.
Three other matches of above 90% match score were obtained for DMA G for the whole period of analysissummarised in Table 2. In the case of each of the detections, visual analysis revealed that the profile was briefly unusual, although there was only one correlation found with WMS history. The results for an Artificial Intelligence (AI) analysis system and all mains burst repairs (MR) for the same DMA are also reported in Table 2 (after Mounce et al. ). Table 2 reports three MR in the period of which two were detected by the AI system (the other having no significant impact on the nightline). In particular, a burst was repaired on 24/12/05 of significant duration detected by the AI system (a total of three AI detections > 85% confidence) but not resulting in a hit using pattern matching.
Although this example illustrates the transferability of the concept of a 'burst' pattern, a limitation in the approach is in the manual assembly of the pattern library and the uncertainty prevalent in defining event classes for WDS.
Even when limited to burst only patterns, performance on the test DMA was found not as accurate as an AI system utilising outlier detection. Accurate selection of precursor patterns is also far from obvious. Using AURA Alert to automatically calculate a novelty score for any type of event, possibly never encountered before, was thus identified as a more attractive technique with the possibility of detecting precursor features before major failure.

AURA Alert
The AURA Alert system utilising CMMs has been used for the detection of irregularities in highly complex assets in a variety of different industries. Applying AURA Alert on real data from two different WDSs to explore the capabilities of the method and results obtained are now described.

Flow data analysis
The DMA inlet flows A -J, used in the pattern matching test, were each analysed by the AURA system and performance compared to WMS and the aforementioned AI system (a 4 week period was used to create the CMM model).  A match score threshold of 85% was used to identify reasonably large deviations from normality resulting in 20 overall detections (in comparison to 16 for the AI system). Of these, four detections corresponded well to WMS burst repairs (for the AI based system this number was fivewith three of these detected by both systems). Of the remainder, 13 were correlated visually with abnormal temporary increases in flow and three with likely short sensor drop outs to zero.
Overall the performance was thus comparable to the AI detection as reported in Mounce et al. (). AURA offers other possible advantages such as across multi-parameter analysis or potential short precursor event detection as further explored in the next two examples.

Water quality example
A multi-parameter water quality dataset was obtained for a measuring instrument based at a DMA inlet in an urban WDS deployed as part of a pilot study. Parameters measured were water temperature, pH, conductivity, turbidity and pressure at a 5 minute resolution. Data from a period of several weeks when the DMA was considered to be operating normally were presented to the AURA Alert system and the learned configurations encountered were stored in the AURA memory. Figure 5 shows the five channels corresponding to the raw data. The AURA Alert output can be seen in the 'Match Strength' channel (bottom axes), which has a value of 100 when in a previously seen state and drops down when a novelty is detected. In Figure

Pre-cursor example
The final example presented is in the use of AURA Alert to identify novelties in multiparameter data several days before a catastrophic failure in a complex asset, without any prior knowledge of similar failures. A flow and pressure dataset was assembled for a DMA. The data consisted of 15 minute readings, the WMS record and any associated customer contacts (CC) (complaints to call centres). These data include pressure data from the DMA inlet in addition to two specific point pressure loggers located at critical (determined by expert judgement) locations in the DMA.
Hydraulic data were utilised, with AURA trained using several weeks of normal data, and a test period with known multiple events and supporting information has then been used to illustrate the possibility of precursor detection.  later. The fact that the water company noted a repair start date to the WMS database on 20th September supports this. Whilst Figure 6 shows the potential for precursor detection, confirmation can be rather subjective due to the resolution of data and in particular the It has been demonstrated that the AURA Alert system has the potential for detecting changes across multiple parameters, allowing robust detection and information for interpretation, and offering scope for detection of event precursors. Timely event detection and diagnosis offer significant improvements in service delivery with a move towards proactive maintenance, while the implication of precursor information is to provide network engineers additional time to investigate the cause of abnormal conditions and perhaps prevent major asset failure before customers are impacted. Of course, datasets with more exhaustive information (such as known artificial hydrant flushing) could be used to evaluate more rigorous quantifiable error metrics such as the level of false-positives.

General discussion
WDS sensors monitor assets (reservoirs, pipes, valves, etc.) with the performance of these assets being indicated by the collected measurements. At the present time, the granularity, i.e.
number of devices and sampling interval, is quite limited compared to other industries. However, the quantity and complexity of sensor and environmental data are growing at an increasing rate and it seems clear that in the future the water sector can and should be penetrated by Information and Communications Technologies and Internet-like technologies. It is easy to anticipate that the environment may before very long be teeming with tens of thousands of small, low-power, wireless sensors. Each of these devices will produce a stream of data, and those streams will need to be monitored and combined to detect changes of interest in the environment. The easier it is to collect and analyse large datasets the more water utilities will collect and, in a decade, tens or even hundreds of petabytes of data may be routinely available.
Demands for solutions and tools will become more urgent to meet the aspiration for intelligent water networks, proactively managed through access to timely information. Permanent installation of high frequency (several hundred or even thousand Hz) pressure monitoring devices may also become routine and pilot studies using these have demonstrated how the arrival times of the burst induced wave at the measurement points can be used to derive the location of the burst using transients (Misiunas et al. ). The data compression facilities of systems such as AURA could prove very useful for these future data quantities.
This proliferation of monitoring will facilitate the continuous and simultaneous monitoring of the complete WDS (or at least significant sub-areas). By evaluating deviation from normality from a set of distributed sensors, both detection and location of abnormal events will be

CONCLUSIONS
The effective and efficient operation of WDSs is essential for three important reasons: maintaining safe and continuous supply to consumers, avoiding loss of water resources through leaks and bursts in the pipe network, and reducing the energy and other resources input to the system and so minimising the carbon footprint of water system operations.
To achieve this efficiency, information is continually required about current system performance, so adjustments can be made where necessary and interventions can occur before any fault or failure impacts on the customer. This paper has presented the use of pattern matching and binary associative neural networks using time series from WDS. Using AURA Alert, time series data from sensors (variables) are converted into vectors using a quantisation process. Vectors are then stored in a historical database in the correlation matrix memory. New data presented as vectors can either be used to generate the k best matching historical patterns or alternatively a measure of novelty (termed Match Strength) can be generated. One of the major features of the system is its ability to search small and very large datasets very quickly. The key conclusions of this research are as follows: • A pattern matching approach can be proficient at finding known patterns in data and has been applied successfully for many applications. The transferability (i.e. not tuned per DMA) of burst patterns was demonstrated here to some extent. However, overall the performance was found to be not as high as when using outlier detection based methods for this type of WDS time series data. A limitation of the approach is in the manual assembly of the pattern library and the uncertainty prevalent in defining event classes for WDS.
• AURA Alert (Advanced Uncertain Reasoning Architecture, utilising a class of binary neural network built on CMMs) can rapidly learn and model the normal operating envelope for a system, with the ability to search through complex high-dimensional multivariate spaces to detect deviations from normal conditions. The novel use of AURA Alert in WDS so as to automatically calculate a continuous novelty score for every time step and hence enable the detection of any type of event, possibly never encountered before, was proposed, explored and demonstrated. Examples have demonstrated successful early detection of abnormality in systems using multi-parameter data as well as significant potential for precursor event detection beyond typical outlier detection approaches. These precursors could be linked to appropriate maintenance requirements for water infrastructure.