This research aims to simulate bio-contamination risk propagation under real-life conditions in the water distribution system (WDS) of Lille University's Scientific City Campus (France), solving the source identification and the response modeling. Neglecting dynamic reactions and not considering the possible chemical decay of most of the contaminants leads to an overestimation of the exposed population. Therefore, unlike the available event detection models, this study considers the interrelated change of several water-quality parameters such as free chlorine concentration, pH, alkalinity, and total organic carbon (TOC) resulting from the pollutants blending. In fact, starting from regular WDS monitoring, the baseline thresholds for each of the mentioned parameters are established; then, significant deviations from the baseline are used as indication for contaminations. For this reason, the purpose of the research was to develop and demonstrate the feasibility of an artificial intelligence (AI)-based smart monitoring system that will effectively enable water operators to ensure a quasi real-time quality control for early chemical and/or bio-contamination detection and preemptive risk management. Advanced pattern recognizers, such as Support Vector Machines (SVMs), and innovative sensing technology solutions, such as Artificial Neural Network (ANN), have been used for this purpose, identifying the anomalies and the severity-level assessment.
Within the European project Smart Water for Europe (SW4EU) where the current paper was born, bounds of limitations on traditional approaches to optimal water quality sensor placement in water distribution systems (WDSs) were discussed together with limitations on the achievable performance of a water quality sensor network (van Thienen et al. 2018). Vitens, the largest drinking water company of the Netherlands, also allocated a designated part of their distribution network to be a demonstration network for online water quality monitoring, the Vitens Innovation Playground (VIP). In the VIP, a network of 44 sensors has been installed to transmit their data by GPRS to servers where the data were processed using event detection algorithms. Deployed as an online sensor network, it allowed early detection and rapid response, as well as accurate location of the spread of a contamination within the distribution network (Williamson et al. 2014).
In addition, various models were developed for event detection in WDSs. The detection of a contamination event requires that the related variations in the values of the measured parameters can be distinguished from the normal daily and/or seasonal fluctuations. It is therefore necessary to use specific algorithms, essentially based on statistical or stochastic methodologies (van Thienen et al. 2013; Blokker et al. 2016). In order to improve the EDS, in recent years data-driven models such as Artificial Neural Networks (ANNs) have been successfully applied for water quality evaluation that focuses on modeling and prediction. Perelman et al. (2012) and Arad et al. (2013) developed a model based on multivariate artificial neural networks followed by a Bayesian event classification. Murray et al. (2011) used pH, conductivity, and turbidity coupled with Bayesian Belief Networks (BBNs) for event detection of Escherichia coli.Yang (2013) developed a system-wide algorithm, integrating the binary signals of each sensor's independent event detection with a source identification model that is based on a backtracking algorithm. By cross-referencing sensor alerts, the algorithm rated the probability of an event occurrence. Arad et al. (2013) aimed to improve the study of Perelman et al. (2012) by applying adaptive updating dynamic thresholds, which are characterized by a sliding window size, and are computed as a function of the error standard deviation of the sliding window.
Based on the concept of dynamic thresholds, and the use of ANNs combined with considerations about interrelated changes of several water-quality parameters (e.g. free chlorine concentration, pH, alkalinity, total organic carbon (TOC)) resulting from the pollutants blending, the purpose of this paper is to develop and demonstrate the feasibility of a smart monitoring of water-quality parameters for early bio-contamination detection, as well as the visualization of the detected anomalies on the network. In fact, the objectives of this study are mainly: (i) the simulations with engineering models (such as Epanet, Epanet-MSX) of a pilot demonstration at the University of Lille water network of Escherichia coli blending in the drinking water to define a pattern recognition for the changes in the physical/chemical water parameters (Shang et al. 2008); (ii) the definition of the baseline thresholds for each of the measured parameters used as indicators for probable bio-contaminations; and (iii) the development of statistical models and artificial intelligence (AI)-based algorithms to enable non-specific anomaly detection, referring to the pattern recognition previously defined, and its geo-localization. The proposed algorithms are tested and compared with each other on the European WDS of the Lille Campus (north of France). The results show not only an efficient anomaly detection and risk-based classification, but also the ability of the final output to visualize the contaminated nodes on the network map, according to a risk severity scale.
In the current paper ‘free chlorine’ refers to ‘chlorine’.
The use of quality sensors to detect biological accidental or deliberate contaminations in WDSs produces large-scale databases, raising the need for a smart data-processing procedure, which has to be established for the development of a risk assessment model, together with a decision support system. In details, the model should be able to early detect biological contaminations in a drinking water network in order to efficiently enable water operators to ensure real-time water-quality control management, filtering false alarms.
Tinelli et al. (2018), Tinelli & Juran (2017) and Abdallah (2015) demonstrated numerically and experimentally the feasibility of detecting non-specific biological anomalies (such as E. coli) through the use of statistical and engineering models for chlorine trend analysis to develop a prototype system for early non-specific bio-contamination detection. The authors also demo-illustrated a comparison between numerical simulations (carried out using EPANET-MSX) of the chlorine decay trend during injection of E. coli and a laboratory model test performed at the University of Lille (Abdallah 2015). Therefore, in the absence of field data of bio-contamination, the EPANET-MSX model was used for scenario simulations to produce numerical data (hereafter named Chlorscan data), representing the chlorine decay trend during E. coli injections and diffusion in the system and simulating the effect of bio-anomaly scenarios on chlorine concentration in water distribution networks. In addition to the chlorine concentration decay monitoring-based early detection, the AI-based methodology presented in this paper enables a systemic integration of multi-parameter co-variant trend analysis in order to improve the reliability of the anomaly detection and geo-localization process.
In machine learning the SVMs are supervised learning models (Cortes & Vapnik 1995) with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one out of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, becoming a non-probabilistic binary linear classifier. Hence, the paper proposes an SVM able to identify two different classes, that is (+1) or (0) respectively for ‘anomaly’ or ‘no-anomaly’ classification. Due to the fact that a single SVM only resolves two classifier problems, the paper draws an SVM model constituted of several classifiers to distinguish different states of anomaly levels (Mamo et al. 2014). Thus, the proposed SVM can classify the incoming unknown data as belonging to one specific signature, based on the anomaly severity.
The identification of each of the anomaly signatures starts according to the scales drawn in Tinelli et al. (2018) In particular, threshold default values of the deviation Amplitude (A) identifying different likelihood levels are calculated through statistical data analysis which yields the deviation amplitude thresholds corresponding to the 1st, 2nd and 3rd Standard Deviations (STDs) of the Chlorscan data. Each STD indicates the likelihood level of an anomaly occurrence, as shown in Figure 1. In fact using these scales, different values of Amplitude (A) and Duration (D) define the anomaly signature: the amplitude has three levels, that is A1 included in the range between the 1st and the 2nd STD, A2 between the 2nd and the 3rd STD, and A3 above the 3rd STD; durations are simply divided into four groups, as a function of the duration in hours. A likelihood matrix is established using the selected thresholds of the state parameters (A – deviation amplitude and D – number of hours that the anomaly amplitude persists) as reported in Figure 1(b).
Figure 1(a) illustrates the definition of the anomaly signatures, according to the amplitude and duration levels. In more detail using the classification of the signatures, Figure 1(b) reports the risk-scale classification of the SVM output: it is evaluated in each single node, since the methodology investigates the chemical/physical water parameters in every single node. Therefore, the ith SVM is designed to recognize one of the presented signatures of the anomaly and the output of the ith node state is ‘ + 1, or positive’ if the contamination is actually present in the specific node, ‘ + 0, or negative’ otherwise. Figure 1(c) illustrates the steps of the multi-class SVM anomaly detector. If an anomaly is detected, the procedure continues with the identification of the severity level, according to the defined signatures (Tinelli 2018).
Regarding the input data, the proposed methodology requires: (i) X, the so-called Matrix of predictor data, where each row is one monitored node of the network (thus, rows are equal to the number of nodes), and each column is one parameter (e.g. chlorine, TOC etc.); and (ii) Y, the array of class labels with each row corresponding to the value of the corresponding row in X. Y is indeed a column vector, whose values are +1 or 0, according to the classified category, that is ‘positive-anomaly’ or ‘negative-not anomaly’, respectively. The methodology assumes a linear separation for two-classes learning, meaning that data are separated by a hyperplane.
The resulting trained model (SVMModel) contains the optimized parameters from the SVM algorithm, enabling the classification of new data.
In order to have a comparison for the results obtained, the ANN was applied (Kohavi & Provost 1998). The ANNs are a computational model based on a large collection of connected simple units called artificial neurons, loosely analogous to axons in a biological brain. To define a pattern recognition problem, the ANN requires a set of input vectors arranged as columns in a matrix. Then, another set of target vectors is required to indicate the classes to which the input vectors are assigned. In detail, the ANN input data consist of: (i) the matrix of predictor data, where each row is one parameter (e.g. chlorine, TOC etc.) and each column is one monitored node of the network (thus, rows are equal to the number of nodes); and (ii) the array of class labels: when there are only two classes, each scalar target value is set to either 0 or 1, indicating which class the corresponding input belongs to. Once the input is defined, the pattern recognition tool is able to train the network, evaluate its performance using cross-entropy and percent misclassification error and analyze the results using visualization tools, such as confusion matrices and Receiver Operating Characteristic curves (ROC curves). The performance of the network can always be evaluated on a test set. The network can also be retrained with modified settings or on a larger dataset in order to get more satisfactory results.
The proposed methodology has been applied to a real case study presented in the following section in order to show its testing and demonstration.
TESTING AND DEMONSTRATION
The described data analysis has been tested through its application to a real case study, the SUNRISE Demonstration Site located at Lille University Campus in the north of France. The Lille Campus has been located in the town of Villeneuve d'Ascq (northern France) since 1967. Lille Campus can be compared to a small city: it covers an area of 110 hectares and it comprises 145 buildings with very different uses. The network is nearly 15 km long. As a university and scientific city distribution network, the network experiences large differences in the water quantity supplied due to seasonal influences and holidays. The pipes are mainly made of cast iron with diameters ranging from 20 to 300 mm. The water network includes 49 fire hydrants, 250 valves, 93 Automatic Meter Readers (AMR) measuring hourly water consumption, five pressure sensors and two Virtual District Metering Areas (VDMA). The network is about 60 years old, suffering from ageing with significant leakage. For the purpose of water-quality control the demonstration included on-line monitoring using Optica's EventLab and S::scan sensors, which were installed in two monitoring stations at building connections. Before the installation, the sensors were tested in a laboratory pilot, which allowed simulating injections of chemical and biological substances at control density and duration and following the responses of sensors to injections. Sensor data were compared with laboratory analysis to check the reliability and accuracy of the sensors.
An EPANET hydraulic model was integrated with the site monitoring as shown in Figure 2(b). In this case study the system water demand was assumed to vary with hourly steps. One month of data was averaged to form a 1 day pattern (24 hr. pattern), which was used for the demand multiplying coefficients, and the simulations were run for 1 day.
Tinelli & Juran (2017) presented numerical simulations of bio-contaminant propagation during selected scenarios of E. coli injections to provide several datasets representing the resulting chlorine decay patterns in the WDS of the Lille Campus.
Chlorine and TOC are thus used as input data to conduct a multi-variable analysis at a mono-spot and multiple spots for specific bio-contaminations (E. coli injections). Regarding the SVM, the matrix of predictor data for mono-spot analysis (chlorine and TOC) is input as explained in the previous section. The array of class labels is defined according to the statistical process drawn in Tinelli et al. (2018): for each row of the matrix, the proposed model includes the evaluation of chlorine and TOC in order to classify the row into one of the two different classes, that is (+1) or (0) respectively for ‘anomaly’ or ‘not anomaly’ classification. As described, the recognition of each of the anomaly signatures starts according to Amplitude (A) and Duration (D) as illustrated in Figure 1(a) and 1(b). Thus, the array of class labels is a vector made up of (+1) or (0). The methodology used the MATLAB programming language to train and cross-validate the SVM model for the two-classes (binary) classification.
RESULTS AND DISCUSSION
The available chlorine and TOC datasets are randomly used for the training phase (70% of the database) and for the validation and testing phases (both 30% of the database). The results obtained are able to show the proper classification of the evaluated datasets, including:
class order which is ‘ + 1’ for the negative class, and ‘0’ for the positive class;
out-of-sample misclassification rate.
Regarding the tested Chlorscan data, the classification loss is approximately 4‰, meaning that the system is able to detect the anomaly with an accuracy close to 100%.
In addition, the proposed methodology is able to visualize the contaminated nodes in the network, according to the color-based risk analysis indicated in Figure 1.
Chlorine and TOC datasets are input as rows in a matrix also in the Neural Network Pattern Recognition Tool (ANN-based algorithms) in order to create, train a network, as well as evaluate its performance using the cross-entropy and confusion matrices. The confusion matrix, illustrated in Figure 2(a), reports the number of false positives, false negatives, true positives, and true negatives and consequently allows for a detailed analysis in terms of accuracy. The network is trained with scaled conjugate gradient back propagation. The confusion matrix in Figure 2(a) points out that the error is around 2‰, illustrating the efficiency and closeness (which indicates how sets – trained/validated and tested sets in this case – are close in a metric space) of the train/validation/test phases, and demonstrating the accuracy of the methodology. In addition the highlighted difference between the SVM and the ANN is 0.0019, approximately 2‰.
Figure 2(c) shows the application of the proposed methodology for multi-spot analysis for the Lille network shown in Figure 2(b). In this case, the matrix of predictor data and the array of class labels are the same but they report multiple changes in the values of chlorine and TOC, according to the injection locations. The final output illustrated in Figure 2(c) points out the visualization of the contaminated nodes in the network according to the color-based risk analysis procedure.
For these scenario simulations of multi-spot injections of the contaminants at various nodes (nodes 1 and 204) of the Lille Campus network, the outcome of the co-variant risk analysis, shown in Figure 2, supports the visualization for both temporal and spatial geo-localization of the bio-anomaly in the network. In fact, the chlorine and the TOC are input as time series in which every value corresponds to the predefined time-step. Therefore, the difference in time of changes in the bio-chemical water parameters in a specific node is directly related to the contaminant transport rate. By comparing the results at different time-steps the sensitivity and the accuracy of the methodology can be assessed to efficiently identify the risk level (or event likelihood) and preemptively minimize public health risks and consumer concerns.
Unlike the available event-detection models, the aim of the early bio-anomaly detection methodology presented in this study is to build AI-based algorithms upon the observed interrelated changes of several water-quality parameters such as free chlorine concentration, pH, alkalinity, TOC resulting from E. coli blending and diffusion in WDSs. In fact, using the AI, an algorithm appropriately trained on the standard conditions of a system is able to recognize deviations from the normal conditions evident enough to constitute an anomaly. In particular, the proposed methodology enables the integration of the datasets obtained for each parameter through the use of AI-based pattern recognition algorithms for an automated detection and accurate visualization of bio-anomalies in WDSs. Both these features improve the water-quality conditions: by identifying the correct contamination source and estimating the accurate contamination time associated with each possible contamination source, even in the case of large and complex WDSs, false alarms are eliminated. Thus, the efficiency of the system is increased and the public health risks are preemptively minimized. Among the available supervised learning models, the Support Vector Machines (SVMs) as well as the Artificial Neural Network (ANN) are tested and compared with each other. Both the SVMs learning and the ANN analysis demonstrate promising performance, which leads to the conclusion that: (i) multi-class SVM advanced pattern recognizers could be successfully employed for contamination detection and classification in WDS; (ii) the integration of multi-class SVM and advanced pattern recognizer is able to assist the decision-making process of utility managers and to benefit the system operators in early detection of bio-anomalies; (iii) the feasibility of early detection of bio-anomalies in drinking-water distribution systems is demo-illustrated through the use of chlorine decay trend monitoring and Chlorscan data; and (iv) the accuracy of anomaly detection is improved, leading to a more precise localization of the anomaly on the network.
Finally, a color-based risk analysis procedure, which indicates the likelihood and the severity of the event itself at each node and supports geo-localization of the bio-anomaly source via visualization of the risk level at the contaminated nodes, is presented and applied on the Lille Campus network for both mono-spot and multi-spot injections.
Therefore, within the frame of the European project SW4EU the paper brings great support to the scientific community because it was used to implement software which is able to run scenarios that are loaded as Excel data to provide a risk matrix and raise eventually an alarm. The analyzed WDS is also displayed in the software through user accessibility and Google Earth visualization in such a way that the detected anomalies can be geo-located on the map, according to the explained color-based procedure. Consequently, an effective and preventive monitoring system was obtained and it provides crucial information for timely intervention by limiting damage to citizens.
However, the final purpose of the current paper is the development of software based on the use of the advanced pattern recognizers and linked with ArcGIS in order to contribute the decision support system of the water utility managers in real time. Finally, future developments may also concern the study of other species (like pesticides, herbicides, etc.) in order to have a general picture of the water-quality parameters, if a contamination event occurs.