ABSTRACT
The timely and robust detection of anomalies is essential for resilient and secure operations of critical water infrastructures against operational faults or malicious actions. However, real-world systems exhibit diverse and evolving spatiotemporal relationships among their components, posing an intricate challenge in anomaly detection. This study proposes a Hybrid Dynamic graph neural network that jointly maps long- and short-term spatiotemporal relationships in multivariate data streams. Those relationships are encoded via a hybrid graph, comprising an optimally learned static subgraph for persistent relationships and a complementary dynamic subgraph for dynamically shifting relationships. Additionally, an attention mechanism captures time-varying relational importances and shifts the model's focus towards significant relationships, while minimising contributions of less importance to the final outputs. The proposed architecture is showcased through a synthetic case study of a water distribution system with multivariate data streams from both on-site and soft sensors, and shows strong detection and localisation accuracy against all scenarios of operational faults and malicious actions explored. A comparative analysis with an equivalent static graph model indicates that the addition of the hybrid dynamic component enhances detection accuracy and reduces false alarm rates through more robust characterisation of behaviours, thus allowing actionable insights for more resilient and secure operations.
HIGHLIGHTS
This study introduced Hybrid Dynamic GNN to capture long- and short-term relationships in complex multivariate systems.
Dynamically adjusted attention mechanisms to varying relationships’ significance.
Reliable detection and explanation both operational faults and malicious actions in complex, stochastically driven systems.
Comparative analysis suggests reduced false alarms and increased detection accuracy over static GNN in identical scenarios.
INTRODUCTION
Water infrastructures, as critical entities for the well-being and security of societies, are expected to continuously sustain and overcome any stressor that may hinder integrity of their services. This constitutes an increasingly complex task, as the current (and future) operational landscape of the sector is characterised by a plethora of disruptive events, including typical operational faults (e.g., pipe bursts and asset malfunctions), and intentional malicious actions (e.g., sabotages and cyber–physical attacks). In the unremitting quest for resilience, albeit the merits of strategic planning (see e.g., Dimas et al. 2022; Moraitis et al. 2023), the prompt real-time detection of events is crucial to prevent escalation and cascading of impacts through timely interventions. At its core, anomaly detection, also referred to as outlier detection, relies on the formulation of a benchmark normal profile for a system, to characterise any new behaviour as either normal or abnormal (Patcha & Park 2007). Although conceptually straightforward, in practice anomaly detection is challenged by many factors (Chandola et al. 2009), both intrinsic (i.e., technique-specific) and extrinsic (i.e., domain or case-specific).
With a focus on the water sector, real-time anomaly detection is identified as a key capability for mature and insightful Water Digital Twins (SWAN Digital Twin Work Group 2022) – a point in the Digital Water horizon emerging from the new data landscape and smart hydroinformatics (Makropoulos & Savić 2019; Sarni et al. 2021). However, the field of anomaly detection has a vast amount of branching techniques that propagate from their early statistical form in the seminal works of Newcomb (1886) and Edgeworth (1887), and expand to the deep learning era. There exist literature over the suite of anomaly detection techniques, both wide (see e.g., Patcha & Park 2007; Chandola et al. 2009) and type-specific (e.g., deep learning (Pang et al. 2021) and graph neural networks (GNNs) (Wang & Yu 2022), detailing their merits and drawbacks for a variety of applications. Notably, there are the following five key challenges that overarch the anomaly detection applications:
CH.1: Water systems, like other infrastructures (e.g., power grids and communication networks), are dynamic by nature, with complex spatiotemporal relationships and multimodal ‘normal’ behaviours.
CH.2: The new data landscape is characterised by larger and more heterogeneous data streams (Makropoulos & Savić 2019; Savić 2021).
CH.3: The difference between normal and anomalous behaviour is obscured by the complex cyber–physical architecture of systems – and even concealed under bogus data in the case of cyber–physical attacks.
CH.4: It is hard to attain complete, accurate, and representative labelled datasets of past events (Chandola et al. 2009; Mounce et al. 2011; Wangen 2019).
CH.5: The majority of anomaly detection techniques are focused on a specific application or a specific problem formulation (Chandola et al. 2009).
Anomaly detection in the water sector is utilised predominantly in the following two distinct applications: (a) leakage detection or (b) attack detection, yet never jointly. The early formulation for both applications was through model-based approaches, with Carpentier & Cohen (1991) comparing on-site measurements to a calibrated static model to detect leakages in the Vaugirard network in Paris, and Amin et al. (2013) proposing hydrodynamic modelling as a reference to detect false-data injection attacks in open channel sensors. Later, Housh & Ohar (2018) proposed the use of EPANET (Rossman 2000) for a model-based attack detector in water distribution systems. Despite promising results and recent advances in cyber–physical modelling of water systems (Taormina et al. 2019; Nikolopoulos et al. 2020; Murillo et al. 2023), attaining well-calibrated models of real systems is not trivial. Moreover, the uncertainty bound of their estimations is expanded by time-evolving system parameters (e.g. pipe roughness) (Chan et al. 2019), as well as the variability and stochastic nature of the underlying water demands (Kossieris & Makropoulos 2018) that drive the systems.
In recent years there has been extensive work in anomaly detection, catalysed by relevant research ‘battles’ (Taormina et al. 2018; Vrachimis et al. 2022), and machine learning (ML) techniques are increasingly proposed for leakage (Chan et al. 2019) and attack (Tuptuk et al. 2021) detection, utilising both supervised and unsupervised approaches. A bottleneck of supervised ML techniques is their need for labelled training datasets, thus favouring the use of unsupervised techniques.
A common drawback among classical ML approaches is that their representation learning aims to decipher short- and long-term temporal relationships and often overlooks or lacks the mechanisms to explicitly account for combined spatiotemporal relationships. A field of ML that explicitly accounts for those mechanisms is that of GNNs, a deep learning graph representation technique especially potent for anomaly detection (Wang & Yu 2022). GNNs jointly infer the spatiotemporal relationships through the adjacency matrix (a.k.a. connectivity matrix), which can be either user-defined or self-learned by the model, and either static or dynamic; with the most common combination being that of user-defined static graphs. The latter is the case for the prediction-based graph anomaly detection introduced by Tsiami & Makropoulos (2021) for the water sector, via temporal graph convolutional neural networks (TGCNs). In their work, system interconnections were represented as a skeletonised graph of the network, derived by the on-site sensors' location. Although the actual spatial connectivity is a meaningful direct relationship, systems often exhibit secondary and possibly unperceived relations. This yields a major challenge to the wide-spread implementation of GNNs, as their performance is highly affected by the representational quality of the underlying graph (Zhu et al. 2021; Wang & Yu 2022).
To maximise understanding and representation of underlying mechanisms, GNNs can be coupled with metric-based, neural or direct graph structure learning (GSL) methods (Zhu et al. 2021; Chen & Wu 2022), and jointly learn the optimal parameters and graph structure, in an end-to-end manner. A proven technique in this process is ‘node embedding’, which, in simple terms, provides a unique ‘summary’ of each node, learnt from its corresponding data per se – instead of requiring human expertise to characterise it. In practice, the embedding vectors build a map of meaningful representations of each node's characteristics and patterns in a model-readable format, and allow the identification of ‘similar’ nodes as well as the generalisation of ML models to unseen examples. Chen et al. (2020) proposed node embeddings to derive an adjacency matrix optimised for the downstream prediction task, and linearly combine it with an original graph to formulate the final static connectivity for a two-layered graph convolutional network (GCN). In the absence of an initial graph, one may be derived, for example, through kNN (k-Nearest Neighbour) graph constructor based on cosine similarity. Deng & Hooi (2021) took the idea a step further and suggested that the final static graph can be optimally derived solely from the similarity between node embeddings, restricted to a predefined set of candidate relationships if prior knowledge exists. The proposed graph deviation network (GDN) utilises the learned sensor embeddings in both the GSL and the downstream graph attention network (GAT) (Velickovic et al. 2018), to represent the different behaviours of sensors in a static graph, and enhance the anomaly detection of attacks. GAT introduce the concept of attention for graph networks, which allow the adjustment of the convolution parameters, i.e., the degree of association between connected nodes, and thus weight the importance of each pairwise connection in the static graph. However, real-world systems are rarely – if ever – static in their relational interconnections, raising questions over the possible existence of a threshold in the ‘optimal’ representability of static graphs.
Water systems, like power grids, communications, and other systems, are multimodal, i.e., they exhibit multiple operational schemes. For example, a distribution system may interchangeably rely on tanks and pressure-boosting stations to sustain pressure levels in a district metered area (DMA), diversify daily the percentage of supply among its water treatment plants, and so on. Although the spatial connectivity of assets remains practically static, the systems' operational connectivity continuously alternates. Dynamic GNNs (DGNNs) have recently emerged to capture both discrete (Seo et al. 2018) and continuous (Trivedi et al. 2019) evolutions of structural information, and have become the state-of-the-art approach, with several models proposed (Yang et al. 2024). Examples for anomaly detection applications include the AddGraph (Zheng et al. 2019), which utilises GCN and gated recurrent units (GRUs) to learn spatiotemporal relationships of dynamic graphs' snapshots, and NetWalk (Yu et al. 2018), which proposed dynamic clustering of clique embedding outputs, i.e., sets of tightly associated node groups, to detect anomalies based on the derived network representation.
Challenged by alternating relationships between multivariate data streams, the problem of dynamic graph anomaly detection in heterogeneous systems is largely under-explored (Kim et al. 2022), and, notably, never addressed in regard to the water systems. So far, there is still no approach proposed that explicitly considers the multimodal behaviour of the systems and the dynamic spatiotemporal relationships emerging from the evolution of their operational connectivity, without loss of potentially strong, long-lasting relationships.
In this vein, the objective of this work is the development of a robust unsupervised solution for anomaly detection in multivariate data streams (CH.2 and CH.4), able to properly adjust to the underlying dynamic spatiotemporal relationships and multimodal behaviours of any complex, cyber–physical system (CH.1, CH.3, and CH.5).
Inspired by the above-mentioned GNN works, and in response to the identified challenges, this work proposes a hybrid dynamic graph neural network (HDGNN) architecture that jointly represents dynamically shifting and persistent structural relationships in multivariate data streams, through a learned hybrid graph. As the term ‘hybrid’ suggests, the generated graph consists of a learned static subgraph, to encode persistent relationships as permanent edges, and a dynamic subgraph, which encodes additional instantaneous relationships in the system as temporary edges in each timestep. The hybrid graph is passed downstream through a masked attention mechanism layer of the HDGNN to update the features of the vertices based on the information aggregation of the assigned neighbourhood for each timestep. The HDGNN outputs predict the baseline performance of the system and, compared to the supervisory control and data acquisition (SCADA) system measurements, characterise normal or anomalous behaviours at node level.
The novelty of this work lies in the fact that, unlike traditional methods that focus on static relationships, the HDGNN jointly captures both persistent and dynamic relationships in a hybrid graph network. This new approach enables the model to effectively capture the complexities and multimodal behaviours inherent in water systems. Moreover, the proposed architecture utilises mechanisms that allow for better characterisation of spatiotemporal dynamics and varying degrees of relationships, thus enhancing the expressive power of the model and leading to nuanced anomaly detections. Last, but not least, with the aim of building a practical solution for the sector applicable to multiple cases, the proposed prediction-based anomaly detection scheme is adoptable without the need for labelled datasets or previous knowledge over optimal graph structures in the system.
METHODS
In this section, we present the proposed HDGNN model and its basic components, namely: (1) hybrid dynamic graph construction, to jointly encode into a single graph the persistent and temporal structural relationships (edges) between measured entities (vertices) of a system; (2) dynamic masked attention for graphs, to enforce the learned connectivity on embeddings-aware attention mechanisms and generate custom dynamic attention patterns per graph instance, i.e., dynamic graph attention; and (3) stacked fully connected output, to sequentially aggregate vertices' features based on the dynamic attention patterns, apply linear and non-linear transformations to the extracted features, and derive the predictions of the vertices' behaviour. The HDGNN is then coupled downstream with an anomaly scoring system that compares model predictions and actual system measurements to characterise anomalous and normal vertex behaviours, in an end-to-end workflow.
Hereafter, and for simplicity, the term sensor refers to any component (tangible or not) and its measurements derived from soft or on-site sensors, simulation, naïve, or statistical models, while the term system refers to any isolated or interconnected network, such as interdependent critical entities, which the entire set of sensors measures.
Preprocessing
An important step in ML techniques is feature scaling, which improves convergence and stability of the model by transforming sensors' features into a standardised scale. This reduces scale-driven dominance of sensors and ensures a more balanced representation of their influence, through various normalisation and standardisation techniques. Given the problem addressed is focused on multivariate timeseries, we hereby utilise z-score normalisation, transforming inputs into zero-mean and unit-variance features.
Let denote the set of system sensors that measures number of variables over a period T, i.e., produces features sets . The features are normalised via , where is the features' mean and , their standard deviation over period T, for each sensor i.
Hybrid dynamic graph construction
Real-world systems are rarely static, and often exhibit multimodal behaviour, i.e., different vertices are connected via different relationships at different times t, as the system adapts to the most recent conditions. Yet, systems often tend to also retain a certain degree of ‘deterministic’ relationships, especially in relation to source points (e.g., water treatment plants and power plants) or central operational hubs (e.g., storage tanks and transmission substations) at a local or global scale. The goal of the hybrid dynamic graph construction module is to formulate a common graph that reflects both static and dynamic relationships, in a two-step approach. In the first step, graph node embeddings are utilised to extract an optimal static graph that reflects the learned persistent relational structure among system components, without any prior knowledge or initial graph required. The second step seeks to augment the static graph with dynamic vertices derived by sensors' short-term correlation, which reflects the strongest relations emerging in the given period. The fusion of the two creates the hybrid dynamic graph used to represent the sensors' relationships at every timestep t.
The above-mentioned outcome represents the learned size-constrained optimal connections within the system via the corresponding static subgraph .
An indicative threshold criterion, and a good starting point, is , which suggests a strong positive correlation between sensors, or , which reflects both negative and positive strong correlations within the selected time window w.
Subsequently, and for each timestep the hybrid dynamic graph of the system is synthesised as the union of its static and dynamic subgraphs, while the adjacency matrix is derived as the addition of the two adjacency matrices, i.e., , and contains self-edges . Although the hybrid dynamic graph is derived solely from sensor-centred information, pairwise relationships may be enforced or restricted based on expert judgement, to modify the final graph. Experts may enforce or restrict connections within specific subsystems according to their expert judgement, or even utilise this capability to restrict connections with sensors known to be prone to failure, inaccuracies, or vulnerability-induced attacks.
Dynamic masked attention for graphs
GNNs obtain their expressive power and representativeness by leveraging the relational information encoded within the system graph. Central to their operation is the fusion of information between neighbouring vertices, which enables the extraction of higher-level representations of the vertices' features based on the characteristics of their neighbourhood. In the derived hybrid graph, the neighbourhood of each vertex can be defined at each timestep t as , i.e., all the vertices connected to it. However, the presence of connections between vertices does not inherently imply the uniform significance of the represented relationships; rather, it simply signifies their existence. To attend to significant relationships while minimising the contribution of less relevant neighbouring sensors, we utilise a graph attention layer, as conceptualised by Velickovic et al. (2018) and later enhanced through sensors' embeddings (Deng & Hooi 2021). Due to the dynamic nature of the hybrid graph, vertices can dynamically enter or exit each other's neighbourhood over time, adding to the complexity of the process. To adaptively adjust the attention mechanism to the varying significance of relationships of a neighbourhood in a dynamic context, the graph attention layer masks the raw attention weights prior to their normalisation for each timestep t. In that manner, any raw attention learned for the pairwise feature extraction of vertices is applied only in the presence of a connection and normalised based on the characteristics of the respective neighbourhood – yielding higher or lower attention at each different graph occurrence. Similarly to the dynamic subgraph construction process, the HDGNN uses a sliding window to perform feature extraction in both space and time, for each sensor's sequence in the time window . Note that the sliding window can be of different size than the one used for the construction of the dynamic graph.
Stacked fully connected output
The output of the HDGNN is the prediction vector that represents the normalised expected features of each sensor's behaviour at time t, based on the learnt relationships captured in the hybrid graph and the attended feature sequences of its neighbourhood in the time window .
Anomaly detection
However, a disruptive event can cause the system to deviate in numerous sensors, posing challenges to their explainability. To provide more valuable insight and pinpoint the source of anomalies, the proposed anomaly detection scheme suggests an expected causal association between the detected anomaly event and the sensor that exhibits the highest deviation score .
CASE STUDY
A synthetic water infrastructure
A key maturity pillar of Digital Twins for water is their ability to provide insights over the operational state of the system, including the detection of anomalous behaviours (SWAN Digital Twin Work Group 2022). In this vein, the HDGNN was initially conceptualised and subsequently deployed as the latest building block for a proof-of-concept Water Digital Twin (WDT) of a synthetic real-world system. The physical counterpart of the WDT resembled a modified water distribution network of C-Town, a benchmark network of an anonymised real-world system (Ostfeld et al. 2012). To account for real-world conditions and the inevitable pipe deterioration, fissures are assigned across all pipes, following the leakage modelling framework presented by Chela et al. (2023). Hence, the synthetic system exhibits hydraulic behaviours past the ‘optimal design’ of the original benchmark, and under the effect of continuous, dynamically varying leakages, with a mean monthly background leakage ratio of ∼10%. Moreover, the system is driven by a stochastically generated demand timeseries synthesised with the anySim R package (Tsoukalas et al. 2020). The recreated synthetic consumption behaviours preserve the auto-correlation structure of historical patterns and their lag-0 cross-correlation, and lead the system to exhibit synthetic, yet realistic operational behaviours.
In total, the water distribution system comprises five DMAs served by a single water source, i.e., reservoir R1 and operated through 11 pumps, four valves, and seven tanks and a total of 52 on-site sensors. It is supervised by a central SCADA system, which regulates operations via automated controls. The system is hypothesised as not yet equipped with smart-water metering devices at the consumers' endpoints, while their consumptions are measured monthly and recorded to the utility's database.
The historical data of the case study refer to an unlabelled dataset of a 2-year operational window, with 15-minutes measuring frequency and includes measurements from 52 on-site sensors and five soft sensors that estimate the fine-scale patterns of background leakages at DMA level, briefly introduced in Supplementary material, Appendix B.
Case study scenarios
Utilising the SCADA and soft-sensor data, this case study comprises six indicative scenarios, to explore the effectiveness of anomaly detection against a diverse array of scenarios that range from operational faults (e.g., pipe bursts and asset malfunctions), to intentional malicious actions (e.g., cyber–physical attacks). Each scenario spans a 2-week period past the historical period. The scenario designs also accommodate additional evaluation insights, presented along the scenario descriptions as follows:
Sc1: An event-free period under business-as-usual conditions. This scenario serves as a test to false alarms per se, and thus evaluation of the operational applicability of the anomaly detection, as high rates and long false-positive detections can lead to the ‘crying wolf’ effect, i.e., scepticism or disregard when real emergency occurs.
Sc2: A motivated and skilled attacker leverages knowledge over system operations and tank refill automations via data breaches that have gone unnoticed. The attacker thus manipulates the system by injecting realistic bogus data timeseries of the level sensor of the main system tank, i.e., Tank T1 in DMA1. The attack occurs during the refill process with the bogus data indicating an increasing water level above 4.5 m (following a polynomial shape). This leads the control system to close Pump PU2 and maintain connection to the source only via Pump PU1. The attack occurs on Day 7 at 15:45, with a total duration of 8 h. This scenario can help evaluate the anomaly detection against sophisticated attacks concealed as normal operations. It can also help evaluate the ability of the algorithm to ‘escape’ strong autoregressive dynamics, and thus, by deduction, infer meaningful spatiotemporal relationships in the system.
Sc3: Occurrence of asset malfunction, and specifically of the actuator of Pump PU4, located at the inlet of DMA3. After Tank T3 level reaches 5.3 m, the SCADA sends an actuation command to turn off Pump PU4, yet the control is not accomplished due to a fault in the servo-electric motor. This leads Pump PU4 to remain open, pushing water in the DMA and causing Tank T3 to overflow. This event occurs in the second day of Week 2 at 23:45 and has a total duration of 15 h. This scenario helps evaluate the capacity of the anomaly algorithm to detect deviations from the control rules in-place and separate the source of deviation among the faulty behaviour and its cascading effects.
Sc4: A threat actor has gained access to the local PLC that oversees the operation of the pressure-boosting station at the inlet of DMA1 and maliciously activates the unscheduled Pump PU3. The attack occurs at 23:45 of the 10th day and has a total duration of 15 h. This scenario helps evaluate the ability of the anomaly detection algorithm to deal with assets that have not been operated within the training period, and the subsequent effects they may have in the system behaviour.
Sc5: Aging and material defects had weakened the structural integrity of Pipe P937, located in the middle of DMA2, leading to a sudden burst on the fifth day at 12:30. The event lasts for a total of 20 h, before utility staff applies clamp-on repair couplings to stop the leakage, i.e., no isolation is required, and downstream flow is not interrupted. This scenario can help evaluate the detection capacity of the algorithm in events that occur in non-monitored assets of the system (i.e., outside the graph), and thus its ability to indirectly infer anomalous behaviours solely from the effects of the events.
Sc6: In the last scenario a combination of disruptive events occurs, and more specifically a cyber–physical attack against a pump that leads to a pipe rapture. An attacker maliciously activates Pump PU5 located at the inlet of DMA3 for a short time, causing a pressure surge. The excessive stress to downstream pipeline leads to a rapture in Pipe P794. The malicious pump activation occurs on Day 7 at 17:30, with a simultaneous occurrence of the pipe rapture. The pump is activated for 30 min while the subsequent leakage lasts 12 h, before clamp-on repair couplings are applied. This scenario helps evaluate the algorithm's adaptability against combinatorial simultaneous events and test its ability to continue detecting anomalous behaviours even after their cause ends.
RESULTS
The HDGNN used in this case study learns a static subgraph that allows each sensor to be steadily associated with approximately 25% of the network (i.e., ), while the dynamic subgraph in each timestep is derived from the correlation between sensors' readings in the last hour (i.e., ). The feature extraction for the attention processes is performed with a sliding window of 4 h (i.e., ). Both the HDGNN predictions and errors are smoothed via weighted moving average filter of the last hour (i.e., ). The anomaly detection threshold for each sensor is derived using an unlabeled holdout set of the last 5 months (20%) of the historical dataset.
The following paragraphs present the results of anomaly detection across each scenario, showcasing the applicability of the proposed HDGNN under a diverse set of disruptive events, and evaluating the performance of the case formulated HDGNN. In summary, the proposed scheme was able to detect all the events, with the swiftest anomaly detection time being 0 h (i.e., instantly detect the event) and the latest being 3 h, for the malfunction of a pump's actuator. The duration of false alarms prior to events never exceeded 1hr, while post-event false alarms not associated to known events last a maximum of 2.5 h. The longest false alarm associated with the extension of a true alarm past the event duration was 8.5 h. In terms of explainability, the HDGNN was able to localise the affected assets, while in the case of pipe bursts the localisation was more generally inferred through the corresponding soft sensor. To explore the contribution of the additional dynamic subgraph component, Supplementary material, Appendix A provides a comparative analysis of a prediction-based anomaly detection with a static GNN for the case study dataset and scenarios.
The HDGNN results for each scenario are presented as follows in two parts; the first part is the performance of the event detection and the second, the localisation of the events' sources.
Sc1: Event-free operations
By examining the highest anomaly scores in each alarm, the sources are identified, respectively, for each as potential issues in (i) the level of Tank T7, (ii) the inlet pressure in Pipe PU1, and (iii) the flow passing through Valve V4. In the case of longest alarm issued, the HDGNN predicts a higher level of Tank T7 and a decrease with slower rates – possibly due to an unusually increased consumption rate in DMA4. The other two alarms are issued instantaneously, because of error deviations that marginally exceeded the defined thresholds, with no obvious trends of behavioural deviations in the HDGNN predictions. Nevertheless, in this case study the anomaly threshold is deterministic, and alarms are activated regardless of the surpass magnitude.
Sc2: Cyber–physical attack of tank level manipulation
At the early steps of the attack, the HDGNN predicts a similar performance that is reversed soon after to indicate a tank emptying process throughout the attack period – and in contrast to the tank sensor measurements. Past the attack duration, and after receiving the true level of the tank, the system continues to detect anomalous behaviour in the tank level for 1 h.
Sc3: Asset malfunction
Sc4: Cyber–physical attack with malicious activation of unscheduled pump
Sc5: Pipe burst
Sc6: Pipe burst caused by malicious activation of pump
With respect to the entire scenario set, the HDGNN exhibited a detection accuracy of ∼86.5% for both operational faults and malicious actions and a false alarm ratio of ∼1.9%, under stochastically driven system behaviours. The comparative analysis to a static GNN in the same case study, presented in Supplementary material, Appendix A, indicates the positive effects of jointly accounting for dynamic and static relationships, on both the detection accuracy (+14%) and the reduction of false alarms (−68%).
DISCUSSION
The proposed HDGNN demonstrates robust detection performance under a diverse set of disruptive scenarios that include operational faults, malicious actions, and their combination. The algorithm was able to detect all the events occurring in the synthetic water distribution system, with a detection accuracy of ∼86.5% across all cases. At the same time, the HDGNN issued a small number of false alarms, with consistent rates across the event-free and the five disruptive scenarios, with an overall false alarm ratio of ∼1.9%. The HDGNN was able to directly associate anomalies with the affected assets per se in all the disruptive scenarios, showing strong localisation capabilities. The results indicate the algorithms robustness in detecting and localising system anomalies without any previous knowledge of specific failure patterns, and under realistic, stochastic system behaviours.
This work also demonstrated the HDGNN's ability to independently identify meaningful spatiotemporal relationships and adapt to operational changes dynamically, without relying on strong autoregressive patterns. This allows the HDGNN to properly generalise to unseen operational patterns and behaviours, and better identify contextual anomalies, i.e., infer if the system behaviour is normal under a given set of conditions, and abnormal under another.
When compared to more ‘traditional’, yet robust, approaches, the HDGNN also exhibits improved performance, exceeding the accuracy, false alarm ratio, and localisation capabilities of a static GNN in detecting complex cyber–physical events under realistic, highly variable conditions. Such conditions may be often found in the real world yet not often encountered in literature, in which detection performance is showcased with respect to either leakages (Vrachimis et al. 2022) or malicious events (Taormina 2017), with the underlying water systems operating under ‘fairly regular’ demand patterns with ‘no significant variations’, thus raising questions over the retainment of performance under real-world conditions (Tsiami & Makropoulos 2021). As such, the performance evaluation and comparative analysis can be deemed sufficiently representative of real-world conditions, while under less variable or more ‘typical’ consumption behaviours, HDGNN is expected to be equally good – if not better. The additional positive effects from coupling static with dynamic relationships are also aligned with the key findings of relevant literature over the advantage of dynamic GNNs to capture stochastic spatiotemporal relationships over the traditional ‘deterministic’ approaches (Kim et al. 2022). To the best of our knowledge, the proposed HDGNN is the first approach in our sector to explicitly consider the multimodal behaviour of systems and the dynamic spatiotemporal relationships emerging from the evolution of their operational connectivity, without loss of potentially strong, long-lasting relationships. Moreover, the proposed architecture goes beyond the combination of static and dynamic relationships by also accounting for their time-varying importance (i.e., the existence of connections does not imply a priori a steady or uniform importance) as it considers the significance of each relationship under both the existing system state and the unique characteristics of each sensor captured via embedding vectors. Hence, it continuously maps and ‘evaluates’ both the static and the dynamic relationships of a system, as they evolve through time. A key advantage of the proposed HDGNN, regardless of its application, is that its underlying graph connectivity can span any combination between a static fully connected (i.e., ) to a fully dynamic graph (i.e., ). This allows the HDGNN to adjust its internal relationship dynamics by relying more on static or dynamic connections, per the problem's needs. To this end, there are both practical and conceptual evidences towards its adoption as a robust, adaptive approach.
While the proposed approach demonstrates promising results in anomaly detection, it is important to acknowledge certain limitations. As every other unsupervised learning, the training of the HDGNN demonstrated in this case study builds on the principle that abnormal instances in the datasets are much rarer than normal, thus no labels are required to train the model into representing the ‘normal’ behaviour. However, if this assumption does not hold true, and training data are not properly curated, the anomaly detection will generate high rates of false alarms, and lead to the ‘cry wolf’ effect. An equally important aspect in the same vein is that of the smoothing process parameterisation, and specifically the time window for the weighted moving average filtering, where the trade-off between low false alarms and swift detection of actual events should be accounted for.
Another limitation and challenge in anomaly detection is the ability to detect and localise two or more events that occur simultaneously. Although not a limitation of the HDGNN per se, the downstream anomaly detection scheme utilised in this work can associate the system anomaly to a single point in the system based on the magnitude of the resulting deviation. This may create a gap in detection, where the anomaly detection system cannot automatically alert for a simultaneous event until the most critical issue is resolved.
An overarching limitation in anomaly detection is the ability to localise events when those occur in unmonitored parts of the system, and subsequently outside the HDGNN graph. Although the use of soft sensors, as illustrated in this case study, can help better identify and localise such events, further downstream schemes could be pursued to deduct more spatially specific information and allow for swifter decision making at the operational level.
CONCLUSIONS
This paper introduced a HDGNN architecture as an advancement in the expressive power of GNNs for multivariate data streams of complex systems, particularly in the water sector. The proposed architecture is the first in the sector to jointly address both persistent and dynamically shifting spatiotemporal relationships among system components, allowing a more detailed and realistic representation of multimodal behaviours and, thus, a more robust inference of contextual anomalies; a great challenge in the anomaly detection domain. This capability becomes more pivotal under the operational complexity emerging from the expanding cyber–physical water system architecture and – in parallel – the evolving perplexity of cyber–physical attacks. Notably, the HDGNN was able to perform robust anomaly detection on both cyber–physical attacks and asset malfunctions, under the same configuration and training, proving that its expressive power can capture a wide range of unseen system behaviours. Especially since this capability emerged without the use of labelled dataset, it signifies its practical potential for a broad range of applications without the need for multiple instances or application-specific modifications. This can be of special interest to water companies, as they could derive a single, fine-tuned instance of the HDGNN of their system and use it across departments. Moreover, the HDGNN architecture requires no prior knowledge regarding both the static and the dynamic subgraphs in the system, and can be used at any scale (from DMA to system wide). Moreover, the HDGNN architecture allows for a wide range of potential system connectivities, ranging from fully static to fully dynamic graphs, thus being able to adapt to each case.
Besides the underlying conceptual advancement that makes it more suitable for complex water systems, the HDGNN also exhibits enhanced performance, with respect to ‘traditional’ static approaches. The advantage of hybrid over static connectivity appears in all the desired traits of the anomaly detection, i.e., the detection accuracy, the localisation, the contextual inference, and the false alarm ratio – which in practice can lead to the ‘crying wolf effect’. Notably, the performance and comparative analysis of the HDGNN were evaluated under realistic boundary conditions, thus yielding more realistic insights. The case study presented in this work provides a synthetic, yet realistic benchmark cyber–physical water system that mimics real-world dynamics, including background leakages and stochastically generated water demands. The dataset can aid future researchers in evaluating the performance of relevant applications in conditions that more closely represent real-world variability and stochasticity – thus ensuring relevance and scalability to larger real systems. This will be crucial in tackling the remaining and future challenges in GNNs’ expressive power and anomaly detection for complex systems.
Future direction could include enhancing the proposed anomaly detection scheme to detect multiple simultaneous anomalies. To this end, new approaches need to be explored employing either simplistic methodologies, e.g., set a second threshold above which parallel alarms are issued, or more advanced ML-based approaches to indicate if the resulting anomalies belong to the same or different events. Moreover, the HDGNN scheme can be combined with parallel modules to allow the detection of anomalies outside the graph, i.e., the cyber layer. As a point towards future research, such a scheme could build on DT technologies to simulate multiple what-if scenarios of asset failures based on the most up-to-date reference model of the system. Those simulation results can then be used as ‘fingerprints’ for each asset's failure and help identify the most probable cause of the identified anomaly.
Nevertheless, the HDGNN anomaly detection scheme, explored under both typical operational faults (i.e., pipe bursts and asset malfunctions) and intentional malicious actions (i.e., cyber–physical attacks), has already yielded promising results in terms of its detection capacity for contextual anomalies and robust explainability, surpassing those of deterministic static GNN. We argue that the proposed joint exploration of static and dynamic relationships can help better map the varying behavioural patterns of complex real-world systems and enhance the performance of graph-based deep learning models used in prediction and anomaly detection tasks.
ACKNOWLEDGEMENT
The research work was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the ‘Third Call for HFRI PhD Fellowships’ (Fellowship Number: 6349).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.