Assessing the performances and transferability of graph neural network metamodels for water distribution systems

Metamodels accurately reproduce the output of physics-based hydraulic models with a signi ﬁ cant reduction in simulation times. They are widely employed in water distribution system (WDS) analysis since they enable computationally expensive applications in the design, control, and optimisation of water networks. Recent machine-learning-based metamodels grant improved ﬁ delity and speed; however, they are only applicable to the water network they were trained on. To address this issue, we investigate graph neural networks (GNNs) as metamodels for WDSs. GNNs leverage the networked structure of WDS by learning shared coef ﬁ cients and thus offering the potential of transferability. This work evaluates the suitability of GNNs as metamodels for estimating nodal pressures in steady-state EPANET simulations. We ﬁ rst compare the effectiveness of GNN metamodels against multi-layer perceptrons (MLPs) on several benchmark WDSs. Then, we explore the transfer-ability of GNNs by training them concurrently on multiple WDSs. For each con ﬁ guration, we calculate model accuracy and speedups with respect to the original numerical model. GNNs perform similarly to MLPs in terms of accuracy and take longer to execute but may still provide substantial speedup. Our preliminary results indicate that GNNs can learn shared representations across networks, although assessing the feasibility of truly general metamodels requires further work.


INTRODUCTION
Water utilities rely on hydrodynamic models to design and control water distribution systems.These physics-based models, such as EPANET (Rossman 2000), compute the state of the system, i.e., the flow rates and pressures at all the pipes and junctions, by solving the underlying equations of mass and energy conservation.The inputs for these computer programs include the layout of the network and the characteristics and settings of components such as pipes, pumps, valves, and reservoirs, among others.Hydrodynamic models provide valuable insight into the functioning of the system.However, the speed of these models is often insufficient for applications such as optimisation (e.g., Bi & Dandy 2014) or criticality assessment (e.g., Meijer et al. 2021), especially in large search space problems (Maier et al. 2014).One alternative to address this issue is developing surrogate models, also referred to as metamodels.
Metamodels are models that aim to significantly reduce simulation times while still obtaining comparable results to the hydrodynamic model.A main family of metamodels is response surface models (Razavi et al. 2012).These surrogate models mimic the input-output relation, i.e., the response surface, of the original physics-based model to obtain results in a fraction of the time while retaining sufficient accuracy.Among the multiple algorithms that can be used for creating these metamodels, artificial neural networks (ANNs) have been increasingly popular due to their high performance and execution speed; previous studies using mainly ANNs have shown remarkable gains in computational time (Broad et al. 2005;Martínez et al. 2007;Salomons et al. 2007;Behzadian et al. 2009;Broad et al. 2010).
ANNs are models obtained by stacking parametric functions that take an input x and produce an estimated output ŷfrom a target representation y.The parameters in an ANN are learned, i.e., calibrated, by minimising the difference between the expected and real targets, measured with a loss function.This calibration process is usually performed via backpropagation, i.e., the parameters change based on the value of the loss function.The process of learning these parameters is referred to as training and the data employed during training are called training sets.The model performances are then evaluated on a separate dataset known as a validation dataset, before being employed for testing on unseen data.
Fully connected ANNs, also known as multi-layer perceptrons (MLPs), are arguably the most used metamodels for water distribution systems (WDS).Even though MLPs can approximate any function (Hornik et al. 1989), their number of parameters increases exponentially with the size of the input, making them unsuitable for high-dimensional data.This issue is known as the 'curse of dimensionality' and implies the amount of training data required by an MLP increases exponentially with the input's dimensions (Lecun et al. 2015).Furthermore, MLPs require a fixed-size input, and consequently, a new model needs to be created when the size of the inputs changes, e.g., by adding new pipes or junctions to a WDS.Thus, they do not overcome a major limitation of traditional metamodels: they are only applicable to the water network they were trained on.As noted by Garzon et al. (2022), this implies that new metamodels must be trained with new sets of simulations to account for multiple networks or structural changes in the original system.This characteristic could discourage the use of metamodels or even make them impractical.
Components of metamodels can resemble the underlying structure of the problem at hand by including inductive bias, i.e., assumptions or knowledge about the data-generating process, underlying physical processes, or the space of solutions (Battaglia et al. 2018).This similitude aids the effectiveness of model transfer by exploiting the connectivity of the nodes and the physical information of the components.For metamodels in WDSs, this information includes connectivity and data such as node elevation, pipe roughness, and length.
Graph neural networks (GNN) are a recent variant of ANN which can perform operations on data that lie on graphsmathematical objects that describe how entities are connected to each other via nodes and edges.WDSs can be represented by graphs, considering junctions, reservoirs, or storage tanks as nodes and pipes, pumps, or valves as edges.GNNs can then take the information embedded in WDS and apply the same linear and non-linear operations used in MLPs.The main difference is that GNNs can have permutation invariant and equivariant properties, which allows them to consider arbitrarily sized graphs.This inductive bias preserves the additional information embedded in the graph structure which helps in decreasing the number of trainable parameters in the metamodel.
Recently, GNNs found successful applications in water networks.Hajgató et al. (2021) and Xing & Sela (2022) used a GNN model to estimate the pressure state of a WDS, based on a few sensors in the network.Bonilla et al. (2022) employed a GNN model to predict the pump speed from pressure and flow measurements in the networks.However, no works yet explored the transferability of GNNs for WDSs, i.e., their ability to learn representations and perform predictions across multiple case studies.Furthermore, to the best of our knowledge, no studies on WDSs have compared the performance of GNNs against that of traditional data-driven alternatives, such as MLPs.In this paper, we move the first steps in these directions by developing GNN-based metamodels for estimating nodal pressures on six benchmark WDSs and comparing them against several MLP baselines.
The remainder of the paper is organised as follows.In the methodology, we firstly describe the data generation procedure and the six case studies used in this work.Next, we present the employed metamodels based on MLP and GNN and describe the metrics used for assessing their performances.The section additionally includes the description of the conducted experiments and the adopted setup.The result section presents the comparison between MLP-and GNN-based metamodels for the benchmark WDSs and discusses the results on GNN transferability.The last section concludes the paper.

Case studies and data generation
To assess the performances and transferability of GNN-based metamodels, we consider the problem of reconstructing nodal pressures as computed by EPANET steady-state simulations.Table 1 reports the six benchmark datasets employed in this study along with their bibliographic reference, and the number of nodes, pipes, and reservoirs in the system.While the chosen WDSs do not feature hydraulic elements such as pumps and valves, they form a sufficiently diverse ensemble, as shown in the network layouts displayed in Figure 1.
For each of these networks, we employed the WNTR Python package (Klise et al. 2018) to generate a dataset of 10,000 samples, divided into training (8,000), validation (1,000), and test (1,000) subsets.Each sample is created by altering all (i) nodal base demands, (ii) pipe diameters, and (iii) pipe roughness coefficients.The altered values are selected from different distributions that reflect commercial ranges for pipe diameters (0-1.5 m at 2.5 cm increments) and Hazen-Williams roughness coefficients (50-150 at 1 unit increments), as well as reasonable base demands with respect to the selected case studies (0-100 L/s with 0.1 increments).To avoid unrealistic configurations leading to very low or very high pressures, the distributions were sampled within a range centred on each original node or pipe characteristics.While the distribution of pipe roughness and nodal-based demands are uniform within these ranges, the selection of pipe diameters is biased towards larger values to reduce the possibility of unfeasible setups (e.g., yielding failed simulation).
The described alterations ensured sufficient variability in the nodal pressures obtained via WNTR pressure-driven simulations.When running the simulations, we kept all other network characteristics constant, including network connectivity, geographical coordinates, elevation, pipe lengths, and boundary conditions (i.e., total head of the reservoirs).

ANN-based metamodels
MLPs are formed by sequences of linear operations between the input data and a parameter matrix, followed by non-linear activation functions (e.g., ReLU, sigmoid, tanh).The propagation rule for a generic MLP layer is where y lþ1 is the layer output, s( Á ) is an activation function, W l is a trainable weight matrix, y l is the layer input, and b l is the bias term.Equation ( 1) is repeated for a certain number of layers, resulting in a fully connected network (see Figure 2).The number of inputs is given by the problem's variables, i.e., node demand, pipe diameters, and pipe roughness coefficient, while the outputs are the pressure estimate at each network's junction.Considering a generic WDS with N nodes and E pipes, the input X and output Y have thus dimensions Contrary to GNNs that can learn shared representations within the same WDS and across WDSs by exploiting topological and 'static' features, adding constant inputs to the MLP (e.g., elevations, pipe lengths) hinders its training process.This is carried out by means of gradient descent algorithms minimising a loss function; for the nodal regression problem entailed in our metamodelling approach, the loss is chosen as the mean squared error (MSE) between the pressure values of WNTR simulations and those predicted by the MLP averaged for the entire training dataset.

GNN-based metamodels
A WDS can be represented as an undirected graph G ¼ (V, E), where V ¼ {1, . . ., N} is the set of nodes (e.g., junctions) and E # V Â V is the set of edges (e.g., pipes).For each node i [ V, we can define its neighbourhood as N i ¼ { j if (i, j) [ E} as a set of nodes that are directly connected to a given node with an edge.Variables defined on the nodes, such as node demands, can be encoded as node signals and multiple variables can be represented by the node matrix X [ R NÂF , where F represents the number of node features.A GNN works by propagating those node features from one node to another via its neighbouring nodes.This propagation can be repeated, for a single layer, for as many K-hop neighbourhoods, as shown in Figure 3.The bigger this value, the wider the reach of information sharing throughout the graph.For mathematical convenience, we define the graph shift operator as a matrix S [ R NÂN , where S ij ¼ S ij if (i, j) [ E and S ij ¼ 0 otherwise.Thus, the expression for a graph convolutional layer is given by where Y lþ1 is the layer output node matrix, s is an activation function, Y l is the layer input node matrix, H k l is a shared trainable matrix, and K is the K-hop neighbourhood (Gama et al. 2020).Nodes and edges can be automatically inferred by the WDS connectivity.Each node has features defined by its elevation and base demand or base head, depending on whether the node is a junction or a reservoir.We included the edge attributes, i.e., diameter, length, and roughness of the pipes, via a preprocessing MLP layer shared by every edge.The obtained output is then aggregated for every node to create a new node embedding Y 0 [ R NÂG , where G is the embedding dimension, as illustrated in Figure 4.As for the GNN layer, we employed the ChebNet graph convolutional model (Defferrard et al. 2016), which has been extensively studied in previous works, including some in water networks (Hajgató et al. 2021).This considers as graph shift operator S in Equation ( 2) the Laplacian matrix, defined as L ¼ I À D À1=2 AD 1=2 obtained from the adjacency matrix A. Finally, we use an MLP shared on the nodes to convert the final node embeddings into a single pressure prediction for each node.Similarly to the MLP-based metamodel described in the previous section, the GNN is trained by minimising the MSE for the training dataset using stochastic gradient descent.

Metamodel performance
We consider the coefficient of determination R 2 and the root mean squared error (RMSE) as goodness-of-fit metrics.The former is defined by where y i is the target pressure at node i computed by WNTR, ŷi is the predicted value of the metamodel, and y is the mean of the target pressure data for a given dataset of size N.A value of 1.0 indicates a perfect fit, while values of R 2 below 0 indicate that the model is performing worse than simply assuming the dataset mean for each point (i.e., R 2 ¼ 0).On the other hand, the RMSE measures the average error in the pressure prediction defined as Since metamodels are built to overcome the computational costs of physics-based simulators, we evaluate the speedup provided by each metamodel as where T s is the average computational time for the numerical simulation, and T mm is the average computational time for the metamodel execution.

Experimental setup
We run two main sets of experiments aimed at (i) assessing the performances of GNN metamodels trained on individual WDSs, and (ii) exploring the advantages of a transferable GNN metamodel trained on datasets featuring samples of all WDSs.We compare the results against that of MLP-based metamodels trained on each WDS separately.
The comparison of MLP-vs GNN-based metamodels on individual WDSs is carried out by training the models on the entire 8,000 samples available for each water network.To facilitate the training process, all variables are scaled using log transformations (elevation, pressure, and pipe length) or min-max scaling (all other features).The scaling parameters derived from the training dataset are applied to the validation and test datasets.We decided to substitute the reservoirs' elevation (set to 0 by default in EPANET solvers) with their base hydraulic head and replace this latter feature with a Boolean flag to discriminate reservoirs from junctions.All GNN datasets thus have three features per node (elevation þ base head, base demand, node Boolean flag) and three per edge (pipe diameter, pipe length, pipe roughness).
The study on GNN transferability entails three separate training datasets built using samples from all WDSs as shown in Table 2.The first dataset contains 1,024 samples for each WDS, for a total of 6,144 data points.Since the GNN loss function is calculated per node, this dataset is unbalanced, as it overrepresents larger networks.To balance the dataset, we undersample the larger WDSs by including a number of data points that is inversely proportional to the number of nodes in the WDS with respect to the smallest systems (FOS, BAK).Using this strategy, we create two extra datasets named balanced and balanced extended with 5,722 and 22,350 data points, respectively.All training datasets in Table 2 are normalised using the same procedure described before for the individual WDS datasets, but with extreme values computed across all WDSs.Similarly, the derived scaling parameters are then applied to normalize the validation and test datasets, now consisting of data from all WDSs.

Hyperparameter search
We test MLP architectures of different complexity by changing the number of hidden layers, as well as the number of units in each layer.In general, a larger network performs better but requires additional data and computational power.We also employ dropout layers (Srivastava et al. 2014) after each fully-connected layer to improve model generalisation.All activation functions are rectified linear units (ReLU).Table 3 shows the hyperparameters' range selected for MLPs, yielding 36 potential combinations.The upper limits of the hyperparameters are selected based on similar works in the water distribution system domain (Martínez et al. 2007;Hajgató et al. 2021).Table 4 shows the values chosen for the optimisation of the GNN hyperparameters, yielding 24 possible combinations.The hyperparameters include the embedding dimensions of the shared preprocessing MLP, the number of graph convolutional (Cheb-Net) layers after the preprocessing MLP, the number of output channels of the graph convolutional layers (e.g., number of hidden units), and the max K-hop neighbourhood considered by the GNN.As for the MLP metamodel described before, all activation functions in the GNN are ReLU.No hyperparameter tuning is performed for the transferability experiments, where we use the largest possible GNN with 64 embedding dimensions, 3 ChebNet layers, 128 hidden output channels, and K ¼ 6.
We run all the experiments using the Pytorch library (Paszke et al. 2019) for MLP models and Pytorch Geometric (Fey & Lenssen 2019) for the GNN models.We used default library weight initialisation methods (Glorot & Bengio 2010) and fixed the random seeds.Each metamodel was trained using the Adam optimisation algorithm with a constant learning rate of 0.001, no weight decay, and default selection of parameters.The training was carried out for 30 epochs with no early stopping and a batch size of 128.In terms of hardware, we employed a Xeon W-10855M @2.8 GHz CPU and a Nvidia Quadro RTX 5000, 16Gb RAM GPU.

RESULTS AND DISCUSSION
In this section, we first compare the MLP and the GNN in terms of performance and execution time one WDS at a time.Then, we assess the transferability of GNNs trained on the combined datasets of Table 2.

MLP vs GNN
Table 5 reports the best configurations of the metamodels based on the validation R 2 for each WDS, and after considering all hyperparameter combinations described in the previous section.From the comparison of the test R 2 , it emerges that MLPs outperform GNNs in all benchmark datasets apart from BAK, where the performances are almost identical, and FOS, where the GNN largely outperforms the MLP.Table 6 presents the comparison also in terms of RMSE and computational speedups, along with the total number of parameters of the best metamodels.While the RMSE follows the same trends described for R 2 , it better indicates the average error in meters for nodal pressure estimation.As expected, the best GNNs are usually smaller than the MLPs in terms of parameters, especially for the FOS and the RUR case studies where the GNN achieves better or similar performances.Nevertheless, MLPs are faster, granting execution speedups of three orders of magnitude with respect to WNTR simulations.Similarly, their training time is between one to two orders of magnitude smaller than that of GNNs.That said, the GNNs provide substantial speedups of up to 70Â that may justify their utilisation.Furthermore, this gap in optimised GPU implementation will likely get smaller as these relatively novel techniques become mainstream.While further GNN hyperparameters tuning may narrow the gap in performances for all WDS where MLPs perform best, Figure 5 suggests that MLPs may never reach performances similar to those of the GNNs for the FOS benchmark.Indeed, regardless of the model size and dropout used, the MLPs reach a performance ceiling at around R 2 ¼ 0:35 and then 'overfit' the validation dataset, despite the consistency in the distribution of input and output variables across the different splits.This behaviour does not occur for the remaining WDS, indicating that, for some case studies, GNNs may discover hidden representations hardly accessible to MLPs.This initial finding requires further analysis linking model performances to topological and hydraulic features of the WDS.
The better fit of GNNs for the FOS case study is clearly visible in Figure 6(a) that shows the test dataset scatterplot of simulated vs predicted pressures across all nodes for the best-performing models.Figure 6 also shows that, regardless of the case study, the developed metamodels tend to overestimate the simulated pressures.The isolated clusters in each panel (e.g., see the top left corner of Figure 6(a)) usually correspond to individual simulation runs which are more difficult to surrogate.These challenges could be rooted in the sensitivity of the pressure response of WDS to input pipe parameters, such as pipe diameter.Employing a data generation process with systematically smaller pipe diameters could potentially be helpful in the training.In that case, the dataset will exhibit a broader distribution of pressure values.Consequently, the metamodel's sensitivity to the input might increase and may require more parameters to capture more complex relationships.

Transferability of GNNs
Figure 7 reports the R 2 scores of the transferable GNNs trained on the combined datasets (see Table 2) for all WDS test datasets, along with the distribution of R 2 for the MLPs (blue cross) and GNNs (orange cross) trained with 8,000 simulations on the individual benchmark WDS.The comparison between the GNN trained on the unbalanced (cyan circle) and balanced dataset (green square) seems to confirm that training on the former favours larger networks, such as MOD, RUR, and KL.While the general GNN seems to retain comparable performances to the individual ones for BAK, MOD, RUR, and KL, the performances drop drastically for PES, and especially, FOS.The problem persists even when considering the GNN trained on the balanced extended dataset (red triangle), which features 8,000 simulations for FOS and over 4,000 for PES.Further analysis is thus necessary to understand why the GNN seems to learn transferable features across a subset of the available WDSs, failing to perform on others.
On the other hand, these results also indicate that GNNs may exploit information learned in some case studies for predictions elsewhere.This particularly emerges when considering the performances of the transferable GNN trained on the balanced extended dataset on RUR and KL.This GNN shows equal performances to those trained on the unbalanced dataset despite having only around 75 and 30% of the training samples for these two WDSs, respectively (see Table 2), and it slightly underperforms the best individual MLPs and GNNs trained individually on 8,000 samples.While the training times of the transferable GNNs grow longer with larger training datasets, the model retains execution speedups in the order of those reported in Table 6 for the individual case studies.

CONCLUSION
In this work, we assessed the performances and transferability properties of GNNs used for metamodelling the pressure response surface of six benchmark WDSs.After generating a large sample of steady-state (snapshot) simulations with WNTR, we first compared the performances of multiple configurations of GNNs trained on individual WDSs against MLPs.The results indicate that, while MLPs tend to slightly outperform GNNs in most case studies, there is partial evidence that GNNs may be inherently better architectures for some WDSs.Despite requiring less trainable parameters to achieve comparable goodness-of-fit, GNNs are substantially slower than MLPs.At the current stage, the direct applicability of both GNNs and MLPs in downstream tasks might be limited for some WDSs.For the other cases, however, they still provide consistent speedups that could justify their use.
We assessed the transferability property of GNNs by training a single model on datasets with samples from all WDSs.Testing results for the larger WDSs suggest that a general GNN may perform comparably to the best MLP and GNN models while requiring substantially less training data for these case studies.These initial findings may indicate that GNNs can indeed learn shared representations across different water networks.However, the performance drop witnessed for two of the six networks implies that substantial efforts are required to design adequate datasets and test the general validity of this approach.
This exploratory study only considered a limited combination of GNN architectures consisting of ChebNet layers with a shared MLP for embedding nodal and edge features.Future studies should assess the effects of other graph layers and GNN paradigms on individual WDS performances (Wu et al. 2021) and transferability (Ruiz et al. 2020).Additionally, the effect of each hyperparameter could be investigated in more detail.Furthermore, we aim to investigate whether we can achieve better transferability by including more variability in the data generation process for example by resorting to a large number of randomly generated WDSs (Sitzenfrei 2016).This can include more complex techniques of sampling the pipe parameters that will result in broader pressure distribution in the dataset.New research could consider the equivalences that occur across different settings, such as when different pipe parameters in consecutive pipes result in the same headloss.These equivalences can be used in a self-supervised setting (Xie et al. 2022).Similarly, we aim to extend this work by considering surrogates of extended-period hydraulic analyses, rather than steady-state simulations.

Figure 2
Figure 2 | A multi-layer perceptron (MLP) with two hidden layers.Every layer is connected to the following one by weights, represented by directed arrows.

Figure 3
Figure 3 | A graph neural network (GNN) layer with a 2-hop neighbourhood.The figures from left to right indicate how the node signal in the black node propagates throughout the network.The same reasoning is applied to every other node in the graph.

Figure 4 |
Figure 4 | Procedure to determine the node embeddings Y 0 .(a) The edge attributes E ij and node attributes of the incident nodes X i and X j are concatenated and then fed to an MLP shared across all edges, to determine an edge embedding E 0 ij .(b) For every node, the embeddings of incidental edges E 0 ij are aggregated to determine the final node embedding Y 0 .(c) Node embeddings for every node are finally fed to the model.

Figure 7 |
Figure 7 | Test R 2 scores of transferable GNNs for all benchmark WDSs.

Table 1 |
Water distribution systems featured in this study

Table 2 |
Training datasets used for the study on transferability

Table 3 |
Range of hyperparameters for the MLP models

Table 4 |
Range of hyperparameters for the GNN models

Table 6 |
Goodness-of-fit metrics on the test dataset, execution speedups, and number of parameters of the best metamodels Figure 5 | Scatter plot of MLPs (blue) and GNNs (orange) validation and testing R 2 for the FOS case study.