The real-time hydraulic model (RTHM) is a key assistive tool in water distribution system (WDS) management, and its performance directly affects assisted decision-making. This study develops a framework to improve the timeliness and accuracy of RTHMs, which includes the following five steps: flow data processing, establishing nodal water demand (NWD) prediction models, node grouping, data assimilation (DA) and uncertainty analysis. Based on the actual network data, the performance of two data processing methods and three machine learning algorithms are, respectively, compared, and the best is selected for modeling. In the establishment of the hourly NWD prediction models, massive data, including flow measurement and data of all 26 input variables on climate, time and social influencing factors are used. It is found that the time features are the most important model input parameter. Application results of actual network prove that the flow data processing method, accurate NWD prediction, node grouping and Kalman filter-based DA method reduce the uncertainty in the RTHM and improve its timeliness and accuracy, so as to obtain the real-time state estimation of the WDS. Accurate NWD estimation (especially in the high-demand period) and combining RTHM with DA have a great influence on the uncertainty reduction in water pressure estimation, although uncertainty is weakened in the propagation process.

  • A framework to improve the timeliness and accuracy of real-time hydraulic models was established, helping to accurately estimate the real-time status of the water distribution system.

  • Reliable nodal water demand prediction models were established, with an accuracy of hourly level.

  • The obtained results improve the understanding of the propagation process and reduction methods of the uncertainty of the WDS hydraulic models.

Graphical Abstract

Graphical Abstract
Graphical Abstract

By combining real-time measurement to simulate the dynamic operating state of the water distribution system (WDS), real-time hydraulic models (RTHMs) play a key role in improving fine control and emergency handling, which also makes people have higher requirements for the accuracy and timeliness of the model. Therefore, it is very essential to explore the methods to improve the timeliness and accuracy of the RTHM. The development of the Supervisory Control and Data Acquisition (SCADA) system has improved the availability of online monitoring data, which contributes to improving the accuracy of the hydraulic model on the one hand, and on the other hand, it also increases the challenge of modeling because of the need for preprocessing and comprehensive analysis of massive real-time data. The currently collected hydraulic data mainly includes pressure and flow data. The analysis and application potential of pressure data have attracted widespread attention (Blesa et al. 2015; Xie et al. 2017; Raei et al. 2018; Soroush & Abedini 2019), but there is little effort in the deep understanding of flow information (Kang & Lansey 2009).

Flow information is very important for the RTHM as an essential model input. The flow meters are generally installed in key locations of the WDS. In addition, a large number of district metering areas (DMAs) (Chondronasios et al. 2017) have been established in WDSs in order to control leakage. Most of such DMAs’ inlets have been covered with flow meters to analyze non-revenue water. These key positions are often abstracted as important nodes in the skeletonization process of establishing hydraulic models, and corresponding flow data inform the nodal water demand (NWD).

The NWD has a great influence on the performance of hydraulic model results. Due to insufficient information, the NWD is often estimated indirectly through model calibration and engineering judgment (Kang & Lansey 2011). Now, the massive flow measurement can provide us with access to directly calculate the NWD. In this study, the collected flow data in the database are considered to establish hourly NWD prediction models, and then the predicted NWD is input into the RTHM, so as to realize the real-time and effective estimation of the WDS's operation state, and further improve the timeliness performance of the hydraulic model. In addition to hydraulic analysis, accurate hourly NWD prediction can also be used to improve modeling the transport of chlorine in the network.

In order to improve the accuracy of the hydraulic model, it is necessary to reduce the uncertainty in the hydraulic model. The uncertainty in the hydraulic model is caused by the uncertainty in model parameters or input variables, which can be generally divided into the following three types: model structural uncertainty, parameter uncertainty and measurement/data uncertainty (Hutton et al. 2014). The uncertainty in the model is objective and cannot be completely eliminated, but it can be quantified and reduced by various methods, such as increasing the monitoring and collection of various parameter information, and the combined use of data assimilation (DA) methods, so that the hydraulic model can fully represent the real system.

This study considers three ways to reduce the uncertainty in the model, the first is to minimize the measurement uncertainty. Measurement errors caused by inaccurate instruments is beyond the discussion of this study, whereas the abnormal data caused by instrument failure, data transmission failure or external anomaly is the focus of this study. There may be abnormal values (such as zero value or abnormal maximum value) in the collected measurement data of the NWD, which will impede the establishment of accurate NWD prediction models.

The second way is to reduce the uncertainty in water demand. The NWD has been demonstrated to be the most important input parameter that will cause uncertainty in the hydraulic model results (Kang et al. 2009). Therefore, the accurate NWD prediction is important for estimating output pressure and water quality. Due to the cost limit, only some nodes in the WDS (referred to as A-class nodes) have been covered with flow meters, whereas water demand of rest nodes (referred to as B-class nodes) is still unknown. In order to estimate the NWD of B-class nodes as accurately as possible, nodes with similar characteristics are divided into a node group after spatial correlation and the similarity of user types between nodes are considered. The nodal demands in the same group are assumed to be correlated with each other (Kang & Lansey 2009), that is, they have similar change trends of the NWD. In this way, the prediction of water demand at these nodes is more directional, thereby reducing the uncertainty in input parameters.

The third way is to introduce the DA method to continuously update the hydraulic models, so that the uncertainty of the model parameters (pipe roughness coefficient) is reduced. Through driving the hydraulic models to simulate the WDS state of next stage, the state estimation including pressures, flow forecasts can be obtained. When time moves forward to the next stage, and flow and pressure measurements are available, the deviations of the measurements and forecasts are obtained. Based on these deviations, calibrating the parameters or state inputs of the models to continuously improve the models outputs is exactly the principle concept of DA. With the goal of reducing uncertainty and improving forecasts, DA update model states by combining the model with observations.

The study develops a framework to improve the timeliness and accuracy of RTHMs. It consists of the following five parts: flow data processing, establishing the hourly NWD prediction models, node grouping, DA and uncertainty analysis. Specifically, two methods are adopted to identify and correct the abnormal data of collected flow measurement, so as to improve the data quality and reduce the measurement uncertainty. Then, three different machine learning algorithms are used to establish the hourly (the next 24 h) NWD prediction models, and the prediction accuracy of different algorithms and the effect of the model input variables on the model prediction results are compared and analyzed. Next, all nodes of the system are grouped according to relevant rules, and the uncertainty in water demand is reduced by assuming that the NWD's change trend in the same node group is consistent. Then, the predicted hourly NWD is used for real-time hydraulic simulation, and the differences between the predicted flow rates and the observation data of A-class nodes are applied by the DA method to optimize the RTHM. Finally, the uncertainty in the hydraulic model output caused by the uncertainty in the model input is quantitatively analyzed, so as to realize real-time state estimation of the WDS.

This paper is organized as follows. In the following section, an overview of water demand prediction and the model uncertainty background is presented. The ‘Methodology’ section mainly describes data processing methods, NWD prediction models and algorithms, node grouping, DA and uncertainty analysis. The case study follows, along with the information of NWD raw data and models developing. Results are then presented followed by discussion and finally conclusions.

Generally, the monitoring data may contain abnormal data due to various factors including unstable meters or sensors, failure of data transfer, infrastructure inspection, maintenance and water facility failure. Also, such abnormal data are often mixed in the water demand that fluctuates greatly within a day, which lowers data quality, worsens the knowledge of water demand information, and escalates the model uncertainty. From this perspective, the processing of flow data is indispensable.

Then, the processed flow data are used to build the NWD prediction models, but this is a very challenging task, because many factors (Mouatadid & Adamowski 2016) influence water demand, such as population, water price, water saving awareness, weather, social factors, etc. In addition, the water demand also has obvious temporal and spatial characteristics. For short-term water demand forecasting, weather variables are considered to be key input factors (Tian et al. 2016). However, the used weather indicators are limited, and usually only include common indicators such as precipitation, temperature and relative humidity (Tian et al. 2016). Moreover, weather variables are rarely used in combination with other variables, this may cause some important information not to be considered in the water demand forecasting. Therefore, this study considers multi-dimensional input variables (weather variables, time variables and social variables) when predicting the NWD.

The methods of building NWD prediction models include linear and nonlinear methods (Guo et al. 2018). Machine learning methods are an important representative of nonlinear methods. They can learn the underlying and deep relationship between the NWD and related variables from historical data, and this relationship is usually implicit before and after modeling. Common machine learning algorithms include Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest (RF), etc., which have been applied in water demand prediction and have good performance (Herrera et al. 2010). In addition to choosing two common machine learning algorithms, ANN and RF, this study also adopts another effective machine learning algorithm, extreme gradient boosting (XGBoost), to establish the NWD prediction model.

XGBoost (Chen et al. 2017) is an open source project designed and optimized to enhance the Gradient Tree Boosting algorithm, which is designed to be efficient, flexible and portable to solve linear models and nonlinear tree algorithms (Darmatasi & Arymurthy 2016). In the field of machine learning algorithm applications and competitions, XGBoost has proven to be an extremely advantageous algorithm. Research results (Sheridan et al. 2016) showed that it has higher computational efficiency and better ability to deal with overfitting problems. As a result, this study chooses XGBoost to construct the NWD prediction model, and compares it with the RF and ANN methods.

During the past two decades, DA has been widely applied in meteorological, oceanographic and other research fields (Li et al. 2013; Mehrparvar & Asghari 2018). Its specific advantages are: integration of information from new observations, flexibility in the evolution of time-varying parameters and incorporation all the sources of uncertainty (Moradkhani et al. 2005). In WDS modeling, a rapid increase of DA activities has been witnessed (Li et al. 2013; Okeya et al. 2014). These achievements have been made possible by the following two facts: first, the increasing availability of real-time measurements and second, the continuous improvement of DA methods. One of the prominent DA methods is the ensemble Kalman filter (EnKF), suitable for complex, nonlinear and chaotic problems, which is adopted in this study. The two variables that are often considered in the calibration of the hydraulic models are the NWD and the pipe roughness coefficient. The NWD input in the hydraulic models of this study is continuously updated through NWD prediction models. The pipe roughness coefficient is a correction parameter when the EnKF method is used to update the RHTMs.

The uncertainty quantification technology currently applied to WDS hydraulic models can be divided into analytical methods or approximate methods (Kang et al. 2009; Pasha & Lansey 2010). The choice of these techniques depends on the availability of information, the complexity of the model, and the type and accuracy of the desired results (Tung & Yen 2005). Commonly used methods include Monte Carlo simulation (MCS) and the first-order second-order moment (FOSM) approximation method and Latin hypercube sampling (LHS). Hwang et al. (2017) and Kang et al. (2009) studied the accuracy of the FOSM, LHS and MCS methods, the results showed that the MCS method has the widest applicability and the most reliable estimation. The disadvantage of MCS is its calculation time. For large-scale systems, it may take a long time to complete the simulation, but its alternative FOSM method can reduce computational time and still provide good statistical estimates of model prediction, although its behavior in branched systems is poor. Taking into account the size of the actual network and the preference for short simulation time, this study uses the FOSM method to quantity the model uncertainty, while using the MCS method to verify.

Figure 1 shows the research outline. Density-Based Spatial Clustering Applications with Noise (DBSCAN) and Abnormal Data Refinement with a Confidence Interval (ADR-CI) methods are used to process water demand raw data in this study. To comparatively analyze the effectiveness of two methods, the raw data without any preprocessing are also used in the subsequent modeling. After data preprocessing, XGBoost, RF and ANN methods are used to establish the hourly NWD forecasting model. They are used in different combinations to obtain combined models, namely:

  • D-XGB, D-RF and D-ANN

  • represent XGBoost, RF and ANN models established after the data are processed by the DBSCAN method.

  • A-XGB, A-RF and A-ANN

  • represent XGBoost, RF and ANN models established after the data are processed by the ADR-CI method

  • XGB, RF and ANN

  • represent XGBoost, RF and ANN models established directly by using the raw data, without using data preprocessing methods

Figure 1

Outline of the research framework.

Figure 1

Outline of the research framework.

Close modal

In this manner, a total of nine models are obtained and tested, and the results are analyzed. Next, the best-performing prediction models are selected to predict water demand of all A-class nodes. On the other hand, according to the relevant rules, all nodes of the WDS model are grouped, and then water demand of B-class nodes in each node group is estimated. The last part is to simulate and analyze the uncertainty in the RTHM combined with the DA method, including the uncertainty simulation in the model input and output. The specific details are introduced in the following sections.

Data processing methods

The DBSCAN method identifies abnormal data by clustering. This method can discover consistent arbitrary-shaped clusters along with detection of noisy outliers and is very sensitive to outliers; therefore, it is very powerful for practical applications. DBSCAN method groups the data points, which are dense and nearby a single cluster. As a result, the data in the dataset that do not belong to any of the clusters are known as noisy outliers. The advantages of this method are the following: less requirement for user domain knowledge, less sensitive to the order in which the data are entered, suitable for large databases. More details about the DBSCAN method can be found in the relevant literature (Viswanath & Suresh Babu 2009).

The ADR-CI method is another method for detecting abnormal data. ADR-CI identifies the abnormal data based on statistical confidence intervals. First, ADR-CI assumes that the hourly water demand at the same time in the past T days is normally distributed. Based on the normal distribution assumption, the corresponding mean and variance can be found. Then, the detection range of abnormal data can be obtained based on the percentage confidence coefficient α and weighting factor β of the corresponding confidence interval. Notably, it is important to determine the appropriate detection range. If the range is set too narrow, some normal data may be incorrectly identified as abnormal data. Instead, the abnormal data may not be found effective. For effective short-term water demand forecasting, the percentage confidence coefficient α is set to 97%–99.75% (Seok et al. 2014), and the detection result of the abnormal value is most consistent with the actual situation. After the detection process, the abnormal data are replaced by the average water demand at the same time in the past T days.

NWD prediction models

Water demand forecasting models can be divided into single-dimensional and multi-dimensional forecasting models, the single-dimensional forecasting model refers to a time series forecasting model established using only historical water demand data as input variables, focusing only on the time correlation of water demand changes, while the multi-dimensional forecasting model takes into account climate, time, society and other factors as input variables, and establishes a gray box model. In this study, both types of NWD prediction models are established, and their prediction capabilities are analyzed and compared.

For the short-term water demand prediction, weather conditions are considered as important factors. As the global warming and extreme weather conditions are increasing, more weather features may need to be considered. It has been shown that the increase in annual average temperature, changes in the temporal and spatial distribution of rainfall and changes in the frequency and intensity of extreme weather events all limit water availability and water use change (Easterling et al. 2000). Under these circumstances, in order to establish more accurate forecasting models, this study considers more weather characteristics indicators as weather input variables, which are shown in Table 1. In addition, the life and work trace of modern urban residents are cut into two parts by working and non-working days, which leads to a large NWD difference in working and non-working days. Therefore, this study adds a social input variable, that is, whether the day is a Chinese legal working day. A study of short-term forecasts of urban household water demand indicated that the temporal dynamics of water use are more important for prediction than any exogenous features (Duerr et al. 2018). As a result, this study also considered the relevant temporal dynamics. The meaning and value range of all 26 input variables (features) used in this study are shown in Table 1.

Table 1

Basic information of the input variables (features) of water demand forecasting models

NumberCategoryNameMeaningValue range or unit
Time variables hour Current hour 0–24 h 
Time variables week_of_year Current week of the year 0–53 
Time variables weekday Current day 0–6 
Social variables is_workday Whether the day is a legal working day 1 or 0 (yes or no) 
Weather variables temp Current temperature °C 
Weather variables temp_max Maximum temperature of the day °C 
Weather variables temp_min Minimum temperature of the day °C 
Weather variables Pressure Atmospheric pressure hpa 
Weather variables humidity Relative humidity 0–100% 
10 Weather variables wind_deg Wind direction 0–360° 
11 Weather variables wind_speed Wind speed in knots 
12 Weather variables rain_1 h Accumulated precipitation within 1 h mm 
13 Weather variables rain_3 h Accumulated precipitation within 3 h mm 
14 Weather variables rain_24 h Accumulated precipitation within 1 h mm 
15 Weather variables rain_today Whether it rains that day 1 or 0 (yes or no) 
16 Weather variables clouds_all Percentage of clouds covering the sky 0–100% 
17–26 Weather variables weather_descrip-tion Including clear, clouds, drizzle, fog, haze, mist, rain, snow, squall, thunderstorm, a total of 10 weather conditions Snow = 1 or 0 (Indicating this weather appears or does not appear)
Fog = 1 or 0 (Indicating this weather appears or does not appear) 
NumberCategoryNameMeaningValue range or unit
Time variables hour Current hour 0–24 h 
Time variables week_of_year Current week of the year 0–53 
Time variables weekday Current day 0–6 
Social variables is_workday Whether the day is a legal working day 1 or 0 (yes or no) 
Weather variables temp Current temperature °C 
Weather variables temp_max Maximum temperature of the day °C 
Weather variables temp_min Minimum temperature of the day °C 
Weather variables Pressure Atmospheric pressure hpa 
Weather variables humidity Relative humidity 0–100% 
10 Weather variables wind_deg Wind direction 0–360° 
11 Weather variables wind_speed Wind speed in knots 
12 Weather variables rain_1 h Accumulated precipitation within 1 h mm 
13 Weather variables rain_3 h Accumulated precipitation within 3 h mm 
14 Weather variables rain_24 h Accumulated precipitation within 1 h mm 
15 Weather variables rain_today Whether it rains that day 1 or 0 (yes or no) 
16 Weather variables clouds_all Percentage of clouds covering the sky 0–100% 
17–26 Weather variables weather_descrip-tion Including clear, clouds, drizzle, fog, haze, mist, rain, snow, squall, thunderstorm, a total of 10 weather conditions Snow = 1 or 0 (Indicating this weather appears or does not appear)
Fog = 1 or 0 (Indicating this weather appears or does not appear) 

XGBoost, RF and ANN

XGBoost is an improved machine learning algorithm based on Gradient Boosting Machine (GBM); XGBoost applies a more regularized model than GBM to control overfitting. To obtain the relationship from the data, the following functional form is established to predict given the input :
(1)
where k is the number of functions and F is the space of functions containing the partition of region and the score in each of them. In general, F is a set of regression trees. To learn the set of functions used in the model, the regularized objective function can be defined in the following form:
(2)
where l is a differentiable convex loss function that measures the difference between the prediction and the target y; Ω measures the complexity of the model to avoid overfitting. This indicates that we are penalizing complex models by making the second item Ω as small as possible. Intuitively, the regularized objective will tend to select a model by using simple and predictive functions (Chen et al. 2017). The unique advantage of the XGBoost algorithm is its scalability in all scenarios. The algorithm runs more than 10 times faster than the existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings.

RF is an ensemble of tree-based models built on decision trees (DTs). The DT algorithm mimics the structure of a tree. The ensemble averages a large number of DTs, lowering the variance and bias in prediction. RF can also be considered as a very convenient and easy-to-use algorithm because its default hyper-parameters usually provide good predictions. RFs have proven to be outstanding predictive models in regression (and classification) tasks and are often used in studies on water demand prediction (Ambrosio et al. 2019).

ANN models have been used to predict hourly or daily water demand (Romano & Kapelan 2014). The feed–forward neural network is one of the most widely used ANNs; its learning process consists of forward propagation and back propagation. In the classical structure of feed–forward neural network, the outputs of each layer are processed layer-by-layer and transferred to the output layer, and an output layer produces the calculated data. This is known as the forward-propagation process. Then, the generated output for each input set is compared with the actual output, and an error is computed. The error is then propagated backward throughout the ANN, and the network parameters are adjusted to control the errors at a very low level. This is known as the back-propagation process. Each ANN is trained by an in-hand dataset and compares the achieved results with the actual output data until acceptable outputs are obtained (Ghiassi et al. 2008).

Node grouping

In order to overcome the limitation of low measurement redundancy, and try to reduce the uncertainty in models, nodes with similar user character are aggregated to a single node group to estimate the water demand of B-class nodes as accurately as possible. Kang & Lansey (2009) proposed to divide general nodes into three categories of service area: industrial, commercial and residential types. Among them, residential types are subdivided into three types according to different residential types: apartment, large lots and 1/2 acres. It is also assumed that the water demand diurnal curves of the same type are similar, but the differences and characteristics between different types are obvious. For example, at the beginning (6–9 AM) and end (5–9 PM) of the working day, large peaks appear in water demand curves of residential types. In contrast, the water demand curves of the commercial types have a sharper rise and fall at the beginning and end of the working day, and remains flat during the working day. Industrial water use is relatively constant throughout the day.

However, in Chinese cities, the boundaries between different types of water service areas are not always clear, and there are many mixed areas, such as areas where industrial users and residential users are mixed, and even areas where commercial, industrial and residential users are mixed. In addition, most of the residential types in Chinese cities are commercial residential buildings. Considering the above practical factors, the water use nodes are divided into the following seven categories: (a) commercial, (b) industrial, (c) residential, (d) commercial and industrial mixed (com-ind-mix), (e) commercial and residential mixed (com-res-mix), (f) industrial and residential mixed (ind-res-mix) and (g) industrial, commercial and residential mixed (All-mix).

All user nodes are grouped according to the region, the user type and the location of the measurement meters, while it is ensured that each group has at least one A-class node (if there are more than one A-class node in a group, one is selected as the representative). By assuming that variation trend and amplitude of water demand of the next period for B-class nodes in the same group are similar to that of the A-class node, the NWD in the next period can be estimated based on the current NWD.

Data assimilation

This study couples the EnKF technique with the RTHM to improve state and parameter estimation. In the combined approach, flow rate measurements of A-class nodes are assimilated with the RTHM for approximating the states of the WDS. The EnKF method consists of two steps, prediction and update. The prediction step uses a forward simulation model to predict the system state as in Equation (3).
(3)
(4)
where is the ensemble member forecast at time ; is the updated ensemble member at time t; f is the forward simulation model (in this case, the EPANET hydraulic model of the WDS); is the system input or forcing data, is the ensemble of the forcing data, is the system parameter; is the process noise; is the noise, sampled from a distribution. Important model parameters include water demand, pipe length, pipe diameter, pipe roughness coefficient and boundary conditions include tanks initial level, reservoir head and pump scheduling. Among these parameters, water demand, pipe length, and pipe diameter are assumed to be known and so can be classified as the system input u. is then used to calculate the predicted states of the system, the predicted measurements,
(5)
(6)
(7)
where h represents the relationship between the system states, parameters and measurements; is the ensemble of the field observation at the ()th time step; is the the field observation at the ()th time step; is a noise; is the forecasted state ensemble; is the the Kalman gain matrix, which is estimated by:
(8)
where is the forecast cross covariance of a priori state estimate and prediction ; is the forecast error covariance of prediction . In this study, pipe roughness coefficient C is updated using iterative restart EnKF, which is suitable for sequential forecasting and updating of parameters (Rajakumar et al. 2019).

Uncertainty simulation and quantification

Because there is actually uncertainty in the water demand of all nodes in each group, a random normal variable N (0, σ) is introduced in the predicted or estimated value of each node demand to simulate the actual NWD. In order to examine the impact of the input uncertainty amplitude on the model output, three levels of uncertainty are evaluated, defined by the coefficient of variation (CV) of normally distributed variables, and the CV corresponding to the three levels are 0.1, 0.15 and 0.2 respectively.

FOSM estimates model uncertainty by approximating a function using a Tylor series expansion and dropping the high-order terms. The variance of the model output Y can be shown to:
(9)
where is a K-dimensional vector of the sensitivity coefficients at mean of parameters. The sensitivity matrix elements are the gradients of the model output Y with respect to the model parameter X. is the variance matrix of the model input parameters that represents input uncertainty. Since FOSM is a truncated approximation based on the first-order Taylor series, it provides a better estimate of the output statistical parameters for a linear system than a nonlinear system.

MCS is an enumeration-based technique for simulating and calculating the uncertainty in the system output given model input uncertainty. In MCS, a large number of parameter sets (known as realizations) are generated according to the probability distribution of the input parameters, and the target system response is repeatedly measured. In this study, the node demand is generated from the basic predicted or estimated value and the added appropriate normal distributed random variable, the probability distribution of node demand is defined by their mean and variance. EPANET2 is used to calculate the nodal pressure under the assumption that the demand is fully satisfied. This process is repeated many times until the convergence criterion is met. The application of MCS is not limited to any specific type of probability distribution. But the disadvantage is that it requires sufficient computational power.

The real network system in the new urban area of a Chinese city is used for analysis and discussion. The network includes 331 nodes, 337 pipes and one source. The pipe length of the network system ranges from 137 to 4,783 m, and the pipe diameter ranges from 50 to 500 mm. The pipes are mainly ductile iron pipes and steel pipes, and laying date is between year 2001 and 2002. Currently, 14 flow meters are installed. According to the aforementioned grouping rule, the entire system is divided into 13 node groups artificially, the metered nodes (A-class nodes) in each group are shown in Figure 2, and the two nodes close to the water source have no water consumption, so they did not participate in the grouping, the result of grouping area and type is shown in Figure 2. The 14 flow meters have collected about 240,000 NWD data for almost two years (April 1, 2015 to March 7, 2017). The data collection and transmission frequency are noted once an hour. The weather data of the same period are collected and processed as quantifiable data. Table 2 shows the information about the raw data of the water demand for 14 nodes. It can be seen that there are obviously impractical extreme values.

Table 2

The information on the raw data of water demand for 14 nodes

NodeGroup typeCountMean (m3/h)Standard deviation (m3/h)Coefficient of variationMinimum (m3/h)1st Quartile (m3/h)Median (m3/h)3rd Quartile (m3/h)Maximum (m3/h)
217 commercial 17,016 14.32 24.21 1.69 3.32 9.66 14.25 825.67 
198 commercial 17,016 9.8 6.5 0.66 6.1 7.77 11.53 134.01 
86 Industrial 17,016 35.24 21.85 0.62 21.94 39.38 57.56 422.77 
327 Industrial 17,016 47.58 24.9 0.52 18.22 56.78 79.47 936.06 
62 Residential 17,016 8.87 7.51 0.85 3.03 7.59 13.72 131.19 
244 Residential 17,016 21.96 40.01 1.82 7.06 13.13 18.79 3,632.43 
260 Residential 17,016 27.31 65.08 2.38 8.31 12.59 22.35 6,717.54 
272 Residential 17,016 8.34 13.8 1.57 3.56 6.22 18.92 123.34 
44 Com-ind-mix 17,016 19.44 33.5 1.72 10.89 17.73 55.18 386.02 
171 Com-res-mix 17,016 20.33 28.22 1.39 8.48 22.81 46.63 190.34 
111 Ind-res-mix 17,016 69.33 154.18 2.22 32.75 50.63 77.79 6,125.44 
317 Ind-res-mix 17,016 35.24 58.85 1.67 11.94 19.38 27.56 1,422.77 
142 All-mix 17,016 14.55 13.94 0.96 7.74 11.73 17.28 400.52 
21 All-mix 17,016 5.45 5.72 1.05 1.88 3.3 7.61 83.8 
NodeGroup typeCountMean (m3/h)Standard deviation (m3/h)Coefficient of variationMinimum (m3/h)1st Quartile (m3/h)Median (m3/h)3rd Quartile (m3/h)Maximum (m3/h)
217 commercial 17,016 14.32 24.21 1.69 3.32 9.66 14.25 825.67 
198 commercial 17,016 9.8 6.5 0.66 6.1 7.77 11.53 134.01 
86 Industrial 17,016 35.24 21.85 0.62 21.94 39.38 57.56 422.77 
327 Industrial 17,016 47.58 24.9 0.52 18.22 56.78 79.47 936.06 
62 Residential 17,016 8.87 7.51 0.85 3.03 7.59 13.72 131.19 
244 Residential 17,016 21.96 40.01 1.82 7.06 13.13 18.79 3,632.43 
260 Residential 17,016 27.31 65.08 2.38 8.31 12.59 22.35 6,717.54 
272 Residential 17,016 8.34 13.8 1.57 3.56 6.22 18.92 123.34 
44 Com-ind-mix 17,016 19.44 33.5 1.72 10.89 17.73 55.18 386.02 
171 Com-res-mix 17,016 20.33 28.22 1.39 8.48 22.81 46.63 190.34 
111 Ind-res-mix 17,016 69.33 154.18 2.22 32.75 50.63 77.79 6,125.44 
317 Ind-res-mix 17,016 35.24 58.85 1.67 11.94 19.38 27.56 1,422.77 
142 All-mix 17,016 14.55 13.94 0.96 7.74 11.73 17.28 400.52 
21 All-mix 17,016 5.45 5.72 1.05 1.88 3.3 7.61 83.8 
Figure 2

Application network schematic showing node groups and measurement locations.

Figure 2

Application network schematic showing node groups and measurement locations.

Close modal

First, the two data processing methods are used to process the node WD data, and the abnormal data are detected and corrected. Then, the historical WD data, weather data, time dynamic data and workday judgment data of each node are merged according to the date and time axis. The data are divided into training and test subsets (80 and 20% of the data length, respectively) to examine the prediction performance of different machine learning prediction models.

Anaconda, an integrated Python environment, is used to complete the training of the corresponding models. The XGBoost models, RF models and ANN models use the python-related machine learning library (Scikit-Learn) to develop. XGBoost and RF algorithms are used to build multi-dimensional prediction models, the input variable X of the training data has 26 features, as shown in Table 1. The output variable Y is the WD of the node. At the same time, the grid search module is used to systematically traverse the various parameter combinations of the models, and the optimal parameters are determined by cross-validation. These efforts significantly reduce the time and effort needed for modeling.

The ANN algorithm is used to establish single-dimensional prediction models. The input variable X of the training data for ANN models is the 24-h NWD in the past few days, and the output variable Y is the 24-h NWD of the target day (represented by a 24-dimensional vector). Its feature value of input layer is 5, and the number of neurons in the hidden layer is 8. The test data subset is used to analyze and compare the performance of different models. The specific details of the optimization parameter settings, input and output data, application scenarios and implementation environment of three machine learning algorithms and two data processing methods are shown in Table 3.

Table 3

The specific details of two data processing methods and three machine learning algorithms

ModelOptimized model parametersModel inputsModel outputsApplication scenarioImplementation environment
DBSCAN EPS=0.066,
minpts=12 
All NWD data Abnormal data point Classification Python 
ADR-CI α%=97%,
β=2.17 
All NWD data Abnormal data point Classification Python 
ANN Number of layers=3,
Learning rate=0.02,
Activation=ReLU,
Batch size=80 
The 24-h NWD in the past few days The 24-h NWD of the target day Prediction Python 
RF Estimators=1,000;
min samples split=2,
min samples leaf=1 
The training data with 26 features The WD of the node Prediction Python 
XGBoost Estimators=2,000,
learning rate=0.05,
max depth=8 
The training data with 26 features The WD of the node Prediction Python 
ModelOptimized model parametersModel inputsModel outputsApplication scenarioImplementation environment
DBSCAN EPS=0.066,
minpts=12 
All NWD data Abnormal data point Classification Python 
ADR-CI α%=97%,
β=2.17 
All NWD data Abnormal data point Classification Python 
ANN Number of layers=3,
Learning rate=0.02,
Activation=ReLU,
Batch size=80 
The 24-h NWD in the past few days The 24-h NWD of the target day Prediction Python 
RF Estimators=1,000;
min samples split=2,
min samples leaf=1 
The training data with 26 features The WD of the node Prediction Python 
XGBoost Estimators=2,000,
learning rate=0.05,
max depth=8 
The training data with 26 features The WD of the node Prediction Python 

According to the pipe characteristics, 337 pipes are divided into five groups, and the corresponding initial pipe roughness coefficient C values are, respectively, as follows: =112.87, =114.92, =118.43, =123.67, =126.18. It is worth noting that these C values have been pre-calibrated when the hydraulic model of the case network was first established. Subsequently, they are continuously updated using the EnKF method to make the RTHM closer to the actual system. In fact, not only the pipe roughness coefficient is corrected, but also the subtle changes of other status or parameters of the RTHM will be expressed indirectly through this process. In other words, the update of the C values does not necessarily make the roughness coefficient of the pipe closer to the true value, but makes the overall simulation process and predicted output of the RTHM more accurate. The simulation time step of the RTHM in this study is 1 h. If the moving step of the DA method is also 1 h, it will not only lead to enormous computation burden, but also make the updating results of the pipe roughness coefficient very unstable, which will greatly reduce the robustness of the hydraulic model. Considering this problem, the moving step length of the EnKF method is set to 24 h, that is, the errors of average forecasts and measurements in the past 24 h are used to update the pipe roughness coefficient of the RHTM.

Performance of NWD prediction models

The established prediction models are used to predict the hourly NWD during the test subset period (from October 1, 2016 to February 28, 2017, one value per hour, 24 values per day). The coefficient of determination R2 of the predicted and the actual value (from the test subset) are obtained, and the average R2 values of the test data period are finally calculated, and the results are shown in Table 4, Figure 3 is a heat map drawn using the data in Table 4.

Table 4

The average value of R2 of forecasting models on testing data

NodeXGBRFANND-XGBD-RFD-ANNA-XGBA-RFA-ANNAverage
217 0.674 0.658 0.692 0.741 0.704 0.736 0.705 0.683 0.716 0.701 
198 0.882 0.869 0.910 0.934 0.905 0.962 0.890 0.889 0.942 0.909 
86 0.846 0.821 0.803 0.898 0.858 0.879 0.854 0.838 0.819 0.846 
327 0.864 0.809 0.812 0.947 0.873 0.858 0.873 0.813 0.827 0.853 
62 0.409 0.421 0.367 0.410 0.395 0.451 0.423 0.446 0.384 0.412 
244 0.818 0.774 0.764 0.920 0.865 0.855 0.848 0.792 0.785 0.825 
260 0.794 0.836 0.778 0.880 0.911 0.852 0.814 0.845 0.795 0.834 
272 0.710 0.688 0.732 0.807 0.711 0.799 0.720 0.702 0.763 0.737 
44 0.734 0.695 0.789 0.819 0.760 0.871 0.768 0.711 0.802 0.772 
171 0.772 0.717 0.754 0.864 0.779 0.834 0.786 0.724 0.761 0.777 
111 0.801 0.751 0.750 0.883 0.829 0.817 0.811 0.760 0.777 0.798 
317 0.787 0.769 0.753 0.864 0.803 0.859 0.808 0.785 0.773 0.800 
142 0.791 0.742 0.786 0.900 0.824 0.882 0.804 0.748 0.819 0.811 
21 0.808 0.750 0.783 0.902 0.838 0.863 0.813 0.762 0.815 0.815 
Average 0.764 0.736 0.748 0.841 0.790 0.823 0.780 0.750 0.770 0.778 
NodeXGBRFANND-XGBD-RFD-ANNA-XGBA-RFA-ANNAverage
217 0.674 0.658 0.692 0.741 0.704 0.736 0.705 0.683 0.716 0.701 
198 0.882 0.869 0.910 0.934 0.905 0.962 0.890 0.889 0.942 0.909 
86 0.846 0.821 0.803 0.898 0.858 0.879 0.854 0.838 0.819 0.846 
327 0.864 0.809 0.812 0.947 0.873 0.858 0.873 0.813 0.827 0.853 
62 0.409 0.421 0.367 0.410 0.395 0.451 0.423 0.446 0.384 0.412 
244 0.818 0.774 0.764 0.920 0.865 0.855 0.848 0.792 0.785 0.825 
260 0.794 0.836 0.778 0.880 0.911 0.852 0.814 0.845 0.795 0.834 
272 0.710 0.688 0.732 0.807 0.711 0.799 0.720 0.702 0.763 0.737 
44 0.734 0.695 0.789 0.819 0.760 0.871 0.768 0.711 0.802 0.772 
171 0.772 0.717 0.754 0.864 0.779 0.834 0.786 0.724 0.761 0.777 
111 0.801 0.751 0.750 0.883 0.829 0.817 0.811 0.760 0.777 0.798 
317 0.787 0.769 0.753 0.864 0.803 0.859 0.808 0.785 0.773 0.800 
142 0.791 0.742 0.786 0.900 0.824 0.882 0.804 0.748 0.819 0.811 
21 0.808 0.750 0.783 0.902 0.838 0.863 0.813 0.762 0.815 0.815 
Average 0.764 0.736 0.748 0.841 0.790 0.823 0.780 0.750 0.770 0.778 
Figure 3

The heat map of the average of R2 of the forecasting models on the test data. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/hydro.2022.033.

Figure 3

The heat map of the average of R2 of the forecasting models on the test data. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/hydro.2022.033.

Close modal

The results in Table 4 and Figure 3 lead to the following observations: (1) according to the overall analysis of R2, the XGBoost models show the best performance, and their forecasts are the most accurate. The ANN models are slightly inferior, and the RF models are the worst among the three machine learning models. (2) With data processing for abnormal data detection and correction, the forecasting accuracy of the three machine learning models improves significantly. Here, we take the XGBoost models as an example to illustrate: for the XGB models that directly uses the original data as the input value, the average R2 value for all 14 nodes is 0.764, and after the raw data are processed by the DBSCA and ACI method, this value increases to 0.841 and 0.780, respectively. The ANN and RF models also have similar behaviors, indicating that the data processing methods are effective, and compared with the ACI method, the DBSCA method performs better in removing outliers and contributes more to improving the performance of the models. (3) The D-XGB models established on 14 nodes are analyzed separately: the highest R2 value is 0.947 and the average is 0.841; only two nodes (nodes 217 and 62) have R2 values below 0.8. This demonstrates that the overall forecasting performance of the XGBoost models is acceptable. When combined with the results of the ANN model, it is found that the overall performance of the multi-dimensional forecasting model is better than that of the single-dimensional forecasting model, so it is feasible and effective to use multi-dimensional input variables as the input to establish forecasting models with the NWD as the output. (4) Among the 14 nodes in this study, node 198 is the best performer with the average R2 values 0.909 for all nine models. By analyzing the raw data in Table 2, it can be seen that the CV of node 198's water demand data is the smallest among all 14 nodes. This indicates that the performance of NWD prediction models is related to the mathematical characteristics of the NWD data. Contrastingly, node 62 is the worst performer, the average R2 values for all nine models is only 0.412. The reason may be related to the high percentage of zero value in node 62′ raw water demand data. In all 17,016 data of node 62, about 9% of the data (1,493) have a value of 0. They mainly occurred at 0:00–5:00 AM during the night, whereas the nonzero data at the same period fluctuated significantly (the fluctuation range is from less than 0.01 to more than 15 m3/h). This extremely irregular variation of the NWD is probably responsible for the poor prediction of the models. Even with the use of preprocessing method, the improvement in data quality is extremely limited. In this case, it may be necessary to go to the site to find the cause or use other methods to forecast the NWD for such particular node. (5) In the heat map, the darker the block color, the larger the R2 value, and the better the performance of the corresponding model. Figure 3 shows predictive models built on some nodes perform well, such as nodes 198, 86, 327, etc., whereas models built on other nodes are relatively poor, such as nodes 272, 44, 171, etc. It can be seen from Figure 2 that the nodes 86 and 327 belong to the industrial node group, the node 198 belongs to the commercial node group and the nodes 272 and 44.171 belong to the residential node group or the mixed group. Compared with industrial or commercial water users, the uncertainty of the water demand of residential or mixed users is greater, the diurnal water demand curves are more irregular, and the CV of the original water demand data is relatively large, which brings more difficulties to the establishment of accurate prediction models.

In addition, some more interesting phenomena are found in Table 4 and Figure 3: (1) although the overall performance of XGBoost models is better than that of RF and ANN models, there are also individual nodes where the performance of RF or ANN models is superior. For example, the ANN models for nodes 44 and 198 are the most outstanding among the three machine learning models. The reason is probably related to the magnitude of the NWD change. Table 2 shows that the CV of nodes 44 and 198 is small, indicating that the dispersion of the NWD data of these two nodes is relatively small. The water demand curve is relatively stable, so it may be more appropriate to use the ANN methods to make predictions. (2) The similar exception phenomenon is observed for the two processing methods. As a whole, the DBSCAN method is more conducive to improving the performance of models than the ADR-CI method, but there are exceptions. For example, the ADR-CI method performs better and is more effective for the models established in node 327. This is probably related to the distribution characteristics of the NWD raw data. As shown in Table 2, the average water demand of node 327 is 27.578 m3/h; the CV is as high as 597.94%; its first quartile, median, and third quartile are 18.219, 26.781 and 33.469 m3/h, respectively; the average and median are very close. These results indicate that the water demand data of node 327 is more discrete and symmetrical. These inferences are consistent with the assumption of the ADR-CI method that the data conform to normal distribution, so the ADR-CI method performs better at node 327.

Feature importance

For the XGBoost and RF methods, a total of 26 features are present in the input independent variables of training data, but they play different roles in the training model. Some features have a more important impact on the determination of model parameters and model predictions, whereas other features are less significant. Through the relevant built-in function of the Scikit-Learn library, the feature importance of XGBoost and RF models is analyzed. The orders of feature importance of D-XGB and D-RF models established in nodes 244 and 21 are shown in Figure 4.

In the D-XGB and D-RF models established for nodes 244 and 21, the two most important features are hour and week of the year, both of which are time variables. It proves that time features are extremely important in predictive models. Figure 4(a) and 4(c) shows that the orders of feature importance of D-XGB prediction models established in two nodes are inconsistent. Similarly, Figure 4(b) and 4(d) shows that the feature important orders for two nodes of the D-RF prediction models are also different.

Figure 4

Feature importance of different forecasting models established in two nodes: (a) the D-XGB model in node 244; (b) the D-RF model in node 244; (c) the D-XGB model in node 21; (d) the D-RF model in node 21.

Figure 4

Feature importance of different forecasting models established in two nodes: (a) the D-XGB model in node 244; (b) the D-RF model in node 244; (c) the D-XGB model in node 21; (d) the D-RF model in node 21.

Close modal

Previous studies of short-term water demand (mainly daily water demand or peak daily water demand) forecasting models have found that weather is the most important influencing factor (Tian et al. 2016), and this research proves that in the hourly NWD forecasting models, time features are the most important influencing factors. This inconsistency is caused by the different time accuracy of the prediction models. There is an essential relationship between the hourly NWD and time period, which can also be confirmed from water demand diurnal curves.

Notably, the most important feature is hour in D-RF prediction models (from Figure 4(b) and 4(d)), its feature importance far exceeds other features, while other features importance is very limited. Oppositely, the overall distribution of feature importance of D-XGB prediction models (from Figure 4(a) and 4(c)) is more uniform. In other words, the feature importance of D-RF prediction models has extremes and their overall performance is extremely uneven. In comparison, the gaps between feature importance of D-XGB prediction models are small, this indicates that during the establishment of D-XGB prediction models, the information introduced by different types of data is more widely used. It is also more robust to the risk of model failure due to the abnormal variation of a single most important feature. The corresponding cost is a more complex and bulky raw data collection process.

DA for RTHM correction to improve parameter and state estimation

The results in Figure 5 compare the differences between observed flow rates and RTHM forecasts (with or without EnKF) for node 317 and the total average of 14 A-class nodes during 72 h. It can be observed that whether for an individual node (node 317 is used as an example in Figure 5, other nodes are not shown), or for the total average of 14 A-class nodes, the simulated forecasts of RTHM combined with EnKF is closer to the actual observation. This shows that the EnKF method reduces the prediction error, especially during the peak water consumption period, the EnKF method improves the model prediction results more obviously. To quantify the modeling performance, the Root Mean Square Error (RMSE) was implemented as the criteria to measure the significance of improvement resulting from the EnKF techniques, and the observed flow rates is approximately regarded as the true value.

Figure 5

Comparison of observed flow rates and RTHM forecasts (with or without EnKF) for node 317 and the total average of 14 A-class nodes during the period from 0:00 on March 5, 2017 to 24:00 on March 7, 2017 (72 h).

Figure 5

Comparison of observed flow rates and RTHM forecasts (with or without EnKF) for node 317 and the total average of 14 A-class nodes during the period from 0:00 on March 5, 2017 to 24:00 on March 7, 2017 (72 h).

Close modal

Table 5 summarizes RMSE of observed flow rates and RTHM forecasts (with or without EnKF) for all 14 A-class nodes during 3 days. Note that RMSE is the average value of 24-h values calculated on the day. The improvement in model performance is not obviously ignorable. When RTHM is combined with EnKF, the reduction of RMSE of each node basically exceeds 20%, which shows that the EnKF technique is able to attain good output accuracy across the case network for state estimation. Figure 6 shows changes of computed pipe roughness coefficient C of updated RTHM during the period from February 6, 2017 to March 8, 2017. It is clear from the results that C values constantly change dynamically, which also makes the RTHM continually updated and tends to be higher accuracy. Hence, it can be deduced that EnKF can be used for RTHM state and parameter estimation.

Table 5

RMSE of observed flow rates and RTHM forecasts (with or without EnKF) for all 14A-class nodes during the period from 0:00 on March 5, 2017 to 24:00 on March 7, 2017 (3 days)

NodeRMSE for RTHM
RMSE for RTHM with EnKF
Day 1Day 2Day 3AverageDay 1Day 2Day 3Average
217 2.493 2.413 2.473 2.460 1.966 1.912 1.953 1.944 
198 1.138 1.238 1.087 1.154 1.057 1.136 1.017 1.070 
86 4.156 3.957 4.116 4.076 3.202 3.072 3.176 3.150 
327 6.928 6.929 7.142 6.999 5.559 5.560 5.705 5.608 
62 1.771 1.772 1.881 1.808 1.243 1.243 1.308 1.265 
244 2.718 2.757 2.955 2.810 1.877 1.900 2.016 1.931 
260 3.385 3.321 3.215 3.307 2.893 2.847 2.769 2.836 
272 1.809 1.805 1.821 1.812 1.309 1.306 1.317 1.311 
44 2.885 2.775 2.907 2.856 2.616 2.531 2.633 2.594 
171 2.132 2.160 2.237 2.176 2.115 2.139 2.204 2.152 
111 7.894 7.578 8.489 7.987 6.937 6.701 7.381 7.006 
317 4.978 5.141 5.315 5.145 4.407 4.529 4.661 4.532 
142 1.732 1.802 1.637 1.723 1.160 1.200 1.106 1.155 
21 0.925 0.950 0.891 0.922 0.798 0.817 0.773 0.796 
Average 3.210 3.186 3.297 3.231 2.653 2.635 2.716 2.668 
NodeRMSE for RTHM
RMSE for RTHM with EnKF
Day 1Day 2Day 3AverageDay 1Day 2Day 3Average
217 2.493 2.413 2.473 2.460 1.966 1.912 1.953 1.944 
198 1.138 1.238 1.087 1.154 1.057 1.136 1.017 1.070 
86 4.156 3.957 4.116 4.076 3.202 3.072 3.176 3.150 
327 6.928 6.929 7.142 6.999 5.559 5.560 5.705 5.608 
62 1.771 1.772 1.881 1.808 1.243 1.243 1.308 1.265 
244 2.718 2.757 2.955 2.810 1.877 1.900 2.016 1.931 
260 3.385 3.321 3.215 3.307 2.893 2.847 2.769 2.836 
272 1.809 1.805 1.821 1.812 1.309 1.306 1.317 1.311 
44 2.885 2.775 2.907 2.856 2.616 2.531 2.633 2.594 
171 2.132 2.160 2.237 2.176 2.115 2.139 2.204 2.152 
111 7.894 7.578 8.489 7.987 6.937 6.701 7.381 7.006 
317 4.978 5.141 5.315 5.145 4.407 4.529 4.661 4.532 
142 1.732 1.802 1.637 1.723 1.160 1.200 1.106 1.155 
21 0.925 0.950 0.891 0.922 0.798 0.817 0.773 0.796 
Average 3.210 3.186 3.297 3.231 2.653 2.635 2.716 2.668 
Figure 6

Evolutions of pipe roughness coefficient C of the updated RTHM during the period from February 6, 2017 to March 8, 2017.

Figure 6

Evolutions of pipe roughness coefficient C of the updated RTHM during the period from February 6, 2017 to March 8, 2017.

Close modal

Uncertainty analysis

The established D-XGB prediction models are chosen to predict the NWD of 14 A-class nodes on March 8, 2017. Then, according to the aforementioned rules, the water demand of all B-class nodes in the 13 node groups are estimated. In order to simulate the uncertainty of the real NWD, normal distribution random variables for three uncertainty levels are introduced into the estimated NWD.

After using RTHM combined with EnKF for hydraulic simulation, the hydraulic states of the whole system are obtained. The nodal pressure is the output parameter of interest because of its importance and wide applicability in hydraulic analysis. In each node group, one node is taken as a representative node for discussion, and the position of the 13 representative node is shown in Figure 2.

Table 6 lists the standard deviations and mean values of output nodal pressures simulated by FOSM and MCS, respectively, on 13 representative nodes at 8 AM on March 8, 2017, which shows that simulation results of the two methods are slightly different under three uncertainty levels, the mean values and standard deviations calculated by MCS are slightly smaller compared to that of FOSM. The reason may be related to the different inherent principles of the two methods. As mentioned earlier, FOSM is based on a first-order truncated Taylor series approximation, its estimation is affected by the degree of nonlinearity of the system. Under normal conditions, it is generally believed that the simulation result of MCS is closer to the true value and have more accurate prediction intervals (Kang et al. 2009). However, for the selected representative nodes, FOSM and MCS give less than 3% errors for pressure mean value, that are, good fits. The simulation time required by the FOSM method is only about 210 s, which is much shorter than the MCS method (about 6,700 s), so a trade-off between accuracy and computational effort needs to be considered.

Table 6

Standard deviations and mean values of output nodal pressures simulated by FOSM and MCS, respectively, on 13 representative nodes at 8 AM on March 8, 2017

NodeUncertainty level 1
Uncertainty level 2
Uncertainty level 3
Mean
Standard deviation
Mean
Standard deviation
Mean
Standard deviation
FOSMMCSFOSMMCSFOSMMCSFOSMMCSFOSMMCSFOSMMCS
233 34.78 34.63 0.98 0.86 34.44 34.08 1.26 1.13 33.75 33.39 1.72 1.62 
195 38.05 37.69 1.01 0.98 37.66 37.20 1.28 1.11 36.87 36.42 1.72 1.76 
72 36.94 36.54 0.94 0.90 36.31 35.85 0.99 1.22 35.39 34.94 1.64 1.62 
331 35.74 35.56 0.94 0.84 35.32 34.91 1.09 1.01 34.40 34.00 1.75 1.43 
52 35.33 34.99 1.96 1.77 34.91 34.52 2.19 1.98 34.13 33.74 3.24 2.94 
251 34.08 33.89 2.07 1.93 33.82 33.41 2.68 2.16 33.07 32.67 3.59 3.32 
99 37.11 36.88 0.91 0.85 36.66 36.20 0.98 0.87 35.80 35.35 1.65 1.49 
36 36.67 36.23 1.20 1.07 36.15 35.74 1.38 1.24 35.46 35.05 2.16 1.90 
211 36.13 35.87 1.14 0.99 35.66 35.23 1.55 1.10 34.66 34.24 2.16 1.78 
128 35.91 35.60 1.30 1.21 35.50 35.08 1.52 1.40 34.64 34.23 2.41 2.24 
293 35.06 34.82 1.22 1.12 34.65 34.28 1.46 1.29 33.76 33.40 2.18 1.92 
133 39.24 38.85 1.95 1.76 38.51 38.08 2.67 2.42 37.53 37.11 3.61 3.04 
29 39.01 38.55 0.95 0.90 38.27 37.84 1.06 0.97 37.40 36.98 1.70 1.53 
NodeUncertainty level 1
Uncertainty level 2
Uncertainty level 3
Mean
Standard deviation
Mean
Standard deviation
Mean
Standard deviation
FOSMMCSFOSMMCSFOSMMCSFOSMMCSFOSMMCSFOSMMCS
233 34.78 34.63 0.98 0.86 34.44 34.08 1.26 1.13 33.75 33.39 1.72 1.62 
195 38.05 37.69 1.01 0.98 37.66 37.20 1.28 1.11 36.87 36.42 1.72 1.76 
72 36.94 36.54 0.94 0.90 36.31 35.85 0.99 1.22 35.39 34.94 1.64 1.62 
331 35.74 35.56 0.94 0.84 35.32 34.91 1.09 1.01 34.40 34.00 1.75 1.43 
52 35.33 34.99 1.96 1.77 34.91 34.52 2.19 1.98 34.13 33.74 3.24 2.94 
251 34.08 33.89 2.07 1.93 33.82 33.41 2.68 2.16 33.07 32.67 3.59 3.32 
99 37.11 36.88 0.91 0.85 36.66 36.20 0.98 0.87 35.80 35.35 1.65 1.49 
36 36.67 36.23 1.20 1.07 36.15 35.74 1.38 1.24 35.46 35.05 2.16 1.90 
211 36.13 35.87 1.14 0.99 35.66 35.23 1.55 1.10 34.66 34.24 2.16 1.78 
128 35.91 35.60 1.30 1.21 35.50 35.08 1.52 1.40 34.64 34.23 2.41 2.24 
293 35.06 34.82 1.22 1.12 34.65 34.28 1.46 1.29 33.76 33.40 2.18 1.92 
133 39.24 38.85 1.95 1.76 38.51 38.08 2.67 2.42 37.53 37.11 3.61 3.04 
29 39.01 38.55 0.95 0.90 38.27 37.84 1.06 0.97 37.40 36.98 1.70 1.53 

By analyzing individual representative nodes, it is summarized that with the increase of uncertainty level of the NWD, the mean values of the nodal pressure decrease, which may be because the pressure loss caused by flow change is not linear, but exponential. So, the increase of uncertainty level leads to the broadening of the variation range of the NWD, and the occurrence of the NWD with a larger value leads to the rapid exponential drop of pressure head, and finally causes the node pressure mean to decrease continuously. In addition, the standard deviations of individual nodes’ pressure are less than proportionally to change in input uncertainty for water demand with varying levels of uncertainty, which shows that uncertainty is weakened in the propagation process.

In order to visually compare and analyze the uncertainty in the output pressure of different representative nodes, the box plots of the 95% confidence intervals of the output pressure simulated by the FOSM and MCS methods (8 AM on March 8, 2017) under uncertainty level 2 for the 13 representative nodes are shown in Figure 7. For MCS, the 97.5 percentile and 2.5 percentile of the simulated results correspond to the upper and lower edges of the boxes, respectively, while FOSM estimates the upper and lower limits of the 95% confidence interval by adding 1.96 times the standard deviation to the mean and subtracting 1.96 times the standard deviation from the mean (Kang & Lansey 2009). Prediction intervals estimated by the FOSM method are very close to the MCS results even for the highest input uncertainty (not presented). However, the pressure prediction intervals obtained by the MCS method are generally narrower and more accurate. It is also shown that the interval width between individual nodes is quite different. For example, the confidence intervals of output pressure at node 52,251,293 are wider and the uncertainties are relatively large. This demonstrates that the uncertainty of output pressure is related to the characteristics of the NWD, or more traceably, the characteristics of the node water users. The randomness of water use in residential areas and mixed areas leads to the greater uncertainty of the output value. From this perspective, grouping different user types and then discussing uncertainty can more reasonably estimate the variation characteristics of the NWD within the group, and thus reduce the uncertainty in the model output.

Figure 7

The box plots of the 95% confidence intervals of the output pressure simulated by FOSM and MCS (8 AM on March 8, 2017) under uncertainty level 2 for the 13 representative nodes.

Figure 7

The box plots of the 95% confidence intervals of the output pressure simulated by FOSM and MCS (8 AM on March 8, 2017) under uncertainty level 2 for the 13 representative nodes.

Close modal

The time series (24 h of March 8, 2017) of pressure prediction intervals estimated by MCS and FOSM for uncertainty level 2 at node 99 are shown in Figure 8. The uncertainty bound estimated by FOSM for pressure fit well with that of MCS in 24 h. During the high-demand times, the uncertainty in the nodal pressure is greater, while the uncertainty in the pressure is small during the night minimum demand. Therefore, accurate estimation of the NWD (especially during the high-demand times) has vital influence on the uncertainty in water pressure. Also, this is why this study has spent great effort to build accurate NWD prediction models. In relevant decisions (such as anomaly identification and model design), not only model uncertainty is taken into account, but also the changes of model uncertainty in different water demand periods need to be analyzed. In other words, accurate dynamic real-time model that takes uncertainty into account is very important in hydraulic application analysis.

Figure 8

The time series (24 h of March 8, 2017) of pressure prediction intervals estimated by MCS and FOSM for uncertainty level 2 at node 99.

Figure 8

The time series (24 h of March 8, 2017) of pressure prediction intervals estimated by MCS and FOSM for uncertainty level 2 at node 99.

Close modal

This paper proposes a framework to improve the timeliness and accuracy of RTHMs by accurately forecasting the hourly NWD, node grouping for reducing the uncertainty in water demand and DA method, so as to realize the accurate real-time state estimation of the WDS. In addition, the methods of processing monitoring data are studied, and the influence of different machine learning algorithms and input features on the performance of the NWD prediction models is analyzed. The effect of the EnKF method on improving RTHM parameters and state estimation performance is compared. Based on the results of the case study, the following main conclusions are drawn:

(1) By establishing the accurate hourly NWD prediction models and synchronously updating it into the input of the RTHMs, the model timeliness is effectively improved. (2) Appropriate flow data processing methods are shown to reduce effectively the measurement uncertainty, while the methods of node grouping are proven to reduce the uncertainty in water demand, and ultimately leading to obtain a narrower and more accurate confidence interval for pressure estimation. In other words, it reduces the uncertainty in the model output and improves the accuracy of the real-time hydraulic model. (3) The hourly NWD estimation (especially during the period of high demand) has a great influence on the uncertainty in water pressure, so it is very important to establish accurate hourly NWD prediction models. The XGBoost method is proven to be very reliable and excellent. Its overall performance is better than the ANN and RF methods. In addition, time features are very important for hourly NWD prediction models. (4) The EnKF method is capable of reducing the uncertainty inherited in RTHMs by assimilating the field measurements, and contributing to improvement of the model performance. (5) The simulation results obtained by using the two uncertainty analysis methods of FOSM and MCS are in good fits. The mean and standard deviation of the output pressure calculated by MCS are slightly smaller, but the error is within 3%. In addition, the standard deviations of individual nodes’ pressure are less than proportionally to change in input uncertainty for water demand with varying levels of uncertainty, which shows that uncertainty is weakened in the propagation process.

Future work could consider the combination of RTHM simulation results and pressure monitoring data to detect abnormal events in the WDS. The hydraulic models considering the uncertainty are used to estimate the state of the WDS, and the pressure detection threshold are created to identify the occurrence of the failure events. Combined with the on-site pressure data, the cause, time and location of the failure events are further diagnosed, so as to achieve a refined and scientific management of water distribution systems. In addition, when using the DA method to optimize the hydraulic models, the on-site pressure data, tank or reservoir water level could also be used to update the model parameters and status estimation.

This study is jointly supported by the National Natural Science Foundation of China (51778178 and 51978203), the National Key Research and Development Program of China (2016YFC0802402 and 2018YFC0406201-3).

The authors declare that they have no conflict of interest.

Data cannot be made publicly available; readers should contact the corresponding author for details.

Ambrosio
J. K.
,
Brentan
B. M.
,
Herrera
M.
,
Luvizotto
E.
,
Ribeiro
L.
&
Izquierdo
J.
2019
Committee machines for hourly water demand forecasting in water supply systems
.
Mathematical Problems in Engineering
2019
,
1
11
.
Blesa
J.
,
Nejjari
F.
&
Sarrate
R.
2015
Robust sensor placement for leak location: analysis and design
.
Journal of Hydroinformatics
18
(
1
),
136
148
.
Chen
T.
,
He
T.
,
Benesty
M.
,
Khotilovich
V.
&
Tang
Y.
2017
XGBoost: eXtreme Gradient Boosting (R Package Version 0.6-4)
.
Chondronasios
A.
,
Gonelas
K.
,
Kanakoudis
V.
,
Patelis
M.
&
Korkana
P.
2017
Optimizing DMAs formation in a water pipe network: the water aging and the operating pressure factors
.
Journal of Hydroinformatics
19
(
6
),
890
899
.
Darmatasi
&
Arymurthy
A. M.
2016
Predicting the status of water pumps using data mining approach
. In
International Workshop on Big Data and Information Security (IWBIS)
,
East Jakarta, INDONESIA
, pp.
57
63
.
Duerr
I.
,
Merrill
H. R.
,
Wang
C.
,
Bai
R.
,
Boyer
M.
,
Dukes
M. D.
&
Bliznyuk
N.
2018
Forecasting urban household water demand with statistical and machine learning methods using large space-time data: a comparative study
.
Environmental Modelling & Software
102
,
29
38
.
Easterling
D. R.
,
Meehl
G. A.
,
Parmesan
C.
,
Changnon
S. A.
,
Karl
T. R.
&
Mearns
L. O.
2000
Climate extremes: observations, modeling, and impacts
.
Science
289
(
5487
),
2068
2074
.
Ghiassi
M.
,
Zimbra
D.
&
Saidane
H.
2008
Urban water demand forecasting with a dynamic artificial neural network model
.
Journal of Water Resources Planning and Management
134
(
2
),
71
75
.
Guo
G.
,
Liu
S.
,
Wu
Y.
,
Li
J.
,
Zhou
R.
&
Zhu
X.
2018
Short-term water demand forecast based on deep learning method
.
Journal of Water Resources Planning and Management
144
(
12
),
4018076
.
Herrera
M.
,
Torgo
L.
,
Izquierdo
J.
&
Pérez-García,
R.
2010
Predictive models for forecasting hourly urban water demand
.
Journal of Hydrology
387
(
1–2
),
141
150
.
Hutton
C. J.
,
Lansey
K.
,
Vamvakeridou-Lyroudia
L.
&
Savic
D. A.
2014
Dealing with uncertainty in water distribution system models: a framework for real-time modeling and data assimilation
.
Journal of Water Resources Planning and Management
140
(
2
),
169
183
.
Hwang
H.
,
Lansey
K.
&
Jung
D.
2017
Accuracy of first-order second moment approximation for uncertainty analysis of water distribution systems
.
Journal of Water Resources Planning and Management
144
(
2
),
04017087
.
Kang
D.
&
Lansey
K.
2009
Real-time demand estimation and confidence limit analysis for water distribution systems
.
Journal of Hydraulic Engineering
135
(
10
),
825
837
.
Kang
D.
&
Lansey
K.
2011
Demand and roughness estimation in water distribution systems
.
Journal of Water Resources Planning and Management
137
(
1
),
20
30
.
Kang
D. S.
,
Pasha
M. F. K.
&
Lansey
K.
2009
Approximate methods for uncertainty analysis of water distribution systems
.
Urban Water Journal
6
(
3
),
233
249
.
Li
X.-L.
,
H.
,
Horton
R.
,
An
T.
&
Yu
Z.
2013
Real-time flood forecast using the coupling support vector machine and data assimilation method
.
Journal of Hydroinformatics
16
,
973
988
.
Mehrparvar
M.
&
Asghari
K.
2018
Modular optimized data assimilation and support vector machine for hydrologic modeling
.
Journal of Hydroinformatics
20
,
728
738
.
Moradkhani
H.
,
Sorooshian
S.
,
Gupta
H. V.
&
Houser
P. R.
2005
Dual state–parameter estimation of hydrological models using ensemble Kalman filter
.
Advances in Water Resources
28
,
135
147
.
Mouatadid
S.
&
Adamowski
J.
2016
Using extreme learning machines for short-term urban water demand forecasting
.
Urban Water Journal
14
(
6
),
630
638
.
Okeya
I.
,
Kapelan
Z.
,
Hutton
C.
&
Naga
D.
2014
Online modelling of water distribution system using data assimilation
.
Procedia Engineering
70
,
1261
1270
.
Raei
E.
,
Shafiee
M. E.
,
Nikoo
M. R.
&
Berglund
E.
2018
Placing an ensemble of pressure sensors for leak detection in water distribution networks under measurement uncertainty
.
Journal of Hydroinformatics
21
(
2
),
223
239
.
Rajakumar
A. G.
,
Mohan Kumar
M.
,
Amrutur
B.
&
Kapelan
Z.
2019
Real-time water quality modeling with ensemble Kalman filter for state and parameter estimation in water distribution networks
.
Journal of Water Resources Planning and Management
145
(
11
),
04019049
.
Romano
M.
&
Kapelan
Z.
2014
Adaptive water demand forecasting for near real-time management of smart water distribution systems
.
Environmental Modelling & Software
60
(
7
),
265
276
.
Seok
J. H.
,
Kim
J. J.
,
Lee
J. Y.
,
Lee
J. J.
,
Song
Y. J.
&
Shin
G. W.
2014
Abnormal data refinement and error percentage correction methods for effective short-term hourly water demand forecasting
.
International Journal of Control Automation and Systems
12
,
1245
.
Sheridan
R. P.
,
Wang
W. M.
,
Liaw
A.
,
Ma
J.
&
Gifford
E. M.
2016
Extreme gradient boosting as a method for quantitative structure–activity relationships
.
Journal of Chemical Information and Modeling
56
,
2353
2360
.
Tian
D.
,
Martinez
C. J.
&
Asefa
T.
2016
Improving short-term urban water demand forecasts with reforecast analog ensembles
.
Journal of Water Resources Planning and Management
142
(
6
),
04016008
.
Tung
Y. K.
&
Yen
B. C.
2005
Hydrosystems Engineering Uncertainty Analysis
.
McGraw-Hill
,
New York
.
Viswanath
P.
&
Suresh Babu
V.
2009
Rough-DBSCAN: A fast hybrid density based clustering method for large data sets
.
Pattern Recognition Letters
30
(
16
),
1477
1488
.
Xie
X.
,
Zhou
Q.
,
Hou
D.
&
Zhang
H.
2017
Compressed sensing based optimal sensor placement for leak localization in water distribution networks
.
Journal of Hydroinformatics
20
(
6
),
1286
1295
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).