Abstract
The real-time hydraulic model (RTHM) is a key assistive tool in water distribution system (WDS) management, and its performance directly affects assisted decision-making. This study develops a framework to improve the timeliness and accuracy of RTHMs, which includes the following five steps: flow data processing, establishing nodal water demand (NWD) prediction models, node grouping, data assimilation (DA) and uncertainty analysis. Based on the actual network data, the performance of two data processing methods and three machine learning algorithms are, respectively, compared, and the best is selected for modeling. In the establishment of the hourly NWD prediction models, massive data, including flow measurement and data of all 26 input variables on climate, time and social influencing factors are used. It is found that the time features are the most important model input parameter. Application results of actual network prove that the flow data processing method, accurate NWD prediction, node grouping and Kalman filter-based DA method reduce the uncertainty in the RTHM and improve its timeliness and accuracy, so as to obtain the real-time state estimation of the WDS. Accurate NWD estimation (especially in the high-demand period) and combining RTHM with DA have a great influence on the uncertainty reduction in water pressure estimation, although uncertainty is weakened in the propagation process.
HIGHLIGHTS
A framework to improve the timeliness and accuracy of real-time hydraulic models was established, helping to accurately estimate the real-time status of the water distribution system.
Reliable nodal water demand prediction models were established, with an accuracy of hourly level.
The obtained results improve the understanding of the propagation process and reduction methods of the uncertainty of the WDS hydraulic models.
Graphical Abstract
INTRODUCTION
By combining real-time measurement to simulate the dynamic operating state of the water distribution system (WDS), real-time hydraulic models (RTHMs) play a key role in improving fine control and emergency handling, which also makes people have higher requirements for the accuracy and timeliness of the model. Therefore, it is very essential to explore the methods to improve the timeliness and accuracy of the RTHM. The development of the Supervisory Control and Data Acquisition (SCADA) system has improved the availability of online monitoring data, which contributes to improving the accuracy of the hydraulic model on the one hand, and on the other hand, it also increases the challenge of modeling because of the need for preprocessing and comprehensive analysis of massive real-time data. The currently collected hydraulic data mainly includes pressure and flow data. The analysis and application potential of pressure data have attracted widespread attention (Blesa et al. 2015; Xie et al. 2017; Raei et al. 2018; Soroush & Abedini 2019), but there is little effort in the deep understanding of flow information (Kang & Lansey 2009).
Flow information is very important for the RTHM as an essential model input. The flow meters are generally installed in key locations of the WDS. In addition, a large number of district metering areas (DMAs) (Chondronasios et al. 2017) have been established in WDSs in order to control leakage. Most of such DMAs’ inlets have been covered with flow meters to analyze non-revenue water. These key positions are often abstracted as important nodes in the skeletonization process of establishing hydraulic models, and corresponding flow data inform the nodal water demand (NWD).
The NWD has a great influence on the performance of hydraulic model results. Due to insufficient information, the NWD is often estimated indirectly through model calibration and engineering judgment (Kang & Lansey 2011). Now, the massive flow measurement can provide us with access to directly calculate the NWD. In this study, the collected flow data in the database are considered to establish hourly NWD prediction models, and then the predicted NWD is input into the RTHM, so as to realize the real-time and effective estimation of the WDS's operation state, and further improve the timeliness performance of the hydraulic model. In addition to hydraulic analysis, accurate hourly NWD prediction can also be used to improve modeling the transport of chlorine in the network.
In order to improve the accuracy of the hydraulic model, it is necessary to reduce the uncertainty in the hydraulic model. The uncertainty in the hydraulic model is caused by the uncertainty in model parameters or input variables, which can be generally divided into the following three types: model structural uncertainty, parameter uncertainty and measurement/data uncertainty (Hutton et al. 2014). The uncertainty in the model is objective and cannot be completely eliminated, but it can be quantified and reduced by various methods, such as increasing the monitoring and collection of various parameter information, and the combined use of data assimilation (DA) methods, so that the hydraulic model can fully represent the real system.
This study considers three ways to reduce the uncertainty in the model, the first is to minimize the measurement uncertainty. Measurement errors caused by inaccurate instruments is beyond the discussion of this study, whereas the abnormal data caused by instrument failure, data transmission failure or external anomaly is the focus of this study. There may be abnormal values (such as zero value or abnormal maximum value) in the collected measurement data of the NWD, which will impede the establishment of accurate NWD prediction models.
The second way is to reduce the uncertainty in water demand. The NWD has been demonstrated to be the most important input parameter that will cause uncertainty in the hydraulic model results (Kang et al. 2009). Therefore, the accurate NWD prediction is important for estimating output pressure and water quality. Due to the cost limit, only some nodes in the WDS (referred to as A-class nodes) have been covered with flow meters, whereas water demand of rest nodes (referred to as B-class nodes) is still unknown. In order to estimate the NWD of B-class nodes as accurately as possible, nodes with similar characteristics are divided into a node group after spatial correlation and the similarity of user types between nodes are considered. The nodal demands in the same group are assumed to be correlated with each other (Kang & Lansey 2009), that is, they have similar change trends of the NWD. In this way, the prediction of water demand at these nodes is more directional, thereby reducing the uncertainty in input parameters.
The third way is to introduce the DA method to continuously update the hydraulic models, so that the uncertainty of the model parameters (pipe roughness coefficient) is reduced. Through driving the hydraulic models to simulate the WDS state of next stage, the state estimation including pressures, flow forecasts can be obtained. When time moves forward to the next stage, and flow and pressure measurements are available, the deviations of the measurements and forecasts are obtained. Based on these deviations, calibrating the parameters or state inputs of the models to continuously improve the models outputs is exactly the principle concept of DA. With the goal of reducing uncertainty and improving forecasts, DA update model states by combining the model with observations.
The study develops a framework to improve the timeliness and accuracy of RTHMs. It consists of the following five parts: flow data processing, establishing the hourly NWD prediction models, node grouping, DA and uncertainty analysis. Specifically, two methods are adopted to identify and correct the abnormal data of collected flow measurement, so as to improve the data quality and reduce the measurement uncertainty. Then, three different machine learning algorithms are used to establish the hourly (the next 24 h) NWD prediction models, and the prediction accuracy of different algorithms and the effect of the model input variables on the model prediction results are compared and analyzed. Next, all nodes of the system are grouped according to relevant rules, and the uncertainty in water demand is reduced by assuming that the NWD's change trend in the same node group is consistent. Then, the predicted hourly NWD is used for real-time hydraulic simulation, and the differences between the predicted flow rates and the observation data of A-class nodes are applied by the DA method to optimize the RTHM. Finally, the uncertainty in the hydraulic model output caused by the uncertainty in the model input is quantitatively analyzed, so as to realize real-time state estimation of the WDS.
This paper is organized as follows. In the following section, an overview of water demand prediction and the model uncertainty background is presented. The ‘Methodology’ section mainly describes data processing methods, NWD prediction models and algorithms, node grouping, DA and uncertainty analysis. The case study follows, along with the information of NWD raw data and models developing. Results are then presented followed by discussion and finally conclusions.
BACKGROUND
Generally, the monitoring data may contain abnormal data due to various factors including unstable meters or sensors, failure of data transfer, infrastructure inspection, maintenance and water facility failure. Also, such abnormal data are often mixed in the water demand that fluctuates greatly within a day, which lowers data quality, worsens the knowledge of water demand information, and escalates the model uncertainty. From this perspective, the processing of flow data is indispensable.
Then, the processed flow data are used to build the NWD prediction models, but this is a very challenging task, because many factors (Mouatadid & Adamowski 2016) influence water demand, such as population, water price, water saving awareness, weather, social factors, etc. In addition, the water demand also has obvious temporal and spatial characteristics. For short-term water demand forecasting, weather variables are considered to be key input factors (Tian et al. 2016). However, the used weather indicators are limited, and usually only include common indicators such as precipitation, temperature and relative humidity (Tian et al. 2016). Moreover, weather variables are rarely used in combination with other variables, this may cause some important information not to be considered in the water demand forecasting. Therefore, this study considers multi-dimensional input variables (weather variables, time variables and social variables) when predicting the NWD.
The methods of building NWD prediction models include linear and nonlinear methods (Guo et al. 2018). Machine learning methods are an important representative of nonlinear methods. They can learn the underlying and deep relationship between the NWD and related variables from historical data, and this relationship is usually implicit before and after modeling. Common machine learning algorithms include Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest (RF), etc., which have been applied in water demand prediction and have good performance (Herrera et al. 2010). In addition to choosing two common machine learning algorithms, ANN and RF, this study also adopts another effective machine learning algorithm, extreme gradient boosting (XGBoost), to establish the NWD prediction model.
XGBoost (Chen et al. 2017) is an open source project designed and optimized to enhance the Gradient Tree Boosting algorithm, which is designed to be efficient, flexible and portable to solve linear models and nonlinear tree algorithms (Darmatasi & Arymurthy 2016). In the field of machine learning algorithm applications and competitions, XGBoost has proven to be an extremely advantageous algorithm. Research results (Sheridan et al. 2016) showed that it has higher computational efficiency and better ability to deal with overfitting problems. As a result, this study chooses XGBoost to construct the NWD prediction model, and compares it with the RF and ANN methods.
During the past two decades, DA has been widely applied in meteorological, oceanographic and other research fields (Li et al. 2013; Mehrparvar & Asghari 2018). Its specific advantages are: integration of information from new observations, flexibility in the evolution of time-varying parameters and incorporation all the sources of uncertainty (Moradkhani et al. 2005). In WDS modeling, a rapid increase of DA activities has been witnessed (Li et al. 2013; Okeya et al. 2014). These achievements have been made possible by the following two facts: first, the increasing availability of real-time measurements and second, the continuous improvement of DA methods. One of the prominent DA methods is the ensemble Kalman filter (EnKF), suitable for complex, nonlinear and chaotic problems, which is adopted in this study. The two variables that are often considered in the calibration of the hydraulic models are the NWD and the pipe roughness coefficient. The NWD input in the hydraulic models of this study is continuously updated through NWD prediction models. The pipe roughness coefficient is a correction parameter when the EnKF method is used to update the RHTMs.
The uncertainty quantification technology currently applied to WDS hydraulic models can be divided into analytical methods or approximate methods (Kang et al. 2009; Pasha & Lansey 2010). The choice of these techniques depends on the availability of information, the complexity of the model, and the type and accuracy of the desired results (Tung & Yen 2005). Commonly used methods include Monte Carlo simulation (MCS) and the first-order second-order moment (FOSM) approximation method and Latin hypercube sampling (LHS). Hwang et al. (2017) and Kang et al. (2009) studied the accuracy of the FOSM, LHS and MCS methods, the results showed that the MCS method has the widest applicability and the most reliable estimation. The disadvantage of MCS is its calculation time. For large-scale systems, it may take a long time to complete the simulation, but its alternative FOSM method can reduce computational time and still provide good statistical estimates of model prediction, although its behavior in branched systems is poor. Taking into account the size of the actual network and the preference for short simulation time, this study uses the FOSM method to quantity the model uncertainty, while using the MCS method to verify.
METHODOLOGY
Figure 1 shows the research outline. Density-Based Spatial Clustering Applications with Noise (DBSCAN) and Abnormal Data Refinement with a Confidence Interval (ADR-CI) methods are used to process water demand raw data in this study. To comparatively analyze the effectiveness of two methods, the raw data without any preprocessing are also used in the subsequent modeling. After data preprocessing, XGBoost, RF and ANN methods are used to establish the hourly NWD forecasting model. They are used in different combinations to obtain combined models, namely:
D-XGB, D-RF and D-ANN
represent XGBoost, RF and ANN models established after the data are processed by the DBSCAN method.
A-XGB, A-RF and A-ANN
represent XGBoost, RF and ANN models established after the data are processed by the ADR-CI method
XGB, RF and ANN
represent XGBoost, RF and ANN models established directly by using the raw data, without using data preprocessing methods
In this manner, a total of nine models are obtained and tested, and the results are analyzed. Next, the best-performing prediction models are selected to predict water demand of all A-class nodes. On the other hand, according to the relevant rules, all nodes of the WDS model are grouped, and then water demand of B-class nodes in each node group is estimated. The last part is to simulate and analyze the uncertainty in the RTHM combined with the DA method, including the uncertainty simulation in the model input and output. The specific details are introduced in the following sections.
Data processing methods
The DBSCAN method identifies abnormal data by clustering. This method can discover consistent arbitrary-shaped clusters along with detection of noisy outliers and is very sensitive to outliers; therefore, it is very powerful for practical applications. DBSCAN method groups the data points, which are dense and nearby a single cluster. As a result, the data in the dataset that do not belong to any of the clusters are known as noisy outliers. The advantages of this method are the following: less requirement for user domain knowledge, less sensitive to the order in which the data are entered, suitable for large databases. More details about the DBSCAN method can be found in the relevant literature (Viswanath & Suresh Babu 2009).
The ADR-CI method is another method for detecting abnormal data. ADR-CI identifies the abnormal data based on statistical confidence intervals. First, ADR-CI assumes that the hourly water demand at the same time in the past T days is normally distributed. Based on the normal distribution assumption, the corresponding mean and variance can be found. Then, the detection range of abnormal data can be obtained based on the percentage confidence coefficient α and weighting factor β of the corresponding confidence interval. Notably, it is important to determine the appropriate detection range. If the range is set too narrow, some normal data may be incorrectly identified as abnormal data. Instead, the abnormal data may not be found effective. For effective short-term water demand forecasting, the percentage confidence coefficient α is set to 97%–99.75% (Seok et al. 2014), and the detection result of the abnormal value is most consistent with the actual situation. After the detection process, the abnormal data are replaced by the average water demand at the same time in the past T days.
NWD prediction models
Water demand forecasting models can be divided into single-dimensional and multi-dimensional forecasting models, the single-dimensional forecasting model refers to a time series forecasting model established using only historical water demand data as input variables, focusing only on the time correlation of water demand changes, while the multi-dimensional forecasting model takes into account climate, time, society and other factors as input variables, and establishes a gray box model. In this study, both types of NWD prediction models are established, and their prediction capabilities are analyzed and compared.
For the short-term water demand prediction, weather conditions are considered as important factors. As the global warming and extreme weather conditions are increasing, more weather features may need to be considered. It has been shown that the increase in annual average temperature, changes in the temporal and spatial distribution of rainfall and changes in the frequency and intensity of extreme weather events all limit water availability and water use change (Easterling et al. 2000). Under these circumstances, in order to establish more accurate forecasting models, this study considers more weather characteristics indicators as weather input variables, which are shown in Table 1. In addition, the life and work trace of modern urban residents are cut into two parts by working and non-working days, which leads to a large NWD difference in working and non-working days. Therefore, this study adds a social input variable, that is, whether the day is a Chinese legal working day. A study of short-term forecasts of urban household water demand indicated that the temporal dynamics of water use are more important for prediction than any exogenous features (Duerr et al. 2018). As a result, this study also considered the relevant temporal dynamics. The meaning and value range of all 26 input variables (features) used in this study are shown in Table 1.
Basic information of the input variables (features) of water demand forecasting models
Number . | Category . | Name . | Meaning . | Value range or unit . |
---|---|---|---|---|
1 | Time variables | hour | Current hour | 0–24 h |
2 | Time variables | week_of_year | Current week of the year | 0–53 |
3 | Time variables | weekday | Current day | 0–6 |
4 | Social variables | is_workday | Whether the day is a legal working day | 1 or 0 (yes or no) |
5 | Weather variables | temp | Current temperature | °C |
6 | Weather variables | temp_max | Maximum temperature of the day | °C |
7 | Weather variables | temp_min | Minimum temperature of the day | °C |
8 | Weather variables | Pressure | Atmospheric pressure | hpa |
9 | Weather variables | humidity | Relative humidity | 0–100% |
10 | Weather variables | wind_deg | Wind direction | 0–360° |
11 | Weather variables | wind_speed | Wind speed | in knots |
12 | Weather variables | rain_1 h | Accumulated precipitation within 1 h | mm |
13 | Weather variables | rain_3 h | Accumulated precipitation within 3 h | mm |
14 | Weather variables | rain_24 h | Accumulated precipitation within 1 h | mm |
15 | Weather variables | rain_today | Whether it rains that day | 1 or 0 (yes or no) |
16 | Weather variables | clouds_all | Percentage of clouds covering the sky | 0–100% |
17–26 | Weather variables | weather_descrip-tion | Including clear, clouds, drizzle, fog, haze, mist, rain, snow, squall, thunderstorm, a total of 10 weather conditions | Snow = 1 or 0 (Indicating this weather appears or does not appear) Fog = 1 or 0 (Indicating this weather appears or does not appear) |
Number . | Category . | Name . | Meaning . | Value range or unit . |
---|---|---|---|---|
1 | Time variables | hour | Current hour | 0–24 h |
2 | Time variables | week_of_year | Current week of the year | 0–53 |
3 | Time variables | weekday | Current day | 0–6 |
4 | Social variables | is_workday | Whether the day is a legal working day | 1 or 0 (yes or no) |
5 | Weather variables | temp | Current temperature | °C |
6 | Weather variables | temp_max | Maximum temperature of the day | °C |
7 | Weather variables | temp_min | Minimum temperature of the day | °C |
8 | Weather variables | Pressure | Atmospheric pressure | hpa |
9 | Weather variables | humidity | Relative humidity | 0–100% |
10 | Weather variables | wind_deg | Wind direction | 0–360° |
11 | Weather variables | wind_speed | Wind speed | in knots |
12 | Weather variables | rain_1 h | Accumulated precipitation within 1 h | mm |
13 | Weather variables | rain_3 h | Accumulated precipitation within 3 h | mm |
14 | Weather variables | rain_24 h | Accumulated precipitation within 1 h | mm |
15 | Weather variables | rain_today | Whether it rains that day | 1 or 0 (yes or no) |
16 | Weather variables | clouds_all | Percentage of clouds covering the sky | 0–100% |
17–26 | Weather variables | weather_descrip-tion | Including clear, clouds, drizzle, fog, haze, mist, rain, snow, squall, thunderstorm, a total of 10 weather conditions | Snow = 1 or 0 (Indicating this weather appears or does not appear) Fog = 1 or 0 (Indicating this weather appears or does not appear) |
XGBoost, RF and ANN



RF is an ensemble of tree-based models built on decision trees (DTs). The DT algorithm mimics the structure of a tree. The ensemble averages a large number of DTs, lowering the variance and bias in prediction. RF can also be considered as a very convenient and easy-to-use algorithm because its default hyper-parameters usually provide good predictions. RFs have proven to be outstanding predictive models in regression (and classification) tasks and are often used in studies on water demand prediction (Ambrosio et al. 2019).
ANN models have been used to predict hourly or daily water demand (Romano & Kapelan 2014). The feed–forward neural network is one of the most widely used ANNs; its learning process consists of forward propagation and back propagation. In the classical structure of feed–forward neural network, the outputs of each layer are processed layer-by-layer and transferred to the output layer, and an output layer produces the calculated data. This is known as the forward-propagation process. Then, the generated output for each input set is compared with the actual output, and an error is computed. The error is then propagated backward throughout the ANN, and the network parameters are adjusted to control the errors at a very low level. This is known as the back-propagation process. Each ANN is trained by an in-hand dataset and compares the achieved results with the actual output data until acceptable outputs are obtained (Ghiassi et al. 2008).
Node grouping
In order to overcome the limitation of low measurement redundancy, and try to reduce the uncertainty in models, nodes with similar user character are aggregated to a single node group to estimate the water demand of B-class nodes as accurately as possible. Kang & Lansey (2009) proposed to divide general nodes into three categories of service area: industrial, commercial and residential types. Among them, residential types are subdivided into three types according to different residential types: apartment, large lots and 1/2 acres. It is also assumed that the water demand diurnal curves of the same type are similar, but the differences and characteristics between different types are obvious. For example, at the beginning (6–9 AM) and end (5–9 PM) of the working day, large peaks appear in water demand curves of residential types. In contrast, the water demand curves of the commercial types have a sharper rise and fall at the beginning and end of the working day, and remains flat during the working day. Industrial water use is relatively constant throughout the day.
However, in Chinese cities, the boundaries between different types of water service areas are not always clear, and there are many mixed areas, such as areas where industrial users and residential users are mixed, and even areas where commercial, industrial and residential users are mixed. In addition, most of the residential types in Chinese cities are commercial residential buildings. Considering the above practical factors, the water use nodes are divided into the following seven categories: (a) commercial, (b) industrial, (c) residential, (d) commercial and industrial mixed (com-ind-mix), (e) commercial and residential mixed (com-res-mix), (f) industrial and residential mixed (ind-res-mix) and (g) industrial, commercial and residential mixed (All-mix).
All user nodes are grouped according to the region, the user type and the location of the measurement meters, while it is ensured that each group has at least one A-class node (if there are more than one A-class node in a group, one is selected as the representative). By assuming that variation trend and amplitude of water demand of the next period for B-class nodes in the same group are similar to that of the A-class node, the NWD in the next period can be estimated based on the current NWD.
Data assimilation






















Uncertainty simulation and quantification
Because there is actually uncertainty in the water demand of all nodes in each group, a random normal variable N (0, σ) is introduced in the predicted or estimated value of each node demand to simulate the actual NWD. In order to examine the impact of the input uncertainty amplitude on the model output, three levels of uncertainty are evaluated, defined by the coefficient of variation (CV) of normally distributed variables, and the CV corresponding to the three levels are 0.1, 0.15 and 0.2 respectively.


MCS is an enumeration-based technique for simulating and calculating the uncertainty in the system output given model input uncertainty. In MCS, a large number of parameter sets (known as realizations) are generated according to the probability distribution of the input parameters, and the target system response is repeatedly measured. In this study, the node demand is generated from the basic predicted or estimated value and the added appropriate normal distributed random variable, the probability distribution of node demand is defined by their mean and variance. EPANET2 is used to calculate the nodal pressure under the assumption that the demand is fully satisfied. This process is repeated many times until the convergence criterion is met. The application of MCS is not limited to any specific type of probability distribution. But the disadvantage is that it requires sufficient computational power.
CASE STUDY
The real network system in the new urban area of a Chinese city is used for analysis and discussion. The network includes 331 nodes, 337 pipes and one source. The pipe length of the network system ranges from 137 to 4,783 m, and the pipe diameter ranges from 50 to 500 mm. The pipes are mainly ductile iron pipes and steel pipes, and laying date is between year 2001 and 2002. Currently, 14 flow meters are installed. According to the aforementioned grouping rule, the entire system is divided into 13 node groups artificially, the metered nodes (A-class nodes) in each group are shown in Figure 2, and the two nodes close to the water source have no water consumption, so they did not participate in the grouping, the result of grouping area and type is shown in Figure 2. The 14 flow meters have collected about 240,000 NWD data for almost two years (April 1, 2015 to March 7, 2017). The data collection and transmission frequency are noted once an hour. The weather data of the same period are collected and processed as quantifiable data. Table 2 shows the information about the raw data of the water demand for 14 nodes. It can be seen that there are obviously impractical extreme values.
The information on the raw data of water demand for 14 nodes
Node . | Group type . | Count . | Mean (m3/h) . | Standard deviation (m3/h) . | Coefficient of variation . | Minimum (m3/h) . | 1st Quartile (m3/h) . | Median (m3/h) . | 3rd Quartile (m3/h) . | Maximum (m3/h) . |
---|---|---|---|---|---|---|---|---|---|---|
217 | commercial | 17,016 | 14.32 | 24.21 | 1.69 | 0 | 3.32 | 9.66 | 14.25 | 825.67 |
198 | commercial | 17,016 | 9.8 | 6.5 | 0.66 | 0 | 6.1 | 7.77 | 11.53 | 134.01 |
86 | Industrial | 17,016 | 35.24 | 21.85 | 0.62 | 0 | 21.94 | 39.38 | 57.56 | 422.77 |
327 | Industrial | 17,016 | 47.58 | 24.9 | 0.52 | 0 | 18.22 | 56.78 | 79.47 | 936.06 |
62 | Residential | 17,016 | 8.87 | 7.51 | 0.85 | 0 | 3.03 | 7.59 | 13.72 | 131.19 |
244 | Residential | 17,016 | 21.96 | 40.01 | 1.82 | 0 | 7.06 | 13.13 | 18.79 | 3,632.43 |
260 | Residential | 17,016 | 27.31 | 65.08 | 2.38 | 0 | 8.31 | 12.59 | 22.35 | 6,717.54 |
272 | Residential | 17,016 | 8.34 | 13.8 | 1.57 | 0 | 3.56 | 6.22 | 18.92 | 123.34 |
44 | Com-ind-mix | 17,016 | 19.44 | 33.5 | 1.72 | 0 | 10.89 | 17.73 | 55.18 | 386.02 |
171 | Com-res-mix | 17,016 | 20.33 | 28.22 | 1.39 | 0 | 8.48 | 22.81 | 46.63 | 190.34 |
111 | Ind-res-mix | 17,016 | 69.33 | 154.18 | 2.22 | 0 | 32.75 | 50.63 | 77.79 | 6,125.44 |
317 | Ind-res-mix | 17,016 | 35.24 | 58.85 | 1.67 | 0 | 11.94 | 19.38 | 27.56 | 1,422.77 |
142 | All-mix | 17,016 | 14.55 | 13.94 | 0.96 | 0 | 7.74 | 11.73 | 17.28 | 400.52 |
21 | All-mix | 17,016 | 5.45 | 5.72 | 1.05 | 0 | 1.88 | 3.3 | 7.61 | 83.8 |
Node . | Group type . | Count . | Mean (m3/h) . | Standard deviation (m3/h) . | Coefficient of variation . | Minimum (m3/h) . | 1st Quartile (m3/h) . | Median (m3/h) . | 3rd Quartile (m3/h) . | Maximum (m3/h) . |
---|---|---|---|---|---|---|---|---|---|---|
217 | commercial | 17,016 | 14.32 | 24.21 | 1.69 | 0 | 3.32 | 9.66 | 14.25 | 825.67 |
198 | commercial | 17,016 | 9.8 | 6.5 | 0.66 | 0 | 6.1 | 7.77 | 11.53 | 134.01 |
86 | Industrial | 17,016 | 35.24 | 21.85 | 0.62 | 0 | 21.94 | 39.38 | 57.56 | 422.77 |
327 | Industrial | 17,016 | 47.58 | 24.9 | 0.52 | 0 | 18.22 | 56.78 | 79.47 | 936.06 |
62 | Residential | 17,016 | 8.87 | 7.51 | 0.85 | 0 | 3.03 | 7.59 | 13.72 | 131.19 |
244 | Residential | 17,016 | 21.96 | 40.01 | 1.82 | 0 | 7.06 | 13.13 | 18.79 | 3,632.43 |
260 | Residential | 17,016 | 27.31 | 65.08 | 2.38 | 0 | 8.31 | 12.59 | 22.35 | 6,717.54 |
272 | Residential | 17,016 | 8.34 | 13.8 | 1.57 | 0 | 3.56 | 6.22 | 18.92 | 123.34 |
44 | Com-ind-mix | 17,016 | 19.44 | 33.5 | 1.72 | 0 | 10.89 | 17.73 | 55.18 | 386.02 |
171 | Com-res-mix | 17,016 | 20.33 | 28.22 | 1.39 | 0 | 8.48 | 22.81 | 46.63 | 190.34 |
111 | Ind-res-mix | 17,016 | 69.33 | 154.18 | 2.22 | 0 | 32.75 | 50.63 | 77.79 | 6,125.44 |
317 | Ind-res-mix | 17,016 | 35.24 | 58.85 | 1.67 | 0 | 11.94 | 19.38 | 27.56 | 1,422.77 |
142 | All-mix | 17,016 | 14.55 | 13.94 | 0.96 | 0 | 7.74 | 11.73 | 17.28 | 400.52 |
21 | All-mix | 17,016 | 5.45 | 5.72 | 1.05 | 0 | 1.88 | 3.3 | 7.61 | 83.8 |
Application network schematic showing node groups and measurement locations.
First, the two data processing methods are used to process the node WD data, and the abnormal data are detected and corrected. Then, the historical WD data, weather data, time dynamic data and workday judgment data of each node are merged according to the date and time axis. The data are divided into training and test subsets (80 and 20% of the data length, respectively) to examine the prediction performance of different machine learning prediction models.
Anaconda, an integrated Python environment, is used to complete the training of the corresponding models. The XGBoost models, RF models and ANN models use the python-related machine learning library (Scikit-Learn) to develop. XGBoost and RF algorithms are used to build multi-dimensional prediction models, the input variable X of the training data has 26 features, as shown in Table 1. The output variable Y is the WD of the node. At the same time, the grid search module is used to systematically traverse the various parameter combinations of the models, and the optimal parameters are determined by cross-validation. These efforts significantly reduce the time and effort needed for modeling.
The ANN algorithm is used to establish single-dimensional prediction models. The input variable X of the training data for ANN models is the 24-h NWD in the past few days, and the output variable Y is the 24-h NWD of the target day (represented by a 24-dimensional vector). Its feature value of input layer is 5, and the number of neurons in the hidden layer is 8. The test data subset is used to analyze and compare the performance of different models. The specific details of the optimization parameter settings, input and output data, application scenarios and implementation environment of three machine learning algorithms and two data processing methods are shown in Table 3.
The specific details of two data processing methods and three machine learning algorithms
Model . | Optimized model parameters . | Model inputs . | Model outputs . | Application scenario . | Implementation environment . |
---|---|---|---|---|---|
DBSCAN | EPS=0.066, minpts=12 | All NWD data | Abnormal data point | Classification | Python |
ADR-CI | α%=97%, β=2.17 | All NWD data | Abnormal data point | Classification | Python |
ANN | Number of layers=3, Learning rate=0.02, Activation=ReLU, Batch size=80 | The 24-h NWD in the past few days | The 24-h NWD of the target day | Prediction | Python |
RF | Estimators=1,000; min samples split=2, min samples leaf=1 | The training data with 26 features | The WD of the node | Prediction | Python |
XGBoost | Estimators=2,000, learning rate=0.05, max depth=8 | The training data with 26 features | The WD of the node | Prediction | Python |
Model . | Optimized model parameters . | Model inputs . | Model outputs . | Application scenario . | Implementation environment . |
---|---|---|---|---|---|
DBSCAN | EPS=0.066, minpts=12 | All NWD data | Abnormal data point | Classification | Python |
ADR-CI | α%=97%, β=2.17 | All NWD data | Abnormal data point | Classification | Python |
ANN | Number of layers=3, Learning rate=0.02, Activation=ReLU, Batch size=80 | The 24-h NWD in the past few days | The 24-h NWD of the target day | Prediction | Python |
RF | Estimators=1,000; min samples split=2, min samples leaf=1 | The training data with 26 features | The WD of the node | Prediction | Python |
XGBoost | Estimators=2,000, learning rate=0.05, max depth=8 | The training data with 26 features | The WD of the node | Prediction | Python |
According to the pipe characteristics, 337 pipes are divided into five groups, and the corresponding initial pipe roughness coefficient C values are, respectively, as follows: =112.87,
=114.92,
=118.43,
=123.67,
=126.18. It is worth noting that these C values have been pre-calibrated when the hydraulic model of the case network was first established. Subsequently, they are continuously updated using the EnKF method to make the RTHM closer to the actual system. In fact, not only the pipe roughness coefficient is corrected, but also the subtle changes of other status or parameters of the RTHM will be expressed indirectly through this process. In other words, the update of the C values does not necessarily make the roughness coefficient of the pipe closer to the true value, but makes the overall simulation process and predicted output of the RTHM more accurate. The simulation time step of the RTHM in this study is 1 h. If the moving step of the DA method is also 1 h, it will not only lead to enormous computation burden, but also make the updating results of the pipe roughness coefficient very unstable, which will greatly reduce the robustness of the hydraulic model. Considering this problem, the moving step length of the EnKF method is set to 24 h, that is, the errors of average forecasts and measurements in the past 24 h are used to update the pipe roughness coefficient of the RHTM.
RESULTS AND DISCUSSION
Performance of NWD prediction models
The established prediction models are used to predict the hourly NWD during the test subset period (from October 1, 2016 to February 28, 2017, one value per hour, 24 values per day). The coefficient of determination R2 of the predicted and the actual value (from the test subset) are obtained, and the average R2 values of the test data period are finally calculated, and the results are shown in Table 4, Figure 3 is a heat map drawn using the data in Table 4.
The average value of R2 of forecasting models on testing data
Node . | XGB . | RF . | ANN . | D-XGB . | D-RF . | D-ANN . | A-XGB . | A-RF . | A-ANN . | Average . |
---|---|---|---|---|---|---|---|---|---|---|
217 | 0.674 | 0.658 | 0.692 | 0.741 | 0.704 | 0.736 | 0.705 | 0.683 | 0.716 | 0.701 |
198 | 0.882 | 0.869 | 0.910 | 0.934 | 0.905 | 0.962 | 0.890 | 0.889 | 0.942 | 0.909 |
86 | 0.846 | 0.821 | 0.803 | 0.898 | 0.858 | 0.879 | 0.854 | 0.838 | 0.819 | 0.846 |
327 | 0.864 | 0.809 | 0.812 | 0.947 | 0.873 | 0.858 | 0.873 | 0.813 | 0.827 | 0.853 |
62 | 0.409 | 0.421 | 0.367 | 0.410 | 0.395 | 0.451 | 0.423 | 0.446 | 0.384 | 0.412 |
244 | 0.818 | 0.774 | 0.764 | 0.920 | 0.865 | 0.855 | 0.848 | 0.792 | 0.785 | 0.825 |
260 | 0.794 | 0.836 | 0.778 | 0.880 | 0.911 | 0.852 | 0.814 | 0.845 | 0.795 | 0.834 |
272 | 0.710 | 0.688 | 0.732 | 0.807 | 0.711 | 0.799 | 0.720 | 0.702 | 0.763 | 0.737 |
44 | 0.734 | 0.695 | 0.789 | 0.819 | 0.760 | 0.871 | 0.768 | 0.711 | 0.802 | 0.772 |
171 | 0.772 | 0.717 | 0.754 | 0.864 | 0.779 | 0.834 | 0.786 | 0.724 | 0.761 | 0.777 |
111 | 0.801 | 0.751 | 0.750 | 0.883 | 0.829 | 0.817 | 0.811 | 0.760 | 0.777 | 0.798 |
317 | 0.787 | 0.769 | 0.753 | 0.864 | 0.803 | 0.859 | 0.808 | 0.785 | 0.773 | 0.800 |
142 | 0.791 | 0.742 | 0.786 | 0.900 | 0.824 | 0.882 | 0.804 | 0.748 | 0.819 | 0.811 |
21 | 0.808 | 0.750 | 0.783 | 0.902 | 0.838 | 0.863 | 0.813 | 0.762 | 0.815 | 0.815 |
Average | 0.764 | 0.736 | 0.748 | 0.841 | 0.790 | 0.823 | 0.780 | 0.750 | 0.770 | 0.778 |
Node . | XGB . | RF . | ANN . | D-XGB . | D-RF . | D-ANN . | A-XGB . | A-RF . | A-ANN . | Average . |
---|---|---|---|---|---|---|---|---|---|---|
217 | 0.674 | 0.658 | 0.692 | 0.741 | 0.704 | 0.736 | 0.705 | 0.683 | 0.716 | 0.701 |
198 | 0.882 | 0.869 | 0.910 | 0.934 | 0.905 | 0.962 | 0.890 | 0.889 | 0.942 | 0.909 |
86 | 0.846 | 0.821 | 0.803 | 0.898 | 0.858 | 0.879 | 0.854 | 0.838 | 0.819 | 0.846 |
327 | 0.864 | 0.809 | 0.812 | 0.947 | 0.873 | 0.858 | 0.873 | 0.813 | 0.827 | 0.853 |
62 | 0.409 | 0.421 | 0.367 | 0.410 | 0.395 | 0.451 | 0.423 | 0.446 | 0.384 | 0.412 |
244 | 0.818 | 0.774 | 0.764 | 0.920 | 0.865 | 0.855 | 0.848 | 0.792 | 0.785 | 0.825 |
260 | 0.794 | 0.836 | 0.778 | 0.880 | 0.911 | 0.852 | 0.814 | 0.845 | 0.795 | 0.834 |
272 | 0.710 | 0.688 | 0.732 | 0.807 | 0.711 | 0.799 | 0.720 | 0.702 | 0.763 | 0.737 |
44 | 0.734 | 0.695 | 0.789 | 0.819 | 0.760 | 0.871 | 0.768 | 0.711 | 0.802 | 0.772 |
171 | 0.772 | 0.717 | 0.754 | 0.864 | 0.779 | 0.834 | 0.786 | 0.724 | 0.761 | 0.777 |
111 | 0.801 | 0.751 | 0.750 | 0.883 | 0.829 | 0.817 | 0.811 | 0.760 | 0.777 | 0.798 |
317 | 0.787 | 0.769 | 0.753 | 0.864 | 0.803 | 0.859 | 0.808 | 0.785 | 0.773 | 0.800 |
142 | 0.791 | 0.742 | 0.786 | 0.900 | 0.824 | 0.882 | 0.804 | 0.748 | 0.819 | 0.811 |
21 | 0.808 | 0.750 | 0.783 | 0.902 | 0.838 | 0.863 | 0.813 | 0.762 | 0.815 | 0.815 |
Average | 0.764 | 0.736 | 0.748 | 0.841 | 0.790 | 0.823 | 0.780 | 0.750 | 0.770 | 0.778 |
The heat map of the average of R2 of the forecasting models on the test data. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/hydro.2022.033.
The heat map of the average of R2 of the forecasting models on the test data. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/hydro.2022.033.
The results in Table 4 and Figure 3 lead to the following observations: (1) according to the overall analysis of R2, the XGBoost models show the best performance, and their forecasts are the most accurate. The ANN models are slightly inferior, and the RF models are the worst among the three machine learning models. (2) With data processing for abnormal data detection and correction, the forecasting accuracy of the three machine learning models improves significantly. Here, we take the XGBoost models as an example to illustrate: for the XGB models that directly uses the original data as the input value, the average R2 value for all 14 nodes is 0.764, and after the raw data are processed by the DBSCA and ACI method, this value increases to 0.841 and 0.780, respectively. The ANN and RF models also have similar behaviors, indicating that the data processing methods are effective, and compared with the ACI method, the DBSCA method performs better in removing outliers and contributes more to improving the performance of the models. (3) The D-XGB models established on 14 nodes are analyzed separately: the highest R2 value is 0.947 and the average is 0.841; only two nodes (nodes 217 and 62) have R2 values below 0.8. This demonstrates that the overall forecasting performance of the XGBoost models is acceptable. When combined with the results of the ANN model, it is found that the overall performance of the multi-dimensional forecasting model is better than that of the single-dimensional forecasting model, so it is feasible and effective to use multi-dimensional input variables as the input to establish forecasting models with the NWD as the output. (4) Among the 14 nodes in this study, node 198 is the best performer with the average R2 values 0.909 for all nine models. By analyzing the raw data in Table 2, it can be seen that the CV of node 198's water demand data is the smallest among all 14 nodes. This indicates that the performance of NWD prediction models is related to the mathematical characteristics of the NWD data. Contrastingly, node 62 is the worst performer, the average R2 values for all nine models is only 0.412. The reason may be related to the high percentage of zero value in node 62′ raw water demand data. In all 17,016 data of node 62, about 9% of the data (1,493) have a value of 0. They mainly occurred at 0:00–5:00 AM during the night, whereas the nonzero data at the same period fluctuated significantly (the fluctuation range is from less than 0.01 to more than 15 m3/h). This extremely irregular variation of the NWD is probably responsible for the poor prediction of the models. Even with the use of preprocessing method, the improvement in data quality is extremely limited. In this case, it may be necessary to go to the site to find the cause or use other methods to forecast the NWD for such particular node. (5) In the heat map, the darker the block color, the larger the R2 value, and the better the performance of the corresponding model. Figure 3 shows predictive models built on some nodes perform well, such as nodes 198, 86, 327, etc., whereas models built on other nodes are relatively poor, such as nodes 272, 44, 171, etc. It can be seen from Figure 2 that the nodes 86 and 327 belong to the industrial node group, the node 198 belongs to the commercial node group and the nodes 272 and 44.171 belong to the residential node group or the mixed group. Compared with industrial or commercial water users, the uncertainty of the water demand of residential or mixed users is greater, the diurnal water demand curves are more irregular, and the CV of the original water demand data is relatively large, which brings more difficulties to the establishment of accurate prediction models.
In addition, some more interesting phenomena are found in Table 4 and Figure 3: (1) although the overall performance of XGBoost models is better than that of RF and ANN models, there are also individual nodes where the performance of RF or ANN models is superior. For example, the ANN models for nodes 44 and 198 are the most outstanding among the three machine learning models. The reason is probably related to the magnitude of the NWD change. Table 2 shows that the CV of nodes 44 and 198 is small, indicating that the dispersion of the NWD data of these two nodes is relatively small. The water demand curve is relatively stable, so it may be more appropriate to use the ANN methods to make predictions. (2) The similar exception phenomenon is observed for the two processing methods. As a whole, the DBSCAN method is more conducive to improving the performance of models than the ADR-CI method, but there are exceptions. For example, the ADR-CI method performs better and is more effective for the models established in node 327. This is probably related to the distribution characteristics of the NWD raw data. As shown in Table 2, the average water demand of node 327 is 27.578 m3/h; the CV is as high as 597.94%; its first quartile, median, and third quartile are 18.219, 26.781 and 33.469 m3/h, respectively; the average and median are very close. These results indicate that the water demand data of node 327 is more discrete and symmetrical. These inferences are consistent with the assumption of the ADR-CI method that the data conform to normal distribution, so the ADR-CI method performs better at node 327.
Feature importance
For the XGBoost and RF methods, a total of 26 features are present in the input independent variables of training data, but they play different roles in the training model. Some features have a more important impact on the determination of model parameters and model predictions, whereas other features are less significant. Through the relevant built-in function of the Scikit-Learn library, the feature importance of XGBoost and RF models is analyzed. The orders of feature importance of D-XGB and D-RF models established in nodes 244 and 21 are shown in Figure 4.
In the D-XGB and D-RF models established for nodes 244 and 21, the two most important features are hour and week of the year, both of which are time variables. It proves that time features are extremely important in predictive models. Figure 4(a) and 4(c) shows that the orders of feature importance of D-XGB prediction models established in two nodes are inconsistent. Similarly, Figure 4(b) and 4(d) shows that the feature important orders for two nodes of the D-RF prediction models are also different.
Feature importance of different forecasting models established in two nodes: (a) the D-XGB model in node 244; (b) the D-RF model in node 244; (c) the D-XGB model in node 21; (d) the D-RF model in node 21.
Feature importance of different forecasting models established in two nodes: (a) the D-XGB model in node 244; (b) the D-RF model in node 244; (c) the D-XGB model in node 21; (d) the D-RF model in node 21.
Previous studies of short-term water demand (mainly daily water demand or peak daily water demand) forecasting models have found that weather is the most important influencing factor (Tian et al. 2016), and this research proves that in the hourly NWD forecasting models, time features are the most important influencing factors. This inconsistency is caused by the different time accuracy of the prediction models. There is an essential relationship between the hourly NWD and time period, which can also be confirmed from water demand diurnal curves.
Notably, the most important feature is hour in D-RF prediction models (from Figure 4(b) and 4(d)), its feature importance far exceeds other features, while other features importance is very limited. Oppositely, the overall distribution of feature importance of D-XGB prediction models (from Figure 4(a) and 4(c)) is more uniform. In other words, the feature importance of D-RF prediction models has extremes and their overall performance is extremely uneven. In comparison, the gaps between feature importance of D-XGB prediction models are small, this indicates that during the establishment of D-XGB prediction models, the information introduced by different types of data is more widely used. It is also more robust to the risk of model failure due to the abnormal variation of a single most important feature. The corresponding cost is a more complex and bulky raw data collection process.
DA for RTHM correction to improve parameter and state estimation
The results in Figure 5 compare the differences between observed flow rates and RTHM forecasts (with or without EnKF) for node 317 and the total average of 14 A-class nodes during 72 h. It can be observed that whether for an individual node (node 317 is used as an example in Figure 5, other nodes are not shown), or for the total average of 14 A-class nodes, the simulated forecasts of RTHM combined with EnKF is closer to the actual observation. This shows that the EnKF method reduces the prediction error, especially during the peak water consumption period, the EnKF method improves the model prediction results more obviously. To quantify the modeling performance, the Root Mean Square Error (RMSE) was implemented as the criteria to measure the significance of improvement resulting from the EnKF techniques, and the observed flow rates is approximately regarded as the true value.
Comparison of observed flow rates and RTHM forecasts (with or without EnKF) for node 317 and the total average of 14 A-class nodes during the period from 0:00 on March 5, 2017 to 24:00 on March 7, 2017 (72 h).
Comparison of observed flow rates and RTHM forecasts (with or without EnKF) for node 317 and the total average of 14 A-class nodes during the period from 0:00 on March 5, 2017 to 24:00 on March 7, 2017 (72 h).
Table 5 summarizes RMSE of observed flow rates and RTHM forecasts (with or without EnKF) for all 14 A-class nodes during 3 days. Note that RMSE is the average value of 24-h values calculated on the day. The improvement in model performance is not obviously ignorable. When RTHM is combined with EnKF, the reduction of RMSE of each node basically exceeds 20%, which shows that the EnKF technique is able to attain good output accuracy across the case network for state estimation. Figure 6 shows changes of computed pipe roughness coefficient C of updated RTHM during the period from February 6, 2017 to March 8, 2017. It is clear from the results that C values constantly change dynamically, which also makes the RTHM continually updated and tends to be higher accuracy. Hence, it can be deduced that EnKF can be used for RTHM state and parameter estimation.
RMSE of observed flow rates and RTHM forecasts (with or without EnKF) for all 14A-class nodes during the period from 0:00 on March 5, 2017 to 24:00 on March 7, 2017 (3 days)
Node . | RMSE for RTHM . | RMSE for RTHM with EnKF . | ||||||
---|---|---|---|---|---|---|---|---|
Day 1 . | Day 2 . | Day 3 . | Average . | Day 1 . | Day 2 . | Day 3 . | Average . | |
217 | 2.493 | 2.413 | 2.473 | 2.460 | 1.966 | 1.912 | 1.953 | 1.944 |
198 | 1.138 | 1.238 | 1.087 | 1.154 | 1.057 | 1.136 | 1.017 | 1.070 |
86 | 4.156 | 3.957 | 4.116 | 4.076 | 3.202 | 3.072 | 3.176 | 3.150 |
327 | 6.928 | 6.929 | 7.142 | 6.999 | 5.559 | 5.560 | 5.705 | 5.608 |
62 | 1.771 | 1.772 | 1.881 | 1.808 | 1.243 | 1.243 | 1.308 | 1.265 |
244 | 2.718 | 2.757 | 2.955 | 2.810 | 1.877 | 1.900 | 2.016 | 1.931 |
260 | 3.385 | 3.321 | 3.215 | 3.307 | 2.893 | 2.847 | 2.769 | 2.836 |
272 | 1.809 | 1.805 | 1.821 | 1.812 | 1.309 | 1.306 | 1.317 | 1.311 |
44 | 2.885 | 2.775 | 2.907 | 2.856 | 2.616 | 2.531 | 2.633 | 2.594 |
171 | 2.132 | 2.160 | 2.237 | 2.176 | 2.115 | 2.139 | 2.204 | 2.152 |
111 | 7.894 | 7.578 | 8.489 | 7.987 | 6.937 | 6.701 | 7.381 | 7.006 |
317 | 4.978 | 5.141 | 5.315 | 5.145 | 4.407 | 4.529 | 4.661 | 4.532 |
142 | 1.732 | 1.802 | 1.637 | 1.723 | 1.160 | 1.200 | 1.106 | 1.155 |
21 | 0.925 | 0.950 | 0.891 | 0.922 | 0.798 | 0.817 | 0.773 | 0.796 |
Average | 3.210 | 3.186 | 3.297 | 3.231 | 2.653 | 2.635 | 2.716 | 2.668 |
Node . | RMSE for RTHM . | RMSE for RTHM with EnKF . | ||||||
---|---|---|---|---|---|---|---|---|
Day 1 . | Day 2 . | Day 3 . | Average . | Day 1 . | Day 2 . | Day 3 . | Average . | |
217 | 2.493 | 2.413 | 2.473 | 2.460 | 1.966 | 1.912 | 1.953 | 1.944 |
198 | 1.138 | 1.238 | 1.087 | 1.154 | 1.057 | 1.136 | 1.017 | 1.070 |
86 | 4.156 | 3.957 | 4.116 | 4.076 | 3.202 | 3.072 | 3.176 | 3.150 |
327 | 6.928 | 6.929 | 7.142 | 6.999 | 5.559 | 5.560 | 5.705 | 5.608 |
62 | 1.771 | 1.772 | 1.881 | 1.808 | 1.243 | 1.243 | 1.308 | 1.265 |
244 | 2.718 | 2.757 | 2.955 | 2.810 | 1.877 | 1.900 | 2.016 | 1.931 |
260 | 3.385 | 3.321 | 3.215 | 3.307 | 2.893 | 2.847 | 2.769 | 2.836 |
272 | 1.809 | 1.805 | 1.821 | 1.812 | 1.309 | 1.306 | 1.317 | 1.311 |
44 | 2.885 | 2.775 | 2.907 | 2.856 | 2.616 | 2.531 | 2.633 | 2.594 |
171 | 2.132 | 2.160 | 2.237 | 2.176 | 2.115 | 2.139 | 2.204 | 2.152 |
111 | 7.894 | 7.578 | 8.489 | 7.987 | 6.937 | 6.701 | 7.381 | 7.006 |
317 | 4.978 | 5.141 | 5.315 | 5.145 | 4.407 | 4.529 | 4.661 | 4.532 |
142 | 1.732 | 1.802 | 1.637 | 1.723 | 1.160 | 1.200 | 1.106 | 1.155 |
21 | 0.925 | 0.950 | 0.891 | 0.922 | 0.798 | 0.817 | 0.773 | 0.796 |
Average | 3.210 | 3.186 | 3.297 | 3.231 | 2.653 | 2.635 | 2.716 | 2.668 |
Evolutions of pipe roughness coefficient C of the updated RTHM during the period from February 6, 2017 to March 8, 2017.
Evolutions of pipe roughness coefficient C of the updated RTHM during the period from February 6, 2017 to March 8, 2017.
Uncertainty analysis
The established D-XGB prediction models are chosen to predict the NWD of 14 A-class nodes on March 8, 2017. Then, according to the aforementioned rules, the water demand of all B-class nodes in the 13 node groups are estimated. In order to simulate the uncertainty of the real NWD, normal distribution random variables for three uncertainty levels are introduced into the estimated NWD.
After using RTHM combined with EnKF for hydraulic simulation, the hydraulic states of the whole system are obtained. The nodal pressure is the output parameter of interest because of its importance and wide applicability in hydraulic analysis. In each node group, one node is taken as a representative node for discussion, and the position of the 13 representative node is shown in Figure 2.
Table 6 lists the standard deviations and mean values of output nodal pressures simulated by FOSM and MCS, respectively, on 13 representative nodes at 8 AM on March 8, 2017, which shows that simulation results of the two methods are slightly different under three uncertainty levels, the mean values and standard deviations calculated by MCS are slightly smaller compared to that of FOSM. The reason may be related to the different inherent principles of the two methods. As mentioned earlier, FOSM is based on a first-order truncated Taylor series approximation, its estimation is affected by the degree of nonlinearity of the system. Under normal conditions, it is generally believed that the simulation result of MCS is closer to the true value and have more accurate prediction intervals (Kang et al. 2009). However, for the selected representative nodes, FOSM and MCS give less than 3% errors for pressure mean value, that are, good fits. The simulation time required by the FOSM method is only about 210 s, which is much shorter than the MCS method (about 6,700 s), so a trade-off between accuracy and computational effort needs to be considered.
Standard deviations and mean values of output nodal pressures simulated by FOSM and MCS, respectively, on 13 representative nodes at 8 AM on March 8, 2017
Node . | Uncertainty level 1 . | Uncertainty level 2 . | Uncertainty level 3 . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean . | Standard deviation . | Mean . | Standard deviation . | Mean . | Standard deviation . | |||||||
FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | |
233 | 34.78 | 34.63 | 0.98 | 0.86 | 34.44 | 34.08 | 1.26 | 1.13 | 33.75 | 33.39 | 1.72 | 1.62 |
195 | 38.05 | 37.69 | 1.01 | 0.98 | 37.66 | 37.20 | 1.28 | 1.11 | 36.87 | 36.42 | 1.72 | 1.76 |
72 | 36.94 | 36.54 | 0.94 | 0.90 | 36.31 | 35.85 | 0.99 | 1.22 | 35.39 | 34.94 | 1.64 | 1.62 |
331 | 35.74 | 35.56 | 0.94 | 0.84 | 35.32 | 34.91 | 1.09 | 1.01 | 34.40 | 34.00 | 1.75 | 1.43 |
52 | 35.33 | 34.99 | 1.96 | 1.77 | 34.91 | 34.52 | 2.19 | 1.98 | 34.13 | 33.74 | 3.24 | 2.94 |
251 | 34.08 | 33.89 | 2.07 | 1.93 | 33.82 | 33.41 | 2.68 | 2.16 | 33.07 | 32.67 | 3.59 | 3.32 |
99 | 37.11 | 36.88 | 0.91 | 0.85 | 36.66 | 36.20 | 0.98 | 0.87 | 35.80 | 35.35 | 1.65 | 1.49 |
36 | 36.67 | 36.23 | 1.20 | 1.07 | 36.15 | 35.74 | 1.38 | 1.24 | 35.46 | 35.05 | 2.16 | 1.90 |
211 | 36.13 | 35.87 | 1.14 | 0.99 | 35.66 | 35.23 | 1.55 | 1.10 | 34.66 | 34.24 | 2.16 | 1.78 |
128 | 35.91 | 35.60 | 1.30 | 1.21 | 35.50 | 35.08 | 1.52 | 1.40 | 34.64 | 34.23 | 2.41 | 2.24 |
293 | 35.06 | 34.82 | 1.22 | 1.12 | 34.65 | 34.28 | 1.46 | 1.29 | 33.76 | 33.40 | 2.18 | 1.92 |
133 | 39.24 | 38.85 | 1.95 | 1.76 | 38.51 | 38.08 | 2.67 | 2.42 | 37.53 | 37.11 | 3.61 | 3.04 |
29 | 39.01 | 38.55 | 0.95 | 0.90 | 38.27 | 37.84 | 1.06 | 0.97 | 37.40 | 36.98 | 1.70 | 1.53 |
Node . | Uncertainty level 1 . | Uncertainty level 2 . | Uncertainty level 3 . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean . | Standard deviation . | Mean . | Standard deviation . | Mean . | Standard deviation . | |||||||
FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | FOSM . | MCS . | |
233 | 34.78 | 34.63 | 0.98 | 0.86 | 34.44 | 34.08 | 1.26 | 1.13 | 33.75 | 33.39 | 1.72 | 1.62 |
195 | 38.05 | 37.69 | 1.01 | 0.98 | 37.66 | 37.20 | 1.28 | 1.11 | 36.87 | 36.42 | 1.72 | 1.76 |
72 | 36.94 | 36.54 | 0.94 | 0.90 | 36.31 | 35.85 | 0.99 | 1.22 | 35.39 | 34.94 | 1.64 | 1.62 |
331 | 35.74 | 35.56 | 0.94 | 0.84 | 35.32 | 34.91 | 1.09 | 1.01 | 34.40 | 34.00 | 1.75 | 1.43 |
52 | 35.33 | 34.99 | 1.96 | 1.77 | 34.91 | 34.52 | 2.19 | 1.98 | 34.13 | 33.74 | 3.24 | 2.94 |
251 | 34.08 | 33.89 | 2.07 | 1.93 | 33.82 | 33.41 | 2.68 | 2.16 | 33.07 | 32.67 | 3.59 | 3.32 |
99 | 37.11 | 36.88 | 0.91 | 0.85 | 36.66 | 36.20 | 0.98 | 0.87 | 35.80 | 35.35 | 1.65 | 1.49 |
36 | 36.67 | 36.23 | 1.20 | 1.07 | 36.15 | 35.74 | 1.38 | 1.24 | 35.46 | 35.05 | 2.16 | 1.90 |
211 | 36.13 | 35.87 | 1.14 | 0.99 | 35.66 | 35.23 | 1.55 | 1.10 | 34.66 | 34.24 | 2.16 | 1.78 |
128 | 35.91 | 35.60 | 1.30 | 1.21 | 35.50 | 35.08 | 1.52 | 1.40 | 34.64 | 34.23 | 2.41 | 2.24 |
293 | 35.06 | 34.82 | 1.22 | 1.12 | 34.65 | 34.28 | 1.46 | 1.29 | 33.76 | 33.40 | 2.18 | 1.92 |
133 | 39.24 | 38.85 | 1.95 | 1.76 | 38.51 | 38.08 | 2.67 | 2.42 | 37.53 | 37.11 | 3.61 | 3.04 |
29 | 39.01 | 38.55 | 0.95 | 0.90 | 38.27 | 37.84 | 1.06 | 0.97 | 37.40 | 36.98 | 1.70 | 1.53 |
By analyzing individual representative nodes, it is summarized that with the increase of uncertainty level of the NWD, the mean values of the nodal pressure decrease, which may be because the pressure loss caused by flow change is not linear, but exponential. So, the increase of uncertainty level leads to the broadening of the variation range of the NWD, and the occurrence of the NWD with a larger value leads to the rapid exponential drop of pressure head, and finally causes the node pressure mean to decrease continuously. In addition, the standard deviations of individual nodes’ pressure are less than proportionally to change in input uncertainty for water demand with varying levels of uncertainty, which shows that uncertainty is weakened in the propagation process.
In order to visually compare and analyze the uncertainty in the output pressure of different representative nodes, the box plots of the 95% confidence intervals of the output pressure simulated by the FOSM and MCS methods (8 AM on March 8, 2017) under uncertainty level 2 for the 13 representative nodes are shown in Figure 7. For MCS, the 97.5 percentile and 2.5 percentile of the simulated results correspond to the upper and lower edges of the boxes, respectively, while FOSM estimates the upper and lower limits of the 95% confidence interval by adding 1.96 times the standard deviation to the mean and subtracting 1.96 times the standard deviation from the mean (Kang & Lansey 2009). Prediction intervals estimated by the FOSM method are very close to the MCS results even for the highest input uncertainty (not presented). However, the pressure prediction intervals obtained by the MCS method are generally narrower and more accurate. It is also shown that the interval width between individual nodes is quite different. For example, the confidence intervals of output pressure at node 52,251,293 are wider and the uncertainties are relatively large. This demonstrates that the uncertainty of output pressure is related to the characteristics of the NWD, or more traceably, the characteristics of the node water users. The randomness of water use in residential areas and mixed areas leads to the greater uncertainty of the output value. From this perspective, grouping different user types and then discussing uncertainty can more reasonably estimate the variation characteristics of the NWD within the group, and thus reduce the uncertainty in the model output.
The box plots of the 95% confidence intervals of the output pressure simulated by FOSM and MCS (8 AM on March 8, 2017) under uncertainty level 2 for the 13 representative nodes.
The box plots of the 95% confidence intervals of the output pressure simulated by FOSM and MCS (8 AM on March 8, 2017) under uncertainty level 2 for the 13 representative nodes.
The time series (24 h of March 8, 2017) of pressure prediction intervals estimated by MCS and FOSM for uncertainty level 2 at node 99 are shown in Figure 8. The uncertainty bound estimated by FOSM for pressure fit well with that of MCS in 24 h. During the high-demand times, the uncertainty in the nodal pressure is greater, while the uncertainty in the pressure is small during the night minimum demand. Therefore, accurate estimation of the NWD (especially during the high-demand times) has vital influence on the uncertainty in water pressure. Also, this is why this study has spent great effort to build accurate NWD prediction models. In relevant decisions (such as anomaly identification and model design), not only model uncertainty is taken into account, but also the changes of model uncertainty in different water demand periods need to be analyzed. In other words, accurate dynamic real-time model that takes uncertainty into account is very important in hydraulic application analysis.
The time series (24 h of March 8, 2017) of pressure prediction intervals estimated by MCS and FOSM for uncertainty level 2 at node 99.
The time series (24 h of March 8, 2017) of pressure prediction intervals estimated by MCS and FOSM for uncertainty level 2 at node 99.
CONCLUSIONS
This paper proposes a framework to improve the timeliness and accuracy of RTHMs by accurately forecasting the hourly NWD, node grouping for reducing the uncertainty in water demand and DA method, so as to realize the accurate real-time state estimation of the WDS. In addition, the methods of processing monitoring data are studied, and the influence of different machine learning algorithms and input features on the performance of the NWD prediction models is analyzed. The effect of the EnKF method on improving RTHM parameters and state estimation performance is compared. Based on the results of the case study, the following main conclusions are drawn:
(1) By establishing the accurate hourly NWD prediction models and synchronously updating it into the input of the RTHMs, the model timeliness is effectively improved. (2) Appropriate flow data processing methods are shown to reduce effectively the measurement uncertainty, while the methods of node grouping are proven to reduce the uncertainty in water demand, and ultimately leading to obtain a narrower and more accurate confidence interval for pressure estimation. In other words, it reduces the uncertainty in the model output and improves the accuracy of the real-time hydraulic model. (3) The hourly NWD estimation (especially during the period of high demand) has a great influence on the uncertainty in water pressure, so it is very important to establish accurate hourly NWD prediction models. The XGBoost method is proven to be very reliable and excellent. Its overall performance is better than the ANN and RF methods. In addition, time features are very important for hourly NWD prediction models. (4) The EnKF method is capable of reducing the uncertainty inherited in RTHMs by assimilating the field measurements, and contributing to improvement of the model performance. (5) The simulation results obtained by using the two uncertainty analysis methods of FOSM and MCS are in good fits. The mean and standard deviation of the output pressure calculated by MCS are slightly smaller, but the error is within 3%. In addition, the standard deviations of individual nodes’ pressure are less than proportionally to change in input uncertainty for water demand with varying levels of uncertainty, which shows that uncertainty is weakened in the propagation process.
Future work could consider the combination of RTHM simulation results and pressure monitoring data to detect abnormal events in the WDS. The hydraulic models considering the uncertainty are used to estimate the state of the WDS, and the pressure detection threshold are created to identify the occurrence of the failure events. Combined with the on-site pressure data, the cause, time and location of the failure events are further diagnosed, so as to achieve a refined and scientific management of water distribution systems. In addition, when using the DA method to optimize the hydraulic models, the on-site pressure data, tank or reservoir water level could also be used to update the model parameters and status estimation.
ACKNOWLEDGEMENTS
This study is jointly supported by the National Natural Science Foundation of China (51778178 and 51978203), the National Key Research and Development Program of China (2016YFC0802402 and 2018YFC0406201-3).
CONFLICT OF INTEREST STATEMENT
The authors declare that they have no conflict of interest.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.