ABSTRACT
One of the goals of efficient water supply management is to provide a regular supply of water pressure to meet the water needs of consumers. Water pressure control is closely related to consumer water demands, and an accurate prediction of water consumption is critical for effective water supply management. However, it is difficult to effectively predict the flow of water supply networks, which are characterized by uncertainty and instability. To obtain more accurate flow prediction data, this work proposes a new adaptive robust method for time series prediction modeling, called Complete Ensemble Empirical Mode Decomposition with Adaptive Noise–Kernel Principal Component Analysis–Long Short-Term Memory (CEEMDAN–KPCA–LSTM), in which CEEMDAN and KPCA preprocessing techniques are combined for determining flow prediction. First, the flow data are decomposed using the CEEMDAN algorithm to reduce non-smoothness. Then, KPCA is used to extract the key influencing factors from the feature series. Finally, the LSTM network is constructed to predict the water supply network flow using the results of the CEEMDAN and KPCA algorithms. The suggested scheme offers significant application prospects for water supply systems.
HIGHLIGHTS
The water flow data are decomposed by Complete Ensemble Empirical Mode Decomposition with Adaptive Noise to extract more informative features.
Kernel Principal Component Analysis was used to select important data decomposition sequence to reduce the computational load of the model.
The developed method could effectively improve the prediction accuracy.
The proposed model has a shorter runtime compared to other models.
The scheme has significant potential for use in water supply systems.
NOMENCLATURE
Abbreviations Expansions
- ANN
artificial neural network
- ARIMA
AutoRegressive Integrated Moving Average
- CEEMDAN
Complete Ensemble Empirical Mode Decomposition with Adaptive Noise
- CNN
convolutional neural network
- EMD
empirical mode decomposition
- EEMD
ensemble empirical mode decomposition
- GRU
gated recursive unit
- IMF
intrinsic mode function
- KPCA
Kernel Principal Component Analysis
- LSTM
Long Short-Term Memory
- MAE
mean absolute error
- MAPE
mean absolute percentage error
- RNN
recurrent neural network
- RMSE
root mean square error
- SWAT
Soil and Water Assessment Tool
- VMD
variational mode decomposition
- CEEMDAN–AM–LSTM
Complete Ensemble Empirical Mode Decomposition with Adaptive Noise–Amplitude Modulation–Long Short-Term Memory
INTRODUCTION
The water supply system is an important part of the urban infrastructure, which is not only directly related to the lives of city residents but also has a great impact on the economic development of the city (Saleem et al. 2021). Flow prediction of the water supply network refers to the process of using historical data and mathematical models to estimate the trend and pattern of water flows in the water supply network in the future period. For the government, flow prediction is important to optimize the design, operation, control, and management of the water supply network, improve the efficiency of water resources utilization, and reduce fluctuations in water pressure and water quantity (Xia et al. 2022). For residents, they could enjoy a more stable water experience, because the water supply department can reasonably plan the pressure of each area according to the result of water flow prediction. Therefore, it is of great significance to study water flow prediction, both in regional planning of water supply networks and in improving residents' water consumption experience.
Over the past few decades, scholars have conducted a great deal of research on urban water demand prediction. As the flow data from a water supply network is a typical time series, and traditional statistical and machine learning methods are the most common solution. Brentan et al. used adaptive Fourier series to improve support vector regression, eliminating the errors and most of the biases inherent in the regression structure, which could predict the water demand in near real time (Brentan et al. 2017). Zubaidi et al. aimed to provide a suitable and reliable technique for predicting municipal water demands using the gravitational search algorithm and the backtracking search algorithm with artificial neural network (ANN) (Zubaidi et al. 2018). Wu et al. proposed the back-propagation neural network based on the principal component analysis model to predict the water demand in Taiyuan, Shanxi Province, China, and the model outperformed other models in a variety of evaluation factors (Wu et al. 2021). Saranya and Vinish proposed a neural network autoregression model that was evaluated for the first time as a replacement for the physically based Soil and Water Assessment Tool hydrologic model for predicting the streamflow under data-poor conditions and for immediate, high-quality modeling results (Saranya & Vinish 2023). Direct water flow prediction is often not very good because it is nonlinear and non-smooth, with partial noise and outliers in the time series. Therefore, to improve the prediction accuracy, the data are usually decomposed and reduced in dimensionality. The empirical mode decomposition (EMD) is an adaptive and efficient method used to decompose nonlinear and non-stationary signals (Battista et al. 2009). It extracts a set of intrinsic mode functions (IMFs) from the analyzed signal by sifting stepwise. With the ensemble EMD (EEMD) algorithm by Wu & Huang (2004), which adds the Gaussian white noise to the signal to be decomposed, the modal mixing problem of EMD decomposition was successfully solved. Rezaiy and Shabri introduced EEMD coupled with the AutoRegressive Integrated Moving Average (ARIMA) model for drought prediction (Rezaiy & Shabri 2024). However, there is always some residual white noise in the IMF generated by the signal decomposition of the EEMD algorithm, which makes subsequent processing and analysis difficult.
To address these issues, Cao et al. proposed a Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) (Cao et al. 2019). After each order of decomposition, CEEMDAN added the Gaussian white noise to the decomposition and immediately performed an overall mean computation. Wang et al. proposed a hybrid CEEMDAN–AM–LSTM model for NOx prediction, and this method achieves the most accurate prediction of NOx concentration among other methods for thermal power plants (Wang et al. 2022). Guo et al. combined CEEMDAN with Long Short-Term Memory (LSTM) and bi-directional LSTM, respectively, to predict precipitation and lower Yellow River discharge, and the models have high accuracy (Guo et al. 2022; Zhang et al. 2023). Jiao et al. proposed a hybrid water quality prediction model based on variational mode decomposition (VMD), which was optimized by a sparrow search algorithm and two-way gated recursive unit (GRU) to provide technical support for river water quality protection and pollution prevention (Dong et al. 2023; Jiao et al. 2024). Gou and Ning established a deep convolutional neural network (CNN) based on the Kernel Principal Component Analysis (KPCA) and a coding scheme method for power prediction with high prediction accuracy and good robustness (Gou & Ning 2021). However, these methods do not fundamentally reduce the dimensionality of the data, which makes the data run in the prediction model for a longer period of time and is not generalizable.
The LSTM model has a long-term memory function, which can effectively solve the problems of gradient explosion and gradient disappearance generated by the recurrent neural network (RNN). Guo et al. established a threshold GRU network model for water demand forecasting, which outperformed the ANN and ARIMA models (Guo et al. 2018). Mu et al. used an LSTM model to predict short-term urban water demand in Hefei, China, and demonstrated that external parameters did not improve the performance of the LSTM model (Mu et al. 2020). Wang et al. proposed a short-term water quality prediction model based on VMD to optimize LSTM, and there is good performance in short-term water quality prediction (Wang et al. 2023). Yao et al. proposed a hybrid model based on CNN and LSTM (CNN–LSTM) for runoff prediction, which has wide applicability as verified by several datasets (Yao et al. 2023). It can be seen that LSTM neural networks are widely used in water flow prediction with high accuracy and low computational complexity. In this work, the LSTM is used to study the water flow prediction in a water supply network. In addition, the structure of the LSTM is modified for the characteristics of a water supply network. The specific contributions are as follows:
The CEEMDAN–KPCA–LSTM urban water supply network flow prediction model is established in this article. Compared with the preprocessing and prediction techniques in previous studies, CEEMDAN–KPCA can effectively deal with the nonlinear and non-smooth problems of the data, reduce the noise and dimensionality of the flow data, and extract the intrinsic features of the flow signal. It can improve the performance of flow prediction models by improving the input data quality and reducing the computational complexity.
In the flow prediction studies of water supply networks, there is a lack of a unified model evaluation standard or framework that integrates the predictive performance, computational complexity, interpretability and scalability of models. In this article, starting from the characteristics of water supply network flow data with time series data, combined with modal decomposition and feature extraction, the flow of the water supply network is predicted by the prediction model.
The rest of this article is organized as follows. Section 2 briefly introduces the basic principles of CEEMDAN, KPCA, and LSTM. Our model, which blends CEEMDAN and KPCA algorithms to achieve higher accuracy and lower time complexity, is presented in Section 3. In Section 4, the model is verified in an urban water supply network system.
PRELIMINARIES
In this section, we will briefly introduce the basic elements of the model, including CEEMDAN, KPCA, and LSTM.
Complete Ensemble Empirical Mode Decomposition with Adaptive Noise
CEEMDAN is an extension of EEMD. In terms of eliminating mode mixing and reducing computational cost, CEEMDAN is efficient over EMD and EEMD. Compared to EEMD, CEEMDAN has a small number of sifting iterations (Poongadan & Lineesh 2024). The CEEMDAN model adds a finite-time adaptive Gaussian white noise to the decomposition process, which reduces the residual noise in the final result and solves the problem of modal mixing that occurs in EMD. The specific steps of the CEEMDAN decomposition are shown in the following.
- (2) In the first EMD decomposition, it will produce the first-mode components
and the first-stage residuals
as in Equations (2)–(3).
is generated by performing k EMD decomposition of
and a series of functional operations as in Equations (4)–(5).
denotes the kth IMF component obtained by EMD decomposition. After T cycles of sequence decomposition, the ith IMF can finally be obtained.
Kernel Principal Component Analysis
By using kernel functions to map data into a high-dimensional feature space instead of PCA, KPCA improves the linear separability of the data in this space. Then, the traditional PCA algorithm is used to reduce the dimension of the mapped data. In this way, non-linearly separable data can be solved.




Long Short-Term Memory







DEVELOPED METHOD
The flow sequence in the large-scale water supply network is often linked to the water habits of the users and shows a cyclical trend. In reality, there are some extreme situations that occur frequently, such as pipeline maintenance, which can disrupt the periodicity of the flow sequence. This has uncertainty and random disturbance in time and has chaotic characteristics.
CEEMDAN need to introduce noise into each decomposition and perform multiple decompositions, which makes the computation significantly larger. Compared to traditional EMD or EEMD, CEEMDAN is more expensive to compute, especially when dealing with high-dimensional data or long time series, where the computation time can be significantly extended. However, flow prediction models in the water supply network often need to respond to real-time flow to adjust future flow predictions. Therefore, an algorithm to reduce the dimensionality of the data is needed. The reason why we chose the KPCA algorithm is because it uses the kernel function to map data to high-dimensional feature space, which can effectively process data with complex nonlinear structure and extract more information and patterns. Meanwhile, traditional PCA can only capture linear features.
Before performing LSTM operations, normalization of the reduced dimensional data and the original data is necessary for the accuracy of the prediction data. In this part, the normalization method uses the max–min normalization method. The principal components are used as input variables to build the LSTM model for prediction, and finally, the model prediction results and evaluation indices as output variables. Importantly, the rectified linear unit (RELU) is used for the activation function in the LSTM. The traditional saturated activation functions, such as sigmoid and tanh, lead to a vanishing gradient, while the unsaturated activation functions, such as RELU, do not (Tollner et al. 2024). Compared to the saturated activation function, an unsaturated activation function such as RELU can accelerate the convergence speed of the model. The deep learning model using RELU can achieve similar or better results without pre-training prior to supervised training.
EXPERIMENTAL RESULTS
In this section, we will introduce how the CEEMDAN–KPCA–LSTM model is applied in the urban water supply network and compare it with other mainstream prediction models, including accuracy and running time.
Data source
This study uses real flow data from a water supply network of a city in Anhui, China. These flow data were obtained from some sensors and electromagnetic flowmeters deployed in the water supply network. The models of the sensors and flow meters are TDS-100W and KEFN-XXX-103-G3, respectively. In addition, the whole collection period is from March 1 to November 26, 2019. The sampling interval of the flow data was 1 h, and the study period was from 00:00 to 23:00.
Validation of CEEMDAN decomposition


From Figure 5 it can be seen that the shortest wavelength and the highest frequency are generated from IMF1. Starting from IMF6, the frequency gradually decreases, the wavelength becomes longer, and the amplitude decreases. From March 1 to November 26, 2019, the residual term represents the overall trend of the change in the flow of the water supply network in the city. It can be seen that the flow gradually increases in March, reaches its maximum in July and August, and then starts to decrease. The overall trend is consistent with the seasonal changes.
Implementation of dimensionality reduction based on the KPCA algorithm
KPCA can use kernel functions to map the data into a high-dimensional space where nonlinear features can be separated. In addition, this method is particularly suitable for dealing with complex nonlinear signals, whereas PCA may miss important nonlinear features.
Based on the previously mentioned contribution requirement of q% (98%), six principal components are examined. The contributions of the first six major components are 47.75, 31.74, 8.44, 5.09, 2.85, and 2.27%, respectively. The cumulative contribution rate of the first six principal components is 98.14%, as shown in the cumulative contribution rate line graph, which meets the contribution rate requirement. The screened principal components are highly representative and contain most of the information characteristics. Reducing the 13-dimensional data to six-dimensional data reduces the complexity of the data, reduces the computation time, and increases the computation speed.
Evaluation metrics




Flow prediction implementation
The LSTM model is used to predict the flow data of the urban water supply network. The screened principal components are the input variables to the LSTM model. The dataset is divided into a training set and a test set in a ratio of 7:3 and is also normalized. After continuous debugging sessions, the parameters used to build the LSTM model for urban water supply network flow prediction are as follows: the number of input layer time steps is 24, the number of input layer dimensions is 6, the number of output layer dimensions is 1, the number of hidden layer dimensions is 1, the number of hidden layer nodes is 100, and the gradient threshold is set to 1 to avoid overfitting.
Monthly flow prediction results for the following months: (a) March; (b) May; (c) July; and (d) August.
Monthly flow prediction results for the following months: (a) March; (b) May; (c) July; and (d) August.
To observe the prediction performance of the CEEMDAN–KPCA–LSTM model more intuitively, the RMSE, MAE, MAPE, and R2 are used to check the prediction results for each month. From Table 1, the MAPE for March, May, July, and August are all around 5% and the R2 is >0.97. The MAPE for May is 5.173%, which is the worst prediction among the 5 months. The results show that the developed model has higher accuracy and better prediction performance for monthly flow predictions.
Evaluation of monthly performances
Model . | Month . | RMSE . | MAE . | MAPE . | R2 . |
---|---|---|---|---|---|
CEEMDAN–KPCA–LSTM | Mar | 13.102 | 9.485 | 4.857 | 0.972 |
May | 14.086 | 11.294 | 5.173 | 0.983 | |
Jul | 13.129 | 10.055 | 4.338 | 0.973 | |
Aug | 12.003 | 10.488 | 4.669 | 0.979 |
Model . | Month . | RMSE . | MAE . | MAPE . | R2 . |
---|---|---|---|---|---|
CEEMDAN–KPCA–LSTM | Mar | 13.102 | 9.485 | 4.857 | 0.972 |
May | 14.086 | 11.294 | 5.173 | 0.983 | |
Jul | 13.129 | 10.055 | 4.338 | 0.973 | |
Aug | 12.003 | 10.488 | 4.669 | 0.979 |
DISCUSSION
Results of performance evaluation
Models . | RMSE . | MAE . | MAPE . | R2 . |
---|---|---|---|---|
GRU | 33.675 | 24.178 | 11.341 | 0.843 |
LSTM | 32.726 | 23.23 | 11.312 | 0.852 |
EMD–LSTM | 14.928 | 13.172 | 7.240 | 0.969 |
CEEMDAN–GRU | 16.132 | 11.96 | 5.784 | 0.964 |
CEEMDAN–LSTM | 13.278 | 11.09 | 5.947 | 0.976 |
EMD–KPCA–LSTM | 7.696 | 6.026 | 3.399 | 0.991 |
CEEMDAN–KPCA–GRU | 8.646 | 7.023 | 3.805 | 0.989 |
CEEMDAN–KPCA–LSTM | 7.113 | 5.481 | 2.756 | 0.993 |
Models . | RMSE . | MAE . | MAPE . | R2 . |
---|---|---|---|---|
GRU | 33.675 | 24.178 | 11.341 | 0.843 |
LSTM | 32.726 | 23.23 | 11.312 | 0.852 |
EMD–LSTM | 14.928 | 13.172 | 7.240 | 0.969 |
CEEMDAN–GRU | 16.132 | 11.96 | 5.784 | 0.964 |
CEEMDAN–LSTM | 13.278 | 11.09 | 5.947 | 0.976 |
EMD–KPCA–LSTM | 7.696 | 6.026 | 3.399 | 0.991 |
CEEMDAN–KPCA–GRU | 8.646 | 7.023 | 3.805 | 0.989 |
CEEMDAN–KPCA–LSTM | 7.113 | 5.481 | 2.756 | 0.993 |
Compared with GRU, LSTM, EMD–LSTM, CEEMDAN–GRU, CEEMDAN–LSTM, EMD–KPCA–LSTM, and CEEMDAN–KPCA–GRU models, the developed model has the best prediction performance. Compared with the single LSTM model, the CEEMDAN–KPCA–LSTM model performs signal decomposition and dimensionality reduction, which has higher prediction accuracy. All performance metrics are superior to the single LSTM model. The RMSE, MAE, and MAPE of the CEEMDAN–KPCA–LSTM model are improved by 78.3, 76.4, and 75.6%, respectively, compared with the single LSTM. These are 46.4, 50.6, and 53.7% compared to the CEEMDAN–LSTM model without dimensionality reduction. The combined models significantly improved the prediction accuracy compared to the single models. From the above data, it can be seen that it is necessary to perform the KPCA dimensionality reduction. The prediction results show that our model has a better performance than the others.
To verify the effect of the KPCA dimensionality reduction on the running time of the algorithm, the running time of the three prediction models was calculated in five rounds, and the results are shown in Table 3. It shows that the LSTM model with direct input data has the shortest running time. The CEEMDAN–LSTM model, which performs the CEEMDAN decomposition, has the longest running time because it uses the 13-dimensional data obtained from the decomposition as input variables. Compared with the CEEMDAN–LSTM model, the running time of our model with the KPCA dimensionality reduction is significantly reduced. From the above, it can be seen that performing the KPCA dimensionality reduction can reduce the running time of the algorithm in this study.
Running time of the algorithm
Models . | Round 1 . | Round 2 . | Round 3 . | Round 4 . | Round 5 . |
---|---|---|---|---|---|
LSTM | 52.54 s | 56.77 s | 59.99 s | 56.03 s | 54.80 s |
CEEMDAN–LSTM | 87.51 s | 93.75 s | 93.08 s | 92.04 s | 91.47 s |
CEEMDAN–KPCA–LSTM | 65.38 s | 70.16 s | 69.56 s | 70.52 s | 68.81 s |
Models . | Round 1 . | Round 2 . | Round 3 . | Round 4 . | Round 5 . |
---|---|---|---|---|---|
LSTM | 52.54 s | 56.77 s | 59.99 s | 56.03 s | 54.80 s |
CEEMDAN–LSTM | 87.51 s | 93.75 s | 93.08 s | 92.04 s | 91.47 s |
CEEMDAN–KPCA–LSTM | 65.38 s | 70.16 s | 69.56 s | 70.52 s | 68.81 s |
In short, the advantage of the CEEMDAN–KPCA–LSTM model is that it can effectively extract the multi-scale features of the time series, reduce the dimensionality and complexity of the data, and improve the prediction performance of the LSTM network. However, the model has some shortcomings. In dimensionality reduction, the choice of different kernel functions will lead to different reduction effects, and the kernel matrix will also take up a lot of memory and computational resources when the data volume is large.
CONCLUSIONS
In this article, the CEEMDAN decomposition, the KPCA dimensionality reduction, and the LSTM model were combined to build the CEEMDAN–KPCA–LSTM model, which was applied to the flow prediction of an urban water supply network. The results were compared with related baseline models, and the main findings are as follows:
The original flow data were decomposed by the CEEMDAN algorithm to obtain different IMFs and residual terms. The various scale fluctuations and trends present in the flow data were decomposed. The KPCA algorithm was used to extract the six principal components that reflect the original information. It can remove redundant features, reduce the dimensionality of model input parameters, and improve model efficiency and performance. Reducing high-dimensional data to low-dimensional data speeds up the operation of the algorithm and improves the accuracy of the model.
The prediction results of the water supply network flow data from March 1 to November 26, 2019 show that the CEEMDAN–KPCA–LSTM model has a high prediction performance. The model prediction result of the RMSE is 7.113, the MAE is 5.481, the MAPE is 2.756%, and the R2 is 0.993. These indicators show that the model in this article outperforms other models when it comes to flow prediction.
Our model focuses primarily on data optimization and learning from historical data so that it can be adapted to other distribution networks with either similar or different characteristics. The developed model can also be used to predict other hydraulic parameters in the distribution network that often exhibit periodic patterns, such as water demand and flow rate. When we apply this model to predict flow in other distribution networks, it can be trained using only the corresponding historical data. In other words, as long as there is sufficient historical data, this model can be applied to any distribution network, and it shows significant advantages when data are scarce or of poor quality. Most importantly, the runtime of the model is shorter compared to other models, which has positive implications for real-time scheduling of water supply networks. In addition, the data collection interval for pressure in this study was set to 1 h. By shortening the time interval and increasing the amount of data, uncertainties caused by long-term correlations can be effectively reduced. To improve the accuracy, we can consider reducing the pressure data collection interval to 1, 5, or 15 min, allowing the model to extract more detailed information. Incorporating these factors into future research studies will be crucial.
ACKNOWLEDGEMENTS
This work was supported in part by the Demonstration Project of Water Supply Safety Guarantee and Optimized Operation of Chaohu Pipeline Network and Major Scientific Research Project of Universities in Anhui Province (2024AH040039).
AUTHOR CONTRIBUTION
W.X. conceptualized, visualized, and investigated the study. Y.C. conceptualized the study, performed the methodology, wrote the original draft, wrote the review and edited. K.N. curated data and wrote the review. Z.M. wrote the review and edited, and supervised the study.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.