Short-term water demand prediction is crucial for real-time optimal scheduling and leakage control in water distribution systems. This paper proposes a new deep learning-based method for short-term water demand prediction. The proposed method consists of four main parts: the variational mode decomposition method, the golden jackal optimization algorithm, the multihead attention mechanism, and the bidirectional gated recurrent unit (BiGRU) model. Furthermore, a seq2seq strategy was adopted for multistep prediction to avoid the error accumulation problem. Hourly water demand data collected from a real-world water distribution system were applied to investigate the potential of the proposed method. The results show that the proposed method can yield remarkably accurate and stable forecasts in single-step prediction (i.e., the mean absolute percentage error (MAPE) reaches 0.45%, and the root mean squared error (RMSE) is 25 m3/h). Moreover, the proposed method still achieves credible performance in 24-step prediction (i.e., the MAPE reaches 2.12%, and the RMSE is 126 m3/h). In general, for both single-step prediction and multistep prediction, the proposed method consistently outperforms other BiGRU-based methods. These findings suggest that the proposed method can provide a reliable alternative for short-term water demand prediction.

  • Based on the ‘decomposition–prediction’ strategy, an attention mechanism-based framework was proposed.

  • A new nature-inspired optimization method was used to optimize the parameters of the variational mode decomposition method, contributing to improving the decomposition quality.

  • A seq2seq strategy was adopted for multistep prediction to avoid the error accumulation problem in multistep prediction.

Short-term water demand prediction (STWDP) can provide an important decision-making basis for optimal scheduling and leakage control of water distribution systems. Therefore, the accuracy of prediction methods has a significant impact on water distribution system operation in terms of security, reliability and cost-effectiveness. To date, the exploration of prediction methods has remained a crucial research area for scholars worldwide. Many prediction methods, such as statistics-based methods (Yasar et al. 2012), artificial neural networks (ANNs) (Ghiassi et al. 2008), and Gaussian process regression (Roushangar & Alizadeh 2018), have been proposed over the past few decades. With the rapid development of computer technology, machine learning methods have gained increasing attention due to their ability to effectively handle nonlinear data. Overall, the application of machine learning in STWDP can be roughly divided into two stages: the shallow learning-based stage and the deep learning-based stage.

In the shallow learning-based stage, prediction methods tend to adopt a simple structure to capture the internal relationships among the data. Therefore, the desired results can be obtained by adjusting a few hyperparameters. Among these prediction methods, one of the base models commonly used is ANN. So far, a wide variety of methods based on ANNs have been developed to predict short-term water demand. For example, Bougadis et al. (2005) adopted three different ANNs to predict short-term municipal water demand and conducted a comparison with regression models and time series models. The results showed that the ANNs consistently outperformed the other models. Similar conclusions were drawn from the case studies of Pulido-Calvo & Gutierrez-Estrada (2009) and Adamowski (2008). To improve the prediction accuracy of ANNs, Huang et al. (2022) applied the mind evolutionary algorithm (MEA) to optimize the initial thresholds and weights of backpropagation neural networks (BPNNs) and found that the MEA_BPNN model can yield more accurate and stable results than the traditional BPNN model. Based on the above literature, there is no doubt that ANN-based methods can achieve good results in many case studies. However, because of the failure to effectively extract deep-level features, improving the upper limit of the prediction accuracy is difficult.

In the deep learning-based stage, prediction methods rely on constructing deeper networks to extract deep-level features of data, thereby achieving more accurate forecasts. A recurrent neural network (RNN) is a typical architecture of deep learning networks and was specifically designed for processing sequence information. To solve the problems (i.e., vanishing and exploding gradients) that exist in traditional RNNs, two improved architectures known as long short-term memory (LSTM) and gated recurrent units (GRUs) were successively developed (Hochreiter & Schmidhuber 1997; Cho et al. 2014).

In recent years, LSTM-based methods have been widely used in the field of STWDP due to the outstanding ability of LSTM to analyze time series data. For instance, Mu et al. (2020) proposed an LSTM-based model to predict hourly and daily urban water demands in Hefei City, China. They concluded that the LSTM-based model can offer predictions with improved accuracy under unfavourable conditions (e.g., data with high temporal resolution, abrupt changes, or a relatively high uncertainty level). To improve the prediction accuracy, Pu et al. (2023) combined wavelet analysis (WA) with convolutional neural networks (CNNs) and LSTM for single-step and multistep prediction. Instead of using the inputs directly, the inputs in this method were a subseries of components obtained from WA. Similarly, Liu et al. (2023) developed a novel LSTM-based method in which the seasonal trend loss (STL) method and adaptive boosting algorithm were used to enhance the LSTM performance. In this work, the STL method was utilized to obtain input series for the LSTM network. The results obtained from the two aforementioned case studies show that using input series derived from appropriate decomposition techniques as model inputs is conducive to improving the prediction accuracy. Considering that the selection of hyperparameters can easily affect prediction accuracy, Wang et al. (2023a, 2023b) used the clouded leopard algorithm to optimize the input parameters of LSTM and eventually obtained the desired forecasts.

Compared with LSTM, the GRU has a simpler architecture, but it has a faster computation speed and similar prediction performance. In view of this, GRU-based methods have also become popular in the field of STWDP. For example, Guo et al. (2018) developed a GRU model integrated with a correction module to forecast future water demand 15-min and 24-h forecasts with a 15-min time step. The results showed that the GRU-based models outperform the ANN and seasonal autoregressive integrated moving average (SARIMA) models for both 15-min and 24-h forecasts. To alleviate forecasting errors at extreme points, Salloom et al. (2021) proposed a novel deep neural network architecture for one-step and multistep forecasting. A GRU was used as a basic model to handle the sequential relationships among historical demand data. This study revealed that the proposed architecture can significantly reduce the complexity of the model while conserving the same accuracy. Moreover, Hu et al. (2021) proposed a novel framework in which a GRU was also used as a basic forecasting model. In this framework, an isolation forest and complete ensemble empirical mode decomposition (EMD) with adaptive noise (CEEMDAN) were adopted for data preprocessing. The results demonstrated that the GRU-based method was superior to support vector regression (SVR) and ANNs. In addition, this study confirmed that using a decomposition technique to handle data is indeed conducive to improving the prediction accuracy.

In summary, LSTM-based and GRU-based methods are adequate for obtaining promising forecasts in STWDP. Nevertheless, there is still some room to improve the prediction accuracy. For example, more effective decomposition techniques can be applied to enhance the decomposition efficiency. Furthermore, the most recent advancements in the deep learning domain can also be applied to water demand forecasting problems.

Inspired by the above literature, this paper proposed a novel deep learning-based method for STWDP. The proposed method is composed of four main parts: the variational mode decomposition (VMD) method, the golden jackal optimization (GJO) algorithm, the multihead attention mechanism (Multi-Attn) and the bidirectional gated recurrent unit (BiGRU) model. In the proposed method, the VMD method was used for data decomposition to create a series of relatively stable subsequences; the GJO algorithm was adopted to optimize the parameters of the VMD method, thereby enhancing the decomposition quality; the BiGRU model was used as a basic prediction method; and the Multi-Attn was introduced into the BiGRU model to improve the prediction accuracy. For clarity, the proposed method is hereafter denoted as the GJO-VMD-BiGRU-Multi-Attn model.

The objective of this study is to investigate the potential of new techniques in the fields of signal decomposition, optimization and deep learning in addressing water demand prediction problems. The major contributions of this study can be briefly summarized as follows:

  • (1) Based on the ‘decomposition–prediction’ strategy, a novel framework was proposed for STWDP.

  • (2) A new nature-inspired optimization method named GJO was used to optimize the parameters of the VMD method, contributing to improving the decomposition quality.

  • (3) A Multi-Attn was introduced into the BiGRU model to capture the key temporal feature information of sequence data. In addition, a seq2seq strategy was adopted for multistep prediction to avoid the error accumulation problem, thus improving the model performance in multistep prediction.

The remainder of this paper is organized as follows. Section 2 describes the theory of related techniques used in this study. In Section 3, the proposed method is evaluated using a case study, following a comprehensive comparison with other BiGRU-based methods. Section 4 presents some conclusions and provides some suggestions for future work.

VMD method

Water demand series data exhibit strong fluctuations over time. These fluctuations may seriously affect the accuracy of prediction methods. In this paper, the VMD method was used to decompose water demand series data, thereby smoothing the series data. The VMD method, proposed by Dragomiretskiy & Zosso (2014), is an adaptive and entirely nonrecursive signal processing method that divides a signal into a series of bandlimited intrinsic mode functions (BIMFs). This method has a more solid mathematical theoretical foundation and can effectively address the problems existing in the EMD method (i.e., endpoint effects and mode mixing). Accordingly, the VMD method has been widely used to decompose time series in many studies (Ghasempour et al. 2021; Hu et al. 2023; Li et al. 2023). The core idea of the VMD method is to construct and solve a variational model. Let f(t) be the original signal; then, the variational model can be constructed by the following formula:
formula
(1)
where is the mode function of the k-th mode, is the centre frequency of the k-th mode, K is the number of decomposed modes, δ represents the Dirac distribution, and is a symbol for taking the partial derivative.
To solve the constrained optimization problem mentioned above, the constrained variational problem is transformed into an unconstrained variational problem by exploiting the advantages of the quadratic penalty terms and Lagrangian multiplier method. An augmented Lagrangian function is introduced to obtain the solution of the unconstrained variational problem, which can be denoted as:
formula
(2)
where is a penalty factor to ensure the accuracy after signal reconstruction and is a Lagrangian multiplier to maintain the strictness of the constraint conditions.
To find the saddle point of Formula (2), , and are initialized first. Then, , and can be updated according to Formulas (3) ∼ (5). When Formula (6) is satisfied, the update procedure is terminated.
formula
(3)
formula
(4)
formula
(5)
formula
(6)
where n denotes the n-th iteration, denotes noise tolerance, which has an impact on the denoising effect, and denotes a threshold used as the basis for terminating the iteration process.

Once the unconstrained variational problem is addressed, a series of BIMFs can be generated. With respect to existing EMD-based methods, a unique advantage of the VMD method is that it can specify the number of modes K according to the actual conditions. The parameter K can remarkably affect the decomposition quality. Nevertheless, how to select a proper value for K is still a challenging task. Moreover, the penalty factor also significantly influences the performance of the VMD method. Therefore, it is necessary to find appropriate values for both K and before implementing the VMD method. In the following, we introduce a novel optimization algorithm that is used to simultaneously optimize the parameters K and .

GJO algorithm

The GJO algorithm, inspired by the collaborative hunting behaviour of golden jackals, is a new nature-inspired optimization method. Owing to its high performance in complicated optimization problems, the GJO algorithm has become increasingly popular since its development in 2022 (Najjar et al. 2022; Lu et al. 2023). In principle, the GJO algorithm consists of three basic stages: the initialization stage, exploration stage and exploitation stage. In the initialization stage, as a population-based method, the GJO algorithm needs to initialize a random prey population to obtain candidate solutions that are uniformly distributed over the search space. In the exploration stage, the behaviour of golden jackals searching for prey is simulated by constructing an exploration strategy. In the last stage, the GJO algorithm describes how and when the jackals enclose and pounce on the prey according to the prey's energy evasion. More information about the development process of the GJO algorithm can be found in the relevant literature (Nitish & Muhammad 2022).

To optimize the parameters of the VMD method, a fitness function needs to be defined in advance. In this paper, the envelope entropy is used as the fitness function. The envelope entropy represents the sparsity of the original signal (Ma et al. 2019; Chang et al. 2022). The larger the envelope entropy value is, the more complicated the BIMFs obtained from VMD. In this context, the parameter combination (K,) corresponding to the minimum envelope entropy is the optimal parameter combination. For a signal with a mean of zero, its envelope entropy can be calculated as follows:
formula
(7)
where is the normalized form of and is the envelope signal obtained by Hilbert demodulation of the signal .

Sliding window sampling

In this paper, the samples of prediction models were constructed using a sliding window model. To evaluate the effectiveness of the proposed method, both single-step and multistep prediction were considered in this study. In general, there are at least four commonly used strategies for making multistep forecasts: direct multistep forecasts, recursive multistep forecasts, direct recursive hybrid multistep forecasts and multiple-output forecasts. However, all of the above strategies have inherent drawbacks that may degrade the prediction performance of models. A direct multistep forecast strategy needs to create multiple models to achieve multistep prediction, resulting in a high computational load. For the recursive multistep forecast strategy, the accumulated errors may increase when the time series to be predicted is long, which can easily lead to a rapid decline in performance. The direct recursive hybrid multistep forecast strategy seeks to combine the best properties of both the recursive and direct forecast strategies, but this strategy cannot fully utilize all the observed values for prediction due to defects in the framework. The multiple-output forecast strategy has no error accumulation problem; however, this strategy does not consider the temporal relationships among the output data. Therefore, this strategy may cause information loss, thereby leading to an increase in prediction errors.

In the above context, a seq2seq strategy derived from a multiple-output forecast strategy was adopted to implement multistep prediction. The seq2seq strategy accounts for the temporal dependence among labels, thereby effectively utilizing the temporal relationships among output data. Therefore, the seq2seq strategy can overcome the drawback of the multiple-output strategy, which may be beneficial for improving the prediction performance. To implement the seq2seq strategy, a seq2seq structure should be constructed. The seq2seq structure consists of an encoder and a decoder, as shown in Figure 1. Typically, the encoder and decoder can be an RNN-based model (e.g. LSTM, GRU and BiGRU). Based on the seq2seq structure, multistep prediction can be achieved by constructing samples with multiple inputs and outputs.
Figure 1

Seq2seq structure.

Figure 1

Seq2seq structure.

Close modal
Figure 2 shows a specific example of sample construction for both single-step and multistep prediction. The data structure of samples for both single-step and multistep prediction is presented in Table 1. In Figure 2 and Table 1, xi denotes the data point at time i.
Table 1

Data structure of samples

Prediction horizonInputsOutputs
Single-step prediction xin, … , xi−2, xi−1  xi 
Multistep prediction xin, … , xi−2, xi−1 xi, xi+1… , xi+m−1 
Prediction horizonInputsOutputs
Single-step prediction xin, … , xi−2, xi−1  xi 
Multistep prediction xin, … , xi−2, xi−1 xi, xi+1… , xi+m−1 
Figure 2

An example of sample construction.

Figure 2

An example of sample construction.

Close modal

BiGRU model

A GRU is a commonly used architecture for RNNs that can effectively process sequence data. Like in LSTM, in standard GRUs, a gate mechanism is adopted to address the vanishing and exploding gradient problems. The structure of a GRU is similar to that of an LSTM network, but only two gates are retained: an update gate and a reset gate. As a consequence, the GRU has fewer parameters and a lower computational load than the LSTM. In a GRU, the update gate helps to determine how much past information (from previous time steps) needs to be transmitted to the next time step, hence controlling the integration of new information; the reset gate controls the mixing degree of information between the current state and previous state, which determines the retention of historical information (Zhang et al. 2022). The relevant formulas for the GRU are as follows:
formula
(8)
formula
(9)
formula
(10)
formula
(11)
where represents the reset gate, represents the update gate, , , , , W and U denote the related weight matrix, , and denote the related biases, is the candidate hidden state at time t, is the hidden state at time t, and represent the sigmoid and activation functions, respectively, and represents element multiplication.
The BiGRU is an improved version of the RNN that consists of two independent GRU layers. One GRU layer is used to process the sequence in the forward direction, while the other processes the sequence in the reverse direction. This bidirectional structure enables the BiGRU to capture long-distance dependencies in the sequence, thereby improving the prediction performance. BiGRUs generally have better prediction performance for nonlinear time series compared with that of traditional GRUs (Chen et al. 2021). Figure 3 illustrates the structure of the BiGRU.
Figure 3

Diagram of the BiGRU structure.

Figure 3

Diagram of the BiGRU structure.

Close modal

In Figure 3, , , , , and denote the related weights.

Multihead attention mechanism (Multi-Attn)

The attention mechanism simulates human ability and focuses on important information at specific moments. Among a large quantity of input data, usually, only a small subset of features is crucial. For this reason, a prediction model can focus on these important parts by introducing attention mechanisms, thereby improving the overall performance.

Multi-Attn is an extended form of the attention mechanism that can more effectively extract information from sequence data (Nei et al. 2022). The specific procedures for introducing Multi-Attn into the BiGRU model can be described as follows:

  • (1) The output vector obtained from the BiGRU is first projected to different spaces (i.e., query, key, and value spaces) through linear transformation, thereby obtaining matrices Q, Key, and V, respectively. Afterwards, matrices Q, Key, and V are split into h heads to create matrices , and (i = 1, 2, … , h). The attention output can be calculated by:
    formula
    (12)
    where denotes the attention output vector corresponding to the i-th head, denotes the dimensions of matrices Q, Key, and V, and denotes the normalization of scores.
  • (2) The attention outputs of all the heads are concatenated according to Formula (13). Afterwards, the final result can be obtained through linear transformation.
    formula
    (13)
    where A is the final result obtained through Multi-Attn, denotes the concatenation operation, and represents the weight of the linear transformation.

GJO-VMD-BiGRU-Multi-Attn model

First, the GJO algorithm was used to obtain the optimal values of K and for the VMD method according to the minimum envelope entropy. Second, water demand series data were decomposed to generate a series of BIMFs using the VMD method with the optimal parameters. Then, for each BIMF, a BiGRU-Multi-Attn model was constructed for both single-step and multistep prediction. To obtain appropriate hyperparameter combinations, a single hyperparameter was adjusted when the other hyperparameters were fixed. Furthermore, the seq2seq strategy was adopted for multistep prediction, which can contribute to avoiding the error accumulation problem and improving the prediction performance. Finally, the final prediction results can be obtained by accumulating forecasts of all the BIMFs. Figure 4 shows the flowchart of the GJO-VMD-BiGRU-Multi-Attn model for water demand prediction.
Figure 4

Flowchart of the GJO-VMD-BiGRU-Multi-Attn model for prediction.

Figure 4

Flowchart of the GJO-VMD-BiGRU-Multi-Attn model for prediction.

Close modal

Prediction performance criteria

Numerous indicators are used to evaluate prediction performance (Donkor et al. 2014). In this study, four indicators commonly used were selected for performance evaluation: absolute percent error (APE), mean absolute percentage error (MAPE), root mean squared error (RMSE), and coefficient of determination (R2). In practice, the lower the APE, MAPE and RMSE are, the better the prediction models perform, while the opposite is true for the R2. The above indicators can be calculated by the following formulas (Huang et al. 2023):
formula
(14)
formula
(15)
formula
(16)
formula
(17)
where and represent the observed values and the predicted values, respectively, m is the number of observed values, and is the mean of the observed values.

Data description

The dataset used in this study was obtained from a real-world water distribution system. The water distribution system is located in the Qinzhou region of southern China, as shown in Figure 5. The water distribution system served a population of approximately 140,000 in 2018. Overall, the water distribution system mainly provides water for residential areas and for some industrial areas. The data were collected from March 5, 2018, to September 30, 2018, with a recording interval of 1 h. Therefore, the dataset contains 5,040 observations in total.
Figure 5

Location of the case study area.

Figure 5

Location of the case study area.

Close modal

In this study, the length of the sliding window was set to 12 based on a trial and error procedure. Specifically, we used water demand data from the previous 12 h to predict future water demand. Note that the length of the sliding window can differ depending on the characteristics of the dataset. To further evaluate the performance of the proposed method, a 24-h prediction was also investigated. According to the approach for sample construction illustrated in Table 1, for each BIMF, approximately 5,000 samples can be obtained for both single-step and multistep prediction. The samples for each BIMF were divided into a training set, a validation set and a testing set at a ratio of 7:1.5:1.5. Therefore, the training set contains approximately 3,500 samples, while both the validation set and the testing set contain approximately 750 samples.

Optimization using the GJO algorithm

To carry out the GJO algorithm effectively, several important parameters need to be appropriately specified in advance. These parameters include the maximum number of iterations, population size, and search spaces for both K and α. Based on the analyses in the related literature (Nitish & Muhammad 2022), the maximum number of iterations and population size were set to 20 and 30, respectively, in this study. The search space for K was in the range of [3,10], while the search space for α ranged from 100 to 2,500. Afterwards, the VMD method can be optimized using the GJO algorithm. Table 2 presents the parameter settings for the GJO algorithm. Figure 6 shows the fitness curve of the optimization.
Table 2

Summary of parameter settings for the GJO algorithm

ParameterMaximum number of iterationsPopulation sizeSearch space for KSearch space for α
Value 20 30 [3,10] [100, 2,500] 
ParameterMaximum number of iterationsPopulation sizeSearch space for KSearch space for α
Value 20 30 [3,10] [100, 2,500] 
Figure 6

Fitness curve of the GJO optimization.

Figure 6

Fitness curve of the GJO optimization.

Close modal

As shown in Figure 6, when the number of iterations reached 6, the best fitness (i.e., 8.3546) was achieved. Thus, the optimal results can be obtained at this point. Eventually, corresponding to the best fitness, the optimal value for K was 10, while the optimal value was 640 for .

Decomposition using the VMD method

Once the appropriate parameters of VMD were determined, the decomposition procedure was applied. Figure 7 shows the decomposition results obtained using the optimized VMD method. In Figure 7, most of the BIMFs are relatively stable with low amplitude (e.g., amplitude values ranging from [−500, 500]). Moreover, no significant mode mixing can be found in any of the BIMFs. These observations confirmed the effectiveness of the optimized VMD method, which provides a good foundation for subsequent prediction.
Figure 7

Decomposition results.

Figure 7

Decomposition results.

Close modal

Results

Since 10 BIMFs were obtained from decomposition, 10 different BiGRU-Multi-Attn models were constructed in this study. To achieve the desired forecasts, the hyperparameters for each BiGRU-Multi-Attn model can be different according to the characteristics of the BIMF. Generally, three methods are commonly used for hyperparameter adjustment: grid search, random search and intelligent optimization methods. However, hyperparameter adjustment is still inflexible and time-consuming, which makes this task challenging. In this study, we adjusted a single hyperparameter when the other hyperparameters were fixed. According to the minimum MAPE on the validation set, an acceptable hyperparameter can be obtained.

Regarding the BiGRU-Multi-Attn model, the hyperparameters that usually need to be adjusted include the number of GRU layers, the number of hidden units, the learning rate, the number of epochs, the batch size, the dropout rate and the number of attention heads. Theoretically, the more GRU layers and hidden units there are, the better the model performs. However, this approach also means a higher computational load. In addition, too many GRU layers and hidden units may easily lead to overfitting, thereby deteriorating the model performance. Given this point, the number of GRU layers was set to 1, while the number of hidden units was set to 32. To avoid model overfitting, the dropout technique was adopted. In this study, we set the dropout rate to 0.1 to utilize this technique. The major parameters that need to be specified are presented in Table 3.

Table 3

Parameter settings of the BiGRU-Multi-Attn models for each BIMF

ParameterM1M2M3M4M5M6M7M8M9M10
Number of BiGRU layer 
Number of hidden units 32 32 32 32 32 32 32 32 32 32 
Learning rate 0.007 0.005 0.006 0.007 0.004 0.002 0.005 0.003 0.007 0.0022 
Optimizer Adam Adam Adam Adam Adam Adam Adam Adam Adam Adam 
Number of epochs 128 128 128 128 128 128 128 128 128 128 
Batch size 32 32 32 32 32 32 32 32 32 32 
Dropout rate 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 
Number of attention heads 
ParameterM1M2M3M4M5M6M7M8M9M10
Number of BiGRU layer 
Number of hidden units 32 32 32 32 32 32 32 32 32 32 
Learning rate 0.007 0.005 0.006 0.007 0.004 0.002 0.005 0.003 0.007 0.0022 
Optimizer Adam Adam Adam Adam Adam Adam Adam Adam Adam Adam 
Number of epochs 128 128 128 128 128 128 128 128 128 128 
Batch size 32 32 32 32 32 32 32 32 32 32 
Dropout rate 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 
Number of attention heads 

Note: Mi (i = 1, 2, …, 10) denotes the corresponding model for ith BIMF.

To evaluate the effectiveness of the GJO-VMD-BiGRU-Multi-Attn model, we compared it with the BiGRU, BiGRU-Multi-Attn and VMD-BiGRU-Multi-Attn models in terms of single-step and multistep prediction. In addition, to verify the performance of the GJO algorithm in VMD optimization, we also compared the proposed method with other optimized VMD-based methods in which two reported optimization algorithms (i.e., particle swarm optimization (PSO) and the whale optimization algorithm (WOA)) (Qi et al. 2022; Wang et al. 2023a, 2023b) were considered in this study. Similarly, PSO and the WOA were also used to optimize VMD. To facilitate the description of the methods, two other optimized VMD-based methods are denoted as the PSO-VMD-BiGRU-Multi-Attn model and the WOA-VMD-BiGRU-Multi-Attn model.

Like an ANN, the BiGRU-based model can also be regarded as a black box that can yield different forecasts for each execution. Therefore, to evaluate the performance of BiGRU-based methods more reasonably, we implemented each BiGRU-based model 10 times. The best results were used as the final values for performance evaluation. To evaluate the model performance more comprehensively, two prediction scenarios were designed. Scenario 1 is a single-step prediction scenario in which one-step prediction can be implemented. Scenario 2 is a multistep prediction scenario in which 24-step predictions are carried out simultaneously via the seq2seq strategy. Note that all the prediction models in this study were implemented in the PyTorch environment.

Scenario 1

Table 4 summarizes the forecasting performance obtained by applying the models to a testing set.

Table 4

Forecasting performance for single-step prediction

ModelMAPE (%)RMSE (m3/h)R2APE (%)
Standard deviationMaximum value
BiGRU 2.51 148 0.896 1.93 11.88 
BiGRU-Multi-Attn 0.75 55 0.985 0.92 7.08 
VMD-BiGRU-Multi-Attn 1.27 75 0.973 1.03 6.31 
GJO-VMD-BiGRU-Multi-Attn 0.45 25 0.997 0.34 2.02 
PSO-VMD-BiGRU-Multi-Attn 0.47 27 0.996 0.36 1.96 
WOA-VMD-BiGRU-Multi-Attn 1.19 69 0.977 0.92 4.91 
ModelMAPE (%)RMSE (m3/h)R2APE (%)
Standard deviationMaximum value
BiGRU 2.51 148 0.896 1.93 11.88 
BiGRU-Multi-Attn 0.75 55 0.985 0.92 7.08 
VMD-BiGRU-Multi-Attn 1.27 75 0.973 1.03 6.31 
GJO-VMD-BiGRU-Multi-Attn 0.45 25 0.997 0.34 2.02 
PSO-VMD-BiGRU-Multi-Attn 0.47 27 0.996 0.36 1.96 
WOA-VMD-BiGRU-Multi-Attn 1.19 69 0.977 0.92 4.91 

Note: The GJO-VMD-BiGRU-Multi-Attn model is the proposed method.

The bold values denotes the best results among the methods used for comparison.

According to the results depicted in Table 4, the following conclusions can be drawn. First, the proposed method achieved the best performance in terms of the MAPE, RMSE and R2, indicating that the proposed method can yield more accurate and stable forecasts than the other methods. Strikingly, the MAPE of the proposed method was only 0.45%, which means that the proposed method yielded an extremely high accuracy. With respect to the R2, the value was as high as 0.997, demonstrating that the proposed method has excellent fitting ability. Second, the BiGRU-Multi-Attn model significantly improved all the indicators when compared to those of the BiGRU model. For example, the MAPE decreased dramatically from 2.51 to 0.75%, thereby improving the prediction accuracy by 70%. Regarding the RMSE, a considerable decrease from 148 to 55 m3/h was observed, resulting in a remarkable reduction of more than 60%. These significant improvements are a direct consequence of introducing Multi-Attn. By introducing Multi-Attn, the BiGRU model can utilize more useful information among the inputs, thereby substantially increasing the model performance. Third, the proposed method and the PSO-VMD-BiGRU-Multi-Attn model performed much better in terms of all the indicators compared with those of the BiGRU-Multi-Attn model. However, at the same time, the BiGRU-Multi-Attn model had lower MAPE and RMSE values than the other two VMD-based models (i.e., the VMD-BiGRU-Multi-Attn and WOA-VMD-BiGRU-Multi-Attn models). On the one hand, it is suggested that effective decomposition can contribute to improving the model performance. On the other hand, if the decomposition results are not desirable enough, deterioration in the model performance may occur instead. In other words, the model performance is sensitive to the decomposition results. In this context, it can be inferred that the GJO algorithm is effective enough to optimize the VMD method. Finally, the overall performance of the PSO-VMD-BiGRU-Multi-Attn model was very close to that of the proposed method. In terms of a certain indicator (e.g., the maximum value of APE), the PSO-VMD-BiGRU-Multi-Attn model was even better than the proposed method. Moreover, the proposed method had an advantage over the PSO-VMD-BiGRU-Multi-Attn model based on most of the indicators (e.g., MAPE, RMSE, R2 and standard deviation of APE), although the performance of the PSO-VMD-BiGRU-Multi-Attn model was excellent. Based on the above analyses, it may be safe to conclude that the proposed method can yield exceptionally accurate and fairly stable forecasts and outperforms the other methods used for comparison according to the prediction accuracy and stability.

Figure 8 gives the regression results for the test data. Furthermore, some representative forecasts were selected as an example. Figure 9(a) shows a comparison between the BiGRU, BiGRU-Multi-Attn, and VMD-BiGRU-Multi-Attn models and the proposed method in terms of the prediction accuracy, while Figure 9(b) shows a comparison between the proposed method and the PSO-VMD-BiGRU-Multi-Attn and WOA-VMD-BiGRU-Multi-Attn models in terms of the prediction accuracy.
Figure 8

Regression results for test data.

Figure 8

Regression results for test data.

Close modal
Figure 9

Water demand forecasts for single-step prediction. (a) Comparisons among BiGRU-based models and (b) comparisons among VMD-BiGRU-Multi-Attn-based models.

Figure 9

Water demand forecasts for single-step prediction. (a) Comparisons among BiGRU-based models and (b) comparisons among VMD-BiGRU-Multi-Attn-based models.

Close modal

According to Figure 8, the regression value R is as high as 0.9985, which once again confirms that the proposed model has good fitting ability. As shown in Figure 9, the prediction curves of the BiGRU-Multi-Attn and PSO-VMD-BiGRU-Multi-Attn models and the proposed method are very close to the real curve, whereas the prediction curves of the other models tend to deviate from the real curve in several periods. The APE curves provide another perspective for us to understand more details about the model performance. A smooth APE curve usually means that the forecasts are stable, while low APE values indicate that the forecasts are accurate. According to Figure 9, the APE curves of both the proposed method and the PSO-VMD-BiGRU-Multi-Attn model are relatively smooth, and their APE values are consistent at a very low level (e.g., most of the APE values are below 1%). In contrast, the APE curves of the other models, especially those of the WOA-VMD-BiGRU-Multi-Attn and BiGRU models, fluctuate strongly with universally high APE values. Based on the observations shown in Figure 9, the proposed method achieved excellent prediction performance, which is consistent with the conclusions drawn from Table 4.

Scenario 2

This scenario provides deep insights into how the proposed method performs in multistep prediction. Table 5 provides a summary of the prediction performance of the models, and Figure 10 shows a good example of 24-step prediction in terms of the prediction accuracy.
Table 5

Forecasting performance for 24-step prediction

ModelMAPE (%)RMSE (m3/h)R2
BiGRU 3.23 194 0.820 
BiGRU-Multi-Attn 3.21 193 0.823 
VMD-BiGRU-Multi-Attn 3.85 225 0.759 
GJO-VMD-BiGRU-Multi-Attn 2.12 126 0.924 
PSO-VMD-BiGRU-Multi-Attn 2.24 135 0.913 
WOA-VMD-BiGRU-Multi-Attn 2.71 156 0.884 
ModelMAPE (%)RMSE (m3/h)R2
BiGRU 3.23 194 0.820 
BiGRU-Multi-Attn 3.21 193 0.823 
VMD-BiGRU-Multi-Attn 3.85 225 0.759 
GJO-VMD-BiGRU-Multi-Attn 2.12 126 0.924 
PSO-VMD-BiGRU-Multi-Attn 2.24 135 0.913 
WOA-VMD-BiGRU-Multi-Attn 2.71 156 0.884 
Figure 10

A good example of 24-step prediction. (a) Comparisons among BiGRU-based models and (b) comparisons among VMD-BiGRU-Multi-Attn-based models.

Figure 10

A good example of 24-step prediction. (a) Comparisons among BiGRU-based models and (b) comparisons among VMD-BiGRU-Multi-Attn-based models.

Close modal

In comparing Tables 4 and 5, the prediction performances of all the models deteriorated with different levels. For example, the MAPE value of the proposed method increased drastically from 0.45 to 2.12%, resulting in an increase of approximately 4 times. Moreover, the RMSE value of the proposed method also increased by approximately 4 times (i.e., the RMSE value increases from 25 to 126 m3/h). These observations suggest that when the prediction step increases, the prediction accuracy and stability of the models will inevitably decrease. However, several interesting findings deserve further analysis. For instance, although an increase in the number of prediction steps may cause a decrease in the model performance, an acceptable prediction performance can still be obtained. All of the MAPE values illustrated in Table 4 were below 5%. Surprisingly, the MAPE values of the proposed method and the PSO-VMD-BiGRU-Multi-Attn model were only slightly greater than 2%. Moreover, the R2 values of these two methods were greater than 0.9. In practice, these achievements can be partly attributed to the use of the seq2seq strategy. The seq2seq strategy allows us to perform multistep prediction without error accumulation and alleviates the deterioration of the model performance by capturing the time dependence among the outputs.

According to Figure 10, the following observations were made. Similar to the prediction curves in Scenario 1, the prediction curves of the proposed method and the PSO-VMD-BiGRU-Multi-Attn model fit the real curve better than did those of the other methods, indicating that the above two methods can yield more accurate forecasts than other methods. On the other hand, the APE values of the proposed method and the PSO-VMD-BiGRU-Multi-Attn model fluctuated at relatively low levels. Although a few APE values slightly exceeded 5%, in most cases, the APE values were less than 2%. This finding suggests that even in the scenario of 24-step prediction, the proposed method and the PSO-VMD-BiGRU-Multi-Attn model still achieve high accuracy.

In light of the results shown in Table 5 and Figure 10, we can safely draw the following conclusions: (1) for multistep prediction, by exploiting the seq2seq strategy, the proposed method maintains a high accuracy; and (2) similar to single-step prediction, the proposed method is superior to other methods in multistep prediction in terms of the prediction accuracy and stability.

In this study, we investigated the potential of a novel method for treating STWDF. A dataset obtained from a real-world water distribution system was applied to evaluate the effectiveness of the proposed method. A comparison between the proposed method and other BiGRU-based methods was also conducted. Based on the results, the following conclusions can be drawn: (1) The proposed method can obtain considerably accurate and stable forecasts in single-step prediction. In particular, the proposed method can also achieve reliable prediction performance in 24-h step prediction. Overall, the proposed method consistently outperforms the other BiGRU-based methods used for comparison in the above two prediction scenarios; (2) the GJO algorithm is effective at optimizing the VMD method, providing a substantial foundation for improving the prediction performance. Both the optimized VMD method and Multi-Attn are conducive to enhancing prediction performance; and (3) the seq2seq strategy can enhance the prediction performance for multistep prediction by preventing error accumulation, which is beneficial for real-time optimal scheduling in water distribution systems.

Although the proposed method achieved satisfying results in this study, several limitations merit attention. For instance, more models need to be constructed than traditional methods where decomposition is not considered. As a result, additional hyperparameters must be adjusted. Considering that hyperparameter adjustment is usually time-consuming for deep learning methods, this may negatively impact the practical application of the proposed method. Moreover, all the BiGRU models corresponding to BIMFs used the same number of inputs for single-step and multistep prediction. Generally, the number of inputs may differ according to the characteristics of the dataset. However, the effect of the number of inputs on model performance was not examined. These limitations require further investigation.

This work is supported by the Middle-Aged and Young Teachers' Basic Ability Promotion Project of Guangxi (Grant No. 2024KY146) and the Vocational Education Teaching Reform Research Project of Guangxi (Grant No. GXGZJG2022B102).

H.H. contributed to conceptualization, formal analysis, investigation, software, writing – original draft. M.W. contributed to funding acquisition, methodology, validation, resources, supervision, writing – review & editing.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Adamowski
J. F.
2008
Peak daily water demand forecast modeling using artificial neural networks
.
J. Water Resour. Plann. Manage.
134
(
2
),
119
128
.
Bougadis
J.
,
Adamowski
K.
&
Diduch
R.
2005
Short-term municipal water demand forecasting
.
Hydrol. Processes
19
(
1
),
137
148
.
Chang
H.
,
Gao
F.
,
Li
Y.
,
Wei
X.
,
Gao
C.
&
Chang
L.
2022
An optimized VMD method for predicting milling cutter wear using vibration signal
.
Machines
10
(
7
),
548
548
.
Chen
W.
,
Qi
W.
,
Li
Y.
,
Zhang
J.
,
Zhu
F.
,
Xie
D.
,
Ru
W.
,
Luo
G.
,
Song
M.
&
Tang
F.
2021
Ultra-short-term wind power prediction based on bidirectional gated recurrent unit and transfer learning
.
Front. Energy Res.
9
,
1
6
.
Cho
K.
,
Van Merrienboer
B.
,
Bahdanau
D.
&
Bengio
Y.
2014
On the properties of neural machine translation: Encoder-decoder approaches
.
Preprint, Submitted September
3
,
2014
.
Donkor
E.
,
Mazzuchi
T.
,
Soyer
R.
&
Roberson
J.
2014
Urban water demand forecasting: Review of methods and models
.
J. Water Resour. Plann. Manage.
140
(
2
),
146
159
.
Dragomiretskiy
K.
&
Zosso
D.
2014
Variational mode decomposition
.
IEEE T. Signal Proces.
62
(
3
),
531
544
.
Ghasempour
R.
,
Azamathulla
H. M.
&
Roushangar
K.
2021
EEMD- and VMD-based hybrid GPR models for river streamflow point and interval predictions
.
Water Supply
21
(
7
),
3960
3975
.
Ghiassi
M.
,
Zimbra
D. K.
&
Saidane
H.
2008
Urban water demand forecasting with a dynamic artificial neural network model
.
J. Water Resour. Plann. Manage.
134
(
2
),
138
146
.
Guo
G.
,
Liu
S.
,
Wu
Y.
,
Zhou
R.
&
Zhu
X.
2018
Short-term water demand forecast based on deep learning method
.
J. Water Resour. Plann. Manage.
144
(
12
),
04018076
.
Hochreiter
S.
&
Schmidhuber
J.
1997
Long short-term memory
.
Neural Comput.
9
(
8
),
1735
1780
.
Mu
L.
,
Zheng
F.
,
Tao
R.
,
Zhang
Q.
&
Kapelan
Z.
2020
Hourly and daily urban water demand predictions using a long short-term memory based model
.
J. Water Resour. Plann. Manage.
146
(
9
),
05020017
.
Nei
L.
,
Xu
S.
,
Zhang
L.
,
Yin
Y.
,
Dong
Z.
&
Zhou
X.
2022
Remaining useful life prediction of aeroengines based on multi-head attention mechanism
.
Machines
10
(
7
),
552
552
.
Pu
Z.
,
Yan
J.
,
Chen
L.
,
Li
Z.
,
Tian
W.
,
Tao
T.
&
Xin
K.
2023
A hybrid Wavelet-CNN-LSTM deep learning model for short-term urban water demand forecasting
.
Front. Env. Sci. Eng.
17
(
2
),
22
.
Pulido-Calvo
I.
&
Gutierrez-Estrada
J. C.
2009
Improved irrigation water demand forecasting using a soft-computing hybrid model
.
Biosyst. Eng.
102
(
2
),
202
218
.
Qi
P.
,
Chang
J.
,
Chen
X.
,
Wang
T.
&
Wu
M.
2022
Identification of rock properties of rock wall cut by roadheader based on PSO-VMD-LSSVM
.
Front. Earth Sc.
10
,
1
12
.
Wang
K.
,
Ye
Z.
,
Wang
Z.
,
Liu
B.
&
Feng
T.
2023a
MACLA-LSTM: A novel approach for forecasting water demand
.
Sustainability
15
(
4
),
3628
3628
.
Yasar
A.
,
Bilgili
M.
&
Simsek
E.
2012
Water demand forecasting based on stepwise multiple nonlinear regression analysis
.
Arab. J. Sci. Eng.
37
(
8
),
2333
2341
.
Zhang
J.
,
Jiang
Y.
,
Wu
S.
,
Li
X.
&
Luo
H.
2022
Prediction of remaining useful life based on bidirectional gated recurrent unit with temporal self-attention mechanism
.
Reliab. Eng. Syst. Saf.
221
(
4
),
108297
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).