## ABSTRACT

Short-term water demand prediction is crucial for real-time optimal scheduling and leakage control in water distribution systems. This paper proposes a new deep learning-based method for short-term water demand prediction. The proposed method consists of four main parts: the variational mode decomposition method, the golden jackal optimization algorithm, the multihead attention mechanism, and the bidirectional gated recurrent unit (BiGRU) model. Furthermore, a seq2seq strategy was adopted for multistep prediction to avoid the error accumulation problem. Hourly water demand data collected from a real-world water distribution system were applied to investigate the potential of the proposed method. The results show that the proposed method can yield remarkably accurate and stable forecasts in single-step prediction (i.e., the mean absolute percentage error (MAPE) reaches 0.45%, and the root mean squared error (RMSE) is 25 m^{3}/h). Moreover, the proposed method still achieves credible performance in 24-step prediction (i.e., the MAPE reaches 2.12%, and the RMSE is 126 m^{3}/h). In general, for both single-step prediction and multistep prediction, the proposed method consistently outperforms other BiGRU-based methods. These findings suggest that the proposed method can provide a reliable alternative for short-term water demand prediction.

## HIGHLIGHTS

Based on the ‘decomposition–prediction’ strategy, an attention mechanism-based framework was proposed.

A new nature-inspired optimization method was used to optimize the parameters of the variational mode decomposition method, contributing to improving the decomposition quality.

A seq2seq strategy was adopted for multistep prediction to avoid the error accumulation problem in multistep prediction.

## INTRODUCTION

Short-term water demand prediction (STWDP) can provide an important decision-making basis for optimal scheduling and leakage control of water distribution systems. Therefore, the accuracy of prediction methods has a significant impact on water distribution system operation in terms of security, reliability and cost-effectiveness. To date, the exploration of prediction methods has remained a crucial research area for scholars worldwide. Many prediction methods, such as statistics-based methods (Yasar *et al.* 2012), artificial neural networks (ANNs) (Ghiassi *et al.* 2008), and Gaussian process regression (Roushangar & Alizadeh 2018), have been proposed over the past few decades. With the rapid development of computer technology, machine learning methods have gained increasing attention due to their ability to effectively handle nonlinear data. Overall, the application of machine learning in STWDP can be roughly divided into two stages: the shallow learning-based stage and the deep learning-based stage.

In the shallow learning-based stage, prediction methods tend to adopt a simple structure to capture the internal relationships among the data. Therefore, the desired results can be obtained by adjusting a few hyperparameters. Among these prediction methods, one of the base models commonly used is ANN. So far, a wide variety of methods based on ANNs have been developed to predict short-term water demand. For example, Bougadis *et al.* (2005) adopted three different ANNs to predict short-term municipal water demand and conducted a comparison with regression models and time series models. The results showed that the ANNs consistently outperformed the other models. Similar conclusions were drawn from the case studies of Pulido-Calvo & Gutierrez-Estrada (2009) and Adamowski (2008). To improve the prediction accuracy of ANNs, Huang *et al.* (2022) applied the mind evolutionary algorithm (MEA) to optimize the initial thresholds and weights of backpropagation neural networks (BPNNs) and found that the MEA_BPNN model can yield more accurate and stable results than the traditional BPNN model. Based on the above literature, there is no doubt that ANN-based methods can achieve good results in many case studies. However, because of the failure to effectively extract deep-level features, improving the upper limit of the prediction accuracy is difficult.

In the deep learning-based stage, prediction methods rely on constructing deeper networks to extract deep-level features of data, thereby achieving more accurate forecasts. A recurrent neural network (RNN) is a typical architecture of deep learning networks and was specifically designed for processing sequence information. To solve the problems (i.e., vanishing and exploding gradients) that exist in traditional RNNs, two improved architectures known as long short-term memory (LSTM) and gated recurrent units (GRUs) were successively developed (Hochreiter & Schmidhuber 1997; Cho *et al.* 2014).

In recent years, LSTM-based methods have been widely used in the field of STWDP due to the outstanding ability of LSTM to analyze time series data. For instance, Mu *et al.* (2020) proposed an LSTM-based model to predict hourly and daily urban water demands in Hefei City, China. They concluded that the LSTM-based model can offer predictions with improved accuracy under unfavourable conditions (e.g., data with high temporal resolution, abrupt changes, or a relatively high uncertainty level). To improve the prediction accuracy, Pu *et al.* (2023) combined wavelet analysis (WA) with convolutional neural networks (CNNs) and LSTM for single-step and multistep prediction. Instead of using the inputs directly, the inputs in this method were a subseries of components obtained from WA. Similarly, Liu *et al.* (2023) developed a novel LSTM-based method in which the seasonal trend loss (STL) method and adaptive boosting algorithm were used to enhance the LSTM performance. In this work, the STL method was utilized to obtain input series for the LSTM network. The results obtained from the two aforementioned case studies show that using input series derived from appropriate decomposition techniques as model inputs is conducive to improving the prediction accuracy. Considering that the selection of hyperparameters can easily affect prediction accuracy, Wang *et al.* (2023a, 2023b) used the clouded leopard algorithm to optimize the input parameters of LSTM and eventually obtained the desired forecasts.

Compared with LSTM, the GRU has a simpler architecture, but it has a faster computation speed and similar prediction performance. In view of this, GRU-based methods have also become popular in the field of STWDP. For example, Guo *et al.* (2018) developed a GRU model integrated with a correction module to forecast future water demand 15-min and 24-h forecasts with a 15-min time step. The results showed that the GRU-based models outperform the ANN and seasonal autoregressive integrated moving average (SARIMA) models for both 15-min and 24-h forecasts. To alleviate forecasting errors at extreme points, Salloom *et al.* (2021) proposed a novel deep neural network architecture for one-step and multistep forecasting. A GRU was used as a basic model to handle the sequential relationships among historical demand data. This study revealed that the proposed architecture can significantly reduce the complexity of the model while conserving the same accuracy. Moreover, Hu *et al.* (2021) proposed a novel framework in which a GRU was also used as a basic forecasting model. In this framework, an isolation forest and complete ensemble empirical mode decomposition (EMD) with adaptive noise (CEEMDAN) were adopted for data preprocessing. The results demonstrated that the GRU-based method was superior to support vector regression (SVR) and ANNs. In addition, this study confirmed that using a decomposition technique to handle data is indeed conducive to improving the prediction accuracy.

In summary, LSTM-based and GRU-based methods are adequate for obtaining promising forecasts in STWDP. Nevertheless, there is still some room to improve the prediction accuracy. For example, more effective decomposition techniques can be applied to enhance the decomposition efficiency. Furthermore, the most recent advancements in the deep learning domain can also be applied to water demand forecasting problems.

Inspired by the above literature, this paper proposed a novel deep learning-based method for STWDP. The proposed method is composed of four main parts: the variational mode decomposition (VMD) method, the golden jackal optimization (GJO) algorithm, the multihead attention mechanism (Multi-Attn) and the bidirectional gated recurrent unit (BiGRU) model. In the proposed method, the VMD method was used for data decomposition to create a series of relatively stable subsequences; the GJO algorithm was adopted to optimize the parameters of the VMD method, thereby enhancing the decomposition quality; the BiGRU model was used as a basic prediction method; and the Multi-Attn was introduced into the BiGRU model to improve the prediction accuracy. For clarity, the proposed method is hereafter denoted as the GJO-VMD-BiGRU-Multi-Attn model.

The objective of this study is to investigate the potential of new techniques in the fields of signal decomposition, optimization and deep learning in addressing water demand prediction problems. The major contributions of this study can be briefly summarized as follows:

(1) Based on the ‘decomposition–prediction’ strategy, a novel framework was proposed for STWDP.

(2) A new nature-inspired optimization method named GJO was used to optimize the parameters of the VMD method, contributing to improving the decomposition quality.

(3) A Multi-Attn was introduced into the BiGRU model to capture the key temporal feature information of sequence data. In addition, a seq2seq strategy was adopted for multistep prediction to avoid the error accumulation problem, thus improving the model performance in multistep prediction.

The remainder of this paper is organized as follows. Section 2 describes the theory of related techniques used in this study. In Section 3, the proposed method is evaluated using a case study, following a comprehensive comparison with other BiGRU-based methods. Section 4 presents some conclusions and provides some suggestions for future work.

## METHODOLOGY

### VMD method

*et al.*2021; Hu

*et al.*2023; Li

*et al.*2023). The core idea of the VMD method is to construct and solve a variational model. Let

*f*(t) be the original signal; then, the variational model can be constructed by the following formula:where is the mode function of the

*k-*th mode, is the centre frequency of the

*k-*th mode,

*K*is the number of decomposed modes,

*δ*represents the Dirac distribution, and is a symbol for taking the partial derivative.

*n*denotes the

*n-*th iteration, denotes noise tolerance, which has an impact on the denoising effect, and denotes a threshold used as the basis for terminating the iteration process.

Once the unconstrained variational problem is addressed, a series of BIMFs can be generated. With respect to existing EMD-based methods, a unique advantage of the VMD method is that it can specify the number of modes *K* according to the actual conditions. The parameter *K* can remarkably affect the decomposition quality. Nevertheless, how to select a proper value for *K* is still a challenging task. Moreover, the penalty factor also significantly influences the performance of the VMD method. Therefore, it is necessary to find appropriate values for both *K* and before implementing the VMD method. In the following, we introduce a novel optimization algorithm that is used to simultaneously optimize the parameters *K* and .

### GJO algorithm

The GJO algorithm, inspired by the collaborative hunting behaviour of golden jackals, is a new nature-inspired optimization method. Owing to its high performance in complicated optimization problems, the GJO algorithm has become increasingly popular since its development in 2022 (Najjar *et al.* 2022; Lu *et al.* 2023). In principle, the GJO algorithm consists of three basic stages: the initialization stage, exploration stage and exploitation stage. In the initialization stage, as a population-based method, the GJO algorithm needs to initialize a random prey population to obtain candidate solutions that are uniformly distributed over the search space. In the exploration stage, the behaviour of golden jackals searching for prey is simulated by constructing an exploration strategy. In the last stage, the GJO algorithm describes how and when the jackals enclose and pounce on the prey according to the prey's energy evasion. More information about the development process of the GJO algorithm can be found in the relevant literature (Nitish & Muhammad 2022).

*et al.*2019; Chang

*et al.*2022). The larger the envelope entropy value is, the more complicated the BIMFs obtained from VMD. In this context, the parameter combination (

*K*,) corresponding to the minimum envelope entropy is the optimal parameter combination. For a signal with a mean of zero, its envelope entropy can be calculated as follows:where is the normalized form of and is the envelope signal obtained by Hilbert demodulation of the signal .

### Sliding window sampling

In this paper, the samples of prediction models were constructed using a sliding window model. To evaluate the effectiveness of the proposed method, both single-step and multistep prediction were considered in this study. In general, there are at least four commonly used strategies for making multistep forecasts: direct multistep forecasts, recursive multistep forecasts, direct recursive hybrid multistep forecasts and multiple-output forecasts. However, all of the above strategies have inherent drawbacks that may degrade the prediction performance of models. A direct multistep forecast strategy needs to create multiple models to achieve multistep prediction, resulting in a high computational load. For the recursive multistep forecast strategy, the accumulated errors may increase when the time series to be predicted is long, which can easily lead to a rapid decline in performance. The direct recursive hybrid multistep forecast strategy seeks to combine the best properties of both the recursive and direct forecast strategies, but this strategy cannot fully utilize all the observed values for prediction due to defects in the framework. The multiple-output forecast strategy has no error accumulation problem; however, this strategy does not consider the temporal relationships among the output data. Therefore, this strategy may cause information loss, thereby leading to an increase in prediction errors.

*x*denotes the data point at time

_{i}*i*.

Prediction horizon . | Inputs . | Outputs . |
---|---|---|

Single-step prediction | x_{i}_{−n}, … , x_{i}_{−2}, x_{i}_{−1} | _{ }x _{i} |

Multistep prediction | x_{i}_{−n}, … , x_{i}_{−2}, x_{i}_{−1} | x, _{i}x_{i}_{+1}… , x_{i+m−}_{1} |

Prediction horizon . | Inputs . | Outputs . |
---|---|---|

Single-step prediction | x_{i}_{−n}, … , x_{i}_{−2}, x_{i}_{−1} | _{ }x _{i} |

Multistep prediction | x_{i}_{−n}, … , x_{i}_{−2}, x_{i}_{−1} | x, _{i}x_{i}_{+1}… , x_{i+m−}_{1} |

### BiGRU model

*et al.*2022). The relevant formulas for the GRU are as follows:where represents the reset gate, represents the update gate, , , , ,

*W*and

*U*denote the related weight matrix, , and denote the related biases, is the candidate hidden state at time

*t*, is the hidden state at time

*t*, and represent the sigmoid and activation functions, respectively, and represents element multiplication.

*et al.*2021). Figure 3 illustrates the structure of the BiGRU.

In Figure 3, , , , , and denote the related weights.

### Multihead attention mechanism (Multi-Attn)

The attention mechanism simulates human ability and focuses on important information at specific moments. Among a large quantity of input data, usually, only a small subset of features is crucial. For this reason, a prediction model can focus on these important parts by introducing attention mechanisms, thereby improving the overall performance.

Multi-Attn is an extended form of the attention mechanism that can more effectively extract information from sequence data (Nei *et al.* 2022). The specific procedures for introducing Multi-Attn into the BiGRU model can be described as follows:

- (1) The output vector obtained from the BiGRU is first projected to different spaces (i.e.
*, query, key*, and*value*spaces) through linear transformation, thereby obtaining matrices*Q*,*Key*, and*V*, respectively. Afterwards, matrices*Q*,*Key*, and*V*are split into*h*heads to create matrices , and (*i*= 1, 2, … ,*h*). The attention output can be calculated by:where denotes the attention output vector corresponding to the*i*-th head, denotes the dimensions of matrices*Q*,*Key*, and*V*, and denotes the normalization of scores. - (2) The attention outputs of all the heads are concatenated according to Formula (13). Afterwards, the final result can be obtained through linear transformation.where
*A*is the final result obtained through Multi-Attn, denotes the concatenation operation, and represents the weight of the linear transformation.

### GJO-VMD-BiGRU-Multi-Attn model

*K*and for the VMD method according to the minimum envelope entropy. Second, water demand series data were decomposed to generate a series of BIMFs using the VMD method with the optimal parameters. Then, for each BIMF, a BiGRU-Multi-Attn model was constructed for both single-step and multistep prediction. To obtain appropriate hyperparameter combinations, a single hyperparameter was adjusted when the other hyperparameters were fixed. Furthermore, the seq2seq strategy was adopted for multistep prediction, which can contribute to avoiding the error accumulation problem and improving the prediction performance. Finally, the final prediction results can be obtained by accumulating forecasts of all the BIMFs. Figure 4 shows the flowchart of the GJO-VMD-BiGRU-Multi-Attn model for water demand prediction.

### Prediction performance criteria

*et al.*2014). In this study, four indicators commonly used were selected for performance evaluation: absolute percent error (APE), mean absolute percentage error (MAPE), root mean squared error (RMSE), and coefficient of determination (R

^{2}). In practice, the lower the APE, MAPE and RMSE are, the better the prediction models perform, while the opposite is true for the R

^{2}. The above indicators can be calculated by the following formulas (Huang

*et al.*2023):where and represent the observed values and the predicted values, respectively,

*m*is the number of observed values, and is the mean of the observed values.

## CASE STUDY

### Data description

In this study, the length of the sliding window was set to 12 based on a trial and error procedure. Specifically, we used water demand data from the previous 12 h to predict future water demand. Note that the length of the sliding window can differ depending on the characteristics of the dataset. To further evaluate the performance of the proposed method, a 24-h prediction was also investigated. According to the approach for sample construction illustrated in Table 1, for each BIMF, approximately 5,000 samples can be obtained for both single-step and multistep prediction. The samples for each BIMF were divided into a training set, a validation set and a testing set at a ratio of 7:1.5:1.5. Therefore, the training set contains approximately 3,500 samples, while both the validation set and the testing set contain approximately 750 samples.

### Optimization using the GJO algorithm

*K*and

*α*. Based on the analyses in the related literature (Nitish & Muhammad 2022), the maximum number of iterations and population size were set to 20 and 30, respectively, in this study. The search space for

*K*was in the range of [3,10], while the search space for

*α*ranged from 100 to 2,500. Afterwards, the VMD method can be optimized using the GJO algorithm. Table 2 presents the parameter settings for the GJO algorithm. Figure 6 shows the fitness curve of the optimization.

Parameter . | Maximum number of iterations . | Population size . | Search space for K
. | Search space for α . |
---|---|---|---|---|

Value | 20 | 30 | [3,10] | [100, 2,500] |

Parameter . | Maximum number of iterations . | Population size . | Search space for K
. | Search space for α . |
---|---|---|---|---|

Value | 20 | 30 | [3,10] | [100, 2,500] |

As shown in Figure 6, when the number of iterations reached 6, the best fitness (i.e., 8.3546) was achieved. Thus, the optimal results can be obtained at this point. Eventually, corresponding to the best fitness, the optimal value for *K* was 10, while the optimal value was 640 for .

### Decomposition using the VMD method

### Results

Since 10 BIMFs were obtained from decomposition, 10 different BiGRU-Multi-Attn models were constructed in this study. To achieve the desired forecasts, the hyperparameters for each BiGRU-Multi-Attn model can be different according to the characteristics of the BIMF. Generally, three methods are commonly used for hyperparameter adjustment: grid search, random search and intelligent optimization methods. However, hyperparameter adjustment is still inflexible and time-consuming, which makes this task challenging. In this study, we adjusted a single hyperparameter when the other hyperparameters were fixed. According to the minimum MAPE on the validation set, an acceptable hyperparameter can be obtained.

Regarding the BiGRU-Multi-Attn model, the hyperparameters that usually need to be adjusted include the number of GRU layers, the number of hidden units, the learning rate, the number of epochs, the batch size, the dropout rate and the number of attention heads. Theoretically, the more GRU layers and hidden units there are, the better the model performs. However, this approach also means a higher computational load. In addition, too many GRU layers and hidden units may easily lead to overfitting, thereby deteriorating the model performance. Given this point, the number of GRU layers was set to 1, while the number of hidden units was set to 32. To avoid model overfitting, the dropout technique was adopted. In this study, we set the dropout rate to 0.1 to utilize this technique. The major parameters that need to be specified are presented in Table 3.

Parameter . | M1 . | M2 . | M3 . | M4 . | M5 . | M6 . | M7 . | M8 . | M9 . | M10 . |
---|---|---|---|---|---|---|---|---|---|---|

Number of BiGRU layer | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Number of hidden units | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |

Learning rate | 0.007 | 0.005 | 0.006 | 0.007 | 0.004 | 0.002 | 0.005 | 0.003 | 0.007 | 0.0022 |

Optimizer | Adam | Adam | Adam | Adam | Adam | Adam | Adam | Adam | Adam | Adam |

Number of epochs | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |

Batch size | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |

Dropout rate | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |

Number of attention heads | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |

Parameter . | M1 . | M2 . | M3 . | M4 . | M5 . | M6 . | M7 . | M8 . | M9 . | M10 . |
---|---|---|---|---|---|---|---|---|---|---|

Number of BiGRU layer | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Number of hidden units | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |

Learning rate | 0.007 | 0.005 | 0.006 | 0.007 | 0.004 | 0.002 | 0.005 | 0.003 | 0.007 | 0.0022 |

Optimizer | Adam | Adam | Adam | Adam | Adam | Adam | Adam | Adam | Adam | Adam |

Number of epochs | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |

Batch size | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |

Dropout rate | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |

Number of attention heads | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |

*Note*: **M_{i}** (

*i*= 1, 2, …, 10) denotes the corresponding model for

*i*th BIMF.

To evaluate the effectiveness of the GJO-VMD-BiGRU-Multi-Attn model, we compared it with the BiGRU, BiGRU-Multi-Attn and VMD-BiGRU-Multi-Attn models in terms of single-step and multistep prediction. In addition, to verify the performance of the GJO algorithm in VMD optimization, we also compared the proposed method with other optimized VMD-based methods in which two reported optimization algorithms (i.e., particle swarm optimization (PSO) and the whale optimization algorithm (WOA)) (Qi *et al.* 2022; Wang *et al.* 2023a, 2023b) were considered in this study. Similarly, PSO and the WOA were also used to optimize VMD. To facilitate the description of the methods, two other optimized VMD-based methods are denoted as the PSO-VMD-BiGRU-Multi-Attn model and the WOA-VMD-BiGRU-Multi-Attn model.

Like an ANN, the BiGRU-based model can also be regarded as a black box that can yield different forecasts for each execution. Therefore, to evaluate the performance of BiGRU-based methods more reasonably, we implemented each BiGRU-based model 10 times. The best results were used as the final values for performance evaluation. To evaluate the model performance more comprehensively, two prediction scenarios were designed. Scenario 1 is a single-step prediction scenario in which one-step prediction can be implemented. Scenario 2 is a multistep prediction scenario in which 24-step predictions are carried out simultaneously via the seq2seq strategy. Note that all the prediction models in this study were implemented in the PyTorch environment.

#### Scenario 1

Table 4 summarizes the forecasting performance obtained by applying the models to a testing set.

Model . | MAPE (%) . | RMSE (m^{3}/h)
. | R^{2}
. | APE (%) . | |
---|---|---|---|---|---|

Standard deviation . | Maximum value . | ||||

BiGRU | 2.51 | 148 | 0.896 | 1.93 | 11.88 |

BiGRU-Multi-Attn | 0.75 | 55 | 0.985 | 0.92 | 7.08 |

VMD-BiGRU-Multi-Attn | 1.27 | 75 | 0.973 | 1.03 | 6.31 |

GJO-VMD-BiGRU-Multi-Attn | 0.45 | 25 | 0.997 | 0.34 | 2.02 |

PSO-VMD-BiGRU-Multi-Attn | 0.47 | 27 | 0.996 | 0.36 | 1.96 |

WOA-VMD-BiGRU-Multi-Attn | 1.19 | 69 | 0.977 | 0.92 | 4.91 |

Model . | MAPE (%) . | RMSE (m^{3}/h)
. | R^{2}
. | APE (%) . | |
---|---|---|---|---|---|

Standard deviation . | Maximum value . | ||||

BiGRU | 2.51 | 148 | 0.896 | 1.93 | 11.88 |

BiGRU-Multi-Attn | 0.75 | 55 | 0.985 | 0.92 | 7.08 |

VMD-BiGRU-Multi-Attn | 1.27 | 75 | 0.973 | 1.03 | 6.31 |

GJO-VMD-BiGRU-Multi-Attn | 0.45 | 25 | 0.997 | 0.34 | 2.02 |

PSO-VMD-BiGRU-Multi-Attn | 0.47 | 27 | 0.996 | 0.36 | 1.96 |

WOA-VMD-BiGRU-Multi-Attn | 1.19 | 69 | 0.977 | 0.92 | 4.91 |

*Note*: The GJO-VMD-BiGRU-Multi-Attn model is the proposed method.

The bold values denotes the best results among the methods used for comparison.

According to the results depicted in Table 4, the following conclusions can be drawn. First, the proposed method achieved the best performance in terms of the MAPE, RMSE and R^{2}, indicating that the proposed method can yield more accurate and stable forecasts than the other methods. Strikingly, the MAPE of the proposed method was only 0.45%, which means that the proposed method yielded an extremely high accuracy. With respect to the R^{2}, the value was as high as 0.997, demonstrating that the proposed method has excellent fitting ability. Second, the BiGRU-Multi-Attn model significantly improved all the indicators when compared to those of the BiGRU model. For example, the MAPE decreased dramatically from 2.51 to 0.75%, thereby improving the prediction accuracy by 70%. Regarding the RMSE, a considerable decrease from 148 to 55 m^{3}/h was observed, resulting in a remarkable reduction of more than 60%. These significant improvements are a direct consequence of introducing Multi-Attn. By introducing Multi-Attn, the BiGRU model can utilize more useful information among the inputs, thereby substantially increasing the model performance. Third, the proposed method and the PSO-VMD-BiGRU-Multi-Attn model performed much better in terms of all the indicators compared with those of the BiGRU-Multi-Attn model. However, at the same time, the BiGRU-Multi-Attn model had lower MAPE and RMSE values than the other two VMD-based models (i.e., the VMD-BiGRU-Multi-Attn and WOA-VMD-BiGRU-Multi-Attn models). On the one hand, it is suggested that effective decomposition can contribute to improving the model performance. On the other hand, if the decomposition results are not desirable enough, deterioration in the model performance may occur instead. In other words, the model performance is sensitive to the decomposition results. In this context, it can be inferred that the GJO algorithm is effective enough to optimize the VMD method. Finally, the overall performance of the PSO-VMD-BiGRU-Multi-Attn model was very close to that of the proposed method. In terms of a certain indicator (e.g., the maximum value of APE), the PSO-VMD-BiGRU-Multi-Attn model was even better than the proposed method. Moreover, the proposed method had an advantage over the PSO-VMD-BiGRU-Multi-Attn model based on most of the indicators (e.g., MAPE, RMSE, R^{2} and standard deviation of APE), although the performance of the PSO-VMD-BiGRU-Multi-Attn model was excellent. Based on the above analyses, it may be safe to conclude that the proposed method can yield exceptionally accurate and fairly stable forecasts and outperforms the other methods used for comparison according to the prediction accuracy and stability.

According to Figure 8, the regression value R is as high as 0.9985, which once again confirms that the proposed model has good fitting ability. As shown in Figure 9, the prediction curves of the BiGRU-Multi-Attn and PSO-VMD-BiGRU-Multi-Attn models and the proposed method are very close to the real curve, whereas the prediction curves of the other models tend to deviate from the real curve in several periods. The APE curves provide another perspective for us to understand more details about the model performance. A smooth APE curve usually means that the forecasts are stable, while low APE values indicate that the forecasts are accurate. According to Figure 9, the APE curves of both the proposed method and the PSO-VMD-BiGRU-Multi-Attn model are relatively smooth, and their APE values are consistent at a very low level (e.g., most of the APE values are below 1%). In contrast, the APE curves of the other models, especially those of the WOA-VMD-BiGRU-Multi-Attn and BiGRU models, fluctuate strongly with universally high APE values. Based on the observations shown in Figure 9, the proposed method achieved excellent prediction performance, which is consistent with the conclusions drawn from Table 4.

#### Scenario 2

Model . | MAPE (%) . | RMSE (m^{3}/h)
. | R^{2}
. |
---|---|---|---|

BiGRU | 3.23 | 194 | 0.820 |

BiGRU-Multi-Attn | 3.21 | 193 | 0.823 |

VMD-BiGRU-Multi-Attn | 3.85 | 225 | 0.759 |

GJO-VMD-BiGRU-Multi-Attn | 2.12 | 126 | 0.924 |

PSO-VMD-BiGRU-Multi-Attn | 2.24 | 135 | 0.913 |

WOA-VMD-BiGRU-Multi-Attn | 2.71 | 156 | 0.884 |

Model . | MAPE (%) . | RMSE (m^{3}/h)
. | R^{2}
. |
---|---|---|---|

BiGRU | 3.23 | 194 | 0.820 |

BiGRU-Multi-Attn | 3.21 | 193 | 0.823 |

VMD-BiGRU-Multi-Attn | 3.85 | 225 | 0.759 |

GJO-VMD-BiGRU-Multi-Attn | 2.12 | 126 | 0.924 |

PSO-VMD-BiGRU-Multi-Attn | 2.24 | 135 | 0.913 |

WOA-VMD-BiGRU-Multi-Attn | 2.71 | 156 | 0.884 |

In comparing Tables 4 and 5, the prediction performances of all the models deteriorated with different levels. For example, the MAPE value of the proposed method increased drastically from 0.45 to 2.12%, resulting in an increase of approximately 4 times. Moreover, the RMSE value of the proposed method also increased by approximately 4 times (i.e., the RMSE value increases from 25 to 126 m^{3}/h). These observations suggest that when the prediction step increases, the prediction accuracy and stability of the models will inevitably decrease. However, several interesting findings deserve further analysis. For instance, although an increase in the number of prediction steps may cause a decrease in the model performance, an acceptable prediction performance can still be obtained. All of the MAPE values illustrated in Table 4 were below 5%. Surprisingly, the MAPE values of the proposed method and the PSO-VMD-BiGRU-Multi-Attn model were only slightly greater than 2%. Moreover, the R^{2} values of these two methods were greater than 0.9. In practice, these achievements can be partly attributed to the use of the seq2seq strategy. The seq2seq strategy allows us to perform multistep prediction without error accumulation and alleviates the deterioration of the model performance by capturing the time dependence among the outputs.

According to Figure 10, the following observations were made. Similar to the prediction curves in Scenario 1, the prediction curves of the proposed method and the PSO-VMD-BiGRU-Multi-Attn model fit the real curve better than did those of the other methods, indicating that the above two methods can yield more accurate forecasts than other methods. On the other hand, the APE values of the proposed method and the PSO-VMD-BiGRU-Multi-Attn model fluctuated at relatively low levels. Although a few APE values slightly exceeded 5%, in most cases, the APE values were less than 2%. This finding suggests that even in the scenario of 24-step prediction, the proposed method and the PSO-VMD-BiGRU-Multi-Attn model still achieve high accuracy.

In light of the results shown in Table 5 and Figure 10, we can safely draw the following conclusions: (1) for multistep prediction, by exploiting the seq2seq strategy, the proposed method maintains a high accuracy; and (2) similar to single-step prediction, the proposed method is superior to other methods in multistep prediction in terms of the prediction accuracy and stability.

## CONCLUSIONS

In this study, we investigated the potential of a novel method for treating STWDF. A dataset obtained from a real-world water distribution system was applied to evaluate the effectiveness of the proposed method. A comparison between the proposed method and other BiGRU-based methods was also conducted. Based on the results, the following conclusions can be drawn: (1) The proposed method can obtain considerably accurate and stable forecasts in single-step prediction. In particular, the proposed method can also achieve reliable prediction performance in 24-h step prediction. Overall, the proposed method consistently outperforms the other BiGRU-based methods used for comparison in the above two prediction scenarios; (2) the GJO algorithm is effective at optimizing the VMD method, providing a substantial foundation for improving the prediction performance. Both the optimized VMD method and Multi-Attn are conducive to enhancing prediction performance; and (3) the seq2seq strategy can enhance the prediction performance for multistep prediction by preventing error accumulation, which is beneficial for real-time optimal scheduling in water distribution systems.

Although the proposed method achieved satisfying results in this study, several limitations merit attention. For instance, more models need to be constructed than traditional methods where decomposition is not considered. As a result, additional hyperparameters must be adjusted. Considering that hyperparameter adjustment is usually time-consuming for deep learning methods, this may negatively impact the practical application of the proposed method. Moreover, all the BiGRU models corresponding to BIMFs used the same number of inputs for single-step and multistep prediction. Generally, the number of inputs may differ according to the characteristics of the dataset. However, the effect of the number of inputs on model performance was not examined. These limitations require further investigation.

## ACKNOWLEDGEMENTS

This work is supported by the Middle-Aged and Young Teachers' Basic Ability Promotion Project of Guangxi (Grant No. 2024KY146) and the Vocational Education Teaching Reform Research Project of Guangxi (Grant No. GXGZJG2022B102).

## AUTHOR CONTRIBUTIONS

H.H. contributed to conceptualization, formal analysis, investigation, software, writing – original draft. M.W. contributed to funding acquisition, methodology, validation, resources, supervision, writing – review & editing.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.