Industrial water withdrawal prediction is the cornerstone of water resources monitoring and early warning, as well as a key task in implementing rigid constraints on water resources. Although many factors may influence water withdrawal, they are difficult to obtain. Extracting the features in the historical water withdrawal data itself is a more convenient alternative way to achieve accurate prediction. However, due to the lack of significant regularity and periodicity in industrial water data, feature extraction is challenging. In this paper, a multi-head attention encoder (MAEN) model consisting of multi-head attention mechanisms and feed-forward neural networks is proposed to enable more convenient and accurate forecasting. The multi-head attention mechanism allows to focus on different parts of the data simultaneously, detecting complex dependencies, and extracting key features from historical data to enable more efficient feature extraction. Applied to water withdrawal data from real factories, the proposed model outperforms common time series prediction models including artificial neural network, long short-term memory, and gate recurrent unit. Compared to the best-performing comparison model, MAEN shows a reduction of mean squared error by 3.2% for 1 day ahead predictions, and 8.9% for 7 days ahead predictions, illustrating its superior performance in industrial water withdrawal prediction.

  • A multi-head attention encoder model is used to predict industrial water withdrawal.

  • Multi-head attention mechanism allows efficient extraction of complex data dependencies.

  • The model achieves superior performance in both single-step and seven-step predictions.

  • The model shows better accuracy and stability compared with ANN, LSTM, and GRU.

Water is a nonrenewable resource essential for human survival. Over time, factors such as climate change, population growth, urbanization, and industrial expansion have significantly impacted water resources, leading to increased consumption (Leitão et al. 2019). In response, the Chinese government has launched numerous initiatives and strategies aimed at curbing water usage in the pursuit of sustainable development. The rapid urbanization and stringent national energy-saving policies underscore the urgency of adopting precise water resource management (Pu et al. 2022). Forecasting user water demand presents a valuable yet challenging endeavor for achieving efficient water resource management (Du et al. 2021). Recent years have seen the wastage of millions of tons of water in industrial processes (Zhou et al. 2019). If industrial water withdrawal can be predicted in advance, it is possible to implement corresponding water-saving measures to reduce water usage and thereby minimize water waste. However, unlike domestic water consumption, which exhibits strong periodicity and trends, industrial water consumption is characterized by its complex randomness and nonlinearity (Wang et al. 2019). Various factors, including gross domestic product, industry development, and product structure, complicate the prediction of industrial water withdrawal (Zhang et al. 2022), resulting in the scarcity of accurate forecasts.

Under the severe water resource situation, many water consumption prediction efforts have been undertaken. Nevertheless, research on water consumption prediction has predominantly concentrated on domestic usage. These methods include the time series model (Liu et al. 2011), machine learning techniques (Zhang et al. 2018), and deep learning approaches (Kim et al. 2022). Widely adopted time series models include the Holt–Winters method, the autoregressive integrated moving average (ARIMA) model, and the seasonal autoregressive integrated moving average (SARIMA). Both ARIMA and SARIMA models utilize differencing to smooth data, allowing linear models to identify features and yield accurate predictions (Razali et al. 2018). Mombeni et al. (2013) applied SARIMA models to water demand data from monthly residential water consumption in Iran from May 2001 to March 2010, discovering that the residuals of the model were well approximated by a three-parameter log-logistic distribution. Razali et al. (2018) successfully forecasted water consumption expenditure in Malaysia from 2006 to 2014 using the ARIMA model, with a mean absolute percentage error (MAPE) of 19.7085%. Nonetheless, the inherent volatility of water consumption data means that traditional time series models often produce significant errors in predicting nonlinear time series data (Du et al. 2020).

Machine learning models are effective at handling nonlinear time series data because of their powerful learning ability. They can automatically learn the variation characteristics of historical water consumption data, making them widely applied in the field of water consumption forecasting (Guo et al. 2018). Support vector regression (SVR), one of the pioneering machine learning models, leverages kernel functions to effectively handle system nonlinearity. A comparative analysis by Ibrahim et al. (2020) on water demand forecasting using SVR and ARIMA indicated superior prediction accuracy of SVR over ARIMA. Furthermore, Mouatadid & Adamowski (2017) found that the extreme learning machine model provided greater accuracy than multiple linear regression and SVR models in forecasting Montreal urban water demand for 1 and 3 days ahead. Among machine learning models, the artificial neural network (ANN) is the most prevalent, employing the backpropagation algorithm to iteratively adjust weights, thereby minimizing prediction error and optimizing outcomes. For example, Lorente-Leyva et al. (2019) used ANN to predict urban water demand, compared to ARIMA, which resulted in a 38.75% reduction in mean squared error (MSE). Although machine learning methods have been widely used in prediction problems, the relatively simplistic architecture of some machine learning models struggles with large-scale data analysis, hindering the extraction of complex data features (Chen et al. 2020).

The excellent capabilities in nonlinear mapping and deep feature extraction make deep learning a focal point in water demand forecasting research (Salloom et al. 2021). Deep learning methods are a type of multi-layer neural network model based on the framework of artificial neural networks. These deep structures excel at autonomously identifying complex features from vast volumes of unstructured data, compensating for the limitations of machine learning in handling complex data. It has wide applications across various fields, including computer vision, natural language processing, and speech recognition (Li et al. 2021). Guo et al. (2018) developed a gated recurrent unit network (GRUN) model to forecast water demand with a 15-min time step. They found deep neural network models such as GRUN outperform the ANN and SARIMA models for a 15-min water demand forecast. Mu et al. (2020) demonstrated the efficacy of a long short-term memory (LSTM)-based model in predicting Hefei City's short-term urban water demands, outperforming the ARIMA model. Similarly, Chen et al. (2022) developed a one-dimensional convolution-gated recurrent unit model showcasing significant automatic feature extraction capability for short-term water demand forecasting. By implementing optimal parameter settings and training strategies, this method achieved the best prediction values for MAPE and Nash–Sutcliffe efficiency coefficient metrics, which are 1.677 and 0.983, respectively.

Despite the achievements of the machine and deep learning in domestic water consumption forecasting, research on industrial water consumption predictions remains scarce (Zhang et al. 2022). This gap can be attributed to two main factors: first, residential water consumption generally exhibits cyclical and predictable patterns, enabling forecasts based on historical data characteristics (Kozłowski et al. 2018). In contrast, industrial water withdrawal is more unpredictable due to irregular demands and less periodicity. Second, industrial water withdrawal is influenced by variables such as industrial structure, market conditions, and gross domestic product, which are often challenging to obtain. Although studies incorporating water consumption influencers as model inputs have been proposed to enhance prediction accuracy (Kavya et al. 2023), the diverse withdrawal behaviors among industrial water users and the limited availability of influential data (such as enterprise output value and product structure) present substantial obstacles to accurately predicting individual user water withdrawals.

In recent years, the multi-head attention mechanism has obtained widespread attention. It is a key innovation in the transformer model proposed by Vaswani et al. (2017) to process natural language, which is essentially sequential data as well as water consumption data. Due to its unique attention framework for extracting features from historical data based on attention scores, the multi-head attention mechanism excels at capturing long-term dependencies in sequential data. This provides a new tool to extract the features in historical industrial water withdrawal data for more precise forecasting. The mechanism's ability to discern potential attentional correlations across various layers enhances data representation, significantly mitigating the challenges posed by the weak periodicity of industrial water consumption in forecasting accuracy (Voita et al. 2019).

Another problem of industrial water withdrawal prediction is temporal scale. Generally, the prediction of time series can be divided into single-step prediction and multi-step prediction (Kelley & Hagan 2024). For industrial water management, preparing sufficient time for responses to its changes is important, thus taking multi-step ahead strategies for forecasting is necessary. There are two major approaches for multi-step ahead prediction, which include recursive and direct strategies (Chandra et al. 2021). The recursive method extends single-step forecasts to subsequent time units, using each new forecast as the basis for the next, which may accumulate errors and diminish accuracy. The direct method, in contrast, generates multiple future data points in a single model run, posing challenges to the model's capacity. The multi-head attention mechanism, adept at processing sequential data, discerns the relationships among data points via a weight matrix, enabling it to capture temporal dependencies in water usage and facilitate the extraction of usage patterns (Zhang & Wang 2023). Consequently, it can effectively generate parallel outputs for multiple future points. This raises the question of whether the multi-head attention mechanism offers enhanced performance in predicting industrial water withdrawal over conventional models, particularly for multi-step forecasting.

Aiming to resolve the challenges of missing influence factors, complex feature extraction, and accurate multi-step predictions, the paper proposes a multi-head attention encoder (MAEN) model to address the industrial water withdrawal prediction problem. First, a linear layer is adopted to embed and transform the original inputs into a higher-dimensional space, capturing more complex patterns and interactions within the data. The multi-head attention mechanism is then used to identify data relationships by simultaneously learning multiple sets of attention weights between different locations. Finally, a linear layer is designed to map the encoder's high-dimensional output to the desired shape. The purpose of this approach is to capture temporal correlations in the time series and automatically extract data features. To demonstrate the performance of the MAEN model, it is compared with ANN, LSTM, and gate recurrent unit (GRU) models. The models are tested to predict the water withdrawal for 1 day ahead and 7 days ahead using data from real factories.

Data description

This study utilizes withdrawal data from a chemical material factory, sourced from the Water Resources Management Platform of a Central Province in China. The data collection time is from 1 September 2015 to 9 March 2023, and the collection frequency is every day, with a total of 2,748 pieces of data. As illustrated in Figure 1, the dataset has an average withdrawal of 13,228.29 m3, a maximum value of 41,676 m3, and a minimum value of 4,666 m3. Figure 1 indicates a general decreasing trend in water withdrawal, characterized by volatility and varying wave amplitudes. The overall decline in water withdrawal may result from the strategic goals of water-saving enterprises, while the fluctuations could be attributed to changes in industrial output value. However, factors like industrial output value, which are closely tied to water withdrawal, are considered private information, and water resource managers do not have the authority to access them. This irregularity and fluctuation in the data, along with the lack of influencing factors, create significant challenges for accurate prediction.
Figure 1

Water withdrawal data.

Figure 1

Water withdrawal data.

Close modal

The dataset is divided into a training dataset and a test dataset according to the ratio of 9:1, which is used for model training and water withdrawal prediction (as shown in Figure 1).

Sample processing

Processing the input data can enhance the model's prediction accuracy and mitigate the impact of a broad value range. As depicted in Figure 1, the industrial water withdrawal data contain outliers. The 3σ principle, as described by Cui & Yan (2009), is employed to eliminate these outliers. Subsequently, the standardization is utilized to normalize the data, with the calculation formula presented in the following equation.
(1)
where is the normalized data, X is the original data, is the mean of the original data, and μ is the standard deviation of the original data.
When training a deep learning model, data cannot be directly utilized as input. A time window is needed to construct the data into labeled sets for input into the neural network. Considering the inherent dependance of time series data on temporal dynamics, the feature value at a time point is expanded to an interval encompassing that time point, which is called a time window (Bandara et al. 2020). The sliding time window technique transforms the original dataset into time series samples of a predefined length by applying a time window, thereby facilitating data input into the model. As shown in Figure 2(a), for single-step prediction, data division is achieved through a time window of fixed length along with a specified sliding step size, generating samples for both the training set and the test set. Similarly, Figure 2(b) demonstrates the sample process for seven-step predictions.
Figure 2

Model sample input and output.

Figure 2

Model sample input and output.

Close modal

Model construction

The multi-head attention mechanism, a variant of the attention mechanism, analyzes the relationships among sequence elements. It transforms input elements into queries (denoted as Q), keys (denoted as K), and values (denoted as V). The mechanism calculates the relationship weights between Q and K and applies these weights to the V elements to derive the final outcome (Chang et al. 2022). In the process of the multi-head attention mechanism, Q, K, and V are all transformed from the same input matrix. The computational formula for this mechanism within neural networks is presented as follows:
(2)
(3)
The attention function is computed as follows:
(4)
where are the weight matrices for the ith head for queries, keys, and values, respectively, is the weight matrix for the output of the concatenated heads, h is the number of heads in the multi-head attention mechanism, is the dimension of keys (and queries), used to scale the dot products for more stable gradients. The SoftMax function is applied row-wise.
The model structure is shown in Figure 3:
  • (1) The water consumption data are embedded through the initial linear layer, transforming daily water withdrawal figures into a high-dimensional space (convert to , as shown in Figure 3). This process unveils the intricate relationships between water consumption across various days, establishing a solid basis for further feature extraction and pattern recognition.

  • (2) The second layer utilizes a multi-head attention mechanism to identify the relationships between daily water withdrawal across various periods. Using three matrices to multiply with , respectively, to get . The attention scores are calculated through dot products and converted into probability values by the SoftMax layer. Then the results between different headers are concatenated and input to the feed-forward layer. It incorporates the Gaussian error linear unit (GELU) activation function to effectively manage smaller data values. GELU enables the model to account for the probability distribution of water consumption in neuron activation decisions, making it especially adept at discerning nuanced water withdrawal patterns.

  • (3) Finally, a linear layer maps the encoder output into the desired output size. By transforming the sequence and feature dimensions into the specified future sequence length, this layer reduces dimensionality, condensing the complex information processed by the encoder into forecasts.

Figure 3

Model structure.

Figure 3

Model structure.

Close modal

Comparison model

To demonstrate the model's predictive performance, the ANN, LSTM, and GRU models are used for comparison. ANN models were popular for time series forecasting before deeper neural networks, showing good results for domestic water consumption (Gagliardi et al. 2017). With the rise of deep learning, LSTM and GRU models have achieved notable success in time series forecasting due to their unique recurrent structures (Kühnert et al. 2021). Therefore, they are selected as comparison models in this study. A detailed description of the comparison models can be found in the supplementary materials.

All models are developed using Python and utilized PyTorch to construct the deep neural network architecture.

Model parameter configuration

Appropriate hyperparameter configuration is crucial for the model to effectively learn the deep features and achieve optimal predictive performance. This study adopts the grid search method to determine the best hyperparameters (as shown in Table 1). Grid search involves a comprehensive search over various hyperparameter combinations to determine the optimal parameters (Priyadarshini & Cotton 2021). The training dataset is split into a validation set at a 1/9 ratio, and the hyperparameter combination with the smallest MSE on the validation set is chosen.

Table 1

Model hyperparameter search range and optimal results

HyperparameterSearch rangeMAENGRULSTMANN
Learning rate 0.005, 0.01, 0.02, 0.03 reduced by a factor of 0.2 every 40 rounds 0.02 0.02a/0.03b 0.03a/0.02b 0.03a/0.01b 
Batch size 16, 32, 64, 128 64 64a/128b 128a/32b 64 
Epochs 50, 100, 200 200a/100b 100a/200b 50a/100b 100a/200b 
Time window 7, 30, 60, 90 30 60a/90b 90a/60b 30a/60b 
N-head 6, 8, 10 – – – 
Encoder layers 1, 2, 3 – – – 
Hidden layers 1, 2, 3 – 1a/3b 
Hidden size 32, 64, 128 – 128 64a/128b – 
HyperparameterSearch rangeMAENGRULSTMANN
Learning rate 0.005, 0.01, 0.02, 0.03 reduced by a factor of 0.2 every 40 rounds 0.02 0.02a/0.03b 0.03a/0.02b 0.03a/0.01b 
Batch size 16, 32, 64, 128 64 64a/128b 128a/32b 64 
Epochs 50, 100, 200 200a/100b 100a/200b 50a/100b 100a/200b 
Time window 7, 30, 60, 90 30 60a/90b 90a/60b 30a/60b 
N-head 6, 8, 10 – – – 
Encoder layers 1, 2, 3 – – – 
Hidden layers 1, 2, 3 – 1a/3b 
Hidden size 32, 64, 128 – 128 64a/128b – 

aPredictions for 1 day ahead.

bPredictions for 7 days ahead.

The hyperparameters considered are learning rate, batch size, number of epochs, and time window size. For time series data, the water withdrawal from a week or a month ago significantly impacts current predictions (Li et al. 2020), making the time window important. Considering the characteristics of industrial sectors, the search range for the time window is set to 7, 30, 60, and 90 days. For the MAEN model, the number of attention heads and layers is also searched. For GRU and LSTM models, the dimensions of the hidden layers and the number of layers are considered, while for the ANN model, the number of hidden layers is searched.

Model training

During model training, it is essential to employ optimizers and overfitting prevention techniques to improve training effectiveness and the model's generalization ability. The optimizer updates the neural network's weights to minimize the loss function, enhancing the model's performance. This work uses the Adam optimizer, one of the most widely used optimizers (Schmidt et al. 2021).

Regularization in deep neural network optimization is crucial for avoiding overfitting and enhancing the model's generalization. A popular regularization technique is applying an L2 penalty on model parameters, known as weight decay (Nakamura & Hong 2019). Weight decay helps prevent overfitting by reducing the weight values during updates, preventing the model parameters from becoming excessively large. In this study, weight decay is configured in the Adam optimizer with a value of 0.001.

Another method to prevent overfitting is setting dropout, which randomly sets selected neuron activations to zero during training with a certain probability (Srivastava et al. 2014). This reduces reliance on specific neurons, forcing the network to learn more robust features. Here, dropout layers are added to the network structures of the MAEN, LSTM, and GRU models. After random research, the dropout rates are set to 0.1, 0.3, and 0.3, respectively. No dropout layers are used in the ANN model due to its simple structure.

Evaluation metrics

The performance metrics employed to evaluate the model performance in this study are Mean Absolute Error (MAE), MSE, and MAPE, which are computed according to the following equations.
(5)
(6)
(7)
where is the true value, is the predicted value.

MAE and MSE gauge the real magnitude of the prediction error, with their error values ranging from [0, +∞). Similarly, the value range of MAPE is [0, +∞). In all instances, lower values indicate that the model's predictions are more aligned with the actual values, denoting better predictive performance.

Model predictions for 1 day ahead

The detailed performance of prediction for 1 day ahead via the ANN model, LSTM model, GRU model, and MAEN model are shown in Figure 4. The green and pink shaded areas represent the differences between the actual and predicted values. The prediction curve of the ANN model shows the largest deviation from the actual curve, indicating its limitation in forecasting industrial water withdrawal. The ANN model only relies on neurons and nonlinear activation functions to train input data, lacking the ability to effectively extract the complex relationships between historical data and sequential information (Sakib et al. 2024), making it unsuitable for non-periodic volatile data. Although the LSTM and GRU models have reduced deviation compared to the ANN model, they still show some deviation in predicting turning points and individual peaks. Sequential models such as LSTM and GRU have a recurrent connection that captures the sequential relationship of the data during forecasting (Kumari & Toshniwal 2021). However, they do not predict the data at the inflection point and extreme values accurately (e.g., June 2022 and January 2023). The MAEN model mitigates these shortcomings and achieves accurate predictions, especially at inflection points and extreme values. In time series data, the importance of different time points varies. The attention mechanism can evaluate data importance at inflection points and extreme values using attention scores, ensuring minimal deviation in predictions for these special points (Abbasimehr & Paki 2022). Consequently, the MAEN model's prediction curve is closest to the original sequence, providing the best overall fit among the four models.
Figure 4

Water withdrawal prediction for 1 day.

Figure 4

Water withdrawal prediction for 1 day.

Close modal

Error of model predictions for 1 day ahead

To provide a thorough assessment of the prediction models, the absolute errors at each moment of the test datasets are plotted as violin plots (Figure 5). Violin plots as visual tools can display the distribution of data. The width of the violin plot represents the probability density of the data. The wider it is, the higher the density of the data around that value. The GRU model has a few larger errors around 2,000, while the ANN model exhibits a wider distribution of errors above 2,000. The median is also marked in Figure 5 to indicate a statistical measure of the error trend. The median in the MAEN model's violin plot is 385.01, showing that the errors are concentrated around this relatively low value. This is attributed to the attention mechanism, which reduces significant errors by assigning attention scores and accurately extracting historical data features, enabling more accurate predictions of industrial water withdrawal (Shih et al. 2019). The median of the LSTM, GRU, and ANN violin plots increases sequentially, with values of 414.41, 421.68, and 461.06, respectively. The error distribution plot further confirms that the prediction performance of the MAEN model is superior to the other three models.
Figure 5

Error distribution of water withdrawal prediction for 1 day.

Figure 5

Error distribution of water withdrawal prediction for 1 day.

Close modal

The evaluation indicators of all models are shown in Table 2, to gain the error values more directly.

Table 2

The error of different models on the water withdrawal test set for 1 day

IndicatorANNGRULSTMMAEN
MAE 690 653 648 628 
MSE 1.28 × 106 1.24 × 106 1.28 × 106 1.2 × 106 
MAPE 8.04 7.55 7.58 7.14 
IndicatorANNGRULSTMMAEN
MAE 690 653 648 628 
MSE 1.28 × 106 1.24 × 106 1.28 × 106 1.2 × 106 
MAPE 8.04 7.55 7.58 7.14 

In time series forecasting, sequential models generally outperform ANN because they contain recurrent structures that can store sequential data. In this context, both the LSTM and GRU models outperform ANN across all evaluation metrics, except for the MSE, where the LSTM's MSE matches the ANN's. The small difference between the evaluation indicators of LSTM and GRU suggests that they have similar predictive performance. GRU gives lower MSE and MAPE values of 1.24 × 106 and 7.55%, respectively, but a higher MAE value compared to LSTM. This discrepancy may arise from distinct error distributions; the GRU model has a few larger errors compared to the LSTM model (as shown in Figure 5). Compared to the ANN, GRU, and LSTM models, MAEN exhibits a decrease in MAPE by 11.2, 5.4, and 5.8%, respectively. As anticipated, the MAEN model outperforms the other three models, demonstrating the lowest MAE, MSE, and MAPE loss. The MAEN model has several advantages: First, the initial linear layer transmits complete information to the attention layer, enhancing prediction stability. Second, the multi-head attention mechanism assigns appropriate weights to input information through attention scores (Wang et al. 2023), adaptively discerning importance and ensuring no information is missed in predictions.

Model predictions for 7 days ahead

For further performance evaluation, all models were utilized to forecast industrial withdrawal for the next 7 days using direct forecasting to prevent error accumulation from rolling forecasts. The results are shown in Figure 6. The prediction curve of the ANN model shows significant deviations during the period from November 2022 to March 2023, which is precisely when water withdrawal fluctuations are at their highest. The ANN model has only a three-layer network, performing only simple transformations, and is unable to handle significant water withdrawal fluctuations, leading to less reliable predictions (Guo et al. 2018). Interestingly, at time points without significant fluctuations, the GRU model surprisingly shows large deviations. For example, in December 2022, there were three instances of significant prediction deviations, which did not occur during the single-step prediction process. The prediction curve of the LSTM model also showed significant deviations during this period. Although the GRU and LSTM models rely on recurrent structures to retain the memory of previous data, they are unable to simultaneously capture relationships between multiple time points to be forecasted and historical data within the prediction segment (Nguyen et al. 2020). However, direct forecasting is more promising than the recursive strategy for multi-step forecasting problems (Ghobadi & Kang 2022). In this context, the effectiveness of the multi-head attention mechanism is evident, as it retains the relationships of data at arbitrary positions using multiple attention heads and assigns different levels of importance, thereby improving prediction accuracy of multiple outputs (Sakib et al. 2024). Therefore, the MAEN model exhibits higher prediction stability compared to other prediction models in multi-step forecasting.
Figure 6

Water withdrawal prediction for 7 days.

Figure 6

Water withdrawal prediction for 7 days.

Close modal

Error of model predictions for 7 days ahead

The absolute error prediction distributions for the four prediction models are illustrated in Figure 7. The violin plots of all models have a similar shape and are generally consistent with the single-step predictions. The only difference is that the width at the widest point of the violin plot for the ANN model is smaller compared to the other three models, mainly because it has some error distribution at other larger values. The medians for the MAEN, LSTM, GRU, and ANN models are 467.15, 520.65, 477.38, and 622.79, respectively. The median of the MAEN model is still the smallest, while the ANN model has the largest median. This indicates the different applicability of the two models in predicting industrial water withdrawal. The ANN model is not effective at handling highly volatile water withdrawal data (such as June 2022 and December 2022 to January 2023). In contrast, the MAEN model, with its unique attention mechanism, automatically extracts fluctuation characteristics and makes reliable predictions.
Figure 7

Error distribution of water withdrawal prediction for 7 days.

Figure 7

Error distribution of water withdrawal prediction for 7 days.

Close modal

As observed in Table 3, while the MAE value of the GRU model is slightly larger than that of the LSTM model in single-step prediction, it is lower than the latter in the multi-step prediction model. The same phenomenon is observed in the median errors of both models (as indicated in Figure 7). This could be because, although the GRU model had several significant errors in multi-step predictions, the rest of its errors were generally smaller than those of the LSTM model. Although the LSTM and GRU models have different structures, they are both essentially recurrent neural networks. As a result, their various evaluation metrics show similar values on the same dataset (Cheng & Yang 2021).

Table 3

The error of different models on the water withdrawal test set for 7 days

IndicatorANNGRULSTMMAEN
MAE 871 773 777 715 
MSE 1.71 × 106 1.65 × 106 1.58 × 106 1.44 × 106 
MAPE 10.64 9.58 9.49 8.27 
IndicatorANNGRULSTMMAEN
MAE 871 773 777 715 
MSE 1.71 × 106 1.65 × 106 1.58 × 106 1.44 × 106 
MAPE 10.64 9.58 9.49 8.27 

When compared to single-step prediction, all models exhibit varying degrees of metric increase, indicating a decline in performance in multi-step prediction. The MAEN model still has the highest prediction accuracy with an MAE of 715, MSE of 1.44 × 106, and MAPE of 8.27%, further corroborating its superior performance. It is worth noting that, for single-step predictions, the MAEN model reduced MAE, MSE, and MAPE by 3.8, 3.2, and 5.4%, respectively, compared to the best-performing comparison model. For multi-step predictions, MAE, MSE, and MAPE decreased by 8.0, 8.9, and 12.9%, respectively. This indicates that the MAEN model's superiority is greater in multi-step predictions than in single-step predictions. The direct strategy in multi-step predictions tests the model's stability and reliability more rigorously. The multi-head attention mechanism can simultaneously learn the complex relationships between water withdrawal data across multiple time steps and extract important feature information, thereby enhancing the model's learning effectiveness. This makes it particularly suitable for prediction tasks involving highly volatile and irregular data, such as industrial water withdrawal.

Overall performance

The superior performance of the MAEN model comes from the core of the model: the multi-head attention layer. Self-attention captures information from historical data using attention scores, and the ‘multi-head’ mechanism enables the model to view this information from multiple perspectives. As seen in Figures 4 and 7, the model gives more attention to data at inflection points and extreme values, resulting in smaller prediction errors at these points. In multi-step predictions, the data relationships within the prediction horizon (7 days in this study) can also be appropriately captured. As a result, the MAEN model performs better in multi-step predictions than in single-step predictions. This is evident from Tables 2 and 3, where the reduction in error is greater for multi-step predictions compared to single-step predictions. The incorporation of the multi-head attention mechanism enables the MAEN model to automatically capture water characteristics through training weights and provide direct multi-step prediction outputs, which are more informative and applicable to actual production scenarios.

Benefits of industrial water withdrawal predictions

Domestic water demand is highly cyclical and is related to easily accessible factors such as temperature and rainfall, making its prediction relatively easy to achieve. Industrial water withdrawal has weak periodicity and is prone to fluctuations (as shown in Figure 1), with its influencing factors being difficult to obtain. Therefore, effectively extracting historical data features of industrial water withdrawal is crucial for improving prediction accuracy. Models such as ANN, LSTM, and GRU, while performing well in predicting domestic water demand, do not exhibit the same level of performance in industrial water withdrawal predictions (e.g., the ANN model shows the lowest error metrics, while LSTM and GRU demonstrate instability in multi-step predictions). The smallest error values produced by the MAEN model further confirm the significant superiority of the multi-head attention mechanism in capturing relationships between water withdrawals on any given date. This approach is not limited by the periodicity or stationarity of water withdrawal, making it well suited for prediction tasks involving highly volatile industrial water withdrawal. To demonstrate the generalizability of the MAEN model in predicting industrial water withdrawal, we applied it to a real dataset from a chemical factory for both single-step and seven-step predictions. The MAEN model still exhibited superior predictive capabilities compared to the ANN, LSTM, and GRU models (described in supplementary material for details).

Significance of multi-step predictions

Industrial water withdrawals account for a significant portion of total water resource withdrawals. Predictive modeling plays a crucial role in aiding managers to anticipate water demand, utilize water resources efficiently, and optimize industrial production. In particular, multi-step predictions provide industrial enterprises with more time to respond (González Perea et al. 2023). In building the model, multiple layers of the multi-head attention mechanism were implemented to ensure effective attention to water withdrawal features from different perspectives. As a result, even when using a direct strategy, the seven-step prediction maintained high accuracy. The MAEN model can be used by water resource management departments to anticipate industrial enterprises' water usage dynamics, promptly remind relevant enterprises to optimize their industrial activities, and implement water-saving measures.

The MAEN model proposed in this paper performs high-dimensional embedding of water withdrawal through linear layers and leverages the multi-head attention mechanism to capture historical water features, enabling accurate and stable single-step and seven-step industrial water withdrawal predictions. Compared to the ANN, LSTM, and GRU models, the MAEN model has the smallest error metrics, with MAE, MSE, and MAPE values of 715, 1.44 × 106, and 8.27%, corroborating its superior performance. The reduction in error values for the MAEN model is greater in multi-step predictions than in single-step predictions (e.g., the MAPE for 1 day ahead predictions decreased by 5.4% compared to the best comparison model, while the MAPE for 7 days ahead predictions decreased by 12.9%), further demonstrating its superiority in multi-step predictions.

The primary objective of this work is to assess the multi-head attention mechanism's efficacy in addressing the complexity and dynamics of industrial water usage patterns, ultimately achieving highly accurate water quantity predictions and optimizing water resource utilization. The proposed model is tailored for time series data with no periodic patterns and significant fluctuations, such as industrial water withdrawal. Knowing the industrial water withdrawal several days in advance will help manage water resource systems, optimize water-saving measures, and reduce energy costs.

However, it is essential to acknowledge that current research predominantly focuses on the use of the multi-head attention mechanism while overlooking data preprocessing. Future research endeavors will delve into exploring more effective data processing techniques and advanced deep learning models to further enhance the accuracy of industrial water withdrawal forecasting. These efforts aim to contribute significantly to sustainable water resource management and optimize industrial production practices.

This work was supported by the Joint Fund Project of the Natural Science Foundation of Anhui Province (Grant No. 2208085US05), (Grant No. 2308085US05), and the Fundamental Research Funds for the Central Universities (Grant No. JZ2023HGTB0264). During the preparation of this work, the authors used an AI tool for language polishing to improve the readability of the manuscript.

R. Y. wrote the original draft, developed the methodology, rendered support in formal analysis, validated, and visualized the article. Y. M. developed the methodology, rendered support in formal analysis, and visualized the article. L. L. investigated and rendered support in data curation. H. X. investigated the data, arranged the resources, rendered support in data curation, and validated the article. H. L. investigated the data, arranged the software, and arranged the resources for the article. H. M. conceptualized the whole article, arranged the resources, and rendered support in funding acquisition. W. W. supervised the work and conceptualized the whole article. K. S. wrote the review, edited the article, and rendered support in project administration and funding acquisition. X. Z. conceptualized the whole article, wrote the review and edited the article, validated the article, and rendered support in project administration and funding acquisition.

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Abbasimehr
H.
&
Paki
R.
2022
Improving time series forecasting using LSTM and attention models
.
Journal of Ambient Intelligence and Humanized Computing
13
,
673
691
.
https://doi.org/10.1007/s12652-020-02761-x
.
Bandara
K.
,
Bergmeir
C.
&
Smyl
S.
2020
Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach
.
Expert Systems with Applications
140
,
112896
.
https://doi.org/10.1016/j.eswa.2019.112896
.
Chandra
R.
,
Goyal
S.
&
Gupta
R.
2021
Evaluation of deep learning models for multi-step ahead time series prediction
.
IEEE Access
9
,
83105
83123
.
https://doi.org/10.1109/ACCESS.2021.3085085
.
Chang
Y.
,
Li
F.
,
Chen
J.
,
Liu
Y.
&
Li
Z.
2022
Efficient temporal flow transformer accompanied with multi-head probsparse self-attention mechanism for remaining useful life prognostics
.
Reliability Engineering & System Safety
226
,
108701
.
https://doi.org/10.1016/j.ress.2022.108701
.
Chen
Y.
,
Peng
G.
,
Zhu
Z.
&
Li
S.
2020
A novel deep learning method based on attention mechanism for bearing remaining useful life prediction
.
Applied Soft Computing
86
,
105919
.
https://doi.org/10.1016/j.asoc.2019.105919
.
Chen
L.
,
Yan
H.
,
Yan
J.
,
Wang
J.
,
Tao
T.
,
Xin
K.
,
Li
S.
,
Pu
Z.
&
Qiu
J.
2022
Short-term water demand forecast based on automatic feature extraction by one-dimensional convolution
.
Journal of Hydrology
606
,
127440
.
https://doi.org/10.1016/j.jhydrol.2022.127440
.
Cheng
Y.
&
Yang
Y.
2021
Prediction of oil well production based on the time series model of optimized recursive neural network
.
Petroleum Science and Technology
39
,
303
312
.
https://doi.org/10.1080/10916466.2021.1877303
.
Cui
W.
&
Yan
X.
2009
Adaptive weighted least square support vector machine regression integrated with outlier detection and its application in QSAR
.
Chemometrics and Intelligent Laboratory Systems
98
,
130
135
.
https://doi.org/10.1016/j.chemolab.2009.05.008
.
Du
B.
,
Zhou
Q.
,
Guo
J.
,
Guo
S.
&
Wang
L.
2021
Deep learning with long short-term memory neural networks combining wavelet transform and principal component analysis for daily urban water demand forecasting
.
Expert Systems with Applications
171
,
114571
.
https://doi.org/10.1016/j.eswa.2021.114571
.
Gagliardi
F.
,
Alvisi
S.
,
Franchini
M.
&
Guidorzi
M.
2017
A comparison between pattern-based and neural network short-term water demand forecasting models
.
Water Supply
17
,
1426
1435
.
https://doi.org/10.2166/ws.2017.045
.
González Perea
R.
,
Fernández García
I.
,
Camacho Poyato
E.
&
Rodríguez Díaz
J. A.
2023
New memory-based hybrid model for middle-term water demand forecasting in irrigated areas
.
Agricultural Water Management
284
,
108367
.
https://doi.org/10.1016/j.agwat.2023.108367
.
Guo
G.
,
Liu
S.
,
Wu
Y.
,
Li
J.
,
Zhou
R.
&
Zhu
X.
2018
Short-term water demand forecast based on deep learning method
.
Journal of Water Resources Planning and Management
144
,
04018076
.
https://doi.org/10.1061/(ASCE)WR.1943-5452.0000992
.
Ibrahim
T.
,
Omar
Y.
&
Maghraby
F. A.
2020
Water demand forecasting using machine learning and time series algorithms
.
IEEE Access
325
329
.
Kavya
M.
,
Mathew
A.
,
Shekar
P. R.
&
Sarwesh
P.
2023
Short term water demand forecast modelling using artificial intelligence for smart water management
.
Sustainable Cities and Society
95
,
104610
.
https://doi.org/10.1016/j.scs.2023.104610
.
Kelley
J.
&
Hagan
M. T.
2024
Comparison of neural network NARX and NARMAX models for multi-step prediction using simulated and experimental data
.
Expert Systems with Applications
237
,
121437
.
https://doi.org/10.1016/j.eswa.2023.121437
.
Kim
J.
,
Lee
H.
,
Lee
M.
,
Han
H.
,
Kim
D.
&
Kim
H. S.
2022
Development of a deep learning-based prediction model for water consumption at the household level
.
Water
14
,
1512
.
https://doi.org/10.3390/w14091512
.
Kozłowski
E.
,
Kowalska
B.
,
Kowalski
D.
&
Mazurkiewicz
D.
2018
Water demand forecasting by trend and harmonic analysis
.
Archives of Civil and Mechanical Engineering
18
,
140
148
.
https://doi.org/10.1016/j.acme.2017.05.006
.
Kühnert
C.
,
Gonuguntla
N. M.
,
Krieg
H.
,
Nowak
D.
&
Thomas
J. A.
2021
Application of LSTM networks for water demand prediction in optimal pump control
.
Water
13
,
644
.
https://doi.org/10.3390/w13050644
.
Kumari
P.
&
Toshniwal
D.
2021
Long short term memory–convolutional neural network based deep hybrid approach for solar irradiance forecasting
.
Applied Energy
295
,
117061
.
https://doi.org/10.1016/j.apenergy.2021.117061
.
Leitão
J.
,
Simões
N.
,
Sá Marques
J. A.
,
Gil
P.
,
Ribeiro
B.
&
Cardoso
A.
2019
Detecting urban water consumption patterns: A time-series clustering approach
.
Water Supply
19
,
2323
2329
.
https://doi.org/10.2166/ws.2019.113
.
Li
M.
,
Wang
Y.
,
Wang
Z.
&
Zheng
H.
2020
A deep learning method based on an attention mechanism for wireless network traffic prediction
.
Ad Hoc Networks
107
,
102258
.
https://doi.org/10.1016/j.adhoc.2020.102258
.
Li
L.
,
Li
Z.
,
Liu
Y.
&
Hong
Q.
2021
Deep joint learning for language recognition
.
Neural Networks
141
,
72
86
.
Liu
J. L.
,
Chen
X.
&
Zhang
T. J.
2011
Application of time series – exponential smoothing model on urban water demand forecasting
.
Advanced Materials Research
183–185
,
1158
1162
.
https://doi.org/10.4028/www.scientific.net/AMR.183-185.1158
.
Lorente-Leyva
L. L.
,
Pavón-Valencia
J. F.
,
Montero-Santos
Y.
,
Herrera-Granda
I. D.
,
Herrera-Granda
E. P.
&
Peluffo-Ordóñez
D. H.
2019
Artificial neural networks for urban water demand forecasting: A case study
.
Journal of Physics: Conference Series
1284
,
012004
.
https://doi.org/10.1088/1742-6596/1284/1/012004
.
Mombeni
H. A.
,
Rezaei
S.
,
Nadarajah
S.
&
Emami
M.
2013
Estimation of water demand in Iran based on SARIMA models
.
Environmental Modeling & Assessment
18
,
559
565
.
https://doi.org/10.1007/s10666-013-9364-4
.
Mouatadid
S.
&
Adamowski
J.
2017
Using extreme learning machines for short-term urban water demand forecasting
.
Urban Water Journal
14
,
630
638
.
https://doi.org/10.1080/1573062X.2016.1236133
.
Mu
L.
,
Zheng
F.
,
Tao
R.
,
Zhang
Q.
&
Kapelan
Z.
2020
Hourly and daily urban water demand predictions using a long short-term memory based model
.
Journal of Water Resources Planning and Management
146
,
05020017
.
https://doi.org/10.1061/(ASCE)WR.1943-5452.0001276
.
Nakamura
K.
&
Hong
B.-W.
2019
Adaptive weight decay for deep neural networks
.
IEEE Access
7
,
118857
118865
.
https://doi.org/10.1109/ACCESS.2019.2937139
.
Priyadarshini
I.
&
Cotton
C.
2021
A novel LSTM-CNN-grid search-based deep neural network for sentiment analysis
.
The Journal of Supercomputing
77
,
13911
13932
.
https://doi.org/10.1007/s11227-021-03838-w
.
Pu
Z.
,
Yan
J.
,
Chen
L.
,
Li
Z.
,
Tian
W.
,
Tao
T.
&
Xin
K.
2022
A hybrid Wavelet-CNN-LSTM deep learning model for short-term urban water demand forecasting
.
Frontiers of Environmental Science & Engineering
17
,
22
.
https://doi.org/10.1007/s11783-023-1622-3
.
Razali
S. N. A. M.
,
Rusiman
M. S.
,
Zawawi
N. I.
&
Arbin
N.
2018
Forecasting of water consumptions expenditure using Holt-Winter's and ARIMA
.
Journal of Physics: Conference Series
995
,
012041
.
https://doi.org/10.1088/1742-6596/995/1/012041
.
Sakib
S.
,
Mahadi
M. K.
,
Abir
S. R.
,
Moon
A.-M.
,
Shafiullah
A.
,
Ali
S.
,
Faisal
F.
&
Nishat
M. M.
2024
Attention-based models for multivariate time series forecasting: Multi-step solar irradiation prediction
.
Heliyon
10
,
e27795
.
https://doi.org/10.1016/j.heliyon.2024.e27795
.
Salloom
T.
,
Kaynak
O.
&
He
W.
2021
A novel deep neural network architecture for real-time water demand forecasting
.
Journal of Hydrology
599
,
126353
.
https://doi.org/10.1016/j.jhydrol.2021.126353
.
Schmidt
R. M.
,
Schneider
F.
&
Hennig
P.
2021
Descending through a crowded valley – benchmarking deep learning optimizers
.
PMLR Access
139
,
9367
9376
.
Shih
S. Y.
,
Sun
F. K.
&
Lee
H.
2019
Temporal pattern attention for multivariate time series forecasting
.
Machine Learning
108
,
1421
1441
.
https://doi.org/10.1007/s10994-019-05815-0
.
Srivastava
N.
,
Hinton
G.
,
Krizhevsky
A.
,
Sutskever
I.
&
Salakhutdinov
R.
2014
Dropout: A simple way to prevent neural networks from overfitting
.
Journal of Machine Learning Research
15
,
1929
1958
.
Vaswani
A.
,
Shazeer
N.
,
Parmar
N.
,
Uszkoreit
J.
,
Jones
L.
,
Gomez
A. N.
,
Kaiser
Ł. u.
&
Polosukhin
I.
2017
Attention is All you need
.
Advances in Neural Information Processing Systems
30
,
5998
6008
.
Voita
E.
,
Talbot
D.
,
Moiseev
F.
,
Sennrich
R.
&
Titov
I.
2019
Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv: 1905.09418. https://doi.org/10.48550/arXiv.1905.09418
.
Wang
B.
,
Wang
X.
&
Zhang
X.
2019
An empirical research on influence factors of industrial water use
.
Water
11
,
2267
.
https://doi.org/10.3390/w11112267
.
Wang
Y.
,
Wang
W.
,
Chau
K.
,
Xu
D.
,
Zang
H.
,
Liu
C.
&
Ma
Q.
2023
A new stable and interpretable flood forecasting model combining multi-head attention mechanism and multiple linear regression
.
Journal of Hydroinformatics
25
,
2561
2588
.
https://doi.org/10.2166/hydro.2023.160
.
Zhang
Y. M.
&
Wang
H.
2023
Multi-head attention-based probabilistic CNN-BiLSTM for day-ahead wind speed forecasting
.
Energy
278
,
127865
.
https://doi.org/10.1016/j.energy.2023.127865
.
Zhang
W.
,
Yang
Q.
,
Kumar
M.
&
Mao
Y.
2018
Application of improved least squares support vector machine in the forecast of daily water consumption
.
Wireless Personal Communications
102
,
3589
3602
.
https://doi.org/10.1007/s11277-018-5393-2
.
Zhang
X.
,
Zhao
D.
,
Wang
T.
&
Wu
X.
2022
Industrial water consumption forecasting based on combined CEEMD-ARIMA model for Henan province, central chain: A case study
.
Environmental Monitoring and Assessment
194
,
471
.
https://doi.org/10.1007/s10661-022-10149-x
.
Zhou
Z.
,
Wu
H.
&
Song
P.
2019
Measuring the resource and environmental efficiency of industrial water consumption in China: A non-radial directional distance function
.
Journal of Cleaner Production
240
,
118169
.
https://doi.org/10.1016/j.jclepro.2019.118169
.

Author notes

Equally contributed

This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-NC-ND 4.0), which permits copying and redistribution for non-commercial purposes with no derivatives, provided the original work is properly cited (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Supplementary data