ABSTRACT
Industrial water withdrawal prediction is the cornerstone of water resources monitoring and early warning, as well as a key task in implementing rigid constraints on water resources. Although many factors may influence water withdrawal, they are difficult to obtain. Extracting the features in the historical water withdrawal data itself is a more convenient alternative way to achieve accurate prediction. However, due to the lack of significant regularity and periodicity in industrial water data, feature extraction is challenging. In this paper, a multi-head attention encoder (MAEN) model consisting of multi-head attention mechanisms and feed-forward neural networks is proposed to enable more convenient and accurate forecasting. The multi-head attention mechanism allows to focus on different parts of the data simultaneously, detecting complex dependencies, and extracting key features from historical data to enable more efficient feature extraction. Applied to water withdrawal data from real factories, the proposed model outperforms common time series prediction models including artificial neural network, long short-term memory, and gate recurrent unit. Compared to the best-performing comparison model, MAEN shows a reduction of mean squared error by 3.2% for 1 day ahead predictions, and 8.9% for 7 days ahead predictions, illustrating its superior performance in industrial water withdrawal prediction.
HIGHLIGHTS
A multi-head attention encoder model is used to predict industrial water withdrawal.
Multi-head attention mechanism allows efficient extraction of complex data dependencies.
The model achieves superior performance in both single-step and seven-step predictions.
The model shows better accuracy and stability compared with ANN, LSTM, and GRU.
INTRODUCTION
Water is a nonrenewable resource essential for human survival. Over time, factors such as climate change, population growth, urbanization, and industrial expansion have significantly impacted water resources, leading to increased consumption (Leitão et al. 2019). In response, the Chinese government has launched numerous initiatives and strategies aimed at curbing water usage in the pursuit of sustainable development. The rapid urbanization and stringent national energy-saving policies underscore the urgency of adopting precise water resource management (Pu et al. 2022). Forecasting user water demand presents a valuable yet challenging endeavor for achieving efficient water resource management (Du et al. 2021). Recent years have seen the wastage of millions of tons of water in industrial processes (Zhou et al. 2019). If industrial water withdrawal can be predicted in advance, it is possible to implement corresponding water-saving measures to reduce water usage and thereby minimize water waste. However, unlike domestic water consumption, which exhibits strong periodicity and trends, industrial water consumption is characterized by its complex randomness and nonlinearity (Wang et al. 2019). Various factors, including gross domestic product, industry development, and product structure, complicate the prediction of industrial water withdrawal (Zhang et al. 2022), resulting in the scarcity of accurate forecasts.
Under the severe water resource situation, many water consumption prediction efforts have been undertaken. Nevertheless, research on water consumption prediction has predominantly concentrated on domestic usage. These methods include the time series model (Liu et al. 2011), machine learning techniques (Zhang et al. 2018), and deep learning approaches (Kim et al. 2022). Widely adopted time series models include the Holt–Winters method, the autoregressive integrated moving average (ARIMA) model, and the seasonal autoregressive integrated moving average (SARIMA). Both ARIMA and SARIMA models utilize differencing to smooth data, allowing linear models to identify features and yield accurate predictions (Razali et al. 2018). Mombeni et al. (2013) applied SARIMA models to water demand data from monthly residential water consumption in Iran from May 2001 to March 2010, discovering that the residuals of the model were well approximated by a three-parameter log-logistic distribution. Razali et al. (2018) successfully forecasted water consumption expenditure in Malaysia from 2006 to 2014 using the ARIMA model, with a mean absolute percentage error (MAPE) of 19.7085%. Nonetheless, the inherent volatility of water consumption data means that traditional time series models often produce significant errors in predicting nonlinear time series data (Du et al. 2020).
Machine learning models are effective at handling nonlinear time series data because of their powerful learning ability. They can automatically learn the variation characteristics of historical water consumption data, making them widely applied in the field of water consumption forecasting (Guo et al. 2018). Support vector regression (SVR), one of the pioneering machine learning models, leverages kernel functions to effectively handle system nonlinearity. A comparative analysis by Ibrahim et al. (2020) on water demand forecasting using SVR and ARIMA indicated superior prediction accuracy of SVR over ARIMA. Furthermore, Mouatadid & Adamowski (2017) found that the extreme learning machine model provided greater accuracy than multiple linear regression and SVR models in forecasting Montreal urban water demand for 1 and 3 days ahead. Among machine learning models, the artificial neural network (ANN) is the most prevalent, employing the backpropagation algorithm to iteratively adjust weights, thereby minimizing prediction error and optimizing outcomes. For example, Lorente-Leyva et al. (2019) used ANN to predict urban water demand, compared to ARIMA, which resulted in a 38.75% reduction in mean squared error (MSE). Although machine learning methods have been widely used in prediction problems, the relatively simplistic architecture of some machine learning models struggles with large-scale data analysis, hindering the extraction of complex data features (Chen et al. 2020).
The excellent capabilities in nonlinear mapping and deep feature extraction make deep learning a focal point in water demand forecasting research (Salloom et al. 2021). Deep learning methods are a type of multi-layer neural network model based on the framework of artificial neural networks. These deep structures excel at autonomously identifying complex features from vast volumes of unstructured data, compensating for the limitations of machine learning in handling complex data. It has wide applications across various fields, including computer vision, natural language processing, and speech recognition (Li et al. 2021). Guo et al. (2018) developed a gated recurrent unit network (GRUN) model to forecast water demand with a 15-min time step. They found deep neural network models such as GRUN outperform the ANN and SARIMA models for a 15-min water demand forecast. Mu et al. (2020) demonstrated the efficacy of a long short-term memory (LSTM)-based model in predicting Hefei City's short-term urban water demands, outperforming the ARIMA model. Similarly, Chen et al. (2022) developed a one-dimensional convolution-gated recurrent unit model showcasing significant automatic feature extraction capability for short-term water demand forecasting. By implementing optimal parameter settings and training strategies, this method achieved the best prediction values for MAPE and Nash–Sutcliffe efficiency coefficient metrics, which are 1.677 and 0.983, respectively.
Despite the achievements of the machine and deep learning in domestic water consumption forecasting, research on industrial water consumption predictions remains scarce (Zhang et al. 2022). This gap can be attributed to two main factors: first, residential water consumption generally exhibits cyclical and predictable patterns, enabling forecasts based on historical data characteristics (Kozłowski et al. 2018). In contrast, industrial water withdrawal is more unpredictable due to irregular demands and less periodicity. Second, industrial water withdrawal is influenced by variables such as industrial structure, market conditions, and gross domestic product, which are often challenging to obtain. Although studies incorporating water consumption influencers as model inputs have been proposed to enhance prediction accuracy (Kavya et al. 2023), the diverse withdrawal behaviors among industrial water users and the limited availability of influential data (such as enterprise output value and product structure) present substantial obstacles to accurately predicting individual user water withdrawals.
In recent years, the multi-head attention mechanism has obtained widespread attention. It is a key innovation in the transformer model proposed by Vaswani et al. (2017) to process natural language, which is essentially sequential data as well as water consumption data. Due to its unique attention framework for extracting features from historical data based on attention scores, the multi-head attention mechanism excels at capturing long-term dependencies in sequential data. This provides a new tool to extract the features in historical industrial water withdrawal data for more precise forecasting. The mechanism's ability to discern potential attentional correlations across various layers enhances data representation, significantly mitigating the challenges posed by the weak periodicity of industrial water consumption in forecasting accuracy (Voita et al. 2019).
Another problem of industrial water withdrawal prediction is temporal scale. Generally, the prediction of time series can be divided into single-step prediction and multi-step prediction (Kelley & Hagan 2024). For industrial water management, preparing sufficient time for responses to its changes is important, thus taking multi-step ahead strategies for forecasting is necessary. There are two major approaches for multi-step ahead prediction, which include recursive and direct strategies (Chandra et al. 2021). The recursive method extends single-step forecasts to subsequent time units, using each new forecast as the basis for the next, which may accumulate errors and diminish accuracy. The direct method, in contrast, generates multiple future data points in a single model run, posing challenges to the model's capacity. The multi-head attention mechanism, adept at processing sequential data, discerns the relationships among data points via a weight matrix, enabling it to capture temporal dependencies in water usage and facilitate the extraction of usage patterns (Zhang & Wang 2023). Consequently, it can effectively generate parallel outputs for multiple future points. This raises the question of whether the multi-head attention mechanism offers enhanced performance in predicting industrial water withdrawal over conventional models, particularly for multi-step forecasting.
Aiming to resolve the challenges of missing influence factors, complex feature extraction, and accurate multi-step predictions, the paper proposes a multi-head attention encoder (MAEN) model to address the industrial water withdrawal prediction problem. First, a linear layer is adopted to embed and transform the original inputs into a higher-dimensional space, capturing more complex patterns and interactions within the data. The multi-head attention mechanism is then used to identify data relationships by simultaneously learning multiple sets of attention weights between different locations. Finally, a linear layer is designed to map the encoder's high-dimensional output to the desired shape. The purpose of this approach is to capture temporal correlations in the time series and automatically extract data features. To demonstrate the performance of the MAEN model, it is compared with ANN, LSTM, and gate recurrent unit (GRU) models. The models are tested to predict the water withdrawal for 1 day ahead and 7 days ahead using data from real factories.
DATA AND METHODOLOGY
Data description
The dataset is divided into a training dataset and a test dataset according to the ratio of 9:1, which is used for model training and water withdrawal prediction (as shown in Figure 1).
Sample processing
Model construction
(1) The water consumption data are embedded through the initial linear layer, transforming daily water withdrawal figures into a high-dimensional space (convert to , as shown in Figure 3). This process unveils the intricate relationships between water consumption across various days, establishing a solid basis for further feature extraction and pattern recognition.
(2) The second layer utilizes a multi-head attention mechanism to identify the relationships between daily water withdrawal across various periods. Using three matrices to multiply with , respectively, to get . The attention scores are calculated through dot products and converted into probability values by the SoftMax layer. Then the results between different headers are concatenated and input to the feed-forward layer. It incorporates the Gaussian error linear unit (GELU) activation function to effectively manage smaller data values. GELU enables the model to account for the probability distribution of water consumption in neuron activation decisions, making it especially adept at discerning nuanced water withdrawal patterns.
(3) Finally, a linear layer maps the encoder output into the desired output size. By transforming the sequence and feature dimensions into the specified future sequence length, this layer reduces dimensionality, condensing the complex information processed by the encoder into forecasts.
Comparison model
To demonstrate the model's predictive performance, the ANN, LSTM, and GRU models are used for comparison. ANN models were popular for time series forecasting before deeper neural networks, showing good results for domestic water consumption (Gagliardi et al. 2017). With the rise of deep learning, LSTM and GRU models have achieved notable success in time series forecasting due to their unique recurrent structures (Kühnert et al. 2021). Therefore, they are selected as comparison models in this study. A detailed description of the comparison models can be found in the supplementary materials.
All models are developed using Python and utilized PyTorch to construct the deep neural network architecture.
Model parameter configuration
Appropriate hyperparameter configuration is crucial for the model to effectively learn the deep features and achieve optimal predictive performance. This study adopts the grid search method to determine the best hyperparameters (as shown in Table 1). Grid search involves a comprehensive search over various hyperparameter combinations to determine the optimal parameters (Priyadarshini & Cotton 2021). The training dataset is split into a validation set at a 1/9 ratio, and the hyperparameter combination with the smallest MSE on the validation set is chosen.
Hyperparameter . | Search range . | MAEN . | GRU . | LSTM . | ANN . |
---|---|---|---|---|---|
Learning rate | 0.005, 0.01, 0.02, 0.03 reduced by a factor of 0.2 every 40 rounds | 0.02 | 0.02a/0.03b | 0.03a/0.02b | 0.03a/0.01b |
Batch size | 16, 32, 64, 128 | 64 | 64a/128b | 128a/32b | 64 |
Epochs | 50, 100, 200 | 200a/100b | 100a/200b | 50a/100b | 100a/200b |
Time window | 7, 30, 60, 90 | 30 | 60a/90b | 90a/60b | 30a/60b |
N-head | 6, 8, 10 | 8 | – | – | – |
Encoder layers | 1, 2, 3 | 2 | – | – | – |
Hidden layers | 1, 2, 3 | – | 3 | 1a/3b | 3 |
Hidden size | 32, 64, 128 | – | 128 | 64a/128b | – |
Hyperparameter . | Search range . | MAEN . | GRU . | LSTM . | ANN . |
---|---|---|---|---|---|
Learning rate | 0.005, 0.01, 0.02, 0.03 reduced by a factor of 0.2 every 40 rounds | 0.02 | 0.02a/0.03b | 0.03a/0.02b | 0.03a/0.01b |
Batch size | 16, 32, 64, 128 | 64 | 64a/128b | 128a/32b | 64 |
Epochs | 50, 100, 200 | 200a/100b | 100a/200b | 50a/100b | 100a/200b |
Time window | 7, 30, 60, 90 | 30 | 60a/90b | 90a/60b | 30a/60b |
N-head | 6, 8, 10 | 8 | – | – | – |
Encoder layers | 1, 2, 3 | 2 | – | – | – |
Hidden layers | 1, 2, 3 | – | 3 | 1a/3b | 3 |
Hidden size | 32, 64, 128 | – | 128 | 64a/128b | – |
aPredictions for 1 day ahead.
bPredictions for 7 days ahead.
The hyperparameters considered are learning rate, batch size, number of epochs, and time window size. For time series data, the water withdrawal from a week or a month ago significantly impacts current predictions (Li et al. 2020), making the time window important. Considering the characteristics of industrial sectors, the search range for the time window is set to 7, 30, 60, and 90 days. For the MAEN model, the number of attention heads and layers is also searched. For GRU and LSTM models, the dimensions of the hidden layers and the number of layers are considered, while for the ANN model, the number of hidden layers is searched.
Model training
During model training, it is essential to employ optimizers and overfitting prevention techniques to improve training effectiveness and the model's generalization ability. The optimizer updates the neural network's weights to minimize the loss function, enhancing the model's performance. This work uses the Adam optimizer, one of the most widely used optimizers (Schmidt et al. 2021).
Regularization in deep neural network optimization is crucial for avoiding overfitting and enhancing the model's generalization. A popular regularization technique is applying an L2 penalty on model parameters, known as weight decay (Nakamura & Hong 2019). Weight decay helps prevent overfitting by reducing the weight values during updates, preventing the model parameters from becoming excessively large. In this study, weight decay is configured in the Adam optimizer with a value of 0.001.
Another method to prevent overfitting is setting dropout, which randomly sets selected neuron activations to zero during training with a certain probability (Srivastava et al. 2014). This reduces reliance on specific neurons, forcing the network to learn more robust features. Here, dropout layers are added to the network structures of the MAEN, LSTM, and GRU models. After random research, the dropout rates are set to 0.1, 0.3, and 0.3, respectively. No dropout layers are used in the ANN model due to its simple structure.
Evaluation metrics
MAE and MSE gauge the real magnitude of the prediction error, with their error values ranging from [0, +∞). Similarly, the value range of MAPE is [0, +∞). In all instances, lower values indicate that the model's predictions are more aligned with the actual values, denoting better predictive performance.
RESULTS
Model predictions for 1 day ahead
Error of model predictions for 1 day ahead
The evaluation indicators of all models are shown in Table 2, to gain the error values more directly.
Indicator . | ANN . | GRU . | LSTM . | MAEN . |
---|---|---|---|---|
MAE | 690 | 653 | 648 | 628 |
MSE | 1.28 × 106 | 1.24 × 106 | 1.28 × 106 | 1.2 × 106 |
MAPE | 8.04 | 7.55 | 7.58 | 7.14 |
Indicator . | ANN . | GRU . | LSTM . | MAEN . |
---|---|---|---|---|
MAE | 690 | 653 | 648 | 628 |
MSE | 1.28 × 106 | 1.24 × 106 | 1.28 × 106 | 1.2 × 106 |
MAPE | 8.04 | 7.55 | 7.58 | 7.14 |
In time series forecasting, sequential models generally outperform ANN because they contain recurrent structures that can store sequential data. In this context, both the LSTM and GRU models outperform ANN across all evaluation metrics, except for the MSE, where the LSTM's MSE matches the ANN's. The small difference between the evaluation indicators of LSTM and GRU suggests that they have similar predictive performance. GRU gives lower MSE and MAPE values of 1.24 × 106 and 7.55%, respectively, but a higher MAE value compared to LSTM. This discrepancy may arise from distinct error distributions; the GRU model has a few larger errors compared to the LSTM model (as shown in Figure 5). Compared to the ANN, GRU, and LSTM models, MAEN exhibits a decrease in MAPE by 11.2, 5.4, and 5.8%, respectively. As anticipated, the MAEN model outperforms the other three models, demonstrating the lowest MAE, MSE, and MAPE loss. The MAEN model has several advantages: First, the initial linear layer transmits complete information to the attention layer, enhancing prediction stability. Second, the multi-head attention mechanism assigns appropriate weights to input information through attention scores (Wang et al. 2023), adaptively discerning importance and ensuring no information is missed in predictions.
Model predictions for 7 days ahead
Error of model predictions for 7 days ahead
As observed in Table 3, while the MAE value of the GRU model is slightly larger than that of the LSTM model in single-step prediction, it is lower than the latter in the multi-step prediction model. The same phenomenon is observed in the median errors of both models (as indicated in Figure 7). This could be because, although the GRU model had several significant errors in multi-step predictions, the rest of its errors were generally smaller than those of the LSTM model. Although the LSTM and GRU models have different structures, they are both essentially recurrent neural networks. As a result, their various evaluation metrics show similar values on the same dataset (Cheng & Yang 2021).
Indicator . | ANN . | GRU . | LSTM . | MAEN . |
---|---|---|---|---|
MAE | 871 | 773 | 777 | 715 |
MSE | 1.71 × 106 | 1.65 × 106 | 1.58 × 106 | 1.44 × 106 |
MAPE | 10.64 | 9.58 | 9.49 | 8.27 |
Indicator . | ANN . | GRU . | LSTM . | MAEN . |
---|---|---|---|---|
MAE | 871 | 773 | 777 | 715 |
MSE | 1.71 × 106 | 1.65 × 106 | 1.58 × 106 | 1.44 × 106 |
MAPE | 10.64 | 9.58 | 9.49 | 8.27 |
When compared to single-step prediction, all models exhibit varying degrees of metric increase, indicating a decline in performance in multi-step prediction. The MAEN model still has the highest prediction accuracy with an MAE of 715, MSE of 1.44 × 106, and MAPE of 8.27%, further corroborating its superior performance. It is worth noting that, for single-step predictions, the MAEN model reduced MAE, MSE, and MAPE by 3.8, 3.2, and 5.4%, respectively, compared to the best-performing comparison model. For multi-step predictions, MAE, MSE, and MAPE decreased by 8.0, 8.9, and 12.9%, respectively. This indicates that the MAEN model's superiority is greater in multi-step predictions than in single-step predictions. The direct strategy in multi-step predictions tests the model's stability and reliability more rigorously. The multi-head attention mechanism can simultaneously learn the complex relationships between water withdrawal data across multiple time steps and extract important feature information, thereby enhancing the model's learning effectiveness. This makes it particularly suitable for prediction tasks involving highly volatile and irregular data, such as industrial water withdrawal.
DISCUSSION
Overall performance
The superior performance of the MAEN model comes from the core of the model: the multi-head attention layer. Self-attention captures information from historical data using attention scores, and the ‘multi-head’ mechanism enables the model to view this information from multiple perspectives. As seen in Figures 4 and 7, the model gives more attention to data at inflection points and extreme values, resulting in smaller prediction errors at these points. In multi-step predictions, the data relationships within the prediction horizon (7 days in this study) can also be appropriately captured. As a result, the MAEN model performs better in multi-step predictions than in single-step predictions. This is evident from Tables 2 and 3, where the reduction in error is greater for multi-step predictions compared to single-step predictions. The incorporation of the multi-head attention mechanism enables the MAEN model to automatically capture water characteristics through training weights and provide direct multi-step prediction outputs, which are more informative and applicable to actual production scenarios.
Benefits of industrial water withdrawal predictions
Domestic water demand is highly cyclical and is related to easily accessible factors such as temperature and rainfall, making its prediction relatively easy to achieve. Industrial water withdrawal has weak periodicity and is prone to fluctuations (as shown in Figure 1), with its influencing factors being difficult to obtain. Therefore, effectively extracting historical data features of industrial water withdrawal is crucial for improving prediction accuracy. Models such as ANN, LSTM, and GRU, while performing well in predicting domestic water demand, do not exhibit the same level of performance in industrial water withdrawal predictions (e.g., the ANN model shows the lowest error metrics, while LSTM and GRU demonstrate instability in multi-step predictions). The smallest error values produced by the MAEN model further confirm the significant superiority of the multi-head attention mechanism in capturing relationships between water withdrawals on any given date. This approach is not limited by the periodicity or stationarity of water withdrawal, making it well suited for prediction tasks involving highly volatile industrial water withdrawal. To demonstrate the generalizability of the MAEN model in predicting industrial water withdrawal, we applied it to a real dataset from a chemical factory for both single-step and seven-step predictions. The MAEN model still exhibited superior predictive capabilities compared to the ANN, LSTM, and GRU models (described in supplementary material for details).
Significance of multi-step predictions
Industrial water withdrawals account for a significant portion of total water resource withdrawals. Predictive modeling plays a crucial role in aiding managers to anticipate water demand, utilize water resources efficiently, and optimize industrial production. In particular, multi-step predictions provide industrial enterprises with more time to respond (González Perea et al. 2023). In building the model, multiple layers of the multi-head attention mechanism were implemented to ensure effective attention to water withdrawal features from different perspectives. As a result, even when using a direct strategy, the seven-step prediction maintained high accuracy. The MAEN model can be used by water resource management departments to anticipate industrial enterprises' water usage dynamics, promptly remind relevant enterprises to optimize their industrial activities, and implement water-saving measures.
CONCLUSIONS
The MAEN model proposed in this paper performs high-dimensional embedding of water withdrawal through linear layers and leverages the multi-head attention mechanism to capture historical water features, enabling accurate and stable single-step and seven-step industrial water withdrawal predictions. Compared to the ANN, LSTM, and GRU models, the MAEN model has the smallest error metrics, with MAE, MSE, and MAPE values of 715, 1.44 × 106, and 8.27%, corroborating its superior performance. The reduction in error values for the MAEN model is greater in multi-step predictions than in single-step predictions (e.g., the MAPE for 1 day ahead predictions decreased by 5.4% compared to the best comparison model, while the MAPE for 7 days ahead predictions decreased by 12.9%), further demonstrating its superiority in multi-step predictions.
The primary objective of this work is to assess the multi-head attention mechanism's efficacy in addressing the complexity and dynamics of industrial water usage patterns, ultimately achieving highly accurate water quantity predictions and optimizing water resource utilization. The proposed model is tailored for time series data with no periodic patterns and significant fluctuations, such as industrial water withdrawal. Knowing the industrial water withdrawal several days in advance will help manage water resource systems, optimize water-saving measures, and reduce energy costs.
However, it is essential to acknowledge that current research predominantly focuses on the use of the multi-head attention mechanism while overlooking data preprocessing. Future research endeavors will delve into exploring more effective data processing techniques and advanced deep learning models to further enhance the accuracy of industrial water withdrawal forecasting. These efforts aim to contribute significantly to sustainable water resource management and optimize industrial production practices.
ACKNOWLEDGEMENTS
This work was supported by the Joint Fund Project of the Natural Science Foundation of Anhui Province (Grant No. 2208085US05), (Grant No. 2308085US05), and the Fundamental Research Funds for the Central Universities (Grant No. JZ2023HGTB0264). During the preparation of this work, the authors used an AI tool for language polishing to improve the readability of the manuscript.
AUTHOR CONTRIBUTIONS
R. Y. wrote the original draft, developed the methodology, rendered support in formal analysis, validated, and visualized the article. Y. M. developed the methodology, rendered support in formal analysis, and visualized the article. L. L. investigated and rendered support in data curation. H. X. investigated the data, arranged the resources, rendered support in data curation, and validated the article. H. L. investigated the data, arranged the software, and arranged the resources for the article. H. M. conceptualized the whole article, arranged the resources, and rendered support in funding acquisition. W. W. supervised the work and conceptualized the whole article. K. S. wrote the review, edited the article, and rendered support in project administration and funding acquisition. X. Z. conceptualized the whole article, wrote the review and edited the article, validated the article, and rendered support in project administration and funding acquisition.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.
REFERENCES
Author notes
Equally contributed