ABSTRACT
Forecasting the surface water quality is vital for environmental monitoring and ecological sustainability. Although existing statistical and machine learning methods have been applied deliberately to forecast water quality, they often do not utterly delineate its complex spatial and temporal dynamics promptly. This in turn limits ensuring the accuracy and reliability of predictions, which are indispensable for effective environmental management. In order to overcome these challenges, we develop a novel approach as a multi-head attention-based long short-term memory model, specifically designed to enhance predictive precision that can capture complex dependencies using water quality datasets more precisely for the very first time. The proposed model shows a significant improvement over existing machine learning and deep learning models, achieving around 5–8% more accuracy in water quality forecasting. These enhanced results suggest that the proposed approach is well-suited for large-scale environmental applications, offering a data-driven approach that the supports targeted intervention strategies appropriately and reliably. This work contributes finely to the advancement of automated water quality forecasting systems, aiding sustainable environmental management practices.
HIGHLIGHTS
Proposed a multi-head attention-based long short-term memory model for accurate surface water quality forecasting.
Implemented the Canadian Council of Ministers of the Environment Water Quality Index model using the Irish water quality dataset for model validation.
Achieved superior performance metrics (MSE, RMSE, MAE, R2) compared to existing models.
Contributed to developing an automated Water Quality Index forecasting model for proactive water management.
INTRODUCTION
Surface water plays a crucial role in supporting environmental sustainability and the inhibition of diverse forms of life that mostly rely on it (Young et al. 2021; Suleman & Shridevi 2022; Mishra 2023). In recent decades, water quality has experienced a notable decline due to both natural and human-induced factors (Georgescu et al. 2023). Natural influences, such as climate change, flooding, and alterations in atmospheric and hydrological conditions, directly affect the water quality. Additionally, anthropogenic activities, including industrial effluent, municipal sewage, agricultural practices, and soil erosion contribute significantly to the deterioration of water quality (Vlad et al. 2012; Calmuc et al. 2021; Uddin et al. 2021). This decline in water quality has made it imperative to assess and check surface water regularly with precision as a result. Organizations like the WHO and various governments are prioritizing water quality evaluation in a regular interval as well (Tavakoly Sany et al. 2014).
The Water Quality Index (WQI) models are used widely and a well-known method for water quality monitoring and forecasting which is defined as a numerical tool that combines multiple water quality parameters into a unified value classification (Uddin et al. 2022, 2023a, b; Syeed et al. 2023). The WQI provides an overall assessment of water quality by classifying this into various categories. However, the method is susceptible to limitations such as eclipses and ambiguity classification calculations (Syeed et al. 2023). Eclipsing occurs when a sudden change of a water quality parameter disproportionately affects the WQI, leading to misclassification by overshadowing other parameters (Ding et al. 2023). Ambiguity arises when the WQI classification is worse than expected and even when the underlying water quality conditions fail to undertake such a surmise (Georgescu et al. 2023; Uddin et al. 2024). These issues, which are primarily driven by fluctuations in water quality parameters due to natural and anthropogenic influences hinder the overall accuracy of water quality assessments (Georgescu et al. 2023; Uddin et al. 2024).
In order to address the eclipses and ambiguities in the WQI prediction, contemporary literature employs a variety of data-driven forecasting models and techniques. Forecasting, in general, involves predicting future trends based on historical data. Unlike traditional WQI models, this approach uses historical time-series WQI data to project future values by analyzing past trends and patterns, rather than focusing on individual parameter values (Rouf et al. 2021; Alsharef et al. 2022; Sajib et al. 2023). Forecasting is prevalent in many prediction domains, including weather forecasting (Chen et al. 2023) and air quality forecasting (Méndez et al. 2023). Recent studies have used both conventional machine learning (ML) and sophisticated deep learning models for surface water quality forecasting. ML models, such as the automatic exponential smoothing model (AESM), are used to evaluate temporal changes in groundwater quality over six years in Xianyang City (Méndez et al. 2023). A fuzzy expert system is proposed to predict the WQI at various locations in Solapur City, India, using triangular and trapezoidal membership functions (Patki et al. 2015). Although these models perform well in WQI forecasting, they often do not capture the intricate non-linear relationships and patterns inherent in the water quality data. Moreover, they often depend on features that require extensive human expertise to prepare and, therefore, potentially overlook data complexities. Advanced deep learning algorithms are used for the automated extraction of complex features from raw water quality datasets to improve predictive and forecasting capabilities (Gambín et al. 2021; Kulisz et al. 2021; Zamili et al. 2023). Another study improves the prediction performance by integrating an artificial neural network (ANN) with the constraint Coefficient-based Particle Swarm Optimization and a Chaotic Gravitational Search Algorithm (CPSOCGSA), achieving an R² 0.965, a mean absolute error (MAE) 0.01627, and an RMSE 0.0187 (Zamili et al. 2023). Further advancements are observed with the use of a cascade-forward network (CFN) model combined with a radial basis function (RBF) network, showing mean squared error (MSE) values ranging from 0.083 to 0.319 and R-value between 0.940 and 0.911 across different quarters (Georgescu et al. 2023). However, these models show limitations in encapsulating the temporal complexities of the water quality data, raising concerns about the reliability of their reported performances (Somlyódy & Varis 2006; Wang et al. 2016; Kulisz et al. 2021). Another study proposes a Variational Autoencoder (VAE) with a self-attention mechanism for improved short-term wind power forecasting (Harrou et al. 2024). The SA–VAE model outperforms eight established deep learning methods, achieving superior accuracy (average R2 = 0.992) using real-world data from France and Turkey. Besides that, they also proposed another approach for predicting energy consumption in wastewater treatment plants (WWTPs) using data augmentation, feature selection, and deep learning (Harrou et al. 2023). LSTM and BiGRU models, enhanced with augmented data and lagged features, achieve high accuracy with MAPE values of 1.36 and 1.436%, respectively.
The model revealed in this study aims to overcome all limitations of traditional ML and deep learning methods by developing a novel approach integrating long short-term memory (LSTM) networks with a multi-head attention mechanism for the very first time. LSTM networks are well-suited for handling sequential data and capturing long-term dependencies which is crucial for right water quality forecasting data (Huang & Wu 2024; Li et al. 2024a, b). The multi-head attention mechanism enhances this capability by allowing the model to focus on multiple aspects of the temporal relationships presented in the data (Sahoo et al. 2019). This combination holds immense potential for improving the reliability and accuracy of water quality predictions. The demonstration in other fields such as greenhouse temperature and solar irradiance forecasting (Li et al. 2024a, b; Sakib et al. 2024) further solidifies the potential. The results reveal that proposed model achieves a mean square error (MSE) of 3,987.56 and 4,356.39, a root mean square error (RMSE) of 63.14 and 66.00, an MAE of 62.49 and 65.43, and a coefficient of determination (R²) of 0.91 and 0.88 for training and testing datasets, respectively. These metrics show the model's strong predictive accuracy and ability to generalize well across large datasets. A comparative analysis with existing ML and deep learning models for WQI forecasting highlights a significant performance improvement (around 5–8%) achieved by our approach. The enhanced accuracy and robustness underscore the effectiveness of integrating LSTM networks with multi-head attention mechanisms in capturing intricate temporal patterns in water quality data. The integration of these advanced techniques offers a significant step forward in addressing the complexities of water quality forecasting promptly.
The main goal of this study is to develop a new forecasting model that enhances the reliability and accuracy of surface water quality predictions with the help of the forefront multi-head attention-based network. The key contributions of this research are as follows:
Development of a multi-head attention-based LSTM model for surface water quality forecasting, a novel approach in WQI prediction.
Improvement of the resilience with outliers shown by lower RMSE and MAE values.
Enhancement of the sensitivity to data variability, reflected in higher R² values.
Reduction of forecasting errors, proven by improved MSE values.
These contributions aim at advancing water quality forecasting methodologies, providing policymakers and environmental managers with a more correct, data-driven tool for decision-making and sustainable water management. In the rest of the paper, the method is described in section ‘Materials and methods’. Section ‘Results and discussion' focuses on the result and description and finally, section ‘Conclusion’ concludes the paper with the new room for research.
MATERIALS AND METHODS
In Step 1, the water quality data collection study area is highlighted, and data acquisition task is carried out.
In Step 2, data pre-processing is performed, which includes dealing with missing values, removing outliers, and conducting statistical analysis.
In Step 3, the CCME WQI model is processed using Python to produce the WQI and assign data classification labels.
Finally, in Step 4, development of the multi-head attention-based LSTM model is carried out, time-series data are prepared, and the model is trained and tested following a delineation of detailed performance evaluation.
The four-step method followed in developing the multi-head attention-based LSTM model.
The four-step method followed in developing the multi-head attention-based LSTM model.
The following sections detail each of these steps.
Step-1: Study area and data acquisition

An exclusive summary of the water quality dataset used for this study.
Step-2: Data pre-processing



Statistical summary of 11 water quality parameters
Statistics . | Alkalinity . | Ammonia . | BOD . | Chloride . | Conductivity . | Dissolved Oxygen . | Ortho-phosphate . | pH . | Temperature . | Total hardness . | True color . | WQI . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Count | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 |
Mean | 133.96 | 0.10 | 1.51 | 20.18 | 351.33 | 62.70 | 0.04 | 7.62 | 10.72 | 155.87 | 72.04 | 65.78 |
Std | 85.43 | 0.72 | 0.85 | 21.49 | 177.36 | 24.63 | 0.10 | 0.50 | 3.85 | 95.46 | 67.36 | 7.19 |
Min | 5.00 | 0.00 | 0.00 | 0.00 | 33.00 | 0.00 | −0.004 | 5.40 | 1.70 | 7.00 | 0.00 | 34.86 |
25% | 56.00 | 0.03 | 1.20 | 15.30 | 210.25 | 50.85 | 0.01 | 7.40 | 7.80 | 69.00 | 29.00 | 60.37 |
50% | 126.00 | 0.04 | 1.30 | 19.00 | 343.00 | 55.20 | 0.02 | 7.70 | 10.50 | 150.00 | 56.00 | 65.38 |
75% | 200.00 | 0.05 | 1.50 | 22.70 | 482.00 | 88.00 | 0.03 | 8.00 | 13.60 | 231.00 | 96.00 | 70.85 |
Max | 432.00 | 40.00 | 16.0 | 1,260.0 | 4,200.0 | 146.0 | 5.30 | 8.67 | 58.0 | 574.0 | 953.0 | 100.0 |
Statistics . | Alkalinity . | Ammonia . | BOD . | Chloride . | Conductivity . | Dissolved Oxygen . | Ortho-phosphate . | pH . | Temperature . | Total hardness . | True color . | WQI . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Count | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 | 7,790 |
Mean | 133.96 | 0.10 | 1.51 | 20.18 | 351.33 | 62.70 | 0.04 | 7.62 | 10.72 | 155.87 | 72.04 | 65.78 |
Std | 85.43 | 0.72 | 0.85 | 21.49 | 177.36 | 24.63 | 0.10 | 0.50 | 3.85 | 95.46 | 67.36 | 7.19 |
Min | 5.00 | 0.00 | 0.00 | 0.00 | 33.00 | 0.00 | −0.004 | 5.40 | 1.70 | 7.00 | 0.00 | 34.86 |
25% | 56.00 | 0.03 | 1.20 | 15.30 | 210.25 | 50.85 | 0.01 | 7.40 | 7.80 | 69.00 | 29.00 | 60.37 |
50% | 126.00 | 0.04 | 1.30 | 19.00 | 343.00 | 55.20 | 0.02 | 7.70 | 10.50 | 150.00 | 56.00 | 65.38 |
75% | 200.00 | 0.05 | 1.50 | 22.70 | 482.00 | 88.00 | 0.03 | 8.00 | 13.60 | 231.00 | 96.00 | 70.85 |
Max | 432.00 | 40.00 | 16.0 | 1,260.0 | 4,200.0 | 146.0 | 5.30 | 8.67 | 58.0 | 574.0 | 953.0 | 100.0 |
Step-3: WQI calculation
To train, test, and validate proposed model, the WQI is calculated using the dataset. WQI models are the widely adopted approach for calculating the WQI (Syeed et al. 2023; Uddin et al. 2024). These are mathematical predictors that take several complex water quality parameter values as input and produce a simple unit less number as the indicator of water quality, which is easy-to-comprehend and act upon (Uddin et al. 2021; Syeed et al. 2023). Worldwide, there are around 23 WQI models currently in practice that can be classified into two categories, weighted and unweighted models. Weighted models' execution is carried out in four steps to calculate the WQI, namely:
(a) selecting water quality parameters
(b) producing the parameter sub-indices
(c) selecting weights for the parameters, and finally
(d) applying an aggregation function to calculate the WQI. For unweighted WQI models, the parameter weighting (step c) is ignored.
For this study, the WQI values is calculated using the Canadian Council of Ministers of the Environment (CCME) model. This is an unweighted model, developed by the British Columbia Ministry of Environment, Canada. CCME is a widely adopted WQI model applied to a wide range of surface water bodies, namely, river, lake, marine and coastal water (Uddin et al. 2021; Syeed et al. 2023). This model is highly stable and consistent to produce reliable WQI classification under diverse environmental conditions and water quality parameter values (Banda & Kumarasamy 2020). The detail computation of the WQI using the CCME model is presented in Equations (4)–(9).
The divisor of 1.732 in the aggregation equation normalizes the WQI to a range of 0–100, where 0 denotes the worst water quality and 100 is the best, considering the maximum values of the individual index factors (Canadian Water 2001). The specific classification scheme for CCME WQI (Syeed et al. 2023; Bouteldjaoui & Taupin 2024) is demonstrated in Table 2.
The standard WQI classification range used in the CCME model
WQI classes . | Value range . |
---|---|
Excellent | 90–100 |
Good | 80–89 |
Fair | 65–79 |
Marginal | 45–64 |
Poor | 0–44 |
WQI classes . | Value range . |
---|---|
Excellent | 90–100 |
Good | 80–89 |
Fair | 65–79 |
Marginal | 45–64 |
Poor | 0–44 |
A snapshot of the final dataset with WQI values calculated using the CCME model.
A snapshot of the final dataset with WQI values calculated using the CCME model.
Step-4: Forecasting model development
Stage 1: The WQI dataset undergoes a comprehensive data preparation process to ensure the inputs are properly structured for model training.
Stage 2: Sequential data is processed through a dual-layer LSTM network. The first LSTM layer comprises 128 units, followed by a second layer with 64 units, enabling the model to capture both short-term and long-term dependencies in the time-series data.
Stage 3: The LSTM layers produce two types of outputs: The All Time Step Output (ATSO) and the Last Time Step Output (LTSO). These outputs are passed into a multi-head attention mechanism to capture important temporal dependencies across different time steps. The attention outputs are further processed through average pooling and max pooling layers, and the results are concatenated.
Stage 4: The concatenated features are passed through fully connected dense layers to generate the final prediction in the output layer. This integration of LSTM and attention mechanisms enhances the model's ability to forecast WQI based on learned temporal relationships.
Proposed architecture of the multi-head attention-based LSTM WQI forecasting model.
Proposed architecture of the multi-head attention-based LSTM WQI forecasting model.
Stage-1: WQI data processing for model input
This is a required step as ML models show greater robustness and effectiveness when trained on scaled data rather than unscaled ones.
Stage-2: LSTM layers


Long short-term memory (LSTM) architecture (Hochreiter & Schmidhuber 1997).









As shown in Figure 6, the proposed model has two LSTM layers with 128 and 64 units each. The output of the first LSTM layer (having 128 units) is passed as input to the second LSTM layer (having 64 units). The second LSTM layer produces two outputs, as ATSO and LTSO, where ATSO is a sequence of predictions for all time steps, represented as , and LTSO is the output for the final time step
, which produces a single prediction for the final time point in the sequence.
Stage-3: Multi-head attention mechanism







To prevent overfitting, a dropout layer is introduced after the attention mechanism, with a dropout rate of 0.1. This introduces regularization by randomly dropping out connections during training, ensuring the model generalizes better to unseen data. The dropout layer effectively reduces reliance on specific neurons, forcing the model to spread its learning across the entire network. Figure 9 summarizes the working principle of the multi-head attention mechanism. The output produced in this layer is fed into a fully connected neural network layer, where integration and transformation are carried out to forecast the WQI value.
Stage-4: Neural network layers
The neural network layer that precedes the Multi-Head Attention mechanism consists of several neural network layers to refine the extracted features to forecast the WQI value. In the first layer it applies two pooling operations, namely, Global Average Pooling and Global Max Pooling. These pooling layers summarize the temporal features extracted by the attention mechanism. The Global Average Pooling layer computes the average value across ATSO, highlighting the overall trend in the data. On the other hand, the Global Max Pooling layer finds the most prominent feature at each time step. These two outputs provide complementary perspectives on the temporal patterns of the water quality data. These outputs are concatenated into a single feature vector, which is later passed through a fully connected layer (i.e., dense layer) with 64 units and a sigmoid activation function. This process allows the model learning more complex relationships by reducing the feature vector dimensionality. Finally, the output from this dense layer is fed into another single unit fully connected layer and a sigmoid activation function for producing the final forecasting.







RESULTS AND DISCUSSION
To train and evaluate the proposed model, the prepared dataset is split into training (80%) and test (20%) sets. The training set is used to fit the model, while the test set evaluates the generalization performance. Hyperparameter tuning plays a pivotal role to improve performance of the model. Table 3 provides a summary of all the candidate hyperparameters tuned for this purpose. By fine-tuning these hyperparameters and training the model under best conditions, effective forecasting performance is achieved.
All parameters names and hyperparameters space for training the forecasting model
Parameters name . | Hyperparameter space . |
---|---|
Batch size | [32, 64] |
Optimizer | [Adam, RMSprop] |
Learning rate | 0.001 |
Loss functions | [MSE, RMSE, MAE] |
Dropout rate | [0.2, 0.5] |
Epoch | 100 |
Input sequence | 50 |
Parameters name . | Hyperparameter space . |
---|---|
Batch size | [32, 64] |
Optimizer | [Adam, RMSprop] |
Learning rate | 0.001 |
Loss functions | [MSE, RMSE, MAE] |
Dropout rate | [0.2, 0.5] |
Epoch | 100 |
Input sequence | 50 |
Bold and underline values provide best results.
All possible combinations of hyperparameters in Table 3 are evaluated, and the combination that produces the best results on the test dataset are selected. As highlighted in Table 3, a batch size of 64, a learning rate of 0.001, a dropout rate of 0.5, an input sequence length of 50, and 100 epochs are selected. Alongside, MSE is chosen as the loss function, and the Adam optimizer is selected for optimization. These selections are based on the high performance of the model in terms of MSE, RMSE, MAE, and R² using Equations (20)–(23). Python scripts are written to automate these calculations.
Performance and reliability of the proposed multi-head attention-based LSTM model is evaluated in three axes, (a) using the conventional performance assessment metrics, e.g., MSE, RMSE, MAE, R², and DM test, (b) benchmark the reported performance against existing ML WQI forecasting models using the listed metrics, and finally, (c) provide a time-series assessment of the forecasting relative to the WQI measurements in the recent past. It is worth mentioning that all the candidate models (both traditional and deep learning) that are selected for performance benchmarking are trained and tested with the same dataset (Rahman et al. 2025).
Forecasting of WQI values for both training and testing datasets using the proposed model.
Forecasting of WQI values for both training and testing datasets using the proposed model.
To offer a comprehensive performance comparison, the proposed model is evaluated against the forecasting performance of several deep learning and traditional ML models. First, the comparison is drawn in terms of MSE, RMSE, MAE, and R² metrics, and then the DM test is carried out to verify that the reported performance improvement of the proposed model is statistically significant. The results for MSE, RMSE, MAE, and R² metrics for each of the model is summarized in Table 4, and the DM test result is presented in Table 5. As can be seen from Table 4, the proposed model achieves the lowest MSE, RMSE, and MAE, along with the highest R² values (0.91 train, 0.88 test), establishing its' ability to capture the complex temporal dependencies of water quality data with minimal prediction error. Overall, 5–8% improvement in accuracy and reliability is observed for the proposed model as compared to the existing ones.
Results of proposed model and other machine learning and deep learning-based forecasting models for WQI forecasting
Model . | MSE . | RMSE . | MAE . | R2 . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Multi-head attention-based LSTM model | 3,987.56 | 4,356.39 | 63.14 | 66.00 | 62.49 | 65.43 | 0.91 | 0.88 |
Stacked LSTM (Du et al. 2017) | 4,307.56 | 4,509.94 | 65.63 | 67.15 | 65.45 | 66.95 | 0.83 | 0.81 |
Gated Recurrent Units ((Cho et al. 2014) | 4,189.10 | 4,395.66 | 64.72 | 66.29 | 64.52 | 66.06 | 0.88 | 0.85 |
Bi-Directional LSTM ((Siami-Namini et al. 2019) | 4,258.58 | 4,434.12 | 65.25 | 66.58 | 65.09 | 66.40 | 0.86 | 0.85 |
Artificial Neural Network (ANN) (Kulisz et al. 2021) | 4,223.19 | 4,399.73 | 64.99 | 66.33 | 64.79 | 66.09 | 0.86 | 0.84 |
ANN With Particle Swarm Optimization (Zamili et al. 2023) | 4,398.75 | 4,582.48 | 66.32 | 67.69 | 66.09 | 67.42 | 0.84 | 0.81 |
Automatic Exponential Smoothing (Nsabimana et al. 2023) | 4,469.42 | 4,561.36 | 66.85 | 67.54 | 66.27 | 66.41 | 0.81 | 0.79 |
Cascade-forward network (CFN) (Georgescu et al. 2023) | 4,247.85 | 4,419.69 | 65.18 | 66.48 | 64.98 | 66.25 | 0.85 | 0.82 |
4 Layers Stacked LSTM (Debow et al. 2023) | 4,187.16 | 4,370.89 | 64.71 | 66.11 | 64.49 | 65.86 | 0.89 | 0.85 |
Vanilla LSTM | 4,280.00 | 4,477.91 | 65.42 | 66.92 | 65.24 | 66.70 | 0.84 | 0.80 |
Linear Regression | 4,243.16 | 4,453.63 | 65.14 | 66.74 | 64.94 | 66.49 | 0.85 | 0.82 |
Decision Tree | 4,265.65 | 4,504.02 | 65.31 | 67.11 | 64.94 | 66.69 | 0.84 | 0.82 |
Random Forest | 4,256.56 | 4,479.83 | 65.24 | 66.93 | 64.93 | 66.61 | 0.85 | 0.83 |
SVR | 4,262.69 | 4,483.85 | 65.29 | 66.96 | 65.09 | 66.76 | 0.84 | 0.81 |
Model . | MSE . | RMSE . | MAE . | R2 . | ||||
---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | |
Multi-head attention-based LSTM model | 3,987.56 | 4,356.39 | 63.14 | 66.00 | 62.49 | 65.43 | 0.91 | 0.88 |
Stacked LSTM (Du et al. 2017) | 4,307.56 | 4,509.94 | 65.63 | 67.15 | 65.45 | 66.95 | 0.83 | 0.81 |
Gated Recurrent Units ((Cho et al. 2014) | 4,189.10 | 4,395.66 | 64.72 | 66.29 | 64.52 | 66.06 | 0.88 | 0.85 |
Bi-Directional LSTM ((Siami-Namini et al. 2019) | 4,258.58 | 4,434.12 | 65.25 | 66.58 | 65.09 | 66.40 | 0.86 | 0.85 |
Artificial Neural Network (ANN) (Kulisz et al. 2021) | 4,223.19 | 4,399.73 | 64.99 | 66.33 | 64.79 | 66.09 | 0.86 | 0.84 |
ANN With Particle Swarm Optimization (Zamili et al. 2023) | 4,398.75 | 4,582.48 | 66.32 | 67.69 | 66.09 | 67.42 | 0.84 | 0.81 |
Automatic Exponential Smoothing (Nsabimana et al. 2023) | 4,469.42 | 4,561.36 | 66.85 | 67.54 | 66.27 | 66.41 | 0.81 | 0.79 |
Cascade-forward network (CFN) (Georgescu et al. 2023) | 4,247.85 | 4,419.69 | 65.18 | 66.48 | 64.98 | 66.25 | 0.85 | 0.82 |
4 Layers Stacked LSTM (Debow et al. 2023) | 4,187.16 | 4,370.89 | 64.71 | 66.11 | 64.49 | 65.86 | 0.89 | 0.85 |
Vanilla LSTM | 4,280.00 | 4,477.91 | 65.42 | 66.92 | 65.24 | 66.70 | 0.84 | 0.80 |
Linear Regression | 4,243.16 | 4,453.63 | 65.14 | 66.74 | 64.94 | 66.49 | 0.85 | 0.82 |
Decision Tree | 4,265.65 | 4,504.02 | 65.31 | 67.11 | 64.94 | 66.69 | 0.84 | 0.82 |
Random Forest | 4,256.56 | 4,479.83 | 65.24 | 66.93 | 64.93 | 66.61 | 0.85 | 0.83 |
SVR | 4,262.69 | 4,483.85 | 65.29 | 66.96 | 65.09 | 66.76 | 0.84 | 0.81 |
The same dataset is used for all model's training and testing.
Diebold–Mariano test results comparing forecasting performance of the proposed model with other baseline models on train and test sets
Comparison (proposed vs.) . | Train . | Test . | ||||||
---|---|---|---|---|---|---|---|---|
DM-MSE . | p-Value . | DM-MAE . | p-Value . | DM-MSE . | p-Value . | DM-MAE . | p-Value . | |
Stacked LSTM | −2.3145 | 0.0206 | −1.8742 | 0.0609 | −2.5748 | 0.0101 | −2.0147 | 0.0439 |
GRU | −1.9563 | 0.0502 | −1.6548 | 0.0987 | −2.1203 | 0.0340 | −1.8967 | 0.0590 |
Bi-LSTM | −1.7471 | 0.0810 | −1.8205 | 0.0710 | −1.9763 | 0.0483 | −2.0085 | 0.0447 |
ANN | −1.9022 | 0.0614 | −1.7520 | 0.0802 | −2.0137 | 0.0441 | −1.8492 | 0.0650 |
ANN + PSO | −3.1028 | 0.0020 | −2.5419 | 0.0111 | −3.3849 | 0.0007 | −2.7485 | 0.0074 |
AES | −2.8793 | 0.0045 | −2.1257 | 0.0342 | −3.2012 | 0.0014 | −2.3854 | 0.0170 |
CFN | −1.9237 | 0.0581 | −1.7205 | 0.0863 | −2.1754 | 0.0304 | −1.9824 | 0.0475 |
Stacked LSTM | −1.5268 | 0.1271 | −1.5242 | 0.1284 | −1.8549 | 0.0642 | −1.7954 | 0.0730 |
Vanilla LSTM | −2.4518 | 0.0143 | −2.0791 | 0.0378 | −2.7829 | 0.0054 | −2.2851 | 0.0205 |
LR | −2.1073 | 0.0354 | −1.9548 | 0.0510 | −2.4782 | 0.0132 | −2.0241 | 0.0429 |
Decision tree | −2.3657 | 0.0181 | −2.0241 | 0.0429 | −2.7348 | 0.0067 | −2.3097 | 0.0193 |
Random forest | −2.2146 | 0.0268 | −1.9883 | 0.0473 | −2.5439 | 0.0110 | −2.1052 | 0.0353 |
SVR | −2.2981 | 0.0217 | −2.0447 | 0.0408 | −2.6021 | 0.0092 | −2.1403 | 0.0324 |
Comparison (proposed vs.) . | Train . | Test . | ||||||
---|---|---|---|---|---|---|---|---|
DM-MSE . | p-Value . | DM-MAE . | p-Value . | DM-MSE . | p-Value . | DM-MAE . | p-Value . | |
Stacked LSTM | −2.3145 | 0.0206 | −1.8742 | 0.0609 | −2.5748 | 0.0101 | −2.0147 | 0.0439 |
GRU | −1.9563 | 0.0502 | −1.6548 | 0.0987 | −2.1203 | 0.0340 | −1.8967 | 0.0590 |
Bi-LSTM | −1.7471 | 0.0810 | −1.8205 | 0.0710 | −1.9763 | 0.0483 | −2.0085 | 0.0447 |
ANN | −1.9022 | 0.0614 | −1.7520 | 0.0802 | −2.0137 | 0.0441 | −1.8492 | 0.0650 |
ANN + PSO | −3.1028 | 0.0020 | −2.5419 | 0.0111 | −3.3849 | 0.0007 | −2.7485 | 0.0074 |
AES | −2.8793 | 0.0045 | −2.1257 | 0.0342 | −3.2012 | 0.0014 | −2.3854 | 0.0170 |
CFN | −1.9237 | 0.0581 | −1.7205 | 0.0863 | −2.1754 | 0.0304 | −1.9824 | 0.0475 |
Stacked LSTM | −1.5268 | 0.1271 | −1.5242 | 0.1284 | −1.8549 | 0.0642 | −1.7954 | 0.0730 |
Vanilla LSTM | −2.4518 | 0.0143 | −2.0791 | 0.0378 | −2.7829 | 0.0054 | −2.2851 | 0.0205 |
LR | −2.1073 | 0.0354 | −1.9548 | 0.0510 | −2.4782 | 0.0132 | −2.0241 | 0.0429 |
Decision tree | −2.3657 | 0.0181 | −2.0241 | 0.0429 | −2.7348 | 0.0067 | −2.3097 | 0.0193 |
Random forest | −2.2146 | 0.0268 | −1.9883 | 0.0473 | −2.5439 | 0.0110 | −2.1052 | 0.0353 |
SVR | −2.2981 | 0.0217 | −2.0447 | 0.0408 | −2.6021 | 0.0092 | −2.1403 | 0.0324 |
A detailed performance comparison with the deep learning models (from Table 4) shows that the proposed model achieves significant improvements in comparison to Vanilla LSTM, Stacked LSTM, gated recurrent units (GRUs), bi-directional LSTM (Bi-LSTM), and ANN. For example, the Stacked LSTM records higher MSE (4,307.56 train, 4,509.94 test), RMSE (65.63 train, 67.15 test), and lower R² values (0.83 train, 0.81 test), reflecting less accurate predictions. GRU, while slightly better in MSE (4,189.10 train, 4,395.66 test) and RMSE (64.72 train, 66.29 test), still underperforms compared to the proposed model in terms of lower error values and higher R². The multi-head attention mechanism in the proposed model provides a nuanced understanding of temporal dependencies, allowing it to adapt efficiently to fluctuations in data and reducing forecast errors compared to simpler architectures like GRU and Stacked LSTM. Alongside, the advanced models like ANN with Particle Swarm Optimization and CFN also fall short with MSE values exceeding 4,200 and R² scores below 0.85, highlighting their limited sensitivity to the variability in the water quality data.
Furthermore, performance comparison with the traditional ML models, e.g., linear regression (LR), decision tree (DT), random forest (RF), and support vector regression (SVR) (from Table 4) also reported better performance for the proposed model. Even though these models show relatively low error values, with MSE ranging from 4,243.16 to 4,265.65 (train) and RMSE between 65.14 and 65.31, their ability to capture temporal dependencies and non-linear relationships is limited. For instance, while RF achieves a slightly better R² (0.83) value, it still lags in handling the complex pattern compared to the proposed model. These traditional models fail to offer the forecasting accuracy required for handling fluctuations and capturing data variability as demonstrated by their higher errors and lower R² values. In contrast, the proposed model achieves enhanced accuracy and reliability, with reduced forecasting errors and better alignment with observed trends.
The DM test among the models is carried out with a null hypothesis posits that there is no significant difference between the forecasting performances of the two models, and an alternative hypothesis
suggests a significant difference (Chen et al. 2014). Table 5 shows the test results comparing the forecasting performance of the proposed Multi-head Attention-based LSTM model with other models on the train and test datasets. The DM-MSE and DM-MAE values are the test statistics based on MSE and MAE respectively, which assess the variance and magnitude of the forecast errors. The p-value indicates the statistical significance of the test results, with a smaller p-value indicating stronger evidence against the null hypothesis. The p-values for different models help to identify whether the differences in their forecasting performance are statistically significant.
The DM test results summarized in the Table 5 show the following findings for the proposed model compared to Stacked LSTM:
Train set: The DM-MSE value of −2.3145 with a p-value of 0.0206 rejects the null hypothesis at the 5% significance level, indicating a meaningful difference in MSE. However, the DM-MAE value for MAE is −1.8742 with a p-value of 0.0609, which does not reject the null hypothesis at the 5% significance level, suggesting borderline significance.
Test set: The DM-MSE value of −2.5748 with a p-value of 0.0101 and the DM-MAE value of −2.0147 with a p-value of 0.0439 both reject the null hypothesis at the 5% significance level, confirming that the proposed model outperforms the Stacked LSTM in terms of forecasting accuracy on the test set.
Similar trends are observed when comparing the proposed model with GRU, Bi-LSTM, ANN, and other models. For instance, when compared with GRU, the null hypothesis for MSE on the train set is marginally rejected at the 5% significance level (DM-MSE = −1.9563, p = 0.0502), but it is definitively rejected for the test set (DM-MSE = −2.1203, p = 0.0340). In comparisons with ANN + PSO, the differences are highly significant for both metrics on both datasets, with DM-MSE values less than −3.0 and p-values under 0.01, clearly demonstrating the superiority of the proposed model. Furthermore, when compared with models like Vanilla LSTM and CFN, the results also show consistent improvements in forecasting performance, with significant DM test outcomes across most metrics. For example, against Vanilla LSTM, the DM-MSE for the test set is −2.7829 with a p-value of 0.0054, strongly rejecting the null hypothesis and highlighting the robustness of the proposed model. These findings substantiate the proposed model's effectiveness in handling water quality forecasting tasks by reducing errors and capturing variability more efficiently, as reflected by consistently lower MSE and MAE values, along with significant DM test result.
To address this, the model was refined by tuning hyperparameters, introducing regularization, and expanding the dataset to include more diverse and recent observations. These steps reduced the width of the confidence intervals, enhancing prediction robustness while maintaining accuracy. The remaining intervals, though still wide, serve as an essential cautionary measure, highlighting potential variability and underscoring the importance of short-term predictions for reliable decision-making. For instance, sudden environmental changes can lead to abrupt fluctuations in WQI, necessitating real-time monitoring alongside forecasts for practical applications.
Along with the positives, the model is prone to few issues. The model is limited by its high computational complexity and extended run-time. This poses a threat to the real-time implementation of the model, particularly in resource-constrained environments. Moreover, on occasion of sudden fluctuations in water quality data, the accuracy of the forecast drops abruptly, as observed by the drop in the R2 score and an increase in error values. To address these limitations, our future work could focus on integrating external environmental factors, e.g., weather data or pollution sources, which may provide better context for sudden changes in water quality parameters. Additionally, employing more robust data pre-processing methods, including advanced techniques for outlier detection and null value dealing, could enhance the model's performance during sudden fluctuations but keeping its higher accuracy for long-term trends. Hybrid models that combine deep learning with traditional statistical methods, such as ARIMA, could also be explored.
The multi-head attention-based LSTM model can be adopted for diverse domain of environmental and water resource management. This sophisticated data-driven forecasting technique enhances the capabilities of policymakers, environmental agencies, and water resource managers by delivering reliable predictions essential for informed decisions in water quality control, pollution mitigation, and public health protection. Thus, decision-makers can use the forecasting to prioritize water treatment in regions with predicted pollution spikes, distribute resources more effectively during contamination incidents, or plan proactive measures to prevent the deterioration of the water quality. By proactively managing contamination incidents and ecological threats, the research aids in the conservation of aquatic ecosystems and the assurance of safe drinking water for communities. Through prompt interventions and strategic planning, this method aims to advance environmental sustainability and safeguard human health, aligning with broader pollution mitigation efforts and promoting the long-term sustainability of water resources (Georgescu et al. 2023). Also, the application of the model should be explored in other domains characterized by complex temporal dependencies in time-series data, such as weather forecasting, air quality prediction, and energy demand forecasting. By using the attention mechanism to capture crucial temporal patterns, the proposed model can provide valuable insights and predictive capabilities across diverse fields. Thus, the outcome of this study has substantial bearings on environmental science and water resource management, offering a data-driven autonomous solution for water quality prediction that elucidates long-term trends and enhances the predictive reliability.
CONCLUSION
In this study, a hybrid Multi-Head Attention-Based LSTM model is proposed for water quality prediction, and a detailed performance assessment is drawn using the Irish water quality dataset. The proposed model shows a 5–8% improvement in accuracy while forecasting WQI as compared to the existing ML and DL models. The use of multi-head attention enhances the ability of the model to capture complex temporal dependencies, resulting in improved prediction with reliability and robustness. The key contribution of this research is the development of a data-driven, automated system that can forecast water quality with greater precision, helping the decision-making process for environmental management. This model can aid in the pollution control, early detection of the water contamination, and overall management of the water resource for sustainable public health protection. However, the computational complexity of the model remains a challenge as training the model requires significant processing time and resources. This may pose challenges for improving the model performance with extended data. Future work should focus on this direction to define an efficient process for faster training and deployment of the system.
FUNDING
No funding was available for this research work.
ETHICS STATEMENT
We adhered to all ethical guidelines and obtained proper consent for the preparation of this manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories: https://doi.org/10.6084/m9.figshare.28184252.v1.
CONFLICT OF INTEREST
The authors declare there is no conflict.