ABSTRACT
The water quality of drinking water reservoirs directly impacts the water supply safety for urban residents. This study focuses on the Da Jing Shan Reservoir, a crucial drinking water source for Zhuhai City and the Macau Special Administrative Region. The aim is to establish a prediction model for the water quality of drinking water reservoirs, which can serve as a vital reference for water plants when formulating their water supply plans. In this research, after smoothing the data using the Hodrick-Prescott filter, we utilized the long short-term memory (LSTM) network model to create a water quality prediction model for the Da Jing Shan Reservoir. Simulation calculations reveal that the model's fitting degree is consistently above 60%. Specifically, the prediction accuracy for pH, dissolved oxygen (DO), and biochemical oxygen demand (BOD) in the water quality prediction model aligns with actual results by more than 70%, effectively simulating the reservoir's water quality changes. Moreover, for parameters such as pH, DO, BOD, and total phosphorus, the relative forecasting error of the LSTM model is less than 10%, confirming the model's validity. The results of this study offer an essential model reference for predicting water quality for the Da Jing Shan Reservoir.
HIGHLIGHTS
The long short-term memory (LSTM) model was optimized with the Hodrick-Prescott (HP) filter.
The HP-LSTM model is used to predict the water quality index.
The HP-LSTM model is compared with the LSTM model to prove the effectiveness of the optimized model.
INTRODUCTION
With the rapid development of socio-economics in recent times, water environmental pollution has become an increasingly pressing issue (Hasanzadeh et al. 2020; Chen et al. 2021; Ouyang et al. 2023). Accurately predicting water quality indicators is crucial for anticipating and promptly responding to sudden water pollution events (Bi et al. 2024; Wang et al. 2024). Numerous studies on temporal and geographical water quality forecasts and data-driven models have been conducted to limit water pollution and lessen its detrimental effects on human civilization and the aquatic ecological system (Chen et al. 2020). In particular, the management of water quality in reservoirs, due to its important functional status as a source of drinking water, is more highly concerned with its water quality change situation (Lian et al. 2022; Yin et al. 2022). The urban reservoir serves as a crucial source of drinking water in urban areas, playing a significant role in ensuring the safety of urban water supply. Water quality in the reservoir is a critical determinant of water supply safety. The process of economic development inevitably brings about specific pollution problems. Therefore, it is crucial to protect the urban reservoir's water source and ensure the urban water supply's safety. One of the most critical tasks to ensure the safety of the water supply is to track and monitor the water quality of the source and identify the pollution problem in real time (Abraham et al. 2022; Zhang et al. 2022a).
Water quality prediction has always been a focal research direction in the water resources field, and forecasting trends in water quality fluctuations is essential for the timely detection of changes, and assessing the impact on drinking water quality and safety for human health and the environment (Bourjila et al. 2023; Chen et al. 2024). As ecological systems, reservoirs are characterized by their complexity, dynamism, vulnerability, and significant ecological value. Predictive alerts for reservoir water quality enable the early detection of anomalies, strengthening ecological and environmental protection (Marcé et al. 2016). Current water quality prediction methods can be categorized into mechanistic and non-mechanistic (Zhang et al. 2022a), and data analysis techniques applied to water quality datasets help identify key factors and characteristics affecting water quality (Li et al. 2023; Shan et al. 2023). However, water quality data have become nonlinear and unstable, which is influenced by numerous factors, and choosing appropriate computational methods for modeling is crucial for developing predictive water quality models. These models are instrumental in foreseeing and alerting about future water quality changes, promptly identifying potential issues, and initiating corrective actions (Shi et al. 2018). In the era of big data, machine learning (ML) and deep learning (DL) models within artificial intelligence (AI) have seen significant growth, offering advanced methods for predicting water quality dynamics (Reichstein et al. 2019; Xiong et al. 2022; Ouyang et al. 2023).
In the field of predicting water quality changes, ML and DL have numerous successful applications. Models such as Extreme Gradient Boosting (XGBoost), support vector regression, K-nearest neighbors, ensemble trees, and random forests have been used to predict the water quality index (WQI), effectively reducing the time and errors associated with calculating the WQI and improving prediction accuracy (Sakaa et al. 2022; Hussein et al. 2024; Yan et al. 2024). In the domain of artificial neural networks, Multilayer Perceptron Neural Networks are widely applied for predicting river/stream water temperature and quality. These models handle complex nonlinear relationships and have demonstrated superior performance compared to traditional statistical models in various case studies (Zhu & Piotrowski 2020; Wong et al. 2021).
These models are data-centric and independent of watershed process mechanisms. They employ algorithms to discern internal relationships between input and output data, effectively capturing the predicted subjects' dynamic patterns (Kasiviswanathan et al. 2016). Among these ML/DL models, neural network models have been broadly adopted (Yang et al. 2017), especially the application of long short-term memory (LSTM), being utilized in predicting water quality, air quality, and other environmental domains (Zhang & Li 2022; Zhang et al. 2022a; Wang et al. 2023). For instance, LSTM is employed to enhance weather forecasting models due to its capability to analyze and predict time-series data. Researchers have utilized LSTM to forecast meteorological variables such as temperature, wind speed, and rainfall. Within the medical domain, LSTM finds utility in disease diagnosis, including risk assessment for conditions such as heart attacks and diabetes. Additionally, it aids in predicting patients' hospital stays and associated medical costs (Varadharajan & Nallasamy 2022; Su et al. 2023). In energy demand forecasting, LSTM is utilized to predict electricity demand and energy consumption, which is crucial for the stable operation of power systems and energy supply planning.
As the demand for high-frequency prediction escalates, traditional data-driven ML and DL models require enhanced robustness to capture transient and dramatic dynamic changes effectively (Xiao et al. 2017). Academic research on LSTM networks shows that a single LSTM structure has apparent limitations in forecasting accuracy when facing complex data. This is especially evident when dealing with high volatility and low regularity data samples, where LSTM cannot be fitted finitely, resulting in reduced prediction accuracy. Therefore, it is necessary to combine the LSTM model with other state-of-the-art technologies (Pyo et al. 2023; Wang et al. 2023b; Cai et al. 2023; Zamani et al. 2023), and the prediction accuracy and efficiency of the LSTM model can be significantly improved by using various pretreatment techniques (Haq & Harigovindan 2022; Zhang et al. 2022c).
Data pre-processing techniques are crucial in harnessing the full potential of LSTM models' algorithmic strengths (Fan et al. 2022; Hao et al. 2022). Among these techniques, the Hodrick-Prescott (HP) filter is notable for addressing the non-stationarity issue by smoothing data (Domala & Kim 2023). Therefore, it can effectively counter the limitations of LSTM models in handling the non-stationary dynamics of the original series, significantly enhancing model performance. The primary use of the HP filter lies in signal denoising for stock price forecasting, load forecasting, and energy consumption forecasting (Xu et al. 2015; Ilyas et al. 2022). However, integrating the HP filter with LSTM models for surface water environment modeling remains largely unexplored. We hypothesize that using the HP filter as a pre-processing technique will bolster LSTM's capability to capture water quality dynamics. This combination is expected to enhance the model's overall performance and reduce its prediction error, particularly in high-frequency scenarios for water quality prediction.
This study aims to analyze the Da Jing Shan Reservoir in Zhuhai as a critical urban water source by collecting WQI measurements from 2010 to 2020. Utilizing these parameters, the study employs LSTM cyclic networks for training and prediction throughout the dataset. To address the problem caused by low-frequency fluctuations, an HP filter is integrated for data pre-processing, which effectively improves the prediction accuracy of the LSTM model. Additionally, the study conducts comparative analyses of LSTM configurations to identify the most effective model structure for precise water quality forecasting, thereby providing a benchmark for future advancements in water quality prediction methodologies.
METHODOLOGICAL FRAMEWORK
Area description
Source and partition of dataset
The research data for this study are sourced from the Environmental Monitoring Yearbook of Zhuhai City, Guangdong Province, spanning from 2010 to 2020. A decade of water quality monitoring data specific to the Da Jing Shan Reservoir is extracted for analysis. A predictive model based on the LSTM network is developed. This model aims to simulate and forecast key water quality parameters of the reservoir, including pH, sulfates (), chlorides (Cl−), the permanganate index (CODMn), dissolved oxygen (DO), total phosphorus (TP), and biological oxygen demand (BOD). The historical data from the Da Jing Shan Reservoir are recorded monthly, providing a temporal resolution suitable for detailed analysis. Our predictive model is designed to process 15 prior temporal measurements of water quality data to predict the subsequent measurement. The model's training and evaluation used historical data from January 2010 to December 2020, comprising 132 monthly records. Data allocation for training and validation followed a 4:1 ratio, which was strategically chosen to ensure comprehensive model development and accurate performance evaluation.
Model design
The overarching pattern of water quality variations often manifests a degree of periodicity, which is conducive to enhancing the accuracy of water quality predictions (Syeed et al. 2023). Nevertheless, the actual measurements of water quality parameters are subject to considerable uncertainty, influenced by various external factors, posing significant challenges to the predictive modeling of water quality (Razavi 2021). As indicated previously, the existing LSTM models do not yield highly accurate predictions (Barzegar et al. 2020), and the HP filter is employed to process the data (Nath et al. 2021; Wang et al. 2024), which is subsequently used to construct a water quality predictive model utilizing LSTM, aiming to achieve refined prediction accuracy.
LSTM network modeling
LSTM networks are a sophisticated variant of Recurrent Neural Network (RNNs) (Xiang et al. 2020), developed to address the limitations of RNNs on short-term memory. Traditional RNNs experience difficulties in preserving information over extended sequences, which impedes their ability to relate information from early time steps to later ones (Xiang et al. 2020). In contrast, LSTMs excel at recognizing and retaining long-term dependencies, thereby circumventing the issues of vanishing and exploding gradient that often plague the training process of standard RNNs. This attribute enables LSTMs to remember inputs over a more extended period, making them particularly effective for tasks involving long or delayed sequences. Despite being proposed in 1997, the LSTM model remains widely utilized due to its robust performance in processing time-series data (Murugesan et al. 2022; Roy et al. 2022; Liang et al. 2023; Patel et al. 2023).
HP filter
The HP filter, a prominent method in signal separation, is grounded in frequency analysis techniques. It operates on the premise that any time series can be decomposed into a blend of cyclic patterns occurring at diverse frequencies. The primary role of the HP filter is to segregate these cyclical components from the trend component within the time series. This segregation is achieved by smoothing out higher frequency fluctuations, highlighting the lower frequency trend. The extracted trend component is vital for analyzing the time-series data's underlying movements and long-term tendencies. The HP filter clarifies the time series by dividing it into a trend, which indicates the general direction over time, and a cyclical component that reflects short-term variations.
The mathematical basis of the HP filter involves statistical equations that adjust the smoothness of the time series. This adjustment allows for control over the separation between cyclical and trend components. The filter's operation minimizes the sum of squared deviations of the trend component from the actual data. This process is subject to a penalty that regulates the second derivative of the trend component, thus regularizing its smoothness. The smoothing parameter selection, commonly represented by lambda (λ), is critical. It determines the filter's responsiveness to short-term fluctuations versus long-term trends, striking a balance between the two for practical analysis (Zhang et al. 2022b).
Indicators for evaluating the model predictions
In generating predictive outcomes with a forecasting model, it is essential to assess the model's predictive performance rigorously. This assessment focuses on the accuracy and the practical applicability of the model's predictions. The evaluation typically includes several key performance indicators. One such indicator is the root mean squared error (RMSE), a metric that quantifies the differences between predicted and observed values. The RMSE is computed by squaring these differences, averaging them, and then taking the square root of this average. This calculation results in an assessment of the average magnitude of the model's errors. A lower RMSE value indicates a model that more closely aligns with the observed data, highlighting its accuracy and reliability (Song et al. 2021; Li & Li 2023).
Another critical metric in model evaluation is the coefficient of determination, which is commonly represented as R². This statistic measures the fit of a regression model to the observed data by quantifying the proportion of variance in the dependent variable that is predictable from the model's independent variables. An R² value near 1 implies a high explanatory power of the model and a better fit to the data. In addition to R², the mean absolute error (MAE) and the mean relative error (MRE) are also used. The MAE calculates the average of the absolute differences between predicted and observed values, offering a straightforward interpretation of error magnitude. Conversely, the MRE expresses the average error as a percentage of the actual values, with a value closer to zero indicating higher predictive accuracy (Than et al. 2021; Wan et al. 2022). These metrics provide a comprehensive view of a model's performance, guiding improvements and application in real-world scenarios.
Working flow of the model
RESULTS AND DISCUSSION
Water quality assessment and trends of water quality parameters
To enhance the management of water quality and the WQI for this source, we have identified seven critical water quality indicators for ongoing environmental monitoring: pH, SO42–, Cl–, CODMn, TP, DO, and BOD. The pH level is crucial as it influences biological activity and chemical processes in water. Sulfates, chlorides, and phosphorus are significant pollutants that pose direct threats to the environment. CODMn serves as an indirect measure of water quality by indicating the oxidation potential of organic substances, while BOD is a direct measure of the organic pollution that can affect the aquatic life. Monitoring these indicators provides a thorough evaluation of both the environmental quality and the ecological health of the water body (Chanapathi & Thatikonda 2019; Manna & Biswas 2023).
levels in the reservoir also showed considerable fluctuations, primarily oscillating between 10 and 40 mg/L. However, within the 60- to 80-month interval, these levels experienced significant variations, peaking at 88 mg/L. The extreme high value of 88.25 mg/L and a remarkable low value of 0.12 mg/L, combined with a median of 19.3 mg/L, indicate sporadic surges in SO4 concentrations. This inference is further supported by the mean value exceeding the median, and the distribution pattern, as visualized in the histogram and standard curve, suggests a deviation from a normal distribution. Cl− concentrations displayed relatively less variability, mainly within the range of 20–40 mg/L. However, a notable transient increase is observed early in the sequence between the 1- to 30-month interval, with the maximum value recorded at 107 mg/L and the minimum value at 4.87 mg/L. The median value of 16.4 mg/L implies that a significant number of readings are concentrated between 10 and 30 mg/L. The distribution pattern of Cl− does not conform to a standard curve, indicating a skewed distribution.
Additionally, the study notes significant variations in CODMn and DO levels. Initially, the CODMn showed relative stability with minor fluctuations, but after the 80th month, the range of variation increased considerably, with differences of up to 3.5 mg/L. DO levels also varied notably, especially in the latter part of the series, with fluctuations around 2.85 mg/L. Both TP and BOD recorded extreme values early in the study but later exhibited more stability. Post the 80th month, BOD levels began showing more pronounced fluctuations. These patterns, marked by outlier events followed by stability, suggest a distribution that deviates from the typical pattern. Typically, the lower the water temperature and the organic pollutants that consume oxygen, the higher the DO. Since the water temperature in the reservoir area has not undergone significant changes over the years, the impact of the oxygen-consuming indicators, which characterize the organic pollution of the water body in the reservoir area, is evident (Li et al. 2020; Jiang et al. 2022). In summary, the oxygen-consuming metrics reflected the impact of organic pollution on the water body in the reservoir. Meanwhile, pH, chloride ions, and sulfate ions show slight variation.
LSTM standalone prediction
Preprocessing by the HP filter and its LSTM prediction
Estimation of the lambda value for the HP filter and its preprocessing
In the HP filter, the lambda parameter controls the smoothness of the output. Lambda selection is typically heuristic, often relying on data properties and iterative refinement (Zhang et al. 2022b). Applying the HP filter as a preprocessing step to the water quality parameter data from the Da Jing Shan Reservoir profoundly impacted the data characteristics, as evidenced in Table 1. The preprocessing led to a significant decrease in variance and kurtosis across different parameters. This reduction in variance implies a more homogenous dataset, and the lower kurtosis suggests a distribution with less extreme outliers or less ‘tailedness’. These changes generally make data more accessible to the model, as extreme fluctuations and anomalies can complicate the training process and lead to poor model performance.
. | pH . | . | Cl− . | CODMn . | DO . | TP . | BOD . | |
---|---|---|---|---|---|---|---|---|
Unfiltered | Variance | 0.17 | 8.52 | 12.55 | 0.55 | 0.43 | 0.01 | 0.51 |
Kurtosis coefficient | 1.68 | 1.3 | 4.33 | 0.29 | 1.04 | 2.94 | 1.14 | |
Filtered | Variance | 0.24 | 10.13 | 12.98 | 0.40 | 0.49 | 0.02 | 0.63 |
Kurtosis coefficient | 5.93 | 1.02 | 18.24 | −0.87 | 2.39 | 63.25 | 9.40 |
. | pH . | . | Cl− . | CODMn . | DO . | TP . | BOD . | |
---|---|---|---|---|---|---|---|---|
Unfiltered | Variance | 0.17 | 8.52 | 12.55 | 0.55 | 0.43 | 0.01 | 0.51 |
Kurtosis coefficient | 1.68 | 1.3 | 4.33 | 0.29 | 1.04 | 2.94 | 1.14 | |
Filtered | Variance | 0.24 | 10.13 | 12.98 | 0.40 | 0.49 | 0.02 | 0.63 |
Kurtosis coefficient | 5.93 | 1.02 | 18.24 | −0.87 | 2.39 | 63.25 | 9.40 |
Model parameter adjustment
The parameter adjustment of the HP-LSTM model involves two main aspects. The lambda value of the HP model component is set to 1 based on prior experience. Parameters of the LSTM model include the hidden layer size, back-step, and iteration number. The hidden layer size is set to 64 based on empirical knowledge. However, the back-step and iteration numbers require more careful adjustment according to individual datasets. Observing the fluctuations and trends of each dataset helps determine a general range for these parameters. Subsequently, continuous adjustments are made based on the LOSS chart and the MSE value, leading to the identification of appropriate back-step and iteration numbers.
LSTM prediction performance
The utilization of the HP filter as a pre-processing method on the water quality data from the Da Jing Shan Reservoir had a notable impact on the data's characteristics, as demonstrated in Table 1. This pre-processing resulted in a marked reduction in variance and kurtosis across various parameters. A decrease in variance suggests a more homogeneous dataset, implying that the data points are more closely clustered around the mean. Simultaneously, the reduction in kurtosis indicates a distribution with fewer extreme outliers or less pronounced tails. These alterations generally lead to data that is more amenable to modeling. This is because extreme fluctuations and anomalies in the data can complicate the training process of models, often leading to suboptimal performance.
Conversely, the LSTM model trained on data processing with the HP filter demonstrated a significantly different behavior. Post-preprocessing, the LOSS curve of this model showed a pronounced downward trajectory, signifying improved performance with each additional training batch. This trend eventually plateaued, which indicates the model attaining an optimal level of comprehension regarding the data patterns. Such a trend indicates a successful training process, where the model learns effectively and reaches a point of diminishing returns on further training, a hallmark of effective model optimization.
The efficacy of the LSTM model's training, enhanced by the HP filter pre-processing, is convincingly demonstrated in Figure 7, which represents the prediction results for both the training and validation phases, respectively. These figures vividly illustrate the close alignment between the model's predictions and the observed values for each water quality index in both periods. The remarkable predictive performance observed during the training phase is not isolated; it is consistently replicated in the validation phase. This consistency is a strong indicator that the model has achieved a commendable level of generalization. It suggests that the LSTM network is not merely memorizing the training data but is effectively learning and adapting to the underlying patterns in the dataset.
This ability to accurately capture and predict the behavior of the various water quality parameters in the Da Jing Shan Reservoir is a significant testament to the combined approach's effectiveness. The LSTM network, renowned for its capability to process and learn from sequential data, and the HP filter, known for its proficiency in smoothing and highlighting essential trends in time-series data, provide a robust framework for water quality prediction. The combined approach effectively mitigates issues like overfitting, where a model performs well on training data but poorly on unseen data, ensuring that the model remains practical and reliable for water quality applications.
The LOSS plot identifies the model fit
The provided Figure 8 illustrates the LOSS diagram of each indicator model under the final parameters, offering insight into the fitting status of each model.
pH: Analysis of Figure 8(a) reveals that the LOSS curve of the unfiltered training set exhibits minimal decrease, suggesting inadequate learning from the training data. Conversely, the filtered LOSS curve for both training and test sets begins to converge after 20 iterations, with a negligible gap between them, indicating an optimal fit.
SO42−: Examination of Figure 8(b) indicates declining unfiltered LOSS curves. While the test set stabilizes after 20 iterations, the LOSS value remains above 0.3, indicating underfitting. However, post-filtering, the LOSS curve steadily decreases and flattens after 20 iterations, with a value below 0.1, demonstrating a perfect fit.
Cl−: As seen in Figure 8(c), the unfiltered LOSS curve stabilizes around 0.6 after 10 iterations, suggesting poor fitting. Conversely, the filtered LOSS curve consistently decreases in both training and test sets, approaching 0.05 around the 35th iteration, with minimal disparity between sets, signifying excellent fitting.
CODMn, DO, and BOD: Figure 8(d), Figure 8(e) and Figure 8(g) depict similar LOSS graphs for these indicators. Under unfiltered conditions, the training set yields stable, low LOSS values, while the test set's LOSS curve either stagnates or increases significantly, indicating underfitting. However, post-filtering, both training and test set LOSS curves steadily decline, plateauing around the 40th iteration, with significantly reduced values compared to pre-filtering. Despite not being perfect, these models exhibit strong fitting.
TP: Figure 8(f) illustrates a steadily declining LOSS curve for this index both pre- and post-filtering, flattening after 25 iterations. Post-filtering, the value significantly decreases, indicating improved fitting.)
Model evaluation and error comparison before and after filtering
Figure 7 depicts the impact of data pre-processing on the performance of the LSTM model applied to pH value predictions from January 2010 to December 2020 using 132 data points. Before the implementation of data processing, the LOSS metric during the test period demonstrates a trend of stabilization but remains at a high error magnitude. The lack of a decrease in error with an increasing batch count indicates the model's initial inability to extract meaningful patterns from the dataset. This suggests that, in its unprocessed state, the data presents complexities or anomalies that the LSTM model struggles to interpret effectively, hindering its learning and predictive capabilities.
The scenario dramatically changes after the application of HP filter. Post-filtering, the LOSS values consistently stay below 0.2, a substantial improvement over the pre-processing phase. This marked reduction in LOSS values signifies a significantly enhanced predictive performance. The HP filter, known for its ability to smooth out noise and highlight underlying trends in time-series data, appears to have made it more tractable for the LSTM model. By reducing the complexity and irregularities in the data, the HP filter allows the LSTM network to learn from the data more effectively and make accurate predictions. This contrast in performance before and after data processing underscores the importance of appropriate data pre-processing in ML applications, especially in dealing with time-series data. It highlights how preprocessing techniques like the HP filter can improve the quality and reliability of predictions made by sophisticated models such as LSTM networks. This improved performance is not just a technical achievement but also has practical implications, ensuring more reliable and accurate predictions in real-world applications such as environmental monitoring and management.
Dataset . | Parameter . | MSE . | MAE . | MRE . | R2 . |
---|---|---|---|---|---|
Training (without HP filter) | pH | 0.23 | 0.16 | 0.02 | 0.32 |
8.33 | 6.68 | 1.36 | 0.59 | ||
Cl− | 9.73 | 6.37 | 0.44 | 0.49 | |
CODMn | 0.40 | 0.25 | 0.12 | 0.60 | |
DO | 0.31 | 0.22 | 0.03 | 0.63 | |
TP | 0.01 | 0.00 | 0.22 | 0.66 | |
BOD | 0.33 | 0.22 | 0.16 | 0.52 | |
Test (without HP filter) | pH | 0.28 | 0.22 | 0.28 | 0.00 |
10.42 | 7.28 | 0.38 | - | ||
Cl− | 9.21 | 6.46 | 0.45 | 0.27 | |
CODMn | 0.91 | 0.79 | 0.35 | 0.22 | |
DO | 0.71 | 0.60 | 0.09 | - | |
TP | 0.01 | 0.01 | 0.51 | 0.23 | |
BOD | 0.95 | 0.84 | 0.50 | 0.22 | |
Training (with HP filter) | pH | 0.05 | 0.03 | 0.01 | 0.99 |
2.27 | 1.66 | 0.08 | 0.93 | ||
Cl− | 1.71 | 1.38 | 0.08 | 0.97 | |
CODMn | 0.05 | 0.03 | 0.02 | 0.98 | |
DO | 0.1 | 0.07 | 0.01 | 0.94 | |
TP | 0.00 | 0.00 | 0.09 | 0.94 | |
BOD | 0.09 | 0.07 | 0.04 | 0.93 | |
Test (with HP filter) | pH | 0.07 | 0.06 | 0.01 | 0.73 |
1.23 | 1.02 | 0.05 | 0.97 | ||
Cl− | 2.57 | 2.07 | 0.13 | 0.92 | |
CODMn | 0.39 | 0.12 | 0.15 | 0.62 | |
DO | 0.17 | 0.11 | 0.02 | 0.80 | |
TP | 0.00 | 0.00 | 0.13 | 0.54 | |
BOD | 0.28 | 0.25 | 0.13 | 0.85 |
Dataset . | Parameter . | MSE . | MAE . | MRE . | R2 . |
---|---|---|---|---|---|
Training (without HP filter) | pH | 0.23 | 0.16 | 0.02 | 0.32 |
8.33 | 6.68 | 1.36 | 0.59 | ||
Cl− | 9.73 | 6.37 | 0.44 | 0.49 | |
CODMn | 0.40 | 0.25 | 0.12 | 0.60 | |
DO | 0.31 | 0.22 | 0.03 | 0.63 | |
TP | 0.01 | 0.00 | 0.22 | 0.66 | |
BOD | 0.33 | 0.22 | 0.16 | 0.52 | |
Test (without HP filter) | pH | 0.28 | 0.22 | 0.28 | 0.00 |
10.42 | 7.28 | 0.38 | - | ||
Cl− | 9.21 | 6.46 | 0.45 | 0.27 | |
CODMn | 0.91 | 0.79 | 0.35 | 0.22 | |
DO | 0.71 | 0.60 | 0.09 | - | |
TP | 0.01 | 0.01 | 0.51 | 0.23 | |
BOD | 0.95 | 0.84 | 0.50 | 0.22 | |
Training (with HP filter) | pH | 0.05 | 0.03 | 0.01 | 0.99 |
2.27 | 1.66 | 0.08 | 0.93 | ||
Cl− | 1.71 | 1.38 | 0.08 | 0.97 | |
CODMn | 0.05 | 0.03 | 0.02 | 0.98 | |
DO | 0.1 | 0.07 | 0.01 | 0.94 | |
TP | 0.00 | 0.00 | 0.09 | 0.94 | |
BOD | 0.09 | 0.07 | 0.04 | 0.93 | |
Test (with HP filter) | pH | 0.07 | 0.06 | 0.01 | 0.73 |
1.23 | 1.02 | 0.05 | 0.97 | ||
Cl− | 2.57 | 2.07 | 0.13 | 0.92 | |
CODMn | 0.39 | 0.12 | 0.15 | 0.62 | |
DO | 0.17 | 0.11 | 0.02 | 0.80 | |
TP | 0.00 | 0.00 | 0.13 | 0.54 | |
BOD | 0.28 | 0.25 | 0.13 | 0.85 |
Regarding the precision of pH value predictions prior to processing, the forecasted values displayed a reasonable alignment with the actual values during training, evidenced by a MRE of 2% and a MAE of 0.167. Conversely, during the testing phase, the MRE escalated to 28%, and the MAE increased to 0.228. Post-processing improvements are noteworthy, as evidenced by the MRE decreasing to a mere 0.5% and the MAE to 0.027 during training. The testing phase also reflected minimal discrepancies, with an MAE value of 0.056 and an MRE value of approximately 0.7% (Wun & Wen 1991; Karunasingha 2022; Jiao et al. 2023).
For and Cl− predictions before data processing, the training phase MAEs are over 6, with MRE soaring to 135% for and 44% for Cl−. These metrics approximately stabilized around 40% during the testing phase. Following data processing, both parameters exhibited reduced disparities between predicted and actual values, with MAEs around 1.5 and MREs under 9%. The testing phase retained this trend of precision, with MREs also below 10%. Before processing, the CODMn index predictions reflected MAEs of 0.25 during training and 0.8 during testing, with 11 and 35% MREs, respectively. These values improved significantly after processing, with MAEs reducing to 0.025 and 0.12, and MREs to 2 and 15%, underscoring enhanced predictive accuracy. Preprocessing MAEs are above 0.55 for DO and BOD during testing, with BOD's MRE reaching 50%. Post-processing, however, revealed a significant amelioration during training, with MAEs below 0.2 and MREs of only 2 and 12%, respectively (Willmott & Matsuura 2005).
The impact of data processing on the accuracy metrics for TP is evident when comparing the pre- and post-processing phases. Post-processing, there is a noticeable improvement in the model fitting during training, as indicated by a relatively low error rate of just 8%. This improvement reflects the model's enhanced ability to learn and adapt to the patterns within the training dataset. However, when it comes to testing, the model demonstrates notable lags when its predictions are compared with the actual values. This discrepancy suggests that the model does not fully account for the presence of significant temporal effects. Despite these lags, the overall trend in the model's predictions post-processing remains consistent, with a relative error of 12%. This consistency, even in the presence of temporal variability and a limited dataset, is a positive indicator of the model's capability to capture general trends in the data. However, the discernible delays in model predictions point to areas where further refinement is needed.
Given the complex nature of environmental data, such as that of TP levels, which can be influenced by many factors and exhibit considerable temporal variation, it is not uncommon for models to face challenges in achieving perfect accuracy. The delays and lags observed suggest that the model may benefit from enhancements, such as incorporating additional variables that account for temporal dynamics or applying more sophisticated data pre-processing techniques that can better handle the variability. The current findings, as highlighted in the referenced study (Park & Stefanski 1998), underscore the importance of continuous model evaluation and refinement, especially in fields dealing with dynamic and complex datasets. Improving the model's ability to account for temporal effects and reduce prediction delays will be crucial for increasing its reliability and applicability, such as water quality monitoring and management.
CONCLUSION AND FUTURE PERSPECTIVES
The simulation analyses conducted in this study clearly illustrate the enhanced performance of the LSTM model when augmented with HP filtering, especially when compared to an unaided LSTM model. This enhancement is evident in the model's fitting effects and prediction accuracy. A significant achievement of this combined approach is the high degree of congruence observed between the model's predicted values and the actual observations. For most water quality indicators studied, this congruence surpassed the 80% threshold, a notable benchmark in predictive modeling. In particular, the predictions for , Cl−, and DO are remarkably accurate, closely matching the empirical results with an accuracy of up to 90%. This high precision in tracing and simulating the measurement processes speaks volumes about the model's efficacy.
Additionally, the performance of the model for critical parameters such as , Cl−, and DO is further validated by the low error metrics. MSE, MAE, and MRE for these parameters all remained below 10%. Such low error rates reinforce the robustness and validity of the proposed model, confirming its reliability in accurately predicting water quality parameters.
LSTM has been applied in various fields, resulting in models such as Long Short-Term Memory with Empirical Mode Decomposition (EMD-LSTM), Long Short-Term Memory with Principal Component Analysis (PCA-LSTM), and Long Short-Term Memory with Wavelet Decomposition and Wavelet Neural Network (WD-WNN-LSTM). EMD-LSTM addresses feature fluctuations, decomposes scale variations, and enhances raw data utilization. PCA-LSTM reduces data dimensions, eliminates redundancy, and decreases prediction errors. WD-WNN-LSTM captures complex spatiotemporal characteristics and non-linear relationships, improving model accuracy (Hao et al. 2022; Li et al. 2022; Wang 2024).
This study significantly impacts water quality forecasting for the Da Jing Shan Reservoir. The successful application of the LSTM model combined with the HP filter in analyzing the reservoir's water quality demonstrates their effectiveness and suitability for similar predictive tasks. For models where short-term data fluctuations impact fitting accuracy, combining HP with LSTM mitigates these fluctuations, enhancing accuracy and achieving a precision of over 85%. Compared to newer composite LSTM models such as EMD-LSTM and Long Short-Term Memory with Kernel Principal Component Analysis (KPCA-LSTM), this approach is more concise and practical. This research is crucial for ensuring water supply safety for both the reservoir and Zhuhai City by providing a reliable model framework.
A real-time water quality monitoring platform based on the model to predict and warn of future changes is developed. This platform aids in the early detection of potential issues, timely issuance of warnings, and pollution process reviews, enhancing response capabilities to water environmental risks. It improves governance planning, promoting high-quality water conservation and environmental protection.
ACKNOWLEDGEMENTS
This work is supported by the Foshan Shunde District Core Technology Breakthrough Project (2230218004273), the Science and Technology Plan Project of Zhuhai in the Field of Social Development (2220004000355), Guangdong Basic and Applied Basic Research Foundation (2023B1515040028), and the National Key Research and Development Program of China (2022YFC3202200).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.