ABSTRACT
Accurate, stable, and long-term water quality predictions are essential for water pollution warning and efficient water environment management. In this study, a hierarchical water quality prediction (HWQP) model was developed based on ‘data decomposition–predictor screening–efficient prediction’ via wavelet decomposition, Spearman correlation analysis, and long short-term memory network, respectively. The observed data from 14 stations in the Huaihe River–Hongze Lake system, including ammonia nitrogen (AN) and chemical oxygen demand (COD), were used to make long-term water quality predictions. The results suggested that, compared to existing water quality prediction models, the HWQP model has higher accuracy, with the root mean square errors of 6 and 17% for simulating AN and COD, respectively. The AN and COD concentrations will range from 0 to 1 mg/l and from 3 to 5 mg/l at 12 stations, respectively, and the COD concentrations will exceed the water quality target at Stations 4 and 5. The established model has great potential to address the challenges associated with the water environment.
HIGHLIGHTS
The original water quality sequences are decomposed by wavelet transformation.
Driving predictors for water quality predictions are identified using the Spearman coefficient.
A novel hybrid model is developed for long-term water quality prediction.
INTRODUCTION
Pollution discharge to sensitive ecosystems has profoundly disrupted their health and functions (Mahdian et al. 2024). Under the background of population growth and industrialization, the increasing pollution discharge has led to water quality degradation as well as a series of problems such as nutrient load and emerging pollutants (Tian et al. 2024a, b). To address the increasing problems of water pollution, a large number of studies have been conducted on water quality monitoring, simulation, and prediction (Behmel et al. 2016; Singh & Ahmed 2021). The prediction of water quality is necessary to improve water pollution control and efficient management of the water environment.
Two categories of models are generally used in water quality prediction, i.e., conventional models and artificial intelligence (AI) models (Dong et al. 2023). Conventional models are based on physical mechanism simulation (Tang et al. 2014; Ding et al. 2019), regression analysis methods (Avila et al. 2017; Wu et al. 2023), and time series decomposition (Yu et al. 2020). The physical mechanism models that can achieve detailed analysis in contaminant transportation processes often involve complex modeling and massive data requirements (Fomarelli et al. 2013). The prediction accuracy of regression analysis and data decomposition methods is related to the complexity of the observed water quality sequence, and the stability and universality of the model are difficult to guarantee. With the development of AI technology, its strong predictive ability for highly non-linear and non-stationary sequences has also been applied to predict water quality. The uncertainty of AI methods has been challenging, and it is difficult to handle complex water quality sequences with traditional algorithms such as random forest, artificial neural networks, and support vector machine (Habib et al. 2024). Many deep neural network models have been successfully applied to predict the concentration of various pollutants, such as ammonia nitrogen (AN), chemical oxygen demand (COD), and biological oxygen demand (Najah et al. 2011; Emamgholizadeh et al. 2014; Imani et al. 2021).
Long short-term memory network (LSTM) is a type of deep learning model, which can effectively solve the long-term dependencies and optimize the gradient explosion problem of traditional neural networks during model training (Peng et al. 2022). In case of LSTM, it is essential to determine the predictors and their forms of the inputs. However, if the original water quality sequences are directly simulated by LSTM, there will be significant errors in the results for sequences with seasonal and long-term trends (Komornikova et al. 2008). Furthermore, it is difficult for LSTM to provide inter-annual predictions due to the inputs missing and error accumulation in simulation.
To improve the performance of AI models in water quality predictions, data decomposition methods have been used to process the original water quality datasets (Wang & Wu 2016). Compared to directly using original data, water quality prediction models based on data decomposition can extract multi-scale features and simplify the datasets (Eze & Ajmal 2020). Wavelet decomposition (WTD) (Song et al. 2021) has been found to be effective in the analysis of water quality series and can facilitate the analysis of multi-temporal scale structures, identification of principal components, and removal of noise. WTD can decompose the original water quality sequence into several relatively stable subsequences with different frequencies, which has been proven to have advantages in revealing subtle changes and hidden information of water quality series (Yuan et al. 2022; Han et al. 2023). Thus, one of the main motivations of this study is to apply WTD into the proposed water quality prediction model.
Water quality parameters often fluctuate due to various natural and human influences, exhibiting instability and seasonality. To predict water quality using deep learning models effectively, it is crucial to identify the appropriate influencing factors, which is essential for ensuring the rationality of the model and improving its accuracy (Li et al. 2022). The Spearman correlation analysis (SCA), which is a non-parametric statistical method, possesses strong capabilities for data interpretation and spatiotemporal pattern recognition (Karthikeyan et al. 2017). Therefore, the SCA can establish a relationship with LSTM by extracting key predictors from multiple influencing factors. This process helps in identifying the most significant variables that impact water quality, thereby enhancing the predictive performance and accuracy of LSTM.
Based on previous research models, the structure of a water quality prediction model can be optimized by considering the distribution patterns of pollutants and related influencing factors of the study area. Numerous studies have explored water quality prediction models, focusing on input pre-processing (data decomposition), structure adjustment (predictor screening), sequence prediction, and output post-processing, all of which have been proven to enhance model performance. However, previous studies only improved a portion of the model without considering the overall optimization, resulting in insufficient optimization of the model's performance. For example, when only considering the optimization of WTD for deep learning model but ignoring the impact of streamflow as a factor in water quality prediction, there were significant relative errors in simulating rivers with seasonal variation of streamflow (Zhou et al. 2022).
This article proposes a hierarchical water quality prediction model with ‘decomposition–inputs–prediction’ hierarchical optimization. Distinct from previous related work, the developed model optimized the designation of water quality prediction by combining data decomposition (WTD) and predictor screening (SCA). Based on the stationary characteristics of decomposed subsequences, long-term predictions were implemented for the decomposition of the inputs.
The aims of this study were to (1) develop a novel water prediction model with the ‘decomposition–inputs–prediction’ hierarchical optimization framework, (2) apply this model for water quality predictions (including AN and COD) of the Huaihe River–Hongze Lake (HR–HL) system, and (3) make long-term water quality predictions for the HR–HL system.
METHODS
Data decomposition based on wavelet decomposition
Step 1. Select a set of continuous finite orthogonal wavelet basis functions and align it with the starting point of the water quality series.
Step 2. Calculate the decomposition coefficients c and d. The larger the coefficient, the more similar the waveform of the current water quality sequence is to the wavelet basis function.
Step 3. Move the wavelet basis function along the time axis and calculate the decomposition coefficient at each time until it covers the entire water quality sequence.
Step 4. Scale the selected wavelet function by one unit and then repeat Steps 1–4.
In this study, we selected Daubechies 5 (db5) as the basis function based on the waveform matching algorithm (Farajpanah et al. 2024). db5 is more suitable for decomposing water quality data in this study through matching the shape of the observed data with the desired wavelet.
Predictor screening by SCA
Water quality indicators are usually correlated with multiple factors, and the decomposed water quality subsequences are also interrelated with each other. SCA was used to describe the trend direction and correlation strength of two random variables in this study. SCA is good at identifying and revealing the degree of association between one dependent variable and one or more variables in a complex water quality dataset (Xiao et al. 2016). This information can evaluate the contribution of driving predictors for more accurate water quality prediction.
Combining as many quality factors as possible will enhance the performance of the water quality prediction model. However, too many predictors may lead to longer time consumption and enlarged predict uncertainty, while too few might reduce accuracy. Moreover, when the prediction range exceeds the set value of the time delay, recursive prediction is required, and the predictors need to be substituted for the measured values. Thus, the reliability of the predictors is crucial in water quality prediction.
Long short-term models for water quality parameters
The variation of water quality is controlled by coupled impacts of numerous hydrological and anthropogenic factors, and it can be generalized as a mapping relationship of an output with multiple inputs, which is similar to neural network models (Kratzert et al. 2018). LSTM as one of the recurrent neural networks (RNNs) is especially good at predicting time series. LSTM overcomes the limitations of conventional RNNs, which solves the problem of gradient disappearance and gradient explosion in long-term time series prediction (Ahmadi et al. 2024). Compared with other improved models (such as Bi-LSTM and stacked LSTM), they have similar accuracy to LSTM in predicting non-stationary sequences such as water quality but require more training costs due to their more complex structure (Adib et al. 2024). Hence, LSTM was applied in this study to achieve accurate and stable prediction of complex water quality sequences.
From the above introduction, the LSTM network can capture better long-term dependence in sequences through the cell state and gate mechanism, and the temporal correlation does not weaken even if the time series becomes longer.
Evaluation metrics of long-term water quality prediction
Three indicators are calculated to evaluate the accuracy, reliability, and stability of water quality prediction models in this study, which are often used to evaluate simulation errors in water quality prediction.
The metrics of RMSE, MAPE, and ADF are selected to evaluate the accuracy, reliability, and stability, respectively, with the objective of addressing the interference of outliers and calibrating the performance in long-term water quality prediction.
STUDY REGION AND DATA
The HR–HL system (Eastern China) is the study region for this research with data from 14 monitoring stations (S1–S14) obtained during 1998–2018.
Study region
Data
There are 14 monitoring stations in this study area. S1–S6 are located in the mainstream of the HR, and S7–S14 are located in the HL. In addition, S6 is a monitoring station at the confluence of the HR and the HL, and S8, S10, S12, and S14 are the confluences of the HL and other tributaries. In this study, weekly or monthly sampling is carried out to measure the concentration and temperature of local contaminants and the flow is measured daily. The measurement period of the available AN and COD concentration datasets is from 1998 to 2018 for S1–S4, from 2003 to 2018 for S5 and S6, and from 2004 to 2018 for the other stations around the HL (Table 1). The concentration and temperature data of contaminants were measured using the national standard water quality detection method, which were provided by the HR Water Resources Protection Bulletin and the HR Water Environment Monitoring Center. The streamflow data were obtained from the hydrographic office of HR Commission of the Ministry of Water Resources. The streamflow and temperature data series are sufficiently long to match with the water quality data, and the daily streamflow datasets are converted into weekly or monthly mean values to drive the prediction model. Datasets before January 2018 are used for calibration and data from 2018 are used as a validation period for the water quality model in this study.
Station ID . | Station name . | Location . | Monitoring period . | Elevation (m) . |
---|---|---|---|---|
S1 | Lu Taizi | HR | 1998.1–2018.12 | 24.2 |
S2 | Beng Bu | HR | 1998.1–2018.12 | 21.3 |
S3 | Wu Jiadu | HR | 1998.1–2018.12 | 20.7 |
S4 | Lin Huaiguan | HR | 1998.1–2018.12 | 18.5 |
S5 | Xiao Liuxiang | HR | 2003.1–2018.12 | 16.5 |
S6 | Lao Zishan | HR | 2003.1–2018.12 | 13.5 |
S7 | Lin Huai | HL | 2004.1–2018.12 | 12–14 |
S8 | Jiang Ba | HL | 2004.1–2018.12 | |
S9 | Gao Liangjian | HL | 2004.1–2018.12 | |
S10 | Er Hezha | HL | 2004.1–2018.12 | |
S11 | Cheng Zihu | HL | 2004.1–2018.12 | |
S12 | Xu Hong | HL | 2004.1–2018.12 | |
S13 | Cheng He | HL | 2004.1–2018.12 | |
S14 | Li Hewa | HL | 2004.1–2018.12 |
Station ID . | Station name . | Location . | Monitoring period . | Elevation (m) . |
---|---|---|---|---|
S1 | Lu Taizi | HR | 1998.1–2018.12 | 24.2 |
S2 | Beng Bu | HR | 1998.1–2018.12 | 21.3 |
S3 | Wu Jiadu | HR | 1998.1–2018.12 | 20.7 |
S4 | Lin Huaiguan | HR | 1998.1–2018.12 | 18.5 |
S5 | Xiao Liuxiang | HR | 2003.1–2018.12 | 16.5 |
S6 | Lao Zishan | HR | 2003.1–2018.12 | 13.5 |
S7 | Lin Huai | HL | 2004.1–2018.12 | 12–14 |
S8 | Jiang Ba | HL | 2004.1–2018.12 | |
S9 | Gao Liangjian | HL | 2004.1–2018.12 | |
S10 | Er Hezha | HL | 2004.1–2018.12 | |
S11 | Cheng Zihu | HL | 2004.1–2018.12 | |
S12 | Xu Hong | HL | 2004.1–2018.12 | |
S13 | Cheng He | HL | 2004.1–2018.12 | |
S14 | Li Hewa | HL | 2004.1–2018.12 |
CASE STUDY
The proposed WTD-SCA-LSTM method was applied to this case, and the driving predictors for the HR and the HL were screened separately. On the basis of short-term accurate and reliable prediction, the long-term prediction and assessment of water quality in the study area will also be discussed in this section.
Original water quality series decomposition
The water quality indicators of a river–lake system are highly non-linear and complex due to the influence of climate change and human interference. This section aims to obtain subsequences with simple features for prediction input by using discrete wavelet decomposition to decompose the original AN and COD concentrations.
Before 2008, the water pollution problem in the HR Basin was severe, and high concentration pollutants were often monitored with water pollution incident (Han et al. 2023). After 2008, the concentrations of pollutants were controlled and remained stable due to the implementation of water environment protection policies. Therefore, the dataset period selected for the prediction model in this article is uniform starting from 2008 for the stability and accuracy of the model. According to Equation (2), the frequency of water quality samples is half a month and db5 is chosen as the wavelet basis function; the layers of wavelet decomposition is rounded to three after calculation.
Driving predictor screening and predictor set construction
Separately determining the inputs of the prediction model for the water quality indicators of each station is a prerequisite for accurate prediction. SCA was used in this section to quantitatively analyze the contribution of influencing factors to water quality parameters.
According to the physical processes of the water environment in the study area, the main hydrological factors affecting water quality are streamflow, water temperature, and wind speed (Lutz et al. 2016; Zlatanovic et al. 2017). Previous studies have utilized these factors to establish prediction inputs for predicting the original water quality sequence. When the data decomposition methods were applied to optimize the prediction model, the predictors should also consider the potential impact of the hydrological factors. From the perspective of the transformation mechanism of environmental factors, complex impacts may exist between pollutants in water bodies, such as the inverse relationship between total nitrogen and dissolved oxygen. In addition, the transport of pollutants can lead to spatial variability in water quality distribution. Therefore, the spatial connection of water quality for stations was evaluated in the inputs of the prediction model. For stations along the HR, the streamflow is the main factor affecting pollutant transport; the contribution of pollutant concentration at upstream station to local station was calculated in SCA. For stations around the HL, the spatial correlation was calculated with the pollutant concentration of nearby stations. The correlation is significant for stations around the confluences of the river-lake system (S6, S8, S10 and S12) due to highly dynamic contaminant activities.
The driving predictor set for stations in the HR (S3) and the HL (S8) were obtained by maximizing the correlation coefficient of SCA (Tables 2 and 3, respectively). For the monitoring stations observed in this study area, the inputs of high-frequency subsequence were repeatedly influenced by the upstream pollutant transport, especially during the wet years when pollution transport activities were intense. Compared with the AN and COD, the inputs for COD prediction usually include temperature, which has a priority than other factors such as streamflow and wind speed. In contrast, the inputs for AN prediction were mainly driven by the original decomposition sequence combination.
Contaminant . | Subsequence . | Driving predictors (with the lead time) . |
---|---|---|
AN | d1 | d1 (t − 3), AN (S3)(t − 3), COD (t − 3) |
d2 | d2 (t − 3), AN(S2) (t − 3), AN (S3) (t − 3) | |
d3 | d3 (t − 2), AN (S2) (t − 2), AN (S3) (t − 2) | |
c3 | c3 (t − 1), AN (S3) (t − 1), Q (t − 2) | |
COD | d1 | d1 (t − 3), COD (S3) (t − 3), AN (t − 3) |
d2 | d2 (t − 3), COD (S2) (t − 3), COD (S3) (t − 3), T (t − 1) | |
d3 | d3 (t − 2), COD (S3) (t − 2), T (t − 1) | |
c3 | c3 (t − 1), COD (S3) (t − 1) |
Contaminant . | Subsequence . | Driving predictors (with the lead time) . |
---|---|---|
AN | d1 | d1 (t − 3), AN (S3)(t − 3), COD (t − 3) |
d2 | d2 (t − 3), AN(S2) (t − 3), AN (S3) (t − 3) | |
d3 | d3 (t − 2), AN (S2) (t − 2), AN (S3) (t − 2) | |
c3 | c3 (t − 1), AN (S3) (t − 1), Q (t − 2) | |
COD | d1 | d1 (t − 3), COD (S3) (t − 3), AN (t − 3) |
d2 | d2 (t − 3), COD (S2) (t − 3), COD (S3) (t − 3), T (t − 1) | |
d3 | d3 (t − 2), COD (S3) (t − 2), T (t − 1) | |
c3 | c3 (t − 1), COD (S3) (t − 1) |
Contaminant . | Subsequence . | Driving predictors (with the lead time) . |
---|---|---|
AN | d1 | d1 (t − 4), AN (S7) (t − 4), AN (S8) (t − 4) |
d2 | d2 (t − 4), AN (S7) (t − 4), AN (S8) (t − 4) | |
d3 | d3 (t − 3), AN (S8) (t − 3), COD (S8) (t − 3) | |
c3 | c3 (t − 1), AN (S8) (t − 1) | |
COD | d1 | d1 (t − 4), COD (S8) (t − 4), COD (S7) (t − 4) |
d2 | d2 (t − 4), COD (S8) (t − 4), T (t − 3) | |
d3 | d3 (t − 3), COD (S8) (t − 3), T (t − 2) | |
c3 | c3 (t − 1), COD (S8) (t − 2), T (t − 1) |
Contaminant . | Subsequence . | Driving predictors (with the lead time) . |
---|---|---|
AN | d1 | d1 (t − 4), AN (S7) (t − 4), AN (S8) (t − 4) |
d2 | d2 (t − 4), AN (S7) (t − 4), AN (S8) (t − 4) | |
d3 | d3 (t − 3), AN (S8) (t − 3), COD (S8) (t − 3) | |
c3 | c3 (t − 1), AN (S8) (t − 1) | |
COD | d1 | d1 (t − 4), COD (S8) (t − 4), COD (S7) (t − 4) |
d2 | d2 (t − 4), COD (S8) (t − 4), T (t − 3) | |
d3 | d3 (t − 3), COD (S8) (t − 3), T (t − 2) | |
c3 | c3 (t − 1), COD (S8) (t − 2), T (t − 1) |
Water quality prediction based on LSTM
The LSTM was used to predict the subsequences decomposed by original water quality series. The dimension of the input layer depends on the number of input predictors, and the maximum input dimension in this study was 4 while the output dimension was only 1 which was the target subsequence. The dimension of the hidden layer was set to 2 when simulating the original water quality sequence and d1, and others were set to 1. The learning rate initialization of LSTM was set to 0.001, and it decreased continuously with training. The period of training set was from 2009 to 2018 with a prediction period from 2018 to 2019. Before starting the simulation of LSTM, normalization of the input datasets was carried out to remove the mean (zero mean and small variance) and improve the efficiency.
The simulation errors of all stations were analyzed using Equations (10) and (11). Overall, the mean values of the RMSE in the recorded period for predicting the AN and COD concentration series in all stations are 0.06 and 0.17 mg/l, respectively, and that of the MAPE are 24.8 and 5.2%, respectively. This study used the ADF test to examine the relative error variation for stationarity, and the errors of AN and COD concentrations at 79.8 and 88.5 (at 95% confidence level) of all stations were stationary, respectively. In conclusion, the hierarchical optimization prediction model is characterized by high accuracy, reliability, and stability, which provides efficient and predictable support information for relevant departments.
Performance comparison of water quality prediction models
To compare the performance of the novel model with other water quality prediction models, the optimization function of predictor screening and data decomposition on LSTM were evaluated in this section. Selecting predictors with higher correlation by SCA helps eliminate unnecessary information from the inputs, especially when the water quality is significantly affected by a factor (i.e., the response of AN concentration to streamflow during flood season). The evaluation of simulating AN and COD concentrations for all stations by different hybrid models is shown in Table 4. After applying SCA to the LSTM prediction model, the mean value of RMSE for AN and COD concentrations decreased by 11 and 21%, respectively. In addition, the decrease in maximum RMSE is significant, which indicates that SCA is an essential part of water quality prediction for some stations.
Evaluation . | . | LSTM . | SCA + LSTM . | EMD + SCA + LSTM . | WTD + SCA + LSTM . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Max . | Min . | Mean . | Max . | Min . | Mean . | Max . | Min . | Mean . | Max . | Min . | Mean . | |||
RMSE (mg/l) | AN | Training | 0.39 | 0.05 | 0.15 | 0.31 | 0.04 | 0.11 | 0.15 | 0.03 | 0.07 | 0.09 | 0.03 | 0.04 |
Testing | 0.46 | 0.05 | 0.18 | 0.37 | 0.05 | 0.16 | 0.15 | 0.03 | 0.08 | 0.12 | 0.03 | 0.06 | ||
COD | Training | 0.73 | 0.06 | 0.31 | 0.55 | 0.06 | 0.24 | 0.28 | 0.06 | 0.17 | 0.27 | 0.05 | 0.14 | |
Testing | 0.91 | 0.11 | 0.39 | 0.69 | 0.11 | 0.31 | 0.34 | 0.07 | 0.19 | 0.31 | 0.06 | 0.17 | ||
MAPE (%) | AN | Training | 43.9 | 19.1 | 24.9 | 35.4 | 18.3 | 23.6 | 29.9 | 14.2 | 20.7 | 29.1 | 11.7 | 16.3 |
Testing | 46.1 | 19.6 | 31.3 | 40.5 | 18.9 | 29.7 | 38.6 | 18.6 | 28.9 | 35.3 | 17.5 | 24.8 | ||
COD | Training | 11.2 | 4.1 | 6.8 | 9.9 | 4.7 | 6.7 | 8.4 | 3.1 | 5.3 | 6.6 | 2.4 | 4.1 | |
Testing | 11.4 | 4.7 | 7.4 | 10.1 | 4.7 | 7.2 | 9.2 | 3.6 | 6.1 | 7.2 | 3.1 | 5.2 |
Evaluation . | . | LSTM . | SCA + LSTM . | EMD + SCA + LSTM . | WTD + SCA + LSTM . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Max . | Min . | Mean . | Max . | Min . | Mean . | Max . | Min . | Mean . | Max . | Min . | Mean . | |||
RMSE (mg/l) | AN | Training | 0.39 | 0.05 | 0.15 | 0.31 | 0.04 | 0.11 | 0.15 | 0.03 | 0.07 | 0.09 | 0.03 | 0.04 |
Testing | 0.46 | 0.05 | 0.18 | 0.37 | 0.05 | 0.16 | 0.15 | 0.03 | 0.08 | 0.12 | 0.03 | 0.06 | ||
COD | Training | 0.73 | 0.06 | 0.31 | 0.55 | 0.06 | 0.24 | 0.28 | 0.06 | 0.17 | 0.27 | 0.05 | 0.14 | |
Testing | 0.91 | 0.11 | 0.39 | 0.69 | 0.11 | 0.31 | 0.34 | 0.07 | 0.19 | 0.31 | 0.06 | 0.17 | ||
MAPE (%) | AN | Training | 43.9 | 19.1 | 24.9 | 35.4 | 18.3 | 23.6 | 29.9 | 14.2 | 20.7 | 29.1 | 11.7 | 16.3 |
Testing | 46.1 | 19.6 | 31.3 | 40.5 | 18.9 | 29.7 | 38.6 | 18.6 | 28.9 | 35.3 | 17.5 | 24.8 | ||
COD | Training | 11.2 | 4.1 | 6.8 | 9.9 | 4.7 | 6.7 | 8.4 | 3.1 | 5.3 | 6.6 | 2.4 | 4.1 | |
Testing | 11.4 | 4.7 | 7.4 | 10.1 | 4.7 | 7.2 | 9.2 | 3.6 | 6.1 | 7.2 | 3.1 | 5.2 |
Empirical Mode Decomposition (EMD) is a classic data decomposition method (Liang et al. 2021), and its performance is compared with WTD in Table 4. In general, the application of data decomposition methods significantly improves the accuracy and reliability of the prediction model. The mean value of RMSE for AN and COD concentrations decreased by more than 50% in some scenarios, and a significant reduction in MAPE values helps reduce the uncertainty in predictions. From the results of all stations, WTD performs better in both maximum and mean errors, indicating that WTD is a more robust choice in hybrid water quality prediction models.
Long-term water quality prediction
DISCUSSION
Advantages of the developed prediction model
Previous studies mainly focused on the accuracy and reliability of water quality simulation and prediction, and the LSTM models were widely applied due to their powerful performance in time series prediction (Song et al. 2021). While several pre-processing methods have been proposed to enhance the LSTM models, few studies addressed the selection of factors for the preprocessed subsequences, limiting the potential for fully improving model performance. A noteworthy aspect of this study is the application of WTD and SCA methods in the hierarchical optimization of the LSTM-based water quality prediction model. WTD was used to extract multi-dimensional features from the original water quality sequences, reducing uncertainties arising from observation. Other data pre-processing methods (such as Locally Weighted Scatterplot Smoothing (LOWESS), Seasonal and Trend decomposition using Loess (STL), EMD, and Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN)) were also considered for data denoising (Yu et al. 2022; Zhou et al. 2022; Dodig et al. 2024). However, we chose WTD for its ability to highlight local details of sequence data through an appropriate wavelet function. The combination of WTD and LSTM has been validated as robust in predicting non-linear time series in uncertain environments (Chen & Xue 2023). This study also demonstrated the effectiveness of WTD in pre-processing observed water quality data. In addition, traditional prediction models used only water quality data as input variables (Song et al. 2021; Yu et al. 2022). In contrast, this study considered multiple factors, including hydrology, water quality, and spatial connections among stations, with the SCA method employed to select relevant predictors. This hierarchical optimization model significantly improves the performance of LSTM, as shown in Table 4, which highlights the model's advantages in accuracy and reliability. In addition, the reconstruction of input features using WTD reduces the simulation time compared to a single LSTM model (Wu & Wang 2022).
In the HR Basin, it is difficult to obtain the input data for water quality prediction. A few studies of water quality prediction in this region (Zhang & Wu 2021; Chen et al. 2024) show that our predictions have a smaller RMSE compared to these results. Moreover, this study found that the subsequences obtained through WTD are stable except for low-frequency subsequences. These stationary subsequences can ensure the accuracy and improve the stability of LSTM in long-term prediction. The low-frequency subsequences can be calculated by linear fitting because of their simple trends. Accordingly, we achieved long-term water quality predictions for the HR Basin in the next few years. According to the National Water Quality Reports published by the China National Environmental Monitoring Centre (https://www.mee.gov.cn/hjzl/shj/dbsszyb/), the water quality of the HR Basin was consistent with our results in 2023, demonstrating the reliability of the developed model.
Limitations of the developed prediction model
Due to the limitation of the observation frequency of water quality in the HR Basin, only monthly AN and COD data were used for model training and prediction in this study. In future, higher frequency data are expected to be utilized to evaluate how data resolution affects model accuracy. In addition, factors such as land use and climate change also affect the water quality, but they are not key factors for predicting the water quality at the timescale of this study. When the observation period is longer, these factors could become important, and hence the development and application of the proposed model should pay attention to them in the future.
Water environment protection and management
Due to the transport of pollutants from the HR to the HL, there is a correlation between the water quality in the HL (S1–S6) and the HR (S7–S14). For example, in April 2014, a water pollution incident in the HR (Figure 11(a)) directly led to a significant increase in AN concentration within HL (Figure 11(b)). Therefore, controlling the pollution load in the HR is crucial for ensuring the health and sustainability of water environment in the HL.
As shown in Figure 11, the predicted AN and COD concentrations will remain stable, and COD will sometimes fail to meet water quality targets at S4 and S5. Therefore, the water environment management department can adjust pollution discharge standards in different sub-basins based on the long-term water quality predictions. More control policies should be implemented at stations that will exceed water quality targets, while considering easing on pollution discharge at other stations to help economic development.
CONCLUSIONS
In this study, a novel river water quality prediction model was developed and applied to simulate the water quality in the HR–HL system. This model used ‘decomposition–inputs–prediction’ hierarchical optimization to preprocess the original datasets, upgrade the predictors, and improve the prediction accuracy. The results lead to the following conclusions:
(1) A hierarchical optimization model based on WTD and SCA was developed for water quality prediction at 14 stations. In comparison with other models, the proposed model improves the accuracy (RMSE: 0.06 and 0.17 mg/l for AN and COD, respectively), reliability (MAPE: 24.8 and 5.2% for AN and COD, respectively), and simulation efficiency in the study area.
(2) AN and COD concentrations will continue to be stable in this region, and the mean values of COD concentration will be more likely to exceed the water quality targets than AN. The water quality of S4 and S5 in the HR will be worse than other stations, and their water quality will have a higher risk of exceeding the targets.
(3) The observation frequency of water quality data affects the accuracy of water quality prediction. Higher frequency data (weekly or daily) are expected to drive the developed model to compare the impact of data resolution on the performance of the water quality prediction model.
(4) The new model was just used for water quality predictions in the HR Basin in this article. It could be applied in other basins for long-term water quality prediction. The setting of hyper-parameters in LSTM is suitable for the stations within the study region. When applying this model in other regions, it is necessary to pay attention to adjusting the hyper-parameters.
FUNDING
This research was funded by the National Key R&D Program of China (2022YFC3202600), the National Natural Science Foundation of China (52479062 and 52309086), the Jiangsu Provincial Science and Technology Basic Research Program Youth Fund Project (BK20241516), and the Water Conservancy Science and Technology Project in Jiangsu Province (2023013, 2022061 and 2024009).
CONFLICT OF INTEREST
The authors declare there is no conflict.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.