ABSTRACT
Water quality prediction is crucial for effective river stream management. Dissolved oxygen, conductivity and chemical oxygen demand are vital chemical parameters for water quality. Development of machine learning (ML) and deep learning (DL) methods made them widely used in this domain. Sophisticated DL techniques, especially long short-term memory (LSTM) networks, are required for accurate, real-time multistep prediction. LSTM networks are effective in predicting water quality due to their ability to handle long-term dependencies in sequential data. We propose a novel hybrid approach for water quality parameters prediction combining DL with data smoothing method. The Sava river at the Jamena hydrological station serves as a case study. Our workflow uses LSTM networks alongside LOcally WEighted Scatterplot Smoothing (LOWESS) technique for data filtering. For comparison, Support Vector Regressor (SVR) is used as the baseline method. Performance is evaluated using Root Mean Squared Error (RMSE) and Coefficient of Determination R2 metrics. Results demonstrate that LSTM outperforms the baseline method, with an R2 up to 0.9998 and RMSE of 0.0230 on the test set for dissolved oxygen. Over a 5-day prediction period, our approach achieves R2 of 0.9912 and RMSE of 0.1610 confirming it as a reliable method for water quality multistep parameters prediction.
HIGHLIGHTS
Water quality prediction using hybrid machine learning-based approach.
Time series data forecasting using long short-term memory networks.
Data denoising using the LOcally WEighted Scatterplot Smoothing method.
Case study focused on the Sava River at the Jamena station.
The new hybrid approach outperformed the baseline prediction method.
INTRODUCTION
The quality of river streams is crucial for human health, well-being, and biodiversity (Committee 2022). Communities commonly rely on surface river water as their main source of drinking water, while water quality is a most important indicator of aquatic system health. Decline of river water quality leads to loss of biodiversity and aquatic species having cascade impacts on ecosystems in general. A progressive declining trend in the river water quality parameters has been indicated for the river basins under continental climate, such as the Sava River basin (Lutz et al. 2016). Physical–chemical water quality of the Sava River basin expectedly decreases from upstream parts to downstream river sections, where the point sources of pollution are suggested as a major driver of water quality (Pantelić et al. 2022).
Water quality models are commonly used for water management practice to simulate the quality of the river streams and to propagate pollution alongside river networks. These models are founded under the physical laws, namely they widely utilize one-dimensional (1-D) and two-dimensional (2-D) Saint Venant equations (Cardona et al. 2011; Sabokruhie et al. 2021) to provide an understanding of how the water quality characteristics dynamically change over time. These models consider both point and non-point sources of pollution, and their propagation within rivers, channels, and hydraulic structures such as dams and bridges (Zhou et al. 2018b). Because water quality models rely on physics equations, it enables several features to simulate the river pollution under different climate and hydrological options, as well as hydraulic conditions. Although the water quality models have been constantly improved, their implementation is not advantageous for many reasons (Zhou et al. 2018b). These reasons comprise several interconnected points: time needed to develop complex river network model, comprehensive input data related to river morphology, required boundary conditions, precise time simulation steps and time-consuming and expensive simulation time. Hopefully, data-driven models utilizing machine learning (ML) techniques could effectively address these challenges related to highly complex non-linear relations among various physical and chemical water parameters (Zhu et al. 2022; Cojbasic et al. 2023). Utilizing these techniques, it is possible to enable the timely and effective development of solutions for managing water pollution, improving water quality, and protecting watershed ecosystems. In recent years, ML and deep learning (DL) methods have appeared that overcome the shortcomings of conventional methods (Faruk 2010; Wang et al. 2021; Wu & Wang 2022). They can capture non-linearity and non-stationarity of water quality from the data. At first, standard Feedforward Artificial Neural Networks (ANNs) and their variations were used for water quality prediction (Faruk 2010; Zhang et al. 2015; Kostić et al. 2016; Elkiran et al. 2019; Chen et al. 2020b; Yahya et al. 2021). In Elkiran et al. (2019) ANN was applied for single and multi-step ahead modeling of dissolved oxygen (DO) in the Yamuna River, India. Elkiran et al. (2019) used several chemical water parameters and water temperature for this purpose. In the Singapore seawater, the forecasted dissolved oxygen (DO) based on salinity and temperature was also mentioned (Palani et al. 2008). In addition to ANNs, a variety of other ML techniques are employed, including Support Vector Regression (SVR) as demonstrated in prior research (Liu et al. 2013, 2014; Su et al. 2022), as well as models based on decision trees like Extreme Gradient Boosting (XGBoost) and Random Forest (RF) as illustrated in hybrid studies (Lu & Ma 2020; Chen et al. 2020a). For short-term water parameter quality prediction, including water temperature, dissolved oxygen (DO), pH value, specific conductance, and others, Lu & Ma (2020) employed two hybrid decision tree models. Lu & Ma (2020) achieved very good results using these methods in combination with Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) data de-noising method. Xu et al. (2021) also used hybrid models based on wavelet transform (WT) and earlier mentioned ML methods (SVM, RF, ANN) with addition of Multiple Linear Regression (MLR) to predict the daily DO in the Dongjiang River Basin, China. In addition to mentioned methods, recent technology advancements have also leveraged usage of the computational intelligence models in the water quality modeling (Aghel et al. 2019; Mohadesi & Aghel 2020). For instance, Mohadesi & Aghel (2020) utilized a hybrid model combining adaptive neuro-fuzzy inference system (ANFIS) and genetic algorithm (GA) to enhance prediction accuracy of inorganic water quality indicators, including electrical conductivity, demonstrating superior performance over traditional ANN approaches. Nowadays, Recurrent Neural Networks (RNNs), especially Long Short-Term Memory Recurrent Neural Networks (LSTM RNNs) for time series prediction have gained increasing popularity. They proved to be highly efficient for predicting time series data due to their ability to capture long-term dependencies in sequential data (Liu et al. 2019; Hu et al. 2019; Chen et al. 2020b; Fu et al. 2022; Schutte et al. 2024). Specifically, attention-based RNNs have been used for accurate short-term prediction and long-term prediction of DO (Liu et al. 2019). Hu et al. (2019) used deep LSTM network for pH and water temperature prediction for more than three days ahead and their model reached accuracy above 95%. Wang et al. (2023b) have explored water quality prediction using several ML and DL techniques and their LSTM model demonstrated exceptional abilities in water quality prediction. The effectiveness of LSTM models compared to other methods (ANN, RNN and Gated Recurrent Unit (GRU)) was also shown by Wang et al. (2023a) in their work. Furthermore, recent advancements in runoff time series prediction are exemplified by Xu et al. (2023), who also developed a hybrid model using a CEEMDAN and LSTM (among others) mixture approach.
The mentioned studies are based on the prediction of water quality considering non-linear relationships in water quality parameters in previous time steps. More precisely, these studies consider the water quality parameters as autoregression processes without the influence with local hydrology. However, there is a lack of literature related to water quality predictions using hydrological data including river stream flows and surface river temperatures, as a significant driver of water quality parameters. For this reason, we propose a general effective and robust framework for the estimation of water quality parameters using both physical (hydrological variables) and chemical parameters. The proposed method is capable of modeling complex and non-linear relationships among physical (e.g. river flows and water temperatures) and chemical (water quality) river stream characteristics. This work has the following four main contributions: (1) Preprocessing the data in terms of de-noising due to uncertainty in measuring equipment. The LOcally WEighted Scatterplot Smoothing (LOWESS) method is used (Cleveland 1979) to extract the chemical water parameters; (2) Estimating and evaluating the cross-correlation between the physical and chemical parameters of water quality by the use of LSTM networks; (3) Proposing, training and validating distinct LSTM models for different chemical parameters prediction for various time steps; (4) Assessing performance of the LSTM models for different chemical parameters prediction on unseen data.
This research sets out to accomplish several objectives aimed at enhancing the prediction and management of water quality in river basins, particularly those identified as pollution hotspots like the Sava River basin. The goals include the following: (a) Development of a novel prediction framework: the core of this research is to develop an innovative framework for predicting water quality parameters. This framework distinguishes itself by employing denoising techniques, such as LOWESS, to mitigate the uncertainty inherent in water quality data. Additionally, it leverages LSTM networks for forecasting deionized water quality parameters over various future time steps. This approach is designed to handle the complex and non-linear relationships between physical (e.g. river flow and water temperatures) and chemical parameters of water quality, offering a more accurate prediction model. (b) Benchmarking against traditional models: a significant part of this research involves comparing the performance of the newly developed framework against traditional models, such as SVR. This comparison aims to validate the effectiveness and efficiency of the LSTM-based framework in predicting water quality parameters more accurately and reliably. (c) Implementation of the framework in a pollution hotspot: The framework is applied within the Sava River basin, a critical area suffering from pollution issues. By implementing this advanced predictive model, the research aims to provide actionable insights for environmental and water management planning, ultimately contributing to the improvement of water quality and the protection of the river’s aquatic ecosystem. The innovative aspect of this research lies in its integration of both physical and chemical parameters to forecast water quality, moving beyond the limitations of traditional models that primarily rely on the serial correlation of water quality parameters alone (Lu & Ma 2020; Xu et al. 2021, 2023). By combining data preprocessing techniques with advanced ML models, this framework represents a pioneering tool in the field of environmental science and water management. Its success in the Sava River basin could serve as a model for addressing water quality issues in other river basins globally, marking a significant step forward in the efforts to protect and improve freshwater resources.
CASE STUDY AND DATA COLLECTION: SAVA RIVER
The Sava river lies within the southeast European regions as a part of the Danube river basin. It drains an area equal to 97,800 kmˆ2 and a length of 926 km, making it the second largest tributary of the Danube river (Schwarz 2016). It flows through the countries from Slovenia, Croatia, Bosnia and Herzegovina and Montenegro to Serbia. The highest elevation of the Sava river basin is 2,864 m a.s.l. at the mountain peak of Triglav (Slovenia), while the lowest point is at the confluence of the Danube river equal to 71 m a.s.l. (Serbia). The climate conditions in the Sava river basin differ from the upper river basin to the lower river parts. The annual precipitation sum and air temperature are respectively equal to 1,140 mm and 10.4 °C at the highest latitude, while the lowest river basin characterized by the significantly reduced annual precipitation (657 mm) and slightly increased air temperature (11.6 °C) (Schwarz 2016). The spatial distribution of river streams substantially follows the pattern of the recorded precipitation. It varies from 10 mˆ3/s up to 1,700 m3/s for the hydrological stations located in Slovenia and Serbia, respectively (ISRBC 2016). The most significant water yield brings the right tributaries of the Sava River (e.g. Una, Vrbas, Bosna, Drina). However, the left tributaries receive the annual precipitation sums in the range of 700–1,000 mm (ISRBC 2016), but the evapotranspiration rates are significant and reduce the runoff from the tributaries such as Krapina, Lonja, Ornjava, and Bosut.
Parameter [Unit] . | Type . | Interval . | Average . |
---|---|---|---|
Flow [m3/s] | Physical | [158, 4490] | 1131.5 |
Temperature [°C] | Physical | [0.1, 29.7] | 13.92 |
Oxygen [mg/l] | Chemical | [6.04, 14.1] | 9.48 |
Conductivity [] | Chemical | [226, 780] | 451.8 |
Chemical Oxygen Demand [mg/l] | Chemical | [1.2, 11.5] | 3.05 |
Parameter [Unit] . | Type . | Interval . | Average . |
---|---|---|---|
Flow [m3/s] | Physical | [158, 4490] | 1131.5 |
Temperature [°C] | Physical | [0.1, 29.7] | 13.92 |
Oxygen [mg/l] | Chemical | [6.04, 14.1] | 9.48 |
Conductivity [] | Chemical | [226, 780] | 451.8 |
Chemical Oxygen Demand [mg/l] | Chemical | [1.2, 11.5] | 3.05 |
METHODOLOGY
A general approach
Prediction of the water quality parameters requires that we have highly accurate and comprehensive input data. We have collected data from the Environmental Protection Agency (SEPA 2023), as it officially measures the water quality parameters. Table 1 presents which data we obtained and used alongside data statistics. The aim of this work was to predict chemical parameters of the water for several days ahead based on the measurement of physical parameters (flow and water temperature) of the previous time steps as well as the recorded chemical parameter. This choice is based on observed higher correlations between physical and chemical parameters than among chemical parameters themselves. Additionally, by relying on more commonly measured parameters, we ensure greater adaptability, applicability, scalability and robustness of the models even when limited chemical data is available. In order to achieve this we perform the following steps illustrated in Figure 4. First, data for the representative hydrological station is collected and processed by filling in missing values using non-linear regression (Stojković et al. 2014). A non-linear regression model is formulated using standardized chemical parameters as the dependent variable and standardized physical parameters as the independent variables, with the objective of addressing missing data in water quality datasets. A z-score normalization of the data is performed for dependent and independent variables. Through this method, 11% of the total datasets are completed, facilitating the training and testing of the proposed concept for forecasting water quality data. Second, the processed data is smoothed using the LOWESS method to single out the noise related to measurement uncertainties. Next, data normalization is performed to scale records between values from 0.0 to 1.0, and subsequently the dataset is derived into training, validation, and test sets. Separate LSTM models are developed for each chemical parameter to predict their values for several time steps ahead. Then, the developed models are tested on unused data to prove the model efficiency. The last step includes the application of the baseline method for comparison purposes.
The following sections explain in detail the previously mentioned steps.
LOcally WEighted Scatterplot Smoothing – LOWESS
Long short-term memory - LSTM model
LSTM networks are ANNs (Hochreiter & Schmidhuber 1997) widely applied for modeling of water quality parameter data, due their capability for reproducing a complex and non-linear structure of water quality and water quantity data (Zhou et al. 2018a; Hu et al. 2019; Dilmi & Ladjal 2021; Zhi et al. 2021; Wang et al. 2023a). Unlike standard feedforward neural networks, LSTM is a type of Recurrent Neural Network (RNN) which means that it has feedback connections. Such neural networks can process not only single data points but also entire sequences of data. LSTM networks are also designed to avoid the exploding/vanishing gradient problem which is common problem for RNNs. The main concept behind LSTM is that it introduces a more complex cell structure compared to simple RNNs. It consists of three main gates - the input gate, the forget gate, and the output gate - along with a cell state that helps in preserving long-term information and hidden state that preserves short-term memory. In that way the exploding/vanishing gradient problem is alleviated. All these characteristics make LSTM networks superior to other methods when it comes to ability to capture long-term dependencies in sequential data and accurately predict data over extended periods of time, which is a critical aspect of water quality data.
In this work, we used three different LSTM networks implemented using Keras (Chollet 2015) to forecast different chemical water quality parameters. To find optimal hyperparameters for LSTM networks (number of LSTM cells per layer, number of layers, activation functions, type of optimizer and learning rate), GridSearchCV function from Scikit-Learn library (Pedregosa et al. 2011) in Python was used with size 5 for cross validation (CV). All networks have two hidden LSTM layers with 64 and 32 LSTM units respectively with activation function. For the input layer we used 10 previous steps, and the output layer has size of 5 which means that we predict 5 steps ahead. To optimize the loss, we used the ADAM optimizer (Kingma & Ba 2017) with a batchsize of 16, to minimize the mean-square-error (MSE) between the predictions and the ground truth with a learning rate 0.001 for dissolved oxygen and chemical oxygen demand prediction and with a learning rate 0.007 for conductivity prediction. We monitored the losses on both training and validation set to prevent overfitting. We set the maximum number of epochs for training the model to 100, but the training ended earlier when the loss function value for the validation set no longer decreased for more than 8 epochs.
Baseline method - support vector machine regression (SVR) model
We choose support vector machine (SVM) for comparison purposes to validate reliability of LSTM for water quality data prediction. It is a popular supervised ML algorithm based on Vapnik's statistical learning theory (Vapnik 1999). This technique is mostly used for classification, but also for prediction. The Support Vector Regression (SVR) uses the same principles as the SVM for classification with several differences: SVR seeks to find a hyperplane that best fits the data points in a continuous space not discrete. SVR is formulated as an optimization problem with the aim to find a function that approximates the relationship between the input variables x and the continuous target variable y, while minimizing the prediction error. This is achieved by mapping the input variables to a high-dimensional feature space using a kernel function which can also handle non-linear data and finding the hyperplane that maximizes the margin (distance) between the hyperplane and the closest data points, while also minimizing the prediction error. This robust method of handling non-linearities and complexities in the data makes SVR a reliable and effective choice as a baseline or main method in various studies focused on water quality prediction (Liu et al. 2013; Xu et al. 2021; Su et al. 2022; Ortiz-Lopez et al. 2023). These features of SVR contribute to validating the reliability of LSTM for predicting water quality, especially given its popularity and established efficiency in supervised ML tasks. The SVR optimization problem can be expressed as follows: Suppose we have a set of training data where xn is a multivariate set of N observations with observed response values yn.
For regression problems where a linear model is insufficient, the Lagrange dual formulation enables the extension of the previously mentioned technique to incorporate non-linear functions. To obtain a non-linear SVR model, the dot product is substituted with a non-linear kernel function denoted as . Here, represents a transformation that maps the input to a higher-dimensional space. Common kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid. There is also a parameter that determines the kernel’s influence. It defines the reach of the kernel function. A smaller value results in a wider reach and a smoother decision boundary, while a larger value focuses on the nearby points, leading to a more complex and irregular decision boundary. By utilizing this non-linear kernel function, the SVR model can effectively capture complex relationships and patterns in the data that cannot be adequately described by a linear model.
We used different SVR models for each prediction day and for each chemical water quality parameter. SVR was implemented using Scikit-Learn library (Pedregosa et al. 2011) in Python. To find optimal hyperaparameters (kernel function, C, and ) for the SVR model we used GridSearchCV function also from Scikit-Learn library with size 5 for cross validation (CV). For all water quality parameters SVR models have the same hyperparameters for the prediction period. For dissolved oxygen prediction parameters are (kernel = RBF , C=100, , ). For conductivity prediction hyperparameters are (kernel = RBF, C=500, , ) and for chemical oxygen demand prediction parameters are (kernel = RBF, C = 2,000, , ).
RESULTS
This section presents the results of various components of the proposed workflow for predicting water quality parameters. Initially, the data is smoothed using the LOWESS method, and the smoothed data is analyzed. Subsequently, the performance of the baseline method and the LSTM models is demonstrated. The evaluation of performance is based on the Root Mean Squared Error (RMSE) and the coefficient of determination () as metrics. The results for different prognostic intervals are provided for both the training and test sets. All of the mentioned analysis and modeling conducted in this study are performed on a high-performance workstation equipped with an AMD Ryzen 7 PRO 4750U processor with Radeon Graphics, clocked at 1.70 GHz, and 16.0 GB of RAM, ensuring timely and efficient processing of the complex datasets involved.
LOcally WEighted Scatterplot Smoothing – LOWESS
Even though the data is verified it may be corrupted by noise in the recorded time series. As highlighted in Section 2, the original data contains outliers originating from uncertainties associated with the measuring equipment. To mitigate the impact of measurement uncertainties and enhance prediction accuracy, the data was subjected to a smoothing process using LOWESS. Figures 2 and 3 represent smoothed physical and chemical parameters respectively. Table 2 demonstrates the percentage difference in variance between the raw and smoothed data. Notably, by employing LOWESS, the variance of the time series data is reduced and the quality of input data for the LSTM network is enhanced, thereby improving its performance.
Parameter . | Raw variance . | Smoothed variance . | Difference in % . |
---|---|---|---|
Flow | 660,580 | 615,667 | 6.80 |
Water temperature | 55.39 | 54.69 | 1.26 |
Dissolved oxygen | 2.91 | 2.76 | 5.29 |
Conductivity | 5,275 | 4,119 | 21.91 |
Chemical oxygen demand | 0.602 | 0.300 | 50.09 |
Parameter . | Raw variance . | Smoothed variance . | Difference in % . |
---|---|---|---|
Flow | 660,580 | 615,667 | 6.80 |
Water temperature | 55.39 | 54.69 | 1.26 |
Dissolved oxygen | 2.91 | 2.76 | 5.29 |
Conductivity | 5,275 | 4,119 | 21.91 |
Chemical oxygen demand | 0.602 | 0.300 | 50.09 |
Table 2 shows the difference in % between the raw and smoothed data variance. By using the LOWESS, the variance of the time series data was reduced, which was the goal in order to improve data as input for the LSTM network. Table 2 shows that the reduced variances are in the range from 1% to 50%. Physical water parameters imply lower range in reduced variance due to reliable measurements, as the measuring equipment does not introduce a noise in the recorded values. However chemical parameters (especially COD) possess higher levels of uncertainties in recorded data (Figures 2 and 3). Usually, in a stable aquatic environment, there should not be large fluctuations in COD levels without corresponding changes in DO levels, so these large fluctuations are considered as outliers in our data. Outliers in COD data could result from various factors like measurement errors, or unexpected events. These outliers do not represent the typical water quality pattern especially in the case of large river streams such as the Sava River stream. Given that, smoothing techniques like LOWESS are used to filter these out to better understand the overall trends. Application of the LOWESS method reduced variance up to 50% for COD data and up to 22% for COND data resulting in outliers and noise diminishing (Table 2).
Water quality parameters prediction with SVR
The following sections discuss performance of SVR models to predict water quality parameters. In the next three subsections separated results for the three chemical water quality parameters (Dissolved Oxygen, Conductivity and Chemical Oxygen Demand) are presented.
Dissolved oxygen prediction
Conductivity prediction
Results for trained SVR model prediction on training and test set in terms of and ˆ2 score for COND are presented in Figure 7. Figure 8 presents results on the test set along with regression diagrams. First day prediction scores are the best with , on training set and , on test set. Performance for the following days slightly declines. Results for the fifth day forecast are , on training set and , on test set. In the fourth row of Figure 9 the performance of the SVR model on the time series test data can be seen. Measured and predicted data are shown from the first to the fifth day forecasts.
Chemical oxygen demand prediction
The sixth row of Figure 8 shows the performance and regression diagrams of the SVR model on the test set for forecasting COD five days ahead. Results for the first day prediction are , for the training set and , for the test set and for the fifth day prediction results are , and , for the training and test set, respectively. The results for each prediction time period on both training and test set can be seen in Figure 7. The sixth row of Figure 9 illustrates SVR model performance on the time series test data from the first to the fifth day prediction.
Water quality parameters prediction with LSTM
In this section, we present and discuss the results of training LSTM networks for different water quality parameters predictions. Three following subsections represent individual results of training and evaluating LSTM model on three different water quality parameters previously mentioned.
Dissolved oxygen prediction
Conductivity prediction
Training performance for COND prediction is shown in Figure 10(b). For this parameter, training lasted shorter because we used bigger learning rate and it stopped after 22 epochs. Results for training and test are summarized in Figure 7. Predictions for the first day are excellent with , for training set and , for test set. As well as for dissolved oxygen, further predictions are slightly worse. For the fifth day prediction , for training and , for test. Regression diagrams for COND prediction are shown in Figure 8 for the test set. Figure 9 represents performance of the LSTM model on the unseen data (test set). Compared to the SVR, LSTM demonstrates more efficient performance in terms of both RSME and on the test set. Consequently, the model can be effectively utilized for accurate conductivity prediction with a high level of precision.
Chemical oxygen demand prediction
LSTM network training for COD prediction is presented in Figure 10(b), as was the case for the previous parameters and here the training was completed earlier after 40 epochs. Results are summarized in Figure 7. Predictions for the first time step ahead have , for training set and , for test set. Further predictions for the fifth day prediction have , for training and , for test. Regression diagrams for COD prediction are given in Figures 8 and 9 showing prediction performance of the LSTM model on the test set. As for the previous chemical parameters the performance of the model is exceptionally strong and superior to that of SVR, when evaluated on the test set. Accordingly the model could also be used efficiently and successfully for COD prediction.
Prediction on the raw dataset
DISCUSSION
Developing reliable water management strategies is crucial to keep water quality and river health. Surface river water serves as the primary fresh water source for communities, and the quality river holds significant importance as a metric for assessing the health of aquatic systems. The findings of this study provide significant insights into the complex relationship between hydrological data (river stream flows and surface river temperatures) and chemical water quality parameters. Compared to the majority of studies that consider water quality parameters as autoregressive processes (Wang et al. 2017; Elkiran et al. 2019; Lu & Ma 2020; Bi et al. 2021; Chen et al. 2022), we added the influence of physical parameters for better prediction of water quality parameters. Our results confirm the assumption that the chemical parameters are correlated with the physical parameters and the obtained results are very satisfactory. The close alignment between the measured and predicted values, especially for short-time prediction 8, lends strong support to this conclusion.
Our approach proved to be better even in relation to some more complex model architectures. Chen et al. (2022) used LSTM networks with Attention mechanism to predict DO at the Burnett River, Australia. Their multi-step prediction results proved to be better compared to the prediction using only LSTM networks, however, we managed to achieve better results when it comes to both RMSE and score. Moreover, Bi et al. (2021) carried out the prediction of DO and COD using several methods (ARIMA, SVR, LSTM, etc), but observed the water parameters only as autoregressive processes. The best results were also obtained using LSTM networks for all prediction steps (1-5), but we obtained significantly better prediction results when it comes to RMSE which they have also used as evaluation metric. There are other works that demonstrate the potential of LSTM architectures for predicting water parameters (Wang et al. 2017; Hu et al. 2019), but there is a lack of literature related to water quality predictions using both physical and chemical water parameters. Our findings demonstrate a clear relationship between interconnected parameters, and by obtaining very good and better results compared to works that were mentioned, they emphasize the importance of taking them into account.
One notable aspect of our work is the consideration of the uncertainties in the original data associated with the measuring equipment. In order to decrease the influence of measurement uncertainties and improve the precision of predictions, we applied a smoothing procedure to the data using the LOWESS method. Others also used several methods for data denoising. Bi et al. (2021) used the filter of Savitzky-Golay to eliminate the potential noise in the time series data of water and Lu & Ma (2020) and Xu et al. (2023) also achieved very good results using CEEMDAN as data denoising technique. We chose to utilize LOWESS methodology widely applied for analyzing time series data in the field of hydrology. This method, recognized for its simplicity and non-parametric nature, has been validated as robust in detecting and mitigating noise, as demonstrated by Stojković et al. (2017). This paper also demonstrated the effectiveness of this technique applied to raw dataset leading to marked improvement in the performance metrics of both LSTM and SVR models. This enhancement is particularly notable in the context of longer-term predictions, where the smoothed data allows the LSTM to leverage its strength in capturing long-term temporal relationships. The difference in performance between the raw and smoothed datasets highlights the impact of data quality on predictive modeling. In particular, the smoothing process mitigates the effect of noise and outliers, which can disproportionately affect model training and validation, leading to more reliable forecasts. Therefore, the preprocessing step of smoothing not only cleanses the data but also prepares the models to interpret the underlying patterns with greater precision, leading to improved predictive capabilities.
However, it’s important to acknowledge the limitations of our study. One of the limitations is precisely the water quality dataset we used, which is based on discrete measurements performed once per day and not on continuous water quality parameters recordings over short-time intervals. This situation results in measured values that exhibit a greater degree of measurement uncertainty. Consequently, the implementation of robust methods, such as LOWESS, becomes essential for reducing this uncertainty. Next, while our approach incorporates both physical and chemical parameters of water, some localized phenomena, biodiversity and also weather data, which affect water quality, are not considered. It was beyond the scope of our research and this remains under consideration for future work. Additionally, we acknowledge the importance of validating our models with independent data from different rivers. While our study was limited to the Sava River at the Jamena hydrological station, due to data availability constraints, we believe that the models developed can be generalized and applied to other rivers, particularly considering recent advances in transfer learning (Weiss et al. 2016; Peng et al. 2022). Given these advancements, our study’s broader implications lie in its contribution to the evolving field of water quality modeling, where ML and transfer learning can play a key role.
CONCLUSION
This study aimed to devise an ML methodology capable of accurately forecasting diverse chemical parameters related to water quality, encompassing dissolved oxygen, conductivity, and chemical oxygen demand. The proposed approach utilizes both physical water parameters (flow and water temperature) and chemical parameters from previous time steps as inputs to enable precise predictions. The objective was to provide a method that can be used by water resource managers, environmental scientists, and policymakers to monitor and manage river stream and water quality effectively as it is important for public health, environmental protection, water supply and agriculture.
Precisely forecasting chemical water parameters poses a significant challenge due to their intricate and fluctuating nature. Moreover, the effectiveness and precision of LSTM networks heavily rely on the quantity and quality of available data. Considering that data quality is influenced by the quality of measurement instruments, this study introduces a hybrid prediction approach for water quality parameters. The proposed method combines an LSTM network with a LOWESS filter, thereby addressing uncertainties arising from the measuring equipment and enhancing prediction accuracy.
Firstly, data analysis, preparation and smoothing steps were performed which represent very important part because it determines the subsequent success of the prediction method, and remove the uncertainties related to measuring equipment.
Next, the LSTM neural network was implemented to predict water quality parameters from one to five days ahead. This methodology was applied on the Sava river basin for the Jamena hydrological station. The data covers the period from 2013 to 2023 and consists of physical (flow and water temperature) and chemical (dissolved oxygen, conductivity and chemical oxygen demand) parameters.
The results indicate that predicting water quality parameters using LSTM networks is a promising and effective approach. LSTM networks can effectively model the complex temporal relationships in water quality data and capture long-term dependencies. The best results were expectedly obtained for the first day prediction for all parameters and they are: for dissolved oxygen , , for conductivity , and for chemical oxygen demand , . Results for the fifth day prediction are: for dissolved oxygen , , for conductivity , and for chemical oxygen demand , .
According to these results, the following conclusions can be drawn:
The LSTM model shows very good performance for the prediction of the first time step.
Performance decreases with increasing forecast steps which range up to 5 days and extending the prediction horizon beyond this point would likely result in a more substantial decrease in the reliability and accuracy of the forecasts.
Based on the efficiency measures shown, the developed methodology can be used for early prediction of water quality parameters.
In our future research, we plan to enhance our data collection methodology, including the incorporation of new sensors capable of more frequent measurements. This enhancement aims to further reduce noise in the data, thereby improving the accuracy and reliability of our models. Alongside enhancing our data collection methodology our objectives include augmenting datasets to encompass factors that influence water quality, such as climate data, thereby enabling the forecasting of additional parameters. Furthermore, we intend to explore alternative methodologies, including Transformers (Wolf et al. 2020) and Graph Neural Networks (GNNs) (Sun et al. 2022). GNNs prove advantageous when confronted with graph-structured data, prevalent in water distribution networks and river networks. GNNs can also incorporate spatial information by considering the spatial relationships between different monitoring stations or sampling locations. Lastly, we plan to replicate the devised methodology in other river streams within Serbia and the surrounding region, thereby assessing its generalizability and applicability in varying contexts.
ACKNOWLEDGEMENTS
This research is supported by the Science Fund of the Republic of Serbia, Grant No. 6707, REmote WAter Quality monitoRing and INtelliGence, REWARDING.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.