ABSTRACT
The water quality index model is a popular tool for evaluating drinking water quality. To overcome low precision and significant errors in the traditional single prediction model, a novel autoregressive integrated moving average (ARIMA)-sparrow search algorithm (SSA)-long short-term memory (LSTM) combination model is proposed to accurately predict residual chlorine, turbidity, and pH in drinking water. First, the ARIMA model is used to extract the linear part of water quality data and output the nonlinear residual. Then, the LSTM model is used to predict the residual, and the SSA is used to find the optimal hyperparameters of the LSTM model, which plays an essential role in reducing the error of the model. To prove the superiority of the model developed, the ARIMA-SSA-LSTM model is compared with SSA-LSTM, whale optimization algorithm-LSTM, PSO-LSTM, ARIMA-LSTM, ARIMA, and LSTM. The results show that the coefficient of determination (R2) of the combination model for residual chlorine, turbidity, and pH are 0.950, 0.990, and 0.998, respectively, which are greater than all comparison models. Therefore, the model is more suitable for the prediction and analysis of water quality data.
HIGHLIGHTS
This article presents a new combined forecasting method to predict the trend of water quality in water distribution networks.
The autoregressive integrated moving average (ARIMA)-sparrow search algorithm (SSA)-long short-term memory (LSTM) combined model overcomes the limitations of the traditional single model.
SSA can find more suitable hyperparameters for LSTM model, so as to improve the prediction accuracy of LSTM model.
INTRODUCTION
Urban water distribution networks are responsible for providing people with high-quality and pollution-free drinking water. Drinking water is prone to ‘secondary pollution’ when flowing through a vast and complex water distribution system, which leads to declining water quality and affects people's lives and health. Residual chlorine, turbidity, pH, and other indicators in water quality monitoring are essential for evaluating drinking water quality. Fast and accurate prediction of these indicators can enable water supply enterprises to quickly find the trend of water quality deterioration and take relevant measures. Therefore, how to scientifically predict the change in water quality data has become increasingly important.
The main methods for water quality prediction are the time series model and the deep learning model. Time series analysis is based on studying a given variable's historical characteristics. Then, a model is built according to its regularity to predict the state or value of the variable in the next period. Due to its flexibility, simplicity, and feasibility, the autoregressive integrated moving average (ARIMA) model has become the most important and widely used time series model (Chen 2007). Wang et al. (2019) built a general water quality prediction model by combining the Holt-Winters seasonal model with the time series ARIMA model, taking total phosphorus and total nitrogen, and the eutrophication indicators as parameters. Abdul Wahid & Arunbabu (2022) successfully predicted water quality trends in the Krishnagiri Reservoir in India using a seasonal ARIMA model by integrating in situ measurement and remote sensing techniques. However, the ARIMA model cannot analyze and deal with nonlinear time series.
To address this issue, a large number of nonlinear deep learning methods have been widely applied to the analysis and prediction of time series data. The most commonly used deep learning model for the analysis and prediction of time series data is a recurrent neural network (RNN) (Ong et al. 2014; Li et al. 2019). The long short-term memory (LSTM) is an enhanced RNN model that successfully solves the common problems of gradient disappearance and gradient explosion in the traditional RNN model. It has vital information capture and storage capabilities and has apparent advantages in processing time series data such as water quality data (Pascanu et al. 2013). Zhou et al. (2018) established a water quality prediction model based on LSTM using water quality features selected by the improved grey association analysis algorithm. They demonstrated the method's effectiveness on two water quality datasets: Taihu Lake and Victoria Bay. Hu et al. (2019) used the LSTM model to predict water quality data, such as pH and water temperature in seawater cages, and obtained relatively accurate results. This study provides a reliable tool for flood forecasting and a valuable reference for water transfer in the Three Gorges reservoir area. However, when using the LSTM model in the aforementioned studies, the selection of parameters is usually determined based on the user's experience, and this uncertainty limits the model's applicability (Xu et al. 2023). Therefore, it is necessary to find optimal hyperparameters of LSTM.
Currently, swarm intelligence has been widely used to adjust the parameters of neural networks. Bonabeau et al. (1999) define it as ‘the burst collective intelligence of simple groups of agents.’ The method is inspired by the natural foraging behavior of social organisms such as ants, birds, and fish. It uses information exchange and cooperation between groups to achieve optimization through simple and limited interaction between individuals. Particle swarm optimization (PSO) is the most widely used swarm intelligence algorithm. Jia et al. (2023) used the PSO algorithm to optimize the parameters of the LSTM model and applied the optimized model to the prediction of crop reference evapotranspiration (ETo) in Shaanxi Province, China. The results show that the optimized model has good prediction accuracy. Wu et al. (2018) used the PSO algorithm to optimize back propagation (BP) neural network parameters, improving the accuracy of predicting dissolved oxygen concentration in water quality. Although the PSO algorithm has a fast convergence speed, its search space is small and cannot solve the high-dimensional problem. Due to the fixed inertia weight used in the PSO algorithm, it often fails to adjust the speed step correctly and quickly falls into the local optimal solution (Nasrollahzadeh et al. 2021). To solve the defects of the PSO algorithm, inspired by the hunting behavior of humpback whales, Mirjalili & Lewis (2016) proposed the whale optimization algorithm (WOA), which has an ample search space and a solid ability to jump out of local optimum, but the algorithm has slow convergence speed and low convergence accuracy. Xue & Shen (2020) proposed a new swarm intelligence optimization algorithm: a sparrow search algorithm (SSA) based on sparrow swarms' foraging and antipredation behavior. The algorithm has the advantages of fast convergence speed, high convergence accuracy, few control parameters, and a solid ability to adapt to various complex problems. Once proposed, it has attracted the attention of researchers and has become a new tool in the swarm intelligence optimization algorithms field. It has been widely used in combinatorial optimization, path optimization, image processing, data prediction, and other areas (Wu et al. 2021; Fan et al. 2023). However, until now, studies have yet to integrate ARIMA and SSA-LSTM into water quality prediction.
Since the water quality data have both linear and nonlinear characteristics, the advantages of the ARIMA model in processing linear time series, the excellent performance of the LSTM model in processing nonlinear time series, and the optimization effect of the SSA on the network structure parameters of LSTM model are considered. An ARIMA-SSA-LSTM combination model is proposed to accurately predict water quality data such as residual chlorine, turbidity, and pH. First, the ARIMA model is used to predict the water quality data of water distribution networks, and the obtained model training residuals are nonlinear time series. Then, the appropriate parameters are selected by the SSA to construct the LSTM residual prediction network to predict the residual values. Finally, the ARIMA-SSA-LSTM model prediction data of the original water quality time series data can be obtained by combining the sequence prediction value calculated by the ARIMA model with the residual prediction value obtained by the LSTM model. To prove that the model proposed has higher prediction accuracy, the water quality data of water distribution networks in a residential area in China are used as case study. The ARIMA-SSA-LSTM model is compared with other models, and the effectiveness and accuracy of the proposed model are comprehensively and systematically evaluated. Aiming at the problems of low accuracy and large error of traditional single prediction model, this article combines the ARIMA model and the SSA-LSTM algorithm into drinking water quality prediction for the first time and verifies the model with actual water quality index data, using mean absolute error (MAE). Root-mean-square error (RMSE) and coefficient of determination (R2) were used to evaluate the prediction effect of the model. The results show that the model has the smallest error and is more suitable for the prediction and analysis of drinking water quality data.
METHODS
Autoregressive integrated moving average
Long short-term memory
In the LSTM model, the input gate of each memory unit controls what to enter in the memory unit, the forget gate controls what to delete from the memory unit, the output gate controls what to output from the memory unit, and the memory unit is responsible for memorizing long sequences of information. At each time step t, each gate structure receives the input xt at this time and the hidden state ct-1 output by the memory unit at the previous time step t-1. Taking unit state ct and output ht at the LSTM layer at t, the computation is as follows.
When the LSTM network processes the input sequence, the features of the input sequence are extracted step by step. For each time step input, the aforementioned steps are performed until all elements of the entire sequence are processed, and finally, the final output of the whole series is returned.
Sparrow search algorithm
Sparrows are birds with social living, strong memory, curiosity, and vigilance, and there is an apparent division of labor when foraging. According to the foraging and antipredation behavior of the sparrow population, researchers summarized three identities: finder, follower, and scout. The finder is responsible for finding food and grasping the foraging area and direction of the population, the follower is responsible for following the foraging, and the scout will timely warn when aware of the danger.
In the SSA, the position of each sparrow corresponds to a feasible solution to the optimization problem. n is the total number of sparrows. Xn = [x1n, x1n,…, xdn] is the location of the nth sparrow, each position element corresponds to an optimization variable, and d is the dimension of the optimization variable. The fitness value f(Xn) can be used to evaluate the quality of sparrow position Xn. The fitness value reflects the advantages and disadvantages of the feasible solution corresponding to each sparrow's position in the objective optimization problem, determines the sparrow's identity attributes, and obtains different position update rules.
To find the optimal hyperparameters of the LSTM, the following steps need to be performed.
- (1)
Define the hyperparameter space: The hyperparameter space is defined as a total, and each individual is a set of hyperparameter combinations, including hidden layer size, learning rate, batch size, etc.
- (2)
Initialize the hyperparameter space: The SSA is used to initialize the hyperparameter space and generate the initial population.
- (3)
Evaluate fitness: Each individual (hyperparameter combination) is evaluated using the fitness function.
- (4)
The SSA updates the population, selects the hyperparameter combination with the best fitness, and generates a new hyperparameter combination.
- (5)
Termination condition: Steps (3) and (4) are repeated until a predetermined termination condition is reached, such as a maximum number of iterations or a satisfactory combination of hyperparameters.
- (6)
The hyperparameter combination with optimal adaptation is taken as the optimal hyperparameter combination output of the LSTM model.
Through the aforementioned steps, you can gradually optimize the hyperparameters of LSTM to improve its performance and effect. This method can better train and adjust the LSTM model's parameters, help find the optimal combination of hyperparameters, and avoid the subjectivity of manual parameter selection.
ARIMA-SSA-LSTM



RESULTS AND DISCUSSION
Study area
The pH has a significant impact on human health and the environment. Too high a pH will make the water alkaline, which may lead to adverse reactions such as sore throat and gastrointestinal discomfort. Too low a pH will make the water acidic, which may hurt the human body and have a corrosive effect on metal materials such as steel. Residual chlorine in urban water distribution networks is unstable. Its concentration will decay with time. When the residual chlorine is lower than 0.05 mg/L, it cannot effectively kill bacteria, viruses, and other microorganisms, deteriorating the water quality and causing severe harm to health. Turbidity refers to the content of small particulate matter in the water, which affects the transparency and cleanliness of the water. High turbidity water will contain more sediment, minerals, and other substances, which will not only affect the beauty of the water but also affect the absorption and utilization of nutrients in the water. High turbidity water can also lead to scale in households, affecting the circulation and use of drinking water. According to the ‘Technical Standards for Online Monitoring of Urban Water Supply Quality’ (CJJ/T 271-2017) issued by the Ministry of Housing and Urban-Rural Development of China, the online monitoring indicators of water supply quality should include residual chlorine and turbidity, and the size of pH is closely related to residual chlorine, and these three indicators are important indicators for evaluating tap water quality. Monitoring these indicators is helpful to evaluate water quality and ensure the safety and sanitation of residents' drinking water and is a common online monitoring and evaluation index for water supply companies in China. Therefore, this study selected three water quality indicators, residual chlorine, turbidity, and pH, according to relevant standards for online monitoring.
Water quality prediction results based on LSTM optimization algorithm
All prediction models in the study are based on the Python language. The LSTM model consists of three hidden layers with 50, 100, and 200 neurons, respectively. The learning rate of the model is set to 0.1, the discard rate to 0.2, the batch length to 64, the number of model iterations to 100, and the time step to 15. RMSE was used as the model's loss function, and the LSTM model was trained using the Adam optimizer.
The population size was set of the SSA to 20 and the maximum number of iterations to 50. The search range of the number of neurons in the three hidden layers of the LSTM model is 10–300, the search range of the learning rate is 0.0001–0.99, and the search range of the optimal batch length is 1–300. The population size, iteration times, and search range of LSTM hyperparameters of PSO and WOA are the same as those of the SSA to compare the optimization effect of these three optimization algorithms under the same conditions.
As shown in Table 1, compared with the LSTM, PSO-LSTM, and WOA-LSTM models, the predicted residual chlorine MAE based on the SSA-LSTM model decreased by 36.9, 15.7, and 10.3%, RMSE decreased by 34.0, 17.7, and 10.6%, and R2 increased by 69.0, 7.7, and 4.3%, respectively.
Error analysis of residual chlorine prediction models based on LSTM, PSO-LSTM, WOA-LSTM, and SSA-LSTM models
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
LSTM | 0.111 | 0.141 | 0.516 |
PSO-LSTM | 0.083 | 0.113 | 0.810 |
WOA-LSTM | 0.078 | 0.104 | 0.836 |
SSA-LSTM | 0.070 | 0.093 | 0.872 |
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
LSTM | 0.111 | 0.141 | 0.516 |
PSO-LSTM | 0.083 | 0.113 | 0.810 |
WOA-LSTM | 0.078 | 0.104 | 0.836 |
SSA-LSTM | 0.070 | 0.093 | 0.872 |
As shown in Table 2, compared with the LSTM, PSO-LSTM, and WOA-LSTM models, the predicted values of MAE for turbidity based on the SSA-LSTM model decreased by 45.5, 19.9, and 14.7%, RMSE decreased by 26.6, 10.3, and 5.90%, and R2 increased by 130.0, 2.7, and 2.0%, respectively.
Error analysis of turbidity prediction models based on LSTM, PSO-LSTM, WOA-LSTM, and SSA-LSTM models
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
LSTM | 0.00576 | 0.00717 | 0.253 |
PSO-LSTM | 0.00392 | 0.00587 | 0.582 |
WOA-LSTM | 0.00368 | 0.00559 | 0.586 |
SSA-LSTM | 0.00314 | 0.00526 | 0.598 |
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
LSTM | 0.00576 | 0.00717 | 0.253 |
PSO-LSTM | 0.00392 | 0.00587 | 0.582 |
WOA-LSTM | 0.00368 | 0.00559 | 0.586 |
SSA-LSTM | 0.00314 | 0.00526 | 0.598 |
As shown in Table 3, all optimization algorithms have achieved good results in predicting pH, and the coefficient of determination (R2) is above 0.9, which is related to the periodic characteristics of pH data. Compared with the LSTM, PSO-LSTM, and WOA-LSTM models, the MAE of predicted pH data based on the SSA-LSTM model decreased by 79.6, 67.0, and 50.7, the RMSE decreased by 72.7, 62.0, and 48.6%, and R2 increased by 15.1, 6.5, and 4.0%, respectively.
Error analysis of pH prediction models based on LSTM, PSO-LSTM, WOA-LSTM, and SSA-LSTM models
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
LSTM | 0.0872 | 0.0937 | 0.859 |
PSO-LSTM | 0.0539 | 0.0674 | 0.929 |
WOA-LSTM | 0.0361 | 0.0498 | 0.951 |
SSA-LSTM | 0.0178 | 0.0256 | 0.989 |
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
LSTM | 0.0872 | 0.0937 | 0.859 |
PSO-LSTM | 0.0539 | 0.0674 | 0.929 |
WOA-LSTM | 0.0361 | 0.0498 | 0.951 |
SSA-LSTM | 0.0178 | 0.0256 | 0.989 |
In summary, although there are still some inaccuracies in the prediction results of the SSA-LSTM model, the prediction has significantly exceeded that of the single LSTM model, and the prediction is also better than the optimized models such as PSO-LSTM and WOA-LSTM.
Water quality prediction results based on ARIMA-SSA-LSTM combined model
For the ARIMA model, Augmented DickeyFuller (ADF) and Kwiatkowski–Phillips–SchmidtShin (KPSS) tests are first used to judge data stationarity. Only when both ADF and KPSS tests pass prove the data are stable. Otherwise, the data are processed by difference until the data are stable. The value range of parameters p and q was set to [0,8], and AIC and BIC were calculated according to different combinations of p and q parameters. The parameter combination with the smallest sum of AIC and BIC was selected as the final parameter of the ARIMA model. The ARIMA model parameters of residual chlorine, turbidity, and pH were (7,1,7), (3,1,6), and (2,1,5), respectively. The parameters of ARIMA in other combination models are consistent with the parameters of a single ARIMA model.
As shown in Table 4, compared with the ARIMA and ARIMA-LSTM models, the MAE of the predicted residual chlorine based on the combined model of ARIMA-SSA-LSTM decreased by 55.9 and 42.4%, the RMSE decreased by 54.5 and 38.2%, and R2 increased by 24.5 and 9.2%, respectively.
Comparison and analysis of the errors of three residual chlorine prediction models
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
ARIMA | 0.0941 | 0.1270 | 0.763 |
ARIMA-LSTM | 0.0721 | 0.0936 | 0.870 |
ARIMA-SSA-LSTM | 0.0415 | 0.0578 | 0.950 |
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
ARIMA | 0.0941 | 0.1270 | 0.763 |
ARIMA-LSTM | 0.0721 | 0.0936 | 0.870 |
ARIMA-SSA-LSTM | 0.0415 | 0.0578 | 0.950 |
As shown in Table 5, compared with the ARIMA and ARIMA-LSTM models, the MAE of the predicted water turbidity based on the combined model of ARIMA-SSA-LSTM decreased by 81.6 and 48.7%, the RMSE decreased by 84.4 and 39.4%, and the R2 increased by 69.5 and 1.5%, respectively.
Comparison and analysis of the errors of three turbidity prediction models
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
ARIMA | 0.00331 | 0.00531 | 0.584 |
ARIMA-LSTM | 0.00119 | 0.00137 | 0.975 |
ARIMA-SSA-LSTM | 0.00061 | 0.00083 | 0.990 |
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
ARIMA | 0.00331 | 0.00531 | 0.584 |
ARIMA-LSTM | 0.00119 | 0.00137 | 0.975 |
ARIMA-SSA-LSTM | 0.00061 | 0.00083 | 0.990 |
As shown in Table 6, compared with the ARIMA and ARIMA-LSTM models, the MAE of the predicted pH based on the combined model of ARIMA-SSA-LSTM decreased by 70.8 and 46.2%, the RMSE decreased by 70.5 and 49.4%, and the R2 increased by 1.4 and 0.4%, respectively.
Comparison and analysis of the errors of three pH prediction models
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
ARIMA | 0.0216 | 0.0295 | 0.984 |
ARIMA-LSTM | 0.0117 | 0.0172 | 0.994 |
ARIMA-SSA-LSTM | 0.0063 | 0.0087 | 0.998 |
Models . | Evaluation index . | ||
---|---|---|---|
MAE . | RMSE . | R2 . | |
ARIMA | 0.0216 | 0.0295 | 0.984 |
ARIMA-LSTM | 0.0117 | 0.0172 | 0.994 |
ARIMA-SSA-LSTM | 0.0063 | 0.0087 | 0.998 |
CONCLUSIONS
Real-time data from a community water distribution network in China were used as an example in this research to establish the ARIMA-SSA-LSTM water quality prediction model. Comparisons are made between the advantages of SSA, WOA, and PSO algorithms on LSTM model hyperparameter selection, as well as between the benefits of the ARIMA-SSA-LSTM, ARIMA-LSTM, and ARIMA for prediction. The main conclusions are as follows:
- (1)
The prediction accuracy of the LSTM model can be improved by selecting proper hyperparameters. Compared with WOA-LSTM and PSO-LSTM, the SSA-LSTM has higher prediction accuracy. Therefore, the SSA can find more suitable hyperparameters for the LSTM model.
- (2)
A new combination prediction model of residual chlorine, turbidity, and pH based on ARIMA, SSA, and LSTM, named ARIMA-SSA-LSTM, is proposed. Experimental results show that the R2 of this model is maintained above 0.95, and the prediction accuracy is higher than all comparison models.
As a result, the approach developed in this study can provide a new perspective, and residual chlorine, turbidity, and pH can be predicted using the prediction model. When the predicted value of water quality data exceeds the range of normal indicators the state prescribes, relevant departments can take corresponding measures to deal with it.
However, there are still some shortcomings in this study: (1) The number and type of water quality indicators in the dataset are small. The selected water quality indicators, such as residual chlorine, turbidity, and pH, may not be sufficient to fully confirm whether the water quality is clean and hygienic; (2) among many excellent deep learning models, only the LSTM model was selected in this study. These two issues will be addressed in future research by expanding the collection of water quality data to include additional indicators such as Escherichia coli, total hardness, dissolved solids, and total suspended solids to enrich the training set. Furthermore, we plan to explore combining linear time series models with other deep learning models for improved performance.
FUNDING
This study was supported by the Ningbo Key Research and Development Program (No. 2022Z092).
AUTHOR CONTRIBUTION
All authors contributed to the study conception and design. Tingyu Wang and Bo Tang wrote and edited the article; Wei Chen collected the preliminary data. All authors read and approved the final manuscript.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.