ABSTRACT
This work explores the accurate prediction of water quality, a critical aspect of water environment protection. Leveraging advanced algorithms and statistical models, including regression analyses, moving average, and artificial neural networks (ANN), we analyze multivariate time-series datasets from tap water testing sites. Our approach involves model fitting and evaluation based on the Akaike information criterion (AIC) and root mean square error (RMSE). The work introduces the long short-term memory (LSTM) neural network and the seasonal autoregressive integrated moving average (SARIMA) model, detailing their architectures and implementations. Results highlight the effectiveness of a multivariate LSTM model in forecasting total chlorine concentrations, complemented by the SARIMA model. Evaluation metrics reveal comparable predictive accuracy, with a slight advantage observed for the mixed method combining both models. This work underscores the significance of integrating technological and statistical methodologies for enhanced water quality predictions, contributing to proactive water management strategies.
HIGHLIGHTS
Optimization of water quality monitoring process.
Predictive analysis to alleviate nitrification concerns.
Atmospheric surface temperature is a good predictor of total chlorine concentration.
Long short-term memory and ARIMA models were fit to the data with optimal hyperparameters.
Both models provide a similar degree of accuracy.
INTRODUCTION
Ensuring the accurate prediction of water quality is of paramount importance in the domain of water environment protection, offering operators the foresight to implement proactive measures and mitigate potential issues before they even occur. Despite the critical nature of this endeavor, achieving the desired precision remains a complex challenge. However, the contemporary landscape is replete with a plethora of meticulously designed and rigorously tested algorithms and statistical models specifically crafted for the analysis of time-series data. Notable among these are regression analyses, support vector regressions, Grey systems, and artificial neural networks (ANN) (Zhou et al. 2018). There are several models that were tested but not included in this study. Among these are the random forest model and gradient boosting models such as XGBoost. These are not included in this study because the goal is to compare models that are specifically designed for time series prediction.
Water quality monitoring data typically takes the form of multivariate time-series datasets, integrating historical and daily updated information from tap water testing sites within the distribution networks of two water treatment plants. Advanced technologies like IoT facilitate real-time water quality monitoring, addressing critical needs in resource conservation and management (Karthikeyan et al. 2023). Exploratory data analysis plays a vital role in understanding the dataset's structure and analysis objectives.
Researchers have extensively studied the influence of factors such as temperature (Oliveira & Von Sperling 2008), pH (Zhang et al. 2019), and other chemical components on water quality. Analytical techniques help uncover relationships among these variables. For instance, principal component analysis effectively explores correlations among physicochemical and microbiological parameters (Platikanov et al. 2019), while correlation analysis provides insights into variable associations (Mitryasova et al. 2016). Cross-correlation analysis reveals temporal dependencies among time-series variables (Lehmann & Rode 2001).
Time-series forecasting methods like autoregressive integrated moving average (ARIMA) have proven useful in water quality predictions (Hernández et al. 2017; Selvaraj & Shalma 2024). More advanced approaches, such as long short-term memory (LSTM) networks, offer sophisticated solutions for modeling and predicting time-series data (Chen et al. 2022). Case studies demonstrate LSTM's superior performance in water quality analysis and prediction (Yoon et al. 2024). Due to time constraints, not every type of machine learning or statistical model could be tested on this dataset. Considering the goal of this case study is solely to maximize the accuracy of predicted values of a time series, the models tested are renowned for their predictive ability in a time series context. Therefore, the LSTM and ARIMA models, representing machine learning and statistical methods, respectively, are considered in this study.
In this work, we focus on key parameters such as weather, chlorine, ammonia, and nitrite to monitor and predict water quality. By bridging water science with predictive analytics, this study contributes to advancing water quality forecasting, promoting better environmental stewardship, and resource management.
METHODOLOGY
The data comprise multivariate time-series datasets, integrating past and current information from tap water testing sites within the distribution system of two water treatment plants. Initial exploration involves selecting target and feature variables using the Pearson correlation coefficient. Subsequently, three models, one being a hybrid of the other two, are fitted to the data. Model evaluation employs the Akaike information criterion (AIC), and predictive performance is assessed through the root mean square error (RMSE).
The LSTM model
The LSTM framework also supports multivariate input and output, which is particularly important when forecasting chemical concentrations at tap water testing sites (Im et al. 2022). Typically, multiple chemical components are measured simultaneously at each testing site, resulting in several time series for each location (Marcoux et al. 2017). The criteria for selecting additional feature variables or time series involve considering the correlation coefficient between the variable to be predicted and any variable that may be related.
Model implementation
Model input is of the form {Xi1(t − d), Xi2(t − d), … , Xis(t − d)}, where observations for each Xis(t − d) for s ∈ {1, 2, … , n} are expanded to the set {Xis(t − 1), Xis(t − 2), … , Xis(t − d)} assuming there are n number of feature variables included that are directly related to the time series of interest and each Xis represents a feature/predictor variable for row/observation i and feature indicator s (Siami-Namini et al. 2018).
The implementation of a multivariate LSTM model can pose challenges due to the requirement of (k − 1) future/predicted values for all included feature variables, based on the number k of future values in the time series to be predicted. A frequently adopted approach involves separately predicting up to (k − 1) future values for all feature variables and subsequently using these predicted values as the required future values.

Each of m1[d, 1], m2[d, 1], ··· , mn−d[d, 1] is exactly equal to Y = Xi1(t − n + d + 1), Xi1(t − n + d + 2), ··· , Xi1(t − 1). Therefore, if we let M = m1, m2, ··· , mn, we can assert that every element of M corresponds to every element of Y. This outlines how the LSTM model is trained: predict Yi given Mi. Conceptually, this can be envisioned as a window of length d commencing at the first row of the dataset (time t − n) and progressively sliding down row-by-row until the second to last row (time t − 2) (Fneish 2019).
The SARIMA model
The seasonal autoregressive integrated moving average (SARIMA) model represents a more statistical approach to time series forecasting compared to an LSTM model (Theerthagiri & Ruby 2023). It extends the ARIMA model to incorporate the seasonality inherent in a time series. The model is commonly employed in situations where a discernible pattern of variation occurs at specific intervals of time, indicating a seasonal component. SARIMA is characterized by six hyperparameters (Valipour et al. 2012):
p: autoregressive order
d: differencing order
q: moving average order
P: seasonal autoregressive order
D: seasonal differencing order
Q: seasonal moving average order
The ‘trend’ hyperparameters encompass the p, d, and q parameters, as they are utilized to model the immediate trend in the time series. Conversely, P, D, and Q are employed to model the seasonal component of the time series. If the autoregressive order (p) is set to n, then the model incorporates n weighted lags with which to model yt would be (t − 1, t − 2, ··· , t − n). Similarly, if the moving average order (q) is set to n, the model incorporates n weighted errors of the lags of the time series to model yt would be (t − 1, t − 2, ··· , t − n) (Hyndman & Athanasopoulos 2018).
For seasonal autoregressive and moving average orders, (P and Q) are set to n, then there would be 2n additional terms added to the model where the same explanations of p and q apply to P and Q, respectively. However, the lags for P and Q are represented as (t − m, t − 2m, ··· t − nm), where ‘m’ is the length of the ‘season’, a period where yt varies in a predictable pattern – could be weeks, months (Hyndman & Athanasopoulos 2018).
The first three (p, d, q) compose the entirety of the ARIMA model, and the last three (P, D, Q) are the same as the first three but apply to the seasonal component. In notation (Pennsylvania State University 2022):
Autoregressive order AR(p = n): yt = β0 + β1(yt−1) + β1(yt−1) + ··· + βn(yt−n) + εt
Differencing order I(d = n): where Byt = yt−1
Moving average order MA(q = n): yt = c + εt + θ1εt−1 + θ2εt−2 + ··· + θnεt−n
In the above model, yt represents the time series of interest, zt denotes the time series after both trend (d) and seasonal (D) differencing have been applied, ω is the intercept term, and n is the order of all six other hyperparameters. The first two summations in the lower part correspond to the p (trend) and P (seasonal) autoregressive components, respectively, while the last two are the q (trend) and Q (seasonal) moving average components, respectively (Hyndman & Athanasopoulos 2018).
CASE STUDY
Data
The United States' Center for Disease Control reports that nine out of 10 Americans get their water from one of more than 148,000 public water systems. These treatment systems typically rely on either groundwater or surface water (from rivers and lakes) as their source of raw water (Centers for Disease Control & Prevention n.d.). To uphold water quality standards, operators of said treatment facilities undertake the daily measurement of specific chemical components in the treated water. These measurements are taken as the water flows out of storage tanks, passes through booster pump stations (where the treated water is ‘boosted’ with additional chlorine), and at various locations, primary endpoints, throughout the distribution system.
The dataset used in this analysis consists of daily measurements for each chemical of interest, collected from various locations throughout the distribution system of a medium-sized city in the southern US. The city's water supply comes from two main treatment plants, Jefferson and El Pico, which together serve approximately 260,000 residents with a combined capacity of 85 million gallons per day. The data collection period across these 36 tap water testing sites spans from 1 September 2021 to 3 December 2023, spanning approximately 1,100 days/data points. It should be noted that the frequency and method of measurement result in a highly variable time series for each chemical monitored, resulting in generally low autocorrelation (<0.2) across the time series of interest. Additionally, some data cleaning was performed prior to predictive analysis. The cleaning process included ensuring that there were no data gaps for addresses that had some missing data. These small data gaps were filled via linear interpolation. The provided data were generally very clean, and there were not significant outliers that warranted removal.
The levels of various chemicals in milligrams per liter (mg/L) undergo continuous monitoring at each tap water testing site, encompassing parameters such as total chlorine, monochloramine, free ammonia, nitrite, and nitrate. However, the focal point of predictive analysis and forecasting narrows down to total chlorine and free ammonia for two reasons. First, free ammonia is predicted because it is the main contributing factor to nitrification. Second, chlorine concentration is predicted because it is the sole parameter subject to manual adjustment by controlling the chlorine dosage at booster pump stations. Finally, other chemicals – beside total chlorine and free ammonia – are either redundant measurements of total chlorine (such as monochloramine) or were measured with too low precision to allow for practically useful prediction. The concentrations of all other monitored chemicals such as nitrite, nitrate, monochloramine, are the result of chemical reactions with chlorine taking place as the water ages throughout the distribution system and are not directly modifiable. Moreover, the flow of treated water from a booster pump station to an endpoint may take several days. Therefore, forecasting total chlorine and free ammonia concentrations at endpoints of the distribution system gives operators the best chance of proactively remedying the situation of chemical levels moving into their respective critical zones.
Levels of total chlorine and free ammonia in mg/L at a testing site.
Despite the evident inverse relationship observed between the levels of the two chemicals plotted above, the coefficient of correlation, using the Pearson method, is only −0.39. Surprisingly, establishing a statistically significant relationship between any two chemical components, whether analyzing one location or an aggregate of several, proves challenging. In exploring potential predictors of total chlorine (or other components) level, the variable demonstrating the strongest observed relationship with the concentration of any chemical is the maximum daily temperature recorded by the Automated Weather Observing System (AWOS) of the international airport within the city. Pearson correlation coefficients of the relationship between said maximum temperature and the concentration of the previously noted key chemical components – whether at a single site or any combination – range from approximately 0.4 to 0.7 but tend to hover around 0.6.
To streamline the model and minimize unnecessary feature variables, only those variables consistently showing a statistically significant relationship with total chlorine levels (>50% of sites) are considered. In this scenario, maximum daily temperature from the airport's AWOS emerges as the sole variable meeting this requirement. We have these daily maximum temperature values for every single day for which we have chemical concentration data. Consequently, the final dataset encompasses total chlorine levels for each of the typically 36 (variable due to real-time monitoring needs) testing sites for each day, along with the maximum temperature for each day converted into degrees Fahrenheit and rounded to the nearest whole number.
RESULTS/DISCUSSION
A multivariate LSTM model is employed to forecast the next six daily total chlorine concentration measurements (in mg/L) at 10 tap water testing sites in a southern US city. The model's input consists of 16 batches of sequences with two variables that are heavily correlated with total chlorine levels: lagged chlorine levels and daily maximum temperature. The output is a single value representing the predicted total chlorine for the next day at each site. The model is trained on 80% of the available data and tested on the remaining 20%.
Adapted from insights from Liu et al. (2022), the stacked LSTM RNN consists of two hidden layers (between the required input and output layers) containing three and eight neurons, respectively. The choice of these specific layer sizes was based on a balance between model complexity and performance, as preliminary experiments suggested that this configuration achieved good predictive accuracy without overfitting. Liu et al. (2022) employed similar layer configurations for time series forecasting tasks, demonstrating the effectiveness of smaller layers in capturing temporal dependencies without excessive computation. The optimizer used is ‘Adam,’ known for its efficiency in training deep learning models, and the loss function is MSE, as it is commonly employed for regression tasks to minimize error. The activation function is ‘SoftPlus,’ as it resulted in the lowest average loss during experimentation, reflecting its ability to avoid issues like vanishing gradients compared to traditional activation functions. The learning rate is set to 0.001, and the number of epochs is fixed at 50. Additionally, an ‘EarlyStopping’ callback is implemented to prevent the initiation of the next epoch if the validation loss has not improved for the past five epochs (Keras 2022). Further experiments could explore the impact of varying the number of neurons or epochs to optimize performance.
Actual vs. forecasted total chlorine concentration (in mg/L) at a testing site for the next 700 days using future maximum daily temperatures.
Actual vs. forecasted total chlorine concentration (in mg/L) at a testing site for the next 700 days using future maximum daily temperatures.
Among the various statistical models considered, including SARIMA and ARIMA types, an exhaustive model selection process using the ‘auto.arima’ function in R's ‘forecast’ package led to the implementation of the ARIMA(5,1,1) model. The choice is based on the model providing the minimum AIC metric. The SARIMA model was considered because of the observed seasonal variation of both chemical concentrations. However, the ARIMA model performed the best of all the ARIMA-type models tested.
ARIMA-class models are reduced SARIMA-class models that do not account for any seasonal component, with only the p, d, and q terms included. Consequently, this model's prediction of the current day's total chlorine level at any testing site is a function of only the previous 5 days' total chlorine levels (autoregressive order of 5) and yesterday's error in total chlorine level (moving average order of 1) applied to the first difference of the time series (differencing order of 1).
Recognizing the proven efficiency of aggregated forecasts from multiple models in enhancing prediction accuracy (Jose & Yasala 2024), a supplementary forecasting approach involves computing another forecast. This is achieved by averaging the predicted values from each of the two models for every day throughout the forecast period. Subsequently, the performance of each method is systematically compared, focusing on a specific testing site.
The model used in this case study demonstrates strengths, such as capturing the strong relationship between total chlorine levels and daily maximum temperature, leading to accurate predictions. Its flexibility in forecasting multiple sites and using previous day values alongside temperature enhances its effectiveness. However, the model has weaknesses, such as relying solely on temperature, which may limit accuracy in areas where factors like manual boosting chemicals significantly impact chlorine levels.
In the provided formula, Pi represents the predicted total chlorine concentration for a specific time at a specific site, Oi denotes the observed total chlorine concentration at that same time and site, and n represents the number of recorded measurements for that site. It is important to note that RMSE is calculated on a per-site basis, considering each site individually.
Five-number summaries of the RMSE values for each method provide a better sense of how well these models predict relative to each other.
Five-number summary of each method's RMSE
. | Five-number summary . | |||||
---|---|---|---|---|---|---|
Min . | 25th percentile . | Median . | 75th percentile . | Max . | ||
Method | ARIMA | 0.23 | 0.29 | 0.33 | 0.41 | 0.74 |
LSTM | 0.22 | 0.28 | 0.31 | 0.40 | 0.77 | |
Mixed | 0.22 | 0.27 | 0.31 | 0.40 | 0.74 |
. | Five-number summary . | |||||
---|---|---|---|---|---|---|
Min . | 25th percentile . | Median . | 75th percentile . | Max . | ||
Method | ARIMA | 0.23 | 0.29 | 0.33 | 0.41 | 0.74 |
LSTM | 0.22 | 0.28 | 0.31 | 0.40 | 0.77 | |
Mixed | 0.22 | 0.27 | 0.31 | 0.40 | 0.74 |
Comparison of all three models' RMSE values upon fitting each to all 36 testing sites of the distribution system.
Comparison of all three models' RMSE values upon fitting each to all 36 testing sites of the distribution system.
Additionally, the same LSTM model used for total chlorine level prediction was applied to predict free ammonia concentration for the next 6 days for each of the 36 testing sites. Maximum daily temperature is the only other variable that is included in the model (just like total chlorine prediction) due to the strong correlation of free ammonia and near-surface air temperature. Validation results of these predictions are essentially the same (proportionally) as those illustrated above of total chlorine.
Finally, follow-up evaluation of the LSTM model's predictive accuracy several months after implementation demonstrates its high degree of accuracy. On average, model predictions of total chlorine for the next day are off by roughly 80% of a testing site's standard deviation in total chlorine levels. Given the high degree of variability in day-to-day total chlorine levels (as seen in Figure 3), we believe that an RMSE value less than the standard deviation indicates a good/useful level of predictive accuracy.
CONCLUSION
Accurate water quality prediction is essential for the efficient operation of water treatment systems, yet noisy data remains a persistent challenge to achieving reliable outcomes. To address this, we proposed a hybrid prediction model that combines the strengths of ARIMA and LSTM, as outlined by Fathi (2019). This composite model leverages the complementary capabilities of its constituent models, with its performance evaluated using RMSE as the benchmark.
The analysis reveals that the hybrid ARIMA–LSTM model offers a marginal improvement in predictive accuracy over individual ARIMA and LSTM models. While the standalone models exhibit comparable performance in this study, the LSTM network demonstrates a slight edge, particularly in capturing long-term trends. The reason that the LSTM model is used instead of either the ARIMA or mixed model is that the multivariate nature of the LSTM enables it to effectively model extended trends in water quality, driven by the strong correlation between chemical concentrations and near-surface air temperature. Furthermore, model interpretability is not a key consideration in this case, as the goal is to maximize raw predictive power.
Despite these strengths, daily variations in water quality data, often influenced by human inputs, introduce noise that cannot be solely attributed to temperature fluctuations. These variations may also be affected by external factors, like water treatment interventions, including chemical dosing, which can significantly alter water quality but may not be accounted for in the model. In cases where these variations could significantly impact chemical component levels, modeling this variability becomes crucial. Incorporating additional predictor variables beyond daily maximum temperature is recommended to improve predictive accuracy and better capture these external influences. However, data acquisition depends on the sensor capabilities of municipal water treatment systems, which may have limitations in both coverage and resolution, further complicating model accuracy. Addressing these data constraints is a critical step toward improving daily variability predictions.
The LSTM model takes roughly 25 s to be trained on the data and generate predictions for each site, meaning that it takes around 15 min in total to train the model on all 36 sites. This amount of training time is not a constraint in the overall predictive process, however, as the predictions are automatically generated early in the morning each day. Given the resources available, this process could be expanded to include a vast number of additional municipalities before training time and hardware limitations even become a concern.
Further optimization, including exhaustive tuning of LSTM hyperparameters and expanding the ensemble with additional models, holds potential for improving predictive accuracy. Nevertheless, the LSTM model presented here provides sufficient accuracy for most practical applications, demonstrating its utility in advancing water quality management and fostering better resource stewardship.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.