Abstract
Rainfall and increased river flow can deteriorate raw water (RW) quality parameters such as turbidity and UV absorbance at 254 nm. This study aims to develop a methodology for integrating both time-lagged watershed rainfall and river flow data into machine learning models of the quality of RW supplying a drinking water treatment plant (DWTP). Spearman's rank non-parametric cross-correlation analyses were performed using both river flow and rain in the watershed and RW data from the water intake. Then, RW turbidity and RW UV254 were modelled, using a support vector regression (SVR) and an artificial neural network (ANN) under several prediction scenarios with time-lagged variables. River flow presented a very strong correlation with RW quality, whereas rainfall showed a moderate correlation. Time lags with maximum correlations between flow data and turbidity were a few hours, while for UV254, they were between 2 and 4 days, demonstrating varied time lags and a complex behaviour. The best performing scenario was the one that used time-lagged watershed rainfall and river flow as input data. The ANN performed better for both turbidity and UV254 than SVR. Results from this study suggest the possibility for new modelling strategies and more accurate chemical dosing for the removal of key contaminants.
HIGHLIGHTS
A methodology for selecting the best raw water (RW) modelling predictors from rainfall and river flow data.
Estimating cross-correlations between both rainfall events and river flow rates and RW quality parameters could be useful to select input data in RW modelling.
RW predictive models could help drinking water treatment plants to anticipate necessary adjustments in the treatment operating conditions.
INTRODUCTION
Surface water contains elements such as pathogens, particles, and organic matter. These elements can be undesirable for human consumption, act as precursors of toxic compounds, or can interfere in water treatment processes and must therefore be removed in drinking water treatment plants (DWTPs) (Health Canada 2002). Rainfall leads to increases in these contaminants in source waters, through the transport of particles and organic and inorganic dissolved substances through surface and subsurface runoff (Singh 2017). Hydrological responses to rainfall depend on the physical characteristics of watersheds (Pratt & Chang 2012; Delpla & Rodriguez 2014; Shaw et al. 2017). Rainfall characteristics such as quantity, duration, intensity, and frequency also play a role (Musy & Higy 2010), as well as soil saturation conditions before rainfall. Additionally, deterioration of source water quality can be exacerbated by non-point and point source pollution due to anthropogenic activity within watersheds (Delpla & Rodriguez 2014), such as sewer overflows from rainfall events and agricultural pollution.
Sudden changes in raw water (RW) quality following rainfall require modifications to water treatment operations. For example, the dosage of coagulant and disinfectant may need to be adjusted (Edzwald 2011; Crittenden et al. 2012) due to increased fine particle (Khan et al. 2015) and natural organic matter (NOM) concentrations (Eikebrokk et al. 2004; Delpla & Rodriguez 2014, 2016; Khan et al. 2015). Modelling and predicting RW quality after rainfall could help DWTPs anticipate these adjustments.
Recent studies have aimed to model and evaluate the impacts of rainfall events on water quality (Rajaee et al. 2020; Tiyasha & Yaseen 2020; Ortiz-Lopez et al. 2022). For instance, Bastaraud et al. (2020) observed, by means of clustering analysis and daily water sampling at different points in the drinking water distribution network, that the variability of microbiological parameters was affected during periods of rain. Jia et al. (2021) estimated the impacts of rainfall events on urban lakes. They observed that water quality (total phosphorus, dissolved oxygen, and total suspended solids) during the rainy season was poorer than during the dry season. Cisty et al. (2021) observed that the more saturated the watershed soil, the more sediments are transported into the river, demonstrating the impacts of rainfall on DWTPs. They also demonstrated that time-lagged series of rainfall could provide valuable information about the presence of sediments in rivers. However, no method for selecting the time lag was mentioned. Wang et al. (2022) used streamflow for predicting chemical dosage in a DWTP using machine learning algorithms with daily data. They used day-lag cross-correlations between flow and various RW quality parameters to establish modelling. However, in this study, neither method for selecting the time of lag was mentioned. Most of the studies mentioned above were based on sampling campaigns that were conducted in source waters (rivers and lakes) following rainfall events. In a previous work, Ortiz-Lopez et al. (2022), however, reviewed the models for surface RW quality that were available in the literature and that could be applied for DWTP decision-making. These authors concluded that although the available models are useful, they have various limitations. These include (a) prediction time steps that are too large (e.g., daily predictions), making it difficult for short-term decision-making in DWTPs, (b) output variables that are not relevant to DWTP decision-making indicators, such as environmental and ecological integrity (dissolved oxygen, phosphorous, chloride, etc.), and (c) the absence of a systematic methodology for choosing input variables, in particular when considering time series and time lags.
In this paper, we predict two relevant RW quality parameters associated with DWTP decision-making, turbidity (Tu) (as indicator of particles) and UV absorbance at 254 nm (UV254) (as indicator of NOM). These crucial parameters are used by DWTP operators to adjust treatment operations (Edzwald 2011) and are highly impacted following rainfall (Baurès et al. 2013; Ortiz-Lopez et al. 2022). We propose a data-driven (machine learning) approach for modelling the quality of the RW supplying DWTPs using watershed rainfall and river flow rates as input variables. We also propose a methodology for determining the best time lags for input variables using Spearman's rank cross-correlation coefficient with hourly time steps. Our novel methodology uses hourly rainfall, the river flow rate and to first select model input data. These input data are then used to model Tu and UV254. Finally, we do not aim to establish herein a new model method but a general methodology for selecting the best time-lagged rain watershed and river flow series as RW quality predictors. For that reason, we used simple machine learning methods in order to test whether our proposed methodology improves RW quality modelling. To the best of our knowledge, our research is the first to propose a modelling procedure for RW quality that considers the need of DWTP operators to react to sudden water quality variations. This is possible through the use of cost-effective data (for input and output variables) that are gathered and modelled based on time steps that most accurately represent RW quality variability after rainfall.
METHODS
Case study
The municipal DWTP in our study is located on the low watershed of the Chaudière River near its river mouth in the Saint Lawrence River. The Chaudière River lower watershed is strongly influenced by the Beaurivage River, a tributary of the Chaudière River. The DWTP supplies about 35,000 inhabitants in the City of Lévis. The land use in the Chaudière River watershed is 68% forest, 23% agricultural, 4% urban and residential, and 5% other (COBARIC 2014).
Data
We collected RW quality data from the water intake at the DWTP. The water intake is located in the Chaudière River, 6 km upstream from the mouth in the Saint Lawrence River (Figure 1). The DWTP is equipped with a continuous analyzer probe for several water quality parameters, such as Tu, conductivity (Co), pH, and temperature (T). Average hourly Tu data were extracted from the DWTP data management system. RW turbidity was measured onsite by online analyzers (Turbidity: HACH Surface scatter). Additionally, RW UV absorbance was measured online with spectro::lyser™ UV spectrophotometer probes (s::can, Austria) installed by our research team at the RW intake. Continuous UV absorbance monitoring was conducted with a 35-mm path length probe (measuring range: 0–70 m−1). UV254 is considered to be an excellent surrogate parameter for estimating RW concentrations of dissolved NOM (Edzwald et al. 1985; Crittenden et al. 2012; Beauchamp et al. 2018). Data for RW quality and hydrological parameters were collected from 15 April to 14 October 2017. Around 4,392 observations were collected. Due to technical issues, 10.4 and 14.5% of observations were lost from the Tu time series and UV254 time series, respectively. Kalman smoothing by an ARIMA model (Hyndman & Khandakar 2008) was used to impute the missing data. Additionally, Tu interference with UV254 measurements was removed using a correction function that was previously developed by our research team for the same water source (Delpla et al. 2023).
Hydrological data for the Chaudière River watershed were also acquired for analysis. Hourly rainfall data were collected from five stations throughout the watershed (MELCC 2020). These hydrological stations are located along the Chaudière River and capture data on the rainfall regime in the watershed. Rainfall gauge stations are located in the low watershed in Charny (PS.1), in the middle watershed in Scott (PS.2), Saint-Séverin (PS.3), and Saint-Georges (PS.4), and in the high watershed in Saint-Ludger (PS.5) (see Figure 1). The stations are managed by the Environment and Climate Change Ministry of Quebec (MELCC). We also collected the hourly data from two flow meter stations: one at Chaudière River (QS.1), located in Saint-Lambert-de-Lauzon, and one at Beaurivage River (QS.2), located in Saint-Étienne (Figure 1). QS.1 is in the lower watershed, 20 km upstream from the water intake, at a point where the water has travelled 89% of the total length of the Chaudière River. QS.2 is also located in the lower watershed, 5.2 km upstream from the Chaudière River mouth (see Figure 1).
RW quality modelling methodology
Input variable choices from time-lag correlations
Rainfall induces various hydrological responses in watersheds, including the rate of river flow and variations in RW quality. There are always time lags between a rainfall event and the hydrological response, but these may vary from one watershed to another (Singh 2017). In order to predict and therefore anticipate RW quality variations after rainfall, historical data must be used as model input variables. The literature also suggests that a specific time lag should be identified and applied between the input and output variables. However, this is challenging, as there are no structured guidelines for determining the best time lag in the literature.
To identify the best input variables, we proposed a cross-correlation analysis between rainfall accumulation (input variables) and RW quality (output variables). The analysis consists of determining the relationships between time-lag rainfall and RW Tu or UV54 . The time lags (k) were tested from 0 (no time lag) to 240 h, at 1-h increments. The complete time series for these variables were considered for each correlation. The same analysis was performed between the river flow rates and the RW quality variables, as the river flow rate is an intermediate variable that could help predict RW quality.
The non-parametric Spearman's rank correlation coefficient (a special version of Pearson's coefficient) was used for the cross-correlation analysis. This index evaluates the associations between two sets of ranked scores (Chen & Popovich 2002). Typically, associations between pairs of variables can be evaluated using the unmodified Pearson's correlation coefficient (r). However, the required assumptions related to data distribution, homoscedasticity, linearity, and continuity were not met in our case. Spearman's rank correlation coefficient is suitable for assessing the strength and direction of a monotonic relationship between two variables, even if the relationship is not linear (Kottegoda & Rosso 2008).
The correlation was considered significant if rs was greater than , where n is the number of observations (Hyndman & Athanasopoulos 2018). Empirically, one could demonstrate that if cross-correlation with a k lag in a population is zero for , then when n is very large, will have a normal distribution ( and ). This is possible since approximately 95% of a normal population is within 2 standard deviations above or below the mean. We determined correlation coefficients to be significant at a p-value of 0.05. All analyses were conducted using the R software package (R Core Team 2022).
Data processing and splitting
We randomly divided 3,893 hourly observations into training (80%) and testing (20%) data categories. Since Tu and UV254 values can change over short periods of time during rainfall, randomly splitting the data ensured that both training and test datasets contained both high and low RW quality values. In order to ensure best performance in models, hyperparameter tuning was carried out by means of a validation stage. Training dataset was split into a pre-training dataset and a validation dataset using a k-fold cross-validation approach. During the validation stage, a 10-fold cross-validation was performed in order to tune the hyperparameters. The training dataset is randomly split into 10 subsamples. Each model (with new parameters) is trained using 9 of the folds as training data, and then, the model is validated using the remaining part of the data. After calculating the accuracy of the model, the process is repeated nine times, with each of the nine subsamples. This model (with new and tuned hyperparameters) is then presented to the validation dataset. This validation process is repeated using a range or grid of hyperparametres previously set. This validation stage ends when every hyperparameter set has been tested. Finally, the validated model (with the highest performance metrics) is tested with the ‘unseen’ test dataset.
Modelling techniques
We used one of the most common ANN algorithms, called the multi-layer feed-forward back propagation neural network (MLPBPNN). The aim of this algorithm is to reduce the expected error by looking for the best collection of relation weights (Ibrahim et al. 2022). The Neuralnet R software package was used to develop the ANN models. The hyperparameters adjusted in the validation stage correspond to the number of neurons in each of the two layers of the neural network and the learning rate. The caret R software package was used to develop this validation stage.
The support vector machine (SVM) is an algorithm that is designed to carry out generalisation tasks and estimate dependencies from finite spaces. Generally, SVMs are used for classification by mapping input data information into output data predictions by using separation hyperplanes. The term SVR is used to refer to an SVM that is used in regression and prediction tasks. SVR develops a linear regression based on the trajectory of the hyperplane. This technique transforms input data in a space with higher dimension characteristics where an optimal separation hyperplane is found (Ortiz-Lopez et al. 2022). The ANN SVR quality model is described in Figure 3(b). SVR is useful for high-dimensional and non-linear prediction problems. For non-linear problems, Kernel functions can be used (Ibrahim et al. 2022). SVR works by constructing two parallel hyperplanes around the main hyperplane. The distance between the hyperplanes is maximized in order to minimize prediction error. The Kernelab R software package was used to develop the SVR models. The Kernelab R software package was used to develop the SVR models. The hyperparameters adjusted in the validation stage correspond to sigma and cost . The e1,017 R software package was used to develop this validation stage.
Modelling scenarios
Different model scenarios were established as shown in Table 1. These scenarios (see first column) vary according to the modelling technique implemented (ANN or SVR), the RW quality parameter modelled (Tu, LogTu, or UV254) and the time lag applied to the input variables.
Scenario . | Variable to predict . | Modelling technique . | Input variables . |
---|---|---|---|
ANN_Tu_1 | Tu | ANN | |
ANN_Tu_2 | Tu | ANN | |
ANN_Tu_3 | Tu | ANN | |
ANN_Tu_4 | Tu | ANN | |
SVR_Tu_1 | Tu | SVR | |
SVR_Tu_2 | Tu | SVR | |
SVR_Tu_3 | Tu | SVR | |
SVR_Tu_4 | Tu | SVR | |
ANN_uv_1 | UV254 | ANN | |
ANN_uv_2 | UV254 | ANN | |
ANN_uv_3 | UV254 | ANN | |
ANN_uv_4 | UV254 | ANN | |
SVR_uv_1 | UV254 | SVR | |
SVR_uv_2 | UV254 | SVR | |
SVR_uv_3 | UV254 | SVR | |
SVR_uv_4 | UV254 | SVR |
Scenario . | Variable to predict . | Modelling technique . | Input variables . |
---|---|---|---|
ANN_Tu_1 | Tu | ANN | |
ANN_Tu_2 | Tu | ANN | |
ANN_Tu_3 | Tu | ANN | |
ANN_Tu_4 | Tu | ANN | |
SVR_Tu_1 | Tu | SVR | |
SVR_Tu_2 | Tu | SVR | |
SVR_Tu_3 | Tu | SVR | |
SVR_Tu_4 | Tu | SVR | |
ANN_uv_1 | UV254 | ANN | |
ANN_uv_2 | UV254 | ANN | |
ANN_uv_3 | UV254 | ANN | |
ANN_uv_4 | UV254 | ANN | |
SVR_uv_1 | UV254 | SVR | |
SVR_uv_2 | UV254 | SVR | |
SVR_uv_3 | UV254 | SVR | |
SVR_uv_4 | UV254 | SVR |
Scenario names that contain the number 1 use non-time-lagged input data for both rainfall accumulation and river flow rates. This means that these scenarios are not predictive models but are only reference models. Scenario names that contain the numbers 2, 3, and 4 use time-lagged input data for both rainfall accumulation and river flow rates (the time lags varied). The time lags were selected as a function of , as indicated in the input variable column in Table 1. When the highest rs value was obtained for a given correlation, this is denoted in the table as . For example, PS.1_rs_max corresponds to the rainfall accumulation in PS.1 with the optimal time lag, meaning that this scenario provided the best correlation between rainfall accumulation at station PS.1 (Figure 1) and RW Tu. Significantly rainfall data with lower and higher time lags are denoted as and , respectively, compared to the rainfall data with optimal time lag.
Model performance metrics
The performance metrics used to evaluate model performance are described below. The R2 metric was used to calculate the percentage of the total variance of the observed data that is explained by the model. R2 values range between 0 and 1 (Rajaee et al. 2020). The coefficient of Nash–Sutcliffe efficiency (NSE) was also used, of which the values range between minus infinity and 1, with 1 representing a perfect fit. An efficiency of less than zero indicates that using the mean value of the observed data is a better predictor than the model (Nash & Sutcliffe 1970). Root mean square error (RMSE) provided us with the standard deviation of the model prediction error. We chose RMSE because it is able to show several deviations, is reliable, and provides high weight to error. Lastly, mean absolute error (MAE) was used to show the absolute difference between the actual and predicted values. However, this only provides information about the extent of the error and not the model validity.
Rainfall and river peak flow event analysis
Although statistical analyses can establish links between input and output variables, they might conceal some disparities between events and oversimplify complex phenomena. The qualitative analysis of a few events can partly reveal these disparities and complexities. We, therefore, analysed the main characteristics of a few rainfall events in PS.3 and river flow peaks in QS.1, as well as their relationships with changes in RW quality. PS.3 presents the highest correlation RW quality variables and it is also centred-placed in the watershed. Due to those, we considered the PS.3 rainfall gauge station to generally represent all of the rainfall events registered at the other rainfall gauge stations. QS.1 was selected because it is the closest river flow meter station to the RW sampling point at the DWTP.
Rainfall event analysis
Rainfall classification systems are based on several features of rain events, such as physical properties, meteorological conditions, or impacts (Breugem et al. 2020). We started by identifying every single rainfall event at each precipitation station (PS.1–PS.5) by considering the following characteristics: duration, accumulation, intensity, antecedent dry time, and posterior dry time. An algorithm was run through the rain data series from the first observation. First, the algorithm identified the observations with values higher than . Then, the observations with values lower than were identified. The time difference between and was the duration of the rainfall event. The sum of rainfall values within a specific is the rainfall accumulation . The rainfall intensity was obtained by dividing by . The antecedent dry time of a rainfall event was calculated by subtracting the of that rainfall event from the of the previous rainfall event. The posterior dry time of a rainfall event was calculated by subtracting the of the next rainfall event from the of the present rainfall event. According to this methodology, we selected four rainfall events for analysis with subsequent changes in RW quality. The rainfall events and RW quality parameter plots and the rainfall characteristics might provide valuable information that can be used to understand model input variable selection.
River peak flow event analysis
The river flow rate can be considered an ‘integrative’ variable of watershed hydrological phenomena (Singh 2017). This is because river flow is the hydrographic response to different rain-induced phenomena that occur in the watershed. Here, we aimed to show the impacts of river flow changes on RW quality parameters. Flow stations QS.1 and QS.2 were selected for this analysis because they are located relatively near the DWTP water intake. Because our case study is in an area with a temperate climate, there are two major periods that can affect flow rate variation, and therefore, our analyses. In late winter/early spring, river flows are affected by melting snow and ice jams. This period presents high river flow rates. In late spring, summer and fall, river flows are principally determined by rain and base flow (subsurface and underground flow). We divided the entire flow rate time series (QS.1 and QS.2) into two periods: (1) the melting snow (MS) period in May (QS.1MS and QS.2MS) and (2) the warm (W) period from June 1 to the end of the time series (QS.1W and QS.2W). The base flow is almost constant during the W period. We were mainly interested in linking river flow and RW quality during the summer period and the rainy season (W period), as during this period, changes in RW quality are more noticeable. A specific peak flow events were identified and selected in order to analyse their impacts on RW quality. Based on this methodology, we selected four peak flow events for analysis with subsequent changes in RW quality. Two of the mentioned peak flow events corresponded to events generated by two of the rainfall events from the previous section. The other two peak flow events were selected in order to represent more event diversity in the analysis. From river flow rate events and RW quality parameter plots, valuable information might be extracted to understand the model input variable selection.
RESULTS AND DISCUSSION
Time-lagged correlations between watershed rainfall and river flow time series and RW quality
Rainfall time series and RW quality
. | Mean . | Std. . | Tu . | PS.1 . | PS.2 . | PS.3 . | PS.4 . | PS.5 . |
---|---|---|---|---|---|---|---|---|
Dev. . | (NTU) . | (mm) . | (mm) . | (mm) . | (mm) . | (mm) . | ||
UV254 (m−1) | 30.41 | 10.47 | 0.53 | − 0.03 | 0 | − 0.04 | −0.03 | −0.01 |
Tu (NTU) | 9.14 | 15.87 | 0.02 | 0.04 | 0.03 | 0 | 0.01 | |
PS.1 | 0.11 | 0.78 | 0.57 | 0.52 | 0.45 | 0.32 | ||
PS.2 | 0.14 | 0.95 | 0.63 | 0.48 | 0.37 | |||
PS.3 | 0.15 | 0.80 | 0.56 | 0.41 | ||||
PS.4 | 0.15 | 0.81 | 0.49 | |||||
PS.5 | 0.14 | 0.80 |
. | Mean . | Std. . | Tu . | PS.1 . | PS.2 . | PS.3 . | PS.4 . | PS.5 . |
---|---|---|---|---|---|---|---|---|
Dev. . | (NTU) . | (mm) . | (mm) . | (mm) . | (mm) . | (mm) . | ||
UV254 (m−1) | 30.41 | 10.47 | 0.53 | − 0.03 | 0 | − 0.04 | −0.03 | −0.01 |
Tu (NTU) | 9.14 | 15.87 | 0.02 | 0.04 | 0.03 | 0 | 0.01 | |
PS.1 | 0.11 | 0.78 | 0.57 | 0.52 | 0.45 | 0.32 | ||
PS.2 | 0.14 | 0.95 | 0.63 | 0.48 | 0.37 | |||
PS.3 | 0.15 | 0.80 | 0.56 | 0.41 | ||||
PS.4 | 0.15 | 0.81 | 0.49 | |||||
PS.5 | 0.14 | 0.80 |
Correlations in bold are significant (p-value <0.05). PS, rainfall gauge station; Tu, turbidity in nephelometric turbidity units (NTU); UV254, UV absorbance at 254 nm.
UV254 has an almost-null or negative correlation with rainfall events from up to 2 days prior. From that point, cross-correlation plots CCF started to increase. The maximum UV254 correlation coefficient occurred at PS.3 with a time lag of . Figure 5 shows that two local maximums were observed for the cross-correlation plots. The first local maximum had an average time lag of 3 days and the second local maximum had an average time lag of 8 days. Time-lagged PS.3 and PS.5 presented the highest correlation coefficient (rs) with UV254. These results are consistent with those of previous studies by other authors. For instance, Volk et al. (2002) observed that UV254 (and DOC) concentrations increased after a rainfall event. They also observed that it could take up to 6 days for the UV254 and DOC concentrations to return to their initial values. Panton et al. (2020) observed that DOC concentrations in two different rivers were high for 2 days following rainfall before returning to base/initial values 10 days later.
River flow time series and RW quality
On the other hand, the flow rate of the main source river, measured at QS.1W (Figure 7(a)) at h, was weakly correlated with UV254. The maximum Spearman's rank correlation coefficient between the time-lagged river flow rate and UV254 was at . The correlation then decreased to for days. River flow rate was moderately correlated with UV254, with Spearman's rank correlation coefficients of at time lags between and 144 h. The flow rate of the secondary (affluent) source river, measured at QS.2W (Figure 7(b)) at h, had a low correlation with UV254. The maximum Spearman's rank correlation coefficient between time-lagged river flow rate and UV254 was at . The correlation then decreased to for days. The river flow rate was moderately correlated with UV254, with Spearman's rank correlation coefficients of at time lags between and 144 h. These results show that the maximum values for UV254 did not occur simultaneously with increases in river flow. On average, peaks in UV254 were observed between 2 and 4 days after river flow peaks (Figure 7).
RW quality modelling: input variable scenarios and model evaluation
The results from the previous section show the empirical relationships between both rainfall and river flow rates and changes in RW quality parameters Tu and UV254. These relationships experienced delays in time and their magnitudes were calculated using correlation coefficients. We therefore chose to apply rainfall and river flow rate variables that were time-lagged for the model input variables. As mentioned in Section 2.3, four modelling scenarios were proposed. Each of the scenarios was developed to model both Tu and UV254 using ANN and SVR.
Scenarios 1 in Table 1 (ANN_Tu_1, ANN_uv_1, SVR_Tu_1, and SVR_uv_1) are not predictive scenarios because there is no time lag applied to the input data (PS.1–5 and QS.1–2) to produce Tu or UV254 output data. The input data for scenarios that included 2 (ANN_Tu_2, ANN_uv_2, SVR_Tu_2, and SVR_uv_2) for Tu modelling (Table 1) are composed of seven independent time-lagged input variables (time-lagged PS.1–5 and QS.1–2). The first five input variables are the time-lagged PS.1–5 values at the maximum correlation coefficient between rainfall and RW quality data. The last two input variables are the time-lagged QS.1 and QS.2 values at the maximum correlation coefficient between river flow data and the RW quality data. The input data for scenario 2 for UV254 modelling (Table 1) are composed of 12 independent time-lagged input variables (time-lagged PS.1–5 and QS.1–2). The first five input variables are the time-lagged PS.1–5 values at the first local maximum correlation coefficient between rainfall and RW quality data. The following five input variables are the time-lagged PS.1–5 values at the second local maximum correlation coefficient between rainfall and RW quality data. This is based on results shown in Figure 5(b) that indicate that cross-correlation plots between rainfall events (PS.1, PS.2, PS.2, PS.5) and UV254 have two local maximums for every rain gauge (except for QS.4 with a third further lag). Globally, cross-correlations between UV254 and lagged rainfalls are low and smaller than those between UV254 and river flow. This suggests that rainfall series are not as effective predictors for UV254 as river flow series. Keeping these observations in mind, only those predictors with time lags represented by the two local maxima were included in the predictor group. The last two input variables are the time-lagged QS.1–2 values at the maximum correlation coefficient between the river flow data and the RW quality data. Tables 3 and 4 present several performance metrics for both the ANN and SVR models for every input data scenario predicting RW quality parameters Tu and UV254. In general, ANN provided better predictive results than SVR for both Tu and UV254.
Scenario . | . | . | NSEcal . | NSEtest . | RMSEcal . | RMSEtest . | MAEcal . | MAEtest . |
---|---|---|---|---|---|---|---|---|
(NTU) . | (NTU) . | (NTU) . | (NTU) . | |||||
SVR_Tu_1 | 0.398 | 0.368 | 0.14 | 0.12 | 14.36 | 8.22 | 4.75 | 1.53 |
SVR_Tu_2 | 0.820 | 0.860 | 0.67 | 0.63 | 0.11 | 0.11 | 0.07 | 0.08 |
SVR_Tu_3 | 0.779 | 0.783 | 0.60 | 0.60 | 0.24 | 0.13 | 0.17 | 0.06 |
SVR_Tu_4 | 0.676 | 0.657 | 0.44 | 0.41 | 0.28 | 0.16 | 0.22 | 0.07 |
ANN_Tu_1 | 0.837 | 0.747 | 0.70 | 0.59 | 8.50 | 5.80 | 3.80 | 1.40 |
ANN_Tu_2 | 0.860 | 0.870 | 0.73 | 0.75 | 0.10 | 0.09 | 0.07 | 0.06 |
ANN_Tu_3 | 0.862 | 0.798 | 0.74 | 0.62 | 0.19 | 0.13 | 0.13 | 0.05 |
ANN_Tu_4 | 0.785 | 0.718 | 0.62 | 0.52 | 0.23 | 0.15 | 0.17 | 0.06 |
Scenario . | . | . | NSEcal . | NSEtest . | RMSEcal . | RMSEtest . | MAEcal . | MAEtest . |
---|---|---|---|---|---|---|---|---|
(NTU) . | (NTU) . | (NTU) . | (NTU) . | |||||
SVR_Tu_1 | 0.398 | 0.368 | 0.14 | 0.12 | 14.36 | 8.22 | 4.75 | 1.53 |
SVR_Tu_2 | 0.820 | 0.860 | 0.67 | 0.63 | 0.11 | 0.11 | 0.07 | 0.08 |
SVR_Tu_3 | 0.779 | 0.783 | 0.60 | 0.60 | 0.24 | 0.13 | 0.17 | 0.06 |
SVR_Tu_4 | 0.676 | 0.657 | 0.44 | 0.41 | 0.28 | 0.16 | 0.22 | 0.07 |
ANN_Tu_1 | 0.837 | 0.747 | 0.70 | 0.59 | 8.50 | 5.80 | 3.80 | 1.40 |
ANN_Tu_2 | 0.860 | 0.870 | 0.73 | 0.75 | 0.10 | 0.09 | 0.07 | 0.06 |
ANN_Tu_3 | 0.862 | 0.798 | 0.74 | 0.62 | 0.19 | 0.13 | 0.13 | 0.05 |
ANN_Tu_4 | 0.785 | 0.718 | 0.62 | 0.52 | 0.23 | 0.15 | 0.17 | 0.06 |
Bold values represent the scenario that obtained the highest performance metrics in turbidity modeling for both the ANN and the SVR.
Scenario . | . | . | NSEcal . | NSEtest . | RMSEcal . | RMSEtest . | MAEcal . | MAEtest . |
---|---|---|---|---|---|---|---|---|
(m−1) . | (m−1) . | (m−1) . | (m−1) . | |||||
SVR_uv_1 | 0.443 | 0.424 | 0.16 | 0.15 | 9.71 | 5.80 | 8.15 | 2.86 |
SVR_uv_2 | 0.710 | 0.680 | 0.48 | 0.45 | 0.16 | 0.18 | 0.14 | 0.16 |
SVR_uv_3 | 0.598 | 0.590 | 0.32 | 0.32 | 8.72 | 5.18 | 7.34 | 2.52 |
SVR_uv_4 | 0.607 | 0.602 | 0.34 | 0.33 | 8.61 | 5.13 | 6.63 | 2.28 |
ANN_uv_1 | 0.593 | 0.569 | 0.35 | 0.32 | 8.61 | 5.10 | 6.70 | 2.32 |
ANN_uv_2 | 0.740 | 0.730 | 0.54 | 0.52 | 0.16 | 0.17 | 0.12 | 0.13 |
ANN_uv_3 | 0.753 | 0.684 | 0.56 | 0.46 | 7.01 | 4.60 | 5.22 | 1.93 |
ANN_uv_4 | 0.694 | 0.689 | 0.48 | 0.47 | 7.71 | 4.50 | 5.45 | 1.82 |
Scenario . | . | . | NSEcal . | NSEtest . | RMSEcal . | RMSEtest . | MAEcal . | MAEtest . |
---|---|---|---|---|---|---|---|---|
(m−1) . | (m−1) . | (m−1) . | (m−1) . | |||||
SVR_uv_1 | 0.443 | 0.424 | 0.16 | 0.15 | 9.71 | 5.80 | 8.15 | 2.86 |
SVR_uv_2 | 0.710 | 0.680 | 0.48 | 0.45 | 0.16 | 0.18 | 0.14 | 0.16 |
SVR_uv_3 | 0.598 | 0.590 | 0.32 | 0.32 | 8.72 | 5.18 | 7.34 | 2.52 |
SVR_uv_4 | 0.607 | 0.602 | 0.34 | 0.33 | 8.61 | 5.13 | 6.63 | 2.28 |
ANN_uv_1 | 0.593 | 0.569 | 0.35 | 0.32 | 8.61 | 5.10 | 6.70 | 2.32 |
ANN_uv_2 | 0.740 | 0.730 | 0.54 | 0.52 | 0.16 | 0.17 | 0.12 | 0.13 |
ANN_uv_3 | 0.753 | 0.684 | 0.56 | 0.46 | 7.01 | 4.60 | 5.22 | 1.93 |
ANN_uv_4 | 0.694 | 0.689 | 0.48 | 0.47 | 7.71 | 4.50 | 5.45 | 1.82 |
Bold values represent the scenario that obtained the highest performance metrics in UV modeling for both the ANN and the SVR.
A quick look at the performance metrics in Tables 3 and 4 indicates that, when using the same models, the prediction model for turbidity is more accurate than the model for UV absorbance. Scenarios 1 (SVR_uv_1, SVR_tu_1, ANN_uv_1, and ANN_tu_1 in Table 1) are not useful for predicting RW quality deterioration due to rainfall events or flow peak events. However, their weak performance metrics confirm that it is important to consider time lags between climatic or hydrological events and changes in RW quality. Indeed, simultaneous rainfall and flow peak data do not provide enough information to estimate Tu or UV254.
Scenarios 2, 3, and 4 are predictive scenarios since they use time-lagged rainfall events and river flow peaks as input data. For scenario 2, these time lags were calculated using the methodology described in Section 2.3. The scenarios that used input variables with the highest time-lag correlation coefficient values resulted in the highest R2 and NSE performance metrics in both calibration and validation stages, as shown in Tables 3 and 4.
Figure 8 shows the scatter plots between the observed UV254 and predicted UV254 values generated by SVR and ANN for scenario 2. ANN once again performed better than SVR . As shown in Figures 5 and 7, cross-correlation plots between both rainfall events and river flow rates and water UV254 did not reveal maximum values at specific time lags (hours). There were at least two maximum cross-correlation coefficients between rainfall events and UV254 (Figure 5). Additionally, cross-correlation plots between river flow peaks and UV254 show an extended period of time lags where correlations are near their maximums (Figure 7). The SVR and ANN models demonstrated that input data with those time lags (at maximum correlations) resulted in models with acceptable predictive capabilities.
Analysis of RW quality variations after rain and flow peak events
As mentioned in Section 2.4, empirical analyses may help establish relationships between input variables such as rainfall and river flow rate data and RW quality output variables such as Tu and UV245. These relationships can be useful to model and predict changes in RW quality variables. However, the models presented herein were not completely accurate when making predictions. As shown in Tables 3 and 4, the performance metrics results suggest that the input data are not always enough to generate accurate output data. In Figure 9, we observed that some RW peak events were not accurately predicted. This suggested that an analysis of individual events might be useful to complement the model results. Specific rainfall events, river flow rate events, and Tu and UV254 parameters were therefore investigated to support the modelled results. These analysed events also led to the overestimation and underestimation of events by the machine learning models discussed in the previous section.
Effects of rainfall events on RW quality
During the data collection period (see Section 2.2), there were between 76 and 92 rainfall events at each of the meteorological stations. Most of the rainfall events were of short duration, typically 1–3 h. At PS.3, most of the rainfall events had accumulations between 1 and 4 mm. Seven rainfall events had more than 20 mm of accumulation. The intensities for most of the rainfall events were 1–4 mm/h. According to Breugem et al. (2020), those rainfall event intensities are considered light to moderate. Eleven rainfall events were of intensities higher than 5 mm/h, which are considered heavy to very heavy. In our data, rainfall intensities in the 90th percentile were 5.2, 5.8, 5.1, 4.9, and 4.8 mm/h at PS.1, PS.2, PS.3, PS.4, and PS.5, respectively. Rainfall accumulations in the 90th percentile were 12.6, 15.3, 16.4, 16.3, and 12.3 mm at PS.1, PS.2, PS.3, PS.4, and PS.5, respectively. For instance, in PS.3, five events recorded intensities higher than 5.1 mm/h and another five had accumulations higher than 16.4 mm. The rainfall events and their main characteristics are listed in Table 5.
Rainfall event . | Start date and time . | Duration (dr) . | Rain accumulation (ar) . | Intensity (ir) . | Antecedent dry time (adtr) . | Maximum hourly precipitation . |
---|---|---|---|---|---|---|
(h) . | (mm) . | (mm/h) . | (h) . | (mm) . | ||
R1 | 2017-05-26 00:00 | 7 | 23.0 | 3.3 | 158 | 6.4 |
R2 | 2017-06-23 22:00 | 3 | 30.4 | 10.1 | 7 | 19.2 |
R3 | 2017-08-05 14:00 | 3 | 28.2 | 9.4 | 60 | 13.2 |
R4 | 2017-08-22 12:00 | 2 | 10.0 | 5.0 | 14 | 6.0 |
Rainfall event . | Start date and time . | Duration (dr) . | Rain accumulation (ar) . | Intensity (ir) . | Antecedent dry time (adtr) . | Maximum hourly precipitation . |
---|---|---|---|---|---|---|
(h) . | (mm) . | (mm/h) . | (h) . | (mm) . | ||
R1 | 2017-05-26 00:00 | 7 | 23.0 | 3.3 | 158 | 6.4 |
R2 | 2017-06-23 22:00 | 3 | 30.4 | 10.1 | 7 | 19.2 |
R3 | 2017-08-05 14:00 | 3 | 28.2 | 9.4 | 60 | 13.2 |
R4 | 2017-08-22 12:00 | 2 | 10.0 | 5.0 | 14 | 6.0 |
Figure 10 clearly shows the variations in Tu and UV254 levels after rainfall. After the maximum precipitation event, Tu levels peaked at 52 h for R1, 33 h for R2, 55 h for R3, and 45 h for R4. However, UV254 peaked at 97 h for R1, 69 h for R2, 55 h for R3, and 188 h for R4. Moreover, there were fewer UV254 peaks and smaller, slower variations (except after R1). According to the statistical analyses (Section 3.1), the highest correlation between rainfall events and Tu occurred at a time lag of 45 h, while there were two local maxima between rainfall events and UV254. The first local maximum was at around 96 h and the second was at around 180 h. In the cross-correlation analysis, we identified weak correlation coefficients between rainfall events and RW quality parameters. This could indicate that in our case study, rainfall events by themselves are insufficient predictors of RW quality parameters such as Tu and UV254. In fact, as shown previously in Figure 5, correlation coefficients between rainfall data and RW quality parameters Tu and UV254 are low, with maximum values up to 0.2. Scenarios using only rainfall input data that are not shown here, resulted in lower performance.
Peak river flow events and RW quality variations
Whereas Tu maximums were observed several hours after the river flow peak (Figure 11(a), 11(c), and 11(d)), UV254 maximums were observed several days after (Figure 11(a)–11(d)). According to the statistical analyses (Section 3.1), the highest correlation between river flow peaks at QS.1 and Tu occurred with a time lag of 8 h. Meanwhile, the highest correlation between river flow peaks and UV254 was at around 62 h. According to the analysis of river flow peaks per individual event, there was moderate variability in Tu and UV254 responses following rainfall events. However, in the cross-correlation analysis, we found high correlation coefficients between river flow peaks and RW quality parameters Tu and UV254. This could indicate that river flow peak events are good predictors of those RW quality parameters. Compared to rainfall event analyses and cross-correlations, river flow rates showed better correlations with RW quality. In fact, river flow rate data showed high correlation coefficients with Tu and UV254. Correlation coefficients higher than 0.8 between QS.1 and Tu and higher than 0.56 between QS.1 and UV254 (Figure 7) are compatible with results from Figure 11. We also tested scenarios with only river flow rate input data (not shown here), which showed a moderately high performance.
Discussion
This section discusses the physical phenomena that determine the presence of NOM and other particles in source water during rainfall. By considering these phenomena, we hope to better understand the statistical relationships between the variables and the performance of machine learning models.
Data analysis and the modelling process showed that Tu and UV254 have different behaviours after rainfall events and river flow peaks. This indicates that there are different transport mechanisms in the watershed that modify particles and dissolved NOM in RW after rainfall. Rain generates surface (directly) and subsurface (indirectly) runoff that generally trigger flow peaks in rivers and groundwater runoff maintain base flow (Chow 1988). Particles and NOM are transported from the watershed and into streams through several mechanisms that, over time, lead to RW deterioration. From a purely empirical point of view, our results indicate that rainfall and river flow have different correlation values with RW quality parameters. Rainfall time series present low correlation values with RW parameters, despite rain being the primary cause of RW quality deterioration in rivers. The diversity of pollutant transport processes, which depend on a multitude of factors such as rainfall intensity and antecedent soil conditions, could explain the low correlation values.
Observations based on the results from our study might be associated with antecedent soil moisture conditions that may have different impacts on the erosion resistance of sediments. Indeed, it has been suggested that pre-wetted soil materials can erode at faster rates (Lawler et al. 1998). Rainfall events R2 and R4 (presented in Figure 10(b) and 10(d)) only had between 7 and 14 h of antecedent dry time. The soil during those events could therefore be considered pre-wetted. Tu peaks were observed between a few hours and a couple of days after the hour of maximum rainfall. In contrast, turbidity peaks in events R1 and R3 started 2 days after the hour of maximum rainfall. Events R1 and R3 had the driest soil.
In R1 and R3, we observed pre-peaks and sustained turbidity values on the recession limb. These pre-peaks (called first-flashings in hydrology) and sustained values on the recession limb have also been observed by Lawler et al. (2006). Rainfall intensity also affects the degree of turbidity. As shown by Lee et al. (2016), intense rainfall may induce erosion and lead to high levels of turbidity runoff flowing into the river. The runoff is caused by the remobilization and resuspension of sediments from sediment beds and recently eroded sediment from different sources in the watershed (Chen & Chang 2019). Our results were consistent with these findings and showed that an intense rainfall event (R2 in Figure 10(b)) generated a high river flow peak and consequently, a high Tu value. The time lag between the rainfall event and the Tu peak was several hours, whereas the river flow rate peak and Tu peak were almost simultaneous. This confirms that intense and extreme rainfall events can generate rapid and high turbidity peaks due to resuspension of sediment from sediment beds. Small rainfall events can also increase turbidity levels due to sediment build-up on topsoil (Chen & Chang 2019). Moderate-intensity rainfall events (R4 and R2 in Figure 10) generated moderate river flow peaks and consequently, moderate to low values of Tu during summer (BF period). For events R1 and R3, there were sustained Tu values on the recession limb. On the other hand, relatively high correlation coefficients were observed between time-lagged river flow rate and turbidity, while mild correlation coefficients were observed between the time-lagged river flow rate and UV254. This could be explained by the fact that the river flow at the sampling point might have already been affected by the rainfall–runoff phenomena from the upstream watershed. There is a link between river flow rate and contaminant (particles and dissolved NOM) transportation into the river. As mentioned, rainfall over watersheds leads to superficial runoff that contributes to increases in river flow rate. This superficial runoff (causing the erosion and transport of soil materials) and the direct rainfall over the water stream (causing remobilization and resuspension) are the main mechanisms that generate increased turbidity.
Machine learning methods can describe the relationships between appropriate input variables (lagged rainfall and river flow series) as predictors and output variables (turbidity and UV254). It has been demonstrated that these input variables are related to the output variables as mentioned above. The ANN model has shown a better performance in modelling and forecasting RW turbidity and RW UV254 than the SVM model, as observed in the performance metrics in Tables 3 and 4. Lee et al. (2016) demonstrated high correlations between daily flow rate and turbidity during extreme rain events. Rajaee et al. (2009) created machine learning models to predict water turbidity using different combinations of input variables, such as river flow rates from 1, 2, and 3 days before the event. These demonstrated better model performance using river flow rate data from 1-day before. This trial-error method, used with daily time steps, allows identifying the best delay time, but represents a higher computational cost. Cisty et al. (2021) used hydrological and climate data in machine learning models for modelling suspended sediments in rivers. They used river flows from the given and the previous day but rainfall information from previous days could also provide information about water turbidity. We have proposed herein a methodology for systematically selecting time-lagged watershed rainfall and river flow as RW quality predictors. Our results using this novelty methodology for selecting rainfall and time-lagged river flow rate data are compatible with the previous modelling studies described above.
According to the statistical analysis in Section 3.1, time-lagged rainfall and RW UV254 showed a weak correlation. Although some authors have reported changes in RW UV254 after rainfall, those changes were shown to appear hours to several days later. Maximum cross-correlation values were recorded almost a week after rainfall events. This suggests that rainfall event time series are not good predictors of RW UV254, even if they do trigger RW quality variations that include UV254. Nevertheless, as presented in Section 3.1, river flow peaks have a moderate correlation with RW UV254, with correlations higher than 0.5 when applying time lags from 2 to 4 days. When using time-lagged river flow rate data, the model performance metrics reported in Section 3.2 were the highest, with the highest cross-correlation coefficients between input and output data. Panton et al. (2020) showed that DOC concentrations increased with increased river flow rates. Flows above base flow rivers were positively correlated with DOC concentrations. Those DOC concentrations decreased progressively as the proportion of river base flow increased. This can be observed in Figure 11(a), 11(c), and 11(d), where UV254 recessing limbs decreased while river base flow amounts recovered to usual values.
UV254 values and the time lag between rainfall events and UV254 maximums could be affected by rainfall intensity. For instance, maximum UV254 values were time lagged between 2 and 3 days after the hour with the highest rain accumulation during events R2 and R3, which had relatively intense rainfall. However, R2 generated a higher maximal UV254 value compared to R3. This could be due to R2 having had only a few hours of antecedent dry time while R3 had several days. Furthermore, maximum UV254 values were time lagged for more than 4 days after the hour with the highest rain accumulation during events R1 and R4, which had moderate rainfall intensity. R1 generated a maximal UV254 value that was slightly higher than R4. Rainfall intensity could also affect the recovery time of UV254 levels in the water. The recovery time can be defined as the time between the start of the rising limb and the end of the declining limb. For high-intensity events (R2 and R3), recovery times were roughly less than 6 days, whereas for moderate-intensity events (R1 and R4), recovery times were longer than 1 week.
CONCLUSION
In this study, we developed an approach for modelling relevant RW quality parameters in RW that influence drinking water treatment (DWT) based on rainfall and river flow rate data. This approach included statistical analyses to select the best rainfall and river flow data predictors to use as model input variables.
The selection of the above-mentioned predictors was based on the highest cross-correlation coefficients between both time-lagged rainfall events in the watershed and the river flow rate and RW quality time series. The highest correlation between rainfall and turbidity was found at around 2-day lag, whereas between rainfall and UV254 was found at around 4 and 10-day lag. Regarding the relationships between river flow and turbidity, the highest correlation was found at 8-h lag for QS.1 and at 13-h lag for QS.2. Moreover, the highest correlation between river flow and UV254 was found at 62-h lag for QS.1 and at 2 and 5-day lag for QS.2. Then, two machine learning models (ANN and SVR) were tested using several input data scenarios from the generated database. The scenarios that used time lags for both rainfall and river flow data with the highest cross-correlation coefficients yielded the highest model performance metrics. This is because empirically, time-lagged input data was more correlated with output data. This was observed for model results for all the proposed scenarios.
Following rainfall, particles (turbidity) are transported from the watershed into the river through mainly surface runoff. Intense rain also generates remobilization and resuspension, further increasing water turbidity. Water turbidity behaviour depends on hydrology and rainfall characteristics such as intensity, as well as antecedent dry time and soil moisture, among others. This could explain the low performance metrics when only rainfall data were used as the input data in the models. The transport of NOM from the watershed and into the river after rainfall is even more complex than Tu. This is reflected by the generally lower UV254 model performance metrics compared to Tu model performance metrics. Alternately, river flow rate data can function as a watershed integrator variable and enhance the performance metrics of RW quality models.
To the best of our knowledge, our methodology is novel in considering the intersections of hydrology and drinking water quality. Our method uses cross-correlations to associate both rainfall events and river flow events with the most relevant RW quality parameters that influence water treatment at DWTPs. This approach allows the most accurate time lags with the highest cross-correlation coefficients between both rainfall and river flow rates and RW quality parameters to be estimated and then used as input data in the RW quality model. Our methodology could make it possible to associate both rain events and river flow events with water quality at the DWTP intake for any case study. DWTPs could use our results to help determine the best strategies for adjusting chemical dosing for the removal of key contaminants such as Tu and NOM. Future research should focus on creating decision-making tools that use the predicted variability of Tu and NOM (based on the methodology herein) to select the most appropriate chemical doses (e.g., coagulants) for particle and NOM removal from RW. Novel cutting-edge and more accurate models could be constructed using input data selected with the methodology herein proposed. Additionally, robustness analysis could be conducted in order to establish the appropriate number or rain gauge or flow gauge stations. Finally, using the proposed methodology, new methods to predict coagulant dosages in DWTP should be proposed. With these tools, DWTP operators can make preventative decisions, especially during periods of rain that will ensure the production of safe drinking water.
ACKNOWLEDGEMENTS
This study was funded by the NSERC (Natural Sciences and Engineering Research Council of Canada) Drinking Water Research Chair at Laval University, whose main partners are the municipalities of Quebec City and Lévis, Avensys Solutions, Agiro and WaterShed Monitoring. The authors are thankful to Crayon-Bleu for English proofreading of the paper.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.