Rainfall and increased river flow can deteriorate raw water (RW) quality parameters such as turbidity and UV absorbance at 254 nm. This study aims to develop a methodology for integrating both time-lagged watershed rainfall and river flow data into machine learning models of the quality of RW supplying a drinking water treatment plant (DWTP). Spearman's rank non-parametric cross-correlation analyses were performed using both river flow and rain in the watershed and RW data from the water intake. Then, RW turbidity and RW UV254 were modelled, using a support vector regression (SVR) and an artificial neural network (ANN) under several prediction scenarios with time-lagged variables. River flow presented a very strong correlation with RW quality, whereas rainfall showed a moderate correlation. Time lags with maximum correlations between flow data and turbidity were a few hours, while for UV254, they were between 2 and 4 days, demonstrating varied time lags and a complex behaviour. The best performing scenario was the one that used time-lagged watershed rainfall and river flow as input data. The ANN performed better for both turbidity and UV254 than SVR. Results from this study suggest the possibility for new modelling strategies and more accurate chemical dosing for the removal of key contaminants.

  • A methodology for selecting the best raw water (RW) modelling predictors from rainfall and river flow data.

  • Estimating cross-correlations between both rainfall events and river flow rates and RW quality parameters could be useful to select input data in RW modelling.

  • RW predictive models could help drinking water treatment plants to anticipate necessary adjustments in the treatment operating conditions.

Surface water contains elements such as pathogens, particles, and organic matter. These elements can be undesirable for human consumption, act as precursors of toxic compounds, or can interfere in water treatment processes and must therefore be removed in drinking water treatment plants (DWTPs) (Health Canada 2002). Rainfall leads to increases in these contaminants in source waters, through the transport of particles and organic and inorganic dissolved substances through surface and subsurface runoff (Singh 2017). Hydrological responses to rainfall depend on the physical characteristics of watersheds (Pratt & Chang 2012; Delpla & Rodriguez 2014; Shaw et al. 2017). Rainfall characteristics such as quantity, duration, intensity, and frequency also play a role (Musy & Higy 2010), as well as soil saturation conditions before rainfall. Additionally, deterioration of source water quality can be exacerbated by non-point and point source pollution due to anthropogenic activity within watersheds (Delpla & Rodriguez 2014), such as sewer overflows from rainfall events and agricultural pollution.

Sudden changes in raw water (RW) quality following rainfall require modifications to water treatment operations. For example, the dosage of coagulant and disinfectant may need to be adjusted (Edzwald 2011; Crittenden et al. 2012) due to increased fine particle (Khan et al. 2015) and natural organic matter (NOM) concentrations (Eikebrokk et al. 2004; Delpla & Rodriguez 2014, 2016; Khan et al. 2015). Modelling and predicting RW quality after rainfall could help DWTPs anticipate these adjustments.

Recent studies have aimed to model and evaluate the impacts of rainfall events on water quality (Rajaee et al. 2020; Tiyasha & Yaseen 2020; Ortiz-Lopez et al. 2022). For instance, Bastaraud et al. (2020) observed, by means of clustering analysis and daily water sampling at different points in the drinking water distribution network, that the variability of microbiological parameters was affected during periods of rain. Jia et al. (2021) estimated the impacts of rainfall events on urban lakes. They observed that water quality (total phosphorus, dissolved oxygen, and total suspended solids) during the rainy season was poorer than during the dry season. Cisty et al. (2021) observed that the more saturated the watershed soil, the more sediments are transported into the river, demonstrating the impacts of rainfall on DWTPs. They also demonstrated that time-lagged series of rainfall could provide valuable information about the presence of sediments in rivers. However, no method for selecting the time lag was mentioned. Wang et al. (2022) used streamflow for predicting chemical dosage in a DWTP using machine learning algorithms with daily data. They used day-lag cross-correlations between flow and various RW quality parameters to establish modelling. However, in this study, neither method for selecting the time of lag was mentioned. Most of the studies mentioned above were based on sampling campaigns that were conducted in source waters (rivers and lakes) following rainfall events. In a previous work, Ortiz-Lopez et al. (2022), however, reviewed the models for surface RW quality that were available in the literature and that could be applied for DWTP decision-making. These authors concluded that although the available models are useful, they have various limitations. These include (a) prediction time steps that are too large (e.g., daily predictions), making it difficult for short-term decision-making in DWTPs, (b) output variables that are not relevant to DWTP decision-making indicators, such as environmental and ecological integrity (dissolved oxygen, phosphorous, chloride, etc.), and (c) the absence of a systematic methodology for choosing input variables, in particular when considering time series and time lags.

In this paper, we predict two relevant RW quality parameters associated with DWTP decision-making, turbidity (Tu) (as indicator of particles) and UV absorbance at 254 nm (UV254) (as indicator of NOM). These crucial parameters are used by DWTP operators to adjust treatment operations (Edzwald 2011) and are highly impacted following rainfall (Baurès et al. 2013; Ortiz-Lopez et al. 2022). We propose a data-driven (machine learning) approach for modelling the quality of the RW supplying DWTPs using watershed rainfall and river flow rates as input variables. We also propose a methodology for determining the best time lags for input variables using Spearman's rank cross-correlation coefficient with hourly time steps. Our novel methodology uses hourly rainfall, the river flow rate and to first select model input data. These input data are then used to model Tu and UV254. Finally, we do not aim to establish herein a new model method but a general methodology for selecting the best time-lagged rain watershed and river flow series as RW quality predictors. For that reason, we used simple machine learning methods in order to test whether our proposed methodology improves RW quality modelling. To the best of our knowledge, our research is the first to propose a modelling procedure for RW quality that considers the need of DWTP operators to react to sudden water quality variations. This is possible through the use of cost-effective data (for input and output variables) that are gathered and modelled based on time steps that most accurately represent RW quality variability after rainfall.

Case study

This study was carried out in the Chaudière River watershed, in Quebec, Canada. Chaudière River is located in southern Quebec. Its source is Mégantic Lake (near the US border) and it flows 185 km into the Saint Lawrence River. The Chaudière River watershed covers an area of 6,714 km2. The form factor (Rf) of this watershed is 0.19 (a small value), corresponding to an elongated watershed. The Chaudière watershed is divided into three main sections, the low, middle, and high watersheds, which are 981, 2,656, and 3,075 km2, respectively (see Figure 1). Most of the watershed has a moderate, sub-humid climate with a long growing season, hot summers, and cold winters. The average annual air temperature is 3.8 °C with a maximum average of 25 °C and a minimum average of −15 °C. The average annual precipitation is 1,050 mm/yr (COBARIC 2014).
Figure 1

Chaudière River watershed. The drinking water treatment plant and water intake (water sampling point) are located in the low watershed. The river flow meter stations are located on the Chaudière River (QS.1) and Beaurivage River (QS.2). Five rain gauges (PS.1–PS.5) are distributed throughout the watershed, along the Chaudière River.

Figure 1

Chaudière River watershed. The drinking water treatment plant and water intake (water sampling point) are located in the low watershed. The river flow meter stations are located on the Chaudière River (QS.1) and Beaurivage River (QS.2). Five rain gauges (PS.1–PS.5) are distributed throughout the watershed, along the Chaudière River.

Close modal

The municipal DWTP in our study is located on the low watershed of the Chaudière River near its river mouth in the Saint Lawrence River. The Chaudière River lower watershed is strongly influenced by the Beaurivage River, a tributary of the Chaudière River. The DWTP supplies about 35,000 inhabitants in the City of Lévis. The land use in the Chaudière River watershed is 68% forest, 23% agricultural, 4% urban and residential, and 5% other (COBARIC 2014).

Data

We collected RW quality data from the water intake at the DWTP. The water intake is located in the Chaudière River, 6 km upstream from the mouth in the Saint Lawrence River (Figure 1). The DWTP is equipped with a continuous analyzer probe for several water quality parameters, such as Tu, conductivity (Co), pH, and temperature (T). Average hourly Tu data were extracted from the DWTP data management system. RW turbidity was measured onsite by online analyzers (Turbidity: HACH Surface scatter). Additionally, RW UV absorbance was measured online with spectro::lyser™ UV spectrophotometer probes (s::can, Austria) installed by our research team at the RW intake. Continuous UV absorbance monitoring was conducted with a 35-mm path length probe (measuring range: 0–70 m−1). UV254 is considered to be an excellent surrogate parameter for estimating RW concentrations of dissolved NOM (Edzwald et al. 1985; Crittenden et al. 2012; Beauchamp et al. 2018). Data for RW quality and hydrological parameters were collected from 15 April to 14 October 2017. Around 4,392 observations were collected. Due to technical issues, 10.4 and 14.5% of observations were lost from the Tu time series and UV254 time series, respectively. Kalman smoothing by an ARIMA model (Hyndman & Khandakar 2008) was used to impute the missing data. Additionally, Tu interference with UV254 measurements was removed using a correction function that was previously developed by our research team for the same water source (Delpla et al. 2023).

Hydrological data for the Chaudière River watershed were also acquired for analysis. Hourly rainfall data were collected from five stations throughout the watershed (MELCC 2020). These hydrological stations are located along the Chaudière River and capture data on the rainfall regime in the watershed. Rainfall gauge stations are located in the low watershed in Charny (PS.1), in the middle watershed in Scott (PS.2), Saint-Séverin (PS.3), and Saint-Georges (PS.4), and in the high watershed in Saint-Ludger (PS.5) (see Figure 1). The stations are managed by the Environment and Climate Change Ministry of Quebec (MELCC). We also collected the hourly data from two flow meter stations: one at Chaudière River (QS.1), located in Saint-Lambert-de-Lauzon, and one at Beaurivage River (QS.2), located in Saint-Étienne (Figure 1). QS.1 is in the lower watershed, 20 km upstream from the water intake, at a point where the water has travelled 89% of the total length of the Chaudière River. QS.2 is also located in the lower watershed, 5.2 km upstream from the Chaudière River mouth (see Figure 1).

RW quality modelling methodology

Figure 2 displays the proposed methodology of the present study. The novel methodology proposed for selecting time-lagged both watershed rainfall and river flow data into machine learning models of the quality of RW will be explained in the subsequent subchapters. We included in this novel methodology, a rainfall and river peak flow analysis as a complement of modelling results. This part of the methodology is described in Section 2.4.
Figure 2

Framework of the proposed methodology.

Figure 2

Framework of the proposed methodology.

Close modal

Input variable choices from time-lag correlations

Rainfall induces various hydrological responses in watersheds, including the rate of river flow and variations in RW quality. There are always time lags between a rainfall event and the hydrological response, but these may vary from one watershed to another (Singh 2017). In order to predict and therefore anticipate RW quality variations after rainfall, historical data must be used as model input variables. The literature also suggests that a specific time lag should be identified and applied between the input and output variables. However, this is challenging, as there are no structured guidelines for determining the best time lag in the literature.

To identify the best input variables, we proposed a cross-correlation analysis between rainfall accumulation (input variables) and RW quality (output variables). The analysis consists of determining the relationships between time-lag rainfall and RW Tu or UV54 . The time lags (k) were tested from 0 (no time lag) to 240 h, at 1-h increments. The complete time series for these variables were considered for each correlation. The same analysis was performed between the river flow rates and the RW quality variables, as the river flow rate is an intermediate variable that could help predict RW quality.

The non-parametric Spearman's rank correlation coefficient (a special version of Pearson's coefficient) was used for the cross-correlation analysis. This index evaluates the associations between two sets of ranked scores (Chen & Popovich 2002). Typically, associations between pairs of variables can be evaluated using the unmodified Pearson's correlation coefficient (r). However, the required assumptions related to data distribution, homoscedasticity, linearity, and continuity were not met in our case. Spearman's rank correlation coefficient is suitable for assessing the strength and direction of a monotonic relationship between two variables, even if the relationship is not linear (Kottegoda & Rosso 2008).

The correlation was considered significant if rs was greater than , where n is the number of observations (Hyndman & Athanasopoulos 2018). Empirically, one could demonstrate that if cross-correlation with a k lag in a population is zero for , then when n is very large, will have a normal distribution ( and ). This is possible since approximately 95% of a normal population is within 2 standard deviations above or below the mean. We determined correlation coefficients to be significant at a p-value of 0.05. All analyses were conducted using the R software package (R Core Team 2022).

Data processing and splitting

Before implementing the modelling algorithms, both the input and output variables were processed. For Tu modelling, we observed that the distribution of Tu data was right-skewed. As this could present a problem for modelling, the logarithm for turbidity was used as a dependent variable. These data were also normalized. Rajaee et al. (2020) concluded that data normalization modifies the distribution of the input variables and ensures that they will receive equal attention during the training stage through neural network model techniques. Additionally, normalization might improve the numerical stability of the model and often reduces training time. We used a normalization range between according to Equation 1, where is the normalized data, is the original data, and and denote the minimum and maximum values of the original data, respectively (Equation 1):

We randomly divided 3,893 hourly observations into training (80%) and testing (20%) data categories. Since Tu and UV254 values can change over short periods of time during rainfall, randomly splitting the data ensured that both training and test datasets contained both high and low RW quality values. In order to ensure best performance in models, hyperparameter tuning was carried out by means of a validation stage. Training dataset was split into a pre-training dataset and a validation dataset using a k-fold cross-validation approach. During the validation stage, a 10-fold cross-validation was performed in order to tune the hyperparameters. The training dataset is randomly split into 10 subsamples. Each model (with new parameters) is trained using 9 of the folds as training data, and then, the model is validated using the remaining part of the data. After calculating the accuracy of the model, the process is repeated nine times, with each of the nine subsamples. This model (with new and tuned hyperparameters) is then presented to the validation dataset. This validation process is repeated using a range or grid of hyperparametres previously set. This validation stage ends when every hyperparameter set has been tested. Finally, the validated model (with the highest performance metrics) is tested with the ‘unseen’ test dataset.

Modelling techniques

Two well-known machine learning techniques have been used to model RW quality: artificial neural networks (ANNs) and support vector regression (SVR). These techniques have already been successfully applied for surface water quality modelling (Ortiz-Lopez et al. 2022). ANNs are robust mathematical model systems that can map a set of input data into a corresponding set of output data. This machine learning technique can identify complex and dynamic non-linear relationships between RW quality variables and rainfall and river flow data. A predictive ANN model uses historical data time series to predict future values (Ortiz-Lopez et al. 2022). The ANN RW quality model is described below in Equation 2 and also in Figure 3(a). In this equation, a neuron receives a set of input signals (x, rainfall events and flow peak events that either are or are not time lagged). Then, a weighted average is calculated using the summation function and weights (w), an activation function (f), computes an output (y), and a RW quality parameter (such as Tu or UV254). For water quality (and all other environmental science fields), ANN is generally composed of at least three layers, constituting a multi-layer perceptron (MLP). The input layer contains the input data series according to the scenarios described in the following section. Each node of the input layer corresponds to a single predictor variable. One or more hidden layers include activation and transfer functions. The output layer shows the predicted values. The bias (b) in Equation 2 allows the output data to have the best fit with the input data in an ANN model:
Figure 3

General architecture of the modelling techniques. (a) ANN and (b) SVR.

Figure 3

General architecture of the modelling techniques. (a) ANN and (b) SVR.

Close modal

We used one of the most common ANN algorithms, called the multi-layer feed-forward back propagation neural network (MLPBPNN). The aim of this algorithm is to reduce the expected error by looking for the best collection of relation weights (Ibrahim et al. 2022). The Neuralnet R software package was used to develop the ANN models. The hyperparameters adjusted in the validation stage correspond to the number of neurons in each of the two layers of the neural network and the learning rate. The caret R software package was used to develop this validation stage.

The support vector machine (SVM) is an algorithm that is designed to carry out generalisation tasks and estimate dependencies from finite spaces. Generally, SVMs are used for classification by mapping input data information into output data predictions by using separation hyperplanes. The term SVR is used to refer to an SVM that is used in regression and prediction tasks. SVR develops a linear regression based on the trajectory of the hyperplane. This technique transforms input data in a space with higher dimension characteristics where an optimal separation hyperplane is found (Ortiz-Lopez et al. 2022). The ANN SVR quality model is described in Figure 3(b). SVR is useful for high-dimensional and non-linear prediction problems. For non-linear problems, Kernel functions can be used (Ibrahim et al. 2022). SVR works by constructing two parallel hyperplanes around the main hyperplane. The distance between the hyperplanes is maximized in order to minimize prediction error. The Kernelab R software package was used to develop the SVR models. The Kernelab R software package was used to develop the SVR models. The hyperparameters adjusted in the validation stage correspond to sigma and cost . The e1,017 R software package was used to develop this validation stage.

Modelling scenarios

Different model scenarios were established as shown in Table 1. These scenarios (see first column) vary according to the modelling technique implemented (ANN or SVR), the RW quality parameter modelled (Tu, LogTu, or UV254) and the time lag applied to the input variables.

Table 1

Proposed modelling scenarios

ScenarioVariable to predictModelling techniqueInput variables
ANN_Tu_1 Tu ANN  
ANN_Tu_2 Tu ANN  
ANN_Tu_3 Tu ANN  
ANN_Tu_4 Tu ANN  
SVR_Tu_1 Tu SVR  
SVR_Tu_2 Tu SVR  
SVR_Tu_3 Tu SVR  
SVR_Tu_4 Tu SVR  
ANN_uv_1 UV254 ANN  
ANN_uv_2 UV254 ANN  
ANN_uv_3 UV254 ANN  
ANN_uv_4 UV254 ANN  
SVR_uv_1 UV254 SVR  
SVR_uv_2 UV254 SVR  
SVR_uv_3 UV254 SVR  
SVR_uv_4 UV254 SVR  
ScenarioVariable to predictModelling techniqueInput variables
ANN_Tu_1 Tu ANN  
ANN_Tu_2 Tu ANN  
ANN_Tu_3 Tu ANN  
ANN_Tu_4 Tu ANN  
SVR_Tu_1 Tu SVR  
SVR_Tu_2 Tu SVR  
SVR_Tu_3 Tu SVR  
SVR_Tu_4 Tu SVR  
ANN_uv_1 UV254 ANN  
ANN_uv_2 UV254 ANN  
ANN_uv_3 UV254 ANN  
ANN_uv_4 UV254 ANN  
SVR_uv_1 UV254 SVR  
SVR_uv_2 UV254 SVR  
SVR_uv_3 UV254 SVR  
SVR_uv_4 UV254 SVR  

Scenario names that contain the number 1 use non-time-lagged input data for both rainfall accumulation and river flow rates. This means that these scenarios are not predictive models but are only reference models. Scenario names that contain the numbers 2, 3, and 4 use time-lagged input data for both rainfall accumulation and river flow rates (the time lags varied). The time lags were selected as a function of , as indicated in the input variable column in Table 1. When the highest rs value was obtained for a given correlation, this is denoted in the table as . For example, PS.1_rs_max corresponds to the rainfall accumulation in PS.1 with the optimal time lag, meaning that this scenario provided the best correlation between rainfall accumulation at station PS.1 (Figure 1) and RW Tu. Significantly rainfall data with lower and higher time lags are denoted as and , respectively, compared to the rainfall data with optimal time lag.

Model performance metrics

The performance metrics used to evaluate model performance are described below. The R2 metric was used to calculate the percentage of the total variance of the observed data that is explained by the model. R2 values range between 0 and 1 (Rajaee et al. 2020). The coefficient of Nash–Sutcliffe efficiency (NSE) was also used, of which the values range between minus infinity and 1, with 1 representing a perfect fit. An efficiency of less than zero indicates that using the mean value of the observed data is a better predictor than the model (Nash & Sutcliffe 1970). Root mean square error (RMSE) provided us with the standard deviation of the model prediction error. We chose RMSE because it is able to show several deviations, is reliable, and provides high weight to error. Lastly, mean absolute error (MAE) was used to show the absolute difference between the actual and predicted values. However, this only provides information about the extent of the error and not the model validity.

Rainfall and river peak flow event analysis

Although statistical analyses can establish links between input and output variables, they might conceal some disparities between events and oversimplify complex phenomena. The qualitative analysis of a few events can partly reveal these disparities and complexities. We, therefore, analysed the main characteristics of a few rainfall events in PS.3 and river flow peaks in QS.1, as well as their relationships with changes in RW quality. PS.3 presents the highest correlation RW quality variables and it is also centred-placed in the watershed. Due to those, we considered the PS.3 rainfall gauge station to generally represent all of the rainfall events registered at the other rainfall gauge stations. QS.1 was selected because it is the closest river flow meter station to the RW sampling point at the DWTP.

Rainfall event analysis

Rainfall classification systems are based on several features of rain events, such as physical properties, meteorological conditions, or impacts (Breugem et al. 2020). We started by identifying every single rainfall event at each precipitation station (PS.1–PS.5) by considering the following characteristics: duration, accumulation, intensity, antecedent dry time, and posterior dry time. An algorithm was run through the rain data series from the first observation. First, the algorithm identified the observations with values higher than . Then, the observations with values lower than were identified. The time difference between and was the duration of the rainfall event. The sum of rainfall values within a specific is the rainfall accumulation . The rainfall intensity was obtained by dividing by . The antecedent dry time of a rainfall event was calculated by subtracting the of that rainfall event from the of the previous rainfall event. The posterior dry time of a rainfall event was calculated by subtracting the of the next rainfall event from the of the present rainfall event. According to this methodology, we selected four rainfall events for analysis with subsequent changes in RW quality. The rainfall events and RW quality parameter plots and the rainfall characteristics might provide valuable information that can be used to understand model input variable selection.

River peak flow event analysis

The river flow rate can be considered an ‘integrative’ variable of watershed hydrological phenomena (Singh 2017). This is because river flow is the hydrographic response to different rain-induced phenomena that occur in the watershed. Here, we aimed to show the impacts of river flow changes on RW quality parameters. Flow stations QS.1 and QS.2 were selected for this analysis because they are located relatively near the DWTP water intake. Because our case study is in an area with a temperate climate, there are two major periods that can affect flow rate variation, and therefore, our analyses. In late winter/early spring, river flows are affected by melting snow and ice jams. This period presents high river flow rates. In late spring, summer and fall, river flows are principally determined by rain and base flow (subsurface and underground flow). We divided the entire flow rate time series (QS.1 and QS.2) into two periods: (1) the melting snow (MS) period in May (QS.1MS and QS.2MS) and (2) the warm (W) period from June 1 to the end of the time series (QS.1W and QS.2W). The base flow is almost constant during the W period. We were mainly interested in linking river flow and RW quality during the summer period and the rainy season (W period), as during this period, changes in RW quality are more noticeable. A specific peak flow events were identified and selected in order to analyse their impacts on RW quality. Based on this methodology, we selected four peak flow events for analysis with subsequent changes in RW quality. Two of the mentioned peak flow events corresponded to events generated by two of the rainfall events from the previous section. The other two peak flow events were selected in order to represent more event diversity in the analysis. From river flow rate events and RW quality parameter plots, valuable information might be extracted to understand the model input variable selection.

Time-lagged correlations between watershed rainfall and river flow time series and RW quality

Rainfall time series and RW quality

We summarized the obtained time series for rainfall events that were identified from PS.3 and the RW quality parameters (Figure 4). At first view, it is not possible to establish relationships between rainfall events or river flow events and RW quality parameters. Table 2 shows more specific results in correlations from PS.1–PS.5 and RW quality parameters. For instance, rainfall measurements at different stations are correlated. The precipitation amount at each station was shown to have a stronger correlation with that of neighbouring stations. The correlation decreased as the stations became further apart (Table 2). Low or almost-null correlations between rainfall time series and RW quality were observed. Some of the correlation coefficients were not statistically significant (p-values >0.05). This was expected due to the delay between the moment when rainfall is registered and the moment when changes in flow and water quality are observed. This delay is associated with the time it takes for a hydrological response to occur in the watershed (Musy & Higy 2010) and for flood routing (Shaw et al. 2017). Despite of these low correlations results, it is expected a relationship between rainfalls and RW quality because the runoff following rainfall carries both dissolved and undissolved matter to the water, leading to increases in Tu and UV254.
Table 2

Correlation matrix (Spearman's rank correlation coefficient, ) between rainfall events and RW quality (simultaneous)

MeanStd.TuPS.1PS.2PS.3PS.4PS.5
Dev.(NTU)(mm)(mm)(mm)(mm)(mm)
UV254 (m−130.41 10.47 0.53 − 0.03 − 0.04 −0.03 −0.01 
Tu (NTU) 9.14 15.87  0.02 0.04 0.03 0.01 
PS.1 0.11 0.78   0.57 0.52 0.45 0.32 
PS.2 0.14 0.95    0.63 0.48 0.37 
PS.3 0.15 0.80     0.56 0.41 
PS.4 0.15 0.81      0.49 
PS.5 0.14 0.80       
MeanStd.TuPS.1PS.2PS.3PS.4PS.5
Dev.(NTU)(mm)(mm)(mm)(mm)(mm)
UV254 (m−130.41 10.47 0.53 − 0.03 − 0.04 −0.03 −0.01 
Tu (NTU) 9.14 15.87  0.02 0.04 0.03 0.01 
PS.1 0.11 0.78   0.57 0.52 0.45 0.32 
PS.2 0.14 0.95    0.63 0.48 0.37 
PS.3 0.15 0.80     0.56 0.41 
PS.4 0.15 0.81      0.49 
PS.5 0.14 0.80       

Correlations in bold are significant (p-value <0.05). PS, rainfall gauge station; Tu, turbidity in nephelometric turbidity units (NTU); UV254, UV absorbance at 254 nm.

Figure 4

Hourly RW quality parameters and rainfall accumulation for the PS.3 rainfall gauge station (bars in dark grey). (a) Turbidity (in red) and (b) UV absorbance at 254 nm (in blue).

Figure 4

Hourly RW quality parameters and rainfall accumulation for the PS.3 rainfall gauge station (bars in dark grey). (a) Turbidity (in red) and (b) UV absorbance at 254 nm (in blue).

Close modal
Levels of RW quality (UV254 and Tu) were correlated with rainfall events from several days before. Indeed, positive correlations were observed between time-lagged rainfall events and RW quality parameters (UV254 and Tu) (Figure 5). Different correlation patterns were observed. Tu presented positive correlations with time-lagged rainfall events, starting at . The maximum correlation coefficient between the time-lagged rainfall events and Tu appeared at h, almost two days apart.
Figure 5

Cross-correlations between time series data for rain gauge stations PS.1–5 and RW quality parameters represented by (a) turbidity and (b) UV254.

Figure 5

Cross-correlations between time series data for rain gauge stations PS.1–5 and RW quality parameters represented by (a) turbidity and (b) UV254.

Close modal

UV254 has an almost-null or negative correlation with rainfall events from up to 2 days prior. From that point, cross-correlation plots CCF started to increase. The maximum UV254 correlation coefficient occurred at PS.3 with a time lag of . Figure 5 shows that two local maximums were observed for the cross-correlation plots. The first local maximum had an average time lag of 3 days and the second local maximum had an average time lag of 8 days. Time-lagged PS.3 and PS.5 presented the highest correlation coefficient (rs) with UV254. These results are consistent with those of previous studies by other authors. For instance, Volk et al. (2002) observed that UV254 (and DOC) concentrations increased after a rainfall event. They also observed that it could take up to 6 days for the UV254 and DOC concentrations to return to their initial values. Panton et al. (2020) observed that DOC concentrations in two different rivers were high for 2 days following rainfall before returning to base/initial values 10 days later.

River flow time series and RW quality

The time series for the average hourly river flow at QS.1 and RW quality parameters are summarized in Figure 6. The results suggest that river flow peaks and RW quality peaks are related. The MS period and the W period are shown. The time series from QS.1 and QS.2 were categorized split, according to average flow behaviour. Several high values for river flow were recorded between the beginning of the time series in mid-April to the spring season in late May (MS period). The high river flow values during the MS period are due to melting snow and ice. During the warm period (W period), river flow peaks are mainly due to rain and underground flows (base flow).
Figure 6

Hourly river flow rate (in black) and RW quality at QS.1, represented by (a) turbidity (in red) and (b) UV254 (in blue).

Figure 6

Hourly river flow rate (in black) and RW quality at QS.1, represented by (a) turbidity (in red) and (b) UV254 (in blue).

Close modal
RW quality levels (UV254 and Tu) were correlated with rainfall events that occurred several hours to days before. Positive correlations were observed between time-lagged rainfall events and UV254 and Tu concentrations (Figure 7). The flow of the main source river, measured at QS.1W (Figure 7(a)) at , was highly correlated with Tu . The maximum Spearman's rank correlation coefficient between the time-lagged river flow at QS.1W and Tu was around h. The flow of the secondary (affluent) source river measured at QS.2W (Figure 7(b)) at h was relatively highly correlated with Tu . The maximum Spearman's rank correlation coefficient between the time-lagged river flow at QS.2W and Tu was around h. These results show that Tu values are highly correlated with river flow values and that the changes in these two parameters are almost simultaneous.
Figure 7

Cross-correlations between time series data for river flow and RW quality parameters at (a) QS.1 and (b) QS.2.

Figure 7

Cross-correlations between time series data for river flow and RW quality parameters at (a) QS.1 and (b) QS.2.

Close modal

On the other hand, the flow rate of the main source river, measured at QS.1W (Figure 7(a)) at h, was weakly correlated with UV254. The maximum Spearman's rank correlation coefficient between the time-lagged river flow rate and UV254 was at . The correlation then decreased to for days. River flow rate was moderately correlated with UV254, with Spearman's rank correlation coefficients of at time lags between and 144 h. The flow rate of the secondary (affluent) source river, measured at QS.2W (Figure 7(b)) at h, had a low correlation with UV254. The maximum Spearman's rank correlation coefficient between time-lagged river flow rate and UV254 was at . The correlation then decreased to for days. The river flow rate was moderately correlated with UV254, with Spearman's rank correlation coefficients of at time lags between and 144 h. These results show that the maximum values for UV254 did not occur simultaneously with increases in river flow. On average, peaks in UV254 were observed between 2 and 4 days after river flow peaks (Figure 7).

RW quality modelling: input variable scenarios and model evaluation

The results from the previous section show the empirical relationships between both rainfall and river flow rates and changes in RW quality parameters Tu and UV254. These relationships experienced delays in time and their magnitudes were calculated using correlation coefficients. We therefore chose to apply rainfall and river flow rate variables that were time-lagged for the model input variables. As mentioned in Section 2.3, four modelling scenarios were proposed. Each of the scenarios was developed to model both Tu and UV254 using ANN and SVR.

Scenarios 1 in Table 1 (ANN_Tu_1, ANN_uv_1, SVR_Tu_1, and SVR_uv_1) are not predictive scenarios because there is no time lag applied to the input data (PS.1–5 and QS.1–2) to produce Tu or UV254 output data. The input data for scenarios that included 2 (ANN_Tu_2, ANN_uv_2, SVR_Tu_2, and SVR_uv_2) for Tu modelling (Table 1) are composed of seven independent time-lagged input variables (time-lagged PS.1–5 and QS.1–2). The first five input variables are the time-lagged PS.1–5 values at the maximum correlation coefficient between rainfall and RW quality data. The last two input variables are the time-lagged QS.1 and QS.2 values at the maximum correlation coefficient between river flow data and the RW quality data. The input data for scenario 2 for UV254 modelling (Table 1) are composed of 12 independent time-lagged input variables (time-lagged PS.1–5 and QS.1–2). The first five input variables are the time-lagged PS.1–5 values at the first local maximum correlation coefficient between rainfall and RW quality data. The following five input variables are the time-lagged PS.1–5 values at the second local maximum correlation coefficient between rainfall and RW quality data. This is based on results shown in Figure 5(b) that indicate that cross-correlation plots between rainfall events (PS.1, PS.2, PS.2, PS.5) and UV254 have two local maximums for every rain gauge (except for QS.4 with a third further lag). Globally, cross-correlations between UV254 and lagged rainfalls are low and smaller than those between UV254 and river flow. This suggests that rainfall series are not as effective predictors for UV254 as river flow series. Keeping these observations in mind, only those predictors with time lags represented by the two local maxima were included in the predictor group. The last two input variables are the time-lagged QS.1–2 values at the maximum correlation coefficient between the river flow data and the RW quality data. Tables 3 and 4 present several performance metrics for both the ANN and SVR models for every input data scenario predicting RW quality parameters Tu and UV254. In general, ANN provided better predictive results than SVR for both Tu and UV254.

Table 3

Performance metrics of SVR and ANN models using several data input scenarios for turbidity modelling

ScenarioNSEcalNSEtestRMSEcalRMSEtestMAEcalMAEtest
(NTU)(NTU)(NTU)(NTU)
SVR_Tu_1 0.398 0.368 0.14 0.12 14.36 8.22 4.75 1.53 
SVR_Tu_2 0.820 0.860 0.67 0.63 0.11 0.11 0.07 0.08 
SVR_Tu_3 0.779 0.783 0.60 0.60 0.24 0.13 0.17 0.06 
SVR_Tu_4 0.676 0.657 0.44 0.41 0.28 0.16 0.22 0.07 
ANN_Tu_1 0.837 0.747 0.70 0.59 8.50 5.80 3.80 1.40 
ANN_Tu_2 0.860 0.870 0.73 0.75 0.10 0.09 0.07 0.06 
ANN_Tu_3 0.862 0.798 0.74 0.62 0.19 0.13 0.13 0.05 
ANN_Tu_4 0.785 0.718 0.62 0.52 0.23 0.15 0.17 0.06 
ScenarioNSEcalNSEtestRMSEcalRMSEtestMAEcalMAEtest
(NTU)(NTU)(NTU)(NTU)
SVR_Tu_1 0.398 0.368 0.14 0.12 14.36 8.22 4.75 1.53 
SVR_Tu_2 0.820 0.860 0.67 0.63 0.11 0.11 0.07 0.08 
SVR_Tu_3 0.779 0.783 0.60 0.60 0.24 0.13 0.17 0.06 
SVR_Tu_4 0.676 0.657 0.44 0.41 0.28 0.16 0.22 0.07 
ANN_Tu_1 0.837 0.747 0.70 0.59 8.50 5.80 3.80 1.40 
ANN_Tu_2 0.860 0.870 0.73 0.75 0.10 0.09 0.07 0.06 
ANN_Tu_3 0.862 0.798 0.74 0.62 0.19 0.13 0.13 0.05 
ANN_Tu_4 0.785 0.718 0.62 0.52 0.23 0.15 0.17 0.06 

Bold values represent the scenario that obtained the highest performance metrics in turbidity modeling for both the ANN and the SVR.

Table 4

Performance metrics of SVR and ANN models using several data input scenarios for UV254 modelling

ScenarioNSEcalNSEtestRMSEcalRMSEtestMAEcalMAEtest
(m−1)(m−1)(m−1)(m−1)
SVR_uv_1 0.443 0.424 0.16 0.15 9.71 5.80 8.15 2.86 
SVR_uv_2 0.710 0.680 0.48 0.45 0.16 0.18 0.14 0.16 
SVR_uv_3 0.598 0.590 0.32 0.32 8.72 5.18 7.34 2.52 
SVR_uv_4 0.607 0.602 0.34 0.33 8.61 5.13 6.63 2.28 
ANN_uv_1 0.593 0.569 0.35 0.32 8.61 5.10 6.70 2.32 
ANN_uv_2 0.740 0.730 0.54 0.52 0.16 0.17 0.12 0.13 
ANN_uv_3 0.753 0.684 0.56 0.46 7.01 4.60 5.22 1.93 
ANN_uv_4 0.694 0.689 0.48 0.47 7.71 4.50 5.45 1.82 
ScenarioNSEcalNSEtestRMSEcalRMSEtestMAEcalMAEtest
(m−1)(m−1)(m−1)(m−1)
SVR_uv_1 0.443 0.424 0.16 0.15 9.71 5.80 8.15 2.86 
SVR_uv_2 0.710 0.680 0.48 0.45 0.16 0.18 0.14 0.16 
SVR_uv_3 0.598 0.590 0.32 0.32 8.72 5.18 7.34 2.52 
SVR_uv_4 0.607 0.602 0.34 0.33 8.61 5.13 6.63 2.28 
ANN_uv_1 0.593 0.569 0.35 0.32 8.61 5.10 6.70 2.32 
ANN_uv_2 0.740 0.730 0.54 0.52 0.16 0.17 0.12 0.13 
ANN_uv_3 0.753 0.684 0.56 0.46 7.01 4.60 5.22 1.93 
ANN_uv_4 0.694 0.689 0.48 0.47 7.71 4.50 5.45 1.82 

Bold values represent the scenario that obtained the highest performance metrics in UV modeling for both the ANN and the SVR.

A quick look at the performance metrics in Tables 3 and 4 indicates that, when using the same models, the prediction model for turbidity is more accurate than the model for UV absorbance. Scenarios 1 (SVR_uv_1, SVR_tu_1, ANN_uv_1, and ANN_tu_1 in Table 1) are not useful for predicting RW quality deterioration due to rainfall events or flow peak events. However, their weak performance metrics confirm that it is important to consider time lags between climatic or hydrological events and changes in RW quality. Indeed, simultaneous rainfall and flow peak data do not provide enough information to estimate Tu or UV254.

Scenarios 2, 3, and 4 are predictive scenarios since they use time-lagged rainfall events and river flow peaks as input data. For scenario 2, these time lags were calculated using the methodology described in Section 2.3. The scenarios that used input variables with the highest time-lag correlation coefficient values resulted in the highest R2 and NSE performance metrics in both calibration and validation stages, as shown in Tables 3 and 4.

Figure 8 shows the scatter plots between the observed LogTu and predicted LogTu generated by SVR and ANN for scenario 2. ANN performed better than SVR . The SVR model showed limitations when predicting Tu peaks (LogTu), in particular, when values of values were higher than . As shown in Figures 5 and 7, cross-correlation plots between both rainfall events and river flow peak rates and water Tu indicated maximum values at specific time lags (hours). Using these time-lagged input data (with maximum correlations) resulted in relatively high predictive capabilities for both SVR and ANN models.
Figure 8

Scatter plots displaying relationships between observed and predicted RW quality parameters. Scenarios: (a) SVR_Tu_2, (b) ANN_Tu_2, (c) SVR_uv_2, and (d) ANN_uv_2. Grey dots are from the dataset that was used at the training stage and black dots at the test stage.

Figure 8

Scatter plots displaying relationships between observed and predicted RW quality parameters. Scenarios: (a) SVR_Tu_2, (b) ANN_Tu_2, (c) SVR_uv_2, and (d) ANN_uv_2. Grey dots are from the dataset that was used at the training stage and black dots at the test stage.

Close modal

Figure 8 shows the scatter plots between the observed UV254 and predicted UV254 values generated by SVR and ANN for scenario 2. ANN once again performed better than SVR . As shown in Figures 5 and 7, cross-correlation plots between both rainfall events and river flow rates and water UV254 did not reveal maximum values at specific time lags (hours). There were at least two maximum cross-correlation coefficients between rainfall events and UV254 (Figure 5). Additionally, cross-correlation plots between river flow peaks and UV254 show an extended period of time lags where correlations are near their maximums (Figure 7). The SVR and ANN models demonstrated that input data with those time lags (at maximum correlations) resulted in models with acceptable predictive capabilities.

Figure 9 shows time series plots for observed and predicted RW parameters for scenarios with the highest performance metrics in Tables 3 and 4. Both models (ANN and SVR) had good predictive capabilities in the ‘peaks’ and ‘valleys’ of the LogTu time series (Figure 9(a) and 9(b)). However, ANN better predicted the peak that was observed in the second half of July (Figure 9(b)). Both models (ANN and SVR) showed similar performances when predicting UV254 (Figure 9(c) and 9(d)). Maximums and minimums for UV254 were reasonably predicted by time-lagged rainfall events and river flow rates.
Figure 9

Time series plot of observed and predicted RW quality parameters using SVR and ANN models. Scenarios (a) SVR_Tu_2 and (b) ANN_Tu_2 show the LogTu (red line) and the modelled LogTu in the training (black dots) and test (pink dots) stages. Scenarios (c) SVR_uv_2 and (d) ANN_uv_2 show the UV254 (blue line) and the modelled UV254 in the training (black dots) and test (cyan dots) stages.

Figure 9

Time series plot of observed and predicted RW quality parameters using SVR and ANN models. Scenarios (a) SVR_Tu_2 and (b) ANN_Tu_2 show the LogTu (red line) and the modelled LogTu in the training (black dots) and test (pink dots) stages. Scenarios (c) SVR_uv_2 and (d) ANN_uv_2 show the UV254 (blue line) and the modelled UV254 in the training (black dots) and test (cyan dots) stages.

Close modal

Analysis of RW quality variations after rain and flow peak events

As mentioned in Section 2.4, empirical analyses may help establish relationships between input variables such as rainfall and river flow rate data and RW quality output variables such as Tu and UV245. These relationships can be useful to model and predict changes in RW quality variables. However, the models presented herein were not completely accurate when making predictions. As shown in Tables 3 and 4, the performance metrics results suggest that the input data are not always enough to generate accurate output data. In Figure 9, we observed that some RW peak events were not accurately predicted. This suggested that an analysis of individual events might be useful to complement the model results. Specific rainfall events, river flow rate events, and Tu and UV254 parameters were therefore investigated to support the modelled results. These analysed events also led to the overestimation and underestimation of events by the machine learning models discussed in the previous section.

Effects of rainfall events on RW quality

During the data collection period (see Section 2.2), there were between 76 and 92 rainfall events at each of the meteorological stations. Most of the rainfall events were of short duration, typically 1–3 h. At PS.3, most of the rainfall events had accumulations between 1 and 4 mm. Seven rainfall events had more than 20 mm of accumulation. The intensities for most of the rainfall events were 1–4 mm/h. According to Breugem et al. (2020), those rainfall event intensities are considered light to moderate. Eleven rainfall events were of intensities higher than 5 mm/h, which are considered heavy to very heavy. In our data, rainfall intensities in the 90th percentile were 5.2, 5.8, 5.1, 4.9, and 4.8 mm/h at PS.1, PS.2, PS.3, PS.4, and PS.5, respectively. Rainfall accumulations in the 90th percentile were 12.6, 15.3, 16.4, 16.3, and 12.3 mm at PS.1, PS.2, PS.3, PS.4, and PS.5, respectively. For instance, in PS.3, five events recorded intensities higher than 5.1 mm/h and another five had accumulations higher than 16.4 mm. The rainfall events and their main characteristics are listed in Table 5.

Table 5

Rainfall event characteristics of four selected events at PS.3

Rainfall eventStart date and timeDuration (dr)Rain accumulation (ar)Intensity (ir)Antecedent dry time (adtr)Maximum hourly precipitation
(h)(mm)(mm/h)(h)(mm)
R1 2017-05-26 00:00 23.0 3.3 158 6.4 
R2 2017-06-23 22:00 30.4 10.1 19.2 
R3 2017-08-05 14:00 28.2 9.4 60 13.2 
R4 2017-08-22 12:00 10.0 5.0 14 6.0 
Rainfall eventStart date and timeDuration (dr)Rain accumulation (ar)Intensity (ir)Antecedent dry time (adtr)Maximum hourly precipitation
(h)(mm)(mm/h)(h)(mm)
R1 2017-05-26 00:00 23.0 3.3 158 6.4 
R2 2017-06-23 22:00 30.4 10.1 19.2 
R3 2017-08-05 14:00 28.2 9.4 60 13.2 
R4 2017-08-22 12:00 10.0 5.0 14 6.0 

Plots of selected rainfall events from PS.3 and RW quality parameters both observed and predicted are shown in Figure 10. As observed, rainfall events (R1–4) have major impacts on RW quality parameters Tu and UV254.
Figure 10

Examples of changes in RW quality concentrations associated with rainfall events (a) R1, (b) R2, (c) R3, and (d) R4 at PS.3. (Red line: observed turbidity, red points: predicted turbidity, blue line: observed UV254, blue points: predicted UV254).

Figure 10

Examples of changes in RW quality concentrations associated with rainfall events (a) R1, (b) R2, (c) R3, and (d) R4 at PS.3. (Red line: observed turbidity, red points: predicted turbidity, blue line: observed UV254, blue points: predicted UV254).

Close modal

Figure 10 clearly shows the variations in Tu and UV254 levels after rainfall. After the maximum precipitation event, Tu levels peaked at 52 h for R1, 33 h for R2, 55 h for R3, and 45 h for R4. However, UV254 peaked at 97 h for R1, 69 h for R2, 55 h for R3, and 188 h for R4. Moreover, there were fewer UV254 peaks and smaller, slower variations (except after R1). According to the statistical analyses (Section 3.1), the highest correlation between rainfall events and Tu occurred at a time lag of 45 h, while there were two local maxima between rainfall events and UV254. The first local maximum was at around 96 h and the second was at around 180 h. In the cross-correlation analysis, we identified weak correlation coefficients between rainfall events and RW quality parameters. This could indicate that in our case study, rainfall events by themselves are insufficient predictors of RW quality parameters such as Tu and UV254. In fact, as shown previously in Figure 5, correlation coefficients between rainfall data and RW quality parameters Tu and UV254 are low, with maximum values up to 0.2. Scenarios using only rainfall input data that are not shown here, resulted in lower performance.

Peak river flow events and RW quality variations

As presented in Figure 7, river flow rates resulting from watershed processes such as rainfall events showed higher correlation coefficients with RW quality parameters. Investigating a number of river flow rate events together with Tu and UV254 helped improve model results. Four river flow events (FRP.1, FRP.2, FRP.3, and FRP.4 in Figure 11) were analysed. Of those events, FRP.1 and FRP.2 were triggered by rainfall events R1 and R2. Events FRP.3 and FRP.4 were selected because they showed similar Tu and UV254 behaviours, but at different time lags. Plots of specific flow rate peak events (FRP.1–4) and RW quality parameters are shown in Figure 11.
Figure 11

Examples of changes in RW quality concentrations associated with river flow peak events (a) FRP.1, (b) FRP.2, (c) FRP.3, and (d) FRP.4 at QS.1. Maximum hourly flows were 129 m3/h for FRP.1, 200 m3/h for R2, 234 m3/h for R3, and 199 m3/h for R4. (Black line: river flow, red line: observed turbidity, red points: predicted turbidity, blue line: observed UV254, blue points: predicted UV254).

Figure 11

Examples of changes in RW quality concentrations associated with river flow peak events (a) FRP.1, (b) FRP.2, (c) FRP.3, and (d) FRP.4 at QS.1. Maximum hourly flows were 129 m3/h for FRP.1, 200 m3/h for R2, 234 m3/h for R3, and 199 m3/h for R4. (Black line: river flow, red line: observed turbidity, red points: predicted turbidity, blue line: observed UV254, blue points: predicted UV254).

Close modal

Whereas Tu maximums were observed several hours after the river flow peak (Figure 11(a), 11(c), and 11(d)), UV254 maximums were observed several days after (Figure 11(a)–11(d)). According to the statistical analyses (Section 3.1), the highest correlation between river flow peaks at QS.1 and Tu occurred with a time lag of 8 h. Meanwhile, the highest correlation between river flow peaks and UV254 was at around 62 h. According to the analysis of river flow peaks per individual event, there was moderate variability in Tu and UV254 responses following rainfall events. However, in the cross-correlation analysis, we found high correlation coefficients between river flow peaks and RW quality parameters Tu and UV254. This could indicate that river flow peak events are good predictors of those RW quality parameters. Compared to rainfall event analyses and cross-correlations, river flow rates showed better correlations with RW quality. In fact, river flow rate data showed high correlation coefficients with Tu and UV254. Correlation coefficients higher than 0.8 between QS.1 and Tu and higher than 0.56 between QS.1 and UV254 (Figure 7) are compatible with results from Figure 11. We also tested scenarios with only river flow rate input data (not shown here), which showed a moderately high performance.

Discussion

This section discusses the physical phenomena that determine the presence of NOM and other particles in source water during rainfall. By considering these phenomena, we hope to better understand the statistical relationships between the variables and the performance of machine learning models.

Data analysis and the modelling process showed that Tu and UV254 have different behaviours after rainfall events and river flow peaks. This indicates that there are different transport mechanisms in the watershed that modify particles and dissolved NOM in RW after rainfall. Rain generates surface (directly) and subsurface (indirectly) runoff that generally trigger flow peaks in rivers and groundwater runoff maintain base flow (Chow 1988). Particles and NOM are transported from the watershed and into streams through several mechanisms that, over time, lead to RW deterioration. From a purely empirical point of view, our results indicate that rainfall and river flow have different correlation values with RW quality parameters. Rainfall time series present low correlation values with RW parameters, despite rain being the primary cause of RW quality deterioration in rivers. The diversity of pollutant transport processes, which depend on a multitude of factors such as rainfall intensity and antecedent soil conditions, could explain the low correlation values.

Observations based on the results from our study might be associated with antecedent soil moisture conditions that may have different impacts on the erosion resistance of sediments. Indeed, it has been suggested that pre-wetted soil materials can erode at faster rates (Lawler et al. 1998). Rainfall events R2 and R4 (presented in Figure 10(b) and 10(d)) only had between 7 and 14 h of antecedent dry time. The soil during those events could therefore be considered pre-wetted. Tu peaks were observed between a few hours and a couple of days after the hour of maximum rainfall. In contrast, turbidity peaks in events R1 and R3 started 2 days after the hour of maximum rainfall. Events R1 and R3 had the driest soil.

In R1 and R3, we observed pre-peaks and sustained turbidity values on the recession limb. These pre-peaks (called first-flashings in hydrology) and sustained values on the recession limb have also been observed by Lawler et al. (2006). Rainfall intensity also affects the degree of turbidity. As shown by Lee et al. (2016), intense rainfall may induce erosion and lead to high levels of turbidity runoff flowing into the river. The runoff is caused by the remobilization and resuspension of sediments from sediment beds and recently eroded sediment from different sources in the watershed (Chen & Chang 2019). Our results were consistent with these findings and showed that an intense rainfall event (R2 in Figure 10(b)) generated a high river flow peak and consequently, a high Tu value. The time lag between the rainfall event and the Tu peak was several hours, whereas the river flow rate peak and Tu peak were almost simultaneous. This confirms that intense and extreme rainfall events can generate rapid and high turbidity peaks due to resuspension of sediment from sediment beds. Small rainfall events can also increase turbidity levels due to sediment build-up on topsoil (Chen & Chang 2019). Moderate-intensity rainfall events (R4 and R2 in Figure 10) generated moderate river flow peaks and consequently, moderate to low values of Tu during summer (BF period). For events R1 and R3, there were sustained Tu values on the recession limb. On the other hand, relatively high correlation coefficients were observed between time-lagged river flow rate and turbidity, while mild correlation coefficients were observed between the time-lagged river flow rate and UV254. This could be explained by the fact that the river flow at the sampling point might have already been affected by the rainfall–runoff phenomena from the upstream watershed. There is a link between river flow rate and contaminant (particles and dissolved NOM) transportation into the river. As mentioned, rainfall over watersheds leads to superficial runoff that contributes to increases in river flow rate. This superficial runoff (causing the erosion and transport of soil materials) and the direct rainfall over the water stream (causing remobilization and resuspension) are the main mechanisms that generate increased turbidity.

Machine learning methods can describe the relationships between appropriate input variables (lagged rainfall and river flow series) as predictors and output variables (turbidity and UV254). It has been demonstrated that these input variables are related to the output variables as mentioned above. The ANN model has shown a better performance in modelling and forecasting RW turbidity and RW UV254 than the SVM model, as observed in the performance metrics in Tables 3 and 4. Lee et al. (2016) demonstrated high correlations between daily flow rate and turbidity during extreme rain events. Rajaee et al. (2009) created machine learning models to predict water turbidity using different combinations of input variables, such as river flow rates from 1, 2, and 3 days before the event. These demonstrated better model performance using river flow rate data from 1-day before. This trial-error method, used with daily time steps, allows identifying the best delay time, but represents a higher computational cost. Cisty et al. (2021) used hydrological and climate data in machine learning models for modelling suspended sediments in rivers. They used river flows from the given and the previous day but rainfall information from previous days could also provide information about water turbidity. We have proposed herein a methodology for systematically selecting time-lagged watershed rainfall and river flow as RW quality predictors. Our results using this novelty methodology for selecting rainfall and time-lagged river flow rate data are compatible with the previous modelling studies described above.

According to the statistical analysis in Section 3.1, time-lagged rainfall and RW UV254 showed a weak correlation. Although some authors have reported changes in RW UV254 after rainfall, those changes were shown to appear hours to several days later. Maximum cross-correlation values were recorded almost a week after rainfall events. This suggests that rainfall event time series are not good predictors of RW UV254, even if they do trigger RW quality variations that include UV254. Nevertheless, as presented in Section 3.1, river flow peaks have a moderate correlation with RW UV254, with correlations higher than 0.5 when applying time lags from 2 to 4 days. When using time-lagged river flow rate data, the model performance metrics reported in Section 3.2 were the highest, with the highest cross-correlation coefficients between input and output data. Panton et al. (2020) showed that DOC concentrations increased with increased river flow rates. Flows above base flow rivers were positively correlated with DOC concentrations. Those DOC concentrations decreased progressively as the proportion of river base flow increased. This can be observed in Figure 11(a), 11(c), and 11(d), where UV254 recessing limbs decreased while river base flow amounts recovered to usual values.

UV254 values and the time lag between rainfall events and UV254 maximums could be affected by rainfall intensity. For instance, maximum UV254 values were time lagged between 2 and 3 days after the hour with the highest rain accumulation during events R2 and R3, which had relatively intense rainfall. However, R2 generated a higher maximal UV254 value compared to R3. This could be due to R2 having had only a few hours of antecedent dry time while R3 had several days. Furthermore, maximum UV254 values were time lagged for more than 4 days after the hour with the highest rain accumulation during events R1 and R4, which had moderate rainfall intensity. R1 generated a maximal UV254 value that was slightly higher than R4. Rainfall intensity could also affect the recovery time of UV254 levels in the water. The recovery time can be defined as the time between the start of the rising limb and the end of the declining limb. For high-intensity events (R2 and R3), recovery times were roughly less than 6 days, whereas for moderate-intensity events (R1 and R4), recovery times were longer than 1 week.

In this study, we developed an approach for modelling relevant RW quality parameters in RW that influence drinking water treatment (DWT) based on rainfall and river flow rate data. This approach included statistical analyses to select the best rainfall and river flow data predictors to use as model input variables.

The selection of the above-mentioned predictors was based on the highest cross-correlation coefficients between both time-lagged rainfall events in the watershed and the river flow rate and RW quality time series. The highest correlation between rainfall and turbidity was found at around 2-day lag, whereas between rainfall and UV254 was found at around 4 and 10-day lag. Regarding the relationships between river flow and turbidity, the highest correlation was found at 8-h lag for QS.1 and at 13-h lag for QS.2. Moreover, the highest correlation between river flow and UV254 was found at 62-h lag for QS.1 and at 2 and 5-day lag for QS.2. Then, two machine learning models (ANN and SVR) were tested using several input data scenarios from the generated database. The scenarios that used time lags for both rainfall and river flow data with the highest cross-correlation coefficients yielded the highest model performance metrics. This is because empirically, time-lagged input data was more correlated with output data. This was observed for model results for all the proposed scenarios.

Following rainfall, particles (turbidity) are transported from the watershed into the river through mainly surface runoff. Intense rain also generates remobilization and resuspension, further increasing water turbidity. Water turbidity behaviour depends on hydrology and rainfall characteristics such as intensity, as well as antecedent dry time and soil moisture, among others. This could explain the low performance metrics when only rainfall data were used as the input data in the models. The transport of NOM from the watershed and into the river after rainfall is even more complex than Tu. This is reflected by the generally lower UV254 model performance metrics compared to Tu model performance metrics. Alternately, river flow rate data can function as a watershed integrator variable and enhance the performance metrics of RW quality models.

To the best of our knowledge, our methodology is novel in considering the intersections of hydrology and drinking water quality. Our method uses cross-correlations to associate both rainfall events and river flow events with the most relevant RW quality parameters that influence water treatment at DWTPs. This approach allows the most accurate time lags with the highest cross-correlation coefficients between both rainfall and river flow rates and RW quality parameters to be estimated and then used as input data in the RW quality model. Our methodology could make it possible to associate both rain events and river flow events with water quality at the DWTP intake for any case study. DWTPs could use our results to help determine the best strategies for adjusting chemical dosing for the removal of key contaminants such as Tu and NOM. Future research should focus on creating decision-making tools that use the predicted variability of Tu and NOM (based on the methodology herein) to select the most appropriate chemical doses (e.g., coagulants) for particle and NOM removal from RW. Novel cutting-edge and more accurate models could be constructed using input data selected with the methodology herein proposed. Additionally, robustness analysis could be conducted in order to establish the appropriate number or rain gauge or flow gauge stations. Finally, using the proposed methodology, new methods to predict coagulant dosages in DWTP should be proposed. With these tools, DWTP operators can make preventative decisions, especially during periods of rain that will ensure the production of safe drinking water.

This study was funded by the NSERC (Natural Sciences and Engineering Research Council of Canada) Drinking Water Research Chair at Laval University, whose main partners are the municipalities of Quebec City and Lévis, Avensys Solutions, Agiro and WaterShed Monitoring. The authors are thankful to Crayon-Bleu for English proofreading of the paper.

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Bastaraud
A.
,
Perthameid
E.
,
Rakotondramangaid
J.-M.
,
Mahazosaotra
J.
,
Ravaonindrina
N.
&
Jambouid
R.
2020
The impact of rainfall on drinking water quality in Antananarivo, Madagascar
.
PLoS ONE
15
(
6
),
e0218698
.
https://doi.org/10.1371/journal.pone.0218698
.
Baurès
E.
,
Delpla
I.
,
Merel
S.
,
Thomas
M. F.
,
Jung
A. V.
&
Thomas
O.
2013
Variation of organic carbon and nitrate with river flow within an oceanic regime in a rural area and potential impacts for drinking water production
.
Journal of Hydrology
477
,
86
93
.
https://doi.org/10.1016/j.jhydrol.2012.11.006.
Beauchamp
N.
,
Dorea
C.
,
Bouchard
C.
&
Rodriguez
M.
2018
Use of differential absorbance to estimate concentrations of chlorinated disinfection by-product in drinking water: Critical review and research needs
.
Critical Reviews in Environmental Science and Technology
48
(
2
),
210
241
.
https://doi.org/10.1080/10643389.2018.1443668.
Breugem
A. J.
,
Wesseling
J. G.
,
Oostindie
K.
&
Ritsema
C. J.
2020
Meteorological aspects of heavy precipitation in relation to floods – An overview
.
Earth-Science Reviews Journal
204
,
103171
.
https://doi.org/10.1016/j.earscirev.2020.103171.
Chen
J.
&
Chang
H.
2019
Dynamics of wet-season turbidity in relation to precipitation, discharge, and land cover in three urbanizing watersheds, Oregon
.
River Research and Applications
35
(
7
),
892
904
.
https://doi.org/10.1002/RRA.3487
.
Chen
P. Y.
&
Popovich
P. M.
2002
Correlation: Parametric and Nonparametric Measures. Sage University Papers Series on Quantitative Applications in the Social Sciences
.
Sage Publications
,
Thousand Oaks, Calif
.
Chow
V.
,
Te Maidment
D. R.
&
Mays
L. W.
1988
Applied Hydrology
.
McGraw-Hill
,
New York
.
Cisty
M.
,
Soldanova
V.
,
Cyprich
F.
,
Holubova
K.
&
Simor
V.
2021
Suspended sediment modelling with hydrological and climate input data
.
Journal of Hydroinformatics
23
(
1
),
192
210
.
https://doi.org/10.2166/HYDRO.2020.116.
COBARIC
2014
Plan directeur de l'eau du bassin versant de la rivière Chaudière: Mise à jour 2014 - Portrait. Comité de bassin de la rivière Chaudière
.
Crittenden
J. C.
,
Trussell
R. R.
,
Hand
D. W.
,
Howe
K. J.
,
Tchobanoglous
G.
&
Borchardt
J. H.
2012
MWH's Water Treatment - Principles and Design
(3rd ed.).
John Wiley and Sons
.
Hoboken, NJ
.
Delpla
I.
&
Rodriguez
M. J.
2014
Effects of future climate and land use scenarios on riverine source water quality
.
Science of the Total Environment
493
,
1014
1024
.
https://doi.org/10.1016/j.scitotenv.2014.06.087.
Delpla
I.
&
Rodriguez
M. J.
2016
Experimental disinfection by-product formation potential following rainfall events
.
Water Research
104
(
1
),
340
348
.
https://doi.org/10.1016/j.watres.2016.08.031
.
Delpla
I.
,
Bouchard
C.
,
Dorea
C.
&
Rodriguez
M. J.
2023
Assessment of rain event effects on source water quality degradation and subsequent water treatment operations
.
Science of The Total Environment
866
,
161085
.
https://doi.org/10.1016/J.SCITOTENV.2022.161085.
Edzwald
J. K.
2011
Water quality and treatment. A handbook on drinking water
(AWWA, Ed.; 6th ed.).
New York
.
McGraw Hill
.
Edzwald
J. K.
,
Becker
W. C.
&
Wattier
K. L.
1985
Surrogate parameters for monitoring organic matter and THM precursors on JSTOR
.
Journal American Water Works Association
77
(
4
),
122
131
.
Eikebrokk
B.
,
Vogt
R. D.
&
Liltved
H.
2004
NOM increase in Northern European source waters: Discussion of possible causes and impacts on coagulation/contact filtration processes
.
Water Supply
4
(
4
),
47
54
.
https://doi.org/10.2166/ws.2004.0060
.
Health Canada
2002
From Source to Tap – The Multi-Barrier Approach to Safe Drinking Water
.
Hyndman
R. J.
&
Athanasopoulos
G.
2018
Forecasting: Principles and Practice
(OTexts, ed.)
.
OTexts
,
Melbourne
,
Australia
.
Hyndman
R. J.
&
Khandakar
Y.
2008
Automatic time series forecasting: The forecast package for R
.
Journal of Statistical Software
27
(
3
),
1
22
.
https://doi.org/10.18637/JSS.V027.I03
.
Ibrahim
K. S. M. H.
,
Huang
Y. F.
,
Ahmed
A. N.
,
Koo
C. H.
&
El-Shafie
A.
2022
A review of the hybrid artificial intelligence and optimization modelling of hydrological streamflow forecasting
.
Alexandria Engineering Journal
61
(
1
),
279
303
.
https://doi.org/10.1016/J.AEJ.2021.04.100
.
Jia
Z.
,
Chang
X.
,
Duan
T.
,
Wang
X.
,
Wei
T.
&
Li
Y.
2021
Water quality responses to rainfall and surrounding land uses in urban lakes
.
Journal of Environmental Management
298
,
113514
.
https://doi.org/10.1016/j.jenvman.2021.113514
.
Khan
S. J.
,
Deere
D.
,
Leusch
F. D. L.
,
Humpage
A.
,
Jenkins
M.
&
Cunliffe
D.
2015
Extreme weather events: Should drinking water quality management systems adapt to changing risk profiles?
Water Research
85
,
124
136
.
https://doi.org/10.1016/j.watres.2015.08.018
.
Kottegoda
N. T.
&
Rosso
R.
2008
Applied Statistics for Civil and Environmental Engineers
. 2nd ed.
Oxford UK
:
Blackwell Pub
.
Lawler
D. M.
,
Thorne
C. R.
&
Hooke
J. M.
1998
Bank erosion and instability
. In:
Applied Fluvial Geomorphology for River Engineering and Management (Thorne, C. R., Hey, R. D. & Newson, M. D., eds)
.
West Sussex, UK
.
Wiley-Blackwell
, pp.
69
93
.
Lawler
D. M.
,
Petts
G. E.
,
Foster
I. D. L.
&
Harper
S.
2006
Turbidity dynamics during spring storm events in an urban headwater river system: The Upper Tame, West Midlands, UK
.
Science of the Total Environment
360
,
109
126
.
https://doi.org/10.1016/j.scitotenv.2005.08.032
.
Lee
C.-S.
,
Lee
Y.-C.
&
Chiang
H.-M.
2016
Abrupt state change of river water quality (turbidity): Effect of extreme rainfalls and typhoons
.
Science of the Total Environment
557
(
558
),
91
101
.
https://doi.org/10.1016/j.scitotenv.2016.02.213
.
Ministère de l'environnement et de la lutte contre les changements climatiques
2020
Données du Réseau de surveillance du climat du Québec.
Direction de la qualité de l'air et du climat
,
Québec
.
Musy
A.
&
Higy
C.
2010
Hydrology: A Science of Nature
(1st ed.).
Enfield, USA
.
CRC Press
.
https://doi.org/10.1201/b10426
.
Nash
J. E.
&
Sutcliffe
J. V.
1970
River flow forecasting through conceptual models
.
Journal of Hydrology
10
,
282
290
.
Ortiz-Lopez
C.
,
Bouchard
C.
&
Rodriguez
M.
2022
Machine learning models with potential application to predict source water quality for treatment purposes: a critical review
.
Environmental Technology Reviews
11
(
1
),
118
147
.
https://doi.org/10.1080/21622515.2022.2118084
.
Panton
A.
,
Couceiro
F.
,
Fones
G. R.
&
Purdie
D. A.
2020
The impact of rainfall events, catchment characteristics and estuarine processes on the export of dissolved organic matter from two lowland rivers and their shared estuary
.
Science of The Total Environment
735
,
139481
.
https://doi.org/10.1016/J.SCITOTENV.2020.139481
.
Pratt
B.
&
Chang
H.
2012
Effects of land cover, topography, and built structure on seasonal water quality at multiple spatial scales
.
Journal of Hazardous Materials
209–210
,
48
58
.
https://doi.org/10.1016/J.JHAZMAT.2011.12.068
.
Rajaee
T.
,
Mirbagheri
S. A.
,
Zounemat-Kermani
M.
&
Nourani
V.
2009
Daily suspended sediment concentration simulation using ANN and neuro-fuzzy models
.
Science of the Total Environment
407
(
17
),
4916
4927
.
https://doi.org/10.1016/j.scitotenv.2009.05.016.
Rajaee
T.
,
Khani
S.
&
Ravansalar
M.
2020
Artificial intelligence-based single and hybrid models for prediction of water quality in rivers: A review
.
Chemometrics and Intelligent Laboratory Systems
200
,
103978
.
https://doi.org/10.1016/j.chemolab.2020.103978
.
R Core Team
2023
R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing
,
Vienna, Austria
. https://www.R-project.org/.
Shaw
E. M.
,
Beven
K. J.
,
Chappell
N. A.
&
Lamb
R.
2017
Hydrology in Practice
, 4th ed.
London, UK
.
CRC Press
.
Singh
V. P.
2017
Handbook of Applied Hydrology
.
New York, USA
.
McGraw-Hill
.
Tiyasha.
,
Tung
T. M.
&
Yaseen
Z. M.
2020
A survey on river water quality modelling using artificial intelligence models: 2000–2020
. In
Journal of Hydrology
585
,
124670
124732
.
Elsevier B.V.
https://doi.org/10.1016/j.jhydrol.2020.124670
Volk
C.
,
Wood
L.
,
Johnson
B.
,
Robinson
J.
,
Zhu
H. W.
&
Kaplan
L.
2002
Monitoring dissolved organic carbon in surface and drinking waters
.
Journal of Environmental Monitoring
4
(
1
),
43
47
.
https://doi.org/10.1039/B107768F
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).