## Abstract

Quantifying pollutant loads from combined sewer overflows (CSOs) is necessary for assessing impacts of urban drainage on receiving water bodies. Based on data obtained at three adjacent CSO structures in the Louis Fargue catchment in Bordeaux, France, this study implements multiple linear regression (MLR) and random forest regression (RFR) approaches to develop statistical models for estimating emitted loads of total suspended solids (TSS). Comparison between hierarchical clustering selection and random selection of CSO events for model calibration is included in model development. The results indicate that selection of the model's explanatory variables depends on both the type of approach and the CSO structure. By using the cluster technique to select representative events for model calibration, model predictability is generally improved. For the available dataset, MLR may have advantages over RFR in terms of verification performance and lower range of error due to splitting events for calibration and verification. But RFR model uncertainty bands are considerably narrower than the MLR ones.

## INTRODUCTION

CSOs have long been known as a major cause of receiving water quality degradation. Their processes and effects on receiving water have been comprehensively investigated and documented in previous papers, e.g. House *et al.* (1993) and Métadier & Bertrand-Krajewski (2011). There are several solutions to alleviate CSO effects; typical ones include disconnecting impervious surfaces, upgrading treatment capacity of the wastewater treatment plant (WWTP), adding retention tanks or sewer tunnels, and real time control (RTC) of an integrated system. In Bordeaux, France, a system of hydraulics-based RTC has already been implemented and operated in one of its catchments, Louis Fargue, since 2012 (Andréa *et al.* 2013). Within the on-going European LIFE EFFIDRAIN project (www.life-effidrain.eu/en), the same catchment is going to be further experimented on with a quality-based RTC approach. As part of this project, one of the objectives is to develop local statistical models to predict total suspended solids (TSS) loads caused by CSO events, based on rainfall parameters derived from measurements by rain gauges.

Multiple options to develop statistical storm water quality models can be found in the literature and regression is considered an effective and practical application of existing data to derive the relation between variables (Vaze & Chiew 2003). Using rainfall and/or hydraulic observations, several regression models have been applied and presented in past publications to calculate storm event loads or event mean concentrations. However, few regression models use the same information to predict the pollutant load by a CSO event, for example, the partial least square regression models by Sandoval *et al.* (2013). This study therefore proposes and compares two approaches for estimating CSO loads: the MLR (Multiple Linear Regression) approach implemented by Sun & Bertrand-Krajewski (2013) and the RFR (Random Forest Regression) approach proposed by Breiman (2001).

This paper aims to overcome major issues to the predictive ability of any regression model: proper selections of explanatory variables (EVs) and calibration data subsets. When there are many candidate EVs, a strategy to balance between under-fitting (with too few inputs) and over-fitting (with too many inputs) should be identified (McQuarrie & Tsai 1998). Furthermore, different methods of splitting calibration and verification data subsets result in different model parameter evaluations. Data splitting can be performed in several ways. For instance, chronological splitting can be used to group all the observations before a specific point in time for calibration, whereas the later observations are reserved for verification. However, this method as well as other random splitting methods would degrade the predictive capacity of the model if the calibration data subset contains observations with close ranges of recorded values (Montgomery & Peck 1992).

## MATERIAL AND METHODS

### Study area and data-base

The Louis Fargue urban catchment has an area of 7,700 ha and a population of approximately 300,000 inhabitants. Its total sewer length is about 1,340 km, of which 80% are combined and 20% are separate sewers. Its main outlet is the Garonne River. Along the river, there are five CSO structures within the catchment. However, due to data availability, this study only focuses on three of them. Figure 1 displays the distribution of these CSO structures: Peugue, Naujac, and Lauzun, accompanied by the positions of nearby rain gauges used to analyse rainfall characteristics. In particular, the rain gauge named Abria is utilized for the Peugue CSO, Paulin for the Naujac CSO, and Louis Fargue for the Lauzun CSO. Distances between Peugue and Naujac, Naujac and Lauzun are around 1 km and 4.3 km respectively.

Each CSO structure is equipped with a nephelometric turbidimeter and an ultrasonic flow meter, which allow continuous measurements of turbidity and flow rates downstream of each CSO weir. The recording time step for these sensors is set at 5 minutes, which is reasonable to balance between measurement frequency for a 7,700 ha catchment and required storage memory. TSS concentration is estimated through local calibration, which derives a correlation function between data provided by the online turbidity sensor and wet weather sampling campaigns for each CSO (Caradot *et al.* 2015; Lepot *et al.* 2016). Wet weather sampling campaigns are key for local calibration because laboratory-analysed values from grab samples of the campaigns are needed to determine the relationship with recorded values given by the turbidimeter. The resulting correlations for the four CSO sites are presented in Maruéjouls *et al.* (2017). Table 1 summarises the characteristics and measurement ranges of rainfall, volume and TSS load data from CSO events selected for developing the models. There are 32, 37, and 27 events used for model development for Peugue, Naujac and Lauzun respectively, and data collection period is between March 2015 and October 2016. There are other CSO events excluded from this study because of gaps in measurements.

Peugue | Naujac | Lauzun | |||||||
---|---|---|---|---|---|---|---|---|---|

Number of events | 32 | 37 | 27 | ||||||

Collection period | Mar 2015- Aug 2016 | Mar 2015- Aug 2016 | Oct 2015- Aug 2016 | ||||||

min | mean | max | min | mean | max | min | mean | max | |

Rainfall (RF) | |||||||||

total depth (mm) | 2.2 | 11.9 | 37.6 | 2.2 | 15.8 | 70.8 | 2.4 | 13.9 | 64.6 |

duration (hour) | 1.0 | 14.2 | 61.2 | 1.6 | 17.8 | 84.4 | 1.7 | 15.3 | 75.6 |

average intensity (mm/hour) | 0.3 | 1.7 | 10.8 | 0.2 | 1.4 | 5.6 | 0.3 | 1.2 | 3.1 |

max intensity, 5 min (mm/hour) | 1.9 | 17.8 | 48.1 | 2.8 | 17.7 | 62.5 | 1.2 | 16.2 | 42.5 |

ADWP^{a} (hour) | 8.1 | 105.1 | 792.3 | 8.9 | 117.6 | 1,099.9 | 7.5 | 118.1 | 796.4 |

CSO volume | |||||||||

total volume (m^{3}) | 686 | 33,200 | 147,245 | 546 | 28,727 | 425,821 | 628 | 12,341 | 113,440 |

TSS | |||||||||

total load (kg) | 132 | 8,297 | 41,536 | 129 | 8,100 | 99,060 | 144 | 6,454 | 32,268 |

Peugue | Naujac | Lauzun | |||||||
---|---|---|---|---|---|---|---|---|---|

Number of events | 32 | 37 | 27 | ||||||

Collection period | Mar 2015- Aug 2016 | Mar 2015- Aug 2016 | Oct 2015- Aug 2016 | ||||||

min | mean | max | min | mean | max | min | mean | max | |

Rainfall (RF) | |||||||||

total depth (mm) | 2.2 | 11.9 | 37.6 | 2.2 | 15.8 | 70.8 | 2.4 | 13.9 | 64.6 |

duration (hour) | 1.0 | 14.2 | 61.2 | 1.6 | 17.8 | 84.4 | 1.7 | 15.3 | 75.6 |

average intensity (mm/hour) | 0.3 | 1.7 | 10.8 | 0.2 | 1.4 | 5.6 | 0.3 | 1.2 | 3.1 |

max intensity, 5 min (mm/hour) | 1.9 | 17.8 | 48.1 | 2.8 | 17.7 | 62.5 | 1.2 | 16.2 | 42.5 |

ADWP^{a} (hour) | 8.1 | 105.1 | 792.3 | 8.9 | 117.6 | 1,099.9 | 7.5 | 118.1 | 796.4 |

CSO volume | |||||||||

total volume (m^{3}) | 686 | 33,200 | 147,245 | 546 | 28,727 | 425,821 | 628 | 12,341 | 113,440 |

TSS | |||||||||

total load (kg) | 132 | 8,297 | 41,536 | 129 | 8,100 | 99,060 | 144 | 6,454 | 32,268 |

^{a}ADWP: Antecedent Dry Weather Period.

### MLR model

*Y*refers to the response variable (i.e. TSS load),

*X*are explanatory or input variables obtained from rainfall characteristics,

_{i}*b*

_{0}and

*b*are model parameters to be calibrated.

_{i}Developing the MLR model is performed using the Python Numpy and Scipy packages (Oliphant 2007) and contains three steps:

#### Pre-screening of candidate variables

Initially, the list of potential EVs contains all possible variables representing rainfall characteristics. To lessen the impact of multicollinearity, if two candidate variables are highly correlated (i.e. Pearson correlation coefficient is higher than 0.9), only one should be used to develop the model. After this pre-evaluation, the list of potential EVs is filtered and can be used in the next two steps of model development.

#### Selection of EVs

The second step involves the use of the leave-one-out cross validation (LOOCV) technique (Rudemo 1982) and forward stepwise algorithm (FSA) to select only useful EVs for the MLR model. For each run of a LOOCV procedure, one CSO event is reserved for verification while all the remaining CSO events are utilized in calibration. The procedure involves multiple runs and completes when all CSO events take part in verification. The verification performances of all runs are then averaged to represent the model's predictive capacity through the LOOCV procedure. The application of LOOCV generally helps to account for uncertainty due to separation of data subsets for calibration and verification.

In addition, FSA is incorporated to reduce the efforts of searching for the best combination of EVs. In order to find the first EV of the MLR model, it is first assumed that the MLR model has solely one EV. Each variable from the list of potential EVs is tested by applying the LOOCV procedure. The selected EV is the one that gives the highest model's predictive capacity. Similarly, the MLR model is then assumed to contain two EVs (the first one is already known) and each variable from the list of remaining variables is added to the model together with the known EV. The results of the LOOCV procedure are then evaluated to identify the second EV, corresponding to the model with the highest predictive capacity. The selection of more EVs is repeated until the predictive capacity of the model cannot be further improved.

#### Splitting of dataset

The final step involves the application of hierarchical clustering to split the dataset for calibration and verification. The clustering method is adapted from Sun & Bertrand-Krajewski (2013) to group similar CSO events into distinct groups called clusters. In brief, the similarity between two CSO events is assessed by means of their standardized Euclidean distance. Using the computed distances, two similar CSO events are paired into a binary cluster. Similar binary clusters are gradually merged to form larger clusters until a hierarchical tree is built. Depending on the number of CSO events needed in calibration, the tree can be cut into different branches to form the expected number of clusters. If *k* CSO events are needed in calibration, the dataset is divided into *k* clusters and the most representative event from each cluster is picked for the calibration data subset. The remaining events are kept for the verification data subset.

In this study, calibration is performed with an increasing number *k* of CSO events until all events are chosen. In order to verify the proficiency of the cluster technique in selecting events for calibration, 2000 combinations of events are generated by random selection and used for calibration at each corresponding *k* number of calibration events. Higher numbers of the combinations of events were tested but given the size of the dataset, no significant variations in the results were observed. The resulting modified Nash-Sutcliffe efficiency indices (NS), as seen in Sun & Bertrand-Krajewski (2012), are summarized by five statistical values, i.e. minimum, lower quartile, median, upper quartile and maximum, and then compared with the modified NS given by using the cluster technique. Besides, these statistical values are also applicable for analysing the difference between MLR and RFR in terms of the range of error due to calibration data selection.

### RFR model

Random forest is an ensemble machine learning method based on bootstrap aggregation to enhance the decision tree method. By sampling with replacement of the original calibration data subset, a large number of trees are grown to maximum size without pruning, and aggregation is done by averaging the results from all trees. For each tree, the set of EVs utilized to determine the best split at each node is randomly chosen from the total number of EVs (Prasad *et al.* 2006). Due to bootstrapping, on average, approximately one-third of the events in calibration data subset are not used to build a tree. They are regarded as the out-of-bag (oob) events and used to compute the internal prediction performance of the model (i.e. oob score, indicated by NS value) and then to evaluate the prediction strength of each variable (i.e. feature importance), avoiding the need for any cross-validation method (Archer & Kimes 2008). Feature importance of a variable indicates how much the oob score would decrease if only that variable is permuted from the data subsets of oob events. More information on the conceptual aspects of RFR can be found in Grömping (2009).

The development of RFR is implemented by Python Scikit-learn package (Pedregosa *et al.* 2011) and involves an analogous three-step procedure as for MLR to allow straight assessments of the two approaches' performances. The main difference lies in the second step of selecting EVs, which utilizes RFR's measures of feature importance to eliminate trivial variables. Given the relatively small number of potential EVs, this study tailors the technique presented by Genuer *et al.* (2010) for the selection. Firstly, all the *M* potential EVs remaining after the first step of pre-screening are used as inputs of the RFR model. According to their feature importance outputs, they are subsequently ranked in a list of decreasing order of importance. Lastly, multiple runs of RFR models with increasing *m* number of inputs taken from the above list are performed (*m* = 1 to *M*). The variables involved in the model leading to the highest oob score are selected as the final EVs.

### Model performance indication

*N*is the total number of the available observations,

*n*is the number of observations used for calibration or verification,

*μ*is the mean of all observed values of

*Y*, and

*e*is the residual error.

In addition, modified root-mean-square error (RMSE), which is the RMSE normalized by the mean value of the response variable (Dembélé *et al.* 2010), is computed. Because of the good agreement between the results from these two indicators, this paper only presents the figures showing modified NS values.

### Assessment of uncertainty

Based on the results obtained by hierarchical clustering, one RFR model and one MLR model are chosen for each CSO to assess their uncertainties. A chosen model should be amongst the models with leading verification performances while its calibration performance remains acceptable. Furthermore, the size of its verification data subset ought to be at least 10% of the dataset.

The method of assessment requires each EV of the model to be associated with a probability density function (pdf). As an initial hypothesis, the pdf for each EV is assumed as normal (as applied e.g. by Dembélé *et al.* 2011) with a standard deviation of 10% of the mean value. Review of previous work shows that rainfall measurement uncertainty can vary in the order of 0–50% (Willems 2001), but the typical values are within the range of 0–15% (Habib *et al.* 2001; Weerts & Serafy 2006). The selected value of 10% is therefore considered as realistic and was used previously by Leonhardt *et al.* (2014). For the MLR approach, the 95% confidence interval of the response variable is derived by propagating the pdf associated to EVs using Monte Carlo (MC) simulations of 10^{6} runs (Muste *et al.* 2012). For the RFR approach, estimation of uncertainty is far more challenging due to its non-parametric nature. In this study, standard errors in an RFR prediction are obtained by applying the Jackknife and infinitesimal Jackknife procedures proposed by Wager *et al.* (2014). 1,000 runs of MC simulations for the RFR model containing 1,000 trees are carried out to propagate pdf of EVs and assess the 95% confidence interval.

## RESULTS AND DISCUSSION

### Selection of EVs

Pre-evaluation shows that model performance using the existing dataset is generally higher with Equation (2), thus it is chosen for MLR model development. After pre-screening of candidate variables, potential EVs are obtained and listed in the first column of Table 2. From this list, selecting EVs is applied to sort out the useful EVs for each model. Table 2 shows the selected EVs for MLR and RFR models of the three CSO structures. It is obvious that the models for the three CSO structures do not share the same type and number of variables although the catchments of these structures are contiguous, with similar land use activities, and a majority of their rainfall, flow and quality characteristics are comparable (see Table 1). For Lauzun, the lower number of EVs compared to the other sites is due to the substantially high correlation between TSS load and rainfall depth (Pearson correlation coefficient amounts up to 0.92). MLR deems requiring more EVs than RFR, apart from the Lauzun case. It is equally interesting to observe that RFR tends to prefer input variables having higher correlation with the measurand and excludes ADWP-related candidates.

Variables | Peugue | Naujac | Lauzun | |||
---|---|---|---|---|---|---|

MLR | RFR | MLR | RFR | MLR | RFR | |

ADWP | ✓ | |||||

RF depth | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

RF duration | ✓ | ✓ | ✓ | ✓ | ||

Average RF intensity | ✓ | ✓ | ||||

Max RF intensity, 5 mins | ✓ | |||||

Max RF intensity, 15 mins | ||||||

Max RF intensity, 30 mins | ✓ | ✓ | ✓ | |||

ADWP x last event's RF depth | ||||||

ADWP x last storm event's RF duration | ||||||

ADWP x last storm event's average RF intensity | ||||||

ADWP x last storm event's max RF intensity, 5 mins | ✓ | ✓ | ||||

ADWP x last storm event's max RF intensity, 15 mins | ||||||

ADWP x last storm event's max RF intensity, 30 mins | ✓ |

Variables | Peugue | Naujac | Lauzun | |||
---|---|---|---|---|---|---|

MLR | RFR | MLR | RFR | MLR | RFR | |

ADWP | ✓ | |||||

RF depth | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

RF duration | ✓ | ✓ | ✓ | ✓ | ||

Average RF intensity | ✓ | ✓ | ||||

Max RF intensity, 5 mins | ✓ | |||||

Max RF intensity, 15 mins | ||||||

Max RF intensity, 30 mins | ✓ | ✓ | ✓ | |||

ADWP x last event's RF depth | ||||||

ADWP x last storm event's RF duration | ||||||

ADWP x last storm event's average RF intensity | ||||||

ADWP x last storm event's max RF intensity, 5 mins | ✓ | ✓ | ||||

ADWP x last storm event's max RF intensity, 15 mins | ||||||

ADWP x last storm event's max RF intensity, 30 mins | ✓ |

Tick symbol indicates that a variable is selected as an EV for a model.

### Dataset division for calibration and verification

Model calibration and verification by the two approaches are performed with increasing sizes of calibration data subset using both random selection and cluster techniques. Resulting NS values are illustrated in Figures 2 and 3. For plotting the outputs, the number of CSO events used for calibration starts from at least the number of EVs selected by either MLR or RFR, whichever is higher. The cluster technique clearly demonstrates improved prediction ability (lower errors) compared to random selection as its verification NS values are mostly higher than the corresponding median values. The box plots indicate that MLR typically shows lower performance in calibration compared to RFR; nevertheless, in terms of verification, MLR provides a narrower range of NS values. Its median NS values are usually higher than the ones of RFR in the cases of Peugue and Lauzun, when there is higher correlation between explanatory and response variables. On the other hand, RFR is better in capturing the non-linearity of the data for the Naujac case. Error in verification results generally decreases when adding more calibration events and appears to be lower with MLR.

### Model uncertainty

MC simulation results are used to estimate the expanded uncertainty (with a coverage factor of 2, as applied in Muste *et al.* (2012)) for each CSO event. The units of expanded uncertainty are presented in both percentage and kg. For RFR models, the average expanded uncertainty of verification events of Peugue, Naujac, and Lauzun are about 50% [±2,781 kg], 28% [±790 kg], and 18% [±707 kg] respectively. For MLR models, most of them provide high expanded uncertainties and the lowest one belongs to the Lauzun model, which amounts to 78% [±1,958 kg]. The MLR model for Naujac seems not to be satisfactory yet due to the very high expanded uncertainty [±35,402 kg].

Given the widely varying magnitude of CSO events, from very small to large ones, it is convenient and operationally practical to sort the expanded uncertainties according to different ranges of simulated load values, as in Table 3. Particularly, all CSO events within the same range of simulated loads (i.e. 0–5,000 kg or 5,000–10,000 kg or >10,000 kg) are considered one group and the group's expanded uncertainty is obtained by averaging all values of individual expanded uncertainties. The expanded uncertainties of RFR models are globally lower than the ones of the MLR model, both in calibration and verification results. This could imply the RFR's benefit of the ensemble modelling to lessen the effect of measurement uncertainty. Figure 4 gives an example of the expanded uncertainties of MLR and RFR models at Lauzun. High expanded uncertainties are observed in the modelling results of MLR; low-TSS-load events (less than 5,000 kg) in particular show significantly larger expanded uncertainties than the others, for example, approximately 85% [±1,940 kg] versus 30% [±2,068 kg] for the other events. Besides, it is interesting to note that RFR can be significantly more reliable for some predictions than for others. This is consistent with the observation by Wager *et al.* (2014) that RFR would give wider confidence intervals if the predicted values are considerably different from the measured ones.

CSO events' loads (kg) | Expanded uncertainty in calibration results (±kg) | Expanded uncertainty in verification results (±kg) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

PEUGUE | NAUJAC | LAUZUN | PEUGUE | NAUJAC | LAUZUN | |||||||

MLR | RFR | MLR | RFR | MLR | RFR | MLR | RFR | MLR | RFR | MLR | RFR | |

0–5,000 | 4,689 | 2,627 | 15,358 | 1,241 | 1,944 | 647 | 3,590 | 1,521 | 14,937 | 790 | 1,940 | 398 |

5,000–10,000 | 5,599 | 3,385 | 12,140 | 4,595 | 2,310 | 3,459 | 8,613 | 4,760 | N.A^{a} | N.A. | 2,068 | 3,054 |

>10,000 | 6,746 | 6,547 | 18,023 | 12,393 | 4,557 | 3,471 | 10,335 | 10,156 | 59,960 | N.A. | N.A. | 32 |

Average | 5,593 | 4,346 | 15,806 | 4,276 | 2,658 | 2,443 | 6,235 | 2,781 | 35,402 | 790 | 1,958 | 707 |

CSO events' loads (kg) | Expanded uncertainty in calibration results (±kg) | Expanded uncertainty in verification results (±kg) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

PEUGUE | NAUJAC | LAUZUN | PEUGUE | NAUJAC | LAUZUN | |||||||

MLR | RFR | MLR | RFR | MLR | RFR | MLR | RFR | MLR | RFR | MLR | RFR | |

0–5,000 | 4,689 | 2,627 | 15,358 | 1,241 | 1,944 | 647 | 3,590 | 1,521 | 14,937 | 790 | 1,940 | 398 |

5,000–10,000 | 5,599 | 3,385 | 12,140 | 4,595 | 2,310 | 3,459 | 8,613 | 4,760 | N.A^{a} | N.A. | 2,068 | 3,054 |

>10,000 | 6,746 | 6,547 | 18,023 | 12,393 | 4,557 | 3,471 | 10,335 | 10,156 | 59,960 | N.A. | N.A. | 32 |

Average | 5,593 | 4,346 | 15,806 | 4,276 | 2,658 | 2,443 | 6,235 | 2,781 | 35,402 | 790 | 1,958 | 707 |

^{a}N.A.: not available because the measured loads of selected events for verification do not fall in the considered range.

### Conclusions and perspectives

This study on modelling of TSS loads at three CSO structures in the Louis Fargue catchment, Bordeaux, France, shows that the set of EVs depends on both the modelling approach (MLR or RFR) and varies with the CSO structure. For the available dataset, RFR requires fewer EVs than MLR. The cluster technique to select appropriate representative events for model calibration clearly enhances the model predictability for both approaches. MLR shows smaller variances of error due to calibration data selection, and higher performances of verification results. However, the uncertainties of simulated values in both calibration and verification of RFR models are seen to be lower than the ones of MLR models.

Given the current results, this paper provides a valuable means of evaluating CSO loads by regression models. In addition to the typical use of MLR, RFR may be an interesting alternative and worth consideration. Further improvement of the models for operational purposes can be achieved by applying a sensitivity analysis on the impact of the standard uncertainty of rainfall measurements and expanding the dataset by further data collection and analysis to include more CSO events. For this study, the developed regression models with better performance are intended for later application in a simplified water quality based RTC strategy. Using the rainfall forecast, the total CSO load of each sub-catchment can be quickly computed and recursively updated, facilitating the decisions on allocating excessive storm water and prioritizing available storage capacities between different sub-catchments according to the magnitude of their CSO loads.

## ACKNOWLEDGEMENTS

The authors wish to thank Bordeaux Métropole, Société de Gestion de l'Assainissement de Bordeaux Métropole (SGAC), and the LIFE programme of the European Commission (through the LIFE EFFIDRAIN project, ref number: LIFE14 ENV/ES/000860) for their financial and technical support.

June 23–27