Abstract

Quantifying pollutant loads from combined sewer overflows (CSOs) is necessary for assessing impacts of urban drainage on receiving water bodies. Based on data obtained at three adjacent CSO structures in the Louis Fargue catchment in Bordeaux, France, this study implements multiple linear regression (MLR) and random forest regression (RFR) approaches to develop statistical models for estimating emitted loads of total suspended solids (TSS). Comparison between hierarchical clustering selection and random selection of CSO events for model calibration is included in model development. The results indicate that selection of the model's explanatory variables depends on both the type of approach and the CSO structure. By using the cluster technique to select representative events for model calibration, model predictability is generally improved. For the available dataset, MLR may have advantages over RFR in terms of verification performance and lower range of error due to splitting events for calibration and verification. But RFR model uncertainty bands are considerably narrower than the MLR ones.

INTRODUCTION

CSOs have long been known as a major cause of receiving water quality degradation. Their processes and effects on receiving water have been comprehensively investigated and documented in previous papers, e.g. House et al. (1993) and Métadier & Bertrand-Krajewski (2011). There are several solutions to alleviate CSO effects; typical ones include disconnecting impervious surfaces, upgrading treatment capacity of the wastewater treatment plant (WWTP), adding retention tanks or sewer tunnels, and real time control (RTC) of an integrated system. In Bordeaux, France, a system of hydraulics-based RTC has already been implemented and operated in one of its catchments, Louis Fargue, since 2012 (Andréa et al. 2013). Within the on-going European LIFE EFFIDRAIN project (www.life-effidrain.eu/en), the same catchment is going to be further experimented on with a quality-based RTC approach. As part of this project, one of the objectives is to develop local statistical models to predict total suspended solids (TSS) loads caused by CSO events, based on rainfall parameters derived from measurements by rain gauges.

Multiple options to develop statistical storm water quality models can be found in the literature and regression is considered an effective and practical application of existing data to derive the relation between variables (Vaze & Chiew 2003). Using rainfall and/or hydraulic observations, several regression models have been applied and presented in past publications to calculate storm event loads or event mean concentrations. However, few regression models use the same information to predict the pollutant load by a CSO event, for example, the partial least square regression models by Sandoval et al. (2013). This study therefore proposes and compares two approaches for estimating CSO loads: the MLR (Multiple Linear Regression) approach implemented by Sun & Bertrand-Krajewski (2013) and the RFR (Random Forest Regression) approach proposed by Breiman (2001).

This paper aims to overcome major issues to the predictive ability of any regression model: proper selections of explanatory variables (EVs) and calibration data subsets. When there are many candidate EVs, a strategy to balance between under-fitting (with too few inputs) and over-fitting (with too many inputs) should be identified (McQuarrie & Tsai 1998). Furthermore, different methods of splitting calibration and verification data subsets result in different model parameter evaluations. Data splitting can be performed in several ways. For instance, chronological splitting can be used to group all the observations before a specific point in time for calibration, whereas the later observations are reserved for verification. However, this method as well as other random splitting methods would degrade the predictive capacity of the model if the calibration data subset contains observations with close ranges of recorded values (Montgomery & Peck 1992).

MATERIAL AND METHODS

Study area and data-base

The Louis Fargue urban catchment has an area of 7,700 ha and a population of approximately 300,000 inhabitants. Its total sewer length is about 1,340 km, of which 80% are combined and 20% are separate sewers. Its main outlet is the Garonne River. Along the river, there are five CSO structures within the catchment. However, due to data availability, this study only focuses on three of them. Figure 1 displays the distribution of these CSO structures: Peugue, Naujac, and Lauzun, accompanied by the positions of nearby rain gauges used to analyse rainfall characteristics. In particular, the rain gauge named Abria is utilized for the Peugue CSO, Paulin for the Naujac CSO, and Louis Fargue for the Lauzun CSO. Distances between Peugue and Naujac, Naujac and Lauzun are around 1 km and 4.3 km respectively.

Figure 1

Louis Fargue catchment and locations of the CSO structures and nearby rain gauges.

Figure 1

Louis Fargue catchment and locations of the CSO structures and nearby rain gauges.

Each CSO structure is equipped with a nephelometric turbidimeter and an ultrasonic flow meter, which allow continuous measurements of turbidity and flow rates downstream of each CSO weir. The recording time step for these sensors is set at 5 minutes, which is reasonable to balance between measurement frequency for a 7,700 ha catchment and required storage memory. TSS concentration is estimated through local calibration, which derives a correlation function between data provided by the online turbidity sensor and wet weather sampling campaigns for each CSO (Caradot et al. 2015; Lepot et al. 2016). Wet weather sampling campaigns are key for local calibration because laboratory-analysed values from grab samples of the campaigns are needed to determine the relationship with recorded values given by the turbidimeter. The resulting correlations for the four CSO sites are presented in Maruéjouls et al. (2017). Table 1 summarises the characteristics and measurement ranges of rainfall, volume and TSS load data from CSO events selected for developing the models. There are 32, 37, and 27 events used for model development for Peugue, Naujac and Lauzun respectively, and data collection period is between March 2015 and October 2016. There are other CSO events excluded from this study because of gaps in measurements.

Table 1

Rainfall characteristics, volume and TSS load of selected CSO events at three CSO structures

 Peugue
 
Naujac
 
Lauzun
 
Number of events
 
32
 
37
 
27
 
Collection period Mar 2015- Aug 2016
 
Mar 2015- Aug 2016
 
Oct 2015- Aug 2016
 
 min mean max min mean max min mean max 
Rainfall (RF) 
 total depth (mm) 2.2 11.9 37.6 2.2 15.8 70.8 2.4 13.9 64.6 
 duration (hour) 1.0 14.2 61.2 1.6 17.8 84.4 1.7 15.3 75.6 
 average intensity (mm/hour) 0.3 1.7 10.8 0.2 1.4 5.6 0.3 1.2 3.1 
 max intensity, 5 min (mm/hour) 1.9 17.8 48.1 2.8 17.7 62.5 1.2 16.2 42.5 
 ADWPa (hour) 8.1 105.1 792.3 8.9 117.6 1,099.9 7.5 118.1 796.4 
CSO volume 
 total volume (m3686 33,200 147,245 546 28,727 425,821 628 12,341 113,440 
TSS 
 total load (kg) 132 8,297 41,536 129 8,100 99,060 144 6,454 32,268 
 Peugue
 
Naujac
 
Lauzun
 
Number of events
 
32
 
37
 
27
 
Collection period Mar 2015- Aug 2016
 
Mar 2015- Aug 2016
 
Oct 2015- Aug 2016
 
 min mean max min mean max min mean max 
Rainfall (RF) 
 total depth (mm) 2.2 11.9 37.6 2.2 15.8 70.8 2.4 13.9 64.6 
 duration (hour) 1.0 14.2 61.2 1.6 17.8 84.4 1.7 15.3 75.6 
 average intensity (mm/hour) 0.3 1.7 10.8 0.2 1.4 5.6 0.3 1.2 3.1 
 max intensity, 5 min (mm/hour) 1.9 17.8 48.1 2.8 17.7 62.5 1.2 16.2 42.5 
 ADWPa (hour) 8.1 105.1 792.3 8.9 117.6 1,099.9 7.5 118.1 796.4 
CSO volume 
 total volume (m3686 33,200 147,245 546 28,727 425,821 628 12,341 113,440 
TSS 
 total load (kg) 132 8,297 41,536 129 8,100 99,060 144 6,454 32,268 

aADWP: Antecedent Dry Weather Period.

MLR model

Two most common possibilities for the equation of an MLR model include: 
formula
(1)
or 
formula
(2)
where Y refers to the response variable (i.e. TSS load), Xi are explanatory or input variables obtained from rainfall characteristics, b0 and bi are model parameters to be calibrated.

Developing the MLR model is performed using the Python Numpy and Scipy packages (Oliphant 2007) and contains three steps:

Pre-screening of candidate variables

Initially, the list of potential EVs contains all possible variables representing rainfall characteristics. To lessen the impact of multicollinearity, if two candidate variables are highly correlated (i.e. Pearson correlation coefficient is higher than 0.9), only one should be used to develop the model. After this pre-evaluation, the list of potential EVs is filtered and can be used in the next two steps of model development.

Selection of EVs

The second step involves the use of the leave-one-out cross validation (LOOCV) technique (Rudemo 1982) and forward stepwise algorithm (FSA) to select only useful EVs for the MLR model. For each run of a LOOCV procedure, one CSO event is reserved for verification while all the remaining CSO events are utilized in calibration. The procedure involves multiple runs and completes when all CSO events take part in verification. The verification performances of all runs are then averaged to represent the model's predictive capacity through the LOOCV procedure. The application of LOOCV generally helps to account for uncertainty due to separation of data subsets for calibration and verification.

In addition, FSA is incorporated to reduce the efforts of searching for the best combination of EVs. In order to find the first EV of the MLR model, it is first assumed that the MLR model has solely one EV. Each variable from the list of potential EVs is tested by applying the LOOCV procedure. The selected EV is the one that gives the highest model's predictive capacity. Similarly, the MLR model is then assumed to contain two EVs (the first one is already known) and each variable from the list of remaining variables is added to the model together with the known EV. The results of the LOOCV procedure are then evaluated to identify the second EV, corresponding to the model with the highest predictive capacity. The selection of more EVs is repeated until the predictive capacity of the model cannot be further improved.

Splitting of dataset

The final step involves the application of hierarchical clustering to split the dataset for calibration and verification. The clustering method is adapted from Sun & Bertrand-Krajewski (2013) to group similar CSO events into distinct groups called clusters. In brief, the similarity between two CSO events is assessed by means of their standardized Euclidean distance. Using the computed distances, two similar CSO events are paired into a binary cluster. Similar binary clusters are gradually merged to form larger clusters until a hierarchical tree is built. Depending on the number of CSO events needed in calibration, the tree can be cut into different branches to form the expected number of clusters. If k CSO events are needed in calibration, the dataset is divided into k clusters and the most representative event from each cluster is picked for the calibration data subset. The remaining events are kept for the verification data subset.

In this study, calibration is performed with an increasing number k of CSO events until all events are chosen. In order to verify the proficiency of the cluster technique in selecting events for calibration, 2000 combinations of events are generated by random selection and used for calibration at each corresponding k number of calibration events. Higher numbers of the combinations of events were tested but given the size of the dataset, no significant variations in the results were observed. The resulting modified Nash-Sutcliffe efficiency indices (NS), as seen in Sun & Bertrand-Krajewski (2012), are summarized by five statistical values, i.e. minimum, lower quartile, median, upper quartile and maximum, and then compared with the modified NS given by using the cluster technique. Besides, these statistical values are also applicable for analysing the difference between MLR and RFR in terms of the range of error due to calibration data selection.

RFR model

Random forest is an ensemble machine learning method based on bootstrap aggregation to enhance the decision tree method. By sampling with replacement of the original calibration data subset, a large number of trees are grown to maximum size without pruning, and aggregation is done by averaging the results from all trees. For each tree, the set of EVs utilized to determine the best split at each node is randomly chosen from the total number of EVs (Prasad et al. 2006). Due to bootstrapping, on average, approximately one-third of the events in calibration data subset are not used to build a tree. They are regarded as the out-of-bag (oob) events and used to compute the internal prediction performance of the model (i.e. oob score, indicated by NS value) and then to evaluate the prediction strength of each variable (i.e. feature importance), avoiding the need for any cross-validation method (Archer & Kimes 2008). Feature importance of a variable indicates how much the oob score would decrease if only that variable is permuted from the data subsets of oob events. More information on the conceptual aspects of RFR can be found in Grömping (2009).

The development of RFR is implemented by Python Scikit-learn package (Pedregosa et al. 2011) and involves an analogous three-step procedure as for MLR to allow straight assessments of the two approaches' performances. The main difference lies in the second step of selecting EVs, which utilizes RFR's measures of feature importance to eliminate trivial variables. Given the relatively small number of potential EVs, this study tailors the technique presented by Genuer et al. (2010) for the selection. Firstly, all the M potential EVs remaining after the first step of pre-screening are used as inputs of the RFR model. According to their feature importance outputs, they are subsequently ranked in a list of decreasing order of importance. Lastly, multiple runs of RFR models with increasing m number of inputs taken from the above list are performed (m = 1 to M). The variables involved in the model leading to the highest oob score are selected as the final EVs.

Model performance indication

Modified NS, by which the variance of all observations is used for normalization (Sun & Bertrand-Krajewski 2012), is applied as a goodness-of-fit indicator for all the models to enable direct comparisons and interpretation of different data subsets of the same or different sizes. Without normalization, a CSO event with a very large TSS load may substantially increase the overall variance of observed values in the selected data subset, resulting in a high NS value although the model performance is not satisfactory. As a result, the judgement when comparing the performances of different models may be biased. The equation of the modified NS is proposed as follows: 
formula
(3)
where N is the total number of the available observations, n is the number of observations used for calibration or verification, μ is the mean of all observed values of Y, and e is the residual error.

In addition, modified root-mean-square error (RMSE), which is the RMSE normalized by the mean value of the response variable (Dembélé et al. 2010), is computed. Because of the good agreement between the results from these two indicators, this paper only presents the figures showing modified NS values.

Assessment of uncertainty

Based on the results obtained by hierarchical clustering, one RFR model and one MLR model are chosen for each CSO to assess their uncertainties. A chosen model should be amongst the models with leading verification performances while its calibration performance remains acceptable. Furthermore, the size of its verification data subset ought to be at least 10% of the dataset.

The method of assessment requires each EV of the model to be associated with a probability density function (pdf). As an initial hypothesis, the pdf for each EV is assumed as normal (as applied e.g. by Dembélé et al. 2011) with a standard deviation of 10% of the mean value. Review of previous work shows that rainfall measurement uncertainty can vary in the order of 0–50% (Willems 2001), but the typical values are within the range of 0–15% (Habib et al. 2001; Weerts & Serafy 2006). The selected value of 10% is therefore considered as realistic and was used previously by Leonhardt et al. (2014). For the MLR approach, the 95% confidence interval of the response variable is derived by propagating the pdf associated to EVs using Monte Carlo (MC) simulations of 106 runs (Muste et al. 2012). For the RFR approach, estimation of uncertainty is far more challenging due to its non-parametric nature. In this study, standard errors in an RFR prediction are obtained by applying the Jackknife and infinitesimal Jackknife procedures proposed by Wager et al. (2014). 1,000 runs of MC simulations for the RFR model containing 1,000 trees are carried out to propagate pdf of EVs and assess the 95% confidence interval.

RESULTS AND DISCUSSION

Selection of EVs

Pre-evaluation shows that model performance using the existing dataset is generally higher with Equation (2), thus it is chosen for MLR model development. After pre-screening of candidate variables, potential EVs are obtained and listed in the first column of Table 2. From this list, selecting EVs is applied to sort out the useful EVs for each model. Table 2 shows the selected EVs for MLR and RFR models of the three CSO structures. It is obvious that the models for the three CSO structures do not share the same type and number of variables although the catchments of these structures are contiguous, with similar land use activities, and a majority of their rainfall, flow and quality characteristics are comparable (see Table 1). For Lauzun, the lower number of EVs compared to the other sites is due to the substantially high correlation between TSS load and rainfall depth (Pearson correlation coefficient amounts up to 0.92). MLR deems requiring more EVs than RFR, apart from the Lauzun case. It is equally interesting to observe that RFR tends to prefer input variables having higher correlation with the measurand and excludes ADWP-related candidates.

Table 2

Candidate EVS and EVS selected by the two approaches for each CSO structure model

Variables Peugue
 
Naujac
 
Lauzun
 
MLR RFR MLR RFR MLR RFR 
ADWP ✓      
RF depth ✓ ✓ ✓ ✓ ✓ ✓ 
RF duration ✓  ✓ ✓  ✓ 
Average RF intensity  ✓ ✓    
Max RF intensity, 5 mins    ✓   
Max RF intensity, 15 mins       
Max RF intensity, 30 mins  ✓ ✓ ✓   
ADWP x last event's RF depth       
ADWP x last storm event's RF duration       
ADWP x last storm event's average RF intensity       
ADWP x last storm event's max RF intensity, 5 mins ✓  ✓    
ADWP x last storm event's max RF intensity, 15 mins       
ADWP x last storm event's max RF intensity, 30 mins   ✓    
Variables Peugue
 
Naujac
 
Lauzun
 
MLR RFR MLR RFR MLR RFR 
ADWP ✓      
RF depth ✓ ✓ ✓ ✓ ✓ ✓ 
RF duration ✓  ✓ ✓  ✓ 
Average RF intensity  ✓ ✓    
Max RF intensity, 5 mins    ✓   
Max RF intensity, 15 mins       
Max RF intensity, 30 mins  ✓ ✓ ✓   
ADWP x last event's RF depth       
ADWP x last storm event's RF duration       
ADWP x last storm event's average RF intensity       
ADWP x last storm event's max RF intensity, 5 mins ✓  ✓    
ADWP x last storm event's max RF intensity, 15 mins       
ADWP x last storm event's max RF intensity, 30 mins   ✓    

Tick symbol indicates that a variable is selected as an EV for a model.

Dataset division for calibration and verification

Model calibration and verification by the two approaches are performed with increasing sizes of calibration data subset using both random selection and cluster techniques. Resulting NS values are illustrated in Figures 2 and 3. For plotting the outputs, the number of CSO events used for calibration starts from at least the number of EVs selected by either MLR or RFR, whichever is higher. The cluster technique clearly demonstrates improved prediction ability (lower errors) compared to random selection as its verification NS values are mostly higher than the corresponding median values. The box plots indicate that MLR typically shows lower performance in calibration compared to RFR; nevertheless, in terms of verification, MLR provides a narrower range of NS values. Its median NS values are usually higher than the ones of RFR in the cases of Peugue and Lauzun, when there is higher correlation between explanatory and response variables. On the other hand, RFR is better in capturing the non-linearity of the data for the Naujac case. Error in verification results generally decreases when adding more calibration events and appears to be lower with MLR.

Figure 2

Calibration results at the three CSO structures obtained by the two approaches, left charts: modified NS by MLR and right charts: modified NS by RFR.

Figure 2

Calibration results at the three CSO structures obtained by the two approaches, left charts: modified NS by MLR and right charts: modified NS by RFR.

Figure 3

Verification results at the three CSO structures obtained by the two approaches, left charts: modified NS by MLR and right charts: modified NS by RFR.

Figure 3

Verification results at the three CSO structures obtained by the two approaches, left charts: modified NS by MLR and right charts: modified NS by RFR.

Model uncertainty

MC simulation results are used to estimate the expanded uncertainty (with a coverage factor of 2, as applied in Muste et al. (2012)) for each CSO event. The units of expanded uncertainty are presented in both percentage and kg. For RFR models, the average expanded uncertainty of verification events of Peugue, Naujac, and Lauzun are about 50% [±2,781 kg], 28% [±790 kg], and 18% [±707 kg] respectively. For MLR models, most of them provide high expanded uncertainties and the lowest one belongs to the Lauzun model, which amounts to 78% [±1,958 kg]. The MLR model for Naujac seems not to be satisfactory yet due to the very high expanded uncertainty [±35,402 kg].

Given the widely varying magnitude of CSO events, from very small to large ones, it is convenient and operationally practical to sort the expanded uncertainties according to different ranges of simulated load values, as in Table 3. Particularly, all CSO events within the same range of simulated loads (i.e. 0–5,000 kg or 5,000–10,000 kg or >10,000 kg) are considered one group and the group's expanded uncertainty is obtained by averaging all values of individual expanded uncertainties. The expanded uncertainties of RFR models are globally lower than the ones of the MLR model, both in calibration and verification results. This could imply the RFR's benefit of the ensemble modelling to lessen the effect of measurement uncertainty. Figure 4 gives an example of the expanded uncertainties of MLR and RFR models at Lauzun. High expanded uncertainties are observed in the modelling results of MLR; low-TSS-load events (less than 5,000 kg) in particular show significantly larger expanded uncertainties than the others, for example, approximately 85% [±1,940 kg] versus 30% [±2,068 kg] for the other events. Besides, it is interesting to note that RFR can be significantly more reliable for some predictions than for others. This is consistent with the observation by Wager et al. (2014) that RFR would give wider confidence intervals if the predicted values are considerably different from the measured ones.

Table 3

Expanded uncertainties of the two models according to the ranges of the simulated loads of CSO events

CSO events' loads (kg) Expanded uncertainty in calibration results (±kg)
 
Expanded uncertainty in verification results (±kg)
 
PEUGUE
 
NAUJAC
 
LAUZUN
 
PEUGUE
 
NAUJAC
 
LAUZUN
 
MLR RFR MLR RFR MLR RFR MLR RFR MLR RFR MLR RFR 
0–5,000 4,689 2,627 15,358 1,241 1,944 647 3,590 1,521 14,937 790 1,940 398 
5,000–10,000 5,599 3,385 12,140 4,595 2,310 3,459 8,613 4,760 N.Aa N.A. 2,068 3,054 
>10,000 6,746 6,547 18,023 12,393 4,557 3,471 10,335 10,156 59,960 N.A. N.A. 32 
Average 5,593 4,346 15,806 4,276 2,658 2,443 6,235 2,781 35,402 790 1,958 707 
CSO events' loads (kg) Expanded uncertainty in calibration results (±kg)
 
Expanded uncertainty in verification results (±kg)
 
PEUGUE
 
NAUJAC
 
LAUZUN
 
PEUGUE
 
NAUJAC
 
LAUZUN
 
MLR RFR MLR RFR MLR RFR MLR RFR MLR RFR MLR RFR 
0–5,000 4,689 2,627 15,358 1,241 1,944 647 3,590 1,521 14,937 790 1,940 398 
5,000–10,000 5,599 3,385 12,140 4,595 2,310 3,459 8,613 4,760 N.Aa N.A. 2,068 3,054 
>10,000 6,746 6,547 18,023 12,393 4,557 3,471 10,335 10,156 59,960 N.A. N.A. 32 
Average 5,593 4,346 15,806 4,276 2,658 2,443 6,235 2,781 35,402 790 1,958 707 

aN.A.: not available because the measured loads of selected events for verification do not fall in the considered range.

Figure 4

Observed versus simulated results with 95% confidence intervals at Lauzun CSO structure.

Figure 4

Observed versus simulated results with 95% confidence intervals at Lauzun CSO structure.

Conclusions and perspectives

This study on modelling of TSS loads at three CSO structures in the Louis Fargue catchment, Bordeaux, France, shows that the set of EVs depends on both the modelling approach (MLR or RFR) and varies with the CSO structure. For the available dataset, RFR requires fewer EVs than MLR. The cluster technique to select appropriate representative events for model calibration clearly enhances the model predictability for both approaches. MLR shows smaller variances of error due to calibration data selection, and higher performances of verification results. However, the uncertainties of simulated values in both calibration and verification of RFR models are seen to be lower than the ones of MLR models.

Given the current results, this paper provides a valuable means of evaluating CSO loads by regression models. In addition to the typical use of MLR, RFR may be an interesting alternative and worth consideration. Further improvement of the models for operational purposes can be achieved by applying a sensitivity analysis on the impact of the standard uncertainty of rainfall measurements and expanding the dataset by further data collection and analysis to include more CSO events. For this study, the developed regression models with better performance are intended for later application in a simplified water quality based RTC strategy. Using the rainfall forecast, the total CSO load of each sub-catchment can be quickly computed and recursively updated, facilitating the decisions on allocating excessive storm water and prioritizing available storage capacities between different sub-catchments according to the magnitude of their CSO loads.

ACKNOWLEDGEMENTS

The authors wish to thank Bordeaux Métropole, Société de Gestion de l'Assainissement de Bordeaux Métropole (SGAC), and the LIFE programme of the European Commission (through the LIFE EFFIDRAIN project, ref number: LIFE14 ENV/ES/000860) for their financial and technical support.

REFERENCES

REFERENCES
Andréa
G.
,
Ahyerre
M.
,
Pleau
M.
,
Pérarnaud
J. J.
,
Komorowski
F.
&
Schoorens
J.
2013
Gestion dynamique des RUTP du bassin versant Louis Fargue à Bordeaux : mise en œuvre et premiers résultats opérationnels (Dynamic management of wet weather urban discharges in the Louis Fargue catchment of Bordeaux: implementation and first operational results)
. In:
Proceedings of the 8th International NOVATECH Conference
,
June 23–27
,
Lyon, France
,
10
pp.
Archer
K. J.
&
Kimes
R. V.
2008
Empirical characterization of random forest variable importance measures
.
Computational Statistics & Data Analysis
52
,
2249
2260
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
(
1
),
5
32
.
Caradot
N.
,
Sonnenberg
H.
,
Rouault
P.
,
Gruber
G.
,
Hofer
T.
,
Torres
A.
,
Pesci
M.
&
Bertrand-Krajewski
J.-L.
2015
Influence of local calibration on the quality of online wet weather discharge monitoring: feedback from five international case studies
.
Water Science and Technology
71
(
1
),
45
51
.
Dembélé
A.
,
Bertrand-Krajewski
J.-L.
&
Barillon
B.
2010
Calibration of stormwater quality regression models: a random process?
Water Science and Technology
62
(
4
),
875
882
.
Dembélé
A.
,
Bertrand-Krajewski
J.-L.
,
Becouze
C.
&
Barillon
B.
2011
A new empirical model for stormwater TSS event mean concentrations (EMCs)
.
Water Science and Technology
64
(
9
),
1926
1934
.
Genuer
R.
,
Poggi
J.-M.
&
Tuleau-Malot
C.
2010
Variable selection using random forests
.
Pattern Recognition Letters
31
,
2225
2236
.
Habib
E.
,
Krajewski
W. F.
,
Nespor
V.
&
Kruger
A.
2001
Sampling errors of tipping-bucket rain gauge measurements
.
J. Hydrolog. Eng.
6
(
2
),
159
166
.
House
M. A.
,
Ellis
J. B.
,
Herricks
E. E.
,
Hvitved-Jacobsen
T.
,
Seager
J.
,
Lijklema
L.
,
Aalderink
H.
&
Clifforde
I. T.
1993
Urban drainage – impacts on receiving water quality
.
Water Science and Technology
27
(
12
),
117
158
.
Leonhardt
G.
,
Sun
S.
,
Rauch
W.
&
Bertrand-Krajewski
J.-L.
2014
Comparison of two model based approaches for areal rainfall estimation in urban hydrology
.
Journal of Hydrology
511
(
4
),
880
890
.
Maruéjouls
T.
,
Granger
D.
,
Aouichat
F.
,
Lafficher
C.
,
Chadoutaud
E.
,
Pouly
N.
&
Binet
G.
2017
Suivi de la qualité des eaux unitaires à l’échelle du bassin versant urbain par la mesure en continu (Online monitoring of combined wastewater quality at urban catchment scale)
.
Techniques Sciences Méthodes
10
,
35
44
.
McQuarrie
A. D. R.
&
Tsai
C.-L.
1998
Regression and Time Series Model Selection
.
World Scientific
,
Singapore
.
Métadier
M.
&
Bertrand-Krajewski
J.-L.
2011
Assessing dry weather flow contribution in TSS and COD storm events loads in combined sewer systems
.
Water Science and Technology
63
(
12
),
2983
2991
.
Montgomery
D. C.
&
Peck
E. A.
1992
Introduction to Linear Regression Analysis
.
Wiley
,
New York
.
Muste
M.
,
Lee
K.
&
Bertrand-Krajewski
J.-L.
2012
Standardized uncertainty analysis for hydrometry: a review of relevant approaches and implementation examples
.
Hydrological Sciences Journal
57
(
4
),
643
667
.
Oliphant
T. E.
2007
Python for scientific computing
.
Computing in Science and Engineering
9
,
10
20
.
Pedregosa
F.
,
Varoquaux
G.
,
Gramfort
A.
,
Michel
V.
,
Thirion
B.
,
Grisel
O.
,
Blondel
M.
,
Prettenhofer
P.
,
Weiss
R.
,
Dubourg
V.
,
Vanderplas
J.
,
Passos
A.
,
Cournapeau
D.
,
Brucher
M.
,
Perrot
M.
&
Duchesnay
E.
2011
Scikitlearn: machine learning in Python
.
Journal of Machine Learning Research
12
,
2825
2830
.
Rudemo
M.
1982
Empirical choice of histograms and Kernel density estimators
.
Scandinavian Journal of Statistics
9
,
65
78
.
Sandoval
S.
,
Torres
A.
,
Pawlowsky-Reusing
E.
,
Riechel
M.
&
Caradot
N.
2013
The evaluation of rainfall influence on combined sewer overflows characteristics: the Berlin case study
.
Water Science and Technology
68
(
12
),
2683
2690
.
Sun
S.
&
Bertrand-Krajewski
J.-L.
2012
On calibration data selection: the case of stormwater quality regression models
.
Environmental Modelling & Software
35
,
61
73
.
Vaze
J.
&
Chiew
F. H. S.
2003
Comparative evaluation of urban storm water quality models
.
Water Resources Research
39
(
10
),
1280
.
Wager
S.
,
Hastie
T.
&
Efron
B.
2014
Confidence intervals for random forests: the jackknife and the infinitesimal jackknife
.
Machine Learning Research
15
(
1
),
1625
1651
.
Willems
P.
2001
Stochastic description of the rainfall input errors in lumped hydrological models
.
Stochastic Environmental Research and Risk Assessment
15
,
132
152
.