Abstract
Modelling, automation, and control are widely used for water resource recovery facility (WRRF) optimization. An influent generator (IG) is a model, aiming to provide the flowrate and pollutant concentration dynamics at the inlet of a WRRF for a range of modelling applications. In this study, a new data-driven IG model is proposed, only using routine data and weather information, and without need for any additional data collection. The model is constructed by an artificial neural network (ANN) and completed with a multivariate regression to generate time series for certain pollutants. The model is able to generate flowrate and quality data (TSS, COD, and nutrients) at different time scales and resolutions (daily or hourly), depending on various user objectives. The model performance is analyzed by a series of statistical criteria. It is shown that the model can generate a very reliable dataset for different model applications.
HIGHLIGHTS
A data-driven influent generator (IG) to create a long-term time series for flow and pollutant concentration at hourly and daily frequencies.
The IG application only needs WRRF routine data and weather information, without need for any additional data collection.
The IG also generates nutrient concentration time series.
A random walk stochastic process is added to better approximate observed temporal variability.
Graphical Abstract
INTRODUCTION AND OBJECTIVES
Nowadays, modelling is widely used for water resource recovery facility (WRRF) design, upgrade and controller evaluation (Sweeney & Kabouris 2013). However, because of a lack of adequate (dynamic) input datasets, most of the models are still used under steady state conditions. For example, for WRRF design, engineers usually make an initial sizing by using design guidelines and safety factors or some statistical evaluations to come up with the values for the design inputs, and the consequence of overly conservative safety factors is oversizing the WRRF (Talebizadeh 2015). However, steady-state simulations are not able to represent the temporal variability present in reality. Thus, a reliable dynamic influent generator (IG) is essential as key input for WRRF modelling (Gernaey et al. 2011; Martin & Vanrolleghem 2014).
An Influent Generator (IG) is a model that generates realistic dynamic (or static) influent scenarios at the inlet of a WRRF. The outputs of an IG include flowrates and pollutant concentrations at different time scales (long term, short term) and resolutions (daily, hourly), according to different user objectives.
An IG model can be applied in very diverse domains, such as the integrated urban wastewater systems (IUWS) to globally improve the wastewater treatment process (Benedetti et al. 2013), and WRRF design and upgrade to face the growing amounts of wastewater produced by increasing urbanization and population. IGs can help to provide a fast and accurate estimation for WRRF designers and a credible input time series to anticipate the treatment performance under different operating conditions.
An IG model can also be used for influent database quality evaluation, optimization and completion (Martin & Vanrolleghem 2014). Since data observation and collection become more and more important in wastewater management, an IG model enables quality evaluation of a dataset collected from online measurements in different ways, e.g. by identifying errors due to clogged sensors, detecting extreme measurement values (outliers), etc. An IG can also complete missing data (gap filling, i.e. temporary measurement failure) and interpolate a low frequency time series into a denser time series.
Different researchers (see below) have been developing IGs based on different modelling principles. The current IGs can be divided into two types: data-driven IGs and phenomenological IGs, also called ‘black’ box and ‘grey’ box, respectively. Data-driven IGs focus on finding the relation between their inputs and outputs, without any knowledge of the internals of the system. Therefore, the performance of data-driven IG depends on the provided dataset. Varying degrees of model complexity have been studied, such as harmonic function models, regression equations, and also artificial neural networks (ANN). The Fourier series IG based on harmonic function has been used to develop a simple and reliable generator of diurnal variations for dry weather influent data, and has been used in different studies (Langergraber et al. 2008; Mannina et al. 2011). Ahnert et al. (2016) developed a statistical method for the generation of a continuous time series of influent quality by only using routine data based on the Weibull distribution. Troutman et al. (2017) developed an automated toolchain based on Gaussian processes to predict the dynamics of combined sewer system flow. Recently, machine learning is also being studied for flow forecasting over a short time horizon, with model structures such as artificial neural networks (ANN) (El-Din & Smith 2002), multiple linear regression (Zhu & Anderson 2016) or nonlinear autoregressive exogenous models (NARX) (Banihabib et al. 2019).
Unlike data-driven approaches, the phenomenological model is built by integrating some important processes, which influence the generation of the influent. This type of model is usually constructed around different submodels, such as dry weather generation, sewer system transport, soil model including infiltration of groundwater etc. (Gernaey et al. 2011). For example, the dry weather flowrate has a diurnal pattern with two peaks corresponding to the morning and evening urban activities, e.g. characterized by Butler et al. (1995). This diurnal pattern can reflect different lifestyles and can be described by a harmonic function. For storm flow, the dilution and first flush during a rain event can be integrated in the model (Bechmann et al. 1999; Talebizadeh et al. 2016; Langeveld et al. 2017). Gernaey et al. (2011) developed and Flores-Alsina et al. (2014) extended a more conceptual phenomenological model, by focusing on flowrate scenarios and submodels of the urban drainage system.
There are advantages and disadvantages in both types of models (Price & Vojinovic 2011). Compared with a data-driven model, a phenomenological model contains more details of the influent generation process, which leads to an explicit result and good extrapolation power beyond the calibration range. Normally, such model consists of submodels with parameters related to physical processes. One of the shortcomings of such IG comes from the need of calibration, because of the series of parameters and submodels, it is difficult to calibrate all parameters needed, such as the catchment information, sewer system etc., which leads to less flexibility when applying to a different case study. In contrast, a data-driven model focuses only on the relation between input and output, instead of providing more understanding of the system. Thus, it is easier to calibrate and use a data-driven model. However, the high dependence on data demands a more complete dataset, in order to reach adequate model quality.
Currently, most influent generators in the literature (either phenomenological or data-driven) suffer from the following two issues. The first issue is that it is difficult to balance the complexity and precision of the IG. Usually, a higher complexity leads to better performance, but it makes the modelling more time consuming and more intense in terms of calculation. The other issue is the ability to generate an adaptable influent profile that can be easily adjusted to different user objectives. For example, the design process demands a long-term series but allows for low time resolution, while a process control application needs a higher time resolution profile.
To solve the issues with the available IG models, in this study, a data-driven IG is proposed that can generate dynamic WRRF influents of different time scales and resolutions. The model aims to generate reliable influent properties and more complete time series for the influent dynamics, including daily and hourly flowrate, daily and hourly total suspended solids (TSS), and daily chemical oxygen demand (COD) and nutrient concentrations for the case study at hand. With the available daily concentration results the IG is constructed such that it is able to obtain hourly concentration dynamics by applying an observed daily pattern or by developing a more detailed model in future studies.
The IG proposed in this work is intended to generate influent time series only based on weather information, without any further collection of flowrate or concentration data. This allows obtaining long-term simulation inputs and since it does not need a complicated dataset, it reduces the investment of labor and money for measurement campaigns. The proposed IG model is not built around a physical submodel, resulting in lower modelling efforts than with a phenomenological model, so that it can provide a satisfactory compromise between model complexity and prediction quality.
MATERIALS AND METHODS
This section describes the case study and the preparation of the data set, followed by an overview of the modelling approach. First, the basic ANN model is presented followed by the stochastic process model that is added to increase the generated variability. Subsequently, the multivariate regression that allows calculating the nutrient time series is introduced. This section finishes with a definition of the criteria that will be used to assess the quality of the proposed IG.
Case studies and dataset preparation
The modelling approach was developed and tested on two urban catchments, Quebec City (Canada) (Tik & Vanrolleghem 2017) and Bordeaux, Clos de Hilde (France) (Ledergerber et al. 2020). These two case studies are both combined sewer systems, with a similar number of population equivalents, 300,000 PE and 200,000 PE, respectively. However, the two catchments provide different extents (range, data frequency) of available data.
In this research, the Quebec City case study is developed around routine influent data that are collected regularly at the entry of the WRRF, including flowrate, COD, TSS. The nutrient concentrations (ammonia and phosphorus) are sampled and measured only once or twice per week since the plant is only required to remove organic matter. Data were available for 8 years (2011–2018). For the Bordeaux case study, which is also a carbon removal WRRF, hourly flowrate and TSS concentrations were available for three months in 2017 and 2018.
Modelling approach
The proposed IG was developed with MATLAB R2019b (www.mathworks.com) using the following machine learning methodology. In short, the dataflow through the model is according to the flowchart shown in Figure 1, detailed in the following sections. Figure 1 shows that the raw data series is pre-treated. The pre-treatment aims to remove outliers detected by a univariate method and the faulty data are replaced by estimated values obtained by the Gaussian kernel smoother (Alferes & Vanrolleghem 2016). Then the cleaned flowrate data are Fourier transformed to obtain a yearly-seasonal-daily pattern for the baseflow P(t). The input dataset of the ANN consists of this P(t) together with weather data (temperature T and rainfall intensity R, provided by the WRRF's pluviometry and treated at hourly frequency), creating input vectors with a number of lags τ, which represents the weather data in the previous τ time steps. A stochastic generator is subsequently applied on the ANN's output, to get a time series, which better mimics the reality in terms of variability (Qsim). Finally, the nutrient concentrations (Cnutrisim) are obtained by applying a multivariate regression with input of the ANN's results and the weather data time series.
ANN model
The artificial neural network takes inspiration from the biological learning process. As a powerful data-driven tool, it can simulate nonlinear systems and is increasingly used in water engineering, urban hydrology and catchment modelling (Maier & Dandy 2000; Rajurkar et al. 2004; Fu et al. 2010).
Input data selection is important before using any ANN model. For a combined sewer system, the dry weather flow (DWF) component varies with the pattern of urban behaviour and is completed with hydrological processes (snowmelt, groundwater) and wet weather flow (WWF) also including direct storm water inflow, and rainfall-dependent infiltration and inflow (RDII) (Wright et al. 2001). Thus, the input of the ANN-based IG consists of a basic domestic flowrate pattern, weather information, including temperature and rainfall measurements with a number of lags τ time steps, to represent an internal autoregression.
The DWF pattern is obtained by applying a Chebyshev bandpass filter after Fourier transform (Heideman et al. 1984). The Fourier transform enables converting the signal from the time domain into the frequency domain, and the Chebyshev bandpass filter allows only the signal with selected frequencies to pass (Schlichthärle 2011). By removing high-frequency contributions such as measurement noise, the dry weather flow pattern can be obtained. In this study, the bandpass filter is focusing on extracting two major signals: the seasonal/yearly effect and the daily effect. To this end, the frequencies that were retained are: 4 year−1 and 2 year−1 for the seasonality and 1 day−1 and 2 day−1 for the diurnal phenomena.
The dataset was divided into a training set, a cross-validation set and a test set (70, 15 and 15%, respectively). To determine the architecture of the ANN (hyper-parameters: number of neurons and layers), a series of ANNs with different numbers of layers and neurons were trained. For each training, the performance of the model was recorded and compared. The number of layers and neurons giving the best performance for the cross-validation dataset, was selected as final architecture. To avoid local minima during training, a series of iterations of the training procedure was applied. The final test of the ANN model was performed with the test set.
Stochastic process
After obtaining the ANN result, it was observed that the model output exhibited less variability than the measurements. However, it is important to have a probability distribution of the generated time series similar to reality, i.e. this allows especially to better design WRRF parameters such as the expected load and hydraulic capacity and certain percentiles of their distribution etc. Therefore, a stochastic process was added in order to optimize the IG model adequacy. The aim was to simulate a more random time series, with a statistical characterization that is more reflecting the distribution of the measured reality.
Multivariate regression
Criteria and error analysis
RESULTS AND DISCUSSION
In this section, first, the result of each of the IG's submodels is presented, including the daily and seasonal pattern, and the ANN modelling results of flowrate and pollutant concentrations for both Quebec City and Bordeaux. Next, attention is focused on the usefulness of adding a stochastic process to increase the variability of the generated time series. Subsequently, the result of the nutrient concentration calculations is shown. Then, the model performance is analyzed with two other criteria. Finally, this section ends with a detailed evaluation of the proposed IG from an overall perspective.
Model and submodel results
Daily and seasonal pattern
A Chebyshev bandpass filter (2nd order Chebyshev Type I) was applied after the Fourier transform of each time series to get the seasonal and daily DWF pattern of the urban activity. Figure 2 shows the pattern for the Quebec City case study. The flowrate pattern exhibits a clear increase during the snowmelt from March to the end of April, while the COD and TSS patterns demonstrate dilution by the snowmelt. The TSS daily concentration shows a slight trend from 2011 to 2018. There are two explanations for this phenomenon: (i) an increase of the urbanization brings increasing TSS discharges and (ii) thanks to the improvements made to the sewer system, the infiltration inflow has decreased, which leads to a higher TSS concentration.
Similarly, the 2nd order Chebyshev Type I bandpass filter is applied for the Bordeaux case study, in order to extract the daily pattern. Figure 3 illustrates that the daily pattern has two peaks corresponding to the increased urban activity in the morning and evening, as reported elsewhere in the literature (Butler et al. 1995). However, the peak and form are different for weekday and weekend days, because of the different weekend behaviour.
ANN modelling result for Quebec City
The Quebec City case study aimed at generating the flowrate, COD and TSS time series using an ANN model. The selected ANN has one input layer, one hidden layer and one output layer. The hidden layer contains four neurons for flowrate and five neurons for COD and TSS. Figure 4 shows the daily flowrate, TSS and COD concentration in the test set. The result clearly demonstrates the snowmelt effect at the end of winter, with an increase of the flowrate and dilution of TSS and COD. The impact of each rain event is adequately described by the IG model.
ANN modelling result for Bordeaux
The ANN model also demonstrated good results for Bordeaux, as shown in Figure 5 for the hourly flowrate (a) and hourly TSS (b). The flowrate is generated by using a daily dry weather flow pattern and hourly rain data as input. The daily dry weather flow pattern profile and the increase of inflow by storm water can both be seen for each rain event. The TSS results demonstrate that the data-driven methodology is also able to describe water quality at high frequency. The TSS concentration follows a daily concentration pattern and it is diluted by the inflow of storm water.
The ability of the IG model to generate concentration time series can also be used for dataset gap filling, especially under wet weather conditions when sensor clogging occurs more frequently. Moreover, as shown in Figure 6, the model can be improved by also feeding it with daily average laboratory measurements as input. Although the model output is still less dynamic than what the sensor data exhibit, this extra input enables the model to better represent the variability of reality (by representing the differences of wastewater discharge on different days). For instance, thanks to this add-on, the lower concentrations on 13 and 14 June could be captured even though there was no wet weather effect. Finally, note that a longer time series will help to find a yearly or seasonal pattern so that the TSS concentration results would be further improved.
Better representing variability by adding a stochastic process
As mentioned before in the Materials and Methods section Stochastic process, the obtained ANN model was extended with a random walk model after having analyzed the error between the ANN model output and the measurement data. The stochastic process extension allows the influent generator to statistically better approach the variability in the real measurements.
The random walk model is a special case of the autoregressive (AR) model. In order to define the order of the autoregression model, orders 1–5 time lags were evaluated for the Quebec City case study. As the autocorrelation coefficient plot and the MAE results of Figure 7 show, the best stochastic model is obtained at 4 time lags autoregression because this gave the best compromise between model complexity and precision.
Figure 8 shows the results of COD and TSS generation with addition of the stochastic process. Compared with the ANN-only model, the stochastic process allows to better mimic the influent random variation. A more detailed analysis is provided in the performance analysis section.
Nutrient concentration generation by multivariate regression
For the reasons explained before, in practice, the influent nutrient concentrations (ammonia and phosphorus) in Quebec City's COD removing plant are measured only once or twice a week. This makes it difficult to estimate the current treatment result from the limited nutrient data or to create an influent time series in view of upgrading the plant for nutrient removal. Thanks to the high correlation between the nutrient concentrations and the flowrate, it was found that the concentrations could be generated quite well by multivariate regression, see Equation (2). Based on the same principles of regression modelling as used in section 3.1.4, regression orders between 1 and 5 were tested and the best order of the regression was found to be 3, giving the best performance (RMSE) on both training set and validation set.
Figure 9 shows the model performance for the test set, demonstrating that the regression is able to adequately describe the ammonia concentrations on the basis of only flowrate and weather data.
In the same way, Figure 10 presents the generated phosphorus concentration time series. Thus, the multivariate model enables generating a high-frequency time series (daily simulation) from a low frequency measurement time series (weekly measurement).
Performance analysis and discussion
The performance of the final data-driven influent generation is summarized in Table 1. The criteria are all calculated based on the test sets. Table 1 demonstrates that the model is able to generate time series with an error of around 10% for flowrate and 13–20% for water quality. The NSE is around 0.5–0.7 indicating that the model can match the observed dataset well. For water quality, the model is able to provide a good generation, too. However, it is worth remembering that the observed data consist of imperfect lab and sensor measurements and that these errors thus indirectly influence the quality of the water quality generation. Considering the high uncertainty and measurement errors of raw wastewater data (Montgomery & Sanders 1986; Bertrand-Krajewski et al. 2007), the RMSE and MAPE obtained can be considered to be in the same order of magnitude as real-life measurements. These results thus indicate that the model performance is sufficiently good.
Variable . | Average . | RMSE . | MAPE . | NSE . |
---|---|---|---|---|
Case study for Quebec City: daily data | ||||
Flowrate | 200,000_m3/d | 38,000 | 11.8% | 0.74 |
COD | 400 mg/L | 52 | 19.8% | 0.43 |
TSS | 250 mg/L | 30 | 16.7% | 0.59 |
Ammonia | 12.5 mg/L | 2.6 | 13.7% | 0.68 |
Phosphorus | 4.3 mg/L | 0.8 | 13.4% | 0.71 |
Case study for Bordeaux: hourly data | ||||
Flowrate | 12,000m3/h | 1,200 | 13.5% | 0.61 |
TSS | 300 mg/L | 42 | 17.5% | 0.70 |
Variable . | Average . | RMSE . | MAPE . | NSE . |
---|---|---|---|---|
Case study for Quebec City: daily data | ||||
Flowrate | 200,000_m3/d | 38,000 | 11.8% | 0.74 |
COD | 400 mg/L | 52 | 19.8% | 0.43 |
TSS | 250 mg/L | 30 | 16.7% | 0.59 |
Ammonia | 12.5 mg/L | 2.6 | 13.7% | 0.68 |
Phosphorus | 4.3 mg/L | 0.8 | 13.4% | 0.71 |
Case study for Bordeaux: hourly data | ||||
Flowrate | 12,000m3/h | 1,200 | 13.5% | 0.61 |
TSS | 300 mg/L | 42 | 17.5% | 0.70 |
Figure 11 compares the observed flow data with the model generated data in different ways. First the quantile–quantile (q–q) plot, see Figure 11(a), shows the good correspondence of the generated data with the observation data (with coefficient of determination r2 = 0.88). The CDF in Figure 11(b) compares the statistical characteristics of the observed and generated dataset. The time series was also split into two subsets, one for winter (from January to May), and one for summer (from July to December) in order to demonstrate that the model's performance is not different for different seasons, see Figure 11(c) and 11(d). Although the winter and summer flow distributions are different, i.e. the winter flow distribution is more dispersed than the summer one: in winter 85% of the flow rate is below 3.8*105 m3/d while in summer 85% of flow is below 2.4*105 m3/d, both the seasonal PDF and CDF analysis show that the distribution of the model outputs is similar to the real data distribution.
Figure 12 illustrates the COD concentration generation by the ANN model without (a) and with stochastic process extension (b). It can be concluded that the ANN model with the stochastic process can better mimic the observed variability. The stochastic process enables the model output distribution to be wider and closer to the distribution of the real measurement time series.
This improvement is confirmed by the KL divergence values in Table 2. A smaller KL value represents a better similarity between two distributions: i.e. the KL divergence value with random walk is smaller than the KL value for the ANN model output without the extension, which indicates it is closer to the distribution of the observations.
KL divergence . | COD . | TSS . |
---|---|---|
ANN model (a) | 0.573 | 0.204 |
ANN model with stochastic process extension (b) | 0.115 | 0.058 |
KL divergence . | COD . | TSS . |
---|---|---|
ANN model (a) | 0.573 | 0.204 |
ANN model with stochastic process extension (b) | 0.115 | 0.058 |
Discussion and evaluation
In general, the error analysis (Table 1) shows the high precision of the IG model outputs. Figure 11 demonstrates a good match between the model results and the measured data for the test set, which means the model can successfully generate flowrate and pollutant concentration time series. However, some discrepancies can be noticed for the hourly generation occurring at the beginning of wet weather flows (Figure 5). This might be caused by overflows generated in the sewer system or depression storage effects, which are not included in the model. On the other hand, high-frequency water quality time series generation is based on sensor measurements, and it must be recognised that sensor clogging and anormalies often occur, especially during rain events (Bertrand-Krajewski et al. 2007).
The snowmelt period in the generated data is also shifted in time for some of the years that were studied (Figures 4 and 5). This timing error is due to the model considering snowmelt water infiltration as an average amount and at an average time in the year. This issue may be dealt with by modifying the yearly pattern for different years, using for instance temperature data.
The proposed IG model provides a tool for fast influent profile generation at different resolutions and it benefits not only the carbon removal but also the nutrient removal process. Moreover, instead of being a deterministic model, the proposed IG creates a stochastic process, which may be helpful for further WRRF modelling uncertainty and scenario analysis. Compared with available phenomenological models presented in the literature review, the advantage of using the proposed data-driven model is that there is no need to gather any information on the catchment, nor does it need calibration of physical parameters. Therefore, It is suitable for cachtments with unknown physical details.
Finally, to confirm its utility and advantages, the proposed model was compared with a very simple data-driven method: the generated TSS concentration was calculated simply as the average TSS load (calculated for the whole available data series), divided by measured flowrate data. This reflects the fact that a city typically generates a stable daily pollution load. The results of Appendix 1. show that the obtained ANN model is considerably more precise, using the same input and the same size of training set.
This simple data-driven model faces the same problem as any ANN model, i.e. it is risky to extrapolate if the new time series is too much different from the training set. To counter this limitation, a more complete database is required to make sure the model has been exposed to ‘more experiences’ and can thus capture more of the system properties, such as more diverse rain data, wider and higher temporal distribution of water quality measurements, etc.
CONCLUSION
This work developed an urban wastewater influent generator model for flowrate and pollutant concentration generation. The proposed IG model is data-driven and includes an ANN, a multivariate regression and a random walk process. Its performance was analyzed by different criteria: MAPE, NSE and statistical characteristics evaluating variability (KL_divergency). The IG model showed a high generation precision for the two case studies: 10% error for flowrate and 15–20% for the concentrations. Given the fact that the performance of a data-driven model depends on the quality and coverage of the training data availability, a more complete dataset (in temporal or spatial sense) can help further improve the model performance.
The proposed IG is balanced in terms of modelling efforts and precision. Compared with a simple data-driven IG model based on average pollutant load and flow (result of Appendix 1), the proposed IG demonstrated a higher accuracy. On the other hand, compared with phenomenological IG models, this model requires less modelling efforts, especially in calibrating the parameters for which prior expert knowledge is needed. Another significant improvement of the proposed IG is that stochastic variability is included in the model, which enables the generated time series to better represent the reality in terms of probability distribution of the generated data. Last but not least, the application of the proposed IG requires only routine WRRF data, thus not requiring any additional investments in data collection. Once the IG model is calibrated, the generation of new time series only requires weather data and is not relying on historical influent data. This ensures the prediction result will be stable over a long time horizon and will not accumulate errors over time.
In conclusion, the proposed IG generates a dynamic and complete data set (flowrate and pollutant concentrations, both organics and nutrients) with good performance, and it is able to generate a time series at different time resolutions (daily and hourly). Further research will be focusing on the optimization of the IG in order to better support WRRF modelling studies.
ACKNOWLEDGEMENTS
Sincere thanks are extended to SUEZ SGAC and NSERC (Natural Sciences and Engineering Research Council of Canada) for funding the research project. Québec City and Bordeaux (SUEZ Le Lyre) are thanked for its contribution with all relevant data. Peter Vanrolleghem holds the Canada Research Chair on Water Quality Modelling.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.