ABSTRACT
Rainfall-derived inflow/infiltration (RDII) modelling during heavy rainfall events is essential for sewer flow management. In this study, two machine learning algorithms, random forest (RF) and long short-term memory (LSTM), were developed for sewer flow prediction and RDII estimation based on field monitoring data. The study implemented feature engineering for extracting physically significant features in sewer flow modelling and investigated the importance of the relevant features. The results from two case studies indicated the superior capability of machine learning models in RDII estimation in the combined and separated sewer systems, and LSTM model outperformed the two models. Compared to traditional methods, machine learning models were capable of simulating the temporal variation in RDII processes and improved prediction accuracy for peak flows and RDII volumes in storm events.
HIGHLIGHTS
Machine learning models, particularly the LSTM model, outperformed traditional methods in estimating RDII in sewer systems.
Feature engineering techniques allowed for the extraction of physically significant features in sewer flow modelling.
Machine learning models successfully simulated the temporal variation in RDII processes, resulting in improved accuracy for predicting peak flows and RDII volumes during storm events.
INTRODUCTION
Inflow/infiltration into defective sewers leads to increased amounts of sewage and reduced treatment efficiency of wastewater treatment plants (WWTPs), which can lead to negative impacts on sewer operation and management. During heavy rainfall events, sewage systems can become easily overloaded due to rainfall-derived inflow and infiltration (RDII), and release untreated sewage and other pollutants into the environment (Jiang et al. 2019; Zhang & Parolari 2022). This can damage ecosystems and also pose a health risk to people who live near the affected areas (Yap & Ngien 2017).
Traditional methods for RDII estimation that were commonly used in the literature (Bennett 1999; Lai 2008; Mikalson et al. 2012) include the constant unit rate method (Empirical), the percentage of rainfall volume (R-value) method (Empirical), the percentage of stream flow method (Empirical), the synthetic unit hydrograph (SUH) method (Conceptual), the probabilistic method (Probabilistic), the rainfall/sewer flow regression method (Statistical), the synthetic stream flow regression method (Statistical), and methods embedded in hydraulic software (Mechanistic/Conceptual). Classifications in parentheses are based on the underlying principles employed by each method. A follow-up research based on the existing reviews of RDII quantification was conducted by Bennett (1999), who concluded that the selection of RDII estimation methods highly depends on site-specific rainfall and runoff conditions. The percentage of rainfall volume (R-value) method can be used for estimates of RDII when monitoring data is lacking, but accuracy is usually restricted due to inconsistent judgement from different analysts (Crawford et al. 1999; Vallabhaneni et al. 2007). The application of the percentage of stream flow method to an unmonitored area requires that the precipitation regime of the sewer system and watershed are consistent (Mikalson et al. 2012). The SUH method (Vallabhaneni et al. 2008) is highly dependent on the quality of hourly rainfall monitoring data. The rainfall/flow regression method for RDII estimation typically applies the multiple linear regression model, which is so sensitive to frequently occurring values that it affects the regression coefficients and interferes with the prediction results. Only if application storms are similar to storms for which predictive equations are generated, will the prediction accuracy of the traditional method be high. The choice of application storms used to determine the coefficients is subject to human subjective influence (Bennett 1999).
RDII estimation methods based on water quality parameters (Zhang et al. 2017) have also been applied in previous studies, such as using chemical oxygen demand (COD) (Bareš et al. 2012), total nitrogen (TN) and total phosphorus (TP) (Guo et al. 2022). Even though laboratory-based testing is a highly accurate method of determining water quality parameters, sample collection, transportation, and laboratory analysis usually take more time. Online monitoring technologies make it possible to collect data continuously and automatically, giving users instant access to metrics like rainfall and flow rates. In RDII research, this immediateness is beneficial, particularly when evaluating the dynamic responses of sewer systems to rainfall events. Online monitoring technologies offer a more efficient approach than traditional water quality assessment methods.
Some studies also employed hydrological models, such as the Antecedent Moisture Condition (AMC) model (Sadri & Graham 2011), MIKE URBAN (Borup et al. 2011) and applied mathematical models (Zhang 2007) combining flow data, inflow/infiltration ratio and autoregressive error models to simulate RDII processes and quantify RDII characteristics in sewer systems. However, they are not suitable for areas lacking necessary infrastructure data. The conductivity-based RDII identification method introduced by Zhang et al. (2018) employs an empirical and statistical approach, making it particularly effective in identifying abnormal flow situations. This technique capitalises on the acute sensitivity of conductivity measurements to detect dilution of wastewater by rainwater, offering a real-time and cost-efficient solution for monitoring sewer systems. However, it's important to note that the accuracy of this method can be influenced by the presence of other conductivity-altering substances, such as industrial discharges in the sewer, which necessitates a meticulous interpretation of the data.
Overflows and RDII are typical sewer processes that exhibit nonlinear patterns (Zhang 2008; Zhang et al. 2018). Various machine learning models have been used in numerous fields to identify nonlinear patterns (Finnis et al. 2012; Gholami et al. 2020; Huang et al. 2020) and have yielded excellent results. Particularly in environmental engineering, artificial neural network (ANN), M5 model tree, long short-term memory (LSTM) and k-nearest neighbour (kNN) were applied to simulate stormwater runoff from 16 green roofs across four Norwegian cities (Abdalla et al. 2021). The Autoregressive Integrated Moving Average (ARIMA) and Multilayer Perceptron Neural Network (MLPNN) were used for forecasting wastewater inflow in Barrie, Canada, with ARIMA showing slightly better performance than MLPNN (Zhang et al. 2019). In Vietnam, a deep learning neural network (DLNN) approach, combined with various geographical and environmental factors, was used to predict flash flood susceptibility in regions frequently affected by tropical storms (Bui et al. 2020). However, currently, there are limited studies using machine learning methods to estimate RDII in the combined and separate sewer systems (Mikalson et al. 2012; Zhang et al. 2018).
The same time span with a small time interval produces a large amount of data. Machine learning can effectively analyse massive data and identify hidden relationships in complex datasets (Senanayake et al. 2019). More available data implies better performance of the models. The hourly data used in some traditional methods mentioned above is not enough, and a large amount of data information has a positive effect on prediction. Moreover, machine learning is subject to less subjective judgment. A major advantage of machine learning is its ability to learn from expanding sets of data in order to predict or analyse the characteristics of a system (Rezapour et al. 2021). The ensemble of decision trees was used to produce reasonable prediction results in the random forest (RF) model, with each tree constructed from a randomly selected subset of the training dataset, ensuring a diversity of perspectives within the ensemble. This method effectively mitigates the risk of overfitting, a significant limitation often encountered in single decision tree models, by amalgamating the predictive insights from multiple trees. Consequently, the RF model enhances the robustness of the predictions, offering a more reliable approach to data analysis. This regression process was successfully completed by selecting and ranking related features, and the variation pattern of the target feature was summarised from the training set (Liu et al. 2012). Features with high importance and abundant information are selected from the input features to improve prediction accuracy. ANNs are another common machine learning algorithm for engineering problems. Any complex nonlinear relationship between inputs and outputs can be based on the excellent self-learning ability of neural networks to establish a model reflecting the sample variation pattern (Dongare et al. 2012). When dealing with a complex problem, ANNs usually outperform other models in terms of their fast convergence and powerful computing abilities. LSTM is a kind of ANN, which was proposed by Hochreiter & Schmidhuber (1997) to effectively solve problems of long-term dependences and gradient disappearance/explosion in recurrent neural networks (RNNs) (Yu et al. 2019).
This study aims to verify and develop reliable machine learning models that can be applied to predict sewer flow as well as estimate RDII and compare their performance to a traditional empirical approach. Tree ensemble model RF and deep learning model LSTM were applied to achieve the above-mentioned purpose based on rainfall and sewer flow monitoring data in the combined and separate sewer systems. The contributions of input features were examined by feature importance ranking, leading to a better understanding of the monitoring data. Modelling results were validated through those field observations, and subsequently compared with a traditional empirical approach to evaluate the applicability and accuracy of machine learning models. A discussion regarding the optimal frequency of sewer flow data to use in similar modelling was also conducted.
METHODS
Data processing and feature engineering
Overview of the sewer network: (a) in Bailong and (b) in Wantou, Ningbo, China.
Model development
In this study, the RF model and LSTM model were constructed. The features affecting sewer flow need to be determined and then a model developed. Rainfall was assumed as a dominant feature and the other features ‘sine and cosine functions’ and ‘Prior Flow’ was added. Based on these data, the prediction of sewage flow variation is known as the supervised learning process. RF model is a tree ensemble model and was developed to focus on reducing variance. Its performances were compared with the deep learning model LSTM.
Estimates of RDII can be obtained by subtracting the statistical mean DWF from the wet weather flow (WWF). Therefore, the accuracy of RDII estimation is highly dependent on the performance of total sewer flow modelling.



RF model
There are three important hyperparameters to consider when optimising the RF model, including max_features (maximum number of features used on each node), max_depth (maximum depth that a tree can be formed) and n_estimators (the number of trees). Hyperparameter optimisation aims to reduce the complexity of the model and increase model accuracy. The grid search method in the sklearn library offering a range of tools for data analysis, was adopted to conduct parameter tuning (Wang et al. 2019; Sun et al. 2020).
LSTM model














In the study, the model is a sequential Keras model (Joshi et al. 2020) made up of a dense output layer with a sigmoid activation function, a bidirectional LSTM layer with 128 hidden units, and a 0.2 dropout rate. The Adam optimiser (Nwankpa 2020) and mean squared error loss function are used in its compilation. Using a different test set, the training process is verified after 45 epochs with a batch size of 300. To help with the evaluation of model performance, plot training and validation losses to visualise learning curves.
The optimisation of the LSTM model was accomplished by adjusting epoch (the training number) and batch size (the sample number) based on the Bayesian optimisation algorithm (Mockus 1989) in this study. The number of epochs refers to the total number of forward and backward passes of the training dataset through the LSTM network. A low number of epochs can lead to underfitting, whereas an excessive number of epochs can cause overfitting. The ideal number of epochs that balances model performance and training efficiency can be found with the use of Bayesian optimisation. The term ‘batch size’ describes how many training samples an optimisation process uses in a single iteration. While smaller batches can result in faster convergence, they also produce more noise when being trained. Although they may be slower and need more memory, larger batch sizes offer more consistent gradient changes. Batch sizes that offer the optimal balance between learning effectiveness and computational resource consumption can be found with the use of Bayesian optimisation.
The procedure for applying Bayesian optimisation to an LSTM model's hyperparameter can be summed up as follows:
It is necessary to define an objective function first. The loss function is the objective function of optimisation in the LSTM model. This function, which offers a means of assessing the effectiveness of various hyperparameter combinations, is fundamental to the optimisation process. A few randomised studies were carried out to gather preliminary results. For Bayesian optimisation algorithms, the data serve as the foundation, enabling the algorithm to start comprehending the behaviour of the objective function.
In iterative optimisation, the Bayesian optimisation algorithm chooses the subsequent set of hyperparameters to assess based on the present model. The epoch count and batch size of the LSTM model are considered hyperparameters. The LSTM model is trained and assessed subsequent to the selection of each set of hyperparameters. The Bayesian model is updated using these fresh evaluation results. The optimisation algorithm can progressively develop a more precise understanding of the objective function's behaviour by repeating this approach. Until a predetermined number of iterations is reached or the model's performance achieves a predetermined convergence threshold, the iterative process is repeated. Ultimately, an ideal set of hyperparameters will be discovered upon completion of the optimisation procedure. In addition, the goal of training or optimising neural networks is to minimise the loss function and increase prediction accuracy. A learning curve illustrates a model's performance over time as it learns from data in machine learning. In order to prevent overfitting, this plot indicates when to cease training and recommends the ideal number of training epochs. Plotting the loss rate as a performance parameter on the y-axis against the quantity of training data or number of iterations (epochs) on the x-axis typically provides important insights into the learning process of the model. The model's underfitting (high losses on both training and validation data), overfitting (training loss decreases while the validation loss increases), or discovering a good balance is indicated by the curve's shape. When training loss and validation loss tend to be constant on the loss curve during the descent process, the model can be considered as converged. Further optimisation of the model may be needed in other situations (Domingo et al. 2022).
RESULTS AND DISCUSSION
Feature importance ranking
Sewer flow and rainfall time series data: (a) Bailong from August 1 to December 10, 2020; (b) Wantou from November 3, 2021 to December 31, 2022.
Sewer flow and rainfall time series data: (a) Bailong from August 1 to December 10, 2020; (b) Wantou from November 3, 2021 to December 31, 2022.
Autocorrelation analysis of sewage flow in Bailong and Wantou. The lag is indicated on the horizontal axis (each lag is 5 min), and the value of the autocorrelation coefficient indicates the degree of correlation on the vertical axis. The light blue shaded part represents 95% confidence interval.
Autocorrelation analysis of sewage flow in Bailong and Wantou. The lag is indicated on the horizontal axis (each lag is 5 min), and the value of the autocorrelation coefficient indicates the degree of correlation on the vertical axis. The light blue shaded part represents 95% confidence interval.
The key observation lies in the fact that these coefficients transcend the bounds of the 95% confidence interval, signifying their statistical significance. Such statistical significance suggests that these correlations are unlikely to be the result of random noise, thus indicating the presence of systematic and persistent patterns or influences within the data. In practical terms, this strong autocorrelation bears considerable implications for data forecasting and decision-making.
In Wantou, where the first autocorrelation coefficient surpasses 0.6, followed by subsequent coefficients exceeding 0.5, but with a subsequent decline to around 0.3, all while remaining outside the 95% confidence interval, a nuanced autocorrelation pattern emerges (Figure 6). The initial autocorrelation value of 0.6 denotes a relatively strong positive correlation between the current observation and the previous one. As subsequent coefficients also exceed 0.5 and fall outside the confidence interval, they underscore the persistence and significance of this autocorrelation.
The observed decline in autocorrelation coefficients beyond a certain lag suggests a temporal weakening of this relationship. This decline to around 0.3 may indicate that the impact of past observations on the current one diminishes as the lag increases. While the autocorrelation coefficients remain statistically significant beyond the confidence interval, this evolving pattern necessitates a more nuanced interpretation. It suggests that there is a time-dependent influence on the data, but this influence wanes as the temporal gap widens.
There is more flow fluctuation in combined systems since stormwater and sanitary sewage are transported through the same pipes. The weather, particularly rainfall, has a major impact on this alteration since it can lead to abrupt and significant increases in flow. Such variability results in a complex flow pattern with higher autocorrelation, especially during consistent weather patterns like a rainy season. However, flow autocorrelation is less in separate systems that isolate household wastewater from stormwater. The sanitary flow is more dependable and stable, largely influenced by human usage patterns, and resistant to variations caused by the weather. This causes the flow pattern to become more consistent, which lowers autocorrelation in sanitary flow.
The feature importance ranking aids in comprehending the relationship between independent features and dependent features. A high feature importance signifies a substantial contribution to the accuracy of model predictions. It is important to note that the comparatively lower importance of the remaining features does not imply a lack of informative value. It is possible that these features contain similar information to other features, resulting in a minor impact on the split nodes of each tree and consequently, a lower weight. Consequently, they may not be selected by a significant number of trees. Following the sequential forward feature selection process, the optimal feature combination for the RF model encompasses all the aforementioned features.
Model prediction analysis
Evaluation of performance indicators for different models (train set vs. test set 7:3)
Model . | Site . | Data type . | MAE . | MSE . | R2 . |
---|---|---|---|---|---|
RF | Bailong | Train | 1.221 | 2.918 | 0.899 |
Test | 1.388 | 4.053 | 0.863 | ||
Wantou | Train | 0.414 | 0.403 | 0.992 | |
Test | 1.124 | 2.959 | 0.945 | ||
LSTM | Bailong | Train | 1.401 | 3.768 | 0.889 |
Test | 1.428 | 3.740 | 0.800 | ||
Wantou | Train | 0.927 | 1.914 | 0.928 | |
Test | 1.649 | 8.609 | 0.918 |
Model . | Site . | Data type . | MAE . | MSE . | R2 . |
---|---|---|---|---|---|
RF | Bailong | Train | 1.221 | 2.918 | 0.899 |
Test | 1.388 | 4.053 | 0.863 | ||
Wantou | Train | 0.414 | 0.403 | 0.992 | |
Test | 1.124 | 2.959 | 0.945 | ||
LSTM | Bailong | Train | 1.401 | 3.768 | 0.889 |
Test | 1.428 | 3.740 | 0.800 | ||
Wantou | Train | 0.927 | 1.914 | 0.928 | |
Test | 1.649 | 8.609 | 0.918 |
Sewer flow prediction on four selected dates: (a) August 27, 2020 in Bailong; (b) September 14, 2020 in Bailong; (c) April 13, 2022 in Wantou; (d) August 30, 2022 in Wantou.
Sewer flow prediction on four selected dates: (a) August 27, 2020 in Bailong; (b) September 14, 2020 in Bailong; (c) April 13, 2022 in Wantou; (d) August 30, 2022 in Wantou.
Table 2 illustrates the model performance before and after optimisation in two sites. RF model and LSTM model performed well and consistently in the training set and the test set. The optimum parameters after model optimisation are presented in Table 3. After optimisation, the accuracy of the model on the training set was reduced, while the accuracy on the test set increased. The model development process shows that the accuracy of machine learning models can be improved by optimising the hyperparameters. However, through a series of trials, taking MAE value as an example, the value of MAE reduces by around 0.1 L/s at most. The effect of optimisation on the improvement of model accuracy can be regarded as negligible. In fact, it is quite difficult to improve the accuracy of the RF model as expected, to achieve the level of LSTM through hyperparameter optimisation for the tree ensemble model. Nonetheless, if the model is expected to greatly improve accuracy, independent features that have strong correlations with the dependent feature need to be added, such as water level and conductivity (Zhang et al. 2018), or the length of the time series is increased.
Evaluation of model performance on the dataset for two chosen dates
Model . | Description . | Site . | Data type . | MAE . | MSE . | R2 . |
---|---|---|---|---|---|---|
RF | Before optimisation | Wantou | Train | 1.104 | 2.718 | 0.950 |
Test-Apr. 13, 2022 | 1.039 | 1.882 | 0.834 | |||
Test-Aug. 30, 2022 | 1.295 | 3.917 | 0.941 | |||
Bailong | Train | 0.629 | 0.785 | 0.973 | ||
Test-Aug. 27, 2020 | 1.751 | 7.256 | 0.858 | |||
Test-Sep. 14, 2020 | 1.825 | 7.562 | 0.950 | |||
After optimisation | Wantou | Train | 1.122 | 2.825 | 0.948 | |
Test-Apr. 13, 2022 | 1.012 | 1.795 | 0.842 | |||
Test-Aug. 30, 2022 | 1.245 | 3.366 | 0.949 | |||
Bailong | Train | 1.283 | 3.125 | 0.892 | ||
Test-Aug. 27, 2020 | 1.588 | 6.754 | 0.868 | |||
Test-Sep. 14, 2020 | 1.660 | 6.248 | 0.959 | |||
LSTM | / | Wantou | Train | 1.069 | 2.686 | 0.951 |
Test-Apr. 13, 2022 | 0.904 | 1.422 | 0.875 | |||
Test-Aug. 30, 2022 | 1.433 | 5.169 | 0.962 | |||
Bailong | Train | 1.381 | 3.721 | 0.870 | ||
Test-Aug. 27, 2020 | 1.326 | 4.110 | 0.917 | |||
Test-Sep. 14, 2020 | 1.507 | 5.018 | 0.967 |
Model . | Description . | Site . | Data type . | MAE . | MSE . | R2 . |
---|---|---|---|---|---|---|
RF | Before optimisation | Wantou | Train | 1.104 | 2.718 | 0.950 |
Test-Apr. 13, 2022 | 1.039 | 1.882 | 0.834 | |||
Test-Aug. 30, 2022 | 1.295 | 3.917 | 0.941 | |||
Bailong | Train | 0.629 | 0.785 | 0.973 | ||
Test-Aug. 27, 2020 | 1.751 | 7.256 | 0.858 | |||
Test-Sep. 14, 2020 | 1.825 | 7.562 | 0.950 | |||
After optimisation | Wantou | Train | 1.122 | 2.825 | 0.948 | |
Test-Apr. 13, 2022 | 1.012 | 1.795 | 0.842 | |||
Test-Aug. 30, 2022 | 1.245 | 3.366 | 0.949 | |||
Bailong | Train | 1.283 | 3.125 | 0.892 | ||
Test-Aug. 27, 2020 | 1.588 | 6.754 | 0.868 | |||
Test-Sep. 14, 2020 | 1.660 | 6.248 | 0.959 | |||
LSTM | / | Wantou | Train | 1.069 | 2.686 | 0.951 |
Test-Apr. 13, 2022 | 0.904 | 1.422 | 0.875 | |||
Test-Aug. 30, 2022 | 1.433 | 5.169 | 0.962 | |||
Bailong | Train | 1.381 | 3.721 | 0.870 | ||
Test-Aug. 27, 2020 | 1.326 | 4.110 | 0.917 | |||
Test-Sep. 14, 2020 | 1.507 | 5.018 | 0.967 |
Parameter values after model optimisation
RF . | ||
---|---|---|
n_estimators . | max_features . | max_depth . |
200 | 2 | 10 |
LSTM | ||
batch_size | epochs | |
300 | 45 |
RF . | ||
---|---|---|
n_estimators . | max_features . | max_depth . |
200 | 2 | 10 |
LSTM | ||
batch_size | epochs | |
300 | 45 |
MAE and MSE of LSTM model results were the smallest, while R2 was the largest, which further demonstrates that LSTM has relatively high accuracy. R2 values in the test set were above or close to 0.9 on both sites in the LSTM model and RF model.
In reality, the accuracy of the tree ensemble model is low without adding the feature of Prior Flow. In the RF model, R2 before adding the feature Prior Flow does not exceed 0.5. After adding the feature Prior Flow to the non-optimised model, R2 increases to over 0.8 and even exceeds 0.9. The accuracy of the RF model has been greatly improved, which is close to the LSTM model. The same findings can also be reached using other evaluation indices. This is due to the fact that the flow rate in the time series correlates with itself over different time periods. The majority of sewer flow data comes from dry weather days. Additionally, due to the lack of other significant independent features directly correlating to the dependent feature, strong autocorrelation is a leading factor in improving the prediction accuracy in the tree ensemble model in this study. In other words, if there are other independent features that are very relevant to the dependent feature, they will be more decisive in improving the accuracy of the model. Autocorrelation may lose its dominant role. Autoregressive models use historical data of the feature itself to predict future values (Singh & Mohapatra 2019). Despite the low percentage of dates with rainfall in the selected sample, rainfall is still needed as an external feature to respond to the sudden increase in sewer flow. ARIMA, a popular statistical model for time series forecasting, performs optimally with data exhibiting clear trends or seasonal patterns. However, as Wang & Jiang (2019) note, ARIMA, being an autoregressive model, necessitates a stationary time series and struggles with nonlinear forecasting patterns, which are prevalent in our dataset. Additionally, ARIMA's effectiveness may diminish when applied to large datasets with diverse input features. These limitations are key reasons why the ARIMA model was not employed in this study.
In contrast, long-term dependencies and nonlinear relationships in time series data are well captured by LSTM, which makes it particularly well-suited for our work that involves analysing complex patterns over extended durations. Its capability to process entire sequences of data comprehensively aligns ideally with the demands of this study. Likewise, renowned for its resilience and adaptability, RF is advantageous in handling diverse data types and configurations. Relative to single decision tree models, the ensemble approach characterised by the aggregation of multiple decision trees, yields enhanced accuracy and more effectively manages overfitting.
Simulating sewer flow for four typical storm events with rainfall depth in Bailong: (a) 9.5 mm on Oct 4; (b) 17 mm on Aug 27; (c) 38.4 mm on Sep 14; (d) 52.6 mm on Sep 18.
Simulating sewer flow for four typical storm events with rainfall depth in Bailong: (a) 9.5 mm on Oct 4; (b) 17 mm on Aug 27; (c) 38.4 mm on Sep 14; (d) 52.6 mm on Sep 18.
Simulating sewer flow for four typical storm events with rainfall depth in Wantou: (a) 20.5 mm on Nov 29, 2021; (b) 32.5 mm on Nov 7, 2021; (c) 50 mm on Mar 21, 2022; (d) 78.5 mm on Aug 30, 2022.
Simulating sewer flow for four typical storm events with rainfall depth in Wantou: (a) 20.5 mm on Nov 29, 2021; (b) 32.5 mm on Nov 7, 2021; (c) 50 mm on Mar 21, 2022; (d) 78.5 mm on Aug 30, 2022.
Estimating total volume of RDII for different rainfall events using the LSTM model: (a) in Bailong and (b) in Wantou.
Estimating total volume of RDII for different rainfall events using the LSTM model: (a) in Bailong and (b) in Wantou.
The RDII in a separate sewer system is mainly related to infiltration and inflow in sanitary sewage systems. This situation may occur when there are defects in the sewer pipeline, such as cracks or joints, that allow rainwater to enter the sewer systems. In addition, groundwater may seep into the sewer system through damaged pipes or manholes. In a combined sewer system, RDII is the direct result of the combined flow of stormwater runoff, inflow, and infiltration during rainfall events.
In a separate sewage system, according to the design of the system, the diverted rainwater is usually sent to separate treatment facilities, detention ponds, or natural water bodies. RDII related to diverted stormwater is treated or managed separately from sanitary sewage. Due to inflow and infiltration, RDII in the sanitary sewage may still occur, but it should be lower compared to the combined sewer system during the same rainfall event. After the rainfall event subsides, the separate system will gradually return to normal flow conditions.
The relationship between rainfall and RDII exhibits a strong linearity within the context of a separate sewage system as shown in Yap & Ngien (2017). However, there are instances of data points deviating from the trendline in Wantou (Figure 12). This discrepancy may be attributed to the RDII data derived from the difference in flow rates between wet and dry days. It is likely that the selected pattern for DWF does not perfectly align with the actual DWF observed on the day of the rainfall event. In essence, DWF in Wantou examined in this study did not consistently conform to a similar pattern at certain times.
This study aims to conduct RDII assessment based on monitoring data with machine learning methods. Machine learning models can predict flow under both wet- and dry-weather conditions with better accuracy, and then RDII is estimated under different storm events. RDII is a common problem in sewer systems. An accurate assessment of RDII can better cope with sewer backflow and overflow, and provide references for decision-makers in planning and management. The performance of machine learning depends on the ability of algorithms to convert datasets into regression models. The model is much easier to train and to estimate accurately when the dataset has strong regularity.
The impact of time resolution on model accuracy
The time interval of the monitoring data selected in this study was 5 min, while the monitoring interval of the sensors installed in the sewer system was 1 min. The model accuracy was also tested by resampling data of 5, 10, 20, 30 and 60 min. In Table 4, the results of two models show that the highest accuracy occurs when using the 5-min dataset. There is a general tendency for the accuracy of the model to decrease as the monitoring frequency decreases. R2 decreases with an increase in monitoring interval from 5 to 60 min in the RF model (0.154) and LSTM model (0.222) in Bailong and in the RF model (0.208) and LSTM model (0.225) in Wantou. The LSTM model reduces accuracy more than the RF model. Other evaluation indices are consistent with this outcome. This may be due to a reduction in information available from the training set, and a higher probability that the rule in the test set has not been learned. It should be noted that a smaller monitoring frequency helps increase the availability of information, but it also increases running time.
Comparison of the evaluation indices in test set at different monitoring frequencies (train/test = 7/3)
Model . | Site . | Indices . | 5 min . | 10 min . | 20 min . | 30 min . | 60 min . |
---|---|---|---|---|---|---|---|
RF | Wantou | MAE | 1.151 | 1.352 | 1.576 | 1.811 | 2.429 |
MSE | 3.060 | 4.225 | 5.878 | 7.718 | 14.243 | ||
R2 | 0.943 | 0.919 | 0.890 | 0.862 | 0.735 | ||
Bailong | MAE | 1.423 | 1.614 | 1.711 | 1.688 | 1.880 | |
MSE | 3.832 | 4.910 | 5.298 | 5.253 | 6.489 | ||
R2 | 0.794 | 0.735 | 0.649 | 0.680 | 0.640 | ||
LSTM | Wantou | MAE | 1.649 | 2.099 | 2.395 | 2.759 | 3.584 |
MSE | 8.609 | 11.308 | 14.924 | 19.809 | 32.287 | ||
R2 | 0.918 | 0.892 | 0.857 | 0.811 | 0.693 | ||
Bailong | MAE | 1.428 | 1.604 | 1.844 | 2.020 | 2.170 | |
MSE | 3.740 | 4.578 | 5.686 | 6.777 | 7.793 | ||
R2 | 0.800 | 0.749 | 0.685 | 0.638 | 0.578 |
Model . | Site . | Indices . | 5 min . | 10 min . | 20 min . | 30 min . | 60 min . |
---|---|---|---|---|---|---|---|
RF | Wantou | MAE | 1.151 | 1.352 | 1.576 | 1.811 | 2.429 |
MSE | 3.060 | 4.225 | 5.878 | 7.718 | 14.243 | ||
R2 | 0.943 | 0.919 | 0.890 | 0.862 | 0.735 | ||
Bailong | MAE | 1.423 | 1.614 | 1.711 | 1.688 | 1.880 | |
MSE | 3.832 | 4.910 | 5.298 | 5.253 | 6.489 | ||
R2 | 0.794 | 0.735 | 0.649 | 0.680 | 0.640 | ||
LSTM | Wantou | MAE | 1.649 | 2.099 | 2.395 | 2.759 | 3.584 |
MSE | 8.609 | 11.308 | 14.924 | 19.809 | 32.287 | ||
R2 | 0.918 | 0.892 | 0.857 | 0.811 | 0.693 | ||
Bailong | MAE | 1.428 | 1.604 | 1.844 | 2.020 | 2.170 | |
MSE | 3.740 | 4.578 | 5.686 | 6.777 | 7.793 | ||
R2 | 0.800 | 0.749 | 0.685 | 0.638 | 0.578 |
CONCLUSIONS
In this study, two machine learning algorithms, RF and LSTM, were used to predict sewer flow in combined and separate sewer systems, and then the RDII was further analysed using the estimated DWF. The two proposed models all achieved high accuracy in sewer flow prediction while the LSTM model outperformed the RF model. Compared with the traditional methods in the previous literature, machine learning models were capable of providing reliable estimates of the sewer flow rate and RDII, and outputting importance ranking of the related features in sewer flow modelling. To further improve the prediction accuracy, future work should focus on examining the effectiveness of more relevant features to sewer flows, such as water level, conductivity, and exploring the potential of more advanced machine learning models in sewer modelling.
ACKNOWLEDGEMENTS
This work was funded by the National Key R&D Programme of China (2022YFC3203200), the Key Research and Development Programme of Zhejiang Province (2020C03082), and the Key Research and Development Programme of Ningbo (2023Z216). The corresponding author was also supported by the Ningbo Young Technology Innovation Leading Talent Programme (2023QL029).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.