Abstract
Direct measurement of the flow rate in sanitary sewer lines is not always feasible and is an important parameter for the normalization of data used in wastewater-based epidemiology applications. Machine learning to estimate past wastewater influent flow rates supporting public health applications has not been studied. The aim of this study was to assess wastewater treatment plant influent flow rates when compared with weather data and to retrospectively estimate flow rates in Louisville, Kentucky (USA), based on other data-types using machine learning. A random forest model was trained using a range of variables, such as feces-related indicators, weather data that could be associated with dilution in sewage systems, and area demographics. The developed algorithm successfully estimated the flow rate with an accuracy of 91.7%, although it did not perform as well with short-term (one-day) high flow rates. This study suggests that using variables such as precipitation (mm/day) and population size are more important for wastewater flow estimation. The fecal indicator concentration (cross-assembly phage and pepper mild mottle virus) was less important. Our study challenges currently accepted opinions by showing the important public health potential application of artificial intelligence in wastewater treatment plant flow rate estimation for wastewater-based epidemiological applications.
HIGHLIGHTS
Machine learning to estimate wastewater influent flow rates has not been studied for wastewater-based epidemiology applications.
Five wastewater treatment plants in Louisville, KY, USA, were studied to provide training and testing data sets of measured flow.
The random forest algorithm to estimate past flow rate had a 91.7% accuracy.
Artificial intelligence has potential applications in wastewater-based epidemiology.
Graphical Abstract
INTRODUCTION
On March 11, 2020, the World Health Organization (2020) declared the spread of coronavirus disease 2019 (COVID-19) as a global pandemic. Conventionally, wastewater-based epidemiology (WBE) for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) relied on the flow rate of wastewater treatment plant influent as a critical variable for standardized data reporting (McClary-Gutierrez et al. 2021). While these facilities have flow measurement systems for regulatory sampling (United States Environmental Protection Agency 2017), in-network sanitary sewer line upstream locations, such as streetline manholes, often lack equipment and access (McClary-Gutierrez et al. 2021), requiring the modelling and estimation of flow rate. The conventional approach for adjusting the level of SARS-CoV-2 RNA for dilution due to stormwater and other factors is to divide it by flow rate. Alternatively, when the flow rate is not available, imperfect, and variable measures of other fecal indicators, such as cross-assembly phage (crAssphage) and pepper mild mottle virus (PMMoV), have been used (Holm et al. 2022a, 2022b). A more comprehensive strategy would involve multiple inputs, including biological, meteorological, seasonal, and geographical data. Such a complex array of candidate inputs can be used to train a machine-learning algorithm, aiding the flow rate estimation. Applications of machine learning to estimate wastewater influent flow rates at sampling locations supporting public health applications have not been reported previously.
The aim of this study was: (1) to assess wastewater treatment plant influent flow rates when compared with meteorological data (minimum or maximum daily temperature (°C); or precipitation (mm/day)); and (2) to retrospectively estimate flow rates at wastewater treatment plants with a range of other data-types using machine learning.
METHODS
Study site
Jefferson County, Kentucky (USA), contains five wastewater treatment plants (Table 1) that cover approximately 97% of the county population (Holm et al. 2022b). The Morris Forman Water Quality Treatment Plant is the only facility that is a combined sewer system, meaning that it combines wastewater and stormwater into the same piped network and is particularly susceptible to changes in influent flow rates due to precipitation. The other four facilities have separate piping networks for stormwater and wastewater.
Characteristics of the studied wastewater treatment plants and associated areas
Water quality treatment plant . | Combined sewer . | Income ($)a . | Populationa . | Area (km2) . | 2019 Mean flow rate (MGD) . | 2020 Mean flow rate (MGD) . | 2021 Mean flow rate (MGD) . |
---|---|---|---|---|---|---|---|
Cedar Creek | No | 76,606 | 55,928 | 80 | 5 | 6 | 6 |
Derek R. Guthrie | No | 53,577 | 295,910 | 332 | 45 | 49 | 37 |
Floyds Fork | No | 113,699 | 32,460 | 88 | 4 | 4 | 3 |
Hite Creek | No | 106,769 | 31,269 | 67 | 5 | 5 | 5 |
Morris Forman | Yes | 54,138 | 349,850 | 280 | 89 | 81 | 97 |
Water quality treatment plant . | Combined sewer . | Income ($)a . | Populationa . | Area (km2) . | 2019 Mean flow rate (MGD) . | 2020 Mean flow rate (MGD) . | 2021 Mean flow rate (MGD) . |
---|---|---|---|---|---|---|---|
Cedar Creek | No | 76,606 | 55,928 | 80 | 5 | 6 | 6 |
Derek R. Guthrie | No | 53,577 | 295,910 | 332 | 45 | 49 | 37 |
Floyds Fork | No | 113,699 | 32,460 | 88 | 4 | 4 | 3 |
Hite Creek | No | 106,769 | 31,269 | 67 | 5 | 5 | 5 |
Morris Forman | Yes | 54,138 | 349,850 | 280 | 89 | 81 | 97 |
MGD, million gallons per day.
aBased on 2018 United States Census Bureau American Community Survey (ACS). Income is mean median household.
Data
Daily influent flow rate data were provided by the wastewater utility, Louisville/Jefferson County Metropolitan Sewer District, from January 1, 2019, to December 31, 2021. Temperature and precipitation data georeferenced for each wastewater treatment plant from August 1, 2020, through June 16, 2021, at 15-minute resolution, were provided by a commercial service, Tomorrow.io (Boston, MA, USA), which uses the wireless network infrastructure to collect weather data. The crAssphage (copies/mL) and PMMoV (copies/mL) concentrations were obtained from Holm et al. (2022a, 2022b). Population, income, and race/ethnicity data for each wastewater treatment plant area were obtained from the United States Census Bureau (2020).
Model
The random forest machine learning algorithm considered variables of crAssphage concentration (copies/mL), PMMoV concentration (copies/mL), site air temperature (°C) at 12:00 (noon), location, site precipitation (mm/day), as well as area population size, income, and race/ethnicity. The three main software packages used to create the random forest model were NumPy, Pandas, and Scikit-learn (McKinney 2010; Pedregosa et al. 2011; Harris et al. 2020). NumPy is a software library that allows the user to store data in arrays, which can then be manipulated using Pandas. Scikit-learn was used to import the default random forest model with no preset hyperparameters into the local Python development environment. The random forest model was constructed using the Scikit-learn package (Kensert et al. 2018). The training group was randomly generated to contain 80% of the data from August 18, 2020, through June 16, 2021, excluding data from April 23 to May 31, 2021, and the testing group contained the remaining 20%. The groups were chosen randomly using the Scikit-learn function. The following hyperparameters were optimized to best analyze the wastewater treatment plant data: maximum depth, maximum number of features, minimum number of samples per leaf, minimum number of samples per split, and number of estimators. The period of April 23 to May 31, 2021, was used to compare measured and estimated flow rate of the studied wastewater treatment plants.
Data analysis
Data analysis for yearly flow data was performed using Minitab Statistical Software (version 21.1.0.0; Minitab, LLC, State College, PA, USA). Plots were produced using R Studio (version 1.4.1106; R Core Team 2019).
Ethics
The University of Louisville Institutional Review Board classified this project as non-human subject research (reference#: 717950).
RESULTS AND DISCUSSION
Treatment plant flow rate by year, with referenced holidays: (a) Cedar Creek Water Quality Treatment Plant; (b) Derek R. Guthrie Water Quality Treatment Plant; (c) Floyds Fork Water Quality Treatment Plant; (d) Hites Creek Water Quality Treatment Plant; and (e) Morris Forman Water Quality Treatment Plant.
Treatment plant flow rate by year, with referenced holidays: (a) Cedar Creek Water Quality Treatment Plant; (b) Derek R. Guthrie Water Quality Treatment Plant; (c) Floyds Fork Water Quality Treatment Plant; (d) Hites Creek Water Quality Treatment Plant; and (e) Morris Forman Water Quality Treatment Plant.
For precipitation also, there was no good predictive value for flow rate; weak correlation coefficients were similarly observed across the treatment plants. Despite this weak relationship, the p-values for flow rate and precipitation were significant for each of the five treatment plants (p < 0.05) (see Supplementary Information). This common finding held despite the five treatment plants being across a mixture of combined and separate sanitary sewer lines.
Actual and estimated flow rate (million gallons per day) from the random forest algorithm: (a) Cedar Creek Water Quality Treatment Plant; (b) Derek R. Guthrie Water Quality Treatment Plant; and (c) Floyds Fork Water Quality Treatment Plant.
Actual and estimated flow rate (million gallons per day) from the random forest algorithm: (a) Cedar Creek Water Quality Treatment Plant; (b) Derek R. Guthrie Water Quality Treatment Plant; and (c) Floyds Fork Water Quality Treatment Plant.
The random forest model accurately estimated the flow rates across the three treatment plants with varying population sizes and flow rates. However, at Derek R. Guthrie Water Quality Treatment Plant, the model unreliably calculated flow values over several days when there was a temporary one-day high flow rate, and on May 5, 2021, the model under-calculated that day by nearly 33 MGD (29% variance) and the +/− 1 day variance was 20–30 MGD.
The concentrations of the two fecal indicators, crAssphage and PMMoV, were not highly weighted by the model, indicating their lower importance when estimating the flow rate. This finding may be due to the large variance in the data Holm et al. (2022a), which may therefore be less useful for estimating retrospective flow rates. However, in situations where precipitation is no longer a usable variable, owing to a drier climate or georeferenced precipitation data not being available, the use of crAssphage and PMMoV concentrations may provide some useful information.
FUTURE RESEARCH
While this research provides preliminary insight into the potential application of machine learning for wastewater-based epidemiology in settings where the flow rate is unavailable, there are a few key areas that require further investigation. First, alternative machine learning models can be used. Algorithms, such as artificial neural networks, recurrent neural networks, and support vector machines, may produce different results. Second, while the random forest model may have proven to be effective in this study, the dataset used to train the model was limited to a temporal scale and pandemic conditions may have influenced the flow rates. Finally, the application of machine learning models to wastewater sampling locations upstream of treatment plants, mainly streetline manholes, may be most useful in areas that routinely lack in-place flow-rate measuring equipment.
CONCLUSIONS
Wastewater systems exhibit a regular flux of flow rates over time. The use of a machine learning model to retrospectively estimate the flow rate resulted in an algorithm with an accuracy of 91.7%. For future machine learning applications to estimate wastewater flow rates, the present study suggests prioritizing including variables of site precipitation (mm/day) and population size, but fecal indicators (crAssphage and PMMoV) were found to be less important. A model limitation is that a single high flow event could affect three days of estimation, and there may be some high flow rate temporal resolution boundaries. Our study challenges currently accepted opinions by showing the potential application of artificial intelligence in wastewater treatment plant flow rate for wastewater-based epidemiology applications.
ACKNOWLEDGEMENTS
We thank the Louisville/Jefferson County Metropolitan Sewer District for their valuable collaboration for wastewater data collection. We would also like to thank Dr Andrew Karem for insight into creating the machine learning model.
FUNDING
This work was supported by a contract from the Louisville-Jefferson County Metro Government as a component of the Coronavirus Aid, Relief, and Economic Security Act, as well as grants from the James Graham Brown Foundation and Owsley Brown II Family Foundation. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.