ABSTRACT
Surveillance networks have been established in many countries worldwide to monitor SARS-CoV-2 in sewage and to estimate the communal prevalence of COVID-19 cases. Despite their popularity, gaining a rapid understanding of how infectious diseases spread across the territory covered by a network is difficult because of the many factors involved. To improve the detection of warning signals within the territory, we propose to apply principal component analysis (PCA) to screen time-series data generated from wastewater treatment plants (WWTPs) under surveillance. Our analysis allows us to identify single WWTPs deviating from the normal behavior as well as deviations of a cluster of WWTPs (indicative of an intermunicipal outbreak). Our approach is illustrated through the analysis of the dataset generated by the Catalan Surveillance Network of SARS-CoV-2 in Sewage (SARSAIGUA). Using 10 principal components, we captured 78.6% of the variance in the original dataset of 51 variables (WWTPs). Our analysis identified exceedance of the Q-statistic threshold as evidence of anomalous performance of a single WWTP, and exceedance of the T2-statistic as a sign of an intermunicipal outbreak. Our approach provides a comprehensive picture of the spread of the COVID-19 pandemic, enabling decision-makers to make informed decisions and better manage future pandemics.
HIGHLIGHTS
PCA is applied to SARS-CoV-2 measurements in sewage from multiple WWTPs.
Q- and T2-statistics are useful to trigger warnings of COVID-19 outbreaks at city and national levels.
Warnings across waves are triggered by different subsets of WWTPs.
The contribution of WWTPs to triggering warnings is independent of their size.
INTRODUCTION
Wastewater-based epidemiology (WBE) has emerged as a powerful approach for extracting valuable health-related information from wastewater, enabling comprehensive surveillance of population health (Singer et al. 2023). While its initial focus was on tracking poliovirus, WBE techniques have rapidly expanded to encompass a broad spectrum of chemical and biological targets. These include monitoring illicit drug use (González-Mariño et al. 2020), assessing pharmaceutical consumption patterns (Escolà Casas et al. 2021), and analyzing various health and lifestyle biomarkers (Daughton 2018; Shao et al. 2023). Moreover, the global outbreak of SARS-CoV-2 has significantly heightened the importance of wastewater as a valuable source of information (Bivins et al. 2020). This has led to the establishment of surveillance networks at regional and country scales, with over 4,107 monitoring sites across 72 countries worldwide dedicated to monitor SARS-CoV-2 in sewage (Naughton et al. 2023). These networks play a crucial role in enhancing our understanding of chemical and biological target dynamics and facilitating targeted interventions. These networks usually involve monitoring multiple wastewater treatment plants (WWTPs), ranging from 8 (e.g., in Slovenia) to 352 (i.e., in The Netherlands) (Table 1). Once established, the data generated are analyzed and transformed into useful information to support decision-making of health authorities. Results are commonly presented by plotting normalized loads of the targeted SARS-CoV-2 genetic marker over time per WWTP. Also, such plots usually include a smoothing operation (e.g., median over 7 days or the moving average of 7 days). In some cases, loads are summed up to have a general overview of the network at the country level (Guerrero-Latorre et al. 2022). Trends in SARS-CoV-2 loads are then estimated by applying the percent change recommended by the CDC, which entails utilizing the regression slope of at least the three most recent measurements and categorizing it based on the statistical significance of the slope (https://www.cdc.gov/nwss/reporting.html). Alternatively, other trend estimates, such as the relative strength index and the Mann-Kendall trend test, are also considered (Chan et al. 2023). Some surveillance networks transform the results into qualitative indicators (of concentrations or loads and/or of trends) shown on a map (e.g., the French network). In some cases, such as the Swiss and Austrian surveillance networks, the effective reproductive number of SARS-CoV-2, the time-varying analog of the basic reproductive number along one or several waves of the disease, is estimated from wastewater data at each monitored WWTP (Huisman et al. 2022).
Name . | # WWTPs . | Cov. (%) . | SF . | Data analysis . |
---|---|---|---|---|
Austria | 48 | 58 | 2 | Figures: Gene copies/inhab./day per site; Sum of loads of all WWTPs; Effective reproductive number per WWTP |
Belgium | 43 | 45 | 2 | Figures: Gene copies/day and per 100,000 inhab. (rolling average); Increasing trend indicators |
Canada | 38 | NA | 3 | Figures: 7-day rolling average of gene copies/day for each site. Trend expressed as: Statistically significant increase or decrease, possible increase, no change |
Catalonia | 56 | 80 | 1* | Figures: 7-day rolling average of gene copies/day for each site and aggregated at the national level. Trend expressed as: Increasing, stable, decreasing |
Finland | 16 | 49 | 1* | Figures: Gene copies/1,000 inhab./day; the trend and its uncertainty are estimated using a statistical model |
France | 200 | 33 | NA | Maps: qualitative assessment, trend over last 30 days, trend over last 7 days |
Germany | 135 | NA | NA | Figures: Gene copies/L per site and aggregated; Maps and Heatmaps of trends |
Hungary | 21 | NA | 1 | Maps: Concentrations and trend (qualitative, 4 categories) |
Ireland | 68 | 80 | 1 | Figures: Qualitative (Positive, positive DNQ, weak positive, negative) |
Luxembourg | 13 | NA | 1 | Figures: Gene copies/day/100,000 equivalent inhab. |
Netherlands | 352 | 98 | 2–3 | Figures and Map: average number of gene copies per 100,000 inhab. per municipality |
Slovenia | 8 | 25 | 1 | Figures: N2 gene copies/PMMoV; estimated cases from wastewater after linear regression |
Sweden | 13 | 25 | 1 | Figures: Gene copies/PMMOV |
Switzerland | 14 | 27 | 3–6 | Figures: Gene copies/100,000 inhab./day; Effective reproductive number per WWTP |
England | 270 | 60 | 3 | National and regional means of gene copies/L |
US (wastewaterSCAN) | 200 | 60 | 2–3 | Figures and maps: Gene copies/PMMoV per site, and national levels categorized into bottom, middle, and upper thirds |
US (Biobot) | 700 + | 30 | 1 | Figures: Nationwide, region, county, expressed as gene copies per mL; case estimates from effective concentrations |
Name . | # WWTPs . | Cov. (%) . | SF . | Data analysis . |
---|---|---|---|---|
Austria | 48 | 58 | 2 | Figures: Gene copies/inhab./day per site; Sum of loads of all WWTPs; Effective reproductive number per WWTP |
Belgium | 43 | 45 | 2 | Figures: Gene copies/day and per 100,000 inhab. (rolling average); Increasing trend indicators |
Canada | 38 | NA | 3 | Figures: 7-day rolling average of gene copies/day for each site. Trend expressed as: Statistically significant increase or decrease, possible increase, no change |
Catalonia | 56 | 80 | 1* | Figures: 7-day rolling average of gene copies/day for each site and aggregated at the national level. Trend expressed as: Increasing, stable, decreasing |
Finland | 16 | 49 | 1* | Figures: Gene copies/1,000 inhab./day; the trend and its uncertainty are estimated using a statistical model |
France | 200 | 33 | NA | Maps: qualitative assessment, trend over last 30 days, trend over last 7 days |
Germany | 135 | NA | NA | Figures: Gene copies/L per site and aggregated; Maps and Heatmaps of trends |
Hungary | 21 | NA | 1 | Maps: Concentrations and trend (qualitative, 4 categories) |
Ireland | 68 | 80 | 1 | Figures: Qualitative (Positive, positive DNQ, weak positive, negative) |
Luxembourg | 13 | NA | 1 | Figures: Gene copies/day/100,000 equivalent inhab. |
Netherlands | 352 | 98 | 2–3 | Figures and Map: average number of gene copies per 100,000 inhab. per municipality |
Slovenia | 8 | 25 | 1 | Figures: N2 gene copies/PMMoV; estimated cases from wastewater after linear regression |
Sweden | 13 | 25 | 1 | Figures: Gene copies/PMMOV |
Switzerland | 14 | 27 | 3–6 | Figures: Gene copies/100,000 inhab./day; Effective reproductive number per WWTP |
England | 270 | 60 | 3 | National and regional means of gene copies/L |
US (wastewaterSCAN) | 200 | 60 | 2–3 | Figures and maps: Gene copies/PMMoV per site, and national levels categorized into bottom, middle, and upper thirds |
US (Biobot) | 700 + | 30 | 1 | Figures: Nationwide, region, county, expressed as gene copies per mL; case estimates from effective concentrations |
Cov., population coverage (in %); SF, sampling frequency (# per week); NA, not available.
* A subset of WWTPs is sampled at less frequency than once per week.
Beyond the implementations at the national level, several studies have proposed mathematical methods: (i) to convert raw wastewater data to prevalence estimates (Ahmed et al. 2020; Morvan et al. 2022), (ii) to smooth the time-series data based on an autoregressive model (Courbariaux et al. 2022), (iii) to reduce and correct the noise associated with the quantification of SARS-CoV-2 gene targets (Cluzel et al. 2022), (iv) to model single outbreaks for short-term forecasting (Joseph-Duran et al. 2022), and (v) to model the transmission of SARS-CoV-2 (SEIR model) in combination with the fate of SARS-CoV-2 in sewage after fecal shedding (Nourbakhsh et al. 2022; Mattei et al. 2023) conferring long-term forecasting prediction capabilities and a way to simulate non-pharmaceutical interventions. Yet, all these approaches are applied to data obtained from a single WWTP. When it comes to aggregating information from multiple WWTPs within the network, it becomes more complex to gain a rapid understanding of how the pandemic is evolving across the entire territory covered by the network. As such, there is a need for reliable and scalable mathematical methods that can provide a more comprehensive picture of the pandemic's spread at the national level. Two surveillance networks employ mathematical methods to integrate data from multiple WWTPs. The WastewaterSCAN network (wastewaterscan.org) represents the 5-sample trimmed average of SARS-CoV-2 concentrations normalized by the concentration of the Pepper Mild Mottle Virus, used as an indicator of human fecal shedding. Aggregated trend lines present population-weighted averages across groups of sites. National levels, depicted in charts, offer a relative interpretation of current wastewater levels over the last 365 days. These levels are categorized into bottom, middle, and upper thirds, based on a retrospective analysis of data from the past 365 days across all sites. The US Biobot (https://biobot.io) approach includes weekly averaging within reporting counties, weighted by population, using a 3-sample rolling average for greater emphasis on recent measurements. This data is subsequently averaged at both the national and regional levels.
In this work, we applied a statistical process control technique to analyze the time-series data generated from multiple WWTPs at once. The usefulness of the approach is illustrated by the application to the data obtained from the Catalan Surveillance Network of SARS-CoV-2 in Sewage (Corominas et al. 2021; Guerrero-Latorre et al. 2022) (SARSAIGUA). Our analysis is based on principal component analysis (PCA), which allows us to feed the gene target loads into an algorithm that can identify deviations in single WWTP from normal behavior (a COVID-19 early warning at a city scale) as well as generic deviations from normal behavior for a cluster of WWTPs (occurrence of an outbreak at the national level).
METHODS
The Catalan Surveillance Network of SARS-CoV-2
Catalonia is a region with more than 7.7 million inhabitants in north-eastern Spain. The Catalan Water Agency (ACA) and the Public Health Agency of Catalonia (ASPCAT) promoted and funded the deployment of SARSAIGUA (Guerrero-Latorre et al. 2022). This network started in July 2020 monitoring 56 WWTPs evenly distributed across Catalonia and serving 80% of the total population. The sample collection and analysis started on the 6th of July 2020, approximately 4 months after the detection of the first clinical COVID-19 case in Catalonia (25 February 2020). Out of the 532 WWTPs in Catalonia, 56 were included in the surveillance network; these 56 represent a high population coverage (80%) and are evenly distributed across the territory (at least 1 WWTP per county). The sampling frequency was set to one sample per week in 36 of the selected 56 WWTPs and fortnightly for the remaining 18, thus resulting in the collection and analysis of 45 samples per week. Notably, some WWTPs are only surveyed during the summer season to better monitor municipalities receiving high tourism (e.g., Castell-Platja d'Aro, Vilaseca-Salou). Refrigerated flow-based composite samples are collected for most WWTPs at the entrance. Details on the WWTPs and the sampling can be found in Supplementary Table S1. The 45 weekly samples are distributed to the three reference laboratories with wide expertise in molecular diagnosis and environmental virology. Each laboratory receives 15 samples per week that are analyzed for SARS-CoV-2 genome abundance using optimized protocols. Quantification of SARS-CoV-2 genomes is accomplished using RT-qPCR targeting a common genetic marker (N1) and two complementary targets, N2 and IP4 (CDC 2020; Pasteur 2020). Manual data quality control on the results generated from the laboratories is executed once a week. This manual quality control involves evaluating: (i) RT-qPCR standard curve parameters; (ii) recovery values from process control; and (iii) correlation between RT-qPCR target genes. Whenever large deviations from previous accumulated data in parameters like recovery percentage of a process control are identified, the laboratories are requested to repeat the sample.
WWTP data
Data used for this study is obtained from an API-created ad-hoc (https://apicovid.icradev.cat/n1). When accessed, this API reads the SQL database where the results of the SARS-CoV-2 gene targets are uploaded weekly by the laboratories responsible for the quantification, and outputs the data retrieved in CSV format. The data includes information about the concentration of N1 gene copies at each WWTP and influent flows (in m3/day). For the analysis, we only included data corresponding to the 51 WWTPs sampled weekly and biweekly, while those WWTPs that were sampled on a seasonal basis (5) were excluded from the analysis. The data used as input for the statistical method applied in this study was the normalized N1 gene load for each WWTP, that is the concentration of N1 (in gene copies (GC)/L) multiplied by the daily flow of the WWTP (L/day) and divided by the number of inhabitants assisted by the corresponding WWTP. Data interpolation was applied to the biweekly sampled WWTPs (N1 concentration and flows) to obtain a weekly value, and missing values (due to sampling or analytical issues) on the entire dataset were interpolated as well. We are aware that this interpolation is not feasible in a real-world surveillance network where only weekly WWTPs can be used. However, we decided to interpolate fortnightly sampled WWTP to increase the size of the dataset and provide more robust evidence of our method. In total, 138 observations collected from the 51 WWTPs were used for the period between July 2020 and August 2023.
Reported clinical cases
Daily reported clinical cases were aggregated for municipalities in the catchment of each WWTP using the official API from the Information Systems of the Department of Health and the Catalan Health Service (Departament de Salut n.d.) (https://analisi.transparenciacatalunya.cat/resource/jj6z-iyrp.json).
Simulated COVID-19 active cases using a compartmental epidemic model
The outcomes from the SCVEIR ID RHUD model described in Fonseca i Casas et al. (2023) and continuously updated and calibrated in the SDL-PAND dashboard (http://pand.sdlps.com) were used in our work to cover the gap of reported clinical cases after April 2022. The model is a cellular automation (CA) model, hence the ‘CA’ in the name of the model, CA-SCVEIR ID RHUD, where each one of the different cells of the automata implements a complete SCVEIR ID RHUD model. The different levels of the compartmental model are: (i) S: susceptible, (ii) C: confined, (iii) V: vaccinated, (iv) E: exposed, (v) I: infective, with IR: infective real (real cases) and ID: infective detected, where the evolution is based on the IR and ID are only for validation purposes, (vi) R: recovered, (vii) H: hospitalized, (viii) U: critical hospitalized, and (ix) D: Dead. The presented data encompasses all simulated active cases throughout Catalonia, extending beyond the confines of the 51 WWTPs specifically included in the PCA. Nevertheless, utilizing this broader dataset for comparative analysis remains valuable, given that these 51 WWTPs represent 80% of the Catalan population.
PCA and statistical process control
PCA is a multivariate statistical method for exploratory data analysis. It can be used to reduce the number of variables in a dataset while retaining as much information as possible (Jolliffe 2002). PCA can be used for a variety of applications, such as identifying patterns in data, reducing the dimensionality of data for visualization, and making predictions using machine learning algorithms. Here, we propose to use PCA as a basis for an automated warning system procedure using the Q-statistic (a.k.a. squared prediction error) (Jackson & Mudholkar 1979) and the Hotelling's T2 (Johnson & Wichern 2002).
The application of PCA followed comprehensive manual data quality control procedures for both flow and N1 measurements. Thus, the primary objective of the PCA analysis was not to identify issues arising from measurement errors, but rather to elucidate the correlations among N1 loads across WWTPs. For the PCA, each of the 51 WWTPs was treated as a variable (matrix column) and an observation (matrix row) is a week in time between July 2020 and August 2022. An input matrix without gaps is required when running PCA; hence, linear interpolation was applied to the biweekly sampled WWTPs (N1 concentration and flows) and to the missing values due to incidences in sampling (e.g., missing flow data). Scaling was applied to each column, to ensure each WWTP is given equal weight in the monitoring process. This involves subtracting each variable by its sample mean (to capture the variation from this mean) and then, divide by its standard deviation. This gives equal importance to all columns and removes variabilities caused by different sampling and analytical methods (e.g., analytical differences between laboratories). After performing PCA, the Q-statistic was calculated to identify structural changes in the process. An increase in Q suggests that the correlation structure captured by the PCA model is not maintained for that observation (1 observation is the set of N1 loads from the 51 WWTPs). In our context, this indicates that one or more WWTPs deviate from the average behavior. Additionally, T2 was computed to measure the distance of each observation from the center of the PCA model, represented by the mean. An increase in T2 signifies that the observation maintains the model structure but with values that are further from the mean, indicating a general deviation of N1 loads from ‘normal’ values and signaling an outbreak scenario. We set up the threshold values for Q and T2 using the first set of observations (n = 51) (training dataset) and using an arbitrary level of significance (α) of 0.05 which gave us acceptable results to illustrate the potential of the PCA. We did not follow a rigorous approach for setting up the thresholds which would require data labeling, training, and validation. Finally, a contribution analysis is performed to determine which variable (WWTP) or variables in the original dataset space are responsible for the detected warning, specifically identifying instances where thresholds are exceeded. We implemented adaptive PCA, which is an extension of traditional PCA that allows for the continuous updating of the principal components as new data becomes available. We used an incremental window to update the PCA, that is, on day i, we used the observations from 1 to i. To improve the stability of the method, in each new PCA, we rejected all observations where Q or T2 values were higher than the respective thresholds of 0.05.
PCA and thresholds implementation
The PCA, the calculations of Q and T2 and the thresholds (see detailed equations in the Supplementary Material) were implemented in JavaScript programming language, which is native to the most-used web browsers at the time of writing (e.g., Google Chrome, Mozilla Firefox, Microsoft Edge, etc.) and it is independent of the operating system and device type. We chose JavaScript rather than other languages keener to statistical methods (such as R or Python) to facilitate the integration of the warning system in the SARSAIGUA platform (https://sarsaigua.icra.cat), already developed using JavaScript. However, the proposed warning system method has only been applied retrospectively for the purpose of this research but not into the SARSAIGUA programme since it was discontinued in December 2023. A matrix library was created to compute basic matrix operations (e.g., sum, multiplication, transposition, determinant, inverse, etc.). Then, the SVD algorithm and PCA procedures, along with the computation of the Hotelling's T2 and Q statistics and contribution analysis were implemented on top of this first layer. This matrix library is freely available at https://github.com/icra/matrix-library-js. To validate the JavaScript implementation, numeric tests were conducted using R language (https://www.r-project.org/) since it is a well-known and established numeric platform with PCA functions built in. A visual web interface was built on top of the matrix library, coded in HTML and CSS, and using VueJS (https://vuejs.org/) as the front-end JavaScript framework.
RESULTS
Raw time series
PCA results (scores and loadings)
Triggering warnings for outbreaks
Threshold exceedances
Contributions of the different WWTPs to the Q and T2 warnings
DISCUSSION
The method proposed in this work allows triggering warnings for COVID-19 outbreaks using data of SARS-CoV-2 N1 gene loads measured in influent wastewater of several WWTPs monitored weekly by SARSAIGUA surveillance network. Despite SARSAIGUA prioritized maximizing spatial coverage over relying on high-frequency measurements, our method allows the effective triggering of outbreak alerts using 1 measurement per week at high spatial resolution. In this regard, it is worth mentioning that the distance from any place to the nearest WWTP included in the network is less than 13 km (SD: 6.6, max: 38) representing 141k inhabitants/WWTP and 569 km2/WWTP. By prioritizing the spatial coverage over the high-frequency data acquisition, it is thus possible to enhance the robustness of the warning system. In practice, the national wastewater surveillance networks have adopted different strategies for data collection. Most networks typically collect between 3 and 7 samples per week (see Table 1) to identify trends in SARS-CoV-2 concentration in wastewater and estimate the communal prevalence of COVID-19 in the surveilled territory. Such high frequencies are required for properly estimating trends (Chan et al. 2023) and for the estimation of the effective reproductive number of the virus (Re) (Huisman et al. 2022). Conversely, networks such as SARSAIGUA (Guerrero-Latorre et al. 2022) opted for a broader coverage by monitoring a larger number of WWTPs at a lower sampling frequency. With such low temporal resolution, it is not possible to get a reliable trend analysis per each WWTP, but as shown here it is possible to trigger warnings at a national scale. Discerning which of these approaches is better requires future investigations conducted on a comprehensive dataset encompassing a large number of WWTPs sampled at high frequency, enabling the downscaling of the temporal and the spatial resolutions. This debate has gained prominence, especially in the current scenario, where budget constraints for wastewater surveillance networks prompt considerations about optimizing either the sampling locations or the sampling frequency. To make informed decisions under these budgetary limitations, it is imperative to determine the most effective approach for ensuring timely warnings and accurate detection of anomalies within the wastewater treatment infrastructure.
The activation of warnings across the progressive pandemic waves does not consistently involve the same set of WWTPs. Interestingly, the larger WWTPs assisting populations exceeding 1 million inhabitants did not significantly contribute to the triggering of warnings but largely dictate the overall (or ‘average’) behavior. In contrast, WWTPs serving populations smaller than 1 million bore the responsibility for initiating warnings. In our analytical framework, all WWTPs hold equal weight due to the statistical standardization of N1 loads of each WWTP. Consequently, an outbreak involving only a few individuals in a small population (i.e., small WWTP) may exert a substantial influence on the outcome of warning initiation. This underscores the heightened impact of even minor outbreaks within these smaller WWTPs on the warning-triggering results. Our investigations in Catalonia provide substantial support for the notion that the propagation of COVID-19 across waves originates differentially at a spatial scale. Assuming rigorous sampling protocols were consistently implemented in all WWTPs across the territory, particularly employing flow-based composite sampling, we propose that our approach effectively identifies the originating sources of each outbreak using wastewater data.
Our approach has several limitations. While data-driven methodologies like PCA offer operational simplicity by requiring few parameters, their effectiveness hinges on the careful selection of these parameters. In our PCA method, two parameters – the number of principal components retained in the model and the level of statistical significance (α) – need to be determined. The selection of the latter entails data labeling, dataset separation into training and validation sets, and refinement of the Q and T2 thresholds. Although in our application, these thresholds were kept fixed after training, monitoring non-stationary processes may necessitate adjusting the parameters to maintain the desired false detection rate (Schmitt et al. 2016). Moreover, as the behavior of the virus may evolve with the emergence of new variants, there is a need to readjust the thresholds after each outbreak. Secondly, the first PCA can only be applied when the number of observations (elapsed weeks) matches the number of WWTPs. Consequently, if wastewater monitoring begins after the outbreak onset, there will be a delay before the method becomes useful as an ‘early warning system’. This issue becomes more pronounced for larger networks. For example, a network of 100 WWTPs, collecting and analyzing 1 sample per week, would require approximately 2 years of training data. In such cases, either the number of WWTPs included in the PCA model should be reduced or the sampling frequency should be increased. The lead time of warnings generated by our method relative to other approaches warrants further investigation. Another limitation is the utilization of interpolation, particularly in the test dataset. While linear interpolation is acceptable for training datasets, its application in the test dataset can lead to data leakage and overly optimistic performance estimates. Another option to address data gaps is to use the last available value and maintain it until a new value is received. However, this strategy may lead to false alarms. For example, when an outbreak has reached the peak and starts descending, the majority of WWTPs may experience a decrease in values, while those interpolated with the last available value remain unchanged. Consequently, this discrepancy may trigger a Q alarm, erroneously signaling an anomaly in the data. Addressing this interpolation issue and refining the adaptation of Q and T2 thresholds would require further research.
The wastewaterSCAN national surveillance network (USA) facilitates integrated data visualization from multiple WWTPs, featuring two key aspects. Firstly, it enables the plotting of population-weighted average trend lines for selected site groups, accessible by choosing predefined groups (e.g., state or county) from the dropdown menu. Secondly, the charts incorporate national-level indicators, categorizing data into bottom, middle, and upper percentiles based on the past 365 days across all wastewaterSCAN sites. The 33rd and 66th percentiles delineate these categories, offering a relative context for interpreting the last year's wastewater levels. While effective for contextualizing SARS-CoV-2 RNA trend lines, the approach lacks warning capabilities. This is where the PCA approach proposed in this paper stands out. Finally, the PCA approach was coded in JavaScript and is open source so that it can be easily integrated in the existing web dashboards of the different national surveillance networks. The developed system allows detecting such abnormal behavior and tracing back the origin of the warning to a single WWTP. Notwithstanding this automation, a manual check is recommended for each warning generated to detect potential errors in the input data that may conduct to false alarms that mislead health authorities. Such a situation occurred once in our dataset, where an unexpected increase in 4.5 standard deviations was observed. All analytical parameters were within the quality control ranges and the only explanation was the discharge of sewage from cruise ships to the target WWTP. Yet, the system allowed identifying such an anomaly and allowed the quality check of the data corresponding to that observation.
CONCLUSIONS
In this study, we have presented a method for generating network-level warnings in sewage surveillance networks for SARS-CoV-2. Our approach maximizes the value contributed by each individual WWTP to the whole network, highlighting that ‘the whole is greater than the sum of its parts’. By utilizing a statistical process control technique, our method can identify deviations from normal behavior at the level of individual WWTPs, providing early warnings for potential epidemic outbreaks providing that a wastewater monitoring programme for the pathogen is already in place. Furthermore, our analysis shows that network-level warnings are not generated by a specific set of WWTPs on a recurrent basis, but rather all WWTPs contribute with relevant information at different times, and this contribution is independent of WWTP size. Our method for generating network-level warnings can provide a comprehensive picture of the spread of a pandemic across a territory, which can inform decision-making processes and intervention strategies by health authorities. By leveraging the value of individual WWTPs in the whole network, our method can maximize the potential for early detection and intervention, ultimately leading to more effective control of the pandemic. Last but not least, despite being applied for SARS-CoV-2 datasets, our method can be applied to other biological or chemical targets detected in urban sewage.
ACKNOWLEDGEMENTS
The study is within the frame of the virWASTE project, which has received funding from the Agència de Gestió d'Ajuts Universitaris i de Recerca (AGAUR) under the call ‘Pandèmies 2020’ (Ref. 2020 PANDE 00044). ICRA authors acknowledge the Economy and Knowledge Department of the Catalan Government through Consolidated Research Groups 2021-SGR-01283 ICRA-TECH and 2021 SGR 01282 ICRA-ENV, and the funding from the CERCA program (Generalitat de Catalunya). The authors also acknowledge the Catalan Surveillance Network of SARS-CoV-2 in Sewage (SARSAIGUA) for the provision of data. We thank Miquel Calvo from UB for his contributions during the design phase of the study.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories: https://zenodo.org/records/5137621; https://github.com/icra/sars-aigues/.
CONFLICT OF INTEREST
The authors declare there is no conflict.