## Abstract

In the case of SARS-CoV-2 pandemic management, wastewater-based epidemiology aims to derive information on the infection dynamics by monitoring virus concentrations in the wastewater. However, due to the intrinsic random fluctuations of the viral signal in wastewater caused by several influencing factors that cannot be determined in detail (e.g. dilutions; number of people discharging; variations in virus excretion; water consumption per day; transport and fate processes in sewer system), the subsequent prevalence analysis may result in misleading conclusions. It is thus helpful to apply data filtering techniques to reduce the noise in the signal. In this paper we investigate 13 smoothing algorithms applied to the virus signals monitored in four wastewater treatment plants in Austria. The parameters of the algorithms have been defined by an optimization procedure aiming for performance metrics. The results are further investigated by means of a cluster analysis. While all algorithms are in principle applicable, SPLINE, Generalized Additive Model and Friedman's Super Smoother are recognized as superior methods in this context (with the latter two having a tendency to over-smoothing). A first analysis of the resulting datasets indicates the positive effect of filtering to the correlation of the viral signal to monitored incidence values.

## HIGHLIGHTS

The random component in the timeline of SARS-CoV-2 virus concentration makes data filtering necessary.

Thirteen common filtering techniques are investigated for their potential to smooth the virus signals.

SPLINE, GAM and Friedman's Super Smoother are seen as superior algorithms for smoothing SARS-CoV-2 signals.

### Graphical Abstract

## INTRODUCTION

Management of the SARS-CoV-2 pandemic rests upon measures such as hygiene, isolation and vaccination but requires rigorous monitoring on the state and spread of the disease (Nicola *et al.* 2020). Next to individual qPCR and antigen testing, wastewater-based epidemiology (WBE) has been recognized as a valuable tool to estimate viral prevalence (Weidhaas *et al.* 2021). There is a rapidly increasing body of evidence about the methodology and its application (see e.g., Ahmed *et al.* 2020; Kitajima *et al.* 2020; Wu *et al.* 2020a, 2020b; Zhu *et al.* 2021). Several studies, e.g. He *et al.* (2020) or Wölfel *et al.* (2020) showed that infected persons in a catchment shed a certain amount of viral load per day into the sewer system, resulting in measurable viral titers at the sampling point – expressed as virus RNA concentrations [number of RNA copies/ml]. Such measurements are usually taken as composite samples at the treatment plant of the urban drainage system.

A key element in WBE is to derive information on the infection dynamics in the catchment by means of the monitored virus particle concentrations by quantifying their RNA genome. The signal serves as a proxy for prevalence, i.e., the total number of infected persons in the catchment (Ahmed *et al.* 2020). WBE is thus a valuable additional source of information next to individual testing strategies. Even more, time series prediction can be applied to the signal, thus serving as potential early-warning tool in pandemic management (e.g., Gonzalez *et al.* 2020; Hart & Halden 2020).

Relating to the basic concepts of WBE (Choi *et al.* 2018; Feng *et al.* 2018) it is typical not only to use the raw signal (RNA concentration as equivalent to virus particles) for further analysis but to apply normalization regarding (a) flow dynamics and (b) changes in population by use of biomarkers (Been *et al.* 2014). Still, it is due to the complexity of the whole process that the wastewater titer signals (both raw value and normalized) not only express the prevalence information but also contain huge variations. Reasons are manifold, but key factors are (a) the individual variances in viral shedding (both amount and time) of the infected persons, (b) effect of spatial distribution of the viral load in the catchment (i.e., the location of the main entry points), (c) stochastic influences to transport mechanics and virus degradation in the sewer system, (d) influence of rain runoff due to dilution and loss via CSO, and (e) variances that stem from both the sampling procedure and the laboratory methods. To conclude, the wastewater signal (even if normalized) contains not only the sought-after information on prevalence in the catchment but also a large noise contribution. The latter, however, makes the analysis and data interpretation difficult and data modeling a poorer fit to the actual status (Wand 2003; Samuelsson *et al.* 2017).

To differentiate the noise from the actual information in the signal, filtering techniques are frequently used in science and engineering (Huang *et al.* 2016). Simple filter techniques such as moving average are common, but recent studies (e.g., Stadler *et al.* (2020), Wu *et al.* (2020a, 2020b), Nemudryi *et al.* (2020), and Graham *et al.* (2020)) applied more advanced procedures such as locally weighted polynomial or spline methods to smooth the viral loads signals. It is clear that the choice of the filtering method – as data preprocessing step – is subject to the aim of the subsequent data use. Still, a thorough investigation to identify the optimal smoothing method applicable to filter SARS-Cov-2 time series is missing. In this study, 13 filtering methods – from simple to advanced – have been applied to the viral load measured at four locations in Austria and are investigated towards three performance indicators, i.e., mean absolute error (MAE), variability (VAR), and Akaike Information Criterion (AIK).

In the remainder of the paper, we first outline the status of WBE in Austria, the selected four case studies and the titer datasets derived therefrom. The filtering methods are presented, however not elaborated in detail. To rationalize optimal performance, an optimization procedure is used for performance metric. The results are further investigated by means of a cluster analysis. Last, the explanatory power of the datasets with respect to infection dynamics is analyzed.

## MATERIALS AND METHODS

### Wastewater surveillance and datasets

Already early in the pandemic Austria has established the research project Coron-A to develop the scientific background of WBE as a Covid-19 surveillance tool (Coron-A 2021). Fundamental in the project is the surveillance of 23 wastewater treatment plants (WWTPs) in Austria by taking 24 h composite volume proportional samples (CVVT: constant volume; variable time) from the inlet of the WWTPs. The sampling frequency is bi-weekly or higher.

Sampling is done by cooled automatic samplers (various suppliers). Samples were cooled to 4 °C (Markt *et al.* 2021) and shipped to the laboratory in Styrofoam boxes with coolpacks guaranteeing continuous cooling during transport. In the laboratory, the 70 g sample was centrifuged for 30 min at 4,500 *g* (4508 R cooling centrifuge, Eppendorf, Hamburg, Germany) to remove particulate matter. The supernatant was then concentrated through polyethylene glycol (PEG) centrifugation at 12,000 *g* for 99 minutes. The pellet obtained was suspended in 800–1,000 μl lysis buffer (details see Markt *et al.* 2021) and transferred to a microreaction tube (Eppendorf). The RNA was purified using the Monarch™ total RNA Miniprep Kit (New England Biolabs, Ipswich, USA). After Nanodrop RNA quantification and appropriate dilution the SARS-CoV-2 nucleocapsid (N1) gene RNA copy numbers were determined on a RotorGene cycler (Qiagen, Hilden, Germany) using a plasmid standard containing the N gene of SARS-CoV-2 (2019-nCoV_N_Positive Control, IDT, Leuven, Belgium) (Markt *et al.* 2021). According to Pérez-Cataluña *et al.* (2021) the recovery can be estimated as approximately 50%.

According to national regulations, WWTPs apply a self-monitoring scheme and measure flow rate and temperature on a daily basis. Water quality parameters such as COD, N_{tot} and NH_{4}^{+} are analyzed as well but the frequency depends on the design capacity of the investigated WWTP (varying between daily to weekly). Water quality parameters are likewise determined via the same 24 h composite samples as used for determination of the SARS-CoV-2 titer.

For our study, we selected four different sampling locations or urban drainage catchments respectively (see Table 1). For easier reference and respecting data protection acts in Austria, the catchments/cities are denoted as A–D in the following. The locations mainly vary in population and catchment area size as well as type of sewage system.

Sampling site . | Connected residents . | Avg. daily inflow 2020 (m3/d) . | Avg. monthly temperature 1971–2000 (°C)^{a}
. | Avg. total annual precipitation 1971–2000 (mm/a)^{a}
. | Avg. number of days with total daily precipitation >10 mm (d/a)^{a}
. | |
---|---|---|---|---|---|---|

Urban | A | 1900000 | 539450 | 11.4 | 548 | 14.9 |

B | 320681 | 83187 | 9.0 | 1184 | 40.0 | |

Rural | C | 41696 | 16344 | 8.9 | 1231 | 40.3 |

D | 23600 | 4899 | 7.9 | 889 | 29.1 |

Sampling site . | Connected residents . | Avg. daily inflow 2020 (m3/d) . | Avg. monthly temperature 1971–2000 (°C)^{a}
. | Avg. total annual precipitation 1971–2000 (mm/a)^{a}
. | Avg. number of days with total daily precipitation >10 mm (d/a)^{a}
. | |
---|---|---|---|---|---|---|

Urban | A | 1900000 | 539450 | 11.4 | 548 | 14.9 |

B | 320681 | 83187 | 9.0 | 1184 | 40.0 | |

Rural | C | 41696 | 16344 | 8.9 | 1231 | 40.3 |

D | 23600 | 4899 | 7.9 | 889 | 29.1 |

Case studies A and B represent prototypical Austrian cities (large to medium) with high population density and an urban environment. In both cases, the entity of the urban catchment is discharged to the WWTP. Case studies C and D, on the other hand, resemble smaller settlements and case study D is moreover a highly touristic place with predominately summer tourism. Figure 1 depicts the locations of the case studies within Austria's administrative regions. Meteorological data from 1971 to 2000 show a temperate climate for all sampling sites. However, the locations experience up to 40 days/year with a total daily precipitation of 10 mm or higher which leads to significant runoff and to a loss of virus particles in the sewage by combined sewer overflow.

Wastewater surveillance in the four chosen case studies started in summer 2020 (in case of location A already in May 2020) and samples were taken weekly or more frequently. In this study we concern ourselves with the data until the end of 2020, which is a timeline of 8 months – see Figure 2 for details.

The timeline of the (raw) wastewater titer values follows the epidemic data derived from individual testing. For the latter we depict here active cases as identified by summing up infections versus recovered/deceased cases (Figure 2). The lockdown after the first pandemic wave in March 2020 was quite successful in reducing also the virus signal in wastewater. This is demonstrated by the low RNA concentration measured at city A in the early summer period. In principle, the signal remained at a low level for all sites during summer and early fall. One exception was case study D where a sharp increase in RNA concentrations was observed in August, potentially being related to the increase in summer tourism. Starting with October 2020 the beginning of the second Austrian pandemic wave was depicted also in the wastewater signal. Both the reported cases of infections and the viral RNA concentration in the four WWTPs peaked in mid-November. Thereafter, another lockdown has been imposed over the country that again declined both the infections and the RNA signal. Note that the temporal differences in the infection dynamics in the four case studies are likely to be due to the regional epidemic management.

### Normalization

_{virus}to the infection dynamics. We are thus less interested in the raw surveillance data but in the specific viral load instead and derive – for an arbitrary datapoint in the series:where L

_{virus}is RNA copies/P/d; Q = flow volume in L/d; c

_{virus}= virus concentration in the sample in RNA copies/L and

*P*= number of persons in the watershed. While the consideration of flow in the timeline is evident from the measured inflow data at the WWTP (see section 2.1) the temporal variation of P in the catchment is to be estimated via a wastewater biomarker (Been

*et al.*2014) as:where c

_{bm}is the concentration of biomarker in g/L and f

_{bm}= specific biomarker load in g/P/d. The choice of an appropriate biomarker has been subject to numerous investigations (Choi

*et al.*2018). However, for this investigation we are less interested in actual values but can express the influence by normalization. As biomarker that is readily available at wastewater treatment plants due to regular surveillance, we apply the standard water quality parameter NH

_{4}-N. The specific load f

_{bm}is here derived from the measured 50-percentile value in the period of the first lockdown in Austria as load fluctuations are minimal therein. Despite NH

_{4}-N being potentially influenced by industry contribution, the parameter is applicable in this context (Been

*et al.*2014; Rauch

*et al.*2021)

### Filtering techniques

The timeline of surveillance raw data (and normalized data as well) includes not only the sought-after information regarding prevalence in the catchment but contains a significant noise contribution, that is due to stochastic effects in the whole process. In this study we apply and compare 13 filter/smoothing techniques with the aim to de-noise the time series of RNA-concentration in WBE. Table 3 summarizes the methods applied herein and gives the key reference(s) for each.

. | City A . | City B . | City C . | City D . |
---|---|---|---|---|

2.5% percentile | 9.77 | 5.84 | 8.02 | 5.94 |

50.0% percentile | 10.71 | 6.49 | 8.99 | 6.80 |

97.5% percentile | 12.17 | 7.13 | 9.73 | 9.32 |

. | City A . | City B . | City C . | City D . |
---|---|---|---|---|

2.5% percentile | 9.77 | 5.84 | 8.02 | 5.94 |

50.0% percentile | 10.71 | 6.49 | 8.99 | 6.80 |

97.5% percentile | 12.17 | 7.13 | 9.73 | 9.32 |

Calculated percentiles from daily measurements during the first lockdown period in Austria (April to mid-May 2020).

Adaptive Degree Polynomial Filter (ADP) | ||

Auto Regressive Model (ARI) | ||

Fast Fourier Transform Filtering (FFT) | ||

Friedman's Super Smoother (SUP) | ||

Generalized Additive Model (GAM) | ||

Kalman Filtering (KAF) | ||

Kernel Smoother (KER) | ||

Locally-Weighted Polynomial (POL) | ||

Robust Running Medians (RRM) | ||

Savitzky-Golay Filters (SGF) | ||

Simple Moving Average (SMA) | ||

Spline (SPL) | ||

Tukey Smoother (TUK) | ||

Method . | Reference . | Sample . |
---|---|---|

TUK | Mallows (1979) | Fiskeaux & Ling (1982) |

KAF | Tusell (2011) | Pan et al. (2016) |

FFT | Cochran et al. (1967) | Yang et al. (2004) |

SPL^{a,e} | Reinsch (1967) | Eubank (1988) |

KER^{a,e} | Härdle & Vieu (1992) | Speckman (1988)) |

SMA^{a} | Hyndman (2011) | He et al. (2020) |

RRM^{a} | Friedman & Stuetzle (1982) | Polasek (1984) |

SUP^{a,e} | Friedman (1984) | Friedman & Silverman (1989) |

POL^{a,e} | Atkeson et al. (1997) | Rajagopalan & Lall (1998) |

SGF^{b} | Press & Teukolsky (1990) | Bromba & Ziegler (1981) |

ARI^{b,e} | Akaike (1969) | Lohani et al. (2012) |

ADP^{c,e} | Barak (1995) | Jakubowska & Kubiak (2004) |

GAM^{d,e} | Hastie (2017) | Murphy et al. (2019) |

Adaptive Degree Polynomial Filter (ADP) | ||

Auto Regressive Model (ARI) | ||

Fast Fourier Transform Filtering (FFT) | ||

Friedman's Super Smoother (SUP) | ||

Generalized Additive Model (GAM) | ||

Kalman Filtering (KAF) | ||

Kernel Smoother (KER) | ||

Locally-Weighted Polynomial (POL) | ||

Robust Running Medians (RRM) | ||

Savitzky-Golay Filters (SGF) | ||

Simple Moving Average (SMA) | ||

Spline (SPL) | ||

Tukey Smoother (TUK) | ||

Method . | Reference . | Sample . |
---|---|---|

TUK | Mallows (1979) | Fiskeaux & Ling (1982) |

KAF | Tusell (2011) | Pan et al. (2016) |

FFT | Cochran et al. (1967) | Yang et al. (2004) |

SPL^{a,e} | Reinsch (1967) | Eubank (1988) |

KER^{a,e} | Härdle & Vieu (1992) | Speckman (1988)) |

SMA^{a} | Hyndman (2011) | He et al. (2020) |

RRM^{a} | Friedman & Stuetzle (1982) | Polasek (1984) |

SUP^{a,e} | Friedman (1984) | Friedman & Silverman (1989) |

POL^{a,e} | Atkeson et al. (1997) | Rajagopalan & Lall (1998) |

SGF^{b} | Press & Teukolsky (1990) | Bromba & Ziegler (1981) |

ARI^{b,e} | Akaike (1969) | Lohani et al. (2012) |

ADP^{c,e} | Barak (1995) | Jakubowska & Kubiak (2004) |

GAM^{d,e} | Hastie (2017) | Murphy et al. (2019) |

^{a}Single parameter.

^{b}Double parameter.

^{c}Triple parameter.

^{d}Above triple parameter model.

^{e}Models with the ability of forecasting.

Typically, the filtering methods can be discriminated by the parameters needed for its use, ranging between 0 and more than 3. Parameter-less data models (here FFT, TUK, and KAF) are mathematically complex and (usually) computationally more difficult to implement, the benefit being obvious as calibration is omitted for these techniques. Data models typically have one to three adjustable parameters. The most common used parametric method in the engineering community is (centralized) SMA which is both simple to implement and contains one parameter only. The detriment is the lower robustness as compared to other – more complex – methods (Williams *et al.* 1998; Raudys *et al.* 2013). The GAM model, which is a more recent development, contains a high number of parameters and is a combination of additive and generalized linear models (Wood *et al.* 2016). Since the parameters can be estimated, the GAM method could also be applied as parameter-less (but not done herein). The numerical details of methods implemented in this study are omitted as this information is given exhaustively in the literature (see reference column in Table 2). The methods are implemented in R and have been tested prior to application for reference datasets.

Some of the implemented models are not only used for smoothing but also provide means for signal forecasting, e.g., GAM, ADP, ARI, POL, SUP, KER, and SPL. Since this study focuses on signal filtering, the capability and accuracy of the techniques in terms of prediction are not taken into account.

### Workflow

The general workflow for the investigation is depicted in Figure 3. The overall data analysis is divided into two main steps a) model fitting and b) clustering. In the first step, multiple smoothing algorithms are applied to the SARS-CoV-2 titer values (raw and normalized to NH_{4}) to de-noise the measured signals in the wastewater system. To quantify their performance a cross-validation approach (Stone 1978) is implemented to estimate a precise error value associated with each model configuration. First, a given filter is fitted to the SARS-CoV-2 time series () under the absence of any arbitrary entity of *x* and the resulting model fitness and metrics are calculated. This procedure is repeated by subsequent excluding each single remaining entity of the series, until a *T*-by-*T* fitness matrix of is computed. Thus, for any observed measurement, there will be T fitted values available, i.e., vector . From we compute the model prediction including uncertainty bounds by using the empirical cumulative distribution function of .

As most models contain adjustable parameters, calibration is an essential step of the workflow. We apply mathematical optimization (as needed either in discrete or continuous mode) to guarantee the best model structure/parameter selection. As stated frequently in the literature the genetic algorithm (GA) is well adapted to solve both real-valued or integer programming even for complex and ill-posed problems (e.g., Panchal & Panchal 2015).

The second step involves clustering of the methods according to their performance (error and consistency). As some of the methods are functionally similar, the center of the clusters is seen as a representative solution. For clustering the K-Medoid algorithm (Park & Jun 2009) is applied.

### Calibration of parametric data models

Filtering methods with parameter(s) require calibration to estimate the best configuration. Global calibration algorithms based on mathematical optimizers are manifold in science and engineering (Schutte *et al.* 2004; Price *et al.* 2006; Kaur *et al.* 2020). Since GA has been applied successfully to different optimization problems (Mehr *et al.* 2018; Ghoddusi *et al.* 2019) and is suitable to solve either permutation (integer) programming or real-valued optimization as required herein, GA was selected as optimization procedure (see Mirjalili (2019) about GA details and operators). Table 4 summarizes the parameters/configurations used for GA to calibrate the parametric filtering methods.

Parameter/configuration . | Value/method . |
---|---|

Population size | 100 |

Iteration | 1000 |

Mutation rate | 0.1 |

Crossover rate | 0.8 |

Elitism | 0.05 |

Selection | roulette wheel |

Mutation method | Random |

Crossover | Two points |

Parameter/configuration . | Value/method . |
---|---|

Population size | 100 |

Iteration | 1000 |

Mutation rate | 0.1 |

Crossover rate | 0.8 |

Elitism | 0.05 |

Selection | roulette wheel |

Mutation method | Random |

Crossover | Two points |

### Performance indices

*et al.*1986; Anderson & Burnham 2004) was computed for every model. With as the original signal and as a column-wise square

*t-*by

*-t*matrix of the filtered series (where each column represents filtered values of under the absence of the

*t*

^{th}signal value) the performance indicators MAE, VAR, and AIC are computed as:where is the

*t*row of matrix, is average of ,

^{th}*k*is the number of model parameters and

*1*

*<*

*t*

*<*

*T*.

## RESULTS AND DISCUSSION

Applying the methods described in section 2, the suitability of smoothing methods is tested for the viral load signals of the four case studies. Note that we first apply the whole procedure as described in 2.4 to the raw signal and – in a second step – repeat the procedure for the NH-_{4} normalized signal.

### Raw signal

Figure 4 shows the results of 13 implemented filtering techniques for the indicators MAE and VAR. According to the results, the MAE varies significantly from station to the station, while the indicator VAR varies similarly across stations. Generally, we can see a strong influence of the catchment size (city A-D) in the results. For the dataset City A (large city), both MAE and VAR are smaller for all filtering techniques as compared to the values obtained for the dataset City D (small community). City B and City C corroborate the trend, that MAR and VAR are increasing the smaller the catchment size. In Figure 4, the methods separated with a solid box indicate the center of the K-Medoid clustering for each station.

For the K-Medoid algorithm, performance indices have been partitioned into 3 clusters namely as best, middle and worst. Next, the median of the best clusters has been defined as optimal method. As a result, for the raw signal investigation, we found the SPLINE method to exhibit best results in both MAE and VAR for the datasets A, B and C. Only in the dataset for the smallest (and touristic) catchment (City D) it is the Locally-weighted polynomial method that behaves best. However, note that also for dataset D Spline is clustered among the best.

Table 5 shows the results of clustering for the filtering methods. Accordingly, the majority of the methods are placed in cluster 1, that indicates the best clustered methods. It is worth to mention that Kalman filtering (KAL) performs worst for smoothing of the viral signals in all WWTPs. Conversely, Friedman's Super Smoother (SUP), Spline (SPL) and Savitzky-Golay Filters (SGF) outweigh the other filtering methods in most of the cases. According to Table 5, SPL and SUP as parametric and nonparametric methods, respectively, are the only ones among the best filtering techniques in all four WWTPs. Two-parameter methods, ARI and SGF, are mostly clustered in the best cluster, while nonparametric methods such as TUK and FFT have been listed in both the best and the second-best clusters with the same membership frequencies.

Figure 5 indicates the raw signal of viral RNA concentration measured in the four wastewater treatment plants and the smoothed signal as computed by the optimal method (center method of the best cluster). Also the result of the cross-validation is depicted as well as the 95% confidence interval to assess the uncertainty of the filtering. Note that the computed uncertainty bounds are higher for City D (small population) and lower for City A (large population). Uncertainty increases where there is a peak (positive or negative) in the original data series or where a significant difference is seen between the original signal and the smoothed one.

### Normalized signal

Applying the procedure outlined above to the SARS-CoV-2 signal normalized to NH_{4}, the general picture does not change drastically. For clustering the methods according to their performance, the K-Medoid algorithm was applied as well – see Table 6. As a result, POL, SPL, SUP, and GAM are identified as optimal smoothing methods. Although the optimal results (center of the best cluster) are different for the two signals (raw versus normalized), the methods SPL, GAM, SUP, and SGF are found as suitable/preferable for both signals. It is also notable that KAL again is consistently performing worst among the deployed methods.

Similar to above, all normalized time series are depicted by using the optimal signal filtering method based on the clustering (Figure 6). Despite the difference in scale, the shape of the smoothed data is quite similar for both normalized and raw signals. A difference is the higher variation of the cross-validation results – indicating a higher sensitivity of the selected filtering techniques when applied to the normalized signal as compared to the raw data.

### Virological analysis

Additional to the analysis of the smoothing methods, the dataset also allows to reflect on the relation of the wastewater measurements to the infection dynamics. As a ballpark approximation we plot the ratio of the viral load per capita to the 7-day incidence value (expressed as sum of new infections over 7 days per 100,000 persons) by fitting linear models to each case study for both filtered and raw viral loads (Figure 7). While this relation is a severe simplification disregarding the different statistical properties of the data, it still allows to reflect on the benefit of data filtering.

Using the marginal probability densities, we derive for the 50-percentile value of the viral load (appr. 100 ×10^{6} copies/cap) incidence values ranging between 100 and 600 (7 day infections/100000 persons). The interesting feature is the influence of scale: From the linear models we see both higher intercepts and lower slopes for small catchments and vice versa. The first observation indicates that small catchments have a certain threshold of viral load before a clear relation with infection dynamics is seen. The second observation (i.e., the slope of the model) points to the fact that infection clusters are (statistically) more significant in a smaller population than in a large one. While still speculative, community size could be an influential factor for WBE in the case of SARS-CoV-2.

Further, the analysis allows demonstration of the benefit of data filtering. A first empirical indication is the narrowing in the model uncertainty bounds when comparing unfiltered data (Figure 7(a)) against filtered one (Figure 7(b)). A quantitative argument towards filtering is given by the increase of Pearson correlation coefficient between the datasets – once filtering is applied (Table 7). For all case studies a significant increase of the correlation coefficient is apparent, indicating the role of time series filtering in the enhancement of model performance quality.

Sites . | Raw . | Filtered . |
---|---|---|

City A | 0.788^{b} | 0.920^{b} |

City B | 0.601^{b} | 0.684^{b} |

City C | 0.200^{a} | 0.394^{b} |

City D | 0.453^{b} | 0.515^{b} |

Sites . | Raw . | Filtered . |
---|---|---|

City A | 0.788^{b} | 0.920^{b} |

City B | 0.601^{b} | 0.684^{b} |

City C | 0.200^{a} | 0.394^{b} |

City D | 0.453^{b} | 0.515^{b} |

^{a}95% confidance.

^{b}99% confidance.

## CONCLUSION

For management of pandemics such SARS-CoV-2 and interpretation of data obtained by means of WBE, filtering of the wastewater titer signal is an important pre-processing step. Modeling the infection dynamics or developing predictive tools therefore may induce misleading results when based on noisy information. This is especially important when models are not robust enough against extreme/oscillative inputs. This study focused on the application of 13 well-established signal filtering techniques for smoothing SARS-CoV-2 datasets in four wastewater treatment plants across Austria. Based on the finding in this study the following conclusions are made:

Spline, GAM, and Friedman's Super Smoother are recognized as superior methods in this context. In three wastewater treatment plants Spline was found to be a robust approach to cope with missing data and uncertainties.

Although GAM is a robust smoother against extremes and outliers, it requires a high number of parameters to be tuned and has a tendency to over-smooth signals. The latter also applies to the Friedman's Super Smoother technique.

For the case of nonparametric methods, TUK and FFT performed generally well and are suitable algorithms. However, nonparametric methods are sensible for missing values and are thus only recommended for times series with a small number of missing signals.

Despite acceptable error values for methods such as KAL, SMA, and POL, they are not suitable in this context as generally overfitting.

A first analysis of the dataset indicates that community size has an influence on WBE for SARS-CoV-2. For smaller catchments both a threshold of viral load is apparent before any relation with infection dynamics is visible and also a higher sensitivity towards infection clusters.

For the application of linear regression models for incidence prediction, filtering results in a consistently improved Pearson correlation coefficient, i.e., model performance.

Apart from the applicability of GAM, Spline and POL (known as LOWESS), in filtering of the wastewater SARS-CoV-2 signals, the conclusions made herein are data-driven based and applying them to similar case cases must be made with discretion.

## ACKNOWLEDGEMENTS

This study was funded by the Austrian Federal Ministry of Education, Science and Research, the Austrian Federal Ministry of Agriculture, Regions and Tourism, the federal states Burgenland, Carinthia, Lower Austria, Salzburg, Styria, Tirol, Upper Austria and Vorarlberg and the Austrian Association of Cities and Towns. We would like to thank the staff of the treatment plants involved for their support and the use of the wastewater titer measurement data. The support of Günther Weichlinger and Christoph Scheffknecht are also appreciated.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.