Abstract
Monitoring SARS-CoV-2 spread is challenging due to asymptomatic infections, numerous variants, and population behavior changes from non-pharmaceutical interventions. We developed a Digital Twin model to simulate SARS-CoV-2 evolution in Catalonia. Continuous validation ensures our model's accuracy. Our system uses Catalonia Health Service data to quantify cases, hospitalizations, and healthcare impact. These data may be under-reported due to screening policy changes. To improve our model's reliability, we incorporate data from the Catalan Surveillance Network of SARS-CoV-2 in Sewage (SARSAIGUA). This paper shows how we use sewage data in the Digital Twin validation process to identify discrepancies between model predictions and real-time data. This continuous validation approach enables us to generate long-term forecasts, gain insights into SARS-CoV-2 spread, reassess assumptions, and enhance our understanding of the pandemic's behavior in Catalonia.
HIGHLIGHTS
The paper presents a Digital Twin model to analyze the evolution of SARS-CoV-2.
It shows how sewage data are used to validate the model's predictions.
It outlines how the continuous validation allows long-term forecasts.
The model can account for the asymptomatic nature of the infections, variants, and changes in population behavior due to NPIs.
The method enhances the reliability and accuracy of the model's forecasts.
INTRODUCTION
Accurate detection of the true infection rate of SARS-CoV-2 is of utmost importance for health authorities as it enables them to forecast current trends, anticipate the impact on the healthcare system, and understand the long-term effects on the population. However, under-reporting of COVID-19 cases, primarily attributed to a significant number of asymptomatic individuals, has emerged as a major challenge (Nishiura et al. 2020; Ma et al. 2021). While asymptomatic individuals may have lower infectivity compared to symptomatic individuals (Sayampanathan et al. 2020), it is essential to precisely calculate the true infection rate not only for assessing the spread but also for estimating the number of patients affected by secondary effects (Carfì et al. 2020; Lindan et al. 2021; Nordvig et al. 2021; Taquet et al. 2021; COVID-19 takes serious toll on heart health – a full year after recovery). To address this need, we developed in the past a Digital Twin (DT) model for monitoring the spread of SARS-CoV-2 in Catalonia (Fonseca i Casas et al. 2023). A DT is a virtual representation of a real-world entity or system. It mirrors an object, process, organization, person, or other abstraction and spans the lifecycle of its physical counterpart. It is updated from real-time data, and uses simulation, machine learning, and reasoning to aid in decision-making. Therefore, the DT serves as a virtual model that simulates infection dynamics based on diverse data sources and underlying assumptions. To simulate the dynamics of the pandemic, and considering the DT structure we follow, we use different sources of information. These sources include the detected infected population, the hospitalizations, the deaths and the virus load in the wastewater, which are indicators of the spread and severity of the disease. The simulation traces are composed of the number of detected cases, which depend on the testing capacity and strategy, and the real cases, which are estimated by the model based on the infection rate and the incubation period and validated with the detected cases and the wastewater information.
The objectives of this model are twofold: (i) to forecast infection trends across different regions and scenarios and (ii) to validate model assumptions by comparing them with actual outcomes and adjusting them accordingly. By doing so, we aim to understand the factors influencing disease spread and evaluate the effectiveness of various interventions.
To estimate the true infection rate of SARS-CoV-2 accurately, various methods have been developed to analyze wastewater data (Phipps et al. 2020; Wu et al. 2020; Rippinger et al. 2021). These methods aim to infer infection cases from the collected wastewater data. Consequently, any changes in the wastewater information can directly impact the forecasts generated by the model, making the forecast highly dependent on the quality and availability of these data. Detecting potential future turning points becomes challenging in such scenarios. To address this, it is crucial to utilize multiple data sources to feed our models and, more importantly, continuously validate the forecasts to identify discrepancies or errors in the underlying assumptions. Our assumption is that SARS-CoV-2 sewage loads serve as a valuable source of information for validating epidemic models, particularly as clinical testing and reporting decrease. Furthermore, by leveraging the model's forecasts, sewage data analysis becomes an excellent tool for understanding the nature of the newly acquired information and analyzing the assumptions that depict the effects of non-pharmaceutical interventions (NPIs) applied to the population.
Several studies have explored the use of SARS-CoV-2 loads in sewage as a means to estimate COVID-19 cases in a population, as referenced in (McMahan et al. 2021; Omori et al. 2021). In Catalonia, two specific studies (Ayala-Aldana et al. 2022; Joseph-Duran et al. 2022) have employed this approach to monitor the regional pandemic situation and provide insights into virus transmission dynamics. In contrast to previous studies that used wastewater data to infer actual COVID-19 cases, our approach employs wastewater data to validate a DT that integrates multiple data sources, enhancing the reliability of its forecasts. This paper focuses on describing the validation process using wastewater data to detect forecast inaccuracies in the models used within the Digital Twin, prompting recalibration for generating new long-term forecasts. However, it is important to note that wastewater data are not the sole source of validation. Information from the Health Service regarding confirmed cases detected through testing is also utilized. Based on our methodology, we have implemented a system that provides warnings to alert specialists when adjustments to the models are needed to accurately reflect the changing situation of the pandemic. Our main objective is to predict the true number of infection cases, which will help us understand the influence of NPIs. Ultimately, we aim to estimate the number of individuals suffering from secondary effects of COVID-19, like Long COVID.
MATERIALS AND METHODS
Understanding the propagation of SARS-CoV-2 is a complex task due to various factors. Firstly, different variants with varying transmission rates (parameter β as used in the epidemiological models) continue to emerge. Secondly, the prevalence of asymptomatic individuals is significant and varies depending on the variant, with a higher number of asymptomatic cases observed with the Omicron variant. Finally, citizen behavior changes, such as wearing face masks and adhering to physical distancing recommendations, also affect the transmission rate (β) (Chu et al. 2020). To tackle these challenges, we employed a DT approach, which combines a set of epidemiological models with a continuous validation and verification process.
The DT comprises three primary models that evolve throughout the pandemic. The first model is a System Dynamics model, conceptualized using Forrester diagrams (Forrester 1988) which represents a SEIRD model. This initial model allows us to test the initial assumptions through its implementation in InsightMaker software (Fortmann-Roe 2014). The second model is an optimization model developed in Python. Its purpose is to detect changing points and estimate the transmission values for different regimes of the infection curve, reflecting changes in NPIs implemented in the population. The third model is represented using the Specification and Description Language (SDL) (ITU-T 2019) and (Sherratt et al. 2015), a graphical, standardized, and unambiguous language from informatics area, but commonly used to describe environmental and social simulation models (Fonseca i Casas et al. 2010; Fonseca i Casas 2014; Olmos et al. 2014; Fonseca i Casas 2019). This SDL model extends the initial SEIRD model by incorporating different regimes based on NPI variations. Additionally, it employs cellular automata to represent different Health Regions, which are administrative divisions in Catalonia. Synthetically, a cellular automaton (CA) is a discrete model of computation that consists of a regular grid of cells, each in one of a finite number of states, such as on and off. The grid can be in any finite number of dimensions. For each cell, a set of cells called its neighborhood is defined relative to the specified cell. An initial state (time t = 0) is selected by assigning a state for each cell. A new generation is created (advancing t), according to some rule, or rules, that determines the new state of each cell in terms of the current state of the cell and the states of the cells in its neighborhood (Sirakoulis et al. 2014). While all three models contribute to the validation process by comparing their outcomes, the SDL model, which provides the most detailed representation, is specifically used for generating forecasts. Thus, in this paper, when referring to the model or the Digital Twin, we are referring to the SDL model.
This approach requires accurate near real-time data to validate the models. The detected cases data used in this study were obtained from the Public Health Agency of Catalonia (ASPCAT). Additionally, data from the Catalan Surveillance Network of SARS-CoV-2 in Sewage (SARSAIGUA) (Guerrero-Latorre et al. 2022) were utilized for the continuous validation of a DT representing the pandemic evolution in Catalonia. As we mention, the DT is based on an extended SEIRD model that can forecast the true number of COVID-19-infected cases in Catalonia (Fonseca i Casas et al. 2023), and the overall continuous life cycle we follow for the DT is detailed in Fonseca i Casas et al. (2021, 2023). Our focus is not on estimating the number of cases based on SARS-CoV-2 viral loads in sewage. Instead, we aim to establish a continuous validation process to ensure the accuracy of predictions provided by the SEIRD model simulations. We specifically aim to detect discrepancies between the trends observed in sewage SARS-CoV-2 data and the simulated trends of true cases in the model. This validation process allows for the generation of long-term forecasts (several months) that can be combined with other data sources to produce more precise and reliable predictions. As we mention. our primary goal is to forecast the true number of infection cases in order to understand the impact of NPIs and eventually forecast the number of individuals experiencing secondary effects of COVID-19, such as Long COVID (Cooper et al. 2022; Mehandru & Merad 2022; Munblit et al. 2022; Davis et al. 2023; Landry et al. 2023; Mizrahi et al. 2023).
Structural and data assumption
Model parameters are implemented to account for vaccination rates, SARS-CoV-2 variants, and reinfections. It is important to note that these parameter values change over time due to the inherent evolution of the pandemic and modifications in the non-pharmaceutical interventions (NPIs) applied to the population, as well as the distribution of vaccines.
Starting from the previous study (Fonseca i Casas et al. 2023) we need to calculate the new percentage of under-reporting due to the changes in testing protocols in Catalonia. Initially, from the beginning of the pandemic until March 28, the under-reporting rate for Catalonia was estimated to be around 60% based on a prevalence study published on several rounds (Pollán et al. 2020; Ministerio de Sanidad 2021a, 2021b, 2021c, 2021d). This value was derived from an assumed asymptomatic infection rate of approximately 40% (Syangtan et al. 2020; Ma et al. 2021), and around 30% for the Omicron variant (Fonseca i Casas et al. 2023). However, since March 28, the under-reporting rate increased due to a reduction in the number of tests conducted to assess the spread of infection across the population. To estimate the under-reporting rate, we utilized information from ASPCAT and made two assumptions.
Firstly, we assumed that the percentage of people infected by age range remains similar in the future (post-March 28, 2022) as it was in the past (during previous Omicron waves). To work with this assumption, we calculated the detection rate percentage for each age range until March 28. Then, we compared this percentage with the new percentage for each age range before March 28, assuming that the differences in detection rates reflect under-reporting. Specifically, we started the calculation using data from the Omicron wave, which showed a detection rate of approximately 30% (as mentioned in the previous study (Fonseca i Casas et al. 2023). However, after March 28, the detection measures changed, resulting in only severe symptomatic cases being detected. Therefore, as a second assumption, we considered that the detection rate for people over 70 years old remains unchanged. This assumption is valid since the testing approach for older people remained consistent, and the symptomatology is typically more severe in this age group, simplifying the detection process. It is important to note that since severe cases are more common in older people, the relative weight of cases in this age group increases, leading to changes in the detection rate by age range, as depicted in Figure 1. By applying these calculations, we can estimate the actual cases for the population older than 70 years old, who are more likely to be tested and reported. Then, we use the proportion of different age groups in the population to determine the number of cases that should be added to the other age groups, representing the actual cases that may be asymptomatic or untested.
Table 1 presents the calculation of under-reporting for different age ranges. The columns PW1 and PW2 represent the respective weight of detected cases for each age range. Taking into account the aforementioned assumptions, PW1/PW2 indicates the variation in the detection of each age range due to changes in monitoring methods for disease spread. Using these values, we can calculate the expected detected cases if monitoring continues for each age range, with the age range of 70+ considered invariant. It is important to note that these expected detected cases (approximately 732,890) represent the cases that would be detected from March 28 onwards, but they do not reflect the actual number of cases since the detected cases only account for about 30% of the total. To calculate the new under-reporting rate, we compare the detected cases (476,141) to the expected detected cases and determine that it represents approximately 65% of the expected value. Considering that the expected detected cases correspond to 30% of the actual cases and the detected cases now represent 65% (35% of under-reporting), we can infer that the detection rate from March 28 onwards is approximately 19%.
. | . | From 1 November 2021 to 28 March 2022 . | From 28 March (as previous) . | . | From 28 March . | ||||
---|---|---|---|---|---|---|---|---|---|
Age . | Population rate (1) . | Cases (2) . | PW1(3) . | Cases (2) . | PW2(3) . | PW2/PW1 . | % under-reporting (4) . | Expected detected cases(5) . | % under-reporting (6) . |
0–4 | 4.1 | 57,949.0 | 4.0 | 9,742.0 | 2.05 | 0.50 | 87.4 | 18,250.1 | 46.6 |
5–14 | 10.5 | 196,806.0 | 13.3 | 16,146.0 | 3.39 | 0.24 | 93.2 | 31,294.7 | 48.4 |
14–44 | 37.1 | 657,482.0 | 46.3 | 145,466.0 | 30.55 | 0.66 | 83.4 | 266,700.3 | 45.5 |
45–59 | 23.6 | 319,222.0 | 22.9 | 113,932.0 | 23.93 | 1.06 | 73.3 | 197,248.7 | 42.2 |
60–69 | 11.7 | 87,630.0 | 6.0 | 65,960.0 | 13.85 | 2.23 | 43.3 | 94,538.9 | 30.2 |
70–79 | 8.2 | 52,102.0 | 3.9 | 69,268.0 | 14.55 | 3.95 | 0.0 (1) | 69,200.9 | −0.1 |
80 | 5.8 | 41,905.0 | 2.7 | 55,657.0 | 11.69 | 3.94 | 0.0 (1) | 55,657.0 | 0.0 |
Total | 100% | 1.413.096.0 | 100.0 | 476,171.0 | 100.00 | 732,890.5 | 35.0 |
. | . | From 1 November 2021 to 28 March 2022 . | From 28 March (as previous) . | . | From 28 March . | ||||
---|---|---|---|---|---|---|---|---|---|
Age . | Population rate (1) . | Cases (2) . | PW1(3) . | Cases (2) . | PW2(3) . | PW2/PW1 . | % under-reporting (4) . | Expected detected cases(5) . | % under-reporting (6) . |
0–4 | 4.1 | 57,949.0 | 4.0 | 9,742.0 | 2.05 | 0.50 | 87.4 | 18,250.1 | 46.6 |
5–14 | 10.5 | 196,806.0 | 13.3 | 16,146.0 | 3.39 | 0.24 | 93.2 | 31,294.7 | 48.4 |
14–44 | 37.1 | 657,482.0 | 46.3 | 145,466.0 | 30.55 | 0.66 | 83.4 | 266,700.3 | 45.5 |
45–59 | 23.6 | 319,222.0 | 22.9 | 113,932.0 | 23.93 | 1.06 | 73.3 | 197,248.7 | 42.2 |
60–69 | 11.7 | 87,630.0 | 6.0 | 65,960.0 | 13.85 | 2.23 | 43.3 | 94,538.9 | 30.2 |
70–79 | 8.2 | 52,102.0 | 3.9 | 69,268.0 | 14.55 | 3.95 | 0.0 (1) | 69,200.9 | −0.1 |
80 | 5.8 | 41,905.0 | 2.7 | 55,657.0 | 11.69 | 3.94 | 0.0 (1) | 55,657.0 | 0.0 |
Total | 100% | 1.413.096.0 | 100.0 | 476,171.0 | 100.00 | 732,890.5 | 35.0 |
(1) Data from https://www.idescat.cat/pub/?id=ep; (2) data from https://sivic.salut.gencat.cat/covid; (3) the percent of cases PW1 and PW2, detected by age range, changes because only the most serious cases are detected by the health system from 28 of March. We calculate the (4) and (5) the expected detected cases. Finally, we can calculate the increase on the under-reporting (6) in each age range. Then we can calculate an under-reporting increase which is about 35%.
Another factor considered in the model is reinfections caused by new variants. At the time of writing, we have been affected by several lineages, including the Original, Alfa, Delta, and Omicron lineages, as well as the BA.5, BA.4, BA2.75 lineages, and XBB.1. In our model, we account for the possibility of reinfections by assuming that a certain percentage of individuals who have recovered from the virus can become susceptible again. We continuously validate our model using wastewater data and detected cases reported by the Health Service to adjust this percentage accordingly. Reinfections have an impact on the intensity of infection waves, and the validation process allows us to estimate the percentage of people who become infected by different variants of the virus. Table 2 provides a summary of the reinfection percentages that we incorporate into the model to capture the effects of various variants.
Variant . | Percent of reinfection . | Date . |
---|---|---|
Omicron | 100% | 22/11/2021 |
BA.5 | 21.46% | 1/05/2022–19/06/2022 |
BA.4 | 20% (increase 5% per week) | 29/8/2022–19/09/2022 |
BA2.75 | 10.8% (increase 0.9% per week) | 29/08/2022–14/11/2022 |
XBB.1 | 3.6% | 15/01/2023 |
Variant . | Percent of reinfection . | Date . |
---|---|---|
Omicron | 100% | 22/11/2021 |
BA.5 | 21.46% | 1/05/2022–19/06/2022 |
BA.4 | 20% (increase 5% per week) | 29/8/2022–19/09/2022 |
BA2.75 | 10.8% (increase 0.9% per week) | 29/08/2022–14/11/2022 |
XBB.1 | 3.6% | 15/01/2023 |
It is important to note that the emergence of variants does not occur abruptly; rather, their presence in the population increases gradually as the variant spreads. In our model, we capture this increase by defining that reinfection does not happen suddenly, but rather occurs gradually on a weekly basis. Each week, we generate an event (SDL SIGNAL) that represents the percentage of individuals who have recovered and become susceptible again. For the latest variant (XBB.1), the percentage is so small that we use a single event to represent reinfection. This approach allows us to simulate the progressive impact of variants on the population over time.
Digital shadow, wastewater treatment plant data
The Catalan Surveillance Network of SARS-CoV-2 in Sewage (SARSAIGUA) was established through the collaboration and funding of the Catalan Water Agency (ACA) and the Public Health Agency of Catalonia (ASPCAT). This network commenced its monitoring activities in July 2020 and encompassed 56 wastewater treatment plants (WWTPs) across Catalonia, covering approximately 80% of the total population of 7.5 million inhabitants (7). The initiation of sample collection and analysis took place on the 6 of July 2020, around four months after the first clinical case of COVID-19 in Catalonia was detected on the 25 of February 2020. The selected 56 WWTPs were evenly distributed throughout the region, with at least one WWTP per county, ensuring comprehensive coverage.
Sampling frequency was determined as one sample per week for 36 of the 56 selected WWTPs, while the remaining 18 were sampled biweekly, resulting in a total of 45 samples collected and analyzed per week. Notably, certain WWTPs were only monitored during the summer season to enhance surveillance in municipalities with high tourism, such as Castell-Platja d'Aro and Vilaseca-Salou. Most WWTPs collected flow-based composite samples at the entrance, which were then refrigerated at 4 °C. The 45 weekly samples were distributed among three reference laboratories with expertise in molecular diagnosis and environmental virology. Each laboratory analyzed 15 samples per week to quantify the abundance of SARS-CoV-2 genomes using optimized protocols. The analysis employed reverse transcription quantitative polymerase chain reaction (RT-qPCR) targeting a common genetic marker (N1) as well as two complementary targets, N2 and IP4 (CDC 2020; Pasteur Institute 2020). Over the course of 20 months of monitoring (from July 2020 to February 2023), SARSAIGUA analyzed approximately 3,600 samples.
For the purpose of this study, a subset of 18 WWTPs was selected to demonstrate the integration of sewage data, specifically the N1 target concentrations, into an epidemiological model that forms part of the DT representing the evolution of the pandemic in Catalonia. The selection of WWTPs was based on the following criteria: (i) inclusion of at least one WWTP per sanitary region (Catalonia has seven regions), (ii) inclusion of WWTPs with weekly monitoring frequency, and (iii) coverage of a wide range of connected populations, ranging from 2,935 to 1.4 million inhabitants. The variables obtained from the sewage data are continuous, and no significant outliers were detected in the dataset.
The data used in this study were sourced directly from the SARSAIGUA public repository (Corominas et al. 2022) which provides regular updates every Friday. The repository can be accessed through GitHub at the following URL: https://github.com/icra/sars-aigues/. It contains comprehensive data on the N1 gene copies for all collected samples, along with information on influent flows from the WWTPs. Researchers were able to retrieve the necessary data from this repository for analysis and integration into the study's methodology. The file that contains the genetic information is release.csv (https://github.com/icra/sars-aigues/blob/main/release.csv).
Code . | WWTP . | County . | Province . | Sanitary regions . | Population . | Correlation (Rho) . |
---|---|---|---|---|---|---|
DPUI | PUIGCERDÀ | Cerdanya | Girona | Alt Pirineu i Aran | 14.753 | 0.752 |
DBSS | BESÒS | Barcelonès | Barcelona | Barcelona | 1.444.884 | 0.534 |
DGVC | GAVÀ/VILADECANS | Baix Llobregat | Barcelona | Barcelona | 172.208 | 0.666 |
DPDL | PRAT DE LLOBREGAT, EL | Baix Llobregat | Barcelona | Barcelona | 1.092.573 | 0.753 |
DFAL | FALSET | Priorat | Tarragona | Camp de Tarragona | 2.935 | 0.658 |
DRUS | REUS | Baix Camp | Tarragona | Camp de Tarragona | 110.574 | 0.751 |
DTAR | TARRAGONA | Tarragonès | Tarragona | Camp de Tarragona | 98.311 | 0.678 |
DIGU | IGUALADA | Anoia | Barcelona | Catalunya Central | 61.375 | 0.579 |
DMAS | MANRESA | Bages | Barcelona | Catalunya Central | 86.721 | 0.543 |
DSOL | SOLSONA | Solsonès | Lleida | Catalunya Central | 10.600 | 0.680 |
DBAY | BANYOLES | Pla de l'Estany | Girona | Girona | 28.464 | 0.744 |
DGIR | GIRONA | Gironès | Girona | Girona | 145.373 | 0.674 |
DPAM | PALAMÓS | Baix Empordà | Girona | Girona | 51.673 | 0.761 |
DBAL | BALAGUER | Noguera | Lleida | Lleida | 15.891 | 0.554 |
DLLE | LLEIDA | Segrià | Lleida | Lleida | 117.673 | 0.510 |
DMOF | MONTFERRER | Alt Urgell | Lleida | Lleida | 12.693 | 0.782 |
DAMP | AMPOSTA | Montsià | Tarragona | Terres de l'Ebre | 21.083 | 0.748 |
DTOT | TORTOSA-ROQUETES | Baix Ebre | Tarragona | Terres de l'Ebre | 39.332 | 0.640 |
Code . | WWTP . | County . | Province . | Sanitary regions . | Population . | Correlation (Rho) . |
---|---|---|---|---|---|---|
DPUI | PUIGCERDÀ | Cerdanya | Girona | Alt Pirineu i Aran | 14.753 | 0.752 |
DBSS | BESÒS | Barcelonès | Barcelona | Barcelona | 1.444.884 | 0.534 |
DGVC | GAVÀ/VILADECANS | Baix Llobregat | Barcelona | Barcelona | 172.208 | 0.666 |
DPDL | PRAT DE LLOBREGAT, EL | Baix Llobregat | Barcelona | Barcelona | 1.092.573 | 0.753 |
DFAL | FALSET | Priorat | Tarragona | Camp de Tarragona | 2.935 | 0.658 |
DRUS | REUS | Baix Camp | Tarragona | Camp de Tarragona | 110.574 | 0.751 |
DTAR | TARRAGONA | Tarragonès | Tarragona | Camp de Tarragona | 98.311 | 0.678 |
DIGU | IGUALADA | Anoia | Barcelona | Catalunya Central | 61.375 | 0.579 |
DMAS | MANRESA | Bages | Barcelona | Catalunya Central | 86.721 | 0.543 |
DSOL | SOLSONA | Solsonès | Lleida | Catalunya Central | 10.600 | 0.680 |
DBAY | BANYOLES | Pla de l'Estany | Girona | Girona | 28.464 | 0.744 |
DGIR | GIRONA | Gironès | Girona | Girona | 145.373 | 0.674 |
DPAM | PALAMÓS | Baix Empordà | Girona | Girona | 51.673 | 0.761 |
DBAL | BALAGUER | Noguera | Lleida | Lleida | 15.891 | 0.554 |
DLLE | LLEIDA | Segrià | Lleida | Lleida | 117.673 | 0.510 |
DMOF | MONTFERRER | Alt Urgell | Lleida | Lleida | 12.693 | 0.782 |
DAMP | AMPOSTA | Montsià | Tarragona | Terres de l'Ebre | 21.083 | 0.748 |
DTOT | TORTOSA-ROQUETES | Baix Ebre | Tarragona | Terres de l'Ebre | 39.332 | 0.640 |
Validation, time series analysis
In this study, a time series analysis was conducted as part of the validation process to assess the model's ability to capture the dynamics of the epidemic. The analysis aimed to compare the simulation traces (synthetic data) with the Digital Shadows (sewage data) and determine how well the model aligns with the observed trends over time. To reduce the impact of noise and focus on detecting significant changes, a logarithmic transformation was applied to the time series of simulated true cases and N1 concentrations. The validation process focused on a three-month window, which was deemed appropriate for evaluating the model's performance and forecasting capabilities over a few months. The first step of the time series analysis involved examining potential delays between the two time series. Cross-correlation analysis, utilizing the ‘ccf’ function in R, was employed to detect any lag between the model outcomes and sewage outcomes. After analyzing various lags, it was found that the model performed well with a lag of zero, indicating that the model predicts the number of true cases without delay compared to sewage data. This lag information was then utilized to validate the key performance indicators (KPIs) obtained from the sewage data. Considering that viral shedding can occur before symptoms appear and continue for up to 10 days after symptoms start (Lavania et al. 2022), it is reasonable to expect that the dynamics of true cases and sewage data would peak simultaneously. However, as shedding may persist beyond 10 days, the decrease in true cases after the peak is expected to be more rapid than in sewage data.
To analyze the correlation between true cases and sewage N1 data from each WWTP, Spearman correlation analysis was applied. Spearman correlation is a non-parametric test that measures the strength and direction of association between two ranked variables and is not strongly influenced by differences in variable scales. Finally, to obtain a single value indicating the validity of the model, an aggregated value was computed by weighting all WWTP correlation values based on the served population. As a convention, if this aggregated value exceeds 0.5, the model is considered to be valid. The statistical analysis, including the Spearman test and computation of aggregated values, was performed using R, utilizing the ‘cor.test’ function for the Spearman test.
In relation to the other Digital Shadows, it is important to note that even as the frequency of clinical testing decreases over time, these data still hold relevance for model validation. The validation process does not solely rely on a single dataset but instead encompasses a combination of factors, including the wastewater system, detected cases, hospitalization rates, and death rates. However, the N1 concentration observed in wastewater remains a particularly valuable and informative source of data for generating the Digital Shadows necessary to validate the model. This concentration serves as a rich indicator of the viral presence in the population and allows us to assess the consistency and accuracy of the model's predictions. By incorporating multiple data sources, including wastewater analysis, we can enhance the robustness and reliability of the model validation process because we can test several assumptions that rely on several data sources. This multi-dimensional approach provides a comprehensive understanding of the pandemic dynamics and strengthens our ability to make informed decisions based on the DT and its corresponding shadows.
RESULTS
The DT model as illustrated in Figure 4, accurately describes the evolution of the different waves of the pandemic in Catalonia. The SDL simulation model produces three curves that illustrate the progression of cases (refer to Figure 4). The first curve represents the detected cases, which should align with the Digital Shadows provided by health authorities. The second curve represents the real cases, accounting for under-reporting, including asymptomatic cases prevalent in the population. The third curve shows reinfections occurring in the population, considering factors such as decreased protection and the emergence of new variants. More details on this model, and how it manages reinfections or other elements (like vaccination status, variants, etc.) can be found in reference (Fonseca i Casas et al. 2023).
The true cases significantly outnumber the detected cases. The detected cases from the model closely match the observed data (depicted in blue), with a coefficient of correlation above 0.5. The red line represents reinfections resulting from waning immunity and the presence of new variants. The peaks of the reinfections and true cases curves occur earlier than those of the detected cases curve, as it takes more time for detection to increase due to the need for symptom development and testing.
The website accompanying the model (https://pand.sdlps.com/) displays a time series of infection dynamics, including key indicators such as the number of real cases, reinfections, and detected cases. The website provides a user-friendly and clear presentation of this information. It is regularly updated with new data from various sources (Digital Shadows), offering valuable insights into the pandemic's evolution and the accuracy of the model's forecasts. To ensure ongoing accuracy, the website also features panels related to the continuous validation process. The validation process is crucial as it allows for the identification and correction of any errors or deviations in the model assumptions. Establishing an infrastructure that facilitates this process and enables prompt error detection and resolution is essential. So, we use wastewater data as a proxy for true infection rates. The next section shows the validation results with these data.
Continuous validation process
One of the main outcomes presented in this paper is the continuous validation process developed to monitor and verify the accuracy of the DT using wastewater data. This process is presented on a panel called WWTP presented on the website for easy validation, which provides key information obtained from the WWTPs. The panel displays the viral load measured in the wastewater and compares it with the model-estimated true cases. This allows for the evaluation of the consistency and reliability of the model's predictions and facilitates the detection of any discrepancies or anomalies in the infection dynamics. On the WWTP panel, the time series for the last 3 months is displayed for visual inspection to assess the correlation between the two time series. The choreograph is also shown to identify any potential lag between the time series. This enables easy updating of the DT as needed.
If the aggregated value calculated from all the individual WWTP values considering the population, goes below 0.5 in the validation panel or if a visual inspection suggests that some WWTPs are not accurately represented, a revision is triggered to review the model assumptions. This revision can be conducted by analyzing the graphical representation of the model or examining the configuration files, if no modifications to the model structure are necessary (which is typically the case). This revision process allows us to infer the reasons behind the model's failure and enhance our understanding of how the system, in this case, how spread of SARS-CoV-2 in Catalonia, operates. It provides valuable insights into the actions that need to be taken to control the spread of the pandemic.
Results from the continuous validation
As part of the continuous validation process, we have encountered some unusual outliers that require further investigation. These outliers represent cases where the viral load in wastewater and the true cases predicted by the model do not align, suggesting a possible error in the model assumptions or a change in the infection dynamics. It is essential to closely examine these outliers, understand their causes, and assess their implications for our model and forecasts. The annexes provide an example of this process, offering valuable insights into the overall behavior of the system.
DISCUSSION
Our study underscores the efficacy of a DT that integrates wastewater data and detected cases as Digital Shadows to generate long-term pandemic forecasts (more than two months). The forecast's reliability hinges on successful validation, ensuring alignment with the Digital Shadows. If discrepancies arise, (considering at this time only regional and no local discrepancies) we must revise the model assumptions to realign the generated data with the Digital Shadows. This methodology enables us to continually monitor and adjust our predictions accuracy. At the time of writing this paper, our forecast has remained valid for over three months without significant deviations from the Digital Shadows.
A key finding of our study is the synchronicity between the true cases estimated by the model and the viral load measured in wastewater. This synchronicity suggests a delay of less of one day between individual infections and virus detection in the sewage system. Thus, wastewater analysis provides a timely and accurate indicator of a given area's pandemic situation. The validation panel for 2022-12-27 (SDL-PAND 2022) displays the correlograms for different WWTPs, illustrating the close alignment between the viral load time series and the true cases predicted by the model. This lack of lag is attributed to our model's use of real cases rather than detected cases, which may be subject to reporting delays or testing errors. This real-time insight is invaluable for decision-making and enables the prompt detection of any changes or anomalies in the system.
The employed methodology aligns well with our analysis scope, focusing on the COVID-19 pandemic's evolution and maintaining a DT that accurately represents the pandemic's key indicators, including real cases, detected cases, and reinfections. The Spearman statistical analysis provides a clear view of each WWTP, with values ranging from 0 (no correlation) to 1 or −1 (perfect correlation). For analysis purposes, we set the threshold at 0.5, indicating significant differences between a specific WWTP's time series and suggesting validation failure.
The methodology outlined in our study holds potential for application across various regions, extending its utility beyond the current geographical scope. The first step in this process involves defining a model that accurately represents the real cases of infection within the target region. This model should be tailored to the specific characteristics of the region, taking into account factors such as population density, population behavior, healthcare infrastructure, and local pandemic response measures. Once a suitable model has been established, the next step is to obtain data from the WWTPs within the region. These data form the basis for the validation process, which ensures that the model's predictions align with the actual situation on the ground. It's important to note that the quality and reliability of the WWTP data can significantly impact the accuracy of the model's validation process. Therefore, establishing robust data collection and processing protocols is crucial. Upon successful validation, the model can then be used to monitor the evolution of the pandemic within the selected area. This involves regularly updating the model with the latest WWTP data and adjusting the model parameters as necessary to reflect changing conditions. In this way, the model serves as a dynamic tool for tracking the pandemic's progression, providing valuable insights that can inform public health decision-making.
Future research can be pursued along two primary avenues. The first involves determining the minimum population size that a WWTP must serve to effectively monitor the evolution of a pandemic or to function as an early warning system. This analysis would entail a comprehensive study of various WWTPs serving different population sizes, examining the accuracy and timeliness of data derived from each. The goal would be to establish a population threshold below which the effectiveness of a WWTP as a monitoring tool significantly diminishes. The second line of inquiry involves the integration of additional data sources into the validation process. For instance, incorporating data on critical hospitalizations could provide a more nuanced understanding of the pandemic's severity at any given time. Similarly, data on secondary effects observed in the population, such as long-term health impacts or societal changes, could offer insights into the broader implications of the pandemic. These additional data sources could also enhance the model's predictive capabilities. By correlating these data with the real cases inferred from wastewater analysis, we could potentially reconstruct the historical trajectory of the pandemic. This retrospective analysis would not only validate the model's predictions but also contribute to a more comprehensive understanding of the long-term effects of the pandemic. These proposed lines of research underscore the potential of our approach to evolve and adapt in response to new data and insights, ultimately enhancing its utility as a tool for pandemic monitoring and management.
CONCLUSIONS
We demonstrated the efficacy of a DT that integrates wastewater data and detected cases as Digital Shadows to generate long-term COVID-19 pandemic forecasts.
We considered the potential impact of reinfections caused by the XBB.1 variant (Arora et al. 2023) and assumed that the reinfection rate would not exceed that observed for the BA.5 or BQ.1 variants. Based on this assumption, we projected Catalonia's future pandemic course, which indicated a stable trend with minor fluctuations, suggesting a stabilization of the pandemic situation.
Our analysis reveals that the COVID-19 case detection rate dropped to 19% from March 28, 2022 onwards, as the health system database only recorded cases with severe symptoms. Before that date, the detection rate was around 30% for Omicron waves and 40% for previous variants. From our analysis, we inferred that approximately 90% of Catalonia's population had been infected or reinfected with SARS-CoV-2. While this value seems high, it aligns with a study conducted in the USA in February 2022 (Clarke et al. 2022), nearly a year earlier. This finding is understandable considering Catalonia's dense population.
We acknowledge that both the DT and the Shadows, which indicate the N1 concentration in wastewater, perform better for larger populations. The effectiveness of this approach diminishes for smaller populations, an area we intend to investigate further.
Finally, we emphasize that the work presented in this paper is not merely an academic proposal but has been implemented as a product accessible through the website https://pand.sdlps.com. A permanent capture of the published website for January 2, 2023, is available (Fonseca i Casas & Garcia Subirana 2023).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.