The recent SARS-COV-2 pandemic has sparked the adoption of wastewater-based epidemiology (WBE) as a low-cost way to monitor the health of populations. In parallel, the pandemic has encouraged researchers to openly share their data to serve the public better and accelerate science. However, environmental surveillance data are highly dependent on context and are difficult to interpret meaningfully across sites. This paper presents the second iteration of the Public Health Environmental Surveillance Open Data Model (PHES-ODM), an open-source dictionary and set of data tools to enhance the interoperability of environmental surveillance data and enable the storage of contextual (meta)data. The data model describes how to store environmental surveillance program data, metadata about measurements taken on various specimens (water, air, surfaces, sites, populations) and data about measurement protocols. The model provides software tools that support the collection and use of PHES-ODM formatted data, including performing PCR calculations and data validation, recording data into input templates, generating wide tables for analysis, and producing SQL database definitions. Fully open-source and already adopted by institutions in Canada, the European Union, and other countries, the PHES-ODM provides a path forward for creating robust, interoperable, open datasets for environmental public health surveillance for SARS-CoV-2 and beyond.

  • The PHES-ODM supports the collection, storage, and use of environmental public health data.

  • The PHES-ODM defines a standardized dictionary to describe environmental surveillance programs.

  • The PHES-ODM provides metadata attributes to facilitate the interpretation of various data, including sampling, protocols, and geolocation.

  • The PHES-ODM supports open-science principles, allowing its use as a platform for additional software to facilitate data entry, validation and sharing.

The SARS-CoV-2 pandemic has created a need for simple and cost-effective ways to monitor the health of populations. Wastewater-based epidemiology (WBE) is one such monitoring approach that has garnered broad appeal in recent years (Hemalatha et al. 2021; Hill et al. 2021). WBE consists of measuring biomarkers (substances indicating the health status of a population, e.g., viral RNA) in wastewater to gain insights into the health of the population producing that wastewater. WBE of SARS-CoV-2 has been adopted at various scales, from single laboratories to national programs (Morvan et al. 2021; Prado et al. 2021; US CDC 2021), in sewersheds covering buildings (Vanrolleghem & Haddad 2021), campuses (Betancourt et al. 2021; Gibas et al. 2021), and often, entire cities (Sherchan et al. 2020; D'Aoust et al. 2021a; Fernandez-Cassi et al. 2021). Because of this mass adoption, projects have even sprung with the intention of aggregating results worldwide (Naughton et al. 2021).

However, measuring biomarkers in wastewater poses a significant challenge: the complexity of the wastewater matrix and its collection system makes interpreting the obtained measurements particularly difficult. Sewer transport, biochemical decay, dynamic water usage patterns, rainfall, runoff, and snow melt significantly affect biomarker measurements (Hill et al. 2021; Haboub et al. 2022). Biomarkers also sorb onto particles to varying degrees and are therefore more or less affected by suspension, deposition and resuspension dynamics (McCall et al. 2017). Variability in sampling strategies (grab, composite, passive), laboratory assays, and quality control measures also complicate comparing WBE data from various sites or labs, let alone countries (Li et al. 2021; Wade et al. 2022).

WBE measurements are thus directly entwined with their context. It is, therefore, essential to capture this context carefully and extensively. Context is gleaned from collecting metadata and taking additional measurements to help identify factors influencing the target biomarker measurements (Figure 1).
Figure 1

Selection of factors to consider while interpreting biomarker measurements.

Figure 1

Selection of factors to consider while interpreting biomarker measurements.

Close modal

It is also essential to recognise that not every data point is created equal; laboratory issues, sensor malfunctions and human error are always possible and are best identified by the people who collected the data. It is, therefore, essential to include a record of their judgement in the metadata to support later interpretations of the data.

Capturing metadata helps ensure that the data are used correctly, is compared with other data only when appropriate (i.e., when the measurement context is similar enough between different campaigns or different sewersheds) and can still be used confidently many years after its original collection date, when the persons familiar with the details of a sampling campaign may no longer be available (Michener 2006).

As efforts have ramped up to produce reliable measurements of SARS-CoV-2 through environmental surveillance during the COVID-19 pandemic, researchers worldwide have shown great openness and willingness to collaborate. Sampling methods (Schang et al. 2021) and molecular assays (Ahmed et al. 2020) were developed and shared to accelerate the global fight against the disease. Practitioners are thus faced with the sizable challenge of stitching together many data sources, measurement campaigns, and assay results and inserting them into the larger framework of environmental public health surveillance – defined by McGeehin et al. (2004) as the ongoing collection, integration, analysis, and dissemination of data from environmental hazard monitoring, human exposure tracking, and health effect surveillance.

The scope of WBE is also continuously expanding. More complex assays are being developed (e.g., gene sequencing), and disease agents other than SARS-CoV-2 are being targeted (e.g., the syncytial respiratory virus; Hughes et al. 2022), influenza (Mercier et al. 2022), and the MPox virus (de Jonge et al. 2022). With WBE becoming a part of more extensive environmental surveillance programs, other environmental components are also being probed (e.g., air and surfaces; Zuniga-Montanez et al. 2022). A data model that can accommodate the breadth of measurements taken in WBE is therefore required. Moreover, this model must have in its structure a particular concern for capturing data quality and be able to accept varied and extensive metadata. In this way, WBE resembles other environmental disciplines, such as microbial food safety (Griffiths et al. 2017), where international collaboration is crucial, and wastewater treatment, where the harsh operational conditions of sensors make context essential for the interpretation of water quality measurements (Plana et al. 2019).

The SARS-CoV-2 pandemic has created data repositories aggregating open case data from around the world (Dong et al. 2020). WBE researchers have relied on these datasets and have generated their own datasets that have also been aggregated and published (Naughton et al. 2021; Therrien et al. 2021). However, a consensus has yet to emerge on what data and metadata are required to maximise the utility of those datasets. This has proven true for many sub-areas of environmental public health in general. In WBE, these sub-areas include data correction (choosing what combination of wastewater measurements can help determine how viral signals are affected by processes other than viral shedding in the population) (Been et al. 2014; Maere et al. 2022); the creation of adequate quality assurance and quality control (QA/QC) measures, protocols, and reports; and the development of processes to meaningfully compare biomarker measurements across sampling locations. Therefore, a data model compatible with environmental public health surveillance must allow the rapid integration of new types of data and metadata into its structure as the community develops and includes these elements in its analyses. The rest of this paper will describe how the Public Health Environmental Surveillance Open Data Model (henceforth, PHES-ODM; Manuel et al. 2021) uses open science to address the challenges of environmental public health surveillance.

The importance of open data and open science

Openly sharing data encourages reuse via standard data formats and markup languages. It also removes barriers to access (paywalls, restrictive licences). This increased access to data speeds up the scientific process, encourages the detection of inaccuracies, and stimulates the development of new lines of enquiry (National Academies of Science, Engineering and Medicine 2018; Burgelman et al. 2019). In matters of public health, where timely decisions must be made based on the most accurate data possible, openly sharing new scientific results and datasets can benefit the general welfare considerably. The SARS-CoV-2 pandemic includes many examples of the scientific community accelerating the adoption of open science (Elsevier 2022; The Lancet 2022). The creation of easily shareable datasets helps the development of open science by reducing the burden on researchers taking on the arduous task of integrating data from various sources.

The FAIR guidelines have been proposed by Wilkinson et al. (2016) to define the features of datasets that facilitate (or discourage) informed data reuse. The guidelines suggest four main characteristics of suitable open datasets:

  1. Findability: Pieces of data should have unique identifiers, contain searchable metadata and be indexable.

  2. Accessibility: It should be possible to access and transfer data using open protocols, such as those widely used on the internet (HTTP, FTP, text-based formats).

  3. Interoperability: The data use an open dictionary that clearly identifies the meaning of data and metadata contained in the dataset.

  4. Reusability: The data contain rich metadata that enable researchers beyond the original dataset's authors to understand the data's meaning and judge its applicability to their research.

A good data model for the public health environmental surveillance field should strive to follow the FAIR guidelines. Its structure and tooling should also be designed to support the needs of the multiple stakeholders that create, manipulate, and consume those data: examples of stakeholders include utility workers contributing to sampling efforts, laboratory professionals, mathematical modellers, and public health officials.

Beyond open data, it is important to acknowledge the importance of software in open science. It fosters transparency and reproducibility of scientific results and improves the reliability of scientific software by allowing anyone to inspect, learn from, and improve it. Similarly, it speeds up science by enabling researchers to build upon the software tools of their peers (National Academies of Science, Engineering and Medicine 2018). However, the success of open software initiatives relies on more than merely publishing source code. Successful open projects fulfil a need shared by multiple people; they make the contribution process transparent and accessible to the community and have a well-defined technical process, governing structure, and leadership (Abernathy et al. 2022).

What are data models, exactly?

According to West (2011), data models define data's structure and intended meaning. They help describe a domain of activity by listing the entities involved, the relevant attributes of those entities, and the relationships between them. Data models offer standardised dictionaries of terms with precise definitions to support the interpretation of data records. Data models are closely related to ontologies: whereas ontologies define entities and relationships in a general and reusable way (Fishman & Stryker 2020), data models are, in contrast, designed to be specific to a particular application. Ontologies thus support the creation of data models but are not data models themselves.

Requirements for an environmental public health data model

To summarise, a data model for environmental public health surveillance should support the inclusion of data for many types of substances, biomarkers and pathogens and allow for the rapid inclusion of new measures over time to respond to emerging concerns. It should qualify the sampling site and its relationship with its population. It should let users describe their samples and the sampling methods employed. Beyond biomarkers, there should also be provisions to record measurements that help characterise or contextualise biomarker information (e.g., on-site measurements, population information, sample composition and laboratory values that affect the reported biomarker measurements). Sampling and analysis protocols should be documented. Given the amount of information required to create complete records, steps should be taken to make the model reasonably approachable via templates, supporting documentation, or dedicated data input tools.

A data model for environmental public health surveillance should aim to uphold the FAIR principles by allowing data owners to identify themselves and the license that governs their data. The model should itself have an explicit license to promote uptake by new users. As public health data can be sensitive, data that identifies specific individuals or groups can cause harm, such as stigmatisation and dehumanisation (Coffman et al. 2021). It is, therefore, critical that data models that hold public health data have provisions for anonymisation, implement restrictions to sharing, and provide ethical usage guidelines.

Currently available data models

Environmental public health datasets currently fall into two categories: those in ad-hoc formats unique to a data producer or sampling campaign and those that conform to standardised formats. Multiple formats have some level of compatibility with the requirements of an environmental public health data model proposed above, with structures that emphasise different aspects according to the model's original intended audience. Table 1 describes some of the available data models in the field, emphasising those developed during the SARS-CoV-2 pandemic.

Table 1

Feature set of currently available environmental public health data models

FeaturePHES-ODMWISENORMAN SCOREW-SPHERENWSSPHA4GE
Reference Manuel et al. (2021)  European Environmental Agency (2023)  NORMAN Network (2020)  Global Water Pathogens Project (2020)  US CDC (2021)  Griffiths et al. (2022)  
Main intended audience Environmental public health surveillance practitioners Governmental agencies Ecotoxicologists, SARS-CoV-2 template for WBE practitioners WBE practitioners WBE practitioners Environmental genomics 
Public dictionary of headers/tables Yes Yes No Yes Yes Yes 
Public dictionary of values? Yes No No No Yes Yes 
Public database definition? Yes No Yes No No No 
Public data conversion tools? Yes No No No No Yes 
Public data validation tools? Dictionary, software tool, template Not found Template Template Dictionary Dictionary, software tool, template 
Public data collection templates? Yes No Yes Yes No Yes 
Governance and development Open source Inter-institutional Inter-institutional Internal Internal Open source 
Model license CC-BY4 Not found Not found for the model, but the template is open-access Not found Not found CC-BY4 
Rights management Element-level (any row, header or combination) Not found Dataset level Dataset level Dataset level Dataset level 
Environmental compartments Various Various (water bodies) Various, but only wastewater in the template Wastewater Wastewater Various 
Pathogen measurements Any pathogen in the dictionary Not found Yes, but only SARS-CoV-2 in template SARS-CoV-2-specific Multiple SARS-CoV-2, MPOX, AMR 
Measurement methods Yes Yes Yes, but only PCR and sequencing-specific in the template PCR and sequencing-specific PCR and sequencing-specific PCR and sequencing-specific 
In-sample measurements Any measure in the dictionary Water quality Water quality, but only PCR in the template PCR and sequencing PCR and sequencing, pH, Conductivity, TSS PCR and sequencing 
Collection site information Yes Yes Yes Yes Yes Yes 
On-site measurements Any measure in the dictionary Water quality Flow, Weather, COD, TSS, NH4+-N, Water temperature Flow Flow, water temperature No 
Population count Served by site or within a geographic region No Served by site Served by site Served by site No 
Sewer network information No No No No Average wastewater travel time, industrial input, stormwater input No 
Sample and sampling method Yes Yes Yes Yes Yes Yes 
Population health data Any measure in the dictionary applied to a geographic region No SARS-CoV-2 prevalence No No No 
FeaturePHES-ODMWISENORMAN SCOREW-SPHERENWSSPHA4GE
Reference Manuel et al. (2021)  European Environmental Agency (2023)  NORMAN Network (2020)  Global Water Pathogens Project (2020)  US CDC (2021)  Griffiths et al. (2022)  
Main intended audience Environmental public health surveillance practitioners Governmental agencies Ecotoxicologists, SARS-CoV-2 template for WBE practitioners WBE practitioners WBE practitioners Environmental genomics 
Public dictionary of headers/tables Yes Yes No Yes Yes Yes 
Public dictionary of values? Yes No No No Yes Yes 
Public database definition? Yes No Yes No No No 
Public data conversion tools? Yes No No No No Yes 
Public data validation tools? Dictionary, software tool, template Not found Template Template Dictionary Dictionary, software tool, template 
Public data collection templates? Yes No Yes Yes No Yes 
Governance and development Open source Inter-institutional Inter-institutional Internal Internal Open source 
Model license CC-BY4 Not found Not found for the model, but the template is open-access Not found Not found CC-BY4 
Rights management Element-level (any row, header or combination) Not found Dataset level Dataset level Dataset level Dataset level 
Environmental compartments Various Various (water bodies) Various, but only wastewater in the template Wastewater Wastewater Various 
Pathogen measurements Any pathogen in the dictionary Not found Yes, but only SARS-CoV-2 in template SARS-CoV-2-specific Multiple SARS-CoV-2, MPOX, AMR 
Measurement methods Yes Yes Yes, but only PCR and sequencing-specific in the template PCR and sequencing-specific PCR and sequencing-specific PCR and sequencing-specific 
In-sample measurements Any measure in the dictionary Water quality Water quality, but only PCR in the template PCR and sequencing PCR and sequencing, pH, Conductivity, TSS PCR and sequencing 
Collection site information Yes Yes Yes Yes Yes Yes 
On-site measurements Any measure in the dictionary Water quality Flow, Weather, COD, TSS, NH4+-N, Water temperature Flow Flow, water temperature No 
Population count Served by site or within a geographic region No Served by site Served by site Served by site No 
Sewer network information No No No No Average wastewater travel time, industrial input, stormwater input No 
Sample and sampling method Yes Yes Yes Yes Yes Yes 
Population health data Any measure in the dictionary applied to a geographic region No SARS-CoV-2 prevalence No No No 

The Water Information System for Europe (WISE; European Environmental Agency 2023) is a repository of dashboards, maps and datasets about water quality in water bodies across Europe. It is intended to carry water quality information for various substances, but it is not geared towards emerging pathogens. It is, therefore, an environmental surveillance system but not an environmental public health model. Each WISE dataset has its unique schema (dictionary, collection of data fields, and specified relationships between fields), which complicates data harmonisation.

The Network of Reference Laboratories, Research Centres and Related Organisations for Monitoring of Emerging Environmental Substances (NORMAN) database system (NORMAN Association 2011) is another European system that collects data concerning a wide array of emerging contaminants in multiple environmental compartments. It uses standardised data collection templates targeting specific contaminants rather than a uniform data model. As such, though it has an extensive overall dictionary, it restricts the effective dictionary size for a given substance to a fit-for-purpose, but fairly limiting level. Similarly, the Wastewater SARS Public Health Environmental Response (W-SPHERE) repository (Global Water Pathogens Project 2020) also offers a fit-for-purpose data template, emphasising collecting minimal but strictly necessary metadata to support data aggregation and analysis at a global scale.

The National Wastewater Surveillance System (NWSS) (US CDC 2021), deployed in the United States, is designed to collect pathogen data from wastewater only. It can store metadata concerning analysis methods for detecting multiple pathogens. However, its structure is primarily designed for biomarkers that can be analysed with PCR-based methods. The structure of NWSS resembles the first version of the PHES-ODM in many ways (what tables are present, contents of the dictionary) as both model development teams consulted with each other while designing their respective first version. However, the institutional governance of NWSS did not allow for community input, which made it difficult to modify the model to accommodate demands for new targets and metadata coming from the WBE community (genomics, including mutations, gene sequences, variants, and proteins, as well as environmental compartments beyond wastewater).

Finally, the Public Health Alliance for Genomics Epidemiology (PHA4GE; Griffiths et al. 2022) data dictionary is based on a per-pathogen template system. Its dictionary strongly emphasises interoperability, with each term being attached to a reference entry in formal ontologies, with multiple fields allowing the connection of the PHA4GE data to information found in other databases. Its origins in the genomics field are evident from its concern for open science and software, and its concern for mapping between formats and databases – critical components of any genomics analysis workflow. However, measurements not linked directly to measuring a pathogen (i.e., on-site environmental quality measurements or physicochemical characterisation of environmental samples) fall outside the scope of the PHA4GE model.

The PHES-ODM

The first iteration of the PHES-ODM was made available on GitHub in 2020 with the stated goal of making data collection and sharing high-quality data as easy as possible for researchers and organisations working in WBE for SARS-CoV-2. The model initially focused on capturing data and metadata from PCR-based viral measurements, as well as water quality measurements in samples and on-site at the sampling point. As the types of assays being performed on WBE samples became more varied and surveillance of SARS-CoV-2 expanded to additional environmental compartments, expansions of the PHES-ODM model became necessary, and its ontology was revised to capture relationships inherent in environmental public health surveillance in general. This paper aims to describe the main features of the PHES-ODM and show how these features support environmental surveillance. In particular, the PHES-ODM 2.0's goals are to:

  • Expand the types of data and metadata that can be recorded in the PHES-ODM beyond WBE.

  • Facilitate the addition of new terms in the dictionary as the community requests them.

  • Support common data model uses, such as data input, validation, analysis, and aggregation.

  • Support open science by ensuring that the PHES-ODM follows the FAIR guidelines, facilitates the creation of FAIR datasets, and employs the principles of open software development.

Working towards achieving these goals has led the PHES-ODM team to develop the following:

  • 1.

    A basic ontology of the environmental surveillance field.

  • 2.

    A data model that maps onto this ontology, supports many use cases, and is general enough to support its various sub-disciplines.

  • 3.

    A dictionary of terms that defines each element of the data model, with a practical emphasis on the currently most commonly practised among its stakeholders, i.e., WBE for SARS-CoV-2.

  • 4.

    Tools that support the consistent use of the PHES-ODM for data collection, analysis, and reporting.

Open-source management

The management of the PHES-ODM project follows an open-science approach. The data model and all its supporting software tools are available on an open software platform (Big Life Lab 2023). Beyond that, however, the project also fosters community participation. An international steering committee comprised of core developers, academics, and institutional users of the data model guides the long-term goals and scope of the project (Big Life Lab 2022a). In parallel, users of the model can request new features and changes to the model or its tools using the GitHub issue tracker (Big Life Lab 2022b), while discussions of the model and the best practices surrounding it occur on the Discourse platform (Discourse 2022). The bulk of the development is carried out by the core development team (Big Life Lab 2022c), whose work undergoes public review by its users before it is approved for inclusion into official releases. The project also hosts working group meetings where participants can weigh in on ways the model could evolve to suit their needs better. Membership to these bi-weekly meetings is open, with regular participants being actively involved in environmental public health surveillance programs worldwide. These processes help ensure that the evolution of the PHES-ODM follows the needs of its community of users. They also make it possible for anyone to witness the development process and trust in the reliability and long-term viability of the model.

Scope

Environmental public health surveillance is, by necessity, multidisciplinary as it encompasses epidemiological research, microbiology, chemistry, mathematical modelling and disciplines directly related to the environmental compartment being probed (e.g., WBE involves sewer characterisation, whereas measuring soil contamination would involve geology). Answers to six questions describe the elements required to describe the provenance and meaning of the data captured in such varied systems (Plana 2015):

  • What data are being stored: The model is used to store measurements. Measurements are realisations of measures (i.e., things that can be quantified or qualified). The model also holds information on the specimens on which the measurements are taken (i.e., the samples taken from the environment to perform the measurements).

  • Where were the measurements taken: Measurements are usually done on samples. However, they can also be done on-site. Some measures may also describe the region represented by the site.

  • How were the data collected: Samples have specific collection methods, and each type of measurement also has its method. Sometimes, those methods rely on instruments.

  • Why were the data collected: Data are always collected with a purpose (intended use).

  • Who collected the data: Every person or organisation in the custody chain of measurements and samples must be known to maintain trust. That includes data owners, creators, custodians, and funders.

  • When were the data collected: Samples are collected punctually (grab samples) or over several days or hours. Analyses are then carried out on the samples. These may also take a substantial amount of time. Finally, once a result has been obtained, it is reported to a data repository. For public health data, dates can have multiple meanings (e.g., disease onset or reporting date.)

Answering these questions helps identify which entities are crucial for describing the environmental surveillance system. By naming these entities and the relationships they have with each other, a system map can be traced (see Figure 2). This map forms a basic ontology of the environmental surveillance field.
Figure 2

Main entities and relationships of the PHES-ODM's proposed environmental surveillance system ontology.

Figure 2

Main entities and relationships of the PHES-ODM's proposed environmental surveillance system ontology.

Close modal

Data model attributes

For a data model to be relevant, its entities must be described with relevant attributes. Once entities, relationships and attributes are identified, they can be formally defined in a dictionary. The dictionary is thus the central repository for all data model elements. In it, elements can be given unique identifiers, standard names, and a description as a first step to utilising them in a data model. The PHES-ODM dictionary is defined in the ‘Parts’ table. Each row of the table is called a part. Each part describes an entity, an attribute, or an accepted value for an attribute.

Every concept used in the PHES-ODM is defined using the ‘part’ data structure. Parts can thus be considered building blocks that can be assembled to create larger structures within the PHES-ODM. For example, tables are a collection of parts of the ‘attribute’ type, and category sets are collections of parts of the ‘category’ type. Data input templates can also be defined as a list of parts to which users provide values.

Parts have over 50 attributes in the PHES-ODM, some of which are optional. The most important attributes are the following:

  • partID: the unique identifier of each part.

  • partLabel: an English language identifier for the part.

  • partDescription: an explanation of what the part represents.

  • partInstruction: explanation of how to use the part within the PHES-ODM.

Relationships are created by simply adding the partID of a part to the list of attributes of another. For one-to-many and many-to-many relationships to be representable in this way, it is necessary to define another part type called a set.

Collections of parts are called ‘sets’ within the PHES-ODM. The identity of the set is represented as a regular part. However, the contents of sets are defined in the sets table. Each row of the sets table associates the unique identifier of a set to the identifier of one of its members. This arrangement allows for the creation of arbitrarily large sets. It also allows any part to be included in an arbitrary number of sets.

Table types

Since PHES-ODM entities have a standard number of attributes, which always have a single value, all entities of the same type can be stored together in tabular form. There are three types of tables in the PHES-ODM. Result tables hold measurements, sampling reports, and quality information. These tables are thus frequently updated in a surveillance program. Conversely, Program description tables contain data that stay constant for a given surveillance program: sampling locations, organisations in charge of the campaign, the contact information of the people responsible, the instruments used to take samples and to perform analyses, and the identity of the final dataset where it will all be recorded. Finally, Dictionary tables are tables where dictionary components are defined (sets and parts). Internationalisation information (list of languages, translations of the dictionary parts, country codes) is also held there. Dictionary table entries are defined by the PHES-ODM maintainers and updated in every release, whereas entries in the other two table types (result and program description) store actual user-defined data. PHES-ODM dictionary tables comprise over one thousand parts as of November 2023. The complete entity relationship diagram of the PHES-ODM can be found in Supplement 1.

Persons with different roles are typically responsible for each of the three table types: dictionary tables are updated by PHES-ODM maintainers, and data custodians oversee the program definition tables. They are aware of all sampling locations and can assign them unique identifiers. They also know what laboratory protocols will be used to gather measurements and samples. Field and lab workers, on the other hand, are responsible for generating information that goes into the result tables. Most users, therefore, do not have to interact with the entirety of the PHES-ODM, thus vastly simplifying its use for individual persons.

Entering data into the PHES-ODM

Because the PHES-ODM is organised into tables, data entry is relatively straightforward. The available columns indicate what information should be collected for each entity type. The parts table also supports data entry – to fill an attribute, one can look in the parts table for the attribute's entry (row). In this entry, one will find the expected data type for the attribute's value. If the data type is categorical, the set of all correct values for that attribute is linked to the entry via one of its properties (‘mmaSet’, short for ‘measure, method or attribute set’). Linking parts to their corresponding values reduces opportunities for input errors and paves the way for automatic data validation.

The abundance of fields in the PHES-ODM reflects its desire to be robust, but only some users will be interested in collecting the complete set of available metadata attributes. Thus, the PHES-ODM aims for flexibility by defining most fields as optional. However, some fields are mandatory because they help uniquely identify data or minimally define an entity.

Recording measurements and samples

Measurements and samples are each stored in a results table where each row reports a unique measurement event or sample. These tables contain attributes that help capture the full context of each measurement or sampling event. Figure 3 shows each attribute and the type of context they provide. For both reports, there are attributes for fully describing the origin, purpose and quality of a measurement or sample, as well as features to locate it in time and space and explain how it was created. In the measure report, the most crucial piece of information is the value. For samples, the identity of the sample and the time at which it was sampled matter most.
Figure 3

Attributes characterising the reported values in the PHES-ODM. Top: Attributes of the measure reports. Bottom: Attributes of the sample reports.

Figure 3

Attributes characterising the reported values in the PHES-ODM. Top: Attributes of the measure reports. Bottom: Attributes of the sample reports.

Close modal

The measure and sample result tables each are paired with a companion table that helps define relationships between reports. For measure reports, that table holds measure report sets. Sets allow measurements that should be read together to be explicitly linked with a common identifier (e.g., points in a calibration curve). Similarly, samples can be linked together in the sample relationships table by denoting their relationship to each other (e.g., split from a main sample, pooled together, or field replicates.)

Environmental public health surveillance is a broad, multidisciplinary field. A wide variety of biomarkers, supporting measurements and types of samples may, therefore, be recorded during a measurement campaign. To assist PHES-ODM users and only reveal dictionary items relevant to a particular campaign or laboratory at any given time, the PHES-ODM assigns to the measures attributes that help categorise measures along several axes. These attributes are:

  • Compartment set: Whether a measure can be taken in air, in water, on surfaces, in humans, in flora, in fauna, in soils or a combination.

  • Specimen set: Indicates whether a measure can be taken in a sample, on-site, over a population, or a combination.

  • Domain: Indicates whether the measure pertains to a physical phenomenon, chemical composition, or biological characteristics.

  • Group: Measures characterising the same target biomarker or the same type of specimen are placed into groups. For instance, measures of various SARS-CoV-2 genes would be put into the same group, while water biochemical composition measures (e.g., COD, NH4+-N or total phosphorous) would be placed in another.

  • Class: Measures that use a similar assay and are quantified using similar units are combined into a class. For instance, measures of the N1 gene of SARS-CoV-2 and a gene of the Pepper Mild Mottle Virus (PMMoV) are placed in the same class since they could be measured using quantitative PCR and expressed using the number of gene copies per unit of sample.

Recording geographical information

A fundamental context element for measurements concerns the location where they were taken. The PHES-ODM lets users record entries for any location where measurements or samples are taken. These locations are called ‘sites’. Sites vary by type, that is, by the infrastructure they contain and by their use (e.g., sites could be day-care facilities, hospitals, sewer utility holes, or wastewater treatment plants.) Besides the site's civic address and geographical location, the model records the site type and allows for a free-text description of aspects of the site that are difficult to standardise into a dictionary.

Sites are represented in the PHES-ODM by a single point in space. However, the environment is usually described in terms of areas (defined as polygons or collections of polygons). In environmental surveillance, the areas of concern might be the watersheds of sewer networks that drain to a sampling site (often called ‘sewersheds’) or the jurisdiction of a public health agency, for example. The PHES-ODM has provisions to store geographical descriptions of these areas and their type.

Representing sites and polygons in the PHES-ODM makes the model interoperable with Geographical Information Systems (GIS). It also allows the PHES-ODM to store measurements done on things other than samples (e.g., wastewater flow measurements taken at a sampling site or daily reported cases in a geographical region) and thus better understand the context of measurement campaigns. One example coming from WBE is sewer and sewershed characterisation. Currently, PHES-ODM does not have dictionary entries for sewer and sewershed characteristics (e.g., combined or separated sewers, average residence time or percentage of permeable surfaces). If the community requests these measures, the PHES-ODM's flexible structure will make adding these measures to the dictionary very straightforward.

Recording protocols

When analysing data from various sources, it is essential to understand how the data were produced, as differences in analysis steps may make values of the same measure incomparable. For example, solids must be concentrated before RNA extraction in PCR analysis of wastewater samples for SARS-CoV-2 genes. This step can be done via filtration through a membrane (Ahmed et al. 2020) or via centrifugation of the sample (D'Aoust et al. 2021b); however, they yield significantly different values. The PHES-ODM thus recognises the importance of tracking the analysis methods in metadata. Protocols, therefore, are one of the entities of the data model. Protocols are complicated entities, however, as they consist of multiple steps that can take many forms. Several challenges must be addressed to storing store protocols efficiently in tables:

  • Protocols are comprised of a collection of steps.

  • Protocol steps can consist of various things:

    • o They can prescribe the use of a specific quantity of something. Quantities can be described using the measures found in the PHES-ODM dictionary and assigned a prescribed value and unit of measurement.

    • o They can prescribe an action. The PHES-ODM calls this a method.

  • • Protocol steps must be organised in the correct order to convey the meaning of the protocol.

  • • Steps may follow each other, or they can be done concurrently.

  • • Measures inside a protocol usually specify the quantity of reagents, supplies and conditions involved in executing a method.

  • • Different protocols may use some of the same steps but in a different order.

  • • Protocols may have steps that consist of one or several (sub)protocols.

To solve these constraints, the PHES-ODM stores a record describing the protocol itself (e.g., what it does or who developed it) separately from its constituent steps. This separation of protocol into smaller pieces is similar to other protocol tracking tools (Teytelman et al. 2020), which break down protocols into blocks that can be independently version-controlled. The entries in the steps table are either methods or measures. Then, the relationships linking the various steps of the protocol are extracted into a separate table. Rows in the relationship table have four main attributes. The first is the identifier of the protocol for which the relationship holds. The three others define a relationship between the two protocol steps in question. The relationship is thus expressed in the form:

The subject and the object can either be a (sub)protocol or a protocol step. The available relationships (e.g., ‘is before’, ‘specifies’, and ‘is concurrent with’) allow one to organise the protocol in a more semantically meaningful way than simply using a sequential order (Robinson et al. 2015).

This flexible structure allows protocol steps and subprotocols to be reused in other protocols. The relationships between protocols, protocol orderings and protocol steps are shown in Figure 4.
Figure 4

Main entities and relationships used to represent protocols.

Figure 4

Main entities and relationships used to represent protocols.

Close modal

Because PHES-ODM protocols use standard dictionary parts and explicit relationships, they are machine-readable, thus paving the way for using protocols in data analysis pipelines. Such pipelines could automatically apply data transformations to reconcile data when possible and segregate it when not.

Recording data quality

Keeping track of quality concerns is critical to the interpretation and use of data. The PHES-ODM includes a table dedicated to quality reports where quality flags can be attributed to measures, sets of measures, and samples. Quality flags can indicate moments when a protocol was not faithfully followed (e.g., the wrong amount of some reagent was added) or when the result of an experiment should not be used (e.g., when the signal is below the level of quantification.) A quality flag indicating no concern was reported also exists to distinguish cases where quality is certified adequate from those where no quality information exists. The dictionary contains all valid quality flag values, thus making quality information machine-readable and paving the way for automatic filtering or processing of reports that would have certain flags attached. However, a notes field is also available for free-text input if a quality issue is not codified into a flag. Additional quality flags can be added to PHES-ODM as the community requests them.

Since quality reports are detached from measure and sample reports, an arbitrary quantity of flags can be assigned to any value. This decoupling of measure and quality reports allows stakeholders (lab analysts, data scientists, and public health officials) to store their own quality flags without affecting other stakeholders. Quality flags thus represent a flexible but robust system for reporting data quality and compliance.

Various stakeholders have different needs when it comes to data quality. Analysts familiar with the subtleties of sampling and laboratory experiments may want to record extensive quality information to support their interpretation of the data. However, condensing this complex information into a straightforward binary may be essential when communicating with outside agents: either the data point can be trusted or it can be disregarded. Measure and sample reports contain a Boolean reportable attribute to address such situations. Having access to fine-grained quality flags and a layman-friendly binary flag lets users of the PHES-ODM communicate with each other regardless of their role.

Ownership, anonymisation and sharing

In the PHES-ODM, the ownership information is linked to each recorded measure via a link to the ̀dataset̀ table. In this table, the provenance of the data is established via links to contact persons and organisations that can be identified as either custodians or funders of the data. Measures are also individually linked to a license table, meaning that ownership and usage conditions can be separately prescribed for each measurement.

As a data model, PHES-ODM does not support anonymisation directly. However, an open-source companion sharing tool allows dataset custodians to define rules for selecting a subset of their data and potentially transforming it before sharing (Big Life Lab 2022d). For example, users might prescribe in their schema to only select rows in the measures table with a particular license that come from a dataset with a specific funder. After selecting rows to share, another rule might define the fields that should be kept.

Leveraging wide tables

Entities defined in the PHES-ODM data model (e.g., measures, samples, sites, etc.) are each stored in separate tables, where a single row describes every new entity instance. New rows are added as new data is accumulated. This style of data storage is called a ‘long’ table, and it has the advantage of maintaining its schema (what tables, table columns, and relationships are present in the data model) regardless of how much data are added to the tables. However, this efficient data format has some disadvantages because it is far removed from how WBE practitioners work with data. On the data collection side, it is common for laboratories to organise their data into a ‘wide’ format, where a row represents a sample with columns for all measures from that sample. As a sample is processed, the analyst fills the row until all the measures listed in the column headers have a value. A similar scheme can be applied, not with a sample, but with individual ‘runs’ of an experiment.

Wide tables are practical because attributes of the measures being taken that never change (e.g., the units, the expected data type, whether the analysis was done on the solid fraction of the sample or its liquid fraction) are stored outside the table. Thus, the analyst need not concern themselves with entering them for each new value, saving time and effort. Moreover, having values taken from the same sample aligned side to side makes them much easier to compare and to quickly judge whether an error has been made during the experiment.

Similarly, data analysts such as epidemiologists typically use wide tables where each row represents all measures for a single day. Transforming the data into time series is essential for understanding the evolution of the biomarker through time. It is also necessary for most visual representations used in WBE and most modelling schemes.

Wide tables become impractical when combining data from multiple laboratories, sampling sites, or datasets. Properties that can be assumed constant in a single lab are no longer at this scale, which might lead analysts to improperly combine data from sources with the same name but not the same meaning. Conversely, since a wide column name can condense multiple attributes of a measure into a single field, the name of that field may vary wildly between table creators, even if they start with the same dictionary of attributes.

The PHES-ODM team recognises this usability issue and proposes a standardised naming scheme for long variables. These names take advantage of the fact that every element of the PHES-ODM dictionary is defined in the parts table and has a unique identifier. Since these identifiers are human-readable (they consist of a shortened version of the part label), they can be concatenated into a long string and remain easy to understand. The order in which the attributes of a measure are chained is defined in advance to ensure uniformity between users. The wide-to-long translation scheme is straightforward and creates valid variable names in all open programming languages (e.g., Python and R) and databases (e.g., SQL). For program description tables, the following pattern is used:
For measure reports, the partIDs of the attributes of the report rows are strung in the following order (see Figure 5):
  1. The environmental compartment the specimen is taken from.

  2. The specimen the measure is taken on.

  3. The fraction of the specimen the measure is taken on.

  4. The measure itself.

  5. The unit that is used for reporting.

  6. The aggregation that is applied to the value.

  7. An index to allow the wide table to record more than one instance of the wide variable in question if required.

  8. The attribute whose value is recorded in the wide table cell. Typically, this would be ‘value’; however, any other attribute not included in the wide name can be specified here.

Figure 5

Transposition of a measurement report row (long format) to wide names and their values. For illustrative purposes, the labels are shown in the long table row. The part identifiers are shown in parentheses.

Figure 5

Transposition of a measurement report row (long format) to wide names and their values. For illustrative purposes, the labels are shown in the long table row. The part identifiers are shown in parentheses.

Close modal

Other measure attributes, such as class, group, and domain, need not be added to the wide name. Their role is to help organise the measures. However, since measures are unique, adding these attributes to the wide name of a measure would be redundant.

The resulting names have many components, but they have the advantage of being highly specific, thus reducing the risk of inadvertently combining incompatible data. Moreover, since they consist of identifiers for parts that all have labels in the parts table, they are both machine-readable and easily translated to an expanded natural language label that anyone can read.

The ability to transpose data from long-to-wide formats (and vice-versa) can be leveraged at the data entry stage by creating templates that respect the typical workflow of laboratory workers where column headers would be wide names. At the data analysis stage, by recreating the wide names from the data stored in the PHES-ODM, measure reports with differing attributes would automatically be dispatched to columns with different names, thus reducing the risk of erroneous combinations.

Supporting community adoption

There often exists a tension between usability and completeness. As the number of concepts included in a system grows, the effort required for a user to become comfortable with it increases and quickly reaches unmanageable levels. This is especially true if the system targets users of varied levels of familiarity with the concepts it contains. Environmental public health surveillance data are often collected by workers with little to no experience with data models (e.g., utility staff, laboratory technicians, or policymakers).

The PHES-ODM, therefore, must strike a balance between robustness (being able to host multiple types of data and metadata) and ease of use. In that respect, the guiding philosophy of the PHES-ODM project has been to aim for robustness and subsequently provide tools (e.g., utility software, templates, standardised long-to-wide conversion procedure) to support common uses and provide extensive documentation to facilitate onboarding and understanding. Information regarding these tools is provided on the PHES-ODM website and GitHub repository.

The documentation provided by the PHES-ODM project follows a four-quadrant approach (Divio Technologies AB 2023), where resources are either designed to be learning-oriented (tutorials), problem-oriented (how-to guides), understanding-oriented (long-form explanations) or information-oriented (reference guides). The community of PHES-ODM users also has access to an open forum to directly ask questions from other users for cases not covered by documentation (Discourse 2022).

The PHES-ODM in action

The structure of the PHES-ODM was designed with WBE for SARS-CoV-2 in mind. It is thus not surprising that the first application of the data model was for that purpose. The Ontario provincial government and the Public Health Agency of Canada adopted the first version of the PHES-ODM in 2021 for their respective wastewater-based epidemiology programs (Caminsky 2022). Both are planning to migrate to version 2.0. In the province of Québec, a university-led pilot study, the CentrEau-COVID project, has adopted the PHES-ODM to collect and harmonise wastewater-based epidemiology data from 31 sampling sites (CentrEau 2021). Outside Canada, version 2.0 of the PHES-ODM has been selected for the European Union Sewage Sentinel System for SARS-CoV-2 to collect and organise surveillance data from its member countries (EU4S-DEEP 2022). Though many other data models exist for environmental surveillance, the use of the PHES-ODM in large-scale surveillance WBE programs suggests robustness and ability to reliably identify, store, link and organise WBE data and metadata. Though PHES-ODM was designed to hold environmental public health data from various environmental compartments, the most common use of the data model remains WBE. Additional involvement from researchers and public health agencies sampling other environmental compartments would help stress-test these areas of the model and improve its robustness.

As environmental public health surveillance becomes more widespread and institutionalised, the number of tracked pathogens and environmental toxins will continue to grow, and vast amounts of data will be generated worldwide. The COVID-19 pandemic has shown how critical the principles of open science and open data are to accessing the full potential of this data for understanding the spread of disease among populations. By creating and maintaining tools that serve all stakeholders of environmental surveillance data, from field and laboratory workers to data analysts and decision-makers, the scientific community benefits from more efficient and frequent data exchanges, which will lead to greater understanding and benefit the public. By presenting the PHES-ODM, the authors hope to contribute to the flourishing of the field and convince the environmental surveillance community that open science within the field is not only desirable but also possible and accessible.

The authors would like to acknowledge the financial support of PHES-ODM development by the CIHR-funded network, CoVaRR-Net (Coronavirus Variants Rapid Response Network) [Funding Reference Number: 175622] and the Public Health Agency of Canada. The Ontario Ministry of the Environment, Conservation and Parks provides funding for the PHES-ODM validation toolkit. The National Sciences and Engineering Research Council of Canada, the Fonds de Recherche du Québec, and the Molson-Trottier Foundation supported salaries, scholarships, travel expenses, sample collection and laboratory analysis. Peter Vanrolleghem holds the Canada Research Chair on Water Quality Modelling. As an open project, the PHES-ODM benefits from the contribution of its steering committee (Big Life Lab 2022a), the core user group, and many people and organisations that provide comments and suggestions. The PHES-ODM is grateful for development platforms that provide freely available resources for open-source projects, including GitHub and Discourse.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abernathy
C.
,
Haddad
I.
,
Martin
G.
,
Mertic
J.
&
Smith
J.
2022
Starting an Open Source Project
.
Open Source Guide. Available from: https://www.linuxfoundation.org/resources/open-source-guides/starting-an-open-source-project (accessed 16 December 2022)
.
Ahmed
W.
,
Bertsch
P. M.
,
Bivins
A.
,
Bibby
K.
,
Farkas
K.
,
Gathercole
A.
,
Haramoto
E.
,
Gyawali
P.
,
Korajkic
A.
,
McMinn
B. R.
,
Mueller
J. F.
,
Simpson
S. L.
,
Smith
W. J. M.
,
Symonds
E. M.
,
Thomas
K. V.
,
Verhagen
R.
&
Kitajima
M.
2020
Comparison of virus concentration methods for the RT-qPCR-based recovery of murine hepatitis virus, a surrogate for SARS-CoV-2 from untreated wastewater
.
Sci. Total Environ.
739
,
139960
.
https://doi.org/10.1016/j.scitotenv.2020.139960
.
Been
F.
,
Rossi
L.
,
Ort
C.
,
Rudaz
S.
,
Delémont
O.
&
Esseiva
P.
2014
Population normalization with ammonium in wastewater-based epidemiology: Application to illicit drug monitoring
.
Environ. Sci. Technol.
48
,
8162
8169
.
https://doi.org/10.1021/es5008388
.
Betancourt
W. Q.
,
Schmitz
B. W.
,
Innes
G. K.
,
Prasek
S. M.
,
Pogreba Brown
K. M.
,
Stark
E. R.
,
Foster
A. R.
,
Sprissler
R. S.
,
Harris
D. T.
,
Sherchan
S. P.
,
Gerba
C. P.
&
Pepper
I. L.
2021
COVID-19 containment on a college campus via wastewater-based epidemiology, targeted clinical testing and an intervention
.
Sci. Total Environ.
779
,
146408
.
https://doi.org/10.1016/j.scitotenv.2021.146408
.
Big Life Lab
2022a
Steering Group Members · Big-Life-Lab/PHES-ODM Wiki [WWW Document]. GitHub. Available from: https://github.com/Big-Life-Lab/PHES-ODM (accessed 23 December 2022)
.
Big Life Lab
2022b
Issues · Big-Life-Lab/PHES-ODM. [WWW Document]. GitHub. Available from: https://github.com/Big-Life-Lab/PHES-ODM. (accessed 30 December 2022)
.
Big Life Lab
2022c
Core User Working Group · Big-Life-Lab/PHES-ODM Wiki [WWW Document]. GitHub. Available from: https://github.com/Big-Life-Lab/PHES-ODM (accessed 30 December 2022)
.
Big Life Lab
2022d
PHES-ODM-Validation: A toolkit to assist in validating whether data conforms to the PHES-ODM dictionary. [WWW Document]. GitHub. Available from: https://github.com/Big-Life-Lab/PHES-ODM-Validation (Accessed 22 December 2022)
.
Big Life Lab
2023
PHES-ODM – Official site [WWW Document]. Available from: https://phes-odm.org/ (accessed 28 November 2023)
.
Burgelman
J.-C.
,
Pascu
C.
,
Szkuta
K.
,
Von Schomberg
R.
,
Karalopoulos
A.
,
Repanas
K.
&
Schouppe
M.
2019
Open science, open data, and open scholarship: European policies to make science fit for the twenty-first century
.
Front. Big Data
2
,
43
.
Caminsky
N. E.
2022
Wastewater Surveillance Research Group [WWW Document]. CoVaRR-Net. Available from: https://covarrnet.ca/wastewater-surveillance-research-group/ (accessed 9 December 2022)
.
CentrEau
2021
COVID – CentrEau [WWW Document]. Available from: https://centreau.org/covid/ (accessed 3 February 2022)
.
Coffman
M. M.
,
Guest
J. S.
,
Wolfe
M. K.
,
Naughton
C. C.
,
Boehm
A. B.
,
Vela
J. D.
&
Carrera
J. S.
2021
Preventing scientific and ethical misuse of wastewater surveillance data
.
Environ. Sci. Technol.
55
,
11473
11475
.
https://doi.org/10.1021/acs.est.1c04325
.
D'Aoust
P. M.
,
Graber
T. E.
,
Mercier
E.
,
Montpetit
D.
,
Alexandrov
I.
,
Neault
N.
,
Baig
A. T.
,
Mayne
J.
,
Zhang
X.
,
Alain
T.
,
Servos
M. R.
,
Srikanthan
N.
,
MacKenzie
M.
,
Figeys
D.
,
Manuel
D.
,
Jüni
P.
,
MacKenzie
A. E.
&
Delatolla
R.
2021a
Catching a resurgence: Increase in SARS-CoV-2 viral RNA identified in wastewater 48 h before COVID-19 clinical tests and 96 h before hospitalizations
.
Sci. Total Environ.
770
,
145319
.
https://doi.org/10.1016/j.scitotenv.2021.145319
.
D'Aoust
P. M.
,
Mercier
E.
,
Montpetit
D.
,
Jia
J.-J.
,
Alexandrov
I.
,
Neault
N.
,
Baig
A. T.
,
Mayne
J.
,
Zhang
X.
,
Alain
T.
,
Langlois
M.-A.
,
Servos
M. R.
,
MacKenzie
M.
,
Figeys
D.
,
MacKenzie
A. E.
,
Graber
T. E.
&
Delatolla
R.
2021b
Quantitative analysis of SARS-CoV-2 RNA from wastewater solids in communities with low COVID-19 incidence and prevalence
.
Water Res.
188
,
116560
.
https://doi.org/10.1016/j.watres.2020.116560
.
de Jonge
E. F.
,
Koelewijn
J. M.
,
van der Drift
A.-M. R.
,
van der Beek
R. F. H. J.
,
Nagelkerke
E.
&
Lodder
W. J.
2022
The detection of monkeypox virus DNA in wastewater samples in The Netherlands
.
Sci. Total Environ.
852
,
158265
.
https://doi.org/10.1016/j.scitotenv.2022.158265
.
Discourse
.
2022
Open Data Model for Public Health and Environmental Surveillance Discourse Community
.
[WWW Document] Open Data Model. Available from: https://odm.discourse.group/ (Accessed 30 December 2022)
.
Divio Technologies AB
,
2023
Documentation System [WWW Document]. Available from: https://documentation.divio.com/#about-the-system (accessed 20 November 2023)
.
Dong
E.
,
Du
H.
&
Gardner
L.
2020
An interactive web-based dashboard to track COVID-19 in real time
.
Lancet Infect. Dis.
20
,
533
534
.
https://doi.org/10.1016/S1473-3099(20)30120-1
.
Elsevier
2022
Novel Coronavirus Information Center [WWW Document]. Elsevier Connect. Available from: https://www.elsevier.com/connect/coronavirus-information-center (accessed 14 November 2022)
.
EU4S-DEEP
2022
The Public Health Environmental Surveillance (Canada) Open Data Model [WWW Document]. EU Wastewater Observatory for Public Health
. Available from: https://wastewater-observatory.jrc.ec.europa.eu/#/open-data-model (accessed 9 December 2022).
European Environmental Agency
2023
WISE - Water Information System for Europe [WWW Document]. European Information Gateway to Water Issues. Available from: https://water.europa.eu/ (Accessed 18 November 2023)
.
Fernandez-Cassi
X.
,
Scheidegger
A.
,
Bänziger
C.
,
Cariti
F.
,
Tuñas Corzon
A.
,
Ganesanandamoorthy
P.
,
Lemaitre
J. C.
,
Ort
C.
,
Julian
T. R.
&
Kohn
T.
2021
Wastewater monitoring outperforms case numbers as a tool to track COVID-19 incidence dynamics when test positivity rates are high
.
Water Res.
200
,
117252
.
https://doi.org/10.1016/j.watres.2021.117252
.
Fishman
N.
&
Stryker
C.
2020
Smarter Data Science
.
Electronic Edition. edn
.
John Wiley & sons, Ltd
,
Hoboken, NJ, USA
.
Gibas
C.
,
Lambirth
K.
,
Mittal
N.
,
Juel
M. A. I.
,
Barua
V. B.
,
Roppolo Brazell
L.
,
Hinton
K.
,
Lontai
J.
,
Stark
N.
,
Young
I.
,
Quach
C.
,
Russ
M.
,
Kauer
J.
,
Nicolosi
B.
,
Chen
D.
,
Akella
S.
,
Tang
W.
,
Schlueter
J.
&
Munir
M.
2021
Implementing building-level SARS-CoV-2 wastewater surveillance on a university campus
.
Sci. Total Environ.
782
,
146749
.
https://doi.org/10.1016/j.scitotenv.2021.146749
.
Global Water Pathogens Project
2020
Wastewater SPHERE [WWW Document]. Available from: https://sphere.waterpathogens.org/ (accessed 3 December 2021)
.
Griffiths
E.
,
Dooley
D.
,
Graham
M.
,
Van Domselaar
G.
,
Brinkman
F. S. L.
&
Hsiao
W. W. L.
2017
Context is everything: Harmonization of critical food microbiology descriptors and metadata for improved food safety and surveillance
.
Front. Microbiol.
8
.
https://doi.org/10.3389/fmicb.2017.01068
.
Griffiths
E. J.
,
Timme
R. E.
,
Mendes
C. I.
,
Page
A. J.
,
Alikhan
N.-F.
,
Fornika
D.
,
Maguire
F.
,
Campos
J.
,
Park
D.
,
Olawoye
I. B.
,
Oluniyi
P. E.
,
Anderson
D.
,
Christoffels
A.
,
da Silva
A. G.
,
Cameron
R.
,
Dooley
D.
,
Katz
L. S.
,
Black
A.
,
Karsch-Mizrachi
I.
,
Barrett
T.
,
Johnston
A.
,
Connor
T. R.
,
Nicholls
S. M.
,
Witney
A. A.
,
Tyson
G. H.
,
Tausch
S. H.
,
Raphenya
A. R.
,
Alcock
B.
,
Aanensen
D. M.
,
Hodcroft
E.
,
Hsiao
W. W. L.
,
Vasconcelos
A. T. R.
&
MacCannell
D. R.
&
on behalf of the Public Health Alliance for Genomic Epidemiology (PHA4GE) consortium
2022
Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package
.
GigaScience
11
,
giac003
.
https://doi.org/10.1093/gigascience/giac003
.
Haboub
K.
,
Maere
T.
,
Mercier
E.
,
D'Aoust
P. M.
,
Delatolla
R.
&
Vanrolleghem
P. A.
2022
Particle characterization and transport processes in view of modelling the fate of SARS-CoV-2 in sewer systems
. In
Proceedings of the 12th Urban Drainage Modeling Conference
,
January 10–12 2022
,
Costa Mesa, CA, United States
.
Hemalatha
M.
,
Kiran
U.
,
Kuncha
S. K.
,
Kopperi
H.
,
Gokulan
C. G.
,
Mohan
S. V.
&
Mishra
R. K.
2021
Surveillance of SARS-CoV-2 spread using wastewater-based epidemiology: Comprehensive study
.
Sci. Total Environ.
768
,
144704
.
https://doi.org/10.1016/j.scitotenv.2020.144704
.
Hughes
B.
,
Duong
D.
,
White
B. J.
,
Wigginton
K. R.
,
Chan
E. M. G.
,
Wolfe
M. K.
&
Boehm
A. B.
2022
Respiratory syncytial virus (RSV) RNA in wastewater settled solids reflects RSV clinical positivity rates
.
Environ. Sci. Technol. Lett.
9
,
173
178
.
https://doi.org/10.1021/acs.estlett.1c00963
.
Li
X.
,
Zhang
S.
,
Shi
J.
,
Luby
S. P.
&
Jiang
G.
2021
Uncertainties in estimating SARS-CoV-2 prevalence by wastewater-based epidemiology
.
Chem. Eng. J.
415
,
129039
.
https://doi.org/10.1016/j.cej.2021.129039
.
Maere
T.
,
Therrien
J.-D.
&
Vanrolleghem
P.
2022
Normalization Practices for SARS-CoV-2 Data in Wastewater-Based Epidemiology (Technical Report)
.
National Collaborating Centre for Infectious Diseases, Public Health Agency of Canada
,
Canada
.
Manuel
D.
,
Therrien
J.-D.
,
Nicolaï
N.
,
Bennett
C.
,
Thomson
M.
,
Sequeria
Y.
&
Vanrolleghem
P. A.
2021
Public Health Environmental Surveillance Open Data Model (PHES-ODM). Osf.io. https://doi.org/10.17605/osf.io/49z2b
.
McCall
A.-K.
,
Palmitessa
R.
,
Blumensaat
F.
,
Morgenroth
E.
&
Ort
C.
2017
Modeling in-sewer transformations at catchment scale – implications on drug consumption estimates in wastewater-based epidemiology
.
Water Res.
122
,
655
668
.
https://doi.org/10.1016/j.watres.2017.05.034
.
McGeehin
M. A.
,
Qualters
J. R.
&
Niskar
A. S.
2004
National environmental public health tracking program: bridging the information gap
.
Environ. Health Perspect.
112
,
1409
1413
.
https://doi.org/10.1289/ehp.7144
.
Mercier
E.
,
D'Aoust
P. M.
,
Thakali
O.
,
Hegazy
N.
,
Jia
J.-J.
,
Zhang
Z.
,
Eid
W.
,
Plaza-Diaz
J.
,
Kabir
M. P.
,
Fang
W.
,
Cowan
A.
,
Stephenson
S. E.
,
Pisharody
L.
,
MacKenzie
A. E.
,
Graber
T. E.
,
Wan
S.
&
Delatolla
R.
2022
Municipal and neighbourhood level wastewater surveillance and subtyping of an influenza virus outbreak
.
Sci. Rep.
12
,
15777
.
https://doi.org/10.1038/s41598-022-20076-z
.
Michener
W. K.
2006
Meta-information concepts for ecological data management
.
Ecol. Inf.
1
,
3
7
.
https://doi.org/10.1016/j.ecoinf.2005.08.004
.
Morvan
M.
,
Lo Jacomo
A.
,
Souque
C.
,
Wade
J. B. C.
,
Hoffmann
T.
,
Koen
J. B. C.
,
Pouwels
M.
,
Singer
A.
,
Bunce
J.
,
Engeli
A.
,
Grimsley
J.
&
Bristol
D.
2021
Estimating SARS-CoV-2 prevalence from large-scale wastewater surveillance: Insights from combined analysis of 44 sites in England
.
https://doi.org/10.21203/RS.3.RS-770963/V1
.
National Academies of Sciences, Engineering and Medicine
2018
Open Science by Design: Realizing a Vision for 21st Century Research
.
Washington, DC, United States. National Academies Press. DOI: 10.17226/25116
.
Naughton
C. C.
,
Roman
F. A.
,
Alvarado
A. G. F.
,
Tariqi
A. Q.
,
Deeming
M. A.
,
Bibby
K.
,
Bivins
A.
,
Rose
J. B.
,
Medema
G.
,
Ahmed
W.
,
Katsivelis
P.
,
Allan
V.
,
Sinclair
R.
,
Zhang
Y.
&
Kinyua
M. N.
2021
Show us the data: Global COVID-19 wastewater monitoring efforts, equity, and gaps
.
https://doi.org/10.1101/2021.03.14.21253564
.
NORMAN Association
2011
NORMAN Position Paper: Towards a harmonized approach for collection and interpretation of data on emerging substances in support of European environmental policies
.
NORMAN Network
2020
NORMAN Database System – SARS-CoV-2 [WWW Document]. NORMAN Network of Reference Laboratories, Research Centres and Related Organisations for Monitoring of Emerging Environmental Substances. Available from: https://www.norman-network.com/nds/sars_cov_2/ (Accessed 3 December 2021)
.
Plana
Q.
2015
Automated Data Collection and Management at Enhanced Lagoons for Wastewater Treatment. M.Sc. Thesis
.
Université Laval
,
Québec, QC
,
Canada
.
Plana
Q.
,
Alferes
J.
,
Fuks
K.
,
Kraft
T.
,
Maruéjouls
T.
,
Torfs
E.
&
Vanrolleghem
P. A.
2019
Towards a water quality database for raw and validated data with emphasis on structured metadata
.
Water Qual. Res. J.
54
,
1
9
.
https://doi.org/10.2166/wqrj.2018.013
.
Prado
T.
,
Fumian
T. M.
,
Mannarino
C. F.
,
Resende
P. C.
,
Motta
F. C.
,
Eppinghaus
A. L. F.
,
Chagas do Vale
V. H.
,
Braz
R. M. S.
,
de Andrade
J. d. S. R.
,
Maranhão
A. G.
&
Miagostovich
M. P.
2021
Wastewater-based epidemiology as a useful tool to track SARS-CoV-2 and support public health policies at municipal level in Brazil
.
Water Res.
191
,
116810
.
https://doi.org/10.1016/j.watres.2021.116810
.
Robinson
I.
,
Webber
J.
,
Webber
J.
&
Eifrem
E.
2015
Graph Databases
.
Sebastopol, CA, USA
,
O'Reilly
.
Schang
C.
,
Crosbie
N. D.
,
Nolan
M.
,
Poon
R.
,
Wang
M.
,
Jex
A.
,
John
N.
,
Baker
L.
,
Scales
P.
,
Schmidt
J.
,
Thorley
B. R.
,
Hill
K.
,
Zamyadi
A.
,
Tseng
C. W.
,
Henry
R.
,
Kolotelo
P.
,
Langeveld
J.
,
Schilperoort
R.
,
Shi
B.
,
Einsiedel
S.
,
Thomas
M.
,
Black
J.
,
Wilson
S.
&
McCarthy
D. T.
2021
Passive sampling of SARS-CoV-2 for wastewater surveillance
.
Environ. Sci. Technol.
55
,
10432
10441
.
https://doi.org/10.1021/ACS.EST.1C01530
.
Sherchan
S. P.
,
Shahin
S.
,
Ward
L. M.
,
Tandukar
S.
,
Aw
T. G.
,
Schmitz
B.
,
Ahmed
W.
&
Kitajima
M.
2020
First detection of SARS-CoV-2 RNA in wastewater in North America: A study in Louisiana, USA
.
Sci. Total Environ.
743
,
140621
.
https://doi.org/10.1016/j.scitotenv.2020.140621
.
Teytelman
L.
,
Team
P. I.
&
Broellochs
A.
2020
How to Make Your Protocol More Reproducible, Discoverable, and User-Friendly
.
The Lancet
2022
The Lancet COVID-19 Content Hub [WWW Document]. Available from: https://www.thelancet.com/coronavirus (accessed 14 November 2022)
.
Therrien
J. D.
,
Maere
T.
,
Sánchez-Quete
F.
,
Tsitouras
A.
,
Goitom
E.
,
Cloutier
F.
,
Dufour
D.
,
Proulx
F.
,
Nicolaï
N.
,
Philippe
R.
,
Tohidi
M.
,
Dorner
S.
,
Frigon
D.
&
Vanrolleghem
P. A.
2021
SARS-CoV-2 wastewater surveillance data and metadata in the Open Data Model format. Part 1: Québec City
.
https://doi.org/10.5281/zenodo.5597158
.
US CDC
2021
National Wastewater Surveillance System (NWSS) – a new public health tool to understand COVID-19 spread in a community [WWW Document]. Available from: https://www.cdc.gov/healthywater/surveillance/wastewater-surveillance/wastewater-surveillance.html (accessed 17 January 2022)
.
Vanrolleghem
P. A.
,
Haddad
S.
,
2021
Collecting, quality ensuring and transferring SARS-CoV-2 data for decision-making by public health authorities – The Québec experience
. In:
The EU Sewage Sentinel System for SARS-CoV-2 (EU4S) Solutions and Science for Support. 5th Town Hall Meeting. European Commission, Webex Meeting
(
Gawlik
B. M.
,
Remonnay
I.
&
Rubini
A.
, eds).
Brussels, Belgium: European Joint Research Center
.
Wade
M. J.
,
Lo Jacomo
A.
,
Armenise
E.
,
Brown
M. R.
,
Bunce
J. T.
,
Cameron
G. J.
,
Fang
Z.
,
Farkas
K.
,
Gilpin
D. F.
,
Graham
D. W.
,
Grimsley
J. M. S.
,
Hart
A.
,
Hoffmann
T.
,
Jackson
K. J.
,
Jones
D. L.
,
Lilley
C. J.
,
McGrath
J. W.
,
McKinley
J. M.
,
McSparron
C.
,
Nejad
B. F.
,
Morvan
M.
,
Quintela-Baluja
M.
,
Roberts
A. M. I.
,
Singer
A. C.
,
Souque
C.
,
Speight
V. L.
,
Sweetapple
C.
,
Walker
D.
,
Watts
G.
,
Weightman
A.
&
Kasprzyk-Hordern
B.
2022
Understanding and managing uncertainty and variability for wastewater monitoring beyond the pandemic: Lessons learned from the United Kingdom national COVID-19 surveillance programmes
.
J. Hazard. Mater.
424
,
127456
.
https://doi.org/10.1016/j.jhazmat.2021.127456
.
West
M.
2011
Developing High Quality Data Models
.
Elsevier
,
Amsterdam, The Netherlands
.
Wilkinson
M. D.
,
Dumontier
M.
,
Aalbersberg
I. J.
,
Appleton
G.
,
Axton
M.
,
Baak
A.
,
Blomberg
N.
,
Boiten
J.-W.
,
da Silva Santos
L. B.
,
Bourne
P. E.
,
Bouwman
J.
,
Brookes
A. J.
,
Clark
T.
,
Crosas
M.
,
Dillo
I.
,
Dumon
O.
,
Edmunds
S.
,
Evelo
C. T.
,
Finkers
R.
,
Gonzalez-Beltran
A.
,
Gray
A. J. G.
,
Groth
P.
,
Goble
C.
,
Grethe
J. S.
,
Heringa
J.
,
‘t Hoen
P. A. C.
,
Hooft
R.
,
Kuhn
T.
,
Kok
R.
,
Kok
J.
,
Lusher
S. J.
,
Martone
M. E.
,
Mons
A.
,
Packer
A. L.
,
Persson
B.
,
Rocca-Serra
P.
,
Roos
M.
,
van Schaik
R.
,
Sansone
S.-A.
,
Schultes
E.
,
Sengstag
T.
,
Slater
T.
,
Strawn
G.
,
Swertz
M. A.
,
Thompson
M.
,
van der Lei
J.
,
van Mulligen
E.
,
Velterop
J.
,
Waagmeester
A.
,
Wittenburg
P.
,
Wolstencroft
K.
,
Zhao
J.
&
Mons
B.
2016
The FAIR guiding principles for scientific data management and stewardship
.
Sci. Data
3
,
160018
.
https://doi.org/10.1038/sdata.2016.18
.
Zuniga-Montanez
R.
,
Coil
D. A.
,
Eisen
J. A.
,
Pechacek
R.
,
Guerrero
R. G.
,
Kim
M.
,
Shapiro
K.
&
Bischel
H. N.
2022
The challenge of SARS-CoV-2 environmental monitoring in schools using floors and portable HEPA filtration units: Fresh or relic RNA?
PLOS ONE
17
,
e0267212
.
https://doi.org/10.1371/journal.pone.0267212
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data