Abstract
The recent SARS-COV-2 pandemic has sparked the adoption of wastewater-based epidemiology (WBE) as a low-cost way to monitor the health of populations. In parallel, the pandemic has encouraged researchers to openly share their data to serve the public better and accelerate science. However, environmental surveillance data are highly dependent on context and are difficult to interpret meaningfully across sites. This paper presents the second iteration of the Public Health Environmental Surveillance Open Data Model (PHES-ODM), an open-source dictionary and set of data tools to enhance the interoperability of environmental surveillance data and enable the storage of contextual (meta)data. The data model describes how to store environmental surveillance program data, metadata about measurements taken on various specimens (water, air, surfaces, sites, populations) and data about measurement protocols. The model provides software tools that support the collection and use of PHES-ODM formatted data, including performing PCR calculations and data validation, recording data into input templates, generating wide tables for analysis, and producing SQL database definitions. Fully open-source and already adopted by institutions in Canada, the European Union, and other countries, the PHES-ODM provides a path forward for creating robust, interoperable, open datasets for environmental public health surveillance for SARS-CoV-2 and beyond.
HIGHLIGHTS
The PHES-ODM supports the collection, storage, and use of environmental public health data.
The PHES-ODM defines a standardized dictionary to describe environmental surveillance programs.
The PHES-ODM provides metadata attributes to facilitate the interpretation of various data, including sampling, protocols, and geolocation.
The PHES-ODM supports open-science principles, allowing its use as a platform for additional software to facilitate data entry, validation and sharing.
INTRODUCTION
The SARS-CoV-2 pandemic has created a need for simple and cost-effective ways to monitor the health of populations. Wastewater-based epidemiology (WBE) is one such monitoring approach that has garnered broad appeal in recent years (Hemalatha et al. 2021; Hill et al. 2021). WBE consists of measuring biomarkers (substances indicating the health status of a population, e.g., viral RNA) in wastewater to gain insights into the health of the population producing that wastewater. WBE of SARS-CoV-2 has been adopted at various scales, from single laboratories to national programs (Morvan et al. 2021; Prado et al. 2021; US CDC 2021), in sewersheds covering buildings (Vanrolleghem & Haddad 2021), campuses (Betancourt et al. 2021; Gibas et al. 2021), and often, entire cities (Sherchan et al. 2020; D'Aoust et al. 2021a; Fernandez-Cassi et al. 2021). Because of this mass adoption, projects have even sprung with the intention of aggregating results worldwide (Naughton et al. 2021).
However, measuring biomarkers in wastewater poses a significant challenge: the complexity of the wastewater matrix and its collection system makes interpreting the obtained measurements particularly difficult. Sewer transport, biochemical decay, dynamic water usage patterns, rainfall, runoff, and snow melt significantly affect biomarker measurements (Hill et al. 2021; Haboub et al. 2022). Biomarkers also sorb onto particles to varying degrees and are therefore more or less affected by suspension, deposition and resuspension dynamics (McCall et al. 2017). Variability in sampling strategies (grab, composite, passive), laboratory assays, and quality control measures also complicate comparing WBE data from various sites or labs, let alone countries (Li et al. 2021; Wade et al. 2022).
Selection of factors to consider while interpreting biomarker measurements.
It is also essential to recognise that not every data point is created equal; laboratory issues, sensor malfunctions and human error are always possible and are best identified by the people who collected the data. It is, therefore, essential to include a record of their judgement in the metadata to support later interpretations of the data.
Capturing metadata helps ensure that the data are used correctly, is compared with other data only when appropriate (i.e., when the measurement context is similar enough between different campaigns or different sewersheds) and can still be used confidently many years after its original collection date, when the persons familiar with the details of a sampling campaign may no longer be available (Michener 2006).
As efforts have ramped up to produce reliable measurements of SARS-CoV-2 through environmental surveillance during the COVID-19 pandemic, researchers worldwide have shown great openness and willingness to collaborate. Sampling methods (Schang et al. 2021) and molecular assays (Ahmed et al. 2020) were developed and shared to accelerate the global fight against the disease. Practitioners are thus faced with the sizable challenge of stitching together many data sources, measurement campaigns, and assay results and inserting them into the larger framework of environmental public health surveillance – defined by McGeehin et al. (2004) as the ongoing collection, integration, analysis, and dissemination of data from environmental hazard monitoring, human exposure tracking, and health effect surveillance.
The scope of WBE is also continuously expanding. More complex assays are being developed (e.g., gene sequencing), and disease agents other than SARS-CoV-2 are being targeted (e.g., the syncytial respiratory virus; Hughes et al. 2022), influenza (Mercier et al. 2022), and the MPox virus (de Jonge et al. 2022). With WBE becoming a part of more extensive environmental surveillance programs, other environmental components are also being probed (e.g., air and surfaces; Zuniga-Montanez et al. 2022). A data model that can accommodate the breadth of measurements taken in WBE is therefore required. Moreover, this model must have in its structure a particular concern for capturing data quality and be able to accept varied and extensive metadata. In this way, WBE resembles other environmental disciplines, such as microbial food safety (Griffiths et al. 2017), where international collaboration is crucial, and wastewater treatment, where the harsh operational conditions of sensors make context essential for the interpretation of water quality measurements (Plana et al. 2019).
The SARS-CoV-2 pandemic has created data repositories aggregating open case data from around the world (Dong et al. 2020). WBE researchers have relied on these datasets and have generated their own datasets that have also been aggregated and published (Naughton et al. 2021; Therrien et al. 2021). However, a consensus has yet to emerge on what data and metadata are required to maximise the utility of those datasets. This has proven true for many sub-areas of environmental public health in general. In WBE, these sub-areas include data correction (choosing what combination of wastewater measurements can help determine how viral signals are affected by processes other than viral shedding in the population) (Been et al. 2014; Maere et al. 2022); the creation of adequate quality assurance and quality control (QA/QC) measures, protocols, and reports; and the development of processes to meaningfully compare biomarker measurements across sampling locations. Therefore, a data model compatible with environmental public health surveillance must allow the rapid integration of new types of data and metadata into its structure as the community develops and includes these elements in its analyses. The rest of this paper will describe how the Public Health Environmental Surveillance Open Data Model (henceforth, PHES-ODM; Manuel et al. 2021) uses open science to address the challenges of environmental public health surveillance.
METHODS
The importance of open data and open science
Openly sharing data encourages reuse via standard data formats and markup languages. It also removes barriers to access (paywalls, restrictive licences). This increased access to data speeds up the scientific process, encourages the detection of inaccuracies, and stimulates the development of new lines of enquiry (National Academies of Science, Engineering and Medicine 2018; Burgelman et al. 2019). In matters of public health, where timely decisions must be made based on the most accurate data possible, openly sharing new scientific results and datasets can benefit the general welfare considerably. The SARS-CoV-2 pandemic includes many examples of the scientific community accelerating the adoption of open science (Elsevier 2022; The Lancet 2022). The creation of easily shareable datasets helps the development of open science by reducing the burden on researchers taking on the arduous task of integrating data from various sources.
The FAIR guidelines have been proposed by Wilkinson et al. (2016) to define the features of datasets that facilitate (or discourage) informed data reuse. The guidelines suggest four main characteristics of suitable open datasets:
Findability: Pieces of data should have unique identifiers, contain searchable metadata and be indexable.
Accessibility: It should be possible to access and transfer data using open protocols, such as those widely used on the internet (HTTP, FTP, text-based formats).
Interoperability: The data use an open dictionary that clearly identifies the meaning of data and metadata contained in the dataset.
Reusability: The data contain rich metadata that enable researchers beyond the original dataset's authors to understand the data's meaning and judge its applicability to their research.
A good data model for the public health environmental surveillance field should strive to follow the FAIR guidelines. Its structure and tooling should also be designed to support the needs of the multiple stakeholders that create, manipulate, and consume those data: examples of stakeholders include utility workers contributing to sampling efforts, laboratory professionals, mathematical modellers, and public health officials.
Beyond open data, it is important to acknowledge the importance of software in open science. It fosters transparency and reproducibility of scientific results and improves the reliability of scientific software by allowing anyone to inspect, learn from, and improve it. Similarly, it speeds up science by enabling researchers to build upon the software tools of their peers (National Academies of Science, Engineering and Medicine 2018). However, the success of open software initiatives relies on more than merely publishing source code. Successful open projects fulfil a need shared by multiple people; they make the contribution process transparent and accessible to the community and have a well-defined technical process, governing structure, and leadership (Abernathy et al. 2022).
What are data models, exactly?
According to West (2011), data models define data's structure and intended meaning. They help describe a domain of activity by listing the entities involved, the relevant attributes of those entities, and the relationships between them. Data models offer standardised dictionaries of terms with precise definitions to support the interpretation of data records. Data models are closely related to ontologies: whereas ontologies define entities and relationships in a general and reusable way (Fishman & Stryker 2020), data models are, in contrast, designed to be specific to a particular application. Ontologies thus support the creation of data models but are not data models themselves.
Requirements for an environmental public health data model
To summarise, a data model for environmental public health surveillance should support the inclusion of data for many types of substances, biomarkers and pathogens and allow for the rapid inclusion of new measures over time to respond to emerging concerns. It should qualify the sampling site and its relationship with its population. It should let users describe their samples and the sampling methods employed. Beyond biomarkers, there should also be provisions to record measurements that help characterise or contextualise biomarker information (e.g., on-site measurements, population information, sample composition and laboratory values that affect the reported biomarker measurements). Sampling and analysis protocols should be documented. Given the amount of information required to create complete records, steps should be taken to make the model reasonably approachable via templates, supporting documentation, or dedicated data input tools.
A data model for environmental public health surveillance should aim to uphold the FAIR principles by allowing data owners to identify themselves and the license that governs their data. The model should itself have an explicit license to promote uptake by new users. As public health data can be sensitive, data that identifies specific individuals or groups can cause harm, such as stigmatisation and dehumanisation (Coffman et al. 2021). It is, therefore, critical that data models that hold public health data have provisions for anonymisation, implement restrictions to sharing, and provide ethical usage guidelines.
Currently available data models
Environmental public health datasets currently fall into two categories: those in ad-hoc formats unique to a data producer or sampling campaign and those that conform to standardised formats. Multiple formats have some level of compatibility with the requirements of an environmental public health data model proposed above, with structures that emphasise different aspects according to the model's original intended audience. Table 1 describes some of the available data models in the field, emphasising those developed during the SARS-CoV-2 pandemic.
Feature set of currently available environmental public health data models
Feature . | PHES-ODM . | WISE . | NORMAN SCORE . | W-SPHERE . | NWSS . | PHA4GE . |
---|---|---|---|---|---|---|
Reference | Manuel et al. (2021) | European Environmental Agency (2023) | NORMAN Network (2020) | Global Water Pathogens Project (2020) | US CDC (2021) | Griffiths et al. (2022) |
Main intended audience | Environmental public health surveillance practitioners | Governmental agencies | Ecotoxicologists, SARS-CoV-2 template for WBE practitioners | WBE practitioners | WBE practitioners | Environmental genomics |
Public dictionary of headers/tables | Yes | Yes | No | Yes | Yes | Yes |
Public dictionary of values? | Yes | No | No | No | Yes | Yes |
Public database definition? | Yes | No | Yes | No | No | No |
Public data conversion tools? | Yes | No | No | No | No | Yes |
Public data validation tools? | Dictionary, software tool, template | Not found | Template | Template | Dictionary | Dictionary, software tool, template |
Public data collection templates? | Yes | No | Yes | Yes | No | Yes |
Governance and development | Open source | Inter-institutional | Inter-institutional | Internal | Internal | Open source |
Model license | CC-BY4 | Not found | Not found for the model, but the template is open-access | Not found | Not found | CC-BY4 |
Rights management | Element-level (any row, header or combination) | Not found | Dataset level | Dataset level | Dataset level | Dataset level |
Environmental compartments | Various | Various (water bodies) | Various, but only wastewater in the template | Wastewater | Wastewater | Various |
Pathogen measurements | Any pathogen in the dictionary | Not found | Yes, but only SARS-CoV-2 in template | SARS-CoV-2-specific | Multiple | SARS-CoV-2, MPOX, AMR |
Measurement methods | Yes | Yes | Yes, but only PCR and sequencing-specific in the template | PCR and sequencing-specific | PCR and sequencing-specific | PCR and sequencing-specific |
In-sample measurements | Any measure in the dictionary | Water quality | Water quality, but only PCR in the template | PCR and sequencing | PCR and sequencing, pH, Conductivity, TSS | PCR and sequencing |
Collection site information | Yes | Yes | Yes | Yes | Yes | Yes |
On-site measurements | Any measure in the dictionary | Water quality | Flow, Weather, COD, TSS, NH4+-N, Water temperature | Flow | Flow, water temperature | No |
Population count | Served by site or within a geographic region | No | Served by site | Served by site | Served by site | No |
Sewer network information | No | No | No | No | Average wastewater travel time, industrial input, stormwater input | No |
Sample and sampling method | Yes | Yes | Yes | Yes | Yes | Yes |
Population health data | Any measure in the dictionary applied to a geographic region | No | SARS-CoV-2 prevalence | No | No | No |
Feature . | PHES-ODM . | WISE . | NORMAN SCORE . | W-SPHERE . | NWSS . | PHA4GE . |
---|---|---|---|---|---|---|
Reference | Manuel et al. (2021) | European Environmental Agency (2023) | NORMAN Network (2020) | Global Water Pathogens Project (2020) | US CDC (2021) | Griffiths et al. (2022) |
Main intended audience | Environmental public health surveillance practitioners | Governmental agencies | Ecotoxicologists, SARS-CoV-2 template for WBE practitioners | WBE practitioners | WBE practitioners | Environmental genomics |
Public dictionary of headers/tables | Yes | Yes | No | Yes | Yes | Yes |
Public dictionary of values? | Yes | No | No | No | Yes | Yes |
Public database definition? | Yes | No | Yes | No | No | No |
Public data conversion tools? | Yes | No | No | No | No | Yes |
Public data validation tools? | Dictionary, software tool, template | Not found | Template | Template | Dictionary | Dictionary, software tool, template |
Public data collection templates? | Yes | No | Yes | Yes | No | Yes |
Governance and development | Open source | Inter-institutional | Inter-institutional | Internal | Internal | Open source |
Model license | CC-BY4 | Not found | Not found for the model, but the template is open-access | Not found | Not found | CC-BY4 |
Rights management | Element-level (any row, header or combination) | Not found | Dataset level | Dataset level | Dataset level | Dataset level |
Environmental compartments | Various | Various (water bodies) | Various, but only wastewater in the template | Wastewater | Wastewater | Various |
Pathogen measurements | Any pathogen in the dictionary | Not found | Yes, but only SARS-CoV-2 in template | SARS-CoV-2-specific | Multiple | SARS-CoV-2, MPOX, AMR |
Measurement methods | Yes | Yes | Yes, but only PCR and sequencing-specific in the template | PCR and sequencing-specific | PCR and sequencing-specific | PCR and sequencing-specific |
In-sample measurements | Any measure in the dictionary | Water quality | Water quality, but only PCR in the template | PCR and sequencing | PCR and sequencing, pH, Conductivity, TSS | PCR and sequencing |
Collection site information | Yes | Yes | Yes | Yes | Yes | Yes |
On-site measurements | Any measure in the dictionary | Water quality | Flow, Weather, COD, TSS, NH4+-N, Water temperature | Flow | Flow, water temperature | No |
Population count | Served by site or within a geographic region | No | Served by site | Served by site | Served by site | No |
Sewer network information | No | No | No | No | Average wastewater travel time, industrial input, stormwater input | No |
Sample and sampling method | Yes | Yes | Yes | Yes | Yes | Yes |
Population health data | Any measure in the dictionary applied to a geographic region | No | SARS-CoV-2 prevalence | No | No | No |
The Water Information System for Europe (WISE; European Environmental Agency 2023) is a repository of dashboards, maps and datasets about water quality in water bodies across Europe. It is intended to carry water quality information for various substances, but it is not geared towards emerging pathogens. It is, therefore, an environmental surveillance system but not an environmental public health model. Each WISE dataset has its unique schema (dictionary, collection of data fields, and specified relationships between fields), which complicates data harmonisation.
The Network of Reference Laboratories, Research Centres and Related Organisations for Monitoring of Emerging Environmental Substances (NORMAN) database system (NORMAN Association 2011) is another European system that collects data concerning a wide array of emerging contaminants in multiple environmental compartments. It uses standardised data collection templates targeting specific contaminants rather than a uniform data model. As such, though it has an extensive overall dictionary, it restricts the effective dictionary size for a given substance to a fit-for-purpose, but fairly limiting level. Similarly, the Wastewater SARS Public Health Environmental Response (W-SPHERE) repository (Global Water Pathogens Project 2020) also offers a fit-for-purpose data template, emphasising collecting minimal but strictly necessary metadata to support data aggregation and analysis at a global scale.
The National Wastewater Surveillance System (NWSS) (US CDC 2021), deployed in the United States, is designed to collect pathogen data from wastewater only. It can store metadata concerning analysis methods for detecting multiple pathogens. However, its structure is primarily designed for biomarkers that can be analysed with PCR-based methods. The structure of NWSS resembles the first version of the PHES-ODM in many ways (what tables are present, contents of the dictionary) as both model development teams consulted with each other while designing their respective first version. However, the institutional governance of NWSS did not allow for community input, which made it difficult to modify the model to accommodate demands for new targets and metadata coming from the WBE community (genomics, including mutations, gene sequences, variants, and proteins, as well as environmental compartments beyond wastewater).
Finally, the Public Health Alliance for Genomics Epidemiology (PHA4GE; Griffiths et al. 2022) data dictionary is based on a per-pathogen template system. Its dictionary strongly emphasises interoperability, with each term being attached to a reference entry in formal ontologies, with multiple fields allowing the connection of the PHA4GE data to information found in other databases. Its origins in the genomics field are evident from its concern for open science and software, and its concern for mapping between formats and databases – critical components of any genomics analysis workflow. However, measurements not linked directly to measuring a pathogen (i.e., on-site environmental quality measurements or physicochemical characterisation of environmental samples) fall outside the scope of the PHA4GE model.
RESULTS AND DISCUSSION
The PHES-ODM
The first iteration of the PHES-ODM was made available on GitHub in 2020 with the stated goal of making data collection and sharing high-quality data as easy as possible for researchers and organisations working in WBE for SARS-CoV-2. The model initially focused on capturing data and metadata from PCR-based viral measurements, as well as water quality measurements in samples and on-site at the sampling point. As the types of assays being performed on WBE samples became more varied and surveillance of SARS-CoV-2 expanded to additional environmental compartments, expansions of the PHES-ODM model became necessary, and its ontology was revised to capture relationships inherent in environmental public health surveillance in general. This paper aims to describe the main features of the PHES-ODM and show how these features support environmental surveillance. In particular, the PHES-ODM 2.0's goals are to:
Expand the types of data and metadata that can be recorded in the PHES-ODM beyond WBE.
Facilitate the addition of new terms in the dictionary as the community requests them.
Support common data model uses, such as data input, validation, analysis, and aggregation.
Support open science by ensuring that the PHES-ODM follows the FAIR guidelines, facilitates the creation of FAIR datasets, and employs the principles of open software development.
Working towards achieving these goals has led the PHES-ODM team to develop the following:
- 1.
A basic ontology of the environmental surveillance field.
- 2.
A data model that maps onto this ontology, supports many use cases, and is general enough to support its various sub-disciplines.
- 3.
A dictionary of terms that defines each element of the data model, with a practical emphasis on the currently most commonly practised among its stakeholders, i.e., WBE for SARS-CoV-2.
- 4.
Tools that support the consistent use of the PHES-ODM for data collection, analysis, and reporting.
Open-source management
The management of the PHES-ODM project follows an open-science approach. The data model and all its supporting software tools are available on an open software platform (Big Life Lab 2023). Beyond that, however, the project also fosters community participation. An international steering committee comprised of core developers, academics, and institutional users of the data model guides the long-term goals and scope of the project (Big Life Lab 2022a). In parallel, users of the model can request new features and changes to the model or its tools using the GitHub issue tracker (Big Life Lab 2022b), while discussions of the model and the best practices surrounding it occur on the Discourse platform (Discourse 2022). The bulk of the development is carried out by the core development team (Big Life Lab 2022c), whose work undergoes public review by its users before it is approved for inclusion into official releases. The project also hosts working group meetings where participants can weigh in on ways the model could evolve to suit their needs better. Membership to these bi-weekly meetings is open, with regular participants being actively involved in environmental public health surveillance programs worldwide. These processes help ensure that the evolution of the PHES-ODM follows the needs of its community of users. They also make it possible for anyone to witness the development process and trust in the reliability and long-term viability of the model.
Scope
Environmental public health surveillance is, by necessity, multidisciplinary as it encompasses epidemiological research, microbiology, chemistry, mathematical modelling and disciplines directly related to the environmental compartment being probed (e.g., WBE involves sewer characterisation, whereas measuring soil contamination would involve geology). Answers to six questions describe the elements required to describe the provenance and meaning of the data captured in such varied systems (Plana 2015):
What data are being stored: The model is used to store measurements. Measurements are realisations of measures (i.e., things that can be quantified or qualified). The model also holds information on the specimens on which the measurements are taken (i.e., the samples taken from the environment to perform the measurements).
Where were the measurements taken: Measurements are usually done on samples. However, they can also be done on-site. Some measures may also describe the region represented by the site.
How were the data collected: Samples have specific collection methods, and each type of measurement also has its method. Sometimes, those methods rely on instruments.
Why were the data collected: Data are always collected with a purpose (intended use).
Who collected the data: Every person or organisation in the custody chain of measurements and samples must be known to maintain trust. That includes data owners, creators, custodians, and funders.
When were the data collected: Samples are collected punctually (grab samples) or over several days or hours. Analyses are then carried out on the samples. These may also take a substantial amount of time. Finally, once a result has been obtained, it is reported to a data repository. For public health data, dates can have multiple meanings (e.g., disease onset or reporting date.)
Main entities and relationships of the PHES-ODM's proposed environmental surveillance system ontology.
Main entities and relationships of the PHES-ODM's proposed environmental surveillance system ontology.
Data model attributes
For a data model to be relevant, its entities must be described with relevant attributes. Once entities, relationships and attributes are identified, they can be formally defined in a dictionary. The dictionary is thus the central repository for all data model elements. In it, elements can be given unique identifiers, standard names, and a description as a first step to utilising them in a data model. The PHES-ODM dictionary is defined in the ‘Parts’ table. Each row of the table is called a part. Each part describes an entity, an attribute, or an accepted value for an attribute.
Every concept used in the PHES-ODM is defined using the ‘part’ data structure. Parts can thus be considered building blocks that can be assembled to create larger structures within the PHES-ODM. For example, tables are a collection of parts of the ‘attribute’ type, and category sets are collections of parts of the ‘category’ type. Data input templates can also be defined as a list of parts to which users provide values.
Parts have over 50 attributes in the PHES-ODM, some of which are optional. The most important attributes are the following:
partID: the unique identifier of each part.
partLabel: an English language identifier for the part.
partDescription: an explanation of what the part represents.
partInstruction: explanation of how to use the part within the PHES-ODM.
Relationships are created by simply adding the partID of a part to the list of attributes of another. For one-to-many and many-to-many relationships to be representable in this way, it is necessary to define another part type called a set.
Collections of parts are called ‘sets’ within the PHES-ODM. The identity of the set is represented as a regular part. However, the contents of sets are defined in the sets table. Each row of the sets table associates the unique identifier of a set to the identifier of one of its members. This arrangement allows for the creation of arbitrarily large sets. It also allows any part to be included in an arbitrary number of sets.
Table types
Since PHES-ODM entities have a standard number of attributes, which always have a single value, all entities of the same type can be stored together in tabular form. There are three types of tables in the PHES-ODM. Result tables hold measurements, sampling reports, and quality information. These tables are thus frequently updated in a surveillance program. Conversely, Program description tables contain data that stay constant for a given surveillance program: sampling locations, organisations in charge of the campaign, the contact information of the people responsible, the instruments used to take samples and to perform analyses, and the identity of the final dataset where it will all be recorded. Finally, Dictionary tables are tables where dictionary components are defined (sets and parts). Internationalisation information (list of languages, translations of the dictionary parts, country codes) is also held there. Dictionary table entries are defined by the PHES-ODM maintainers and updated in every release, whereas entries in the other two table types (result and program description) store actual user-defined data. PHES-ODM dictionary tables comprise over one thousand parts as of November 2023. The complete entity relationship diagram of the PHES-ODM can be found in Supplement 1.
Persons with different roles are typically responsible for each of the three table types: dictionary tables are updated by PHES-ODM maintainers, and data custodians oversee the program definition tables. They are aware of all sampling locations and can assign them unique identifiers. They also know what laboratory protocols will be used to gather measurements and samples. Field and lab workers, on the other hand, are responsible for generating information that goes into the result tables. Most users, therefore, do not have to interact with the entirety of the PHES-ODM, thus vastly simplifying its use for individual persons.
Entering data into the PHES-ODM
Because the PHES-ODM is organised into tables, data entry is relatively straightforward. The available columns indicate what information should be collected for each entity type. The parts table also supports data entry – to fill an attribute, one can look in the parts table for the attribute's entry (row). In this entry, one will find the expected data type for the attribute's value. If the data type is categorical, the set of all correct values for that attribute is linked to the entry via one of its properties (‘mmaSet’, short for ‘measure, method or attribute set’). Linking parts to their corresponding values reduces opportunities for input errors and paves the way for automatic data validation.
The abundance of fields in the PHES-ODM reflects its desire to be robust, but only some users will be interested in collecting the complete set of available metadata attributes. Thus, the PHES-ODM aims for flexibility by defining most fields as optional. However, some fields are mandatory because they help uniquely identify data or minimally define an entity.
Recording measurements and samples
Attributes characterising the reported values in the PHES-ODM. Top: Attributes of the measure reports. Bottom: Attributes of the sample reports.
Attributes characterising the reported values in the PHES-ODM. Top: Attributes of the measure reports. Bottom: Attributes of the sample reports.
The measure and sample result tables each are paired with a companion table that helps define relationships between reports. For measure reports, that table holds measure report sets. Sets allow measurements that should be read together to be explicitly linked with a common identifier (e.g., points in a calibration curve). Similarly, samples can be linked together in the sample relationships table by denoting their relationship to each other (e.g., split from a main sample, pooled together, or field replicates.)
Environmental public health surveillance is a broad, multidisciplinary field. A wide variety of biomarkers, supporting measurements and types of samples may, therefore, be recorded during a measurement campaign. To assist PHES-ODM users and only reveal dictionary items relevant to a particular campaign or laboratory at any given time, the PHES-ODM assigns to the measures attributes that help categorise measures along several axes. These attributes are:
Compartment set: Whether a measure can be taken in air, in water, on surfaces, in humans, in flora, in fauna, in soils or a combination.
Specimen set: Indicates whether a measure can be taken in a sample, on-site, over a population, or a combination.
Domain: Indicates whether the measure pertains to a physical phenomenon, chemical composition, or biological characteristics.
Group: Measures characterising the same target biomarker or the same type of specimen are placed into groups. For instance, measures of various SARS-CoV-2 genes would be put into the same group, while water biochemical composition measures (e.g., COD, NH4+-N or total phosphorous) would be placed in another.
Class: Measures that use a similar assay and are quantified using similar units are combined into a class. For instance, measures of the N1 gene of SARS-CoV-2 and a gene of the Pepper Mild Mottle Virus (PMMoV) are placed in the same class since they could be measured using quantitative PCR and expressed using the number of gene copies per unit of sample.
Recording geographical information
A fundamental context element for measurements concerns the location where they were taken. The PHES-ODM lets users record entries for any location where measurements or samples are taken. These locations are called ‘sites’. Sites vary by type, that is, by the infrastructure they contain and by their use (e.g., sites could be day-care facilities, hospitals, sewer utility holes, or wastewater treatment plants.) Besides the site's civic address and geographical location, the model records the site type and allows for a free-text description of aspects of the site that are difficult to standardise into a dictionary.
Sites are represented in the PHES-ODM by a single point in space. However, the environment is usually described in terms of areas (defined as polygons or collections of polygons). In environmental surveillance, the areas of concern might be the watersheds of sewer networks that drain to a sampling site (often called ‘sewersheds’) or the jurisdiction of a public health agency, for example. The PHES-ODM has provisions to store geographical descriptions of these areas and their type.
Representing sites and polygons in the PHES-ODM makes the model interoperable with Geographical Information Systems (GIS). It also allows the PHES-ODM to store measurements done on things other than samples (e.g., wastewater flow measurements taken at a sampling site or daily reported cases in a geographical region) and thus better understand the context of measurement campaigns. One example coming from WBE is sewer and sewershed characterisation. Currently, PHES-ODM does not have dictionary entries for sewer and sewershed characteristics (e.g., combined or separated sewers, average residence time or percentage of permeable surfaces). If the community requests these measures, the PHES-ODM's flexible structure will make adding these measures to the dictionary very straightforward.
Recording protocols
When analysing data from various sources, it is essential to understand how the data were produced, as differences in analysis steps may make values of the same measure incomparable. For example, solids must be concentrated before RNA extraction in PCR analysis of wastewater samples for SARS-CoV-2 genes. This step can be done via filtration through a membrane (Ahmed et al. 2020) or via centrifugation of the sample (D'Aoust et al. 2021b); however, they yield significantly different values. The PHES-ODM thus recognises the importance of tracking the analysis methods in metadata. Protocols, therefore, are one of the entities of the data model. Protocols are complicated entities, however, as they consist of multiple steps that can take many forms. Several challenges must be addressed to storing store protocols efficiently in tables:
• Protocols are comprised of a collection of steps.
• Protocol steps can consist of various things:
o They can prescribe the use of a specific quantity of something. Quantities can be described using the measures found in the PHES-ODM dictionary and assigned a prescribed value and unit of measurement.
o They can prescribe an action. The PHES-ODM calls this a method.
• Protocol steps must be organised in the correct order to convey the meaning of the protocol.
• Steps may follow each other, or they can be done concurrently.
• Measures inside a protocol usually specify the quantity of reagents, supplies and conditions involved in executing a method.
• Different protocols may use some of the same steps but in a different order.
• Protocols may have steps that consist of one or several (sub)protocols.
The subject and the object can either be a (sub)protocol or a protocol step. The available relationships (e.g., ‘is before’, ‘specifies’, and ‘is concurrent with’) allow one to organise the protocol in a more semantically meaningful way than simply using a sequential order (Robinson et al. 2015).
Because PHES-ODM protocols use standard dictionary parts and explicit relationships, they are machine-readable, thus paving the way for using protocols in data analysis pipelines. Such pipelines could automatically apply data transformations to reconcile data when possible and segregate it when not.
Recording data quality
Keeping track of quality concerns is critical to the interpretation and use of data. The PHES-ODM includes a table dedicated to quality reports where quality flags can be attributed to measures, sets of measures, and samples. Quality flags can indicate moments when a protocol was not faithfully followed (e.g., the wrong amount of some reagent was added) or when the result of an experiment should not be used (e.g., when the signal is below the level of quantification.) A quality flag indicating no concern was reported also exists to distinguish cases where quality is certified adequate from those where no quality information exists. The dictionary contains all valid quality flag values, thus making quality information machine-readable and paving the way for automatic filtering or processing of reports that would have certain flags attached. However, a notes field is also available for free-text input if a quality issue is not codified into a flag. Additional quality flags can be added to PHES-ODM as the community requests them.
Since quality reports are detached from measure and sample reports, an arbitrary quantity of flags can be assigned to any value. This decoupling of measure and quality reports allows stakeholders (lab analysts, data scientists, and public health officials) to store their own quality flags without affecting other stakeholders. Quality flags thus represent a flexible but robust system for reporting data quality and compliance.
Various stakeholders have different needs when it comes to data quality. Analysts familiar with the subtleties of sampling and laboratory experiments may want to record extensive quality information to support their interpretation of the data. However, condensing this complex information into a straightforward binary may be essential when communicating with outside agents: either the data point can be trusted or it can be disregarded. Measure and sample reports contain a Boolean reportable attribute to address such situations. Having access to fine-grained quality flags and a layman-friendly binary flag lets users of the PHES-ODM communicate with each other regardless of their role.
Ownership, anonymisation and sharing
In the PHES-ODM, the ownership information is linked to each recorded measure via a link to the ̀dataset̀ table. In this table, the provenance of the data is established via links to contact persons and organisations that can be identified as either custodians or funders of the data. Measures are also individually linked to a license table, meaning that ownership and usage conditions can be separately prescribed for each measurement.
As a data model, PHES-ODM does not support anonymisation directly. However, an open-source companion sharing tool allows dataset custodians to define rules for selecting a subset of their data and potentially transforming it before sharing (Big Life Lab 2022d). For example, users might prescribe in their schema to only select rows in the measures table with a particular license that come from a dataset with a specific funder. After selecting rows to share, another rule might define the fields that should be kept.
Leveraging wide tables
Entities defined in the PHES-ODM data model (e.g., measures, samples, sites, etc.) are each stored in separate tables, where a single row describes every new entity instance. New rows are added as new data is accumulated. This style of data storage is called a ‘long’ table, and it has the advantage of maintaining its schema (what tables, table columns, and relationships are present in the data model) regardless of how much data are added to the tables. However, this efficient data format has some disadvantages because it is far removed from how WBE practitioners work with data. On the data collection side, it is common for laboratories to organise their data into a ‘wide’ format, where a row represents a sample with columns for all measures from that sample. As a sample is processed, the analyst fills the row until all the measures listed in the column headers have a value. A similar scheme can be applied, not with a sample, but with individual ‘runs’ of an experiment.
Wide tables are practical because attributes of the measures being taken that never change (e.g., the units, the expected data type, whether the analysis was done on the solid fraction of the sample or its liquid fraction) are stored outside the table. Thus, the analyst need not concern themselves with entering them for each new value, saving time and effort. Moreover, having values taken from the same sample aligned side to side makes them much easier to compare and to quickly judge whether an error has been made during the experiment.
Similarly, data analysts such as epidemiologists typically use wide tables where each row represents all measures for a single day. Transforming the data into time series is essential for understanding the evolution of the biomarker through time. It is also necessary for most visual representations used in WBE and most modelling schemes.
Wide tables become impractical when combining data from multiple laboratories, sampling sites, or datasets. Properties that can be assumed constant in a single lab are no longer at this scale, which might lead analysts to improperly combine data from sources with the same name but not the same meaning. Conversely, since a wide column name can condense multiple attributes of a measure into a single field, the name of that field may vary wildly between table creators, even if they start with the same dictionary of attributes.
The environmental compartment the specimen is taken from.
The specimen the measure is taken on.
The fraction of the specimen the measure is taken on.
The measure itself.
The unit that is used for reporting.
The aggregation that is applied to the value.
An index to allow the wide table to record more than one instance of the wide variable in question if required.
The attribute whose value is recorded in the wide table cell. Typically, this would be ‘value’; however, any other attribute not included in the wide name can be specified here.
Transposition of a measurement report row (long format) to wide names and their values. For illustrative purposes, the labels are shown in the long table row. The part identifiers are shown in parentheses.
Transposition of a measurement report row (long format) to wide names and their values. For illustrative purposes, the labels are shown in the long table row. The part identifiers are shown in parentheses.
Other measure attributes, such as class, group, and domain, need not be added to the wide name. Their role is to help organise the measures. However, since measures are unique, adding these attributes to the wide name of a measure would be redundant.
The resulting names have many components, but they have the advantage of being highly specific, thus reducing the risk of inadvertently combining incompatible data. Moreover, since they consist of identifiers for parts that all have labels in the parts table, they are both machine-readable and easily translated to an expanded natural language label that anyone can read.
The ability to transpose data from long-to-wide formats (and vice-versa) can be leveraged at the data entry stage by creating templates that respect the typical workflow of laboratory workers where column headers would be wide names. At the data analysis stage, by recreating the wide names from the data stored in the PHES-ODM, measure reports with differing attributes would automatically be dispatched to columns with different names, thus reducing the risk of erroneous combinations.
Supporting community adoption
There often exists a tension between usability and completeness. As the number of concepts included in a system grows, the effort required for a user to become comfortable with it increases and quickly reaches unmanageable levels. This is especially true if the system targets users of varied levels of familiarity with the concepts it contains. Environmental public health surveillance data are often collected by workers with little to no experience with data models (e.g., utility staff, laboratory technicians, or policymakers).
The PHES-ODM, therefore, must strike a balance between robustness (being able to host multiple types of data and metadata) and ease of use. In that respect, the guiding philosophy of the PHES-ODM project has been to aim for robustness and subsequently provide tools (e.g., utility software, templates, standardised long-to-wide conversion procedure) to support common uses and provide extensive documentation to facilitate onboarding and understanding. Information regarding these tools is provided on the PHES-ODM website and GitHub repository.
The documentation provided by the PHES-ODM project follows a four-quadrant approach (Divio Technologies AB 2023), where resources are either designed to be learning-oriented (tutorials), problem-oriented (how-to guides), understanding-oriented (long-form explanations) or information-oriented (reference guides). The community of PHES-ODM users also has access to an open forum to directly ask questions from other users for cases not covered by documentation (Discourse 2022).
The PHES-ODM in action
The structure of the PHES-ODM was designed with WBE for SARS-CoV-2 in mind. It is thus not surprising that the first application of the data model was for that purpose. The Ontario provincial government and the Public Health Agency of Canada adopted the first version of the PHES-ODM in 2021 for their respective wastewater-based epidemiology programs (Caminsky 2022). Both are planning to migrate to version 2.0. In the province of Québec, a university-led pilot study, the CentrEau-COVID project, has adopted the PHES-ODM to collect and harmonise wastewater-based epidemiology data from 31 sampling sites (CentrEau 2021). Outside Canada, version 2.0 of the PHES-ODM has been selected for the European Union Sewage Sentinel System for SARS-CoV-2 to collect and organise surveillance data from its member countries (EU4S-DEEP 2022). Though many other data models exist for environmental surveillance, the use of the PHES-ODM in large-scale surveillance WBE programs suggests robustness and ability to reliably identify, store, link and organise WBE data and metadata. Though PHES-ODM was designed to hold environmental public health data from various environmental compartments, the most common use of the data model remains WBE. Additional involvement from researchers and public health agencies sampling other environmental compartments would help stress-test these areas of the model and improve its robustness.
CONCLUSION
As environmental public health surveillance becomes more widespread and institutionalised, the number of tracked pathogens and environmental toxins will continue to grow, and vast amounts of data will be generated worldwide. The COVID-19 pandemic has shown how critical the principles of open science and open data are to accessing the full potential of this data for understanding the spread of disease among populations. By creating and maintaining tools that serve all stakeholders of environmental surveillance data, from field and laboratory workers to data analysts and decision-makers, the scientific community benefits from more efficient and frequent data exchanges, which will lead to greater understanding and benefit the public. By presenting the PHES-ODM, the authors hope to contribute to the flourishing of the field and convince the environmental surveillance community that open science within the field is not only desirable but also possible and accessible.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the financial support of PHES-ODM development by the CIHR-funded network, CoVaRR-Net (Coronavirus Variants Rapid Response Network) [Funding Reference Number: 175622] and the Public Health Agency of Canada. The Ontario Ministry of the Environment, Conservation and Parks provides funding for the PHES-ODM validation toolkit. The National Sciences and Engineering Research Council of Canada, the Fonds de Recherche du Québec, and the Molson-Trottier Foundation supported salaries, scholarships, travel expenses, sample collection and laboratory analysis. Peter Vanrolleghem holds the Canada Research Chair on Water Quality Modelling. As an open project, the PHES-ODM benefits from the contribution of its steering committee (Big Life Lab 2022a), the core user group, and many people and organisations that provide comments and suggestions. The PHES-ODM is grateful for development platforms that provide freely available resources for open-source projects, including GitHub and Discourse.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.