Towards a water quality database for raw and validated data with emphasis on structured metadata

On-line continuous monitoring of water bodies produces large quantities of high frequency data. Long-term quality control and applicability of these data require rigorous storage and documentation. To carry out these activities successfully, a database has to be built. Such a database should provide the simplicity to store and document all relevant data and should be easy to use for further data evaluation and interpretation. In this paper, a comprehensive database structure for water quality data is proposed. Its goal is to centralize the data, standardize their format, provide easy access, and, especially, document all relevant information (metadata) associated with the measurements in an ef ﬁ cient way. The emphasis on data documentation enables the provision of detailed information not only on the history of the measurements (e.g., where, how, when and by whom was the value measured) but also on the history of the equipment (e.g., sensor maintenance, calibration/validation history), personnel (e.g., experience), projects, sampling sites, etc. As such, the proposed database structure provides a robust and ef ﬁ cient tool for functional data storage and access, allowing future use of data collected at great expense.


INTRODUCTION
Automated monitoring stations and state-of-the-art instrumentation are used to continuously monitor and control water bodies over the long term and increasingly also in real time. This on-line, continuous monitoring is used to collect data at high frequency thus generating large sets of data (Rieger & Vanrolleghem ). However, these large quantities of data are only beneficial if they are accessible, well-documented and reliable (Copp et al. ). Thus, the tasks of efficient storage and quality control are crucial to their interpretation and further application.
Generally, in many organizations, storage and quality check of the collected data are done individually by the users at their work space. However, each user organizes, structures and evaluates the data in a different manner (Camhy et al. ). As personnel are changing over time, this diversification hinders data interpretation, understanding and reproduction leading to inconsistencies in further studies.
Thus, to successfully manage these large amounts of heterogeneous data, a systematic and efficient storage system is needed (Rieger et al. ). In this respect, Camhy et al. () and Horsburgh et al. () identified several data management challenges: the collected raw data have a highly variable format; the database has to be flexible and adaptable because it is growing continuously: monitoring programs are modified, additional variables are measured and different sensors are used; the personnel involved in collecting and managing the data changes. It is thus critical that one is documenting the collected data with all relevant metadata (data about data).
Metadata are any additional information that provide more details about the data and its identification: the measured attributes, their names, units, the extent, the quality, the spatial and temporal aspects, the content, and how the value was obtained (Gray et al. ; ISO ). This information is essential for other potential users to understand and interpret the collected data.
The issues of metadata are illustrated with an example of a one-month measurement campaign conducted at a fullscale wastewater treatment plant. For this campaign, a number of automated sensors to measure water quality parameters (TSS, N-components, etc.) were installed. If only the measured values are stored, the data will only have very limited meaning. At the very least, metadata such as the variable names and their units should be stored as well.
However, even with the addition of these metadata, the relevance and application of the data set will most likely be limited to persons that were directly involved in the campaign. Subsequently, the data will either be shelved and lost or applied unsuccessfully in a further study because too much information on the data is missing. If we want the efforts of such a measurement campaign to transcend this limited life-expectancy, much more detailed metadata should be stored: the exact location where the sensors were placed, the type of sensors (and their measurement principles), their maintenance, calibration and validation history, the weather conditions during the campaign, etc.
Providing a systematic structure to store all these metadata is an important challenge for effective data management.
Some commercial databases to store water quality and hydrological data in a structured way are offered on the market. Nevertheless, accessing the raw data or making a modification of the metadata is sometimes limited or not possible, and can only be done through a predefined graphical user interface (GUI) (Camhy et al. ). Moreover, data have to be continuously transformed to the proprietary format of the software. In addition, any modification relies on the vendor support, thus placing important restraints on customized use.
Also, some organizations have proposed standards to exchange environmental data including data description, analysis and reporting, e.g., the Environmental Data Stan- Using their experience with high frequency data collection, the modelEAU research group at Université Laval in Québec City (Canada), developed a database structure to be applied to water quality data from rivers, sewer systems and water resource recovery facilities (WRRFs). The main objectives of this database are to centralize data storage from on-line measurements, laboratory analysis and data post-treatments, and deal with the challenges presented above, especially regarding the storage of metadata. This Downloaded from https://iwaponline.com/wqrj/article-pdf/54/1/1/520763/wqrjc0540001.pdf" /><meta name="description" content="Abstract. On-line continuous monitoring of water bodies produces large quantities of high frequency by guest paper presents the structure of the developed database and its application.

DATABASE DESIGN
The database structure that was designed, named datEAUbase (water database, 'eau' is water in French), offers robustness, data format uniformity, flexibility if modifications are needed, efficient storage of relevant metadata, and the possibility to comprehensively document a monitoring program.
The datEAUbase has been designed to store all relevant data, i.e., the raw, filtered and validated data, laboratory measurements and corresponding metadata (see Figure 1).
The storage of the raw, filtered and laboratory data in the same database has been considered essential since all of them are related, and crucial to validate the data series and assure their quality.

datEAUbase STRUCTURE
The metadata considered are presented in Figure 2 and include detailed information about the sites, the sampling points, the watershed, the parameters, the equipment used, the measurement procedure followed, the project in which the data have been collected, for which purpose the value has been measured, the person responsible for the value and the weather conditions when the value was taken.
The design presented in Figure 2 is materialized by 23 different, interrelated tables in MySQL. The overall structure of the datEAUbase is presented in Figure 3.
Compared to other software, e.g., MS Access, MySQL not only offers a large capacity but, more importantly, also the possibility to work with m-to-n relationships (MS Access for instance, only allows 1-to-n relations). The m-to-n relationship means that each row in one table can be related to multiple rows in another table and vice versa. For example, many people can be involved in one project, and one person can also be involved in several projects. The links between the tables are made through the specific keys (called IDs in Figure 3) associated with each row of a table. The storage requirements for each data type included in the dataEAUbase are described in Table 1.

Primary tables
The general structure is based on primary and lookup tables.
The primary tables (Metadata, Value and Comments tables presented in Figure 3) Table 2).
To illustrate the database's structure, an example follows. In the primary tables, the information stored can be: on June 15, 2015 at 10:40:00 GMT, a value of 6.5 was measured. This value is linked to Metadata_ID 22. Moreover, a comment can be added that the calibration activity was unsuccessful. Through the internal links with the   Downloaded from https://iwaponline.com/wqrj/article-pdf/54/1/1/520763/wqrjc0540001.pdf" /><meta name="description" content="Abstract. On-line continuous monitoring of water bodies produces large quantities of high frequency by guest  Ultimately, by its specific structure the datEAUbase not only permits to rigorously document all measured values but it also allows to build memory of the measuring campaigns in a reliable way. For instance, the structure allows to track the history of a piece of equipment, e.g., in which projects has one sensor been used or which is its calibration/ A text as the corresponding binary string data type e.g., who has been involved in a certain project or who has used certain equipment which can be useful information if some experienced person is needed.

Lookup tables
The lookup tables have been divided into six different blocks, shown in Figure 3: all information about the instru-

Sampling location information
The Sampling location tables contain the information about the site and the identification of the specific sampling points.
Also, some more information about urban and hydrological characteristics is included.

Project information
In the Project table, information about the project is detailed. This table is linked to other parts of the database by a number of tables containing n-to-m links. These linking tables contain information about who is working in a project, where a project takes place and which equipment is used, and vice versa, in how many projects someone is working, for how many projects a location is used, and in how many projects a piece of equipment is used.
For example, the monEAU project deals with the usefulness of automatic monitoring stations (AMS) to study the water quality. The measurements are located at the inlet of Grandes-Piles F/AL. The following equipment is used: con-ductivity_001, pH_003 and ammolyser_001. The personnel involved are Alferes, Plana and Vanrolleghem.

Contact information
In the Contact table, detailed information about the people involved in the different projects is stored. This information includes the first name, the last name, their affiliation together with the address of the corresponding office and the person's function. Also, the e-mail, the phone number, the skype name or the LinkedIn information are stored.

Purpose of the measurement information
The Purpose table stores information about the aim of the value included in the database, i.e., on-line measurement, laboratory analysis, calibration, validation or cleaning. This Downloaded from https://iwaponline.com/wqrj/article-pdf/54/1/1/520763/wqrjc0540001.pdf" /><meta name="description" content="Abstract. On-line continuous monitoring of water bodies produces large quantities of high frequency by guest is accompanied with a detailed description of the different purposes.
For example, the purpose of the measurement is sensor validation. This is a routine sensor validation activity for verification of proper operation.

Weather information
Despite the fact that weather data such as daily rainfall or hourly temperatures can be stored into the database, this

datEAUbase APPLICATION
The structure and design of the datEAUbase creates a comprehensive environment to store and document data alongside their relevant metadata in a robust and highly efficient way. Moreover, it ensures that each value stored in the datEAUbase is unique, being linked to a specific time stamp and a complete set of metadata.
Although these features represent the core functional-   Downloaded from https://iwaponline.com/wqrj/article-pdf/54/1/1/520763/wqrjc0540001.pdf" /><meta name="description" content="Abstract. On-line continuous monitoring of water bodies produces large quantities of high frequency by guest The following important steps in the maintenance and application of the datEAUbase are facilitated through the user interface ( Figure 5): • Before measurements can be stored in the datEAUbase, its metadata need to be present in the lookup tables.
The interface allows easy addition or modification of metadata (for example, adding a new sensor in an existing project).
• Different metadata_IDs have to be created in the metadata • Non-automated data (such as laboratory results) can be entered in the datEAUbase through the user interface.
This also consists of a simple coupling of the measured values to their corresponding metadata_ID.
• One of the main features of the interface is its application to search the database and extract a specific data set of interest or information on sensor or project history.
• During the search process, an internal quality check is also performed. Data will only be available for extraction if all internal links are present. All metadata combinations that are present in the metadata table should also be linked internally in the lookup tables.

CONCLUSIONS
Technological advances in water quality measurement lead to the creation of large quantities of high frequency data.
Without efficient storage and rigorous documentation, the Downloaded from https://iwaponline.com/wqrj/article-pdf/54/1/1/520763/wqrjc0540001.pdf" /><meta name="description" content="Abstract. On-line continuous monitoring of water bodies produces large quantities of high frequency by guest life expectancy of these data is often limited to the specific project for which they were collected. Such common practices represent a significant loss of information as well as expense (that often goes into a measurement campaign).
To maintain understanding of the collected data, track their history and secure their usefulness in further studies, documentation by metadata is crucial. This includes detailed information about the sites, the sampling points, the watershed, the parameters, the equipment used, the measurement procedure followed, the project in which the data have been collected, for which purpose the value has been measured, the person responsible for the value and the weather conditions when the value was taken.
This paper presents a comprehensive database structure (the datEAUbase) that offers a data storage system with an emphasis on metadata. It provides a robust, large storage capacity with flexibility for future modifications and possible improvements.
Its specific structure, consisting of a combination of three primary tables interlinked with 20 lookup tables, allows for very efficient storage of huge amounts of information while avoiding redundancy. Moreover, this rigorous documentation of all measured values with their metadata allows to build memory on sensor history, project history and so on, in a reliable way.
Since this tool is meant for large data users to store and exchange water quality data, easy access and maintenance is ensured through a user-friendly interface.